Files

Joseph Doherty fd2e96fea2 feat: replace debug view polling with real-time SignalR streaming

The debug view polled every 2s by re-subscribing for full snapshots. Now a
persistent DebugStreamBridgeActor on central subscribes once and receives
incremental Akka stream events from the site, forwarding them to the Blazor
component via callbacks and to the CLI via a new SignalR hub at
/hubs/debug-stream. Adds `debug stream` CLI command with auto-reconnect.

2026-03-21 01:34:53 -04:00

14 KiB

Raw Blame History

Component: Central–Site Communication

Purpose

The Communication component manages all messaging between the central cluster and site clusters using Akka.NET. It provides the transport layer for deployments, instance lifecycle commands, integration routing, debug streaming, health reporting, and remote queries (parked messages, event logs).

Location

Both central and site clusters. Each side has communication actors that handle message routing.

Responsibilities

Resolve site addresses from the configuration database and maintain a cached address map.
Establish and maintain cross-cluster connections using Akka.NET ClusterClient/ClusterClientReceptionist.
Route messages between central and site clusters in a hub-and-spoke topology.
Broker requests from external systems (via central) to sites and return responses.
Support multiple concurrent message patterns (request/response, fire-and-forget, streaming).
Detect site connectivity status for health monitoring.

Communication Patterns

1. Deployment (Central → Site)

Pattern: Request/Response.
Central sends a flattened configuration to a site.
Site Runtime receives, compiles scripts, creates/updates Instance Actors, and responds with success/failure.
No buffering at central — if the site is unreachable, the deployment fails immediately.

2. Instance Lifecycle Commands (Central → Site)

Pattern: Request/Response.
Central sends disable, enable, or delete commands for specific instances.
Site Runtime processes the command and responds with success/failure.
If the site is unreachable, the command fails immediately (no buffering).

3. System-Wide Artifact Deployment (Central → Site(s))

Pattern: Broadcast with per-site acknowledgment (deploy to all sites), or targeted to a single site (per-site deployment).
When shared scripts, external system definitions, database connections, data connections, notification lists, or SMTP configuration are explicitly deployed, central sends them to the target site(s).
Each site acknowledges receipt and reports success/failure independently.

4. Integration Routing (External System → Central → Site → Central → External System)

Pattern: Request/Response (brokered).
External system sends a request to central (e.g., MES requests machine values).
Central routes the request to the appropriate site.
Site reads values from the Instance Actor and responds.
Central returns the response to the external system.

5. Recipe/Command Delivery (External System → Central → Site)

Pattern: Fire-and-forget with acknowledgment.
External system sends a command to central (e.g., recipe manager sends recipe).
Central routes to the site.
Site applies and acknowledges.

6. Debug Streaming (Site → Central)

Pattern: Subscribe/push with initial snapshot (no polling).
A DebugStreamBridgeActor (one per active debug session) is created on the central cluster by the DebugStreamService. The bridge actor sends a SubscribeDebugViewRequest to the site via CentralCommunicationActor, with itself as the Sender. The site's InstanceActor registers the bridge actor as the debug subscriber.
Site requests a snapshot of all current attribute values and alarm states from the Instance Actor and sends it to the bridge actor.
Site then subscribes to the site-wide Akka stream filtered by the instance's unique name and forwards AttributeValueChanged and AlarmStateChanged events to the bridge actor in real time via Akka remoting.
The bridge actor forwards received events to the consumer via callbacks (Blazor component or SignalR hub).
Attribute value stream messages: [InstanceUniqueName].[AttributePath].[AttributeName], value, quality, timestamp.
Alarm state stream messages: [InstanceUniqueName].[AlarmName], state (active/normal), priority, timestamp.
Central sends an unsubscribe request when the debug session ends. The site removes its stream subscription and the bridge actor is stopped.
The stream is session-based and temporary.

Central-Side Debug Stream Components

DebugStreamService: Singleton service that manages debug stream sessions. Resolves instance ID to unique name and site, creates and tears down DebugStreamBridgeActor instances, and provides a clean API for both Blazor components and the SignalR hub.
DebugStreamBridgeActor: One per active debug session. Acts as the Akka-level subscriber registered with the site's InstanceActor. Receives real-time AttributeValueChanged and AlarmStateChanged events from the site and forwards them to the consumer via callbacks.
DebugStreamHub: SignalR hub at /hubs/debug-stream for external consumers (e.g., CLI). Authenticates via Basic Auth + LDAP and requires the Deployment role. Server-to-client methods: OnSnapshot, OnAttributeChanged, OnAlarmChanged, OnStreamTerminated.

6a. Debug Snapshot (Central → Site)

Pattern: Request/Response (one-shot, no subscription).
Central sends a DebugSnapshotRequest (identified by instance unique name) to the site.
Site's Deployment Manager routes to the Instance Actor by unique name.
Instance Actor builds and returns a DebugViewSnapshot with all current attribute values and alarm states (same payload as the streaming initial snapshot).
No subscription is created; no stream is established.
Uses the 30-second QueryTimeout.

7. Health Reporting (Site → Central)

Pattern: Periodic push.
Sites periodically send health metrics (connection status, node status, buffer depth, script error rates, alarm evaluation error rates) to central.

8. Remote Queries (Central → Site)

Pattern: Request/Response.
Central queries sites for:
- Parked messages (store-and-forward dead letters).
- Site event logs.
- Instance debug snapshots (attribute values and alarm states).
Central can also send management commands:
- Retry or discard parked messages.

Topology

Central Cluster
  ├── ClusterClient → Site A Cluster (SiteCommunicationActor via Receptionist)
  ├── ClusterClient → Site B Cluster (SiteCommunicationActor via Receptionist)
  └── ClusterClient → Site N Cluster (SiteCommunicationActor via Receptionist)

Site Clusters
  └── ClusterClient → Central Cluster (CentralCommunicationActor via Receptionist)

Sites do not communicate with each other.
All inter-cluster communication flows through central.
Both CentralCommunicationActor and SiteCommunicationActor are registered with their cluster's ClusterClientReceptionist for cross-cluster discovery.

Site Address Resolution

Central discovers site addresses through the configuration database, not runtime registration:

Each site record in the Sites table includes optional NodeAAddress and NodeBAddress fields containing base Akka addresses of the site's cluster nodes (e.g., akka.tcp://scadalink@host:port).
The CentralCommunicationActor loads all site addresses from the database at startup and creates one ClusterClient per site, configured with both NodeA and NodeB as contact points.
The address cache is refreshed every 60 seconds and on-demand when site records are added, edited, or deleted via the Central UI or CLI. ClusterClient instances are recreated when contact points change.
When routing a message to a site, central sends via ClusterClient.Send("/user/site-communication", msg). ClusterClient handles failover between NodeA and NodeB internally — there is no application-level NodeA preference/NodeB fallback logic.
Heartbeats from sites serve health monitoring only — they do not serve as a registration or address discovery mechanism.
If no addresses are configured for a site, messages to that site are dropped and the caller's Ask times out.

Site → Central Communication

Site nodes configure a list of CentralContactPoints (both central node addresses) instead of a single CentralActorPath.
The site creates a ClusterClient using the central contact points and sends heartbeats, health reports, and other messages via ClusterClient.Send("/user/central-communication", msg).
ClusterClient handles automatic failover between central nodes — if the active central node goes down, the site's ClusterClient reconnects to the standby node transparently.

Message Timeouts

Each request/response pattern has a default timeout that can be overridden in configuration:

Pattern	Default Timeout	Rationale
1. Deployment	120 seconds	Script compilation at the site can be slow
2. Instance Lifecycle	30 seconds	Stop/start actors is fast
3. System-Wide Artifacts	120 seconds per site	Includes shared script recompilation
4. Integration Routing	30 seconds	External system waiting for response; Inbound API per-method timeout may cap this further
5. Recipe/Command Delivery	30 seconds	Fire-and-forget with ack
8. Remote Queries	30 seconds	Querying parked messages or event logs

Timeouts use the Akka.NET ask pattern. If no response is received within the timeout, the caller receives a timeout failure.

Transport Configuration

Akka.NET remoting provides the underlying transport for both intra-cluster communication and ClusterClient connections. The following transport-level settings are explicitly configured (not left to framework defaults) for predictable behavior:

Transport heartbeat interval: Configurable interval at which heartbeat messages are sent over remoting connections (e.g., every 5 seconds).
Failure detection threshold: Number of missed heartbeats before the connection is considered lost (e.g., 3 missed heartbeats = 15 seconds with a 5-second interval).
Reconnection: ClusterClient handles reconnection and failover between contact points automatically for cross-cluster communication. No custom reconnection logic is required.

These settings should be tuned for the expected network conditions between central and site clusters.

Application-Level Correlation

All request/response messages include an application-level correlation ID to ensure correct pairing of requests and responses, even across reconnection events:

Deployments include a deployment ID and revision hash for idempotency (see Deployment Manager).
Lifecycle commands include a command ID for deduplication.
Remote queries include a query ID for response correlation.

This provides protocol-level safety beyond Akka.NET's transport guarantees, which may not hold across disconnect/reconnect cycles.

Message Ordering

Akka.NET guarantees message ordering between a specific sender/receiver actor pair. The Communication Layer relies on this guarantee — messages to a given site are processed in the order they are sent. Callers do not need to handle out-of-order delivery.

ManagementActor and ClusterClient

The ManagementActor is registered at the well-known path /user/management on central nodes and advertised via ClusterClientReceptionist. External tools (primarily the CLI) connect using Akka.NET ClusterClient, which contacts the receptionist to discover the ManagementActor. This is a separate ClusterClient usage from the inter-cluster ClusterClient connections used for central-site messaging — the CLI does not participate in cluster membership or affect the hub-and-spoke topology.

Connection Failure Behavior

In-flight messages: When a connection drops while a request is in flight (e.g., deployment sent but no response received), the Akka ask pattern times out and the caller receives a failure. There is no automatic retry or buffering at central — the engineer sees the failure in the UI and re-initiates the action. This is consistent with the design principle that central does not buffer messages.
Debug streams: Any connection interruption (failover or network blip) kills the debug stream. The DebugStreamBridgeActor is stopped and the consumer is notified via OnStreamTerminated. The engineer must reopen the debug view to re-establish the subscription with a fresh snapshot. There is no auto-resume.

Failover Behavior

Central failover: The standby node takes over the Akka.NET cluster role. In-progress deployments are treated as failed. Site ClusterClients automatically reconnect to the standby central node via their configured contact points.
Site failover: The standby node takes over. The Deployment Manager singleton restarts and re-creates the Instance Actor hierarchy. Central's per-site ClusterClient automatically reconnects to the surviving site node. Ongoing debug streams are interrupted and must be re-established by the engineer.

Dependencies

Akka.NET Remoting + ClusterClient: Provides the transport layer. ClusterClient/ClusterClientReceptionist used for all cross-cluster messaging.
Cluster Infrastructure: Manages node roles and failover detection.
Configuration Database: Provides site node addresses (NodeAAddress, NodeBAddress) for address resolution.

Interactions

Deployment Manager (central): Uses communication to deliver configurations, lifecycle commands, and system-wide artifacts, and receive status.
Site Runtime: Receives deployments, lifecycle commands, and artifact updates. Provides debug view data.
Central UI: Debug view requests and remote queries flow through communication.
Health Monitoring: Receives periodic health reports from sites.
Store-and-Forward Engine (site): Parked message queries/commands are routed through communication.
Site Event Logging: Event log queries are routed through communication.
Management Service: The ManagementActor is registered with ClusterClientReceptionist on central nodes. The CLI communicates with the ManagementActor via ClusterClient, which is a separate channel from inter-cluster remoting.

14 KiB Raw Blame History Unescape Escape