feat: complete gRPC streaming channel — site host, docker config, docs, integration tests

Switch site host to WebApplicationBuilder with Kestrel HTTP/2 gRPC server,
add GrpcPort/keepalive config, wire SiteStreamManager as ISiteStreamSubscriber,
expose gRPC ports in docker-compose, add site seed script, update all 10
requirement docs + CLAUDE.md + README.md for the new dual-transport architecture.
This commit is contained in:
Joseph Doherty
2026-03-21 12:38:33 -04:00
parent 3fe3c4161b
commit 416a03b782
34 changed files with 728 additions and 156 deletions

View File

@@ -2,7 +2,7 @@
## Purpose
The Communication component manages all messaging between the central cluster and site clusters using Akka.NET. It provides the transport layer for deployments, instance lifecycle commands, integration routing, debug streaming, health reporting, and remote queries (parked messages, event logs).
The Communication component manages all messaging between the central cluster and site clusters. It provides the transport layer for deployments, instance lifecycle commands, integration routing, debug streaming, health reporting, and remote queries (parked messages, event logs). Two transports are used: **Akka.NET ClusterClient** for command/control messaging and **gRPC server-streaming** for real-time data (attribute values, alarm states).
## Location
@@ -10,12 +10,15 @@ Both central and site clusters. Each side has communication actors that handle m
## Responsibilities
- Resolve site addresses from the configuration database and maintain a cached address map.
- Establish and maintain cross-cluster connections using Akka.NET ClusterClient/ClusterClientReceptionist.
- Resolve site addresses (Akka remoting and gRPC) from the configuration database and maintain a cached address map.
- Establish and maintain cross-cluster connections using Akka.NET ClusterClient/ClusterClientReceptionist for command/control.
- Establish and maintain per-site gRPC streaming connections for real-time data delivery (site→central).
- Route messages between central and site clusters in a hub-and-spoke topology.
- Broker requests from external systems (via central) to sites and return responses.
- Support multiple concurrent message patterns (request/response, fire-and-forget, streaming).
- Detect site connectivity status for health monitoring.
- Host the **SiteStreamGrpcServer** on site nodes (Kestrel HTTP/2) to serve real-time event streams.
- Manage per-site **SiteStreamGrpcClient** instances on central nodes via **SiteStreamGrpcClientFactory**.
## Communication Patterns
@@ -50,22 +53,55 @@ Both central and site clusters. Each side has communication actors that handle m
- Site applies and acknowledges.
### 6. Debug Streaming (Site → Central)
- **Pattern**: Subscribe/push with initial snapshot (no polling).
- A **DebugStreamBridgeActor** (one per active debug session) is created on the central cluster by the **DebugStreamService**. The bridge actor sends a `SubscribeDebugViewRequest` to the site via `CentralCommunicationActor`. The site's `InstanceActor` stores the subscription's correlation ID and replies with an initial snapshot via the ClusterClient reply path.
- Site requests a **snapshot** of all current attribute values and alarm states from the Instance Actor and sends it back to the bridge actor (via the ClusterClient reply path, which works for immediate responses).
- For ongoing events, the InstanceActor wraps `AttributeValueChanged` and `AlarmStateChanged` in a `DebugStreamEvent(correlationId, event)` message and sends it to the local `SiteCommunicationActor`. The SiteCommunicationActor forwards it to central via its own ClusterClient (`ClusterClient.Send("/user/central-communication", event)`). The `CentralCommunicationActor` looks up the bridge actor by correlation ID and delivers the event. This follows the same site→central pattern as health reports.
- **Pattern**: Subscribe/push with initial snapshot. Two transports: **ClusterClient** for the subscribe/unsubscribe handshake and initial snapshot, **gRPC server-streaming** for ongoing real-time events.
- A **DebugStreamBridgeActor** (one per active debug session) is created on the central cluster by the **DebugStreamService**. The bridge actor first opens a **gRPC server-streaming subscription** to the site via `SiteStreamGrpcClient`, then sends a `SubscribeDebugViewRequest` to the site via `CentralCommunicationActor` (ClusterClient). The site's `InstanceActor` replies with an initial snapshot via the ClusterClient reply path.
- **gRPC stream (real-time events)**: The site's **SiteStreamGrpcServer** receives the gRPC `SubscribeInstance` call and creates a **StreamRelayActor** that subscribes to **SiteStreamManager** for the requested instance. Events (`AttributeValueChanged`, `AlarmStateChanged`) flow from `SiteStreamManager``StreamRelayActor``Channel<SiteStreamEvent>` (bounded, 1000, DropOldest) → gRPC response stream → `SiteStreamGrpcClient` on central → `DebugStreamBridgeActor`.
- The `DebugStreamEvent` message type no longer exists — events are not routed through ClusterClient. `SiteCommunicationActor` and `CentralCommunicationActor` have no role in streaming event delivery.
- The bridge actor forwards received events to the consumer via callbacks (Blazor component or SignalR hub).
- **Snapshot-to-stream handoff**: The gRPC stream is opened **before** the snapshot request to avoid missing events. The consumer applies the snapshot as baseline, then replays buffered gRPC events with timestamps newer than the snapshot (timestamp-based dedup).
- Attribute value stream messages: `[InstanceUniqueName].[AttributePath].[AttributeName]`, value, quality, timestamp.
- Alarm state stream messages: `[InstanceUniqueName].[AlarmName]`, state (active/normal), priority, timestamp.
- Central sends an unsubscribe request when the debug session ends. The site removes its stream subscription and the bridge actor is stopped.
- Central sends an unsubscribe request via ClusterClient when the debug session ends. The gRPC stream is cancelled. The site's `StreamRelayActor` is stopped and the SiteStreamManager subscription is removed.
- The stream is session-based and temporary.
#### Site-Side gRPC Streaming Components
- **SiteStreamGrpcServer**: gRPC service (`SiteStreamService.SiteStreamServiceBase`) hosted on each site node via Kestrel HTTP/2 on a dedicated port (default 8083). Implements the `SubscribeInstance` RPC. For each subscription, creates a `StreamRelayActor` that subscribes to `SiteStreamManager`, bridges events through a `Channel<SiteStreamEvent>` to the gRPC response stream. Tracks active subscriptions by `correlation_id` — duplicate IDs cancel the old stream. Enforces a max concurrent stream limit (default 100). Rejects streams with `StatusCode.Unavailable` before the actor system is ready.
- **StreamRelayActor**: Short-lived actor created per gRPC subscription. Receives domain events (`AttributeValueChanged`, `AlarmStateChanged`) from `SiteStreamManager`, converts them to protobuf `SiteStreamEvent` messages, and writes to the `Channel<SiteStreamEvent>` writer. Stopped when the gRPC stream is cancelled or the client disconnects.
#### Central-Side Debug Stream Components
- **DebugStreamService**: Singleton service that manages debug stream sessions. Resolves instance ID to unique name and site, creates and tears down `DebugStreamBridgeActor` instances, and provides a clean API for both Blazor components and the SignalR hub.
- **DebugStreamBridgeActor**: One per active debug session. Acts as the Akka-level subscriber registered with the site's `InstanceActor`. Receives real-time `AttributeValueChanged` and `AlarmStateChanged` events from the site and forwards them to the consumer via callbacks.
- **DebugStreamService**: Singleton service that manages debug stream sessions. Resolves instance ID to unique name and site, creates and tears down `DebugStreamBridgeActor` instances, and provides a clean API for both Blazor components and the SignalR hub. Injects `SiteStreamGrpcClientFactory` for gRPC stream creation.
- **DebugStreamBridgeActor**: One per active debug session. Opens a gRPC streaming subscription via `SiteStreamGrpcClient` and receives real-time events via callback. Also receives the initial `DebugViewSnapshot` via ClusterClient. Forwards all events to the consumer via callbacks. Handles gRPC stream errors with reconnection logic: tries the other site node endpoint, retries with backoff (max 3 retries), terminates the session if all retries fail.
- **SiteStreamGrpcClient**: Per-site gRPC client that manages `GrpcChannel` instances and streaming subscriptions. Reads from the gRPC response stream in a background task, converts protobuf messages to domain events, and invokes the `onEvent` callback.
- **SiteStreamGrpcClientFactory**: Caches per-site `SiteStreamGrpcClient` instances. Reads `GrpcNodeAAddress` / `GrpcNodeBAddress` from the `Site` entity (loaded by `CentralCommunicationActor`). Falls back to NodeB if NodeA connection fails. Disposes clients on site removal or address change.
- **DebugStreamHub**: SignalR hub at `/hubs/debug-stream` for external consumers (e.g., CLI). Authenticates via Basic Auth + LDAP and requires the **Deployment** role. Server-to-client methods: `OnSnapshot`, `OnAttributeChanged`, `OnAlarmChanged`, `OnStreamTerminated`.
#### gRPC Proto Definition
The streaming protocol is defined in `sitestream.proto` (`src/ScadaLink.Communication/Protos/sitestream.proto`):
- **Service**: `SiteStreamService` with a single RPC `SubscribeInstance(InstanceStreamRequest) returns (stream SiteStreamEvent)`.
- **Messages**: `InstanceStreamRequest` (correlation_id, instance_unique_name), `SiteStreamEvent` (correlation_id, oneof event: `AttributeValueUpdate`, `AlarmStateUpdate`).
- The `oneof event` pattern is extensible — future event types (health metrics, connection state changes) are added as new fields without breaking existing consumers.
- Proto field numbers are never reused. Old clients ignore unknown `oneof` variants.
#### gRPC Connection Keepalive
Three layers of dead-client detection prevent orphan streams on site nodes:
| Layer | Detects | Timeline | Mechanism |
|-------|---------|----------|-----------|
| TCP RST | Clean process death, connection close | 15s | OS-level TCP, `WriteAsync` throws |
| gRPC keepalive PING | Network partition, silent crash, firewall drop | ~25s | HTTP/2 PING frames, `CancellationToken` fires |
| Session timeout | Misconfigured keepalive, long-lived zombie streams | 4 hours | `CancellationTokenSource.CancelAfter` |
Keepalive settings are configurable via `CommunicationOptions`:
- `GrpcKeepAlivePingDelay`: 15 seconds (default)
- `GrpcKeepAlivePingTimeout`: 10 seconds (default)
- `GrpcMaxStreamLifetime`: 4 hours (default)
- `GrpcMaxConcurrentStreams`: 100 (default)
### 6a. Debug Snapshot (Central → Site)
- **Pattern**: Request/Response (one-shot, no subscription).
- Central sends a `DebugSnapshotRequest` (identified by instance unique name) to the site.
@@ -91,12 +127,17 @@ Both central and site clusters. Each side has communication actors that handle m
```
Central Cluster
├── ClusterClient → Site A Cluster (SiteCommunicationActor via Receptionist)
├── ClusterClient → Site B Cluster (SiteCommunicationActor via Receptionist)
└── ClusterClient → Site N Cluster (SiteCommunicationActor via Receptionist)
├── ClusterClient → Site A Cluster (SiteCommunicationActor via Receptionist) [command/control]
├── ClusterClient → Site B Cluster (SiteCommunicationActor via Receptionist) [command/control]
└── ClusterClient → Site N Cluster (SiteCommunicationActor via Receptionist) [command/control]
├── SiteStreamGrpcClient ◄── gRPC stream ── Site A (SiteStreamGrpcServer) [real-time data]
├── SiteStreamGrpcClient ◄── gRPC stream ── Site B (SiteStreamGrpcServer) [real-time data]
└── SiteStreamGrpcClient ◄── gRPC stream ── Site N (SiteStreamGrpcServer) [real-time data]
Site Clusters
└── ClusterClient → Central Cluster (CentralCommunicationActor via Receptionist)
└── ClusterClient → Central Cluster (CentralCommunicationActor via Receptionist) [command/control]
└── SiteStreamGrpcServer (Kestrel HTTP/2, port 8083) → serves gRPC streams [real-time data]
```
- Sites do **not** communicate with each other.
@@ -107,8 +148,8 @@ Site Clusters
Central discovers site addresses through the **configuration database**, not runtime registration:
- Each site record in the Sites table includes optional **NodeAAddress** and **NodeBAddress** fields containing base Akka addresses of the site's cluster nodes (e.g., `akka.tcp://scadalink@host:port`).
- The **CentralCommunicationActor** loads all site addresses from the database at startup and creates one **ClusterClient per site**, configured with both NodeA and NodeB as contact points.
- Each site record in the Sites table includes optional **NodeAAddress** and **NodeBAddress** fields containing base Akka addresses of the site's cluster nodes (e.g., `akka.tcp://scadalink@host:port`), and optional **GrpcNodeAAddress** and **GrpcNodeBAddress** fields containing gRPC endpoints (e.g., `http://host:8083`).
- The **CentralCommunicationActor** loads all site addresses from the database at startup and creates one **ClusterClient per site**, configured with both NodeA and NodeB as contact points. The **SiteStreamGrpcClientFactory** uses `GrpcNodeAAddress` / `GrpcNodeBAddress` to create per-site gRPC channels for streaming.
- The address cache is **refreshed every 60 seconds** and **on-demand** when site records are added, edited, or deleted via the Central UI or CLI. ClusterClient instances are recreated when contact points change.
- When routing a message to a site, central sends via `ClusterClient.Send("/user/site-communication", msg)`. **ClusterClient handles failover between NodeA and NodeB internally** — there is no application-level NodeA preference/NodeB fallback logic.
- **Heartbeats** from sites serve **health monitoring only** — they do not serve as a registration or address discovery mechanism.
@@ -166,7 +207,7 @@ The ManagementActor is registered at the well-known path `/user/management` on c
## Connection Failure Behavior
- **In-flight messages**: When a connection drops while a request is in flight (e.g., deployment sent but no response received), the Akka ask pattern times out and the caller receives a failure. There is **no automatic retry or buffering at central** — the engineer sees the failure in the UI and re-initiates the action. This is consistent with the design principle that central does not buffer messages.
- **Debug streams**: Any connection interruption (failover or network blip) kills the debug stream. The `DebugStreamBridgeActor` is stopped and the consumer is notified via `OnStreamTerminated`. The engineer must reopen the debug view to re-establish the subscription with a fresh snapshot. There is no auto-resume.
- **Debug streams**: Any gRPC stream interruption triggers reconnection logic in the `DebugStreamBridgeActor`. The bridge actor attempts to reconnect to the other site node endpoint (NodeB if NodeA failed, or vice versa), with up to 3 retries and 5-second backoff. If all retries fail, the consumer is notified via `OnStreamTerminated` and the bridge actor is stopped. Events during the reconnection gap are lost (acceptable for real-time debug view). On successful reconnection, the consumer can request a fresh snapshot to re-sync state.
## Failover Behavior
@@ -175,9 +216,11 @@ The ManagementActor is registered at the well-known path `/user/management` on c
## Dependencies
- **Akka.NET Remoting + ClusterClient**: Provides the transport layer. ClusterClient/ClusterClientReceptionist used for all cross-cluster messaging.
- **Akka.NET Remoting + ClusterClient**: Provides the command/control transport layer. ClusterClient/ClusterClientReceptionist used for cross-cluster command/control messaging (deployments, lifecycle, subscribe/unsubscribe handshake, snapshots).
- **gRPC (Grpc.AspNetCore + Grpc.Net.Client)**: Provides the real-time data streaming transport. Site nodes host a gRPC server (SiteStreamGrpcServer); central nodes create per-site gRPC clients (SiteStreamGrpcClient).
- **Cluster Infrastructure**: Manages node roles and failover detection.
- **Configuration Database**: Provides site node addresses (NodeAAddress, NodeBAddress) for address resolution.
- **Configuration Database**: Provides site node addresses (NodeAAddress, NodeBAddress for Akka remoting; GrpcNodeAAddress, GrpcNodeBAddress for gRPC streaming) for address resolution.
- **Site Runtime (SiteStreamManager)**: The SiteStreamGrpcServer subscribes to SiteStreamManager to receive real-time events for gRPC delivery.
## Interactions