Replaces ClusterClient-based event streaming with dedicated gRPC server-streaming channels. Covers proto definition, server/client patterns, Channel<T> bridging, keepalive/orphan prevention, failover scenarios, port/address configuration, extensibility guide for new event types, testing strategy, and implementation guardrails.
48 KiB
gRPC Streaming Channel: Site → Central Real-Time Data
Context
Debug streaming events currently flow through Akka.NET ClusterClient (InstanceActor → SiteCommunicationActor → ClusterClient.Send → CentralCommunicationActor → bridge actor). ClusterClient wasn't built for high-throughput value streaming — it's a cluster coordination tool with gossip-based routing. As we scale beyond debug view to health streaming, alarm feeds, or future live dashboards, pushing all real-time data through ClusterClient will become a bottleneck.
Goal: Add a dedicated gRPC server-streaming channel on each site node. Central subscribes to sites over gRPC for real-time data. ClusterClient continues to handle command/control (subscribe, unsubscribe, deploy, lifecycle) but all streaming values flow through the gRPC channel.
Scope: General-purpose site→central streaming transport. Debug view is the first consumer, but the proto and server are designed so future features (health streaming, alarm feeds, live dashboards) can subscribe with different event types and filters.
Why gRPC Streaming Instead of ClusterClient
| Concern | ClusterClient | gRPC Server Streaming |
|---|---|---|
| Purpose | Cluster coordination, service discovery, request/response | High-throughput data streaming |
| Sender preservation | Temporary proxy ref — breaks for stored future Tells | N/A — callback-based, no actor refs cross boundary |
| Flow control | None (fire-and-forget Tell) | HTTP/2 flow control + Channel backpressure |
| Scalability | Gossip-based routing, single receptionist | Direct TCP/HTTP2 per-site, multiplexed streams |
| Reconnection | ClusterClient auto-reconnect (coarse, cluster-level) | gRPC channel-level reconnect per subscription |
| Serialization | Akka.NET Hyperion (runtime IL, fragile across versions) | Protocol Buffers (schema-driven, cross-platform) |
The DCL already uses this exact pattern — RealLmxProxyClient opens gRPC server-streaming subscriptions to LmxProxy servers for real-time tag value updates. This plan applies the same pattern to site→central communication.
Architecture
Central Cluster Site Cluster
───────────── ────────────
DebugStreamBridgeActor InstanceActor
│ │
│── SubscribeDebugView ──► │ (ClusterClient: command/control)
│◄── DebugViewSnapshot ── │
│ │
│ │ publishes AttributeValueChanged
│ │ publishes AlarmStateChanged
│ ▼
SiteStreamGrpcClient ◄──── gRPC stream ───── SiteStreamGrpcServer
(per-site, on central) (HTTP/2) (Kestrel, on site)
│ │
│ reads from gRPC stream │ receives from SiteStreamManager
│ routes by correlationId │ filters by instance name
▼ │
DebugStreamBridgeActor │
│ │
▼ │
SignalR Hub / Blazor UI │
Key separation: ClusterClient handles subscribe/unsubscribe/snapshot (request-response). gRPC handles the ongoing value stream (server-streaming).
Port & Address Configuration
Site-Side (appsettings)
ScadaLink:Node:GrpcPort — explicit config setting, not derived from RemotingPort:
"Node": {
"Role": "Site",
"NodeHostname": "scadalink-site-a-a",
"RemotingPort": 8082,
"GrpcPort": 8083
}
Why explicit, not offset: RemotingPort is itself a config value (8081 central, 8082 sites). A rigid offset silently breaks if someone changes RemotingPort to a non-standard value. Explicit ports are visible and independently configurable.
Add GrpcPort to NodeOptions (src/ScadaLink.Host/NodeOptions.cs):
public int GrpcPort { get; set; } = 8083;
Add validation in StartupValidator (site role only — central doesn't host a gRPC streaming server).
Central-Side (Database — Site Entity)
Central needs to know each site node's gRPC endpoint. Add two fields to the Site entity:
Modify: src/ScadaLink.Commons/Entities/Sites/Site.cs
public class Site
{
public int Id { get; set; }
public string Name { get; set; }
public string SiteIdentifier { get; set; }
public string? Description { get; set; }
public string? NodeAAddress { get; set; } // Akka: "akka.tcp://scadalink@host:8082"
public string? NodeBAddress { get; set; } // Akka: "akka.tcp://scadalink@host:8082"
public string? GrpcNodeAAddress { get; set; } // gRPC: "http://host:8083"
public string? GrpcNodeBAddress { get; set; } // gRPC: "http://host:8083"
}
Database Migration
Add GrpcNodeAAddress and GrpcNodeBAddress nullable string columns to the Sites table. Existing sites get NULL (gRPC streaming unavailable until configured).
Management Commands
Modify: src/ScadaLink.Commons/Messages/Management/SiteCommands.cs
public record CreateSiteCommand(
string Name, string SiteIdentifier, string? Description,
string? NodeAAddress = null, string? NodeBAddress = null,
string? GrpcNodeAAddress = null, string? GrpcNodeBAddress = null);
public record UpdateSiteCommand(
int SiteId, string Name, string? Description,
string? NodeAAddress = null, string? NodeBAddress = null,
string? GrpcNodeAAddress = null, string? GrpcNodeBAddress = null);
ManagementActor Handlers
Modify: src/ScadaLink.ManagementService/ManagementActor.cs
Update HandleCreateSite and HandleUpdateSite to pass gRPC addresses through to the repository.
CLI
Modify: src/ScadaLink.CLI/Commands/SiteCommands.cs
Add --grpc-node-a-address and --grpc-node-b-address options to site create and site update:
scadalink site create --name "Site A" --identifier site-a \
--node-a-address "akka.tcp://scadalink@site-a-a:8082" \
--node-b-address "akka.tcp://scadalink@site-a-b:8082" \
--grpc-node-a-address "http://site-a-a:8083" \
--grpc-node-b-address "http://site-a-b:8083"
Central UI
Modify: src/ScadaLink.CentralUI/Components/Pages/Admin/Sites.razor
Add two form fields below the existing Node A / Node B address inputs in the site create/edit form:
<label class="form-label small">gRPC Node A Address</label>
<input type="text" class="form-control form-control-sm" @bind="_formGrpcNodeAAddress"
placeholder="http://host:8083" />
<label class="form-label small">gRPC Node B Address</label>
<input type="text" class="form-control form-control-sm" @bind="_formGrpcNodeBAddress"
placeholder="http://host:8083" />
Add corresponding columns to the sites list table. Wire _formGrpcNodeAAddress / _formGrpcNodeBAddress into CreateSiteCommand / UpdateSiteCommand in the save handler.
SiteStreamGrpcClientFactory
Reads GrpcNodeAAddress / GrpcNodeBAddress from the Site entity (loaded by CentralCommunicationActor.LoadSiteAddressesFromDb()) when creating per-site gRPC channels. Falls back to NodeB if NodeA connection fails (same pattern as ClusterClient dual-contact-point failover).
Docker Compose Port Allocation
Modify: docker/docker-compose.yml
Expose gRPC ports for each site node (internal 8083):
- Site-A:
9023:8083/9024:8083(nodes A/B) - Site-B:
9033:8083/9034:8083 - Site-C:
9043:8083/9044:8083
Files Affected by Port & Address Configuration
| File | Change |
|---|---|
src/ScadaLink.Host/NodeOptions.cs |
Add GrpcPort property |
src/ScadaLink.Host/StartupValidator.cs |
Validate GrpcPort for site role |
src/ScadaLink.Host/appsettings.Site.json |
Add GrpcPort: 8083 |
src/ScadaLink.Commons/Entities/Sites/Site.cs |
Add GrpcNodeAAddress, GrpcNodeBAddress |
src/ScadaLink.Commons/Messages/Management/SiteCommands.cs |
Add gRPC address params |
src/ScadaLink.ConfigurationDatabase/ |
EF migration for new columns |
src/ScadaLink.ManagementService/ManagementActor.cs |
Pass gRPC addresses in handlers |
src/ScadaLink.CLI/Commands/SiteCommands.cs |
Add --grpc-node-a-address / --grpc-node-b-address |
src/ScadaLink.CentralUI/Components/Pages/Admin/Sites.razor |
Add gRPC address form fields + table columns |
docker/docker-compose.yml |
Expose gRPC ports |
Proto Definition
File: src/ScadaLink.Communication/Protos/sitestream.proto
The oneof event pattern is extensible — future event types (health metrics, connection state changes, etc.) are added as new fields without breaking existing consumers.
syntax = "proto3";
option csharp_namespace = "ScadaLink.Communication.Grpc";
package sitestream;
service SiteStreamService {
// Subscribe to real-time events filtered by instance.
// Server streams events until the client cancels or the site shuts down.
rpc SubscribeInstance(InstanceStreamRequest) returns (stream SiteStreamEvent);
}
message InstanceStreamRequest {
string correlation_id = 1;
string instance_unique_name = 2;
}
message SiteStreamEvent {
string correlation_id = 1;
oneof event {
AttributeValueUpdate attribute_changed = 2;
AlarmStateUpdate alarm_changed = 3;
// Future: HealthMetricUpdate health_metric = 4;
// Future: ConnectionStateUpdate connection_state = 5;
}
}
message AttributeValueUpdate {
string instance_unique_name = 1;
string attribute_path = 2;
string attribute_name = 3;
string value = 4; // string-encoded (same as LmxProxy VtqMessage pattern)
string quality = 5; // "Good", "Uncertain", "Bad"
int64 timestamp_utc_ticks = 6;
}
message AlarmStateUpdate {
string instance_unique_name = 1;
string alarm_name = 2;
int32 state = 3; // 0=Normal, 1=Active (maps to AlarmState enum)
int32 priority = 4;
int64 timestamp_utc_ticks = 5;
}
Pre-generate C# stubs and check into src/ScadaLink.Communication/SiteStreamGrpc/ (same pattern as LmxProxy — no protoc in Docker for ARM64 compatibility).
Server-Streaming Pattern (Site Side)
gRPC Server Implementation
SiteStreamGrpcServer inherits from SiteStreamService.SiteStreamServiceBase:
public override async Task SubscribeInstance(
InstanceStreamRequest request,
IServerStreamWriter<SiteStreamEvent> responseStream,
ServerCallContext context)
{
var channel = Channel.CreateBounded<SiteStreamEvent>(
new BoundedChannelOptions(1000) { FullMode = BoundedChannelFullMode.DropOldest });
// Local actor subscribes to SiteStreamManager, writes to channel
var relayActor = actorSystem.ActorOf(
Props.Create(() => new StreamRelayActor(request, channel.Writer)));
streamManager.Subscribe(request.InstanceUniqueName, relayActor);
try
{
await foreach (var evt in channel.Reader.ReadAllAsync(context.CancellationToken))
{
await responseStream.WriteAsync(evt, context.CancellationToken);
}
}
finally
{
streamManager.RemoveSubscriber(relayActor);
actorSystem.Stop(relayActor);
}
}
Channel<T> Bridging Pattern
IServerStreamWriter<T> is not thread-safe. Multiple Akka actors may publish events concurrently. The Channel<SiteStreamEvent> bridges these worlds:
Akka Actor Thread(s) gRPC Response Stream
│ ▲
│ channel.Writer.TryWrite(evt) │ await responseStream.WriteAsync(evt)
▼ │
┌─────────────────────────────────────────┐
│ Channel<SiteStreamEvent> │
│ BoundedChannelOptions(1000) │
│ FullMode = DropOldest │
└─────────────────────────────────────────┘
- Bounded capacity (1000): prevents unbounded memory growth if the gRPC client is slow
- DropOldest: matches the existing
SiteStreamManageroverflow strategy - ReadAllAsync: yields items as they arrive, naturally async
Kestrel HTTP/2 Setup
Site hosts switch from Host.CreateDefaultBuilder() to WebApplicationBuilder with Kestrel configured for a dedicated gRPC port:
builder.WebHost.ConfigureKestrel(options =>
{
options.ListenAnyIP(grpcPort, listenOptions =>
{
listenOptions.Protocols = HttpProtocols.Http2; // gRPC requires HTTP/2
});
});
builder.Services.AddGrpc();
// ... existing site services ...
app.MapGrpcService<SiteStreamGrpcServer>();
Reference: infra/lmxfakeproxy/Program.cs uses the identical Kestrel setup.
Client-Streaming Pattern (Central Side)
gRPC Client Implementation
SiteStreamGrpcClient manages per-site gRPC channels and streaming subscriptions:
public async Task<StreamSubscription> SubscribeAsync(
string correlationId, string instanceUniqueName,
Action<object> onEvent, CancellationToken ct)
{
var request = new InstanceStreamRequest
{
CorrelationId = correlationId,
InstanceUniqueName = instanceUniqueName
};
var call = _client.SubscribeInstance(request, cancellationToken: ct);
// Background task reads from the gRPC response stream
_ = Task.Run(async () =>
{
try
{
await foreach (var evt in call.ResponseStream.ReadAllAsync(ct))
{
var domainEvent = ConvertToDomainEvent(evt);
onEvent(domainEvent);
}
}
catch (RpcException ex) when (ex.StatusCode == StatusCode.Cancelled)
{
// Normal cancellation
}
}, ct);
return new StreamSubscription(correlationId, call);
}
Reference: src/ScadaLink.DataConnectionLayer/Adapters/RealLmxProxyClient.cs uses the identical background-task-reading-stream pattern for LmxProxy subscriptions.
Port Resolution
Client Factory
SiteStreamGrpcClientFactory caches per-site GrpcChannel instances (same pattern as CentralCommunicationActor._siteClients caching per-site ClusterClient instances).
Failover & Reconnection
Four failure scenarios to handle, each with different behavior:
1. Site Node Failover (Active → Standby)
What happens: The active site node goes down. The site's Akka cluster promotes the standby to active. The Deployment Manager singleton moves to the new node, recreating Instance Actors from persisted config.
gRPC impact: The gRPC stream on the old node breaks — ResponseStream.MoveNext() throws RpcException on the central client.
Central response (DebugStreamBridgeActor):
- gRPC stream breaks →
onStreamErrorcallback fires - Bridge actor receives the error, enters reconnecting state
- Attempts to open a new gRPC stream to the site's NodeB address (via
SiteStreamGrpcClientFactoryfailover) - If NodeB succeeds → stream resumes, new events flow. The consumer (SignalR/Blazor) sees a brief gap but no action needed.
- If both nodes unreachable →
onTerminatedcallback fires, session ends, consumer notified
Data gap: Events that occurred between the old node dying and the new stream connecting are lost. This is acceptable for debug view (real-time monitoring, not historical replay). If needed, the consumer can request a fresh snapshot via ClusterClient after reconnection to re-sync state.
Reconnection timing: The bridge actor should retry with backoff:
- Immediate retry to NodeB (site failover is fast, ~25s for Akka singleton handover)
- If NodeB fails, retry NodeA after 5s (original node may have restarted)
- Max 3 retries, then give up and terminate the session
// In DebugStreamBridgeActor, on gRPC stream error:
private void HandleGrpcStreamError(Exception ex)
{
_log.Warning("gRPC stream broke for {0}: {1}", _instanceUniqueName, ex.Message);
if (_retryCount >= MaxRetries)
{
_onTerminated();
Context.Stop(Self);
return;
}
_retryCount++;
// Try the other node, then cycle back
_currentEndpoint = _currentEndpoint == _grpcNodeA ? _grpcNodeB : _grpcNodeA;
Context.System.Scheduler.ScheduleTellOnce(
TimeSpan.FromSeconds(_retryCount > 1 ? 5 : 0),
Self, new ReconnectGrpcStream(), ActorRefs.NoSender);
}
2. Central Node Failover (Active → Standby)
What happens: The active central node goes down. The standby becomes leader within ~25s (Akka failover). Traefik detects via /health/active and routes new traffic to the new leader.
gRPC impact: All SiteStreamGrpcClient instances, GrpcChannels, and DebugStreamBridgeActors on the old central node are destroyed. The site-side gRPC server detects dead clients via keepalive (see Connection Keepalive section) and cleans up.
Central response: Nothing to do — the old node is dead. On the new active node:
- Users reconnect to the debug view (Blazor circuit was lost with the old node)
- CLI clients reconnect via
.WithAutomaticReconnect()(SignalR) - Fresh
DebugStreamBridgeActor+ gRPC stream created on demand
No automatic session migration: Debug sessions are not persisted. When central fails over, active debug/stream sessions end. Users re-subscribe. This is consistent with how the system already handles central failover for all stateful sessions (Blazor circuits, SignalR connections).
3. Network Partition (Central ↔ Site Temporarily Unreachable)
What happens: Network between central and site drops but both clusters are running fine.
gRPC impact: gRPC keepalive pings fail on both sides:
- Site side: Detects dead client within ~25s, tears down subscription (see Keepalive section)
- Central side:
ResponseStream.MoveNext()throwsRpcExceptionafter keepalive timeout
Central response: Same as site node failover — bridge actor enters reconnecting state, retries with backoff. When network recovers, reconnection succeeds and streaming resumes.
ClusterClient behavior: ClusterClient also detects the partition independently (transport heartbeat failure, 10s threshold). CentralCommunicationActor fires ConnectionStateChanged(isConnected: false) and sends DebugStreamTerminated to the bridge actor. The bridge actor may receive both the gRPC error and the DebugStreamTerminated — it should handle both idempotently (first one triggers reconnect/terminate, second is ignored).
4. Site Node Restart (Same Node Comes Back)
What happens: A site node restarts (e.g., Windows Service restart, container recreation). The Akka cluster reforms, Instance Actors are recreated.
gRPC impact: Same as site node failover — the gRPC stream on that node was broken when the process died. The gRPC server starts fresh on the restarted node.
Central response: Bridge actor reconnects (same retry logic as scenario 1). The new gRPC stream connects to the restarted node's fresh SiteStreamGrpcServer, which subscribes to the newly recreated SiteStreamManager.
Reconnection State Machine (DebugStreamBridgeActor)
┌──────────────────┐
│ Streaming │ ◄── Normal state: gRPC stream active
└────────┬─────────┘
│ gRPC stream error / keepalive timeout
▼
┌──────────────────┐
┌──► │ Reconnecting │ ── try other node endpoint
│ └────────┬─────────┘
│ │
│ ┌────────┴─────────┐
│ │ │
│ success failure (retry < max)
│ │ │
│ ▼ │
│ Streaming schedule retry (5s backoff)
│ │
└───────────────────────┘
│
failure (retry >= max)
│
▼
┌──────────────────┐
│ Terminated │ ── notify consumer, stop actor
└──────────────────┘
Summary
| Scenario | Site Cleanup | Central Response | Data Gap |
|---|---|---|---|
| Site failover | Automatic (process death) | Reconnect to NodeB, retry NodeA | Yes (~30s) |
| Central failover | Keepalive timeout (~25s) | Sessions lost, user re-subscribes | Session ends |
| Network partition | Keepalive timeout (~25s) | Reconnect with backoff | Yes (partition duration) |
| Site restart | Automatic (process death) | Reconnect to restarted node | Yes (~30s) |
Backpressure and Flow Control
Three layers of flow control:
- gRPC/HTTP2: Built-in TCP flow control. If the central client is slow reading, the site's
WriteAsynceventually blocks. - Channel<T>: Bounded at 1000 with DropOldest. If gRPC is backpressured AND the channel fills, oldest events are dropped (consistent with SiteStreamManager's existing overflow strategy).
- SiteStreamManager: Akka.Streams source with per-subscriber bounded buffer (configurable via
SiteRuntimeOptions.StreamBufferSize, default 1000, DropHead).
Connection Keepalive & Orphan Stream Prevention
If the central cluster crashes, loses network, or fails over without unsubscribing, the site must detect the dead client and tear down the streaming subscription. Three complementary layers handle this:
1. TCP-Level Detection (seconds — clean disconnect)
When the central process dies or the TCP connection resets cleanly, the site's IServerStreamWriter.WriteAsync() throws RpcException with StatusCode.Cancelled. The gRPC method's finally block runs, cleaning up the SiteStreamManager subscription. This fires within seconds on a clean TCP RST.
2. gRPC Keepalive Pings (10–25s — network partition / silent failure)
gRPC supports HTTP/2 PING frames for proactive liveness detection. Configure on both sides:
Site server (Kestrel):
builder.Services.AddGrpc(options =>
{
// If the server sends a ping and gets no ACK within this timeout, it closes the connection.
// This catches silent client death (crash without TCP RST, network partition).
options.KeepAliveTimeout = TimeSpan.FromSeconds(20);
});
// Kestrel HTTP/2 keep-alive settings
builder.WebHost.ConfigureKestrel(options =>
{
options.ListenAnyIP(grpcPort, listenOptions =>
{
listenOptions.Protocols = HttpProtocols.Http2;
});
// Kestrel sends HTTP/2 PING frames at this interval
options.Limits.Http2.KeepAlivePingDelay = TimeSpan.FromSeconds(15);
// Close connection if PING ACK not received within this timeout
options.Limits.Http2.KeepAlivePingTimeout = TimeSpan.FromSeconds(10);
});
Central client (GrpcChannel):
var channel = GrpcChannel.ForAddress(endpoint, new GrpcChannelOptions
{
HttpHandler = new SocketsHttpHandler
{
// Client sends PING frames at this interval to keep the connection alive
// and detect server-side death
KeepAlivePingDelay = TimeSpan.FromSeconds(15),
// Close connection if no PING ACK within this timeout
KeepAlivePingTimeout = TimeSpan.FromSeconds(10),
// Send pings even when no active streams (keeps channel warm for fast reconnect)
KeepAlivePingPolicy = HttpKeepAlivePingPolicy.Always
}
});
Detection timeline: If central dies silently (no TCP RST), the site detects within KeepAlivePingDelay + KeepAlivePingTimeout = ~25 seconds. The ServerCallContext.CancellationToken fires, unblocking the ReadAllAsync loop, and the finally block cleans up.
3. Server-Side Stream Timeout (safety net — defense in depth)
As a final safety net, the gRPC server method can enforce a maximum stream duration or idle timeout. This catches edge cases where keepalive pings are disabled or misconfigured:
// In SiteStreamGrpcServer.SubscribeInstance():
// Link the gRPC cancellation token with a maximum session timeout
using var sessionTimeout = CancellationTokenSource.CreateLinkedTokenSource(
context.CancellationToken);
sessionTimeout.CancelAfter(TimeSpan.FromHours(4)); // max stream lifetime
await foreach (var evt in channel.Reader.ReadAllAsync(sessionTimeout.Token))
{
await responseStream.WriteAsync(evt, sessionTimeout.Token);
}
The 4-hour lifetime is a safety net, not the primary detection mechanism. Normal cleanup happens via gRPC keepalive (25s) or TCP reset (seconds).
Summary of Detection Layers
| Layer | Detects | Timeline | Mechanism |
|---|---|---|---|
| TCP RST | Clean process death, connection close | 1–5s | OS-level TCP, WriteAsync throws |
| gRPC keepalive PING | Network partition, silent crash, firewall drop | ~25s | HTTP/2 PING frames, CancellationToken fires |
| Session timeout | Misconfigured keepalive, long-lived zombie streams | 4 hours | CancellationTokenSource.CancelAfter |
All three trigger the same cleanup path: CancellationToken cancels → ReadAllAsync exits → finally block removes SiteStreamManager subscription and stops relay actor.
Configuration Defaults
Add to CommunicationOptions (src/ScadaLink.Communication/CommunicationOptions.cs):
public TimeSpan GrpcKeepAlivePingDelay { get; set; } = TimeSpan.FromSeconds(15);
public TimeSpan GrpcKeepAlivePingTimeout { get; set; } = TimeSpan.FromSeconds(10);
public TimeSpan GrpcMaxStreamLifetime { get; set; } = TimeSpan.FromHours(4);
Bind from appsettings.json under ScadaLink:Communication (existing options section).
Security Considerations
- Internal Docker network: Plain HTTP/2 (no TLS) — all site↔central traffic flows within
scadalink-netDocker bridge network. Same as current Akka.NET remoting (unencrypted by default). - Production: Enable TLS on the Kestrel gRPC endpoint via HTTPS certificate configuration. The
GrpcChannelon central switches tohttps://scheme. - No authentication on the gRPC channel: The channel is internal infrastructure (site↔central only). Authentication happens at the user-facing boundaries (LDAP/JWT for UI, Basic Auth for CLI/Management API).
Adding New Event Types
The streaming channel is designed to carry any event type that can be scoped to an instance. Adding a new event type requires changes at four layers: proto, SiteStreamManager, gRPC server relay, and gRPC client conversion.
Requirements for a New Event Type
- Must carry
InstanceUniqueName—SiteStreamManager.ForwardToSubscribers()filters by instance name. Without it, the event can't be routed to the correct subscriber. - Must be published to
SiteStreamManager— the gRPC server subscribes toSiteStreamManagervia a relay actor. Any event that reaches the relay actor gets written to the gRPC stream. - Must have a proto message — the event crosses the gRPC wire boundary, so it needs a protobuf representation in
sitestream.proto. - Must be added to the
oneof eventinSiteStreamEvent— this is how the client knows which event type arrived. Existing field numbers are never reused.
Step-by-Step: Adding a New Event Type
Example: adding ScriptErrorEvent to the stream.
1. Define the domain record
File: src/ScadaLink.Commons/Messages/Streaming/ScriptErrorEvent.cs
namespace ScadaLink.Commons.Messages.Streaming;
public record ScriptErrorEvent(
string InstanceUniqueName, // Required for filtering
string ScriptName,
string ErrorMessage,
DateTimeOffset Timestamp);
2. Add the proto message
File: src/ScadaLink.Communication/Protos/sitestream.proto
Add the message definition and a new field to the oneof:
message SiteStreamEvent {
string correlation_id = 1;
oneof event {
AttributeValueUpdate attribute_changed = 2;
AlarmStateUpdate alarm_changed = 3;
ScriptErrorUpdate script_error = 4; // ← new field, next available number
}
}
message ScriptErrorUpdate {
string instance_unique_name = 1;
string script_name = 2;
string error_message = 3;
int64 timestamp_utc_ticks = 4;
}
Re-generate the C# stubs and check them into SiteStreamGrpc/.
3. Publish to SiteStreamManager
File: src/ScadaLink.SiteRuntime/Streaming/SiteStreamManager.cs
Add a publish method (follows the existing pattern exactly):
public void PublishScriptError(ScriptErrorEvent error)
{
_sourceActor?.Tell(error);
ForwardToSubscribers(error.InstanceUniqueName, error);
}
ForwardToSubscribers already accepts object message — no changes needed to the forwarding infrastructure. It filters by instanceName and Tells the message to all matching subscribers.
4. Call the publish method from the source actor
File: wherever the event originates (e.g., ScriptExecutionActor.cs)
_streamManager?.PublishScriptError(new ScriptErrorEvent(
_instanceUniqueName, _scriptName, ex.Message, DateTimeOffset.UtcNow));
5. Handle in the gRPC server relay actor
File: src/ScadaLink.Communication/Grpc/SiteStreamGrpcServer.cs (the StreamRelayActor)
The relay actor receives events from SiteStreamManager and writes them to the Channel<SiteStreamEvent>. Add a Receive<ScriptErrorEvent> handler that converts to the proto message:
Receive<ScriptErrorEvent>(evt =>
{
var protoEvt = new SiteStreamEvent
{
CorrelationId = _correlationId,
ScriptError = new ScriptErrorUpdate
{
InstanceUniqueName = evt.InstanceUniqueName,
ScriptName = evt.ScriptName,
ErrorMessage = evt.ErrorMessage,
TimestampUtcTicks = evt.Timestamp.UtcTicks
}
};
_channel.TryWrite(protoEvt);
});
6. Handle in the gRPC client converter
File: src/ScadaLink.Communication/Grpc/SiteStreamGrpcClient.cs (the ConvertToDomainEvent method)
Add a case for the new proto oneof variant:
private static object ConvertToDomainEvent(SiteStreamEvent evt) => evt.EventCase switch
{
SiteStreamEvent.EventOneofCase.AttributeChanged => /* existing */,
SiteStreamEvent.EventOneofCase.AlarmChanged => /* existing */,
SiteStreamEvent.EventOneofCase.ScriptError => new ScriptErrorEvent(
evt.ScriptError.InstanceUniqueName,
evt.ScriptError.ScriptName,
evt.ScriptError.ErrorMessage,
new DateTimeOffset(evt.ScriptError.TimestampUtcTicks, TimeSpan.Zero)),
_ => evt // Unknown types pass through as raw proto
};
7. Handle in the consumer (if needed)
The DebugStreamBridgeActor already passes events to its _onEvent callback as object. Consumers that care about the new type add a pattern match:
// In SignalR hub callback:
ScriptErrorEvent error =>
hubClients.Client(connectionId).SendAsync("OnScriptError", error),
// In Blazor component:
case ScriptErrorEvent error:
_scriptErrors.Add(error);
_ = InvokeAsync(StateHasChanged);
break;
Checklist for New Event Types
| Step | File(s) | What to Do |
|---|---|---|
| Domain record | Commons/Messages/Streaming/ |
Create record with InstanceUniqueName field |
| Proto message | Communication/Protos/sitestream.proto |
Add message + oneof field (next number) |
| Regenerate stubs | Communication/SiteStreamGrpc/ |
Run protoc, check in generated files |
| Publish method | SiteRuntime/Streaming/SiteStreamManager.cs |
Add PublishXxx() method |
| Source actor | Wherever event originates | Call _streamManager?.PublishXxx(...) |
| Server relay | Communication/Grpc/SiteStreamGrpcServer.cs |
Add Receive<T> → proto conversion |
| Client converter | Communication/Grpc/SiteStreamGrpcClient.cs |
Add EventOneofCase → domain conversion |
| Consumer (optional) | Hub, Blazor, CLI | Add pattern match for new type |
Proto Versioning Rules
- Never reuse field numbers — deleted fields' numbers are reserved forever
- Add new
oneofvariants with the next available field number — old clients ignore unknown fields - Never change existing field types or numbers — this breaks wire compatibility
- The
oneofpattern guarantees forward compatibility — a client that doesn't know aboutScriptErrorUpdatesimply seesEventCase == Noneand can skip it
Existing Codebase References
| Pattern | File | Relevance |
|---|---|---|
| Proto file definition | src/ScadaLink.DataConnectionLayer/Adapters/Protos/scada.proto |
Same proto3 syntax, server streaming (stream VtqMessage) |
| Pre-generated C# stubs | src/ScadaLink.DataConnectionLayer/Adapters/LmxProxyGrpc/ |
Same approach — checked-in stubs, no protoc at build time |
| gRPC client + stream reader | src/ScadaLink.DataConnectionLayer/Adapters/RealLmxProxyClient.cs |
Background task reads ResponseStream, invokes callback |
| gRPC server implementation | infra/lmxfakeproxy/Services/ScadaServiceImpl.cs |
Service base class override pattern |
| Kestrel HTTP/2 setup | infra/lmxfakeproxy/Program.cs |
HttpProtocols.Http2, AddGrpc(), MapGrpcService<T>() |
| SiteStreamManager | src/ScadaLink.SiteRuntime/Streaming/SiteStreamManager.cs |
Subscribe/filter by instance, per-subscriber buffer, DropHead overflow |
| Per-site client caching | CentralCommunicationActor._siteClients dictionary |
One client per site, refresh on address change |
| Bridge actor pattern | src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs |
Per-session actor with callbacks, adapted to use gRPC instead of Akka messages |
Implementation Summary
New Files
| File | Purpose |
|---|---|
src/ScadaLink.Communication/Protos/sitestream.proto |
Proto definition |
src/ScadaLink.Communication/SiteStreamGrpc/ |
Pre-generated C# stubs |
src/ScadaLink.Communication/Grpc/SiteStreamGrpcServer.cs |
Site-side gRPC streaming server |
src/ScadaLink.Communication/Grpc/SiteStreamGrpcClient.cs |
Central-side gRPC streaming client |
src/ScadaLink.Communication/Grpc/SiteStreamGrpcClientFactory.cs |
Per-site client factory/cache |
src/ScadaLink.Communication/Grpc/SiteStreamGrpcClientFactory.cs |
Per-site gRPC client cache, reads GrpcNodeA/BAddress from Site entity |
Modified Files
| File | Change |
|---|---|
src/ScadaLink.Communication/ScadaLink.Communication.csproj |
Add Grpc.AspNetCore + Grpc.Net.Client packages |
src/ScadaLink.Host/Program.cs |
Site: switch to WebApplicationBuilder + Kestrel gRPC |
src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs |
Use gRPC client for streaming (ClusterClient for snapshot only) |
src/ScadaLink.Communication/DebugStreamService.cs |
Inject SiteStreamGrpcClientFactory |
src/ScadaLink.SiteRuntime/Actors/InstanceActor.cs |
Remove DebugStreamEvent forwarding (publish to SiteStreamManager only) |
src/ScadaLink.Communication/Actors/SiteCommunicationActor.cs |
Remove Receive<DebugStreamEvent> handler |
src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs |
Remove HandleDebugStreamEvent |
docker/docker-compose.yml |
Expose gRPC ports for site nodes |
Deleted Files
| File | Reason |
|---|---|
src/ScadaLink.Commons/Messages/DebugView/DebugStreamEvent.cs |
No longer needed — events flow via gRPC, not ClusterClient |
Design Review Notes
The following concerns were identified during external review. Items marked [V1] should be addressed in the initial implementation. Items marked [Future] are noted for consideration as the transport scales beyond debug view.
[V1] Snapshot-to-Stream Handoff Race
The initial snapshot arrives via ClusterClient, then the gRPC stream opens separately. Events between snapshot generation and stream establishment can be missed or duplicated.
Mitigation: Open the gRPC stream first, then request the snapshot via ClusterClient. The gRPC stream buffers events from the moment it connects. The consumer applies the snapshot as the baseline, then replays any buffered gRPC events with timestamps newer than the snapshot. This is a simple timestamp-based dedup — no sequence numbers needed for V1.
[V1] Stream Authority — Which Site Node to Connect To
Both site nodes may be running, but only the active node (hosting the Deployment Manager singleton) has live Instance Actors and a populated SiteStreamManager.
Rule: Central connects to the site node whose Akka address matches the singleton owner. In practice, CentralCommunicationActor already tracks which site node is reachable via ClusterClient — the gRPC client factory should use the same node selection. Try GrpcNodeAAddress first; on failure, try GrpcNodeBAddress.
The standby site node's gRPC server will accept connections but its SiteStreamManager will have no subscribers and no events. A connected stream to the standby will simply be idle (no data). On site failover, the bridge actor reconnects and picks up the new active node.
[V1] Startup/Shutdown Ordering
Switching site host to WebApplicationBuilder requires coordinating ASP.NET Core and Akka.NET lifecycles:
- Startup: Actor system and SiteStreamManager must be initialized before
MapGrpcService<SiteStreamGrpcServer>()begins accepting connections. Gate gRPC readiness on actor system startup (reject streams withStatusCode.Unavailableuntil ready). - Shutdown: On
CoordinatedShutdown, stop accepting new gRPC streams first, cancel all active streams (triggering client reconnect), then tear down actors. UseIHostApplicationLifetime.ApplicationStoppingto signal the gRPC server.
[V1] Duplicate Stream Prevention
Central bugs or reconnect storms could create multiple gRPC streams for the same instance/correlationId.
Rule: SiteStreamGrpcServer tracks active subscriptions by correlation_id. If a new SubscribeInstance arrives with a correlation_id that's already active, cancel the old stream before starting the new one. One stream per correlation ID.
[V1] Observability
Add metrics for silent degradation detection:
| Metric | Source | Purpose |
|---|---|---|
grpc_streams_active |
SiteStreamGrpcServer | Active stream count per site node |
grpc_streams_events_sent |
SiteStreamGrpcServer | Events written to gRPC, by type |
grpc_streams_events_dropped |
Channel writer | Events dropped due to bounded buffer overflow |
grpc_streams_reconnects |
DebugStreamBridgeActor | Reconnection count and reason |
grpc_streams_duration |
SiteStreamGrpcServer | Stream lifetime (histogram) |
Emit via Serilog structured logging (existing infrastructure). Consider Prometheus counters if metrics export is added later.
[V1] Proto Improvements
Use protobuf-native types instead of .NET-specific scalars:
import "google/protobuf/timestamp.proto";
enum Quality {
QUALITY_UNSPECIFIED = 0;
QUALITY_GOOD = 1;
QUALITY_UNCERTAIN = 2;
QUALITY_BAD = 3;
}
enum AlarmState {
ALARM_STATE_UNSPECIFIED = 0;
ALARM_STATE_NORMAL = 1;
ALARM_STATE_ACTIVE = 2;
}
message AttributeValueUpdate {
string instance_unique_name = 1;
string attribute_path = 2;
string attribute_name = 3;
string value = 4;
Quality quality = 5;
google.protobuf.Timestamp timestamp = 6;
}
message AlarmStateUpdate {
string instance_unique_name = 1;
string alarm_name = 2;
AlarmState state = 3;
int32 priority = 4;
google.protobuf.Timestamp timestamp = 5;
}
Reserve enum zero as UNSPECIFIED per proto3 convention. Use google.protobuf.Timestamp instead of int64 ticks for cross-platform compatibility.
[V1] Max Concurrent Streams
Define limits to prevent resource exhaustion:
- Max 100 concurrent gRPC streams per site node (configurable via
CommunicationOptions) - Server rejects with
StatusCode.ResourceExhaustedwhen limit reached - One stream per
correlation_id(duplicate prevention above)
Documentation Updates
All documentation changes must be completed as part of the implementation — not deferred. The design docs are the source of truth for this system's architecture.
High-Level Requirements
Modify: docs/requirements/HighLevelReqs.md
Update the following sections:
- Section 5 (Central–Site Communication): Add gRPC streaming as a transport alongside ClusterClient. Clarify that ClusterClient handles command/control and gRPC handles real-time data streaming.
- Section 13 (Non-Functional / Performance): Add gRPC streaming throughput expectations and backpressure behavior if applicable.
Component-Level Requirements
| Document | Changes |
|---|---|
docs/requirements/Component-Communication.md |
Pattern 6 (Debug Streaming): Replace ClusterClient streaming path with gRPC. Add SiteStreamGrpcServer, SiteStreamGrpcClient, SiteStreamGrpcClientFactory to component responsibilities. Add gRPC port configuration to shared settings. Update dependencies/interactions. |
docs/requirements/Component-SiteRuntime.md |
Update SiteStreamManager section: note that gRPC server subscribes to the stream for cross-cluster delivery. InstanceActor no longer forwards DebugStreamEvent directly. |
docs/requirements/Component-Host.md |
Site host now uses WebApplicationBuilder with Kestrel HTTP/2 for gRPC. Document GrpcPort config, startup/shutdown ordering with Akka.NET. |
docs/requirements/Component-CentralUI.md |
Debug view streaming path updated (gRPC, not ClusterClient for events). |
docs/requirements/Component-CLI.md |
site create / site update commands updated with --grpc-node-a-address / --grpc-node-b-address. |
docs/requirements/Component-ConfigurationDatabase.md |
Migration for GrpcNodeAAddress / GrpcNodeBAddress on Sites table. |
docs/requirements/Component-ClusterInfrastructure.md |
Note gRPC port alongside Akka remoting port in node configuration. |
CLAUDE.md
Update key design decisions:
- Add "gRPC streaming for site→central real-time data; ClusterClient for command/control only" under Data & Communication
- Add gRPC port convention under Architecture & Runtime
- Update current component count if a new component is introduced
README.md
Update the architecture diagram to show the gRPC streaming channel between site and central clusters.
Testing Strategy
Tests to Update (Existing)
| Test File | Change |
|---|---|
tests/ScadaLink.SiteRuntime.Tests/Actors/InstanceActorIntegrationTests.cs |
Remove DebugStreamEventForwarder test helper and DebugStreamEvent expectations. Debug subscriber tests should verify events reach SiteStreamManager only (not direct forwarding). |
tests/ScadaLink.Communication.Tests/ |
Remove any tests for DebugStreamEvent routing through CentralCommunicationActor and SiteCommunicationActor. |
tests/ScadaLink.Host.Tests/HealthCheckTests.cs |
May need updates if site host builder changes affect test factory setup. |
New Tests Required
| Test | Project | What to Verify |
|---|---|---|
SiteStreamGrpcServerTests |
ScadaLink.Communication.Tests |
Server accepts subscription, relays events from mock SiteStreamManager to gRPC stream, cleans up on cancellation, rejects duplicate correlation_id, enforces max concurrent streams limit, rejects before actor system ready. |
SiteStreamGrpcClientTests |
ScadaLink.Communication.Tests |
Client connects, reads stream, converts proto→domain types, invokes callback, handles stream errors, reconnects to NodeB on failure. |
SiteStreamGrpcClientFactoryTests |
ScadaLink.Communication.Tests |
Creates and caches per-site clients, derives endpoints from Site entity, disposes on site removal. |
DebugStreamBridgeActorTests (update) |
ScadaLink.Communication.Tests |
Verify bridge actor opens gRPC stream after snapshot, receives events via gRPC callback (not Akka messages), reconnects on stream error with node failover, terminates after max retries. |
GrpcStreamIntegrationTest |
ScadaLink.IntegrationTests |
End-to-end: site gRPC server → central gRPC client → bridge actor → callback. Use in-process test server (WebApplicationFactory or TestServer). Verify event delivery, cancellation cleanup, keepalive behavior. |
SiteHostStartupTests |
ScadaLink.Host.Tests |
Site host starts with WebApplicationBuilder, gRPC port configured, MapGrpcService registered, rejects streams before actor system ready. |
ProtoRoundtripTests |
ScadaLink.Communication.Tests |
Serialize/deserialize each proto message type, verify oneof discrimination, verify enum mappings, verify google.protobuf.Timestamp conversion. |
Test Coverage Guardrails
To ensure the implementation stays compliant with this plan:
-
Proto contract tests: A test that loads
sitestream.protoand verifies alloneofvariants have handlers in bothStreamRelayActor(server) andConvertToDomainEvent(client). If a new proto field is added without handlers, the test fails. Prevents silent event type gaps. -
Architectural constraint test: Add to
ScadaLink.Commons.Tests/ArchitecturalConstraintTests.cs— verify thatDebugStreamEventtype no longer exists in the assembly (ensures the ClusterClient streaming path is fully removed and doesn't creep back). -
Startup validation test: Site host integration test that verifies gRPC server rejects
SubscribeInstancecalls withStatusCode.Unavailablebefore the actor system is ready, and accepts them after. -
Cleanup verification test: gRPC server test that verifies after stream cancellation, the SiteStreamManager subscription count returns to its pre-test value (no leaked subscriptions).
-
No ClusterClient streaming regression: Integration test that subscribes via gRPC, triggers attribute changes, and verifies events arrive via gRPC callback — NOT via
DebugStreamEventthroughCentralCommunicationActor. This prevents accidental reintroduction of the ClusterClient streaming path.
Implementation Plan Guardrails
When creating implementation plans (work packages) from this document, each plan must include:
-
Pre-implementation checklist:
- Identify all requirement docs affected by this work package
- Identify all existing tests that need updating
- List new tests required before marking the work package complete
-
Per-step requirements:
- Every new public class/interface must have covering unit tests
- Every modified actor must have its existing tests updated to reflect new behavior
- Every proto change must have roundtrip serialization tests
- Every config change must be validated in
StartupValidatorwith a covering test
-
Completion criteria (a work package is not done until):
- All identified requirement docs are updated
- All existing tests pass (no regressions)
- All new tests listed in this document are implemented and passing
dotnet buildsucceeds with zero warnings- CLAUDE.md key design decisions are updated if architectural choices changed
docker/deploy.shsucceeds and end-to-end streaming verified manually
-
Review checkpoint: After each implementation phase (site server, central client, cleanup, config), run the full test suite and verify the streaming path end-to-end before proceeding to the next phase. Do not batch all phases into a single commit.