77cb0ad0e2
The largest themed batch — small mechanical fixes across 11 modules.
API / message hygiene:
- Comm-020: SiteAddressCacheLoaded now carries IReadOnlyDictionary /
IReadOnlyList — Akka messages must be immutable.
- Commons-016: BundleSession.MaxUnlockAttempts named constant replaces
magic 3.
- Commons-018: IOperationTrackingStore + IPartitionMaintenance moved from
Interfaces/ root to Interfaces/Services/ (namespace preserved — 9
consumers exceeded the in-prompt move threshold).
- Commons-023: TrackingStatusSnapshot.SourceNode now consistent with the
trailing-optional-with-default pattern used elsewhere.
- SR-022: AuditingDbCommand.DbConnection.set no longer uses reflection —
exposes AuditingDbConnection.Inner via internal API surface.
Dead code / config cleanup:
- ClusterInfra-011: decorative SectionName constant deleted.
- ClusterInfra-014: dead AddClusterInfrastructureActors method + its
"throws-when-called" test deleted.
- Host-021: Microsoft Logging:LogLevel block deleted from appsettings.json
(dead under Serilog).
Fail-loud over fail-silent:
- DM-021: ResolveSiteIdentifierAsync throws on missing site (was silently
substituting a DB id).
- DM-022: dropped transient Pending write — record now lands directly in
InProgress (no UI flicker, one fewer DB write).
- Host-020: LoggerConfigurationFactory emits a Console.Error warning when
both Serilog:MinimumLevel and ScadaLink:Logging:MinimumLevel are set
(ScadaLink remains truth per Host-011).
- SnF-022: NotifyCachedCallObserverAsync logs Warning on unparseable
TrackedOperationId (was silently dropping).
- SnF-023: empty siteId default replaced with $unknown-site sentinel
+ constructor normalisation.
Correctness:
- SCA-001: SupervisorStrategy XML rewritten to match actual
DefaultDecider/Restart semantics (was claiming Resume).
- SCA-003: OnUpsertAsync now restamps IngestedAtUtc on every upsert.
- SR-021: HandleDeployArtifacts now dispatches an internal
ApplyArtifactDataConnectionsToDcl message after the SQLite write so
system-wide artifact-deploy data-connection changes go live
immediately (was requiring a site restart).
- SnF-020: RetryParkedMessageAsync captures the parked row BEFORE the
local write so a concurrent delete can't skip standby replication.
Sentinels / naming collisions:
- HM-021: CentralSiteId changed from "central" to "$central"
(uncollideable — leading $ is forbidden in real SiteIdentifiers).
Doc / surface cleanups:
- SEL-018: FailedWriteCount promoted to ISiteEventLogger; XML softened
to "Available for future Health Monitoring integration".
- SnF-019: VERIFY outcome — documented parking-after-DefaultMaxRetries
in Component-StoreAndForward.md + DefaultMaxRetries XML (uniform
cap; maxRetries:0 is the unbounded escape hatch).
- SnF-021: Component-StoreAndForward.md no longer claims the tracking
table lives in SnF — it's in SiteRuntime, the interface is in Commons.
- CLI-020: bundle export response parse guarded with try/catch on
JsonException / KeyNotFoundException / FormatException — emits a
clean INVALID_RESPONSE exit instead of a stack trace.
Config:
- ClusterInfra-013: intent comment added to "catastrophic config" test.
- Host-016: appsettings.Site.json second CentralContactPoints entry
removed (was pointing at the SITE's own port); doc-key explains how
to extend.
- Host-018: NodeName added to both shipped per-role configs (was
causing SourceNode to be null on audit rows).
UI:
- CentralUI-029: replaced JS.InvokeAsync<int>("eval", …) with an ES
module import (new wwwroot/js/browser-time.js).
- CentralUI-032: AuditResultsGrid gains a Previous button backed by a
cursor stack.
10+ new regression tests across the affected projects. Build clean;
all suites green. README regenerated: 6 open (was 33).
Session-to-date: 130 of 136 originally-open Theme findings closed.
583 lines
26 KiB
C#
583 lines
26 KiB
C#
using System.Collections.Immutable;
|
|
using Akka.Actor;
|
|
using Akka.Cluster.Tools.Client;
|
|
using Akka.Cluster.Tools.PublishSubscribe;
|
|
using Akka.Event;
|
|
using Microsoft.Extensions.DependencyInjection;
|
|
using ScadaLink.Commons.Interfaces.Repositories;
|
|
using ScadaLink.Commons.Messages.Audit;
|
|
using ScadaLink.Commons.Messages.Communication;
|
|
using ScadaLink.Commons.Messages.Health;
|
|
using ScadaLink.Commons.Messages.Notification;
|
|
using ScadaLink.HealthMonitoring;
|
|
|
|
namespace ScadaLink.Communication.Actors;
|
|
|
|
/// <summary>
|
|
/// Abstraction for creating ClusterClient instances per site, enabling testability.
|
|
/// </summary>
|
|
public interface ISiteClientFactory
|
|
{
|
|
/// <summary>Creates a ClusterClient actor for the given site with the specified contact points.</summary>
|
|
/// <param name="system">The actor system in which to create the client.</param>
|
|
/// <param name="siteId">The site identifier, used to name the actor.</param>
|
|
/// <param name="contacts">The set of receptionist actor paths to use as initial contacts.</param>
|
|
/// <returns>An actor reference for the new ClusterClient.</returns>
|
|
IActorRef Create(ActorSystem system, string siteId, ImmutableHashSet<ActorPath> contacts);
|
|
}
|
|
|
|
/// <summary>
|
|
/// Default implementation that creates a real ClusterClient for each site.
|
|
/// </summary>
|
|
public class DefaultSiteClientFactory : ISiteClientFactory
|
|
{
|
|
/// <inheritdoc />
|
|
public IActorRef Create(ActorSystem system, string siteId, ImmutableHashSet<ActorPath> contacts)
|
|
{
|
|
var settings = ClusterClientSettings.Create(system).WithInitialContacts(contacts);
|
|
return system.ActorOf(ClusterClient.Props(settings), $"site-client-{siteId}");
|
|
}
|
|
}
|
|
|
|
/// <summary>
|
|
/// Central-side actor that routes messages from central to site clusters via ClusterClient.
|
|
/// Resolves site addresses from the database on a periodic refresh cycle and manages
|
|
/// per-site ClusterClient instances.
|
|
///
|
|
/// WP-4: All 8 message patterns routed through this actor.
|
|
/// WP-5: Ask timeout on connection drop (no central buffering). Debug streams killed on interruption.
|
|
/// </summary>
|
|
public class CentralCommunicationActor : ReceiveActor
|
|
{
|
|
private readonly ILoggingAdapter _log = Context.GetLogger();
|
|
private readonly IServiceProvider _serviceProvider;
|
|
private readonly ISiteClientFactory _siteClientFactory;
|
|
|
|
/// <summary>
|
|
/// Per-site ClusterClient instances and their contact addresses.
|
|
/// Maps SiteIdentifier → (ClusterClient actor, set of contact address strings).
|
|
/// Refreshed periodically via RefreshSiteAddresses.
|
|
/// </summary>
|
|
private readonly Dictionary<string, (IActorRef Client, ImmutableHashSet<string> ContactAddresses)> _siteClients = new();
|
|
|
|
// Communication-016: the previous _debugSubscriptions / _inProgressDeployments
|
|
// dictionaries existed solely to support a documented "synchronous kill streams +
|
|
// mark deployments failed on site disconnect" workflow triggered by
|
|
// ConnectionStateChanged. No production code ever emitted that message — only
|
|
// the unit test did — so the workflow was dead from end to end. Disconnect
|
|
// detection is owned by the underlying transports: the gRPC keepalive PING
|
|
// signals stream interruption in ~25s (handled by DebugStreamBridgeActor's own
|
|
// reconnection logic), and an Ask round-trip for a deploy times out at the
|
|
// CommunicationService layer (caller sees failure). The tracking dicts +
|
|
// ConnectionStateChanged record + HandleConnectionStateChanged handler are
|
|
// removed; see docs/requirements/Component-Communication.md "Connection
|
|
// Failure Behavior" for the keepalive-based contract that survives.
|
|
|
|
private ICancelable? _refreshSchedule;
|
|
|
|
/// <summary>
|
|
/// Communication-019: per-actor lifecycle CTS threaded into the periodic
|
|
/// <see cref="LoadSiteAddressesFromDb"/> repository call so a hung MS SQL
|
|
/// connection is bounded by actor shutdown rather than holding piped tasks
|
|
/// open indefinitely. Cancelled in <see cref="PostStop"/>; never reset.
|
|
/// </summary>
|
|
private readonly CancellationTokenSource _lifecycleCts = new();
|
|
|
|
/// <summary>
|
|
/// Proxy <see cref="IActorRef"/> for the central NotificationOutboxActor cluster singleton.
|
|
/// Set via <see cref="RegisterNotificationOutbox"/> — the Host creates the singleton proxy
|
|
/// after this actor and registers it (mirrors how the site-side actor receives its
|
|
/// runtime <see cref="IActorRef"/>s). Null until registration completes; a notification
|
|
/// arriving before then is rejected with a non-accepted ack so the site retries.
|
|
/// </summary>
|
|
private IActorRef? _notificationOutboxProxy;
|
|
|
|
/// <summary>
|
|
/// Proxy <see cref="IActorRef"/> for the central AuditLogIngestActor cluster
|
|
/// singleton. Set via <see cref="RegisterAuditIngest"/> — the Host creates the
|
|
/// singleton proxy after this actor and registers it (mirrors
|
|
/// <see cref="_notificationOutboxProxy"/>). Null until registration completes;
|
|
/// an audit ingest command arriving before then is answered with an empty
|
|
/// reply so the site keeps its rows Pending and retries.
|
|
///
|
|
/// Once registered, the handler Asks this proxy and pipes the reply straight
|
|
/// back to the caller. On an Ask timeout or a faulted reply, PipeTo forwards a
|
|
/// <see cref="Status.Failure"/> to the caller — the fault propagates rather
|
|
/// than being swallowed. This differs from the gRPC handler
|
|
/// (<c>SiteStreamGrpcServer</c>), which catches the exception and returns an
|
|
/// empty ack; here the faulted Ask is the transient signal the site relies on
|
|
/// (see <see cref="HandleIngestAuditEvents"/>).
|
|
/// </summary>
|
|
private IActorRef? _auditIngestProxy;
|
|
|
|
/// <summary>
|
|
/// Default Ask timeout for routing audit ingest commands to the
|
|
/// AuditLogIngestActor proxy — 30 s, matching the value of
|
|
/// <c>SiteStreamGrpcServer.AuditIngestAskTimeout</c> (that constant is private
|
|
/// to the gRPC server and not reachable here, so it is declared locally). A
|
|
/// generous window absorbs a slow MS SQL connection without the round-trip
|
|
/// surfacing as a failure on a healthy site. When the window is exceeded the
|
|
/// Ask faults and that fault is piped back to the caller as a
|
|
/// <see cref="Status.Failure"/> (see <see cref="HandleIngestAuditEvents"/>).
|
|
/// </summary>
|
|
private static readonly TimeSpan DefaultAuditIngestAskTimeout = TimeSpan.FromSeconds(30);
|
|
|
|
/// <summary>
|
|
/// Effective Ask timeout for audit ingest routing. Defaults to
|
|
/// <see cref="DefaultAuditIngestAskTimeout"/>; overridable via the constructor
|
|
/// so tests can exercise the timeout/fault path without waiting 30 s.
|
|
/// </summary>
|
|
private readonly TimeSpan _auditIngestAskTimeout;
|
|
|
|
/// <summary>
|
|
/// DistributedPubSub topic used to fan health reports out to the peer
|
|
/// central node so both per-node aggregators stay in sync. See
|
|
/// <see cref="SiteHealthReportReplica"/> for the protocol rationale.
|
|
/// </summary>
|
|
private const string HealthReportTopic = "site-health-replica";
|
|
|
|
/// <summary>Initializes the <see cref="CentralCommunicationActor"/> and wires all message handlers.</summary>
|
|
/// <param name="serviceProvider">DI service provider for scoped repository and aggregator access.</param>
|
|
/// <param name="siteClientFactory">Factory used to create per-site ClusterClient actors.</param>
|
|
/// <param name="auditIngestAskTimeout">
|
|
/// Optional override for the audit-ingest Ask timeout; defaults to
|
|
/// <see cref="DefaultAuditIngestAskTimeout"/> (30 s). Exists only so tests can
|
|
/// exercise the timeout/fault path quickly — production always uses the default.
|
|
/// </param>
|
|
public CentralCommunicationActor(
|
|
IServiceProvider serviceProvider,
|
|
ISiteClientFactory siteClientFactory,
|
|
TimeSpan? auditIngestAskTimeout = null)
|
|
{
|
|
_serviceProvider = serviceProvider;
|
|
_siteClientFactory = siteClientFactory;
|
|
_auditIngestAskTimeout = auditIngestAskTimeout ?? DefaultAuditIngestAskTimeout;
|
|
|
|
// Site address cache loaded from database
|
|
Receive<SiteAddressCacheLoaded>(HandleSiteAddressCacheLoaded);
|
|
|
|
// Periodic refresh trigger
|
|
Receive<RefreshSiteAddresses>(_ => LoadSiteAddressesFromDb());
|
|
|
|
// Communication-006: a faulted LoadSiteAddressesFromDb task is piped here as a
|
|
// Status.Failure. Without this handler the failure was an unhandled message
|
|
// (debug-level only) and the refresh failed silently — operators could not
|
|
// distinguish "no sites configured" from "database is down". Log at Warning.
|
|
Receive<Status.Failure>(failure =>
|
|
_log.Warning(failure.Cause,
|
|
"Failed to load site addresses from the database; the site ClusterClient "
|
|
+ "cache was not refreshed and may be stale or empty"));
|
|
|
|
// Health monitoring: heartbeats and health reports from sites
|
|
Receive<HeartbeatMessage>(HandleHeartbeat);
|
|
Receive<SiteHealthReport>(HandleSiteHealthReport);
|
|
Receive<SiteHealthReportReplica>(r => ProcessLocally(r.Report));
|
|
Receive<SubscribeAck>(_ => { /* DistributedPubSub subscribe confirmation */ });
|
|
|
|
// Route enveloped messages to sites
|
|
Receive<SiteEnvelope>(HandleSiteEnvelope);
|
|
|
|
// Notification Outbox: the Host registers the outbox singleton proxy after this
|
|
// actor is created (the proxy cannot exist before this actor's construction).
|
|
Receive<RegisterNotificationOutbox>(msg =>
|
|
{
|
|
_notificationOutboxProxy = msg.OutboxProxy;
|
|
_log.Info("Registered notification outbox proxy");
|
|
});
|
|
|
|
// Notification Outbox ingest: a site forwards a buffered NotificationSubmit to the
|
|
// central cluster via ClusterClient. Forward to the outbox proxy so the original
|
|
// Sender (the site's ClusterClient path) is preserved and the NotificationSubmitAck
|
|
// routes straight back to the site.
|
|
Receive<NotificationSubmit>(HandleNotificationSubmit);
|
|
|
|
// Notification Outbox status query: forward to the outbox proxy, preserving Sender
|
|
// so the NotificationStatusResponse routes back to the querying site.
|
|
Receive<NotificationStatusQuery>(HandleNotificationStatusQuery);
|
|
|
|
// Audit Log (#23): the Host registers the AuditLogIngestActor singleton
|
|
// proxy after this actor is created (the proxy cannot exist before this
|
|
// actor's construction).
|
|
Receive<RegisterAuditIngest>(msg =>
|
|
{
|
|
_auditIngestProxy = msg.AuditIngestActor;
|
|
_log.Info("Registered audit ingest proxy");
|
|
});
|
|
|
|
// Audit Log (#23) site→central ingest: a site forwards a batch of audit
|
|
// events to the central cluster via ClusterClient. Ask the ingest proxy
|
|
// and pipe the IngestAuditEventsReply back to the original Sender (the
|
|
// site's ClusterClient path) so the site can flip its rows to Forwarded.
|
|
Receive<IngestAuditEventsCommand>(HandleIngestAuditEvents);
|
|
|
|
// Audit Log (#23 M3) combined-telemetry ingest: routes to the same proxy
|
|
// the same way; the proxy replies with an IngestCachedTelemetryReply.
|
|
Receive<IngestCachedTelemetryCommand>(HandleIngestCachedTelemetry);
|
|
}
|
|
|
|
private void HandleNotificationSubmit(NotificationSubmit msg)
|
|
{
|
|
if (_notificationOutboxProxy == null)
|
|
{
|
|
// No outbox proxy registered yet. A non-accepted ack makes the site's
|
|
// Store-and-Forward forwarder treat this as transient and retry later.
|
|
_log.Warning(
|
|
"Cannot route NotificationSubmit {0} — notification outbox not available",
|
|
msg.NotificationId);
|
|
Sender.Tell(new NotificationSubmitAck(
|
|
msg.NotificationId, Accepted: false, Error: "notification outbox not available"));
|
|
return;
|
|
}
|
|
|
|
_log.Debug("Routing NotificationSubmit {0} to the notification outbox", msg.NotificationId);
|
|
_notificationOutboxProxy.Forward(msg);
|
|
}
|
|
|
|
private void HandleNotificationStatusQuery(NotificationStatusQuery msg)
|
|
{
|
|
if (_notificationOutboxProxy == null)
|
|
{
|
|
// No outbox proxy registered yet. Reply Found: false so the querying site
|
|
// falls back to its local Store-and-Forward buffer to resolve the status.
|
|
_log.Warning(
|
|
"Cannot route NotificationStatusQuery {0} — notification outbox not available",
|
|
msg.NotificationId);
|
|
Sender.Tell(new NotificationStatusResponse(
|
|
msg.CorrelationId, Found: false, Status: "Unknown",
|
|
RetryCount: 0, LastError: null, DeliveredAt: null));
|
|
return;
|
|
}
|
|
|
|
_log.Debug("Routing NotificationStatusQuery {0} to the notification outbox", msg.NotificationId);
|
|
_notificationOutboxProxy.Forward(msg);
|
|
}
|
|
|
|
private void HandleIngestAuditEvents(IngestAuditEventsCommand msg)
|
|
{
|
|
if (_auditIngestProxy == null)
|
|
{
|
|
// No ingest proxy registered yet (host startup race). Reply with an
|
|
// empty IngestAuditEventsReply so the site keeps its rows Pending and
|
|
// retries — the same behaviour as the gRPC handler's wiring-race path.
|
|
_log.Warning(
|
|
"Cannot route IngestAuditEventsCommand ({0} events) — audit ingest not available",
|
|
msg.Events.Count);
|
|
Sender.Tell(new IngestAuditEventsReply(Array.Empty<Guid>()));
|
|
return;
|
|
}
|
|
|
|
// Capture Sender before the async/PipeTo — Akka resets Sender between
|
|
// dispatches. The reply is piped straight back to the site's ClusterClient.
|
|
// On an Ask timeout or a faulted reply, PipeTo delivers a Status.Failure to
|
|
// replyTo: the fault propagates to the caller rather than being swallowed.
|
|
// The site's own Ask through this path then faults, and the site drain loop
|
|
// treats that as a transient failure — rows stay Pending and are retried on
|
|
// the next tick. (The gRPC handler instead returns an empty ack on fault;
|
|
// propagating the fault here is the cleaner transient signal.)
|
|
var replyTo = Sender;
|
|
_log.Debug("Routing IngestAuditEventsCommand ({0} events) to the audit ingest actor", msg.Events.Count);
|
|
_auditIngestProxy.Ask<IngestAuditEventsReply>(msg, _auditIngestAskTimeout)
|
|
.PipeTo(replyTo);
|
|
}
|
|
|
|
private void HandleIngestCachedTelemetry(IngestCachedTelemetryCommand msg)
|
|
{
|
|
if (_auditIngestProxy == null)
|
|
{
|
|
_log.Warning(
|
|
"Cannot route IngestCachedTelemetryCommand ({0} entries) — audit ingest not available",
|
|
msg.Entries.Count);
|
|
Sender.Tell(new IngestCachedTelemetryReply(Array.Empty<Guid>()));
|
|
return;
|
|
}
|
|
|
|
var replyTo = Sender;
|
|
_log.Debug("Routing IngestCachedTelemetryCommand ({0} entries) to the audit ingest actor", msg.Entries.Count);
|
|
_auditIngestProxy.Ask<IngestCachedTelemetryReply>(msg, _auditIngestAskTimeout)
|
|
.PipeTo(replyTo);
|
|
}
|
|
|
|
private void HandleHeartbeat(HeartbeatMessage heartbeat)
|
|
{
|
|
var aggregator = _serviceProvider.GetService<ICentralHealthAggregator>();
|
|
aggregator?.MarkHeartbeat(heartbeat.SiteId, heartbeat.Timestamp);
|
|
}
|
|
|
|
/// <summary>
|
|
/// Handles a report delivered directly from a site (via ClusterClient):
|
|
/// process locally, then fan out to the peer central node so its
|
|
/// aggregator stays in sync.
|
|
/// </summary>
|
|
private void HandleSiteHealthReport(SiteHealthReport report)
|
|
{
|
|
ProcessLocally(report);
|
|
|
|
try
|
|
{
|
|
DistributedPubSub.Get(Context.System).Mediator.Tell(
|
|
new Publish(HealthReportTopic, new SiteHealthReportReplica(report)));
|
|
}
|
|
catch
|
|
{
|
|
// No-op in non-clustered hosts (TestKit).
|
|
}
|
|
}
|
|
|
|
/// <summary>
|
|
/// Applies a report to the local aggregator without re-broadcasting.
|
|
/// Used for both site-originated reports and peer-replicated ones — the
|
|
/// aggregator is idempotent via sequence-number comparison.
|
|
/// </summary>
|
|
private void ProcessLocally(SiteHealthReport report)
|
|
{
|
|
var aggregator = _serviceProvider.GetService<ICentralHealthAggregator>();
|
|
if (aggregator != null)
|
|
{
|
|
aggregator.ProcessReport(report);
|
|
}
|
|
else
|
|
{
|
|
_log.Warning("ICentralHealthAggregator not available, dropping health report from site {0}", report.SiteId);
|
|
}
|
|
}
|
|
|
|
// Communication-016: HandleConnectionStateChanged removed — no production
|
|
// caller emitted ConnectionStateChanged, so the workflow ran only in tests.
|
|
// Disconnect detection is owned by the transport layers (gRPC keepalive +
|
|
// ClusterClient/Ask timeout).
|
|
|
|
private void HandleSiteEnvelope(SiteEnvelope envelope)
|
|
{
|
|
if (!_siteClients.TryGetValue(envelope.SiteId, out var entry))
|
|
{
|
|
_log.Warning("No ClusterClient for site {0}, cannot route message {1}",
|
|
envelope.SiteId, envelope.Message.GetType().Name);
|
|
|
|
// The Ask will timeout on the caller side — no central buffering (WP-5)
|
|
return;
|
|
}
|
|
|
|
// Route via ClusterClient — Sender is preserved for Ask response routing
|
|
entry.Client.Tell(
|
|
new ClusterClient.Send("/user/site-communication", envelope.Message),
|
|
Sender);
|
|
}
|
|
|
|
private void LoadSiteAddressesFromDb()
|
|
{
|
|
var self = Self;
|
|
// Communication-019: pass the actor's lifecycle CT into the repository
|
|
// call so a hung database query is cancelled when the actor stops
|
|
// rather than leaving the piped task to accumulate. Captured locally
|
|
// because the lifecycle CTS may have been disposed by PostStop on a
|
|
// racing late tick; treat that as "actor gone, give up".
|
|
CancellationToken ct;
|
|
try
|
|
{
|
|
ct = _lifecycleCts.Token;
|
|
}
|
|
catch (ObjectDisposedException)
|
|
{
|
|
return;
|
|
}
|
|
|
|
Task.Run(async () =>
|
|
{
|
|
using var scope = _serviceProvider.CreateScope();
|
|
var repo = scope.ServiceProvider.GetRequiredService<ISiteRepository>();
|
|
var sites = await repo.GetAllSitesAsync(ct).ConfigureAwait(false);
|
|
|
|
var contacts = new Dictionary<string, List<string>>();
|
|
foreach (var site in sites)
|
|
{
|
|
var addrs = new List<string>();
|
|
if (!string.IsNullOrWhiteSpace(site.NodeAAddress))
|
|
{
|
|
var addr = site.NodeAAddress;
|
|
// Strip actor path suffix if present (legacy format)
|
|
var idx = addr.IndexOf("/user/");
|
|
if (idx > 0) addr = addr.Substring(0, idx);
|
|
addrs.Add(addr);
|
|
}
|
|
if (!string.IsNullOrWhiteSpace(site.NodeBAddress))
|
|
{
|
|
var addr = site.NodeBAddress;
|
|
var idx = addr.IndexOf("/user/");
|
|
if (idx > 0) addr = addr.Substring(0, idx);
|
|
addrs.Add(addr);
|
|
}
|
|
if (addrs.Count > 0)
|
|
contacts[site.SiteIdentifier] = addrs;
|
|
}
|
|
|
|
// Communication-020: freeze the cross-task payload before piping to
|
|
// Self. The message record exposes read-only types (
|
|
// IReadOnlyDictionary / IReadOnlyList) so the Akka.NET message-
|
|
// immutability convention is enforced by type, not just convention.
|
|
var frozen = contacts.ToDictionary(
|
|
kvp => kvp.Key,
|
|
kvp => (IReadOnlyList<string>)kvp.Value.AsReadOnly());
|
|
return new SiteAddressCacheLoaded(frozen);
|
|
}).PipeTo(self);
|
|
}
|
|
|
|
private void HandleSiteAddressCacheLoaded(SiteAddressCacheLoaded msg)
|
|
{
|
|
var newSiteIds = msg.SiteContacts.Keys.ToHashSet();
|
|
var existingSiteIds = _siteClients.Keys.ToHashSet();
|
|
|
|
// Stop ClusterClients for removed sites
|
|
foreach (var removed in existingSiteIds.Except(newSiteIds))
|
|
{
|
|
_log.Info("Stopping ClusterClient for removed site {0}", removed);
|
|
Context.Stop(_siteClients[removed].Client);
|
|
_siteClients.Remove(removed);
|
|
}
|
|
|
|
// Add or update
|
|
foreach (var (siteId, addresses) in msg.SiteContacts)
|
|
{
|
|
// Communication-009: parse all addresses up front inside a try/catch so a
|
|
// single malformed site row cannot abort the whole refresh loop and leave
|
|
// the cache half-updated. A bad site is logged and skipped; others proceed.
|
|
ImmutableHashSet<ActorPath> contactPaths;
|
|
try
|
|
{
|
|
contactPaths = addresses
|
|
.Select(a => ActorPath.Parse($"{a}/system/receptionist"))
|
|
.ToImmutableHashSet();
|
|
}
|
|
catch (Exception ex)
|
|
{
|
|
_log.Warning(ex,
|
|
"Malformed contact address for site {0}; skipping this site in the refresh "
|
|
+ "(other sites are unaffected)", siteId);
|
|
continue;
|
|
}
|
|
|
|
var contactStrings = addresses.ToImmutableHashSet();
|
|
|
|
// Skip if unchanged
|
|
if (_siteClients.TryGetValue(siteId, out var existing) && existing.ContactAddresses.SetEquals(contactStrings))
|
|
continue;
|
|
|
|
// Stop old client if addresses changed
|
|
if (_siteClients.ContainsKey(siteId))
|
|
{
|
|
_log.Info("Updating ClusterClient for site {0} (addresses changed)", siteId);
|
|
Context.Stop(_siteClients[siteId].Client);
|
|
}
|
|
|
|
var client = _siteClientFactory.Create(Context.System, siteId, contactPaths);
|
|
_siteClients[siteId] = (client, contactStrings);
|
|
_log.Info("Created ClusterClient for site {0} with {1} contact(s)", siteId, addresses.Count);
|
|
}
|
|
|
|
_log.Info("Site ClusterClient cache refreshed with {0} site(s)", _siteClients.Count);
|
|
}
|
|
|
|
// Communication-016: TrackMessageForCleanup removed — the dicts it fed
|
|
// existed solely to support the dead ConnectionStateChanged workflow.
|
|
|
|
/// <inheritdoc />
|
|
protected override SupervisorStrategy SupervisorStrategy()
|
|
{
|
|
return new OneForOneStrategy(
|
|
maxNrOfRetries: -1,
|
|
withinTimeRange: Timeout.InfiniteTimeSpan,
|
|
decider: Decider.From(ex =>
|
|
{
|
|
_log.Warning(ex, "Child actor of CentralCommunicationActor faulted, resuming (state preserved)");
|
|
return Directive.Resume;
|
|
}));
|
|
}
|
|
|
|
/// <inheritdoc />
|
|
protected override void PreStart()
|
|
{
|
|
_log.Info("CentralCommunicationActor started");
|
|
|
|
// Subscribe to the peer-replication topic so we receive health reports
|
|
// delivered to the other central node and keep our local aggregator
|
|
// in sync (ClusterClient load-balances reports across nodes).
|
|
// Tolerant of non-clustered hosts (TestKit) where the extension is absent.
|
|
try
|
|
{
|
|
DistributedPubSub.Get(Context.System).Mediator.Tell(
|
|
new Subscribe(HealthReportTopic, Self));
|
|
}
|
|
catch (Exception ex)
|
|
{
|
|
_log.Debug("DistributedPubSub not available — peer health replication disabled: {0}", ex.Message);
|
|
}
|
|
|
|
// Schedule periodic refresh of site addresses from the database
|
|
_refreshSchedule = Context.System.Scheduler.ScheduleTellRepeatedlyCancelable(
|
|
TimeSpan.Zero,
|
|
TimeSpan.FromSeconds(60),
|
|
Self,
|
|
new RefreshSiteAddresses(),
|
|
ActorRefs.NoSender);
|
|
}
|
|
|
|
/// <inheritdoc />
|
|
protected override void PostStop()
|
|
{
|
|
_log.Info("CentralCommunicationActor stopped");
|
|
_refreshSchedule?.Cancel();
|
|
// Communication-019: cancel any in-flight LoadSiteAddressesFromDb so a
|
|
// hung MS SQL query does not outlive the actor.
|
|
try
|
|
{
|
|
_lifecycleCts.Cancel();
|
|
}
|
|
catch (ObjectDisposedException)
|
|
{
|
|
// Double-stop is benign.
|
|
}
|
|
_lifecycleCts.Dispose();
|
|
}
|
|
}
|
|
|
|
/// <summary>
|
|
/// Command to trigger a refresh of site addresses from the database.
|
|
/// </summary>
|
|
public record RefreshSiteAddresses;
|
|
|
|
/// <summary>
|
|
/// Internal message carrying the loaded site contact data from the database.
|
|
/// ClusterClient creation happens on the actor thread in HandleSiteAddressCacheLoaded.
|
|
///
|
|
/// Communication-020: the payload is exposed as <see cref="IReadOnlyDictionary{TKey,TValue}"/>
|
|
/// of <see cref="IReadOnlyList{T}"/> so the Akka.NET "messages are immutable"
|
|
/// convention is enforced at the type level rather than relying on producer
|
|
/// discipline. The producer wraps the constructed buckets with
|
|
/// <c>List<T>.AsReadOnly()</c> before piping to Self.
|
|
/// </summary>
|
|
internal record SiteAddressCacheLoaded(IReadOnlyDictionary<string, IReadOnlyList<string>> SiteContacts);
|
|
|
|
/// <summary>
|
|
/// Notification sent to debug view subscribers when the stream is terminated
|
|
/// due to site disconnection (WP-5).
|
|
/// </summary>
|
|
public record DebugStreamTerminated(string SiteId, string CorrelationId);
|
|
|
|
/// <summary>
|
|
/// Registers the central NotificationOutboxActor singleton proxy with the
|
|
/// <see cref="CentralCommunicationActor"/> so site-forwarded <see cref="NotificationSubmit"/>
|
|
/// and <see cref="NotificationStatusQuery"/> messages can be routed to it. Sent by the Host
|
|
/// after the outbox singleton proxy is created.
|
|
/// </summary>
|
|
public record RegisterNotificationOutbox(IActorRef OutboxProxy);
|
|
|
|
/// <summary>
|
|
/// Registers the central AuditLogIngestActor singleton proxy with the
|
|
/// <see cref="CentralCommunicationActor"/> so site-forwarded
|
|
/// <see cref="IngestAuditEventsCommand"/> and <see cref="IngestCachedTelemetryCommand"/>
|
|
/// messages can be routed to it. Sent by the Host after the audit-ingest
|
|
/// singleton proxy is created. Lives here (not in Commons) because
|
|
/// <c>ScadaLink.Commons</c> has no Akka package reference and cannot hold an
|
|
/// <see cref="IActorRef"/> field.
|
|
/// </summary>
|
|
public sealed record RegisterAuditIngest(IActorRef AuditIngestActor);
|