Commit Graph

13 Commits

Author SHA1 Message Date
Joseph Doherty 8c78913503 fix(communication): correct audit-ingest timeout-path docs and add timeout test 2026-05-21 03:29:54 -04:00
Joseph Doherty 6d073046c6 feat(communication): route audit ingest commands through CentralCommunicationActor 2026-05-21 03:23:30 -04:00
Joseph Doherty 2ff62a2ceb feat(notification-outbox): route NotificationSubmit to the outbox actor 2026-05-19 02:38:04 -04:00
Joseph Doherty 0b4c1563aa fix(communication): resolve Communication-009,010,011 — atomic site-cache refresh, XML doc correction, test coverage 2026-05-16 22:04:21 -04:00
Joseph Doherty 31a6995d24 fix(communication): resolve Communication-004..008 — Resume supervision, gRPC option wiring, address-load logging, sync dispose, flap detection 2026-05-16 20:58:03 -04:00
Joseph Doherty f66dc031a4 fix(health): route site heartbeats into the aggregator
CentralCommunicationActor.HandleHeartbeat was forwarding each incoming
HeartbeatMessage to Context.Parent, which resolves to the /user
guardian — a non-actor. Every site heartbeat went straight to dead
letters (~1026 per central node per 30 minutes at the default ~2s
interval across three sites).

The aggregator now exposes MarkHeartbeat(siteId, receivedAt) which
bumps LastReportReceivedAt on already-known sites (and clears IsOnline
if it had flipped) without touching LatestReport. Heartbeats from
unregistered sites are dropped — first registration still happens on
the first full report. CentralCommunicationActor calls this in place
of the no-op Tell.

The result: heartbeats now serve their stated health-monitoring
purpose (per CLAUDE.md) by keeping a site marked online between the
30s full reports if a single report is briefly delayed, and the dead
letter noise disappears entirely.
2026-05-13 08:11:43 -04:00
Joseph Doherty 6f1f6b8467 fix(health): replicate site health reports between central nodes
CentralHealthAggregator is a per-node hosted singleton, but site health
reports flow through ClusterClient which round-robins each report to one
central node only. The other node's aggregator never saw those reports
and marked sites offline at the 60s threshold — sites constantly flapped
between online and offline on the monitoring page.

On receive, the active CentralCommunicationActor now republishes a
SiteHealthReportReplica wrapper on a DistributedPubSub topic. Both
central nodes subscribe to the topic and process replicas through a
dedicated path that updates the local aggregator without re-broadcasting
(avoids fan-out loops). The aggregator's existing sequence-number
idempotency makes self-delivery a cheap no-op.

DistributedPubSubExtensionProvider is now listed in the HOCON
`akka.extensions` block so the mediator is initialised at cluster
start, eliminating a race where the first Subscribe arrived before the
extension was loaded.
2026-05-13 06:20:07 -04:00
Joseph Doherty 49f042a937 refactor: remove ClusterClient streaming path (DebugStreamEvent), events flow via gRPC 2026-03-21 12:18:52 -04:00
Joseph Doherty 3efec91386 fix: route debug stream events through ClusterClient site→central path
ClusterClient Sender refs are temporary proxies — valid for immediate reply
but not durable for future Tells. Events now flow as DebugStreamEvent through
SiteCommunicationActor → ClusterClient → CentralCommunicationActor → bridge
actor (same pattern as health reports). Also fix DebugStreamHub to use
IHubContext for long-lived callbacks instead of transient hub instance.
2026-03-21 11:32:17 -04:00
Joseph Doherty 4f22ca2b1f feat: replace ActorSelection with ClusterClient for inter-cluster communication
Central and site clusters now communicate via ClusterClient/
ClusterClientReceptionist instead of direct ActorSelection. Both
CentralCommunicationActor and SiteCommunicationActor are registered
with their cluster's receptionist. Central creates one ClusterClient
per site using NodeA/NodeB contact points from the DB. Sites configure
multiple CentralContactPoints for automatic failover between central
nodes. ISiteClientFactory enables test injection.
2026-03-18 00:08:47 -04:00
Joseph Doherty e5eb871961 fix: wire up health report pipeline between sites and central aggregator
Sites now send SiteHealthReport via AkkaHealthReportTransport →
SiteCommunicationActor → CentralCommunicationActor → CentralHealthAggregator.
Added IHealthReportTransport impl, ISiteIdentityProvider impl, registered
HealthReportSender on site nodes, and added SiteHealthReport handler in
CentralCommunicationActor. Health Dashboard now shows all 3 sites online.
2026-03-17 23:46:17 -04:00
Joseph Doherty 9e97c1acd2 feat: replace site registration with database-driven site addressing
Central now resolves site Akka remoting addresses from the Sites DB table
(NodeAAddress/NodeBAddress) instead of relying on runtime RegisterSite
messages. Eliminates the race condition where sites starting before central
had their registration dead-lettered. Addresses are cached in
CentralCommunicationActor with 60s periodic refresh and on-demand refresh
when sites are added/edited/deleted via UI or CLI.
2026-03-17 23:13:10 -04:00
Joseph Doherty 389f5a0378 Phase 3B: Site I/O & Observability — Communication, DCL, Script/Alarm actors, Health, Event Logging
Communication Layer (WP-1–5):
- 8 message patterns with correlation IDs, per-pattern timeouts
- Central/Site communication actors, transport heartbeat config
- Connection failure handling (no central buffering, debug streams killed)

Data Connection Layer (WP-6–14, WP-34):
- Connection actor with Become/Stash lifecycle (Connecting/Connected/Reconnecting)
- OPC UA + LmxProxy adapters behind IDataConnection
- Auto-reconnect, bad quality propagation, transparent re-subscribe
- Write-back, tag path resolution with retry, health reporting
- Protocol extensibility via DataConnectionFactory

Site Runtime (WP-15–25, WP-32–33):
- ScriptActor/ScriptExecutionActor (triggers, concurrent execution, blocking I/O dispatcher)
- AlarmActor/AlarmExecutionActor (ValueMatch/RangeViolation/RateOfChange, in-memory state)
- SharedScriptLibrary (inline execution), ScriptRuntimeContext (API)
- ScriptCompilationService (Roslyn, forbidden API enforcement, execution timeout)
- Recursion limit (default 10), call direction enforcement
- SiteStreamManager (per-subscriber bounded buffers, fire-and-forget)
- Debug view backend (snapshot + stream), concurrency serialization
- Local artifact storage (4 SQLite tables)

Health Monitoring (WP-26–28):
- SiteHealthCollector (thread-safe counters, connection state)
- HealthReportSender (30s interval, monotonic sequence numbers)
- CentralHealthAggregator (offline detection 60s, online recovery)

Site Event Logging (WP-29–31):
- SiteEventLogger (SQLite, 6 event categories, ISO 8601 UTC)
- EventLogPurgeService (30-day retention, 1GB cap)
- EventLogQueryService (filters, keyword search, keyset pagination)

541 tests pass, zero warnings.
2026-03-16 20:57:25 -04:00