scadalink-design

Author	SHA1	Message	Date
Joseph Doherty	e93f655ce4	feat(health): SiteAuditBacklog metric (count + age + bytes) (#23 M6)	2026-05-20 19:02:01 -04:00
Joseph Doherty	23c0fd417e	feat(health): AuditRedactionFailure counter + bridge (#23 M5) Bundle C task M5-T7 — surface DefaultAuditPayloadFilter redactor over-redactions as a Site Health metric so a misconfigured / catastrophic regex shows up on /monitoring/health rather than disappearing into a NoOp sink. - SiteHealthReport: new 'AuditRedactionFailure' int field (defaulted to 0 for back-compat with existing producers/tests). - ISiteHealthCollector / SiteHealthCollector: new IncrementAuditRedactionFailure() — per-interval atomic counter with Interlocked, reset on CollectReport, mirroring the M2 Bundle G SiteAuditWriteFailures pattern. - HealthMetricsAuditRedactionFailureCounter: new bridge in ScadaLink.AuditLog.Site that forwards IAuditRedactionFailureCounter increments to ISiteHealthCollector — mirrors HealthMetricsAuditWriteFailureCounter one-for-one. - AddAuditLogHealthMetricsBridge: now ALSO Replaces the NoOpAuditRedactionFailureCounter binding with the health-metrics bridge, so a single AddAuditLogHealthMetricsBridge() call wires both the M2 Bundle G write-failure counter and the M5 Bundle C redaction-failure counter into the health report. Site-side only for M5 — the filter also runs on CentralAuditWriter and AuditLogIngestActor (where it just keeps the NoOp default), but a central-side health-metric surface for AuditRedactionFailure is deferred to M6 alongside the rest of the central health collector work. Tests: - AuditRedactionFailureMetricTests (HealthMonitoring) covers the SiteHealthCollector increment/report/reset shape (3 tests). - HealthMetricsAuditRedactionFailureCounterTests (AuditLog) covers the AuditLog → HealthMonitoring bridge (3 tests). - Existing CountCapturingHealthCollector stub in DeploymentManagerRedeployTests extended with the new no-op interface method. Verified: dotnet build clean, all 24 test projects green (the only Failed at first ScadaLink.SiteRuntime.Tests run was the known-flaky InstanceActorChildAttributeRaceTests; passes on re-run in isolation and full suite, unrelated to these changes).	2026-05-20 17:28:33 -04:00
Joseph Doherty	dd3351da93	feat(health): SiteAuditWriteFailures counter + AuditLog bridge (#23 ) Bundle G of Audit Log #23 M2. Bridges the FallbackAuditWriter primary- failure counter into the Site Health Monitoring report payload so a sustained audit-write outage surfaces on /monitoring/health instead of disappearing into a NoOp sink. - SiteHealthReport: add SiteAuditWriteFailures (defaulted, additive). - ISiteHealthCollector + SiteHealthCollector: new IncrementSiteAuditWriteFailures() counter, per-interval reset semantics matching ScriptErrorCount / DeadLetterCount. - HealthMetricsAuditWriteFailureCounter: adapter forwarding IAuditWriteFailureCounter.Increment() to the collector. - AddAuditLogHealthMetricsBridge(): swaps the NoOp default registration for the real bridge; called from SiteServiceRegistration after AddSiteHealthMonitoring + AddAuditLog. - Existing host-wiring test updated: site composition now resolves HealthMetricsAuditWriteFailureCounter (not NoOp). Tests: HealthMonitoring 60 -> 63 (3 new), AuditLog 56 -> 59 (3 new), full solution green.	2026-05-20 13:22:25 -04:00
Joseph Doherty	02a7e8abc6	feat(health): show all cluster nodes (online/offline, primary/standby) in health dashboard Add NodeStatus record, IClusterNodeProvider interface, and AkkaClusterNodeProvider that queries Akka cluster membership for all site-role nodes. HealthReportSender populates ClusterNodes before each report. UI shows a row per node with hostname, Online/Offline badge, and Primary/Standby badge. Falls back to single-node display if ClusterNodes is not populated.	2026-03-24 16:19:39 -04:00
Joseph Doherty	e84a831a02	feat(health): redesign health dashboard with 4-column layout and new metrics New fields in SiteHealthReport: NodeHostname, DataConnectionEndpoints (primary/secondary), DataConnectionTagQuality (good/bad/uncertain), ParkedMessageCount. New collector methods to populate them. Health dashboard redesigned to match mockup: Nodes \| Data Connections (with per-connection tag quality) \| Instances + S&F Buffers \| Error Counts + Parked Messages. Site names resolved from repository.	2026-03-24 16:19:39 -04:00
Joseph Doherty	8095c8efbe	fix: only active singleton node sends health reports Both nodes of a site cluster were sending health reports. The standby node (without the DeploymentManager singleton) reported 0 instances and no connections, overwriting the active node's data in the aggregator. Added IsActiveNode flag to ISiteHealthCollector, set by DeploymentManagerActor on PreStart/PostStop. HealthReportSender skips sending when the node is not active. Also ensured EnsureDclConnections is called during startup batch creation so data connections survive container restarts.	2026-03-18 01:44:57 -04:00
Joseph Doherty	f165ca2774	feat: wire all health metrics and add instance counts to dashboard Wired ISiteHealthCollector calls for script errors (ScriptExecutionActor), alarm eval errors (AlarmActor), dead letters (DeadLetterMonitorActor), and S&F buffer depth placeholder. Added instance count tracking (deployed/ enabled/disabled) to SiteHealthReport via DeploymentManagerActor. Updated Health Dashboard UI to show instance counts per site. All metrics flow through the existing health report pipeline via ClusterClient.	2026-03-18 00:57:49 -04:00
Joseph Doherty	389f5a0378	Phase 3B: Site I/O & Observability — Communication, DCL, Script/Alarm actors, Health, Event Logging Communication Layer (WP-1–5): - 8 message patterns with correlation IDs, per-pattern timeouts - Central/Site communication actors, transport heartbeat config - Connection failure handling (no central buffering, debug streams killed) Data Connection Layer (WP-6–14, WP-34): - Connection actor with Become/Stash lifecycle (Connecting/Connected/Reconnecting) - OPC UA + LmxProxy adapters behind IDataConnection - Auto-reconnect, bad quality propagation, transparent re-subscribe - Write-back, tag path resolution with retry, health reporting - Protocol extensibility via DataConnectionFactory Site Runtime (WP-15–25, WP-32–33): - ScriptActor/ScriptExecutionActor (triggers, concurrent execution, blocking I/O dispatcher) - AlarmActor/AlarmExecutionActor (ValueMatch/RangeViolation/RateOfChange, in-memory state) - SharedScriptLibrary (inline execution), ScriptRuntimeContext (API) - ScriptCompilationService (Roslyn, forbidden API enforcement, execution timeout) - Recursion limit (default 10), call direction enforcement - SiteStreamManager (per-subscriber bounded buffers, fire-and-forget) - Debug view backend (snapshot + stream), concurrency serialization - Local artifact storage (4 SQLite tables) Health Monitoring (WP-26–28): - SiteHealthCollector (thread-safe counters, connection state) - HealthReportSender (30s interval, monotonic sequence numbers) - CentralHealthAggregator (offline detection 60s, online recovery) Site Event Logging (WP-29–31): - SiteEventLogger (SQLite, 6 event categories, ISO 8601 UTC) - EventLogPurgeService (30-day retention, 1GB cap) - EventLogQueryService (filters, keyword search, keyset pagination) 541 tests pass, zero warnings.	2026-03-16 20:57:25 -04:00

8 Commits