scadalink-design

Author	SHA1	Message	Date
Joseph Doherty	e93f655ce4	feat(health): SiteAuditBacklog metric (count + age + bytes) (#23 M6)	2026-05-20 19:02:01 -04:00
Joseph Doherty	23c0fd417e	feat(health): AuditRedactionFailure counter + bridge (#23 M5) Bundle C task M5-T7 — surface DefaultAuditPayloadFilter redactor over-redactions as a Site Health metric so a misconfigured / catastrophic regex shows up on /monitoring/health rather than disappearing into a NoOp sink. - SiteHealthReport: new 'AuditRedactionFailure' int field (defaulted to 0 for back-compat with existing producers/tests). - ISiteHealthCollector / SiteHealthCollector: new IncrementAuditRedactionFailure() — per-interval atomic counter with Interlocked, reset on CollectReport, mirroring the M2 Bundle G SiteAuditWriteFailures pattern. - HealthMetricsAuditRedactionFailureCounter: new bridge in ScadaLink.AuditLog.Site that forwards IAuditRedactionFailureCounter increments to ISiteHealthCollector — mirrors HealthMetricsAuditWriteFailureCounter one-for-one. - AddAuditLogHealthMetricsBridge: now ALSO Replaces the NoOpAuditRedactionFailureCounter binding with the health-metrics bridge, so a single AddAuditLogHealthMetricsBridge() call wires both the M2 Bundle G write-failure counter and the M5 Bundle C redaction-failure counter into the health report. Site-side only for M5 — the filter also runs on CentralAuditWriter and AuditLogIngestActor (where it just keeps the NoOp default), but a central-side health-metric surface for AuditRedactionFailure is deferred to M6 alongside the rest of the central health collector work. Tests: - AuditRedactionFailureMetricTests (HealthMonitoring) covers the SiteHealthCollector increment/report/reset shape (3 tests). - HealthMetricsAuditRedactionFailureCounterTests (AuditLog) covers the AuditLog → HealthMonitoring bridge (3 tests). - Existing CountCapturingHealthCollector stub in DeploymentManagerRedeployTests extended with the new no-op interface method. Verified: dotnet build clean, all 24 test projects green (the only Failed at first ScadaLink.SiteRuntime.Tests run was the known-flaky InstanceActorChildAttributeRaceTests; passes on re-run in isolation and full suite, unrelated to these changes).	2026-05-20 17:28:33 -04:00
Joseph Doherty	dd3351da93	feat(health): SiteAuditWriteFailures counter + AuditLog bridge (#23 ) Bundle G of Audit Log #23 M2. Bridges the FallbackAuditWriter primary- failure counter into the Site Health Monitoring report payload so a sustained audit-write outage surfaces on /monitoring/health instead of disappearing into a NoOp sink. - SiteHealthReport: add SiteAuditWriteFailures (defaulted, additive). - ISiteHealthCollector + SiteHealthCollector: new IncrementSiteAuditWriteFailures() counter, per-interval reset semantics matching ScriptErrorCount / DeadLetterCount. - HealthMetricsAuditWriteFailureCounter: adapter forwarding IAuditWriteFailureCounter.Increment() to the collector. - AddAuditLogHealthMetricsBridge(): swaps the NoOp default registration for the real bridge; called from SiteServiceRegistration after AddSiteHealthMonitoring + AddAuditLog. - Existing host-wiring test updated: site composition now resolves HealthMetricsAuditWriteFailureCounter (not NoOp). Tests: HealthMonitoring 60 -> 63 (3 new), AuditLog 56 -> 59 (3 new), full solution green.	2026-05-20 13:22:25 -04:00
Joseph Doherty	437fe154e7	feat(triggers): add WhileTrue fire mode for Conditional/Expression script triggers Conditional and Expression script triggers gain an optional `mode` field in their TriggerConfiguration JSON: - OnTrue (default): unchanged edge/per-change firing. An absent mode field parses as OnTrue, so every existing trigger config behaves identically. - WhileTrue: fires on the false->true edge, then re-fires on a periodic timer while the condition holds; stops on the true->false edge. The re-fire cadence is the script's MinTimeBetweenRuns; with none configured the trigger degrades to a single edge fire and logs a warning. ScriptActor tracks condition truth state and manages a dedicated "whiletrue-trigger" timer. ScriptTriggerConfigCodec and ScriptTriggerEditor round-trip the mode and expose an OnTrue/WhileTrue selector for the two trigger kinds. Design: docs/plans/2026-05-18-whiletrue-trigger-mode-design.md Tests: 7 ScriptActor runtime tests (edge fire, timer re-fire, stop, re-arm, no-MinTimeBetweenRuns degrade, OnTrue regressions) + 14 codec / editor tests. SiteRuntime suite 206 green, CentralUI suite 295 green.	2026-05-18 10:44:11 -04:00
Joseph Doherty	6139a65a7b	fix(site-runtime): fan tag updates out to every attribute sharing a tag path InstanceActor._tagPathToAttribute was a Dictionary<string,string> — one tag path mapped to a single attribute. When two attributes reference the same PLC node (e.g. two composed cooling-tank modules both reading ns=3;s=Tank.Level, or a pump's TempSensor and AlarmSensor both reading ns=3;s=Sensor.Reading), SubscribeToDcl's map assignment overwrote, so only the last-registered attribute ever received values — the rest stayed permanently Uncertain. The map is now Dictionary<string,List<string>>; HandleTagValueUpdate fans each update out to every attribute referencing the tag path, and each distinct tag path is still subscribed only once per connection.	2026-05-18 04:21:26 -04:00
Joseph Doherty	be274212f0	fix(site-runtime): resolve SiteRuntime-017..019 — isolated attribute snapshot for child actors, corrected dispatcher doc, remove dead lifecycle handlers	2026-05-17 03:18:41 -04:00
Joseph Doherty	dd7626da63	fix(site-runtime): resolve SiteRuntime-012,013,015,016 — doc accuracy, shared LoggerFactory, execution-actor coverage; SiteRuntime-014 deferred	2026-05-16 22:32:30 -04:00
Joseph Doherty	a88bec9376	fix(site-runtime): resolve SiteRuntime-004..011 — deploy-after-persist, remove reflection, deterministic IDs, non-blocking startup, dedicated script scheduler, config-change detection, semantic trust-model check	2026-05-16 21:44:10 -04:00
Joseph Doherty	bc548e1447	feat(deployment-manager): resolve DeploymentManager-006 — query site deployment state before redeploy and reconcile Adds DeploymentStateQuery request/response contracts (Commons), a site-side handler (SiteRuntime), a CommunicationService query method (Communication), and reconciliation in DeploymentService: when a prior record is InProgress or Failed-on-timeout, query the site; if it already holds the target revision hash mark the record Success without re-sending; on query failure fall through to a normal deploy (site-side stale-rejection is the safety net).	2026-05-16 20:12:24 -04:00
Joseph Doherty	09b4bd5dfa	fix(site-runtime): resolve SiteRuntime-001/002/003 — route data-sourced writes to DCL, real per-attribute API results, race-free redeploy	2026-05-16 19:57:28 -04:00
Joseph Doherty	d030153378	test(site-runtime): fix stale SetStaticAttribute tests HandleSetStaticAttribute was made fire-and-forget (commit `2951507`) — it no longer replies with SetStaticAttributeResponse — but three InstanceActor tests still ExpectMsg<SetStaticAttributeResponse> and timed out. Verify the mutation via the GetAttributeRequest round-trip instead, which the FIFO mailbox makes a sound sync point. Test intent (in-memory update, SQLite persistence, serialized ordering) is unchanged.	2026-05-16 14:33:09 -04:00
Joseph Doherty	751248feb6	feat(alarms): HiLo trigger type with per-band level, hysteresis, messages, overrides Adds a new HiLo alarm trigger type with four configurable setpoints (LoLo / Lo / Hi / HiHi). Each setpoint carries an optional priority, deadband (for hysteresis), and operator message. The site runtime emits AlarmStateChanged with an AlarmLevel field so consumers can differentiate warning vs critical bands. Plumbing: - new AlarmLevel enum + AlarmStateChanged.Level/Message init properties - AlarmTriggerEditor (Blazor) gets a HiLo render with severity tinting - AlarmTriggerConfigCodec extracted from the editor for testability - sitestream.proto carries level + message over gRPC - SemanticValidator enforces numeric attribute, setpoint ordering, non-negative deadband - on-trigger scripts get an Alarm global (Name/Level/Priority/Message) so notification routing can branch by severity - per-instance InstanceAlarmOverride entity + EF migration + flattening step + CLI commands; HiLo overrides merge setpoint-by-setpoint, binary types whole-replace - DebugView shows a Level badge + per-band message tooltip - App.razor auto-reloads on permanent Blazor circuit failure - docker/regen-proto.sh automates the proto regen workflow (the linux/arm64 protoc segfault means generated files are checked in for now)	2026-05-13 03:23:32 -04:00
Joseph Doherty	49f042a937	refactor: remove ClusterClient streaming path (DebugStreamEvent), events flow via gRPC	2026-03-21 12:18:52 -04:00
Joseph Doherty	3efec91386	fix: route debug stream events through ClusterClient site→central path ClusterClient Sender refs are temporary proxies — valid for immediate reply but not durable for future Tells. Events now flow as DebugStreamEvent through SiteCommunicationActor → ClusterClient → CentralCommunicationActor → bridge actor (same pattern as health reports). Also fix DebugStreamHub to use IHubContext for long-lived callbacks instead of transient hub instance.	2026-03-21 11:32:17 -04:00
Joseph Doherty	775cb8084f	feat: data-sourced attributes start with uncertain quality before first DCL value Attributes bound to data connections now initialize with "Uncertain" quality, distinguishing "never received a value" from "known good" or "connection lost." Quality is tracked per attribute and included in GetAttributeResponse.	2026-03-17 18:25:39 -04:00
Joseph Doherty	389f5a0378	Phase 3B: Site I/O & Observability — Communication, DCL, Script/Alarm actors, Health, Event Logging Communication Layer (WP-1–5): - 8 message patterns with correlation IDs, per-pattern timeouts - Central/Site communication actors, transport heartbeat config - Connection failure handling (no central buffering, debug streams killed) Data Connection Layer (WP-6–14, WP-34): - Connection actor with Become/Stash lifecycle (Connecting/Connected/Reconnecting) - OPC UA + LmxProxy adapters behind IDataConnection - Auto-reconnect, bad quality propagation, transparent re-subscribe - Write-back, tag path resolution with retry, health reporting - Protocol extensibility via DataConnectionFactory Site Runtime (WP-15–25, WP-32–33): - ScriptActor/ScriptExecutionActor (triggers, concurrent execution, blocking I/O dispatcher) - AlarmActor/AlarmExecutionActor (ValueMatch/RangeViolation/RateOfChange, in-memory state) - SharedScriptLibrary (inline execution), ScriptRuntimeContext (API) - ScriptCompilationService (Roslyn, forbidden API enforcement, execution timeout) - Recursion limit (default 10), call direction enforcement - SiteStreamManager (per-subscriber bounded buffers, fire-and-forget) - Debug view backend (snapshot + stream), concurrency serialization - Local artifact storage (4 SQLite tables) Health Monitoring (WP-26–28): - SiteHealthCollector (thread-safe counters, connection state) - HealthReportSender (30s interval, monotonic sequence numbers) - CentralHealthAggregator (offline detection 60s, online recovery) Site Event Logging (WP-29–31): - SiteEventLogger (SQLite, 6 event categories, ISO 8601 UTC) - EventLogPurgeService (30-day retention, 1GB cap) - EventLogQueryService (filters, keyword search, keyset pagination) 541 tests pass, zero warnings.	2026-03-16 20:57:25 -04:00
Joseph Doherty	e9e6165914	Phase 3A: Site runtime foundation — Akka cluster, SQLite persistence, Deployment Manager singleton, Instance Actor - WP-1: Site cluster config (keep-oldest SBR, down-if-alone, 2s/10s failure detection) - WP-2: Site-role host bootstrap (no Kestrel, SQLite paths) - WP-3: SiteStorageService with deployed_configurations + static_attribute_overrides tables - WP-4: DeploymentManagerActor as cluster singleton with staggered Instance Actor creation, OneForOneStrategy/Resume supervision, deploy/disable/enable/delete lifecycle - WP-5: InstanceActor with attribute state, GetAttribute/SetAttribute, SQLite override persistence - WP-6: CoordinatedShutdown verified for graceful singleton handover - WP-7: Dual-node recovery (both seed nodes, min-nr-of-members=1) - WP-8: 31 tests (storage CRUD, actor lifecycle, supervision, negative checks) 389 total tests pass, zero warnings.	2026-03-16 20:34:56 -04:00

17 Commits