Commit Graph

18 Commits

Author SHA1 Message Date
Joseph Doherty
6af2607a50 feat(siteruntime): thread ParentExecutionId into the routed script's ScriptRuntimeContext 2026-05-21 17:35:49 -04:00
Joseph Doherty
e93f655ce4 feat(health): SiteAuditBacklog metric (count + age + bytes) (#23 M6) 2026-05-20 19:02:01 -04:00
Joseph Doherty
23c0fd417e feat(health): AuditRedactionFailure counter + bridge (#23 M5)
Bundle C task M5-T7 — surface DefaultAuditPayloadFilter redactor
over-redactions as a Site Health metric so a misconfigured /
catastrophic regex shows up on /monitoring/health rather than
disappearing into a NoOp sink.

  - SiteHealthReport: new 'AuditRedactionFailure' int field
    (defaulted to 0 for back-compat with existing producers/tests).
  - ISiteHealthCollector / SiteHealthCollector:
    new IncrementAuditRedactionFailure() — per-interval atomic
    counter with Interlocked, reset on CollectReport, mirroring
    the M2 Bundle G SiteAuditWriteFailures pattern.
  - HealthMetricsAuditRedactionFailureCounter: new bridge in
    ScadaLink.AuditLog.Site that forwards IAuditRedactionFailureCounter
    increments to ISiteHealthCollector — mirrors
    HealthMetricsAuditWriteFailureCounter one-for-one.
  - AddAuditLogHealthMetricsBridge: now ALSO Replaces the
    NoOpAuditRedactionFailureCounter binding with the health-metrics
    bridge, so a single AddAuditLogHealthMetricsBridge() call wires
    both the M2 Bundle G write-failure counter and the M5 Bundle C
    redaction-failure counter into the health report.

Site-side only for M5 — the filter also runs on CentralAuditWriter
and AuditLogIngestActor (where it just keeps the NoOp default), but
a central-side health-metric surface for AuditRedactionFailure is
deferred to M6 alongside the rest of the central health collector
work.

Tests:
  - AuditRedactionFailureMetricTests (HealthMonitoring) covers the
    SiteHealthCollector increment/report/reset shape (3 tests).
  - HealthMetricsAuditRedactionFailureCounterTests (AuditLog) covers
    the AuditLog → HealthMonitoring bridge (3 tests).
  - Existing CountCapturingHealthCollector stub in
    DeploymentManagerRedeployTests extended with the new no-op
    interface method.

Verified: dotnet build clean, all 24 test projects green
(the only Failed at first ScadaLink.SiteRuntime.Tests run was the
known-flaky InstanceActorChildAttributeRaceTests; passes on re-run
in isolation and full suite, unrelated to these changes).
2026-05-20 17:28:33 -04:00
Joseph Doherty
dd3351da93 feat(health): SiteAuditWriteFailures counter + AuditLog bridge (#23)
Bundle G of Audit Log #23 M2. Bridges the FallbackAuditWriter primary-
failure counter into the Site Health Monitoring report payload so a
sustained audit-write outage surfaces on /monitoring/health instead of
disappearing into a NoOp sink.

- SiteHealthReport: add SiteAuditWriteFailures (defaulted, additive).
- ISiteHealthCollector + SiteHealthCollector: new
  IncrementSiteAuditWriteFailures() counter, per-interval reset
  semantics matching ScriptErrorCount / DeadLetterCount.
- HealthMetricsAuditWriteFailureCounter: adapter forwarding
  IAuditWriteFailureCounter.Increment() to the collector.
- AddAuditLogHealthMetricsBridge(): swaps the NoOp default
  registration for the real bridge; called from
  SiteServiceRegistration after AddSiteHealthMonitoring + AddAuditLog.
- Existing host-wiring test updated: site composition now resolves
  HealthMetricsAuditWriteFailureCounter (not NoOp).

Tests: HealthMonitoring 60 -> 63 (3 new), AuditLog 56 -> 59 (3 new),
full solution green.
2026-05-20 13:22:25 -04:00
Joseph Doherty
437fe154e7 feat(triggers): add WhileTrue fire mode for Conditional/Expression script triggers
Conditional and Expression script triggers gain an optional `mode` field
in their TriggerConfiguration JSON:

- OnTrue (default): unchanged edge/per-change firing. An absent mode field
  parses as OnTrue, so every existing trigger config behaves identically.
- WhileTrue: fires on the false->true edge, then re-fires on a periodic
  timer while the condition holds; stops on the true->false edge. The
  re-fire cadence is the script's MinTimeBetweenRuns; with none configured
  the trigger degrades to a single edge fire and logs a warning.

ScriptActor tracks condition truth state and manages a dedicated
"whiletrue-trigger" timer. ScriptTriggerConfigCodec and ScriptTriggerEditor
round-trip the mode and expose an OnTrue/WhileTrue selector for the two
trigger kinds. Design: docs/plans/2026-05-18-whiletrue-trigger-mode-design.md

Tests: 7 ScriptActor runtime tests (edge fire, timer re-fire, stop,
re-arm, no-MinTimeBetweenRuns degrade, OnTrue regressions) + 14 codec /
editor tests. SiteRuntime suite 206 green, CentralUI suite 295 green.
2026-05-18 10:44:11 -04:00
Joseph Doherty
6139a65a7b fix(site-runtime): fan tag updates out to every attribute sharing a tag path
InstanceActor._tagPathToAttribute was a Dictionary<string,string> — one tag
path mapped to a single attribute. When two attributes reference the same PLC
node (e.g. two composed cooling-tank modules both reading ns=3;s=Tank.Level,
or a pump's TempSensor and AlarmSensor both reading ns=3;s=Sensor.Reading),
SubscribeToDcl's map assignment overwrote, so only the last-registered
attribute ever received values — the rest stayed permanently Uncertain.

The map is now Dictionary<string,List<string>>; HandleTagValueUpdate fans each
update out to every attribute referencing the tag path, and each distinct tag
path is still subscribed only once per connection.
2026-05-18 04:21:26 -04:00
Joseph Doherty
be274212f0 fix(site-runtime): resolve SiteRuntime-017..019 — isolated attribute snapshot for child actors, corrected dispatcher doc, remove dead lifecycle handlers 2026-05-17 03:18:41 -04:00
Joseph Doherty
dd7626da63 fix(site-runtime): resolve SiteRuntime-012,013,015,016 — doc accuracy, shared LoggerFactory, execution-actor coverage; SiteRuntime-014 deferred 2026-05-16 22:32:30 -04:00
Joseph Doherty
a88bec9376 fix(site-runtime): resolve SiteRuntime-004..011 — deploy-after-persist, remove reflection, deterministic IDs, non-blocking startup, dedicated script scheduler, config-change detection, semantic trust-model check 2026-05-16 21:44:10 -04:00
Joseph Doherty
bc548e1447 feat(deployment-manager): resolve DeploymentManager-006 — query site deployment state before redeploy and reconcile
Adds DeploymentStateQuery request/response contracts (Commons), a site-side
handler (SiteRuntime), a CommunicationService query method (Communication), and
reconciliation in DeploymentService: when a prior record is InProgress or
Failed-on-timeout, query the site; if it already holds the target revision hash
mark the record Success without re-sending; on query failure fall through to a
normal deploy (site-side stale-rejection is the safety net).
2026-05-16 20:12:24 -04:00
Joseph Doherty
09b4bd5dfa fix(site-runtime): resolve SiteRuntime-001/002/003 — route data-sourced writes to DCL, real per-attribute API results, race-free redeploy 2026-05-16 19:57:28 -04:00
Joseph Doherty
d030153378 test(site-runtime): fix stale SetStaticAttribute tests
HandleSetStaticAttribute was made fire-and-forget (commit 2951507) — it no
longer replies with SetStaticAttributeResponse — but three InstanceActor
tests still ExpectMsg<SetStaticAttributeResponse> and timed out. Verify the
mutation via the GetAttributeRequest round-trip instead, which the FIFO
mailbox makes a sound sync point. Test intent (in-memory update, SQLite
persistence, serialized ordering) is unchanged.
2026-05-16 14:33:09 -04:00
Joseph Doherty
751248feb6 feat(alarms): HiLo trigger type with per-band level, hysteresis, messages, overrides
Adds a new HiLo alarm trigger type with four configurable setpoints
(LoLo / Lo / Hi / HiHi). Each setpoint carries an optional priority,
deadband (for hysteresis), and operator message. The site runtime emits
AlarmStateChanged with an AlarmLevel field so consumers can differentiate
warning vs critical bands.

Plumbing:
  - new AlarmLevel enum + AlarmStateChanged.Level/Message init properties
  - AlarmTriggerEditor (Blazor) gets a HiLo render with severity tinting
  - AlarmTriggerConfigCodec extracted from the editor for testability
  - sitestream.proto carries level + message over gRPC
  - SemanticValidator enforces numeric attribute, setpoint ordering,
    non-negative deadband
  - on-trigger scripts get an Alarm global (Name/Level/Priority/Message)
    so notification routing can branch by severity
  - per-instance InstanceAlarmOverride entity + EF migration + flattening
    step + CLI commands; HiLo overrides merge setpoint-by-setpoint, binary
    types whole-replace
  - DebugView shows a Level badge + per-band message tooltip
  - App.razor auto-reloads on permanent Blazor circuit failure
  - docker/regen-proto.sh automates the proto regen workflow (the linux/arm64
    protoc segfault means generated files are checked in for now)
2026-05-13 03:23:32 -04:00
Joseph Doherty
49f042a937 refactor: remove ClusterClient streaming path (DebugStreamEvent), events flow via gRPC 2026-03-21 12:18:52 -04:00
Joseph Doherty
3efec91386 fix: route debug stream events through ClusterClient site→central path
ClusterClient Sender refs are temporary proxies — valid for immediate reply
but not durable for future Tells. Events now flow as DebugStreamEvent through
SiteCommunicationActor → ClusterClient → CentralCommunicationActor → bridge
actor (same pattern as health reports). Also fix DebugStreamHub to use
IHubContext for long-lived callbacks instead of transient hub instance.
2026-03-21 11:32:17 -04:00
Joseph Doherty
775cb8084f feat: data-sourced attributes start with uncertain quality before first DCL value
Attributes bound to data connections now initialize with "Uncertain" quality,
distinguishing "never received a value" from "known good" or "connection lost."
Quality is tracked per attribute and included in GetAttributeResponse.
2026-03-17 18:25:39 -04:00
Joseph Doherty
389f5a0378 Phase 3B: Site I/O & Observability — Communication, DCL, Script/Alarm actors, Health, Event Logging
Communication Layer (WP-1–5):
- 8 message patterns with correlation IDs, per-pattern timeouts
- Central/Site communication actors, transport heartbeat config
- Connection failure handling (no central buffering, debug streams killed)

Data Connection Layer (WP-6–14, WP-34):
- Connection actor with Become/Stash lifecycle (Connecting/Connected/Reconnecting)
- OPC UA + LmxProxy adapters behind IDataConnection
- Auto-reconnect, bad quality propagation, transparent re-subscribe
- Write-back, tag path resolution with retry, health reporting
- Protocol extensibility via DataConnectionFactory

Site Runtime (WP-15–25, WP-32–33):
- ScriptActor/ScriptExecutionActor (triggers, concurrent execution, blocking I/O dispatcher)
- AlarmActor/AlarmExecutionActor (ValueMatch/RangeViolation/RateOfChange, in-memory state)
- SharedScriptLibrary (inline execution), ScriptRuntimeContext (API)
- ScriptCompilationService (Roslyn, forbidden API enforcement, execution timeout)
- Recursion limit (default 10), call direction enforcement
- SiteStreamManager (per-subscriber bounded buffers, fire-and-forget)
- Debug view backend (snapshot + stream), concurrency serialization
- Local artifact storage (4 SQLite tables)

Health Monitoring (WP-26–28):
- SiteHealthCollector (thread-safe counters, connection state)
- HealthReportSender (30s interval, monotonic sequence numbers)
- CentralHealthAggregator (offline detection 60s, online recovery)

Site Event Logging (WP-29–31):
- SiteEventLogger (SQLite, 6 event categories, ISO 8601 UTC)
- EventLogPurgeService (30-day retention, 1GB cap)
- EventLogQueryService (filters, keyword search, keyset pagination)

541 tests pass, zero warnings.
2026-03-16 20:57:25 -04:00
Joseph Doherty
e9e6165914 Phase 3A: Site runtime foundation — Akka cluster, SQLite persistence, Deployment Manager singleton, Instance Actor
- WP-1: Site cluster config (keep-oldest SBR, down-if-alone, 2s/10s failure detection)
- WP-2: Site-role host bootstrap (no Kestrel, SQLite paths)
- WP-3: SiteStorageService with deployed_configurations + static_attribute_overrides tables
- WP-4: DeploymentManagerActor as cluster singleton with staggered Instance Actor creation,
  OneForOneStrategy/Resume supervision, deploy/disable/enable/delete lifecycle
- WP-5: InstanceActor with attribute state, GetAttribute/SetAttribute, SQLite override persistence
- WP-6: CoordinatedShutdown verified for graceful singleton handover
- WP-7: Dual-node recovery (both seed nodes, min-nr-of-members=1)
- WP-8: 31 tests (storage CRUD, actor lifecycle, supervision, negative checks)
389 total tests pass, zero warnings.
2026-03-16 20:34:56 -04:00