Add two optional nullable fields (TlsMode, Credentials) to the
UpdateSmtpConfigCommand record. The handler applies preserve-if-null
semantics: an update that omits a field leaves the existing value
intact, so existing 5-arg callers remain non-breaking.
Bundle C task M5-T7 — surface DefaultAuditPayloadFilter redactor
over-redactions as a Site Health metric so a misconfigured /
catastrophic regex shows up on /monitoring/health rather than
disappearing into a NoOp sink.
- SiteHealthReport: new 'AuditRedactionFailure' int field
(defaulted to 0 for back-compat with existing producers/tests).
- ISiteHealthCollector / SiteHealthCollector:
new IncrementAuditRedactionFailure() — per-interval atomic
counter with Interlocked, reset on CollectReport, mirroring
the M2 Bundle G SiteAuditWriteFailures pattern.
- HealthMetricsAuditRedactionFailureCounter: new bridge in
ScadaLink.AuditLog.Site that forwards IAuditRedactionFailureCounter
increments to ISiteHealthCollector — mirrors
HealthMetricsAuditWriteFailureCounter one-for-one.
- AddAuditLogHealthMetricsBridge: now ALSO Replaces the
NoOpAuditRedactionFailureCounter binding with the health-metrics
bridge, so a single AddAuditLogHealthMetricsBridge() call wires
both the M2 Bundle G write-failure counter and the M5 Bundle C
redaction-failure counter into the health report.
Site-side only for M5 — the filter also runs on CentralAuditWriter
and AuditLogIngestActor (where it just keeps the NoOp default), but
a central-side health-metric surface for AuditRedactionFailure is
deferred to M6 alongside the rest of the central health collector
work.
Tests:
- AuditRedactionFailureMetricTests (HealthMonitoring) covers the
SiteHealthCollector increment/report/reset shape (3 tests).
- HealthMetricsAuditRedactionFailureCounterTests (AuditLog) covers
the AuditLog → HealthMonitoring bridge (3 tests).
- Existing CountCapturingHealthCollector stub in
DeploymentManagerRedeployTests extended with the new no-op
interface method.
Verified: dotnet build clean, all 24 test projects green
(the only Failed at first ScadaLink.SiteRuntime.Tests run was the
known-flaky InstanceActorChildAttributeRaceTests; passes on re-run
in isolation and full suite, unrelated to these changes).
Bundle C of Audit Log #23 M3.
Adds the ScadaLink.SiteCallAudit project + matching tests project, mirroring
the ScadaLink.AuditLog scaffolding pattern (net10.0, central package
management, InternalsVisibleTo to the tests assembly).
SiteCallAuditActor is the central singleton entry point for Site Call Audit
(#22): it receives UpsertSiteCallCommand and persists the SiteCall via
ISiteCallAuditRepository.UpsertAsync (monotonic, idempotent — out-of-order
or duplicate updates are silent no-ops at the repo). Audit-write failures
NEVER abort the user-facing action (CLAUDE.md): repository throws are
caught + logged, the actor replies Accepted=false, and the singleton stays
alive (Resume supervisor strategy as defence in depth).
Two constructors mirror AuditLogIngestActor:
- IServiceProvider production constructor resolves the scoped EF repository
from a fresh DI scope per message.
- ISiteCallAuditRepository test constructor injects a concrete repository so
the TestKit tests exercise the real monotonic-upsert SQL end to end.
UpsertSiteCallCommand + UpsertSiteCallReply live in ScadaLink.Commons (same
home as IngestAuditEventsCommand) so Bundle D's gRPC server can construct
them without taking a project reference on the actor's host project.
AddSiteCallAudit() is a placeholder for symmetry with AddAuditLog /
AddNotificationOutbox; Bundle F will populate it with the actor's Props
factory + options bindings.
Tests (Akka.TestKit.Xunit2 + MsSqlMigrationFixture via project ref to
ScadaLink.ConfigurationDatabase.Tests, mirroring Bundle D2):
- Receive_UpsertSiteCallCommand_Persists_Replies_Accepted
- Receive_DuplicateUpsert_OlderStatus_NoOp_StillRepliesAccepted (idempotency)
- Receive_RepoThrowsTransient_RepliesAccepted_False_ActorStaysAlive
Reconciliation, KPIs, and the central->site Retry/Discard relay are
deferred per CLAUDE.md scope discipline.
ScadaLink.slnx updated to include both new projects.
All 3 new tests pass against the running infra/mssql container; full suite
(2683 tests across 27 projects) passes with no regressions.
Bundle G of Audit Log #23 M2. Bridges the FallbackAuditWriter primary-
failure counter into the Site Health Monitoring report payload so a
sustained audit-write outage surfaces on /monitoring/health instead of
disappearing into a NoOp sink.
- SiteHealthReport: add SiteAuditWriteFailures (defaulted, additive).
- ISiteHealthCollector + SiteHealthCollector: new
IncrementSiteAuditWriteFailures() counter, per-interval reset
semantics matching ScriptErrorCount / DeadLetterCount.
- HealthMetricsAuditWriteFailureCounter: adapter forwarding
IAuditWriteFailureCounter.Increment() to the collector.
- AddAuditLogHealthMetricsBridge(): swaps the NoOp default
registration for the real bridge; called from
SiteServiceRegistration after AddSiteHealthMonitoring + AddAuditLog.
- Existing host-wiring test updated: site composition now resolves
HealthMetricsAuditWriteFailureCounter (not NoOp).
Tests: HealthMonitoring 60 -> 63 (3 new), AuditLog 56 -> 59 (3 new),
full solution green.
The script-analysis sandbox Notify surface was stale after the Notification
Outbox change: SandboxNotifyTarget.Send returned Task<NotificationResult> and
there was no Status method, while production NotifyTarget.Send returns
Task<string> (a NotificationId) plus NotifyHelper.Status. A script that
test-ran cleanly in the sandbox would not compile against the real site
runtime.
- Move the NotificationDeliveryStatus record from ScadaLink.SiteRuntime.Scripts
into ScadaLink.Commons.Messages.Notification so both production and the
CentralUI sandbox reference the exact same type (CentralUI does not, and
should not, reference SiteRuntime). Production NotifyHelper.Status is
otherwise untouched.
- Rewrite SandboxNotifyHelper/SandboxNotifyTarget to be a signature-faithful
no-op fake: Send returns Task<string> (a fake NotificationId), Status returns
Task<NotificationDeliveryStatus>. Production now enqueues into the site S&F
engine, which has no central-side equivalent in the sandbox, so the fake no
longer carries an INotificationDeliveryService.
- Add script-analysis tests proving a script using the new Notify shape both
diagnoses clean and runs in the sandbox.
Adds DeploymentStateQuery request/response contracts (Commons), a site-side
handler (SiteRuntime), a CommunicationService query method (Communication), and
reconciliation in DeploymentService: when a prior record is InProgress or
Failed-on-timeout, query the site; if it already holds the target revision hash
mark the record Success without re-sending; on query failure fall through to a
normal deploy (site-side stale-rejection is the safety net).
The Test Run sandbox and Monaco analysis modelled a script API that had
drifted from the site runtime's ScriptGlobals, so real scripts failed to
compile in Test Run. Realign both to the runtime surface
(Instance/Scripts/ExternalSystem/Attributes/Children/Parent) and drop the
duplicate ScriptHost stub so the two cannot diverge again.
- Script calls (Scripts.CallShared, Instance.CallScript, Route.To().Call)
accept an anonymous object instead of a hand-built dictionary, via a
shared ScriptArgs normalizer; existing dictionary calls still compile.
- Test Run can optionally bind to a deployed instance, so Instance/
Attributes/CallScript route to it cross-site; adds site-side
RouteToGetAttributes/RouteToSetAttributes handlers.
- Adds Test Run panels to the API method and template script editors.
- Fixes the TestDatabaseQuery seed script, which queried a table that
never existed.
Also commits unrelated in-progress work already in the tree: the health
monitoring report loop, site streaming changes, and the Admin/Design
data-connection and SMTP page reorganization.
Triage was painful on the old layout: a lone Site dropdown sat on a sparse
row, errors were truncated mid-sentence with a per-row View/Hide toggle
that on expand pushed an unwrapped <pre> through the table and shoved the
Actions column off-screen, all rows looked the same regardless of age or
attempt count, and OriginInstance — which tells you which instance
produced the failure — wasn't displayed at all even though the data was
on the entity.
This pass:
- Adds a real filter bar: Site, Category, Target system, Origin instance,
Age window, free-text search. Category/Target/Origin/Age/Search filter
the loaded page client-side; Site still drives the server query (and
changing site now auto-queries — one fewer click).
- Replaces the in-table expansion with an Offcanvas detail drawer.
Clicking a row slides in a side panel with full message ID + copy,
category label, origin, attempts, both timestamps in relative + absolute
form, the complete error (pre-wrap, scrollable), and big Retry / Discard
buttons. The table never overflows.
- Stacks Target + Method into one column (target in semibold, method
small/muted below) and surfaces Origin as a code-styled chip in a new
column ("—" muted when null).
- Severity left-border on each row, derived client-side from
AttemptCount/MaxAttempts and age of the last attempt: red when retries
are exhausted and last attempt was in the past hour, amber when
exhausted but stale, muted grey otherwise.
- Mini attempt progress bar under the n/max count, red when fully
exhausted and amber while partial.
- Relative timestamps ("5m ago", "1h ago", "2d ago") with absolute UTC on
hover via the title attribute — applies in both the table and the drawer.
- Bulk select: header checkbox selects the filtered set, per-row
checkboxes. When ≥1 selected, a sticky action strip slides in below the
filter bar offering Retry selected / Discard selected with the usual
confirm dialog. Toast reports per-item success/failure counts.
- Summary line next to the title: "N parked · K target systems · oldest
Xh ago" (and "(showing M of N)" when filters are active).
- ParkedMessageEntry contract extended additively with MaxAttempts,
Category, and OriginInstance so the UI has the data it needs for
severity, the category filter, and the new column.
- Bumped page size from 25 to 50 to better match the dense layout.
CentralHealthAggregator is a per-node hosted singleton, but site health
reports flow through ClusterClient which round-robins each report to one
central node only. The other node's aggregator never saw those reports
and marked sites offline at the 60s threshold — sites constantly flapped
between online and offline on the monitoring page.
On receive, the active CentralCommunicationActor now republishes a
SiteHealthReportReplica wrapper on a DistributedPubSub topic. Both
central nodes subscribe to the topic and process replicas through a
dedicated path that updates the local aggregator without re-broadcasting
(avoids fan-out loops). The aggregator's existing sequence-number
idempotency makes self-delivery a cheap no-op.
DistributedPubSubExtensionProvider is now listed in the HOCON
`akka.extensions` block so the mediator is initialised at cluster
start, eliminating a race where the first Subscribe arrived before the
extension was loaded.
Adds a new HiLo alarm trigger type with four configurable setpoints
(LoLo / Lo / Hi / HiHi). Each setpoint carries an optional priority,
deadband (for hysteresis), and operator message. The site runtime emits
AlarmStateChanged with an AlarmLevel field so consumers can differentiate
warning vs critical bands.
Plumbing:
- new AlarmLevel enum + AlarmStateChanged.Level/Message init properties
- AlarmTriggerEditor (Blazor) gets a HiLo render with severity tinting
- AlarmTriggerConfigCodec extracted from the editor for testability
- sitestream.proto carries level + message over gRPC
- SemanticValidator enforces numeric attribute, setpoint ordering,
non-negative deadband
- on-trigger scripts get an Alarm global (Name/Level/Priority/Message)
so notification routing can branch by severity
- per-instance InstanceAlarmOverride entity + EF migration + flattening
step + CLI commands; HiLo overrides merge setpoint-by-setpoint, binary
types whole-replace
- DebugView shows a Level badge + per-band message tooltip
- App.razor auto-reloads on permanent Blazor circuit failure
- docker/regen-proto.sh automates the proto regen workflow (the linux/arm64
protoc segfault means generated files are checked in for now)
Add NodeStatus record, IClusterNodeProvider interface, and AkkaClusterNodeProvider
that queries Akka cluster membership for all site-role nodes. HealthReportSender
populates ClusterNodes before each report. UI shows a row per node with
hostname, Online/Offline badge, and Primary/Standby badge. Falls back to
single-node display if ClusterNodes is not populated.
New fields in SiteHealthReport: NodeHostname, DataConnectionEndpoints
(primary/secondary), DataConnectionTagQuality (good/bad/uncertain),
ParkedMessageCount. New collector methods to populate them.
Health dashboard redesigned to match mockup: Nodes | Data Connections
(with per-connection tag quality) | Instances + S&F Buffers | Error
Counts + Parked Messages. Site names resolved from repository.
Thread backup data connection fields through management command messages,
ManagementActor handlers, SiteService, site-side SQLite storage, and
deployment/replication actors. The old --configuration CLI flag is kept
as a hidden alias for backwards compatibility.
Add ActiveEndpoint field to DataConnectionHealthReport showing which
endpoint is active (Primary, Backup, or Primary with no backup configured).
Log failover transitions and connection restoration events to the site
event log via ISiteEventLogger, passed as an optional parameter through
the actor hierarchy for backwards compatibility.
Update CreateConnectionCommand to carry PrimaryConnectionDetails,
BackupConnectionDetails, and FailoverRetryCount. Update all callers:
DataConnectionManagerActor, DataConnectionActor, DeploymentManagerActor,
FlatteningService, and ConnectionConfig. The actor stores both configs
but continues using primary only — failover logic comes in Task 3.
Replace SiteDataConnectionAssignment join table with a direct SiteId FK on DataConnection,
simplifying the data model, repositories, UI, CLI, and deployment service.
ClusterClient Sender refs are temporary proxies — valid for immediate reply
but not durable for future Tells. Events now flow as DebugStreamEvent through
SiteCommunicationActor → ClusterClient → CentralCommunicationActor → bridge
actor (same pattern as health reports). Also fix DebugStreamHub to use
IHubContext for long-lived callbacks instead of transient hub instance.
Replace the CLI's Akka.NET ClusterClient transport with a simple HTTP client
targeting a new POST /management endpoint on the Central Host. The endpoint
handles Basic Auth, LDAP authentication, role resolution, and ManagementActor
dispatch in a single round-trip — eliminating the CLI's Akka, LDAP, and
Security dependencies.
Also fixes DCL ReSubscribeAll losing subscriptions on repeated reconnect by
deriving the tag list from _subscriptionsByInstance instead of _subscriptionIds.
- Add JoeAppEngine folder to OPC UA nodes.json (BTCS, AlarmCntsBySeverity, Scheduler/ScanTime)
- Fix DataConnectionActor: capture Self in PreStart for use from non-actor threads,
preventing Self.Tell failure in Disconnected event handler
- Implement InstanceActor.HandleConnectionQualityChanged to mark attributes Bad on disconnect
- Fix LmxFakeProxy TagMapper to serialize arrays as JSON instead of "System.Int32[]"
- Allow DataType and DataSourceReference updates in TemplateService.UpdateAttributeAsync
- Update test_infra_opcua.md with JoeAppEngine documentation
Add SiteReplicationActor (runs on every site node) to replicate deployed
configs and store-and-forward buffer operations to the standby peer via
cluster member discovery and fire-and-forget Tell. Wire ReplicationService
handler and pass replication actor to DeploymentManagerActor singleton.
Fix 5 pre-existing ConfigurationDatabase test failures: RowVersion NOT NULL
on SQLite, stale migration name assertion, and seed data count mismatch.
Adds `debug snapshot --id <int>` to query a running instance's current
attribute values and alarm states without the subscribe/stream overhead
of the debug view. Routes through ManagementActor → CommunicationService
→ site DeploymentManager → InstanceActor using the existing remote query
pattern.
Add 33 new management message records, ManagementActor handlers, and CLI
commands to close all functionality gaps between the Central UI and the
Management CLI. New capabilities include:
- Template member CRUD (attributes, alarms, scripts, compositions)
- Shared script CRUD
- Database connection definition CRUD
- Inbound API method CRUD
- LDAP scope rule management
- API key enable/disable
- Area update
- Remote event log and parked message queries
- Missing get/update commands for templates, sites, instances, data
connections, external systems, notifications, and SMTP config
Includes 12 new ManagementActor unit tests covering authorization,
happy-path queries, and error handling. Updates CLI README and component
design documents (Component-CLI.md, Component-ManagementService.md).
Wired ISiteHealthCollector calls for script errors (ScriptExecutionActor),
alarm eval errors (AlarmActor), dead letters (DeadLetterMonitorActor), and
S&F buffer depth placeholder. Added instance count tracking (deployed/
enabled/disabled) to SiteHealthReport via DeploymentManagerActor. Updated
Health Dashboard UI to show instance counts per site. All metrics flow
through the existing health report pipeline via ClusterClient.
Central now resolves site Akka remoting addresses from the Sites DB table
(NodeAAddress/NodeBAddress) instead of relying on runtime RegisterSite
messages. Eliminates the race condition where sites starting before central
had their registration dead-lettered. Addresses are cached in
CentralCommunicationActor with 60s periodic refresh and on-demand refresh
when sites are added/edited/deleted via UI or CLI.
Attributes bound to data connections now initialize with "Uncertain" quality,
distinguishing "never received a value" from "known good" or "connection lost."
Quality is tracked per attribute and included in GetAttributeResponse.
Two Akka.NET deserialization bugs prevented CLI commands from reaching ManagementActor: IReadOnlyList<string> in AuthenticatedUser serialized as a compiler-generated internal type unknown to the server, and ManagementSuccess.Data carried server-side assembly types the CLI couldn't resolve on receipt. Fixed by using string[] for roles and pre-serializing response data to JSON in ManagementActor before sending. Adds full CLI reference documentation covering all 10 command groups.
- Add TagValueUpdate/ConnectionQualityChanged handlers to InstanceActor
- InstanceActor subscribes to DCL on PreStart based on DataSourceReference
- DeploymentManagerActor creates DCL connections on deploy and passes DCL ref
- AkkaHostedService creates DCL Manager Actor for tag subscriptions
- Move CreateConnectionCommand to Commons for cross-project access
- Add ConnectionConfig to FlattenedConfiguration for deployment packaging