- Add SecurityOptionsValidator (IValidateOptions<SecurityOptions>) enforcing
RoleRefreshThresholdMinutes < IdleTimeoutMinutes; registered with ValidateOnStart in
AddSecurity — startup FAILS if threshold >= idle, so the invariant cannot be silently
misconfigured away.
- Update SecurityOptions XML-docs: class-level summary distinguishes JWT Bearer path
(JwtSigningKey/JwtExpiryMinutes) from Blazor cookie session path (IdleTimeoutMinutes/
RoleRefreshThresholdMinutes); both time fields document the ~45-min effective idle window
and the new cross-field constraint.
- Remove dead jwtService variable from /auth/login lambda in AuthEndpoints.cs (resolved
but never used since login moved to SessionClaimBuilder).
- Extract ApplyValidationResultAsync helper from OnValidatePrincipalAsync (pure
decision-application step); add 3 adapter tests covering Reject → RejectPrincipal +
SignOutAsync; Replace → ReplacePrincipal + ShouldRenew; Keep → no-op.
- Fix inaccurate TryRefreshAsync comment (dropped "OR last-activity needs advancing" —
the code only returns non-null when roleRefreshDue).
- Add InternalsVisibleTo for Security.Tests in Security.csproj.
- Add IsRoleRefreshDue tests: missing claim → due; unparsable claim → due; plus integration
test covering the full ValidateAsync path for a principal missing zb:lastrolerefresh
(triggers refresh + re-stamps anchor rather than keeping stale principal forever).
- Add SecurityOptionsValidatorConfigGuardTests: default succeeds; equal fails; greater fails;
boundary (idle-1) succeeds; wiring confirmed via AddSecurity container.
Spike outcome: the shared ILdapAuthService (ZB.MOM.WW.Auth.Abstractions, an external
NuGet package) exposes ONLY AuthenticateAsync(username, password, ct) — no passwordless
service-account group-search. A live LDAP group re-query for an active session therefore
requires a new lib method and is OUT OF SCOPE (cannot modify the external package).
Implemented the always-achievable layers (cookie-only; no embedded JWT for cookie principals):
- /auth/login now stores the user's raw LDAP groups (one zb:group claim each) plus a
zb:lastrolerefresh anchor (login time, UTC), seeding the LastActivity idle anchor too.
- SessionClaimBuilder: single shared DRY claim-builder used by BOTH /auth/login AND the
refresh path, so the two claim shapes cannot drift (canonical identity/role/scope claims
with nameType/roleType pinned, plus the M2.19 group + refresh-anchor additions).
- CookieSessionValidator (TimeProvider-injected, unit-testable) + a thin
CookieAuthenticationEvents.OnValidatePrincipal adapter:
* idle-timeout: a session past IdleTimeoutMinutes (default 30) is RejectPrincipal+SignOut;
consistent with the cookie ExpireTimeSpan+SlidingExpiration window (same value).
* role refresh WITHOUT LDAP: when older than RoleRefreshThresholdMinutes (new option,
default 15) the DB-backed RoleMapper re-runs on the STORED groups, claims are rebuilt
via the shared builder, the anchor advances, principal is replaced + cookie renewed.
Revoked DB mappings drop the user's roles mid-session.
* fail-soft: any refresh error KEEPS the existing principal (no sign-out, never throws)
— mirrors the documented "LDAP failure: active sessions continue with current roles".
- Documented residual limitation in Component-Security.md: central role-mapping/scope
changes apply within ~15 min without LDAP; live directory group-membership changes are
picked up only at next login (needs a passwordless group-search on the external
ZB.MOM.WW.Auth.Ldap lib — tracked follow-up).
Tests (Security.Tests, all green): CookieSessionValidatorTests + SessionClaimBuilderParityTests
— idle reject/keep, LDAP-free remap-from-stored-groups, revoked-roles loss, sub-threshold
no-refresh, refresh-throws-keeps-session, and login/refresh claim-parity.
- MockSiteStreamGrpcClient.SubscribeCalls and UnsubscribedCorrelationIds
switched from bare List<T> to lock-guarded backing fields with snapshot
accessors, eliminating the actor-thread/test-thread data race (matches
the existing lock(events) pattern for ReceivedEvents)
- AttributeKey and AlarmKey null-guard each component with ?? string.Empty
so a null SourceReference/AlarmName/etc. cannot silently collide with an
empty-string component in the dedup dictionary
- On_Snapshot_Opens_GrpcStream renamed to
On_Snapshot_Does_Not_Open_Additional_GrpcStream; assertion updated to
confirm exactly one subscribe (the PreStart stream-first open) with no
second subscribe after snapshot delivery
- _stopped ordering in InstanceNotFound path moved after CleanupGrpc()
for consistency with DebugStreamTerminated and ReceiveTimeout handlers
Re-architect DebugStreamBridgeActor from snapshot-first to stream-first so no
attribute/alarm event occurring during the snapshot-build + network-transit
window is lost (#26).
Lifecycle change:
- PreStart now opens the gRPC subscription FIRST (alongside sending the
SubscribeDebugViewRequest), so live events start flowing immediately.
- Phase model via a single _snapshotDelivered flag (mutated only on the actor
thread). While buffering (snapshot not yet delivered), AttributeValueChanged/
AlarmStateChanged are appended to an ordered _preSnapshotBuffer instead of
being delivered. After snapshot+flush, the same handlers pass through directly.
- On DebugViewSnapshot: deliver snapshot, then flush the buffer in arrival order
with per-entity dedup, then set _snapshotDelivered=true (pass-through).
Dedup rule (exactly-once):
- Identity: attributes by (InstanceUniqueName, AttributePath, AttributeName);
alarms by (InstanceUniqueName, AlarmName, SourceReference) so native
per-condition alarms are not conflated. Keys joined with a NUL delimiter
(declared as an escaped char constant; no raw NUL in source) so distinct
identities never collide on a space within a name.
- Boundary: a buffered event whose timestamp is <= the snapshot's timestamp for
the same entity is already reflected -> DROP; strictly-newer (>) -> DELIVER;
entity absent from the snapshot -> DELIVER (genuine gap-window event).
Preserved paths:
- M2.11 InstanceNotFound: with stream-first the gRPC stream is already open, so
the not-found path now tears it down (CleanupGrpc) + clears the buffer, does
NOT enter pass-through, delivers the not-found snapshot, and stops cleanly.
- Reconnect (ReconnectGrpcStream -> OpenGrpcStream) does not touch the phase
flag: a mid-session reconnect resumes pass-through; a reconnect during the
buffering phase stays buffering until the snapshot arrives.
- Communication-008 retry/stability/stop/terminate + ReceiveTimeout orphan net
unchanged. Duplicate/late snapshot after delivery is ignored defensively.
Tests: 10 new M2.18 tests (stream-first ordering, gap-window buffering, dedup
drop/deliver for attrs + alarms, ordering, pass-through, InstanceNotFound
teardown, reconnect-during-buffering, reconnect-after-snapshot) + revised the
M2.11 not-found test to assert stream teardown. Full DebugStreamBridgeActor
class green: 23/23.
AddSiteEventLogHealthMetricsBridge registered via AddHostedService(factory-lambda),
which sets ImplementationFactory and leaves ImplementationType null. The prior
ImplementationType == guard was therefore silently dead — a second call would spin
up a second SiteEventLogFailureCountReporter. Fix: add a private
SiteEventLogHealthMetricsBridgeMarker singleton and guard on its ServiceType instead.
Also corrects the cycle-path comment in both ServiceCollectionExtensions.cs and
SiteEventLogFailureCountReporter.cs: StoreAndForward.csproj does reference
SiteEventLogging.csproj, so the transitive path HealthMonitoring → StoreAndForward →
SiteEventLogging is real, but adding a direct HealthMonitoring → SiteEventLogging
reference would NOT create a cycle (SiteEventLogging has no back-edge to HealthMonitoring).
The Func<long> seam is a coupling-avoidance measure, not a cycle-breaker.
Adds AddSiteEventLogHealthMetricsBridgeTests.AddSiteEventLogHealthMetricsBridge_IsIdempotent_DoesNotDoubleRegister_HostedService
as a regression test (builds provider and asserts exactly one reporter via GetServices<IHostedService>().OfType<T>()).
- Remove redundant linked CancellationTokenSource in ProbeAsync; pass the
framework cancellationToken and ProbeTimeout directly to Ask (the two-CTS
pattern was redundant — Ask already honours both the timeout and the token).
- Add EchoActor XML <remarks> explaining why no Receive<Identify> handler is
needed (ActorBase answers Identify automatically).
- Add PreCancelledToken_ReportsUnhealthy_DoesNotThrow test: verifies the
never-throws guarantee on the shutdown-race path (token already cancelled
before CheckHealthAsync is invoked).
REQ-HOST-4a lists "required cluster singletons running (if applicable)" as a
readiness criterion, but /health/ready only checked database + akka-cluster.
Add a third Ready-tagged check, RequiredSingletonsHealthCheck, registered in the
Central-role AddHealthChecks() chain (so it is naturally role-scoped — site nodes
never run it).
Probe: for each required central singleton, Ask its local ClusterSingletonProxy
an Identify with a short bounded per-singleton timeout (~2s, probes run
concurrently via Task.WhenAll). A non-null ActorIdentity.Subject within the
timeout means the singleton is running and reachable through the proxy; a null
subject or a timeout means unreachable → Unhealthy, naming the unreachable
singleton(s). The check never throws (catch-all → Unhealthy) and resolves
ActorSystem lazily from DI per probe (Unhealthy if Akka not yet up).
Required-always set = the five singleton proxies created unconditionally in
AkkaHostedService.RegisterCentralActors: notification-outbox, audit-log-ingest,
site-call-audit, audit-log-purge, site-audit-reconciliation. There are no
feature/config-gated central singletons today; any future gated singleton is the
"if applicable" case and must NOT be added to the required set.
Leadership-agnostic: the proxy reaches the singleton from either central node, so
a ready standby still reports ready (readiness must not require cluster
leadership — that is the Active tier's job). During a brief singleton handover the
probe may time out and the node flaps to not-ready, which is correct (a node
mid-handover is legitimately not fully ready); no retries, to keep the probe fast.
Tests (TDD): RequiredSingletonsHealthCheckTests exercises the probe against a
TestKit ActorSystem — all proxies present+reachable → Healthy; one missing →
Unhealthy naming it; ActorSystem absent → Unhealthy, no throw. HealthCheckTests
regression-guards the Ready tag + absence of the Active tag on the new check.
OPC UA (RealOpcUaClient):
- Append 5 new SelectClauses at indices 13–17 (never renumber 0–12):
- 13: AlarmConditionType/ActiveState/TransitionTime → OriginalRaiseTime
- 14–17: LimitAlarmType HighHighLimit/HighLimit/LowLimit/LowLowLimit → LimitValue
- New OpcUaAlarmMapper.PickLimitValue helper: first non-null in HiHi→Hi→Lo→LoLo
priority order, InvariantCulture-formatted; empty string for non-limit alarm types.
- HandleAlarmEvent reads new indices with fields.Count > N guards; hard minimum (6)
unchanged so base ConditionType events still process without the limit fields.
- Document unavailable-by-protocol fields (Category, Description, OperatorUser,
CurrentValue) inline in BuildAlarmEventFilter and HandleAlarmEvent.
MxGateway (MxGatewayAlarmMapper):
- MapTransition: CurrentValue and LimitValue now populated via MxValueToString
(uses MxValueExtensions.ToClrValue + InvariantCulture) from OnAlarmTransitionEvent
proto fields current_value/limit_value.
- MapSnapshot: same — populated from ActiveAlarmSnapshot.current_value/limit_value.
- MxValueToString helper (internal): null-safe MxValue → string conversion.
Tests (17 new, 40 total pass):
- OpcUaAlarmMapperTests: PickLimitValue priority, InvariantCulture, all-null case.
- MxGatewayAlarmMapperTests: CurrentValue/LimitValue populate from double/string
MxValue; absent fields yield empty strings.
- RealOpcUaClientAlarmFilterTests: index alignment assertions (count=18, per-index
TypeDefinitionId+BrowsePath), regression guard on existing indices 0–12.
Replace bare task-discard with ContinueWith(OnlyOnFaulted|ExecuteSynchronously) so a
faulted ISiteEventLogger is logged and swallowed rather than going to the unobserved-task
firehose. Replace the "ScriptRuntimeContext" class-name fallback with the meaningful
"InstanceScript:{instanceName}" identifier (matching the site-event-log source convention).
Update the method doc-comment to state the best-effort contract explicitly. Pin the new
fallback value in the shape-precision test.
Inject ISiteEventLogger into ScriptRuntimeContext (additive optional ctor
param, defaulted null, all existing callers source-compatible). Add a single
private EmitRecursionLimitEventAsync helper that fires-and-forgets a
"script"/Error site event; called at both recursion guard sites (CallScript
at ~:332 and ScriptCallHelper.CallShared at ~:499). ScriptExecutionActor
threads the already-resolved siteEventLogger singleton into the context;
AlarmExecutionActor leaves it null (no siteEventLogger wired there).
Existing _logger.LogError + throw behaviour unchanged.
Tests: RecursionLimitSiteEventTests — 5 tests covering both CallScript and
CallShared (ISiteEventLogger.LogEventAsync called once with category "script",
severity "Error"; null logger path does not throw).
- Add DebugStreamBridgeActorTests: On_InstanceNotFound_Snapshot_Forwards_To_OnEvent_Does_Not_Open_Stream_And_Terminates — asserts _onEvent receives the not-found snapshot, SubscribeCalls remains empty, and the actor terminates cleanly via Watch/ExpectTerminated.
- Add comment in DebugStreamBridgeActor near Context.Stop(Self) explaining that the subsequent StopDebugStream Tell from DebugStreamService.StopStream produces a benign expected dead-letter.
- Reword not-found toast in DebugView.razor to "Instance not found on the selected site — check the deployment target." (accurate when the instance may be deployed to a different site).
RouteDebugSnapshot and RouteDebugViewSubscribe on DeploymentManagerActor
previously returned an empty DebugViewSnapshot for unknown instances,
indistinguishable from a deployed-but-empty instance. Callers had no way
to differentiate "not deployed here" from "deployed, no data yet."
Approach — additive field on existing message contract:
Added `bool InstanceNotFound = false` as an optional trailing parameter
to DebugViewSnapshot (Commons). All existing positional constructor calls
and serialized wire frames are unaffected (default = false). A dedicated
new message type was considered but rejected: the ClusterClient channel
and DebugStreamService TCS are already typed on DebugViewSnapshot, and a
second reply union would require wider changes for zero additive-safety
gain.
Changes:
- Commons/DebugViewSnapshot: add InstanceNotFound = false (additive)
- DeploymentManagerActor: set InstanceNotFound=true in both unknown-
instance branches (RouteDebugViewSubscribe, RouteDebugSnapshot)
- DebugStreamBridgeActor: when snapshot.InstanceNotFound, forward it to
_onEvent (resolves the TCS) then stop cleanly; no gRPC stream opened
- DebugView.razor: check session.InitialSnapshot.InstanceNotFound after
connect and show a clear "not deployed on this site" error toast
- 3 new tests in DeploymentManagerActorTests covering: unknown→snapshot,
unknown→subscribe, known-empty→InstanceNotFound stays false
Extends ContainsAuditLogMutation regex to match T-SQL bracketed forms
([AuditLog], [dbo].[AuditLog]) that SSMS-generated SQL produces; the
prior optional-schema pattern only matched bare/dbo-prefixed names,
silently missing these real violation forms.
Changes:
- Schema sub-pattern (?:dbo\.)? → (?:\[?dbo\]?\.)? (matches dbo. and [dbo].)
- Table sub-pattern AuditLog\b → \[?AuditLog\]?\b (matches AuditLog and [AuditLog])
- Pattern compiled as static readonly Regex field for clarity/performance
- Adds 4 new planted-positive cases: UPDATE [dbo].[AuditLog], UPDATE [AuditLog],
DELETE FROM [dbo].[AuditLog], DELETE FROM [AuditLog]
- Retains all existing negatives; adds DELETE FROM [dbo].[Notifications] negative
- Fixes misleading "reverse order" comment on the comment-prefix positive case
- Documents scan limitations (EF Core bulk methods; multi-line DML) in class XML doc
Adds AuditLogAppendOnlyGuardTests.cs to
tests/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase.Tests/ — a code-level backstop
for the DB-role DENY UPDATE / DENY DELETE control established in migration
20260602174346_CollapseAuditLogToCanonical.
The guard scans every non-Designer, non-Snapshot *.cs file in the
ConfigurationDatabase source tree and fails the test run if any line matches the
DML-syntax pattern:
UPDATE\s+(?:dbo\.)?AuditLog\b
DELETE\s+(?:FROM\s+)?(?:dbo\.)?AuditLog\b
The tight DML-syntax pattern naturally excludes false positives without extra
exclusion checks: DENY UPDATE ON dbo.AuditLog is not matched (UPDATE is followed
by ON, not the table name); ALTER TABLE … SWITCH and TRUNCATE contain no UPDATE/
DELETE keyword; comments with UPDATE/AuditLog in separate clauses are not matched.
Self-verifying unit tests (ContainsAuditLogMutation_*) prove the helper:
- returns false on clean-source lines (INSERT, SELECT, DENY DDL, ALTER SWITCH,
TRUNCATE, DELETE FROM Notifications);
- returns TRUE on planted violations (UPDATE AuditLog SET …, DELETE FROM
dbo.AuditLog WHERE …, lower-case variants);
- returns false on the exact DENY/GRANT/partition-switch strings from the
production migration files.
All 256 ConfigurationDatabase.Tests pass; solution builds 0 W / 0 E.
REQ-HOST-3/REQ-HOST-4 require a MachineDataDb connection string for Central nodes.
The shipped docker appsettings (docker/central-node-a/appsettings.Central.json and
central-node-b) already carry the key. Host-008 had removed the fail-fast Require
because MachineDataDb had no consumer yet; this commit reverses that decision so a
misconfigured or missing connection string is caught at startup with a clear error.
Changes:
- DatabaseOptions: add MachineDataDb property with XML doc comment
- StartupValidator: add .Require for ScadaBridge:Database:MachineDataDb inside the
existing Central .When block, immediately after the ConfigurationDb Require
- StartupValidatorTests: rename Central_MissingMachineDataDb_PassesValidation ->
FailsValidation and flip to Assert.Throws; update comment to cite REQ-HOST-3/4,
shipped docker appsettings, and the Host-008 reversal; add MachineDataDb to
ValidCentralConfig() so all other Central tests remain green
- CentralDbTestEnvironment: supply ScadaBridge__Database__MachineDataDb env var
(mirrors ConfigurationDb pattern) so HostStartupTests, HealthCheckTests, and
MetricsEndpointTests pass through the new Require
- CompositionRootTests, AkkaHostedServiceAuditWiringTests, ActorPathTests: set
ScadaBridge__Database__MachineDataDb env var alongside the pepper env var and
clear it in Dispose, matching the existing pepper handling pattern
Build: 0 warnings, 0 errors. dotnet test Host.Tests: 233/233 passed.
Add code comments in ValidateConnectionBindingCompleteness explaining
that the unbound-attribute branch also covers the silently-dropped
stale-binding case (cross-reference FlatteningService.ApplyConnectionBindings),
and that the `continue` skips the exists-at-site check for unbound attrs.
Add two new tests:
- FlatteningPipelineConnectionBindingTests: stale DataConnectionId (999)
not present in site connections → flattener drops it silently →
validator reports ConnectionBinding Error, IsValid false.
- ValidationServiceTests: enforce:true + siteConnectionNames:null on a
properly-bound attribute → no ConnectionBinding error (exists-at-site
check stays inert when site set is not supplied).
Pre-deployment validation only WARNED when a data-sourced attribute had no
connection binding, so an instance with unresolved bindings still passed IsValid
and could deploy. There was also no check that a binding resolves to a connection
that actually exists at the target site.
- ValidationService.Validate gains an opt-in `enforceConnectionBindings` flag
(default false) plus a `siteConnectionNames` set. Default-false keeps the
template DESIGN-TIME path (ManagementActor.HandleValidateTemplate) non-blocking,
since bindings are legitimately set later at instance/deploy time. The DEPLOY
path (FlatteningPipeline) opts in (true) so:
* a data-sourced attribute with no binding is now a deploy-gating Error;
* a binding to a connection that does not exist on the target site is an Error.
Static (non-data-sourced) attributes are never flagged.
- FlatteningPipeline computes the site-connection-names set from the loaded site
data connections (mirroring M2.1's alarmCapableConnectionNames) and threads it in.
- Tests: TemplateEngine.Tests covers design-time warning / deploy-time error /
static-ok / exists-at-site / non-existent-connection. New
FlatteningPipelineConnectionBindingTests proves the deploy path enforces it.
Mark M2.7 + M2.8 completed in the plan task tracker.
SplitCallArguments now skips C# line (`//`) and block (`/* */`) comments when
tokenizing the argument list, so a comma inside a comment no longer produces a
spurious arg-count mismatch. IsNumericLiteral now explicitly rejects tokens
whose first non-sign character is `_` or a letter (e.g. `_2`), and restricts
underscore digit-separators to positions after at least one digit, preventing
identifier-shaped tokens from being inferred as Integer/Float.
#20 return-type: when a CallScript/CallShared result is assigned directly into
a typed local declaration (optionally awaited, optionally via an Instance./
Scripts./Parent./Children["x"]. receiver), compare the LHS declared type
against the target script's declared ReturnDefinition and flag clear
cross-category mismatches (ReturnTypeMismatch). Previously BuildReturnMap was
built but never read.
#21 argument-type: positional call arguments are now split (paren/brace/bracket
+ string-literal aware) and each literal-inferable argument is checked against
the target's declared parameter type (ParameterMismatch), not just the count.
Conservative — only CLEAR primitive mismatches (String/Integer/Float/Boolean)
are flagged; Integer<->Float widening is tolerated. Unknown/Object/List
declarations, var/untyped/unused/expression-embedded assignments, and
non-literal arguments (variables, member access, method/await chains, casts,
object/array initializers, compound or concatenated expressions, interpolated
strings) are never flagged. Inference limits documented in code.
Adds 16 SemanticValidatorTests covering mismatch detection, correct-call pass,
and the dynamic/unknown no-false-positive cases.
Object/List parameters and return values were shape-validated only (object vs
array), with no field-level/nested type checks — type-wrong nested data passed
inbound validation and failed only at script runtime. Add recursive type
validation (declared Object field types, List element type, scalars at any depth)
with path-qualified errors, symmetric across ParameterValidator and ReturnValueValidator.
Both validators now parse the canonical JSON Schema definition format (the
Central UI / MigrateParametersToJsonSchema output) via a shared recursive engine,
Commons.Types.InboundApi.InboundApiSchema, instead of the legacy flat
[{name,type}] array which they could not even deserialize from migrated rows.
The legacy flat-array form is still accepted on read for transition safety.
Undeclared fields are rejected at every level (consistent with the existing
top-level unexpected-parameter rejection); a present-but-null value satisfies
any type, only absence of a required field is an error.
The UI script editor has no ExecutionTimeoutSeconds control (authoring deferred),
so a body edit silently cleared a timeout set via Transport import. Round-trip the
loaded value so UI edits preserve it. Add the missing AlarmExecutionActor null/<=0
fallback tests for symmetry with ScriptExecutionActor.
Spec promised a per-script timeout but only the global ScriptExecutionTimeoutSeconds
existed. Add nullable TemplateScript.ExecutionTimeoutSeconds threaded through EF +
flattening (ResolvedScript) to ScriptExecutionActor/AlarmExecutionActor, which use
perScript ?? global for the execution CTS. Includes the EF migration for the new column.
HandleAlarmEvent set AlarmTypeName to the event-type NodeId string ("i=9341"),
but the client-side conditionFilter gate (and the OPC UA WhereClause) use friendly
type names — so a friendly-name filter built a correct server WhereClause yet the
client gate dropped every event (zero alarms delivered). Resolve the event-type
NodeId to its friendly name via an inverse of KnownConditionTypeIds (NodeId-string
fallback for custom types) so both sides agree. Also fix a dead-code ternary in
the SourceName derivation.
conditionFilter was plumbed end-to-end but applied nowhere — a filtered source
silently mirrored all conditions. Define the filter as a comma-separated,
case-insensitive list of condition type names (blank = all); enforce it
authoritatively client-side in DataConnectionActor routing (uniform across OPC UA
+ MxGateway) and, for OPC UA, additionally build a server-side EventFilter
WhereClause as a bandwidth optimization.
ExecuteWriteAsync only caught SqlException, so a live outage surfacing as
InvalidOperationException/SocketException/IOException/TimeoutException escaped
unclassified and crashed the script actor instead of buffering. Mirror the HTTP
path: propagate OperationCanceledException on cancellation, classify transport
exceptions as transient (buffer+retry), let unexpected exceptions propagate.
CachedWrite buffered ALL write failures and retried forever, never returning a
synchronous failure to the script — permanent SQL errors (constraint/syntax/
permission) were treated as transient. Mirror the External-System API path:
attempt immediately, return Failed synchronously on permanent SQL errors (no
buffering), buffer only transient errors; the S&F retry path parks permanent
failures instead of retrying forever. New SqlErrorClassifier + PermanentDatabaseException.
- FormatConnection now includes BackupConfigurationJson so a backup-only change
no longer renders identical Before/After cells (covers all 4 ConnectionsEqual fields)
- add ComputeConnectionsDiff(null, newConfig) first-deploy unit test
ComputeConnectionsDiff existed with tests but was never called and ConfigurationDiff
had no slot for it, so standalone connection endpoint/protocol/failover drift never
appeared in the deployment diff (only per-attribute binding drift did). Add a
ConnectionChanges slot, wire ComputeConnectionsDiff into ComputeDiff, and render the
connection section in the deployment diff UI.
- connection-name capable-set comparer kept as StringComparer.Ordinal:
FlatteningService and SemanticValidator use all-ordinal name-keyed
dictionaries throughout; OrdinalIgnoreCase would be inconsistent with
the rest of the binding-resolution path — added comment documenting this
- IsAlarmCapable protocol-match confirmed consistent with DataConnectionFactory
(both OrdinalIgnoreCase); added case-insensitive InlineData variants
(OPCUA, opcua, mxgateway, MXGATEWAY) to lock the contract
- clarified FlatteningPipeline comment: "filters connections by alarm-capable
protocol, then collects their names" (was "maps from the protocol string")
- added DataConnectionLayer/DataConnectionFactory.cs path reference to
AlarmCapableProtocols sync-risk comment
FlatteningPipeline loaded data connections but never passed the alarm-capable
connection set to SemanticValidator, so the native-alarm-source capability check
(built but inert) never ran — a source bound to a non-alarm-capable connection
deployed silently. Compute the capable set (IAlarmSubscribableConnection: OPC UA
+ MxGateway) and thread it through ValidationService to SemanticValidator.
ScriptExecutionActor previously emitted only an Error 'script' event on failure.
It now also fire-and-forgets an Info 'script' event when execution starts (right
before RunAsync) and when it completes successfully — giving the operational log
the full started/completed/failed lifecycle. Uses the already-resolved
siteEventLogger; fire-and-forget so the event log can never block or fault the
script's own run.
Extends the SingleServiceProvider test helper to also serve IServiceScopeFactory
(returning a self-scope) so ScriptExecutionActor's serviceProvider.CreateScope()
reaches the logging hot path in tests instead of throwing into the catch.
StoreAndForwardService gains an optional ISiteEventLogger? ctor param (default
null so the many direct-construction tests still compile) and, when wired,
mirrors its own buffer/retry/park activity onto site operational events via the
existing OnActivity hook (which already isolates a throwing subscriber, so a
failing event log can never be misclassified as a transient delivery failure):
- store_and_forward (ExternalSystem / CachedDbWrite): queued/retried/delivered/
parked. Warning on buffer/retry, Error on park, Info on retry-recovery; an
immediate-success delivery is the hot path and is not logged.
- notification (the site forward-to-central path): logged ONLY on forward
FAILURE (buffered after the immediate forward threw) and on park, per the
Component-SiteEventLogging spec — routine enqueue and forward-success are
deliberately not logged (central's Notifications table is the audit record).
Wired through AddStoreAndForward (resolves ISiteEventLogger optionally from DI);
StoreAndForward project now references SiteEventLogging (acyclic: SiteEventLogging
references only Commons). Also documents the 'notification' category on the
ISiteEventLogger.LogEventAsync eventType param (folds in M1.8 doc fix).
DeploymentManagerActor now fire-and-forgets a 'deployment' site operational
event on deploy/enable/disable/delete outcomes (Info on success, Error on
failure), source 'DeploymentManagerActor'. The disable/delete events are emitted
from the existing PipeTo continuations (safe: reads only the immutable
_serviceProvider and fire-and-forgets).
InstanceActor now emits an 'instance_lifecycle' Info event in PreStart (started)
and a new PostStop (stopped) — covering start/stop/enable/disable/redeploy/
failover transitions from the instance's own vantage point. Both actors already
hold _serviceProvider; no ctor change.
Resolution is optional and LogEventAsync is fire-and-forget so a logging failure
never affects the deployment pipeline or instance lifecycle.
AlarmActor (computed) and NativeAlarmActor (native mirror) now fire-and-forget
an 'alarm' site operational event on every state transition:
- raise/activate: Error (priority/severity >= 700) or Warning
- clear/return-to-normal, ack, inter-band transition: Info
Both actors take a new optional IServiceProvider? ctor param (default null so
existing direct-construction tests still compile); InstanceActor passes its
_serviceProvider at the two Props.Create sites. Resolution is optional and the
LogEventAsync call is fire-and-forget, so a logging failure never affects alarm
evaluation. Rehydration replays are not re-logged.
Adds a capturing FakeSiteEventLogger test helper + SingleServiceProvider.
Add a daily purge tick to SiteCallAuditActor that drops terminal SiteCalls
rows older than the retention window via ISiteCallAuditRepository.PurgeTerminalAsync.
The threshold is computed each tick as UtcNow - RetentionDays so an operator who
lowers RetentionDays sees it on the next purge without a restart. Mirrors
AuditLogPurgeActor's daily cadence + continue-on-error posture: a purge fault is
logged and swallowed so the central singleton stays alive and retries next tick.
The purge timer is started in PreStart alongside the reconciliation timer and
gates on the same collaborators (pull client + enumerator) being available — the
repo-only test ctor injects neither, so neither background timer runs there.
Options: PurgeInterval (default 24h, clamped >= 1 min so a zero config value
can't spin the scheduler) + RetentionDays (default 365), plus a test-only
override that bypasses the clamp for millisecond cadences.
Tests (all in-memory, no live MSSQL): purge tick calls PurgeTerminalAsync with a
UtcNow - RetentionDays threshold (non-default 30 days); default retention yields
a 365-day threshold; a throwing repo does not kill the singleton (a second tick
still arrives).
Add a periodic reconciliation tick to SiteCallAuditActor that, per site,
pulls changed SiteCall rows since a per-site UpdatedAtUtc cursor and upserts
them idempotently (monotonic UpsertAsync) — the documented self-heal for lost
best-effort gRPC telemetry. Mirrors SiteAuditReconciliationActor's structure
(per-site cursor, per-site try/catch failure isolation, advance cursor by max
observed UpdatedAtUtc) minus the stalled-detection EventStream machinery.
Dependency wiring: add an acyclic SiteCallAudit -> AuditLog project reference
and resolve IPullSiteCallsClient + ISiteEnumerator (central-only singletons
registered by AddAuditLogCentralReconciliationClient) from the IServiceProvider
the production ctor already holds — no Host Props.Create change needed. The
repo-only test ctor injects neither collaborator, so the tick is gated off
there. A new public test ctor injects fake client + enumerator + repo so the
tick is unit-testable in-memory (public, not internal: Akka's ActivatorProducer
uses public-only reflection binding).
Options: ReconciliationInterval (default 5 min, clamped >= 1s so a zero config
value can't spin the scheduler) + ReconciliationBatchSize (default 500), plus a
test-only override that bypasses the clamp for millisecond cadences.
Tests (all in-memory, no live MSSQL): absent row is upserted on a tick; second
tick advances the cursor past already-pulled rows; one failing site does not
sink other sites; repo-only ctor does not start the tick.
Site Call Audit (#22): build the documented periodic reconciliation PULL
self-heal path for the eventually-consistent central SiteCalls mirror, as a
dedicated PullSiteCalls gRPC RPC kept separate from the audit pull. This is the
pull PLUMBING only; the central reconciliation tick is a separate follow-up.
- IOperationTrackingStore.ReadChangedSinceAsync(sinceUtc, batchSize): inclusive
UpdatedAtUtc cursor, oldest-first, batch-capped; SQLite impl projects tracking
rows onto SiteCallOperational (Kind->Channel, TargetSummary->Target, SourceSite
left empty - the store has no site-id column).
- sitestream.proto: rpc PullSiteCalls + PullSiteCallsRequest/Response, mirroring
PullAuditEvents; regenerated checked-in SiteStreamGrpc/*.cs.
- SiteCallDtoMapper.ToDto(SiteCallOperational): inverse of FromDto for the handler.
- SiteStreamGrpcServer.PullSiteCalls handler + SetOperationTrackingStore seam;
Host wires the seam alongside SetSiteAuditQueue (site roles only).
- Central IPullSiteCallsClient + GrpcPullSiteCallsClient (home: AuditLog/Central to
reuse ISiteEnumerator; SiteCallAudit does not reference AuditLog). Re-stamps
SourceSite from the dialed siteId; no-throw on tolerable transport faults;
SpecifyKind (not ToUniversalTime) cursor handling. Central-only DI registration.
Tests: ReadChangedSinceAsync (4), PullSiteCalls handler (6), GrpcPullSiteCallsClient
(8). Full solution build 0 warnings/0 errors (TreatWarningsAsErrors).