Adds a "Dev Disable-Login Flag" subsection to Component-Security.md covering
ScadaBridge:Security:Auth:DisableLogin / User, the AutoLoginAuthenticationHandler
mechanism, and the no-environment-guard / startup-warning production risk.
Ships DisableLogin: false under ScadaBridge → Security → Auth in:
- src/.../Host/appsettings.json (canonical default)
- docker/central-node-a/appsettings.Central.json
- docker/central-node-b/appsettings.Central.json
Also records DL-3 commit SHAs in the plan tasks file.
Spike outcome: the shared ILdapAuthService (ZB.MOM.WW.Auth.Abstractions, an external
NuGet package) exposes ONLY AuthenticateAsync(username, password, ct) — no passwordless
service-account group-search. A live LDAP group re-query for an active session therefore
requires a new lib method and is OUT OF SCOPE (cannot modify the external package).
Implemented the always-achievable layers (cookie-only; no embedded JWT for cookie principals):
- /auth/login now stores the user's raw LDAP groups (one zb:group claim each) plus a
zb:lastrolerefresh anchor (login time, UTC), seeding the LastActivity idle anchor too.
- SessionClaimBuilder: single shared DRY claim-builder used by BOTH /auth/login AND the
refresh path, so the two claim shapes cannot drift (canonical identity/role/scope claims
with nameType/roleType pinned, plus the M2.19 group + refresh-anchor additions).
- CookieSessionValidator (TimeProvider-injected, unit-testable) + a thin
CookieAuthenticationEvents.OnValidatePrincipal adapter:
* idle-timeout: a session past IdleTimeoutMinutes (default 30) is RejectPrincipal+SignOut;
consistent with the cookie ExpireTimeSpan+SlidingExpiration window (same value).
* role refresh WITHOUT LDAP: when older than RoleRefreshThresholdMinutes (new option,
default 15) the DB-backed RoleMapper re-runs on the STORED groups, claims are rebuilt
via the shared builder, the anchor advances, principal is replaced + cookie renewed.
Revoked DB mappings drop the user's roles mid-session.
* fail-soft: any refresh error KEEPS the existing principal (no sign-out, never throws)
— mirrors the documented "LDAP failure: active sessions continue with current roles".
- Documented residual limitation in Component-Security.md: central role-mapping/scope
changes apply within ~15 min without LDAP; live directory group-membership changes are
picked up only at next login (needs a passwordless group-search on the external
ZB.MOM.WW.Auth.Ldap lib — tracked follow-up).
Tests (Security.Tests, all green): CookieSessionValidatorTests + SessionClaimBuilderParityTests
— idle reject/keep, LDAP-free remap-from-stored-groups, revoked-roles loss, sub-threshold
no-refresh, refresh-throws-keeps-session, and login/refresh claim-parity.
git blame shows commit 1d5465f3 deliberately added NotDeployed to CanDelete so an
undeployed instance can have its orphan record fully removed. Code + tests already
permit it; the spec matrix said 'No'. Per M2.17, reconcile doc→code (not the reverse):
matrix now reads 'Delete from Not deployed = Yes (removes the orphan record)' with a
note, and CanDelete carries a remark citing the rationale + origin commit.
REQ-HOST-4a lists "required cluster singletons running (if applicable)" as a
readiness criterion, but /health/ready only checked database + akka-cluster.
Add a third Ready-tagged check, RequiredSingletonsHealthCheck, registered in the
Central-role AddHealthChecks() chain (so it is naturally role-scoped — site nodes
never run it).
Probe: for each required central singleton, Ask its local ClusterSingletonProxy
an Identify with a short bounded per-singleton timeout (~2s, probes run
concurrently via Task.WhenAll). A non-null ActorIdentity.Subject within the
timeout means the singleton is running and reachable through the proxy; a null
subject or a timeout means unreachable → Unhealthy, naming the unreachable
singleton(s). The check never throws (catch-all → Unhealthy) and resolves
ActorSystem lazily from DI per probe (Unhealthy if Akka not yet up).
Required-always set = the five singleton proxies created unconditionally in
AkkaHostedService.RegisterCentralActors: notification-outbox, audit-log-ingest,
site-call-audit, audit-log-purge, site-audit-reconciliation. There are no
feature/config-gated central singletons today; any future gated singleton is the
"if applicable" case and must NOT be added to the required set.
Leadership-agnostic: the proxy reaches the singleton from either central node, so
a ready standby still reports ready (readiness must not require cluster
leadership — that is the Active tier's job). During a brief singleton handover the
probe may time out and the node flaps to not-ready, which is correct (a node
mid-handover is legitimately not fully ready); no retries, to keep the probe fast.
Tests (TDD): RequiredSingletonsHealthCheckTests exercises the probe against a
TestKit ActorSystem — all proxies present+reachable → Healthy; one missing →
Unhealthy naming it; ActorSystem absent → Unhealthy, no throw. HealthCheckTests
regression-guards the Ready tag + absence of the Active tag on the new check.
Object/List parameters and return values were shape-validated only (object vs
array), with no field-level/nested type checks — type-wrong nested data passed
inbound validation and failed only at script runtime. Add recursive type
validation (declared Object field types, List element type, scalars at any depth)
with path-qualified errors, symmetric across ParameterValidator and ReturnValueValidator.
Both validators now parse the canonical JSON Schema definition format (the
Central UI / MigrateParametersToJsonSchema output) via a shared recursive engine,
Commons.Types.InboundApi.InboundApiSchema, instead of the legacy flat
[{name,type}] array which they could not even deserialize from migrated rows.
The legacy flat-array form is still accepted on read for transition safety.
Undeclared fields are rejected at every level (consistent with the existing
top-level unexpected-parameter rejection); a present-but-null value satisfies
any type, only absence of a required field is an error.
Gitea renders mermaid inline, so the flow/state/hierarchy/DAG diagrams
move to text-in-markdown: auto-layout (removes the manual overlap-prone
draw.io step), diffable source, no committed binaries, and a dark-text
theme so labels stay legible. Keep draw.io PNGs only for the two complex
bespoke diagrams (logical architecture, env2 topology) where pixel
control still wins. All 24 mermaid blocks validated by rendering.
Add explicit dark text color (per-class color + base theme override) to
the store-and-forward mermaid diagram so node/edge labels read clearly
regardless of gitea's page theme.
Swap the store-and-forward Message Lifecycle PNG embed for an inline
mermaid block to verify whether gitea renders mermaid in markdown. If it
does, the standard flow/state/hierarchy diagrams can move to inline
mermaid (text-only, auto-layout) instead of draw.io source + PNG.
Replace ASCII-art diagrams across the README and docs/ with editable
.drawio sources plus exported PNGs, so the diagrams render clearly in
rendered markdown and can be maintained/regenerated instead of being
hand-edited as fragile text art. Non-diagram blocks (code, folder
trees, UI wireframes) were left as text.
Renames the 13 SCADALINK_* runtime env vars → SCADABRIDGE_*, the ScadaLink__
.NET config keys → ScadaBridge__, the stale ScadaLink.Host.exe assembly name
→ ZB.MOM.WW.ScadaBridge.Host.exe, the scadalink_app SQL login → scadabridge_app,
and residual identifiers/comments/docs. Migration records (prior rename
tooling/design, DB-rename helper, this scrub script) carved out.
Adds tools/scrub-scadalink-refs.sh.
The native alarms feature merged with 7 component docs updated, but the
spec layer drifted: HighLevelReqs, Commons, and ManagementService had no
native-alarm coverage and the README table flagged it on only one row.
Add HighLevelReqs §3.4.2 (+ validation), document the Commons
types/entities/messages and the 7 ManagementService commands, sync the
README rows + link the TreeView sub-component, fix 2 broken plan links,
and drop the one-off native-alarms RESUME scratchpad.
Expanding a Galaxy object in the tag picker hung on "loading…": the browse
reply inlined every child's full attribute set (~152 KB), exceeding Akka's
128 KB remote frame, and remoting silently discarded the oversized reply.
Browse path (DataConnectionLayer):
- RealMxGatewayClient: navigation now uses BrowseChildren(include_attributes=
false) — child objects only — and an object's own attributes load lazily via
DiscoverHierarchy(root, max_depth=0) when it's expanded. Payload drops from
~152 KB/level to a few KB. Seam contract unchanged.
- DataConnectionActor.CapBrowseChildren: protocol-agnostic byte-budget cap
(~100 KB) on every BrowseNodeResult before it crosses the site→central
frame, OR-ing the adapter's own Truncated flag. Byte budget, not a count —
the only bound that holds regardless of NodeId/attribute-name length.
- RealOpcUaClient: requestedMaxReferencesPerNode 1000 → 500 to narrow the
window before the byte budget applies.
- Graceful gRPC Unimplemented handling → NotSupportedException →
BrowseFailureKind.NotBrowsable with an actionable message (older gateway
builds lacking BrowseChildren).
Picker UI (CentralUI):
- NodeBrowserDialog: modal-lg → modal-xl; new scoped .razor.css caps the tree
at 55vh with its own scrollbar so manual entry + Select/Cancel stay visible.
- Protocol-agnostic failure messages (was hardcoded "OPC UA …"); renamed the
leftover opcua-browser-tree class to node-browser-tree.
Tests: new frame-budget cap test + NotSupported=>NotBrowsable mapping test;
DCL suite 88/88. Doc: Component-DataConnectionLayer.md records the lazy
attribute-light browse and the frame-size guard.
Adds MxGateway under Supported Protocols, an MxGateway Settings config table,
notes IBrowsableDataConnection now backs both protocols via BrowseNodeCommand/
BrowseService, and updates the README component table.
Final themed batch. 5 well-localised correctness fixes.
Serialisation precision:
- ESG-020: DatabaseGateway.JsonElementToParameterValue probes
TryGetInt64 → TryGetDecimal → GetDouble, so a script's high-precision
decimal SQL parameter survives the cached-write retry round-trip
without silent precision loss. 3 new regression tests.
Template engine correctness:
- TE-018: DiffService gains ComputeConnectionsDiff over
FlattenedConfiguration.Connections, mirroring the existing entity-diff
shape and pairing with the Theme 1 TE-017 hash-coverage fix. A
ConfigurationDiff record extension in Commons is flagged as a follow-up.
- TE-019: TemplateResolver.BuildInheritanceChain now walks via the
int? ParentTemplateId directly — only null means "no parent". A real
Id of 0 (the prior special-cased sentinel) now walks the chain like
any other node, matching the TemplateEngine-013 CycleDetector fix.
Regression of TE-013 closed.
- TE-020: All 5 Create* paths in TemplateService + SharedScriptService
re-ordered to save-first → log-with-real-Id → save-audit (matching
the InstanceService pattern). Create* audit rows no longer carry a
literal "0" EntityId.
Doc deferral:
- Transport-012: Component-Transport.md §Audit Trail now spells out that
the BundleImportId repository filter IS wired (in CentralUiRepository),
but the Audit-Log-Viewer UI dropdown + summary-row hyperlink are a
deferred CentralUI follow-up. CLI workaround documented
(audit query --bundle-import-id).
11+ new regression tests (3 ESG, 4 DiffService, 3 TemplateResolver, 4
TemplateService, 1 SharedScriptService). Build clean; ESG 72/72,
TemplateEngine 324/324. README regenerated: 1 pending of 481 total.
Session-to-date: 135 of 136 originally-open Theme findings closed
across 10 themes in 10 commits.
The largest themed batch — small mechanical fixes across 11 modules.
API / message hygiene:
- Comm-020: SiteAddressCacheLoaded now carries IReadOnlyDictionary /
IReadOnlyList — Akka messages must be immutable.
- Commons-016: BundleSession.MaxUnlockAttempts named constant replaces
magic 3.
- Commons-018: IOperationTrackingStore + IPartitionMaintenance moved from
Interfaces/ root to Interfaces/Services/ (namespace preserved — 9
consumers exceeded the in-prompt move threshold).
- Commons-023: TrackingStatusSnapshot.SourceNode now consistent with the
trailing-optional-with-default pattern used elsewhere.
- SR-022: AuditingDbCommand.DbConnection.set no longer uses reflection —
exposes AuditingDbConnection.Inner via internal API surface.
Dead code / config cleanup:
- ClusterInfra-011: decorative SectionName constant deleted.
- ClusterInfra-014: dead AddClusterInfrastructureActors method + its
"throws-when-called" test deleted.
- Host-021: Microsoft Logging:LogLevel block deleted from appsettings.json
(dead under Serilog).
Fail-loud over fail-silent:
- DM-021: ResolveSiteIdentifierAsync throws on missing site (was silently
substituting a DB id).
- DM-022: dropped transient Pending write — record now lands directly in
InProgress (no UI flicker, one fewer DB write).
- Host-020: LoggerConfigurationFactory emits a Console.Error warning when
both Serilog:MinimumLevel and ScadaLink:Logging:MinimumLevel are set
(ScadaLink remains truth per Host-011).
- SnF-022: NotifyCachedCallObserverAsync logs Warning on unparseable
TrackedOperationId (was silently dropping).
- SnF-023: empty siteId default replaced with $unknown-site sentinel
+ constructor normalisation.
Correctness:
- SCA-001: SupervisorStrategy XML rewritten to match actual
DefaultDecider/Restart semantics (was claiming Resume).
- SCA-003: OnUpsertAsync now restamps IngestedAtUtc on every upsert.
- SR-021: HandleDeployArtifacts now dispatches an internal
ApplyArtifactDataConnectionsToDcl message after the SQLite write so
system-wide artifact-deploy data-connection changes go live
immediately (was requiring a site restart).
- SnF-020: RetryParkedMessageAsync captures the parked row BEFORE the
local write so a concurrent delete can't skip standby replication.
Sentinels / naming collisions:
- HM-021: CentralSiteId changed from "central" to "$central"
(uncollideable — leading $ is forbidden in real SiteIdentifiers).
Doc / surface cleanups:
- SEL-018: FailedWriteCount promoted to ISiteEventLogger; XML softened
to "Available for future Health Monitoring integration".
- SnF-019: VERIFY outcome — documented parking-after-DefaultMaxRetries
in Component-StoreAndForward.md + DefaultMaxRetries XML (uniform
cap; maxRetries:0 is the unbounded escape hatch).
- SnF-021: Component-StoreAndForward.md no longer claims the tracking
table lives in SnF — it's in SiteRuntime, the interface is in Commons.
- CLI-020: bundle export response parse guarded with try/catch on
JsonException / KeyNotFoundException / FormatException — emits a
clean INVALID_RESPONSE exit instead of a stack trace.
Config:
- ClusterInfra-013: intent comment added to "catastrophic config" test.
- Host-016: appsettings.Site.json second CentralContactPoints entry
removed (was pointing at the SITE's own port); doc-key explains how
to extend.
- Host-018: NodeName added to both shipped per-role configs (was
causing SourceNode to be null on audit rows).
UI:
- CentralUI-029: replaced JS.InvokeAsync<int>("eval", …) with an ES
module import (new wwwroot/js/browser-time.js).
- CentralUI-032: AuditResultsGrid gains a Previous button backed by a
cursor stack.
10+ new regression tests across the affected projects. Build clean;
all suites green. README regenerated: 6 open (was 33).
Session-to-date: 130 of 136 originally-open Theme findings closed.
Comm-016: delete dead HandleConnectionStateChanged + _debugSubscriptions /
_inProgressDeployments tracking + ConnectionStateChanged message record.
Disconnect detection is owned by the transport layers (gRPC keepalive PING
~25s; Ask-timeout at CommunicationService). Updates the
Component-Communication.md design doc to make that explicit.
SnF-018: NotificationForwarder.DeliverAsync now discards a corrupt buffered
payload (Warning log + return true) instead of returning false and parking
the row — honoring the design's "notifications do not park" invariant.
DM-018: reconciliation no longer force-sets Enabled, preserving an
intentional Disabled state after central failover.
ESG-018: DeliverBufferedAsync (both ExternalSystemClient + DatabaseGateway)
catches JsonException and returns false, turning a corrupt buffered row
into a parked operation instead of a retry-forever poison message.
InboundAPI-022: register ActiveNodeGate as IActiveNodeGate in the Central
DI branch so standby-node gating is actually wired up in production.
NS-019: remove orphaned NotificationDeliveryService /
INotificationDeliveryService / NotificationResult; central notification
delivery now lives entirely in NotificationOutbox.
SEL-016: normalise From/To filters to UTC before ISO-string compare so
non-UTC DateTimeOffset clients no longer get spuriously excluded events.
TE-017: include Description on attributes/alarms and a HashableConnections
projection (protocol, endpoint JSON, failover count) in the revision hash
and DiffService; staleness detection now catches description-only and
connection-endpoint edits.
Transport-001 and Transport-002 (also High) remain Open — they're being
handled in a follow-up batch because both touch BundleImporter.cs and
must serialise.
Reflect this session's implementation work in the Transport (#24)
component spec:
- New 'CLI' section covering bundle export / preview / import
commands, the base64-over-JSON wire format, the 200 MB request-body
cap, and the 5-minute per-command timeout. Authorization table +
Interactions section updated to mention ManagementActor handlers.
- Import wizard nav placement corrected from Design to Admin (already
the case in code; the spec lagged).
- Blocker-scan heuristic boundaries documented under Import Flow:
the '.' skip, the DataSourceReference exclusion, and the
KnownNonReferenceNames denylist. Both DetectBlockersAsync and
RunSemanticValidationAsync Pass 1 share the filter.
- Adds SourceNode varchar(64) NULL to AuditLog, Notifications, and SiteCalls
tables with role-name semantics: node-a/node-b for site rows (qualified by
SourceSiteId), central-a/central-b for central direct-write rows.
- New IX_AuditLog_Node_Occurred (SourceNode, OccurredAtUtc) index.
- Reframes CLAUDE.md from documentation-only to implementation project.
- Adds docs/plans/2026-05-23-audit-source-node.md + tasks.json companion.
The M1 implementation (Bundle A) committed concrete AuditChannel /
AuditKind / AuditStatus enums that reflect CLAUDE.md's locked
cached-call lifecycle decisions. The older alog.md and
Component-AuditLog.md narratives still used pre-M1 vocabulary
(Success / TransientFailure / PermanentFailure / Enqueued / Retrying /
SyncCall / CachedEnqueued / Attempt / Terminal / Completed). This
commit reconciles both docs to the M1 vocabulary:
AuditChannel : ApiOutbound, DbOutbound, Notification, ApiInbound
AuditKind (10): ApiCall, ApiCallCached, DbWrite, DbWriteCached,
NotifySend, NotifyDeliver, InboundRequest,
InboundAuthFailure, CachedSubmit, CachedResolve
AuditStatus(8): Submitted, Forwarded, Attempted, Delivered, Failed,
Parked, Discarded, Skipped
Updates:
- Status column description + worked examples use the new 8 values.
- Kind table flattened from per-channel groupings to a single flat
list of the 10 discriminators (no more SyncCall / Cached* /
Attempt / Terminal / Completed).
- Cached-call lifecycle examples rewritten to the
CachedSubmit -> Forwarded -> Attempted... -> CachedResolve shape.
- Notification lifecycle examples rewritten to
NotifySend(Submitted) -> NotifyDeliver(Attempted) ->
NotifyDeliver(Delivered/Parked/Discarded).
- Inbound API examples split into InboundRequest (success path) and
InboundAuthFailure (401 path).
- 'Errors only' UI toggle, audit-error-rate KPI, and payload-cap
decision (#6 in §16) all switched from 'non-Success' to
Status IN ('Failed', 'Parked', 'Discarded').
- Per-site event-rate table in §13.1 renamed to the new kinds.
Pure design correction; no operational behavior change. Per the
goal-prompt invariant #6, alog.md may change when a design correction
is committed before the affected code change — this commit is that
correction, landed ahead of the M1 merge so the merge order reads
design-first, code-second.
No code, test, or infra file changes.
Final cross-bundle reviewer identified 7 inconsistencies that the per-bundle
reviewers couldn't see; all fixed in one logical commit.
Critical:
- HighLevelReqs AL-3: drop 'then upsert-on-newer-status' — AuditLog is
strictly append-only (correct for SiteCalls/Notifications, wrong for
the immutable AuditLog shadow).
- Component-AuditLog Error rate KPI: align with HealthMonitoring's
exclusion list (Success/Delivered/Enqueued) rather than just non-Success;
otherwise every Delivered notification or Enqueued cached call would be
counted as an error.
Important:
- Component-AuditLog line 154: ISiteAuditWriter -> IAuditWriter (canonical
name per Commons and the rest of this doc).
- Component-AuditLog Central direct-write paragraph: convert remaining
slash notation (ApiInbound/Completed, Notification/Attempt,
Notification/Terminal) to dot notation used everywhere else.
- Component-ClusterInfrastructure: scope SiteCallAuditActor to
reconciliation + KPIs + Retry/Discard relay; cached-telemetry ingest is
AuditLogIngestActor's role per Combined Telemetry contract.
- Component-CentralUI Audit Log page: state the OperationalAudit read
permission and the read-vs-export split (matching CLI doc).
- Component-NotificationOutbox: add never-fail-the-action invariant for
dispatcher audit writes.
Minor:
- Component-InboundAPI: 'Non-blocking semantics' was ambiguous (could be
read as async); reword to 'Fail-soft' — the write is still synchronous
before flush, but failures are caught and don't change the response.
- Component-CLI: realign audit-query/audit-export flags to actually match
the Central UI Audit Log filter set (channel, kind, status, site,
instance, target, actor, correlation-id, errors-only); drop --user and
--entity-id which are IAuditService concepts, not Audit Log columns.
- Component-AuditLog KPI tile names: 'Volume/Error rate/Backlog' ->
'Audit volume/Audit error rate/Audit backlog' (matches Central UI and
Health Monitoring); drop the two orphan KPIs (Top inbound callers, Top
outbound 5xx) that were never surfaced anywhere.
- Component-AuditLog Interactions: re-attribute DbOutbound emissions to
ESG (where Database.* lives) with a note that Site Runtime is the API
surface for scripts.
- HighLevelReqs AL-12: drop 'and reconciliation operations' (CLI has no
reconcile command; reconciliation is an internal self-healing pull).
Add note that verify-chain becomes operational once AL-11's hash chain
ships.
Task 10's reviewer noted that Component-CentralUI.md renamed the
IAuditService page from 'Audit Log Viewer' to 'Configuration Audit Log
Viewer' to avoid collision with the new operational Audit Log page (#23).
Two stale lowercased refs in Component-ConfigurationDatabase.md needed
the same disambiguation.
Bundle D code-review feedback on 0ae1a25 and e6f7a7f:
- Audit error rate (HealthMonitoring tile) was described as a combined
view of CentralAuditWriteFailures + AuditRedactionFailure (writer
health). Per alog.md §10.3 / §14.1 it is the operational error rate
of audited operations: % of central AuditLog rows with Status not
in (Success/Delivered/Enqueued) over a rolling 5-min window. Audit
writer issues surface separately via the dedicated metrics.
- Audit volume description gains the spec-mandated 'events/min, global
+ per-site sparkline' shape.
- CLI: scadalink audit was claiming all three subcommands need both
OperationalAudit and AuditExport. Per alog.md §11.2 / §15.1, read
(query, verify-chain) needs OperationalAudit; bulk export
additionally requires AuditExport. Restored the spec's split.
Reviewer flag on 1bbfad3: "per Component-AuditLog.md, §6.2" pointed at
alog.md numbering, not at any anchor in Component-AuditLog.md (which uses
prose subsection titles). Switch to the prose anchor (Ingestion Paths →
Telemetry forward) so the link resolves.