Add the system-completion design doc (risk-first milestones M1-M10): Phase 1 Stabilize (M1 runtime wiring, M2 correctness, M3 script trust boundary, M4 doc reconciliation) then Phase 2 Expand (M5-M10 feature epics). Scope = all Tier 1/2/4 + in-scope Tier 3 features; T12/T19 deferred to own brainstorm; deliberate anti-goals excluded. Also commit the source audit (stillpending.md).
23 KiB
ScadaBridge — Pending / Deferred / Partial / Missing Functionality Audit
Date: 2026-06-15
Scope: Full system — design specs (docs/requirements/), all of src/, the Central UI / CLI / Management Service surfaces, and the plan/checklist archive (docs/plans/).
Method: Five parallel read-only investigators, each verifying doc claims against actual code (file:line evidence). Top findings were independently corroborated by 2+ agents.
Executive summary
The codebase is unusually clean: zero real TODO/FIXME markers in src/, and all 11 implementation phases self-report complete. Consequently the unfinished work does not announce itself — it hides in three forms:
- Silent gaps (Tier 1) — documented as working, not marked deferred, but absent or inert in production.
- Partial / behaviorally-divergent functionality (Tier 2) — real, but narrower or different from the spec.
- Intentional deferrals (Tier 3) — knowingly punted, correctly documented, with extensible seams. Not defects.
- Doc↔code drift (Tier 4) — code is fine; the specs describe a superseded architecture.
Actionable risk is concentrated in Tier 1. Recommended starting point: #3 / #4 (wire the two never-started audit actors) — highest impact, smallest blast radius.
Tier 1 — Silent gaps: documented as working, but not actually running
These are the dangerous ones. Specs present them as live behavior, they are not marked deferred, yet the functionality is absent or inert.
| # | Gap | Where | Impact |
|---|---|---|---|
| 1 | Script "test-compilation" does no real compilation. The headline pre-deploy gate ("scripts must compile without errors") is a brace-balance + forbidden-API substring scan. No Roslyn reference in the project. | TemplateEngine/Validation/ScriptCompiler.cs:56-104; used by ValidationService.cs:128 |
Semantically-broken C# passes validation and deploys. Found independently by 2 agents. |
| 2 | Script forbidden-API gate is bypassable. The script trust model's only design-time enforcement is the same substring scan — defeated by aliases / using static / global::. Self-documented as "SECURITY LIMITATION (TemplateEngine-006)". |
ScriptCompiler.cs:14-22,61-72; ValidationService.cs:346 |
Security boundary is advisory only. 2 agents. |
| 3 | Audit Log 365-day retention purge never starts. AuditLogPurgeActor exists but has zero ActorOf/Props.Create callers; only the roll-forward partition service runs. |
AuditLog/.../AuditLogPurgeActor.cs:58; AkkaHostedService.cs:486 ("wired in a later bundle") |
The documented purge does not run in production; AuditLog grows unbounded. |
| 4 | Audit Log reconciliation self-heal never wired. IPullAuditEventsClient has no implementation; SiteAuditReconciliationActor is never instantiated. |
IPullAuditEventsClient.cs:31; SiteAuditReconciliationActor.cs:68 |
The documented "lost-telemetry fallback" doesn't exist; forward telemetry is the only path. |
| 5 | Site Call Audit reconciliation pull + daily purge both missing. The actor's own docstring admits it. PurgeTerminalAsync is implemented but never invoked. |
SiteCallAuditActor.cs:28-34; SiteCallAuditRepository.cs:213 |
SiteCalls mirror has no self-heal and grows unbounded. |
| 6 | Site Event Logging emits only 2 of 7 documented categories. Only connection + script (error-path only). Alarm, Deployment, Store-and-Forward, Instance-Lifecycle, Notification events are never logged — ISiteEventLogger isn't injected into those subsystems. |
DataConnectionActor.cs, ScriptExecutionActor.cs (only emitters); spec "Events Logged" §20-28 |
Operational event log is materially incomplete vs spec. |
Tier 2 — Partial / behaviorally-divergent functionality
Real, but narrower than the spec — wrong in a way that could surprise an operator or script author.
| # | Gap | Where |
|---|---|---|
| 7 | Database.CachedWrite misclassifies permanent SQL errors as transient → retries forever instead of failing fast to the script. The API path does it right; the DB path does not. No immediate attempt, no synchronous permanent-Failed return. |
DatabaseGateway.cs:78-204 (cf. ExternalSystemClient.cs:100-161) |
| 8 | Alarm conditionFilter is plumbed end-to-end but applied nowhere — set a filter on a native-alarm source and it silently mirrors all conditions. |
DataConnectionActor.cs:1482,1540-1554; RealOpcUaClient.cs:242,295; MxGatewayDataConnection.cs:154-167 |
| 9 | Per-script execution timeout doesn't exist — spec promises per-script; only a global ScriptExecutionTimeoutSeconds. No field in the template/flattened model to carry it. |
SiteRuntimeOptions.cs:31; ScriptExecutionActor.cs:100; AlarmExecutionActor.cs:66 |
| 10 | Connection-level diffs never surface in the deployment diff — ComputeConnectionsDiff is dead code (no callers); ConfigurationDiff has no slot for it. Per-attribute binding drift is caught; standalone connection endpoint (protocol/config/failover) diff is not. |
DiffService.cs:158-204; Commons/Types/Flattening/ConfigurationDiff.cs:7-24 |
| 11 | Inbound API auth transport drift — code uses Authorization: Bearer sbk_<keyId>_<secret>; doc says X-API-Key header. |
InboundAPI/.../EndpointExtensions.cs:83-90 |
| 12 | Inbound API audit write is fire-and-forget after response flush — doc says synchronous before flush. Row is still emitted (fail-soft), just non-blocking and after the body is forwarded. | AuditWriteMiddleware.cs:195-212,281-290 |
| 13 | Inbound Object/List extended types are shape-validated only — no nested/field-level type validation, despite spec implying typed/nested validation. |
ParameterValidator.cs:109-145; ReturnValueValidator.cs:18 |
| 14 | JWT-in-cookie session design not implemented — /auth/login signs a plain ClaimsPrincipal; GenerateToken only used by the CLI /auth/token path; ValidateToken has no external callers. |
AuthEndpoints.cs:38,75-112,152; ServiceCollectionExtensions.cs:99-118 |
| 15 | "Re-query LDAP every 15 min / roles never >15 min stale" not implemented for interactive sessions — JwtTokenService.RefreshToken/RecordActivity/ShouldRefresh/IsIdleTimedOut have zero call sites; roles fixed until cookie expiry. The 15-min sliding + 30-min idle layers are collapsed into a single 30-min sliding cookie window. |
JwtTokenService.* (no callers); ServiceCollectionExtensions.cs:99-148 |
| 16 | Transport stale-instance enumeration always returns empty — BundleImporter returns Array.Empty<int>(); UI shows a generic warning with no count, link not filtered to stale instances. |
BundleImporter.cs:733; TransportImport.razor:347-388 |
| 17 | MachineDataDb fail-fast requirement not enforced — spec (REQ-HOST-3/4) requires central nodes to validate a non-empty MachineDataDb connection string. DatabaseOptions has only ConfigurationDb/SiteDbPath; validator never checks it; 0 grep hits in src/. Key lives only in docker appsettings as dead config. |
DatabaseOptions.cs:6-12; StartupValidator.cs:60-61 |
| 18 | CI grep-guard against UPDATE/DELETE … AuditLog not in the repo — spec claims a build-time grep that fails on data-layer mutations. DB-role DENY enforcement is present in migrations (so this is a backstop, not the only control), but the claimed code-level guard is absent. |
spec Component-AuditLog.md:335-336, Component-ConfigurationDatabase.md:297 |
Lower-severity Tier-2 / behavioral notes
| # | Gap | Where |
|---|---|---|
| 19 | Script "started"/"completed" events not logged (only failures, severity Error). |
ScriptExecutionActor.cs:239,256; ScriptActor.cs:369 |
| 20 | Return-type compatibility check is dead scaffolding — BuildReturnMap builds maps never read; no return-type comparison runs. |
SemanticValidator.cs:62-63,279-287 |
| 21 | Argument type compatibility not checked — only arg count (comma counting). | SemanticValidator.cs:251-266,390-425 |
| 22 | Native-alarm-source connection-capability validation never runs in deploy pipeline — alarmCapableConnectionNames param no production caller supplies. |
SemanticValidator.cs:30-33,239-245; FlatteningPipeline.cs:93,115 |
| 23 | Connection-binding completeness is a non-blocking Warning, not deploy-gating Error; "name exists at site" half missing. | ValidationService.cs:504-519; ValidationResult.cs:9 |
| 24 | Debug snapshot/subscribe for unknown instance returns empty snapshot, not error — caller can't distinguish "not deployed" from "deployed but empty." | DeploymentManagerActor.cs:845-866 |
| 25 | Recursion-limit error logged to .NET ILogger, not the site event log as spec requires. |
ScriptRuntimeContext.cs:302-305,464-466 |
| 26 | Debug-stream snapshot/stream ordering reversed; no timestamp-dedup replay — PreStart sends snapshot first, opens stream after; gap-window events lost (spec wants stream-first + replay/dedup). |
DebugStreamBridgeActor.cs:89-103,163-166 |
| 27 | OPC UA native-alarm transition leaves several display fields empty (Category/Description/OperatorUser/OriginalRaiseTime/CurrentValue/LimitValue) — partly by design. | RealOpcUaClient.cs:395-403; MxGatewayAlarmMapper.cs:79-113 |
| 28 | Readiness gate omits "required cluster singletons running" criterion — covers membership + DB connectivity only (softened by spec's "(if applicable)"). | Program.cs:188-201,314-317; AkkaClusterHealthCheck.cs:54 |
| 29 | SiteEventLog active-node purge gate never registered — SiteEventLogActiveNodeCheck not added to DI; purge defaults to () => true, runs on standby too (harmless, but documented restriction unenforced). |
SiteEventLogging/ServiceCollectionExtensions.cs:33-37; EventLogPurgeService.cs:61 |
| 30 | FailedWriteCount metric exposed "for future Health Monitoring" but never consumed — dangling metric. |
ISiteEventLogger.cs:32-40 |
| 31 | StateTransitionValidator allows Delete from NotDeployed; spec matrix says No (deliberate per code comment, contradicts doc). |
StateTransitionValidator.cs:38-39 |
Tier 3 — Intentional deferrals (correctly documented — NOT defects)
Knowingly punted, with extensible seams and explicit doc notes. [PERM] = permanent / v-next; [SLICE] = deferred-to-a-later-slice with seam present.
Centralized Audit Log (#23)
[PERM]Hash-chain tamper evidence (v1.x).verify-chainCLI is a no-op stub that prints "not enabled in this release". —AuditCommands.cs:243-246;AuditVerifyChainHelpers.cs:6-8[PERM]Parquet export/archival. Server returns HTTP501; CSV + JSONL implemented. —AuditEndpoints.cs:188-194;AuditExportHelpers.cs:139-148[PERM]Per-channel retention overrides. —2026-05-20-audit-log-code-roadmap.md:16[PERM]Tag-cascade forParentExecutionId— only the inbound-API→routed-site bridge is built; trigger-driven runs passparentExecutionId = null. —ScriptActor.cs:404,429;2026-05-21-audit-parent-executionid-design.md:209[PERM]ExecutionId/ParentExecutionId backfill on historical rows; SourceNode backfill on legacy rows; per-node stuck-count KPIs.[PERM]Structured/response-header response capture; inbound request-header capture; per-method opt-out;AuditInboundCeilingHitsmetric. —2026-05-23-inbound-api-full-response-audit-design.md:113-127- Uncertain: CLI
audit treecommand (doc "maybe", not found in CLI).
Notifications (#8 / #21)
[SLICE]Teams (and all non-Email) notification types —INotificationDeliveryAdapterseam exists, onlyEmailNotificationDeliveryAdapterimplemented;NotificationTypeenum is Email-only. Missing-adapter path parks gracefully. —NotificationType.cs:6-9;NotificationOutboxActor.cs:457-474[SLICE]Central UI notification-list form has noTypeselector (Email hard-coded). —NotificationListForm.razor[PERM]Historical/trend KPI charts (no time-series store).
Native Alarms / MxGateway / OPC UA
[PERM]Native-alarm ack/shelve/suppress write-back; central alarm tables/history/journal; alarm-driven notifications/scripts — read-only by design. —2026-05-29-native-alarms-design.md:201-206[SLICE]Dedicated operator Alarm Summary page (DebugView only for now).[PERM]MxGateway secured writes (operator+verifier).[SLICE]OPC UA address-space search;BrowseNextpaging. —RealOpcUaClient.cs:574[PERM]OPC UA type-info surfacing; bulk override import/CSV.[SLICE]OPC UA "Verify endpoint" connectivity button; cert-management UI.
Transport (#24)
[PERM]Site-scoped / instance-scoped artifact transport (needs name-mapping subsystem).[PERM]Direct cluster-to-cluster pull; asymmetric bundle signing; differential/incremental bundles.[PERM/SLICE]Per-line/Myers diff for Modified artifacts (coarse line-count delta only). —ArtifactDiff.cs:18-25
TreeView
[SLICE/PERM]R6 lazy-loading, R7 keyboard nav, R16 multi-select — spec marks all "(Deferred)". —Component-TreeView.md:87-93,288-295
Templates / Data Connections / Triggers UI
[SLICE]Template tree search/filter;[PERM]folder drag-drop, sibling reorder, root context menu.[PERM]Move data connection between sites;[SLICE]connection live-status indicators (blocked on DCL state surfacing).[SLICE]Base-template versioning "update-derived" flow; multiple inheritance levels;[PERM]promote-derived-to-base, cross-tenant libraries.[SLICE]Strict expression-trigger analysis kind;[PERM]WhileTrue trigger mode for alarms.[SLICE]Schema-driven value-entry forms; schema hover/completion;[PERM]JSON Schema$refreuse / template-level schema library.
Cached-call tracking (#6 / #22)
[SLICE]CLI surface for site-local Retry/Discard of cached calls;[PERM]unified notifications+site-calls outbox page.
UI audit backlog (2026-05-12-ui-audit.md:536-554)
IDialogServicemodal abstraction; design-tokens/CSS-vars; dark-mode/theming; shared pagination+filter component; accessibility pass; replacing SignalR debug-view streaming.
Environment / tooling
[PERM]True air-gapped second environment (env2 shares MSSQL/LDAP/SMTP); 3rd/4th env;--envflag ondeploy.sh.[PERM]Repo/folder rename (kept as ScadaBridge to preserve context).[SLICE]Playwright alarm-override UI coverage.
Tier 4 — Doc↔code drift (code is fine; docs describe a superseded architecture)
Worth fixing for anyone relying on the docs as the spec.
Config DB / Commons re-architecture not reflected in specs (High doc-impact):
AuditLogtable collapsed to 10 canonical +DetailsJson+ 6 PERSISTEDJSON_VALUEcomputed cols; doc still lists ~24 typed columns (Kind,HttpStatus,RequestSummary, …). — migration20260602174346_CollapseAuditLogToCanonical.cs;Entities/AuditLogRow.cs:54-136AuditEventmoved out of Commons into the externalZB.MOM.WW.AuditNuGet package; doc (REQ-COM-1/3/5b) still describes it as a Commons type. —Commons.csproj:11ApiKeyentity / API-key persistence retired to sharedZB.MOM.WW.Auth.ApiKeysSQLite store; doc still listsApiKey+ApprovedApiKeyIds. — migration20260602092753_RetireInboundApiKeyStore.cs
CLI docs drift (README is the stale doc; Component-CLI.md mostly matches code):
- Entire
bundle(Transport #24) command group is shipped + registered but documented in neitherComponent-CLI.mdnorCLI/README.md. —Program.cs:36;BundleCommands.cs:24-372 security api-key createrequires undocumented--methods(Required); docs show only--name. —SecurityCommands.cs:41-45security api-key update/deleteuse--key-id; docs document--id(and an unwired--nameon update). —SecurityCommands.cs:60,71security api-key set-methodssubcommand exists in code, documented nowhere. —SecurityCommands.cs:91-102api-method createuses required--script; docs document--code+--description(neither exists). README is internally inconsistent (create=--code, update=--script). —ApiMethodCommands.cs:57-62db-connection create/updatedocumented with--provider; code has no such option. —DbConnectionCommands.cs:56-72- Widespread README option-name drift where
Component-CLI.mdalready matches code (scope-rule--mapping-id, health--site/--keyword, template attribute--value/--data-source, template alarm--trigger-type/--priority/--trigger-config, composition delete--id, etc.). audit querydoc lists--page(code is keyset-only--all); undocumented--execution-id/--parent-execution-idfilters exist.
Stale "deferred" markers for things that have actually SHIPPED:
- Transport CLI (
bundle export/preview/import) — design doc §13 said "deferred"; now implemented. SourceNodecapture —.tasks.jsonshows all 21 tasks "pending"; fully implemented across Commons/AuditLog/NotificationOutbox/SiteCallOperational.- Site Call Audit Retry/Discard relay — DI comment says deferred; implemented + wired (
SiteCallAuditActor.cs:150-156,450-505;AkkaHostedService.cs:580-589). - Bundle-import audit filter UI (Transport-012) — doc says deferred follow-up; shipped (
ConfigurationAuditLog.razor?bundleImportId=filter). - Redaction/payload-cap "deferred to M5" comments in Site Runtime — already shipped (
ScadaBridgeAuditRedactor,AuditLogOptions.DefaultCapBytes/ErrorCapBytes). AuditLogPage.HandleRowSelectedclass comment says "no-op seam"; method is fully wired (opens drawer).
Other doc/spec inconsistencies (code richer/different than doc):
- Security role names: doc says Admin/Design/Deployment; code uses Administrator/Designer/Deployer/Viewer (canonicalized via migration).
SiteCallentity field names diverge from doc (ChannelnotKind,SourceSitenotSourceSiteId, addsHttpStatus/IngestedAtUtc).ExecuteReaderaudited asDbWrite(read/write distinguished viaExtraJSONop, not a distinctAuditKind).- Inbound audit doc references
ApiInbound.Completed; actual kinds areInboundRequest/InboundAuthFailure. Teamsclaimed present inNotificationTypeenum by Commons/ConfigDB docs; enum is Email-only.- Commons under-documents shipped code: MxGateway endpoint serializer/validator/config,
Observability/ScadaBridgeTelemetry.cs,IInboundApiKeyAdmin,IAuditActorAccessor— none in the doc folder map. IHealthMonitoringRepositorylisted in ConfigDB repo table but doesn't exist (doc annotated "future").requirements-traceability.mdand many.md.tasks.jsonshow "Pending" for shipped features — they track plan generation, not implementation; unreliable as a status source.ExternalSystemForm"Recent audit activity" drill-in omitschannel=ApiOutboundand uses exact-matchtargetinstead of starts-with (siblingApiKeyFormlink is correct). —ExternalSystemForm.razor:20-24
Code-level sweep — investigated and ruled out (false positives)
For completeness, items that look unfinished but are intentional:
- ~44 empty
catchblocks — all have explanatory comments / intentional fallback (JSON parse → default; disposal-raceObjectDisposedException). None silently swallow real errors. SiteNotificationRepository/SiteExternalSystemRepositorywrite methods throwNotSupportedException— by design (site config is read-only, managed via central deployment).StubOpcUaClient(canned data;BrowseChildrenAsyncthrowsNotImplementedException) — dev/test-only; production wiresRealOpcUaClientFactory/RealMxGatewayClientFactory(DataConnectionFactory.cs:38-47).NoOpSiteStreamAuditClient,SandboxNotifyHelper, sandbox host fakes — legitimate DI-default / test composition seams.AddSecurityActors/AddTemplateEngineActors"Phase 0 placeholder" registrations — intentional empty seams (actor wiring lives in Host).- Migration
Down()NotSupportedException, MxGateway/Bundle version-rejectionNotSupportedException,AuditWriteMiddlewarewrite-only-streamNotSupportedException— intentional guards. - Management Service: 113 handlers; all wire-registered
Mgmt*commands dispatched. The three "unhandled" (ResolveRolesCommandretired;BrowseNodeCommand/ReadTagValuesCommandrouted direct-to-site) are intentional. - Central UI: no stub/placeholder pages, no
NotImplementedException, no "coming soon" banners, no no-op@onclick.disabled=/placeholder=usages are legitimate (loading guards, edit locks, HTML hints).
Phase completeness (self-reported)
All 11 phases report Complete with passing verification gates:
- Phase 0 Solution Skeleton — Complete (gate 11/11, 57 tests).
- Phase 1 Central Foundations — Complete (gate 20/20, 186 pass + 1 live-LDAP skip).
- Phase 2 Modeling & Validation — Complete (gate 9/9, 359 tests).
- Phase 3A Runtime Foundation — Complete (gate 13/13, 389/389).
- Phase 3B Site I/O & Observability — Complete (gate 11/11, 541 cumulative).
- Phase 3C Deployment & Store-Forward — Complete (terse checklist).
- Phase 4 Operator UI — Complete (terse checklist).
- Phase 5 Authoring UI — Complete (terse checklist).
- Phase 6 Deployment & Ops UI — Complete (terse checklist; Codex external-review step skipped, best-effort).
- Phase 7 Integrations — Complete (terse checklist; Q12 SMTP-OAuth2 is a test-env dependency).
- Phase 8 Production Readiness — Complete (terse checklist).
The ~665 unchecked - [ ] items in phase plan docs are spec-traceability references (each dispositioned Pass / Out-of-scope in Forward/Reverse tables), a documentation style — not a TODO list.
Operational (not code): docs/deployment/production-checklist.md has ~60 unchecked install-time operator steps (env vars, connection strings, firewall ports 8081/636/587/1433, TLS certs, smoke tests).
Confidence & caveats
- High confidence on Tier 1 — each item verified by reading the code (class/interface existence + absence of callers via grep); top items corroborated by 2+ independent agents.
- Terse Phase 3C–8 checklists self-report "Complete / tests passing" with no per-gate breakdown; test counts for those phases were not independently re-run.
- Actual
src/artifacts were treated as truth over.tasks.jsonstatus fields, which are demonstrably stale. - Items marked Uncertain (e.g.
audit treeCLI, per-channel retention) rest on doc text only.
Recommended next steps
- Wire the two never-started audit actors (#3, #4) — highest impact, smallest blast radius (DI/Host wiring +
IPullAuditEventsClientimpl). - Site Call Audit reconciliation + purge (#5) — same shape as #3/#4.
- Decide on script compilation/security (#1, #2) — either implement the Roslyn gate or downgrade the spec's claims; currently the strongest functional + security gap.
- Site Event Logging categories (#6) — inject
ISiteEventLoggerinto the 5 missing subsystems. - Reconcile Tier-4 doc drift — update Config DB / Commons specs for the audit/auth re-architecture and the CLI docs for the
bundlegroup + option names.