Files
ScadaBridge/stillpending.md
T
Joseph Doherty f4707745bf docs(plans): completion roadmap for stillpending.md audit
Add the system-completion design doc (risk-first milestones M1-M10):
Phase 1 Stabilize (M1 runtime wiring, M2 correctness, M3 script trust
boundary, M4 doc reconciliation) then Phase 2 Expand (M5-M10 feature
epics). Scope = all Tier 1/2/4 + in-scope Tier 3 features; T12/T19
deferred to own brainstorm; deliberate anti-goals excluded. Also commit
the source audit (stillpending.md).
2026-06-15 09:27:00 -04:00

23 KiB
Raw Blame History

ScadaBridge — Pending / Deferred / Partial / Missing Functionality Audit

Date: 2026-06-15 Scope: Full system — design specs (docs/requirements/), all of src/, the Central UI / CLI / Management Service surfaces, and the plan/checklist archive (docs/plans/). Method: Five parallel read-only investigators, each verifying doc claims against actual code (file:line evidence). Top findings were independently corroborated by 2+ agents.

Executive summary

The codebase is unusually clean: zero real TODO/FIXME markers in src/, and all 11 implementation phases self-report complete. Consequently the unfinished work does not announce itself — it hides in three forms:

  1. Silent gaps (Tier 1) — documented as working, not marked deferred, but absent or inert in production.
  2. Partial / behaviorally-divergent functionality (Tier 2) — real, but narrower or different from the spec.
  3. Intentional deferrals (Tier 3) — knowingly punted, correctly documented, with extensible seams. Not defects.
  4. Doc↔code drift (Tier 4) — code is fine; the specs describe a superseded architecture.

Actionable risk is concentrated in Tier 1. Recommended starting point: #3 / #4 (wire the two never-started audit actors) — highest impact, smallest blast radius.


Tier 1 — Silent gaps: documented as working, but not actually running

These are the dangerous ones. Specs present them as live behavior, they are not marked deferred, yet the functionality is absent or inert.

# Gap Where Impact
1 Script "test-compilation" does no real compilation. The headline pre-deploy gate ("scripts must compile without errors") is a brace-balance + forbidden-API substring scan. No Roslyn reference in the project. TemplateEngine/Validation/ScriptCompiler.cs:56-104; used by ValidationService.cs:128 Semantically-broken C# passes validation and deploys. Found independently by 2 agents.
2 Script forbidden-API gate is bypassable. The script trust model's only design-time enforcement is the same substring scan — defeated by aliases / using static / global::. Self-documented as "SECURITY LIMITATION (TemplateEngine-006)". ScriptCompiler.cs:14-22,61-72; ValidationService.cs:346 Security boundary is advisory only. 2 agents.
3 Audit Log 365-day retention purge never starts. AuditLogPurgeActor exists but has zero ActorOf/Props.Create callers; only the roll-forward partition service runs. AuditLog/.../AuditLogPurgeActor.cs:58; AkkaHostedService.cs:486 ("wired in a later bundle") The documented purge does not run in production; AuditLog grows unbounded.
4 Audit Log reconciliation self-heal never wired. IPullAuditEventsClient has no implementation; SiteAuditReconciliationActor is never instantiated. IPullAuditEventsClient.cs:31; SiteAuditReconciliationActor.cs:68 The documented "lost-telemetry fallback" doesn't exist; forward telemetry is the only path.
5 Site Call Audit reconciliation pull + daily purge both missing. The actor's own docstring admits it. PurgeTerminalAsync is implemented but never invoked. SiteCallAuditActor.cs:28-34; SiteCallAuditRepository.cs:213 SiteCalls mirror has no self-heal and grows unbounded.
6 Site Event Logging emits only 2 of 7 documented categories. Only connection + script (error-path only). Alarm, Deployment, Store-and-Forward, Instance-Lifecycle, Notification events are never logged — ISiteEventLogger isn't injected into those subsystems. DataConnectionActor.cs, ScriptExecutionActor.cs (only emitters); spec "Events Logged" §20-28 Operational event log is materially incomplete vs spec.

Tier 2 — Partial / behaviorally-divergent functionality

Real, but narrower than the spec — wrong in a way that could surprise an operator or script author.

# Gap Where
7 Database.CachedWrite misclassifies permanent SQL errors as transient → retries forever instead of failing fast to the script. The API path does it right; the DB path does not. No immediate attempt, no synchronous permanent-Failed return. DatabaseGateway.cs:78-204 (cf. ExternalSystemClient.cs:100-161)
8 Alarm conditionFilter is plumbed end-to-end but applied nowhere — set a filter on a native-alarm source and it silently mirrors all conditions. DataConnectionActor.cs:1482,1540-1554; RealOpcUaClient.cs:242,295; MxGatewayDataConnection.cs:154-167
9 Per-script execution timeout doesn't exist — spec promises per-script; only a global ScriptExecutionTimeoutSeconds. No field in the template/flattened model to carry it. SiteRuntimeOptions.cs:31; ScriptExecutionActor.cs:100; AlarmExecutionActor.cs:66
10 Connection-level diffs never surface in the deployment diffComputeConnectionsDiff is dead code (no callers); ConfigurationDiff has no slot for it. Per-attribute binding drift is caught; standalone connection endpoint (protocol/config/failover) diff is not. DiffService.cs:158-204; Commons/Types/Flattening/ConfigurationDiff.cs:7-24
11 Inbound API auth transport drift — code uses Authorization: Bearer sbk_<keyId>_<secret>; doc says X-API-Key header. InboundAPI/.../EndpointExtensions.cs:83-90
12 Inbound API audit write is fire-and-forget after response flush — doc says synchronous before flush. Row is still emitted (fail-soft), just non-blocking and after the body is forwarded. AuditWriteMiddleware.cs:195-212,281-290
13 Inbound Object/List extended types are shape-validated only — no nested/field-level type validation, despite spec implying typed/nested validation. ParameterValidator.cs:109-145; ReturnValueValidator.cs:18
14 JWT-in-cookie session design not implemented/auth/login signs a plain ClaimsPrincipal; GenerateToken only used by the CLI /auth/token path; ValidateToken has no external callers. AuthEndpoints.cs:38,75-112,152; ServiceCollectionExtensions.cs:99-118
15 "Re-query LDAP every 15 min / roles never >15 min stale" not implemented for interactive sessionsJwtTokenService.RefreshToken/RecordActivity/ShouldRefresh/IsIdleTimedOut have zero call sites; roles fixed until cookie expiry. The 15-min sliding + 30-min idle layers are collapsed into a single 30-min sliding cookie window. JwtTokenService.* (no callers); ServiceCollectionExtensions.cs:99-148
16 Transport stale-instance enumeration always returns emptyBundleImporter returns Array.Empty<int>(); UI shows a generic warning with no count, link not filtered to stale instances. BundleImporter.cs:733; TransportImport.razor:347-388
17 MachineDataDb fail-fast requirement not enforced — spec (REQ-HOST-3/4) requires central nodes to validate a non-empty MachineDataDb connection string. DatabaseOptions has only ConfigurationDb/SiteDbPath; validator never checks it; 0 grep hits in src/. Key lives only in docker appsettings as dead config. DatabaseOptions.cs:6-12; StartupValidator.cs:60-61
18 CI grep-guard against UPDATE/DELETE … AuditLog not in the repo — spec claims a build-time grep that fails on data-layer mutations. DB-role DENY enforcement is present in migrations (so this is a backstop, not the only control), but the claimed code-level guard is absent. spec Component-AuditLog.md:335-336, Component-ConfigurationDatabase.md:297

Lower-severity Tier-2 / behavioral notes

# Gap Where
19 Script "started"/"completed" events not logged (only failures, severity Error). ScriptExecutionActor.cs:239,256; ScriptActor.cs:369
20 Return-type compatibility check is dead scaffoldingBuildReturnMap builds maps never read; no return-type comparison runs. SemanticValidator.cs:62-63,279-287
21 Argument type compatibility not checked — only arg count (comma counting). SemanticValidator.cs:251-266,390-425
22 Native-alarm-source connection-capability validation never runs in deploy pipelinealarmCapableConnectionNames param no production caller supplies. SemanticValidator.cs:30-33,239-245; FlatteningPipeline.cs:93,115
23 Connection-binding completeness is a non-blocking Warning, not deploy-gating Error; "name exists at site" half missing. ValidationService.cs:504-519; ValidationResult.cs:9
24 Debug snapshot/subscribe for unknown instance returns empty snapshot, not error — caller can't distinguish "not deployed" from "deployed but empty." DeploymentManagerActor.cs:845-866
25 Recursion-limit error logged to .NET ILogger, not the site event log as spec requires. ScriptRuntimeContext.cs:302-305,464-466
26 Debug-stream snapshot/stream ordering reversed; no timestamp-dedup replayPreStart sends snapshot first, opens stream after; gap-window events lost (spec wants stream-first + replay/dedup). DebugStreamBridgeActor.cs:89-103,163-166
27 OPC UA native-alarm transition leaves several display fields empty (Category/Description/OperatorUser/OriginalRaiseTime/CurrentValue/LimitValue) — partly by design. RealOpcUaClient.cs:395-403; MxGatewayAlarmMapper.cs:79-113
28 Readiness gate omits "required cluster singletons running" criterion — covers membership + DB connectivity only (softened by spec's "(if applicable)"). Program.cs:188-201,314-317; AkkaClusterHealthCheck.cs:54
29 SiteEventLog active-node purge gate never registeredSiteEventLogActiveNodeCheck not added to DI; purge defaults to () => true, runs on standby too (harmless, but documented restriction unenforced). SiteEventLogging/ServiceCollectionExtensions.cs:33-37; EventLogPurgeService.cs:61
30 FailedWriteCount metric exposed "for future Health Monitoring" but never consumed — dangling metric. ISiteEventLogger.cs:32-40
31 StateTransitionValidator allows Delete from NotDeployed; spec matrix says No (deliberate per code comment, contradicts doc). StateTransitionValidator.cs:38-39

Tier 3 — Intentional deferrals (correctly documented — NOT defects)

Knowingly punted, with extensible seams and explicit doc notes. [PERM] = permanent / v-next; [SLICE] = deferred-to-a-later-slice with seam present.

Centralized Audit Log (#23)

  • [PERM] Hash-chain tamper evidence (v1.x). verify-chain CLI is a no-op stub that prints "not enabled in this release". — AuditCommands.cs:243-246; AuditVerifyChainHelpers.cs:6-8
  • [PERM] Parquet export/archival. Server returns HTTP 501; CSV + JSONL implemented. — AuditEndpoints.cs:188-194; AuditExportHelpers.cs:139-148
  • [PERM] Per-channel retention overrides. — 2026-05-20-audit-log-code-roadmap.md:16
  • [PERM] Tag-cascade for ParentExecutionId — only the inbound-API→routed-site bridge is built; trigger-driven runs pass parentExecutionId = null. — ScriptActor.cs:404,429; 2026-05-21-audit-parent-executionid-design.md:209
  • [PERM] ExecutionId/ParentExecutionId backfill on historical rows; SourceNode backfill on legacy rows; per-node stuck-count KPIs.
  • [PERM] Structured/response-header response capture; inbound request-header capture; per-method opt-out; AuditInboundCeilingHits metric. — 2026-05-23-inbound-api-full-response-audit-design.md:113-127
  • Uncertain: CLI audit tree command (doc "maybe", not found in CLI).

Notifications (#8 / #21)

  • [SLICE] Teams (and all non-Email) notification types — INotificationDeliveryAdapter seam exists, only EmailNotificationDeliveryAdapter implemented; NotificationType enum is Email-only. Missing-adapter path parks gracefully. — NotificationType.cs:6-9; NotificationOutboxActor.cs:457-474
  • [SLICE] Central UI notification-list form has no Type selector (Email hard-coded). — NotificationListForm.razor
  • [PERM] Historical/trend KPI charts (no time-series store).

Native Alarms / MxGateway / OPC UA

  • [PERM] Native-alarm ack/shelve/suppress write-back; central alarm tables/history/journal; alarm-driven notifications/scripts — read-only by design. — 2026-05-29-native-alarms-design.md:201-206
  • [SLICE] Dedicated operator Alarm Summary page (DebugView only for now).
  • [PERM] MxGateway secured writes (operator+verifier).
  • [SLICE] OPC UA address-space search; BrowseNext paging. — RealOpcUaClient.cs:574
  • [PERM] OPC UA type-info surfacing; bulk override import/CSV.
  • [SLICE] OPC UA "Verify endpoint" connectivity button; cert-management UI.

Transport (#24)

  • [PERM] Site-scoped / instance-scoped artifact transport (needs name-mapping subsystem).
  • [PERM] Direct cluster-to-cluster pull; asymmetric bundle signing; differential/incremental bundles.
  • [PERM/SLICE] Per-line/Myers diff for Modified artifacts (coarse line-count delta only). — ArtifactDiff.cs:18-25

TreeView

  • [SLICE/PERM] R6 lazy-loading, R7 keyboard nav, R16 multi-select — spec marks all "(Deferred)". — Component-TreeView.md:87-93,288-295

Templates / Data Connections / Triggers UI

  • [SLICE] Template tree search/filter; [PERM] folder drag-drop, sibling reorder, root context menu.
  • [PERM] Move data connection between sites; [SLICE] connection live-status indicators (blocked on DCL state surfacing).
  • [SLICE] Base-template versioning "update-derived" flow; multiple inheritance levels; [PERM] promote-derived-to-base, cross-tenant libraries.
  • [SLICE] Strict expression-trigger analysis kind; [PERM] WhileTrue trigger mode for alarms.
  • [SLICE] Schema-driven value-entry forms; schema hover/completion; [PERM] JSON Schema $ref reuse / template-level schema library.

Cached-call tracking (#6 / #22)

  • [SLICE] CLI surface for site-local Retry/Discard of cached calls; [PERM] unified notifications+site-calls outbox page.

UI audit backlog (2026-05-12-ui-audit.md:536-554)

  • IDialogService modal abstraction; design-tokens/CSS-vars; dark-mode/theming; shared pagination+filter component; accessibility pass; replacing SignalR debug-view streaming.

Environment / tooling

  • [PERM] True air-gapped second environment (env2 shares MSSQL/LDAP/SMTP); 3rd/4th env; --env flag on deploy.sh.
  • [PERM] Repo/folder rename (kept as ScadaBridge to preserve context).
  • [SLICE] Playwright alarm-override UI coverage.

Tier 4 — Doc↔code drift (code is fine; docs describe a superseded architecture)

Worth fixing for anyone relying on the docs as the spec.

Config DB / Commons re-architecture not reflected in specs (High doc-impact):

  • AuditLog table collapsed to 10 canonical + DetailsJson + 6 PERSISTED JSON_VALUE computed cols; doc still lists ~24 typed columns (Kind, HttpStatus, RequestSummary, …). — migration 20260602174346_CollapseAuditLogToCanonical.cs; Entities/AuditLogRow.cs:54-136
  • AuditEvent moved out of Commons into the external ZB.MOM.WW.Audit NuGet package; doc (REQ-COM-1/3/5b) still describes it as a Commons type. — Commons.csproj:11
  • ApiKey entity / API-key persistence retired to shared ZB.MOM.WW.Auth.ApiKeys SQLite store; doc still lists ApiKey + ApprovedApiKeyIds. — migration 20260602092753_RetireInboundApiKeyStore.cs

CLI docs drift (README is the stale doc; Component-CLI.md mostly matches code):

  • Entire bundle (Transport #24) command group is shipped + registered but documented in neither Component-CLI.md nor CLI/README.md. — Program.cs:36; BundleCommands.cs:24-372
  • security api-key create requires undocumented --methods (Required); docs show only --name. — SecurityCommands.cs:41-45
  • security api-key update/delete use --key-id; docs document --id (and an unwired --name on update). — SecurityCommands.cs:60,71
  • security api-key set-methods subcommand exists in code, documented nowhere. — SecurityCommands.cs:91-102
  • api-method create uses required --script; docs document --code + --description (neither exists). README is internally inconsistent (create=--code, update=--script). — ApiMethodCommands.cs:57-62
  • db-connection create/update documented with --provider; code has no such option. — DbConnectionCommands.cs:56-72
  • Widespread README option-name drift where Component-CLI.md already matches code (scope-rule --mapping-id, health --site/--keyword, template attribute --value/--data-source, template alarm --trigger-type/--priority/--trigger-config, composition delete --id, etc.).
  • audit query doc lists --page (code is keyset-only --all); undocumented --execution-id/--parent-execution-id filters exist.

Stale "deferred" markers for things that have actually SHIPPED:

  • Transport CLI (bundle export/preview/import) — design doc §13 said "deferred"; now implemented.
  • SourceNode capture — .tasks.json shows all 21 tasks "pending"; fully implemented across Commons/AuditLog/NotificationOutbox/SiteCallOperational.
  • Site Call Audit Retry/Discard relay — DI comment says deferred; implemented + wired (SiteCallAuditActor.cs:150-156,450-505; AkkaHostedService.cs:580-589).
  • Bundle-import audit filter UI (Transport-012) — doc says deferred follow-up; shipped (ConfigurationAuditLog.razor ?bundleImportId= filter).
  • Redaction/payload-cap "deferred to M5" comments in Site Runtime — already shipped (ScadaBridgeAuditRedactor, AuditLogOptions.DefaultCapBytes/ErrorCapBytes).
  • AuditLogPage.HandleRowSelected class comment says "no-op seam"; method is fully wired (opens drawer).

Other doc/spec inconsistencies (code richer/different than doc):

  • Security role names: doc says Admin/Design/Deployment; code uses Administrator/Designer/Deployer/Viewer (canonicalized via migration).
  • SiteCall entity field names diverge from doc (Channel not Kind, SourceSite not SourceSiteId, adds HttpStatus/IngestedAtUtc).
  • ExecuteReader audited as DbWrite (read/write distinguished via Extra JSON op, not a distinct AuditKind).
  • Inbound audit doc references ApiInbound.Completed; actual kinds are InboundRequest/InboundAuthFailure.
  • Teams claimed present in NotificationType enum by Commons/ConfigDB docs; enum is Email-only.
  • Commons under-documents shipped code: MxGateway endpoint serializer/validator/config, Observability/ScadaBridgeTelemetry.cs, IInboundApiKeyAdmin, IAuditActorAccessor — none in the doc folder map.
  • IHealthMonitoringRepository listed in ConfigDB repo table but doesn't exist (doc annotated "future").
  • requirements-traceability.md and many .md.tasks.json show "Pending" for shipped features — they track plan generation, not implementation; unreliable as a status source.
  • ExternalSystemForm "Recent audit activity" drill-in omits channel=ApiOutbound and uses exact-match target instead of starts-with (sibling ApiKeyForm link is correct). — ExternalSystemForm.razor:20-24

Code-level sweep — investigated and ruled out (false positives)

For completeness, items that look unfinished but are intentional:

  • ~44 empty catch blocks — all have explanatory comments / intentional fallback (JSON parse → default; disposal-race ObjectDisposedException). None silently swallow real errors.
  • SiteNotificationRepository / SiteExternalSystemRepository write methods throw NotSupportedException — by design (site config is read-only, managed via central deployment).
  • StubOpcUaClient (canned data; BrowseChildrenAsync throws NotImplementedException) — dev/test-only; production wires RealOpcUaClientFactory/RealMxGatewayClientFactory (DataConnectionFactory.cs:38-47).
  • NoOpSiteStreamAuditClient, SandboxNotifyHelper, sandbox host fakes — legitimate DI-default / test composition seams.
  • AddSecurityActors / AddTemplateEngineActors "Phase 0 placeholder" registrations — intentional empty seams (actor wiring lives in Host).
  • Migration Down() NotSupportedException, MxGateway/Bundle version-rejection NotSupportedException, AuditWriteMiddleware write-only-stream NotSupportedException — intentional guards.
  • Management Service: 113 handlers; all wire-registered Mgmt* commands dispatched. The three "unhandled" (ResolveRolesCommand retired; BrowseNodeCommand/ReadTagValuesCommand routed direct-to-site) are intentional.
  • Central UI: no stub/placeholder pages, no NotImplementedException, no "coming soon" banners, no no-op @onclick. disabled=/placeholder= usages are legitimate (loading guards, edit locks, HTML hints).

Phase completeness (self-reported)

All 11 phases report Complete with passing verification gates:

  • Phase 0 Solution Skeleton — Complete (gate 11/11, 57 tests).
  • Phase 1 Central Foundations — Complete (gate 20/20, 186 pass + 1 live-LDAP skip).
  • Phase 2 Modeling & Validation — Complete (gate 9/9, 359 tests).
  • Phase 3A Runtime Foundation — Complete (gate 13/13, 389/389).
  • Phase 3B Site I/O & Observability — Complete (gate 11/11, 541 cumulative).
  • Phase 3C Deployment & Store-Forward — Complete (terse checklist).
  • Phase 4 Operator UI — Complete (terse checklist).
  • Phase 5 Authoring UI — Complete (terse checklist).
  • Phase 6 Deployment & Ops UI — Complete (terse checklist; Codex external-review step skipped, best-effort).
  • Phase 7 Integrations — Complete (terse checklist; Q12 SMTP-OAuth2 is a test-env dependency).
  • Phase 8 Production Readiness — Complete (terse checklist).

The ~665 unchecked - [ ] items in phase plan docs are spec-traceability references (each dispositioned Pass / Out-of-scope in Forward/Reverse tables), a documentation style — not a TODO list.

Operational (not code): docs/deployment/production-checklist.md has ~60 unchecked install-time operator steps (env vars, connection strings, firewall ports 8081/636/587/1433, TLS certs, smoke tests).


Confidence & caveats

  • High confidence on Tier 1 — each item verified by reading the code (class/interface existence + absence of callers via grep); top items corroborated by 2+ independent agents.
  • Terse Phase 3C8 checklists self-report "Complete / tests passing" with no per-gate breakdown; test counts for those phases were not independently re-run.
  • Actual src/ artifacts were treated as truth over .tasks.json status fields, which are demonstrably stale.
  • Items marked Uncertain (e.g. audit tree CLI, per-channel retention) rest on doc text only.
  1. Wire the two never-started audit actors (#3, #4) — highest impact, smallest blast radius (DI/Host wiring + IPullAuditEventsClient impl).
  2. Site Call Audit reconciliation + purge (#5) — same shape as #3/#4.
  3. Decide on script compilation/security (#1, #2) — either implement the Roslyn gate or downgrade the spec's claims; currently the strongest functional + security gap.
  4. Site Event Logging categories (#6) — inject ISiteEventLogger into the 5 missing subsystems.
  5. Reconcile Tier-4 doc drift — update Config DB / Commons specs for the audit/auth re-architecture and the CLI docs for the bundle group + option names.