Files
ScadaBridge/docs/plans/2026-06-15-stillpending-completion-design.md
T
Joseph Doherty f4707745bf docs(plans): completion roadmap for stillpending.md audit
Add the system-completion design doc (risk-first milestones M1-M10):
Phase 1 Stabilize (M1 runtime wiring, M2 correctness, M3 script trust
boundary, M4 doc reconciliation) then Phase 2 Expand (M5-M10 feature
epics). Scope = all Tier 1/2/4 + in-scope Tier 3 features; T12/T19
deferred to own brainstorm; deliberate anti-goals excluded. Also commit
the source audit (stillpending.md).
2026-06-15 09:27:00 -04:00

13 KiB
Raw Blame History

Design: Completing stillpending.md — System Completion Roadmap

Date: 2026-06-15 Status: Approved (brainstorming session) — ready for per-milestone implementation planning Source audit: stillpending.md (repo root) — full deferred/partial/unfinished/missing inventory Scope decision: Complete everything genuinely intended — all of Tier 1, Tier 2, Tier 4, and the in-scope Tier 3 features. Two large Tier 3 items deferred to their own brainstorm; a set of deliberate anti-goals excluded (see Scope).

Goal

Drive stillpending.md to zero for all genuinely-intended functionality: make the silent gaps real, correct the behavioral divergences, reconcile the docs with the shipped architecture, and build out the deferred features that have a real seam and a real reason to exist.

Scope

In scope

  • Tier 1 — all 6 silent gaps (documented as working, actually inert).
  • Tier 2 — all ~31 genuine code defects / partial behaviors.
  • Tier 4 — all doc↔code drift (specs + CLI docs + stale markers).
  • Tier 3 — features tagged for build: T1T11, T13T18, T20T26, T28, T30T36, T41.

Deferred to their own brainstorm (NOT in this plan)

  • T12 — Native-alarm ack/shelve/suppress write-back + central alarm tables/history/journal + alarm-driven notifications. Reverses the "native alarms are read-only by design" decision; large enough to warrant its own design session.
  • T19 — Direct cluster-to-cluster pull + asymmetric bundle signing + differential/incremental bundles. Three separable large features.

Excluded (deliberate anti-goals — leave as-is)

  • T27 Promote-derived-to-base + cross-tenant template libraries.
  • T29 WhileTrue trigger mode for alarms (alarms are already level-based).
  • T37 Replace SignalR debug-view streaming (working code; pure refactor).
  • T38 True air-gapped env2 / 3rd-4th env / --env flag (env2 deliberately shares infra).
  • T39 Repo/folder rename away from ScadaBridge (kept to preserve context).
  • T40 Rename legacy Transport bundle manifests.

Approach

Risk-first milestones (Approach A). Work is ordered by impact, not by component, and grouped into themed milestones using the repo's existing m1…mN implementation-doc convention (cf. auditlog-m1..m8). Phase 1 (M1M4) stabilizes — a finite, ship-soon body of work that makes the docs true and the silent gaps real. Phase 2 (M5M10) expands — an open-ended roadmap of independently-sized feature epics.

Rejected alternatives:

  • Component-grouped — cleaner per-component diffs, but spreads the high-risk Tier-1 work across late workstreams and ships no value early.
  • Two-track only — good mental model (folded into the Phase 1/2 boundary here), but too coarse for sequencing.

Milestones

Phase 1 — Stabilize

M1 — Runtime wiring (Tier 1: #3, #4, #5, #6)

Wire up behavior that exists in code but is never started, and fill the event-log categories.

  • Wire AuditLogPurgeActor into Host bootstrap; drive partition-switch purge (365-day retention). (AuditLogPurgeActor.cs, AkkaHostedService.cs:486)
  • Implement IPullAuditEventsClient and instantiate SiteAuditReconciliationActor (periodic PullAuditEvents fallback). (IPullAuditEventsClient.cs, SiteAuditReconciliationActor.cs)
  • Site Call Audit: implement the per-site reconciliation pull (changed-since cursor read + central insert-if-not-exists/upsert-on-newer) and schedule PurgeTerminalAsync daily. (SiteCallAuditActor.cs:28-34, SiteCallAuditRepository.cs:213)
  • Site Event Logging: inject ISiteEventLogger into AlarmActor/NativeAlarmActor, DeploymentManagerActor, Store-and-Forward, and NotificationOutboxActor; emit Alarm / Deployment / Data-Connection / S&F / Instance-Lifecycle / Notification events, and add script started/completed (Info) alongside the existing error events.
  • DoD: actors instantiated and DI-registered; integration tests prove the purge actually switches out partitions and reconciliation back-fills a row dropped from telemetry; event-log rows produced for all 7 categories.

M2 — Correctness & behavioral gaps (Tier 2)

  • Database.CachedWrite: attempt immediately; permanent SQL error returns synchronously as Failed (mirror the API path); only transient errors buffer. (DatabaseGateway.cs:78-204)
  • Alarm conditionFilter: build the OPC UA event WhereClause from the filter and honor it in DataConnectionActor routing; same plumbing for MxGateway. (DataConnectionActor.cs:1482, RealOpcUaClient.cs:242)
  • Per-script execution timeout: add a field to TemplateScript + flattened config; apply in ScriptExecutionActor/AlarmExecutionActor; fall back to the global default. (SiteRuntimeOptions.cs:31)
  • Connection-level diff: add a ConnectionChanges slot to ConfigurationDiff; call ComputeConnectionsDiff from ComputeDiff; surface in the deployment diff UI. (DiffService.cs:174)
  • MachineDataDb fail-fast: add the option + startup validation (+ DbContext if the connection is actually consumed). (DatabaseOptions.cs, StartupValidator.cs)
  • CI grep-guard: build step that fails on UPDATE/DELETE against AuditLog in the data layer. (Component-AuditLog.md:335)
  • LDAP periodic re-query: implement the session-refresh path that re-queries LDAP so interactive roles are never >15 min stale (wire JwtTokenService.RefreshToken/ShouldRefresh or an OnValidatePrincipal revalidation). (security-relevant; pulled out of the session-doc reconcile)
  • Low-severity batch (#19#31): return-type compatibility check; argument type compatibility; native-alarm-source capability validation wired into the deploy pipeline; binding-completeness as a deploy-gating Error (+ "name exists at site" check); debug snapshot/subscribe error response for unknown instance; recursion-limit error → site event log; debug-stream snapshot/stream ordering + timestamp-dedup replay; OPC UA native-alarm transition field population; readiness "required singletons running" probe; register the SiteEventLog active-node purge gate; consume FailedWriteCount in Health Monitoring; reconcile StateTransitionValidator delete-from-NotDeployed with the spec matrix.
  • DoD: unit + integration tests per behavior; where the fix corrects code, the spec already matched; where the spec was the divergence, it's updated in the same change.

M3 — Script trust boundary (Tier 1: #1, #2)

  • Wire the already-referenced Microsoft.CodeAnalysis.CSharp.Scripting into ScriptCompiler.TryCompile for a real semantic compile (errors block deploy). (ScriptCompiler.cs:56-104, ValidationService.cs:128)
  • Replace the advisory substring forbidden-API scan with Roslyn symbol/semantic analysis that resolves aliases, using static, and global::; coordinate design-time enforcement with the Site Runtime sandbox so the trust boundary is authoritative. (ScriptCompiler.cs:14-22)
  • Apply the same real compile to shared scripts. (SharedScriptService.cs:168-206)
  • DoD: semantically-invalid C# fails validation; adversarial bypass tests (alias / using static / global:: reaching a forbidden API) fail to deploy.

M4 — Doc reconciliation (Tier 4, parallelizable)

  • Update specs to the shipped re-architecture: Component-ConfigurationDatabase.md (collapsed AuditLog schema), Component-Commons.md (AuditEventZB.MOM.WW.Audit package, ApiKey retirement, undocumented types/interfaces), Component-InboundAPI.md (Bearer auth, audit write timing, type validation), Component-NotificationService.md/NotificationType (Teams status), Security role names, SiteCall field names, AuditKind vocabulary.
  • CLI docs: document the bundle group; fix README option-name drift (the README is the stale doc; Component-CLI.md mostly matches code); correct audit query options.
  • Clear stale "deferred" markers/comments for shipped features (Transport CLI, SourceNode, Site Call Audit relay, bundle-import audit filter, M5 redaction comments, AuditLogPage.HandleRowSelected).
  • Code-vs-doc dispositions: doc-update (code authoritative) for Bearer auth, fire-and-forget audit timing, JWT-in-cookie→cookie-only session model, ExecuteReader/DbWrite kind. Code-change (build it) for nested Object/List validation (#13) — done in M2's validator work — and the LDAP re-query (M2).
  • DoD: no remaining doc↔code contradiction for in-scope components; CLI docs match registered commands/options.

Phase 2 — Expand

M5 — Audit hardening (T1T8)

Hash-chain tamper evidence (off by default, verify-chain made real); Parquet export/archival (replace the 501); per-channel retention overrides; tag-cascade for ParentExecutionId (thread writing-execution id through trigger-driven runs); ExecutionId/ParentExecutionId + SourceNode backfill on historical rows; per-node stuck-count KPIs; structured response capture (headers/content-type, inbound request headers, per-method opt-out, AuditInboundCeilingHits metric); CLI audit tree.

M6 — Notifications (T9T11)

Teams + other non-Email delivery adapters behind the existing INotificationDeliveryAdapter seam; NotificationType enum values; Central UI notification-list Type selector; historical/trend KPI charts (introduce a time-series store).

M7 — OPC UA / MxGateway UX (T13T17)

Dedicated operator Alarm Summary page; MxGateway secured writes (operator+verifier); OPC UA address-space search + BrowseNext paging; type-info surfacing + bulk override CSV import; "Verify endpoint" connectivity button + cert-management UI.

M8 — Transport (T18, T20)

Site-scoped / instance-scoped artifact transport (name-mapping subsystem); per-line/Myers diff for Modified artifacts.

M9 — Templates & authoring (T22T26, T28, T30T32)

Template tree search/filter; folder drag-drop + sibling reorder + root context menu; move data connection between sites; connection live-status indicators; base-template versioning "update-derived" flow + multi-level inheritance; strict expression-trigger analysis kind; schema-driven value-entry forms + hover/completion + JSON Schema $ref/library; CLI Retry/Discard for cached calls; unified notifications+site-calls outbox page.

M10 — UI/UX platform (T33T36, T41)

IDialogService modal abstraction; design-tokens/CSS-vars + dark-mode/theming; shared pagination+filter component; accessibility pass; Playwright alarm-override UI coverage.

Dependencies & sequencing

  • M1 → M5 — audit hardening builds on the wired purge/reconciliation.
  • M6/T11 — depends on introducing a time-series store (new infra; size carefully).
  • M9/T26 — base-template versioning is the largest authoring item; may split.
  • M4 — runs anytime; cheap and high-clarity, good to interleave.
  • M3 — independent; can run in parallel with M1/M2.
  • Phase 1 (M1M4) should complete before Phase 2 work starts in earnest, so the foundation is true before features pile on.

Cross-cutting conventions (per CLAUDE.md)

  • Each milestone gets its own dated implementation plan in docs/plans/ and (where useful) a .tasks.json.
  • Design doc + code + entities/repos + actors/services + UI + tests + migrations + deploy config travel together in each slice.
  • Every milestone is independently shippable: dotnet build ZB.MOM.WW.ScadaBridge.slnx green + relevant unit/integration tests pass; cluster-runtime changes rebuilt via bash docker/deploy.sh.
  • M3 and the security items (M2 LDAP re-query) carry adversarial tests (bypass attempts), not just happy-path.
  • git diff review before each commit; related changes committed together with a design-summary message.

Testing strategy

  • M1: integration tests against the cluster proving purge + reconciliation actually run (not just unit-level actor tests); event-log row assertions per category.
  • M2: behavior-level unit tests per gap; CachedWrite + conditionFilter + per-script-timeout get integration coverage.
  • M3: golden invalid scripts must fail; adversarial forbidden-API bypass corpus must fail to deploy.
  • M4: doc-only — no test impact beyond keeping existing suites green; nested-type validation (#13) gets validator unit tests.
  • M5M10: standard unit + integration + Playwright (UI) coverage per feature; new infra (M6 time-series store) gets its own integration suite.

Open items / risks

  • M3 real-compile may surface latent invalid scripts in existing templates/fixtures — budget for fixture cleanup.
  • M6 time-series store is the one genuinely-new piece of infrastructure; scope it deliberately (could reuse MS SQL with a rollup table rather than a new dependency).
  • The Phase 2 roadmap is large; treat each milestone as a separate planning + implementation pass, not a single mega-effort.

Next step

Hand off to the writing-plans skill to produce the detailed, bite-sized implementation plan, starting with Phase 1 (M1M4). Phase 2 milestones are planned individually as they're picked up.