# Design: Completing `stillpending.md` — System Completion Roadmap **Date:** 2026-06-15 **Status:** Approved (brainstorming session) — ready for per-milestone implementation planning **Source audit:** `stillpending.md` (repo root) — full deferred/partial/unfinished/missing inventory **Scope decision:** Complete *everything* genuinely intended — all of Tier 1, Tier 2, Tier 4, and the in-scope Tier 3 features. Two large Tier 3 items deferred to their own brainstorm; a set of deliberate anti-goals excluded (see Scope). ## Goal Drive `stillpending.md` to zero for all genuinely-intended functionality: make the silent gaps real, correct the behavioral divergences, reconcile the docs with the shipped architecture, and build out the deferred features that have a real seam and a real reason to exist. ## Scope ### In scope - **Tier 1** — all 6 silent gaps (documented as working, actually inert). - **Tier 2** — all ~31 genuine code defects / partial behaviors. - **Tier 4** — all doc↔code drift (specs + CLI docs + stale markers). - **Tier 3** — features tagged for build: T1–T11, T13–T18, T20–T26, T28, T30–T36, T41. ### Deferred to their own brainstorm (NOT in this plan) - **T12** — Native-alarm ack/shelve/suppress write-back + central alarm tables/history/journal + alarm-driven notifications. Reverses the "native alarms are read-only by design" decision; large enough to warrant its own design session. - **T19** — Direct cluster-to-cluster pull + asymmetric bundle signing + differential/incremental bundles. Three separable large features. ### Excluded (deliberate anti-goals — leave as-is) - **T27** Promote-derived-to-base + cross-tenant template libraries. - **T29** WhileTrue trigger mode for alarms (alarms are already level-based). - **T37** Replace SignalR debug-view streaming (working code; pure refactor). - **T38** True air-gapped env2 / 3rd-4th env / `--env` flag (env2 deliberately shares infra). - **T39** Repo/folder rename away from ScadaBridge (kept to preserve context). - **T40** Rename legacy Transport bundle manifests. ## Approach **Risk-first milestones (Approach A).** Work is ordered by impact, not by component, and grouped into themed milestones using the repo's existing `m1…mN` implementation-doc convention (cf. `auditlog-m1..m8`). Phase 1 (M1–M4) *stabilizes* — a finite, ship-soon body of work that makes the docs true and the silent gaps real. Phase 2 (M5–M10) *expands* — an open-ended roadmap of independently-sized feature epics. Rejected alternatives: - **Component-grouped** — cleaner per-component diffs, but spreads the high-risk Tier-1 work across late workstreams and ships no value early. - **Two-track only** — good mental model (folded into the Phase 1/2 boundary here), but too coarse for sequencing. ## Milestones ### Phase 1 — Stabilize #### M1 — Runtime wiring (Tier 1: #3, #4, #5, #6) Wire up behavior that exists in code but is never started, and fill the event-log categories. - Wire `AuditLogPurgeActor` into Host bootstrap; drive partition-switch purge (365-day retention). (`AuditLogPurgeActor.cs`, `AkkaHostedService.cs:486`) - Implement `IPullAuditEventsClient` and instantiate `SiteAuditReconciliationActor` (periodic `PullAuditEvents` fallback). (`IPullAuditEventsClient.cs`, `SiteAuditReconciliationActor.cs`) - Site Call Audit: implement the per-site reconciliation pull (changed-since cursor read + central insert-if-not-exists/upsert-on-newer) and schedule `PurgeTerminalAsync` daily. (`SiteCallAuditActor.cs:28-34`, `SiteCallAuditRepository.cs:213`) - Site Event Logging: inject `ISiteEventLogger` into `AlarmActor`/`NativeAlarmActor`, `DeploymentManagerActor`, Store-and-Forward, and `NotificationOutboxActor`; emit Alarm / Deployment / Data-Connection / S&F / Instance-Lifecycle / Notification events, and add script *started*/*completed* (Info) alongside the existing error events. - **DoD:** actors instantiated and DI-registered; integration tests prove the purge actually switches out partitions and reconciliation back-fills a row dropped from telemetry; event-log rows produced for all 7 categories. #### M2 — Correctness & behavioral gaps (Tier 2) - `Database.CachedWrite`: attempt immediately; permanent SQL error returns synchronously as `Failed` (mirror the API path); only transient errors buffer. (`DatabaseGateway.cs:78-204`) - Alarm `conditionFilter`: build the OPC UA event `WhereClause` from the filter and honor it in `DataConnectionActor` routing; same plumbing for MxGateway. (`DataConnectionActor.cs:1482`, `RealOpcUaClient.cs:242`) - Per-script execution timeout: add a field to `TemplateScript` + flattened config; apply in `ScriptExecutionActor`/`AlarmExecutionActor`; fall back to the global default. (`SiteRuntimeOptions.cs:31`) - Connection-level diff: add a `ConnectionChanges` slot to `ConfigurationDiff`; call `ComputeConnectionsDiff` from `ComputeDiff`; surface in the deployment diff UI. (`DiffService.cs:174`) - `MachineDataDb` fail-fast: add the option + startup validation (+ DbContext if the connection is actually consumed). (`DatabaseOptions.cs`, `StartupValidator.cs`) - CI grep-guard: build step that fails on `UPDATE`/`DELETE` against `AuditLog` in the data layer. (`Component-AuditLog.md:335`) - LDAP periodic re-query: implement the session-refresh path that re-queries LDAP so interactive roles are never >15 min stale (wire `JwtTokenService.RefreshToken`/`ShouldRefresh` or an `OnValidatePrincipal` revalidation). (security-relevant; pulled out of the session-doc reconcile) - Low-severity batch (#19–#31): return-type compatibility check; argument *type* compatibility; native-alarm-source capability validation wired into the deploy pipeline; binding-completeness as a deploy-gating Error (+ "name exists at site" check); debug snapshot/subscribe error response for unknown instance; recursion-limit error → site event log; debug-stream snapshot/stream ordering + timestamp-dedup replay; OPC UA native-alarm transition field population; readiness "required singletons running" probe; register the SiteEventLog active-node purge gate; consume `FailedWriteCount` in Health Monitoring; reconcile `StateTransitionValidator` delete-from-`NotDeployed` with the spec matrix. - **DoD:** unit + integration tests per behavior; where the fix corrects code, the spec already matched; where the spec was the divergence, it's updated in the same change. #### M3 — Script trust boundary (Tier 1: #1, #2) - Wire the already-referenced `Microsoft.CodeAnalysis.CSharp.Scripting` into `ScriptCompiler.TryCompile` for a real semantic compile (errors block deploy). (`ScriptCompiler.cs:56-104`, `ValidationService.cs:128`) - Replace the advisory substring forbidden-API scan with Roslyn symbol/semantic analysis that resolves aliases, `using static`, and `global::`; coordinate design-time enforcement with the Site Runtime sandbox so the trust boundary is authoritative. (`ScriptCompiler.cs:14-22`) - Apply the same real compile to shared scripts. (`SharedScriptService.cs:168-206`) - **DoD:** semantically-invalid C# fails validation; adversarial bypass tests (alias / `using static` / `global::` reaching a forbidden API) fail to deploy. #### M4 — Doc reconciliation (Tier 4, parallelizable) - Update specs to the shipped re-architecture: `Component-ConfigurationDatabase.md` (collapsed `AuditLog` schema), `Component-Commons.md` (`AuditEvent`→`ZB.MOM.WW.Audit` package, `ApiKey` retirement, undocumented types/interfaces), `Component-InboundAPI.md` (Bearer auth, audit write timing, type validation), `Component-NotificationService.md`/`NotificationType` (Teams status), Security role names, `SiteCall` field names, `AuditKind` vocabulary. - CLI docs: document the `bundle` group; fix README option-name drift (the README is the stale doc; `Component-CLI.md` mostly matches code); correct `audit query` options. - Clear stale "deferred" markers/comments for shipped features (Transport CLI, `SourceNode`, Site Call Audit relay, bundle-import audit filter, M5 redaction comments, `AuditLogPage.HandleRowSelected`). - **Code-vs-doc dispositions:** doc-update (code authoritative) for Bearer auth, fire-and-forget audit timing, JWT-in-cookie→cookie-only session model, `ExecuteReader`/`DbWrite` kind. Code-change (build it) for nested `Object`/`List` validation (#13) — done in M2's validator work — and the LDAP re-query (M2). - **DoD:** no remaining doc↔code contradiction for in-scope components; CLI docs match registered commands/options. ### Phase 2 — Expand #### M5 — Audit hardening (T1–T8) Hash-chain tamper evidence (off by default, `verify-chain` made real); Parquet export/archival (replace the 501); per-channel retention overrides; tag-cascade for `ParentExecutionId` (thread writing-execution id through trigger-driven runs); ExecutionId/ParentExecutionId + SourceNode backfill on historical rows; per-node stuck-count KPIs; structured response capture (headers/content-type, inbound request headers, per-method opt-out, `AuditInboundCeilingHits` metric); CLI `audit tree`. #### M6 — Notifications (T9–T11) Teams + other non-Email delivery adapters behind the existing `INotificationDeliveryAdapter` seam; `NotificationType` enum values; Central UI notification-list `Type` selector; historical/trend KPI charts (introduce a time-series store). #### M7 — OPC UA / MxGateway UX (T13–T17) Dedicated operator Alarm Summary page; MxGateway secured writes (operator+verifier); OPC UA address-space search + `BrowseNext` paging; type-info surfacing + bulk override CSV import; "Verify endpoint" connectivity button + cert-management UI. #### M8 — Transport (T18, T20) Site-scoped / instance-scoped artifact transport (name-mapping subsystem); per-line/Myers diff for Modified artifacts. #### M9 — Templates & authoring (T22–T26, T28, T30–T32) Template tree search/filter; folder drag-drop + sibling reorder + root context menu; move data connection between sites; connection live-status indicators; base-template versioning "update-derived" flow + multi-level inheritance; strict expression-trigger analysis kind; schema-driven value-entry forms + hover/completion + JSON Schema `$ref`/library; CLI Retry/Discard for cached calls; unified notifications+site-calls outbox page. #### M10 — UI/UX platform (T33–T36, T41) `IDialogService` modal abstraction; design-tokens/CSS-vars + dark-mode/theming; shared pagination+filter component; accessibility pass; Playwright alarm-override UI coverage. ## Dependencies & sequencing - **M1 → M5** — audit hardening builds on the wired purge/reconciliation. - **M6/T11** — depends on introducing a time-series store (new infra; size carefully). - **M9/T26** — base-template versioning is the largest authoring item; may split. - **M4** — runs anytime; cheap and high-clarity, good to interleave. - **M3** — independent; can run in parallel with M1/M2. - Phase 1 (M1–M4) should complete before Phase 2 work starts in earnest, so the foundation is true before features pile on. ## Cross-cutting conventions (per CLAUDE.md) - Each milestone gets its own dated implementation plan in `docs/plans/` and (where useful) a `.tasks.json`. - Design doc + code + entities/repos + actors/services + UI + tests + migrations + deploy config travel together in each slice. - Every milestone is independently shippable: `dotnet build ZB.MOM.WW.ScadaBridge.slnx` green + relevant unit/integration tests pass; cluster-runtime changes rebuilt via `bash docker/deploy.sh`. - M3 and the security items (M2 LDAP re-query) carry adversarial tests (bypass attempts), not just happy-path. - `git diff` review before each commit; related changes committed together with a design-summary message. ## Testing strategy - **M1:** integration tests against the cluster proving purge + reconciliation actually run (not just unit-level actor tests); event-log row assertions per category. - **M2:** behavior-level unit tests per gap; CachedWrite + conditionFilter + per-script-timeout get integration coverage. - **M3:** golden invalid scripts must fail; adversarial forbidden-API bypass corpus must fail to deploy. - **M4:** doc-only — no test impact beyond keeping existing suites green; nested-type validation (#13) gets validator unit tests. - **M5–M10:** standard unit + integration + Playwright (UI) coverage per feature; new infra (M6 time-series store) gets its own integration suite. ## Open items / risks - M3 real-compile may surface latent invalid scripts in existing templates/fixtures — budget for fixture cleanup. - M6 time-series store is the one genuinely-new piece of infrastructure; scope it deliberately (could reuse MS SQL with a rollup table rather than a new dependency). - The Phase 2 roadmap is large; treat each milestone as a separate planning + implementation pass, not a single mega-effort. ## Next step Hand off to the writing-plans skill to produce the detailed, bite-sized implementation plan, starting with **Phase 1 (M1–M4)**. Phase 2 milestones are planned individually as they're picked up.