Add the system-completion design doc (risk-first milestones M1-M10): Phase 1 Stabilize (M1 runtime wiring, M2 correctness, M3 script trust boundary, M4 doc reconciliation) then Phase 2 Expand (M5-M10 feature epics). Scope = all Tier 1/2/4 + in-scope Tier 3 features; T12/T19 deferred to own brainstorm; deliberate anti-goals excluded. Also commit the source audit (stillpending.md).
13 KiB
Design: Completing stillpending.md — System Completion Roadmap
Date: 2026-06-15
Status: Approved (brainstorming session) — ready for per-milestone implementation planning
Source audit: stillpending.md (repo root) — full deferred/partial/unfinished/missing inventory
Scope decision: Complete everything genuinely intended — all of Tier 1, Tier 2, Tier 4, and the in-scope Tier 3 features. Two large Tier 3 items deferred to their own brainstorm; a set of deliberate anti-goals excluded (see Scope).
Goal
Drive stillpending.md to zero for all genuinely-intended functionality: make the silent gaps real, correct the behavioral divergences, reconcile the docs with the shipped architecture, and build out the deferred features that have a real seam and a real reason to exist.
Scope
In scope
- Tier 1 — all 6 silent gaps (documented as working, actually inert).
- Tier 2 — all ~31 genuine code defects / partial behaviors.
- Tier 4 — all doc↔code drift (specs + CLI docs + stale markers).
- Tier 3 — features tagged for build: T1–T11, T13–T18, T20–T26, T28, T30–T36, T41.
Deferred to their own brainstorm (NOT in this plan)
- T12 — Native-alarm ack/shelve/suppress write-back + central alarm tables/history/journal + alarm-driven notifications. Reverses the "native alarms are read-only by design" decision; large enough to warrant its own design session.
- T19 — Direct cluster-to-cluster pull + asymmetric bundle signing + differential/incremental bundles. Three separable large features.
Excluded (deliberate anti-goals — leave as-is)
- T27 Promote-derived-to-base + cross-tenant template libraries.
- T29 WhileTrue trigger mode for alarms (alarms are already level-based).
- T37 Replace SignalR debug-view streaming (working code; pure refactor).
- T38 True air-gapped env2 / 3rd-4th env /
--envflag (env2 deliberately shares infra). - T39 Repo/folder rename away from ScadaBridge (kept to preserve context).
- T40 Rename legacy Transport bundle manifests.
Approach
Risk-first milestones (Approach A). Work is ordered by impact, not by component, and grouped into themed milestones using the repo's existing m1…mN implementation-doc convention (cf. auditlog-m1..m8). Phase 1 (M1–M4) stabilizes — a finite, ship-soon body of work that makes the docs true and the silent gaps real. Phase 2 (M5–M10) expands — an open-ended roadmap of independently-sized feature epics.
Rejected alternatives:
- Component-grouped — cleaner per-component diffs, but spreads the high-risk Tier-1 work across late workstreams and ships no value early.
- Two-track only — good mental model (folded into the Phase 1/2 boundary here), but too coarse for sequencing.
Milestones
Phase 1 — Stabilize
M1 — Runtime wiring (Tier 1: #3, #4, #5, #6)
Wire up behavior that exists in code but is never started, and fill the event-log categories.
- Wire
AuditLogPurgeActorinto Host bootstrap; drive partition-switch purge (365-day retention). (AuditLogPurgeActor.cs,AkkaHostedService.cs:486) - Implement
IPullAuditEventsClientand instantiateSiteAuditReconciliationActor(periodicPullAuditEventsfallback). (IPullAuditEventsClient.cs,SiteAuditReconciliationActor.cs) - Site Call Audit: implement the per-site reconciliation pull (changed-since cursor read + central insert-if-not-exists/upsert-on-newer) and schedule
PurgeTerminalAsyncdaily. (SiteCallAuditActor.cs:28-34,SiteCallAuditRepository.cs:213) - Site Event Logging: inject
ISiteEventLoggerintoAlarmActor/NativeAlarmActor,DeploymentManagerActor, Store-and-Forward, andNotificationOutboxActor; emit Alarm / Deployment / Data-Connection / S&F / Instance-Lifecycle / Notification events, and add script started/completed (Info) alongside the existing error events. - DoD: actors instantiated and DI-registered; integration tests prove the purge actually switches out partitions and reconciliation back-fills a row dropped from telemetry; event-log rows produced for all 7 categories.
M2 — Correctness & behavioral gaps (Tier 2)
Database.CachedWrite: attempt immediately; permanent SQL error returns synchronously asFailed(mirror the API path); only transient errors buffer. (DatabaseGateway.cs:78-204)- Alarm
conditionFilter: build the OPC UA eventWhereClausefrom the filter and honor it inDataConnectionActorrouting; same plumbing for MxGateway. (DataConnectionActor.cs:1482,RealOpcUaClient.cs:242) - Per-script execution timeout: add a field to
TemplateScript+ flattened config; apply inScriptExecutionActor/AlarmExecutionActor; fall back to the global default. (SiteRuntimeOptions.cs:31) - Connection-level diff: add a
ConnectionChangesslot toConfigurationDiff; callComputeConnectionsDifffromComputeDiff; surface in the deployment diff UI. (DiffService.cs:174) MachineDataDbfail-fast: add the option + startup validation (+ DbContext if the connection is actually consumed). (DatabaseOptions.cs,StartupValidator.cs)- CI grep-guard: build step that fails on
UPDATE/DELETEagainstAuditLogin the data layer. (Component-AuditLog.md:335) - LDAP periodic re-query: implement the session-refresh path that re-queries LDAP so interactive roles are never >15 min stale (wire
JwtTokenService.RefreshToken/ShouldRefreshor anOnValidatePrincipalrevalidation). (security-relevant; pulled out of the session-doc reconcile) - Low-severity batch (#19–#31): return-type compatibility check; argument type compatibility; native-alarm-source capability validation wired into the deploy pipeline; binding-completeness as a deploy-gating Error (+ "name exists at site" check); debug snapshot/subscribe error response for unknown instance; recursion-limit error → site event log; debug-stream snapshot/stream ordering + timestamp-dedup replay; OPC UA native-alarm transition field population; readiness "required singletons running" probe; register the SiteEventLog active-node purge gate; consume
FailedWriteCountin Health Monitoring; reconcileStateTransitionValidatordelete-from-NotDeployedwith the spec matrix. - DoD: unit + integration tests per behavior; where the fix corrects code, the spec already matched; where the spec was the divergence, it's updated in the same change.
M3 — Script trust boundary (Tier 1: #1, #2)
- Wire the already-referenced
Microsoft.CodeAnalysis.CSharp.ScriptingintoScriptCompiler.TryCompilefor a real semantic compile (errors block deploy). (ScriptCompiler.cs:56-104,ValidationService.cs:128) - Replace the advisory substring forbidden-API scan with Roslyn symbol/semantic analysis that resolves aliases,
using static, andglobal::; coordinate design-time enforcement with the Site Runtime sandbox so the trust boundary is authoritative. (ScriptCompiler.cs:14-22) - Apply the same real compile to shared scripts. (
SharedScriptService.cs:168-206) - DoD: semantically-invalid C# fails validation; adversarial bypass tests (alias /
using static/global::reaching a forbidden API) fail to deploy.
M4 — Doc reconciliation (Tier 4, parallelizable)
- Update specs to the shipped re-architecture:
Component-ConfigurationDatabase.md(collapsedAuditLogschema),Component-Commons.md(AuditEvent→ZB.MOM.WW.Auditpackage,ApiKeyretirement, undocumented types/interfaces),Component-InboundAPI.md(Bearer auth, audit write timing, type validation),Component-NotificationService.md/NotificationType(Teams status), Security role names,SiteCallfield names,AuditKindvocabulary. - CLI docs: document the
bundlegroup; fix README option-name drift (the README is the stale doc;Component-CLI.mdmostly matches code); correctaudit queryoptions. - Clear stale "deferred" markers/comments for shipped features (Transport CLI,
SourceNode, Site Call Audit relay, bundle-import audit filter, M5 redaction comments,AuditLogPage.HandleRowSelected). - Code-vs-doc dispositions: doc-update (code authoritative) for Bearer auth, fire-and-forget audit timing, JWT-in-cookie→cookie-only session model,
ExecuteReader/DbWritekind. Code-change (build it) for nestedObject/Listvalidation (#13) — done in M2's validator work — and the LDAP re-query (M2). - DoD: no remaining doc↔code contradiction for in-scope components; CLI docs match registered commands/options.
Phase 2 — Expand
M5 — Audit hardening (T1–T8)
Hash-chain tamper evidence (off by default, verify-chain made real); Parquet export/archival (replace the 501); per-channel retention overrides; tag-cascade for ParentExecutionId (thread writing-execution id through trigger-driven runs); ExecutionId/ParentExecutionId + SourceNode backfill on historical rows; per-node stuck-count KPIs; structured response capture (headers/content-type, inbound request headers, per-method opt-out, AuditInboundCeilingHits metric); CLI audit tree.
M6 — Notifications (T9–T11)
Teams + other non-Email delivery adapters behind the existing INotificationDeliveryAdapter seam; NotificationType enum values; Central UI notification-list Type selector; historical/trend KPI charts (introduce a time-series store).
M7 — OPC UA / MxGateway UX (T13–T17)
Dedicated operator Alarm Summary page; MxGateway secured writes (operator+verifier); OPC UA address-space search + BrowseNext paging; type-info surfacing + bulk override CSV import; "Verify endpoint" connectivity button + cert-management UI.
M8 — Transport (T18, T20)
Site-scoped / instance-scoped artifact transport (name-mapping subsystem); per-line/Myers diff for Modified artifacts.
M9 — Templates & authoring (T22–T26, T28, T30–T32)
Template tree search/filter; folder drag-drop + sibling reorder + root context menu; move data connection between sites; connection live-status indicators; base-template versioning "update-derived" flow + multi-level inheritance; strict expression-trigger analysis kind; schema-driven value-entry forms + hover/completion + JSON Schema $ref/library; CLI Retry/Discard for cached calls; unified notifications+site-calls outbox page.
M10 — UI/UX platform (T33–T36, T41)
IDialogService modal abstraction; design-tokens/CSS-vars + dark-mode/theming; shared pagination+filter component; accessibility pass; Playwright alarm-override UI coverage.
Dependencies & sequencing
- M1 → M5 — audit hardening builds on the wired purge/reconciliation.
- M6/T11 — depends on introducing a time-series store (new infra; size carefully).
- M9/T26 — base-template versioning is the largest authoring item; may split.
- M4 — runs anytime; cheap and high-clarity, good to interleave.
- M3 — independent; can run in parallel with M1/M2.
- Phase 1 (M1–M4) should complete before Phase 2 work starts in earnest, so the foundation is true before features pile on.
Cross-cutting conventions (per CLAUDE.md)
- Each milestone gets its own dated implementation plan in
docs/plans/and (where useful) a.tasks.json. - Design doc + code + entities/repos + actors/services + UI + tests + migrations + deploy config travel together in each slice.
- Every milestone is independently shippable:
dotnet build ZB.MOM.WW.ScadaBridge.slnxgreen + relevant unit/integration tests pass; cluster-runtime changes rebuilt viabash docker/deploy.sh. - M3 and the security items (M2 LDAP re-query) carry adversarial tests (bypass attempts), not just happy-path.
git diffreview before each commit; related changes committed together with a design-summary message.
Testing strategy
- M1: integration tests against the cluster proving purge + reconciliation actually run (not just unit-level actor tests); event-log row assertions per category.
- M2: behavior-level unit tests per gap; CachedWrite + conditionFilter + per-script-timeout get integration coverage.
- M3: golden invalid scripts must fail; adversarial forbidden-API bypass corpus must fail to deploy.
- M4: doc-only — no test impact beyond keeping existing suites green; nested-type validation (#13) gets validator unit tests.
- M5–M10: standard unit + integration + Playwright (UI) coverage per feature; new infra (M6 time-series store) gets its own integration suite.
Open items / risks
- M3 real-compile may surface latent invalid scripts in existing templates/fixtures — budget for fixture cleanup.
- M6 time-series store is the one genuinely-new piece of infrastructure; scope it deliberately (could reuse MS SQL with a rollup table rather than a new dependency).
- The Phase 2 roadmap is large; treat each milestone as a separate planning + implementation pass, not a single mega-effort.
Next step
Hand off to the writing-plans skill to produce the detailed, bite-sized implementation plan, starting with Phase 1 (M1–M4). Phase 2 milestones are planned individually as they're picked up.