docs(plans): completion roadmap for stillpending.md audit
Add the system-completion design doc (risk-first milestones M1-M10): Phase 1 Stabilize (M1 runtime wiring, M2 correctness, M3 script trust boundary, M4 doc reconciliation) then Phase 2 Expand (M5-M10 feature epics). Scope = all Tier 1/2/4 + in-scope Tier 3 features; T12/T19 deferred to own brainstorm; deliberate anti-goals excluded. Also commit the source audit (stillpending.md).
This commit is contained in:
@@ -0,0 +1,129 @@
|
||||
# Design: Completing `stillpending.md` — System Completion Roadmap
|
||||
|
||||
**Date:** 2026-06-15
|
||||
**Status:** Approved (brainstorming session) — ready for per-milestone implementation planning
|
||||
**Source audit:** `stillpending.md` (repo root) — full deferred/partial/unfinished/missing inventory
|
||||
**Scope decision:** Complete *everything* genuinely intended — all of Tier 1, Tier 2, Tier 4, and the in-scope Tier 3 features. Two large Tier 3 items deferred to their own brainstorm; a set of deliberate anti-goals excluded (see Scope).
|
||||
|
||||
## Goal
|
||||
|
||||
Drive `stillpending.md` to zero for all genuinely-intended functionality: make the silent gaps real, correct the behavioral divergences, reconcile the docs with the shipped architecture, and build out the deferred features that have a real seam and a real reason to exist.
|
||||
|
||||
## Scope
|
||||
|
||||
### In scope
|
||||
- **Tier 1** — all 6 silent gaps (documented as working, actually inert).
|
||||
- **Tier 2** — all ~31 genuine code defects / partial behaviors.
|
||||
- **Tier 4** — all doc↔code drift (specs + CLI docs + stale markers).
|
||||
- **Tier 3** — features tagged for build: T1–T11, T13–T18, T20–T26, T28, T30–T36, T41.
|
||||
|
||||
### Deferred to their own brainstorm (NOT in this plan)
|
||||
- **T12** — Native-alarm ack/shelve/suppress write-back + central alarm tables/history/journal + alarm-driven notifications. Reverses the "native alarms are read-only by design" decision; large enough to warrant its own design session.
|
||||
- **T19** — Direct cluster-to-cluster pull + asymmetric bundle signing + differential/incremental bundles. Three separable large features.
|
||||
|
||||
### Excluded (deliberate anti-goals — leave as-is)
|
||||
- **T27** Promote-derived-to-base + cross-tenant template libraries.
|
||||
- **T29** WhileTrue trigger mode for alarms (alarms are already level-based).
|
||||
- **T37** Replace SignalR debug-view streaming (working code; pure refactor).
|
||||
- **T38** True air-gapped env2 / 3rd-4th env / `--env` flag (env2 deliberately shares infra).
|
||||
- **T39** Repo/folder rename away from ScadaBridge (kept to preserve context).
|
||||
- **T40** Rename legacy Transport bundle manifests.
|
||||
|
||||
## Approach
|
||||
|
||||
**Risk-first milestones (Approach A).** Work is ordered by impact, not by component, and grouped into themed milestones using the repo's existing `m1…mN` implementation-doc convention (cf. `auditlog-m1..m8`). Phase 1 (M1–M4) *stabilizes* — a finite, ship-soon body of work that makes the docs true and the silent gaps real. Phase 2 (M5–M10) *expands* — an open-ended roadmap of independently-sized feature epics.
|
||||
|
||||
Rejected alternatives:
|
||||
- **Component-grouped** — cleaner per-component diffs, but spreads the high-risk Tier-1 work across late workstreams and ships no value early.
|
||||
- **Two-track only** — good mental model (folded into the Phase 1/2 boundary here), but too coarse for sequencing.
|
||||
|
||||
## Milestones
|
||||
|
||||
### Phase 1 — Stabilize
|
||||
|
||||
#### M1 — Runtime wiring (Tier 1: #3, #4, #5, #6)
|
||||
Wire up behavior that exists in code but is never started, and fill the event-log categories.
|
||||
- Wire `AuditLogPurgeActor` into Host bootstrap; drive partition-switch purge (365-day retention). (`AuditLogPurgeActor.cs`, `AkkaHostedService.cs:486`)
|
||||
- Implement `IPullAuditEventsClient` and instantiate `SiteAuditReconciliationActor` (periodic `PullAuditEvents` fallback). (`IPullAuditEventsClient.cs`, `SiteAuditReconciliationActor.cs`)
|
||||
- Site Call Audit: implement the per-site reconciliation pull (changed-since cursor read + central insert-if-not-exists/upsert-on-newer) and schedule `PurgeTerminalAsync` daily. (`SiteCallAuditActor.cs:28-34`, `SiteCallAuditRepository.cs:213`)
|
||||
- Site Event Logging: inject `ISiteEventLogger` into `AlarmActor`/`NativeAlarmActor`, `DeploymentManagerActor`, Store-and-Forward, and `NotificationOutboxActor`; emit Alarm / Deployment / Data-Connection / S&F / Instance-Lifecycle / Notification events, and add script *started*/*completed* (Info) alongside the existing error events.
|
||||
- **DoD:** actors instantiated and DI-registered; integration tests prove the purge actually switches out partitions and reconciliation back-fills a row dropped from telemetry; event-log rows produced for all 7 categories.
|
||||
|
||||
#### M2 — Correctness & behavioral gaps (Tier 2)
|
||||
- `Database.CachedWrite`: attempt immediately; permanent SQL error returns synchronously as `Failed` (mirror the API path); only transient errors buffer. (`DatabaseGateway.cs:78-204`)
|
||||
- Alarm `conditionFilter`: build the OPC UA event `WhereClause` from the filter and honor it in `DataConnectionActor` routing; same plumbing for MxGateway. (`DataConnectionActor.cs:1482`, `RealOpcUaClient.cs:242`)
|
||||
- Per-script execution timeout: add a field to `TemplateScript` + flattened config; apply in `ScriptExecutionActor`/`AlarmExecutionActor`; fall back to the global default. (`SiteRuntimeOptions.cs:31`)
|
||||
- Connection-level diff: add a `ConnectionChanges` slot to `ConfigurationDiff`; call `ComputeConnectionsDiff` from `ComputeDiff`; surface in the deployment diff UI. (`DiffService.cs:174`)
|
||||
- `MachineDataDb` fail-fast: add the option + startup validation (+ DbContext if the connection is actually consumed). (`DatabaseOptions.cs`, `StartupValidator.cs`)
|
||||
- CI grep-guard: build step that fails on `UPDATE`/`DELETE` against `AuditLog` in the data layer. (`Component-AuditLog.md:335`)
|
||||
- LDAP periodic re-query: implement the session-refresh path that re-queries LDAP so interactive roles are never >15 min stale (wire `JwtTokenService.RefreshToken`/`ShouldRefresh` or an `OnValidatePrincipal` revalidation). (security-relevant; pulled out of the session-doc reconcile)
|
||||
- Low-severity batch (#19–#31): return-type compatibility check; argument *type* compatibility; native-alarm-source capability validation wired into the deploy pipeline; binding-completeness as a deploy-gating Error (+ "name exists at site" check); debug snapshot/subscribe error response for unknown instance; recursion-limit error → site event log; debug-stream snapshot/stream ordering + timestamp-dedup replay; OPC UA native-alarm transition field population; readiness "required singletons running" probe; register the SiteEventLog active-node purge gate; consume `FailedWriteCount` in Health Monitoring; reconcile `StateTransitionValidator` delete-from-`NotDeployed` with the spec matrix.
|
||||
- **DoD:** unit + integration tests per behavior; where the fix corrects code, the spec already matched; where the spec was the divergence, it's updated in the same change.
|
||||
|
||||
#### M3 — Script trust boundary (Tier 1: #1, #2)
|
||||
- Wire the already-referenced `Microsoft.CodeAnalysis.CSharp.Scripting` into `ScriptCompiler.TryCompile` for a real semantic compile (errors block deploy). (`ScriptCompiler.cs:56-104`, `ValidationService.cs:128`)
|
||||
- Replace the advisory substring forbidden-API scan with Roslyn symbol/semantic analysis that resolves aliases, `using static`, and `global::`; coordinate design-time enforcement with the Site Runtime sandbox so the trust boundary is authoritative. (`ScriptCompiler.cs:14-22`)
|
||||
- Apply the same real compile to shared scripts. (`SharedScriptService.cs:168-206`)
|
||||
- **DoD:** semantically-invalid C# fails validation; adversarial bypass tests (alias / `using static` / `global::` reaching a forbidden API) fail to deploy.
|
||||
|
||||
#### M4 — Doc reconciliation (Tier 4, parallelizable)
|
||||
- Update specs to the shipped re-architecture: `Component-ConfigurationDatabase.md` (collapsed `AuditLog` schema), `Component-Commons.md` (`AuditEvent`→`ZB.MOM.WW.Audit` package, `ApiKey` retirement, undocumented types/interfaces), `Component-InboundAPI.md` (Bearer auth, audit write timing, type validation), `Component-NotificationService.md`/`NotificationType` (Teams status), Security role names, `SiteCall` field names, `AuditKind` vocabulary.
|
||||
- CLI docs: document the `bundle` group; fix README option-name drift (the README is the stale doc; `Component-CLI.md` mostly matches code); correct `audit query` options.
|
||||
- Clear stale "deferred" markers/comments for shipped features (Transport CLI, `SourceNode`, Site Call Audit relay, bundle-import audit filter, M5 redaction comments, `AuditLogPage.HandleRowSelected`).
|
||||
- **Code-vs-doc dispositions:** doc-update (code authoritative) for Bearer auth, fire-and-forget audit timing, JWT-in-cookie→cookie-only session model, `ExecuteReader`/`DbWrite` kind. Code-change (build it) for nested `Object`/`List` validation (#13) — done in M2's validator work — and the LDAP re-query (M2).
|
||||
- **DoD:** no remaining doc↔code contradiction for in-scope components; CLI docs match registered commands/options.
|
||||
|
||||
### Phase 2 — Expand
|
||||
|
||||
#### M5 — Audit hardening (T1–T8)
|
||||
Hash-chain tamper evidence (off by default, `verify-chain` made real); Parquet export/archival (replace the 501); per-channel retention overrides; tag-cascade for `ParentExecutionId` (thread writing-execution id through trigger-driven runs); ExecutionId/ParentExecutionId + SourceNode backfill on historical rows; per-node stuck-count KPIs; structured response capture (headers/content-type, inbound request headers, per-method opt-out, `AuditInboundCeilingHits` metric); CLI `audit tree`.
|
||||
|
||||
#### M6 — Notifications (T9–T11)
|
||||
Teams + other non-Email delivery adapters behind the existing `INotificationDeliveryAdapter` seam; `NotificationType` enum values; Central UI notification-list `Type` selector; historical/trend KPI charts (introduce a time-series store).
|
||||
|
||||
#### M7 — OPC UA / MxGateway UX (T13–T17)
|
||||
Dedicated operator Alarm Summary page; MxGateway secured writes (operator+verifier); OPC UA address-space search + `BrowseNext` paging; type-info surfacing + bulk override CSV import; "Verify endpoint" connectivity button + cert-management UI.
|
||||
|
||||
#### M8 — Transport (T18, T20)
|
||||
Site-scoped / instance-scoped artifact transport (name-mapping subsystem); per-line/Myers diff for Modified artifacts.
|
||||
|
||||
#### M9 — Templates & authoring (T22–T26, T28, T30–T32)
|
||||
Template tree search/filter; folder drag-drop + sibling reorder + root context menu; move data connection between sites; connection live-status indicators; base-template versioning "update-derived" flow + multi-level inheritance; strict expression-trigger analysis kind; schema-driven value-entry forms + hover/completion + JSON Schema `$ref`/library; CLI Retry/Discard for cached calls; unified notifications+site-calls outbox page.
|
||||
|
||||
#### M10 — UI/UX platform (T33–T36, T41)
|
||||
`IDialogService` modal abstraction; design-tokens/CSS-vars + dark-mode/theming; shared pagination+filter component; accessibility pass; Playwright alarm-override UI coverage.
|
||||
|
||||
## Dependencies & sequencing
|
||||
|
||||
- **M1 → M5** — audit hardening builds on the wired purge/reconciliation.
|
||||
- **M6/T11** — depends on introducing a time-series store (new infra; size carefully).
|
||||
- **M9/T26** — base-template versioning is the largest authoring item; may split.
|
||||
- **M4** — runs anytime; cheap and high-clarity, good to interleave.
|
||||
- **M3** — independent; can run in parallel with M1/M2.
|
||||
- Phase 1 (M1–M4) should complete before Phase 2 work starts in earnest, so the foundation is true before features pile on.
|
||||
|
||||
## Cross-cutting conventions (per CLAUDE.md)
|
||||
|
||||
- Each milestone gets its own dated implementation plan in `docs/plans/` and (where useful) a `.tasks.json`.
|
||||
- Design doc + code + entities/repos + actors/services + UI + tests + migrations + deploy config travel together in each slice.
|
||||
- Every milestone is independently shippable: `dotnet build ZB.MOM.WW.ScadaBridge.slnx` green + relevant unit/integration tests pass; cluster-runtime changes rebuilt via `bash docker/deploy.sh`.
|
||||
- M3 and the security items (M2 LDAP re-query) carry adversarial tests (bypass attempts), not just happy-path.
|
||||
- `git diff` review before each commit; related changes committed together with a design-summary message.
|
||||
|
||||
## Testing strategy
|
||||
|
||||
- **M1:** integration tests against the cluster proving purge + reconciliation actually run (not just unit-level actor tests); event-log row assertions per category.
|
||||
- **M2:** behavior-level unit tests per gap; CachedWrite + conditionFilter + per-script-timeout get integration coverage.
|
||||
- **M3:** golden invalid scripts must fail; adversarial forbidden-API bypass corpus must fail to deploy.
|
||||
- **M4:** doc-only — no test impact beyond keeping existing suites green; nested-type validation (#13) gets validator unit tests.
|
||||
- **M5–M10:** standard unit + integration + Playwright (UI) coverage per feature; new infra (M6 time-series store) gets its own integration suite.
|
||||
|
||||
## Open items / risks
|
||||
|
||||
- M3 real-compile may surface latent invalid scripts in existing templates/fixtures — budget for fixture cleanup.
|
||||
- M6 time-series store is the one genuinely-new piece of infrastructure; scope it deliberately (could reuse MS SQL with a rollup table rather than a new dependency).
|
||||
- The Phase 2 roadmap is large; treat each milestone as a separate planning + implementation pass, not a single mega-effort.
|
||||
|
||||
## Next step
|
||||
|
||||
Hand off to the writing-plans skill to produce the detailed, bite-sized implementation plan, starting with **Phase 1 (M1–M4)**. Phase 2 milestones are planned individually as they're picked up.
|
||||
Reference in New Issue
Block a user