Files
ScadaBridge/docs/plans/2026-06-15-stillpending-completion-design.md
T

140 lines
15 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Design: Completing `stillpending.md` — System Completion Roadmap
**Date:** 2026-06-15
**Status:** Approved (brainstorming session) — ready for per-milestone implementation planning
**Source audit:** `stillpending.md` (repo root) — full deferred/partial/unfinished/missing inventory
**Scope decision:** Complete *everything* genuinely intended — all of Tier 1, Tier 2, Tier 4, and the in-scope Tier 3 features. Two large Tier 3 items deferred to their own brainstorm; a set of deliberate anti-goals excluded (see Scope).
## Goal
Drive `stillpending.md` to zero for all genuinely-intended functionality: make the silent gaps real, correct the behavioral divergences, reconcile the docs with the shipped architecture, and build out the deferred features that have a real seam and a real reason to exist.
## Scope
### In scope
- **Tier 1** — all 6 silent gaps (documented as working, actually inert).
- **Tier 2** — all ~31 genuine code defects / partial behaviors.
- **Tier 4** — all doc↔code drift (specs + CLI docs + stale markers).
- **Tier 3** — features tagged for build: T1T11, T13T18, T20T26, T28, T30T36, T41.
### Deferred to their own brainstorm (NOT in this plan)
- **T12** — Native-alarm ack/shelve/suppress write-back + central alarm tables/history/journal + alarm-driven notifications. Reverses the "native alarms are read-only by design" decision; large enough to warrant its own design session.
- **T19** — Direct cluster-to-cluster pull + asymmetric bundle signing + differential/incremental bundles. Three separable large features.
### Excluded (deliberate anti-goals — leave as-is)
- **T27** Promote-derived-to-base + cross-tenant template libraries.
- **T29** WhileTrue trigger mode for alarms (alarms are already level-based).
- **T37** Replace SignalR debug-view streaming (working code; pure refactor).
- **T38** True air-gapped env2 / 3rd-4th env / `--env` flag (env2 deliberately shares infra).
- **T39** Repo/folder rename away from ScadaBridge (kept to preserve context).
- **T40** Rename legacy Transport bundle manifests.
## Approach
**Risk-first milestones (Approach A).** Work is ordered by impact, not by component, and grouped into themed milestones using the repo's existing `m1…mN` implementation-doc convention (cf. `auditlog-m1..m8`). Phase 1 (M1M4) *stabilizes* — a finite, ship-soon body of work that makes the docs true and the silent gaps real. Phase 2 (M5M10) *expands* — an open-ended roadmap of independently-sized feature epics.
Rejected alternatives:
- **Component-grouped** — cleaner per-component diffs, but spreads the high-risk Tier-1 work across late workstreams and ships no value early.
- **Two-track only** — good mental model (folded into the Phase 1/2 boundary here), but too coarse for sequencing.
## Milestones
### Phase 1 — Stabilize
#### M1 — Runtime wiring (Tier 1: #3, #4, #5, #6)
Wire up behavior that exists in code but is never started, and fill the event-log categories.
- Wire `AuditLogPurgeActor` into Host bootstrap; drive partition-switch purge (365-day retention). (`AuditLogPurgeActor.cs`, `AkkaHostedService.cs:486`)
- Implement `IPullAuditEventsClient` and instantiate `SiteAuditReconciliationActor` (periodic `PullAuditEvents` fallback). (`IPullAuditEventsClient.cs`, `SiteAuditReconciliationActor.cs`)
- Site Call Audit: implement the per-site reconciliation pull (changed-since cursor read + central insert-if-not-exists/upsert-on-newer) and schedule `PurgeTerminalAsync` daily. (`SiteCallAuditActor.cs:28-34`, `SiteCallAuditRepository.cs:213`)
- Site Event Logging: inject `ISiteEventLogger` into `AlarmActor`/`NativeAlarmActor`, `DeploymentManagerActor`, Store-and-Forward, and `NotificationOutboxActor`; emit Alarm / Deployment / Data-Connection / S&F / Instance-Lifecycle / Notification events, and add script *started*/*completed* (Info) alongside the existing error events.
- **DoD:** actors instantiated and DI-registered; integration tests prove the purge actually switches out partitions and reconciliation back-fills a row dropped from telemetry; event-log rows produced for all 7 categories.
#### M2 — Correctness & behavioral gaps (Tier 2)
- `Database.CachedWrite`: attempt immediately; permanent SQL error returns synchronously as `Failed` (mirror the API path); only transient errors buffer. (`DatabaseGateway.cs:78-204`)
- Alarm `conditionFilter`: build the OPC UA event `WhereClause` from the filter and honor it in `DataConnectionActor` routing; same plumbing for MxGateway. (`DataConnectionActor.cs:1482`, `RealOpcUaClient.cs:242`)
- Per-script execution timeout: add a field to `TemplateScript` + flattened config; apply in `ScriptExecutionActor`/`AlarmExecutionActor`; fall back to the global default. (`SiteRuntimeOptions.cs:31`)
- Connection-level diff: add a `ConnectionChanges` slot to `ConfigurationDiff`; call `ComputeConnectionsDiff` from `ComputeDiff`; surface in the deployment diff UI. (`DiffService.cs:174`)
- `MachineDataDb` fail-fast: add the option + startup validation (+ DbContext if the connection is actually consumed). (`DatabaseOptions.cs`, `StartupValidator.cs`)
- CI grep-guard: build step that fails on `UPDATE`/`DELETE` against `AuditLog` in the data layer. (`Component-AuditLog.md:335`)
- LDAP periodic re-query: implement the session-refresh path that re-queries LDAP so interactive roles are never >15 min stale (wire `JwtTokenService.RefreshToken`/`ShouldRefresh` or an `OnValidatePrincipal` revalidation). (security-relevant; pulled out of the session-doc reconcile)
- Low-severity batch (#19#31): return-type compatibility check; argument *type* compatibility; native-alarm-source capability validation wired into the deploy pipeline; binding-completeness as a deploy-gating Error (+ "name exists at site" check); debug snapshot/subscribe error response for unknown instance; recursion-limit error → site event log; debug-stream snapshot/stream ordering + timestamp-dedup replay; OPC UA native-alarm transition field population; readiness "required singletons running" probe; register the SiteEventLog active-node purge gate; consume `FailedWriteCount` in Health Monitoring; reconcile `StateTransitionValidator` delete-from-`NotDeployed` with the spec matrix.
- **DoD:** unit + integration tests per behavior; where the fix corrects code, the spec already matched; where the spec was the divergence, it's updated in the same change.
#### M3 — Script trust boundary (Tier 1: #1, #2)
- Wire the already-referenced `Microsoft.CodeAnalysis.CSharp.Scripting` into `ScriptCompiler.TryCompile` for a real semantic compile (errors block deploy). (`ScriptCompiler.cs:56-104`, `ValidationService.cs:128`)
- Replace the advisory substring forbidden-API scan with Roslyn symbol/semantic analysis that resolves aliases, `using static`, and `global::`; coordinate design-time enforcement with the Site Runtime sandbox so the trust boundary is authoritative. (`ScriptCompiler.cs:14-22`)
- Apply the same real compile to shared scripts. (`SharedScriptService.cs:168-206`)
- **DoD:** semantically-invalid C# fails validation; adversarial bypass tests (alias / `using static` / `global::` reaching a forbidden API) fail to deploy.
#### M4 — Doc reconciliation (Tier 4, parallelizable)
- Update specs to the shipped re-architecture: `Component-ConfigurationDatabase.md` (collapsed `AuditLog` schema), `Component-Commons.md` (`AuditEvent``ZB.MOM.WW.Audit` package, `ApiKey` retirement, undocumented types/interfaces), `Component-InboundAPI.md` (Bearer auth, audit write timing, type validation), `Component-NotificationService.md`/`NotificationType` (Teams status), Security role names, `SiteCall` field names, `AuditKind` vocabulary.
- CLI docs: document the `bundle` group; fix README option-name drift (the README is the stale doc; `Component-CLI.md` mostly matches code); correct `audit query` options.
- Clear stale "deferred" markers/comments for shipped features (Transport CLI, `SourceNode`, Site Call Audit relay, bundle-import audit filter, M5 redaction comments, `AuditLogPage.HandleRowSelected`).
- **Code-vs-doc dispositions:** doc-update (code authoritative) for Bearer auth, fire-and-forget audit timing, JWT-in-cookie→cookie-only session model, `ExecuteReader`/`DbWrite` kind. Code-change (build it) for nested `Object`/`List` validation (#13) — done in M2's validator work — and the LDAP re-query (M2).
- **DoD:** no remaining doc↔code contradiction for in-scope components; CLI docs match registered commands/options.
### Phase 2 — Expand
#### M5 — Audit hardening (T1T8)
Hash-chain tamper evidence (off by default, `verify-chain` made real); Parquet export/archival (replace the 501); per-channel retention overrides; tag-cascade for `ParentExecutionId` (thread writing-execution id through trigger-driven runs); ExecutionId/ParentExecutionId + SourceNode backfill on historical rows; per-node stuck-count KPIs; structured response capture (headers/content-type, inbound request headers, per-method opt-out, `AuditInboundCeilingHits` metric); CLI `audit tree`.
#### M6 — KPI History & Trends (T11 delivered; T9/T10 deferred)
Reshaped during the 2026-06-17 brainstorm (see `docs/plans/2026-06-17-m6-kpi-history-design.md`):
- **T11 — DELIVERED** as the reusable **KPI-history backbone** (#26 KpiHistory), promoted from a notifications-only feature. A tall/EAV `KpiSample` store in **central MS SQL** (no new infra — supersedes the original "point-in-time only, no time-series store" stance), a `KpiHistoryRecorderActor` cluster singleton (`kpi-history-recorder`, not readiness-gated, best-effort with per-source isolation) sampling DI-registered `IKpiSampleSource`s every minute, a bucketed `GetRawSeriesAsync` + `KpiSeriesBucketer` query + scoped `KpiHistoryQueryService`, and a reusable custom-SVG `KpiTrendChart` (no third-party charting lib). Trends shipped for **all** current KPI sources — Notification Outbox, Site Call Audit, Audit Log, and Site Health — across four UI surfaces.
- **T9 (Teams + other non-Email delivery adapters behind `INotificationDeliveryAdapter`) — DEFERRED to the next major version.** The seam exists; no code now. Transport choice (Incoming Webhook vs Microsoft Graph) and the Teams list-targeting model remain to be designed.
- **T10 (`NotificationType` enum values + Central UI notification-list `Type` selector) — DEFERRED with T9.** A Type selector has no purpose until a second delivery type exists.
#### M7 — OPC UA / MxGateway UX (T13T17) — **DELIVERED**
Delivered per `docs/plans/2026-06-18-m7-opcua-mxgateway-ux-design.md` (full scope, all five features):
- **T13** — operator Alarm Summary page (`/monitoring/alarms`, read-only, `RequireDeployment`); per-instance `DebugViewSnapshot` fan-out (no central alarm store); shared `AlarmStateBadges`.
- **T14** — MxGateway secured writes: new global `Operator` + `Verifier` roles + `RequireOperator`/`RequireVerifier`; central `PendingSecuredWrite` table + migration; ManagementActor submit/approve/reject/list with **no-self-approval + CAS race guard**, MxGateway-protocol-only; approve relays a `WriteTagRequest` to the site; `SecuredWrite` audit channel/kinds (central direct-write, best-effort); Central UI `/operations/secured-writes`.
- **T15** — OPC UA `BrowseNext` paging + bounded recursive address-space search (`IAddressSpaceSearchable`).
- **T16** — browse type-info (DataType/ValueRank/Writable) + **attribute**-override CSV import (InstanceConfigure InputFile + CLI `instance import-overrides --file`). Native-alarm-source-override CSV import was **deferred** (attribute overrides only).
- **T17** — Verify-endpoint probe (captures-but-never-trusts an untrusted server cert) + **site-local** cert trust (per-node `CertStoreActor`, DeploymentManager broadcast to **both** site nodes; D6) + Admin-gated cert-management UI.
Small follow-ups logged (not blocking): stamp `SourceNode` on the `SecuredWrite` audit rows (currently NULL); an aggregated **live** alarm stream for the summary page (snapshot + poll today); central-persisted, auditable cert trust (site-local today).
#### M8 — Transport (T18, T20)
Site-scoped / instance-scoped artifact transport (name-mapping subsystem); per-line/Myers diff for Modified artifacts.
#### M9 — Templates & authoring (T22T26, T28, T30T32)
Template tree search/filter; folder drag-drop + sibling reorder + root context menu; move data connection between sites; connection live-status indicators; base-template versioning "update-derived" flow + multi-level inheritance; strict expression-trigger analysis kind; schema-driven value-entry forms + hover/completion + JSON Schema `$ref`/library; CLI Retry/Discard for cached calls; unified notifications+site-calls outbox page.
#### M10 — UI/UX platform (T33T36, T41)
`IDialogService` modal abstraction; design-tokens/CSS-vars + dark-mode/theming; shared pagination+filter component; accessibility pass; Playwright alarm-override UI coverage.
## Dependencies & sequencing
- **M1 → M5** — audit hardening builds on the wired purge/reconciliation.
- **M6/T11** — delivered as the #26 KpiHistory backbone; reused **central MS SQL** (a tall/EAV `KpiSample` table) rather than introducing new infra. T9/T10 deferred to the next major version.
- **M9/T26** — base-template versioning is the largest authoring item; may split.
- **M4** — runs anytime; cheap and high-clarity, good to interleave.
- **M3** — independent; can run in parallel with M1/M2.
- Phase 1 (M1M4) should complete before Phase 2 work starts in earnest, so the foundation is true before features pile on.
## Cross-cutting conventions (per CLAUDE.md)
- Each milestone gets its own dated implementation plan in `docs/plans/` and (where useful) a `.tasks.json`.
- Design doc + code + entities/repos + actors/services + UI + tests + migrations + deploy config travel together in each slice.
- Every milestone is independently shippable: `dotnet build ZB.MOM.WW.ScadaBridge.slnx` green + relevant unit/integration tests pass; cluster-runtime changes rebuilt via `bash docker/deploy.sh`.
- M3 and the security items (M2 LDAP re-query) carry adversarial tests (bypass attempts), not just happy-path.
- `git diff` review before each commit; related changes committed together with a design-summary message.
## Testing strategy
- **M1:** integration tests against the cluster proving purge + reconciliation actually run (not just unit-level actor tests); event-log row assertions per category.
- **M2:** behavior-level unit tests per gap; CachedWrite + conditionFilter + per-script-timeout get integration coverage.
- **M3:** golden invalid scripts must fail; adversarial forbidden-API bypass corpus must fail to deploy.
- **M4:** doc-only — no test impact beyond keeping existing suites green; nested-type validation (#13) gets validator unit tests.
- **M5M10:** standard unit + integration + Playwright (UI) coverage per feature; new infra (M6 time-series store) gets its own integration suite.
## Open items / risks
- M3 real-compile may surface latent invalid scripts in existing templates/fixtures — budget for fixture cleanup.
- M6 KPI history (resolved): reused **central MS SQL** with a tall/EAV `KpiSample` table rather than a new dependency, so no genuinely-new infrastructure was introduced.
- The Phase 2 roadmap is large; treat each milestone as a separate planning + implementation pass, not a single mega-effort.
## Next step
Hand off to the writing-plans skill to produce the detailed, bite-sized implementation plan, starting with **Phase 1 (M1M4)**. Phase 2 milestones are planned individually as they're picked up.