145 lines
18 KiB
Markdown
145 lines
18 KiB
Markdown
# Design: Completing `stillpending.md` — System Completion Roadmap
|
||
|
||
**Date:** 2026-06-15
|
||
**Status:** Approved (brainstorming session) — ready for per-milestone implementation planning
|
||
**Source audit:** `stillpending.md` (repo root) — full deferred/partial/unfinished/missing inventory
|
||
**Scope decision:** Complete *everything* genuinely intended — all of Tier 1, Tier 2, Tier 4, and the in-scope Tier 3 features. Two large Tier 3 items deferred to their own brainstorm; a set of deliberate anti-goals excluded (see Scope).
|
||
|
||
## Goal
|
||
|
||
Drive `stillpending.md` to zero for all genuinely-intended functionality: make the silent gaps real, correct the behavioral divergences, reconcile the docs with the shipped architecture, and build out the deferred features that have a real seam and a real reason to exist.
|
||
|
||
## Scope
|
||
|
||
### In scope
|
||
- **Tier 1** — all 6 silent gaps (documented as working, actually inert).
|
||
- **Tier 2** — all ~31 genuine code defects / partial behaviors.
|
||
- **Tier 4** — all doc↔code drift (specs + CLI docs + stale markers).
|
||
- **Tier 3** — features tagged for build: T1–T11, T13–T18, T20–T26, T28, T30–T36, T41.
|
||
|
||
### Deferred to their own brainstorm (NOT in this plan)
|
||
- **T12** — Native-alarm ack/shelve/suppress write-back + central alarm tables/history/journal + alarm-driven notifications. Reverses the "native alarms are read-only by design" decision; large enough to warrant its own design session.
|
||
- **T19** — Direct cluster-to-cluster pull + asymmetric bundle signing + differential/incremental bundles. Three separable large features.
|
||
|
||
### Excluded (deliberate anti-goals — leave as-is)
|
||
- **T27** Promote-derived-to-base + cross-tenant template libraries.
|
||
- **T29** WhileTrue trigger mode for alarms (alarms are already level-based).
|
||
- **T37** Replace SignalR debug-view streaming (working code; pure refactor).
|
||
- **T38** True air-gapped env2 / 3rd-4th env / `--env` flag (env2 deliberately shares infra).
|
||
- **T39** Repo/folder rename away from ScadaBridge (kept to preserve context).
|
||
- **T40** Rename legacy Transport bundle manifests.
|
||
|
||
## Approach
|
||
|
||
**Risk-first milestones (Approach A).** Work is ordered by impact, not by component, and grouped into themed milestones using the repo's existing `m1…mN` implementation-doc convention (cf. `auditlog-m1..m8`). Phase 1 (M1–M4) *stabilizes* — a finite, ship-soon body of work that makes the docs true and the silent gaps real. Phase 2 (M5–M10) *expands* — an open-ended roadmap of independently-sized feature epics.
|
||
|
||
Rejected alternatives:
|
||
- **Component-grouped** — cleaner per-component diffs, but spreads the high-risk Tier-1 work across late workstreams and ships no value early.
|
||
- **Two-track only** — good mental model (folded into the Phase 1/2 boundary here), but too coarse for sequencing.
|
||
|
||
## Milestones
|
||
|
||
### Phase 1 — Stabilize
|
||
|
||
#### M1 — Runtime wiring (Tier 1: #3, #4, #5, #6)
|
||
Wire up behavior that exists in code but is never started, and fill the event-log categories.
|
||
- Wire `AuditLogPurgeActor` into Host bootstrap; drive partition-switch purge (365-day retention). (`AuditLogPurgeActor.cs`, `AkkaHostedService.cs:486`)
|
||
- Implement `IPullAuditEventsClient` and instantiate `SiteAuditReconciliationActor` (periodic `PullAuditEvents` fallback). (`IPullAuditEventsClient.cs`, `SiteAuditReconciliationActor.cs`)
|
||
- Site Call Audit: implement the per-site reconciliation pull (changed-since cursor read + central insert-if-not-exists/upsert-on-newer) and schedule `PurgeTerminalAsync` daily. (`SiteCallAuditActor.cs:28-34`, `SiteCallAuditRepository.cs:213`)
|
||
- Site Event Logging: inject `ISiteEventLogger` into `AlarmActor`/`NativeAlarmActor`, `DeploymentManagerActor`, Store-and-Forward, and `NotificationOutboxActor`; emit Alarm / Deployment / Data-Connection / S&F / Instance-Lifecycle / Notification events, and add script *started*/*completed* (Info) alongside the existing error events.
|
||
- **DoD:** actors instantiated and DI-registered; integration tests prove the purge actually switches out partitions and reconciliation back-fills a row dropped from telemetry; event-log rows produced for all 7 categories.
|
||
|
||
#### M2 — Correctness & behavioral gaps (Tier 2)
|
||
- `Database.CachedWrite`: attempt immediately; permanent SQL error returns synchronously as `Failed` (mirror the API path); only transient errors buffer. (`DatabaseGateway.cs:78-204`)
|
||
- Alarm `conditionFilter`: build the OPC UA event `WhereClause` from the filter and honor it in `DataConnectionActor` routing; same plumbing for MxGateway. (`DataConnectionActor.cs:1482`, `RealOpcUaClient.cs:242`)
|
||
- Per-script execution timeout: add a field to `TemplateScript` + flattened config; apply in `ScriptExecutionActor`/`AlarmExecutionActor`; fall back to the global default. (`SiteRuntimeOptions.cs:31`)
|
||
- Connection-level diff: add a `ConnectionChanges` slot to `ConfigurationDiff`; call `ComputeConnectionsDiff` from `ComputeDiff`; surface in the deployment diff UI. (`DiffService.cs:174`)
|
||
- `MachineDataDb` fail-fast: add the option + startup validation (+ DbContext if the connection is actually consumed). (`DatabaseOptions.cs`, `StartupValidator.cs`)
|
||
- CI grep-guard: build step that fails on `UPDATE`/`DELETE` against `AuditLog` in the data layer. (`Component-AuditLog.md:335`)
|
||
- LDAP periodic re-query: implement the session-refresh path that re-queries LDAP so interactive roles are never >15 min stale (wire `JwtTokenService.RefreshToken`/`ShouldRefresh` or an `OnValidatePrincipal` revalidation). (security-relevant; pulled out of the session-doc reconcile)
|
||
- Low-severity batch (#19–#31): return-type compatibility check; argument *type* compatibility; native-alarm-source capability validation wired into the deploy pipeline; binding-completeness as a deploy-gating Error (+ "name exists at site" check); debug snapshot/subscribe error response for unknown instance; recursion-limit error → site event log; debug-stream snapshot/stream ordering + timestamp-dedup replay; OPC UA native-alarm transition field population; readiness "required singletons running" probe; register the SiteEventLog active-node purge gate; consume `FailedWriteCount` in Health Monitoring; reconcile `StateTransitionValidator` delete-from-`NotDeployed` with the spec matrix.
|
||
- **DoD:** unit + integration tests per behavior; where the fix corrects code, the spec already matched; where the spec was the divergence, it's updated in the same change.
|
||
|
||
#### M3 — Script trust boundary (Tier 1: #1, #2)
|
||
- Wire the already-referenced `Microsoft.CodeAnalysis.CSharp.Scripting` into `ScriptCompiler.TryCompile` for a real semantic compile (errors block deploy). (`ScriptCompiler.cs:56-104`, `ValidationService.cs:128`)
|
||
- Replace the advisory substring forbidden-API scan with Roslyn symbol/semantic analysis that resolves aliases, `using static`, and `global::`; coordinate design-time enforcement with the Site Runtime sandbox so the trust boundary is authoritative. (`ScriptCompiler.cs:14-22`)
|
||
- Apply the same real compile to shared scripts. (`SharedScriptService.cs:168-206`)
|
||
- **DoD:** semantically-invalid C# fails validation; adversarial bypass tests (alias / `using static` / `global::` reaching a forbidden API) fail to deploy.
|
||
|
||
#### M4 — Doc reconciliation (Tier 4, parallelizable)
|
||
- Update specs to the shipped re-architecture: `Component-ConfigurationDatabase.md` (collapsed `AuditLog` schema), `Component-Commons.md` (`AuditEvent`→`ZB.MOM.WW.Audit` package, `ApiKey` retirement, undocumented types/interfaces), `Component-InboundAPI.md` (Bearer auth, audit write timing, type validation), `Component-NotificationService.md`/`NotificationType` (Teams status), Security role names, `SiteCall` field names, `AuditKind` vocabulary.
|
||
- CLI docs: document the `bundle` group; fix README option-name drift (the README is the stale doc; `Component-CLI.md` mostly matches code); correct `audit query` options.
|
||
- Clear stale "deferred" markers/comments for shipped features (Transport CLI, `SourceNode`, Site Call Audit relay, bundle-import audit filter, M5 redaction comments, `AuditLogPage.HandleRowSelected`).
|
||
- **Code-vs-doc dispositions:** doc-update (code authoritative) for Bearer auth, fire-and-forget audit timing, JWT-in-cookie→cookie-only session model, `ExecuteReader`/`DbWrite` kind. Code-change (build it) for nested `Object`/`List` validation (#13) — done in M2's validator work — and the LDAP re-query (M2).
|
||
- **DoD:** no remaining doc↔code contradiction for in-scope components; CLI docs match registered commands/options.
|
||
|
||
### Phase 2 — Expand
|
||
|
||
#### M5 — Audit hardening (T1–T8)
|
||
Hash-chain tamper evidence (off by default, `verify-chain` made real); Parquet export/archival (replace the 501); per-channel retention overrides; tag-cascade for `ParentExecutionId` (thread writing-execution id through trigger-driven runs); ExecutionId/ParentExecutionId + SourceNode backfill on historical rows; per-node stuck-count KPIs; structured response capture (headers/content-type, inbound request headers, per-method opt-out, `AuditInboundCeilingHits` metric); CLI `audit tree`.
|
||
|
||
#### M6 — KPI History & Trends (T11 delivered; T9/T10 deferred)
|
||
Reshaped during the 2026-06-17 brainstorm (see `docs/plans/2026-06-17-m6-kpi-history-design.md`):
|
||
- **T11 — DELIVERED** as the reusable **KPI-history backbone** (#26 KpiHistory), promoted from a notifications-only feature. A tall/EAV `KpiSample` store in **central MS SQL** (no new infra — supersedes the original "point-in-time only, no time-series store" stance), a `KpiHistoryRecorderActor` cluster singleton (`kpi-history-recorder`, not readiness-gated, best-effort with per-source isolation) sampling DI-registered `IKpiSampleSource`s every minute, a bucketed `GetRawSeriesAsync` + `KpiSeriesBucketer` query + scoped `KpiHistoryQueryService`, and a reusable custom-SVG `KpiTrendChart` (no third-party charting lib). Trends shipped for **all** current KPI sources — Notification Outbox, Site Call Audit, Audit Log, and Site Health — across four UI surfaces.
|
||
- **T9 (Teams + other non-Email delivery adapters behind `INotificationDeliveryAdapter`) — DEFERRED to the next major version.** The seam exists; no code now. Transport choice (Incoming Webhook vs Microsoft Graph) and the Teams list-targeting model remain to be designed.
|
||
- **T10 (`NotificationType` enum values + Central UI notification-list `Type` selector) — DEFERRED with T9.** A Type selector has no purpose until a second delivery type exists.
|
||
|
||
#### M7 — OPC UA / MxGateway UX (T13–T17) — **DELIVERED**
|
||
Delivered per `docs/plans/2026-06-18-m7-opcua-mxgateway-ux-design.md` (full scope, all five features):
|
||
- **T13** — operator Alarm Summary page (`/monitoring/alarms`, read-only, `RequireDeployment`); per-instance `DebugViewSnapshot` fan-out (no central alarm store); shared `AlarmStateBadges`.
|
||
- **T14** — MxGateway secured writes: new global `Operator` + `Verifier` roles + `RequireOperator`/`RequireVerifier`; central `PendingSecuredWrite` table + migration; ManagementActor submit/approve/reject/list with **no-self-approval + CAS race guard**, MxGateway-protocol-only; approve relays a `WriteTagRequest` to the site; `SecuredWrite` audit channel/kinds (central direct-write, best-effort); Central UI `/operations/secured-writes`.
|
||
- **T15** — OPC UA `BrowseNext` paging + bounded recursive address-space search (`IAddressSpaceSearchable`).
|
||
- **T16** — browse type-info (DataType/ValueRank/Writable) + **attribute**-override CSV import (InstanceConfigure InputFile + CLI `instance import-overrides --file`). Native-alarm-source-override CSV import was **deferred** (attribute overrides only).
|
||
- **T17** — Verify-endpoint probe (captures-but-never-trusts an untrusted server cert) + **site-local** cert trust (per-node `CertStoreActor`, DeploymentManager broadcast to **both** site nodes; D6) + Admin-gated cert-management UI.
|
||
|
||
Small follow-ups logged (not blocking): stamp `SourceNode` on the `SecuredWrite` audit rows (currently NULL); an aggregated **live** alarm stream for the summary page (snapshot + poll today); central-persisted, auditable cert trust (site-local today).
|
||
|
||
#### M8 — Transport (T18, T20) — **DELIVERED**
|
||
Delivered both features, with the silent gap #16 folded in:
|
||
- **T18** — site/instance-scoped artifact transport. Transport now also moves `Site` definitions, site-scoped `DataConnection`s (protocol connections), and `Instance`s (+ `InstanceAttributeOverride`/`InstanceAlarmOverride`/`InstanceNativeAlarmSourceOverride`/`InstanceConnectionBinding` children + `Area` by name). New **name-mapping subsystem** (`BundleNameMap`/`SiteMapping`/`ConnectionMapping`/`MappingAction` in Commons): auto-match (sites by `SiteIdentifier`, connections by the site-qualified `{SiteIdentifier}/{Name}`, instances by `UniqueName`), operator override via the Central UI import **Map step** (between Passphrase and Diff) or CLI `--map-site`/`--map-connection`/`--create-missing-sites`/`--create-missing-connections`; export gains `--sites`/`--instances` + a UI Sites & Instances picker. D3 **carry-full-config (encrypted secrets)**: Site addresses travel; `DataConnection` `PrimaryConfiguration`/`BackupConfiguration` ride the encrypted `SecretsBlock` (presence-only in diffs). Apply resolves/creates target sites + connections in one EF transaction, upserts instances forced `NotDeployed`, and rewires FKs (`InstanceConnectionBinding.DataConnectionId` via the connection map with a Pass-2; `NativeAlarmSourceOverride.ConnectionNameOverride` to the mapped target). `schemaVersion` 1.0→1.1 (additive); `bundleFormatVersion` stays 1; **no new EF tables/columns**.
|
||
- **T20** — per-line **Myers** diff. New pure `LineDiffer` (custom O(ND), no third-party lib) replaces the coarse `<N lines>` marker for code fields; `ArtifactDiff` embeds a size-capped structured `lineDiff` (hunks + `truncated` + add/remove totals); the import wizard renders Modified code fields via the shared `LineDiffView` component.
|
||
- **#16 folded in** — `ImportResult.StaleInstanceIds` is no longer a stub: for each Overwritten template the importer enumerates deployed target instances whose freshly-flattened revision hash drifts from `DeployedConfigSnapshot.RevisionHash` via the new `IStaleInstanceProbe` seam (Commons; implemented in DeploymentManager), so Confirm shows a real count and Result deep-links the filtered Deployments page.
|
||
|
||
Small follow-ups logged (not blocking): DeploymentManagerRepository hydration for the stale probe; large-bundle/perf hardening. **T19** (direct cluster-to-cluster pull / asymmetric bundle signing / differential bundles) remains **deferred** to its own brainstorm.
|
||
|
||
#### M9 — Templates & authoring (T22–T26, T28, T30–T32)
|
||
Template tree search/filter; folder drag-drop + sibling reorder + root context menu; move data connection between sites; connection live-status indicators; base-template versioning "update-derived" flow + multi-level inheritance; strict expression-trigger analysis kind; schema-driven value-entry forms + hover/completion + JSON Schema `$ref`/library; CLI Retry/Discard for cached calls; unified notifications+site-calls outbox page.
|
||
|
||
#### M10 — UI/UX platform (T33–T36, T41)
|
||
`IDialogService` modal abstraction; design-tokens/CSS-vars + dark-mode/theming; shared pagination+filter component; accessibility pass; Playwright alarm-override UI coverage.
|
||
|
||
## Dependencies & sequencing
|
||
|
||
- **M1 → M5** — audit hardening builds on the wired purge/reconciliation.
|
||
- **M6/T11** — delivered as the #26 KpiHistory backbone; reused **central MS SQL** (a tall/EAV `KpiSample` table) rather than introducing new infra. T9/T10 deferred to the next major version.
|
||
- **M9/T26** — base-template versioning is the largest authoring item; may split.
|
||
- **M4** — runs anytime; cheap and high-clarity, good to interleave.
|
||
- **M3** — independent; can run in parallel with M1/M2.
|
||
- Phase 1 (M1–M4) should complete before Phase 2 work starts in earnest, so the foundation is true before features pile on.
|
||
|
||
## Cross-cutting conventions (per CLAUDE.md)
|
||
|
||
- Each milestone gets its own dated implementation plan in `docs/plans/` and (where useful) a `.tasks.json`.
|
||
- Design doc + code + entities/repos + actors/services + UI + tests + migrations + deploy config travel together in each slice.
|
||
- Every milestone is independently shippable: `dotnet build ZB.MOM.WW.ScadaBridge.slnx` green + relevant unit/integration tests pass; cluster-runtime changes rebuilt via `bash docker/deploy.sh`.
|
||
- M3 and the security items (M2 LDAP re-query) carry adversarial tests (bypass attempts), not just happy-path.
|
||
- `git diff` review before each commit; related changes committed together with a design-summary message.
|
||
|
||
## Testing strategy
|
||
|
||
- **M1:** integration tests against the cluster proving purge + reconciliation actually run (not just unit-level actor tests); event-log row assertions per category.
|
||
- **M2:** behavior-level unit tests per gap; CachedWrite + conditionFilter + per-script-timeout get integration coverage.
|
||
- **M3:** golden invalid scripts must fail; adversarial forbidden-API bypass corpus must fail to deploy.
|
||
- **M4:** doc-only — no test impact beyond keeping existing suites green; nested-type validation (#13) gets validator unit tests.
|
||
- **M5–M10:** standard unit + integration + Playwright (UI) coverage per feature; new infra (M6 time-series store) gets its own integration suite.
|
||
|
||
## Open items / risks
|
||
|
||
- M3 real-compile may surface latent invalid scripts in existing templates/fixtures — budget for fixture cleanup.
|
||
- M6 KPI history (resolved): reused **central MS SQL** with a tall/EAV `KpiSample` table rather than a new dependency, so no genuinely-new infrastructure was introduced.
|
||
- The Phase 2 roadmap is large; treat each milestone as a separate planning + implementation pass, not a single mega-effort.
|
||
|
||
## Next step
|
||
|
||
Hand off to the writing-plans skill to produce the detailed, bite-sized implementation plan, starting with **Phase 1 (M1–M4)**. Phase 2 milestones are planned individually as they're picked up.
|