docs(plans): completion roadmap for stillpending.md audit
Add the system-completion design doc (risk-first milestones M1-M10): Phase 1 Stabilize (M1 runtime wiring, M2 correctness, M3 script trust boundary, M4 doc reconciliation) then Phase 2 Expand (M5-M10 feature epics). Scope = all Tier 1/2/4 + in-scope Tier 3 features; T12/T19 deferred to own brainstorm; deliberate anti-goals excluded. Also commit the source audit (stillpending.md).
This commit is contained in:
@@ -0,0 +1,129 @@
|
||||
# Design: Completing `stillpending.md` — System Completion Roadmap
|
||||
|
||||
**Date:** 2026-06-15
|
||||
**Status:** Approved (brainstorming session) — ready for per-milestone implementation planning
|
||||
**Source audit:** `stillpending.md` (repo root) — full deferred/partial/unfinished/missing inventory
|
||||
**Scope decision:** Complete *everything* genuinely intended — all of Tier 1, Tier 2, Tier 4, and the in-scope Tier 3 features. Two large Tier 3 items deferred to their own brainstorm; a set of deliberate anti-goals excluded (see Scope).
|
||||
|
||||
## Goal
|
||||
|
||||
Drive `stillpending.md` to zero for all genuinely-intended functionality: make the silent gaps real, correct the behavioral divergences, reconcile the docs with the shipped architecture, and build out the deferred features that have a real seam and a real reason to exist.
|
||||
|
||||
## Scope
|
||||
|
||||
### In scope
|
||||
- **Tier 1** — all 6 silent gaps (documented as working, actually inert).
|
||||
- **Tier 2** — all ~31 genuine code defects / partial behaviors.
|
||||
- **Tier 4** — all doc↔code drift (specs + CLI docs + stale markers).
|
||||
- **Tier 3** — features tagged for build: T1–T11, T13–T18, T20–T26, T28, T30–T36, T41.
|
||||
|
||||
### Deferred to their own brainstorm (NOT in this plan)
|
||||
- **T12** — Native-alarm ack/shelve/suppress write-back + central alarm tables/history/journal + alarm-driven notifications. Reverses the "native alarms are read-only by design" decision; large enough to warrant its own design session.
|
||||
- **T19** — Direct cluster-to-cluster pull + asymmetric bundle signing + differential/incremental bundles. Three separable large features.
|
||||
|
||||
### Excluded (deliberate anti-goals — leave as-is)
|
||||
- **T27** Promote-derived-to-base + cross-tenant template libraries.
|
||||
- **T29** WhileTrue trigger mode for alarms (alarms are already level-based).
|
||||
- **T37** Replace SignalR debug-view streaming (working code; pure refactor).
|
||||
- **T38** True air-gapped env2 / 3rd-4th env / `--env` flag (env2 deliberately shares infra).
|
||||
- **T39** Repo/folder rename away from ScadaBridge (kept to preserve context).
|
||||
- **T40** Rename legacy Transport bundle manifests.
|
||||
|
||||
## Approach
|
||||
|
||||
**Risk-first milestones (Approach A).** Work is ordered by impact, not by component, and grouped into themed milestones using the repo's existing `m1…mN` implementation-doc convention (cf. `auditlog-m1..m8`). Phase 1 (M1–M4) *stabilizes* — a finite, ship-soon body of work that makes the docs true and the silent gaps real. Phase 2 (M5–M10) *expands* — an open-ended roadmap of independently-sized feature epics.
|
||||
|
||||
Rejected alternatives:
|
||||
- **Component-grouped** — cleaner per-component diffs, but spreads the high-risk Tier-1 work across late workstreams and ships no value early.
|
||||
- **Two-track only** — good mental model (folded into the Phase 1/2 boundary here), but too coarse for sequencing.
|
||||
|
||||
## Milestones
|
||||
|
||||
### Phase 1 — Stabilize
|
||||
|
||||
#### M1 — Runtime wiring (Tier 1: #3, #4, #5, #6)
|
||||
Wire up behavior that exists in code but is never started, and fill the event-log categories.
|
||||
- Wire `AuditLogPurgeActor` into Host bootstrap; drive partition-switch purge (365-day retention). (`AuditLogPurgeActor.cs`, `AkkaHostedService.cs:486`)
|
||||
- Implement `IPullAuditEventsClient` and instantiate `SiteAuditReconciliationActor` (periodic `PullAuditEvents` fallback). (`IPullAuditEventsClient.cs`, `SiteAuditReconciliationActor.cs`)
|
||||
- Site Call Audit: implement the per-site reconciliation pull (changed-since cursor read + central insert-if-not-exists/upsert-on-newer) and schedule `PurgeTerminalAsync` daily. (`SiteCallAuditActor.cs:28-34`, `SiteCallAuditRepository.cs:213`)
|
||||
- Site Event Logging: inject `ISiteEventLogger` into `AlarmActor`/`NativeAlarmActor`, `DeploymentManagerActor`, Store-and-Forward, and `NotificationOutboxActor`; emit Alarm / Deployment / Data-Connection / S&F / Instance-Lifecycle / Notification events, and add script *started*/*completed* (Info) alongside the existing error events.
|
||||
- **DoD:** actors instantiated and DI-registered; integration tests prove the purge actually switches out partitions and reconciliation back-fills a row dropped from telemetry; event-log rows produced for all 7 categories.
|
||||
|
||||
#### M2 — Correctness & behavioral gaps (Tier 2)
|
||||
- `Database.CachedWrite`: attempt immediately; permanent SQL error returns synchronously as `Failed` (mirror the API path); only transient errors buffer. (`DatabaseGateway.cs:78-204`)
|
||||
- Alarm `conditionFilter`: build the OPC UA event `WhereClause` from the filter and honor it in `DataConnectionActor` routing; same plumbing for MxGateway. (`DataConnectionActor.cs:1482`, `RealOpcUaClient.cs:242`)
|
||||
- Per-script execution timeout: add a field to `TemplateScript` + flattened config; apply in `ScriptExecutionActor`/`AlarmExecutionActor`; fall back to the global default. (`SiteRuntimeOptions.cs:31`)
|
||||
- Connection-level diff: add a `ConnectionChanges` slot to `ConfigurationDiff`; call `ComputeConnectionsDiff` from `ComputeDiff`; surface in the deployment diff UI. (`DiffService.cs:174`)
|
||||
- `MachineDataDb` fail-fast: add the option + startup validation (+ DbContext if the connection is actually consumed). (`DatabaseOptions.cs`, `StartupValidator.cs`)
|
||||
- CI grep-guard: build step that fails on `UPDATE`/`DELETE` against `AuditLog` in the data layer. (`Component-AuditLog.md:335`)
|
||||
- LDAP periodic re-query: implement the session-refresh path that re-queries LDAP so interactive roles are never >15 min stale (wire `JwtTokenService.RefreshToken`/`ShouldRefresh` or an `OnValidatePrincipal` revalidation). (security-relevant; pulled out of the session-doc reconcile)
|
||||
- Low-severity batch (#19–#31): return-type compatibility check; argument *type* compatibility; native-alarm-source capability validation wired into the deploy pipeline; binding-completeness as a deploy-gating Error (+ "name exists at site" check); debug snapshot/subscribe error response for unknown instance; recursion-limit error → site event log; debug-stream snapshot/stream ordering + timestamp-dedup replay; OPC UA native-alarm transition field population; readiness "required singletons running" probe; register the SiteEventLog active-node purge gate; consume `FailedWriteCount` in Health Monitoring; reconcile `StateTransitionValidator` delete-from-`NotDeployed` with the spec matrix.
|
||||
- **DoD:** unit + integration tests per behavior; where the fix corrects code, the spec already matched; where the spec was the divergence, it's updated in the same change.
|
||||
|
||||
#### M3 — Script trust boundary (Tier 1: #1, #2)
|
||||
- Wire the already-referenced `Microsoft.CodeAnalysis.CSharp.Scripting` into `ScriptCompiler.TryCompile` for a real semantic compile (errors block deploy). (`ScriptCompiler.cs:56-104`, `ValidationService.cs:128`)
|
||||
- Replace the advisory substring forbidden-API scan with Roslyn symbol/semantic analysis that resolves aliases, `using static`, and `global::`; coordinate design-time enforcement with the Site Runtime sandbox so the trust boundary is authoritative. (`ScriptCompiler.cs:14-22`)
|
||||
- Apply the same real compile to shared scripts. (`SharedScriptService.cs:168-206`)
|
||||
- **DoD:** semantically-invalid C# fails validation; adversarial bypass tests (alias / `using static` / `global::` reaching a forbidden API) fail to deploy.
|
||||
|
||||
#### M4 — Doc reconciliation (Tier 4, parallelizable)
|
||||
- Update specs to the shipped re-architecture: `Component-ConfigurationDatabase.md` (collapsed `AuditLog` schema), `Component-Commons.md` (`AuditEvent`→`ZB.MOM.WW.Audit` package, `ApiKey` retirement, undocumented types/interfaces), `Component-InboundAPI.md` (Bearer auth, audit write timing, type validation), `Component-NotificationService.md`/`NotificationType` (Teams status), Security role names, `SiteCall` field names, `AuditKind` vocabulary.
|
||||
- CLI docs: document the `bundle` group; fix README option-name drift (the README is the stale doc; `Component-CLI.md` mostly matches code); correct `audit query` options.
|
||||
- Clear stale "deferred" markers/comments for shipped features (Transport CLI, `SourceNode`, Site Call Audit relay, bundle-import audit filter, M5 redaction comments, `AuditLogPage.HandleRowSelected`).
|
||||
- **Code-vs-doc dispositions:** doc-update (code authoritative) for Bearer auth, fire-and-forget audit timing, JWT-in-cookie→cookie-only session model, `ExecuteReader`/`DbWrite` kind. Code-change (build it) for nested `Object`/`List` validation (#13) — done in M2's validator work — and the LDAP re-query (M2).
|
||||
- **DoD:** no remaining doc↔code contradiction for in-scope components; CLI docs match registered commands/options.
|
||||
|
||||
### Phase 2 — Expand
|
||||
|
||||
#### M5 — Audit hardening (T1–T8)
|
||||
Hash-chain tamper evidence (off by default, `verify-chain` made real); Parquet export/archival (replace the 501); per-channel retention overrides; tag-cascade for `ParentExecutionId` (thread writing-execution id through trigger-driven runs); ExecutionId/ParentExecutionId + SourceNode backfill on historical rows; per-node stuck-count KPIs; structured response capture (headers/content-type, inbound request headers, per-method opt-out, `AuditInboundCeilingHits` metric); CLI `audit tree`.
|
||||
|
||||
#### M6 — Notifications (T9–T11)
|
||||
Teams + other non-Email delivery adapters behind the existing `INotificationDeliveryAdapter` seam; `NotificationType` enum values; Central UI notification-list `Type` selector; historical/trend KPI charts (introduce a time-series store).
|
||||
|
||||
#### M7 — OPC UA / MxGateway UX (T13–T17)
|
||||
Dedicated operator Alarm Summary page; MxGateway secured writes (operator+verifier); OPC UA address-space search + `BrowseNext` paging; type-info surfacing + bulk override CSV import; "Verify endpoint" connectivity button + cert-management UI.
|
||||
|
||||
#### M8 — Transport (T18, T20)
|
||||
Site-scoped / instance-scoped artifact transport (name-mapping subsystem); per-line/Myers diff for Modified artifacts.
|
||||
|
||||
#### M9 — Templates & authoring (T22–T26, T28, T30–T32)
|
||||
Template tree search/filter; folder drag-drop + sibling reorder + root context menu; move data connection between sites; connection live-status indicators; base-template versioning "update-derived" flow + multi-level inheritance; strict expression-trigger analysis kind; schema-driven value-entry forms + hover/completion + JSON Schema `$ref`/library; CLI Retry/Discard for cached calls; unified notifications+site-calls outbox page.
|
||||
|
||||
#### M10 — UI/UX platform (T33–T36, T41)
|
||||
`IDialogService` modal abstraction; design-tokens/CSS-vars + dark-mode/theming; shared pagination+filter component; accessibility pass; Playwright alarm-override UI coverage.
|
||||
|
||||
## Dependencies & sequencing
|
||||
|
||||
- **M1 → M5** — audit hardening builds on the wired purge/reconciliation.
|
||||
- **M6/T11** — depends on introducing a time-series store (new infra; size carefully).
|
||||
- **M9/T26** — base-template versioning is the largest authoring item; may split.
|
||||
- **M4** — runs anytime; cheap and high-clarity, good to interleave.
|
||||
- **M3** — independent; can run in parallel with M1/M2.
|
||||
- Phase 1 (M1–M4) should complete before Phase 2 work starts in earnest, so the foundation is true before features pile on.
|
||||
|
||||
## Cross-cutting conventions (per CLAUDE.md)
|
||||
|
||||
- Each milestone gets its own dated implementation plan in `docs/plans/` and (where useful) a `.tasks.json`.
|
||||
- Design doc + code + entities/repos + actors/services + UI + tests + migrations + deploy config travel together in each slice.
|
||||
- Every milestone is independently shippable: `dotnet build ZB.MOM.WW.ScadaBridge.slnx` green + relevant unit/integration tests pass; cluster-runtime changes rebuilt via `bash docker/deploy.sh`.
|
||||
- M3 and the security items (M2 LDAP re-query) carry adversarial tests (bypass attempts), not just happy-path.
|
||||
- `git diff` review before each commit; related changes committed together with a design-summary message.
|
||||
|
||||
## Testing strategy
|
||||
|
||||
- **M1:** integration tests against the cluster proving purge + reconciliation actually run (not just unit-level actor tests); event-log row assertions per category.
|
||||
- **M2:** behavior-level unit tests per gap; CachedWrite + conditionFilter + per-script-timeout get integration coverage.
|
||||
- **M3:** golden invalid scripts must fail; adversarial forbidden-API bypass corpus must fail to deploy.
|
||||
- **M4:** doc-only — no test impact beyond keeping existing suites green; nested-type validation (#13) gets validator unit tests.
|
||||
- **M5–M10:** standard unit + integration + Playwright (UI) coverage per feature; new infra (M6 time-series store) gets its own integration suite.
|
||||
|
||||
## Open items / risks
|
||||
|
||||
- M3 real-compile may surface latent invalid scripts in existing templates/fixtures — budget for fixture cleanup.
|
||||
- M6 time-series store is the one genuinely-new piece of infrastructure; scope it deliberately (could reuse MS SQL with a rollup table rather than a new dependency).
|
||||
- The Phase 2 roadmap is large; treat each milestone as a separate planning + implementation pass, not a single mega-effort.
|
||||
|
||||
## Next step
|
||||
|
||||
Hand off to the writing-plans skill to produce the detailed, bite-sized implementation plan, starting with **Phase 1 (M1–M4)**. Phase 2 milestones are planned individually as they're picked up.
|
||||
+217
@@ -0,0 +1,217 @@
|
||||
# ScadaBridge — Pending / Deferred / Partial / Missing Functionality Audit
|
||||
|
||||
**Date:** 2026-06-15
|
||||
**Scope:** Full system — design specs (`docs/requirements/`), all of `src/`, the Central UI / CLI / Management Service surfaces, and the plan/checklist archive (`docs/plans/`).
|
||||
**Method:** Five parallel read-only investigators, each verifying doc claims against actual code (file:line evidence). Top findings were independently corroborated by 2+ agents.
|
||||
|
||||
## Executive summary
|
||||
|
||||
The codebase is unusually clean: **zero real `TODO`/`FIXME` markers in `src/`**, and all 11 implementation phases self-report complete. Consequently the unfinished work does **not** announce itself — it hides in three forms:
|
||||
|
||||
1. **Silent gaps (Tier 1)** — documented as working, not marked deferred, but absent or inert in production.
|
||||
2. **Partial / behaviorally-divergent functionality (Tier 2)** — real, but narrower or different from the spec.
|
||||
3. **Intentional deferrals (Tier 3)** — knowingly punted, correctly documented, with extensible seams. *Not defects.*
|
||||
4. **Doc↔code drift (Tier 4)** — code is fine; the specs describe a superseded architecture.
|
||||
|
||||
**Actionable risk is concentrated in Tier 1.** Recommended starting point: **#3 / #4 (wire the two never-started audit actors)** — highest impact, smallest blast radius.
|
||||
|
||||
---
|
||||
|
||||
## Tier 1 — Silent gaps: documented as working, but not actually running
|
||||
|
||||
These are the dangerous ones. Specs present them as live behavior, they are **not** marked deferred, yet the functionality is absent or inert.
|
||||
|
||||
| # | Gap | Where | Impact |
|
||||
|---|-----|-------|--------|
|
||||
| **1** | **Script "test-compilation" does no real compilation.** The headline pre-deploy gate ("scripts must compile without errors") is a brace-balance + forbidden-API *substring* scan. No Roslyn reference in the project. | `TemplateEngine/Validation/ScriptCompiler.cs:56-104`; used by `ValidationService.cs:128` | Semantically-broken C# passes validation and **deploys**. Found independently by 2 agents. |
|
||||
| **2** | **Script forbidden-API gate is bypassable.** The script trust model's only design-time enforcement is the same substring scan — defeated by aliases / `using static` / `global::`. Self-documented as "SECURITY LIMITATION (TemplateEngine-006)". | `ScriptCompiler.cs:14-22,61-72`; `ValidationService.cs:346` | Security boundary is advisory only. 2 agents. |
|
||||
| **3** | **Audit Log 365-day retention purge never starts.** `AuditLogPurgeActor` exists but has **zero** `ActorOf`/`Props.Create` callers; only the roll-*forward* partition service runs. | `AuditLog/.../AuditLogPurgeActor.cs:58`; `AkkaHostedService.cs:486` ("wired in a later bundle") | The documented purge does not run in production; `AuditLog` grows unbounded. |
|
||||
| **4** | **Audit Log reconciliation self-heal never wired.** `IPullAuditEventsClient` has no implementation; `SiteAuditReconciliationActor` is never instantiated. | `IPullAuditEventsClient.cs:31`; `SiteAuditReconciliationActor.cs:68` | The documented "lost-telemetry fallback" doesn't exist; forward telemetry is the only path. |
|
||||
| **5** | **Site Call Audit reconciliation pull + daily purge both missing.** The actor's own docstring admits it. `PurgeTerminalAsync` is implemented but never invoked. | `SiteCallAuditActor.cs:28-34`; `SiteCallAuditRepository.cs:213` | `SiteCalls` mirror has no self-heal and grows unbounded. |
|
||||
| **6** | **Site Event Logging emits only 2 of 7 documented categories.** Only `connection` + `script` (error-path only). Alarm, Deployment, Store-and-Forward, Instance-Lifecycle, Notification events are never logged — `ISiteEventLogger` isn't injected into those subsystems. | `DataConnectionActor.cs`, `ScriptExecutionActor.cs` (only emitters); spec "Events Logged" §20-28 | Operational event log is materially incomplete vs spec. |
|
||||
|
||||
---
|
||||
|
||||
## Tier 2 — Partial / behaviorally-divergent functionality
|
||||
|
||||
Real, but narrower than the spec — wrong in a way that could surprise an operator or script author.
|
||||
|
||||
| # | Gap | Where |
|
||||
|---|-----|-------|
|
||||
| **7** | **`Database.CachedWrite` misclassifies permanent SQL errors as transient** → retries forever instead of failing fast to the script. The API path does it right; the DB path does not. No immediate attempt, no synchronous permanent-`Failed` return. | `DatabaseGateway.cs:78-204` (cf. `ExternalSystemClient.cs:100-161`) |
|
||||
| **8** | **Alarm `conditionFilter` is plumbed end-to-end but applied nowhere** — set a filter on a native-alarm source and it silently mirrors *all* conditions. | `DataConnectionActor.cs:1482,1540-1554`; `RealOpcUaClient.cs:242,295`; `MxGatewayDataConnection.cs:154-167` |
|
||||
| **9** | **Per-script execution timeout doesn't exist** — spec promises per-script; only a global `ScriptExecutionTimeoutSeconds`. No field in the template/flattened model to carry it. | `SiteRuntimeOptions.cs:31`; `ScriptExecutionActor.cs:100`; `AlarmExecutionActor.cs:66` |
|
||||
| **10** | **Connection-level diffs never surface in the deployment diff** — `ComputeConnectionsDiff` is dead code (no callers); `ConfigurationDiff` has no slot for it. Per-attribute binding drift *is* caught; standalone connection endpoint (protocol/config/failover) diff is not. | `DiffService.cs:158-204`; `Commons/Types/Flattening/ConfigurationDiff.cs:7-24` |
|
||||
| **11** | **Inbound API auth transport drift** — code uses `Authorization: Bearer sbk_<keyId>_<secret>`; doc says `X-API-Key` header. | `InboundAPI/.../EndpointExtensions.cs:83-90` |
|
||||
| **12** | **Inbound API audit write is fire-and-forget *after* response flush** — doc says synchronous *before* flush. Row is still emitted (fail-soft), just non-blocking and after the body is forwarded. | `AuditWriteMiddleware.cs:195-212,281-290` |
|
||||
| **13** | **Inbound `Object`/`List` extended types are shape-validated only** — no nested/field-level type validation, despite spec implying typed/nested validation. | `ParameterValidator.cs:109-145`; `ReturnValueValidator.cs:18` |
|
||||
| **14** | **JWT-in-cookie session design not implemented** — `/auth/login` signs a plain `ClaimsPrincipal`; `GenerateToken` only used by the CLI `/auth/token` path; `ValidateToken` has no external callers. | `AuthEndpoints.cs:38,75-112,152`; `ServiceCollectionExtensions.cs:99-118` |
|
||||
| **15** | **"Re-query LDAP every 15 min / roles never >15 min stale" not implemented for interactive sessions** — `JwtTokenService.RefreshToken`/`RecordActivity`/`ShouldRefresh`/`IsIdleTimedOut` have **zero** call sites; roles fixed until cookie expiry. The 15-min sliding + 30-min idle layers are collapsed into a single 30-min sliding cookie window. | `JwtTokenService.*` (no callers); `ServiceCollectionExtensions.cs:99-148` |
|
||||
| **16** | **Transport stale-instance enumeration always returns empty** — `BundleImporter` returns `Array.Empty<int>()`; UI shows a generic warning with no count, link not filtered to stale instances. | `BundleImporter.cs:733`; `TransportImport.razor:347-388` |
|
||||
| **17** | **`MachineDataDb` fail-fast requirement not enforced** — spec (REQ-HOST-3/4) requires central nodes to validate a non-empty `MachineDataDb` connection string. `DatabaseOptions` has only `ConfigurationDb`/`SiteDbPath`; validator never checks it; 0 `grep` hits in `src/`. Key lives only in docker appsettings as dead config. | `DatabaseOptions.cs:6-12`; `StartupValidator.cs:60-61` |
|
||||
| **18** | **CI grep-guard against `UPDATE/DELETE … AuditLog` not in the repo** — spec claims a build-time grep that fails on data-layer mutations. DB-role DENY enforcement *is* present in migrations (so this is a backstop, not the only control), but the claimed code-level guard is absent. | spec `Component-AuditLog.md:335-336`, `Component-ConfigurationDatabase.md:297` |
|
||||
|
||||
### Lower-severity Tier-2 / behavioral notes
|
||||
|
||||
| # | Gap | Where |
|
||||
|---|-----|-------|
|
||||
| 19 | **Script "started"/"completed" events not logged** (only failures, severity `Error`). | `ScriptExecutionActor.cs:239,256`; `ScriptActor.cs:369` |
|
||||
| 20 | **Return-type compatibility check is dead scaffolding** — `BuildReturnMap` builds maps never read; no return-type comparison runs. | `SemanticValidator.cs:62-63,279-287` |
|
||||
| 21 | **Argument *type* compatibility not checked** — only arg *count* (comma counting). | `SemanticValidator.cs:251-266,390-425` |
|
||||
| 22 | **Native-alarm-source connection-capability validation never runs in deploy pipeline** — `alarmCapableConnectionNames` param no production caller supplies. | `SemanticValidator.cs:30-33,239-245`; `FlatteningPipeline.cs:93,115` |
|
||||
| 23 | **Connection-binding completeness is a non-blocking Warning, not deploy-gating Error**; "name exists at site" half missing. | `ValidationService.cs:504-519`; `ValidationResult.cs:9` |
|
||||
| 24 | **Debug snapshot/subscribe for unknown instance returns empty snapshot, not error** — caller can't distinguish "not deployed" from "deployed but empty." | `DeploymentManagerActor.cs:845-866` |
|
||||
| 25 | **Recursion-limit error logged to .NET `ILogger`, not the site event log** as spec requires. | `ScriptRuntimeContext.cs:302-305,464-466` |
|
||||
| 26 | **Debug-stream snapshot/stream ordering reversed; no timestamp-dedup replay** — `PreStart` sends snapshot first, opens stream after; gap-window events lost (spec wants stream-first + replay/dedup). | `DebugStreamBridgeActor.cs:89-103,163-166` |
|
||||
| 27 | **OPC UA native-alarm transition leaves several display fields empty** (Category/Description/OperatorUser/OriginalRaiseTime/CurrentValue/LimitValue) — partly by design. | `RealOpcUaClient.cs:395-403`; `MxGatewayAlarmMapper.cs:79-113` |
|
||||
| 28 | **Readiness gate omits "required cluster singletons running" criterion** — covers membership + DB connectivity only (softened by spec's "(if applicable)"). | `Program.cs:188-201,314-317`; `AkkaClusterHealthCheck.cs:54` |
|
||||
| 29 | **SiteEventLog active-node purge gate never registered** — `SiteEventLogActiveNodeCheck` not added to DI; purge defaults to `() => true`, runs on standby too (harmless, but documented restriction unenforced). | `SiteEventLogging/ServiceCollectionExtensions.cs:33-37`; `EventLogPurgeService.cs:61` |
|
||||
| 30 | **`FailedWriteCount` metric exposed "for future Health Monitoring" but never consumed** — dangling metric. | `ISiteEventLogger.cs:32-40` |
|
||||
| 31 | **`StateTransitionValidator` allows Delete from `NotDeployed`; spec matrix says No** (deliberate per code comment, contradicts doc). | `StateTransitionValidator.cs:38-39` |
|
||||
|
||||
---
|
||||
|
||||
## Tier 3 — Intentional deferrals (correctly documented — NOT defects)
|
||||
|
||||
Knowingly punted, with extensible seams and explicit doc notes. `[PERM]` = permanent / v-next; `[SLICE]` = deferred-to-a-later-slice with seam present.
|
||||
|
||||
**Centralized Audit Log (#23)**
|
||||
- `[PERM]` Hash-chain tamper evidence (v1.x). `verify-chain` CLI is a no-op stub that prints "not enabled in this release". — `AuditCommands.cs:243-246`; `AuditVerifyChainHelpers.cs:6-8`
|
||||
- `[PERM]` Parquet export/archival. Server returns HTTP `501`; CSV + JSONL implemented. — `AuditEndpoints.cs:188-194`; `AuditExportHelpers.cs:139-148`
|
||||
- `[PERM]` Per-channel retention overrides. — `2026-05-20-audit-log-code-roadmap.md:16`
|
||||
- `[PERM]` Tag-cascade for `ParentExecutionId` — only the inbound-API→routed-site bridge is built; trigger-driven runs pass `parentExecutionId = null`. — `ScriptActor.cs:404,429`; `2026-05-21-audit-parent-executionid-design.md:209`
|
||||
- `[PERM]` ExecutionId/ParentExecutionId backfill on historical rows; SourceNode backfill on legacy rows; per-node stuck-count KPIs.
|
||||
- `[PERM]` Structured/response-header response capture; inbound request-header capture; per-method opt-out; `AuditInboundCeilingHits` metric. — `2026-05-23-inbound-api-full-response-audit-design.md:113-127`
|
||||
- *Uncertain:* CLI `audit tree` command (doc "maybe", not found in CLI).
|
||||
|
||||
**Notifications (#8 / #21)**
|
||||
- `[SLICE]` Teams (and all non-Email) notification types — `INotificationDeliveryAdapter` seam exists, only `EmailNotificationDeliveryAdapter` implemented; `NotificationType` enum is Email-only. Missing-adapter path parks gracefully. — `NotificationType.cs:6-9`; `NotificationOutboxActor.cs:457-474`
|
||||
- `[SLICE]` Central UI notification-list form has no `Type` selector (Email hard-coded). — `NotificationListForm.razor`
|
||||
- `[PERM]` Historical/trend KPI charts (no time-series store).
|
||||
|
||||
**Native Alarms / MxGateway / OPC UA**
|
||||
- `[PERM]` Native-alarm ack/shelve/suppress write-back; central alarm tables/history/journal; alarm-driven notifications/scripts — read-only by design. — `2026-05-29-native-alarms-design.md:201-206`
|
||||
- `[SLICE]` Dedicated operator Alarm Summary page (DebugView only for now).
|
||||
- `[PERM]` MxGateway secured writes (operator+verifier).
|
||||
- `[SLICE]` OPC UA address-space search; `BrowseNext` paging. — `RealOpcUaClient.cs:574`
|
||||
- `[PERM]` OPC UA type-info surfacing; bulk override import/CSV.
|
||||
- `[SLICE]` OPC UA "Verify endpoint" connectivity button; cert-management UI.
|
||||
|
||||
**Transport (#24)**
|
||||
- `[PERM]` Site-scoped / instance-scoped artifact transport (needs name-mapping subsystem).
|
||||
- `[PERM]` Direct cluster-to-cluster pull; asymmetric bundle signing; differential/incremental bundles.
|
||||
- `[PERM/SLICE]` Per-line/Myers diff for Modified artifacts (coarse line-count delta only). — `ArtifactDiff.cs:18-25`
|
||||
|
||||
**TreeView**
|
||||
- `[SLICE/PERM]` R6 lazy-loading, R7 keyboard nav, R16 multi-select — spec marks all "(Deferred)". — `Component-TreeView.md:87-93,288-295`
|
||||
|
||||
**Templates / Data Connections / Triggers UI**
|
||||
- `[SLICE]` Template tree search/filter; `[PERM]` folder drag-drop, sibling reorder, root context menu.
|
||||
- `[PERM]` Move data connection between sites; `[SLICE]` connection live-status indicators (blocked on DCL state surfacing).
|
||||
- `[SLICE]` Base-template versioning "update-derived" flow; multiple inheritance levels; `[PERM]` promote-derived-to-base, cross-tenant libraries.
|
||||
- `[SLICE]` Strict expression-trigger analysis kind; `[PERM]` WhileTrue trigger mode for alarms.
|
||||
- `[SLICE]` Schema-driven value-entry forms; schema hover/completion; `[PERM]` JSON Schema `$ref` reuse / template-level schema library.
|
||||
|
||||
**Cached-call tracking (#6 / #22)**
|
||||
- `[SLICE]` CLI surface for site-local Retry/Discard of cached calls; `[PERM]` unified notifications+site-calls outbox page.
|
||||
|
||||
**UI audit backlog (`2026-05-12-ui-audit.md:536-554`)**
|
||||
- `IDialogService` modal abstraction; design-tokens/CSS-vars; dark-mode/theming; shared pagination+filter component; accessibility pass; replacing SignalR debug-view streaming.
|
||||
|
||||
**Environment / tooling**
|
||||
- `[PERM]` True air-gapped second environment (env2 shares MSSQL/LDAP/SMTP); 3rd/4th env; `--env` flag on `deploy.sh`.
|
||||
- `[PERM]` Repo/folder rename (kept as ScadaBridge to preserve context).
|
||||
- `[SLICE]` Playwright alarm-override UI coverage.
|
||||
|
||||
---
|
||||
|
||||
## Tier 4 — Doc↔code drift (code is fine; docs describe a superseded architecture)
|
||||
|
||||
Worth fixing for anyone relying on the docs as the spec.
|
||||
|
||||
**Config DB / Commons re-architecture not reflected in specs (High doc-impact):**
|
||||
- `AuditLog` table collapsed to 10 canonical + `DetailsJson` + 6 PERSISTED `JSON_VALUE` computed cols; doc still lists ~24 typed columns (`Kind`, `HttpStatus`, `RequestSummary`, …). — migration `20260602174346_CollapseAuditLogToCanonical.cs`; `Entities/AuditLogRow.cs:54-136`
|
||||
- `AuditEvent` moved out of Commons into the external `ZB.MOM.WW.Audit` NuGet package; doc (REQ-COM-1/3/5b) still describes it as a Commons type. — `Commons.csproj:11`
|
||||
- `ApiKey` entity / API-key persistence retired to shared `ZB.MOM.WW.Auth.ApiKeys` SQLite store; doc still lists `ApiKey` + `ApprovedApiKeyIds`. — migration `20260602092753_RetireInboundApiKeyStore.cs`
|
||||
|
||||
**CLI docs drift (README is the stale doc; `Component-CLI.md` mostly matches code):**
|
||||
- Entire `bundle` (Transport #24) command group is shipped + registered but documented in **neither** `Component-CLI.md` **nor** `CLI/README.md`. — `Program.cs:36`; `BundleCommands.cs:24-372`
|
||||
- `security api-key create` requires undocumented `--methods` (Required); docs show only `--name`. — `SecurityCommands.cs:41-45`
|
||||
- `security api-key update`/`delete` use `--key-id`; docs document `--id` (and an unwired `--name` on update). — `SecurityCommands.cs:60,71`
|
||||
- `security api-key set-methods` subcommand exists in code, documented nowhere. — `SecurityCommands.cs:91-102`
|
||||
- `api-method create` uses required `--script`; docs document `--code` + `--description` (neither exists). README is internally inconsistent (create=`--code`, update=`--script`). — `ApiMethodCommands.cs:57-62`
|
||||
- `db-connection create`/`update` documented with `--provider`; code has no such option. — `DbConnectionCommands.cs:56-72`
|
||||
- Widespread README option-name drift where `Component-CLI.md` already matches code (scope-rule `--mapping-id`, health `--site`/`--keyword`, template attribute `--value`/`--data-source`, template alarm `--trigger-type`/`--priority`/`--trigger-config`, composition delete `--id`, etc.).
|
||||
- `audit query` doc lists `--page` (code is keyset-only `--all`); undocumented `--execution-id`/`--parent-execution-id` filters exist.
|
||||
|
||||
**Stale "deferred" markers for things that have actually SHIPPED:**
|
||||
- Transport CLI (`bundle export/preview/import`) — design doc §13 said "deferred"; now implemented.
|
||||
- `SourceNode` capture — `.tasks.json` shows all 21 tasks "pending"; fully implemented across Commons/AuditLog/NotificationOutbox/SiteCallOperational.
|
||||
- Site Call Audit Retry/Discard relay — DI comment says deferred; implemented + wired (`SiteCallAuditActor.cs:150-156,450-505`; `AkkaHostedService.cs:580-589`).
|
||||
- Bundle-import audit filter UI (Transport-012) — doc says deferred follow-up; shipped (`ConfigurationAuditLog.razor` `?bundleImportId=` filter).
|
||||
- Redaction/payload-cap "deferred to M5" comments in Site Runtime — already shipped (`ScadaBridgeAuditRedactor`, `AuditLogOptions.DefaultCapBytes/ErrorCapBytes`).
|
||||
- `AuditLogPage.HandleRowSelected` class comment says "no-op seam"; method is fully wired (opens drawer).
|
||||
|
||||
**Other doc/spec inconsistencies (code richer/different than doc):**
|
||||
- Security role names: doc says Admin/Design/Deployment; code uses Administrator/Designer/Deployer/Viewer (canonicalized via migration).
|
||||
- `SiteCall` entity field names diverge from doc (`Channel` not `Kind`, `SourceSite` not `SourceSiteId`, adds `HttpStatus`/`IngestedAtUtc`).
|
||||
- `ExecuteReader` audited as `DbWrite` (read/write distinguished via `Extra` JSON `op`, not a distinct `AuditKind`).
|
||||
- Inbound audit doc references `ApiInbound.Completed`; actual kinds are `InboundRequest`/`InboundAuthFailure`.
|
||||
- `Teams` claimed present in `NotificationType` enum by Commons/ConfigDB docs; enum is Email-only.
|
||||
- Commons under-documents shipped code: MxGateway endpoint serializer/validator/config, `Observability/ScadaBridgeTelemetry.cs`, `IInboundApiKeyAdmin`, `IAuditActorAccessor` — none in the doc folder map.
|
||||
- `IHealthMonitoringRepository` listed in ConfigDB repo table but doesn't exist (doc annotated "future").
|
||||
- `requirements-traceability.md` and many `.md.tasks.json` show "Pending" for shipped features — they track *plan generation*, not implementation; unreliable as a status source.
|
||||
- `ExternalSystemForm` "Recent audit activity" drill-in omits `channel=ApiOutbound` and uses exact-match `target` instead of starts-with (sibling `ApiKeyForm` link is correct). — `ExternalSystemForm.razor:20-24`
|
||||
|
||||
---
|
||||
|
||||
## Code-level sweep — investigated and ruled out (false positives)
|
||||
|
||||
For completeness, items that *look* unfinished but are intentional:
|
||||
- ~44 empty `catch` blocks — all have explanatory comments / intentional fallback (JSON parse → default; disposal-race `ObjectDisposedException`). None silently swallow real errors.
|
||||
- `SiteNotificationRepository` / `SiteExternalSystemRepository` write methods throw `NotSupportedException` — by design (site config is read-only, managed via central deployment).
|
||||
- `StubOpcUaClient` (canned data; `BrowseChildrenAsync` throws `NotImplementedException`) — dev/test-only; production wires `RealOpcUaClientFactory`/`RealMxGatewayClientFactory` (`DataConnectionFactory.cs:38-47`).
|
||||
- `NoOpSiteStreamAuditClient`, `SandboxNotifyHelper`, sandbox host fakes — legitimate DI-default / test composition seams.
|
||||
- `AddSecurityActors` / `AddTemplateEngineActors` "Phase 0 placeholder" registrations — intentional empty seams (actor wiring lives in Host).
|
||||
- Migration `Down()` `NotSupportedException`, MxGateway/Bundle version-rejection `NotSupportedException`, `AuditWriteMiddleware` write-only-stream `NotSupportedException` — intentional guards.
|
||||
- Management Service: 113 handlers; all wire-registered `Mgmt*` commands dispatched. The three "unhandled" (`ResolveRolesCommand` retired; `BrowseNodeCommand`/`ReadTagValuesCommand` routed direct-to-site) are intentional.
|
||||
- Central UI: no stub/placeholder pages, no `NotImplementedException`, no "coming soon" banners, no no-op `@onclick`. `disabled=`/`placeholder=` usages are legitimate (loading guards, edit locks, HTML hints).
|
||||
|
||||
---
|
||||
|
||||
## Phase completeness (self-reported)
|
||||
|
||||
All 11 phases report Complete with passing verification gates:
|
||||
|
||||
- **Phase 0** Solution Skeleton — Complete (gate 11/11, 57 tests).
|
||||
- **Phase 1** Central Foundations — Complete (gate 20/20, 186 pass + 1 live-LDAP skip).
|
||||
- **Phase 2** Modeling & Validation — Complete (gate 9/9, 359 tests).
|
||||
- **Phase 3A** Runtime Foundation — Complete (gate 13/13, 389/389).
|
||||
- **Phase 3B** Site I/O & Observability — Complete (gate 11/11, 541 cumulative).
|
||||
- **Phase 3C** Deployment & Store-Forward — Complete (terse checklist).
|
||||
- **Phase 4** Operator UI — Complete (terse checklist).
|
||||
- **Phase 5** Authoring UI — Complete (terse checklist).
|
||||
- **Phase 6** Deployment & Ops UI — Complete (terse checklist; Codex external-review step skipped, best-effort).
|
||||
- **Phase 7** Integrations — Complete (terse checklist; Q12 SMTP-OAuth2 is a test-env dependency).
|
||||
- **Phase 8** Production Readiness — Complete (terse checklist).
|
||||
|
||||
The ~665 unchecked `- [ ]` items in phase *plan* docs are spec-traceability references (each dispositioned Pass / Out-of-scope in Forward/Reverse tables), a documentation style — not a TODO list.
|
||||
|
||||
**Operational (not code):** `docs/deployment/production-checklist.md` has ~60 unchecked install-time operator steps (env vars, connection strings, firewall ports 8081/636/587/1433, TLS certs, smoke tests).
|
||||
|
||||
---
|
||||
|
||||
## Confidence & caveats
|
||||
|
||||
- **High confidence** on Tier 1 — each item verified by reading the code (class/interface existence + absence of callers via grep); top items corroborated by 2+ independent agents.
|
||||
- Terse Phase 3C–8 checklists self-report "Complete / tests passing" with no per-gate breakdown; test counts for those phases were not independently re-run.
|
||||
- Actual `src/` artifacts were treated as truth over `.tasks.json` status fields, which are demonstrably stale.
|
||||
- Items marked *Uncertain* (e.g. `audit tree` CLI, per-channel retention) rest on doc text only.
|
||||
|
||||
## Recommended next steps
|
||||
|
||||
1. **Wire the two never-started audit actors** (#3, #4) — highest impact, smallest blast radius (DI/Host wiring + `IPullAuditEventsClient` impl).
|
||||
2. **Site Call Audit reconciliation + purge** (#5) — same shape as #3/#4.
|
||||
3. **Decide on script compilation/security** (#1, #2) — either implement the Roslyn gate or downgrade the spec's claims; currently the strongest functional + security gap.
|
||||
4. **Site Event Logging categories** (#6) — inject `ISiteEventLogger` into the 5 missing subsystems.
|
||||
5. **Reconcile Tier-4 doc drift** — update Config DB / Commons specs for the audit/auth re-architecture and the CLI docs for the `bundle` group + option names.
|
||||
Reference in New Issue
Block a user