From f4707745bf8fe48c6be66d371a1b016fe7a92b4f Mon Sep 17 00:00:00 2001 From: Joseph Doherty Date: Mon, 15 Jun 2026 09:27:00 -0400 Subject: [PATCH] docs(plans): completion roadmap for stillpending.md audit Add the system-completion design doc (risk-first milestones M1-M10): Phase 1 Stabilize (M1 runtime wiring, M2 correctness, M3 script trust boundary, M4 doc reconciliation) then Phase 2 Expand (M5-M10 feature epics). Scope = all Tier 1/2/4 + in-scope Tier 3 features; T12/T19 deferred to own brainstorm; deliberate anti-goals excluded. Also commit the source audit (stillpending.md). --- ...26-06-15-stillpending-completion-design.md | 129 +++++++++++ stillpending.md | 217 ++++++++++++++++++ 2 files changed, 346 insertions(+) create mode 100644 docs/plans/2026-06-15-stillpending-completion-design.md create mode 100644 stillpending.md diff --git a/docs/plans/2026-06-15-stillpending-completion-design.md b/docs/plans/2026-06-15-stillpending-completion-design.md new file mode 100644 index 00000000..a40a7820 --- /dev/null +++ b/docs/plans/2026-06-15-stillpending-completion-design.md @@ -0,0 +1,129 @@ +# Design: Completing `stillpending.md` — System Completion Roadmap + +**Date:** 2026-06-15 +**Status:** Approved (brainstorming session) — ready for per-milestone implementation planning +**Source audit:** `stillpending.md` (repo root) — full deferred/partial/unfinished/missing inventory +**Scope decision:** Complete *everything* genuinely intended — all of Tier 1, Tier 2, Tier 4, and the in-scope Tier 3 features. Two large Tier 3 items deferred to their own brainstorm; a set of deliberate anti-goals excluded (see Scope). + +## Goal + +Drive `stillpending.md` to zero for all genuinely-intended functionality: make the silent gaps real, correct the behavioral divergences, reconcile the docs with the shipped architecture, and build out the deferred features that have a real seam and a real reason to exist. + +## Scope + +### In scope +- **Tier 1** — all 6 silent gaps (documented as working, actually inert). +- **Tier 2** — all ~31 genuine code defects / partial behaviors. +- **Tier 4** — all doc↔code drift (specs + CLI docs + stale markers). +- **Tier 3** — features tagged for build: T1–T11, T13–T18, T20–T26, T28, T30–T36, T41. + +### Deferred to their own brainstorm (NOT in this plan) +- **T12** — Native-alarm ack/shelve/suppress write-back + central alarm tables/history/journal + alarm-driven notifications. Reverses the "native alarms are read-only by design" decision; large enough to warrant its own design session. +- **T19** — Direct cluster-to-cluster pull + asymmetric bundle signing + differential/incremental bundles. Three separable large features. + +### Excluded (deliberate anti-goals — leave as-is) +- **T27** Promote-derived-to-base + cross-tenant template libraries. +- **T29** WhileTrue trigger mode for alarms (alarms are already level-based). +- **T37** Replace SignalR debug-view streaming (working code; pure refactor). +- **T38** True air-gapped env2 / 3rd-4th env / `--env` flag (env2 deliberately shares infra). +- **T39** Repo/folder rename away from ScadaBridge (kept to preserve context). +- **T40** Rename legacy Transport bundle manifests. + +## Approach + +**Risk-first milestones (Approach A).** Work is ordered by impact, not by component, and grouped into themed milestones using the repo's existing `m1…mN` implementation-doc convention (cf. `auditlog-m1..m8`). Phase 1 (M1–M4) *stabilizes* — a finite, ship-soon body of work that makes the docs true and the silent gaps real. Phase 2 (M5–M10) *expands* — an open-ended roadmap of independently-sized feature epics. + +Rejected alternatives: +- **Component-grouped** — cleaner per-component diffs, but spreads the high-risk Tier-1 work across late workstreams and ships no value early. +- **Two-track only** — good mental model (folded into the Phase 1/2 boundary here), but too coarse for sequencing. + +## Milestones + +### Phase 1 — Stabilize + +#### M1 — Runtime wiring (Tier 1: #3, #4, #5, #6) +Wire up behavior that exists in code but is never started, and fill the event-log categories. +- Wire `AuditLogPurgeActor` into Host bootstrap; drive partition-switch purge (365-day retention). (`AuditLogPurgeActor.cs`, `AkkaHostedService.cs:486`) +- Implement `IPullAuditEventsClient` and instantiate `SiteAuditReconciliationActor` (periodic `PullAuditEvents` fallback). (`IPullAuditEventsClient.cs`, `SiteAuditReconciliationActor.cs`) +- Site Call Audit: implement the per-site reconciliation pull (changed-since cursor read + central insert-if-not-exists/upsert-on-newer) and schedule `PurgeTerminalAsync` daily. (`SiteCallAuditActor.cs:28-34`, `SiteCallAuditRepository.cs:213`) +- Site Event Logging: inject `ISiteEventLogger` into `AlarmActor`/`NativeAlarmActor`, `DeploymentManagerActor`, Store-and-Forward, and `NotificationOutboxActor`; emit Alarm / Deployment / Data-Connection / S&F / Instance-Lifecycle / Notification events, and add script *started*/*completed* (Info) alongside the existing error events. +- **DoD:** actors instantiated and DI-registered; integration tests prove the purge actually switches out partitions and reconciliation back-fills a row dropped from telemetry; event-log rows produced for all 7 categories. + +#### M2 — Correctness & behavioral gaps (Tier 2) +- `Database.CachedWrite`: attempt immediately; permanent SQL error returns synchronously as `Failed` (mirror the API path); only transient errors buffer. (`DatabaseGateway.cs:78-204`) +- Alarm `conditionFilter`: build the OPC UA event `WhereClause` from the filter and honor it in `DataConnectionActor` routing; same plumbing for MxGateway. (`DataConnectionActor.cs:1482`, `RealOpcUaClient.cs:242`) +- Per-script execution timeout: add a field to `TemplateScript` + flattened config; apply in `ScriptExecutionActor`/`AlarmExecutionActor`; fall back to the global default. (`SiteRuntimeOptions.cs:31`) +- Connection-level diff: add a `ConnectionChanges` slot to `ConfigurationDiff`; call `ComputeConnectionsDiff` from `ComputeDiff`; surface in the deployment diff UI. (`DiffService.cs:174`) +- `MachineDataDb` fail-fast: add the option + startup validation (+ DbContext if the connection is actually consumed). (`DatabaseOptions.cs`, `StartupValidator.cs`) +- CI grep-guard: build step that fails on `UPDATE`/`DELETE` against `AuditLog` in the data layer. (`Component-AuditLog.md:335`) +- LDAP periodic re-query: implement the session-refresh path that re-queries LDAP so interactive roles are never >15 min stale (wire `JwtTokenService.RefreshToken`/`ShouldRefresh` or an `OnValidatePrincipal` revalidation). (security-relevant; pulled out of the session-doc reconcile) +- Low-severity batch (#19–#31): return-type compatibility check; argument *type* compatibility; native-alarm-source capability validation wired into the deploy pipeline; binding-completeness as a deploy-gating Error (+ "name exists at site" check); debug snapshot/subscribe error response for unknown instance; recursion-limit error → site event log; debug-stream snapshot/stream ordering + timestamp-dedup replay; OPC UA native-alarm transition field population; readiness "required singletons running" probe; register the SiteEventLog active-node purge gate; consume `FailedWriteCount` in Health Monitoring; reconcile `StateTransitionValidator` delete-from-`NotDeployed` with the spec matrix. +- **DoD:** unit + integration tests per behavior; where the fix corrects code, the spec already matched; where the spec was the divergence, it's updated in the same change. + +#### M3 — Script trust boundary (Tier 1: #1, #2) +- Wire the already-referenced `Microsoft.CodeAnalysis.CSharp.Scripting` into `ScriptCompiler.TryCompile` for a real semantic compile (errors block deploy). (`ScriptCompiler.cs:56-104`, `ValidationService.cs:128`) +- Replace the advisory substring forbidden-API scan with Roslyn symbol/semantic analysis that resolves aliases, `using static`, and `global::`; coordinate design-time enforcement with the Site Runtime sandbox so the trust boundary is authoritative. (`ScriptCompiler.cs:14-22`) +- Apply the same real compile to shared scripts. (`SharedScriptService.cs:168-206`) +- **DoD:** semantically-invalid C# fails validation; adversarial bypass tests (alias / `using static` / `global::` reaching a forbidden API) fail to deploy. + +#### M4 — Doc reconciliation (Tier 4, parallelizable) +- Update specs to the shipped re-architecture: `Component-ConfigurationDatabase.md` (collapsed `AuditLog` schema), `Component-Commons.md` (`AuditEvent`→`ZB.MOM.WW.Audit` package, `ApiKey` retirement, undocumented types/interfaces), `Component-InboundAPI.md` (Bearer auth, audit write timing, type validation), `Component-NotificationService.md`/`NotificationType` (Teams status), Security role names, `SiteCall` field names, `AuditKind` vocabulary. +- CLI docs: document the `bundle` group; fix README option-name drift (the README is the stale doc; `Component-CLI.md` mostly matches code); correct `audit query` options. +- Clear stale "deferred" markers/comments for shipped features (Transport CLI, `SourceNode`, Site Call Audit relay, bundle-import audit filter, M5 redaction comments, `AuditLogPage.HandleRowSelected`). +- **Code-vs-doc dispositions:** doc-update (code authoritative) for Bearer auth, fire-and-forget audit timing, JWT-in-cookie→cookie-only session model, `ExecuteReader`/`DbWrite` kind. Code-change (build it) for nested `Object`/`List` validation (#13) — done in M2's validator work — and the LDAP re-query (M2). +- **DoD:** no remaining doc↔code contradiction for in-scope components; CLI docs match registered commands/options. + +### Phase 2 — Expand + +#### M5 — Audit hardening (T1–T8) +Hash-chain tamper evidence (off by default, `verify-chain` made real); Parquet export/archival (replace the 501); per-channel retention overrides; tag-cascade for `ParentExecutionId` (thread writing-execution id through trigger-driven runs); ExecutionId/ParentExecutionId + SourceNode backfill on historical rows; per-node stuck-count KPIs; structured response capture (headers/content-type, inbound request headers, per-method opt-out, `AuditInboundCeilingHits` metric); CLI `audit tree`. + +#### M6 — Notifications (T9–T11) +Teams + other non-Email delivery adapters behind the existing `INotificationDeliveryAdapter` seam; `NotificationType` enum values; Central UI notification-list `Type` selector; historical/trend KPI charts (introduce a time-series store). + +#### M7 — OPC UA / MxGateway UX (T13–T17) +Dedicated operator Alarm Summary page; MxGateway secured writes (operator+verifier); OPC UA address-space search + `BrowseNext` paging; type-info surfacing + bulk override CSV import; "Verify endpoint" connectivity button + cert-management UI. + +#### M8 — Transport (T18, T20) +Site-scoped / instance-scoped artifact transport (name-mapping subsystem); per-line/Myers diff for Modified artifacts. + +#### M9 — Templates & authoring (T22–T26, T28, T30–T32) +Template tree search/filter; folder drag-drop + sibling reorder + root context menu; move data connection between sites; connection live-status indicators; base-template versioning "update-derived" flow + multi-level inheritance; strict expression-trigger analysis kind; schema-driven value-entry forms + hover/completion + JSON Schema `$ref`/library; CLI Retry/Discard for cached calls; unified notifications+site-calls outbox page. + +#### M10 — UI/UX platform (T33–T36, T41) +`IDialogService` modal abstraction; design-tokens/CSS-vars + dark-mode/theming; shared pagination+filter component; accessibility pass; Playwright alarm-override UI coverage. + +## Dependencies & sequencing + +- **M1 → M5** — audit hardening builds on the wired purge/reconciliation. +- **M6/T11** — depends on introducing a time-series store (new infra; size carefully). +- **M9/T26** — base-template versioning is the largest authoring item; may split. +- **M4** — runs anytime; cheap and high-clarity, good to interleave. +- **M3** — independent; can run in parallel with M1/M2. +- Phase 1 (M1–M4) should complete before Phase 2 work starts in earnest, so the foundation is true before features pile on. + +## Cross-cutting conventions (per CLAUDE.md) + +- Each milestone gets its own dated implementation plan in `docs/plans/` and (where useful) a `.tasks.json`. +- Design doc + code + entities/repos + actors/services + UI + tests + migrations + deploy config travel together in each slice. +- Every milestone is independently shippable: `dotnet build ZB.MOM.WW.ScadaBridge.slnx` green + relevant unit/integration tests pass; cluster-runtime changes rebuilt via `bash docker/deploy.sh`. +- M3 and the security items (M2 LDAP re-query) carry adversarial tests (bypass attempts), not just happy-path. +- `git diff` review before each commit; related changes committed together with a design-summary message. + +## Testing strategy + +- **M1:** integration tests against the cluster proving purge + reconciliation actually run (not just unit-level actor tests); event-log row assertions per category. +- **M2:** behavior-level unit tests per gap; CachedWrite + conditionFilter + per-script-timeout get integration coverage. +- **M3:** golden invalid scripts must fail; adversarial forbidden-API bypass corpus must fail to deploy. +- **M4:** doc-only — no test impact beyond keeping existing suites green; nested-type validation (#13) gets validator unit tests. +- **M5–M10:** standard unit + integration + Playwright (UI) coverage per feature; new infra (M6 time-series store) gets its own integration suite. + +## Open items / risks + +- M3 real-compile may surface latent invalid scripts in existing templates/fixtures — budget for fixture cleanup. +- M6 time-series store is the one genuinely-new piece of infrastructure; scope it deliberately (could reuse MS SQL with a rollup table rather than a new dependency). +- The Phase 2 roadmap is large; treat each milestone as a separate planning + implementation pass, not a single mega-effort. + +## Next step + +Hand off to the writing-plans skill to produce the detailed, bite-sized implementation plan, starting with **Phase 1 (M1–M4)**. Phase 2 milestones are planned individually as they're picked up. diff --git a/stillpending.md b/stillpending.md new file mode 100644 index 00000000..24a465f9 --- /dev/null +++ b/stillpending.md @@ -0,0 +1,217 @@ +# ScadaBridge — Pending / Deferred / Partial / Missing Functionality Audit + +**Date:** 2026-06-15 +**Scope:** Full system — design specs (`docs/requirements/`), all of `src/`, the Central UI / CLI / Management Service surfaces, and the plan/checklist archive (`docs/plans/`). +**Method:** Five parallel read-only investigators, each verifying doc claims against actual code (file:line evidence). Top findings were independently corroborated by 2+ agents. + +## Executive summary + +The codebase is unusually clean: **zero real `TODO`/`FIXME` markers in `src/`**, and all 11 implementation phases self-report complete. Consequently the unfinished work does **not** announce itself — it hides in three forms: + +1. **Silent gaps (Tier 1)** — documented as working, not marked deferred, but absent or inert in production. +2. **Partial / behaviorally-divergent functionality (Tier 2)** — real, but narrower or different from the spec. +3. **Intentional deferrals (Tier 3)** — knowingly punted, correctly documented, with extensible seams. *Not defects.* +4. **Doc↔code drift (Tier 4)** — code is fine; the specs describe a superseded architecture. + +**Actionable risk is concentrated in Tier 1.** Recommended starting point: **#3 / #4 (wire the two never-started audit actors)** — highest impact, smallest blast radius. + +--- + +## Tier 1 — Silent gaps: documented as working, but not actually running + +These are the dangerous ones. Specs present them as live behavior, they are **not** marked deferred, yet the functionality is absent or inert. + +| # | Gap | Where | Impact | +|---|-----|-------|--------| +| **1** | **Script "test-compilation" does no real compilation.** The headline pre-deploy gate ("scripts must compile without errors") is a brace-balance + forbidden-API *substring* scan. No Roslyn reference in the project. | `TemplateEngine/Validation/ScriptCompiler.cs:56-104`; used by `ValidationService.cs:128` | Semantically-broken C# passes validation and **deploys**. Found independently by 2 agents. | +| **2** | **Script forbidden-API gate is bypassable.** The script trust model's only design-time enforcement is the same substring scan — defeated by aliases / `using static` / `global::`. Self-documented as "SECURITY LIMITATION (TemplateEngine-006)". | `ScriptCompiler.cs:14-22,61-72`; `ValidationService.cs:346` | Security boundary is advisory only. 2 agents. | +| **3** | **Audit Log 365-day retention purge never starts.** `AuditLogPurgeActor` exists but has **zero** `ActorOf`/`Props.Create` callers; only the roll-*forward* partition service runs. | `AuditLog/.../AuditLogPurgeActor.cs:58`; `AkkaHostedService.cs:486` ("wired in a later bundle") | The documented purge does not run in production; `AuditLog` grows unbounded. | +| **4** | **Audit Log reconciliation self-heal never wired.** `IPullAuditEventsClient` has no implementation; `SiteAuditReconciliationActor` is never instantiated. | `IPullAuditEventsClient.cs:31`; `SiteAuditReconciliationActor.cs:68` | The documented "lost-telemetry fallback" doesn't exist; forward telemetry is the only path. | +| **5** | **Site Call Audit reconciliation pull + daily purge both missing.** The actor's own docstring admits it. `PurgeTerminalAsync` is implemented but never invoked. | `SiteCallAuditActor.cs:28-34`; `SiteCallAuditRepository.cs:213` | `SiteCalls` mirror has no self-heal and grows unbounded. | +| **6** | **Site Event Logging emits only 2 of 7 documented categories.** Only `connection` + `script` (error-path only). Alarm, Deployment, Store-and-Forward, Instance-Lifecycle, Notification events are never logged — `ISiteEventLogger` isn't injected into those subsystems. | `DataConnectionActor.cs`, `ScriptExecutionActor.cs` (only emitters); spec "Events Logged" §20-28 | Operational event log is materially incomplete vs spec. | + +--- + +## Tier 2 — Partial / behaviorally-divergent functionality + +Real, but narrower than the spec — wrong in a way that could surprise an operator or script author. + +| # | Gap | Where | +|---|-----|-------| +| **7** | **`Database.CachedWrite` misclassifies permanent SQL errors as transient** → retries forever instead of failing fast to the script. The API path does it right; the DB path does not. No immediate attempt, no synchronous permanent-`Failed` return. | `DatabaseGateway.cs:78-204` (cf. `ExternalSystemClient.cs:100-161`) | +| **8** | **Alarm `conditionFilter` is plumbed end-to-end but applied nowhere** — set a filter on a native-alarm source and it silently mirrors *all* conditions. | `DataConnectionActor.cs:1482,1540-1554`; `RealOpcUaClient.cs:242,295`; `MxGatewayDataConnection.cs:154-167` | +| **9** | **Per-script execution timeout doesn't exist** — spec promises per-script; only a global `ScriptExecutionTimeoutSeconds`. No field in the template/flattened model to carry it. | `SiteRuntimeOptions.cs:31`; `ScriptExecutionActor.cs:100`; `AlarmExecutionActor.cs:66` | +| **10** | **Connection-level diffs never surface in the deployment diff** — `ComputeConnectionsDiff` is dead code (no callers); `ConfigurationDiff` has no slot for it. Per-attribute binding drift *is* caught; standalone connection endpoint (protocol/config/failover) diff is not. | `DiffService.cs:158-204`; `Commons/Types/Flattening/ConfigurationDiff.cs:7-24` | +| **11** | **Inbound API auth transport drift** — code uses `Authorization: Bearer sbk__`; doc says `X-API-Key` header. | `InboundAPI/.../EndpointExtensions.cs:83-90` | +| **12** | **Inbound API audit write is fire-and-forget *after* response flush** — doc says synchronous *before* flush. Row is still emitted (fail-soft), just non-blocking and after the body is forwarded. | `AuditWriteMiddleware.cs:195-212,281-290` | +| **13** | **Inbound `Object`/`List` extended types are shape-validated only** — no nested/field-level type validation, despite spec implying typed/nested validation. | `ParameterValidator.cs:109-145`; `ReturnValueValidator.cs:18` | +| **14** | **JWT-in-cookie session design not implemented** — `/auth/login` signs a plain `ClaimsPrincipal`; `GenerateToken` only used by the CLI `/auth/token` path; `ValidateToken` has no external callers. | `AuthEndpoints.cs:38,75-112,152`; `ServiceCollectionExtensions.cs:99-118` | +| **15** | **"Re-query LDAP every 15 min / roles never >15 min stale" not implemented for interactive sessions** — `JwtTokenService.RefreshToken`/`RecordActivity`/`ShouldRefresh`/`IsIdleTimedOut` have **zero** call sites; roles fixed until cookie expiry. The 15-min sliding + 30-min idle layers are collapsed into a single 30-min sliding cookie window. | `JwtTokenService.*` (no callers); `ServiceCollectionExtensions.cs:99-148` | +| **16** | **Transport stale-instance enumeration always returns empty** — `BundleImporter` returns `Array.Empty()`; UI shows a generic warning with no count, link not filtered to stale instances. | `BundleImporter.cs:733`; `TransportImport.razor:347-388` | +| **17** | **`MachineDataDb` fail-fast requirement not enforced** — spec (REQ-HOST-3/4) requires central nodes to validate a non-empty `MachineDataDb` connection string. `DatabaseOptions` has only `ConfigurationDb`/`SiteDbPath`; validator never checks it; 0 `grep` hits in `src/`. Key lives only in docker appsettings as dead config. | `DatabaseOptions.cs:6-12`; `StartupValidator.cs:60-61` | +| **18** | **CI grep-guard against `UPDATE/DELETE … AuditLog` not in the repo** — spec claims a build-time grep that fails on data-layer mutations. DB-role DENY enforcement *is* present in migrations (so this is a backstop, not the only control), but the claimed code-level guard is absent. | spec `Component-AuditLog.md:335-336`, `Component-ConfigurationDatabase.md:297` | + +### Lower-severity Tier-2 / behavioral notes + +| # | Gap | Where | +|---|-----|-------| +| 19 | **Script "started"/"completed" events not logged** (only failures, severity `Error`). | `ScriptExecutionActor.cs:239,256`; `ScriptActor.cs:369` | +| 20 | **Return-type compatibility check is dead scaffolding** — `BuildReturnMap` builds maps never read; no return-type comparison runs. | `SemanticValidator.cs:62-63,279-287` | +| 21 | **Argument *type* compatibility not checked** — only arg *count* (comma counting). | `SemanticValidator.cs:251-266,390-425` | +| 22 | **Native-alarm-source connection-capability validation never runs in deploy pipeline** — `alarmCapableConnectionNames` param no production caller supplies. | `SemanticValidator.cs:30-33,239-245`; `FlatteningPipeline.cs:93,115` | +| 23 | **Connection-binding completeness is a non-blocking Warning, not deploy-gating Error**; "name exists at site" half missing. | `ValidationService.cs:504-519`; `ValidationResult.cs:9` | +| 24 | **Debug snapshot/subscribe for unknown instance returns empty snapshot, not error** — caller can't distinguish "not deployed" from "deployed but empty." | `DeploymentManagerActor.cs:845-866` | +| 25 | **Recursion-limit error logged to .NET `ILogger`, not the site event log** as spec requires. | `ScriptRuntimeContext.cs:302-305,464-466` | +| 26 | **Debug-stream snapshot/stream ordering reversed; no timestamp-dedup replay** — `PreStart` sends snapshot first, opens stream after; gap-window events lost (spec wants stream-first + replay/dedup). | `DebugStreamBridgeActor.cs:89-103,163-166` | +| 27 | **OPC UA native-alarm transition leaves several display fields empty** (Category/Description/OperatorUser/OriginalRaiseTime/CurrentValue/LimitValue) — partly by design. | `RealOpcUaClient.cs:395-403`; `MxGatewayAlarmMapper.cs:79-113` | +| 28 | **Readiness gate omits "required cluster singletons running" criterion** — covers membership + DB connectivity only (softened by spec's "(if applicable)"). | `Program.cs:188-201,314-317`; `AkkaClusterHealthCheck.cs:54` | +| 29 | **SiteEventLog active-node purge gate never registered** — `SiteEventLogActiveNodeCheck` not added to DI; purge defaults to `() => true`, runs on standby too (harmless, but documented restriction unenforced). | `SiteEventLogging/ServiceCollectionExtensions.cs:33-37`; `EventLogPurgeService.cs:61` | +| 30 | **`FailedWriteCount` metric exposed "for future Health Monitoring" but never consumed** — dangling metric. | `ISiteEventLogger.cs:32-40` | +| 31 | **`StateTransitionValidator` allows Delete from `NotDeployed`; spec matrix says No** (deliberate per code comment, contradicts doc). | `StateTransitionValidator.cs:38-39` | + +--- + +## Tier 3 — Intentional deferrals (correctly documented — NOT defects) + +Knowingly punted, with extensible seams and explicit doc notes. `[PERM]` = permanent / v-next; `[SLICE]` = deferred-to-a-later-slice with seam present. + +**Centralized Audit Log (#23)** +- `[PERM]` Hash-chain tamper evidence (v1.x). `verify-chain` CLI is a no-op stub that prints "not enabled in this release". — `AuditCommands.cs:243-246`; `AuditVerifyChainHelpers.cs:6-8` +- `[PERM]` Parquet export/archival. Server returns HTTP `501`; CSV + JSONL implemented. — `AuditEndpoints.cs:188-194`; `AuditExportHelpers.cs:139-148` +- `[PERM]` Per-channel retention overrides. — `2026-05-20-audit-log-code-roadmap.md:16` +- `[PERM]` Tag-cascade for `ParentExecutionId` — only the inbound-API→routed-site bridge is built; trigger-driven runs pass `parentExecutionId = null`. — `ScriptActor.cs:404,429`; `2026-05-21-audit-parent-executionid-design.md:209` +- `[PERM]` ExecutionId/ParentExecutionId backfill on historical rows; SourceNode backfill on legacy rows; per-node stuck-count KPIs. +- `[PERM]` Structured/response-header response capture; inbound request-header capture; per-method opt-out; `AuditInboundCeilingHits` metric. — `2026-05-23-inbound-api-full-response-audit-design.md:113-127` +- *Uncertain:* CLI `audit tree` command (doc "maybe", not found in CLI). + +**Notifications (#8 / #21)** +- `[SLICE]` Teams (and all non-Email) notification types — `INotificationDeliveryAdapter` seam exists, only `EmailNotificationDeliveryAdapter` implemented; `NotificationType` enum is Email-only. Missing-adapter path parks gracefully. — `NotificationType.cs:6-9`; `NotificationOutboxActor.cs:457-474` +- `[SLICE]` Central UI notification-list form has no `Type` selector (Email hard-coded). — `NotificationListForm.razor` +- `[PERM]` Historical/trend KPI charts (no time-series store). + +**Native Alarms / MxGateway / OPC UA** +- `[PERM]` Native-alarm ack/shelve/suppress write-back; central alarm tables/history/journal; alarm-driven notifications/scripts — read-only by design. — `2026-05-29-native-alarms-design.md:201-206` +- `[SLICE]` Dedicated operator Alarm Summary page (DebugView only for now). +- `[PERM]` MxGateway secured writes (operator+verifier). +- `[SLICE]` OPC UA address-space search; `BrowseNext` paging. — `RealOpcUaClient.cs:574` +- `[PERM]` OPC UA type-info surfacing; bulk override import/CSV. +- `[SLICE]` OPC UA "Verify endpoint" connectivity button; cert-management UI. + +**Transport (#24)** +- `[PERM]` Site-scoped / instance-scoped artifact transport (needs name-mapping subsystem). +- `[PERM]` Direct cluster-to-cluster pull; asymmetric bundle signing; differential/incremental bundles. +- `[PERM/SLICE]` Per-line/Myers diff for Modified artifacts (coarse line-count delta only). — `ArtifactDiff.cs:18-25` + +**TreeView** +- `[SLICE/PERM]` R6 lazy-loading, R7 keyboard nav, R16 multi-select — spec marks all "(Deferred)". — `Component-TreeView.md:87-93,288-295` + +**Templates / Data Connections / Triggers UI** +- `[SLICE]` Template tree search/filter; `[PERM]` folder drag-drop, sibling reorder, root context menu. +- `[PERM]` Move data connection between sites; `[SLICE]` connection live-status indicators (blocked on DCL state surfacing). +- `[SLICE]` Base-template versioning "update-derived" flow; multiple inheritance levels; `[PERM]` promote-derived-to-base, cross-tenant libraries. +- `[SLICE]` Strict expression-trigger analysis kind; `[PERM]` WhileTrue trigger mode for alarms. +- `[SLICE]` Schema-driven value-entry forms; schema hover/completion; `[PERM]` JSON Schema `$ref` reuse / template-level schema library. + +**Cached-call tracking (#6 / #22)** +- `[SLICE]` CLI surface for site-local Retry/Discard of cached calls; `[PERM]` unified notifications+site-calls outbox page. + +**UI audit backlog (`2026-05-12-ui-audit.md:536-554`)** +- `IDialogService` modal abstraction; design-tokens/CSS-vars; dark-mode/theming; shared pagination+filter component; accessibility pass; replacing SignalR debug-view streaming. + +**Environment / tooling** +- `[PERM]` True air-gapped second environment (env2 shares MSSQL/LDAP/SMTP); 3rd/4th env; `--env` flag on `deploy.sh`. +- `[PERM]` Repo/folder rename (kept as ScadaBridge to preserve context). +- `[SLICE]` Playwright alarm-override UI coverage. + +--- + +## Tier 4 — Doc↔code drift (code is fine; docs describe a superseded architecture) + +Worth fixing for anyone relying on the docs as the spec. + +**Config DB / Commons re-architecture not reflected in specs (High doc-impact):** +- `AuditLog` table collapsed to 10 canonical + `DetailsJson` + 6 PERSISTED `JSON_VALUE` computed cols; doc still lists ~24 typed columns (`Kind`, `HttpStatus`, `RequestSummary`, …). — migration `20260602174346_CollapseAuditLogToCanonical.cs`; `Entities/AuditLogRow.cs:54-136` +- `AuditEvent` moved out of Commons into the external `ZB.MOM.WW.Audit` NuGet package; doc (REQ-COM-1/3/5b) still describes it as a Commons type. — `Commons.csproj:11` +- `ApiKey` entity / API-key persistence retired to shared `ZB.MOM.WW.Auth.ApiKeys` SQLite store; doc still lists `ApiKey` + `ApprovedApiKeyIds`. — migration `20260602092753_RetireInboundApiKeyStore.cs` + +**CLI docs drift (README is the stale doc; `Component-CLI.md` mostly matches code):** +- Entire `bundle` (Transport #24) command group is shipped + registered but documented in **neither** `Component-CLI.md` **nor** `CLI/README.md`. — `Program.cs:36`; `BundleCommands.cs:24-372` +- `security api-key create` requires undocumented `--methods` (Required); docs show only `--name`. — `SecurityCommands.cs:41-45` +- `security api-key update`/`delete` use `--key-id`; docs document `--id` (and an unwired `--name` on update). — `SecurityCommands.cs:60,71` +- `security api-key set-methods` subcommand exists in code, documented nowhere. — `SecurityCommands.cs:91-102` +- `api-method create` uses required `--script`; docs document `--code` + `--description` (neither exists). README is internally inconsistent (create=`--code`, update=`--script`). — `ApiMethodCommands.cs:57-62` +- `db-connection create`/`update` documented with `--provider`; code has no such option. — `DbConnectionCommands.cs:56-72` +- Widespread README option-name drift where `Component-CLI.md` already matches code (scope-rule `--mapping-id`, health `--site`/`--keyword`, template attribute `--value`/`--data-source`, template alarm `--trigger-type`/`--priority`/`--trigger-config`, composition delete `--id`, etc.). +- `audit query` doc lists `--page` (code is keyset-only `--all`); undocumented `--execution-id`/`--parent-execution-id` filters exist. + +**Stale "deferred" markers for things that have actually SHIPPED:** +- Transport CLI (`bundle export/preview/import`) — design doc §13 said "deferred"; now implemented. +- `SourceNode` capture — `.tasks.json` shows all 21 tasks "pending"; fully implemented across Commons/AuditLog/NotificationOutbox/SiteCallOperational. +- Site Call Audit Retry/Discard relay — DI comment says deferred; implemented + wired (`SiteCallAuditActor.cs:150-156,450-505`; `AkkaHostedService.cs:580-589`). +- Bundle-import audit filter UI (Transport-012) — doc says deferred follow-up; shipped (`ConfigurationAuditLog.razor` `?bundleImportId=` filter). +- Redaction/payload-cap "deferred to M5" comments in Site Runtime — already shipped (`ScadaBridgeAuditRedactor`, `AuditLogOptions.DefaultCapBytes/ErrorCapBytes`). +- `AuditLogPage.HandleRowSelected` class comment says "no-op seam"; method is fully wired (opens drawer). + +**Other doc/spec inconsistencies (code richer/different than doc):** +- Security role names: doc says Admin/Design/Deployment; code uses Administrator/Designer/Deployer/Viewer (canonicalized via migration). +- `SiteCall` entity field names diverge from doc (`Channel` not `Kind`, `SourceSite` not `SourceSiteId`, adds `HttpStatus`/`IngestedAtUtc`). +- `ExecuteReader` audited as `DbWrite` (read/write distinguished via `Extra` JSON `op`, not a distinct `AuditKind`). +- Inbound audit doc references `ApiInbound.Completed`; actual kinds are `InboundRequest`/`InboundAuthFailure`. +- `Teams` claimed present in `NotificationType` enum by Commons/ConfigDB docs; enum is Email-only. +- Commons under-documents shipped code: MxGateway endpoint serializer/validator/config, `Observability/ScadaBridgeTelemetry.cs`, `IInboundApiKeyAdmin`, `IAuditActorAccessor` — none in the doc folder map. +- `IHealthMonitoringRepository` listed in ConfigDB repo table but doesn't exist (doc annotated "future"). +- `requirements-traceability.md` and many `.md.tasks.json` show "Pending" for shipped features — they track *plan generation*, not implementation; unreliable as a status source. +- `ExternalSystemForm` "Recent audit activity" drill-in omits `channel=ApiOutbound` and uses exact-match `target` instead of starts-with (sibling `ApiKeyForm` link is correct). — `ExternalSystemForm.razor:20-24` + +--- + +## Code-level sweep — investigated and ruled out (false positives) + +For completeness, items that *look* unfinished but are intentional: +- ~44 empty `catch` blocks — all have explanatory comments / intentional fallback (JSON parse → default; disposal-race `ObjectDisposedException`). None silently swallow real errors. +- `SiteNotificationRepository` / `SiteExternalSystemRepository` write methods throw `NotSupportedException` — by design (site config is read-only, managed via central deployment). +- `StubOpcUaClient` (canned data; `BrowseChildrenAsync` throws `NotImplementedException`) — dev/test-only; production wires `RealOpcUaClientFactory`/`RealMxGatewayClientFactory` (`DataConnectionFactory.cs:38-47`). +- `NoOpSiteStreamAuditClient`, `SandboxNotifyHelper`, sandbox host fakes — legitimate DI-default / test composition seams. +- `AddSecurityActors` / `AddTemplateEngineActors` "Phase 0 placeholder" registrations — intentional empty seams (actor wiring lives in Host). +- Migration `Down()` `NotSupportedException`, MxGateway/Bundle version-rejection `NotSupportedException`, `AuditWriteMiddleware` write-only-stream `NotSupportedException` — intentional guards. +- Management Service: 113 handlers; all wire-registered `Mgmt*` commands dispatched. The three "unhandled" (`ResolveRolesCommand` retired; `BrowseNodeCommand`/`ReadTagValuesCommand` routed direct-to-site) are intentional. +- Central UI: no stub/placeholder pages, no `NotImplementedException`, no "coming soon" banners, no no-op `@onclick`. `disabled=`/`placeholder=` usages are legitimate (loading guards, edit locks, HTML hints). + +--- + +## Phase completeness (self-reported) + +All 11 phases report Complete with passing verification gates: + +- **Phase 0** Solution Skeleton — Complete (gate 11/11, 57 tests). +- **Phase 1** Central Foundations — Complete (gate 20/20, 186 pass + 1 live-LDAP skip). +- **Phase 2** Modeling & Validation — Complete (gate 9/9, 359 tests). +- **Phase 3A** Runtime Foundation — Complete (gate 13/13, 389/389). +- **Phase 3B** Site I/O & Observability — Complete (gate 11/11, 541 cumulative). +- **Phase 3C** Deployment & Store-Forward — Complete (terse checklist). +- **Phase 4** Operator UI — Complete (terse checklist). +- **Phase 5** Authoring UI — Complete (terse checklist). +- **Phase 6** Deployment & Ops UI — Complete (terse checklist; Codex external-review step skipped, best-effort). +- **Phase 7** Integrations — Complete (terse checklist; Q12 SMTP-OAuth2 is a test-env dependency). +- **Phase 8** Production Readiness — Complete (terse checklist). + +The ~665 unchecked `- [ ]` items in phase *plan* docs are spec-traceability references (each dispositioned Pass / Out-of-scope in Forward/Reverse tables), a documentation style — not a TODO list. + +**Operational (not code):** `docs/deployment/production-checklist.md` has ~60 unchecked install-time operator steps (env vars, connection strings, firewall ports 8081/636/587/1433, TLS certs, smoke tests). + +--- + +## Confidence & caveats + +- **High confidence** on Tier 1 — each item verified by reading the code (class/interface existence + absence of callers via grep); top items corroborated by 2+ independent agents. +- Terse Phase 3C–8 checklists self-report "Complete / tests passing" with no per-gate breakdown; test counts for those phases were not independently re-run. +- Actual `src/` artifacts were treated as truth over `.tasks.json` status fields, which are demonstrably stale. +- Items marked *Uncertain* (e.g. `audit tree` CLI, per-channel retention) rest on doc text only. + +## Recommended next steps + +1. **Wire the two never-started audit actors** (#3, #4) — highest impact, smallest blast radius (DI/Host wiring + `IPullAuditEventsClient` impl). +2. **Site Call Audit reconciliation + purge** (#5) — same shape as #3/#4. +3. **Decide on script compilation/security** (#1, #2) — either implement the Roslyn gate or downgrade the spec's claims; currently the strongest functional + security gap. +4. **Site Event Logging categories** (#6) — inject `ISiteEventLogger` into the 5 missing subsystems. +5. **Reconcile Tier-4 doc drift** — update Config DB / Commons specs for the audit/auth re-architecture and the CLI docs for the `bundle` group + option names.