Phase 6 — Four implementation plans for unplanned v2 features, each with codex adversarial review #76

Merged
dohertj2 merged 1 commits from phase-6-plans-drafts into v2 2026-04-19 03:17:17 -04:00
Owner

Summary

After drivers paused, audited plan.md + driver-stability.md + acl-design.md + admin-ui.md for features documented-but-unshipped. Four coherent tracks had no implementation plan at all. This PR drafts them + runs each through a Codex adversarial review + bakes the findings into each plan.

Plans drafted

  • docs/v2/implementation/phase-6-1-resilience-and-observability.md — Polly pipelines, Tier A/B/C runtime enforcement, health endpoints, structured logging + correlation IDs, LiteDB fallback
  • docs/v2/implementation/phase-6-2-authorization-runtime.md — ACL permission-trie evaluator on Read/Write/Subscribe paths, LdapGroupRoleMapping, per-session cache
  • docs/v2/implementation/phase-6-3-redundancy-runtime.md — Dynamic ServiceLevel, ServerUriArray, mid-apply dip, operator-driven role transition
  • docs/v2/implementation/phase-6-4-admin-ui-completion.md — UNS drag-reorder + impact preview, CSV import, 5-identifier search, draft-diff enhancements, OPC 40010 Identification exposure

Each plan follows the existing phase-*.md template (Entry Gate, Streams A-E, Compliance Checks, Risks, Completion Checklist).

Adversarial review

Codex ran read-only-sandbox reviews against each plan with explicit focus on decision-log conflicts, unbounded blast radius, under-specified state transitions, wrong primitives, and testing holes.

Real issues surfaced (each has an adjustment in the plan):

  • 6.1 — Auto-retry conflicting with decisions #44-45 no-auto-write-retry; per-instance pipeline breaking #35 per-device isolation; Tier A/B recycle breaching #73-74 Tier-C-only; watchdog formula ignoring #70
  • 6.2 — LdapGroupRoleMapping conflated with data-plane ACLs; Browse enforcement missing entirely; HistoryRead using wrong permission flag; subscription re-auth policy unresolved
  • 6.3 — ServiceLevel=0 colliding with OPC UA Part 5 Maintenance; ServerUriArray missing self; Kepware/Aveva cutover unverified hearsay; apply-window race on cancellation
  • 6.4 — Stale UNS impact preview overwriting concurrent drafts; identifier contract drifting from canonical decision #117 set; CSV atomicity contradictory; OPC 40010 fields not matching decision #139

Each finding documented in the plan's Adversarial Review section with severity / verdict / adjustment so the next session executes against the corrected plan rather than the original draft.

Validation

  • Pure documentation — no code
  • Codex thread IDs cited in each plan for reproducibility
  • Plans remain DRAFT status; each becomes its own implementation phase with Entry/Exit gates when prioritized

Test plan

  • Each plan has all required sections per the phase-*.md template
  • Every finding has an explicit verdict (ACCEPT / REJECT) + concrete adjustment
  • Cross-references to plan.md decisions use decision numbers
## Summary After drivers paused, audited `plan.md` + `driver-stability.md` + `acl-design.md` + `admin-ui.md` for features documented-but-unshipped. Four coherent tracks had no implementation plan at all. This PR drafts them + runs each through a Codex adversarial review + bakes the findings into each plan. ### Plans drafted - `docs/v2/implementation/phase-6-1-resilience-and-observability.md` — Polly pipelines, Tier A/B/C runtime enforcement, health endpoints, structured logging + correlation IDs, LiteDB fallback - `docs/v2/implementation/phase-6-2-authorization-runtime.md` — ACL permission-trie evaluator on Read/Write/Subscribe paths, `LdapGroupRoleMapping`, per-session cache - `docs/v2/implementation/phase-6-3-redundancy-runtime.md` — Dynamic `ServiceLevel`, `ServerUriArray`, mid-apply dip, operator-driven role transition - `docs/v2/implementation/phase-6-4-admin-ui-completion.md` — UNS drag-reorder + impact preview, CSV import, 5-identifier search, draft-diff enhancements, OPC 40010 Identification exposure Each plan follows the existing `phase-*.md` template (Entry Gate, Streams A-E, Compliance Checks, Risks, Completion Checklist). ### Adversarial review Codex ran read-only-sandbox reviews against each plan with explicit focus on decision-log conflicts, unbounded blast radius, under-specified state transitions, wrong primitives, and testing holes. **Real issues surfaced** (each has an adjustment in the plan): - **6.1** — Auto-retry conflicting with decisions #44-45 no-auto-write-retry; per-instance pipeline breaking #35 per-device isolation; Tier A/B recycle breaching #73-74 Tier-C-only; watchdog formula ignoring #70 - **6.2** — LdapGroupRoleMapping conflated with data-plane ACLs; Browse enforcement missing entirely; HistoryRead using wrong permission flag; subscription re-auth policy unresolved - **6.3** — ServiceLevel=0 colliding with OPC UA Part 5 Maintenance; ServerUriArray missing self; Kepware/Aveva cutover unverified hearsay; apply-window race on cancellation - **6.4** — Stale UNS impact preview overwriting concurrent drafts; identifier contract drifting from canonical decision #117 set; CSV atomicity contradictory; OPC 40010 fields not matching decision #139 Each finding documented in the plan's `Adversarial Review` section with severity / verdict / adjustment so the next session executes against the corrected plan rather than the original draft. ## Validation - Pure documentation — no code - Codex thread IDs cited in each plan for reproducibility - Plans remain `DRAFT` status; each becomes its own implementation phase with Entry/Exit gates when prioritized ## Test plan - [x] Each plan has all required sections per the `phase-*.md` template - [x] Every finding has an explicit verdict (ACCEPT / REJECT) + concrete adjustment - [x] Cross-references to `plan.md` decisions use decision numbers
dohertj2 added 1 commit 2026-04-19 03:17:13 -04:00
Phase 6 — Draft 4 implementation plans covering v2 unimplemented features + adversarial review + adjustments. After drivers were paused per user direction, audited the v2 plan for features documented-but-unshipped and identified four coherent tracks that had no implementation plan at all. Each plan follows the docs/v2/implementation/phase-*.md template (DRAFT status, branch name, Stream A-E task breakdown, Compliance Checks, Risks, Completion Checklist). docs/v2/implementation/phase-6-1-resilience-and-observability.md (243 lines) covers Polly resilience pipelines wired to every capability interface, Tier A/B/C runtime enforcement (memory watchdog generalized beyond Galaxy, scheduled recycle per decision #67, wedge detection), health endpoints on :4841, structured Serilog with correlation IDs, LiteDB local-cache fallback per decision #36. phase-6-2-authorization-runtime.md (145 lines) wires ACL enforcement on every OPC UA Read/Write/Subscribe/Call path + LDAP-group-to-admin-role grants per decisions #105 and #129 -- runtime permission-trie evaluator over the 6-level Cluster/Namespace/UnsArea/UnsLine/Equipment/Tag hierarchy, per-session cache invalidated on generation-apply + LDAP-cache expiry. phase-6-3-redundancy-runtime.md (165 lines) lands the non-transparent warm/hot redundancy runtime per decisions #79-85: dynamic ServiceLevel node, ServerUriArray peer broadcast, mid-apply dip via sp_PublishGeneration hook, operator-driven role transition (no auto-election -- plan remains explicit about what's out of scope). phase-6-4-admin-ui-completion.md (178 lines) closes Phase 1 Stream E completion-checklist items that never landed: UNS drag-reorder + impact preview, Equipment CSV import, 5-identifier search, draft-diff viewer enhancements, OPC 40010 _base Identification field exposure per decisions #138-139. Each plan then got a Codex adversarial-review pass (codex mcp tool, read-only sandbox, synchronous). Reviews explicitly targeted decision-log conflicts, API-shape assumptions, unbounded blast radius, under-specified state transitions, and testing holes. Appended 'Adversarial Review — 2026-04-19' section to each plan with numbered findings (severity / finding / why-it-matters / adjustment accepted). Review surfaced real substantive issues that the initial drafts glossed over: Phase 6.1 auto-retry conflicting with decisions #44-45 no-auto-write-retry rule; Phase 6.1 per-driver-instance pipeline breaking decision #35's per-device isolation; Phase 6.1 recycle/watchdog at Tier A/B breaching decisions #73-74 Tier-C-only constraint; Phase 6.2 conflating control-plane LdapGroupRoleMapping with data-plane ACL grants; Phase 6.2 missing Browse enforcement entirely; Phase 6.2 subscription re-authorization policy unresolved between create-time-only and per-publish; Phase 6.3 ServiceLevel=0 colliding with OPC UA Part 5 Maintenance semantics; Phase 6.3 ServerUriArray excluding self (spec-bug); Phase 6.3 apply-window counter race on cancellation; Phase 6.3 client cutover for Kepware/Aveva OI Gateway is unverified hearsay; Phase 6.4 stale UNS impact preview overwriting concurrent draft edits; Phase 6.4 identifier contract drifting from admin-ui.md canonical set (ZTag/MachineCode/SAPID/EquipmentId/EquipmentUuid, not ZTag/SAPID/UniqueId/Alias1/Alias2); Phase 6.4 CSV import atomicity internally contradictory (single txn vs chunked inserts); Phase 6.4 OPC 40010 field list not matching decision #139. Every finding has an adjustment in the plan doc -- plans are meant to be executable from the next session with the critique already baked in rather than a clean draft that would run into the same issues at implementation time. Codex thread IDs cited in each plan's review section for reproducibility. Pure documentation PR -- no code changes. Plans are DRAFT status; each becomes its own implementation phase with its own entry-gate + exit-gate when business prioritizes. 4695a5c88e
dohertj2 merged commit 81a1f7f0f6 into v2 2026-04-19 03:17:17 -04:00
dohertj2 referenced this issue from a commit 2026-04-30 08:21:24 -04:00
Phase 2 Stream D Option B — archive v1 surface + new Driver.Galaxy.E2E parity suite. Non-destructive intermediate state: the v1 OtOpcUa.Host + Historian.Aveva + Tests + IntegrationTests projects all still build (494 v1 unit + 6 v1 integration tests still pass when run explicitly), but solution-level `dotnet test ZB.MOM.WW.OtOpcUa.slnx` now skips them via IsTestProject=false on the test projects + archive-status PropertyGroup comments on the src projects. The destructive deletion is reserved for Phase 2 PR 3 with explicit operator review per CLAUDE.md "only use destructive operations when truly the best approach". tests/ZB.MOM.WW.OtOpcUa.Tests/ renamed via git mv to tests/ZB.MOM.WW.OtOpcUa.Tests.v1Archive/; csproj <AssemblyName> kept as the original ZB.MOM.WW.OtOpcUa.Tests so v1 OtOpcUa.Host's [InternalsVisibleTo("ZB.MOM.WW.OtOpcUa.Tests")] still matches and the project rebuilds clean. tests/ZB.MOM.WW.OtOpcUa.IntegrationTests gets <IsTestProject>false</IsTestProject>. src/ZB.MOM.WW.OtOpcUa.Host + src/ZB.MOM.WW.OtOpcUa.Historian.Aveva get PropertyGroup archive-status comments documenting they're functionally superseded but kept in-build because cascading dependencies (Historian.Aveva → Host; IntegrationTests → Host) make a single-PR deletion high blast-radius. New tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.E2E/ project (.NET 10) with ParityFixture that spawns OtOpcUa.Driver.Galaxy.Host.exe (net48 x86) as a Process.Start subprocess with OTOPCUA_GALAXY_BACKEND=db env vars, awaits 2s for the PipeServer to bind, then exposes a connected GalaxyProxyDriver; skips on non-Windows / Administrator shells (PipeAcl denies admins per decision #76) / ZB unreachable / Host EXE not built — each skip carries a SkipReason string the test method reads via Assert.Skip(SkipReason). RecordingAddressSpaceBuilder captures every Folder/Variable/AddProperty registration so parity tests can assert on the same shape v1 LmxNodeManager produced. HierarchyParityTests (3) — Discover returns gobjects with attributes; attribute full references match the tag.attribute Galaxy reference grammar; HistoryExtension flag flows through correctly. StabilityFindingsRegressionTests (4) — one test per 2026-04-13 stability finding from commits c76ab8f and 7310925: phantom probe subscription doesn't corrupt unrelated host status; HostStatusChangedEventArgs structurally carries a specific HostName + OldState + NewState (event signature mathematically prevents the v1 cross-host quality-clear bug); all GalaxyProxyDriver capability methods return Task or Task<T> (sync-over-async would deadlock OPC UA stack thread); AcknowledgeAsync completes before returning (no fire-and-forget background work that could race shutdown). Solution test count: 470 pass / 7 skip (E2E on admin shell) / 1 pre-existing Phase 0 baseline. Run archived suites explicitly: `dotnet test tests/ZB.MOM.WW.OtOpcUa.Tests.v1Archive` (494 pass) + `dotnet test tests/ZB.MOM.WW.OtOpcUa.IntegrationTests` (6 pass). docs/v2/V1_ARCHIVE_STATUS.md inventories every archived surface with run-it-explicitly instructions + a 10-step deletion plan for PR 3 + rollback procedure (git revert restores all four projects). docs/v2/implementation/exit-gate-phase-2-final.md supersedes the two partial-exit docs with the per-stream status table (A/B/C/D/E all addressed, D split across PR 2/3 per safety protocol), the test count breakdown, fresh adversarial review of PR 2 deltas (4 new findings: medium IsTestProject=false safety net loss, medium structural-vs-behavioral stability tests, low backend=db default, low Process.Start env inheritance), the 8 carried-forward findings from exit-gate-phase-2.md, the recommended PR order (1 → 2 → 3 → 4). docs/v2/implementation/pr-2-body.md is the Gitea web-UI paste-in for opening PR 2 once pushed.
Sign in to join this conversation.