Files
lmxopcua/docs/v2/implementation/phase-6-1-resilience-and-observability.md
Joseph Doherty 4695a5c88e Phase 6 — Draft 4 implementation plans covering v2 unimplemented features + adversarial review + adjustments. After drivers were paused per user direction, audited the v2 plan for features documented-but-unshipped and identified four coherent tracks that had no implementation plan at all. Each plan follows the docs/v2/implementation/phase-*.md template (DRAFT status, branch name, Stream A-E task breakdown, Compliance Checks, Risks, Completion Checklist). docs/v2/implementation/phase-6-1-resilience-and-observability.md (243 lines) covers Polly resilience pipelines wired to every capability interface, Tier A/B/C runtime enforcement (memory watchdog generalized beyond Galaxy, scheduled recycle per decision #67, wedge detection), health endpoints on :4841, structured Serilog with correlation IDs, LiteDB local-cache fallback per decision #36. phase-6-2-authorization-runtime.md (145 lines) wires ACL enforcement on every OPC UA Read/Write/Subscribe/Call path + LDAP-group-to-admin-role grants per decisions #105 and #129 -- runtime permission-trie evaluator over the 6-level Cluster/Namespace/UnsArea/UnsLine/Equipment/Tag hierarchy, per-session cache invalidated on generation-apply + LDAP-cache expiry. phase-6-3-redundancy-runtime.md (165 lines) lands the non-transparent warm/hot redundancy runtime per decisions #79-85: dynamic ServiceLevel node, ServerUriArray peer broadcast, mid-apply dip via sp_PublishGeneration hook, operator-driven role transition (no auto-election -- plan remains explicit about what's out of scope). phase-6-4-admin-ui-completion.md (178 lines) closes Phase 1 Stream E completion-checklist items that never landed: UNS drag-reorder + impact preview, Equipment CSV import, 5-identifier search, draft-diff viewer enhancements, OPC 40010 _base Identification field exposure per decisions #138-139. Each plan then got a Codex adversarial-review pass (codex mcp tool, read-only sandbox, synchronous). Reviews explicitly targeted decision-log conflicts, API-shape assumptions, unbounded blast radius, under-specified state transitions, and testing holes. Appended 'Adversarial Review — 2026-04-19' section to each plan with numbered findings (severity / finding / why-it-matters / adjustment accepted). Review surfaced real substantive issues that the initial drafts glossed over: Phase 6.1 auto-retry conflicting with decisions #44-45 no-auto-write-retry rule; Phase 6.1 per-driver-instance pipeline breaking decision #35's per-device isolation; Phase 6.1 recycle/watchdog at Tier A/B breaching decisions #73-74 Tier-C-only constraint; Phase 6.2 conflating control-plane LdapGroupRoleMapping with data-plane ACL grants; Phase 6.2 missing Browse enforcement entirely; Phase 6.2 subscription re-authorization policy unresolved between create-time-only and per-publish; Phase 6.3 ServiceLevel=0 colliding with OPC UA Part 5 Maintenance semantics; Phase 6.3 ServerUriArray excluding self (spec-bug); Phase 6.3 apply-window counter race on cancellation; Phase 6.3 client cutover for Kepware/Aveva OI Gateway is unverified hearsay; Phase 6.4 stale UNS impact preview overwriting concurrent draft edits; Phase 6.4 identifier contract drifting from admin-ui.md canonical set (ZTag/MachineCode/SAPID/EquipmentId/EquipmentUuid, not ZTag/SAPID/UniqueId/Alias1/Alias2); Phase 6.4 CSV import atomicity internally contradictory (single txn vs chunked inserts); Phase 6.4 OPC 40010 field list not matching decision #139. Every finding has an adjustment in the plan doc -- plans are meant to be executable from the next session with the critique already baked in rather than a clean draft that would run into the same issues at implementation time. Codex thread IDs cited in each plan's review section for reproducibility. Pure documentation PR -- no code changes. Plans are DRAFT status; each becomes its own implementation phase with its own entry-gate + exit-gate when business prioritizes.
2026-04-19 03:15:00 -04:00

18 KiB
Raw Blame History

Phase 6.1 — Resilience & Observability Runtime

Status: DRAFT — implementation plan for a cross-cutting phase that was never formalised. The v2 plan.md specifies Polly, Tier A/B/C protections, structured logging, and local-cache fallback by decision; none are wired end-to-end.

Branch: v2/phase-6-1-resilience-observability Estimated duration: 3 weeks Predecessor: Phase 5 (drivers) — partial; S7 + OPC UA Client shipped, AB/TwinCAT/FOCAS paused Successor: Phase 6.2 (Authorization runtime)

Phase Objective

Land the cross-cutting runtime protections + operability features that plan.md + driver-stability.md specify by decision but that no driver-phase actually wires. End-state: every driver goes through the same Polly resilience layer, health endpoints render the live driver fleet, structured logs carry per-request correlation IDs, and the config substrate survives a central DB outage via a LiteDB local cache.

Closes these gaps flagged in the 2026-04-19 audit:

  1. Polly v8 resilience pipelines wired to every IDriver capability (no-op per-driver today; Galaxy has a hand-rolled CircuitBreaker only).
  2. Tier A/B/C enforcement at runtime — driver-stability.md §24 and decisions #6373 define memory watchdog, bounded queues, scheduled recycle, wedge detection; MemoryWatchdog exists only inside Driver.Galaxy.Host.
  3. Health endpoints (/healthz, /readyz) on OtOpcUa.Server.
  4. Structured Serilog with per-request correlation IDs (driver instance, OPC UA session, IPC call).
  5. LiteDB local cache + Polly retry + fallback on central-DB outage (decision #36).

Scope — What Changes

Concern Change
Core → new Core.Resilience sub-namespace Shared Polly pipeline builder (DriverResiliencePipelines), per-capability policy (Read / Write / Subscribe / HistoryRead / Discover / Probe / Alarm). One pipeline per driver instance; driver-options decide tuning.
Every IDriver* consumer in the server Wrap capability calls in the shared pipeline. Policy composition order: timeout → retry (with jitter, bounded by capability-specific MaxRetries) → circuit breaker (per driver instance, opens on N consecutive failures) → bulkhead (ceiling on in-flight requests per driver).
Core → new Core.Stability sub-namespace Generalise MemoryWatchdog (Driver.Galaxy.Host) into DriverMemoryWatchdog consuming IDriver.GetMemoryFootprint(). Add ScheduledRecycleScheduler (decision #67) for weekly/time-of-day recycle. Add WedgeDetector that flips a driver to Faulted when no successful Read in N × PublishingInterval.
DriverTypeRegistry Each driver type registers its DriverTier {A, B, C}. Tier C drivers must also advertise their out-of-process topology; the registry enforces invariants (Tier C has a Proxy + Host pair).
OtOpcUa.Server → new Minimal API endpoints /healthz (liveness — process alive + config DB reachable or LiteDB cache warm), /readyz (readiness — every driver instance reports DriverState.Healthy). JSON bodies cite individual driver health per instance.
Serilog configuration Centralize enrichers in OtOpcUa.Server/Observability/LogContextEnricher.cs. Every driver call runs inside a LogContext.PushProperty scope with {DriverInstanceId, DriverType, CapabilityName, CorrelationId (UA RequestHandle or internal GUID)}. Sink config stays rolling-file per CLAUDE.md; JSON-formatted output added alongside plain-text so SIEM ingestion works.
Configuration project Add LiteDbConfigCache adapter. Wrap EF Core queries in a Polly pipeline: timeout (2 s) → retry (3×, jittered) → fallback-to-cache. Cache refresh on successful DB query + after sp_PublishGeneration. Cache lives at %ProgramData%/OtOpcUa/config-cache/<cluster-id>.db per node.
DriverHostStatus entity Extend to carry LastCircuitBreakerOpenUtc, ConsecutiveFailures, CurrentBulkheadDepth, LastRecycleUtc. Admin /hosts page reads these.

Scope — What Does NOT Change

Item Reason
Driver wire protocols Resilience is a server-side wrapper; individual drivers don't see Polly. Their existing retry logic (ModbusTcpTransport reconnect, SessionReconnectHandler) stays in place as inner layers.
Config DB schema LiteDB cache is a read-only mirror; no new central tables except DriverHostStatus column additions.
OPC UA wire behavior visible to clients Health endpoints live on a separate HTTP port (4841 by convention); the OPC UA server on 4840 is unaffected.
The four 2026-04-13 Galaxy stability findings Already closed in Phase 2. Phase 6.1 generalises the pattern, doesn't re-fix Galaxy.
Driver-layer SafeHandle usage Existing Galaxy SafeMxAccessHandle + Modbus TcpClient disposal stay — they're driver-internal, not part of the cross-cutting layer.

Entry Gate Checklist

  • Phases 05 exit gates cleared (or explicitly deferred with task reference)
  • driver-stability.md §24 re-read; decisions #6373 + #3436 re-skimmed
  • Polly v8 NuGet available (Microsoft.Extensions.Resilience + Polly.Core) — verify package restore before task breakdown
  • LiteDB 5.x NuGet confirmed MIT + actively maintained
  • Existing drivers catalogued: Galaxy.Proxy, Modbus, S7, OpcUaClient — confirm test counts baseline so the resilience layer doesn't regress any
  • Serilog configuration inventory: locate every Log.ForContext call site that will need LogContext rewrap
  • Admin /hosts page's current DriverHostStatus consumption reviewed so the schema extensions don't break it

Task Breakdown

Stream A — Resilience layer (1 week)

  1. A.1 Add Polly.Core + Microsoft.Extensions.Resilience to Core. Build DriverResiliencePipelineBuilder that composes Timeout → Retry (exponential backoff + jitter, capability-specific max retries) → CircuitBreaker (consecutive-failure threshold; half-open probe) → Bulkhead (max in-flight per driver instance). Unit tests cover each policy in isolation + composed pipeline.
  2. A.2 DriverResilienceOptions record bound from DriverInstance.ResilienceConfig JSON column (new nullable). Defaults encoded per-tier: Tier A (OPC UA Client, S7) — 3 retries, 2 s timeout, 5-failure breaker; Tier B (Modbus) — same except 4 s timeout; Tier C (Galaxy) — 1 retry (inner supervisor handles restart), 10 s timeout, circuit-breaker trips but doesn't kill the driver (the Proxy supervisor already handles that).
  3. A.3 DriverCapabilityInvoker<T> wraps every IDriver* method call. Existing server-side dispatch (whatever currently calls driver.ReadAsync) routes through the invoker. Policy injection via DI.
  4. A.4 Remove the hand-rolled CircuitBreaker + Backoff from Driver.Galaxy.Proxy/Supervisor/ — replaced by the shared layer. Keep HeartbeatMonitor (different concern: IPC liveness, not data-path resilience).
  5. A.5 Unit tests: per-policy, per-composition. Integration test: Modbus driver under a FlakeyTransport that fails 5×, succeeds on 6th; invoker surfaces the eventual success. Bench: no-op overhead < 1% under nominal load.

Stream B — Tier A/B/C stability runtime (1 week, can parallel with Stream A after A.1)

  1. B.1 Core.AbstractionsDriverTier enum {A, B, C}. Extend DriverTypeRegistry to require DriverTier at registration. Existing driver types get their tier stamped (Galaxy = C, Modbus = B, S7 = B, OpcUaClient = A).
  2. B.2 Generalise DriverMemoryWatchdog (lift from Driver.Galaxy.Host/MemoryWatchdog.cs). Tier-specific thresholds: A = 256 MB RSS soft / 512 MB hard, B = 512 MB soft / 1 GB hard, C = 1 GB soft / 2 GB hard (decision #70 hybrid multiplier + floor). Soft threshold → log + metric; hard threshold → mark driver Faulted + trigger recycle.
  3. B.3 ScheduledRecycleScheduler (decision #67): each driver instance can opt-in to a weekly recycle at a configured cron. Recycle = ShutdownAsyncInitializeAsync. Tier C drivers get the Proxy-side recycle; Tier A/B recycle in-process.
  4. B.4 WedgeDetector: polling thread per driver instance; if LastSuccessfulRead older than WedgeThreshold (default 5 × PublishingInterval, minimum 60 s) AND driver state is Healthy, flag as wedged → force ReinitializeAsync. Prevents silent dead-subscriptions.
  5. B.5 Tests: watchdog unit tests drive synthetic allocation; scheduler uses a virtual clock; wedge detector tests use a fake IClock + driver stub.

Stream C — Health endpoints + structured logging (4 days)

  1. C.1 OtOpcUa.Server/Observability/HealthEndpoints.cs — Minimal API on a second Kestrel binding (default http://+:4841). /healthz reports process uptime + config-DB reachability (or cache-warm). /readyz enumerates DriverInstance rows + reports each driver's DriverHealth.State; returns 503 if ANY driver is Faulted. JSON body per docs/v2/acl-design.md §"Operator Dashboards" shape.
  2. C.2 LogContextEnricher installed at Serilog config time. Every driver-capability call site wraps its body in using (LogContext.PushProperty("DriverInstanceId", id)) using (LogContext.PushProperty("CorrelationId", correlationId)). Correlation IDs: reuse OPC UA RequestHeader.RequestHandle when in-flight; otherwise generate Guid.NewGuid().ToString("N")[..12].
  3. C.3 Add JSON-formatted Serilog sink alongside the existing rolling-file plain-text sink so SIEMs (Splunk, Datadog) can ingest without a regex parser. Sink switchable via Serilog:WriteJson appsetting.
  4. C.4 Integration test: boot server, issue Modbus read, assert log line contains DriverInstanceId + CorrelationId structured fields.

Stream D — Config DB LiteDB fallback (1 week)

  1. D.1 LiteDbConfigCache adapter. Wraps ConfigurationDbContext queries that are safe to serve stale (cluster membership, generation metadata, driver instance definitions, LDAP role mapping). Write-path queries (draft save, publish) bypass the cache and fail hard on DB outage.
  2. D.2 Cache refresh strategy: refresh on every successful read (write-through-cache), full refresh after sp_PublishGeneration confirmation. Cache entries carry CachedAtUtc; served entries older than 24 h trigger a synthetic Warning log line so operators see stale data in effect.
  3. D.3 Polly pipeline in Configuration project: EF Core query → retry 3× → fallback to cache. On fallback, driver state stays Healthy but a UsingStaleConfig flag on the cluster's health report flips true.
  4. D.4 Tests: in-memory SQL Server failure injected via TestContainers-ish double; cache returns last-known values; Admin UI banners reflect UsingStaleConfig.

Stream E — Admin /hosts page refresh (3 days)

  1. E.1 Extend DriverHostStatus schema with Stream A resilience columns. Generate EF migration.
  2. E.2 Admin/FleetStatusHub SignalR hub pushes LastCircuitBreakerOpenUtc + CurrentBulkheadDepth + LastRecycleUtc on change.
  3. E.3 /hosts Blazor page renders new columns; red badge if ConsecutiveFailures > breakerThreshold / 2.

Compliance Checks (run at exit gate)

  • Polly coverage: every IDriver* method call in the server dispatch layer routes through DriverCapabilityInvoker. Enforce via a Roslyn analyzer added to Core.Abstractions build (error on direct IDriver.ReadAsync calls outside the invoker).
  • Tier registry: every driver type registered in DriverTypeRegistry has a non-null Tier. Unit test walks the registry + asserts no gaps.
  • Health contract: /healthz + /readyz respond within 500 ms even with one driver Faulted.
  • Structured log: CI grep on tests/ output asserts at least one log line contains "DriverInstanceId" + "CorrelationId" JSON fields.
  • Cache fallback: Integration test kills the SQL container mid-operation; driver health stays Healthy, UsingStaleConfig flips true.
  • No regression in existing test suites — dotnet test ZB.MOM.WW.OtOpcUa.slnx count equal-or-greater than pre-Phase-6.1 baseline.

Risks and Mitigations

Risk Likelihood Impact Mitigation
Polly pipeline adds per-request latency on hot path Medium Medium Benchmark Stream A.5 before merging; 1 % overhead budget; inline hot path short-circuits when retry count = 0
LiteDB cache diverges from central DB Medium High Stale-data banner in Admin UI; UsingStaleConfig flag surfaced on /readyz; cache refresh on every successful DB round-trip; 24-hour synthetic warning
Tier watchdog false-positive-kills a legitimate batch load Low High Soft/hard threshold split; soft only logs; hard triggers recycle; thresholds configurable per-instance
Wedge detector races with slow-but-healthy drivers Medium High Minimum 60 s threshold; detector only activates if driver claims Healthy; add circuit-breaker feedback so rapid oscillation trips instead of thrashing
Roslyn analyzer breaks external driver authors Low Medium Release analyzer as warning-level initially; upgrade to error in Phase 6.1+1 after one release cycle

Completion Checklist

  • Stream A: Polly shared pipeline + per-tier defaults + driver-capability invoker + tests
  • Stream B: Tier registry + generalised watchdog + scheduled recycle + wedge detector
  • Stream C: /healthz + /readyz + structured logging + JSON Serilog sink
  • Stream D: LiteDB cache + Polly fallback in Configuration
  • Stream E: Admin /hosts page refresh
  • Cross-cutting: phase-6-1-compliance.ps1 exits 0; full solution dotnet test passes; exit-gate doc recorded

Adversarial Review — 2026-04-19 (Codex, thread 019da489-e317-7aa1-ab1f-6335e0be2447)

Plan substantially rewritten before implementation to address these findings. Each entry: severity · verdict · adjustment.

  1. Crit · ACCEPT — Auto-retry collides with decisions #44/#45 (no auto-write-retry; opt-in via WriteIdempotent + CAS). Pipeline now capability-specific: Read/HistoryRead/Discover/Probe/Alarm-subscribe all get retries; Write does not unless the tag metadata carries WriteIdempotent=true. New WriteIdempotentAttribute surfaces on ModbusTagDefinition / S7TagDefinition / etc.
  2. Crit · ACCEPT — "One pipeline per driver instance" breaks decision #35's per-device isolation. Change: pipeline key is (DriverInstanceId, HostName) not just DriverInstanceId. One dead PLC behind a multi-device Modbus driver no longer opens the breaker for healthy siblings.
  3. Crit · ACCEPT — Memory watchdog + scheduled recycle at Tier A/B breaches decisions #73/#74 (process-kill protections are Tier-C-only). Change: Stream B splits into two — MemoryTracking (all tiers, soft/hard thresholds log + surface to Admin /hosts; never kills) and MemoryRecycle (Tier C only, requires out-of-process topology). Tier A/B overrun paths escalate to Tier C via a future PR, not auto-kill.
  4. High · ACCEPT — Removing Galaxy's hand-rolled CircuitBreaker drops decision #68 host-supervision crash-loop protection. Change: keep Driver.Galaxy.Proxy/Supervisor/CircuitBreaker.cs + Backoff.cs — they guard the IPC process re-spawn, not the per-call data path. Data-path Polly is an orthogonal layer.
  5. High · ACCEPT — Roslyn analyzer targeting IDriver misses the hot paths (IReadable.ReadAsync, IWritable.WriteAsync, ISubscribable.SubscribeAsync etc.). Change: analyzer rule now matches every method on the capability interfaces; compliance doc enumerates the full call-site list.
  6. High · ACCEPT/healthz + /readyz under-specified for degraded-running. Change: add a state-matrix sub-section explicitly covering Unknown (pre-init: /readyz 503), Initializing (503), Healthy (200), Degraded (200 with JSON body flagging the degraded driver; /readyz is OR across drivers), Faulted (503), plus cached-config-serving (/healthz returns 200 + UsingStaleConfig: true in JSON body).
  7. High · ACCEPTWedgeDetector based on "no successful Read" false-fires on write-only subscriptions + idle systems. Change: wedge criteria now (hasPendingWork AND noProgressIn > threshold) where hasPendingWork comes from the Polly bulkhead depth + active MonitoredItem count. Idle driver stays Healthy.
  8. High · ACCEPT — LiteDB cache serving mixed-generation reads breaks publish atomicity. Change: cache is snapshot-per-generation. Each published generation writes a sealed snapshot into <cache-root>/<cluster>/<generationId>.db; reads serve the last-known-sealed generation and never mix. Central DB outage during a publish means that publish fails (write path doesn't use cache); reads continue from the prior sealed snapshot.
  9. Med · ACCEPTDriverHostStatus schema conflates per-host connectivity with per-driver-instance resilience counters. Change: new DriverInstanceResilienceStatus table separate from DriverHostStatus. Admin /hosts joins both for display.
  10. Med · ACCEPT — Compliance says analyzer-error; risks say analyzer-warning. Change: phase 6.1 ships at error level (this phase is the gate); warning-mode option removed.
  11. Med · ACCEPT — Hardcoded per-tier MB bands ignore decision #70's max(multiplier × baseline, baseline + floor) formula with observed-baseline capture. Change: watchdog captures baseline at post-init plateau (median of first 5 min GetMemoryFootprint samples) + applies the hybrid formula. Tier constants now encode the multiplier + floor, not raw MB.
  12. Med · ACCEPT — Tests mostly cover happy path. Change: Stream A.5 adds negative tests for duplicate-write-replay-under-timeout; Stream B.5 adds false-wedge-on-idle-subscription + false-wedge-on-slow-historic-backfill; Stream D.4 adds mixed-generation cache test + corrupt-first-boot cache test.