Files
lmxopcua/docs/v2/implementation/phase-6-1-resilience-and-observability.md
Joseph Doherty f29043c66a Phase 6.1 exit gate — compliance script real-checks + phase doc status = SHIPPED
scripts/compliance/phase-6-1-compliance.ps1 replaces the stub TODOs with 34
real checks covering:
- Stream A: pipeline builder + CapabilityInvoker + WriteIdempotentAttribute
  present; pipeline key includes HostName (per-device isolation per decision
  #144); OnReadValue / OnWriteValue / HistoryRead route through invoker in
  DriverNodeManager; Galaxy supervisor CircuitBreaker + Backoff preserved.
- Stream B: DriverTier enum; DriverTypeMetadata requires Tier; MemoryTracking
  + MemoryRecycle (Tier C-gated) + ScheduledRecycleScheduler (rejects Tier
  A/B) + demand-aware WedgeDetector all present.
- Stream C: DriverHealthReport + HealthEndpointsHost; state matrix Healthy=200
  / Faulted=503 asserted in code; LogContextEnricher; JSON sink opt-in via
  Serilog:WriteJson.
- Stream D: GenerationSealedCache + ReadOnly marking + GenerationCacheUnavailable
  exception path; ResilientConfigReader + StaleConfigFlag.
- Stream E data layer: DriverInstanceResilienceStatus entity +
  DriverResilienceStatusTracker. SignalR/Blazor surface is Deferred per the
  visual-compliance follow-up pattern borrowed from Phase 6.4.
- Cross-cutting: full solution `dotnet test` runs; asserts 1042 >= 906
  baseline; tolerates the one pre-existing Client.CLI Subscribe flake and
  flags any new failure.

Running the script locally returns "Phase 6.1 compliance: PASS" — exit 0. Any
future regression that deletes a class or un-wires a dispatch path turns a
green check red + exit non-zero.

docs/v2/implementation/phase-6-1-resilience-and-observability.md status
updated from DRAFT to SHIPPED with the merged-PRs summary + test count delta +
the single deferred follow-up (visual review of the Admin /hosts columns).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 08:53:47 -04:00

23 KiB
Raw Blame History

Phase 6.1 — Resilience & Observability Runtime

Status: SHIPPED 2026-04-19 — Streams A/B/C/D + E data layer merged to v2 across PRs #78-82. Final exit-gate PR #83 turns the compliance script into real checks (all pass) and records this status update. One deferred piece: Stream E.2/E.3 SignalR hub + Blazor /hosts column refresh lands in a visual-compliance follow-up PR on the Phase 6.4 Admin UI branch.

Baseline: 906 solution tests → post-Phase-6.1: 1042 passing (+136 net). One pre-existing Client.CLI Subscribe flake unchanged.

Branch: v2/phase-6-1-resilience-observability Estimated duration: 3 weeks Predecessor: Phase 5 (drivers) — partial; S7 + OPC UA Client shipped, AB/TwinCAT/FOCAS paused Successor: Phase 6.2 (Authorization runtime)

Phase Objective

Land the cross-cutting runtime protections + operability features that plan.md + driver-stability.md specify by decision but that no driver-phase actually wires. End-state: every driver goes through the same Polly resilience layer, health endpoints render the live driver fleet, structured logs carry per-request correlation IDs, and the config substrate survives a central DB outage via a LiteDB local cache.

Closes these gaps flagged in the 2026-04-19 audit:

  1. Polly v8 resilience pipelines wired to every IDriver capability (no-op per-driver today; Galaxy has a hand-rolled CircuitBreaker only).
  2. Tier A/B/C enforcement at runtime — driver-stability.md §24 and decisions #6373 define memory watchdog, bounded queues, scheduled recycle, wedge detection; MemoryWatchdog exists only inside Driver.Galaxy.Host.
  3. Health endpoints (/healthz, /readyz) on OtOpcUa.Server.
  4. Structured Serilog with per-request correlation IDs (driver instance, OPC UA session, IPC call).
  5. LiteDB local cache + Polly retry + fallback on central-DB outage (decision #36).

Scope — What Changes

Concern Change
Core → new Core.Resilience sub-namespace Shared Polly pipeline builder (DriverResiliencePipelines). Pipeline key = (DriverInstanceId, HostName) so one dead PLC behind a multi-device driver doesn't open the breaker for healthy siblings (decision #35 per-device isolation). Per-capability policy — Read / HistoryRead / Discover / Probe / Alarm get retries; Write does NOT unless [WriteIdempotent] on the tag definition (decisions #44-45).
Every capability-interface consumer in the server Wrap IReadable.ReadAsync, IWritable.WriteAsync, ITagDiscovery.DiscoverAsync, ISubscribable.SubscribeAsync/UnsubscribeAsync, IHostConnectivityProbe probe loop, IAlarmSource.SubscribeAlarmsAsync/AcknowledgeAsync, IHistoryProvider.ReadRawAsync/ReadProcessedAsync/ReadAtTimeAsync/ReadEventsAsync. Composition: timeout → (retry when capability supports) → circuit breaker → bulkhead.
Core.Abstractions → new WriteIdempotentAttribute Marker on ModbusTagDefinition / S7TagDefinition / OpcUaClientDriver tag rows; opts that tag into auto-retry on Write. Absence = no retry, per spec.
Core → new Core.Stability sub-namespace — split Two separate subsystems: (a) MemoryTracking runs all tiers; captures baseline (median of first 5 min GetMemoryFootprint samples) + applies the hybrid rule soft = max(multiplier × baseline, baseline + floor); soft breach logs + surfaces to Admin; never kills. (b) MemoryRecycle (Tier C only — requires out-of-process topology) handles hard-breach recycle via the Proxy-side supervisor. Tier A/B overrun escalates to Tier C promotion ticket, not auto-kill.
ScheduledRecycleScheduler Tier C only per decisions #73-74. Weekly/time-of-day recycle via Proxy supervisor. Tier A/B opt-in recycle lands in a future phase together with a Tier-C-escalation workflow.
WedgeDetector Demand-aware: flips a driver to Faulted only when (hasPendingWork AND noProgressIn > threshold). hasPendingWork derives from non-zero Polly bulkhead depth OR ≥1 active MonitoredItem OR ≥1 queued historian read. Idle + subscription-only drivers stay Healthy.
DriverTypeRegistry Each driver type registers its DriverTier {A, B, C}. Tier C drivers must advertise their out-of-process topology; the registry enforces invariants (Tier C has a Proxy + Host pair).
Driver.Galaxy.Proxy/Supervisor/ Retains existing CircuitBreaker + Backoff — they guard IPC respawn (decision #68), different concern from the per-call Polly layer. Only HeartbeatMonitor is referenced downstream (IPC liveness).
OtOpcUa.Server → Minimal API endpoints on http://+:4841 /healthz = process alive + (config DB reachable OR UsingStaleConfig=true). /readyz = ANDed driver health; state-machine per DriverState: Unknown/Initializing → 503, Healthy → 200, Degraded → 200 + {degradedDrivers: [...]} in body, Faulted → 503. JSON body always reports per-instance detail.
Serilog configuration Centralize enrichers in OtOpcUa.Server/Observability/LogContextEnricher.cs. Every capability call runs inside a LogContext.PushProperty scope with {DriverInstanceId, DriverType, CapabilityName, CorrelationId (UA RequestHandle or internal GUID)}. Sink config stays rolling-file per CLAUDE.md; JSON sink added alongside plain-text (switchable via Serilog:WriteJson appsetting).
Configuration project Add LiteDbConfigCache adapter. Generation-sealed snapshots: sp_PublishGeneration writes <cache-root>/<cluster>/<generationId>.db as a read-only sealed file. Reads serve the last-known-sealed generation; mixed-generation reads are impossible. Write path bypasses cache + fails hard on DB outage. Pipeline: timeout (2 s) → retry (3×, jittered) → fallback-to-sealed-snapshot.
DriverHostStatus vs. DriverInstanceResilienceStatus New separate entity DriverInstanceResilienceStatus { DriverInstanceId, HostName, LastCircuitBreakerOpenUtc, ConsecutiveFailures, CurrentBulkheadDepth, LastRecycleUtc, BaselineFootprintBytes }. DriverHostStatus keeps per-host connectivity only; Admin /hosts joins both for display.

Scope — What Does NOT Change

Item Reason
Driver wire protocols Resilience is a server-side wrapper; individual drivers don't see Polly. Their existing retry logic (ModbusTcpTransport reconnect, SessionReconnectHandler) stays in place as inner layers.
Config DB schema LiteDB cache is a read-only mirror; no new central tables except DriverHostStatus column additions.
OPC UA wire behavior visible to clients Health endpoints live on a separate HTTP port (4841 by convention); the OPC UA server on 4840 is unaffected.
The four 2026-04-13 Galaxy stability findings Already closed in Phase 2. Phase 6.1 generalises the pattern, doesn't re-fix Galaxy.
Driver-layer SafeHandle usage Existing Galaxy SafeMxAccessHandle + Modbus TcpClient disposal stay — they're driver-internal, not part of the cross-cutting layer.

Entry Gate Checklist

  • Phases 05 exit gates cleared (or explicitly deferred with task reference)
  • driver-stability.md §24 re-read; decisions #6373 + #3436 re-skimmed
  • Polly v8 NuGet available (Microsoft.Extensions.Resilience + Polly.Core) — verify package restore before task breakdown
  • LiteDB 5.x NuGet confirmed MIT + actively maintained
  • Existing drivers catalogued: Galaxy.Proxy, Modbus, S7, OpcUaClient — confirm test counts baseline so the resilience layer doesn't regress any
  • Serilog configuration inventory: locate every Log.ForContext call site that will need LogContext rewrap
  • Admin /hosts page's current DriverHostStatus consumption reviewed so the schema extensions don't break it

Task Breakdown

Stream A — Resilience layer (1 week)

  1. A.1 Add Polly.Core + Microsoft.Extensions.Resilience to Core. Build DriverResiliencePipelineBuilder — key on (DriverInstanceId, HostName); composes Timeout → (Retry when the capability allows it; skipped for Write unless [WriteIdempotent]) → CircuitBreaker → Bulkhead. Per-capability policy map documented in DriverResilienceOptions.CapabilityPolicies.
  2. A.2 DriverResilienceOptions record bound from DriverInstance.ResilienceConfig JSON column (new nullable). Per-tier × per-capability defaults: Tier A (OpcUaClient, S7) Read 3 retries/2 s/5-failure-breaker, Write 0 retries/2 s/5-failure-breaker; Tier B (Modbus) Read 3/4 s/5, Write 0/4 s/5; Tier C (Galaxy) Read 1 retry/10 s/no-kill, Write 0/10 s/no-kill. Idempotent writes can opt into Read-shaped retry via the attribute.
  3. A.3 CapabilityInvoker<TCapability, TResult> wraps every method on the capability interfaces (IReadable.ReadAsync, IWritable.WriteAsync, ITagDiscovery.DiscoverAsync, ISubscribable.SubscribeAsync/UnsubscribeAsync, IHostConnectivityProbe probe loop, IAlarmSource.SubscribeAlarmsAsync/AcknowledgeAsync, IHistoryProvider.ReadRawAsync/ReadProcessedAsync/ReadAtTimeAsync/ReadEventsAsync). Existing server-side dispatch routes through it.
  4. A.4 Retain Driver.Galaxy.Proxy/Supervisor/CircuitBreaker.cs + Backoff.cs — they guard IPC process respawn (decision #68), orthogonal to the per-call Polly layer. Only HeartbeatMonitor is consumed outside the supervisor.
  5. A.5 Unit tests: per-policy, per-composition. Negative integration tests: (a) Modbus FlakeyTransport fails 5× on Read, succeeds 6th — invoker surfaces success; (b) Modbus FlakeyTransport fails 1× on Write with [WriteIdempotent]=false — invoker surfaces failure without retry (no duplicate pulse); (c) Modbus FlakeyTransport fails 1× on Write with [WriteIdempotent]=true — invoker retries. Bench: no-op overhead < 1%.
  6. A.6 WriteIdempotentAttribute in Core.Abstractions. Modbus/S7/OpcUaClient tag-definition records pick it up; invoker reads via reflection once at driver init.

Stream B — Tier A/B/C stability runtime — split into MemoryTracking + MemoryRecycle (1 week)

  1. B.1 Core.AbstractionsDriverTier enum {A, B, C}. Extend DriverTypeRegistry to require DriverTier at registration. Existing driver types stamped (Galaxy = C, Modbus = B, S7 = B, OpcUaClient = A).
  2. B.2 MemoryTracking (all tiers) lifted from Driver.Galaxy.Host/MemoryWatchdog.cs. Captures BaselineFootprintBytes as the median of first 5 min of IDriver.GetMemoryFootprint() samples post-InitializeAsync. Applies decision #70 hybrid formula: soft = max(multiplier × baseline, baseline + floor); Tier A multiplier=3, floor=50 MB; Tier B multiplier=3, floor=100 MB; Tier C multiplier=2, floor=500 MB. Soft breach → log + DriverInstanceResilienceStatus.CurrentFootprint tick; never kills. Hard = 2 × soft.
  3. B.3 MemoryRecycle (Tier C only per decisions #73-74). Hard-breach on a Tier C driver triggers ScheduledRecycleScheduler.RequestRecycleNow(driverInstanceId); scheduler proxies to Driver.Galaxy.Proxy/Supervisor/ which restarts the Host process. Tier A/B hard-breach logs a promotion-to-Tier-C recommendation; never auto-kills the in-process driver.
  4. B.4 ScheduledRecycleScheduler per decision #67: Tier C driver instances opt-in to a weekly recycle at a configured cron. Tier A/B scheduled recycle deferred to a later phase paired with Tier-C escalation.
  5. B.5 WedgeDetector demand-aware: if (state==Healthy && hasPendingWork && noProgressIn > WedgeThreshold) → force ReinitializeAsync. hasPendingWork = (bulkhead depth > 0) OR (active monitored items > 0) OR (queued historian-read count > 0). WedgeThreshold default 5 × PublishingInterval, min 60 s. Idle driver stays Healthy.
  6. B.6 Tests: tracking unit tests drive synthetic allocation against a fake GetMemoryFootprint; recycle tests use a mock supervisor; wedge tests include the false-fault cases — idle subscriber, slow historian backfill, write-only burst.

Stream C — Health endpoints + structured logging (4 days)

  1. C.1 OtOpcUa.Server/Observability/HealthEndpoints.cs — Minimal API on a second Kestrel binding (default http://+:4841). /healthz reports process uptime + config-DB reachability (or cache-warm). /readyz enumerates DriverInstance rows + reports each driver's DriverHealth.State; returns 503 if ANY driver is Faulted. JSON body per docs/v2/acl-design.md §"Operator Dashboards" shape.
  2. C.2 LogContextEnricher installed at Serilog config time. Every driver-capability call site wraps its body in using (LogContext.PushProperty("DriverInstanceId", id)) using (LogContext.PushProperty("CorrelationId", correlationId)). Correlation IDs: reuse OPC UA RequestHeader.RequestHandle when in-flight; otherwise generate Guid.NewGuid().ToString("N")[..12].
  3. C.3 Add JSON-formatted Serilog sink alongside the existing rolling-file plain-text sink so SIEMs (Splunk, Datadog) can ingest without a regex parser. Sink switchable via Serilog:WriteJson appsetting.
  4. C.4 Integration test: boot server, issue Modbus read, assert log line contains DriverInstanceId + CorrelationId structured fields.

Stream D — Config DB LiteDB fallback — generation-sealed snapshots (1 week)

  1. D.1 LiteDbConfigCache adapter backed by sealed generation snapshots: each successful sp_PublishGeneration writes <cache-root>/<clusterId>/<generationId>.db as read-only after commit. The adapter maintains a CurrentSealedGenerationId pointer updated atomically on successful publish. Mixed-generation reads are impossible — every read served from the cache serves one coherent sealed generation.
  2. D.2 Write-path queries (draft save, publish) bypass the cache entirely and fail hard on DB outage. Read-path queries (DriverInstance enumeration, LdapGroupRoleMapping, cluster + namespace metadata) go through the pipeline: timeout 2 s → retry 3× jittered → fallback to the current sealed snapshot.
  3. D.3 UsingStaleConfig flag flips true when a read fell back to the sealed snapshot; cleared on the next successful DB round-trip. Surfaced on /healthz body and Admin /hosts.
  4. D.4 Tests: (a) SQL-container kill mid-operation — read returns sealed snapshot, UsingStaleConfig=true, driver stays Healthy; (b) mixed-generation guard — attempt to serve partial generation by corrupting a snapshot file mid-read → adapter fails closed rather than serving mixed data; (c) first-boot-no-snapshot case — adapter refuses to start, driver fails InitializeAsync with a clear config-DB-required error.

Stream E — Admin /hosts page refresh (3 days)

  1. E.1 Extend DriverHostStatus schema with Stream A resilience columns. Generate EF migration.
  2. E.2 Admin/FleetStatusHub SignalR hub pushes LastCircuitBreakerOpenUtc + CurrentBulkheadDepth + LastRecycleUtc on change.
  3. E.3 /hosts Blazor page renders new columns; red badge if ConsecutiveFailures > breakerThreshold / 2.

Compliance Checks (run at exit gate)

  • Invoker coverage: every method on IReadable / IWritable / ITagDiscovery / ISubscribable / IHostConnectivityProbe / IAlarmSource / IHistoryProvider in the server dispatch layer routes through CapabilityInvoker. Enforce via a Roslyn analyzer (error-level; warning-first is rejected — the compliance check is the gate).
  • Write-retry guard: writes without [WriteIdempotent] never get retried. Unit-test the invoker path asserts zero retry attempts.
  • Pipeline isolation: pipeline key is (DriverInstanceId, HostName). Integration test with two Modbus hosts under one instance — failing host A does not open the breaker for host B.
  • Tier registry: every driver type registered in DriverTypeRegistry has a non-null Tier. Unit test walks the registry + asserts no gaps. Tier C registrations must declare their out-of-process topology.
  • MemoryTracking never kills: soft/hard breach tests on a Tier A/B driver log + surface without terminating the process.
  • MemoryRecycle Tier C only: hard breach on a Tier A driver never invokes the supervisor; on Tier C it does.
  • Wedge demand-aware: test suite includes idle-subscription-only, slow-historian-backfill, and write-only-burst cases — driver stays Healthy.
  • Galaxy supervisor preserved: Driver.Galaxy.Proxy/Supervisor/CircuitBreaker.cs + Backoff.cs still present + still invoked on Host crash.
  • Health state machine: /healthz + /readyz respond within 500 ms for every DriverState; state-machine table in this doc drives the test matrix.
  • Structured log: CI grep asserts at least one log line per capability call has "DriverInstanceId" + "CorrelationId" JSON fields.
  • Generation-sealed cache: integration tests cover (a) SQL-kill mid-operation serves last-sealed snapshot; (b) mixed-generation corruption fails closed; (c) first-boot no-snapshot + DB-down → InitializeAsync fails with clear error.
  • No regression in existing test suites — dotnet test ZB.MOM.WW.OtOpcUa.slnx count equal-or-greater than pre-Phase-6.1 baseline.

Risks and Mitigations

Risk Likelihood Impact Mitigation
Polly pipeline adds per-request latency on hot path Medium Medium Benchmark Stream A.5 before merging; 1 % overhead budget; inline hot path short-circuits when retry count = 0
LiteDB cache diverges from central DB Medium High Stale-data banner in Admin UI; UsingStaleConfig flag surfaced on /readyz; cache refresh on every successful DB round-trip; 24-hour synthetic warning
Tier watchdog false-positive-kills a legitimate batch load Low High Soft/hard threshold split; soft only logs; hard triggers recycle; thresholds configurable per-instance
Wedge detector races with slow-but-healthy drivers Medium High Minimum 60 s threshold; detector only activates if driver claims Healthy; add circuit-breaker feedback so rapid oscillation trips instead of thrashing
Roslyn analyzer breaks external driver authors Low Medium Release analyzer as warning-level initially; upgrade to error in Phase 6.1+1 after one release cycle

Completion Checklist

  • Stream A: Polly shared pipeline + per-tier defaults + driver-capability invoker + tests
  • Stream B: Tier registry + generalised watchdog + scheduled recycle + wedge detector
  • Stream C: /healthz + /readyz + structured logging + JSON Serilog sink
  • Stream D: LiteDB cache + Polly fallback in Configuration
  • Stream E: Admin /hosts page refresh
  • Cross-cutting: phase-6-1-compliance.ps1 exits 0; full solution dotnet test passes; exit-gate doc recorded

Adversarial Review — 2026-04-19 (Codex, thread 019da489-e317-7aa1-ab1f-6335e0be2447)

Plan substantially rewritten before implementation to address these findings. Each entry: severity · verdict · adjustment.

  1. Crit · ACCEPT — Auto-retry collides with decisions #44/#45 (no auto-write-retry; opt-in via WriteIdempotent + CAS). Pipeline now capability-specific: Read/HistoryRead/Discover/Probe/Alarm-subscribe all get retries; Write does not unless the tag metadata carries WriteIdempotent=true. New WriteIdempotentAttribute surfaces on ModbusTagDefinition / S7TagDefinition / etc.
  2. Crit · ACCEPT — "One pipeline per driver instance" breaks decision #35's per-device isolation. Change: pipeline key is (DriverInstanceId, HostName) not just DriverInstanceId. One dead PLC behind a multi-device Modbus driver no longer opens the breaker for healthy siblings.
  3. Crit · ACCEPT — Memory watchdog + scheduled recycle at Tier A/B breaches decisions #73/#74 (process-kill protections are Tier-C-only). Change: Stream B splits into two — MemoryTracking (all tiers, soft/hard thresholds log + surface to Admin /hosts; never kills) and MemoryRecycle (Tier C only, requires out-of-process topology). Tier A/B overrun paths escalate to Tier C via a future PR, not auto-kill.
  4. High · ACCEPT — Removing Galaxy's hand-rolled CircuitBreaker drops decision #68 host-supervision crash-loop protection. Change: keep Driver.Galaxy.Proxy/Supervisor/CircuitBreaker.cs + Backoff.cs — they guard the IPC process re-spawn, not the per-call data path. Data-path Polly is an orthogonal layer.
  5. High · ACCEPT — Roslyn analyzer targeting IDriver misses the hot paths (IReadable.ReadAsync, IWritable.WriteAsync, ISubscribable.SubscribeAsync etc.). Change: analyzer rule now matches every method on the capability interfaces; compliance doc enumerates the full call-site list.
  6. High · ACCEPT/healthz + /readyz under-specified for degraded-running. Change: add a state-matrix sub-section explicitly covering Unknown (pre-init: /readyz 503), Initializing (503), Healthy (200), Degraded (200 with JSON body flagging the degraded driver; /readyz is OR across drivers), Faulted (503), plus cached-config-serving (/healthz returns 200 + UsingStaleConfig: true in JSON body).
  7. High · ACCEPTWedgeDetector based on "no successful Read" false-fires on write-only subscriptions + idle systems. Change: wedge criteria now (hasPendingWork AND noProgressIn > threshold) where hasPendingWork comes from the Polly bulkhead depth + active MonitoredItem count. Idle driver stays Healthy.
  8. High · ACCEPT — LiteDB cache serving mixed-generation reads breaks publish atomicity. Change: cache is snapshot-per-generation. Each published generation writes a sealed snapshot into <cache-root>/<cluster>/<generationId>.db; reads serve the last-known-sealed generation and never mix. Central DB outage during a publish means that publish fails (write path doesn't use cache); reads continue from the prior sealed snapshot.
  9. Med · ACCEPTDriverHostStatus schema conflates per-host connectivity with per-driver-instance resilience counters. Change: new DriverInstanceResilienceStatus table separate from DriverHostStatus. Admin /hosts joins both for display.
  10. Med · ACCEPT — Compliance says analyzer-error; risks say analyzer-warning. Change: phase 6.1 ships at error level (this phase is the gate); warning-mode option removed.
  11. Med · ACCEPT — Hardcoded per-tier MB bands ignore decision #70's max(multiplier × baseline, baseline + floor) formula with observed-baseline capture. Change: watchdog captures baseline at post-init plateau (median of first 5 min GetMemoryFootprint samples) + applies the hybrid formula. Tier constants now encode the multiplier + floor, not raw MB.
  12. Med · ACCEPT — Tests mostly cover happy path. Change: Stream A.5 adds negative tests for duplicate-write-replay-under-timeout; Stream B.5 adds false-wedge-on-idle-subscription + false-wedge-on-slow-historic-backfill; Stream D.4 adds mixed-generation cache test + corrupt-first-boot cache test.