Phase 6.1 — Resilience & Observability Runtime

Status: DRAFT — implementation plan for a cross-cutting phase that was never formalised. The v2 plan.md specifies Polly, Tier A/B/C protections, structured logging, and local-cache fallback by decision; none are wired end-to-end.

Branch: v2/phase-6-1-resilience-observability Estimated duration: 3 weeks Predecessor: Phase 5 (drivers) — partial; S7 + OPC UA Client shipped, AB/TwinCAT/FOCAS paused Successor: Phase 6.2 (Authorization runtime)

Phase Objective

Land the cross-cutting runtime protections + operability features that plan.md + driver-stability.md specify by decision but that no driver-phase actually wires. End-state: every driver goes through the same Polly resilience layer, health endpoints render the live driver fleet, structured logs carry per-request correlation IDs, and the config substrate survives a central DB outage via a LiteDB local cache.

Closes these gaps flagged in the 2026-04-19 audit:

Polly v8 resilience pipelines wired to every IDriver capability (no-op per-driver today; Galaxy has a hand-rolled CircuitBreaker only).
Tier A/B/C enforcement at runtime — driver-stability.md §2–4 and decisions #63–73 define memory watchdog, bounded queues, scheduled recycle, wedge detection; MemoryWatchdog exists only inside Driver.Galaxy.Host.
Health endpoints (/healthz, /readyz) on OtOpcUa.Server.
Structured Serilog with per-request correlation IDs (driver instance, OPC UA session, IPC call).
LiteDB local cache + Polly retry + fallback on central-DB outage (decision #36).

Scope — What Changes

Concern	Change
`Core` → new `Core.Resilience` sub-namespace	Shared Polly pipeline builder (`DriverResiliencePipelines`), per-capability policy (Read / Write / Subscribe / HistoryRead / Discover / Probe / Alarm). One pipeline per driver instance; driver-options decide tuning.
Every `IDriver*` consumer in the server	Wrap capability calls in the shared pipeline. Policy composition order: timeout → retry (with jitter, bounded by capability-specific `MaxRetries`) → circuit breaker (per driver instance, opens on N consecutive failures) → bulkhead (ceiling on in-flight requests per driver).
`Core` → new `Core.Stability` sub-namespace	Generalise `MemoryWatchdog` (`Driver.Galaxy.Host`) into `DriverMemoryWatchdog` consuming `IDriver.GetMemoryFootprint()`. Add `ScheduledRecycleScheduler` (decision #67) for weekly/time-of-day recycle. Add `WedgeDetector` that flips a driver to Faulted when no successful Read in N × PublishingInterval.
`DriverTypeRegistry`	Each driver type registers its `DriverTier` {A, B, C}. Tier C drivers must also advertise their out-of-process topology; the registry enforces invariants (Tier C has a `Proxy` + `Host` pair).
`OtOpcUa.Server` → new Minimal API endpoints	`/healthz` (liveness — process alive + config DB reachable or LiteDB cache warm), `/readyz` (readiness — every driver instance reports `DriverState.Healthy`). JSON bodies cite individual driver health per instance.
Serilog configuration	Centralize enrichers in `OtOpcUa.Server/Observability/LogContextEnricher.cs`. Every driver call runs inside a `LogContext.PushProperty` scope with {DriverInstanceId, DriverType, CapabilityName, CorrelationId (UA RequestHandle or internal GUID)}. Sink config stays rolling-file per CLAUDE.md; JSON-formatted output added alongside plain-text so SIEM ingestion works.
`Configuration` project	Add `LiteDbConfigCache` adapter. Wrap EF Core queries in a Polly pipeline: timeout (2 s) → retry (3×, jittered) → fallback-to-cache. Cache refresh on successful DB query + after `sp_PublishGeneration`. Cache lives at `%ProgramData%/OtOpcUa/config-cache/<cluster-id>.db` per node.
`DriverHostStatus` entity	Extend to carry `LastCircuitBreakerOpenUtc`, `ConsecutiveFailures`, `CurrentBulkheadDepth`, `LastRecycleUtc`. Admin `/hosts` page reads these.

Scope — What Does NOT Change

Item	Reason
Driver wire protocols	Resilience is a server-side wrapper; individual drivers don't see Polly. Their existing retry logic (ModbusTcpTransport reconnect, SessionReconnectHandler) stays in place as inner layers.
Config DB schema	LiteDB cache is a read-only mirror; no new central tables except `DriverHostStatus` column additions.
OPC UA wire behavior visible to clients	Health endpoints live on a separate HTTP port (4841 by convention); the OPC UA server on 4840 is unaffected.
The four 2026-04-13 Galaxy stability findings	Already closed in Phase 2. Phase 6.1 generalises the pattern, doesn't re-fix Galaxy.
Driver-layer SafeHandle usage	Existing Galaxy `SafeMxAccessHandle` + Modbus `TcpClient` disposal stay — they're driver-internal, not part of the cross-cutting layer.

Entry Gate Checklist

Phases 0–5 exit gates cleared (or explicitly deferred with task reference)
driver-stability.md §2–4 re-read; decisions #63–73 + #34–36 re-skimmed
Polly v8 NuGet available (Microsoft.Extensions.Resilience + Polly.Core) — verify package restore before task breakdown
LiteDB 5.x NuGet confirmed MIT + actively maintained
Existing drivers catalogued: Galaxy.Proxy, Modbus, S7, OpcUaClient — confirm test counts baseline so the resilience layer doesn't regress any
Serilog configuration inventory: locate every Log.ForContext call site that will need LogContext rewrap
Admin /hosts page's current DriverHostStatus consumption reviewed so the schema extensions don't break it

Task Breakdown

Stream A — Resilience layer (1 week)

A.1 Add Polly.Core + Microsoft.Extensions.Resilience to Core. Build DriverResiliencePipelineBuilder that composes Timeout → Retry (exponential backoff + jitter, capability-specific max retries) → CircuitBreaker (consecutive-failure threshold; half-open probe) → Bulkhead (max in-flight per driver instance). Unit tests cover each policy in isolation + composed pipeline.
A.2 DriverResilienceOptions record bound from DriverInstance.ResilienceConfig JSON column (new nullable). Defaults encoded per-tier: Tier A (OPC UA Client, S7) — 3 retries, 2 s timeout, 5-failure breaker; Tier B (Modbus) — same except 4 s timeout; Tier C (Galaxy) — 1 retry (inner supervisor handles restart), 10 s timeout, circuit-breaker trips but doesn't kill the driver (the Proxy supervisor already handles that).
A.3 DriverCapabilityInvoker<T> wraps every IDriver* method call. Existing server-side dispatch (whatever currently calls driver.ReadAsync) routes through the invoker. Policy injection via DI.
A.4 Remove the hand-rolled CircuitBreaker + Backoff from Driver.Galaxy.Proxy/Supervisor/ — replaced by the shared layer. Keep HeartbeatMonitor (different concern: IPC liveness, not data-path resilience).
A.5 Unit tests: per-policy, per-composition. Integration test: Modbus driver under a FlakeyTransport that fails 5×, succeeds on 6th; invoker surfaces the eventual success. Bench: no-op overhead < 1% under nominal load.

Stream B — Tier A/B/C stability runtime (1 week, can parallel with Stream A after A.1)

B.1 Core.Abstractions → DriverTier enum {A, B, C}. Extend DriverTypeRegistry to require DriverTier at registration. Existing driver types get their tier stamped (Galaxy = C, Modbus = B, S7 = B, OpcUaClient = A).
B.2 Generalise DriverMemoryWatchdog (lift from Driver.Galaxy.Host/MemoryWatchdog.cs). Tier-specific thresholds: A = 256 MB RSS soft / 512 MB hard, B = 512 MB soft / 1 GB hard, C = 1 GB soft / 2 GB hard (decision #70 hybrid multiplier + floor). Soft threshold → log + metric; hard threshold → mark driver Faulted + trigger recycle.
B.3 ScheduledRecycleScheduler (decision #67): each driver instance can opt-in to a weekly recycle at a configured cron. Recycle = ShutdownAsync → InitializeAsync. Tier C drivers get the Proxy-side recycle; Tier A/B recycle in-process.
B.4 WedgeDetector: polling thread per driver instance; if LastSuccessfulRead older than WedgeThreshold (default 5 × PublishingInterval, minimum 60 s) AND driver state is Healthy, flag as wedged → force ReinitializeAsync. Prevents silent dead-subscriptions.
B.5 Tests: watchdog unit tests drive synthetic allocation; scheduler uses a virtual clock; wedge detector tests use a fake IClock + driver stub.

Stream C — Health endpoints + structured logging (4 days)

C.1 OtOpcUa.Server/Observability/HealthEndpoints.cs — Minimal API on a second Kestrel binding (default http://+:4841). /healthz reports process uptime + config-DB reachability (or cache-warm). /readyz enumerates DriverInstance rows + reports each driver's DriverHealth.State; returns 503 if ANY driver is Faulted. JSON body per docs/v2/acl-design.md §"Operator Dashboards" shape.
C.2 LogContextEnricher installed at Serilog config time. Every driver-capability call site wraps its body in using (LogContext.PushProperty("DriverInstanceId", id)) using (LogContext.PushProperty("CorrelationId", correlationId)). Correlation IDs: reuse OPC UA RequestHeader.RequestHandle when in-flight; otherwise generate Guid.NewGuid().ToString("N")[..12].
C.3 Add JSON-formatted Serilog sink alongside the existing rolling-file plain-text sink so SIEMs (Splunk, Datadog) can ingest without a regex parser. Sink switchable via Serilog:WriteJson appsetting.
C.4 Integration test: boot server, issue Modbus read, assert log line contains DriverInstanceId + CorrelationId structured fields.

Stream D — Config DB LiteDB fallback (1 week)

D.1 LiteDbConfigCache adapter. Wraps ConfigurationDbContext queries that are safe to serve stale (cluster membership, generation metadata, driver instance definitions, LDAP role mapping). Write-path queries (draft save, publish) bypass the cache and fail hard on DB outage.
D.2 Cache refresh strategy: refresh on every successful read (write-through-cache), full refresh after sp_PublishGeneration confirmation. Cache entries carry CachedAtUtc; served entries older than 24 h trigger a synthetic Warning log line so operators see stale data in effect.
D.3 Polly pipeline in Configuration project: EF Core query → retry 3× → fallback to cache. On fallback, driver state stays Healthy but a UsingStaleConfig flag on the cluster's health report flips true.
D.4 Tests: in-memory SQL Server failure injected via TestContainers-ish double; cache returns last-known values; Admin UI banners reflect UsingStaleConfig.

Stream E — Admin `/hosts` page refresh (3 days)

E.1 Extend DriverHostStatus schema with Stream A resilience columns. Generate EF migration.
E.2 Admin/FleetStatusHub SignalR hub pushes LastCircuitBreakerOpenUtc + CurrentBulkheadDepth + LastRecycleUtc on change.
E.3 /hosts Blazor page renders new columns; red badge if ConsecutiveFailures > breakerThreshold / 2.

Compliance Checks (run at exit gate)

Polly coverage: every IDriver* method call in the server dispatch layer routes through DriverCapabilityInvoker. Enforce via a Roslyn analyzer added to Core.Abstractions build (error on direct IDriver.ReadAsync calls outside the invoker).
Tier registry: every driver type registered in DriverTypeRegistry has a non-null Tier. Unit test walks the registry + asserts no gaps.
Health contract: /healthz + /readyz respond within 500 ms even with one driver Faulted.
Structured log: CI grep on tests/ output asserts at least one log line contains "DriverInstanceId" + "CorrelationId" JSON fields.
Cache fallback: Integration test kills the SQL container mid-operation; driver health stays Healthy, UsingStaleConfig flips true.
No regression in existing test suites — dotnet test ZB.MOM.WW.OtOpcUa.slnx count equal-or-greater than pre-Phase-6.1 baseline.

Risks and Mitigations

Risk	Likelihood	Impact	Mitigation
Polly pipeline adds per-request latency on hot path	Medium	Medium	Benchmark Stream A.5 before merging; 1 % overhead budget; inline hot path short-circuits when retry count = 0
LiteDB cache diverges from central DB	Medium	High	Stale-data banner in Admin UI; `UsingStaleConfig` flag surfaced on `/readyz`; cache refresh on every successful DB round-trip; 24-hour synthetic warning
Tier watchdog false-positive-kills a legitimate batch load	Low	High	Soft/hard threshold split; soft only logs; hard triggers recycle; thresholds configurable per-instance
Wedge detector races with slow-but-healthy drivers	Medium	High	Minimum 60 s threshold; detector only activates if driver claims `Healthy`; add circuit-breaker feedback so rapid oscillation trips instead of thrashing
Roslyn analyzer breaks external driver authors	Low	Medium	Release analyzer as warning-level initially; upgrade to error in Phase 6.1+1 after one release cycle

Completion Checklist

Stream A: Polly shared pipeline + per-tier defaults + driver-capability invoker + tests
Stream B: Tier registry + generalised watchdog + scheduled recycle + wedge detector
Stream C: /healthz + /readyz + structured logging + JSON Serilog sink
Stream D: LiteDB cache + Polly fallback in Configuration
Stream E: Admin /hosts page refresh
Cross-cutting: phase-6-1-compliance.ps1 exits 0; full solution dotnet test passes; exit-gate doc recorded

Adversarial Review — 2026-04-19 (Codex, thread `019da489-e317-7aa1-ab1f-6335e0be2447`)

Plan substantially rewritten before implementation to address these findings. Each entry: severity · verdict · adjustment.

Crit · ACCEPT — Auto-retry collides with decisions #44/#45 (no auto-write-retry; opt-in via WriteIdempotent + CAS). Pipeline now capability-specific: Read/HistoryRead/Discover/Probe/Alarm-subscribe all get retries; Write does not unless the tag metadata carries WriteIdempotent=true. New WriteIdempotentAttribute surfaces on ModbusTagDefinition / S7TagDefinition / etc.
Crit · ACCEPT — "One pipeline per driver instance" breaks decision #35's per-device isolation. Change: pipeline key is (DriverInstanceId, HostName) not just DriverInstanceId. One dead PLC behind a multi-device Modbus driver no longer opens the breaker for healthy siblings.
Crit · ACCEPT — Memory watchdog + scheduled recycle at Tier A/B breaches decisions #73/#74 (process-kill protections are Tier-C-only). Change: Stream B splits into two — MemoryTracking (all tiers, soft/hard thresholds log + surface to Admin /hosts; never kills) and MemoryRecycle (Tier C only, requires out-of-process topology). Tier A/B overrun paths escalate to Tier C via a future PR, not auto-kill.
High · ACCEPT — Removing Galaxy's hand-rolled CircuitBreaker drops decision #68 host-supervision crash-loop protection. Change: keep Driver.Galaxy.Proxy/Supervisor/CircuitBreaker.cs + Backoff.cs — they guard the IPC process re-spawn, not the per-call data path. Data-path Polly is an orthogonal layer.
High · ACCEPT — Roslyn analyzer targeting IDriver misses the hot paths (IReadable.ReadAsync, IWritable.WriteAsync, ISubscribable.SubscribeAsync etc.). Change: analyzer rule now matches every method on the capability interfaces; compliance doc enumerates the full call-site list.
High · ACCEPT — /healthz + /readyz under-specified for degraded-running. Change: add a state-matrix sub-section explicitly covering Unknown (pre-init: /readyz 503), Initializing (503), Healthy (200), Degraded (200 with JSON body flagging the degraded driver; /readyz is OR across drivers), Faulted (503), plus cached-config-serving (/healthz returns 200 + UsingStaleConfig: true in JSON body).
High · ACCEPT — WedgeDetector based on "no successful Read" false-fires on write-only subscriptions + idle systems. Change: wedge criteria now (hasPendingWork AND noProgressIn > threshold) where hasPendingWork comes from the Polly bulkhead depth + active MonitoredItem count. Idle driver stays Healthy.
High · ACCEPT — LiteDB cache serving mixed-generation reads breaks publish atomicity. Change: cache is snapshot-per-generation. Each published generation writes a sealed snapshot into <cache-root>/<cluster>/<generationId>.db; reads serve the last-known-sealed generation and never mix. Central DB outage during a publish means that publish fails (write path doesn't use cache); reads continue from the prior sealed snapshot.
Med · ACCEPT — DriverHostStatus schema conflates per-host connectivity with per-driver-instance resilience counters. Change: new DriverInstanceResilienceStatus table separate from DriverHostStatus. Admin /hosts joins both for display.
Med · ACCEPT — Compliance says analyzer-error; risks say analyzer-warning. Change: phase 6.1 ships at error level (this phase is the gate); warning-mode option removed.
Med · ACCEPT — Hardcoded per-tier MB bands ignore decision #70's max(multiplier × baseline, baseline + floor) formula with observed-baseline capture. Change: watchdog captures baseline at post-init plateau (median of first 5 min GetMemoryFootprint samples) + applies the hybrid formula. Tier constants now encode the multiplier + floor, not raw MB.
Med · ACCEPT — Tests mostly cover happy path. Change: Stream A.5 adds negative tests for duplicate-write-replay-under-timeout; Stream B.5 adds false-wedge-on-idle-subscription + false-wedge-on-slow-historic-backfill; Stream D.4 adds mixed-generation cache test + corrupt-first-boot cache test.

18 KiB Raw Blame History Unescape Escape