18 KiB
Phase 6.1 — Resilience & Observability Runtime
Status: DRAFT — implementation plan for a cross-cutting phase that was never formalised. The v2
plan.mdspecifies Polly, Tier A/B/C protections, structured logging, and local-cache fallback by decision; none are wired end-to-end.Branch:
v2/phase-6-1-resilience-observabilityEstimated duration: 3 weeks Predecessor: Phase 5 (drivers) — partial; S7 + OPC UA Client shipped, AB/TwinCAT/FOCAS paused Successor: Phase 6.2 (Authorization runtime)
Phase Objective
Land the cross-cutting runtime protections + operability features that plan.md + driver-stability.md specify by decision but that no driver-phase actually wires. End-state: every driver goes through the same Polly resilience layer, health endpoints render the live driver fleet, structured logs carry per-request correlation IDs, and the config substrate survives a central DB outage via a LiteDB local cache.
Closes these gaps flagged in the 2026-04-19 audit:
- Polly v8 resilience pipelines wired to every
IDrivercapability (no-op per-driver today; Galaxy has a hand-rolledCircuitBreakeronly). - Tier A/B/C enforcement at runtime —
driver-stability.md§2–4 and decisions #63–73 define memory watchdog, bounded queues, scheduled recycle, wedge detection;MemoryWatchdogexists only insideDriver.Galaxy.Host. - Health endpoints (
/healthz,/readyz) onOtOpcUa.Server. - Structured Serilog with per-request correlation IDs (driver instance, OPC UA session, IPC call).
- LiteDB local cache + Polly retry + fallback on central-DB outage (decision #36).
Scope — What Changes
| Concern | Change |
|---|---|
Core → new Core.Resilience sub-namespace |
Shared Polly pipeline builder (DriverResiliencePipelines), per-capability policy (Read / Write / Subscribe / HistoryRead / Discover / Probe / Alarm). One pipeline per driver instance; driver-options decide tuning. |
Every IDriver* consumer in the server |
Wrap capability calls in the shared pipeline. Policy composition order: timeout → retry (with jitter, bounded by capability-specific MaxRetries) → circuit breaker (per driver instance, opens on N consecutive failures) → bulkhead (ceiling on in-flight requests per driver). |
Core → new Core.Stability sub-namespace |
Generalise MemoryWatchdog (Driver.Galaxy.Host) into DriverMemoryWatchdog consuming IDriver.GetMemoryFootprint(). Add ScheduledRecycleScheduler (decision #67) for weekly/time-of-day recycle. Add WedgeDetector that flips a driver to Faulted when no successful Read in N × PublishingInterval. |
DriverTypeRegistry |
Each driver type registers its DriverTier {A, B, C}. Tier C drivers must also advertise their out-of-process topology; the registry enforces invariants (Tier C has a Proxy + Host pair). |
OtOpcUa.Server → new Minimal API endpoints |
/healthz (liveness — process alive + config DB reachable or LiteDB cache warm), /readyz (readiness — every driver instance reports DriverState.Healthy). JSON bodies cite individual driver health per instance. |
| Serilog configuration | Centralize enrichers in OtOpcUa.Server/Observability/LogContextEnricher.cs. Every driver call runs inside a LogContext.PushProperty scope with {DriverInstanceId, DriverType, CapabilityName, CorrelationId (UA RequestHandle or internal GUID)}. Sink config stays rolling-file per CLAUDE.md; JSON-formatted output added alongside plain-text so SIEM ingestion works. |
Configuration project |
Add LiteDbConfigCache adapter. Wrap EF Core queries in a Polly pipeline: timeout (2 s) → retry (3×, jittered) → fallback-to-cache. Cache refresh on successful DB query + after sp_PublishGeneration. Cache lives at %ProgramData%/OtOpcUa/config-cache/<cluster-id>.db per node. |
DriverHostStatus entity |
Extend to carry LastCircuitBreakerOpenUtc, ConsecutiveFailures, CurrentBulkheadDepth, LastRecycleUtc. Admin /hosts page reads these. |
Scope — What Does NOT Change
| Item | Reason |
|---|---|
| Driver wire protocols | Resilience is a server-side wrapper; individual drivers don't see Polly. Their existing retry logic (ModbusTcpTransport reconnect, SessionReconnectHandler) stays in place as inner layers. |
| Config DB schema | LiteDB cache is a read-only mirror; no new central tables except DriverHostStatus column additions. |
| OPC UA wire behavior visible to clients | Health endpoints live on a separate HTTP port (4841 by convention); the OPC UA server on 4840 is unaffected. |
| The four 2026-04-13 Galaxy stability findings | Already closed in Phase 2. Phase 6.1 generalises the pattern, doesn't re-fix Galaxy. |
| Driver-layer SafeHandle usage | Existing Galaxy SafeMxAccessHandle + Modbus TcpClient disposal stay — they're driver-internal, not part of the cross-cutting layer. |
Entry Gate Checklist
- Phases 0–5 exit gates cleared (or explicitly deferred with task reference)
driver-stability.md§2–4 re-read; decisions #63–73 + #34–36 re-skimmed- Polly v8 NuGet available (
Microsoft.Extensions.Resilience+Polly.Core) — verify package restore before task breakdown - LiteDB 5.x NuGet confirmed MIT + actively maintained
- Existing drivers catalogued: Galaxy.Proxy, Modbus, S7, OpcUaClient — confirm test counts baseline so the resilience layer doesn't regress any
- Serilog configuration inventory: locate every
Log.ForContextcall site that will needLogContextrewrap - Admin
/hostspage's currentDriverHostStatusconsumption reviewed so the schema extensions don't break it
Task Breakdown
Stream A — Resilience layer (1 week)
- A.1 Add
Polly.Core+Microsoft.Extensions.ResiliencetoCore. BuildDriverResiliencePipelineBuilderthat composes Timeout → Retry (exponential backoff + jitter, capability-specific max retries) → CircuitBreaker (consecutive-failure threshold; half-open probe) → Bulkhead (max in-flight per driver instance). Unit tests cover each policy in isolation + composed pipeline. - A.2
DriverResilienceOptionsrecord bound fromDriverInstance.ResilienceConfigJSON column (new nullable). Defaults encoded per-tier: Tier A (OPC UA Client, S7) — 3 retries, 2 s timeout, 5-failure breaker; Tier B (Modbus) — same except 4 s timeout; Tier C (Galaxy) — 1 retry (inner supervisor handles restart), 10 s timeout, circuit-breaker trips but doesn't kill the driver (the Proxy supervisor already handles that). - A.3
DriverCapabilityInvoker<T>wraps everyIDriver*method call. Existing server-side dispatch (whatever currently callsdriver.ReadAsync) routes through the invoker. Policy injection via DI. - A.4 Remove the hand-rolled
CircuitBreaker+BackofffromDriver.Galaxy.Proxy/Supervisor/— replaced by the shared layer. KeepHeartbeatMonitor(different concern: IPC liveness, not data-path resilience). - A.5 Unit tests: per-policy, per-composition. Integration test: Modbus driver under a FlakeyTransport that fails 5×, succeeds on 6th; invoker surfaces the eventual success. Bench: no-op overhead < 1% under nominal load.
Stream B — Tier A/B/C stability runtime (1 week, can parallel with Stream A after A.1)
- B.1
Core.Abstractions→DriverTierenum {A, B, C}. ExtendDriverTypeRegistryto requireDriverTierat registration. Existing driver types get their tier stamped (Galaxy = C, Modbus = B, S7 = B, OpcUaClient = A). - B.2 Generalise
DriverMemoryWatchdog(lift fromDriver.Galaxy.Host/MemoryWatchdog.cs). Tier-specific thresholds: A = 256 MB RSS soft / 512 MB hard, B = 512 MB soft / 1 GB hard, C = 1 GB soft / 2 GB hard (decision #70 hybrid multiplier + floor). Soft threshold → log + metric; hard threshold → mark driver Faulted + trigger recycle. - B.3
ScheduledRecycleScheduler(decision #67): each driver instance can opt-in to a weekly recycle at a configured cron. Recycle =ShutdownAsync→InitializeAsync. Tier C drivers get the Proxy-side recycle; Tier A/B recycle in-process. - B.4
WedgeDetector: polling thread per driver instance; ifLastSuccessfulReadolder thanWedgeThreshold(default 5 × PublishingInterval, minimum 60 s) AND driver state isHealthy, flag as wedged → forceReinitializeAsync. Prevents silent dead-subscriptions. - B.5 Tests: watchdog unit tests drive synthetic allocation; scheduler uses a virtual clock; wedge detector tests use a fake IClock + driver stub.
Stream C — Health endpoints + structured logging (4 days)
- C.1
OtOpcUa.Server/Observability/HealthEndpoints.cs— Minimal API on a second Kestrel binding (defaulthttp://+:4841)./healthzreports process uptime + config-DB reachability (or cache-warm)./readyzenumeratesDriverInstancerows + reports each driver'sDriverHealth.State; returns 503 if ANY driver is Faulted. JSON body perdocs/v2/acl-design.md§"Operator Dashboards" shape. - C.2
LogContextEnricherinstalled at Serilog config time. Every driver-capability call site wraps its body inusing (LogContext.PushProperty("DriverInstanceId", id)) using (LogContext.PushProperty("CorrelationId", correlationId)). Correlation IDs: reuse OPC UARequestHeader.RequestHandlewhen in-flight; otherwise generateGuid.NewGuid().ToString("N")[..12]. - C.3 Add JSON-formatted Serilog sink alongside the existing rolling-file plain-text sink so SIEMs (Splunk, Datadog) can ingest without a regex parser. Sink switchable via
Serilog:WriteJsonappsetting. - C.4 Integration test: boot server, issue Modbus read, assert log line contains
DriverInstanceId+CorrelationIdstructured fields.
Stream D — Config DB LiteDB fallback (1 week)
- D.1
LiteDbConfigCacheadapter. WrapsConfigurationDbContextqueries that are safe to serve stale (cluster membership, generation metadata, driver instance definitions, LDAP role mapping). Write-path queries (draft save, publish) bypass the cache and fail hard on DB outage. - D.2 Cache refresh strategy: refresh on every successful read (write-through-cache), full refresh after
sp_PublishGenerationconfirmation. Cache entries carryCachedAtUtc; served entries older than 24 h trigger a syntheticWarninglog line so operators see stale data in effect. - D.3 Polly pipeline in
Configurationproject: EF Core query → retry 3× → fallback to cache. On fallback, driver state staysHealthybut aUsingStaleConfigflag on the cluster's health report flips true. - D.4 Tests: in-memory SQL Server failure injected via
TestContainers-ish double; cache returns last-known values; Admin UI banners reflectUsingStaleConfig.
Stream E — Admin /hosts page refresh (3 days)
- E.1 Extend
DriverHostStatusschema with Stream A resilience columns. Generate EF migration. - E.2
Admin/FleetStatusHubSignalR hub pushesLastCircuitBreakerOpenUtc+CurrentBulkheadDepth+LastRecycleUtcon change. - E.3
/hostsBlazor page renders new columns; red badge ifConsecutiveFailures > breakerThreshold / 2.
Compliance Checks (run at exit gate)
- Polly coverage: every
IDriver*method call in the server dispatch layer routes throughDriverCapabilityInvoker. Enforce via a Roslyn analyzer added toCore.Abstractionsbuild (error on directIDriver.ReadAsynccalls outside the invoker). - Tier registry: every driver type registered in
DriverTypeRegistryhas a non-nullTier. Unit test walks the registry + asserts no gaps. - Health contract:
/healthz+/readyzrespond within 500 ms even with one driver Faulted. - Structured log: CI grep on
tests/output asserts at least one log line contains"DriverInstanceId"+"CorrelationId"JSON fields. - Cache fallback: Integration test kills the SQL container mid-operation; driver health stays
Healthy,UsingStaleConfigflips true. - No regression in existing test suites —
dotnet test ZB.MOM.WW.OtOpcUa.slnxcount equal-or-greater than pre-Phase-6.1 baseline.
Risks and Mitigations
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Polly pipeline adds per-request latency on hot path | Medium | Medium | Benchmark Stream A.5 before merging; 1 % overhead budget; inline hot path short-circuits when retry count = 0 |
| LiteDB cache diverges from central DB | Medium | High | Stale-data banner in Admin UI; UsingStaleConfig flag surfaced on /readyz; cache refresh on every successful DB round-trip; 24-hour synthetic warning |
| Tier watchdog false-positive-kills a legitimate batch load | Low | High | Soft/hard threshold split; soft only logs; hard triggers recycle; thresholds configurable per-instance |
| Wedge detector races with slow-but-healthy drivers | Medium | High | Minimum 60 s threshold; detector only activates if driver claims Healthy; add circuit-breaker feedback so rapid oscillation trips instead of thrashing |
| Roslyn analyzer breaks external driver authors | Low | Medium | Release analyzer as warning-level initially; upgrade to error in Phase 6.1+1 after one release cycle |
Completion Checklist
- Stream A: Polly shared pipeline + per-tier defaults + driver-capability invoker + tests
- Stream B: Tier registry + generalised watchdog + scheduled recycle + wedge detector
- Stream C:
/healthz+/readyz+ structured logging + JSON Serilog sink - Stream D: LiteDB cache + Polly fallback in Configuration
- Stream E: Admin
/hostspage refresh - Cross-cutting:
phase-6-1-compliance.ps1exits 0; full solutiondotnet testpasses; exit-gate doc recorded
Adversarial Review — 2026-04-19 (Codex, thread 019da489-e317-7aa1-ab1f-6335e0be2447)
Plan substantially rewritten before implementation to address these findings. Each entry: severity · verdict · adjustment.
- Crit · ACCEPT — Auto-retry collides with decisions #44/#45 (no auto-write-retry; opt-in via
WriteIdempotent+ CAS). Pipeline now capability-specific: Read/HistoryRead/Discover/Probe/Alarm-subscribe all get retries; Write does not unless the tag metadata carriesWriteIdempotent=true. NewWriteIdempotentAttributesurfaces onModbusTagDefinition/S7TagDefinition/ etc. - Crit · ACCEPT — "One pipeline per driver instance" breaks decision #35's per-device isolation. Change: pipeline key is
(DriverInstanceId, HostName)not justDriverInstanceId. One dead PLC behind a multi-device Modbus driver no longer opens the breaker for healthy siblings. - Crit · ACCEPT — Memory watchdog + scheduled recycle at Tier A/B breaches decisions #73/#74 (process-kill protections are Tier-C-only). Change: Stream B splits into two —
MemoryTracking(all tiers, soft/hard thresholds log + surface to Admin/hosts; never kills) andMemoryRecycle(Tier C only, requires out-of-process topology). Tier A/B overrun paths escalate to Tier C via a future PR, not auto-kill. - High · ACCEPT — Removing Galaxy's hand-rolled
CircuitBreakerdrops decision #68 host-supervision crash-loop protection. Change: keepDriver.Galaxy.Proxy/Supervisor/CircuitBreaker.cs+Backoff.cs— they guard the IPC process re-spawn, not the per-call data path. Data-path Polly is an orthogonal layer. - High · ACCEPT — Roslyn analyzer targeting
IDrivermisses the hot paths (IReadable.ReadAsync,IWritable.WriteAsync,ISubscribable.SubscribeAsyncetc.). Change: analyzer rule now matches every method on the capability interfaces; compliance doc enumerates the full call-site list. - High · ACCEPT —
/healthz+/readyzunder-specified for degraded-running. Change: add a state-matrix sub-section explicitly coveringUnknown(pre-init:/readyz503),Initializing(503),Healthy(200),Degraded(200 with JSON body flagging the degraded driver;/readyzis OR across drivers),Faulted(503), plus cached-config-serving (/healthzreturns 200 +UsingStaleConfig: truein JSON body). - High · ACCEPT —
WedgeDetectorbased on "no successful Read" false-fires on write-only subscriptions + idle systems. Change: wedge criteria now(hasPendingWork AND noProgressIn > threshold)wherehasPendingWorkcomes from the Polly bulkhead depth + active MonitoredItem count. Idle driver stays Healthy. - High · ACCEPT — LiteDB cache serving mixed-generation reads breaks publish atomicity. Change: cache is snapshot-per-generation. Each published generation writes a sealed snapshot into
<cache-root>/<cluster>/<generationId>.db; reads serve the last-known-sealed generation and never mix. Central DB outage during a publish means that publish fails (write path doesn't use cache); reads continue from the prior sealed snapshot. - Med · ACCEPT —
DriverHostStatusschema conflates per-host connectivity with per-driver-instance resilience counters. Change: newDriverInstanceResilienceStatustable separate fromDriverHostStatus. Admin/hostsjoins both for display. - Med · ACCEPT — Compliance says analyzer-error; risks say analyzer-warning. Change: phase 6.1 ships at error level (this phase is the gate); warning-mode option removed.
- Med · ACCEPT — Hardcoded per-tier MB bands ignore decision #70's
max(multiplier × baseline, baseline + floor)formula with observed-baseline capture. Change: watchdog captures baseline at post-init plateau (median of first 5 min GetMemoryFootprint samples) + applies the hybrid formula. Tier constants now encode the multiplier + floor, not raw MB. - Med · ACCEPT — Tests mostly cover happy path. Change: Stream A.5 adds negative tests for duplicate-write-replay-under-timeout; Stream B.5 adds false-wedge-on-idle-subscription + false-wedge-on-slow-historic-backfill; Stream D.4 adds mixed-generation cache test + corrupt-first-boot cache test.