scripts/compliance/phase-6-1-compliance.ps1 replaces the stub TODOs with 34 real checks covering: - Stream A: pipeline builder + CapabilityInvoker + WriteIdempotentAttribute present; pipeline key includes HostName (per-device isolation per decision #144); OnReadValue / OnWriteValue / HistoryRead route through invoker in DriverNodeManager; Galaxy supervisor CircuitBreaker + Backoff preserved. - Stream B: DriverTier enum; DriverTypeMetadata requires Tier; MemoryTracking + MemoryRecycle (Tier C-gated) + ScheduledRecycleScheduler (rejects Tier A/B) + demand-aware WedgeDetector all present. - Stream C: DriverHealthReport + HealthEndpointsHost; state matrix Healthy=200 / Faulted=503 asserted in code; LogContextEnricher; JSON sink opt-in via Serilog:WriteJson. - Stream D: GenerationSealedCache + ReadOnly marking + GenerationCacheUnavailable exception path; ResilientConfigReader + StaleConfigFlag. - Stream E data layer: DriverInstanceResilienceStatus entity + DriverResilienceStatusTracker. SignalR/Blazor surface is Deferred per the visual-compliance follow-up pattern borrowed from Phase 6.4. - Cross-cutting: full solution `dotnet test` runs; asserts 1042 >= 906 baseline; tolerates the one pre-existing Client.CLI Subscribe flake and flags any new failure. Running the script locally returns "Phase 6.1 compliance: PASS" — exit 0. Any future regression that deletes a class or un-wires a dispatch path turns a green check red + exit non-zero. docs/v2/implementation/phase-6-1-resilience-and-observability.md status updated from DRAFT to SHIPPED with the merged-PRs summary + test count delta + the single deferred follow-up (visual review of the Admin /hosts columns). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
23 KiB
Phase 6.1 — Resilience & Observability Runtime
Status: SHIPPED 2026-04-19 — Streams A/B/C/D + E data layer merged to
v2across PRs #78-82. Final exit-gate PR #83 turns the compliance script into real checks (all pass) and records this status update. One deferred piece: Stream E.2/E.3 SignalR hub + Blazor/hostscolumn refresh lands in a visual-compliance follow-up PR on the Phase 6.4 Admin UI branch.Baseline: 906 solution tests → post-Phase-6.1: 1042 passing (+136 net). One pre-existing Client.CLI Subscribe flake unchanged.
Branch:
v2/phase-6-1-resilience-observabilityEstimated duration: 3 weeks Predecessor: Phase 5 (drivers) — partial; S7 + OPC UA Client shipped, AB/TwinCAT/FOCAS paused Successor: Phase 6.2 (Authorization runtime)
Phase Objective
Land the cross-cutting runtime protections + operability features that plan.md + driver-stability.md specify by decision but that no driver-phase actually wires. End-state: every driver goes through the same Polly resilience layer, health endpoints render the live driver fleet, structured logs carry per-request correlation IDs, and the config substrate survives a central DB outage via a LiteDB local cache.
Closes these gaps flagged in the 2026-04-19 audit:
- Polly v8 resilience pipelines wired to every
IDrivercapability (no-op per-driver today; Galaxy has a hand-rolledCircuitBreakeronly). - Tier A/B/C enforcement at runtime —
driver-stability.md§2–4 and decisions #63–73 define memory watchdog, bounded queues, scheduled recycle, wedge detection;MemoryWatchdogexists only insideDriver.Galaxy.Host. - Health endpoints (
/healthz,/readyz) onOtOpcUa.Server. - Structured Serilog with per-request correlation IDs (driver instance, OPC UA session, IPC call).
- LiteDB local cache + Polly retry + fallback on central-DB outage (decision #36).
Scope — What Changes
| Concern | Change |
|---|---|
Core → new Core.Resilience sub-namespace |
Shared Polly pipeline builder (DriverResiliencePipelines). Pipeline key = (DriverInstanceId, HostName) so one dead PLC behind a multi-device driver doesn't open the breaker for healthy siblings (decision #35 per-device isolation). Per-capability policy — Read / HistoryRead / Discover / Probe / Alarm get retries; Write does NOT unless [WriteIdempotent] on the tag definition (decisions #44-45). |
| Every capability-interface consumer in the server | Wrap IReadable.ReadAsync, IWritable.WriteAsync, ITagDiscovery.DiscoverAsync, ISubscribable.SubscribeAsync/UnsubscribeAsync, IHostConnectivityProbe probe loop, IAlarmSource.SubscribeAlarmsAsync/AcknowledgeAsync, IHistoryProvider.ReadRawAsync/ReadProcessedAsync/ReadAtTimeAsync/ReadEventsAsync. Composition: timeout → (retry when capability supports) → circuit breaker → bulkhead. |
Core.Abstractions → new WriteIdempotentAttribute |
Marker on ModbusTagDefinition / S7TagDefinition / OpcUaClientDriver tag rows; opts that tag into auto-retry on Write. Absence = no retry, per spec. |
Core → new Core.Stability sub-namespace — split |
Two separate subsystems: (a) MemoryTracking runs all tiers; captures baseline (median of first 5 min GetMemoryFootprint samples) + applies the hybrid rule soft = max(multiplier × baseline, baseline + floor); soft breach logs + surfaces to Admin; never kills. (b) MemoryRecycle (Tier C only — requires out-of-process topology) handles hard-breach recycle via the Proxy-side supervisor. Tier A/B overrun escalates to Tier C promotion ticket, not auto-kill. |
ScheduledRecycleScheduler |
Tier C only per decisions #73-74. Weekly/time-of-day recycle via Proxy supervisor. Tier A/B opt-in recycle lands in a future phase together with a Tier-C-escalation workflow. |
WedgeDetector |
Demand-aware: flips a driver to Faulted only when (hasPendingWork AND noProgressIn > threshold). hasPendingWork derives from non-zero Polly bulkhead depth OR ≥1 active MonitoredItem OR ≥1 queued historian read. Idle + subscription-only drivers stay Healthy. |
DriverTypeRegistry |
Each driver type registers its DriverTier {A, B, C}. Tier C drivers must advertise their out-of-process topology; the registry enforces invariants (Tier C has a Proxy + Host pair). |
Driver.Galaxy.Proxy/Supervisor/ |
Retains existing CircuitBreaker + Backoff — they guard IPC respawn (decision #68), different concern from the per-call Polly layer. Only HeartbeatMonitor is referenced downstream (IPC liveness). |
OtOpcUa.Server → Minimal API endpoints on http://+:4841 |
/healthz = process alive + (config DB reachable OR UsingStaleConfig=true). /readyz = ANDed driver health; state-machine per DriverState: Unknown/Initializing → 503, Healthy → 200, Degraded → 200 + {degradedDrivers: [...]} in body, Faulted → 503. JSON body always reports per-instance detail. |
| Serilog configuration | Centralize enrichers in OtOpcUa.Server/Observability/LogContextEnricher.cs. Every capability call runs inside a LogContext.PushProperty scope with {DriverInstanceId, DriverType, CapabilityName, CorrelationId (UA RequestHandle or internal GUID)}. Sink config stays rolling-file per CLAUDE.md; JSON sink added alongside plain-text (switchable via Serilog:WriteJson appsetting). |
Configuration project |
Add LiteDbConfigCache adapter. Generation-sealed snapshots: sp_PublishGeneration writes <cache-root>/<cluster>/<generationId>.db as a read-only sealed file. Reads serve the last-known-sealed generation; mixed-generation reads are impossible. Write path bypasses cache + fails hard on DB outage. Pipeline: timeout (2 s) → retry (3×, jittered) → fallback-to-sealed-snapshot. |
DriverHostStatus vs. DriverInstanceResilienceStatus |
New separate entity DriverInstanceResilienceStatus { DriverInstanceId, HostName, LastCircuitBreakerOpenUtc, ConsecutiveFailures, CurrentBulkheadDepth, LastRecycleUtc, BaselineFootprintBytes }. DriverHostStatus keeps per-host connectivity only; Admin /hosts joins both for display. |
Scope — What Does NOT Change
| Item | Reason |
|---|---|
| Driver wire protocols | Resilience is a server-side wrapper; individual drivers don't see Polly. Their existing retry logic (ModbusTcpTransport reconnect, SessionReconnectHandler) stays in place as inner layers. |
| Config DB schema | LiteDB cache is a read-only mirror; no new central tables except DriverHostStatus column additions. |
| OPC UA wire behavior visible to clients | Health endpoints live on a separate HTTP port (4841 by convention); the OPC UA server on 4840 is unaffected. |
| The four 2026-04-13 Galaxy stability findings | Already closed in Phase 2. Phase 6.1 generalises the pattern, doesn't re-fix Galaxy. |
| Driver-layer SafeHandle usage | Existing Galaxy SafeMxAccessHandle + Modbus TcpClient disposal stay — they're driver-internal, not part of the cross-cutting layer. |
Entry Gate Checklist
- Phases 0–5 exit gates cleared (or explicitly deferred with task reference)
driver-stability.md§2–4 re-read; decisions #63–73 + #34–36 re-skimmed- Polly v8 NuGet available (
Microsoft.Extensions.Resilience+Polly.Core) — verify package restore before task breakdown - LiteDB 5.x NuGet confirmed MIT + actively maintained
- Existing drivers catalogued: Galaxy.Proxy, Modbus, S7, OpcUaClient — confirm test counts baseline so the resilience layer doesn't regress any
- Serilog configuration inventory: locate every
Log.ForContextcall site that will needLogContextrewrap - Admin
/hostspage's currentDriverHostStatusconsumption reviewed so the schema extensions don't break it
Task Breakdown
Stream A — Resilience layer (1 week)
- A.1 Add
Polly.Core+Microsoft.Extensions.ResiliencetoCore. BuildDriverResiliencePipelineBuilder— key on(DriverInstanceId, HostName); composes Timeout → (Retry when the capability allows it; skipped for Write unless[WriteIdempotent]) → CircuitBreaker → Bulkhead. Per-capability policy map documented inDriverResilienceOptions.CapabilityPolicies. - A.2
DriverResilienceOptionsrecord bound fromDriverInstance.ResilienceConfigJSON column (new nullable). Per-tier × per-capability defaults: Tier A (OpcUaClient, S7) Read 3 retries/2 s/5-failure-breaker, Write 0 retries/2 s/5-failure-breaker; Tier B (Modbus) Read 3/4 s/5, Write 0/4 s/5; Tier C (Galaxy) Read 1 retry/10 s/no-kill, Write 0/10 s/no-kill. Idempotent writes can opt into Read-shaped retry via the attribute. - A.3
CapabilityInvoker<TCapability, TResult>wraps every method on the capability interfaces (IReadable.ReadAsync,IWritable.WriteAsync,ITagDiscovery.DiscoverAsync,ISubscribable.SubscribeAsync/UnsubscribeAsync,IHostConnectivityProbeprobe loop,IAlarmSource.SubscribeAlarmsAsync/AcknowledgeAsync,IHistoryProvider.ReadRawAsync/ReadProcessedAsync/ReadAtTimeAsync/ReadEventsAsync). Existing server-side dispatch routes through it. - A.4 Retain
Driver.Galaxy.Proxy/Supervisor/CircuitBreaker.cs+Backoff.cs— they guard IPC process respawn (decision #68), orthogonal to the per-call Polly layer. OnlyHeartbeatMonitoris consumed outside the supervisor. - A.5 Unit tests: per-policy, per-composition. Negative integration tests: (a) Modbus FlakeyTransport fails 5× on Read, succeeds 6th — invoker surfaces success; (b) Modbus FlakeyTransport fails 1× on Write with
[WriteIdempotent]=false— invoker surfaces failure without retry (no duplicate pulse); (c) Modbus FlakeyTransport fails 1× on Write with[WriteIdempotent]=true— invoker retries. Bench: no-op overhead < 1%. - A.6
WriteIdempotentAttributeinCore.Abstractions. Modbus/S7/OpcUaClient tag-definition records pick it up; invoker reads via reflection once at driver init.
Stream B — Tier A/B/C stability runtime — split into MemoryTracking + MemoryRecycle (1 week)
- B.1
Core.Abstractions→DriverTierenum {A, B, C}. ExtendDriverTypeRegistryto requireDriverTierat registration. Existing driver types stamped (Galaxy = C, Modbus = B, S7 = B, OpcUaClient = A). - B.2
MemoryTracking(all tiers) lifted fromDriver.Galaxy.Host/MemoryWatchdog.cs. CapturesBaselineFootprintBytesas the median of first 5 min ofIDriver.GetMemoryFootprint()samples post-InitializeAsync. Applies decision #70 hybrid formula:soft = max(multiplier × baseline, baseline + floor); Tier A multiplier=3, floor=50 MB; Tier B multiplier=3, floor=100 MB; Tier C multiplier=2, floor=500 MB. Soft breach → log +DriverInstanceResilienceStatus.CurrentFootprinttick; never kills. Hard = 2 × soft. - B.3
MemoryRecycle(Tier C only per decisions #73-74). Hard-breach on a Tier C driver triggersScheduledRecycleScheduler.RequestRecycleNow(driverInstanceId); scheduler proxies toDriver.Galaxy.Proxy/Supervisor/which restarts the Host process. Tier A/B hard-breach logs a promotion-to-Tier-C recommendation; never auto-kills the in-process driver. - B.4
ScheduledRecycleSchedulerper decision #67: Tier C driver instances opt-in to a weekly recycle at a configured cron. Tier A/B scheduled recycle deferred to a later phase paired with Tier-C escalation. - B.5
WedgeDetectordemand-aware:if (state==Healthy && hasPendingWork && noProgressIn > WedgeThreshold) → force ReinitializeAsync.hasPendingWork= (bulkhead depth > 0) OR (active monitored items > 0) OR (queued historian-read count > 0).WedgeThresholddefault 5 × PublishingInterval, min 60 s. Idle driver stays Healthy. - B.6 Tests: tracking unit tests drive synthetic allocation against a fake
GetMemoryFootprint; recycle tests use a mock supervisor; wedge tests include the false-fault cases — idle subscriber, slow historian backfill, write-only burst.
Stream C — Health endpoints + structured logging (4 days)
- C.1
OtOpcUa.Server/Observability/HealthEndpoints.cs— Minimal API on a second Kestrel binding (defaulthttp://+:4841)./healthzreports process uptime + config-DB reachability (or cache-warm)./readyzenumeratesDriverInstancerows + reports each driver'sDriverHealth.State; returns 503 if ANY driver is Faulted. JSON body perdocs/v2/acl-design.md§"Operator Dashboards" shape. - C.2
LogContextEnricherinstalled at Serilog config time. Every driver-capability call site wraps its body inusing (LogContext.PushProperty("DriverInstanceId", id)) using (LogContext.PushProperty("CorrelationId", correlationId)). Correlation IDs: reuse OPC UARequestHeader.RequestHandlewhen in-flight; otherwise generateGuid.NewGuid().ToString("N")[..12]. - C.3 Add JSON-formatted Serilog sink alongside the existing rolling-file plain-text sink so SIEMs (Splunk, Datadog) can ingest without a regex parser. Sink switchable via
Serilog:WriteJsonappsetting. - C.4 Integration test: boot server, issue Modbus read, assert log line contains
DriverInstanceId+CorrelationIdstructured fields.
Stream D — Config DB LiteDB fallback — generation-sealed snapshots (1 week)
- D.1
LiteDbConfigCacheadapter backed by sealed generation snapshots: each successfulsp_PublishGenerationwrites<cache-root>/<clusterId>/<generationId>.dbas read-only after commit. The adapter maintains aCurrentSealedGenerationIdpointer updated atomically on successful publish. Mixed-generation reads are impossible — every read served from the cache serves one coherent sealed generation. - D.2 Write-path queries (draft save, publish) bypass the cache entirely and fail hard on DB outage. Read-path queries (DriverInstance enumeration, LdapGroupRoleMapping, cluster + namespace metadata) go through the pipeline: timeout 2 s → retry 3× jittered → fallback to the current sealed snapshot.
- D.3
UsingStaleConfigflag flips true when a read fell back to the sealed snapshot; cleared on the next successful DB round-trip. Surfaced on/healthzbody and Admin/hosts. - D.4 Tests: (a) SQL-container kill mid-operation — read returns sealed snapshot,
UsingStaleConfig=true, driver stays Healthy; (b) mixed-generation guard — attempt to serve partial generation by corrupting a snapshot file mid-read → adapter fails closed rather than serving mixed data; (c) first-boot-no-snapshot case — adapter refuses to start, driver failsInitializeAsyncwith a clear config-DB-required error.
Stream E — Admin /hosts page refresh (3 days)
- E.1 Extend
DriverHostStatusschema with Stream A resilience columns. Generate EF migration. - E.2
Admin/FleetStatusHubSignalR hub pushesLastCircuitBreakerOpenUtc+CurrentBulkheadDepth+LastRecycleUtcon change. - E.3
/hostsBlazor page renders new columns; red badge ifConsecutiveFailures > breakerThreshold / 2.
Compliance Checks (run at exit gate)
- Invoker coverage: every method on
IReadable/IWritable/ITagDiscovery/ISubscribable/IHostConnectivityProbe/IAlarmSource/IHistoryProviderin the server dispatch layer routes throughCapabilityInvoker. Enforce via a Roslyn analyzer (error-level; warning-first is rejected — the compliance check is the gate). - Write-retry guard: writes without
[WriteIdempotent]never get retried. Unit-test the invoker path asserts zero retry attempts. - Pipeline isolation: pipeline key is
(DriverInstanceId, HostName). Integration test with two Modbus hosts under one instance — failing host A does not open the breaker for host B. - Tier registry: every driver type registered in
DriverTypeRegistryhas a non-nullTier. Unit test walks the registry + asserts no gaps. Tier C registrations must declare their out-of-process topology. - MemoryTracking never kills: soft/hard breach tests on a Tier A/B driver log + surface without terminating the process.
- MemoryRecycle Tier C only: hard breach on a Tier A driver never invokes the supervisor; on Tier C it does.
- Wedge demand-aware: test suite includes idle-subscription-only, slow-historian-backfill, and write-only-burst cases — driver stays Healthy.
- Galaxy supervisor preserved:
Driver.Galaxy.Proxy/Supervisor/CircuitBreaker.cs+Backoff.csstill present + still invoked on Host crash. - Health state machine:
/healthz+/readyzrespond within 500 ms for everyDriverState; state-machine table in this doc drives the test matrix. - Structured log: CI grep asserts at least one log line per capability call has
"DriverInstanceId"+"CorrelationId"JSON fields. - Generation-sealed cache: integration tests cover (a) SQL-kill mid-operation serves last-sealed snapshot; (b) mixed-generation corruption fails closed; (c) first-boot no-snapshot + DB-down →
InitializeAsyncfails with clear error. - No regression in existing test suites —
dotnet test ZB.MOM.WW.OtOpcUa.slnxcount equal-or-greater than pre-Phase-6.1 baseline.
Risks and Mitigations
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Polly pipeline adds per-request latency on hot path | Medium | Medium | Benchmark Stream A.5 before merging; 1 % overhead budget; inline hot path short-circuits when retry count = 0 |
| LiteDB cache diverges from central DB | Medium | High | Stale-data banner in Admin UI; UsingStaleConfig flag surfaced on /readyz; cache refresh on every successful DB round-trip; 24-hour synthetic warning |
| Tier watchdog false-positive-kills a legitimate batch load | Low | High | Soft/hard threshold split; soft only logs; hard triggers recycle; thresholds configurable per-instance |
| Wedge detector races with slow-but-healthy drivers | Medium | High | Minimum 60 s threshold; detector only activates if driver claims Healthy; add circuit-breaker feedback so rapid oscillation trips instead of thrashing |
| Roslyn analyzer breaks external driver authors | Low | Medium | Release analyzer as warning-level initially; upgrade to error in Phase 6.1+1 after one release cycle |
Completion Checklist
- Stream A: Polly shared pipeline + per-tier defaults + driver-capability invoker + tests
- Stream B: Tier registry + generalised watchdog + scheduled recycle + wedge detector
- Stream C:
/healthz+/readyz+ structured logging + JSON Serilog sink - Stream D: LiteDB cache + Polly fallback in Configuration
- Stream E: Admin
/hostspage refresh - Cross-cutting:
phase-6-1-compliance.ps1exits 0; full solutiondotnet testpasses; exit-gate doc recorded
Adversarial Review — 2026-04-19 (Codex, thread 019da489-e317-7aa1-ab1f-6335e0be2447)
Plan substantially rewritten before implementation to address these findings. Each entry: severity · verdict · adjustment.
- Crit · ACCEPT — Auto-retry collides with decisions #44/#45 (no auto-write-retry; opt-in via
WriteIdempotent+ CAS). Pipeline now capability-specific: Read/HistoryRead/Discover/Probe/Alarm-subscribe all get retries; Write does not unless the tag metadata carriesWriteIdempotent=true. NewWriteIdempotentAttributesurfaces onModbusTagDefinition/S7TagDefinition/ etc. - Crit · ACCEPT — "One pipeline per driver instance" breaks decision #35's per-device isolation. Change: pipeline key is
(DriverInstanceId, HostName)not justDriverInstanceId. One dead PLC behind a multi-device Modbus driver no longer opens the breaker for healthy siblings. - Crit · ACCEPT — Memory watchdog + scheduled recycle at Tier A/B breaches decisions #73/#74 (process-kill protections are Tier-C-only). Change: Stream B splits into two —
MemoryTracking(all tiers, soft/hard thresholds log + surface to Admin/hosts; never kills) andMemoryRecycle(Tier C only, requires out-of-process topology). Tier A/B overrun paths escalate to Tier C via a future PR, not auto-kill. - High · ACCEPT — Removing Galaxy's hand-rolled
CircuitBreakerdrops decision #68 host-supervision crash-loop protection. Change: keepDriver.Galaxy.Proxy/Supervisor/CircuitBreaker.cs+Backoff.cs— they guard the IPC process re-spawn, not the per-call data path. Data-path Polly is an orthogonal layer. - High · ACCEPT — Roslyn analyzer targeting
IDrivermisses the hot paths (IReadable.ReadAsync,IWritable.WriteAsync,ISubscribable.SubscribeAsyncetc.). Change: analyzer rule now matches every method on the capability interfaces; compliance doc enumerates the full call-site list. - High · ACCEPT —
/healthz+/readyzunder-specified for degraded-running. Change: add a state-matrix sub-section explicitly coveringUnknown(pre-init:/readyz503),Initializing(503),Healthy(200),Degraded(200 with JSON body flagging the degraded driver;/readyzis OR across drivers),Faulted(503), plus cached-config-serving (/healthzreturns 200 +UsingStaleConfig: truein JSON body). - High · ACCEPT —
WedgeDetectorbased on "no successful Read" false-fires on write-only subscriptions + idle systems. Change: wedge criteria now(hasPendingWork AND noProgressIn > threshold)wherehasPendingWorkcomes from the Polly bulkhead depth + active MonitoredItem count. Idle driver stays Healthy. - High · ACCEPT — LiteDB cache serving mixed-generation reads breaks publish atomicity. Change: cache is snapshot-per-generation. Each published generation writes a sealed snapshot into
<cache-root>/<cluster>/<generationId>.db; reads serve the last-known-sealed generation and never mix. Central DB outage during a publish means that publish fails (write path doesn't use cache); reads continue from the prior sealed snapshot. - Med · ACCEPT —
DriverHostStatusschema conflates per-host connectivity with per-driver-instance resilience counters. Change: newDriverInstanceResilienceStatustable separate fromDriverHostStatus. Admin/hostsjoins both for display. - Med · ACCEPT — Compliance says analyzer-error; risks say analyzer-warning. Change: phase 6.1 ships at error level (this phase is the gate); warning-mode option removed.
- Med · ACCEPT — Hardcoded per-tier MB bands ignore decision #70's
max(multiplier × baseline, baseline + floor)formula with observed-baseline capture. Change: watchdog captures baseline at post-init plateau (median of first 5 min GetMemoryFootprint samples) + applies the hybrid formula. Tier constants now encode the multiplier + floor, not raw MB. - Med · ACCEPT — Tests mostly cover happy path. Change: Stream A.5 adds negative tests for duplicate-write-replay-under-timeout; Stream B.5 adds false-wedge-on-idle-subscription + false-wedge-on-slow-historic-backfill; Stream D.4 adds mixed-generation cache test + corrupt-first-boot cache test.