138 lines
18 KiB
Markdown
138 lines
18 KiB
Markdown
# Phase 6.1 — Resilience & Observability Runtime
|
||
|
||
> **Status**: DRAFT — implementation plan for a cross-cutting phase that was never formalised. The v2 `plan.md` specifies Polly, Tier A/B/C protections, structured logging, and local-cache fallback by decision; none are wired end-to-end.
|
||
>
|
||
> **Branch**: `v2/phase-6-1-resilience-observability`
|
||
> **Estimated duration**: 3 weeks
|
||
> **Predecessor**: Phase 5 (drivers) — partial; S7 + OPC UA Client shipped, AB/TwinCAT/FOCAS paused
|
||
> **Successor**: Phase 6.2 (Authorization runtime)
|
||
|
||
## Phase Objective
|
||
|
||
Land the cross-cutting runtime protections + operability features that `plan.md` + `driver-stability.md` specify by decision but that no driver-phase actually wires. End-state: every driver goes through the same Polly resilience layer, health endpoints render the live driver fleet, structured logs carry per-request correlation IDs, and the config substrate survives a central DB outage via a LiteDB local cache.
|
||
|
||
Closes these gaps flagged in the 2026-04-19 audit:
|
||
|
||
1. Polly v8 resilience pipelines wired to every `IDriver` capability (no-op per-driver today; Galaxy has a hand-rolled `CircuitBreaker` only).
|
||
2. Tier A/B/C enforcement at runtime — `driver-stability.md` §2–4 and decisions #63–73 define memory watchdog, bounded queues, scheduled recycle, wedge detection; `MemoryWatchdog` exists only inside `Driver.Galaxy.Host`.
|
||
3. Health endpoints (`/healthz`, `/readyz`) on `OtOpcUa.Server`.
|
||
4. Structured Serilog with per-request correlation IDs (driver instance, OPC UA session, IPC call).
|
||
5. LiteDB local cache + Polly retry + fallback on central-DB outage (decision #36).
|
||
|
||
## Scope — What Changes
|
||
|
||
| Concern | Change |
|
||
|---------|--------|
|
||
| `Core` → new `Core.Resilience` sub-namespace | Shared Polly pipeline builder (`DriverResiliencePipelines`), per-capability policy (Read / Write / Subscribe / HistoryRead / Discover / Probe / Alarm). One pipeline per driver instance; driver-options decide tuning. |
|
||
| Every `IDriver*` consumer in the server | Wrap capability calls in the shared pipeline. Policy composition order: timeout → retry (with jitter, bounded by capability-specific `MaxRetries`) → circuit breaker (per driver instance, opens on N consecutive failures) → bulkhead (ceiling on in-flight requests per driver). |
|
||
| `Core` → new `Core.Stability` sub-namespace | Generalise `MemoryWatchdog` (`Driver.Galaxy.Host`) into `DriverMemoryWatchdog` consuming `IDriver.GetMemoryFootprint()`. Add `ScheduledRecycleScheduler` (decision #67) for weekly/time-of-day recycle. Add `WedgeDetector` that flips a driver to Faulted when no successful Read in N × PublishingInterval. |
|
||
| `DriverTypeRegistry` | Each driver type registers its `DriverTier` {A, B, C}. Tier C drivers must also advertise their out-of-process topology; the registry enforces invariants (Tier C has a `Proxy` + `Host` pair). |
|
||
| `OtOpcUa.Server` → new Minimal API endpoints | `/healthz` (liveness — process alive + config DB reachable or LiteDB cache warm), `/readyz` (readiness — every driver instance reports `DriverState.Healthy`). JSON bodies cite individual driver health per instance. |
|
||
| Serilog configuration | Centralize enrichers in `OtOpcUa.Server/Observability/LogContextEnricher.cs`. Every driver call runs inside a `LogContext.PushProperty` scope with {DriverInstanceId, DriverType, CapabilityName, CorrelationId (UA RequestHandle or internal GUID)}. Sink config stays rolling-file per CLAUDE.md; JSON-formatted output added alongside plain-text so SIEM ingestion works. |
|
||
| `Configuration` project | Add `LiteDbConfigCache` adapter. Wrap EF Core queries in a Polly pipeline: timeout (2 s) → retry (3×, jittered) → fallback-to-cache. Cache refresh on successful DB query + after `sp_PublishGeneration`. Cache lives at `%ProgramData%/OtOpcUa/config-cache/<cluster-id>.db` per node. |
|
||
| `DriverHostStatus` entity | Extend to carry `LastCircuitBreakerOpenUtc`, `ConsecutiveFailures`, `CurrentBulkheadDepth`, `LastRecycleUtc`. Admin `/hosts` page reads these. |
|
||
|
||
## Scope — What Does NOT Change
|
||
|
||
| Item | Reason |
|
||
|------|--------|
|
||
| Driver wire protocols | Resilience is a server-side wrapper; individual drivers don't see Polly. Their existing retry logic (ModbusTcpTransport reconnect, SessionReconnectHandler) stays in place as inner layers. |
|
||
| Config DB schema | LiteDB cache is a read-only mirror; no new central tables except `DriverHostStatus` column additions. |
|
||
| OPC UA wire behavior visible to clients | Health endpoints live on a separate HTTP port (4841 by convention); the OPC UA server on 4840 is unaffected. |
|
||
| The four 2026-04-13 Galaxy stability findings | Already closed in Phase 2. Phase 6.1 *generalises* the pattern, doesn't re-fix Galaxy. |
|
||
| Driver-layer SafeHandle usage | Existing Galaxy `SafeMxAccessHandle` + Modbus `TcpClient` disposal stay — they're driver-internal, not part of the cross-cutting layer. |
|
||
|
||
## Entry Gate Checklist
|
||
|
||
- [ ] Phases 0–5 exit gates cleared (or explicitly deferred with task reference)
|
||
- [ ] `driver-stability.md` §2–4 re-read; decisions #63–73 + #34–36 re-skimmed
|
||
- [ ] Polly v8 NuGet available (`Microsoft.Extensions.Resilience` + `Polly.Core`) — verify package restore before task breakdown
|
||
- [ ] LiteDB 5.x NuGet confirmed MIT + actively maintained
|
||
- [ ] Existing drivers catalogued: Galaxy.Proxy, Modbus, S7, OpcUaClient — confirm test counts baseline so the resilience layer doesn't regress any
|
||
- [ ] Serilog configuration inventory: locate every `Log.ForContext` call site that will need `LogContext` rewrap
|
||
- [ ] Admin `/hosts` page's current `DriverHostStatus` consumption reviewed so the schema extensions don't break it
|
||
|
||
## Task Breakdown
|
||
|
||
### Stream A — Resilience layer (1 week)
|
||
|
||
1. **A.1** Add `Polly.Core` + `Microsoft.Extensions.Resilience` to `Core`. Build `DriverResiliencePipelineBuilder` that composes Timeout → Retry (exponential backoff + jitter, capability-specific max retries) → CircuitBreaker (consecutive-failure threshold; half-open probe) → Bulkhead (max in-flight per driver instance). Unit tests cover each policy in isolation + composed pipeline.
|
||
2. **A.2** `DriverResilienceOptions` record bound from `DriverInstance.ResilienceConfig` JSON column (new nullable). Defaults encoded per-tier: Tier A (OPC UA Client, S7) — 3 retries, 2 s timeout, 5-failure breaker; Tier B (Modbus) — same except 4 s timeout; Tier C (Galaxy) — 1 retry (inner supervisor handles restart), 10 s timeout, circuit-breaker trips but doesn't kill the driver (the Proxy supervisor already handles that).
|
||
3. **A.3** `DriverCapabilityInvoker<T>` wraps every `IDriver*` method call. Existing server-side dispatch (whatever currently calls `driver.ReadAsync`) routes through the invoker. Policy injection via DI.
|
||
4. **A.4** Remove the hand-rolled `CircuitBreaker` + `Backoff` from `Driver.Galaxy.Proxy/Supervisor/` — replaced by the shared layer. Keep `HeartbeatMonitor` (different concern: IPC liveness, not data-path resilience).
|
||
5. **A.5** Unit tests: per-policy, per-composition. Integration test: Modbus driver under a FlakeyTransport that fails 5×, succeeds on 6th; invoker surfaces the eventual success. Bench: no-op overhead < 1% under nominal load.
|
||
|
||
### Stream B — Tier A/B/C stability runtime (1 week, can parallel with Stream A after A.1)
|
||
|
||
1. **B.1** `Core.Abstractions` → `DriverTier` enum {A, B, C}. Extend `DriverTypeRegistry` to require `DriverTier` at registration. Existing driver types get their tier stamped (Galaxy = C, Modbus = B, S7 = B, OpcUaClient = A).
|
||
2. **B.2** Generalise `DriverMemoryWatchdog` (lift from `Driver.Galaxy.Host/MemoryWatchdog.cs`). Tier-specific thresholds: A = 256 MB RSS soft / 512 MB hard, B = 512 MB soft / 1 GB hard, C = 1 GB soft / 2 GB hard (decision #70 hybrid multiplier + floor). Soft threshold → log + metric; hard threshold → mark driver Faulted + trigger recycle.
|
||
3. **B.3** `ScheduledRecycleScheduler` (decision #67): each driver instance can opt-in to a weekly recycle at a configured cron. Recycle = `ShutdownAsync` → `InitializeAsync`. Tier C drivers get the Proxy-side recycle; Tier A/B recycle in-process.
|
||
4. **B.4** `WedgeDetector`: polling thread per driver instance; if `LastSuccessfulRead` older than `WedgeThreshold` (default 5 × PublishingInterval, minimum 60 s) AND driver state is `Healthy`, flag as wedged → force `ReinitializeAsync`. Prevents silent dead-subscriptions.
|
||
5. **B.5** Tests: watchdog unit tests drive synthetic allocation; scheduler uses a virtual clock; wedge detector tests use a fake IClock + driver stub.
|
||
|
||
### Stream C — Health endpoints + structured logging (4 days)
|
||
|
||
1. **C.1** `OtOpcUa.Server/Observability/HealthEndpoints.cs` — Minimal API on a second Kestrel binding (default `http://+:4841`). `/healthz` reports process uptime + config-DB reachability (or cache-warm). `/readyz` enumerates `DriverInstance` rows + reports each driver's `DriverHealth.State`; returns 503 if ANY driver is Faulted. JSON body per `docs/v2/acl-design.md` §"Operator Dashboards" shape.
|
||
2. **C.2** `LogContextEnricher` installed at Serilog config time. Every driver-capability call site wraps its body in `using (LogContext.PushProperty("DriverInstanceId", id)) using (LogContext.PushProperty("CorrelationId", correlationId))`. Correlation IDs: reuse OPC UA `RequestHeader.RequestHandle` when in-flight; otherwise generate `Guid.NewGuid().ToString("N")[..12]`.
|
||
3. **C.3** Add JSON-formatted Serilog sink alongside the existing rolling-file plain-text sink so SIEMs (Splunk, Datadog) can ingest without a regex parser. Sink switchable via `Serilog:WriteJson` appsetting.
|
||
4. **C.4** Integration test: boot server, issue Modbus read, assert log line contains `DriverInstanceId` + `CorrelationId` structured fields.
|
||
|
||
### Stream D — Config DB LiteDB fallback (1 week)
|
||
|
||
1. **D.1** `LiteDbConfigCache` adapter. Wraps `ConfigurationDbContext` queries that are safe to serve stale (cluster membership, generation metadata, driver instance definitions, LDAP role mapping). Write-path queries (draft save, publish) bypass the cache and fail hard on DB outage.
|
||
2. **D.2** Cache refresh strategy: refresh on every successful read (write-through-cache), full refresh after `sp_PublishGeneration` confirmation. Cache entries carry `CachedAtUtc`; served entries older than 24 h trigger a synthetic `Warning` log line so operators see stale data in effect.
|
||
3. **D.3** Polly pipeline in `Configuration` project: EF Core query → retry 3× → fallback to cache. On fallback, driver state stays `Healthy` but a `UsingStaleConfig` flag on the cluster's health report flips true.
|
||
4. **D.4** Tests: in-memory SQL Server failure injected via `TestContainers`-ish double; cache returns last-known values; Admin UI banners reflect `UsingStaleConfig`.
|
||
|
||
### Stream E — Admin `/hosts` page refresh (3 days)
|
||
|
||
1. **E.1** Extend `DriverHostStatus` schema with Stream A resilience columns. Generate EF migration.
|
||
2. **E.2** `Admin/FleetStatusHub` SignalR hub pushes `LastCircuitBreakerOpenUtc` + `CurrentBulkheadDepth` + `LastRecycleUtc` on change.
|
||
3. **E.3** `/hosts` Blazor page renders new columns; red badge if `ConsecutiveFailures > breakerThreshold / 2`.
|
||
|
||
## Compliance Checks (run at exit gate)
|
||
|
||
- [ ] **Polly coverage**: every `IDriver*` method call in the server dispatch layer routes through `DriverCapabilityInvoker`. Enforce via a Roslyn analyzer added to `Core.Abstractions` build (error on direct `IDriver.ReadAsync` calls outside the invoker).
|
||
- [ ] **Tier registry**: every driver type registered in `DriverTypeRegistry` has a non-null `Tier`. Unit test walks the registry + asserts no gaps.
|
||
- [ ] **Health contract**: `/healthz` + `/readyz` respond within 500 ms even with one driver Faulted.
|
||
- [ ] **Structured log**: CI grep on `tests/` output asserts at least one log line contains `"DriverInstanceId"` + `"CorrelationId"` JSON fields.
|
||
- [ ] **Cache fallback**: Integration test kills the SQL container mid-operation; driver health stays `Healthy`, `UsingStaleConfig` flips true.
|
||
- [ ] No regression in existing test suites — `dotnet test ZB.MOM.WW.OtOpcUa.slnx` count equal-or-greater than pre-Phase-6.1 baseline.
|
||
|
||
## Risks and Mitigations
|
||
|
||
| Risk | Likelihood | Impact | Mitigation |
|
||
|------|:----------:|:------:|------------|
|
||
| Polly pipeline adds per-request latency on hot path | Medium | Medium | Benchmark Stream A.5 before merging; 1 % overhead budget; inline hot path short-circuits when retry count = 0 |
|
||
| LiteDB cache diverges from central DB | Medium | High | Stale-data banner in Admin UI; `UsingStaleConfig` flag surfaced on `/readyz`; cache refresh on every successful DB round-trip; 24-hour synthetic warning |
|
||
| Tier watchdog false-positive-kills a legitimate batch load | Low | High | Soft/hard threshold split; soft only logs; hard triggers recycle; thresholds configurable per-instance |
|
||
| Wedge detector races with slow-but-healthy drivers | Medium | High | Minimum 60 s threshold; detector only activates if driver claims `Healthy`; add circuit-breaker feedback so rapid oscillation trips instead of thrashing |
|
||
| Roslyn analyzer breaks external driver authors | Low | Medium | Release analyzer as warning-level initially; upgrade to error in Phase 6.1+1 after one release cycle |
|
||
|
||
## Completion Checklist
|
||
|
||
- [ ] Stream A: Polly shared pipeline + per-tier defaults + driver-capability invoker + tests
|
||
- [ ] Stream B: Tier registry + generalised watchdog + scheduled recycle + wedge detector
|
||
- [ ] Stream C: `/healthz` + `/readyz` + structured logging + JSON Serilog sink
|
||
- [ ] Stream D: LiteDB cache + Polly fallback in Configuration
|
||
- [ ] Stream E: Admin `/hosts` page refresh
|
||
- [ ] Cross-cutting: `phase-6-1-compliance.ps1` exits 0; full solution `dotnet test` passes; exit-gate doc recorded
|
||
|
||
## Adversarial Review — 2026-04-19 (Codex, thread `019da489-e317-7aa1-ab1f-6335e0be2447`)
|
||
|
||
Plan substantially rewritten before implementation to address these findings. Each entry: severity · verdict · adjustment.
|
||
|
||
1. **Crit · ACCEPT** — Auto-retry collides with decisions #44/#45 (no auto-write-retry; opt-in via `WriteIdempotent` + CAS). Pipeline now **capability-specific**: Read/HistoryRead/Discover/Probe/Alarm-subscribe all get retries; **Write does not** unless the tag metadata carries `WriteIdempotent=true`. New `WriteIdempotentAttribute` surfaces on `ModbusTagDefinition` / `S7TagDefinition` / etc.
|
||
2. **Crit · ACCEPT** — "One pipeline per driver instance" breaks decision #35's per-device isolation. **Change**: pipeline key is `(DriverInstanceId, HostName)` not just `DriverInstanceId`. One dead PLC behind a multi-device Modbus driver no longer opens the breaker for healthy siblings.
|
||
3. **Crit · ACCEPT** — Memory watchdog + scheduled recycle at Tier A/B breaches decisions #73/#74 (process-kill protections are Tier-C-only). **Change**: Stream B splits into two — `MemoryTracking` (all tiers, soft/hard thresholds log + surface to Admin `/hosts`; never kills) and `MemoryRecycle` (Tier C only, requires out-of-process topology). Tier A/B overrun paths escalate to Tier C via a future PR, not auto-kill.
|
||
4. **High · ACCEPT** — Removing Galaxy's hand-rolled `CircuitBreaker` drops decision #68 host-supervision crash-loop protection. **Change**: keep `Driver.Galaxy.Proxy/Supervisor/CircuitBreaker.cs` + `Backoff.cs` — they guard the IPC *process* re-spawn, not the per-call data path. Data-path Polly is an orthogonal layer.
|
||
5. **High · ACCEPT** — Roslyn analyzer targeting `IDriver` misses the hot paths (`IReadable.ReadAsync`, `IWritable.WriteAsync`, `ISubscribable.SubscribeAsync` etc.). **Change**: analyzer rule now matches every method on the capability interfaces; compliance doc enumerates the full call-site list.
|
||
6. **High · ACCEPT** — `/healthz` + `/readyz` under-specified for degraded-running. **Change**: add a state-matrix sub-section explicitly covering `Unknown` (pre-init: `/readyz` 503), `Initializing` (503), `Healthy` (200), `Degraded` (200 with JSON body flagging the degraded driver; `/readyz` is OR across drivers), `Faulted` (503), plus cached-config-serving (`/healthz` returns 200 + `UsingStaleConfig: true` in JSON body).
|
||
7. **High · ACCEPT** — `WedgeDetector` based on "no successful Read" false-fires on write-only subscriptions + idle systems. **Change**: wedge criteria now `(hasPendingWork AND noProgressIn > threshold)` where `hasPendingWork` comes from the Polly bulkhead depth + active MonitoredItem count. Idle driver stays Healthy.
|
||
8. **High · ACCEPT** — LiteDB cache serving mixed-generation reads breaks publish atomicity. **Change**: cache is snapshot-per-generation. Each published generation writes a sealed snapshot into `<cache-root>/<cluster>/<generationId>.db`; reads serve the last-known-sealed generation and never mix. Central DB outage during a *publish* means that publish fails (write path doesn't use cache); reads continue from the prior sealed snapshot.
|
||
9. **Med · ACCEPT** — `DriverHostStatus` schema conflates per-host connectivity with per-driver-instance resilience counters. **Change**: new `DriverInstanceResilienceStatus` table separate from `DriverHostStatus`. Admin `/hosts` joins both for display.
|
||
10. **Med · ACCEPT** — Compliance says analyzer-error; risks say analyzer-warning. **Change**: phase 6.1 ships at **error** level (this phase is the gate); warning-mode option removed.
|
||
11. **Med · ACCEPT** — Hardcoded per-tier MB bands ignore decision #70's `max(multiplier × baseline, baseline + floor)` formula with observed-baseline capture. **Change**: watchdog captures baseline at post-init plateau (median of first 5 min GetMemoryFootprint samples) + applies the hybrid formula. Tier constants now encode the multiplier + floor, not raw MB.
|
||
12. **Med · ACCEPT** — Tests mostly cover happy path. **Change**: Stream A.5 adds negative tests for duplicate-write-replay-under-timeout; Stream B.5 adds false-wedge-on-idle-subscription + false-wedge-on-slow-historic-backfill; Stream D.4 adds mixed-generation cache test + corrupt-first-boot cache test.
|
||
|