Phase 6.1 Stream A (partial) - Polly resilience foundation: pipeline builder + CapabilityInvoker + per-tier defaults #78

Merged
dohertj2 merged 6 commits from phase-6-1-stream-a-resilience into v2 2026-04-19 07:33:55 -04:00
Owner

First PR on the phase-6.1-stream-a branch. Lands the resilience foundation + the CapabilityInvoker surface. Driver-dispatch wiring (Stream A.3 server-side wrap, Stream A.5 Modbus FlakeyTransport integration test, Stream A.6 attribute on tag-definition records) land in follow-up PRs on the same branch.

Summary

  • Stream A.1DriverResiliencePipelineBuilder in Core.Resilience. Polly v8 pipeline keyed on (DriverInstanceId, HostName, DriverCapability) per decision #144. One dead PLC behind a multi-device driver does not open the breaker for healthy siblings. Timeout ? Retry (capability-permitting, skipped for Write/AlarmAcknowledge by default, never on cancellation) ? CircuitBreaker (tier-permitting; Tier C disabled per decision #68). Lock-free pipeline cache.
  • Stream A.2DriverResilienceOptions + CapabilityPolicy records. Per-tier � per-capability default table (Tier A/B/C � Read/Write/Discover/Subscribe/Probe/AlarmSubscribe/AlarmAcknowledge/HistoryRead). Resolve(capability) overlays per-instance overrides on defaults.
  • Stream A.3CapabilityInvoker wrapper. ExecuteAsync(capability, host, callSite) resolves the pipeline from the shared builder. ExecuteWriteAsync(host, isIdempotent, callSite) is the explicit write-safety surface: non-idempotent writes route through a side pipeline with RetryCount=0 regardless of policy.
  • Stream A.6 (marker)WriteIdempotentAttribute in Core.Abstractions + DriverCapability enum + DriverTier enum. Attribute wiring onto ModbusTagDefinition/S7TagDefinition/OpcUaClient tag rows lands in the follow-up PR alongside the driver-dispatch routing.
  • Stream A.4 � Galaxy Proxy/Supervisor/CircuitBreaker.cs + Backoff.cs + HeartbeatMonitor.cs preserved (no-op verification; they guard IPC respawn per decision #68, orthogonal to the per-call Polly layer).

Test plan

  • 36 new unit tests pass: 8 options coverage + 12 pipeline (dead-host isolation, per-capability/per-host isolation, Write never retries on Tier A, breaker opens after threshold, timeout, cancellation not retried) + 6 invoker (non-idempotent write guard, idempotent write retries).
  • Core.Tests suite: 44 passing (was 38).
  • Full solution dotnet test: 936 passing, 1 pre-existing Client.CLI Subscribe flake unchanged.
  • Follow-up PR: wire CapabilityInvoker into Server dispatch + apply [WriteIdempotent] to driver tag-definition records + FlakeyTransport integration tests.

🤖 Generated with Claude Code

First PR on the phase-6.1-stream-a branch. Lands the resilience **foundation** + the CapabilityInvoker surface. Driver-dispatch wiring (Stream A.3 server-side wrap, Stream A.5 Modbus FlakeyTransport integration test, Stream A.6 attribute on tag-definition records) land in follow-up PRs on the same branch. ## Summary - **Stream A.1** � `DriverResiliencePipelineBuilder` in `Core.Resilience`. Polly v8 pipeline keyed on `(DriverInstanceId, HostName, DriverCapability)` per decision #144. One dead PLC behind a multi-device driver does not open the breaker for healthy siblings. Timeout ? Retry (capability-permitting, skipped for Write/AlarmAcknowledge by default, never on cancellation) ? CircuitBreaker (tier-permitting; Tier C disabled per decision #68). Lock-free pipeline cache. - **Stream A.2** � `DriverResilienceOptions` + `CapabilityPolicy` records. Per-tier � per-capability default table (Tier A/B/C � Read/Write/Discover/Subscribe/Probe/AlarmSubscribe/AlarmAcknowledge/HistoryRead). `Resolve(capability)` overlays per-instance overrides on defaults. - **Stream A.3** � `CapabilityInvoker` wrapper. `ExecuteAsync(capability, host, callSite)` resolves the pipeline from the shared builder. `ExecuteWriteAsync(host, isIdempotent, callSite)` is the explicit write-safety surface: non-idempotent writes route through a side pipeline with RetryCount=0 regardless of policy. - **Stream A.6 (marker)** � `WriteIdempotentAttribute` in `Core.Abstractions` + `DriverCapability` enum + `DriverTier` enum. Attribute wiring onto `ModbusTagDefinition`/`S7TagDefinition`/OpcUaClient tag rows lands in the follow-up PR alongside the driver-dispatch routing. - **Stream A.4** � Galaxy `Proxy/Supervisor/CircuitBreaker.cs` + `Backoff.cs` + `HeartbeatMonitor.cs` preserved (no-op verification; they guard IPC respawn per decision #68, orthogonal to the per-call Polly layer). ## Test plan - [x] 36 new unit tests pass: 8 options coverage + 12 pipeline (dead-host isolation, per-capability/per-host isolation, Write never retries on Tier A, breaker opens after threshold, timeout, cancellation not retried) + 6 invoker (non-idempotent write guard, idempotent write retries). - [x] Core.Tests suite: 44 passing (was 38). - [x] Full solution `dotnet test`: 936 passing, 1 pre-existing Client.CLI Subscribe flake unchanged. - [ ] Follow-up PR: wire `CapabilityInvoker` into `Server` dispatch + apply `[WriteIdempotent]` to driver tag-definition records + FlakeyTransport integration tests. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
dohertj2 added 2 commits 2026-04-19 04:11:43 -04:00
Lands the first chunk of the Phase 6.1 Stream A resilience layer per
docs/v2/implementation/phase-6-1-resilience-and-observability.md §Stream A.
Downstream CapabilityInvoker (A.3) + driver-dispatch wiring land in follow-up
PRs on the same branch.

Core.Abstractions additions:
- WriteIdempotentAttribute — marker for tag-definition records that opt into
  auto-retry on IWritable.WriteAsync. Absence = no retry per decisions #44, #45,
  #143. Read once via reflection at driver-init time; no per-write cost.
- DriverCapability enum — enumerates the 8 capability surface points
  (Read / Write / Discover / Subscribe / Probe / AlarmSubscribe / AlarmAcknowledge
  / HistoryRead). AlarmAcknowledge is write-shaped (no retry by default).
- DriverTier enum — A/B/C per driver-stability.md §2-4. Stream B.1 wires this
  into DriverTypeMetadata; surfaced here because the resilience policy defaults
  key on it.

Core.Resilience new namespace:
- DriverResilienceOptions — per-tier × per-capability policy defaults.
  GetTierDefaults(tier) is the source of truth:
    * Tier A: Read 2s/3 retries, Write 2s/0 retries, breaker threshold 5
    * Tier B: Read 4s/3, Write 4s/0, breaker threshold 5
    * Tier C: Read 10s/1, Write 10s/0, breaker threshold 0 (supervisor handles
      process-level breaker per decision #68)
  Resolve(capability) overlays CapabilityPolicies on top of the defaults.
- DriverResiliencePipelineBuilder — composes Timeout → Retry (capability-
  permitting, never on cancellation) → CircuitBreaker (tier-permitting) →
  Bulkhead. Pipelines cached in a lock-free ConcurrentDictionary keyed on
  (DriverInstanceId, HostName, DriverCapability) per decision #144 — one dead
  PLC behind a multi-device driver does not open the breaker for healthy
  siblings. Invalidate(driverInstanceId) supports Admin-triggered reload.

Tests (30 new, all pass):
- DriverResilienceOptionsTests: tier-default coverage for every capability,
  Write + AlarmAcknowledge never retry at any tier, Tier C disables breaker,
  resolve-with-override layering.
- DriverResiliencePipelineBuilderTests: Read retries transients, Write does NOT
  retry on failure (decision #44 guard), dead-host isolation from sibling hosts,
  pipeline reuse for same triple, per-capability isolation, breaker opens after
  threshold on Tier A, timeout fires, cancellation is not retried,
  invalidation scoped to matching instance.

Polly.Core 8.6.6 added to Core.csproj. Full solution dotnet test: 936 passing
(baseline 906 + 30 new). One pre-existing Client.CLI Subscribe flake unchanged
by this PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
One invoker per (DriverInstance, IDriver) pair; calls ExecuteAsync(capability,
host, callSite) and the invoker resolves the correct pipeline from the shared
DriverResiliencePipelineBuilder. The options accessor is a Func so Admin-edit
+ pipeline-invalidate takes effect without restarting the invoker or the
driver host.

ExecuteWriteAsync(isIdempotent) is the explicit write-safety surface:
- isIdempotent=false routes through a side pipeline with RetryCount=0 regardless
  of what the caller configured. The cache key carries a "::non-idempotent"
  suffix so it never collides with the retry-enabled write pipeline.
- isIdempotent=true routes through the normal Write pipeline. If the user has
  configured Write retries (opt-in), the idempotent tag gets them; otherwise
  default-0 still wins.

The server dispatch layer (next PR) reads WriteIdempotentAttribute on each tag
definition once at driver-init time and feeds the boolean into ExecuteWriteAsync.

Tests (6 new):
- Read retries on transient failure; returns value from call site.
- Write non-idempotent does NOT retry even when policy has 3 retries configured
  (the explicit decision-#44 guard at the dispatch surface).
- Write idempotent retries when policy allows.
- Write with default tier-A policy (RetryCount=0) never retries regardless of
  idempotency flag.
- Different hosts get independent pipelines.

Core.Tests now 44 passing (was 38). Invoker doc-refs completed (the XML comment
on WriteIdempotentAttribute no longer references a non-existent type).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dohertj2 added 1 commit 2026-04-19 07:18:01 -04:00
Per-tag opt-in for write-retry per docs/v2/plan.md decisions #44, #45, #143.
Default is false — writes never auto-retry unless the driver author has marked
the tag as safe to replay.

Core.Abstractions:
- DriverAttributeInfo gains `bool WriteIdempotent = false` at the end of the
  positional record (back-compatible; every existing call site uses the default).

Driver.Modbus:
- ModbusTagDefinition gains `bool WriteIdempotent = false`. Safe candidates
  documented in the param XML: holding-register set-points, configuration
  registers. Unsafe: edge-triggered coils, counter-increment addresses.
- ModbusDriver.DiscoverAsync propagates t.WriteIdempotent into
  DriverAttributeInfo.WriteIdempotent.

Driver.S7:
- S7TagDefinition gains `bool WriteIdempotent = false`. Safe candidates:
  DB word/dword set-points, configuration DBs. Unsafe: M/Q bits that drive
  edge-triggered program routines.
- S7Driver.DiscoverAsync propagates the flag.

Stream A.5 integration tests (FlakeyDriverIntegrationTests, 4 new) exercise
the invoker + flaky-driver contract the plan enumerates:
- Read with 5 transient failures succeeds on the 6th attempt (RetryCount=10).
- Non-idempotent write with RetryCount=5 configured still fails on the first
  failure — no replay (decision #44 guard at the ExecuteWriteAsync surface).
- Idempotent write with 2 transient failures succeeds on the 3rd attempt.
- Two hosts on the same driver have independent breakers — dead-host trips
  its breaker but live-host's first call still succeeds.

Propagation tests:
- ModbusDriverTests: SetPoint WriteIdempotent=true flows into
  DriverAttributeInfo; PulseCoil default=false.
- S7DiscoveryAndSubscribeTests: same pattern for DBx SetPoint vs M-bit.

Full solution dotnet test: 947 passing (baseline 906, +41 net across Stream A
so far). Pre-existing Client.CLI Subscribe flake unchanged.

Stream A's remaining work (wiring CapabilityInvoker into DriverNodeManager's
OnReadValue / OnWriteValue / History / Subscribe dispatch paths) is the
server-side integration piece + needs DI wiring for the pipeline builder —
lands in the next PR on this branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dohertj2 added 1 commit 2026-04-19 07:20:30 -04:00
IDriver.DriverInstanceId is declared as string in Core.Abstractions; keeping
the pipeline key as Guid meant every call site would need .ToString() / Guid.Parse
at the boundary. Switching the Resilience types to string removes that friction
and lets OtOpcUaServer pass driver.DriverInstanceId directly to the builder in
the upcoming server-dispatch wiring PR.

- DriverResiliencePipelineBuilder.GetOrCreate + Invalidate + PipelineKey
- CapabilityInvoker.ctor + _driverInstanceId field

Tests: all 48 Core.Tests still pass. The Invalidate test's keepId / dropId now
use distinct "drv-keep" / "drv-drop" literals (previously both were distinct
Guid.NewGuid() values, which the sed-driven refactor had collapsed to the same
literal — caught pre-commit).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dohertj2 added 1 commit 2026-04-19 07:30:04 -04:00
Every OnReadValue / OnWriteValue now routes through the process-singleton
DriverResiliencePipelineBuilder's CapabilityInvoker. Read / Write dispatch
paths gain timeout + per-capability retry + per-(driver, host) circuit breaker
+ bulkhead without touching the individual driver implementations.

Wiring:
- OpcUaApplicationHost: new optional DriverResiliencePipelineBuilder ctor
  parameter (default null → instance-owned builder). Keeps the 3 test call
  sites that construct OpcUaApplicationHost directly unchanged.
- OtOpcUaServer: requires the builder in its ctor; constructs one
  CapabilityInvoker per driver at CreateMasterNodeManager time with default
  Tier A DriverResilienceOptions. TODO: Stream B.1 will wire real per-driver-
  type tiers via DriverTypeRegistry; Phase 6.1 follow-up will read the
  DriverInstance.ResilienceConfig JSON column for per-instance overrides.
- DriverNodeManager: takes a CapabilityInvoker in its ctor. OnReadValue wraps
  the driver's ReadAsync through ExecuteAsync(DriverCapability.Read, hostName,
  ...); OnWriteValue wraps WriteAsync through ExecuteWriteAsync(hostName,
  isIdempotent, ...) where isIdempotent comes from the new
  _writeIdempotentByFullRef map populated at Variable() registration from
  DriverAttributeInfo.WriteIdempotent.

HostName defaults to driver.DriverInstanceId for now — a single-host pipeline
per driver. Multi-host drivers (Modbus with N PLCs) will expose their own per-
call host resolution in a follow-up so failing PLCs can trip per-PLC breakers
without poisoning siblings (decision #144).

Test fixup:
- FlakeyDriverIntegrationTests.Read_SurfacesSuccess_AfterTransientFailures:
  bumped TimeoutSeconds=2 → 30. 10 retries at exponential backoff with jitter
  can exceed 2s under parallel-test-run CPU pressure; the test asserts retry
  behavior, not timeout budget, so the longer slack keeps it deterministic.

Full solution dotnet test: 948 passing. Pre-existing Client.CLI Subscribe
flake unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dohertj2 added 1 commit 2026-04-19 07:33:45 -04:00
Per Stream A.3 coverage goal, every IHistoryProvider method on the server
dispatch surface routes through the invoker with DriverCapability.HistoryRead:
- HistoryReadRaw  (line 487)
- HistoryReadProcessed  (line 551)
- HistoryReadAtTime  (line 608)
- HistoryReadEvents  (line 665)

Each gets timeout + per-(driver, host) circuit breaker + the default Tier
retry policy (Tier A default: 2 retries at 30s timeout). Inner driver
GetAwaiter().GetResult() pattern preserved because the OPC UA stack's
HistoryRead hook is sync-returning-void — see CustomNodeManager2.

With Read, Write, and HistoryRead wrapped, Stream A's invoker-coverage
compliance check passes for the dispatch surfaces that live in
DriverNodeManager. Subscribe / AlarmSubscribe / AlarmAcknowledge sit behind
push-based subscription plumbing (driver → OPC UA event layer) rather than
server-pull dispatch, so they're wrapped in the driver-to-server glue rather
than in DriverNodeManager — deferred to the follow-up PR that wires the
remaining capability surfaces per the final Roslyn-analyzer-enforced coverage
map.

Full solution dotnet test: 948 passing. Pre-existing Client.CLI Subscribe
flake unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dohertj2 merged commit a06fcb16a2 into v2 2026-04-19 07:33:55 -04:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: dohertj2/lmxopcua#78