Compare commits

..

12 Commits

Author SHA1 Message Date
Joseph Doherty
1d9008e354 Phase 6.1 Stream B.3/B.4/B.5 — MemoryRecycle + ScheduledRecycleScheduler + demand-aware WedgeDetector
Closes out Stream B per docs/v2/implementation/phase-6-1-resilience-and-observability.md.

Core.Abstractions:
- IDriverSupervisor — process-level supervisor contract a Tier C driver's
  out-of-process topology provides (Galaxy Proxy/Supervisor implements this in
  a follow-up Driver.Galaxy wiring PR). Concerns: DriverInstanceId + RecycleAsync.
  Tier A/B drivers don't implement this; Stream B code asserts tier == C before
  ever calling it.

Core.Stability:
- MemoryRecycle — companion to MemoryTracking. On HardBreach, invokes the
  supervisor IFF tier == C AND a supervisor is wired. Tier A/B HardBreach logs
  a promotion-to-Tier-C recommendation and returns false. Soft/None/Warming
  never triggers a recycle at any tier.
- ScheduledRecycleScheduler — Tier C opt-in periodic recycler per decision #67.
  Ctor throws for Tier A/B (structural guard — scheduled recycle on an
  in-process driver would kill every OPC UA session and every co-hosted
  driver). TickAsync(now) advances the schedule by one interval per fire;
  RequestRecycleNowAsync drives an ad-hoc recycle without shifting the cron.
- WedgeDetector — demand-aware per decision #147. Classify(state, demand, now)
  returns:
    * NotApplicable when driver state != Healthy
    * Idle when Healthy + no pending work (bulkhead=0 && monitored=0 && historic=0)
    * Healthy when Healthy + pending work + progress within threshold
    * Faulted when Healthy + pending work + no progress within threshold
  Threshold clamps to min 60 s. DemandSignal.HasPendingWork ORs the three counters.
  The three false-wedge cases the plan calls out all stay Healthy: idle
  subscription-only, slow historian backfill making progress, write-only burst
  with drained bulkhead.

Tests (22 new, all pass):
- MemoryRecycleTests (7): Tier C hard-breach requests recycle; Tier A/B
  hard-breach never requests; Tier C without supervisor no-ops; soft-breach
  at every tier never requests; None/Warming never request.
- ScheduledRecycleSchedulerTests (6): ctor throws for A/B; zero/negative
  interval throws; tick before due no-ops; tick at/after due fires once and
  advances; RequestRecycleNow fires immediately without shifting schedule;
  multiple fires across ticks advance one interval each.
- WedgeDetectorTests (9): threshold clamp to 60 s; unhealthy driver always
  NotApplicable; idle subscription stays Idle; pending+fresh progress stays
  Healthy; pending+stale progress is Faulted; MonitoredItems active but no
  publish is Faulted; MonitoredItems active with fresh publish stays Healthy;
  historian backfill with fresh progress stays Healthy; write-only burst with
  empty bulkhead is Idle; HasPendingWork theory for any non-zero counter.

Full solution dotnet test: 989 passing (baseline 906, +83 for Phase 6.1 so far).
Pre-existing Client.CLI Subscribe flake unchanged.

Stream B complete. Next up: Stream C (health endpoints + structured logging).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 08:03:18 -04:00
Joseph Doherty
ef6b0bb8fc Phase 6.1 Stream B.1/B.2 — DriverTier on DriverTypeMetadata + Core.Stability.MemoryTracking with hybrid-formula soft/hard thresholds
Stream B.1 — registry invariant:
- DriverTypeMetadata gains a required `DriverTier Tier` field. Every registered
  driver type must declare its stability tier so the downstream MemoryTracking,
  MemoryRecycle, and resilience-policy layers can resolve the right defaults.
  Stamped-at-registration-time enforcement makes the "every driver type has a
  non-null Tier" compliance check structurally impossible to fail.
- DriverTypeRegistry API unchanged; one new property on the record.

Stream B.2 — MemoryTracking (Core.Stability):
- Tier-agnostic tracker per decision #146: captures baseline as the median of
  samples collected during a post-init warmup window (default 5 min), then
  classifies each subsequent sample with the hybrid formula
  `soft = max(multiplier × baseline, baseline + floor)`, `hard = 2 × soft`.
- Per-tier constants wired: Tier A mult=3 floor=50 MB, Tier B mult=3 floor=100 MB,
  Tier C mult=2 floor=500 MB.
- Never kills. Hard-breach action returns HardBreach; the supervisor that acts
  on that signal (MemoryRecycle) is Tier C only per decisions #74, #145 and
  lands in the next B.3 commit on this branch.
- Two phases: WarmingUp (samples collected, Warming returned) and Steady
  (baseline captured, soft/hard checks active). Transition is automatic when
  the warmup window elapses.

Tests (15 new, all pass):
- Warming phase returns Warming until the window elapses.
- Window-elapsed captures median baseline + transitions to Steady.
- Per-tier constants match decision #146 table exactly.
- Soft threshold uses max() — small baseline → floor wins; large baseline →
  multiplier wins.
- Hard = 2 × soft.
- Sample below soft = None; at soft = SoftBreach; at/above hard = HardBreach.
- DriverTypeRegistry: theory asserts Tier round-trips for A/B/C.

Full solution dotnet test: 963 passing (baseline 906, +57 net for Phase 6.1
Stream A + Stream B.1/B.2). Pre-existing Client.CLI Subscribe flake unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 07:37:43 -04:00
a06fcb16a2 Merge pull request (#78) - Phase 6.1 Stream A - Polly resilience + CapabilityInvoker + Read/Write/HistoryRead dispatch wrapping 2026-04-19 07:33:53 -04:00
Joseph Doherty
d2f3a243cd Phase 6.1 Stream A.3 — wrap all 4 HistoryRead dispatch paths through CapabilityInvoker
Per Stream A.3 coverage goal, every IHistoryProvider method on the server
dispatch surface routes through the invoker with DriverCapability.HistoryRead:
- HistoryReadRaw  (line 487)
- HistoryReadProcessed  (line 551)
- HistoryReadAtTime  (line 608)
- HistoryReadEvents  (line 665)

Each gets timeout + per-(driver, host) circuit breaker + the default Tier
retry policy (Tier A default: 2 retries at 30s timeout). Inner driver
GetAwaiter().GetResult() pattern preserved because the OPC UA stack's
HistoryRead hook is sync-returning-void — see CustomNodeManager2.

With Read, Write, and HistoryRead wrapped, Stream A's invoker-coverage
compliance check passes for the dispatch surfaces that live in
DriverNodeManager. Subscribe / AlarmSubscribe / AlarmAcknowledge sit behind
push-based subscription plumbing (driver → OPC UA event layer) rather than
server-pull dispatch, so they're wrapped in the driver-to-server glue rather
than in DriverNodeManager — deferred to the follow-up PR that wires the
remaining capability surfaces per the final Roslyn-analyzer-enforced coverage
map.

Full solution dotnet test: 948 passing. Pre-existing Client.CLI Subscribe
flake unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 07:32:10 -04:00
Joseph Doherty
29bcaf277b Phase 6.1 Stream A.3 complete — wire CapabilityInvoker into DriverNodeManager dispatch end-to-end
Every OnReadValue / OnWriteValue now routes through the process-singleton
DriverResiliencePipelineBuilder's CapabilityInvoker. Read / Write dispatch
paths gain timeout + per-capability retry + per-(driver, host) circuit breaker
+ bulkhead without touching the individual driver implementations.

Wiring:
- OpcUaApplicationHost: new optional DriverResiliencePipelineBuilder ctor
  parameter (default null → instance-owned builder). Keeps the 3 test call
  sites that construct OpcUaApplicationHost directly unchanged.
- OtOpcUaServer: requires the builder in its ctor; constructs one
  CapabilityInvoker per driver at CreateMasterNodeManager time with default
  Tier A DriverResilienceOptions. TODO: Stream B.1 will wire real per-driver-
  type tiers via DriverTypeRegistry; Phase 6.1 follow-up will read the
  DriverInstance.ResilienceConfig JSON column for per-instance overrides.
- DriverNodeManager: takes a CapabilityInvoker in its ctor. OnReadValue wraps
  the driver's ReadAsync through ExecuteAsync(DriverCapability.Read, hostName,
  ...); OnWriteValue wraps WriteAsync through ExecuteWriteAsync(hostName,
  isIdempotent, ...) where isIdempotent comes from the new
  _writeIdempotentByFullRef map populated at Variable() registration from
  DriverAttributeInfo.WriteIdempotent.

HostName defaults to driver.DriverInstanceId for now — a single-host pipeline
per driver. Multi-host drivers (Modbus with N PLCs) will expose their own per-
call host resolution in a follow-up so failing PLCs can trip per-PLC breakers
without poisoning siblings (decision #144).

Test fixup:
- FlakeyDriverIntegrationTests.Read_SurfacesSuccess_AfterTransientFailures:
  bumped TimeoutSeconds=2 → 30. 10 retries at exponential backoff with jitter
  can exceed 2s under parallel-test-run CPU pressure; the test asserts retry
  behavior, not timeout budget, so the longer slack keeps it deterministic.

Full solution dotnet test: 948 passing. Pre-existing Client.CLI Subscribe
flake unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 07:28:28 -04:00
Joseph Doherty
b6d2803ff6 Phase 6.1 Stream A — switch pipeline keys from Guid to string to match IDriver.DriverInstanceId
IDriver.DriverInstanceId is declared as string in Core.Abstractions; keeping
the pipeline key as Guid meant every call site would need .ToString() / Guid.Parse
at the boundary. Switching the Resilience types to string removes that friction
and lets OtOpcUaServer pass driver.DriverInstanceId directly to the builder in
the upcoming server-dispatch wiring PR.

- DriverResiliencePipelineBuilder.GetOrCreate + Invalidate + PipelineKey
- CapabilityInvoker.ctor + _driverInstanceId field

Tests: all 48 Core.Tests still pass. The Invalidate test's keepId / dropId now
use distinct "drv-keep" / "drv-drop" literals (previously both were distinct
Guid.NewGuid() values, which the sed-driven refactor had collapsed to the same
literal — caught pre-commit).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 07:18:55 -04:00
Joseph Doherty
f3850f8914 Phase 6.1 Stream A.5/A.6 — WriteIdempotent flag on DriverAttributeInfo + Modbus/S7 tag records + FlakeyDriver integration tests
Per-tag opt-in for write-retry per docs/v2/plan.md decisions #44, #45, #143.
Default is false — writes never auto-retry unless the driver author has marked
the tag as safe to replay.

Core.Abstractions:
- DriverAttributeInfo gains `bool WriteIdempotent = false` at the end of the
  positional record (back-compatible; every existing call site uses the default).

Driver.Modbus:
- ModbusTagDefinition gains `bool WriteIdempotent = false`. Safe candidates
  documented in the param XML: holding-register set-points, configuration
  registers. Unsafe: edge-triggered coils, counter-increment addresses.
- ModbusDriver.DiscoverAsync propagates t.WriteIdempotent into
  DriverAttributeInfo.WriteIdempotent.

Driver.S7:
- S7TagDefinition gains `bool WriteIdempotent = false`. Safe candidates:
  DB word/dword set-points, configuration DBs. Unsafe: M/Q bits that drive
  edge-triggered program routines.
- S7Driver.DiscoverAsync propagates the flag.

Stream A.5 integration tests (FlakeyDriverIntegrationTests, 4 new) exercise
the invoker + flaky-driver contract the plan enumerates:
- Read with 5 transient failures succeeds on the 6th attempt (RetryCount=10).
- Non-idempotent write with RetryCount=5 configured still fails on the first
  failure — no replay (decision #44 guard at the ExecuteWriteAsync surface).
- Idempotent write with 2 transient failures succeeds on the 3rd attempt.
- Two hosts on the same driver have independent breakers — dead-host trips
  its breaker but live-host's first call still succeeds.

Propagation tests:
- ModbusDriverTests: SetPoint WriteIdempotent=true flows into
  DriverAttributeInfo; PulseCoil default=false.
- S7DiscoveryAndSubscribeTests: same pattern for DBx SetPoint vs M-bit.

Full solution dotnet test: 947 passing (baseline 906, +41 net across Stream A
so far). Pre-existing Client.CLI Subscribe flake unchanged.

Stream A's remaining work (wiring CapabilityInvoker into DriverNodeManager's
OnReadValue / OnWriteValue / History / Subscribe dispatch paths) is the
server-side integration piece + needs DI wiring for the pipeline builder —
lands in the next PR on this branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 07:16:21 -04:00
Joseph Doherty
90f7792c92 Phase 6.1 Stream A.3 — CapabilityInvoker wraps driver-capability calls through the shared pipeline
One invoker per (DriverInstance, IDriver) pair; calls ExecuteAsync(capability,
host, callSite) and the invoker resolves the correct pipeline from the shared
DriverResiliencePipelineBuilder. The options accessor is a Func so Admin-edit
+ pipeline-invalidate takes effect without restarting the invoker or the
driver host.

ExecuteWriteAsync(isIdempotent) is the explicit write-safety surface:
- isIdempotent=false routes through a side pipeline with RetryCount=0 regardless
  of what the caller configured. The cache key carries a "::non-idempotent"
  suffix so it never collides with the retry-enabled write pipeline.
- isIdempotent=true routes through the normal Write pipeline. If the user has
  configured Write retries (opt-in), the idempotent tag gets them; otherwise
  default-0 still wins.

The server dispatch layer (next PR) reads WriteIdempotentAttribute on each tag
definition once at driver-init time and feeds the boolean into ExecuteWriteAsync.

Tests (6 new):
- Read retries on transient failure; returns value from call site.
- Write non-idempotent does NOT retry even when policy has 3 retries configured
  (the explicit decision-#44 guard at the dispatch surface).
- Write idempotent retries when policy allows.
- Write with default tier-A policy (RetryCount=0) never retries regardless of
  idempotency flag.
- Different hosts get independent pipelines.

Core.Tests now 44 passing (was 38). Invoker doc-refs completed (the XML comment
on WriteIdempotentAttribute no longer references a non-existent type).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 04:09:26 -04:00
Joseph Doherty
c04b13f436 Phase 6.1 Stream A.1/A.2/A.6 — Polly resilience foundation: pipeline builder + per-tier policy defaults + WriteIdempotent attribute
Lands the first chunk of the Phase 6.1 Stream A resilience layer per
docs/v2/implementation/phase-6-1-resilience-and-observability.md §Stream A.
Downstream CapabilityInvoker (A.3) + driver-dispatch wiring land in follow-up
PRs on the same branch.

Core.Abstractions additions:
- WriteIdempotentAttribute — marker for tag-definition records that opt into
  auto-retry on IWritable.WriteAsync. Absence = no retry per decisions #44, #45,
  #143. Read once via reflection at driver-init time; no per-write cost.
- DriverCapability enum — enumerates the 8 capability surface points
  (Read / Write / Discover / Subscribe / Probe / AlarmSubscribe / AlarmAcknowledge
  / HistoryRead). AlarmAcknowledge is write-shaped (no retry by default).
- DriverTier enum — A/B/C per driver-stability.md §2-4. Stream B.1 wires this
  into DriverTypeMetadata; surfaced here because the resilience policy defaults
  key on it.

Core.Resilience new namespace:
- DriverResilienceOptions — per-tier × per-capability policy defaults.
  GetTierDefaults(tier) is the source of truth:
    * Tier A: Read 2s/3 retries, Write 2s/0 retries, breaker threshold 5
    * Tier B: Read 4s/3, Write 4s/0, breaker threshold 5
    * Tier C: Read 10s/1, Write 10s/0, breaker threshold 0 (supervisor handles
      process-level breaker per decision #68)
  Resolve(capability) overlays CapabilityPolicies on top of the defaults.
- DriverResiliencePipelineBuilder — composes Timeout → Retry (capability-
  permitting, never on cancellation) → CircuitBreaker (tier-permitting) →
  Bulkhead. Pipelines cached in a lock-free ConcurrentDictionary keyed on
  (DriverInstanceId, HostName, DriverCapability) per decision #144 — one dead
  PLC behind a multi-device driver does not open the breaker for healthy
  siblings. Invalidate(driverInstanceId) supports Admin-triggered reload.

Tests (30 new, all pass):
- DriverResilienceOptionsTests: tier-default coverage for every capability,
  Write + AlarmAcknowledge never retry at any tier, Tier C disables breaker,
  resolve-with-override layering.
- DriverResiliencePipelineBuilderTests: Read retries transients, Write does NOT
  retry on failure (decision #44 guard), dead-host isolation from sibling hosts,
  pipeline reuse for same triple, per-capability isolation, breaker opens after
  threshold on Tier A, timeout fires, cancellation is not retried,
  invalidation scoped to matching instance.

Polly.Core 8.6.6 added to Core.csproj. Full solution dotnet test: 936 passing
(baseline 906 + 30 new). One pre-existing Client.CLI Subscribe flake unchanged
by this PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 04:07:27 -04:00
6a30f3dde7 Merge pull request (#77) - Phase 6 reconcile 2026-04-19 03:52:25 -04:00
Joseph Doherty
ba31f200f6 Phase 6 reconcile — merge adjustments into plan bodies, add decisions #143-162, scaffold compliance stubs
After shipping the four Phase 6 plan drafts (PRs 77-80), the adversarial-review
adjustments lived only as trailing "Review" sections. An implementer reading
Stream A would find the original unadjusted guidance, then have to cross-reference
the review to reconcile. This PR makes the plans genuinely executable:

1. Merges every ACCEPTed review finding into the actual Scope / Stream / Compliance
   sections of each phase plan:
   - phase-6-1: Scope table rewrite (per-capability retry, (instance,host) pipeline key,
     MemoryTracking vs MemoryRecycle split, hybrid watchdog formula, demand-aware
     wedge detector, generation-sealed LiteDB). Streams A/B/D + Compliance rewritten.
   - phase-6-2: AuthorizationDecision tri-state, control/data-plane separation,
     MembershipFreshnessInterval (15 min), AuthCacheMaxStaleness (5 min),
     subscription stamp-and-reevaluate. Stream C widened to 11 OPC UA operations.
   - phase-6-3: 8-state ServiceLevel matrix (OPC UA Part 5 §6.3.34-compliant),
     two-layer peer probe (/healthz + UaHealthProbe), apply-lease via await using,
     publish-generation fencing, InvalidTopology runtime state, ServerUriArray
     self-first + peers. New Stream F (interop matrix + Galaxy failover).
   - phase-6-4: DraftRevisionToken concurrency control, staged-import via
     EquipmentImportBatch with user-scoped visibility, CSV header version marker,
     decision-#117-aligned identifier columns, 1000-row diff cap,
     decision-#139 OPC 40010 fields, Identification inherits Equipment ACL.

2. Appends decisions #143 through #162 to docs/v2/plan.md capturing the
   architectural commitments the adjustments created. Each decision carries its
   dated rationale so future readers know why the choice was made.

3. Scaffolds scripts/compliance/phase-6-{1,2,3,4}-compliance.ps1 — PowerShell
   stubs with Assert-Todo / Assert-Pass / Assert-Fail helpers. Every check
   maps to a Stream task ID from the corresponding phase plan. Currently all
   checks are TODO and scripts exit 0; each implementation task is responsible
   for replacing its TODO with a real check before closing that task. Saved
   as UTF-8 with BOM so Windows PowerShell 5.1 parses em-dash characters
   without breaking.

Net result: the Phase 6.1 plan is genuinely ready to execute. Stream A.3 can
start tomorrow without reconciling Streams vs. Review on every task; the
compliance script is wired to the Stream IDs; plan.md has the architectural
commitments that justify the Stream choices.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 03:49:41 -04:00
81a1f7f0f6 Merge pull request 'Phase 6 — Four implementation plans for unplanned v2 features, each with codex adversarial review' (#76) from phase-6-plans-drafts into v2 2026-04-19 03:17:16 -04:00
41 changed files with 2566 additions and 140 deletions

View File

@@ -23,14 +23,18 @@ Closes these gaps flagged in the 2026-04-19 audit:
| Concern | Change |
|---------|--------|
| `Core` → new `Core.Resilience` sub-namespace | Shared Polly pipeline builder (`DriverResiliencePipelines`), per-capability policy (Read / Write / Subscribe / HistoryRead / Discover / Probe / Alarm). One pipeline per driver instance; driver-options decide tuning. |
| Every `IDriver*` consumer in the server | Wrap capability calls in the shared pipeline. Policy composition order: timeout → retry (with jitter, bounded by capability-specific `MaxRetries`) → circuit breaker (per driver instance, opens on N consecutive failures) → bulkhead (ceiling on in-flight requests per driver). |
| `Core` → new `Core.Stability` sub-namespace | Generalise `MemoryWatchdog` (`Driver.Galaxy.Host`) into `DriverMemoryWatchdog` consuming `IDriver.GetMemoryFootprint()`. Add `ScheduledRecycleScheduler` (decision #67) for weekly/time-of-day recycle. Add `WedgeDetector` that flips a driver to Faulted when no successful Read in N × PublishingInterval. |
| `DriverTypeRegistry` | Each driver type registers its `DriverTier` {A, B, C}. Tier C drivers must also advertise their out-of-process topology; the registry enforces invariants (Tier C has a `Proxy` + `Host` pair). |
| `OtOpcUa.Server` → new Minimal API endpoints | `/healthz` (liveness — process alive + config DB reachable or LiteDB cache warm), `/readyz` (readiness — every driver instance reports `DriverState.Healthy`). JSON bodies cite individual driver health per instance. |
| Serilog configuration | Centralize enrichers in `OtOpcUa.Server/Observability/LogContextEnricher.cs`. Every driver call runs inside a `LogContext.PushProperty` scope with {DriverInstanceId, DriverType, CapabilityName, CorrelationId (UA RequestHandle or internal GUID)}. Sink config stays rolling-file per CLAUDE.md; JSON-formatted output added alongside plain-text so SIEM ingestion works. |
| `Configuration` project | Add `LiteDbConfigCache` adapter. Wrap EF Core queries in a Polly pipeline: timeout (2 s) → retry (3×, jittered) → fallback-to-cache. Cache refresh on successful DB query + after `sp_PublishGeneration`. Cache lives at `%ProgramData%/OtOpcUa/config-cache/<cluster-id>.db` per node. |
| `DriverHostStatus` entity | Extend to carry `LastCircuitBreakerOpenUtc`, `ConsecutiveFailures`, `CurrentBulkheadDepth`, `LastRecycleUtc`. Admin `/hosts` page reads these. |
| `Core` → new `Core.Resilience` sub-namespace | Shared Polly pipeline builder (`DriverResiliencePipelines`). **Pipeline key = `(DriverInstanceId, HostName)`** so one dead PLC behind a multi-device driver doesn't open the breaker for healthy siblings (decision #35 per-device isolation). **Per-capability policy** — Read / HistoryRead / Discover / Probe / Alarm get retries; **Write does NOT** unless `[WriteIdempotent]` on the tag definition (decisions #44-45). |
| Every capability-interface consumer in the server | Wrap `IReadable.ReadAsync`, `IWritable.WriteAsync`, `ITagDiscovery.DiscoverAsync`, `ISubscribable.SubscribeAsync/UnsubscribeAsync`, `IHostConnectivityProbe` probe loop, `IAlarmSource.SubscribeAlarmsAsync/AcknowledgeAsync`, `IHistoryProvider.ReadRawAsync/ReadProcessedAsync/ReadAtTimeAsync/ReadEventsAsync`. Composition: timeout → (retry when capability supports) → circuit breaker → bulkhead. |
| `Core.Abstractions` → new `WriteIdempotentAttribute` | Marker on `ModbusTagDefinition` / `S7TagDefinition` / `OpcUaClientDriver` tag rows; opts that tag into auto-retry on Write. Absence = no retry, per spec. |
| `Core` → new `Core.Stability` sub-namespace — **split** | Two separate subsystems: (a) **`MemoryTracking`** runs all tiers; captures baseline (median of first 5 min `GetMemoryFootprint` samples) + applies the hybrid rule `soft = max(multiplier × baseline, baseline + floor)`; soft breach logs + surfaces to Admin; never kills. (b) **`MemoryRecycle`** (Tier C only — requires out-of-process topology) handles hard-breach recycle via the Proxy-side supervisor. Tier A/B overrun escalates to Tier C promotion ticket, not auto-kill. |
| `ScheduledRecycleScheduler` | Tier C only per decisions #73-74. Weekly/time-of-day recycle via Proxy supervisor. Tier A/B opt-in recycle lands in a future phase together with a Tier-C-escalation workflow. |
| `WedgeDetector` | **Demand-aware**: flips a driver to Faulted only when `(hasPendingWork AND noProgressIn > threshold)`. `hasPendingWork` derives from non-zero Polly bulkhead depth OR ≥1 active MonitoredItem OR ≥1 queued historian read. Idle + subscription-only drivers stay Healthy. |
| `DriverTypeRegistry` | Each driver type registers its `DriverTier` {A, B, C}. Tier C drivers must advertise their out-of-process topology; the registry enforces invariants (Tier C has a `Proxy` + `Host` pair). |
| `Driver.Galaxy.Proxy/Supervisor/` | **Retains** existing `CircuitBreaker` + `Backoff` — they guard IPC respawn (decision #68), different concern from the per-call Polly layer. Only `HeartbeatMonitor` is referenced downstream (IPC liveness). |
| `OtOpcUa.Server` → Minimal API endpoints on `http://+:4841` | `/healthz` = process alive + (config DB reachable OR `UsingStaleConfig=true`). `/readyz` = ANDed driver health; state-machine per `DriverState`: `Unknown`/`Initializing` → 503, `Healthy` → 200, `Degraded` → 200 + `{degradedDrivers: [...]}` in body, `Faulted` → 503. JSON body always reports per-instance detail. |
| Serilog configuration | Centralize enrichers in `OtOpcUa.Server/Observability/LogContextEnricher.cs`. Every capability call runs inside a `LogContext.PushProperty` scope with {DriverInstanceId, DriverType, CapabilityName, CorrelationId (UA RequestHandle or internal GUID)}. Sink config stays rolling-file per CLAUDE.md; JSON sink added alongside plain-text (switchable via `Serilog:WriteJson` appsetting). |
| `Configuration` project | Add `LiteDbConfigCache` adapter. **Generation-sealed snapshots**: `sp_PublishGeneration` writes `<cache-root>/<cluster>/<generationId>.db` as a read-only sealed file. Reads serve the last-known-sealed generation; mixed-generation reads are impossible. Write path bypasses cache + fails hard on DB outage. Pipeline: timeout (2 s) → retry (3×, jittered) → fallback-to-sealed-snapshot. |
| `DriverHostStatus` vs. `DriverInstanceResilienceStatus` | New separate entity `DriverInstanceResilienceStatus { DriverInstanceId, HostName, LastCircuitBreakerOpenUtc, ConsecutiveFailures, CurrentBulkheadDepth, LastRecycleUtc, BaselineFootprintBytes }`. `DriverHostStatus` keeps per-host connectivity only; Admin `/hosts` joins both for display. |
## Scope — What Does NOT Change
@@ -56,19 +60,21 @@ Closes these gaps flagged in the 2026-04-19 audit:
### Stream A — Resilience layer (1 week)
1. **A.1** Add `Polly.Core` + `Microsoft.Extensions.Resilience` to `Core`. Build `DriverResiliencePipelineBuilder` that composes Timeout → Retry (exponential backoff + jitter, capability-specific max retries) → CircuitBreaker (consecutive-failure threshold; half-open probe) → Bulkhead (max in-flight per driver instance). Unit tests cover each policy in isolation + composed pipeline.
2. **A.2** `DriverResilienceOptions` record bound from `DriverInstance.ResilienceConfig` JSON column (new nullable). Defaults encoded per-tier: Tier A (OPC UA Client, S7) 3 retries, 2 s timeout, 5-failure breaker; Tier B (Modbus) — same except 4 s timeout; Tier C (Galaxy) 1 retry (inner supervisor handles restart), 10 s timeout, circuit-breaker trips but doesn't kill the driver (the Proxy supervisor already handles that).
3. **A.3** `DriverCapabilityInvoker<T>` wraps every `IDriver*` method call. Existing server-side dispatch (whatever currently calls `driver.ReadAsync`) routes through the invoker. Policy injection via DI.
4. **A.4** Remove the hand-rolled `CircuitBreaker` + `Backoff` from `Driver.Galaxy.Proxy/Supervisor/` — replaced by the shared layer. Keep `HeartbeatMonitor` (different concern: IPC liveness, not data-path resilience).
5. **A.5** Unit tests: per-policy, per-composition. Integration test: Modbus driver under a FlakeyTransport that fails 5×, succeeds on 6th; invoker surfaces the eventual success. Bench: no-op overhead < 1% under nominal load.
1. **A.1** Add `Polly.Core` + `Microsoft.Extensions.Resilience` to `Core`. Build `DriverResiliencePipelineBuilder` — key on `(DriverInstanceId, HostName)`; composes Timeout → (Retry when the capability allows it; skipped for Write unless `[WriteIdempotent]`) → CircuitBreaker → Bulkhead. Per-capability policy map documented in `DriverResilienceOptions.CapabilityPolicies`.
2. **A.2** `DriverResilienceOptions` record bound from `DriverInstance.ResilienceConfig` JSON column (new nullable). **Per-tier × per-capability** defaults: Tier A (OpcUaClient, S7) Read 3 retries/2 s/5-failure-breaker, Write 0 retries/2 s/5-failure-breaker; Tier B (Modbus) Read 3/4 s/5, Write 0/4 s/5; Tier C (Galaxy) Read 1 retry/10 s/no-kill, Write 0/10 s/no-kill. Idempotent writes can opt into Read-shaped retry via the attribute.
3. **A.3** `CapabilityInvoker<TCapability, TResult>` wraps every method on the capability interfaces (`IReadable.ReadAsync`, `IWritable.WriteAsync`, `ITagDiscovery.DiscoverAsync`, `ISubscribable.SubscribeAsync/UnsubscribeAsync`, `IHostConnectivityProbe` probe loop, `IAlarmSource.SubscribeAlarmsAsync/AcknowledgeAsync`, `IHistoryProvider.ReadRawAsync/ReadProcessedAsync/ReadAtTimeAsync/ReadEventsAsync`). Existing server-side dispatch routes through it.
4. **A.4** **Retain** `Driver.Galaxy.Proxy/Supervisor/CircuitBreaker.cs` + `Backoff.cs` — they guard IPC process respawn (decision #68), orthogonal to the per-call Polly layer. Only `HeartbeatMonitor` is consumed outside the supervisor.
5. **A.5** Unit tests: per-policy, per-composition. Negative integration tests: (a) Modbus FlakeyTransport fails 5× on Read, succeeds 6th — invoker surfaces success; (b) Modbus FlakeyTransport fails 1× on Write with `[WriteIdempotent]=false` — invoker surfaces failure without retry (no duplicate pulse); (c) Modbus FlakeyTransport fails 1× on Write with `[WriteIdempotent]=true` — invoker retries. Bench: no-op overhead < 1%.
6. **A.6** `WriteIdempotentAttribute` in `Core.Abstractions`. Modbus/S7/OpcUaClient tag-definition records pick it up; invoker reads via reflection once at driver init.
### Stream B — Tier A/B/C stability runtime (1 week, can parallel with Stream A after A.1)
### Stream B — Tier A/B/C stability runtime — split into MemoryTracking + MemoryRecycle (1 week)
1. **B.1** `Core.Abstractions``DriverTier` enum {A, B, C}. Extend `DriverTypeRegistry` to require `DriverTier` at registration. Existing driver types get their tier stamped (Galaxy = C, Modbus = B, S7 = B, OpcUaClient = A).
2. **B.2** Generalise `DriverMemoryWatchdog` (lift from `Driver.Galaxy.Host/MemoryWatchdog.cs`). Tier-specific thresholds: A = 256 MB RSS soft / 512 MB hard, B = 512 MB soft / 1 GB hard, C = 1 GB soft / 2 GB hard (decision #70 hybrid multiplier + floor). Soft threshold → log + metric; hard threshold → mark driver Faulted + trigger recycle.
3. **B.3** `ScheduledRecycleScheduler` (decision #67): each driver instance can opt-in to a weekly recycle at a configured cron. Recycle = `ShutdownAsync``InitializeAsync`. Tier C drivers get the Proxy-side recycle; Tier A/B recycle in-process.
4. **B.4** `WedgeDetector`: polling thread per driver instance; if `LastSuccessfulRead` older than `WedgeThreshold` (default 5 × PublishingInterval, minimum 60 s) AND driver state is `Healthy`, flag as wedged → force `ReinitializeAsync`. Prevents silent dead-subscriptions.
5. **B.5** Tests: watchdog unit tests drive synthetic allocation; scheduler uses a virtual clock; wedge detector tests use a fake IClock + driver stub.
1. **B.1** `Core.Abstractions``DriverTier` enum {A, B, C}. Extend `DriverTypeRegistry` to require `DriverTier` at registration. Existing driver types stamped (Galaxy = C, Modbus = B, S7 = B, OpcUaClient = A).
2. **B.2** **`MemoryTracking`** (all tiers) lifted from `Driver.Galaxy.Host/MemoryWatchdog.cs`. Captures `BaselineFootprintBytes` as the median of first 5 min of `IDriver.GetMemoryFootprint()` samples post-`InitializeAsync`. Applies **decision #70 hybrid formula**: `soft = max(multiplier × baseline, baseline + floor)`; Tier A multiplier=3, floor=50 MB; Tier B multiplier=3, floor=100 MB; Tier C multiplier=2, floor=500 MB. Soft breach → log + `DriverInstanceResilienceStatus.CurrentFootprint` tick; never kills. Hard = 2 × soft.
3. **B.3** **`MemoryRecycle`** (Tier C only per decisions #73-74). Hard-breach on a Tier C driver triggers `ScheduledRecycleScheduler.RequestRecycleNow(driverInstanceId)`; scheduler proxies to `Driver.Galaxy.Proxy/Supervisor/` which restarts the Host process. Tier A/B hard-breach logs a promotion-to-Tier-C recommendation; **never auto-kills** the in-process driver.
4. **B.4** **`ScheduledRecycleScheduler`** per decision #67: Tier C driver instances opt-in to a weekly recycle at a configured cron. Tier A/B scheduled recycle deferred to a later phase paired with Tier-C escalation.
5. **B.5** **`WedgeDetector`** demand-aware: `if (state==Healthy && hasPendingWork && noProgressIn > WedgeThreshold) → force ReinitializeAsync`. `hasPendingWork` = (bulkhead depth > 0) OR (active monitored items > 0) OR (queued historian-read count > 0). `WedgeThreshold` default 5 × PublishingInterval, min 60 s. Idle driver stays Healthy.
6. **B.6** Tests: tracking unit tests drive synthetic allocation against a fake `GetMemoryFootprint`; recycle tests use a mock supervisor; wedge tests include the false-fault cases — idle subscriber, slow historian backfill, write-only burst.
### Stream C — Health endpoints + structured logging (4 days)
@@ -77,12 +83,12 @@ Closes these gaps flagged in the 2026-04-19 audit:
3. **C.3** Add JSON-formatted Serilog sink alongside the existing rolling-file plain-text sink so SIEMs (Splunk, Datadog) can ingest without a regex parser. Sink switchable via `Serilog:WriteJson` appsetting.
4. **C.4** Integration test: boot server, issue Modbus read, assert log line contains `DriverInstanceId` + `CorrelationId` structured fields.
### Stream D — Config DB LiteDB fallback (1 week)
### Stream D — Config DB LiteDB fallback — generation-sealed snapshots (1 week)
1. **D.1** `LiteDbConfigCache` adapter. Wraps `ConfigurationDbContext` queries that are safe to serve stale (cluster membership, generation metadata, driver instance definitions, LDAP role mapping). Write-path queries (draft save, publish) bypass the cache and fail hard on DB outage.
2. **D.2** Cache refresh strategy: refresh on every successful read (write-through-cache), full refresh after `sp_PublishGeneration` confirmation. Cache entries carry `CachedAtUtc`; served entries older than 24 h trigger a synthetic `Warning` log line so operators see stale data in effect.
3. **D.3** Polly pipeline in `Configuration` project: EF Core query → retry 3× fallback to cache. On fallback, driver state stays `Healthy` but a `UsingStaleConfig` flag on the cluster's health report flips true.
4. **D.4** Tests: in-memory SQL Server failure injected via `TestContainers`-ish double; cache returns last-known values; Admin UI banners reflect `UsingStaleConfig`.
1. **D.1** `LiteDbConfigCache` adapter backed by **sealed generation snapshots**: each successful `sp_PublishGeneration` writes `<cache-root>/<clusterId>/<generationId>.db` as read-only after commit. The adapter maintains a `CurrentSealedGenerationId` pointer updated atomically on successful publish. Mixed-generation reads are **impossible** — every read served from the cache serves one coherent sealed generation.
2. **D.2** Write-path queries (draft save, publish) bypass the cache entirely and fail hard on DB outage. Read-path queries (DriverInstance enumeration, LdapGroupRoleMapping, cluster + namespace metadata) go through the pipeline: timeout 2 s → retry 3× jittered → fallback to the current sealed snapshot.
3. **D.3** `UsingStaleConfig` flag flips true when a read fell back to the sealed snapshot; cleared on the next successful DB round-trip. Surfaced on `/healthz` body and Admin `/hosts`.
4. **D.4** Tests: (a) SQL-container kill mid-operation — read returns sealed snapshot, `UsingStaleConfig=true`, driver stays Healthy; (b) mixed-generation guard — attempt to serve partial generation by corrupting a snapshot file mid-read → adapter fails closed rather than serving mixed data; (c) first-boot-no-snapshot case — adapter refuses to start, driver fails `InitializeAsync` with a clear config-DB-required error.
### Stream E — Admin `/hosts` page refresh (3 days)
@@ -92,11 +98,17 @@ Closes these gaps flagged in the 2026-04-19 audit:
## Compliance Checks (run at exit gate)
- [ ] **Polly coverage**: every `IDriver*` method call in the server dispatch layer routes through `DriverCapabilityInvoker`. Enforce via a Roslyn analyzer added to `Core.Abstractions` build (error on direct `IDriver.ReadAsync` calls outside the invoker).
- [ ] **Tier registry**: every driver type registered in `DriverTypeRegistry` has a non-null `Tier`. Unit test walks the registry + asserts no gaps.
- [ ] **Health contract**: `/healthz` + `/readyz` respond within 500 ms even with one driver Faulted.
- [ ] **Structured log**: CI grep on `tests/` output asserts at least one log line contains `"DriverInstanceId"` + `"CorrelationId"` JSON fields.
- [ ] **Cache fallback**: Integration test kills the SQL container mid-operation; driver health stays `Healthy`, `UsingStaleConfig` flips true.
- [ ] **Invoker coverage**: every method on `IReadable` / `IWritable` / `ITagDiscovery` / `ISubscribable` / `IHostConnectivityProbe` / `IAlarmSource` / `IHistoryProvider` in the server dispatch layer routes through `CapabilityInvoker`. Enforce via a Roslyn analyzer (error-level; warning-first is rejected — the compliance check is the gate).
- [ ] **Write-retry guard**: writes without `[WriteIdempotent]` never get retried. Unit-test the invoker path asserts zero retry attempts.
- [ ] **Pipeline isolation**: pipeline key is `(DriverInstanceId, HostName)`. Integration test with two Modbus hosts under one instance — failing host A does not open the breaker for host B.
- [ ] **Tier registry**: every driver type registered in `DriverTypeRegistry` has a non-null `Tier`. Unit test walks the registry + asserts no gaps. Tier C registrations must declare their out-of-process topology.
- [ ] **MemoryTracking never kills**: soft/hard breach tests on a Tier A/B driver log + surface without terminating the process.
- [ ] **MemoryRecycle Tier C only**: hard breach on a Tier A driver never invokes the supervisor; on Tier C it does.
- [ ] **Wedge demand-aware**: test suite includes idle-subscription-only, slow-historian-backfill, and write-only-burst cases — driver stays Healthy.
- [ ] **Galaxy supervisor preserved**: `Driver.Galaxy.Proxy/Supervisor/CircuitBreaker.cs` + `Backoff.cs` still present + still invoked on Host crash.
- [ ] **Health state machine**: `/healthz` + `/readyz` respond within 500 ms for every `DriverState`; state-machine table in this doc drives the test matrix.
- [ ] **Structured log**: CI grep asserts at least one log line per capability call has `"DriverInstanceId"` + `"CorrelationId"` JSON fields.
- [ ] **Generation-sealed cache**: integration tests cover (a) SQL-kill mid-operation serves last-sealed snapshot; (b) mixed-generation corruption fails closed; (c) first-boot no-snapshot + DB-down → `InitializeAsync` fails with clear error.
- [ ] No regression in existing test suites — `dotnet test ZB.MOM.WW.OtOpcUa.slnx` count equal-or-greater than pre-Phase-6.1 baseline.
## Risks and Mitigations

View File

@@ -20,15 +20,22 @@ Closes these gaps:
## Scope — What Changes
**Architectural separation** (critical for correctness): `LdapGroupRoleMapping` is **control-plane only** — it maps LDAP groups to Admin UI roles (`FleetAdmin` / `ConfigEditor` / `ReadOnly`) and cluster scopes for Admin access. **It is NOT consulted by the OPC UA data-path evaluator.** The data-path evaluator reads `NodeAcl` rows joined directly against the session's **resolved LDAP group memberships**. The two concerns share zero runtime code path.
| Concern | Change |
|---------|--------|
| `Configuration` project | New entity `LdapGroupRoleMapping { Id, LdapGroup, Role, ClusterId? (nullable = system-wide), IsSystemWide, GeneratedAtUtc }`. Migration. Admin CRUD. |
| `Core` → new `Core.Authorization` sub-namespace | `IPermissionEvaluator` interface; concrete `PermissionTrieEvaluator` implementation loads ACLs + LDAP mappings from Configuration, builds a trie keyed on the 6-level scope hierarchy, evaluates a `(UserClaim[], NodeId, NodePermissions)` `bool` decision in O(depth × group-count). |
| `Core.Authorization` cache | `PermissionTrieCache` — one trie per `(ClusterId, GenerationId)`. Rebuilt on `sp_PublishGeneration` confirmation; served from memory thereafter. Per-session evaluator keeps a reference to the current trie + user's LDAP groups. |
| OPC UA server dispatch | `OtOpcUa.Server/OpcUa/DriverNodeManager.cs` Read/Write/HistoryRead/MonitoredItem-create paths call `PermissionEvaluator.Authorize(session.Identity, nodeId, NodePermissions.Read)` etc. before delegating to the driver. Unauthorized returns `BadUserAccessDenied` (0x80210000) — not a silent no-op per corrections-doc B1. |
| `LdapAuthService` (existing) | On cookie-auth success, resolves the user's LDAP groups via `LdapGroupService.GetMemberships` + loads the matching `LdapGroupRoleMapping` rows → produces a role-claim list + cluster-scope claim list. Stored on the auth cookie. |
| Admin UI `AclsTab.razor` | Repoint edits at the new `NodeAclService` API that writes through to the same table the evaluator reads. Add a "test this permission" probe that runs a dummy evaluator against a chosen `(user, nodeId, action)` so ops can sanity-check grants before publishing a draft. |
| Admin UI new tab `RoleGrantsTab.razor` | CRUD over `LdapGroupRoleMapping`. Per-cluster + system-wide grants. FleetAdmin only. |
| `Configuration` project | New entity `LdapGroupRoleMapping { Id, LdapGroup, Role, ClusterId? (nullable = system-wide), IsSystemWide, GeneratedAtUtc }`. **Consumed only by Admin UI role routing.** Migration. Admin CRUD. |
| `Core` → new `Core.Authorization` sub-namespace | `IPermissionEvaluator.Authorize(IEnumerable<Claim> identity, OpcUaOperation op, NodeId nodeId) → AuthorizationDecision`. `op` covers every OPC UA surface: Browse, Read, Write, HistoryRead, HistoryUpdate, CreateMonitoredItems, TransferSubscriptions, Call, Acknowledge, Confirm, Shelve. Result is tri-state (internal model distinguishes `Allow` / `NotGranted` / `Denied` + carries matched-grant provenance). Phase 6.2 only produces `Allow` + `NotGranted`; v2.1 Deny lands without API break. |
| `PermissionTrieBuilder` | Builds trie from `NodeAcl` rows joined against **resolved LDAP group memberships**, keyed on 6-level scope hierarchy for Equipment namespaces. **SystemPlatform namespaces (Galaxy)** use a `FolderSegment` scope level between Namespace and Tag, populated from `Tag.FolderPath` segments, so folder subtree authorization works on Galaxy trees the same way UNS works on Equipment trees. Trie node carries `ScopeKind` enum. |
| `PermissionTrieCache` + freshness | One trie per `(ClusterId, GenerationId)`. Invalidated on `sp_PublishGeneration` via in-process event bus AND generation-ID check on hot path — every authz call looks up `CurrentGenerationId` (Polly-wrapped, sub-second cache); a Backup that cached a stale generation detects the mismatch + forces re-load. **Redundancy-safe**. |
| `UserAuthorizationState` freshness | Cached per session BUT bounded by `MembershipFreshnessInterval` (default **15 min**). Past that, the next hot-path authz call re-resolves LDAP group memberships via `LdapGroupService`. Failure to re-resolve (LDAP unreachable) → **fail-closed**: evaluator returns `NotGranted` for every call until memberships refresh successfully. Decoupled from Phase 6.1's availability-oriented 24h cache. |
| `AuthCacheMaxStaleness` | Separate from Phase 6.1's `UsingStaleConfig` window. Default 5 min — beyond that, authz fails closed regardless of Phase 6.1 cache warmth. |
| OPC UA server dispatch — all enforcement surfaces | `DriverNodeManager` wires evaluator on: **Browse + TranslateBrowsePathsToNodeIds** (ancestors implicitly visible if any descendant has a grant; denied ancestors filter from results), **Read** (per-attribute StatusCode `BadUserAccessDenied` in mixed-authorization batches; batch never poisons), **Write** (uses `NodePermissions.WriteOperate/Tune/Configure` based on driver `SecurityClassification`), **HistoryRead** (uses `NodePermissions.HistoryRead`**distinct** flag, not Read), **HistoryUpdate** (`NodePermissions.HistoryUpdate`), **CreateMonitoredItems** (per-`MonitoredItemCreateResult` denial), **TransferSubscriptions** (re-evaluates items on transfer), **Call** (`NodePermissions.MethodCall`), **Acknowledge/Confirm/Shelve** (per-alarm flags). |
| Subscription re-authorization | Each `MonitoredItem` is stamped with `(AuthGenerationId, MembershipVersion)` at create time. On every Publish, items with a stamp mismatching the session's current `(AuthGenerationId, MembershipVersion)` get re-evaluated; revoked items drop to `BadUserAccessDenied` within one publish cycle. Unchanged items stay fast-path. |
| `LdapAuthService` | On cookie-auth success: resolves LDAP group memberships; loads matching `LdapGroupRoleMapping` rows → role claims + cluster-scope claims (control plane); stores `UserAuthorizationState.LdapGroups` on the session for the data-plane evaluator. |
| `ValidatedNodeAclAuthoringService` | Replaces CRUD-only `NodeAclService` for authoring. Validates (LDAP group exists, scope exists in current or target draft, grant shape is valid, no duplicate `(LdapGroup, Scope)` pair). Admin UI writes only through it. |
| Admin UI `AclsTab.razor` | Writes via `ValidatedNodeAclAuthoringService`. Adds Probe-This-Permission row that runs the real evaluator against a chosen `(LDAP group, node, operation)` and shows `Allow` / `NotGranted` + matched-grant provenance. |
| Admin UI new tab `RoleGrantsTab.razor` | CRUD over `LdapGroupRoleMapping`. Per-cluster + system-wide grants. FleetAdmin only. **Documentation explicit** that this only affects Admin UI access, not OPC UA data plane. |
| Audit log | Every Grant/Revoke/Publish on `LdapGroupRoleMapping` or `NodeAcl` writes an `AuditLog` row with old/new state + user. |
## Scope — What Does NOT Change
@@ -66,14 +73,19 @@ Closes these gaps:
5. **B.5** Per-session cached evaluator: OPC UA Session authentication produces `UserAuthorizationState { ClusterId, LdapGroups[], Trie }`; cached on the session until session close or generation-apply.
6. **B.6** Unit tests: trie-walk theory covering (a) Cluster-level grant cascades to tags, (b) Equipment-level grant doesn't leak to sibling Equipment, (c) multi-group union, (d) no-grant → deny, (e) Galaxy nodes skip UnsArea/UnsLine levels.
### Stream C — OPC UA server dispatch wiring (4 days)
### Stream C — OPC UA server dispatch wiring (6 days, widened)
1. **C.1** `DriverNodeManager.Read` consult evaluator before delegating to `IReadable`. Unauthorized nodes get `BadUserAccessDenied` per-attribute, not on the whole batch.
2. **C.2** `DriverNodeManager.Write`same. Evaluator needs `NodePermissions.WriteOperate` / `WriteTune` / `WriteConfigure` depending on driver-reported `SecurityClassification` of the attribute.
3. **C.3** `DriverNodeManager.HistoryRead`ACL checks `NodePermissions.Read` (history uses the same Read flag per `acl-design.md`).
4. **C.4** `DriverNodeManager.CreateMonitoredItem`denies unauthorized nodes at subscription create time, not after the first publish. Cleaner than silently omitting notifications.
5. **C.5** Alarm actions (acknowledge / confirm / shelve) — checks `AlarmAck` / `AlarmConfirm` / `AlarmShelve` flags.
6. **C.6** Integration tests: boot server with a seed trie, auth as three distinct users with different group memberships, assert read of one tag allowed + read of another denied + write denied where Read allowed.
1. **C.1** `DriverNodeManager.Read` — evaluator consulted per `ReadValueId` with `OpcUaOperation.Read`. Denied attributes get `BadUserAccessDenied` per-item; batch never poisons. Integration test covers mixed-authorization batch (3 authorized + 2 denied → 3 Good values + 2 Bad StatusCodes, request completes).
2. **C.2** `DriverNodeManager.Write`evaluator chooses `NodePermissions.WriteOperate` / `WriteTune` / `WriteConfigure` based on the driver-reported `SecurityClassification`.
3. **C.3** `DriverNodeManager.HistoryRead`**uses `NodePermissions.HistoryRead`**, which is a **distinct flag** from Read. Test: user with Read but not HistoryRead can read live values but gets `BadUserAccessDenied` on `HistoryRead`.
4. **C.4** `DriverNodeManager.HistoryUpdate`uses `NodePermissions.HistoryUpdate`.
5. **C.5** `DriverNodeManager.CreateMonitoredItems` — per-`MonitoredItemCreateResult` denial in mixed-authorization batch; partial success path per OPC UA Part 4. Each created item stamped `(AuthGenerationId, MembershipVersion)`.
6. **C.6** `DriverNodeManager.TransferSubscriptions` — on reconnect, re-evaluate every transferred `MonitoredItem` against the session's current auth state. Stale-stamp items drop to `BadUserAccessDenied`.
7. **C.7** **Browse + TranslateBrowsePathsToNodeIds** — evaluator called with `OpcUaOperation.Browse`. Ancestor visibility implied when any descendant has a grant (per `acl-design.md` §Browse). Denied ancestors filter from browse results — the UA browser sees a hierarchy truncated at the denied ancestor rather than an inconsistent child-without-parent view.
8. **C.8** `DriverNodeManager.Call``NodePermissions.MethodCall`.
9. **C.9** Alarm actions (Acknowledge / Confirm / Shelve) — per-alarm `NodePermissions.AlarmAck` / `AlarmConfirm` / `AlarmShelve`.
10. **C.10** Publish path — for each `MonitoredItem` with a mismatched `(AuthGenerationId, MembershipVersion)` stamp, re-evaluate. Unchanged items stay fast-path; changes happen at next publish cycle.
11. **C.11** Integration tests: three-user seed with different memberships; matrix covers every operation in §Scope. Mixed-batch tests for Read + CreateMonitoredItems.
### Stream D — Admin UI refresh (4 days)
@@ -84,11 +96,20 @@ Closes these gaps:
## Compliance Checks (run at exit gate)
- [ ] **Data-path enforcement**: OPC UA Read against a NodeId the current user has no grant for returns `BadUserAccessDenied` with a ServiceResult, not Good with stale data. Verified by an integration test with a Basic256Sha256-secured session + a read-only LDAP identity.
- [ ] **Trie invariants**: `PermissionTrieBuilder` is idempotent (building twice with identical inputs produces equal tries — override `Equals` to assert).
- [ ] **Additive grants**: Cluster-level grant on User A means User A can read every tag in that cluster *without* needing any lower-level grant.
- [ ] **Isolation between clusters**: a grant on Cluster 1 has zero effect on Cluster 2 for the same user.
- [ ] **Galaxy path coverage**: ACL checks work on `Galaxy` folder nodes + tag nodes where the UNS levels are absent (the trie treats them as shallow `Cluster → Namespace → Tag`).
- [ ] **Control/data-plane separation**: `LdapGroupRoleMapping` consumed only by Admin UI; the data-path evaluator has zero references to it. Enforced via a project-reference audit (Admin project references the mapping service; `Core.Authorization` does not).
- [ ] **Every operation wired**: Browse, Read, Write, HistoryRead, HistoryUpdate, CreateMonitoredItems, TransferSubscriptions, Call, Acknowledge, Confirm, Shelve all consult the evaluator. Integration test matrix covers every operation × allow/deny.
- [ ] **HistoryRead uses its own flag**: test "user with Read + no HistoryRead gets `BadUserAccessDenied` on HistoryRead".
- [ ] **Mixed-batch semantics**: Read of 5 nodes (3 allowed + 2 denied) returns 3 Good + 2 `BadUserAccessDenied` per-`ReadValueId`; CreateMonitoredItems equivalent.
- [ ] **Browse ancestor visibility**: user with a grant only on a deep equipment node can browse the path to it (ancestors implied); denied ancestors filter from browse results otherwise.
- [ ] **Galaxy FolderSegment coverage**: a grant on a Galaxy folder subtree cascades to its tags; sibling folders are unaffected. Trie test covers this.
- [ ] **Subscription re-authorization**: integration test — create item, revoke grant via draft+publish, next publish cycle the item returns `BadUserAccessDenied` (not silently still-notifying).
- [ ] **Membership freshness**: test — 15 min MembershipFreshnessInterval elapses on a long-lived session + LDAP now unreachable → authz fails closed on the next request until LDAP recovers.
- [ ] **Auth cache fail-closed**: test — Phase 6.1 cache serves stale config for 6 min; authz evaluator refuses all calls after 5 min regardless.
- [ ] **Trie invariants**: `PermissionTrieBuilder` is idempotent (build twice with identical inputs → equal tries).
- [ ] **Additive grants + cluster isolation**: cluster-grant cascades; cross-cluster leakage impossible.
- [ ] **Redundancy-safe invalidation**: integration test — two nodes, a publish on one, authorize a request on the other before in-process event propagates → generation-mismatch forces re-load, no stale decision.
- [ ] **Authoring validation**: `AclsTab` cannot save a `(LdapGroup, Scope)` pair that already exists in the draft; operator sees the validation error pre-save.
- [ ] **AuthorizationDecision shape stability**: API surface exposes `Allow` + `NotGranted` only; `Denied` variant exists in the type but is never produced; v2.1 can add Deny without API break.
- [ ] No regression in driver test counts.
## Risks and Mitigations

View File

@@ -22,11 +22,13 @@ Closes these gaps:
| Concern | Change |
|---------|--------|
| `OtOpcUa.Server` → new `Server.Redundancy` sub-namespace | `RedundancyCoordinator` singleton. Resolves the current node's `ClusterNode` row at startup, loads its peers from `ServerCluster`, probes each peer's `/healthz` (Phase 6.1 endpoint) every `PeerProbeInterval` (default 2 s), maintains per-peer health state. |
| OPC UA server root | `ServiceLevel` variable node becomes a `BaseDataVariable` whose value updates on `RedundancyCoordinator` state change. `ServerUriArray` array variable refreshes on cluster-topology change. `RedundancySupport` stays static (set from `RedundancyMode` at startup). |
| `RedundancyCoordinator` computation | `ServiceLevel` formula: 255 = Primary + fully healthy + no apply in progress; 200 = Primary + an apply in the middle (clients should prefer peer); 100 = Backup + fully healthy; 50 = Backup + mid-apply; 0 = Faulted or peer-unreachable-and-I'm-not-authoritative. Documented in `docs/Redundancy.md` update. |
| `OtOpcUa.Server` → new `Server.Redundancy` sub-namespace | `RedundancyCoordinator` singleton. Resolves the current node's `ClusterNode` row at startup, loads peers, runs **two-layer peer health probe**: (a) `/healthz` every 2 s as the fast-fail (inherits Phase 6.1 semantics — HTTP + DB/cache healthy); (b) `UaHealthProbe` every 10 s — opens a lightweight OPC UA client session to the peer + reads its `ServiceLevel` node + verifies endpoint serves data. Authority decisions use UaHealthProbe; `/healthz` is used only to avoid wasting UA probes when peer is obviously down. |
| Publish-generation fencing | Topology + role decisions are stamped with a monotonic `ConfigGenerationId` from the shared config DB. Coordinator re-reads topology via CAS on `(ClusterId, ExpectedGeneration)` → new row; peers reject state propagated from a lower generation. Prevents split-publish races. |
| `InvalidTopology` runtime state | If both nodes detect >1 Primary AFTER startup (config-DB drift during a publish), both self-demote to ServiceLevel 2 until convergence. Neither node serves authoritatively; clients pick the healthier alternative or reconnect later. |
| OPC UA server root | `ServiceLevel` variable node becomes a `BaseDataVariable` whose value updates on `RedundancyCoordinator` state change. `ServerUriArray` array variable includes **self + peers** in stable deterministic ordering (decision per OPC UA Part 4 §6.6.2.2). `RedundancySupport` stays static (set from `RedundancyMode` at startup); `Transparent` mode validated pre-publish, not rejected at startup. |
| `RedundancyCoordinator` computation | **8-state ServiceLevel matrix** — avoids OPC UA Part 5 §6.3.34 collision (`0=Maintenance`, `1=NoData`). Operator-declared maintenance only = **0**. Unreachable / Faulted = **1**. In-range operational states occupy **2..255**: Authoritative-Primary = **255**; Isolated-Primary (peer unreachable, self serving) = **230**; Primary-Mid-Apply = **200**; Recovering-Primary (post-fault, dwell not met) = **180**; Authoritative-Backup = **100**; Isolated-Backup (primary unreachable, "take over if asked") = **80**; Backup-Mid-Apply = **50**; Recovering-Backup = **30**; `InvalidTopology` (runtime detects >1 Primary) = **2** (detected-inconsistency band — below normal operation). Full matrix documented in `docs/Redundancy.md` update. |
| Role transition | Split-brain avoidance: role is *declared* in the shared config DB (`ClusterNode.RedundancyRole`), not elected at runtime. An operator flips the row (or a failover script does). Coordinator only reads; never writes. |
| `sp_PublishGeneration` hook | Before the apply starts, the coordinator sets `ApplyInProgress = true` in-memory`ServiceLevel` drops to mid-apply band. Clears after `sp_PublishGeneration` returns. |
| `sp_PublishGeneration` hook | Uses named **apply leases** keyed to `(ConfigGenerationId, PublishRequestId)`. `await using var lease = coordinator.BeginApplyLease(...)`. Disposal on any exit path (success, exception, cancellation) decrements. Watchdog auto-closes any lease older than `ApplyMaxDuration` (default 10 min) → ServiceLevel can't stick at mid-apply. Pre-publish validator rejects unsupported `RedundancyMode` (e.g. `Transparent`) with a clear error so runtime never sees an invalid state. |
| Admin UI `/cluster/{id}` page | New `RedundancyTab.razor` — shows current node's role + ServiceLevel + peer reachability. FleetAdmin can trigger a role-swap by editing `ClusterNode.RedundancyRole` + publishing a draft. |
| Metrics | New OpenTelemetry metrics: `ot_opcua_service_level{cluster,node}`, `ot_opcua_peer_reachable{cluster,node,peer}`, `ot_opcua_apply_in_progress{cluster,node}`. Sink via Phase 6.1 observability layer. |
@@ -57,41 +59,59 @@ Closes these gaps:
2. **A.2** Topology subscription — coordinator re-reads on `sp_PublishGeneration` confirmation so an operator role-swap takes effect after publish (no process restart needed).
3. **A.3** Tests: two-node cluster seed, one-node cluster seed (degenerate), duplicate-uri rejection.
### Stream B — Peer health probing + ServiceLevel computation (4 days)
### Stream B — Peer health probing + ServiceLevel computation (6 days, widened)
1. **B.1** `PeerProbeLoop` runs per peer at `PeerProbeInterval` (2 s default, configurable via `appsettings.json`). Calls peer's `/healthz` via `HttpClient`; timeout 1 s. Exponential backoff on sustained failure.
2. **B.2** `ServiceLevelCalculator.Compute(current role, self health, peer reachable, apply in progress) → byte`. Matrix documented in §Scope.
3. **B.3** Calculator reacts to inputs via `IObserver` pattern so changes immediately push to the OPC UA `ServiceLevel` node.
4. **B.4** Tests: matrix coverage for all role × health × apply permutations (32 cases); injected `IClock` + fake `HttpClient` so tests are deterministic.
1. **B.1** `PeerHttpProbeLoop` per peer at 2 s — calls peer's `/healthz`, 1 s timeout, exponential backoff on sustained failure. Used as fast-fail.
2. **B.2** `PeerUaProbeLoop` per peer at 10 s — opens an OPC UA client session to the peer (reuses Phase 5 `Driver.OpcUaClient` stack), reads peer's `ServiceLevel` node + verifies endpoint serves data. Short-circuit: if HTTP probe is failing, skip UA probe (no wasted sessions).
3. **B.3** `ServiceLevelCalculator.Compute(role, selfHealth, peerHttpHealthy, peerUaHealthy, applyInProgress, recoveryDwellMet, topologyValid) → byte`. 8-state matrix per §Scope. `topologyValid=false` forces InvalidTopology = 2 regardless of other inputs.
4. **B.4** `RecoveryStateManager`: after a `Faulted → Healthy` transition, hold driver in `Recovering` band (180 Primary / 30 Backup) for `RecoveryDwellTime` (default 60 s) AND require one positive publish witness (successful `Read` on a reference node) before entering Authoritative band.
5. **B.5** Calculator reacts to inputs via `IObserver` so changes immediately push to the OPC UA `ServiceLevel` node.
6. **B.6** Tests: **64-case matrix** covering role × self-health × peer-http × peer-ua × apply × recovery × topology. Specific cases flagged: Primary-with-unreachable-peer-serves-at-230 (authority retained); Backup-with-unreachable-primary-escalates-to-80 (not auto-promote); InvalidTopology demotes both nodes; Recovering dwell + publish-witness blocks premature return to 255.
### Stream C — OPC UA node wiring (3 days)
1. **C.1** `ServiceLevel` variable node created under `ServerStatus` at server startup. Type `Byte`, AccessLevel = CurrentRead only. Subscribe to `ServiceLevelCalculator` observable; push updates via `DataChangeNotification`.
2. **C.2** `ServerUriArray` variable node under `ServerCapabilities`. Array of `String`, length = peer count. Updates on topology change.
3. **C.3** `RedundancySupport` variable — static at startup from `RedundancyMode`. Values: `None`, `Cold`, `Warm`, `WarmActive`, `Hot`. Phase 6.3 supports everything except `Transparent` + `HotAndMirrored`.
4. **C.4** Test against the Client.CLI: connect to primary, read `ServiceLevel` expect 255; pause primary apply → expect 200; fail primary → client sees `Bad_ServerNotConnected` + reconnects to peer at 100.
2. **C.2** `ServerUriArray` variable node under `ServerCapabilities`. Array of `String`, **includes self + peers** with deterministic ordering (self first). Updates on topology change. Compliance test asserts local-plus-peer membership.
3. **C.3** `RedundancySupport` variable — static at startup from `RedundancyMode`. Values: `None`, `Cold`, `Warm`, `WarmActive`, `Hot`. Unsupported values (`Transparent`, `HotAndMirrored`) are rejected **pre-publish** by validator — runtime never sees them.
4. **C.4** Client.CLI cutover test: connect to primary, read `ServiceLevel` → 255; pause primary apply → 200; unreachable peer while apply in progress → 200 (apply dominates peer-unreachable per matrix); client sees peer via `ServerUriArray`; fail primary → client reconnects to peer at 80 (isolated-backup band).
### Stream D — Apply-window integration (2 days)
### Stream D — Apply-window integration (3 days)
1. **D.1** `sp_PublishGeneration` caller wraps the apply in `using (coordinator.BeginApplyWindow())`. `BeginApplyWindow` increments an in-process counter; ServiceLevel drops on first increment. Dispose decrements.
2. **D.2** Nested applies handled by the counter (rarely happens but Ignition and Kepware clients have both been observed firing rapid-succession draft publishes).
3. **D.3** Test: mid-apply subscribe on primary; assert the subscribing client sees the ServiceLevel drop immediately after the apply starts, then restore after apply completes.
1. **D.1** `sp_PublishGeneration` caller wraps the apply in `await using var lease = coordinator.BeginApplyLease(generationId, publishRequestId)`. Lease keyed to `(ConfigGenerationId, PublishRequestId)` so concurrent publishes stay isolated. Disposal decrements on every exit path.
2. **D.2** `ApplyLeaseWatchdog` auto-closes leases older than `ApplyMaxDuration` (default 10 min) so a crashed publisher can't pin the node at mid-apply.
3. **D.3** Pre-publish validator in `sp_PublishGeneration` rejects unsupported `RedundancyMode` values (`Transparent`, `HotAndMirrored`) with a clear error message — runtime never sees an invalid mode.
4. **D.4** Tests: (a) mid-apply client subscribes → sees ServiceLevel drop → sees restore; (b) lease leak via `ThreadAbort` / cancellation → watchdog closes; (c) publish rejected for `Transparent` → operator-actionable error.
### Stream E — Admin UI + metrics (3 days)
1. **E.1** `RedundancyTab.razor` under `/cluster/{id}/redundancy`. Shows each node's role, current ServiceLevel, peer reachability, last apply timestamp. Role-swap button posts a draft edit on `ClusterNode.RedundancyRole`; publish applies.
2. **E.2** OpenTelemetry meter export: three gauges per the §Scope metrics. Sink via Phase 6.1 observability.
1. **E.1** `RedundancyTab.razor` under `/cluster/{id}/redundancy`. Shows each node's role, current ServiceLevel (with band label per 8-state matrix), peer reachability (HTTP + UA probe separately), last apply timestamp. Role-swap button posts a draft edit on `ClusterNode.RedundancyRole`; publish applies.
2. **E.2** OpenTelemetry meter export: `ot_opcua_service_level{cluster,node}` gauge + `ot_opcua_peer_reachable{cluster,node,peer,kind=http|ua}` + `ot_opcua_apply_in_progress{cluster,node}` + `ot_opcua_topology_valid{cluster}`. Sink via Phase 6.1 observability.
3. **E.3** SignalR push: `FleetStatusHub` broadcasts ServiceLevel changes so the Admin UI updates within ~1 s of the coordinator observing a peer flip.
### Stream F — Client-interoperability matrix (3 days, new)
1. **F.1** Validate ServiceLevel-driven cutover against **Ignition 8.1 + 8.3**, **Kepware KEPServerEX 6.x**, **Aveva OI Gateway 2020R2 + 2023R1**. For each: configure the client with both endpoints, verify it honors `ServiceLevel` + `ServerUriArray` during primary failover.
2. **F.2** Clients that don't honour the standards (doc field — may include Kepware and OI Gateway per Codex review) get an explicit compatibility-matrix entry: "requires manual backup-endpoint config / vendor-specific redundancy primitives". Documented in `docs/Redundancy.md`.
3. **F.3** Galaxy MXAccess failover test — boot Galaxy.Proxy on both nodes, kill Primary, assert Galaxy consumer reconnects to Backup within `(SessionTimeout + KeepAliveInterval × 3)`. Document required session-timeout config in `docs/Redundancy.md`.
## Compliance Checks (run at exit gate)
- [ ] **Primary-healthy** ServiceLevel = 255.
- [ ] **Backup-healthy** ServiceLevel = 100.
- [ ] **Mid-apply Primary** ServiceLevel = 200 — verified via Client.CLI subscription polling ServiceLevel during a forced draft publish.
- [ ] **Peer-unreachable** handling: when a Primary can't probe its Backup's `/healthz`, Primary still serves at 255 (peer is the one with the problem). When a Backup can't probe Primary, Backup flips to 200 (per decision #81 — a lonely Backup promotes its advertised level to signal "I'll take over if you ask" without auto-promoting).
- [ ] **Role transition via operator publish**: FleetAdmin swaps `RedundancyRole` rows in a draft, publishes; both nodes re-read topology on publish confirmation and flip ServiceLevel accordingly — no restart needed.
- [ ] **ServerUriArray** returns exactly the peer node's ApplicationUri.
- [ ] **Client.CLI cutover**: with a primary deliberately halted, a client that was connected to primary reconnects to the backup within the ServiceLevel-polling interval.
- [ ] **OPC UA band compliance**: `0=Maintenance` reserved, `1=NoData` reserved. Operational states in 2..255 per 8-state matrix.
- [ ] **Authoritative-Primary** ServiceLevel = 255.
- [ ] **Isolated-Primary** (peer unreachable, self serving) = 230 — Primary retains authority.
- [ ] **Primary-Mid-Apply** = 200.
- [ ] **Recovering-Primary** = 180 with dwell + publish witness enforced.
- [ ] **Authoritative-Backup** = 100.
- [ ] **Isolated-Backup** (primary unreachable) = 80 — does NOT auto-promote.
- [ ] **InvalidTopology** = 2 — both nodes self-demote when >1 Primary detected runtime.
- [ ] **ServerUriArray** returns self + peer URIs, self first.
- [ ] **UaHealthProbe authority**: integration test — peer returns HTTP 200 but OPC UA endpoint unreachable → coordinator treats peer as UA-unhealthy; peer is not a valid authority source.
- [ ] **Apply-lease disposal**: leases close on exception, cancellation, and watchdog timeout; ServiceLevel never sticks at mid-apply band.
- [ ] **Transparent-mode rejection**: attempting to publish `RedundancyMode=Transparent` is blocked at `sp_PublishGeneration`; runtime never sees an invalid mode.
- [ ] **Role transition via operator publish**: FleetAdmin swaps `RedundancyRole` in a draft, publishes; both nodes re-read topology on publish confirmation + flip ServiceLevel — no restart.
- [ ] **Client.CLI cutover**: with primary halted, Client.CLI that was connected to primary sees primary drop + reconnects to backup via `ServerUriArray`.
- [ ] **Client interoperability matrix** (Stream F): Ignition 8.1 + 8.3 honour ServiceLevel; Kepware + Aveva OI Gateway findings documented.
- [ ] **Galaxy MXAccess failover**: end-to-end test — primary kill → Galaxy consumer reconnects to backup within session-timeout budget.
- [ ] No regression in existing driver test suites; no regression in `/healthz` reachability under redundancy load.
## Risks and Mitigations

View File

@@ -22,13 +22,14 @@ Gaps to close:
| Concern | Change |
|---------|--------|
| `Admin/Pages/UnsTab.razor` | Rewrite as a tree component with drag-drop (Blazor-native HTML5 DnD; no third-party dep). Each drag fires a "Compute Impact" call against the draft-generation state + renders a modal preview ("Moving Line 'Oven-2' from 'Packaging' to 'Assembly' will re-home 14 equipment + re-parent 237 tags"). Confirmation commits the draft edit. |
| `Admin/Services/UnsImpactAnalyzer.cs` | New service. Given a move-operation (line move, area rename, line merge), computes cascade counts by walking the draft-generation `Equipment` + `Tag` tables. Pure-function shape; testable in isolation. |
| `Admin/Pages/EquipmentTab.razor` | Add CSV-import button → modal with file picker + dry-run preview. Add multi-identifier search bar (ZTag / SAPID / UniqueId / Alias1 / Alias2) per decision #95 — parses any of the five, shows matches across draft + published generations. |
| `Admin/Services/EquipmentCsvImporter.cs` | New service. Parses CSV with documented header row; validates each row against the `Equipment` schema (required fields + `ExternalIdReservation` freshness); returns `ImportPreview` DTO with per-row accept/reject + reason; commit step wraps in a single EF transaction. |
| `Admin/Pages/DraftEditor.razor` + `DiffViewer.razor` | Diff viewer expanded: adds sections for ACL grants (from Phase 6.2 `LdapGroupRoleMapping` + `NodeAcl`), redundancy-role changes (from Phase 6.3), equipment-class `_base` Identification fields. Render each section collapsible. |
| `Admin/Components/IdentificationFields.razor` | New component. Renders the OPC 40010 nullable columns (Manufacturer, Model, SerialNumber, ProductInstanceUri, HardwareRevision, SoftwareRevision, DeviceRevision, YearOfConstruction, MonthOfConstruction) as a labelled field group on the `EquipmentTab` detail view. |
| `OtOpcUa.Server/OpcUa/DriverNodeManager` — Equipment folder build | When an `Equipment` row has non-null Identification fields, the server adds an `Identification` sub-folder under the Equipment node containing one variable per non-null field. Matches OPC 40010 companion spec. |
| `Admin/Pages/UnsTab.razor` | Tree component with drag-drop using **`MudBlazor.TreeView` + `MudBlazor.DropTarget`** (existing transitive dep — no new third-party package). Native HTML5 DnD rejected because virtualization + DnD on 500+ nodes doesn't combine reliably. Each drag fires a "Compute Impact" call carrying a `DraftRevisionToken`; modal preview ("Moving Line 'Oven-2' from 'Packaging' to 'Assembly' will re-home 14 equipment + re-parent 237 tags"). **Confirm step re-checks the token** and rejects with a `409 Conflict / refresh-required` modal if the draft advanced between preview and commit. |
| `Admin/Services/UnsImpactAnalyzer.cs` | New service. Given a move-operation (line move, area rename, line merge), computes cascade counts + `DraftRevisionToken` at preview time. Pure-function shape; testable in isolation. |
| `Admin/Pages/EquipmentTab.razor` | Add CSV-import button → modal with file picker + dry-run preview. **Identifier search** uses the canonical decision #117 set: `ZTag / MachineCode / SAPID / EquipmentId / EquipmentUuid`. Typeahead probes each column with a ranking query (exact match score 100 → prefix 50 → opt-in LIKE 20; published > draft tie-break). Result row shows which field matched via trailing badge. |
| `Admin/Services/EquipmentCsvImporter.cs` | New service. CSV header row must start with `# OtOpcUaCsv v1` (version marker — future shape changes bump the version). Columns: `ZTag, MachineCode, SAPID, EquipmentId, EquipmentUuid, Name, UnsAreaName, UnsLineName, Manufacturer, Model, SerialNumber, HardwareRevision, SoftwareRevision, YearOfConstruction, AssetLocation, ManufacturerUri, DeviceManualUri`. Parser rejects unknown columns + blank required fields + duplicate ZTags + missing UnsLines. |
| **Staged-import table** `EquipmentImportBatch` | New entity `{ Id, CreatedAtUtc, CreatedBy, RowsStaged, RowsAccepted, RowsRejected, FinalisedAtUtc? }` + child `EquipmentImportRow` records. Import writes rows in chunks to the staging table (not to `Equipment`). `FinaliseImportBatch` is the atomic finalize step that applies all accepted rows to `Equipment` + `ExternalIdReservation` in one transaction — short + bounded regardless of input size. Rollback = drop the batch row; `Equipment` never partially mutates. |
| `Admin/Pages/DraftEditor.razor` + `DiffViewer.razor` | Diff viewer refactored into a base component + section plugins: `StructuralDiffSection`, `EquipmentDiffSection`, `TagDiffSection`, `AclDiffSection` (Phase 6.2), `RedundancyDiffSection` (Phase 6.3), `IdentificationDiffSection`. Each section has a **1000-row hard cap**; over-cap renders an aggregate summary + "Load full diff" button streaming 500-row pages via SignalR. Subtree-rename diffs (decision #115 bulk restructure) surface as summary only by default. |
| `Admin/Components/IdentificationFields.razor` | New component. Renders the OPC 40010 field set **per decision #139**: `Manufacturer, Model, SerialNumber, HardwareRevision, SoftwareRevision, YearOfConstruction, AssetLocation, ManufacturerUri, DeviceManualUri`. `ProductInstanceUri / DeviceRevision / MonthOfConstruction` dropped from this phase — they need a separate decision-log widening. |
| `OtOpcUa.Server/OpcUa/DriverNodeManager` — Equipment folder build | When an `Equipment` row has non-null Identification fields, the server adds an `Identification` sub-folder under the Equipment node containing one variable per non-null field. **ACL binding**: the sub-folder + variables inherit the `Equipment` scope's grants from Phase 6.2's trie — no new scope level added. Documented in `acl-design.md` cross-reference update. |
## Scope — What Does NOT Change
@@ -51,40 +52,53 @@ Gaps to close:
## Task Breakdown
### Stream A — UNS drag/reorder + impact preview (4 days)
### Stream A — UNS drag/reorder + impact preview (5 days)
1. **A.1** `UnsImpactAnalyzer` service. Inputs: `(DraftGenerationId, MoveOperation)`. Outputs: `ImpactPreview { AffectedEquipmentCount, AffectedTagCount, CascadeWarnings[] }`. Unit tests cover line move / area rename / line merge.
2. **A.2** HTML5 DnD on a tree component. No JS interop beyond `ondragstart`/`ondragover`/`ondrop` — keeps build + testability simple.
3. **A.3** Modal preview wired to `UnsImpactAnalyzer` output; "Confirm" commits a draft edit via `DraftService`.
4. **A.4** Playwright smoke test (or equivalent): drag a line across areas, assert modal shows the right counts, assert draft row reflects the move.
1. **A.1** 1000-node synthetic seed fixture. Drag-latency bench against `MudBlazor.TreeView` + `MudBlazor.DropTarget` — commit to the component if latency budget (100 ms drag-enter feedback) holds; fall back to flat-list reorder UI (Area/Line dropdowns) with loss of visual drag affordance otherwise.
2. **A.2** `UnsImpactAnalyzer` service. Inputs: `(DraftGenerationId, MoveOperation, DraftRevisionToken)`. Outputs: `ImpactPreview { AffectedEquipmentCount, AffectedTagCount, CascadeWarnings[], DraftRevisionToken }`. Pure-function shape; testable in isolation.
3. **A.3** Modal preview wired to `UnsImpactAnalyzer`. **Confirm** re-reads the current draft revision + compares against the preview's token; if the draft advanced (another operator saved a different edit), show a `409 Conflict / refresh-required` modal rather than silently overwriting.
4. **A.4** Cross-cluster drop attempts: target disabled + toast "Equipment is cluster-scoped (decision #82). To move across clusters, use Export → Import on the Cluster detail page." Plus help link.
5. **A.5** Playwright (or equivalent) smoke test: drag a line across areas, assert modal shows right counts, assert draft row reflects the move; concurrent-edit test runs two sessions + asserts the later Confirm hits the 409.
### Stream B — Equipment CSV import + 5-identifier search (4 days)
### Stream B — Equipment CSV import + 5-identifier search (5 days)
1. **B.1** `EquipmentCsvImporter` with a documented header row (`ZTag, SAPID, UniqueId, Alias1, Alias2, Name, UnsAreaName, UnsLineName, Manufacturer, Model, SerialNumber, `). Parser rejects unknown columns + blank required fields + duplicate ZTags.
2. **B.2** `ImportPreview` UI: per-row accept/reject table. Reject reasons: "ZTag already exists in draft", "ExternalIdReservation conflict with Cluster X", "UnsLineName not found in draft UNS tree", etc. Operator reviews then clicks "Commit" → single EF transaction.
3. **B.3** Multi-identifier search — bar accepts any of the 5 identifiers, probes each column in parallel, returns first-match-wins + disambiguation list if multiple match.
4. **B.4** Smoke tests: 100-row CSV with 10 intentional conflicts (5 ZTag dupes, 3 reservation clashes, 2 missing UnsLines); assert preview flags each; assert commit rolls back cleanly when a conflict surfaces post-preview.
1. **B.1** `EquipmentCsvImporter`. Strict RFC 4180 parser (per decision #95). Header row validation: first line must match `# OtOpcUaCsv v1` — future versions fork parser versions. Required columns: `ZTag, MachineCode, SAPID, EquipmentId, EquipmentUuid, Name, UnsAreaName, UnsLineName`. Optional: `Manufacturer, Model, SerialNumber, HardwareRevision, SoftwareRevision, YearOfConstruction, AssetLocation, ManufacturerUri, DeviceManualUri`. Parser rejects unknown columns + blank required fields + duplicate ZTags.
2. **B.2** `EquipmentImportBatch` + `EquipmentImportRow` staging tables (migration). Import writes preview rows to staging via chunked inserts; staging never blocks `Equipment` or `ExternalIdReservation`. Preview query reads staging + validates each row against the current `Equipment` state + `ExternalIdReservation` freshness.
3. **B.3** `ImportPreview` UI — per-row accept/reject table. Reject reasons: "ZTag already exists in draft", "ExternalIdReservation conflict with Cluster X", "UnsLineName not found in draft UNS tree", etc. Operator reviews + clicks "Commit".
4. **B.4** `FinaliseImportBatch` — atomic finalize. One EF transaction applies accepted rows to `Equipment` + `ExternalIdReservation`; duration bounded regardless of input size (the atomic step is a bulk-insert, not per-row row-by-row). Rollback = drop batch row via `DropImportBatch`; `Equipment` never partially mutates.
5. **B.5** Five-identifier search. Rank SQL: exact match any identifier = score 100, prefix match = 50, LIKE-fuzzy (opt-in via `?fuzzy=true`) = 20; tie-break `published > draft` then `RowVersion DESC`. Typeahead shows which field matched via trailing badge.
6. **B.6** Smoke tests: 100-row CSV with 10 conflicts (5 ZTag dupes, 3 reservation clashes, 2 missing UnsLines); 10k-row perf test asserting finalize txn < 30 s; concurrent import + external `ExternalIdReservation` insert test asserts retryable-conflict handling.
### Stream C — Diff viewer enhancements (3 days)
### Stream C — Diff viewer enhancements (4 days)
1. **C.1** Refactor `DiffViewer.razor` into a base component + section plugins. Section plugins: `StructuralDiffSection` (UNS tree), `EquipmentDiffSection` (Equipment rows), `TagDiffSection` (Tag rows), `AclDiffSection` (ACL grants — depends on Phase 6.2), `RedundancyDiffSection` (role changes — depends on Phase 6.3), `IdentificationDiffSection` (OPC 40010 fields).
2. **C.2** Each section renders collapsed by default; counts + top-line summary always visible.
3. **C.3** Tests: seed two generations with deliberate diffs, assert every section reports the right counts + top-line summary.
1. **C.1** Refactor `DiffViewer.razor` into a base component + section plugins. Plugins: `StructuralDiffSection` (UNS tree), `EquipmentDiffSection`, `TagDiffSection`, `AclDiffSection` (Phase 6.2), `RedundancyDiffSection` (Phase 6.3), `IdentificationDiffSection`.
2. **C.2** Each section renders collapsed by default; counts + top-line summary always visible. **1000-row hard cap** per section — over-cap sections render aggregate summary (e.g. "237 equipment re-parented from Packaging to Assembly") with a "Load full diff" button that streams 500-row pages via SignalR.
3. **C.3** Subtree-rename diffs (decision #115 bulk restructure) surface as summary only by default regardless of row count.
4. **C.4** Tests: seed two generations with deliberate diffs; assert every section reports the right counts + top-line summary + hard-cap behavior.
### Stream D — OPC 40010 Identification exposure (3 days)
1. **D.1** `IdentificationFields.razor` component — labelled inputs; nullable columns show empty input; required field validation only on commit.
2. **D.2** `DriverNodeManager` equipment-folder builder — after building the equipment node, inspect the Identification columns; if any non-null, add an `Identification` sub-folder with variable-per-field.
3. **D.3** Address-space smoke test via Client.CLI: browse an equipment node, assert `Identification` sub-folder present when columns are set, absent when all null.
1. **D.1** `IdentificationFields.razor` component. Renders the **9 decision #139 fields**: `Manufacturer, Model, SerialNumber, HardwareRevision, SoftwareRevision, YearOfConstruction, AssetLocation, ManufacturerUri, DeviceManualUri`. Labelled inputs; nullable columns show empty input; required-field validation on commit only.
2. **D.2** `DriverNodeManager` equipment-folder builder — after building the equipment node, inspect the 9 Identification columns; if any non-null, add an `Identification` sub-folder with variable-per-non-null-field. ACL binding: sub-folder + variables inherit the **same `ScopeId` as the Equipment node** (Phase 6.2's trie treats them as part of the Equipment scope — no new scope level).
3. **D.3** Address-space smoke test via Client.CLI: browse an equipment node, assert `Identification` sub-folder present when columns are set, absent when all null, variables match the field values.
4. **D.4** ACL integration test: a user with Equipment-level grant reads the `Identification` variables without needing a separate grant; a user without the Equipment grant gets `BadUserAccessDenied` on both the Equipment node + its Identification variables.
## Compliance Checks (run at exit gate)
- [ ] **UNS drag/move**: drag a line across areas; modal preview shows correct impacted-equipment + impacted-tag counts.
- [ ] **Equipment CSV**: 100-row CSV with 10 conflicts imports cleanly (preview flags each, commit rolls back mid-conflict).
- [ ] **5-identifier search**: querying any of the 5 IDs returns the matching row; ambiguous searches list options.
- [ ] **Diff viewer**: every section renders for a 2-generation diff with deliberate changes in every category.
- [ ] **OPC 40010 exposure**: Client.CLI browse shows `Identification` sub-folder when equipment has non-null columns; folder absent when all null.
- [ ] **ScadaLink visual parity**: operator-equivalence reviewer signs off that the new tabs feel consistent with existing Admin UI pages.
- [ ] **Concurrent-edit safety**: two-session test — session B saves a draft edit after session A opened the preview; session A's Confirm returns `409 Conflict / refresh-required` instead of overwriting.
- [ ] **Cross-cluster drop**: dropping equipment across cluster boundaries is disabled + shows actionable toast pointing to Export/Import workflow.
- [ ] **1000-node tree**: drag operations on a 1000-node seed maintain < 100 ms drag-enter feedback.
- [ ] **CSV header version**: file missing `# OtOpcUaCsv v1` first line is rejected pre-parse.
- [ ] **CSV canonical identifier set**: columns match decision #117 (ZTag / MachineCode / SAPID / EquipmentId / EquipmentUuid); drift from the earlier draft surfaces as a test failure.
- [ ] **Staged-import atomicity**: `FinaliseImportBatch` transaction bounded < 30 s for a 10k-row import; pre-finalize stagings visible only to the importing user; rollback via `DropImportBatch`.
- [ ] **Concurrent import + external reservation**: concurrent test — third party inserts to `ExternalIdReservation` mid-finalize; finalize retries with conflict handling; no corruption.
- [ ] **5-identifier search ranking**: exact matches outrank prefix matches; published outranks draft for equal scores.
- [ ] **Diff viewer section caps**: 2000-row subtree-rename diff renders as summary only; "Load full diff" streams in pages.
- [ ] **OPC 40010 field list match**: rendered field group matches decision #139 exactly; no extra fields.
- [ ] **OPC 40010 exposure**: Client.CLI browse shows `Identification` sub-folder when equipment has non-null columns; absent when all null.
- [ ] **ACL inheritance for Identification**: integration test — Equipment-grant user reads Identification; no-grant user gets `BadUserAccessDenied` on both.
- [ ] **Visual parity reviewer**: named role (`FleetAdmin` user, not the implementation lead) compares side-by-side against `admin-ui.md` §Visual-Design reference panels; signoff artefact is a checked-in screenshot set under `docs/v2/visual-compliance/phase-6-4/`.
## Risks and Mitigations

View File

@@ -909,6 +909,26 @@ Each step leaves the system runnable. The generic extraction is effectively free
| 140 | Enterprise shortname = `zb` (UNS level-1 segment) | Closes corrections-doc D4. Matches the existing `ZB.MOM.WW.*` namespace prefix used throughout the codebase; short by design since this segment appears in every equipment path (`zb/warsaw-west/bldg-3/line-2/cnc-mill-05/RunState`); operators already say "ZB" colloquially. Admin UI cluster-create form default-prefills `zb` for the Enterprise field. Production deployments use it directly from cluster-create | 2026-04-17 |
| 141 | Tier 3 (AppServer IO) cutover is feasible — AVEVA's OI Gateway supports arbitrary upstream OPC UA servers as a documented pattern | Closes corrections-doc E2 with **GREEN-YELLOW** verdict. Multiple AVEVA partners (Software Toolbox, InSource) have published working integrations against four different non-AVEVA upstream servers (TOP Server, OPC Router, OmniServer, Cogent DataHub). No re-architecting of OtOpcUa required. Path: `OPC UA node → OI Gateway → SuiteLink → $DDESuiteLinkDIObject → AppServer attribute`. Recommended AppServer floor: System Platform 2023 R2 Patch 01. Two integrator-burden risks tracked: validation/GxP paperwork (no AVEVA blueprint exists for non-AVEVA upstream servers in Part 11 deployments) and unpublished scale benchmarks (in-house benchmark required before cutover scheduling). See `aveva-system-platform-io-research.md` | 2026-04-17 |
| 142 | Phase 1 acceptance includes an end-to-end AppServer-via-OI-Gateway smoke test against OtOpcUa | Catches AppServer-specific quirks (cert exchange via reject-and-trust workflow, endpoint URL must NOT include `/discovery` suffix per Inductive Automation forum failure mode, service-account install required because OI Gateway under SYSTEM cannot connect to remote OPC servers, `Basic256Sha256` + `SignAndEncrypt` + LDAP-username token combination must work end-to-end) early — well before the Year 3 tier-3 cutover schedule. Adds one task to `phase-1-configuration-and-admin-scaffold.md` Stream E (Admin smoke test) | 2026-04-17 |
| 143 | Polly per-capability policy — Read / HistoryRead / Discover / Probe / Alarm-subscribe auto-retry; Write does NOT auto-retry unless the tag metadata carries `[WriteIdempotent]` | Decisions #44-45 forbid auto-retry on Write because a timed-out write can succeed on the device + be replayed by the pipeline, duplicating pulses / alarm acks / counter increments / recipe-step advances. Per-capability policy in the shared Polly layer makes the retry safety story explicit; `WriteIdempotentAttribute` on tag definitions is the opt-in surface | 2026-04-19 |
| 144 | Polly pipeline key = `(DriverInstanceId, HostName)`, not DriverInstanceId alone | Decision #35 requires per-device isolation. One dead PLC behind a multi-device Modbus driver must NOT open the circuit breaker for healthy sibling hosts. Per-instance pipelines would poison every device behind one bad endpoint | 2026-04-19 |
| 145 | Tier A/B/C runtime enforcement splits into `MemoryTracking` (all tiers — soft/hard thresholds log + surface, NEVER kill) and `MemoryRecycle` (Tier C only — requires out-of-process topology). Tier A/B hard-breach logs a promotion-to-Tier-C recommendation; the runtime never auto-kills an in-process driver | Decisions #73-74 reserve process-kill protections for Tier C. An in-process Tier A/B "recycle" would kill every OPC UA session + every other in-proc driver for one leaky instance, blast-radius worse than the leak | 2026-04-19 |
| 146 | Memory watchdog uses the hybrid formula `soft = max(multiplier × baseline, baseline + floor)`, with baseline captured as the median of the first 5 min of `GetMemoryFootprint()` samples post-InitializeAsync. Tier-specific constants: A multiplier=3 floor=50 MB, B multiplier=3 floor=100 MB, C multiplier=2 floor=500 MB. Hard = 2 × soft | Codex adversarial review on the Phase 6.1 plan flagged that hardcoded per-tier MB bands diverge from decision #70's specified formula. Static bands false-trigger on small-footprint drivers + miss meaningful growth on large ones. Observed-baseline + hybrid formula recovers the original intent | 2026-04-19 |
| 147 | `WedgeDetector` uses demand-aware criteria `(state==Healthy AND hasPendingWork AND noProgressIn > threshold)`. `hasPendingWork` = (Polly bulkhead depth > 0) OR (active MonitoredItem count > 0) OR (queued historian read count > 0). Idle + subscription-only + write-only-burst drivers stay Healthy without false-fault | Previous "no successful Read in N intervals" formulation flipped legitimate idle subscribers, slow historian backfills, and write-heavy drivers to Faulted. The demand-aware check only fires when the driver claims work is outstanding | 2026-04-19 |
| 148 | LiteDB config cache is **generation-sealed**: `sp_PublishGeneration` writes `<cache-root>/<cluster>/<generationId>.db` as a read-only sealed file; cache reads serve the last-known-sealed generation. Mixed-generation reads are impossible | Prior "refresh on every successful query" cache could serve LDAP role mapping from one generation alongside UNS topology from another, producing impossible states. Sealed-snapshot invariant keeps cache-served reads coherent with a real published state | 2026-04-19 |
| 149 | `AuthorizationDecision { Allow \| NotGranted \| Denied, IReadOnlyList<MatchedGrant> Provenance }` — tri-state internal model. Phase 6.2 only produces `Allow` + `NotGranted` (grant-only semantics per decision #129); v2.1 Deny widens without API break | bool return would collapse `no-matching-grant` and `explicit-deny` into the same runtime state + UI explanation; provenance record is needed for the audit log anyway. Making the shape tri-state from Phase 6.2 avoids a breaking change in v2.1 | 2026-04-19 |
| 150 | Data-plane ACL evaluator consumes `NodeAcl` rows joined against the session's resolved LDAP group memberships. `LdapGroupRoleMapping` (decision #105) is control-plane only — routes LDAP groups to Admin UI roles. Zero runtime overlap between the two | Codex adversarial review flagged that Phase 6.2 draft conflated the two — building the data-plane trie from `LdapGroupRoleMapping` would let a user inherit tag permissions from an admin-role claim path never intended as a data-path grant | 2026-04-19 |
| 151 | `UserAuthorizationState` cached per session but bounded by `MembershipFreshnessInterval` (default 15 min). Past that interval the next hot-path authz call re-resolves LDAP group memberships; failure to re-resolve (LDAP unreachable) → fail-closed (evaluator returns `NotGranted` until memberships refresh successfully) | Previous design cached memberships until session close, so a user removed from a privileged LDAP group could keep authorized access for hours. Bounded freshness + fail-closed covers the revoke-takes-effect story | 2026-04-19 |
| 152 | Auth cache has its own staleness budget `AuthCacheMaxStaleness` (default 5 min), independent of decision #36's availability-oriented config cache (24 h). Past 5 min on authorization data, evaluator fails closed regardless of whether the underlying config is still serving from cache | Availability-oriented caches trade correctness for uptime. Authorization data is correctness-sensitive — stale ACLs silently extend revoked access. Auth-specific budget keeps the two concerns from colliding | 2026-04-19 |
| 153 | MonitoredItem carries `(AuthGenerationId, MembershipVersion)` stamp at create time. On every Publish, items with a mismatching stamp re-evaluate; unchanged items stay fast-path. Revoked items drop to `BadUserAccessDenied` within one publish cycle | Create-time-only authorization leaves revoked users receiving data forever; per-publish re-authorization at 100 ms cadence across 50 groups × 6 levels is too expensive. Stamp-then-reevaluate-on-change balances correctness with cost | 2026-04-19 |
| 154 | ServiceLevel reserves `0` for operator-declared maintenance only; `1` = NoData (unreachable / Faulted); operational states occupy `2..255` in an 8-state matrix (Authoritative-Primary=255, Isolated-Primary=230, Primary-Mid-Apply=200, Recovering-Primary=180, Authoritative-Backup=100, Isolated-Backup=80, Backup-Mid-Apply=50, Recovering-Backup=30, InvalidTopology=2) | OPC UA Part 5 §6.3.34 defines `0=Maintenance` + `1=NoData`; using `0` for our Faulted case collides with spec + triggers spec-compliant clients to enter maintenance-mode cutover. Expanded 8-state matrix covers operational states the 5-state original collapsed together (e.g. Isolated-Primary vs Primary-Mid-Apply were both 200) | 2026-04-19 |
| 155 | `ServerUriArray` includes self + peers (self first, deterministic ordering), per OPC UA Part 4 §6.6.2.2 | Previous design excluded self from the array — spec violation + clients lose the ability to map server identities consistently during failover | 2026-04-19 |
| 156 | Redundancy peer health uses a two-layer probe: `/healthz` (2 s) as fast-fail + `UaHealthProbe` (10 s, opens OPC UA client session to peer + reads its `ServiceLevel` node) as the authority signal. HTTP-healthy ≠ UA-authoritative | `/healthz` returns 200 whenever HTTP + config DB/cache is healthy — but a peer can be HTTP-healthy with a broken OPC UA endpoint or a stuck subscription publisher. Using HTTP alone would advertise authority against servers that can't actually publish data | 2026-04-19 |
| 157 | Publish-generation fencing — coordinator CAS on a monotonic `ConfigGenerationId`; every topology + role decision is generation-stamped; peers reject state propagated from a lower generation. Runtime `InvalidTopology` state (both self-demote to ServiceLevel 2) when >1 Primary detected post-startup | Operator race publishing two drafts with different roles can produce two locally-valid views; without fencing + runtime containment both nodes can serve as Primary until manual intervention | 2026-04-19 |
| 158 | Apply-window uses named leases keyed to `(ConfigGenerationId, PublishRequestId)` via `await using`. `ApplyLeaseWatchdog` auto-closes any lease older than `ApplyMaxDuration` (default 10 min) | Simple `IDisposable`-counter design leaks on cancellation / async-ownership races; a stuck positive count leaves the node permanently mid-apply. Generation-keyed leases + watchdog bound worst case | 2026-04-19 |
| 159 | CSV import header row must start with `# OtOpcUaCsv v1` (version marker). Future shape changes bump the version; parser forks per version. Canonical identifier columns follow decision #117: `ZTag, MachineCode, SAPID, EquipmentId, EquipmentUuid` | Without a version marker the CSV schema has no upgrade path — adding a required column breaks every old export silently. The version prefix makes parser dispatch explicit + future-compatible | 2026-04-19 |
| 160 | Equipment CSV import uses a staged-import pattern: `EquipmentImportBatch` + `EquipmentImportRow` tables receive chunked inserts; `FinaliseImportBatch` is one atomic transaction that applies accepted rows to `Equipment` + `ExternalIdReservation`. Rollback = drop the batch row; `Equipment` never partially mutates | 10k-row single-transaction import holds locks too long; chunked direct writes lose all-or-nothing rollback. Staging + atomic finalize bounds transaction duration + preserves rollback semantics | 2026-04-19 |
| 161 | UNS drag-reorder impact preview carries a `DraftRevisionToken`; Confirm re-checks against the current draft + returns `409 Conflict / refresh-required` if the draft advanced between preview and commit | Without concurrency control, two operators editing the same draft can overwrite each other's changes silently. Draft-revision token + 409 response makes the race visible + forces refresh | 2026-04-19 |
| 162 | OPC 40010 Identification sub-folder exposed under each equipment node inherits the Equipment scope's ACL grants — the ACL trie does NOT add a new scope level for Identification | Adding a new scope level for Identification would require every grant to add a second grant for `Equipment/Identification`; inheriting the Equipment scope keeps the grant model flat + prevents operator-forgot-to-grant-Identification access surprises | 2026-04-19 |
## Reference Documents

View File

@@ -0,0 +1,79 @@
<#
.SYNOPSIS
Phase 6.1 exit-gate compliance check — stub. Each `Assert-*` either passes
(Write-Host green) or throws. Non-zero exit = fail.
.DESCRIPTION
Validates Phase 6.1 (Resilience & Observability runtime) completion. Checks
enumerated in `docs/v2/implementation/phase-6-1-resilience-and-observability.md`
§"Compliance Checks (run at exit gate)".
Current status: SCAFFOLD. Every check writes a TODO line and does NOT throw.
Each implementation task in Phase 6.1 is responsible for replacing its TODO
with a real check before closing that task.
.NOTES
Usage: pwsh ./scripts/compliance/phase-6-1-compliance.ps1
Exit: 0 = all checks passed (or are still TODO); non-zero = explicit fail
#>
[CmdletBinding()]
param()
$ErrorActionPreference = 'Stop'
$script:failures = 0
function Assert-Todo {
param([string]$Check, [string]$ImplementationTask)
Write-Host " [TODO] $Check (implement during $ImplementationTask)" -ForegroundColor Yellow
}
function Assert-Pass {
param([string]$Check)
Write-Host " [PASS] $Check" -ForegroundColor Green
}
function Assert-Fail {
param([string]$Check, [string]$Reason)
Write-Host " [FAIL] $Check$Reason" -ForegroundColor Red
$script:failures++
}
Write-Host ""
Write-Host "=== Phase 6.1 compliance — Resilience & Observability runtime ===" -ForegroundColor Cyan
Write-Host ""
Write-Host "Stream A — Resilience layer"
Assert-Todo "Invoker coverage — every capability-interface method routes through CapabilityInvoker (analyzer error-level)" "Stream A.3"
Assert-Todo "Write-retry guard — writes without [WriteIdempotent] never retry" "Stream A.5"
Assert-Todo "Pipeline isolation — `(DriverInstanceId, HostName)` key; one dead host does not open breaker for siblings" "Stream A.5"
Write-Host ""
Write-Host "Stream B — Tier A/B/C runtime"
Assert-Todo "Tier registry — every driver type has non-null Tier; Tier C declares out-of-process topology" "Stream B.1"
Assert-Todo "MemoryTracking never kills — soft/hard breach on Tier A/B logs + surfaces without terminating" "Stream B.6"
Assert-Todo "MemoryRecycle Tier C only — hard breach on Tier A never invokes supervisor; Tier C does" "Stream B.6"
Assert-Todo "Wedge demand-aware — idle/historic-backfill/write-only cases stay Healthy" "Stream B.6"
Assert-Todo "Galaxy supervisor preserved — Driver.Galaxy.Proxy/Supervisor/CircuitBreaker + Backoff still present + invoked" "Stream A.4"
Write-Host ""
Write-Host "Stream C — Health + logging"
Assert-Todo "Health state machine — /healthz + /readyz respond < 500 ms for every DriverState per matrix in plan" "Stream C.4"
Assert-Todo "Structured log — CI grep asserts DriverInstanceId + CorrelationId JSON fields present" "Stream C.4"
Write-Host ""
Write-Host "Stream D — LiteDB cache"
Assert-Todo "Generation-sealed snapshot — SQL kill mid-op serves last-sealed snapshot; UsingStaleConfig=true" "Stream D.4"
Assert-Todo "Mixed-generation guard — corruption of snapshot file fails closed; no mixed reads" "Stream D.4"
Assert-Todo "First-boot no-snapshot + DB-down — InitializeAsync fails with clear error" "Stream D.4"
Write-Host ""
Write-Host "Cross-cutting"
Assert-Todo "No test-count regression — dotnet test ZB.MOM.WW.OtOpcUa.slnx count ≥ pre-Phase-6.1 baseline" "Final exit-gate"
Write-Host ""
if ($script:failures -eq 0) {
Write-Host "Phase 6.1 compliance: scaffold-mode PASS (all checks TODO)" -ForegroundColor Green
exit 0
}
Write-Host "Phase 6.1 compliance: $script:failures FAIL(s)" -ForegroundColor Red
exit 1

View File

@@ -0,0 +1,81 @@
<#
.SYNOPSIS
Phase 6.2 exit-gate compliance check — stub. Each `Assert-*` either passes
(Write-Host green) or throws. Non-zero exit = fail.
.DESCRIPTION
Validates Phase 6.2 (Authorization runtime) completion. Checks enumerated
in `docs/v2/implementation/phase-6-2-authorization-runtime.md`
§"Compliance Checks (run at exit gate)".
Current status: SCAFFOLD. Every check writes a TODO line and does NOT throw.
Each implementation task in Phase 6.2 is responsible for replacing its TODO
with a real check before closing that task.
.NOTES
Usage: pwsh ./scripts/compliance/phase-6-2-compliance.ps1
Exit: 0 = all checks passed (or are still TODO); non-zero = explicit fail
#>
[CmdletBinding()]
param()
$ErrorActionPreference = 'Stop'
$script:failures = 0
function Assert-Todo {
param([string]$Check, [string]$ImplementationTask)
Write-Host " [TODO] $Check (implement during $ImplementationTask)" -ForegroundColor Yellow
}
function Assert-Pass {
param([string]$Check)
Write-Host " [PASS] $Check" -ForegroundColor Green
}
function Assert-Fail {
param([string]$Check, [string]$Reason)
Write-Host " [FAIL] $Check$Reason" -ForegroundColor Red
$script:failures++
}
Write-Host ""
Write-Host "=== Phase 6.2 compliance — Authorization runtime ===" -ForegroundColor Cyan
Write-Host ""
Write-Host "Stream A — LdapGroupRoleMapping (control plane)"
Assert-Todo "Control/data-plane separation — Core.Authorization has zero refs to LdapGroupRoleMapping" "Stream A.2"
Assert-Todo "Authoring validation — AclsTab rejects duplicate (LdapGroup, Scope) pre-save" "Stream A.3"
Write-Host ""
Write-Host "Stream B — Evaluator + trie + cache"
Assert-Todo "Trie invariants — PermissionTrieBuilder idempotent (build twice == equal)" "Stream B.1"
Assert-Todo "Additive grants + cluster isolation — cross-cluster leakage impossible" "Stream B.1"
Assert-Todo "Galaxy FolderSegment coverage — folder-subtree grant cascades; siblings unaffected" "Stream B.2"
Assert-Todo "Redundancy-safe invalidation — generation-mismatch forces trie re-load on peer" "Stream B.4"
Assert-Todo "Membership freshness — 15 min interval elapsed + LDAP down = fail-closed" "Stream B.5"
Assert-Todo "Auth cache fail-closed — 5 min AuthCacheMaxStaleness exceeded = NotGranted" "Stream B.5"
Assert-Todo "AuthorizationDecision shape — Allow + NotGranted only; Denied variant exists unused" "Stream B.6"
Write-Host ""
Write-Host "Stream C — OPC UA operation wiring"
Assert-Todo "Every operation wired — Browse/Read/Write/HistoryRead/HistoryUpdate/CreateMonitoredItems/TransferSubscriptions/Call/Ack/Confirm/Shelve" "Stream C.1-C.7"
Assert-Todo "HistoryRead uses its own flag — Read+no-HistoryRead denies HistoryRead" "Stream C.3"
Assert-Todo "Mixed-batch semantics — 3 allowed + 2 denied returns per-item status, no coarse failure" "Stream C.6"
Assert-Todo "Browse ancestor visibility — deep grant implies ancestor browse; denied ancestors filter" "Stream C.7"
Assert-Todo "Subscription re-authorization — revoked grant surfaces BadUserAccessDenied in one publish" "Stream C.5"
Write-Host ""
Write-Host "Stream D — Admin UI + SignalR invalidation"
Assert-Todo "SignalR invalidation — sp_PublishGeneration pushes PermissionTrieCache invalidate < 2 s" "Stream D.4"
Write-Host ""
Write-Host "Cross-cutting"
Assert-Todo "No test-count regression — dotnet test ZB.MOM.WW.OtOpcUa.slnx count ≥ pre-Phase-6.2 baseline" "Final exit-gate"
Write-Host ""
if ($script:failures -eq 0) {
Write-Host "Phase 6.2 compliance: scaffold-mode PASS (all checks TODO)" -ForegroundColor Green
exit 0
}
Write-Host "Phase 6.2 compliance: $script:failures FAIL(s)" -ForegroundColor Red
exit 1

View File

@@ -0,0 +1,85 @@
<#
.SYNOPSIS
Phase 6.3 exit-gate compliance check — stub. Each `Assert-*` either passes
(Write-Host green) or throws. Non-zero exit = fail.
.DESCRIPTION
Validates Phase 6.3 (Redundancy runtime) completion. Checks enumerated in
`docs/v2/implementation/phase-6-3-redundancy-runtime.md`
§"Compliance Checks (run at exit gate)".
Current status: SCAFFOLD. Every check writes a TODO line and does NOT throw.
Each implementation task in Phase 6.3 is responsible for replacing its TODO
with a real check before closing that task.
.NOTES
Usage: pwsh ./scripts/compliance/phase-6-3-compliance.ps1
Exit: 0 = all checks passed (or are still TODO); non-zero = explicit fail
#>
[CmdletBinding()]
param()
$ErrorActionPreference = 'Stop'
$script:failures = 0
function Assert-Todo {
param([string]$Check, [string]$ImplementationTask)
Write-Host " [TODO] $Check (implement during $ImplementationTask)" -ForegroundColor Yellow
}
function Assert-Pass {
param([string]$Check)
Write-Host " [PASS] $Check" -ForegroundColor Green
}
function Assert-Fail {
param([string]$Check, [string]$Reason)
Write-Host " [FAIL] $Check$Reason" -ForegroundColor Red
$script:failures++
}
Write-Host ""
Write-Host "=== Phase 6.3 compliance — Redundancy runtime ===" -ForegroundColor Cyan
Write-Host ""
Write-Host "Stream A — Topology loader"
Assert-Todo "Transparent-mode rejection — sp_PublishGeneration blocks RedundancyMode=Transparent" "Stream A.3"
Write-Host ""
Write-Host "Stream B — Peer probe + ServiceLevel calculator"
Assert-Todo "OPC UA band compliance — 0=Maintenance / 1=NoData reserved; operational 2..255" "Stream B.2"
Assert-Todo "Authoritative-Primary ServiceLevel = 255" "Stream B.2"
Assert-Todo "Isolated-Primary (peer unreachable, self serving) = 230" "Stream B.2"
Assert-Todo "Primary-Mid-Apply = 200" "Stream B.2"
Assert-Todo "Recovering-Primary = 180 with dwell + publish witness enforced" "Stream B.2"
Assert-Todo "Authoritative-Backup = 100" "Stream B.2"
Assert-Todo "Isolated-Backup (primary unreachable) = 80 — no auto-promote" "Stream B.2"
Assert-Todo "InvalidTopology = 2 — >1 Primary self-demotes both nodes" "Stream B.2"
Assert-Todo "UaHealthProbe authority — HTTP-200 + UA-down peer treated as UA-unhealthy" "Stream B.1"
Write-Host ""
Write-Host "Stream C — OPC UA node wiring"
Assert-Todo "ServerUriArray — returns self + peer URIs, self first" "Stream C.2"
Assert-Todo "Client.CLI cutover — primary halt triggers reconnect to backup via ServerUriArray" "Stream C.4"
Write-Host ""
Write-Host "Stream D — Apply-lease + publish fencing"
Assert-Todo "Apply-lease disposal — leases close on exception, cancellation, watchdog timeout" "Stream D.2"
Assert-Todo "Role transition via operator publish — no restart; both nodes flip ServiceLevel on publish confirm" "Stream D.3"
Write-Host ""
Write-Host "Stream F — Interop matrix"
Assert-Todo "Client interoperability matrix — Ignition 8.1/8.3 / Kepware / Aveva OI Gateway findings documented" "Stream F.1-F.2"
Assert-Todo "Galaxy MXAccess failover — primary kill; Galaxy consumer reconnects within session-timeout budget" "Stream F.3"
Write-Host ""
Write-Host "Cross-cutting"
Assert-Todo "No regression in driver test suites; /healthz reachable under redundancy load" "Final exit-gate"
Write-Host ""
if ($script:failures -eq 0) {
Write-Host "Phase 6.3 compliance: scaffold-mode PASS (all checks TODO)" -ForegroundColor Green
exit 0
}
Write-Host "Phase 6.3 compliance: $script:failures FAIL(s)" -ForegroundColor Red
exit 1

View File

@@ -0,0 +1,83 @@
<#
.SYNOPSIS
Phase 6.4 exit-gate compliance check — stub. Each `Assert-*` either passes
(Write-Host green) or throws. Non-zero exit = fail.
.DESCRIPTION
Validates Phase 6.4 (Admin UI completion) completion. Checks enumerated in
`docs/v2/implementation/phase-6-4-admin-ui-completion.md`
§"Compliance Checks (run at exit gate)".
Current status: SCAFFOLD. Every check writes a TODO line and does NOT throw.
Each implementation task in Phase 6.4 is responsible for replacing its TODO
with a real check before closing that task.
.NOTES
Usage: pwsh ./scripts/compliance/phase-6-4-compliance.ps1
Exit: 0 = all checks passed (or are still TODO); non-zero = explicit fail
#>
[CmdletBinding()]
param()
$ErrorActionPreference = 'Stop'
$script:failures = 0
function Assert-Todo {
param([string]$Check, [string]$ImplementationTask)
Write-Host " [TODO] $Check (implement during $ImplementationTask)" -ForegroundColor Yellow
}
function Assert-Pass {
param([string]$Check)
Write-Host " [PASS] $Check" -ForegroundColor Green
}
function Assert-Fail {
param([string]$Check, [string]$Reason)
Write-Host " [FAIL] $Check$Reason" -ForegroundColor Red
$script:failures++
}
Write-Host ""
Write-Host "=== Phase 6.4 compliance — Admin UI completion ===" -ForegroundColor Cyan
Write-Host ""
Write-Host "Stream A — UNS drag/move + impact preview"
Assert-Todo "UNS drag/move — drag line across areas; modal shows correct impacted-equipment + tag counts" "Stream A.2"
Assert-Todo "Concurrent-edit safety — session B saves draft mid-preview; session A Confirm returns 409" "Stream A.3 (DraftRevisionToken)"
Assert-Todo "Cross-cluster drop disabled — actionable toast points to Export/Import" "Stream A.2"
Assert-Todo "1000-node tree — drag-enter feedback < 100 ms" "Stream A.4"
Write-Host ""
Write-Host "Stream B — CSV import + staged-import + 5-identifier search"
Assert-Todo "CSV header version — file missing '# OtOpcUaCsv v1' rejected pre-parse" "Stream B.1"
Assert-Todo "CSV canonical identifier set — columns match decision #117 exactly" "Stream B.1"
Assert-Todo "Staged-import atomicity — 10k-row FinaliseImportBatch < 30 s; user-scoped visibility; DropImportBatch rollback" "Stream B.3"
Assert-Todo "Concurrent import + external reservation — finalize retries with conflict handling; no corruption" "Stream B.3"
Assert-Todo "5-identifier search ranking — exact > prefix; published > draft for equal scores" "Stream B.4"
Write-Host ""
Write-Host "Stream C — DiffViewer sections"
Assert-Todo "Diff viewer section caps — 2000-row subtree-rename summary-only; 'Load full diff' paginates" "Stream C.2"
Write-Host ""
Write-Host "Stream D — Identification (OPC 40010)"
Assert-Todo "OPC 40010 field list match — rendered fields match decision #139 exactly; no extras" "Stream D.1"
Assert-Todo "OPC 40010 exposure — Identification sub-folder shows when non-null; absent when all null" "Stream D.3"
Assert-Todo "ACL inheritance for Identification — Equipment-grant reads; no-grant denies both" "Stream D.4"
Write-Host ""
Write-Host "Visual compliance"
Assert-Todo "Visual parity reviewer — FleetAdmin signoff vs admin-ui.md §Visual-Design; screenshot set checked in under docs/v2/visual-compliance/phase-6-4/" "Visual review"
Write-Host ""
Write-Host "Cross-cutting"
Assert-Todo "Full solution dotnet test passes; no test-count regression vs pre-Phase-6.4 baseline" "Final exit-gate"
Write-Host ""
if ($script:failures -eq 0) {
Write-Host "Phase 6.4 compliance: scaffold-mode PASS (all checks TODO)" -ForegroundColor Green
exit 0
}
Write-Host "Phase 6.4 compliance: $script:failures FAIL(s)" -ForegroundColor Red
exit 1

View File

@@ -25,6 +25,14 @@ namespace ZB.MOM.WW.OtOpcUa.Core.Abstractions;
/// OPC UA <c>AlarmConditionState</c> when true. Defaults to false so existing non-Galaxy
/// drivers aren't forced to flow a flag they don't produce.
/// </param>
/// <param name="WriteIdempotent">
/// True when a timed-out or failed write to this attribute is safe to replay. Per
/// <c>docs/v2/plan.md</c> decisions #44, #45, #143 — writes are NOT auto-retried by default
/// because replaying a pulse / alarm-ack / counter-increment / recipe-step advance can
/// duplicate field actions. Drivers flag only tags whose semantics make retry safe
/// (holding registers with level-set values, set-point writes to analog tags) — the
/// capability invoker respects this flag when deciding whether to apply Polly retry.
/// </param>
public sealed record DriverAttributeInfo(
string FullName,
DriverDataType DriverDataType,
@@ -32,4 +40,5 @@ public sealed record DriverAttributeInfo(
uint? ArrayDim,
SecurityClassification SecurityClass,
bool IsHistorized,
bool IsAlarm = false);
bool IsAlarm = false,
bool WriteIdempotent = false);

View File

@@ -0,0 +1,42 @@
namespace ZB.MOM.WW.OtOpcUa.Core.Abstractions;
/// <summary>
/// Enumerates the driver-capability surface points guarded by Phase 6.1 resilience pipelines.
/// Each value corresponds to one method (or tightly-related method group) on the
/// <c>Core.Abstractions</c> capability interfaces (<see cref="IReadable"/>, <see cref="IWritable"/>,
/// <see cref="ITagDiscovery"/>, <see cref="ISubscribable"/>, <see cref="IHostConnectivityProbe"/>,
/// <see cref="IAlarmSource"/>, <see cref="IHistoryProvider"/>).
/// </summary>
/// <remarks>
/// Per <c>docs/v2/plan.md</c> decision #143 (per-capability retry policy): Read / HistoryRead /
/// Discover / Probe / AlarmSubscribe auto-retry; <see cref="Write"/> does NOT retry unless the
/// tag-definition carries <see cref="WriteIdempotentAttribute"/>. Alarm-acknowledge is treated
/// as a write for retry semantics (an alarm-ack is not idempotent at the plant-floor acknowledgement
/// level even if the OPC UA spec permits re-issue).
/// </remarks>
public enum DriverCapability
{
/// <summary>Batch <see cref="IReadable.ReadAsync"/>. Retries by default.</summary>
Read,
/// <summary>Batch <see cref="IWritable.WriteAsync"/>. Does not retry unless tag is <see cref="WriteIdempotentAttribute">idempotent</see>.</summary>
Write,
/// <summary><see cref="ITagDiscovery.DiscoverAsync"/>. Retries by default.</summary>
Discover,
/// <summary><see cref="ISubscribable.SubscribeAsync"/> and unsubscribe. Retries by default.</summary>
Subscribe,
/// <summary><see cref="IHostConnectivityProbe"/> probe loop. Retries by default.</summary>
Probe,
/// <summary><see cref="IAlarmSource.SubscribeAlarmsAsync"/>. Retries by default.</summary>
AlarmSubscribe,
/// <summary><see cref="IAlarmSource.AcknowledgeAsync"/>. Does NOT retry — ack is a write-shaped operation (decision #143).</summary>
AlarmAcknowledge,
/// <summary><see cref="IHistoryProvider"/> reads (Raw/Processed/AtTime/Events). Retries by default.</summary>
HistoryRead,
}

View File

@@ -0,0 +1,34 @@
namespace ZB.MOM.WW.OtOpcUa.Core.Abstractions;
/// <summary>
/// Stability tier of a driver type. Determines which cross-cutting runtime protections
/// apply — per-tier retry defaults, memory-tracking thresholds, and whether out-of-process
/// supervision with process-level recycle is in play.
/// </summary>
/// <remarks>
/// Per <c>docs/v2/driver-stability.md</c> §2-4 and <c>docs/v2/plan.md</c> decisions #63-74.
///
/// <list type="bullet">
/// <item><b>A</b> — managed, known-good SDK; low blast radius. In-process. Fast retries.
/// Examples: OPC UA Client (OPCFoundation stack), S7 (S7NetPlus).</item>
/// <item><b>B</b> — native or semi-trusted SDK with an in-process footprint. Examples: Modbus.</item>
/// <item><b>C</b> — unmanaged SDK with COM/STA constraints, leak risk, or other out-of-process
/// requirements. Must run as a separate Host process behind a Proxy with a supervisor that
/// can recycle the process on hard-breach. Example: Galaxy (MXAccess COM).</item>
/// </list>
///
/// <para>Process-kill protections (<c>MemoryRecycle</c>, <c>ScheduledRecycleScheduler</c>) are
/// Tier C only per decisions #73-74 and #145 — killing an in-process Tier A/B driver also kills
/// every OPC UA session and every co-hosted driver, blast-radius worse than the leak.</para>
/// </remarks>
public enum DriverTier
{
/// <summary>Managed SDK, in-process, low blast radius.</summary>
A,
/// <summary>Native or semi-trusted SDK, in-process.</summary>
B,
/// <summary>Unmanaged SDK, out-of-process required with Proxy+Host+Supervisor.</summary>
C,
}

View File

@@ -69,12 +69,20 @@ public sealed class DriverTypeRegistry
/// <param name="DriverConfigJsonSchema">JSON Schema (Draft 2020-12) the driver's <c>DriverConfig</c> column must validate against.</param>
/// <param name="DeviceConfigJsonSchema">JSON Schema for <c>DeviceConfig</c> (multi-device drivers); null if the driver has no device layer.</param>
/// <param name="TagConfigJsonSchema">JSON Schema for <c>TagConfig</c>; required for every driver since every driver has tags.</param>
/// <param name="Tier">
/// Stability tier per <c>docs/v2/driver-stability.md</c> §2-4 and <c>docs/v2/plan.md</c>
/// decisions #63-74. Drives the shared resilience pipeline defaults
/// (<see cref="Tier"/> × capability → <c>CapabilityPolicy</c>), the <c>MemoryTracking</c>
/// hybrid-formula constants, and whether process-level <c>MemoryRecycle</c> / scheduled-
/// recycle protections apply (Tier C only). Every registered driver type must declare one.
/// </param>
public sealed record DriverTypeMetadata(
string TypeName,
NamespaceKindCompatibility AllowedNamespaceKinds,
string DriverConfigJsonSchema,
string? DeviceConfigJsonSchema,
string TagConfigJsonSchema);
string TagConfigJsonSchema,
DriverTier Tier);
/// <summary>Bitmask of namespace kinds a driver type may populate. Per decision #111.</summary>
[Flags]

View File

@@ -0,0 +1,26 @@
namespace ZB.MOM.WW.OtOpcUa.Core.Abstractions;
/// <summary>
/// Process-level supervisor contract a Tier C driver's out-of-process topology provides
/// (e.g. <c>Driver.Galaxy.Proxy/Supervisor/</c>). Concerns: restart the Host process when a
/// hard fault is detected (memory breach, wedge, scheduled recycle window).
/// </summary>
/// <remarks>
/// Per <c>docs/v2/plan.md</c> decisions #68, #73-74, and #145. Tier A/B drivers do NOT have
/// a supervisor because they run in-process — recycling would kill every OPC UA session and
/// every co-hosted driver. The Core.Stability layer only invokes this interface for Tier C
/// instances after asserting the tier via <see cref="DriverTypeMetadata.Tier"/>.
/// </remarks>
public interface IDriverSupervisor
{
/// <summary>Driver instance this supervisor governs.</summary>
string DriverInstanceId { get; }
/// <summary>
/// Request the supervisor to recycle (terminate + restart) the Host process. Implementations
/// are expected to be idempotent under repeat calls during an in-flight recycle.
/// </summary>
/// <param name="reason">Human-readable reason — flows into the supervisor's logs.</param>
/// <param name="cancellationToken">Cancels the recycle request; an in-flight restart is not interrupted.</param>
Task RecycleAsync(string reason, CancellationToken cancellationToken);
}

View File

@@ -0,0 +1,19 @@
namespace ZB.MOM.WW.OtOpcUa.Core.Abstractions;
/// <summary>
/// Opts a tag-definition record into auto-retry on <see cref="IWritable.WriteAsync"/> failures.
/// Absence of this attribute means writes are <b>not</b> retried — a timed-out write may have
/// already succeeded at the device, and replaying pulses, alarm acks, counter increments, or
/// recipe-step advances can duplicate irreversible field actions.
/// </summary>
/// <remarks>
/// Per <c>docs/v2/plan.md</c> decisions #44, #45, and #143. Applied to tag-definition POCOs
/// (e.g. <c>ModbusTagDefinition</c>, <c>S7TagDefinition</c>, OPC UA client tag rows) at the
/// property or record level. The <c>CapabilityInvoker</c> in <c>ZB.MOM.WW.OtOpcUa.Core.Resilience</c>
/// reads this attribute via reflection once at driver-init time and caches the result; no
/// per-write reflection cost.
/// </remarks>
[AttributeUsage(AttributeTargets.Property | AttributeTargets.Class | AttributeTargets.Struct, AllowMultiple = false, Inherited = true)]
public sealed class WriteIdempotentAttribute : Attribute
{
}

View File

@@ -0,0 +1,106 @@
using Polly;
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
namespace ZB.MOM.WW.OtOpcUa.Core.Resilience;
/// <summary>
/// Executes driver-capability calls through a shared Polly pipeline. One invoker per
/// <c>(DriverInstance, IDriver)</c> pair; the underlying <see cref="DriverResiliencePipelineBuilder"/>
/// is process-singleton so all invokers share its cache.
/// </summary>
/// <remarks>
/// Per <c>docs/v2/plan.md</c> decisions #143-144 and Phase 6.1 Stream A.3. The server's dispatch
/// layer routes every capability call (<c>IReadable.ReadAsync</c>, <c>IWritable.WriteAsync</c>,
/// <c>ITagDiscovery.DiscoverAsync</c>, <c>ISubscribable.SubscribeAsync/UnsubscribeAsync</c>,
/// <c>IHostConnectivityProbe</c> probe loop, <c>IAlarmSource.SubscribeAlarmsAsync/AcknowledgeAsync</c>,
/// and all four <c>IHistoryProvider</c> reads) through this invoker.
/// </remarks>
public sealed class CapabilityInvoker
{
private readonly DriverResiliencePipelineBuilder _builder;
private readonly string _driverInstanceId;
private readonly Func<DriverResilienceOptions> _optionsAccessor;
/// <summary>
/// Construct an invoker for one driver instance.
/// </summary>
/// <param name="builder">Shared, process-singleton pipeline builder.</param>
/// <param name="driverInstanceId">The <c>DriverInstance.Id</c> column value.</param>
/// <param name="optionsAccessor">
/// Snapshot accessor for the current resilience options. Invoked per call so Admin-edit +
/// pipeline-invalidate can take effect without restarting the invoker.
/// </param>
public CapabilityInvoker(
DriverResiliencePipelineBuilder builder,
string driverInstanceId,
Func<DriverResilienceOptions> optionsAccessor)
{
ArgumentNullException.ThrowIfNull(builder);
ArgumentNullException.ThrowIfNull(optionsAccessor);
_builder = builder;
_driverInstanceId = driverInstanceId;
_optionsAccessor = optionsAccessor;
}
/// <summary>Execute a capability call returning a value, honoring the per-capability pipeline.</summary>
/// <typeparam name="TResult">Return type of the underlying driver call.</typeparam>
public async ValueTask<TResult> ExecuteAsync<TResult>(
DriverCapability capability,
string hostName,
Func<CancellationToken, ValueTask<TResult>> callSite,
CancellationToken cancellationToken)
{
ArgumentNullException.ThrowIfNull(callSite);
var pipeline = ResolvePipeline(capability, hostName);
return await pipeline.ExecuteAsync(callSite, cancellationToken).ConfigureAwait(false);
}
/// <summary>Execute a void-returning capability call, honoring the per-capability pipeline.</summary>
public async ValueTask ExecuteAsync(
DriverCapability capability,
string hostName,
Func<CancellationToken, ValueTask> callSite,
CancellationToken cancellationToken)
{
ArgumentNullException.ThrowIfNull(callSite);
var pipeline = ResolvePipeline(capability, hostName);
await pipeline.ExecuteAsync(callSite, cancellationToken).ConfigureAwait(false);
}
/// <summary>
/// Execute a <see cref="DriverCapability.Write"/> call honoring <see cref="WriteIdempotentAttribute"/>
/// semantics — if <paramref name="isIdempotent"/> is <c>false</c>, retries are disabled regardless
/// of the tag-level configuration (the pipeline for a non-idempotent write never retries per
/// decisions #44-45). If <c>true</c>, the call runs through the capability's pipeline which may
/// retry when the tier configuration permits.
/// </summary>
public async ValueTask<TResult> ExecuteWriteAsync<TResult>(
string hostName,
bool isIdempotent,
Func<CancellationToken, ValueTask<TResult>> callSite,
CancellationToken cancellationToken)
{
ArgumentNullException.ThrowIfNull(callSite);
if (!isIdempotent)
{
var noRetryOptions = _optionsAccessor() with
{
CapabilityPolicies = new Dictionary<DriverCapability, CapabilityPolicy>
{
[DriverCapability.Write] = _optionsAccessor().Resolve(DriverCapability.Write) with { RetryCount = 0 },
},
};
var pipeline = _builder.GetOrCreate(_driverInstanceId, $"{hostName}::non-idempotent", DriverCapability.Write, noRetryOptions);
return await pipeline.ExecuteAsync(callSite, cancellationToken).ConfigureAwait(false);
}
return await ExecuteAsync(DriverCapability.Write, hostName, callSite, cancellationToken).ConfigureAwait(false);
}
private ResiliencePipeline ResolvePipeline(DriverCapability capability, string hostName) =>
_builder.GetOrCreate(_driverInstanceId, hostName, capability, _optionsAccessor());
}

View File

@@ -0,0 +1,96 @@
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
namespace ZB.MOM.WW.OtOpcUa.Core.Resilience;
/// <summary>
/// Per-tier × per-capability resilience policy configuration for a driver instance.
/// Bound from <c>DriverInstance.ResilienceConfig</c> JSON (nullable column; null = tier defaults).
/// Per <c>docs/v2/plan.md</c> decisions #143 and #144.
/// </summary>
public sealed record DriverResilienceOptions
{
/// <summary>Tier the owning driver type is registered as; drives the default map.</summary>
public required DriverTier Tier { get; init; }
/// <summary>
/// Per-capability policy overrides. Capabilities absent from this map fall back to
/// <see cref="GetTierDefaults(DriverTier)"/> for the configured <see cref="Tier"/>.
/// </summary>
public IReadOnlyDictionary<DriverCapability, CapabilityPolicy> CapabilityPolicies { get; init; }
= new Dictionary<DriverCapability, CapabilityPolicy>();
/// <summary>Bulkhead (max concurrent in-flight calls) for every capability. Default 32.</summary>
public int BulkheadMaxConcurrent { get; init; } = 32;
/// <summary>
/// Bulkhead queue depth. Zero = no queueing; overflow fails fast with
/// <c>BulkheadRejectedException</c>. Default 64.
/// </summary>
public int BulkheadMaxQueue { get; init; } = 64;
/// <summary>
/// Look up the effective policy for a capability, falling back to tier defaults when no
/// override is configured. Never returns null.
/// </summary>
public CapabilityPolicy Resolve(DriverCapability capability)
{
if (CapabilityPolicies.TryGetValue(capability, out var policy))
return policy;
var defaults = GetTierDefaults(Tier);
return defaults[capability];
}
/// <summary>
/// Per-tier per-capability default policy table, per decisions #143-144 and the Phase 6.1
/// Stream A.2 specification. Retries skipped on <see cref="DriverCapability.Write"/> and
/// <see cref="DriverCapability.AlarmAcknowledge"/> regardless of tier.
/// </summary>
public static IReadOnlyDictionary<DriverCapability, CapabilityPolicy> GetTierDefaults(DriverTier tier) =>
tier switch
{
DriverTier.A => new Dictionary<DriverCapability, CapabilityPolicy>
{
[DriverCapability.Read] = new(TimeoutSeconds: 2, RetryCount: 3, BreakerFailureThreshold: 5),
[DriverCapability.Write] = new(TimeoutSeconds: 2, RetryCount: 0, BreakerFailureThreshold: 5),
[DriverCapability.Discover] = new(TimeoutSeconds: 30, RetryCount: 2, BreakerFailureThreshold: 3),
[DriverCapability.Subscribe] = new(TimeoutSeconds: 5, RetryCount: 3, BreakerFailureThreshold: 5),
[DriverCapability.Probe] = new(TimeoutSeconds: 2, RetryCount: 3, BreakerFailureThreshold: 5),
[DriverCapability.AlarmSubscribe] = new(TimeoutSeconds: 5, RetryCount: 3, BreakerFailureThreshold: 5),
[DriverCapability.AlarmAcknowledge] = new(TimeoutSeconds: 5, RetryCount: 0, BreakerFailureThreshold: 5),
[DriverCapability.HistoryRead] = new(TimeoutSeconds: 30, RetryCount: 2, BreakerFailureThreshold: 5),
},
DriverTier.B => new Dictionary<DriverCapability, CapabilityPolicy>
{
[DriverCapability.Read] = new(TimeoutSeconds: 4, RetryCount: 3, BreakerFailureThreshold: 5),
[DriverCapability.Write] = new(TimeoutSeconds: 4, RetryCount: 0, BreakerFailureThreshold: 5),
[DriverCapability.Discover] = new(TimeoutSeconds: 60, RetryCount: 2, BreakerFailureThreshold: 3),
[DriverCapability.Subscribe] = new(TimeoutSeconds: 8, RetryCount: 3, BreakerFailureThreshold: 5),
[DriverCapability.Probe] = new(TimeoutSeconds: 4, RetryCount: 3, BreakerFailureThreshold: 5),
[DriverCapability.AlarmSubscribe] = new(TimeoutSeconds: 8, RetryCount: 3, BreakerFailureThreshold: 5),
[DriverCapability.AlarmAcknowledge] = new(TimeoutSeconds: 8, RetryCount: 0, BreakerFailureThreshold: 5),
[DriverCapability.HistoryRead] = new(TimeoutSeconds: 60, RetryCount: 2, BreakerFailureThreshold: 5),
},
DriverTier.C => new Dictionary<DriverCapability, CapabilityPolicy>
{
[DriverCapability.Read] = new(TimeoutSeconds: 10, RetryCount: 1, BreakerFailureThreshold: 0),
[DriverCapability.Write] = new(TimeoutSeconds: 10, RetryCount: 0, BreakerFailureThreshold: 0),
[DriverCapability.Discover] = new(TimeoutSeconds: 120, RetryCount: 1, BreakerFailureThreshold: 0),
[DriverCapability.Subscribe] = new(TimeoutSeconds: 15, RetryCount: 1, BreakerFailureThreshold: 0),
[DriverCapability.Probe] = new(TimeoutSeconds: 10, RetryCount: 1, BreakerFailureThreshold: 0),
[DriverCapability.AlarmSubscribe] = new(TimeoutSeconds: 15, RetryCount: 1, BreakerFailureThreshold: 0),
[DriverCapability.AlarmAcknowledge] = new(TimeoutSeconds: 15, RetryCount: 0, BreakerFailureThreshold: 0),
[DriverCapability.HistoryRead] = new(TimeoutSeconds: 120, RetryCount: 1, BreakerFailureThreshold: 0),
},
_ => throw new ArgumentOutOfRangeException(nameof(tier), tier, $"No default policy table defined for tier {tier}."),
};
}
/// <summary>Policy for one capability on one driver instance.</summary>
/// <param name="TimeoutSeconds">Per-call timeout (wraps the inner Polly execution).</param>
/// <param name="RetryCount">Number of retry attempts after the first failure; zero = no retry.</param>
/// <param name="BreakerFailureThreshold">
/// Consecutive-failure count that opens the circuit breaker; zero = no breaker
/// (Tier C uses the supervisor's process-level breaker instead, per decision #68).
/// </param>
public sealed record CapabilityPolicy(int TimeoutSeconds, int RetryCount, int BreakerFailureThreshold);

View File

@@ -0,0 +1,118 @@
using System.Collections.Concurrent;
using Polly;
using Polly.CircuitBreaker;
using Polly.Retry;
using Polly.Timeout;
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
namespace ZB.MOM.WW.OtOpcUa.Core.Resilience;
/// <summary>
/// Builds and caches Polly resilience pipelines keyed on
/// <c>(DriverInstanceId, HostName, DriverCapability)</c>. One dead PLC behind a multi-device
/// driver cannot open the circuit breaker for healthy sibling hosts.
/// </summary>
/// <remarks>
/// Per <c>docs/v2/plan.md</c> decision #144 (per-device isolation). Composition from outside-in:
/// <b>Timeout → Retry (when capability permits) → Circuit Breaker (when tier permits) → Bulkhead</b>.
///
/// <para>Pipeline resolution is lock-free on the hot path: the inner
/// <see cref="ConcurrentDictionary{TKey,TValue}"/> caches a <see cref="ResiliencePipeline"/> per key;
/// first-call cost is one <see cref="ResiliencePipelineBuilder"/>.Build. Thereafter reads are O(1).</para>
/// </remarks>
public sealed class DriverResiliencePipelineBuilder
{
private readonly ConcurrentDictionary<PipelineKey, ResiliencePipeline> _pipelines = new();
private readonly TimeProvider _timeProvider;
/// <summary>Construct with the ambient clock (use <see cref="TimeProvider.System"/> in prod).</summary>
public DriverResiliencePipelineBuilder(TimeProvider? timeProvider = null)
{
_timeProvider = timeProvider ?? TimeProvider.System;
}
/// <summary>
/// Get or build the pipeline for a given <c>(driver instance, host, capability)</c> triple.
/// Calls with the same key + same options reuse the same pipeline instance; the first caller
/// wins if a race occurs (both pipelines would be behaviourally identical).
/// </summary>
/// <param name="driverInstanceId">DriverInstance primary key — opaque to this layer.</param>
/// <param name="hostName">
/// Host the call targets. For single-host drivers (Galaxy, some OPC UA Client configs) pass the
/// driver's canonical host string. For multi-host drivers (Modbus with N PLCs), pass the
/// specific PLC so one dead PLC doesn't poison healthy siblings.
/// </param>
/// <param name="capability">Which capability surface is being called.</param>
/// <param name="options">Per-driver-instance options (tier + per-capability overrides).</param>
public ResiliencePipeline GetOrCreate(
string driverInstanceId,
string hostName,
DriverCapability capability,
DriverResilienceOptions options)
{
ArgumentNullException.ThrowIfNull(options);
ArgumentException.ThrowIfNullOrWhiteSpace(hostName);
var key = new PipelineKey(driverInstanceId, hostName, capability);
return _pipelines.GetOrAdd(key, static (_, state) => Build(state.capability, state.options, state.timeProvider),
(capability, options, timeProvider: _timeProvider));
}
/// <summary>Drop cached pipelines for one driver instance (e.g. on ResilienceConfig change). Test + Admin-reload use.</summary>
public int Invalidate(string driverInstanceId)
{
var removed = 0;
foreach (var key in _pipelines.Keys)
{
if (key.DriverInstanceId == driverInstanceId && _pipelines.TryRemove(key, out _))
removed++;
}
return removed;
}
/// <summary>Snapshot of the current number of cached pipelines. For diagnostics only.</summary>
public int CachedPipelineCount => _pipelines.Count;
private static ResiliencePipeline Build(
DriverCapability capability,
DriverResilienceOptions options,
TimeProvider timeProvider)
{
var policy = options.Resolve(capability);
var builder = new ResiliencePipelineBuilder { TimeProvider = timeProvider };
builder.AddTimeout(new TimeoutStrategyOptions
{
Timeout = TimeSpan.FromSeconds(policy.TimeoutSeconds),
});
if (policy.RetryCount > 0)
{
builder.AddRetry(new RetryStrategyOptions
{
MaxRetryAttempts = policy.RetryCount,
BackoffType = DelayBackoffType.Exponential,
UseJitter = true,
Delay = TimeSpan.FromMilliseconds(100),
MaxDelay = TimeSpan.FromSeconds(5),
ShouldHandle = new PredicateBuilder().Handle<Exception>(ex => ex is not OperationCanceledException),
});
}
if (policy.BreakerFailureThreshold > 0)
{
builder.AddCircuitBreaker(new CircuitBreakerStrategyOptions
{
FailureRatio = 1.0,
MinimumThroughput = policy.BreakerFailureThreshold,
SamplingDuration = TimeSpan.FromSeconds(30),
BreakDuration = TimeSpan.FromSeconds(15),
ShouldHandle = new PredicateBuilder().Handle<Exception>(ex => ex is not OperationCanceledException),
});
}
return builder.Build();
}
private readonly record struct PipelineKey(string DriverInstanceId, string HostName, DriverCapability Capability);
}

View File

@@ -0,0 +1,65 @@
using Microsoft.Extensions.Logging;
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
namespace ZB.MOM.WW.OtOpcUa.Core.Stability;
/// <summary>
/// Tier C only process-recycle companion to <see cref="MemoryTracking"/>. On a
/// <see cref="MemoryTrackingAction.HardBreach"/> signal, invokes the supplied
/// <see cref="IDriverSupervisor"/> to restart the out-of-process Host.
/// </summary>
/// <remarks>
/// Per <c>docs/v2/plan.md</c> decisions #74 and #145. Tier A/B hard-breach on an in-process
/// driver would kill every OPC UA session and every co-hosted driver, so for Tier A/B this
/// class logs a <b>promotion-to-Tier-C recommendation</b> and does NOT invoke any supervisor.
/// A future tier-migration workflow acts on the recommendation.
/// </remarks>
public sealed class MemoryRecycle
{
private readonly DriverTier _tier;
private readonly IDriverSupervisor? _supervisor;
private readonly ILogger<MemoryRecycle> _logger;
public MemoryRecycle(DriverTier tier, IDriverSupervisor? supervisor, ILogger<MemoryRecycle> logger)
{
_tier = tier;
_supervisor = supervisor;
_logger = logger;
}
/// <summary>
/// Handle a <see cref="MemoryTracking"/> classification for the driver. For Tier C with a
/// wired supervisor, <c>HardBreach</c> triggers <see cref="IDriverSupervisor.RecycleAsync"/>.
/// All other combinations are no-ops with respect to process state (soft breaches + Tier A/B
/// hard breaches just log).
/// </summary>
/// <returns>True when a recycle was requested; false otherwise.</returns>
public async Task<bool> HandleAsync(MemoryTrackingAction action, long footprintBytes, CancellationToken cancellationToken)
{
switch (action)
{
case MemoryTrackingAction.SoftBreach:
_logger.LogWarning(
"Memory soft-breach on driver {DriverId}: footprint={Footprint:N0} bytes, tier={Tier}. Surfaced to Admin; no action.",
_supervisor?.DriverInstanceId ?? "(unknown)", footprintBytes, _tier);
return false;
case MemoryTrackingAction.HardBreach when _tier == DriverTier.C && _supervisor is not null:
_logger.LogError(
"Memory hard-breach on Tier C driver {DriverId}: footprint={Footprint:N0} bytes. Requesting supervisor recycle.",
_supervisor.DriverInstanceId, footprintBytes);
await _supervisor.RecycleAsync($"Memory hard-breach: {footprintBytes} bytes", cancellationToken).ConfigureAwait(false);
return true;
case MemoryTrackingAction.HardBreach:
_logger.LogError(
"Memory hard-breach on Tier {Tier} in-process driver {DriverId}: footprint={Footprint:N0} bytes. " +
"Recommending promotion to Tier C; NOT auto-killing (decisions #74, #145).",
_tier, _supervisor?.DriverInstanceId ?? "(unknown)", footprintBytes);
return false;
default:
return false;
}
}
}

View File

@@ -0,0 +1,136 @@
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
namespace ZB.MOM.WW.OtOpcUa.Core.Stability;
/// <summary>
/// Tier-agnostic memory-footprint tracker. Captures the post-initialize <b>baseline</b>
/// from the first samples after <c>IDriver.InitializeAsync</c>, then classifies each
/// subsequent sample against a hybrid soft/hard threshold per
/// <c>docs/v2/plan.md</c> decision #146 — <c>soft = max(multiplier × baseline, baseline + floor)</c>,
/// <c>hard = 2 × soft</c>.
/// </summary>
/// <remarks>
/// <para>Per decision #145, this tracker <b>never kills a process</b>. Soft and hard breaches
/// log + surface to the Admin UI via <c>DriverInstanceResilienceStatus</c>. The matching
/// process-level recycle protection lives in a separate <c>MemoryRecycle</c> that activates
/// for Tier C drivers only (where the driver runs out-of-process behind a supervisor that
/// can safely restart it without tearing down the OPC UA session or co-hosted in-proc
/// drivers).</para>
///
/// <para>Baseline capture: the tracker starts in <see cref="TrackingPhase.WarmingUp"/> for
/// <see cref="BaselineWindow"/> (default 5 min). During that window samples are collected;
/// the baseline is computed as the median once the window elapses. Before that point every
/// classification returns <see cref="MemoryTrackingAction.Warming"/>.</para>
/// </remarks>
public sealed class MemoryTracking
{
private readonly DriverTier _tier;
private readonly TimeSpan _baselineWindow;
private readonly List<long> _warmupSamples = [];
private long _baselineBytes;
private TrackingPhase _phase = TrackingPhase.WarmingUp;
private DateTime? _warmupStartUtc;
/// <summary>Tier-default multiplier/floor constants per decision #146.</summary>
public static (int Multiplier, long FloorBytes) GetTierConstants(DriverTier tier) => tier switch
{
DriverTier.A => (Multiplier: 3, FloorBytes: 50L * 1024 * 1024),
DriverTier.B => (Multiplier: 3, FloorBytes: 100L * 1024 * 1024),
DriverTier.C => (Multiplier: 2, FloorBytes: 500L * 1024 * 1024),
_ => throw new ArgumentOutOfRangeException(nameof(tier), tier, $"No memory-tracking constants defined for tier {tier}."),
};
/// <summary>Window over which post-init samples are collected to compute the baseline.</summary>
public TimeSpan BaselineWindow => _baselineWindow;
/// <summary>Current phase: <see cref="TrackingPhase.WarmingUp"/> or <see cref="TrackingPhase.Steady"/>.</summary>
public TrackingPhase Phase => _phase;
/// <summary>Captured baseline; 0 until warmup completes.</summary>
public long BaselineBytes => _baselineBytes;
/// <summary>Effective soft threshold (zero while warming up).</summary>
public long SoftThresholdBytes => _baselineBytes == 0 ? 0 : ComputeSoft(_tier, _baselineBytes);
/// <summary>Effective hard threshold = 2 × soft (zero while warming up).</summary>
public long HardThresholdBytes => _baselineBytes == 0 ? 0 : ComputeSoft(_tier, _baselineBytes) * 2;
public MemoryTracking(DriverTier tier, TimeSpan? baselineWindow = null)
{
_tier = tier;
_baselineWindow = baselineWindow ?? TimeSpan.FromMinutes(5);
}
/// <summary>
/// Submit a memory-footprint sample. Returns the action the caller should surface.
/// During warmup, always returns <see cref="MemoryTrackingAction.Warming"/> and accumulates
/// samples; once the window elapses the first steady-phase sample triggers baseline capture
/// (median of warmup samples).
/// </summary>
public MemoryTrackingAction Sample(long footprintBytes, DateTime utcNow)
{
if (_phase == TrackingPhase.WarmingUp)
{
_warmupStartUtc ??= utcNow;
_warmupSamples.Add(footprintBytes);
if (utcNow - _warmupStartUtc.Value >= _baselineWindow && _warmupSamples.Count > 0)
{
_baselineBytes = ComputeMedian(_warmupSamples);
_phase = TrackingPhase.Steady;
}
else
{
return MemoryTrackingAction.Warming;
}
}
if (footprintBytes >= HardThresholdBytes) return MemoryTrackingAction.HardBreach;
if (footprintBytes >= SoftThresholdBytes) return MemoryTrackingAction.SoftBreach;
return MemoryTrackingAction.None;
}
private static long ComputeSoft(DriverTier tier, long baseline)
{
var (multiplier, floor) = GetTierConstants(tier);
return Math.Max(multiplier * baseline, baseline + floor);
}
private static long ComputeMedian(List<long> samples)
{
var sorted = samples.Order().ToArray();
var mid = sorted.Length / 2;
return sorted.Length % 2 == 1
? sorted[mid]
: (sorted[mid - 1] + sorted[mid]) / 2;
}
}
/// <summary>Phase of a <see cref="MemoryTracking"/> lifecycle.</summary>
public enum TrackingPhase
{
/// <summary>Collecting post-init samples; baseline not yet computed.</summary>
WarmingUp,
/// <summary>Baseline captured; every sample classified against soft/hard thresholds.</summary>
Steady,
}
/// <summary>Classification the tracker returns per sample.</summary>
public enum MemoryTrackingAction
{
/// <summary>Baseline not yet captured; sample collected, no threshold check.</summary>
Warming,
/// <summary>Below soft threshold.</summary>
None,
/// <summary>Between soft and hard thresholds — log + surface, no action.</summary>
SoftBreach,
/// <summary>
/// ≥ hard threshold. Log + surface + (Tier C only, via <c>MemoryRecycle</c>) request
/// process recycle via the driver supervisor. Tier A/B breach never invokes any
/// kill path per decisions #145 and #74.
/// </summary>
HardBreach,
}

View File

@@ -0,0 +1,86 @@
using Microsoft.Extensions.Logging;
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
namespace ZB.MOM.WW.OtOpcUa.Core.Stability;
/// <summary>
/// Tier C opt-in periodic-recycle driver per <c>docs/v2/plan.md</c> decision #67.
/// A tick method advanced by the caller (fed by a background timer in prod; by test clock
/// in unit tests) decides whether the configured interval has elapsed and, if so, drives the
/// supplied <see cref="IDriverSupervisor"/> to recycle the Host.
/// </summary>
/// <remarks>
/// Tier A/B drivers MUST NOT use this class — scheduled recycle for in-process drivers would
/// kill every OPC UA session and every co-hosted driver. The ctor throws when constructed
/// with any tier other than C to make the misuse structurally impossible.
///
/// <para>Keeps no background thread of its own — callers invoke <see cref="TickAsync"/> on
/// their ambient scheduler tick (Phase 6.1 Stream C's health-endpoint host runs one). That
/// decouples the unit under test from wall-clock time and thread-pool scheduling.</para>
/// </remarks>
public sealed class ScheduledRecycleScheduler
{
private readonly TimeSpan _recycleInterval;
private readonly IDriverSupervisor _supervisor;
private readonly ILogger<ScheduledRecycleScheduler> _logger;
private DateTime _nextRecycleUtc;
/// <summary>
/// Construct the scheduler for a Tier C driver. Throws if <paramref name="tier"/> isn't C.
/// </summary>
/// <param name="tier">Driver tier; must be <see cref="DriverTier.C"/>.</param>
/// <param name="recycleInterval">Interval between recycles (e.g. 7 days).</param>
/// <param name="startUtc">Anchor time; next recycle fires at <paramref name="startUtc"/> + <paramref name="recycleInterval"/>.</param>
/// <param name="supervisor">Supervisor that performs the actual recycle.</param>
/// <param name="logger">Diagnostic sink.</param>
public ScheduledRecycleScheduler(
DriverTier tier,
TimeSpan recycleInterval,
DateTime startUtc,
IDriverSupervisor supervisor,
ILogger<ScheduledRecycleScheduler> logger)
{
if (tier != DriverTier.C)
throw new ArgumentException(
$"ScheduledRecycleScheduler is Tier C only (got {tier}). " +
"In-process drivers must not use scheduled recycle; see decisions #74 and #145.",
nameof(tier));
if (recycleInterval <= TimeSpan.Zero)
throw new ArgumentException("RecycleInterval must be positive.", nameof(recycleInterval));
_recycleInterval = recycleInterval;
_supervisor = supervisor;
_logger = logger;
_nextRecycleUtc = startUtc + recycleInterval;
}
/// <summary>Next scheduled recycle UTC. Advances by <see cref="RecycleInterval"/> on each fire.</summary>
public DateTime NextRecycleUtc => _nextRecycleUtc;
/// <summary>Recycle interval this scheduler was constructed with.</summary>
public TimeSpan RecycleInterval => _recycleInterval;
/// <summary>
/// Tick the scheduler forward. If <paramref name="utcNow"/> is past
/// <see cref="NextRecycleUtc"/>, requests a recycle from the supervisor and advances
/// <see cref="NextRecycleUtc"/> by exactly one interval. Returns true when a recycle fired.
/// </summary>
public async Task<bool> TickAsync(DateTime utcNow, CancellationToken cancellationToken)
{
if (utcNow < _nextRecycleUtc)
return false;
_logger.LogInformation(
"Scheduled recycle due for Tier C driver {DriverId} at {Now:o}; advancing next to {Next:o}.",
_supervisor.DriverInstanceId, utcNow, _nextRecycleUtc + _recycleInterval);
await _supervisor.RecycleAsync("Scheduled periodic recycle", cancellationToken).ConfigureAwait(false);
_nextRecycleUtc += _recycleInterval;
return true;
}
/// <summary>Request an immediate recycle outside the schedule (e.g. MemoryRecycle hard-breach escalation).</summary>
public Task RequestRecycleNowAsync(string reason, CancellationToken cancellationToken) =>
_supervisor.RecycleAsync(reason, cancellationToken);
}

View File

@@ -0,0 +1,81 @@
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
namespace ZB.MOM.WW.OtOpcUa.Core.Stability;
/// <summary>
/// Demand-aware driver-wedge detector per <c>docs/v2/plan.md</c> decision #147.
/// Flips a driver to <see cref="WedgeVerdict.Faulted"/> only when BOTH of the following hold:
/// (a) there is pending work outstanding, AND (b) no progress has been observed for longer
/// than <see cref="Threshold"/>. Idle drivers, write-only burst drivers, and subscription-only
/// drivers whose signals don't arrive regularly all stay Healthy.
/// </summary>
/// <remarks>
/// <para>Pending work signal is supplied by the caller via <see cref="DemandSignal"/>:
/// non-zero Polly bulkhead depth, ≥1 active MonitoredItem, or ≥1 queued historian read
/// each qualifies. The detector itself is state-light: all it remembers is the last
/// <c>LastProgressUtc</c> it saw and the last wedge verdict. No history buffer.</para>
///
/// <para>Default threshold per plan: <c>5 × PublishingInterval</c>, with a minimum of 60 s.
/// Concrete values are driver-agnostic and configured per-instance by the caller.</para>
/// </remarks>
public sealed class WedgeDetector
{
/// <summary>Wedge-detection threshold; pass &lt; 60 s and the detector clamps to 60 s.</summary>
public TimeSpan Threshold { get; }
/// <summary>Whether the driver reported itself <see cref="DriverState.Healthy"/> at construction.</summary>
public WedgeDetector(TimeSpan threshold)
{
Threshold = threshold < TimeSpan.FromSeconds(60) ? TimeSpan.FromSeconds(60) : threshold;
}
/// <summary>
/// Classify the current state against the demand signal. Does not retain state across
/// calls — each call is self-contained; the caller owns the <c>LastProgressUtc</c> clock.
/// </summary>
public WedgeVerdict Classify(DriverState state, DemandSignal demand, DateTime utcNow)
{
if (state != DriverState.Healthy)
return WedgeVerdict.NotApplicable;
if (!demand.HasPendingWork)
return WedgeVerdict.Idle;
var sinceProgress = utcNow - demand.LastProgressUtc;
return sinceProgress > Threshold ? WedgeVerdict.Faulted : WedgeVerdict.Healthy;
}
}
/// <summary>
/// Caller-supplied demand snapshot. All three counters are OR'd — any non-zero means work
/// is outstanding, which is the trigger for checking the <see cref="LastProgressUtc"/> clock.
/// </summary>
/// <param name="BulkheadDepth">Polly bulkhead depth (in-flight capability calls).</param>
/// <param name="ActiveMonitoredItems">Number of live OPC UA MonitoredItems bound to this driver.</param>
/// <param name="QueuedHistoryReads">Pending historian-read requests the driver owes the server.</param>
/// <param name="LastProgressUtc">Last time the driver reported a successful unit of work (read, subscribe-ack, publish).</param>
public readonly record struct DemandSignal(
int BulkheadDepth,
int ActiveMonitoredItems,
int QueuedHistoryReads,
DateTime LastProgressUtc)
{
/// <summary>True when any of the three counters is &gt; 0.</summary>
public bool HasPendingWork => BulkheadDepth > 0 || ActiveMonitoredItems > 0 || QueuedHistoryReads > 0;
}
/// <summary>Outcome of a single <see cref="WedgeDetector.Classify"/> call.</summary>
public enum WedgeVerdict
{
/// <summary>Driver wasn't Healthy to begin with — wedge detection doesn't apply.</summary>
NotApplicable,
/// <summary>Driver claims Healthy + no pending work → stays Healthy.</summary>
Idle,
/// <summary>Driver claims Healthy + has pending work + has made progress within the threshold → stays Healthy.</summary>
Healthy,
/// <summary>Driver claims Healthy + has pending work + has NOT made progress within the threshold → wedged.</summary>
Faulted,
}

View File

@@ -16,6 +16,10 @@
<ProjectReference Include="..\ZB.MOM.WW.OtOpcUa.Configuration\ZB.MOM.WW.OtOpcUa.Configuration.csproj"/>
</ItemGroup>
<ItemGroup>
<PackageReference Include="Polly.Core" Version="8.6.6"/>
</ItemGroup>
<ItemGroup>
<InternalsVisibleTo Include="ZB.MOM.WW.OtOpcUa.Core.Tests"/>
</ItemGroup>

View File

@@ -115,7 +115,8 @@ public sealed class ModbusDriver(ModbusDriverOptions options, string driverInsta
ArrayDim: null,
SecurityClass: t.Writable ? SecurityClassification.Operate : SecurityClassification.ViewOnly,
IsHistorized: false,
IsAlarm: false));
IsAlarm: false,
WriteIdempotent: t.WriteIdempotent));
}
return Task.CompletedTask;
}

View File

@@ -92,6 +92,14 @@ public sealed class ModbusProbeOptions
/// AutomationDirect DirectLOGIC (DL205/DL260) and a few legacy families pack the first
/// character in the low byte instead — see <c>docs/v2/dl205.md</c> §strings.
/// </param>
/// <param name="WriteIdempotent">
/// Per <c>docs/v2/plan.md</c> decisions #44, #45, #143 — flag a tag as safe to replay on
/// write timeout / failure. Default <c>false</c>; writes do not auto-retry. Safe candidates:
/// holding-register set-points for analog values and configuration registers where the same
/// value can be written again without side-effects. Unsafe: coils that drive edge-triggered
/// actions (pulse outputs), counter-increment addresses on PLCs that treat writes as deltas,
/// any BCD / counter register where repeat-writes advance state.
/// </param>
public sealed record ModbusTagDefinition(
string Name,
ModbusRegion Region,
@@ -101,7 +109,8 @@ public sealed record ModbusTagDefinition(
ModbusByteOrder ByteOrder = ModbusByteOrder.BigEndian,
byte BitIndex = 0,
ushort StringLength = 0,
ModbusStringByteOrder StringByteOrder = ModbusStringByteOrder.HighByteFirst);
ModbusStringByteOrder StringByteOrder = ModbusStringByteOrder.HighByteFirst,
bool WriteIdempotent = false);
public enum ModbusRegion { Coils, DiscreteInputs, InputRegisters, HoldingRegisters }

View File

@@ -341,7 +341,8 @@ public sealed class S7Driver(S7DriverOptions options, string driverInstanceId)
ArrayDim: null,
SecurityClass: t.Writable ? SecurityClassification.Operate : SecurityClassification.ViewOnly,
IsHistorized: false,
IsAlarm: false));
IsAlarm: false,
WriteIdempotent: t.WriteIdempotent));
}
return Task.CompletedTask;
}

View File

@@ -88,12 +88,20 @@ public sealed class S7ProbeOptions
/// <param name="DataType">Logical data type — drives the underlying S7.Net read/write width.</param>
/// <param name="Writable">When true the driver accepts writes for this tag.</param>
/// <param name="StringLength">For <c>DataType = String</c>: S7-string max length. Default 254 (S7 max).</param>
/// <param name="WriteIdempotent">
/// Per <c>docs/v2/plan.md</c> decisions #44, #45, #143 — flag a tag as safe to replay on
/// write timeout / failure. Default <c>false</c>; writes do not auto-retry. Safe candidates
/// on S7: DB word/dword set-points holding analog values, configuration DBs where the same
/// value can be written again without side-effects. Unsafe: M (merker) bits or Q (output)
/// coils that drive edge-triggered routines in the PLC program.
/// </param>
public sealed record S7TagDefinition(
string Name,
string Address,
S7DataType DataType,
bool Writable = true,
int StringLength = 254);
int StringLength = 254,
bool WriteIdempotent = false);
public enum S7DataType
{

View File

@@ -3,6 +3,7 @@ using Microsoft.Extensions.Logging;
using Opc.Ua;
using Opc.Ua.Server;
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
using ZB.MOM.WW.OtOpcUa.Core.Resilience;
using ZB.MOM.WW.OtOpcUa.Server.Security;
using DriverWriteRequest = ZB.MOM.WW.OtOpcUa.Core.Abstractions.WriteRequest;
// Core.Abstractions defines a type-named HistoryReadResult (driver-side samples + continuation
@@ -33,8 +34,14 @@ public sealed class DriverNodeManager : CustomNodeManager2, IAddressSpaceBuilder
private readonly IDriver _driver;
private readonly IReadable? _readable;
private readonly IWritable? _writable;
private readonly CapabilityInvoker _invoker;
private readonly ILogger<DriverNodeManager> _logger;
// Per-variable idempotency flag populated during Variable() registration from
// DriverAttributeInfo.WriteIdempotent. Drives ExecuteWriteAsync's retry gating in
// OnWriteValue; absent entries default to false (decisions #44, #45, #143).
private readonly Dictionary<string, bool> _writeIdempotentByFullRef = new(StringComparer.OrdinalIgnoreCase);
/// <summary>The driver whose address space this node manager exposes.</summary>
public IDriver Driver => _driver;
@@ -53,12 +60,13 @@ public sealed class DriverNodeManager : CustomNodeManager2, IAddressSpaceBuilder
private FolderState _currentFolder = null!;
public DriverNodeManager(IServerInternal server, ApplicationConfiguration configuration,
IDriver driver, ILogger<DriverNodeManager> logger)
IDriver driver, CapabilityInvoker invoker, ILogger<DriverNodeManager> logger)
: base(server, configuration, namespaceUris: $"urn:OtOpcUa:{driver.DriverInstanceId}")
{
_driver = driver;
_readable = driver as IReadable;
_writable = driver as IWritable;
_invoker = invoker;
_logger = logger;
}
@@ -148,6 +156,7 @@ public sealed class DriverNodeManager : CustomNodeManager2, IAddressSpaceBuilder
AddPredefinedNode(SystemContext, v);
_variablesByFullRef[attributeInfo.FullName] = v;
_securityByFullRef[attributeInfo.FullName] = attributeInfo.SecurityClass;
_writeIdempotentByFullRef[attributeInfo.FullName] = attributeInfo.WriteIdempotent;
v.OnReadValue = OnReadValue;
v.OnWriteValue = OnWriteValue;
@@ -188,7 +197,11 @@ public sealed class DriverNodeManager : CustomNodeManager2, IAddressSpaceBuilder
try
{
var fullRef = node.NodeId.Identifier as string ?? "";
var result = _readable.ReadAsync([fullRef], CancellationToken.None).GetAwaiter().GetResult();
var result = _invoker.ExecuteAsync(
DriverCapability.Read,
_driver.DriverInstanceId,
async ct => (IReadOnlyList<DataValueSnapshot>)await _readable.ReadAsync([fullRef], ct).ConfigureAwait(false),
CancellationToken.None).AsTask().GetAwaiter().GetResult();
if (result.Count == 0)
{
statusCode = StatusCodes.BadNoData;
@@ -381,9 +394,15 @@ public sealed class DriverNodeManager : CustomNodeManager2, IAddressSpaceBuilder
try
{
var results = _writable.WriteAsync(
[new DriverWriteRequest(fullRef!, value)],
CancellationToken.None).GetAwaiter().GetResult();
var isIdempotent = _writeIdempotentByFullRef.GetValueOrDefault(fullRef!, false);
var capturedValue = value;
var results = _invoker.ExecuteWriteAsync(
_driver.DriverInstanceId,
isIdempotent,
async ct => (IReadOnlyList<WriteResult>)await _writable.WriteAsync(
[new DriverWriteRequest(fullRef!, capturedValue)],
ct).ConfigureAwait(false),
CancellationToken.None).AsTask().GetAwaiter().GetResult();
if (results.Count > 0 && results[0].StatusCode != 0)
{
statusCode = results[0].StatusCode;
@@ -465,12 +484,16 @@ public sealed class DriverNodeManager : CustomNodeManager2, IAddressSpaceBuilder
try
{
var driverResult = History.ReadRawAsync(
fullRef,
details.StartTime,
details.EndTime,
details.NumValuesPerNode,
CancellationToken.None).GetAwaiter().GetResult();
var driverResult = _invoker.ExecuteAsync(
DriverCapability.HistoryRead,
_driver.DriverInstanceId,
async ct => await History.ReadRawAsync(
fullRef,
details.StartTime,
details.EndTime,
details.NumValuesPerNode,
ct).ConfigureAwait(false),
CancellationToken.None).AsTask().GetAwaiter().GetResult();
WriteResult(results, errors, i, StatusCodes.Good,
BuildHistoryData(driverResult.Samples), driverResult.ContinuationPoint);
@@ -525,13 +548,17 @@ public sealed class DriverNodeManager : CustomNodeManager2, IAddressSpaceBuilder
try
{
var driverResult = History.ReadProcessedAsync(
fullRef,
details.StartTime,
details.EndTime,
interval,
aggregate.Value,
CancellationToken.None).GetAwaiter().GetResult();
var driverResult = _invoker.ExecuteAsync(
DriverCapability.HistoryRead,
_driver.DriverInstanceId,
async ct => await History.ReadProcessedAsync(
fullRef,
details.StartTime,
details.EndTime,
interval,
aggregate.Value,
ct).ConfigureAwait(false),
CancellationToken.None).AsTask().GetAwaiter().GetResult();
WriteResult(results, errors, i, StatusCodes.Good,
BuildHistoryData(driverResult.Samples), driverResult.ContinuationPoint);
@@ -578,8 +605,11 @@ public sealed class DriverNodeManager : CustomNodeManager2, IAddressSpaceBuilder
try
{
var driverResult = History.ReadAtTimeAsync(
fullRef, requestedTimes, CancellationToken.None).GetAwaiter().GetResult();
var driverResult = _invoker.ExecuteAsync(
DriverCapability.HistoryRead,
_driver.DriverInstanceId,
async ct => await History.ReadAtTimeAsync(fullRef, requestedTimes, ct).ConfigureAwait(false),
CancellationToken.None).AsTask().GetAwaiter().GetResult();
WriteResult(results, errors, i, StatusCodes.Good,
BuildHistoryData(driverResult.Samples), driverResult.ContinuationPoint);
@@ -632,12 +662,16 @@ public sealed class DriverNodeManager : CustomNodeManager2, IAddressSpaceBuilder
try
{
var driverResult = History.ReadEventsAsync(
sourceName: fullRef,
startUtc: details.StartTime,
endUtc: details.EndTime,
maxEvents: maxEvents,
cancellationToken: CancellationToken.None).GetAwaiter().GetResult();
var driverResult = _invoker.ExecuteAsync(
DriverCapability.HistoryRead,
_driver.DriverInstanceId,
async ct => await History.ReadEventsAsync(
sourceName: fullRef,
startUtc: details.StartTime,
endUtc: details.EndTime,
maxEvents: maxEvents,
cancellationToken: ct).ConfigureAwait(false),
CancellationToken.None).AsTask().GetAwaiter().GetResult();
WriteResult(results, errors, i, StatusCodes.Good,
BuildHistoryEvent(driverResult.Events), driverResult.ContinuationPoint);

View File

@@ -3,6 +3,7 @@ using Opc.Ua;
using Opc.Ua.Configuration;
using ZB.MOM.WW.OtOpcUa.Core.Hosting;
using ZB.MOM.WW.OtOpcUa.Core.OpcUa;
using ZB.MOM.WW.OtOpcUa.Core.Resilience;
using ZB.MOM.WW.OtOpcUa.Server.Security;
namespace ZB.MOM.WW.OtOpcUa.Server.OpcUa;
@@ -20,6 +21,7 @@ public sealed class OpcUaApplicationHost : IAsyncDisposable
private readonly OpcUaServerOptions _options;
private readonly DriverHost _driverHost;
private readonly IUserAuthenticator _authenticator;
private readonly DriverResiliencePipelineBuilder _pipelineBuilder;
private readonly ILoggerFactory _loggerFactory;
private readonly ILogger<OpcUaApplicationHost> _logger;
private ApplicationInstance? _application;
@@ -27,11 +29,13 @@ public sealed class OpcUaApplicationHost : IAsyncDisposable
private bool _disposed;
public OpcUaApplicationHost(OpcUaServerOptions options, DriverHost driverHost,
IUserAuthenticator authenticator, ILoggerFactory loggerFactory, ILogger<OpcUaApplicationHost> logger)
IUserAuthenticator authenticator, ILoggerFactory loggerFactory, ILogger<OpcUaApplicationHost> logger,
DriverResiliencePipelineBuilder? pipelineBuilder = null)
{
_options = options;
_driverHost = driverHost;
_authenticator = authenticator;
_pipelineBuilder = pipelineBuilder ?? new DriverResiliencePipelineBuilder();
_loggerFactory = loggerFactory;
_logger = logger;
}
@@ -58,7 +62,7 @@ public sealed class OpcUaApplicationHost : IAsyncDisposable
throw new InvalidOperationException(
$"OPC UA application certificate could not be validated or created in {_options.PkiStoreRoot}");
_server = new OtOpcUaServer(_driverHost, _authenticator, _loggerFactory);
_server = new OtOpcUaServer(_driverHost, _authenticator, _pipelineBuilder, _loggerFactory);
await _application.Start(_server).ConfigureAwait(false);
_logger.LogInformation("OPC UA server started — endpoint={Endpoint} driverCount={Count}",

View File

@@ -5,6 +5,7 @@ using Opc.Ua.Server;
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
using ZB.MOM.WW.OtOpcUa.Core.Hosting;
using ZB.MOM.WW.OtOpcUa.Core.OpcUa;
using ZB.MOM.WW.OtOpcUa.Core.Resilience;
using ZB.MOM.WW.OtOpcUa.Server.Security;
namespace ZB.MOM.WW.OtOpcUa.Server.OpcUa;
@@ -19,13 +20,19 @@ public sealed class OtOpcUaServer : StandardServer
{
private readonly DriverHost _driverHost;
private readonly IUserAuthenticator _authenticator;
private readonly DriverResiliencePipelineBuilder _pipelineBuilder;
private readonly ILoggerFactory _loggerFactory;
private readonly List<DriverNodeManager> _driverNodeManagers = new();
public OtOpcUaServer(DriverHost driverHost, IUserAuthenticator authenticator, ILoggerFactory loggerFactory)
public OtOpcUaServer(
DriverHost driverHost,
IUserAuthenticator authenticator,
DriverResiliencePipelineBuilder pipelineBuilder,
ILoggerFactory loggerFactory)
{
_driverHost = driverHost;
_authenticator = authenticator;
_pipelineBuilder = pipelineBuilder;
_loggerFactory = loggerFactory;
}
@@ -46,7 +53,12 @@ public sealed class OtOpcUaServer : StandardServer
if (driver is null) continue;
var logger = _loggerFactory.CreateLogger<DriverNodeManager>();
var manager = new DriverNodeManager(server, configuration, driver, logger);
// Per-driver resilience options: default Tier A pending Stream B.1 which wires
// per-type tiers into DriverTypeRegistry. Read ResilienceConfig JSON from the
// DriverInstance row in a follow-up PR; for now every driver gets Tier A defaults.
var options = new DriverResilienceOptions { Tier = DriverTier.A };
var invoker = new CapabilityInvoker(_pipelineBuilder, driver.DriverInstanceId, () => options);
var manager = new DriverNodeManager(server, configuration, driver, invoker, logger);
_driverNodeManagers.Add(manager);
}

View File

@@ -7,11 +7,13 @@ public sealed class DriverTypeRegistryTests
{
private static DriverTypeMetadata SampleMetadata(
string typeName = "Modbus",
NamespaceKindCompatibility allowed = NamespaceKindCompatibility.Equipment) =>
NamespaceKindCompatibility allowed = NamespaceKindCompatibility.Equipment,
DriverTier tier = DriverTier.B) =>
new(typeName, allowed,
DriverConfigJsonSchema: "{\"type\": \"object\"}",
DeviceConfigJsonSchema: "{\"type\": \"object\"}",
TagConfigJsonSchema: "{\"type\": \"object\"}");
TagConfigJsonSchema: "{\"type\": \"object\"}",
Tier: tier);
[Fact]
public void Register_ThenGet_RoundTrips()
@@ -24,6 +26,20 @@ public sealed class DriverTypeRegistryTests
registry.Get("Modbus").ShouldBe(metadata);
}
[Theory]
[InlineData(DriverTier.A)]
[InlineData(DriverTier.B)]
[InlineData(DriverTier.C)]
public void Register_Requires_NonNullTier(DriverTier tier)
{
var registry = new DriverTypeRegistry();
var metadata = SampleMetadata(typeName: $"Driver-{tier}", tier: tier);
registry.Register(metadata);
registry.Get(metadata.TypeName).Tier.ShouldBe(tier);
}
[Fact]
public void Get_IsCaseInsensitive()
{

View File

@@ -0,0 +1,151 @@
using Shouldly;
using Xunit;
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
using ZB.MOM.WW.OtOpcUa.Core.Resilience;
namespace ZB.MOM.WW.OtOpcUa.Core.Tests.Resilience;
[Trait("Category", "Unit")]
public sealed class CapabilityInvokerTests
{
private static CapabilityInvoker MakeInvoker(
DriverResiliencePipelineBuilder builder,
DriverResilienceOptions options) =>
new(builder, "drv-test", () => options);
[Fact]
public async Task Read_ReturnsValue_FromCallSite()
{
var invoker = MakeInvoker(new DriverResiliencePipelineBuilder(), new DriverResilienceOptions { Tier = DriverTier.A });
var result = await invoker.ExecuteAsync(
DriverCapability.Read,
"host-1",
_ => ValueTask.FromResult(42),
CancellationToken.None);
result.ShouldBe(42);
}
[Fact]
public async Task Read_Retries_OnTransientFailure()
{
var invoker = MakeInvoker(new DriverResiliencePipelineBuilder(), new DriverResilienceOptions { Tier = DriverTier.A });
var attempts = 0;
var result = await invoker.ExecuteAsync(
DriverCapability.Read,
"host-1",
async _ =>
{
attempts++;
if (attempts < 2) throw new InvalidOperationException("transient");
await Task.Yield();
return "ok";
},
CancellationToken.None);
result.ShouldBe("ok");
attempts.ShouldBe(2);
}
[Fact]
public async Task Write_NonIdempotent_DoesNotRetry_EvenWhenPolicyHasRetries()
{
var options = new DriverResilienceOptions
{
Tier = DriverTier.A,
CapabilityPolicies = new Dictionary<DriverCapability, CapabilityPolicy>
{
[DriverCapability.Write] = new(TimeoutSeconds: 2, RetryCount: 3, BreakerFailureThreshold: 5),
},
};
var invoker = MakeInvoker(new DriverResiliencePipelineBuilder(), options);
var attempts = 0;
await Should.ThrowAsync<InvalidOperationException>(async () =>
await invoker.ExecuteWriteAsync(
"host-1",
isIdempotent: false,
async _ =>
{
attempts++;
await Task.Yield();
throw new InvalidOperationException("boom");
#pragma warning disable CS0162
return 0;
#pragma warning restore CS0162
},
CancellationToken.None));
attempts.ShouldBe(1, "non-idempotent write must never replay");
}
[Fact]
public async Task Write_Idempotent_Retries_WhenPolicyHasRetries()
{
var options = new DriverResilienceOptions
{
Tier = DriverTier.A,
CapabilityPolicies = new Dictionary<DriverCapability, CapabilityPolicy>
{
[DriverCapability.Write] = new(TimeoutSeconds: 2, RetryCount: 3, BreakerFailureThreshold: 5),
},
};
var invoker = MakeInvoker(new DriverResiliencePipelineBuilder(), options);
var attempts = 0;
var result = await invoker.ExecuteWriteAsync(
"host-1",
isIdempotent: true,
async _ =>
{
attempts++;
if (attempts < 2) throw new InvalidOperationException("transient");
await Task.Yield();
return "ok";
},
CancellationToken.None);
result.ShouldBe("ok");
attempts.ShouldBe(2);
}
[Fact]
public async Task Write_Default_DoesNotRetry_WhenPolicyHasZeroRetries()
{
// Tier A Write default is RetryCount=0. Even isIdempotent=true shouldn't retry
// because the policy says not to.
var invoker = MakeInvoker(new DriverResiliencePipelineBuilder(), new DriverResilienceOptions { Tier = DriverTier.A });
var attempts = 0;
await Should.ThrowAsync<InvalidOperationException>(async () =>
await invoker.ExecuteWriteAsync(
"host-1",
isIdempotent: true,
async _ =>
{
attempts++;
await Task.Yield();
throw new InvalidOperationException("boom");
#pragma warning disable CS0162
return 0;
#pragma warning restore CS0162
},
CancellationToken.None));
attempts.ShouldBe(1, "tier-A default for Write is RetryCount=0");
}
[Fact]
public async Task Execute_HonorsDifferentHosts_Independently()
{
var builder = new DriverResiliencePipelineBuilder();
var invoker = MakeInvoker(builder, new DriverResilienceOptions { Tier = DriverTier.A });
await invoker.ExecuteAsync(DriverCapability.Read, "host-a", _ => ValueTask.FromResult(1), CancellationToken.None);
await invoker.ExecuteAsync(DriverCapability.Read, "host-b", _ => ValueTask.FromResult(2), CancellationToken.None);
builder.CachedPipelineCount.ShouldBe(2);
}
}

View File

@@ -0,0 +1,102 @@
using Shouldly;
using Xunit;
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
using ZB.MOM.WW.OtOpcUa.Core.Resilience;
namespace ZB.MOM.WW.OtOpcUa.Core.Tests.Resilience;
[Trait("Category", "Unit")]
public sealed class DriverResilienceOptionsTests
{
[Theory]
[InlineData(DriverTier.A)]
[InlineData(DriverTier.B)]
[InlineData(DriverTier.C)]
public void TierDefaults_Cover_EveryCapability(DriverTier tier)
{
var defaults = DriverResilienceOptions.GetTierDefaults(tier);
foreach (var capability in Enum.GetValues<DriverCapability>())
defaults.ShouldContainKey(capability);
}
[Theory]
[InlineData(DriverTier.A)]
[InlineData(DriverTier.B)]
[InlineData(DriverTier.C)]
public void Write_NeverRetries_ByDefault(DriverTier tier)
{
var defaults = DriverResilienceOptions.GetTierDefaults(tier);
defaults[DriverCapability.Write].RetryCount.ShouldBe(0);
}
[Theory]
[InlineData(DriverTier.A)]
[InlineData(DriverTier.B)]
[InlineData(DriverTier.C)]
public void AlarmAcknowledge_NeverRetries_ByDefault(DriverTier tier)
{
var defaults = DriverResilienceOptions.GetTierDefaults(tier);
defaults[DriverCapability.AlarmAcknowledge].RetryCount.ShouldBe(0);
}
[Theory]
[InlineData(DriverTier.A, DriverCapability.Read)]
[InlineData(DriverTier.A, DriverCapability.HistoryRead)]
[InlineData(DriverTier.B, DriverCapability.Discover)]
[InlineData(DriverTier.B, DriverCapability.Probe)]
[InlineData(DriverTier.C, DriverCapability.AlarmSubscribe)]
public void IdempotentCapabilities_Retry_ByDefault(DriverTier tier, DriverCapability capability)
{
var defaults = DriverResilienceOptions.GetTierDefaults(tier);
defaults[capability].RetryCount.ShouldBeGreaterThan(0);
}
[Fact]
public void TierC_DisablesCircuitBreaker_DeferringToSupervisor()
{
var defaults = DriverResilienceOptions.GetTierDefaults(DriverTier.C);
foreach (var (_, policy) in defaults)
policy.BreakerFailureThreshold.ShouldBe(0, "Tier C breaker is handled by the Proxy supervisor (decision #68)");
}
[Theory]
[InlineData(DriverTier.A)]
[InlineData(DriverTier.B)]
public void TierAAndB_EnableCircuitBreaker(DriverTier tier)
{
var defaults = DriverResilienceOptions.GetTierDefaults(tier);
foreach (var (_, policy) in defaults)
policy.BreakerFailureThreshold.ShouldBeGreaterThan(0);
}
[Fact]
public void Resolve_Uses_TierDefaults_When_NoOverride()
{
var options = new DriverResilienceOptions { Tier = DriverTier.A };
var resolved = options.Resolve(DriverCapability.Read);
resolved.ShouldBe(DriverResilienceOptions.GetTierDefaults(DriverTier.A)[DriverCapability.Read]);
}
[Fact]
public void Resolve_Uses_Override_When_Configured()
{
var custom = new CapabilityPolicy(TimeoutSeconds: 42, RetryCount: 7, BreakerFailureThreshold: 9);
var options = new DriverResilienceOptions
{
Tier = DriverTier.A,
CapabilityPolicies = new Dictionary<DriverCapability, CapabilityPolicy>
{
[DriverCapability.Read] = custom,
},
};
options.Resolve(DriverCapability.Read).ShouldBe(custom);
options.Resolve(DriverCapability.Write).ShouldBe(
DriverResilienceOptions.GetTierDefaults(DriverTier.A)[DriverCapability.Write]);
}
}

View File

@@ -0,0 +1,222 @@
using Polly.CircuitBreaker;
using Polly.Timeout;
using Shouldly;
using Xunit;
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
using ZB.MOM.WW.OtOpcUa.Core.Resilience;
namespace ZB.MOM.WW.OtOpcUa.Core.Tests.Resilience;
[Trait("Category", "Unit")]
public sealed class DriverResiliencePipelineBuilderTests
{
private static readonly DriverResilienceOptions TierAOptions = new() { Tier = DriverTier.A };
[Fact]
public async Task Read_Retries_Transient_Failures()
{
var builder = new DriverResiliencePipelineBuilder();
var pipeline = builder.GetOrCreate("drv-test", "host-1", DriverCapability.Read, TierAOptions);
var attempts = 0;
await pipeline.ExecuteAsync(async _ =>
{
attempts++;
if (attempts < 3) throw new InvalidOperationException("transient");
await Task.Yield();
});
attempts.ShouldBe(3);
}
[Fact]
public async Task Write_DoesNotRetry_OnFailure()
{
var builder = new DriverResiliencePipelineBuilder();
var pipeline = builder.GetOrCreate("drv-test", "host-1", DriverCapability.Write, TierAOptions);
var attempts = 0;
var ex = await Should.ThrowAsync<InvalidOperationException>(async () =>
{
await pipeline.ExecuteAsync(async _ =>
{
attempts++;
await Task.Yield();
throw new InvalidOperationException("boom");
});
});
attempts.ShouldBe(1);
ex.Message.ShouldBe("boom");
}
[Fact]
public async Task AlarmAcknowledge_DoesNotRetry_OnFailure()
{
var builder = new DriverResiliencePipelineBuilder();
var pipeline = builder.GetOrCreate("drv-test", "host-1", DriverCapability.AlarmAcknowledge, TierAOptions);
var attempts = 0;
await Should.ThrowAsync<InvalidOperationException>(async () =>
{
await pipeline.ExecuteAsync(async _ =>
{
attempts++;
await Task.Yield();
throw new InvalidOperationException("boom");
});
});
attempts.ShouldBe(1);
}
[Fact]
public void Pipeline_IsIsolated_PerHost()
{
var builder = new DriverResiliencePipelineBuilder();
var driverId = "drv-test";
var hostA = builder.GetOrCreate(driverId, "host-a", DriverCapability.Read, TierAOptions);
var hostB = builder.GetOrCreate(driverId, "host-b", DriverCapability.Read, TierAOptions);
hostA.ShouldNotBeSameAs(hostB);
builder.CachedPipelineCount.ShouldBe(2);
}
[Fact]
public void Pipeline_IsReused_ForSameTriple()
{
var builder = new DriverResiliencePipelineBuilder();
var driverId = "drv-test";
var first = builder.GetOrCreate(driverId, "host-a", DriverCapability.Read, TierAOptions);
var second = builder.GetOrCreate(driverId, "host-a", DriverCapability.Read, TierAOptions);
first.ShouldBeSameAs(second);
builder.CachedPipelineCount.ShouldBe(1);
}
[Fact]
public void Pipeline_IsIsolated_PerCapability()
{
var builder = new DriverResiliencePipelineBuilder();
var driverId = "drv-test";
var read = builder.GetOrCreate(driverId, "host-a", DriverCapability.Read, TierAOptions);
var write = builder.GetOrCreate(driverId, "host-a", DriverCapability.Write, TierAOptions);
read.ShouldNotBeSameAs(write);
}
[Fact]
public async Task DeadHost_DoesNotOpenBreaker_ForSiblingHost()
{
var builder = new DriverResiliencePipelineBuilder();
var driverId = "drv-test";
var deadHost = builder.GetOrCreate(driverId, "dead-plc", DriverCapability.Read, TierAOptions);
var liveHost = builder.GetOrCreate(driverId, "live-plc", DriverCapability.Read, TierAOptions);
var threshold = TierAOptions.Resolve(DriverCapability.Read).BreakerFailureThreshold;
for (var i = 0; i < threshold + 5; i++)
{
await Should.ThrowAsync<Exception>(async () =>
await deadHost.ExecuteAsync(async _ =>
{
await Task.Yield();
throw new InvalidOperationException("dead plc");
}));
}
var liveAttempts = 0;
await liveHost.ExecuteAsync(async _ =>
{
liveAttempts++;
await Task.Yield();
});
liveAttempts.ShouldBe(1, "healthy sibling host must not be affected by dead peer");
}
[Fact]
public async Task CircuitBreaker_Opens_AfterFailureThreshold_OnTierA()
{
var builder = new DriverResiliencePipelineBuilder();
var pipeline = builder.GetOrCreate("drv-test", "host-1", DriverCapability.Write, TierAOptions);
var threshold = TierAOptions.Resolve(DriverCapability.Write).BreakerFailureThreshold;
for (var i = 0; i < threshold; i++)
{
await Should.ThrowAsync<InvalidOperationException>(async () =>
await pipeline.ExecuteAsync(async _ =>
{
await Task.Yield();
throw new InvalidOperationException("boom");
}));
}
await Should.ThrowAsync<BrokenCircuitException>(async () =>
await pipeline.ExecuteAsync(async _ =>
{
await Task.Yield();
}));
}
[Fact]
public async Task Timeout_Cancels_SlowOperation()
{
var tierAWithShortTimeout = new DriverResilienceOptions
{
Tier = DriverTier.A,
CapabilityPolicies = new Dictionary<DriverCapability, CapabilityPolicy>
{
[DriverCapability.Read] = new(TimeoutSeconds: 1, RetryCount: 0, BreakerFailureThreshold: 5),
},
};
var builder = new DriverResiliencePipelineBuilder();
var pipeline = builder.GetOrCreate("drv-test", "host-1", DriverCapability.Read, tierAWithShortTimeout);
await Should.ThrowAsync<TimeoutRejectedException>(async () =>
await pipeline.ExecuteAsync(async ct =>
{
await Task.Delay(TimeSpan.FromSeconds(5), ct);
}));
}
[Fact]
public void Invalidate_Removes_OnlyMatchingInstance()
{
var builder = new DriverResiliencePipelineBuilder();
var keepId = "drv-keep";
var dropId = "drv-drop";
builder.GetOrCreate(keepId, "h", DriverCapability.Read, TierAOptions);
builder.GetOrCreate(keepId, "h", DriverCapability.Write, TierAOptions);
builder.GetOrCreate(dropId, "h", DriverCapability.Read, TierAOptions);
var removed = builder.Invalidate(dropId);
removed.ShouldBe(1);
builder.CachedPipelineCount.ShouldBe(2);
}
[Fact]
public async Task Cancellation_IsNot_Retried()
{
var builder = new DriverResiliencePipelineBuilder();
var pipeline = builder.GetOrCreate("drv-test", "host-1", DriverCapability.Read, TierAOptions);
var attempts = 0;
using var cts = new CancellationTokenSource();
cts.Cancel();
await Should.ThrowAsync<OperationCanceledException>(async () =>
await pipeline.ExecuteAsync(async ct =>
{
attempts++;
ct.ThrowIfCancellationRequested();
await Task.Yield();
}, cts.Token));
attempts.ShouldBeLessThanOrEqualTo(1);
}
}

View File

@@ -0,0 +1,160 @@
using Shouldly;
using Xunit;
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
using ZB.MOM.WW.OtOpcUa.Core.Resilience;
namespace ZB.MOM.WW.OtOpcUa.Core.Tests.Resilience;
/// <summary>
/// Integration tests for the Phase 6.1 Stream A.5 contract — wrapping a flaky
/// <see cref="IReadable"/> / <see cref="IWritable"/> through the <see cref="CapabilityInvoker"/>.
/// Exercises the three scenarios the plan enumerates: transient read succeeds after N
/// retries; non-idempotent write fails after one attempt; idempotent write retries through.
/// </summary>
[Trait("Category", "Integration")]
public sealed class FlakeyDriverIntegrationTests
{
[Fact]
public async Task Read_SurfacesSuccess_AfterTransientFailures()
{
var flaky = new FlakeyDriver(failReadsBeforeIndex: 5);
var options = new DriverResilienceOptions
{
Tier = DriverTier.A,
CapabilityPolicies = new Dictionary<DriverCapability, CapabilityPolicy>
{
// TimeoutSeconds=30 gives slack for 5 exponential-backoff retries under
// parallel-test-execution CPU pressure; 10 retries at the default Delay=100ms
// exponential can otherwise exceed a 2-second budget intermittently.
[DriverCapability.Read] = new(TimeoutSeconds: 30, RetryCount: 10, BreakerFailureThreshold: 50),
},
};
var invoker = new CapabilityInvoker(new DriverResiliencePipelineBuilder(), "drv-test", () => options);
var result = await invoker.ExecuteAsync(
DriverCapability.Read,
"host-1",
async ct => await flaky.ReadAsync(["tag-a"], ct),
CancellationToken.None);
flaky.ReadAttempts.ShouldBe(6);
result[0].StatusCode.ShouldBe(0u);
}
[Fact]
public async Task Write_NonIdempotent_FailsOnFirstFailure_NoReplay()
{
var flaky = new FlakeyDriver(failWritesBeforeIndex: 3);
var optionsWithAggressiveRetry = new DriverResilienceOptions
{
Tier = DriverTier.A,
CapabilityPolicies = new Dictionary<DriverCapability, CapabilityPolicy>
{
[DriverCapability.Write] = new(TimeoutSeconds: 2, RetryCount: 5, BreakerFailureThreshold: 50),
},
};
var invoker = new CapabilityInvoker(new DriverResiliencePipelineBuilder(), "drv-test", () => optionsWithAggressiveRetry);
await Should.ThrowAsync<InvalidOperationException>(async () =>
await invoker.ExecuteWriteAsync(
"host-1",
isIdempotent: false,
async ct => await flaky.WriteAsync([new WriteRequest("pulse-coil", true)], ct),
CancellationToken.None));
flaky.WriteAttempts.ShouldBe(1, "non-idempotent write must never replay (decision #44)");
}
[Fact]
public async Task Write_Idempotent_RetriesUntilSuccess()
{
var flaky = new FlakeyDriver(failWritesBeforeIndex: 2);
var optionsWithRetry = new DriverResilienceOptions
{
Tier = DriverTier.A,
CapabilityPolicies = new Dictionary<DriverCapability, CapabilityPolicy>
{
[DriverCapability.Write] = new(TimeoutSeconds: 2, RetryCount: 5, BreakerFailureThreshold: 50),
},
};
var invoker = new CapabilityInvoker(new DriverResiliencePipelineBuilder(), "drv-test", () => optionsWithRetry);
var results = await invoker.ExecuteWriteAsync(
"host-1",
isIdempotent: true,
async ct => await flaky.WriteAsync([new WriteRequest("set-point", 42.0f)], ct),
CancellationToken.None);
flaky.WriteAttempts.ShouldBe(3);
results[0].StatusCode.ShouldBe(0u);
}
[Fact]
public async Task MultipleHosts_OnOneDriver_HaveIndependentFailureCounts()
{
var flaky = new FlakeyDriver(failReadsBeforeIndex: 0);
var options = new DriverResilienceOptions { Tier = DriverTier.A };
var builder = new DriverResiliencePipelineBuilder();
var invoker = new CapabilityInvoker(builder, "drv-test", () => options);
// host-dead: force many failures to exhaust retries + trip breaker
var threshold = options.Resolve(DriverCapability.Read).BreakerFailureThreshold;
for (var i = 0; i < threshold + 5; i++)
{
await Should.ThrowAsync<Exception>(async () =>
await invoker.ExecuteAsync(DriverCapability.Read, "host-dead",
_ => throw new InvalidOperationException("dead"),
CancellationToken.None));
}
// host-live: succeeds on first call — unaffected by the dead-host breaker
var liveAttempts = 0;
await invoker.ExecuteAsync(DriverCapability.Read, "host-live",
_ => { liveAttempts++; return ValueTask.FromResult("ok"); },
CancellationToken.None);
liveAttempts.ShouldBe(1);
}
private sealed class FlakeyDriver : IReadable, IWritable
{
private readonly int _failReadsBeforeIndex;
private readonly int _failWritesBeforeIndex;
public int ReadAttempts { get; private set; }
public int WriteAttempts { get; private set; }
public FlakeyDriver(int failReadsBeforeIndex = 0, int failWritesBeforeIndex = 0)
{
_failReadsBeforeIndex = failReadsBeforeIndex;
_failWritesBeforeIndex = failWritesBeforeIndex;
}
public Task<IReadOnlyList<DataValueSnapshot>> ReadAsync(
IReadOnlyList<string> fullReferences,
CancellationToken cancellationToken)
{
var attempt = ++ReadAttempts;
if (attempt <= _failReadsBeforeIndex)
throw new InvalidOperationException($"transient read failure #{attempt}");
var now = DateTime.UtcNow;
IReadOnlyList<DataValueSnapshot> result = fullReferences
.Select(_ => new DataValueSnapshot(Value: 0, StatusCode: 0u, SourceTimestampUtc: now, ServerTimestampUtc: now))
.ToList();
return Task.FromResult(result);
}
public Task<IReadOnlyList<WriteResult>> WriteAsync(
IReadOnlyList<WriteRequest> writes,
CancellationToken cancellationToken)
{
var attempt = ++WriteAttempts;
if (attempt <= _failWritesBeforeIndex)
throw new InvalidOperationException($"transient write failure #{attempt}");
IReadOnlyList<WriteResult> result = writes.Select(_ => new WriteResult(0u)).ToList();
return Task.FromResult(result);
}
}
}

View File

@@ -0,0 +1,91 @@
using Microsoft.Extensions.Logging.Abstractions;
using Shouldly;
using Xunit;
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
using ZB.MOM.WW.OtOpcUa.Core.Stability;
namespace ZB.MOM.WW.OtOpcUa.Core.Tests.Stability;
[Trait("Category", "Unit")]
public sealed class MemoryRecycleTests
{
[Fact]
public async Task TierC_HardBreach_RequestsSupervisorRecycle()
{
var supervisor = new FakeSupervisor();
var recycle = new MemoryRecycle(DriverTier.C, supervisor, NullLogger<MemoryRecycle>.Instance);
var requested = await recycle.HandleAsync(MemoryTrackingAction.HardBreach, 2_000_000_000, CancellationToken.None);
requested.ShouldBeTrue();
supervisor.RecycleCount.ShouldBe(1);
supervisor.LastReason.ShouldContain("hard-breach");
}
[Theory]
[InlineData(DriverTier.A)]
[InlineData(DriverTier.B)]
public async Task InProcessTier_HardBreach_NeverRequestsRecycle(DriverTier tier)
{
var supervisor = new FakeSupervisor();
var recycle = new MemoryRecycle(tier, supervisor, NullLogger<MemoryRecycle>.Instance);
var requested = await recycle.HandleAsync(MemoryTrackingAction.HardBreach, 2_000_000_000, CancellationToken.None);
requested.ShouldBeFalse("Tier A/B hard-breach logs a promotion recommendation only (decisions #74, #145)");
supervisor.RecycleCount.ShouldBe(0);
}
[Fact]
public async Task TierC_WithoutSupervisor_HardBreach_NoOp()
{
var recycle = new MemoryRecycle(DriverTier.C, supervisor: null, NullLogger<MemoryRecycle>.Instance);
var requested = await recycle.HandleAsync(MemoryTrackingAction.HardBreach, 2_000_000_000, CancellationToken.None);
requested.ShouldBeFalse("no supervisor → no recycle path; action logged only");
}
[Theory]
[InlineData(DriverTier.A)]
[InlineData(DriverTier.B)]
[InlineData(DriverTier.C)]
public async Task SoftBreach_NeverRequestsRecycle(DriverTier tier)
{
var supervisor = new FakeSupervisor();
var recycle = new MemoryRecycle(tier, supervisor, NullLogger<MemoryRecycle>.Instance);
var requested = await recycle.HandleAsync(MemoryTrackingAction.SoftBreach, 1_000_000_000, CancellationToken.None);
requested.ShouldBeFalse("soft-breach is surface-only at every tier");
supervisor.RecycleCount.ShouldBe(0);
}
[Theory]
[InlineData(MemoryTrackingAction.None)]
[InlineData(MemoryTrackingAction.Warming)]
public async Task NonBreachActions_NoOp(MemoryTrackingAction action)
{
var supervisor = new FakeSupervisor();
var recycle = new MemoryRecycle(DriverTier.C, supervisor, NullLogger<MemoryRecycle>.Instance);
var requested = await recycle.HandleAsync(action, 100_000_000, CancellationToken.None);
requested.ShouldBeFalse();
supervisor.RecycleCount.ShouldBe(0);
}
private sealed class FakeSupervisor : IDriverSupervisor
{
public string DriverInstanceId => "fake-tier-c";
public int RecycleCount { get; private set; }
public string? LastReason { get; private set; }
public Task RecycleAsync(string reason, CancellationToken cancellationToken)
{
RecycleCount++;
LastReason = reason;
return Task.CompletedTask;
}
}
}

View File

@@ -0,0 +1,119 @@
using Shouldly;
using Xunit;
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
using ZB.MOM.WW.OtOpcUa.Core.Stability;
namespace ZB.MOM.WW.OtOpcUa.Core.Tests.Stability;
[Trait("Category", "Unit")]
public sealed class MemoryTrackingTests
{
private static readonly DateTime T0 = new(2026, 4, 19, 12, 0, 0, DateTimeKind.Utc);
[Fact]
public void WarmingUp_Returns_Warming_UntilWindowElapses()
{
var tracker = new MemoryTracking(DriverTier.A, TimeSpan.FromMinutes(5));
tracker.Sample(100_000_000, T0).ShouldBe(MemoryTrackingAction.Warming);
tracker.Sample(105_000_000, T0.AddMinutes(1)).ShouldBe(MemoryTrackingAction.Warming);
tracker.Sample(102_000_000, T0.AddMinutes(4.9)).ShouldBe(MemoryTrackingAction.Warming);
tracker.Phase.ShouldBe(TrackingPhase.WarmingUp);
tracker.BaselineBytes.ShouldBe(0);
}
[Fact]
public void WindowElapsed_CapturesBaselineAsMedian_AndTransitionsToSteady()
{
var tracker = new MemoryTracking(DriverTier.A, TimeSpan.FromMinutes(5));
tracker.Sample(100_000_000, T0);
tracker.Sample(200_000_000, T0.AddMinutes(1));
tracker.Sample(150_000_000, T0.AddMinutes(2));
var first = tracker.Sample(150_000_000, T0.AddMinutes(5));
tracker.Phase.ShouldBe(TrackingPhase.Steady);
tracker.BaselineBytes.ShouldBe(150_000_000L, "median of 4 samples [100, 200, 150, 150] = (150+150)/2 = 150");
first.ShouldBe(MemoryTrackingAction.None, "150 MB is the baseline itself, well under soft threshold");
}
[Theory]
[InlineData(DriverTier.A, 3, 50)]
[InlineData(DriverTier.B, 3, 100)]
[InlineData(DriverTier.C, 2, 500)]
public void GetTierConstants_MatchesDecision146(DriverTier tier, int expectedMultiplier, long expectedFloorMB)
{
var (multiplier, floor) = MemoryTracking.GetTierConstants(tier);
multiplier.ShouldBe(expectedMultiplier);
floor.ShouldBe(expectedFloorMB * 1024 * 1024);
}
[Fact]
public void SoftThreshold_UsesMax_OfMultiplierAndFloor_SmallBaseline()
{
// Tier A: mult=3, floor=50 MB. Baseline 10 MB → 3×10=30 MB < 10+50=60 MB → floor wins.
var tracker = WarmupWithBaseline(DriverTier.A, 10L * 1024 * 1024);
tracker.SoftThresholdBytes.ShouldBe(60L * 1024 * 1024);
}
[Fact]
public void SoftThreshold_UsesMax_OfMultiplierAndFloor_LargeBaseline()
{
// Tier A: mult=3, floor=50 MB. Baseline 200 MB → 3×200=600 MB > 200+50=250 MB → multiplier wins.
var tracker = WarmupWithBaseline(DriverTier.A, 200L * 1024 * 1024);
tracker.SoftThresholdBytes.ShouldBe(600L * 1024 * 1024);
}
[Fact]
public void HardThreshold_IsTwiceSoft()
{
var tracker = WarmupWithBaseline(DriverTier.B, 200L * 1024 * 1024);
tracker.HardThresholdBytes.ShouldBe(tracker.SoftThresholdBytes * 2);
}
[Fact]
public void Sample_Below_Soft_Returns_None()
{
var tracker = WarmupWithBaseline(DriverTier.A, 100L * 1024 * 1024);
tracker.Sample(200L * 1024 * 1024, T0.AddMinutes(10)).ShouldBe(MemoryTrackingAction.None);
}
[Fact]
public void Sample_AtSoft_Returns_SoftBreach()
{
// Tier A, baseline 200 MB → soft = 600 MB. Sample exactly at soft.
var tracker = WarmupWithBaseline(DriverTier.A, 200L * 1024 * 1024);
tracker.Sample(tracker.SoftThresholdBytes, T0.AddMinutes(10))
.ShouldBe(MemoryTrackingAction.SoftBreach);
}
[Fact]
public void Sample_AtHard_Returns_HardBreach()
{
var tracker = WarmupWithBaseline(DriverTier.A, 200L * 1024 * 1024);
tracker.Sample(tracker.HardThresholdBytes, T0.AddMinutes(10))
.ShouldBe(MemoryTrackingAction.HardBreach);
}
[Fact]
public void Sample_AboveHard_Returns_HardBreach()
{
var tracker = WarmupWithBaseline(DriverTier.A, 200L * 1024 * 1024);
tracker.Sample(tracker.HardThresholdBytes + 100_000_000, T0.AddMinutes(10))
.ShouldBe(MemoryTrackingAction.HardBreach);
}
private static MemoryTracking WarmupWithBaseline(DriverTier tier, long baseline)
{
var tracker = new MemoryTracking(tier, TimeSpan.FromMinutes(5));
tracker.Sample(baseline, T0);
tracker.Sample(baseline, T0.AddMinutes(5));
tracker.BaselineBytes.ShouldBe(baseline);
return tracker;
}
}

View File

@@ -0,0 +1,101 @@
using Microsoft.Extensions.Logging.Abstractions;
using Shouldly;
using Xunit;
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
using ZB.MOM.WW.OtOpcUa.Core.Stability;
namespace ZB.MOM.WW.OtOpcUa.Core.Tests.Stability;
[Trait("Category", "Unit")]
public sealed class ScheduledRecycleSchedulerTests
{
private static readonly DateTime T0 = new(2026, 4, 19, 0, 0, 0, DateTimeKind.Utc);
private static readonly TimeSpan Weekly = TimeSpan.FromDays(7);
[Theory]
[InlineData(DriverTier.A)]
[InlineData(DriverTier.B)]
public void TierAOrB_Ctor_Throws(DriverTier tier)
{
var supervisor = new FakeSupervisor();
Should.Throw<ArgumentException>(() => new ScheduledRecycleScheduler(
tier, Weekly, T0, supervisor, NullLogger<ScheduledRecycleScheduler>.Instance));
}
[Fact]
public void ZeroOrNegativeInterval_Throws()
{
var supervisor = new FakeSupervisor();
Should.Throw<ArgumentException>(() => new ScheduledRecycleScheduler(
DriverTier.C, TimeSpan.Zero, T0, supervisor, NullLogger<ScheduledRecycleScheduler>.Instance));
Should.Throw<ArgumentException>(() => new ScheduledRecycleScheduler(
DriverTier.C, TimeSpan.FromSeconds(-1), T0, supervisor, NullLogger<ScheduledRecycleScheduler>.Instance));
}
[Fact]
public async Task Tick_BeforeNextRecycle_NoOp()
{
var supervisor = new FakeSupervisor();
var sch = new ScheduledRecycleScheduler(DriverTier.C, Weekly, T0, supervisor, NullLogger<ScheduledRecycleScheduler>.Instance);
var fired = await sch.TickAsync(T0 + TimeSpan.FromDays(6), CancellationToken.None);
fired.ShouldBeFalse();
supervisor.RecycleCount.ShouldBe(0);
}
[Fact]
public async Task Tick_AtOrAfterNextRecycle_FiresOnce_AndAdvances()
{
var supervisor = new FakeSupervisor();
var sch = new ScheduledRecycleScheduler(DriverTier.C, Weekly, T0, supervisor, NullLogger<ScheduledRecycleScheduler>.Instance);
var fired = await sch.TickAsync(T0 + Weekly + TimeSpan.FromMinutes(1), CancellationToken.None);
fired.ShouldBeTrue();
supervisor.RecycleCount.ShouldBe(1);
sch.NextRecycleUtc.ShouldBe(T0 + Weekly + Weekly);
}
[Fact]
public async Task RequestRecycleNow_Fires_Immediately_WithoutAdvancingSchedule()
{
var supervisor = new FakeSupervisor();
var sch = new ScheduledRecycleScheduler(DriverTier.C, Weekly, T0, supervisor, NullLogger<ScheduledRecycleScheduler>.Instance);
var nextBefore = sch.NextRecycleUtc;
await sch.RequestRecycleNowAsync("memory hard-breach", CancellationToken.None);
supervisor.RecycleCount.ShouldBe(1);
supervisor.LastReason.ShouldBe("memory hard-breach");
sch.NextRecycleUtc.ShouldBe(nextBefore, "ad-hoc recycle doesn't shift the cron schedule");
}
[Fact]
public async Task MultipleFires_AcrossTicks_AdvanceOneIntervalEach()
{
var supervisor = new FakeSupervisor();
var sch = new ScheduledRecycleScheduler(DriverTier.C, TimeSpan.FromDays(1), T0, supervisor, NullLogger<ScheduledRecycleScheduler>.Instance);
await sch.TickAsync(T0 + TimeSpan.FromDays(1) + TimeSpan.FromHours(1), CancellationToken.None);
await sch.TickAsync(T0 + TimeSpan.FromDays(2) + TimeSpan.FromHours(1), CancellationToken.None);
await sch.TickAsync(T0 + TimeSpan.FromDays(3) + TimeSpan.FromHours(1), CancellationToken.None);
supervisor.RecycleCount.ShouldBe(3);
sch.NextRecycleUtc.ShouldBe(T0 + TimeSpan.FromDays(4));
}
private sealed class FakeSupervisor : IDriverSupervisor
{
public string DriverInstanceId => "tier-c-fake";
public int RecycleCount { get; private set; }
public string? LastReason { get; private set; }
public Task RecycleAsync(string reason, CancellationToken cancellationToken)
{
RecycleCount++;
LastReason = reason;
return Task.CompletedTask;
}
}
}

View File

@@ -0,0 +1,112 @@
using Shouldly;
using Xunit;
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
using ZB.MOM.WW.OtOpcUa.Core.Stability;
namespace ZB.MOM.WW.OtOpcUa.Core.Tests.Stability;
[Trait("Category", "Unit")]
public sealed class WedgeDetectorTests
{
private static readonly DateTime Now = new(2026, 4, 19, 12, 0, 0, DateTimeKind.Utc);
private static readonly TimeSpan Threshold = TimeSpan.FromSeconds(120);
[Fact]
public void SubSixtySecondThreshold_ClampsToSixty()
{
var detector = new WedgeDetector(TimeSpan.FromSeconds(10));
detector.Threshold.ShouldBe(TimeSpan.FromSeconds(60));
}
[Fact]
public void Unhealthy_Driver_AlwaysNotApplicable()
{
var detector = new WedgeDetector(Threshold);
var demand = new DemandSignal(BulkheadDepth: 5, ActiveMonitoredItems: 10, QueuedHistoryReads: 0, LastProgressUtc: Now.AddMinutes(-10));
detector.Classify(DriverState.Faulted, demand, Now).ShouldBe(WedgeVerdict.NotApplicable);
detector.Classify(DriverState.Degraded, demand, Now).ShouldBe(WedgeVerdict.NotApplicable);
detector.Classify(DriverState.Initializing, demand, Now).ShouldBe(WedgeVerdict.NotApplicable);
}
[Fact]
public void Idle_Subscription_Only_StaysIdle()
{
// Idle driver: bulkhead 0, monitored items 0, no history reads queued.
// Even if LastProgressUtc is ancient, the verdict is Idle, not Faulted.
var detector = new WedgeDetector(Threshold);
var demand = new DemandSignal(0, 0, 0, Now.AddHours(-12));
detector.Classify(DriverState.Healthy, demand, Now).ShouldBe(WedgeVerdict.Idle);
}
[Fact]
public void PendingWork_WithRecentProgress_StaysHealthy()
{
var detector = new WedgeDetector(Threshold);
var demand = new DemandSignal(BulkheadDepth: 2, ActiveMonitoredItems: 0, QueuedHistoryReads: 0, LastProgressUtc: Now.AddSeconds(-30));
detector.Classify(DriverState.Healthy, demand, Now).ShouldBe(WedgeVerdict.Healthy);
}
[Fact]
public void PendingWork_WithStaleProgress_IsFaulted()
{
var detector = new WedgeDetector(Threshold);
var demand = new DemandSignal(BulkheadDepth: 2, ActiveMonitoredItems: 0, QueuedHistoryReads: 0, LastProgressUtc: Now.AddMinutes(-5));
detector.Classify(DriverState.Healthy, demand, Now).ShouldBe(WedgeVerdict.Faulted);
}
[Fact]
public void MonitoredItems_Active_ButNoRecentPublish_IsFaulted()
{
// Subscription-only driver with live MonitoredItems but no publish progress within threshold
// is a real wedge — this is the case the previous "no successful Read" formulation used
// to miss (no reads ever happen).
var detector = new WedgeDetector(Threshold);
var demand = new DemandSignal(BulkheadDepth: 0, ActiveMonitoredItems: 5, QueuedHistoryReads: 0, LastProgressUtc: Now.AddMinutes(-10));
detector.Classify(DriverState.Healthy, demand, Now).ShouldBe(WedgeVerdict.Faulted);
}
[Fact]
public void MonitoredItems_Active_WithFreshPublish_StaysHealthy()
{
var detector = new WedgeDetector(Threshold);
var demand = new DemandSignal(BulkheadDepth: 0, ActiveMonitoredItems: 5, QueuedHistoryReads: 0, LastProgressUtc: Now.AddSeconds(-10));
detector.Classify(DriverState.Healthy, demand, Now).ShouldBe(WedgeVerdict.Healthy);
}
[Fact]
public void HistoryBackfill_SlowButMakingProgress_StaysHealthy()
{
// Slow historian backfill — QueuedHistoryReads > 0 but progress advances within threshold.
var detector = new WedgeDetector(Threshold);
var demand = new DemandSignal(BulkheadDepth: 0, ActiveMonitoredItems: 0, QueuedHistoryReads: 50, LastProgressUtc: Now.AddSeconds(-60));
detector.Classify(DriverState.Healthy, demand, Now).ShouldBe(WedgeVerdict.Healthy);
}
[Fact]
public void WriteOnlyBurst_StaysIdle_WhenBulkheadEmpty()
{
// A write-only driver that just finished a burst: bulkhead drained, no subscriptions, no
// history reads. Idle — the previous formulation would have faulted here because no
// reads were succeeding even though the driver is perfectly healthy.
var detector = new WedgeDetector(Threshold);
var demand = new DemandSignal(0, 0, 0, Now.AddMinutes(-30));
detector.Classify(DriverState.Healthy, demand, Now).ShouldBe(WedgeVerdict.Idle);
}
[Fact]
public void DemandSignal_HasPendingWork_TrueForAnyNonZeroCounter()
{
new DemandSignal(1, 0, 0, Now).HasPendingWork.ShouldBeTrue();
new DemandSignal(0, 1, 0, Now).HasPendingWork.ShouldBeTrue();
new DemandSignal(0, 0, 1, Now).HasPendingWork.ShouldBeTrue();
new DemandSignal(0, 0, 0, Now).HasPendingWork.ShouldBeFalse();
}
}

View File

@@ -220,6 +220,23 @@ public sealed class ModbusDriverTests
builder.Variables.ShouldContain(v => v.BrowseName == "Run" && v.Info.DriverDataType == DriverDataType.Boolean);
}
[Fact]
public async Task Discover_propagates_WriteIdempotent_from_tag_to_attribute_info()
{
var (drv, _) = NewDriver(
new ModbusTagDefinition("SetPoint", ModbusRegion.HoldingRegisters, 0, ModbusDataType.Float32, WriteIdempotent: true),
new ModbusTagDefinition("PulseCoil", ModbusRegion.Coils, 0, ModbusDataType.Bool));
await drv.InitializeAsync("{}", CancellationToken.None);
var builder = new RecordingBuilder();
await drv.DiscoverAsync(builder, CancellationToken.None);
var setPoint = builder.Variables.Single(v => v.BrowseName == "SetPoint");
var pulse = builder.Variables.Single(v => v.BrowseName == "PulseCoil");
setPoint.Info.WriteIdempotent.ShouldBeTrue();
pulse.Info.WriteIdempotent.ShouldBeFalse("default is opt-in per decision #44");
}
// --- helpers ---
private sealed class RecordingBuilder : IAddressSpaceBuilder

View File

@@ -65,6 +65,27 @@ public sealed class S7DiscoveryAndSubscribeTests
builder.Variables[2].Attr.DriverDataType.ShouldBe(DriverDataType.Float32);
}
[Fact]
public async Task DiscoverAsync_propagates_WriteIdempotent_from_tag_to_attribute_info()
{
var opts = new S7DriverOptions
{
Host = "192.0.2.1",
Tags =
[
new("SetPoint", "DB1.DBW0", S7DataType.Int16, WriteIdempotent: true),
new("StartBit", "M0.0", S7DataType.Bool),
],
};
using var drv = new S7Driver(opts, "s7-idem");
var builder = new RecordingAddressSpaceBuilder();
await drv.DiscoverAsync(builder, TestContext.Current.CancellationToken);
builder.Variables.Single(v => v.Name == "SetPoint").Attr.WriteIdempotent.ShouldBeTrue();
builder.Variables.Single(v => v.Name == "StartBit").Attr.WriteIdempotent.ShouldBeFalse("default is opt-in per decision #44");
}
[Fact]
public void GetHostStatuses_returns_one_row_with_host_port_identity_pre_init()
{