3e0452e8a4e5e964a540aedd4306cc73f9260617
9 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
ae8f226e45 |
Phase 6.1 Stream E.3 partial — in-flight counter feeds CurrentBulkheadDepth
Closes the observer half of #162 that was flagged as "persisted as 0 today" in PR #105. The Admin /hosts column refresh + FleetStatusHub SignalR push + red-badge visual still belong to the visual-compliance pass. Core.Resilience: - DriverResilienceStatusTracker gains RecordCallStart + RecordCallComplete + CurrentInFlight field on the snapshot record. Concurrent-safe via the same ConcurrentDictionary.AddOrUpdate pattern as the other recorder methods. Clamps to zero on over-decrement so a stray Complete-without-Start can't drive the counter negative. - CapabilityInvoker gains an optional statusTracker ctor parameter. When wired, every ExecuteAsync / ExecuteAsync(void) wraps the pipeline call in try / finally that records start/complete — so the counter advances cleanly whether the call succeeds, cancels, or throws. Null tracker keeps the pre-Phase-6.1 Stream E.3 behaviour exactly. Server.Hosting: - ResilienceStatusPublisherHostedService persists CurrentInFlight as the DriverInstanceResilienceStatus.CurrentBulkheadDepth column (was 0 before this PR). One-line fix on both the insert + update branches. The in-flight counter is a pragmatic proxy for Polly's internal bulkhead depth — a future PR wiring Polly telemetry would replace it with the real value. The shape of the column + the publisher + the Admin /hosts query doesn't change, so the follow-up is invisible to consumers. Tests (8 new InFlightCounterTests, all pass): - Start+Complete nets to zero. - Nested starts sum; Complete decrements. - Complete-without-Start clamps to zero. - Different hosts track independently. - Concurrent starts (500 parallel) don't lose count. - CapabilityInvoker observed-mid-call depth == 1 during a pending call. - CapabilityInvoker exception path still decrements (try/finally). - CapabilityInvoker without tracker doesn't throw. Full solution dotnet test: 1243 passing (was 1235, +8). Pre-existing Client.CLI Subscribe flake unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
4de94fab0d |
Phase 6.1 Stream A remaining — IPerCallHostResolver + DriverNodeManager per-call host dispatch (decision #144)
Closes the per-device isolation gap flagged at the Phase 6.1 Stream A wire-up (PR #78 used driver.DriverInstanceId as the pipeline host for every call, so multi-host drivers like Modbus with N PLCs shared one pipeline — one dead PLC poisoned sibling breakers). Decision #144 requires per-device isolation; this PR wires it without breaking single-host drivers. Core.Abstractions: - IPerCallHostResolver interface. Optional driver capability. Drivers with multi-host topology (Modbus across N PLCs, AB CIP across a rack, etc.) implement this; single-host drivers (Galaxy, S7 against one PLC, OpcUaClient against one remote server) leave it alone. Must be fast + allocation-free — called once per tag on the hot path. Unknown refs return empty so dispatch falls back to single-host without throwing. Server/OpcUa/DriverNodeManager: - Captures `driver as IPerCallHostResolver` at construction alongside the existing capability casts. - New `ResolveHostFor(fullReference)` helper returns either the resolver's answer or the driver's DriverInstanceId (single-host fallback). Empty / whitespace resolver output also falls back to DriverInstanceId. - Every dispatch site now passes `ResolveHostFor(fullRef)` to the invoker instead of `_driver.DriverInstanceId` — OnReadValue, OnWriteValue, all four HistoryRead paths. The HistoryRead Events path tolerates fullRef=null and falls back to DriverInstanceId for those cluster-wide event queries. - Drivers without IPerCallHostResolver observe zero behavioural change: every call still keys on DriverInstanceId, same as before. Tests (4 new PerCallHostResolverDispatchTests, all pass): - DeadPlc_DoesNotOpenBreaker_For_HealthyPlc_With_Resolver — 2 PLCs behind one driver; hammer the dead PLC past its breaker threshold; assert the healthy PLC's first call succeeds on its first attempt (decision #144). - EmptyString / unknown-ref fallback behaviour documented via test. - WithoutResolver_SameHost_Shares_One_Pipeline — regression guard for the single-host pre-existing behaviour. - WithResolver_TwoHosts_Get_Two_Pipelines — builds the CachedPipelineCount assertion to confirm the shared-builder cache keys correctly. Full solution dotnet test: 1219 passing (was 1215, +4). Pre-existing Client.CLI Subscribe flake unchanged. Adoption: Modbus driver (#120 follow-up), AB CIP / AB Legacy / TwinCAT drivers (also #120) implement the interface and return the per-tag PLC host string. Single-host drivers stay silent and pay zero cost. Remaining sub-items of #160 still deferred: - IAlarmSource.SubscribeAlarmsAsync + AcknowledgeAsync invoker wrapping. Non-trivial because alarm subscription is push-based from driver through IAlarmConditionSink — the wrap has to happen at the driver-to-server glue rather than a synchronous dispatch site. - Roslyn analyzer asserting every capability-interface call routes through CapabilityInvoker. Substantial (separate analyzer project + test harness); noise-value ratio favors shipping this post-v2-GA once the coverage is known-stable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
7b50118b68 |
Phase 6.1 Stream A follow-up — DriverInstance.ResilienceConfig JSON column + parser + OtOpcUaServer wire-in
Closes the Phase 6.1 Stream A.2 "per-instance overrides bound from DriverInstance.ResilienceConfig JSON column" work flagged as a follow-up when Stream A.1 shipped in PR #78. Every driver can now override its Polly pipeline policy per instance instead of inheriting pure tier defaults. Configuration: - DriverInstance entity gains a nullable `ResilienceConfig` string column (nvarchar(max)) + SQL check constraint `CK_DriverInstance_ResilienceConfig_IsJson` that enforces ISJSON when not null. Null = use tier defaults (decision #143 / unchanged from pre-Phase-6.1). - EF migration `20260419161008_AddDriverInstanceResilienceConfig`. - SchemaComplianceTests expected-constraint list gains the new CK name. Core.Resilience.DriverResilienceOptionsParser: - Pure-function parser. ParseOrDefaults(tier, json, out diag) returns the effective DriverResilienceOptions — tier defaults with per-capability / bulkhead overrides layered on top when the JSON payload supplies them. Partial policies (e.g. Read { retryCount: 10 }) fill missing fields from the tier default for that capability. - Malformed JSON falls back to pure tier defaults + surfaces a human-readable diagnostic via the out parameter. Callers log the diag but don't fail startup — a misconfigured ResilienceConfig must not brick a working driver. - Property names + capability keys are case-insensitive; unrecognised capability names are logged-and-skipped; unrecognised shape-level keys are ignored so future shapes land without a migration. Server wire-in: - OtOpcUaServer gains two optional ctor params: `tierLookup` (driverType → DriverTier) + `resilienceConfigLookup` (driverInstanceId → JSON string). CreateMasterNodeManager now resolves tier + JSON for each driver, parses via DriverResilienceOptionsParser, logs the diagnostic if any, and constructs CapabilityInvoker with the merged options instead of pure Tier A defaults. - OpcUaApplicationHost threads both lookups through. Default null keeps existing tests constructing without either Func unchanged (falls back to Tier A + tier defaults exactly as before). Tests (13 new DriverResilienceOptionsParserTests): - null / whitespace / empty-object JSON returns pure tier defaults. - Malformed JSON falls back + surfaces diagnostic. - Read override merged into tier defaults; other capabilities untouched. - Partial policy fills missing fields from tier default. - Bulkhead overrides honored. - Unknown capability skipped + surfaced in diagnostic. - Property names + capability keys are case-insensitive. - Every tier × every capability × empty-JSON round-trips tier defaults exactly (theory). Full solution dotnet test: 1215 passing (was 1202, +13). Pre-existing Client.CLI Subscribe flake unchanged. Production wiring (Program.cs) example: Func<string, DriverTier> tierLookup = type => type switch { "Galaxy" => DriverTier.C, "Modbus" or "S7" => DriverTier.B, "OpcUaClient" => DriverTier.A, _ => DriverTier.A, }; Func<string, string?> cfgLookup = id => db.DriverInstances.AsNoTracking().FirstOrDefault(x => x.DriverInstanceId == id)?.ResilienceConfig; var host = new OpcUaApplicationHost(..., tierLookup: tierLookup, resilienceConfigLookup: cfgLookup); Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
cbcaf6593a |
Phase 6.1 Stream E (data layer) — DriverInstanceResilienceStatus entity + DriverResilienceStatusTracker + EF migration
Ships the data + runtime layer of Stream E. The SignalR hub and Blazor /hosts page refresh (E.2-E.3) are follow-up work paired with the visual-compliance review per Phase 6.4 patterns — documented as a deferred follow-up below. Configuration: - New entity DriverInstanceResilienceStatus with: DriverInstanceId, HostName (composite PK), LastCircuitBreakerOpenUtc, ConsecutiveFailures, CurrentBulkheadDepth, LastRecycleUtc, BaselineFootprintBytes, CurrentFootprintBytes, LastSampledUtc. - Separate from DriverHostStatus (per-host connectivity view) so a Running host that has tripped its breaker or is nearing its memory ceiling shows up distinctly on Admin /hosts. Admin page left-joins both for display. - OtOpcUaConfigDbContext + Fluent-API config + IX_DriverResilience_LastSampled index for the stale-sample filter query. - EF migration: 20260419124034_AddDriverInstanceResilienceStatus. Core.Resilience: - DriverResilienceStatusTracker — process-singleton in-memory tracker keyed on (DriverInstanceId, HostName). CapabilityInvoker + MemoryTracking + MemoryRecycle callers record failure/success/breaker-open/recycle/footprint events; a HostedService (Stream E.2 follow-up) samples this tracker every 5 s and persists to the DB. Pure in-memory keeps tests fast + the core free of EF/SQL dependencies. Tests: - DriverResilienceStatusTrackerTests (9 new, all pass): tryget-before-write returns null; failures accumulate; success resets; breaker/recycle/footprint fields populate; per-host isolation; snapshot returns all pairs; concurrent writes don't lose counts. - SchemaComplianceTests: expected-tables list updated to include the new DriverInstanceResilienceStatus table. Full solution dotnet test: 1042 passing (baseline 906, +136 for Phase 6.1 so far across Streams A/B/C/D/E.1). Pre-existing Client.CLI Subscribe flake unchanged. Deferred to follow-up PR (E.2/E.3): - ResilienceStatusPublisher HostedService that samples DriverResilienceStatusTracker every 5 s + upserts DriverInstanceResilienceStatus rows. - Admin FleetStatusHub SignalR hub pushing LastCircuitBreakerOpenUtc / CurrentBulkheadDepth / LastRecycleUtc on change. - Admin /hosts Blazor column additions (red badge when ConsecutiveFailures > breakerThreshold / 2). Visual-compliance reviewer signoff alongside Phase 6.4 admin-ui patterns. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
29bcaf277b |
Phase 6.1 Stream A.3 complete — wire CapabilityInvoker into DriverNodeManager dispatch end-to-end
Every OnReadValue / OnWriteValue now routes through the process-singleton DriverResiliencePipelineBuilder's CapabilityInvoker. Read / Write dispatch paths gain timeout + per-capability retry + per-(driver, host) circuit breaker + bulkhead without touching the individual driver implementations. Wiring: - OpcUaApplicationHost: new optional DriverResiliencePipelineBuilder ctor parameter (default null → instance-owned builder). Keeps the 3 test call sites that construct OpcUaApplicationHost directly unchanged. - OtOpcUaServer: requires the builder in its ctor; constructs one CapabilityInvoker per driver at CreateMasterNodeManager time with default Tier A DriverResilienceOptions. TODO: Stream B.1 will wire real per-driver- type tiers via DriverTypeRegistry; Phase 6.1 follow-up will read the DriverInstance.ResilienceConfig JSON column for per-instance overrides. - DriverNodeManager: takes a CapabilityInvoker in its ctor. OnReadValue wraps the driver's ReadAsync through ExecuteAsync(DriverCapability.Read, hostName, ...); OnWriteValue wraps WriteAsync through ExecuteWriteAsync(hostName, isIdempotent, ...) where isIdempotent comes from the new _writeIdempotentByFullRef map populated at Variable() registration from DriverAttributeInfo.WriteIdempotent. HostName defaults to driver.DriverInstanceId for now — a single-host pipeline per driver. Multi-host drivers (Modbus with N PLCs) will expose their own per- call host resolution in a follow-up so failing PLCs can trip per-PLC breakers without poisoning siblings (decision #144). Test fixup: - FlakeyDriverIntegrationTests.Read_SurfacesSuccess_AfterTransientFailures: bumped TimeoutSeconds=2 → 30. 10 retries at exponential backoff with jitter can exceed 2s under parallel-test-run CPU pressure; the test asserts retry behavior, not timeout budget, so the longer slack keeps it deterministic. Full solution dotnet test: 948 passing. Pre-existing Client.CLI Subscribe flake unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
b6d2803ff6 |
Phase 6.1 Stream A — switch pipeline keys from Guid to string to match IDriver.DriverInstanceId
IDriver.DriverInstanceId is declared as string in Core.Abstractions; keeping the pipeline key as Guid meant every call site would need .ToString() / Guid.Parse at the boundary. Switching the Resilience types to string removes that friction and lets OtOpcUaServer pass driver.DriverInstanceId directly to the builder in the upcoming server-dispatch wiring PR. - DriverResiliencePipelineBuilder.GetOrCreate + Invalidate + PipelineKey - CapabilityInvoker.ctor + _driverInstanceId field Tests: all 48 Core.Tests still pass. The Invalidate test's keepId / dropId now use distinct "drv-keep" / "drv-drop" literals (previously both were distinct Guid.NewGuid() values, which the sed-driven refactor had collapsed to the same literal — caught pre-commit). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
f3850f8914 |
Phase 6.1 Stream A.5/A.6 — WriteIdempotent flag on DriverAttributeInfo + Modbus/S7 tag records + FlakeyDriver integration tests
Per-tag opt-in for write-retry per docs/v2/plan.md decisions #44, #45, #143. Default is false — writes never auto-retry unless the driver author has marked the tag as safe to replay. Core.Abstractions: - DriverAttributeInfo gains `bool WriteIdempotent = false` at the end of the positional record (back-compatible; every existing call site uses the default). Driver.Modbus: - ModbusTagDefinition gains `bool WriteIdempotent = false`. Safe candidates documented in the param XML: holding-register set-points, configuration registers. Unsafe: edge-triggered coils, counter-increment addresses. - ModbusDriver.DiscoverAsync propagates t.WriteIdempotent into DriverAttributeInfo.WriteIdempotent. Driver.S7: - S7TagDefinition gains `bool WriteIdempotent = false`. Safe candidates: DB word/dword set-points, configuration DBs. Unsafe: M/Q bits that drive edge-triggered program routines. - S7Driver.DiscoverAsync propagates the flag. Stream A.5 integration tests (FlakeyDriverIntegrationTests, 4 new) exercise the invoker + flaky-driver contract the plan enumerates: - Read with 5 transient failures succeeds on the 6th attempt (RetryCount=10). - Non-idempotent write with RetryCount=5 configured still fails on the first failure — no replay (decision #44 guard at the ExecuteWriteAsync surface). - Idempotent write with 2 transient failures succeeds on the 3rd attempt. - Two hosts on the same driver have independent breakers — dead-host trips its breaker but live-host's first call still succeeds. Propagation tests: - ModbusDriverTests: SetPoint WriteIdempotent=true flows into DriverAttributeInfo; PulseCoil default=false. - S7DiscoveryAndSubscribeTests: same pattern for DBx SetPoint vs M-bit. Full solution dotnet test: 947 passing (baseline 906, +41 net across Stream A so far). Pre-existing Client.CLI Subscribe flake unchanged. Stream A's remaining work (wiring CapabilityInvoker into DriverNodeManager's OnReadValue / OnWriteValue / History / Subscribe dispatch paths) is the server-side integration piece + needs DI wiring for the pipeline builder — lands in the next PR on this branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
90f7792c92 |
Phase 6.1 Stream A.3 — CapabilityInvoker wraps driver-capability calls through the shared pipeline
One invoker per (DriverInstance, IDriver) pair; calls ExecuteAsync(capability, host, callSite) and the invoker resolves the correct pipeline from the shared DriverResiliencePipelineBuilder. The options accessor is a Func so Admin-edit + pipeline-invalidate takes effect without restarting the invoker or the driver host. ExecuteWriteAsync(isIdempotent) is the explicit write-safety surface: - isIdempotent=false routes through a side pipeline with RetryCount=0 regardless of what the caller configured. The cache key carries a "::non-idempotent" suffix so it never collides with the retry-enabled write pipeline. - isIdempotent=true routes through the normal Write pipeline. If the user has configured Write retries (opt-in), the idempotent tag gets them; otherwise default-0 still wins. The server dispatch layer (next PR) reads WriteIdempotentAttribute on each tag definition once at driver-init time and feeds the boolean into ExecuteWriteAsync. Tests (6 new): - Read retries on transient failure; returns value from call site. - Write non-idempotent does NOT retry even when policy has 3 retries configured (the explicit decision-#44 guard at the dispatch surface). - Write idempotent retries when policy allows. - Write with default tier-A policy (RetryCount=0) never retries regardless of idempotency flag. - Different hosts get independent pipelines. Core.Tests now 44 passing (was 38). Invoker doc-refs completed (the XML comment on WriteIdempotentAttribute no longer references a non-existent type). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
c04b13f436 |
Phase 6.1 Stream A.1/A.2/A.6 — Polly resilience foundation: pipeline builder + per-tier policy defaults + WriteIdempotent attribute
Lands the first chunk of the Phase 6.1 Stream A resilience layer per docs/v2/implementation/phase-6-1-resilience-and-observability.md §Stream A. Downstream CapabilityInvoker (A.3) + driver-dispatch wiring land in follow-up PRs on the same branch. Core.Abstractions additions: - WriteIdempotentAttribute — marker for tag-definition records that opt into auto-retry on IWritable.WriteAsync. Absence = no retry per decisions #44, #45, #143. Read once via reflection at driver-init time; no per-write cost. - DriverCapability enum — enumerates the 8 capability surface points (Read / Write / Discover / Subscribe / Probe / AlarmSubscribe / AlarmAcknowledge / HistoryRead). AlarmAcknowledge is write-shaped (no retry by default). - DriverTier enum — A/B/C per driver-stability.md §2-4. Stream B.1 wires this into DriverTypeMetadata; surfaced here because the resilience policy defaults key on it. Core.Resilience new namespace: - DriverResilienceOptions — per-tier × per-capability policy defaults. GetTierDefaults(tier) is the source of truth: * Tier A: Read 2s/3 retries, Write 2s/0 retries, breaker threshold 5 * Tier B: Read 4s/3, Write 4s/0, breaker threshold 5 * Tier C: Read 10s/1, Write 10s/0, breaker threshold 0 (supervisor handles process-level breaker per decision #68) Resolve(capability) overlays CapabilityPolicies on top of the defaults. - DriverResiliencePipelineBuilder — composes Timeout → Retry (capability- permitting, never on cancellation) → CircuitBreaker (tier-permitting) → Bulkhead. Pipelines cached in a lock-free ConcurrentDictionary keyed on (DriverInstanceId, HostName, DriverCapability) per decision #144 — one dead PLC behind a multi-device driver does not open the breaker for healthy siblings. Invalidate(driverInstanceId) supports Admin-triggered reload. Tests (30 new, all pass): - DriverResilienceOptionsTests: tier-default coverage for every capability, Write + AlarmAcknowledge never retry at any tier, Tier C disables breaker, resolve-with-override layering. - DriverResiliencePipelineBuilderTests: Read retries transients, Write does NOT retry on failure (decision #44 guard), dead-host isolation from sibling hosts, pipeline reuse for same triple, per-capability isolation, breaker opens after threshold on Tier A, timeout fires, cancellation is not retried, invalidation scoped to matching instance. Polly.Core 8.6.6 added to Core.csproj. Full solution dotnet test: 936 passing (baseline 906 + 30 new). One pre-existing Client.CLI Subscribe flake unchanged by this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |