d2f3a243cd76e8e91ca4808711cb1ec223b5eaa6
7 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
29bcaf277b |
Phase 6.1 Stream A.3 complete — wire CapabilityInvoker into DriverNodeManager dispatch end-to-end
Every OnReadValue / OnWriteValue now routes through the process-singleton DriverResiliencePipelineBuilder's CapabilityInvoker. Read / Write dispatch paths gain timeout + per-capability retry + per-(driver, host) circuit breaker + bulkhead without touching the individual driver implementations. Wiring: - OpcUaApplicationHost: new optional DriverResiliencePipelineBuilder ctor parameter (default null → instance-owned builder). Keeps the 3 test call sites that construct OpcUaApplicationHost directly unchanged. - OtOpcUaServer: requires the builder in its ctor; constructs one CapabilityInvoker per driver at CreateMasterNodeManager time with default Tier A DriverResilienceOptions. TODO: Stream B.1 will wire real per-driver- type tiers via DriverTypeRegistry; Phase 6.1 follow-up will read the DriverInstance.ResilienceConfig JSON column for per-instance overrides. - DriverNodeManager: takes a CapabilityInvoker in its ctor. OnReadValue wraps the driver's ReadAsync through ExecuteAsync(DriverCapability.Read, hostName, ...); OnWriteValue wraps WriteAsync through ExecuteWriteAsync(hostName, isIdempotent, ...) where isIdempotent comes from the new _writeIdempotentByFullRef map populated at Variable() registration from DriverAttributeInfo.WriteIdempotent. HostName defaults to driver.DriverInstanceId for now — a single-host pipeline per driver. Multi-host drivers (Modbus with N PLCs) will expose their own per- call host resolution in a follow-up so failing PLCs can trip per-PLC breakers without poisoning siblings (decision #144). Test fixup: - FlakeyDriverIntegrationTests.Read_SurfacesSuccess_AfterTransientFailures: bumped TimeoutSeconds=2 → 30. 10 retries at exponential backoff with jitter can exceed 2s under parallel-test-run CPU pressure; the test asserts retry behavior, not timeout budget, so the longer slack keeps it deterministic. Full solution dotnet test: 948 passing. Pre-existing Client.CLI Subscribe flake unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
b6d2803ff6 |
Phase 6.1 Stream A — switch pipeline keys from Guid to string to match IDriver.DriverInstanceId
IDriver.DriverInstanceId is declared as string in Core.Abstractions; keeping the pipeline key as Guid meant every call site would need .ToString() / Guid.Parse at the boundary. Switching the Resilience types to string removes that friction and lets OtOpcUaServer pass driver.DriverInstanceId directly to the builder in the upcoming server-dispatch wiring PR. - DriverResiliencePipelineBuilder.GetOrCreate + Invalidate + PipelineKey - CapabilityInvoker.ctor + _driverInstanceId field Tests: all 48 Core.Tests still pass. The Invalidate test's keepId / dropId now use distinct "drv-keep" / "drv-drop" literals (previously both were distinct Guid.NewGuid() values, which the sed-driven refactor had collapsed to the same literal — caught pre-commit). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
f3850f8914 |
Phase 6.1 Stream A.5/A.6 — WriteIdempotent flag on DriverAttributeInfo + Modbus/S7 tag records + FlakeyDriver integration tests
Per-tag opt-in for write-retry per docs/v2/plan.md decisions #44, #45, #143. Default is false — writes never auto-retry unless the driver author has marked the tag as safe to replay. Core.Abstractions: - DriverAttributeInfo gains `bool WriteIdempotent = false` at the end of the positional record (back-compatible; every existing call site uses the default). Driver.Modbus: - ModbusTagDefinition gains `bool WriteIdempotent = false`. Safe candidates documented in the param XML: holding-register set-points, configuration registers. Unsafe: edge-triggered coils, counter-increment addresses. - ModbusDriver.DiscoverAsync propagates t.WriteIdempotent into DriverAttributeInfo.WriteIdempotent. Driver.S7: - S7TagDefinition gains `bool WriteIdempotent = false`. Safe candidates: DB word/dword set-points, configuration DBs. Unsafe: M/Q bits that drive edge-triggered program routines. - S7Driver.DiscoverAsync propagates the flag. Stream A.5 integration tests (FlakeyDriverIntegrationTests, 4 new) exercise the invoker + flaky-driver contract the plan enumerates: - Read with 5 transient failures succeeds on the 6th attempt (RetryCount=10). - Non-idempotent write with RetryCount=5 configured still fails on the first failure — no replay (decision #44 guard at the ExecuteWriteAsync surface). - Idempotent write with 2 transient failures succeeds on the 3rd attempt. - Two hosts on the same driver have independent breakers — dead-host trips its breaker but live-host's first call still succeeds. Propagation tests: - ModbusDriverTests: SetPoint WriteIdempotent=true flows into DriverAttributeInfo; PulseCoil default=false. - S7DiscoveryAndSubscribeTests: same pattern for DBx SetPoint vs M-bit. Full solution dotnet test: 947 passing (baseline 906, +41 net across Stream A so far). Pre-existing Client.CLI Subscribe flake unchanged. Stream A's remaining work (wiring CapabilityInvoker into DriverNodeManager's OnReadValue / OnWriteValue / History / Subscribe dispatch paths) is the server-side integration piece + needs DI wiring for the pipeline builder — lands in the next PR on this branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
90f7792c92 |
Phase 6.1 Stream A.3 — CapabilityInvoker wraps driver-capability calls through the shared pipeline
One invoker per (DriverInstance, IDriver) pair; calls ExecuteAsync(capability, host, callSite) and the invoker resolves the correct pipeline from the shared DriverResiliencePipelineBuilder. The options accessor is a Func so Admin-edit + pipeline-invalidate takes effect without restarting the invoker or the driver host. ExecuteWriteAsync(isIdempotent) is the explicit write-safety surface: - isIdempotent=false routes through a side pipeline with RetryCount=0 regardless of what the caller configured. The cache key carries a "::non-idempotent" suffix so it never collides with the retry-enabled write pipeline. - isIdempotent=true routes through the normal Write pipeline. If the user has configured Write retries (opt-in), the idempotent tag gets them; otherwise default-0 still wins. The server dispatch layer (next PR) reads WriteIdempotentAttribute on each tag definition once at driver-init time and feeds the boolean into ExecuteWriteAsync. Tests (6 new): - Read retries on transient failure; returns value from call site. - Write non-idempotent does NOT retry even when policy has 3 retries configured (the explicit decision-#44 guard at the dispatch surface). - Write idempotent retries when policy allows. - Write with default tier-A policy (RetryCount=0) never retries regardless of idempotency flag. - Different hosts get independent pipelines. Core.Tests now 44 passing (was 38). Invoker doc-refs completed (the XML comment on WriteIdempotentAttribute no longer references a non-existent type). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
c04b13f436 |
Phase 6.1 Stream A.1/A.2/A.6 — Polly resilience foundation: pipeline builder + per-tier policy defaults + WriteIdempotent attribute
Lands the first chunk of the Phase 6.1 Stream A resilience layer per docs/v2/implementation/phase-6-1-resilience-and-observability.md §Stream A. Downstream CapabilityInvoker (A.3) + driver-dispatch wiring land in follow-up PRs on the same branch. Core.Abstractions additions: - WriteIdempotentAttribute — marker for tag-definition records that opt into auto-retry on IWritable.WriteAsync. Absence = no retry per decisions #44, #45, #143. Read once via reflection at driver-init time; no per-write cost. - DriverCapability enum — enumerates the 8 capability surface points (Read / Write / Discover / Subscribe / Probe / AlarmSubscribe / AlarmAcknowledge / HistoryRead). AlarmAcknowledge is write-shaped (no retry by default). - DriverTier enum — A/B/C per driver-stability.md §2-4. Stream B.1 wires this into DriverTypeMetadata; surfaced here because the resilience policy defaults key on it. Core.Resilience new namespace: - DriverResilienceOptions — per-tier × per-capability policy defaults. GetTierDefaults(tier) is the source of truth: * Tier A: Read 2s/3 retries, Write 2s/0 retries, breaker threshold 5 * Tier B: Read 4s/3, Write 4s/0, breaker threshold 5 * Tier C: Read 10s/1, Write 10s/0, breaker threshold 0 (supervisor handles process-level breaker per decision #68) Resolve(capability) overlays CapabilityPolicies on top of the defaults. - DriverResiliencePipelineBuilder — composes Timeout → Retry (capability- permitting, never on cancellation) → CircuitBreaker (tier-permitting) → Bulkhead. Pipelines cached in a lock-free ConcurrentDictionary keyed on (DriverInstanceId, HostName, DriverCapability) per decision #144 — one dead PLC behind a multi-device driver does not open the breaker for healthy siblings. Invalidate(driverInstanceId) supports Admin-triggered reload. Tests (30 new, all pass): - DriverResilienceOptionsTests: tier-default coverage for every capability, Write + AlarmAcknowledge never retry at any tier, Tier C disables breaker, resolve-with-override layering. - DriverResiliencePipelineBuilderTests: Read retries transients, Write does NOT retry on failure (decision #44 guard), dead-host isolation from sibling hosts, pipeline reuse for same triple, per-capability isolation, breaker opens after threshold on Tier A, timeout fires, cancellation is not retried, invalidation scoped to matching instance. Polly.Core 8.6.6 added to Core.csproj. Full solution dotnet test: 936 passing (baseline 906 + 30 new). One pre-existing Client.CLI Subscribe flake unchanged by this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
190d09cdeb |
Phase 3 PR 15 — alarm-condition contract in IAddressSpaceBuilder + wire OnAlarmEvent through GenericDriverNodeManager. IAddressSpaceBuilder.IVariableHandle gains MarkAsAlarmCondition(AlarmConditionInfo) which returns an IAlarmConditionSink. AlarmConditionInfo carries SourceName/InitialSeverity/InitialDescription. Concrete address-space builders (the upcoming PR 16 OPC UA server backend) materialize a sibling AlarmConditionState node on the first call; the sink receives every lifecycle transition the generic node manager forwards. GenericDriverNodeManager gains a CapturingBuilder wrapper that transparently wraps every Folder/Variable call — the wrapper observes MarkAsAlarmCondition calls without participating in materialization, captures the resulting IAlarmConditionSink into an internal source-node-id → sink ConcurrentDictionary keyed by IVariableHandle.FullReference. After DiscoverAsync completes, if the driver implements IAlarmSource the node manager subscribes to OnAlarmEvent and routes every AlarmEventArgs to the sink registered for args.SourceNodeId — unknown source ids are dropped silently (may belong to another driver or to a variable the builder chose not to flag). Dispose unsubscribes the forwarder to prevent dangling invocation-list references across node-manager rebuilds. GalaxyProxyDriver.DiscoverAsync now calls handle.MarkAsAlarmCondition(new AlarmConditionInfo(fullName, AlarmSeverity.Medium, null)) on every attr.IsAlarm=true variable — severity seed is Medium because the live Priority byte arrives through the subsequent GalaxyAlarmEvent stream (which PR 14's GalaxyAlarmTracker now emits); the Admin UI sees the severity update on the first transition. RecordingAddressSpaceBuilder in Driver.Galaxy.E2E gains a RecordedAlarmCondition list + a RecordingSink implementation that captures AlarmEventArgs for test assertion — the E2E parity suite can now verify alarm-condition registration shape in addition to folder/variable shape. Tests (4 new GenericDriverNodeManagerTests): Alarm_events_are_routed_to_the_sink_registered_for_the_matching_source_node_id — 2 alarms registered (Tank.HiHi + Heater.OverTemp), driver raises an event for Tank.HiHi, the Tank.HiHi sink captures the payload, the Heater.OverTemp sink does not (tag-scoped fan-out, not broadcast); Non_alarm_variables_do_not_register_sinks — plain Tank.Level in the same discover is not in TrackedAlarmSources; Unknown_source_node_id_is_dropped_silently — a transition for Unknown.Source doesn't reach any sink + no exception; Dispose_unsubscribes_from_OnAlarmEvent — post-dispose, a transition for a previously-registered tag is no-op because the forwarder detached. InternalsVisibleTo('ZB.MOM.WW.OtOpcUa.Core.Tests') added to Core csproj so TrackedAlarmSources internal property is visible to the test. Full solution: 0 errors, 152 unit tests pass (8 Core + 14 Proxy + 14 Admin + 24 Configuration + 6 Shared + 84 Galaxy.Host + 2 Server). PR 16 will implement the concrete OPC UA address-space builder that materializes AlarmConditionState from this contract.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
01fd90c178 |
Phase 1 Streams B–E scaffold + Phase 2 Streams A–C scaffold — 8 new projects with ~70 new tests, all green alongside the 494 v1 IntegrationTests baseline (parity preserved: no v1 tests broken; legacy OtOpcUa.Host untouched). Phase 1 finish: Configuration project (16 entities + 10 enums + DbContext + DesignTimeDbContextFactory + InitialSchema/StoredProcedures/AuthorizationGrants migrations — 8 procs including sp_PublishGeneration with MERGE on ExternalIdReservation per decision #124, sp_RollbackToGeneration cloning rows into a new published generation, sp_ValidateDraft with cross-cluster-namespace + EquipmentUuid-immutability + ZTag/SAPID reservation pre-flight, sp_ComputeGenerationDiff with CHECKSUM-based row signature — plus OtOpcUaNode/OtOpcUaAdmin SQL roles with EXECUTE grants scoped to per-principal-class proc sets and DENY UPDATE/DELETE/INSERT/SELECT on dbo schema); managed DraftValidator covering UNS segment regex, path length, EquipmentUuid immutability across generations, same-cluster namespace binding (decision #122), reservation pre-flight, EquipmentId derivation (decision #125), driver↔namespace compatibility — returning every failing rule in one pass; LiteDB local cache with round-trip + ring pruning + corruption-fast-fail; GenerationApplier with per-entity Added/Removed/Modified diff and dependency-ordered callbacks (namespace → driver → device → equipment → poll-group → tag, Removed before Added); Core project with GenericDriverNodeManager (scaffold for the Phase 2 Galaxy port) and DriverHost lifecycle registry; Server project using Microsoft.Extensions.Hosting BackgroundService replacing TopShelf, with NodeBootstrap that falls back to LiteDB cache when the central DB is unreachable (decision #79); Admin project scaffolded as Blazor Server with Bootstrap 5 sidebar layout, cookie auth, three admin roles (ConfigViewer/ConfigEditor/FleetAdmin), Cluster + Generation services fronting the stored procs. Phase 2 scaffold: Driver.Galaxy.Shared (netstandard2.0) with full MessagePack IPC contract surface — Hello version negotiation, Open/CloseSession, Heartbeat, DiscoverHierarchy + GalaxyObjectInfo/GalaxyAttributeInfo, Read/WriteValues, Subscribe/Unsubscribe/OnDataChange, AlarmSubscribe/Event/Ack, HistoryRead, HostConnectivityStatus, Recycle — plus length-prefixed framing (decision #28) with a 16 MiB cap and thread-safe FrameWriter/FrameReader; Driver.Galaxy.Host (net48) implementing the Tier C cross-cutting protections from driver-stability.md — strict PipeAcl (allow configured server SID only, explicit deny on LocalSystem + Administrators), PipeServer with caller-SID verification via pipe.RunAsClient + WindowsIdentity.GetCurrent and per-process shared-secret Hello, Galaxy-specific MemoryWatchdog (warn at max(1.5×baseline, +200 MB), soft-recycle at max(2×baseline, +200 MB), hard ceiling 1.5 GB, slope ≥5 MB/min over 30-min rolling window), RecyclePolicy (1 soft recycle per hour cap + 03:00 local daily scheduled), PostMortemMmf (1000-entry ring buffer in %ProgramData%\OtOpcUa\driver-postmortem\galaxy.mmf, survives hard crash, readable cross-process), MxAccessHandle : SafeHandle (ReleaseHandle loops Marshal.ReleaseComObject until refcount=0 then calls optional unregister callback), StaPump with responsiveness probe (BlockingCollection dispatcher for Phase 1 — real Win32 GetMessage/DispatchMessage pump slots in with the same semantics when the Galaxy code lift happens), IsExternalInit shim for init setters on .NET 4.8; Driver.Galaxy.Proxy (net10) implementing IDriver + ITagDiscovery forwarding over the IPC channel with MX data-type and security-classification mapping, plus Supervisor pieces — Backoff (5s → 15s → 60s capped, reset-on-stable-run), CircuitBreaker (3 crashes per 5 min opens; 1h → 4h → manual cooldown escalation; sticky alert doesn't auto-clear), HeartbeatMonitor (2s cadence, 3 consecutive misses = host dead per driver-stability.md). Infrastructure: docker SQL Server remapped to host port 14330 to coexist with the native MSSQL14 Galaxy ZB DB instance on 1433; NuGetAuditSuppress applied per-project for two System.Security.Cryptography.Xml advisories that only reach via EF Core Design with PrivateAssets=all (fix ships in 11.0.0-preview); .slnx gains 14 project registrations. Deferred with explicit TODOs in docs/v2/implementation/phase-2-partial-exit-evidence.md: Phase 1 Stream E Admin UI pages (Generations listing + draft-diff-publish, Equipment CRUD with OPC 40010 fields, UNS Areas/Lines tabs, ACLs + permission simulator, Generic JSON config editor, SignalR real-time, Release-Reservation + Merge-Equipment workflows, LDAP login page, AppServer smoke test per decision #142), Phase 2 Stream D (Galaxy MXAccess code lift out of legacy OtOpcUa.Host, dual-service installer, appsettings → DriverConfig migration script, legacy Host deletion — blocked by parity), Phase 2 Stream E (v1 IntegrationTests against v2 topology, Client.CLI walkthrough diff, four 2026-04-13 stability findings regression tests, adversarial review — requires live MXAccess runtime).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |