a2c7fda5f5e20085457873bc006ce3c1e1b2b468
15 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
ae8f226e45 |
Phase 6.1 Stream E.3 partial — in-flight counter feeds CurrentBulkheadDepth
Closes the observer half of #162 that was flagged as "persisted as 0 today" in PR #105. The Admin /hosts column refresh + FleetStatusHub SignalR push + red-badge visual still belong to the visual-compliance pass. Core.Resilience: - DriverResilienceStatusTracker gains RecordCallStart + RecordCallComplete + CurrentInFlight field on the snapshot record. Concurrent-safe via the same ConcurrentDictionary.AddOrUpdate pattern as the other recorder methods. Clamps to zero on over-decrement so a stray Complete-without-Start can't drive the counter negative. - CapabilityInvoker gains an optional statusTracker ctor parameter. When wired, every ExecuteAsync / ExecuteAsync(void) wraps the pipeline call in try / finally that records start/complete — so the counter advances cleanly whether the call succeeds, cancels, or throws. Null tracker keeps the pre-Phase-6.1 Stream E.3 behaviour exactly. Server.Hosting: - ResilienceStatusPublisherHostedService persists CurrentInFlight as the DriverInstanceResilienceStatus.CurrentBulkheadDepth column (was 0 before this PR). One-line fix on both the insert + update branches. The in-flight counter is a pragmatic proxy for Polly's internal bulkhead depth — a future PR wiring Polly telemetry would replace it with the real value. The shape of the column + the publisher + the Admin /hosts query doesn't change, so the follow-up is invisible to consumers. Tests (8 new InFlightCounterTests, all pass): - Start+Complete nets to zero. - Nested starts sum; Complete decrements. - Complete-without-Start clamps to zero. - Different hosts track independently. - Concurrent starts (500 parallel) don't lose count. - CapabilityInvoker observed-mid-call depth == 1 during a pending call. - CapabilityInvoker exception path still decrements (try/finally). - CapabilityInvoker without tracker doesn't throw. Full solution dotnet test: 1243 passing (was 1235, +8). Pre-existing Client.CLI Subscribe flake unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
7b50118b68 |
Phase 6.1 Stream A follow-up — DriverInstance.ResilienceConfig JSON column + parser + OtOpcUaServer wire-in
Closes the Phase 6.1 Stream A.2 "per-instance overrides bound from DriverInstance.ResilienceConfig JSON column" work flagged as a follow-up when Stream A.1 shipped in PR #78. Every driver can now override its Polly pipeline policy per instance instead of inheriting pure tier defaults. Configuration: - DriverInstance entity gains a nullable `ResilienceConfig` string column (nvarchar(max)) + SQL check constraint `CK_DriverInstance_ResilienceConfig_IsJson` that enforces ISJSON when not null. Null = use tier defaults (decision #143 / unchanged from pre-Phase-6.1). - EF migration `20260419161008_AddDriverInstanceResilienceConfig`. - SchemaComplianceTests expected-constraint list gains the new CK name. Core.Resilience.DriverResilienceOptionsParser: - Pure-function parser. ParseOrDefaults(tier, json, out diag) returns the effective DriverResilienceOptions — tier defaults with per-capability / bulkhead overrides layered on top when the JSON payload supplies them. Partial policies (e.g. Read { retryCount: 10 }) fill missing fields from the tier default for that capability. - Malformed JSON falls back to pure tier defaults + surfaces a human-readable diagnostic via the out parameter. Callers log the diag but don't fail startup — a misconfigured ResilienceConfig must not brick a working driver. - Property names + capability keys are case-insensitive; unrecognised capability names are logged-and-skipped; unrecognised shape-level keys are ignored so future shapes land without a migration. Server wire-in: - OtOpcUaServer gains two optional ctor params: `tierLookup` (driverType → DriverTier) + `resilienceConfigLookup` (driverInstanceId → JSON string). CreateMasterNodeManager now resolves tier + JSON for each driver, parses via DriverResilienceOptionsParser, logs the diagnostic if any, and constructs CapabilityInvoker with the merged options instead of pure Tier A defaults. - OpcUaApplicationHost threads both lookups through. Default null keeps existing tests constructing without either Func unchanged (falls back to Tier A + tier defaults exactly as before). Tests (13 new DriverResilienceOptionsParserTests): - null / whitespace / empty-object JSON returns pure tier defaults. - Malformed JSON falls back + surfaces diagnostic. - Read override merged into tier defaults; other capabilities untouched. - Partial policy fills missing fields from tier default. - Bulkhead overrides honored. - Unknown capability skipped + surfaced in diagnostic. - Property names + capability keys are case-insensitive. - Every tier × every capability × empty-JSON round-trips tier defaults exactly (theory). Full solution dotnet test: 1215 passing (was 1202, +13). Pre-existing Client.CLI Subscribe flake unchanged. Production wiring (Program.cs) example: Func<string, DriverTier> tierLookup = type => type switch { "Galaxy" => DriverTier.C, "Modbus" or "S7" => DriverTier.B, "OpcUaClient" => DriverTier.A, _ => DriverTier.A, }; Func<string, string?> cfgLookup = id => db.DriverInstances.AsNoTracking().FirstOrDefault(x => x.DriverInstanceId == id)?.ResilienceConfig; var host = new OpcUaApplicationHost(..., tierLookup: tierLookup, resilienceConfigLookup: cfgLookup); Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
c1cab33e38 |
Phase 6.4 Stream D server-side — IdentificationFolderBuilder materializes OPC 40010 Machinery Identification sub-folder
Closes the server-side / non-UI piece of Phase 6.4 Stream D. The Razor `IdentificationFields.razor` component for Admin-UI editing ships separately when the Admin UI pass lands (still tracked under #157 UI follow-up). Core.OpcUa additions: - IdentificationFolderBuilder — pure-function builder that materializes the OPC 40010 Machinery companion-spec Identification sub-folder per decision #139. Reads the nine nullable columns off an Equipment row: Manufacturer, Model, SerialNumber, HardwareRevision, SoftwareRevision, YearOfConstruction (short → OPC UA Int32), AssetLocation, ManufacturerUri, DeviceManualUri. Emits one AddProperty call per non-null field; skips the sub-folder entirely when all nine are null so browse trees don't carry pointless empty folders. - HasAnyFields(equipment) — cheap short-circuit so callers can decide whether to invoke Folder() at all. - FolderName constant ("Identification") + FieldNames list exposed so downstream tools / tests can cross-reference without duplicating the decision-#139 field set. ACL binding: the sub-folder + variables live under the Equipment node so Phase 6.2's PermissionTrie treats them as part of the Equipment ScopeId — no new scope level. A user with Equipment-level grant reads the Identification fields; a user without gets BadUserAccessDenied on both the Equipment node + its Identification variables. Documented in the class remarks; cross-reference update to acl-design.md is a follow-up. Tests (9 new IdentificationFolderBuilderTests): - HasAnyFields all-null false / any-non-null true. - Build all-null returns null + doesn't emit Folder. - Build fully-populated emits all 9 fields in decision #139 order. - Only non-null fields are emitted (3-of-9 case). - YearOfConstruction short widens to DriverDataType.Int32 with int value. - String values round-trip through AddProperty. - FieldNames constant matches decision #139 exactly. - FolderName is "Identification". Full solution dotnet test: 1202 passing (was 1193, +9). Pre-existing Client.CLI Subscribe flake unchanged. Production integration: the component that consumes this is the address-space-build flow that walks the live Equipment table + calls IdentificationFolderBuilder.Build(equipmentFolder, equipment) under each Equipment node. That integration is the remaining Stream D follow-up alongside the Razor UI component. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
bd53ebd192 |
Phase 6.2 exit gate — compliance script real-checks + phase doc = SHIPPED (core)
scripts/compliance/phase-6-2-compliance.ps1 replaces the stub TODOs with 23 real checks spanning: - Stream A: LdapGroupRoleMapping entity + AdminRole enum + ILdapGroupRoleMappingService + impl + write-time invariant + EF migration all present. - Stream B: OpcUaOperation enum + NodeScope + AuthorizationDecision tri-state + IPermissionEvaluator + PermissionTrie + Builder + Cache keyed on GenerationId + UserAuthorizationState with MembershipFreshnessInterval=15m and AuthCacheMaxStaleness=5m + TriePermissionEvaluator + HistoryRead uses its own flag. - Control/data-plane separation: the evaluator + trie + cache + builder + interface all have zero references to LdapGroupRoleMapping (decision #150). - Stream C foundation: ILdapGroupsBearer + AuthorizationGate with StrictMode knob. DriverNodeManager dispatch-path wiring (11 surfaces) is Deferred, tracked as task #143. - Stream D data layer: ValidatedNodeAclAuthoringService + exception type + rejects None permissions. Blazor UI pieces (RoleGrantsTab, AclsTab, SignalR invalidation, draft diff) are Deferred, tracked as task #144. - Cross-cutting: full solution dotnet test runs; 1097 >= 1042 baseline; tolerates the one pre-existing Client.CLI Subscribe flake. IPermissionEvaluator doc-comment reworded to avoid mentioning the literal type name "LdapGroupRoleMapping" — the compliance check does a text-absence sweep for that identifier across the data-plane files. docs/v2/implementation/phase-6-2-authorization-runtime.md status updated from DRAFT to SHIPPED (core). Two deferred follow-ups explicitly called out so operators see what's still pending for the "Phase 6.2 fully wired end-to-end" milestone. `Phase 6.2 compliance: PASS` — exit 0. Any regression that deletes a class or re-introduces an LdapGroupRoleMapping reference into the data-plane evaluator turns a green check red + exit non-zero. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
40fb459040 |
Phase 6.2 Stream B — permission-trie evaluator in Core.Authorization
Ships Stream B.1-B.6 — the data-plane authorization engine Phase 6.2 runs on.
Integration into OPC UA dispatch (Stream C — Read / Write / HistoryRead /
Subscribe / Browse / Call etc.) is the next PR on this branch.
New Core.Abstractions:
- OpcUaOperation enum enumerates every OPC UA surface the evaluator gates:
Browse, Read, WriteOperate/Tune/Configure (split by SecurityClassification),
HistoryRead, HistoryUpdate, CreateMonitoredItems, TransferSubscriptions,
Call, AlarmAcknowledge/Confirm/Shelve. Stream C maps each one back to its
dispatch call site.
New Core.Authorization namespace:
- NodeScope record + NodeHierarchyKind — 6-level scope addressing for
Equipment-kind (UNS) namespaces, folder-segment walk for SystemPlatform-kind
(Galaxy). NodeScope carries a Kind selector so the evaluator knows which
hierarchy to descend.
- AuthorizationDecision { Verdict, Provenance } + AuthorizationVerdict
{Allow, NotGranted, Denied} + MatchedGrant. Tri-state per decision #149;
Phase 6.2 only produces Allow + NotGranted, Denied stays reserved for v2.1
Explicit Deny without API break.
- IPermissionEvaluator.Authorize(session, operation, scope).
- PermissionTrie + PermissionTrieNode + TrieGrant. In-memory trie keyed on
the ACL scope hierarchy. CollectMatches walks Cluster → Namespace →
UnsArea → UnsLine → Equipment → Tag (or → FolderSegment(s) → Tag on
Galaxy). Pure additive union — matches that share an LDAP group with the
session contribute flags; OR across levels.
- PermissionTrieBuilder static factory. Build(clusterId, generationId, rows,
scopePaths?) returns a trie for one generation. Cross-cluster rows are
filtered out so the trie is cluster-coherent. Stream C follow-up wires a
real scopePaths lookup from the live DB; tests supply hand-built paths.
- PermissionTrieCache — process-singleton, keyed on (ClusterId, GenerationId).
Install(trie) adds a generation + promotes to "current" when the id is
highest-known (handles out-of-order installs gracefully). Prior generations
retained so an in-flight request against a prior trie still succeeds; GC
via Prune(cluster, keepLatest).
- UserAuthorizationState — per-session cache of resolved LDAP groups +
AuthGenerationId + MembershipVersion + MembershipResolvedUtc. Bounded by
MembershipFreshnessInterval (default 15 min per decision #151) +
AuthCacheMaxStaleness (default 5 min per decision #152).
- TriePermissionEvaluator — default IPermissionEvaluator. Fails closed on
stale sessions (IsStale check short-circuits to NotGranted), on cross-
cluster requests, on empty trie cache. Maps OpcUaOperation → NodePermissions
via MapOperationToPermission (total — every enum value has a mapping; tested).
Tests (27 new, all pass):
- PermissionTrieTests (7): cluster-level grant cascades to every tag;
equipment-level grant doesn't leak to sibling equipment; multi-group union
ORs flags; no-matching-group returns empty; Galaxy folder-segment grant
doesn't leak to sibling folder; cross-cluster rows don't land in this
cluster's trie; build is idempotent (B.6 invariants).
- TriePermissionEvaluatorTests (8): allow when flag matches; NotGranted when
no matching group; NotGranted when flags insufficient; HistoryRead requires
its own bit (decision-level requirement); cross-cluster session denied;
stale session fails closed; no cached trie denied; MapOperationToPermission
is total across every OpcUaOperation.
- PermissionTrieCacheTests (8): empty cache returns null; install-then-get
round-trips; new generation becomes current; out-of-order install doesn't
downgrade current; invalidate drops one cluster; prune retains most recent;
prune no-op when fewer than keep; cluster isolation.
- UserAuthorizationStateTests (4): fresh is not stale; IsStale after 5 min
default; NeedsRefresh true between freshness + staleness windows.
Full solution dotnet test: 1078 passing (baseline 906, Phase 6.1 = 1042,
Phase 6.2 Stream A = +9, Stream B = +27 = 1078). Pre-existing Client.CLI
Subscribe flake unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
cbcaf6593a |
Phase 6.1 Stream E (data layer) — DriverInstanceResilienceStatus entity + DriverResilienceStatusTracker + EF migration
Ships the data + runtime layer of Stream E. The SignalR hub and Blazor /hosts page refresh (E.2-E.3) are follow-up work paired with the visual-compliance review per Phase 6.4 patterns — documented as a deferred follow-up below. Configuration: - New entity DriverInstanceResilienceStatus with: DriverInstanceId, HostName (composite PK), LastCircuitBreakerOpenUtc, ConsecutiveFailures, CurrentBulkheadDepth, LastRecycleUtc, BaselineFootprintBytes, CurrentFootprintBytes, LastSampledUtc. - Separate from DriverHostStatus (per-host connectivity view) so a Running host that has tripped its breaker or is nearing its memory ceiling shows up distinctly on Admin /hosts. Admin page left-joins both for display. - OtOpcUaConfigDbContext + Fluent-API config + IX_DriverResilience_LastSampled index for the stale-sample filter query. - EF migration: 20260419124034_AddDriverInstanceResilienceStatus. Core.Resilience: - DriverResilienceStatusTracker — process-singleton in-memory tracker keyed on (DriverInstanceId, HostName). CapabilityInvoker + MemoryTracking + MemoryRecycle callers record failure/success/breaker-open/recycle/footprint events; a HostedService (Stream E.2 follow-up) samples this tracker every 5 s and persists to the DB. Pure in-memory keeps tests fast + the core free of EF/SQL dependencies. Tests: - DriverResilienceStatusTrackerTests (9 new, all pass): tryget-before-write returns null; failures accumulate; success resets; breaker/recycle/footprint fields populate; per-host isolation; snapshot returns all pairs; concurrent writes don't lose counts. - SchemaComplianceTests: expected-tables list updated to include the new DriverInstanceResilienceStatus table. Full solution dotnet test: 1042 passing (baseline 906, +136 for Phase 6.1 so far across Streams A/B/C/D/E.1). Pre-existing Client.CLI Subscribe flake unchanged. Deferred to follow-up PR (E.2/E.3): - ResilienceStatusPublisher HostedService that samples DriverResilienceStatusTracker every 5 s + upserts DriverInstanceResilienceStatus rows. - Admin FleetStatusHub SignalR hub pushing LastCircuitBreakerOpenUtc / CurrentBulkheadDepth / LastRecycleUtc on change. - Admin /hosts Blazor column additions (red badge when ConsecutiveFailures > breakerThreshold / 2). Visual-compliance reviewer signoff alongside Phase 6.4 admin-ui patterns. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
9dd5e4e745 |
Phase 6.1 Stream C — health endpoints on :4841 + LogContextEnricher + Serilog JSON sink + CapabilityInvoker enrichment
Closes Stream C per docs/v2/implementation/phase-6-1-resilience-and-observability.md. Core.Observability (new namespace): - DriverHealthReport — pure-function aggregation over DriverHealthSnapshot list. Empty fleet = Healthy. Any Faulted = Faulted. Any Unknown/Initializing (no Faulted) = NotReady. Any Degraded or Reconnecting (no Faulted, no NotReady) = Degraded. Else Healthy. HttpStatus(verdict) maps to the Stream C.1 state matrix: Healthy/Degraded → 200, NotReady/Faulted → 503. - LogContextEnricher — Serilog LogContext wrapper. Push(id, type, capability, correlationId) returns an IDisposable scope; inner log calls carry DriverInstanceId / DriverType / CapabilityName / CorrelationId structured properties automatically. NewCorrelationId = 12-hex-char GUID slice for cases where no OPC UA RequestHeader.RequestHandle is in flight. CapabilityInvoker — now threads LogContextEnricher around every ExecuteAsync / ExecuteWriteAsync call site. OtOpcUaServer passes driver.DriverType through so logs correlate to the driver type too. Every capability call emits structured fields per the Stream C.4 compliance check. Server.Observability: - HealthEndpointsHost — standalone HttpListener on http://localhost:4841/ (loopback avoids Windows URL-ACL elevation; remote probing via reverse proxy or explicit netsh urlacl grant). Routes: /healthz → 200 when (configDbReachable OR usingStaleConfig); 503 otherwise. Body: status, uptimeSeconds, configDbReachable, usingStaleConfig. /readyz → DriverHealthReport.Aggregate + HttpStatus mapping. Body: verdict, drivers[], degradedDrivers[], uptimeSeconds. anything else → 404. Disposal cooperative with the HttpListener shutdown. - OpcUaApplicationHost starts the health host after the OPC UA server comes up and disposes it on shutdown. New OpcUaServerOptions knobs: HealthEndpointsEnabled (default true), HealthEndpointsPrefix (default http://localhost:4841/). Program.cs: - Serilog pipeline adds Enrich.FromLogContext + opt-in JSON file sink via `Serilog:WriteJson = true` appsetting. Uses Serilog.Formatting.Compact's CompactJsonFormatter (one JSON object per line — SIEMs like Splunk, Datadog, Graylog ingest without a regex parser). Server.Tests: - Existing 3 OpcUaApplicationHost integration tests now set HealthEndpointsEnabled=false to avoid port :4841 collisions under parallel execution. - New HealthEndpointsHostTests (9): /healthz healthy empty fleet; stale-config returns 200 with flag; unreachable+no-cache returns 503; /readyz empty/ Healthy/Faulted/Degraded/Initializing drivers return correct status and bodies; unknown path → 404. Uses ephemeral ports via Interlocked counter. Core.Tests: - DriverHealthReportTests (8): empty fleet, all-healthy, any-Faulted trumps, any-NotReady without Faulted, Degraded without Faulted/NotReady, HttpStatus per-verdict theory. - LogContextEnricherTests (8): all 4 properties attach; scope disposes cleanly; NewCorrelationId shape; null/whitespace driverInstanceId throws. - CapabilityInvokerEnrichmentTests (2): inner logs carry structured properties; no context leak outside the call site. Full solution dotnet test: 1016 passing (baseline 906, +110 for Phase 6.1 so far across Streams A+B+C). Pre-existing Client.CLI Subscribe flake unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
1d9008e354 |
Phase 6.1 Stream B.3/B.4/B.5 — MemoryRecycle + ScheduledRecycleScheduler + demand-aware WedgeDetector
Closes out Stream B per docs/v2/implementation/phase-6-1-resilience-and-observability.md. Core.Abstractions: - IDriverSupervisor — process-level supervisor contract a Tier C driver's out-of-process topology provides (Galaxy Proxy/Supervisor implements this in a follow-up Driver.Galaxy wiring PR). Concerns: DriverInstanceId + RecycleAsync. Tier A/B drivers don't implement this; Stream B code asserts tier == C before ever calling it. Core.Stability: - MemoryRecycle — companion to MemoryTracking. On HardBreach, invokes the supervisor IFF tier == C AND a supervisor is wired. Tier A/B HardBreach logs a promotion-to-Tier-C recommendation and returns false. Soft/None/Warming never triggers a recycle at any tier. - ScheduledRecycleScheduler — Tier C opt-in periodic recycler per decision #67. Ctor throws for Tier A/B (structural guard — scheduled recycle on an in-process driver would kill every OPC UA session and every co-hosted driver). TickAsync(now) advances the schedule by one interval per fire; RequestRecycleNowAsync drives an ad-hoc recycle without shifting the cron. - WedgeDetector — demand-aware per decision #147. Classify(state, demand, now) returns: * NotApplicable when driver state != Healthy * Idle when Healthy + no pending work (bulkhead=0 && monitored=0 && historic=0) * Healthy when Healthy + pending work + progress within threshold * Faulted when Healthy + pending work + no progress within threshold Threshold clamps to min 60 s. DemandSignal.HasPendingWork ORs the three counters. The three false-wedge cases the plan calls out all stay Healthy: idle subscription-only, slow historian backfill making progress, write-only burst with drained bulkhead. Tests (22 new, all pass): - MemoryRecycleTests (7): Tier C hard-breach requests recycle; Tier A/B hard-breach never requests; Tier C without supervisor no-ops; soft-breach at every tier never requests; None/Warming never request. - ScheduledRecycleSchedulerTests (6): ctor throws for A/B; zero/negative interval throws; tick before due no-ops; tick at/after due fires once and advances; RequestRecycleNow fires immediately without shifting schedule; multiple fires across ticks advance one interval each. - WedgeDetectorTests (9): threshold clamp to 60 s; unhealthy driver always NotApplicable; idle subscription stays Idle; pending+fresh progress stays Healthy; pending+stale progress is Faulted; MonitoredItems active but no publish is Faulted; MonitoredItems active with fresh publish stays Healthy; historian backfill with fresh progress stays Healthy; write-only burst with empty bulkhead is Idle; HasPendingWork theory for any non-zero counter. Full solution dotnet test: 989 passing (baseline 906, +83 for Phase 6.1 so far). Pre-existing Client.CLI Subscribe flake unchanged. Stream B complete. Next up: Stream C (health endpoints + structured logging). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
ef6b0bb8fc |
Phase 6.1 Stream B.1/B.2 — DriverTier on DriverTypeMetadata + Core.Stability.MemoryTracking with hybrid-formula soft/hard thresholds
Stream B.1 — registry invariant: - DriverTypeMetadata gains a required `DriverTier Tier` field. Every registered driver type must declare its stability tier so the downstream MemoryTracking, MemoryRecycle, and resilience-policy layers can resolve the right defaults. Stamped-at-registration-time enforcement makes the "every driver type has a non-null Tier" compliance check structurally impossible to fail. - DriverTypeRegistry API unchanged; one new property on the record. Stream B.2 — MemoryTracking (Core.Stability): - Tier-agnostic tracker per decision #146: captures baseline as the median of samples collected during a post-init warmup window (default 5 min), then classifies each subsequent sample with the hybrid formula `soft = max(multiplier × baseline, baseline + floor)`, `hard = 2 × soft`. - Per-tier constants wired: Tier A mult=3 floor=50 MB, Tier B mult=3 floor=100 MB, Tier C mult=2 floor=500 MB. - Never kills. Hard-breach action returns HardBreach; the supervisor that acts on that signal (MemoryRecycle) is Tier C only per decisions #74, #145 and lands in the next B.3 commit on this branch. - Two phases: WarmingUp (samples collected, Warming returned) and Steady (baseline captured, soft/hard checks active). Transition is automatic when the warmup window elapses. Tests (15 new, all pass): - Warming phase returns Warming until the window elapses. - Window-elapsed captures median baseline + transitions to Steady. - Per-tier constants match decision #146 table exactly. - Soft threshold uses max() — small baseline → floor wins; large baseline → multiplier wins. - Hard = 2 × soft. - Sample below soft = None; at soft = SoftBreach; at/above hard = HardBreach. - DriverTypeRegistry: theory asserts Tier round-trips for A/B/C. Full solution dotnet test: 963 passing (baseline 906, +57 net for Phase 6.1 Stream A + Stream B.1/B.2). Pre-existing Client.CLI Subscribe flake unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
b6d2803ff6 |
Phase 6.1 Stream A — switch pipeline keys from Guid to string to match IDriver.DriverInstanceId
IDriver.DriverInstanceId is declared as string in Core.Abstractions; keeping the pipeline key as Guid meant every call site would need .ToString() / Guid.Parse at the boundary. Switching the Resilience types to string removes that friction and lets OtOpcUaServer pass driver.DriverInstanceId directly to the builder in the upcoming server-dispatch wiring PR. - DriverResiliencePipelineBuilder.GetOrCreate + Invalidate + PipelineKey - CapabilityInvoker.ctor + _driverInstanceId field Tests: all 48 Core.Tests still pass. The Invalidate test's keepId / dropId now use distinct "drv-keep" / "drv-drop" literals (previously both were distinct Guid.NewGuid() values, which the sed-driven refactor had collapsed to the same literal — caught pre-commit). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
90f7792c92 |
Phase 6.1 Stream A.3 — CapabilityInvoker wraps driver-capability calls through the shared pipeline
One invoker per (DriverInstance, IDriver) pair; calls ExecuteAsync(capability, host, callSite) and the invoker resolves the correct pipeline from the shared DriverResiliencePipelineBuilder. The options accessor is a Func so Admin-edit + pipeline-invalidate takes effect without restarting the invoker or the driver host. ExecuteWriteAsync(isIdempotent) is the explicit write-safety surface: - isIdempotent=false routes through a side pipeline with RetryCount=0 regardless of what the caller configured. The cache key carries a "::non-idempotent" suffix so it never collides with the retry-enabled write pipeline. - isIdempotent=true routes through the normal Write pipeline. If the user has configured Write retries (opt-in), the idempotent tag gets them; otherwise default-0 still wins. The server dispatch layer (next PR) reads WriteIdempotentAttribute on each tag definition once at driver-init time and feeds the boolean into ExecuteWriteAsync. Tests (6 new): - Read retries on transient failure; returns value from call site. - Write non-idempotent does NOT retry even when policy has 3 retries configured (the explicit decision-#44 guard at the dispatch surface). - Write idempotent retries when policy allows. - Write with default tier-A policy (RetryCount=0) never retries regardless of idempotency flag. - Different hosts get independent pipelines. Core.Tests now 44 passing (was 38). Invoker doc-refs completed (the XML comment on WriteIdempotentAttribute no longer references a non-existent type). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
c04b13f436 |
Phase 6.1 Stream A.1/A.2/A.6 — Polly resilience foundation: pipeline builder + per-tier policy defaults + WriteIdempotent attribute
Lands the first chunk of the Phase 6.1 Stream A resilience layer per docs/v2/implementation/phase-6-1-resilience-and-observability.md §Stream A. Downstream CapabilityInvoker (A.3) + driver-dispatch wiring land in follow-up PRs on the same branch. Core.Abstractions additions: - WriteIdempotentAttribute — marker for tag-definition records that opt into auto-retry on IWritable.WriteAsync. Absence = no retry per decisions #44, #45, #143. Read once via reflection at driver-init time; no per-write cost. - DriverCapability enum — enumerates the 8 capability surface points (Read / Write / Discover / Subscribe / Probe / AlarmSubscribe / AlarmAcknowledge / HistoryRead). AlarmAcknowledge is write-shaped (no retry by default). - DriverTier enum — A/B/C per driver-stability.md §2-4. Stream B.1 wires this into DriverTypeMetadata; surfaced here because the resilience policy defaults key on it. Core.Resilience new namespace: - DriverResilienceOptions — per-tier × per-capability policy defaults. GetTierDefaults(tier) is the source of truth: * Tier A: Read 2s/3 retries, Write 2s/0 retries, breaker threshold 5 * Tier B: Read 4s/3, Write 4s/0, breaker threshold 5 * Tier C: Read 10s/1, Write 10s/0, breaker threshold 0 (supervisor handles process-level breaker per decision #68) Resolve(capability) overlays CapabilityPolicies on top of the defaults. - DriverResiliencePipelineBuilder — composes Timeout → Retry (capability- permitting, never on cancellation) → CircuitBreaker (tier-permitting) → Bulkhead. Pipelines cached in a lock-free ConcurrentDictionary keyed on (DriverInstanceId, HostName, DriverCapability) per decision #144 — one dead PLC behind a multi-device driver does not open the breaker for healthy siblings. Invalidate(driverInstanceId) supports Admin-triggered reload. Tests (30 new, all pass): - DriverResilienceOptionsTests: tier-default coverage for every capability, Write + AlarmAcknowledge never retry at any tier, Tier C disables breaker, resolve-with-override layering. - DriverResiliencePipelineBuilderTests: Read retries transients, Write does NOT retry on failure (decision #44 guard), dead-host isolation from sibling hosts, pipeline reuse for same triple, per-capability isolation, breaker opens after threshold on Tier A, timeout fires, cancellation is not retried, invalidation scoped to matching instance. Polly.Core 8.6.6 added to Core.csproj. Full solution dotnet test: 936 passing (baseline 906 + 30 new). One pre-existing Client.CLI Subscribe flake unchanged by this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
f53c39a598 |
Phase 3 PR 16 — concrete OPC UA server scaffolding with AlarmConditionState materialization. Introduces the OPCFoundation.NetStandard.Opc.Ua.Server package (v1.5.374.126, same version the v1 stack already uses) and two new server-side classes: DriverNodeManager : CustomNodeManager2 is the concrete realization of PR 15's IAddressSpaceBuilder contract — Folder() creates FolderState nodes under an Organizes hierarchy rooted at ObjectsFolder > DriverInstanceId; Variable() creates BaseDataVariableState with DataType mapped from DriverDataType (Boolean/Int32/Float/Double/String/DateTime) + ValueRank (Scalar or OneDimension) + AccessLevel CurrentReadOrWrite; AddProperty() creates PropertyState with HasProperty reference. Read hook wires OnReadValue per variable to route to IReadable.ReadAsync; Write hook wires OnWriteValue to route to IWritable.WriteAsync and surface per-tag StatusCode. MarkAsAlarmCondition() materializes an OPC UA AlarmConditionState child of the variable, seeded from AlarmConditionInfo (SourceName, InitialSeverity → UA severity via Low=250/Medium=500/High=700/Critical=900, InitialDescription), initial state Enabled + Acknowledged + Inactive + Retain=false. Returns an IAlarmConditionSink whose OnTransition updates alarm.Severity/Time/Message and switches state per AlarmType string ('Active' → SetActiveState(true) + SetAcknowledgedState(false) + Retain=true; 'Acknowledged' → SetAcknowledgedState(true); 'Inactive' → SetActiveState(false) + Retain=false if already Acked) then calls alarm.ReportEvent to emit the OPC UA event to subscribed clients. Galaxy's GalaxyAlarmTracker (PR 14) now lands at a concrete AlarmConditionState node instead of just raising an unobserved C# event. OtOpcUaServer : StandardServer wires one DriverNodeManager per DriverHost.GetDriver during CreateMasterNodeManager — anonymous endpoint, no security profile (minimum-viable; LDAP + security-profile wire-up is the next PR). DriverHost gains public GetDriver(instanceId) so the server can enumerate drivers at startup. NestedBuilder inner class in DriverNodeManager implements IAddressSpaceBuilder by temporarily retargeting the parent's _currentFolder during each call so Folder→Variable→AddProperty land under the correct subtree — not thread-safe if discovery ran concurrently, but GenericDriverNodeManager.BuildAddressSpaceAsync is sequential per driver so this is safe by construction. NuGet audit suppress for GHSA-h958-fxgg-g7w3 (moderate-severity in OPCFoundation.NetStandard.Opc.Ua.Core 1.5.374.126; v1 stack already accepts this risk on the same package version). PR 16 is scoped as scaffolding — the actual server startup (ApplicationInstance, certificate config, endpoint binding, session management wiring into OpcUaServerService.ExecuteAsync) is deferred to a follow-up PR because it needs ApplicationConfiguration XML + optional-cert-store logic that depends on per-deployment policy decisions. The materialization shape is complete: a subsequent PR adds 100 LOC to start the server and all the already-written IAddressSpaceBuilder + alarm-condition + read/write wire-up activates end-to-end. Full solution: 0 errors, 152 unit tests pass (no new tests this PR — DriverNodeManager unit testing needs an IServerInternal mock which is heavyweight; live-endpoint integration tests land alongside the server-startup PR).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
190d09cdeb |
Phase 3 PR 15 — alarm-condition contract in IAddressSpaceBuilder + wire OnAlarmEvent through GenericDriverNodeManager. IAddressSpaceBuilder.IVariableHandle gains MarkAsAlarmCondition(AlarmConditionInfo) which returns an IAlarmConditionSink. AlarmConditionInfo carries SourceName/InitialSeverity/InitialDescription. Concrete address-space builders (the upcoming PR 16 OPC UA server backend) materialize a sibling AlarmConditionState node on the first call; the sink receives every lifecycle transition the generic node manager forwards. GenericDriverNodeManager gains a CapturingBuilder wrapper that transparently wraps every Folder/Variable call — the wrapper observes MarkAsAlarmCondition calls without participating in materialization, captures the resulting IAlarmConditionSink into an internal source-node-id → sink ConcurrentDictionary keyed by IVariableHandle.FullReference. After DiscoverAsync completes, if the driver implements IAlarmSource the node manager subscribes to OnAlarmEvent and routes every AlarmEventArgs to the sink registered for args.SourceNodeId — unknown source ids are dropped silently (may belong to another driver or to a variable the builder chose not to flag). Dispose unsubscribes the forwarder to prevent dangling invocation-list references across node-manager rebuilds. GalaxyProxyDriver.DiscoverAsync now calls handle.MarkAsAlarmCondition(new AlarmConditionInfo(fullName, AlarmSeverity.Medium, null)) on every attr.IsAlarm=true variable — severity seed is Medium because the live Priority byte arrives through the subsequent GalaxyAlarmEvent stream (which PR 14's GalaxyAlarmTracker now emits); the Admin UI sees the severity update on the first transition. RecordingAddressSpaceBuilder in Driver.Galaxy.E2E gains a RecordedAlarmCondition list + a RecordingSink implementation that captures AlarmEventArgs for test assertion — the E2E parity suite can now verify alarm-condition registration shape in addition to folder/variable shape. Tests (4 new GenericDriverNodeManagerTests): Alarm_events_are_routed_to_the_sink_registered_for_the_matching_source_node_id — 2 alarms registered (Tank.HiHi + Heater.OverTemp), driver raises an event for Tank.HiHi, the Tank.HiHi sink captures the payload, the Heater.OverTemp sink does not (tag-scoped fan-out, not broadcast); Non_alarm_variables_do_not_register_sinks — plain Tank.Level in the same discover is not in TrackedAlarmSources; Unknown_source_node_id_is_dropped_silently — a transition for Unknown.Source doesn't reach any sink + no exception; Dispose_unsubscribes_from_OnAlarmEvent — post-dispose, a transition for a previously-registered tag is no-op because the forwarder detached. InternalsVisibleTo('ZB.MOM.WW.OtOpcUa.Core.Tests') added to Core csproj so TrackedAlarmSources internal property is visible to the test. Full solution: 0 errors, 152 unit tests pass (8 Core + 14 Proxy + 14 Admin + 24 Configuration + 6 Shared + 84 Galaxy.Host + 2 Server). PR 16 will implement the concrete OPC UA address-space builder that materializes AlarmConditionState from this contract.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
01fd90c178 |
Phase 1 Streams B–E scaffold + Phase 2 Streams A–C scaffold — 8 new projects with ~70 new tests, all green alongside the 494 v1 IntegrationTests baseline (parity preserved: no v1 tests broken; legacy OtOpcUa.Host untouched). Phase 1 finish: Configuration project (16 entities + 10 enums + DbContext + DesignTimeDbContextFactory + InitialSchema/StoredProcedures/AuthorizationGrants migrations — 8 procs including sp_PublishGeneration with MERGE on ExternalIdReservation per decision #124, sp_RollbackToGeneration cloning rows into a new published generation, sp_ValidateDraft with cross-cluster-namespace + EquipmentUuid-immutability + ZTag/SAPID reservation pre-flight, sp_ComputeGenerationDiff with CHECKSUM-based row signature — plus OtOpcUaNode/OtOpcUaAdmin SQL roles with EXECUTE grants scoped to per-principal-class proc sets and DENY UPDATE/DELETE/INSERT/SELECT on dbo schema); managed DraftValidator covering UNS segment regex, path length, EquipmentUuid immutability across generations, same-cluster namespace binding (decision #122), reservation pre-flight, EquipmentId derivation (decision #125), driver↔namespace compatibility — returning every failing rule in one pass; LiteDB local cache with round-trip + ring pruning + corruption-fast-fail; GenerationApplier with per-entity Added/Removed/Modified diff and dependency-ordered callbacks (namespace → driver → device → equipment → poll-group → tag, Removed before Added); Core project with GenericDriverNodeManager (scaffold for the Phase 2 Galaxy port) and DriverHost lifecycle registry; Server project using Microsoft.Extensions.Hosting BackgroundService replacing TopShelf, with NodeBootstrap that falls back to LiteDB cache when the central DB is unreachable (decision #79); Admin project scaffolded as Blazor Server with Bootstrap 5 sidebar layout, cookie auth, three admin roles (ConfigViewer/ConfigEditor/FleetAdmin), Cluster + Generation services fronting the stored procs. Phase 2 scaffold: Driver.Galaxy.Shared (netstandard2.0) with full MessagePack IPC contract surface — Hello version negotiation, Open/CloseSession, Heartbeat, DiscoverHierarchy + GalaxyObjectInfo/GalaxyAttributeInfo, Read/WriteValues, Subscribe/Unsubscribe/OnDataChange, AlarmSubscribe/Event/Ack, HistoryRead, HostConnectivityStatus, Recycle — plus length-prefixed framing (decision #28) with a 16 MiB cap and thread-safe FrameWriter/FrameReader; Driver.Galaxy.Host (net48) implementing the Tier C cross-cutting protections from driver-stability.md — strict PipeAcl (allow configured server SID only, explicit deny on LocalSystem + Administrators), PipeServer with caller-SID verification via pipe.RunAsClient + WindowsIdentity.GetCurrent and per-process shared-secret Hello, Galaxy-specific MemoryWatchdog (warn at max(1.5×baseline, +200 MB), soft-recycle at max(2×baseline, +200 MB), hard ceiling 1.5 GB, slope ≥5 MB/min over 30-min rolling window), RecyclePolicy (1 soft recycle per hour cap + 03:00 local daily scheduled), PostMortemMmf (1000-entry ring buffer in %ProgramData%\OtOpcUa\driver-postmortem\galaxy.mmf, survives hard crash, readable cross-process), MxAccessHandle : SafeHandle (ReleaseHandle loops Marshal.ReleaseComObject until refcount=0 then calls optional unregister callback), StaPump with responsiveness probe (BlockingCollection dispatcher for Phase 1 — real Win32 GetMessage/DispatchMessage pump slots in with the same semantics when the Galaxy code lift happens), IsExternalInit shim for init setters on .NET 4.8; Driver.Galaxy.Proxy (net10) implementing IDriver + ITagDiscovery forwarding over the IPC channel with MX data-type and security-classification mapping, plus Supervisor pieces — Backoff (5s → 15s → 60s capped, reset-on-stable-run), CircuitBreaker (3 crashes per 5 min opens; 1h → 4h → manual cooldown escalation; sticky alert doesn't auto-clear), HeartbeatMonitor (2s cadence, 3 consecutive misses = host dead per driver-stability.md). Infrastructure: docker SQL Server remapped to host port 14330 to coexist with the native MSSQL14 Galaxy ZB DB instance on 1433; NuGetAuditSuppress applied per-project for two System.Security.Cryptography.Xml advisories that only reach via EF Core Design with PrivateAssets=all (fix ships in 11.0.0-preview); .slnx gains 14 project registrations. Deferred with explicit TODOs in docs/v2/implementation/phase-2-partial-exit-evidence.md: Phase 1 Stream E Admin UI pages (Generations listing + draft-diff-publish, Equipment CRUD with OPC 40010 fields, UNS Areas/Lines tabs, ACLs + permission simulator, Generic JSON config editor, SignalR real-time, Release-Reservation + Merge-Equipment workflows, LDAP login page, AppServer smoke test per decision #142), Phase 2 Stream D (Galaxy MXAccess code lift out of legacy OtOpcUa.Host, dual-service installer, appsettings → DriverConfig migration script, legacy Host deletion — blocked by parity), Phase 2 Stream E (v1 IntegrationTests against v2 topology, Client.CLI walkthrough diff, four 2026-04-13 stability findings regression tests, adversarial review — requires live MXAccess runtime).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |