Commit Graph

16 Commits

Author SHA1 Message Date
Joseph Doherty
ae8f226e45 Phase 6.1 Stream E.3 partial — in-flight counter feeds CurrentBulkheadDepth
Closes the observer half of #162 that was flagged as "persisted as 0 today"
in PR #105. The Admin /hosts column refresh + FleetStatusHub SignalR push
+ red-badge visual still belong to the visual-compliance pass.

Core.Resilience:
- DriverResilienceStatusTracker gains RecordCallStart + RecordCallComplete
  + CurrentInFlight field on the snapshot record. Concurrent-safe via the
  same ConcurrentDictionary.AddOrUpdate pattern as the other recorder methods.
  Clamps to zero on over-decrement so a stray Complete-without-Start can't
  drive the counter negative.
- CapabilityInvoker gains an optional statusTracker ctor parameter. When
  wired, every ExecuteAsync / ExecuteAsync(void) wraps the pipeline call
  in try / finally that records start/complete — so the counter advances
  cleanly whether the call succeeds, cancels, or throws. Null tracker keeps
  the pre-Phase-6.1 Stream E.3 behaviour exactly.

Server.Hosting:
- ResilienceStatusPublisherHostedService persists CurrentInFlight as the
  DriverInstanceResilienceStatus.CurrentBulkheadDepth column (was 0 before
  this PR). One-line fix on both the insert + update branches.

The in-flight counter is a pragmatic proxy for Polly's internal bulkhead
depth — a future PR wiring Polly telemetry would replace it with the real
value. The shape of the column + the publisher + the Admin /hosts query
doesn't change, so the follow-up is invisible to consumers.

Tests (8 new InFlightCounterTests, all pass):
- Start+Complete nets to zero.
- Nested starts sum; Complete decrements.
- Complete-without-Start clamps to zero.
- Different hosts track independently.
- Concurrent starts (500 parallel) don't lose count.
- CapabilityInvoker observed-mid-call depth == 1 during a pending call.
- CapabilityInvoker exception path still decrements (try/finally).
- CapabilityInvoker without tracker doesn't throw.

Full solution dotnet test: 1243 passing (was 1235, +8). Pre-existing
Client.CLI Subscribe flake unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 15:02:34 -04:00
Joseph Doherty
4de94fab0d Phase 6.1 Stream A remaining — IPerCallHostResolver + DriverNodeManager per-call host dispatch (decision #144)
Closes the per-device isolation gap flagged at the Phase 6.1 Stream A wire-up
(PR #78 used driver.DriverInstanceId as the pipeline host for every call, so
multi-host drivers like Modbus with N PLCs shared one pipeline — one dead PLC
poisoned sibling breakers). Decision #144 requires per-device isolation; this
PR wires it without breaking single-host drivers.

Core.Abstractions:
- IPerCallHostResolver interface. Optional driver capability. Drivers with
  multi-host topology (Modbus across N PLCs, AB CIP across a rack, etc.)
  implement this; single-host drivers (Galaxy, S7 against one PLC, OpcUaClient
  against one remote server) leave it alone. Must be fast + allocation-free
  — called once per tag on the hot path. Unknown refs return empty so dispatch
  falls back to single-host without throwing.

Server/OpcUa/DriverNodeManager:
- Captures `driver as IPerCallHostResolver` at construction alongside the
  existing capability casts.
- New `ResolveHostFor(fullReference)` helper returns either the resolver's
  answer or the driver's DriverInstanceId (single-host fallback). Empty /
  whitespace resolver output also falls back to DriverInstanceId.
- Every dispatch site now passes `ResolveHostFor(fullRef)` to the invoker
  instead of `_driver.DriverInstanceId` — OnReadValue, OnWriteValue, all four
  HistoryRead paths. The HistoryRead Events path tolerates fullRef=null and
  falls back to DriverInstanceId for those cluster-wide event queries.
- Drivers without IPerCallHostResolver observe zero behavioural change:
  every call still keys on DriverInstanceId, same as before.

Tests (4 new PerCallHostResolverDispatchTests, all pass):
- DeadPlc_DoesNotOpenBreaker_For_HealthyPlc_With_Resolver — 2 PLCs behind
  one driver; hammer the dead PLC past its breaker threshold; assert the
  healthy PLC's first call succeeds on its first attempt (decision #144).
- EmptyString / unknown-ref fallback behaviour documented via test.
- WithoutResolver_SameHost_Shares_One_Pipeline — regression guard for the
  single-host pre-existing behaviour.
- WithResolver_TwoHosts_Get_Two_Pipelines — builds the CachedPipelineCount
  assertion to confirm the shared-builder cache keys correctly.

Full solution dotnet test: 1219 passing (was 1215, +4). Pre-existing
Client.CLI Subscribe flake unchanged.

Adoption: Modbus driver (#120 follow-up), AB CIP / AB Legacy / TwinCAT
drivers (also #120) implement the interface and return the per-tag PLC host
string. Single-host drivers stay silent and pay zero cost.

Remaining sub-items of #160 still deferred:
- IAlarmSource.SubscribeAlarmsAsync + AcknowledgeAsync invoker wrapping.
  Non-trivial because alarm subscription is push-based from driver through
  IAlarmConditionSink — the wrap has to happen at the driver-to-server glue
  rather than a synchronous dispatch site.
- Roslyn analyzer asserting every capability-interface call routes through
  CapabilityInvoker. Substantial (separate analyzer project + test harness);
  noise-value ratio favors shipping this post-v2-GA once the coverage is
  known-stable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 12:31:24 -04:00
Joseph Doherty
7b50118b68 Phase 6.1 Stream A follow-up — DriverInstance.ResilienceConfig JSON column + parser + OtOpcUaServer wire-in
Closes the Phase 6.1 Stream A.2 "per-instance overrides bound from
DriverInstance.ResilienceConfig JSON column" work flagged as a follow-up
when Stream A.1 shipped in PR #78. Every driver can now override its Polly
pipeline policy per instance instead of inheriting pure tier defaults.

Configuration:
- DriverInstance entity gains a nullable `ResilienceConfig` string column
  (nvarchar(max)) + SQL check constraint `CK_DriverInstance_ResilienceConfig_IsJson`
  that enforces ISJSON when not null. Null = use tier defaults (decision
  #143 / unchanged from pre-Phase-6.1).
- EF migration `20260419161008_AddDriverInstanceResilienceConfig`.
- SchemaComplianceTests expected-constraint list gains the new CK name.

Core.Resilience.DriverResilienceOptionsParser:
- Pure-function parser. ParseOrDefaults(tier, json, out diag) returns the
  effective DriverResilienceOptions — tier defaults with per-capability /
  bulkhead overrides layered on top when the JSON payload supplies them.
  Partial policies (e.g. Read { retryCount: 10 }) fill missing fields from
  the tier default for that capability.
- Malformed JSON falls back to pure tier defaults + surfaces a human-readable
  diagnostic via the out parameter. Callers log the diag but don't fail
  startup — a misconfigured ResilienceConfig must not brick a working
  driver.
- Property names + capability keys are case-insensitive; unrecognised
  capability names are logged-and-skipped; unrecognised shape-level keys
  are ignored so future shapes land without a migration.

Server wire-in:
- OtOpcUaServer gains two optional ctor params: `tierLookup` (driverType →
  DriverTier) + `resilienceConfigLookup` (driverInstanceId → JSON string).
  CreateMasterNodeManager now resolves tier + JSON for each driver, parses
  via DriverResilienceOptionsParser, logs the diagnostic if any, and
  constructs CapabilityInvoker with the merged options instead of pure
  Tier A defaults.
- OpcUaApplicationHost threads both lookups through. Default null keeps
  existing tests constructing without either Func unchanged (falls back
  to Tier A + tier defaults exactly as before).

Tests (13 new DriverResilienceOptionsParserTests):
- null / whitespace / empty-object JSON returns pure tier defaults.
- Malformed JSON falls back + surfaces diagnostic.
- Read override merged into tier defaults; other capabilities untouched.
- Partial policy fills missing fields from tier default.
- Bulkhead overrides honored.
- Unknown capability skipped + surfaced in diagnostic.
- Property names + capability keys are case-insensitive.
- Every tier × every capability × empty-JSON round-trips tier defaults
  exactly (theory).

Full solution dotnet test: 1215 passing (was 1202, +13). Pre-existing
Client.CLI Subscribe flake unchanged.

Production wiring (Program.cs) example:
  Func<string, DriverTier> tierLookup = type => type switch
  {
      "Galaxy" => DriverTier.C,
      "Modbus" or "S7" => DriverTier.B,
      "OpcUaClient" => DriverTier.A,
      _ => DriverTier.A,
  };
  Func<string, string?> cfgLookup = id =>
      db.DriverInstances.AsNoTracking().FirstOrDefault(x => x.DriverInstanceId == id)?.ResilienceConfig;
  var host = new OpcUaApplicationHost(..., tierLookup: tierLookup, resilienceConfigLookup: cfgLookup);

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 12:21:42 -04:00
Joseph Doherty
c1cab33e38 Phase 6.4 Stream D server-side — IdentificationFolderBuilder materializes OPC 40010 Machinery Identification sub-folder
Closes the server-side / non-UI piece of Phase 6.4 Stream D. The Razor
`IdentificationFields.razor` component for Admin-UI editing ships separately
when the Admin UI pass lands (still tracked under #157 UI follow-up).

Core.OpcUa additions:
- IdentificationFolderBuilder — pure-function builder that materializes the
  OPC 40010 Machinery companion-spec Identification sub-folder per decision
  #139. Reads the nine nullable columns off an Equipment row:
  Manufacturer, Model, SerialNumber, HardwareRevision, SoftwareRevision,
  YearOfConstruction (short → OPC UA Int32), AssetLocation, ManufacturerUri,
  DeviceManualUri. Emits one AddProperty call per non-null field; skips the
  sub-folder entirely when all nine are null so browse trees don't carry
  pointless empty folders.
- HasAnyFields(equipment) — cheap short-circuit so callers can decide
  whether to invoke Folder() at all.
- FolderName constant ("Identification") + FieldNames list exposed so
  downstream tools / tests can cross-reference without duplicating the
  decision-#139 field set.

ACL binding: the sub-folder + variables live under the Equipment node so
Phase 6.2's PermissionTrie treats them as part of the Equipment ScopeId —
no new scope level. A user with Equipment-level grant reads the
Identification fields; a user without gets BadUserAccessDenied on both the
Equipment node + its Identification variables. Documented in the class
remarks; cross-reference update to acl-design.md is a follow-up.

Tests (9 new IdentificationFolderBuilderTests):
- HasAnyFields all-null false / any-non-null true.
- Build all-null returns null + doesn't emit Folder.
- Build fully-populated emits all 9 fields in decision #139 order.
- Only non-null fields are emitted (3-of-9 case).
- YearOfConstruction short widens to DriverDataType.Int32 with int value.
- String values round-trip through AddProperty.
- FieldNames constant matches decision #139 exactly.
- FolderName is "Identification".

Full solution dotnet test: 1202 passing (was 1193, +9). Pre-existing
Client.CLI Subscribe flake unchanged.

Production integration: the component that consumes this is the
address-space-build flow that walks the live Equipment table + calls
IdentificationFolderBuilder.Build(equipmentFolder, equipment) under each
Equipment node. That integration is the remaining Stream D follow-up
alongside the Razor UI component.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:57:39 -04:00
Joseph Doherty
40fb459040 Phase 6.2 Stream B — permission-trie evaluator in Core.Authorization
Ships Stream B.1-B.6 — the data-plane authorization engine Phase 6.2 runs on.
Integration into OPC UA dispatch (Stream C — Read / Write / HistoryRead /
Subscribe / Browse / Call etc.) is the next PR on this branch.

New Core.Abstractions:
- OpcUaOperation enum enumerates every OPC UA surface the evaluator gates:
  Browse, Read, WriteOperate/Tune/Configure (split by SecurityClassification),
  HistoryRead, HistoryUpdate, CreateMonitoredItems, TransferSubscriptions,
  Call, AlarmAcknowledge/Confirm/Shelve. Stream C maps each one back to its
  dispatch call site.

New Core.Authorization namespace:
- NodeScope record + NodeHierarchyKind — 6-level scope addressing for
  Equipment-kind (UNS) namespaces, folder-segment walk for SystemPlatform-kind
  (Galaxy). NodeScope carries a Kind selector so the evaluator knows which
  hierarchy to descend.
- AuthorizationDecision { Verdict, Provenance } + AuthorizationVerdict
  {Allow, NotGranted, Denied} + MatchedGrant. Tri-state per decision #149;
  Phase 6.2 only produces Allow + NotGranted, Denied stays reserved for v2.1
  Explicit Deny without API break.
- IPermissionEvaluator.Authorize(session, operation, scope).
- PermissionTrie + PermissionTrieNode + TrieGrant. In-memory trie keyed on
  the ACL scope hierarchy. CollectMatches walks Cluster → Namespace →
  UnsArea → UnsLine → Equipment → Tag (or → FolderSegment(s) → Tag on
  Galaxy). Pure additive union — matches that share an LDAP group with the
  session contribute flags; OR across levels.
- PermissionTrieBuilder static factory. Build(clusterId, generationId, rows,
  scopePaths?) returns a trie for one generation. Cross-cluster rows are
  filtered out so the trie is cluster-coherent. Stream C follow-up wires a
  real scopePaths lookup from the live DB; tests supply hand-built paths.
- PermissionTrieCache — process-singleton, keyed on (ClusterId, GenerationId).
  Install(trie) adds a generation + promotes to "current" when the id is
  highest-known (handles out-of-order installs gracefully). Prior generations
  retained so an in-flight request against a prior trie still succeeds; GC
  via Prune(cluster, keepLatest).
- UserAuthorizationState — per-session cache of resolved LDAP groups +
  AuthGenerationId + MembershipVersion + MembershipResolvedUtc. Bounded by
  MembershipFreshnessInterval (default 15 min per decision #151) +
  AuthCacheMaxStaleness (default 5 min per decision #152).
- TriePermissionEvaluator — default IPermissionEvaluator. Fails closed on
  stale sessions (IsStale check short-circuits to NotGranted), on cross-
  cluster requests, on empty trie cache. Maps OpcUaOperation → NodePermissions
  via MapOperationToPermission (total — every enum value has a mapping; tested).

Tests (27 new, all pass):
- PermissionTrieTests (7): cluster-level grant cascades to every tag;
  equipment-level grant doesn't leak to sibling equipment; multi-group union
  ORs flags; no-matching-group returns empty; Galaxy folder-segment grant
  doesn't leak to sibling folder; cross-cluster rows don't land in this
  cluster's trie; build is idempotent (B.6 invariants).
- TriePermissionEvaluatorTests (8): allow when flag matches; NotGranted when
  no matching group; NotGranted when flags insufficient; HistoryRead requires
  its own bit (decision-level requirement); cross-cluster session denied;
  stale session fails closed; no cached trie denied; MapOperationToPermission
  is total across every OpcUaOperation.
- PermissionTrieCacheTests (8): empty cache returns null; install-then-get
  round-trips; new generation becomes current; out-of-order install doesn't
  downgrade current; invalidate drops one cluster; prune retains most recent;
  prune no-op when fewer than keep; cluster isolation.
- UserAuthorizationStateTests (4): fresh is not stale; IsStale after 5 min
  default; NeedsRefresh true between freshness + staleness windows.

Full solution dotnet test: 1078 passing (baseline 906, Phase 6.1 = 1042,
Phase 6.2 Stream A = +9, Stream B = +27 = 1078). Pre-existing Client.CLI
Subscribe flake unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 09:27:44 -04:00
Joseph Doherty
cbcaf6593a Phase 6.1 Stream E (data layer) — DriverInstanceResilienceStatus entity + DriverResilienceStatusTracker + EF migration
Ships the data + runtime layer of Stream E. The SignalR hub and Blazor /hosts
page refresh (E.2-E.3) are follow-up work paired with the visual-compliance
review per Phase 6.4 patterns — documented as a deferred follow-up below.

Configuration:
- New entity DriverInstanceResilienceStatus with:
  DriverInstanceId, HostName (composite PK),
  LastCircuitBreakerOpenUtc, ConsecutiveFailures, CurrentBulkheadDepth,
  LastRecycleUtc, BaselineFootprintBytes, CurrentFootprintBytes,
  LastSampledUtc.
- Separate from DriverHostStatus (per-host connectivity view) so a Running
  host that has tripped its breaker or is nearing its memory ceiling shows up
  distinctly on Admin /hosts. Admin page left-joins both for display.
- OtOpcUaConfigDbContext + Fluent-API config + IX_DriverResilience_LastSampled
  index for the stale-sample filter query.
- EF migration: 20260419124034_AddDriverInstanceResilienceStatus.

Core.Resilience:
- DriverResilienceStatusTracker — process-singleton in-memory tracker keyed on
  (DriverInstanceId, HostName). CapabilityInvoker + MemoryTracking +
  MemoryRecycle callers record failure/success/breaker-open/recycle/footprint
  events; a HostedService (Stream E.2 follow-up) samples this tracker every
  5 s and persists to the DB. Pure in-memory keeps tests fast + the core
  free of EF/SQL dependencies.

Tests:
- DriverResilienceStatusTrackerTests (9 new, all pass): tryget-before-write
  returns null; failures accumulate; success resets; breaker/recycle/footprint
  fields populate; per-host isolation; snapshot returns all pairs; concurrent
  writes don't lose counts.
- SchemaComplianceTests: expected-tables list updated to include the new
  DriverInstanceResilienceStatus table.

Full solution dotnet test: 1042 passing (baseline 906, +136 for Phase 6.1 so
far across Streams A/B/C/D/E.1). Pre-existing Client.CLI Subscribe flake
unchanged.

Deferred to follow-up PR (E.2/E.3):
- ResilienceStatusPublisher HostedService that samples DriverResilienceStatusTracker
  every 5 s + upserts DriverInstanceResilienceStatus rows.
- Admin FleetStatusHub SignalR hub pushing LastCircuitBreakerOpenUtc /
  CurrentBulkheadDepth / LastRecycleUtc on change.
- Admin /hosts Blazor column additions (red badge when
  ConsecutiveFailures > breakerThreshold / 2). Visual-compliance reviewer
  signoff alongside Phase 6.4 admin-ui patterns.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 08:47:43 -04:00
Joseph Doherty
9dd5e4e745 Phase 6.1 Stream C — health endpoints on :4841 + LogContextEnricher + Serilog JSON sink + CapabilityInvoker enrichment
Closes Stream C per docs/v2/implementation/phase-6-1-resilience-and-observability.md.

Core.Observability (new namespace):
- DriverHealthReport — pure-function aggregation over DriverHealthSnapshot list.
  Empty fleet = Healthy. Any Faulted = Faulted. Any Unknown/Initializing (no
  Faulted) = NotReady. Any Degraded or Reconnecting (no Faulted, no NotReady)
  = Degraded. Else Healthy. HttpStatus(verdict) maps to the Stream C.1 state
  matrix: Healthy/Degraded → 200, NotReady/Faulted → 503.
- LogContextEnricher — Serilog LogContext wrapper. Push(id, type, capability,
  correlationId) returns an IDisposable scope; inner log calls carry
  DriverInstanceId / DriverType / CapabilityName / CorrelationId structured
  properties automatically. NewCorrelationId = 12-hex-char GUID slice for
  cases where no OPC UA RequestHeader.RequestHandle is in flight.

CapabilityInvoker — now threads LogContextEnricher around every ExecuteAsync /
ExecuteWriteAsync call site. OtOpcUaServer passes driver.DriverType through
so logs correlate to the driver type too. Every capability call emits
structured fields per the Stream C.4 compliance check.

Server.Observability:
- HealthEndpointsHost — standalone HttpListener on http://localhost:4841/
  (loopback avoids Windows URL-ACL elevation; remote probing via reverse
  proxy or explicit netsh urlacl grant). Routes:
    /healthz → 200 when (configDbReachable OR usingStaleConfig); 503 otherwise.
      Body: status, uptimeSeconds, configDbReachable, usingStaleConfig.
    /readyz  → DriverHealthReport.Aggregate + HttpStatus mapping.
      Body: verdict, drivers[], degradedDrivers[], uptimeSeconds.
    anything else → 404.
  Disposal cooperative with the HttpListener shutdown.
- OpcUaApplicationHost starts the health host after the OPC UA server comes up
  and disposes it on shutdown. New OpcUaServerOptions knobs:
  HealthEndpointsEnabled (default true), HealthEndpointsPrefix (default
  http://localhost:4841/).

Program.cs:
- Serilog pipeline adds Enrich.FromLogContext + opt-in JSON file sink via
  `Serilog:WriteJson = true` appsetting. Uses Serilog.Formatting.Compact's
  CompactJsonFormatter (one JSON object per line — SIEMs like Splunk,
  Datadog, Graylog ingest without a regex parser).

Server.Tests:
- Existing 3 OpcUaApplicationHost integration tests now set
  HealthEndpointsEnabled=false to avoid port :4841 collisions under parallel
  execution.
- New HealthEndpointsHostTests (9): /healthz healthy empty fleet; stale-config
  returns 200 with flag; unreachable+no-cache returns 503; /readyz empty/
  Healthy/Faulted/Degraded/Initializing drivers return correct status and
  bodies; unknown path → 404. Uses ephemeral ports via Interlocked counter.

Core.Tests:
- DriverHealthReportTests (8): empty fleet, all-healthy, any-Faulted trumps,
  any-NotReady without Faulted, Degraded without Faulted/NotReady, HttpStatus
  per-verdict theory.
- LogContextEnricherTests (8): all 4 properties attach; scope disposes cleanly;
  NewCorrelationId shape; null/whitespace driverInstanceId throws.
- CapabilityInvokerEnrichmentTests (2): inner logs carry structured
  properties; no context leak outside the call site.

Full solution dotnet test: 1016 passing (baseline 906, +110 for Phase 6.1 so
far across Streams A+B+C). Pre-existing Client.CLI Subscribe flake unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 08:15:44 -04:00
Joseph Doherty
1d9008e354 Phase 6.1 Stream B.3/B.4/B.5 — MemoryRecycle + ScheduledRecycleScheduler + demand-aware WedgeDetector
Closes out Stream B per docs/v2/implementation/phase-6-1-resilience-and-observability.md.

Core.Abstractions:
- IDriverSupervisor — process-level supervisor contract a Tier C driver's
  out-of-process topology provides (Galaxy Proxy/Supervisor implements this in
  a follow-up Driver.Galaxy wiring PR). Concerns: DriverInstanceId + RecycleAsync.
  Tier A/B drivers don't implement this; Stream B code asserts tier == C before
  ever calling it.

Core.Stability:
- MemoryRecycle — companion to MemoryTracking. On HardBreach, invokes the
  supervisor IFF tier == C AND a supervisor is wired. Tier A/B HardBreach logs
  a promotion-to-Tier-C recommendation and returns false. Soft/None/Warming
  never triggers a recycle at any tier.
- ScheduledRecycleScheduler — Tier C opt-in periodic recycler per decision #67.
  Ctor throws for Tier A/B (structural guard — scheduled recycle on an
  in-process driver would kill every OPC UA session and every co-hosted
  driver). TickAsync(now) advances the schedule by one interval per fire;
  RequestRecycleNowAsync drives an ad-hoc recycle without shifting the cron.
- WedgeDetector — demand-aware per decision #147. Classify(state, demand, now)
  returns:
    * NotApplicable when driver state != Healthy
    * Idle when Healthy + no pending work (bulkhead=0 && monitored=0 && historic=0)
    * Healthy when Healthy + pending work + progress within threshold
    * Faulted when Healthy + pending work + no progress within threshold
  Threshold clamps to min 60 s. DemandSignal.HasPendingWork ORs the three counters.
  The three false-wedge cases the plan calls out all stay Healthy: idle
  subscription-only, slow historian backfill making progress, write-only burst
  with drained bulkhead.

Tests (22 new, all pass):
- MemoryRecycleTests (7): Tier C hard-breach requests recycle; Tier A/B
  hard-breach never requests; Tier C without supervisor no-ops; soft-breach
  at every tier never requests; None/Warming never request.
- ScheduledRecycleSchedulerTests (6): ctor throws for A/B; zero/negative
  interval throws; tick before due no-ops; tick at/after due fires once and
  advances; RequestRecycleNow fires immediately without shifting schedule;
  multiple fires across ticks advance one interval each.
- WedgeDetectorTests (9): threshold clamp to 60 s; unhealthy driver always
  NotApplicable; idle subscription stays Idle; pending+fresh progress stays
  Healthy; pending+stale progress is Faulted; MonitoredItems active but no
  publish is Faulted; MonitoredItems active with fresh publish stays Healthy;
  historian backfill with fresh progress stays Healthy; write-only burst with
  empty bulkhead is Idle; HasPendingWork theory for any non-zero counter.

Full solution dotnet test: 989 passing (baseline 906, +83 for Phase 6.1 so far).
Pre-existing Client.CLI Subscribe flake unchanged.

Stream B complete. Next up: Stream C (health endpoints + structured logging).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 08:03:18 -04:00
Joseph Doherty
ef6b0bb8fc Phase 6.1 Stream B.1/B.2 — DriverTier on DriverTypeMetadata + Core.Stability.MemoryTracking with hybrid-formula soft/hard thresholds
Stream B.1 — registry invariant:
- DriverTypeMetadata gains a required `DriverTier Tier` field. Every registered
  driver type must declare its stability tier so the downstream MemoryTracking,
  MemoryRecycle, and resilience-policy layers can resolve the right defaults.
  Stamped-at-registration-time enforcement makes the "every driver type has a
  non-null Tier" compliance check structurally impossible to fail.
- DriverTypeRegistry API unchanged; one new property on the record.

Stream B.2 — MemoryTracking (Core.Stability):
- Tier-agnostic tracker per decision #146: captures baseline as the median of
  samples collected during a post-init warmup window (default 5 min), then
  classifies each subsequent sample with the hybrid formula
  `soft = max(multiplier × baseline, baseline + floor)`, `hard = 2 × soft`.
- Per-tier constants wired: Tier A mult=3 floor=50 MB, Tier B mult=3 floor=100 MB,
  Tier C mult=2 floor=500 MB.
- Never kills. Hard-breach action returns HardBreach; the supervisor that acts
  on that signal (MemoryRecycle) is Tier C only per decisions #74, #145 and
  lands in the next B.3 commit on this branch.
- Two phases: WarmingUp (samples collected, Warming returned) and Steady
  (baseline captured, soft/hard checks active). Transition is automatic when
  the warmup window elapses.

Tests (15 new, all pass):
- Warming phase returns Warming until the window elapses.
- Window-elapsed captures median baseline + transitions to Steady.
- Per-tier constants match decision #146 table exactly.
- Soft threshold uses max() — small baseline → floor wins; large baseline →
  multiplier wins.
- Hard = 2 × soft.
- Sample below soft = None; at soft = SoftBreach; at/above hard = HardBreach.
- DriverTypeRegistry: theory asserts Tier round-trips for A/B/C.

Full solution dotnet test: 963 passing (baseline 906, +57 net for Phase 6.1
Stream A + Stream B.1/B.2). Pre-existing Client.CLI Subscribe flake unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 07:37:43 -04:00
Joseph Doherty
29bcaf277b Phase 6.1 Stream A.3 complete — wire CapabilityInvoker into DriverNodeManager dispatch end-to-end
Every OnReadValue / OnWriteValue now routes through the process-singleton
DriverResiliencePipelineBuilder's CapabilityInvoker. Read / Write dispatch
paths gain timeout + per-capability retry + per-(driver, host) circuit breaker
+ bulkhead without touching the individual driver implementations.

Wiring:
- OpcUaApplicationHost: new optional DriverResiliencePipelineBuilder ctor
  parameter (default null → instance-owned builder). Keeps the 3 test call
  sites that construct OpcUaApplicationHost directly unchanged.
- OtOpcUaServer: requires the builder in its ctor; constructs one
  CapabilityInvoker per driver at CreateMasterNodeManager time with default
  Tier A DriverResilienceOptions. TODO: Stream B.1 will wire real per-driver-
  type tiers via DriverTypeRegistry; Phase 6.1 follow-up will read the
  DriverInstance.ResilienceConfig JSON column for per-instance overrides.
- DriverNodeManager: takes a CapabilityInvoker in its ctor. OnReadValue wraps
  the driver's ReadAsync through ExecuteAsync(DriverCapability.Read, hostName,
  ...); OnWriteValue wraps WriteAsync through ExecuteWriteAsync(hostName,
  isIdempotent, ...) where isIdempotent comes from the new
  _writeIdempotentByFullRef map populated at Variable() registration from
  DriverAttributeInfo.WriteIdempotent.

HostName defaults to driver.DriverInstanceId for now — a single-host pipeline
per driver. Multi-host drivers (Modbus with N PLCs) will expose their own per-
call host resolution in a follow-up so failing PLCs can trip per-PLC breakers
without poisoning siblings (decision #144).

Test fixup:
- FlakeyDriverIntegrationTests.Read_SurfacesSuccess_AfterTransientFailures:
  bumped TimeoutSeconds=2 → 30. 10 retries at exponential backoff with jitter
  can exceed 2s under parallel-test-run CPU pressure; the test asserts retry
  behavior, not timeout budget, so the longer slack keeps it deterministic.

Full solution dotnet test: 948 passing. Pre-existing Client.CLI Subscribe
flake unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 07:28:28 -04:00
Joseph Doherty
b6d2803ff6 Phase 6.1 Stream A — switch pipeline keys from Guid to string to match IDriver.DriverInstanceId
IDriver.DriverInstanceId is declared as string in Core.Abstractions; keeping
the pipeline key as Guid meant every call site would need .ToString() / Guid.Parse
at the boundary. Switching the Resilience types to string removes that friction
and lets OtOpcUaServer pass driver.DriverInstanceId directly to the builder in
the upcoming server-dispatch wiring PR.

- DriverResiliencePipelineBuilder.GetOrCreate + Invalidate + PipelineKey
- CapabilityInvoker.ctor + _driverInstanceId field

Tests: all 48 Core.Tests still pass. The Invalidate test's keepId / dropId now
use distinct "drv-keep" / "drv-drop" literals (previously both were distinct
Guid.NewGuid() values, which the sed-driven refactor had collapsed to the same
literal — caught pre-commit).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 07:18:55 -04:00
Joseph Doherty
f3850f8914 Phase 6.1 Stream A.5/A.6 — WriteIdempotent flag on DriverAttributeInfo + Modbus/S7 tag records + FlakeyDriver integration tests
Per-tag opt-in for write-retry per docs/v2/plan.md decisions #44, #45, #143.
Default is false — writes never auto-retry unless the driver author has marked
the tag as safe to replay.

Core.Abstractions:
- DriverAttributeInfo gains `bool WriteIdempotent = false` at the end of the
  positional record (back-compatible; every existing call site uses the default).

Driver.Modbus:
- ModbusTagDefinition gains `bool WriteIdempotent = false`. Safe candidates
  documented in the param XML: holding-register set-points, configuration
  registers. Unsafe: edge-triggered coils, counter-increment addresses.
- ModbusDriver.DiscoverAsync propagates t.WriteIdempotent into
  DriverAttributeInfo.WriteIdempotent.

Driver.S7:
- S7TagDefinition gains `bool WriteIdempotent = false`. Safe candidates:
  DB word/dword set-points, configuration DBs. Unsafe: M/Q bits that drive
  edge-triggered program routines.
- S7Driver.DiscoverAsync propagates the flag.

Stream A.5 integration tests (FlakeyDriverIntegrationTests, 4 new) exercise
the invoker + flaky-driver contract the plan enumerates:
- Read with 5 transient failures succeeds on the 6th attempt (RetryCount=10).
- Non-idempotent write with RetryCount=5 configured still fails on the first
  failure — no replay (decision #44 guard at the ExecuteWriteAsync surface).
- Idempotent write with 2 transient failures succeeds on the 3rd attempt.
- Two hosts on the same driver have independent breakers — dead-host trips
  its breaker but live-host's first call still succeeds.

Propagation tests:
- ModbusDriverTests: SetPoint WriteIdempotent=true flows into
  DriverAttributeInfo; PulseCoil default=false.
- S7DiscoveryAndSubscribeTests: same pattern for DBx SetPoint vs M-bit.

Full solution dotnet test: 947 passing (baseline 906, +41 net across Stream A
so far). Pre-existing Client.CLI Subscribe flake unchanged.

Stream A's remaining work (wiring CapabilityInvoker into DriverNodeManager's
OnReadValue / OnWriteValue / History / Subscribe dispatch paths) is the
server-side integration piece + needs DI wiring for the pipeline builder —
lands in the next PR on this branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 07:16:21 -04:00
Joseph Doherty
90f7792c92 Phase 6.1 Stream A.3 — CapabilityInvoker wraps driver-capability calls through the shared pipeline
One invoker per (DriverInstance, IDriver) pair; calls ExecuteAsync(capability,
host, callSite) and the invoker resolves the correct pipeline from the shared
DriverResiliencePipelineBuilder. The options accessor is a Func so Admin-edit
+ pipeline-invalidate takes effect without restarting the invoker or the
driver host.

ExecuteWriteAsync(isIdempotent) is the explicit write-safety surface:
- isIdempotent=false routes through a side pipeline with RetryCount=0 regardless
  of what the caller configured. The cache key carries a "::non-idempotent"
  suffix so it never collides with the retry-enabled write pipeline.
- isIdempotent=true routes through the normal Write pipeline. If the user has
  configured Write retries (opt-in), the idempotent tag gets them; otherwise
  default-0 still wins.

The server dispatch layer (next PR) reads WriteIdempotentAttribute on each tag
definition once at driver-init time and feeds the boolean into ExecuteWriteAsync.

Tests (6 new):
- Read retries on transient failure; returns value from call site.
- Write non-idempotent does NOT retry even when policy has 3 retries configured
  (the explicit decision-#44 guard at the dispatch surface).
- Write idempotent retries when policy allows.
- Write with default tier-A policy (RetryCount=0) never retries regardless of
  idempotency flag.
- Different hosts get independent pipelines.

Core.Tests now 44 passing (was 38). Invoker doc-refs completed (the XML comment
on WriteIdempotentAttribute no longer references a non-existent type).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 04:09:26 -04:00
Joseph Doherty
c04b13f436 Phase 6.1 Stream A.1/A.2/A.6 — Polly resilience foundation: pipeline builder + per-tier policy defaults + WriteIdempotent attribute
Lands the first chunk of the Phase 6.1 Stream A resilience layer per
docs/v2/implementation/phase-6-1-resilience-and-observability.md §Stream A.
Downstream CapabilityInvoker (A.3) + driver-dispatch wiring land in follow-up
PRs on the same branch.

Core.Abstractions additions:
- WriteIdempotentAttribute — marker for tag-definition records that opt into
  auto-retry on IWritable.WriteAsync. Absence = no retry per decisions #44, #45,
  #143. Read once via reflection at driver-init time; no per-write cost.
- DriverCapability enum — enumerates the 8 capability surface points
  (Read / Write / Discover / Subscribe / Probe / AlarmSubscribe / AlarmAcknowledge
  / HistoryRead). AlarmAcknowledge is write-shaped (no retry by default).
- DriverTier enum — A/B/C per driver-stability.md §2-4. Stream B.1 wires this
  into DriverTypeMetadata; surfaced here because the resilience policy defaults
  key on it.

Core.Resilience new namespace:
- DriverResilienceOptions — per-tier × per-capability policy defaults.
  GetTierDefaults(tier) is the source of truth:
    * Tier A: Read 2s/3 retries, Write 2s/0 retries, breaker threshold 5
    * Tier B: Read 4s/3, Write 4s/0, breaker threshold 5
    * Tier C: Read 10s/1, Write 10s/0, breaker threshold 0 (supervisor handles
      process-level breaker per decision #68)
  Resolve(capability) overlays CapabilityPolicies on top of the defaults.
- DriverResiliencePipelineBuilder — composes Timeout → Retry (capability-
  permitting, never on cancellation) → CircuitBreaker (tier-permitting) →
  Bulkhead. Pipelines cached in a lock-free ConcurrentDictionary keyed on
  (DriverInstanceId, HostName, DriverCapability) per decision #144 — one dead
  PLC behind a multi-device driver does not open the breaker for healthy
  siblings. Invalidate(driverInstanceId) supports Admin-triggered reload.

Tests (30 new, all pass):
- DriverResilienceOptionsTests: tier-default coverage for every capability,
  Write + AlarmAcknowledge never retry at any tier, Tier C disables breaker,
  resolve-with-override layering.
- DriverResiliencePipelineBuilderTests: Read retries transients, Write does NOT
  retry on failure (decision #44 guard), dead-host isolation from sibling hosts,
  pipeline reuse for same triple, per-capability isolation, breaker opens after
  threshold on Tier A, timeout fires, cancellation is not retried,
  invalidation scoped to matching instance.

Polly.Core 8.6.6 added to Core.csproj. Full solution dotnet test: 936 passing
(baseline 906 + 30 new). One pre-existing Client.CLI Subscribe flake unchanged
by this PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 04:07:27 -04:00
Joseph Doherty
190d09cdeb Phase 3 PR 15 — alarm-condition contract in IAddressSpaceBuilder + wire OnAlarmEvent through GenericDriverNodeManager. IAddressSpaceBuilder.IVariableHandle gains MarkAsAlarmCondition(AlarmConditionInfo) which returns an IAlarmConditionSink. AlarmConditionInfo carries SourceName/InitialSeverity/InitialDescription. Concrete address-space builders (the upcoming PR 16 OPC UA server backend) materialize a sibling AlarmConditionState node on the first call; the sink receives every lifecycle transition the generic node manager forwards. GenericDriverNodeManager gains a CapturingBuilder wrapper that transparently wraps every Folder/Variable call — the wrapper observes MarkAsAlarmCondition calls without participating in materialization, captures the resulting IAlarmConditionSink into an internal source-node-id → sink ConcurrentDictionary keyed by IVariableHandle.FullReference. After DiscoverAsync completes, if the driver implements IAlarmSource the node manager subscribes to OnAlarmEvent and routes every AlarmEventArgs to the sink registered for args.SourceNodeId — unknown source ids are dropped silently (may belong to another driver or to a variable the builder chose not to flag). Dispose unsubscribes the forwarder to prevent dangling invocation-list references across node-manager rebuilds. GalaxyProxyDriver.DiscoverAsync now calls handle.MarkAsAlarmCondition(new AlarmConditionInfo(fullName, AlarmSeverity.Medium, null)) on every attr.IsAlarm=true variable — severity seed is Medium because the live Priority byte arrives through the subsequent GalaxyAlarmEvent stream (which PR 14's GalaxyAlarmTracker now emits); the Admin UI sees the severity update on the first transition. RecordingAddressSpaceBuilder in Driver.Galaxy.E2E gains a RecordedAlarmCondition list + a RecordingSink implementation that captures AlarmEventArgs for test assertion — the E2E parity suite can now verify alarm-condition registration shape in addition to folder/variable shape. Tests (4 new GenericDriverNodeManagerTests): Alarm_events_are_routed_to_the_sink_registered_for_the_matching_source_node_id — 2 alarms registered (Tank.HiHi + Heater.OverTemp), driver raises an event for Tank.HiHi, the Tank.HiHi sink captures the payload, the Heater.OverTemp sink does not (tag-scoped fan-out, not broadcast); Non_alarm_variables_do_not_register_sinks — plain Tank.Level in the same discover is not in TrackedAlarmSources; Unknown_source_node_id_is_dropped_silently — a transition for Unknown.Source doesn't reach any sink + no exception; Dispose_unsubscribes_from_OnAlarmEvent — post-dispose, a transition for a previously-registered tag is no-op because the forwarder detached. InternalsVisibleTo('ZB.MOM.WW.OtOpcUa.Core.Tests') added to Core csproj so TrackedAlarmSources internal property is visible to the test. Full solution: 0 errors, 152 unit tests pass (8 Core + 14 Proxy + 14 Admin + 24 Configuration + 6 Shared + 84 Galaxy.Host + 2 Server). PR 16 will implement the concrete OPC UA address-space builder that materializes AlarmConditionState from this contract.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 07:51:35 -04:00
Joseph Doherty
01fd90c178 Phase 1 Streams B–E scaffold + Phase 2 Streams A–C scaffold — 8 new projects with ~70 new tests, all green alongside the 494 v1 IntegrationTests baseline (parity preserved: no v1 tests broken; legacy OtOpcUa.Host untouched). Phase 1 finish: Configuration project (16 entities + 10 enums + DbContext + DesignTimeDbContextFactory + InitialSchema/StoredProcedures/AuthorizationGrants migrations — 8 procs including sp_PublishGeneration with MERGE on ExternalIdReservation per decision #124, sp_RollbackToGeneration cloning rows into a new published generation, sp_ValidateDraft with cross-cluster-namespace + EquipmentUuid-immutability + ZTag/SAPID reservation pre-flight, sp_ComputeGenerationDiff with CHECKSUM-based row signature — plus OtOpcUaNode/OtOpcUaAdmin SQL roles with EXECUTE grants scoped to per-principal-class proc sets and DENY UPDATE/DELETE/INSERT/SELECT on dbo schema); managed DraftValidator covering UNS segment regex, path length, EquipmentUuid immutability across generations, same-cluster namespace binding (decision #122), reservation pre-flight, EquipmentId derivation (decision #125), driver↔namespace compatibility — returning every failing rule in one pass; LiteDB local cache with round-trip + ring pruning + corruption-fast-fail; GenerationApplier with per-entity Added/Removed/Modified diff and dependency-ordered callbacks (namespace → driver → device → equipment → poll-group → tag, Removed before Added); Core project with GenericDriverNodeManager (scaffold for the Phase 2 Galaxy port) and DriverHost lifecycle registry; Server project using Microsoft.Extensions.Hosting BackgroundService replacing TopShelf, with NodeBootstrap that falls back to LiteDB cache when the central DB is unreachable (decision #79); Admin project scaffolded as Blazor Server with Bootstrap 5 sidebar layout, cookie auth, three admin roles (ConfigViewer/ConfigEditor/FleetAdmin), Cluster + Generation services fronting the stored procs. Phase 2 scaffold: Driver.Galaxy.Shared (netstandard2.0) with full MessagePack IPC contract surface — Hello version negotiation, Open/CloseSession, Heartbeat, DiscoverHierarchy + GalaxyObjectInfo/GalaxyAttributeInfo, Read/WriteValues, Subscribe/Unsubscribe/OnDataChange, AlarmSubscribe/Event/Ack, HistoryRead, HostConnectivityStatus, Recycle — plus length-prefixed framing (decision #28) with a 16 MiB cap and thread-safe FrameWriter/FrameReader; Driver.Galaxy.Host (net48) implementing the Tier C cross-cutting protections from driver-stability.md — strict PipeAcl (allow configured server SID only, explicit deny on LocalSystem + Administrators), PipeServer with caller-SID verification via pipe.RunAsClient + WindowsIdentity.GetCurrent and per-process shared-secret Hello, Galaxy-specific MemoryWatchdog (warn at max(1.5×baseline, +200 MB), soft-recycle at max(2×baseline, +200 MB), hard ceiling 1.5 GB, slope ≥5 MB/min over 30-min rolling window), RecyclePolicy (1 soft recycle per hour cap + 03:00 local daily scheduled), PostMortemMmf (1000-entry ring buffer in %ProgramData%\OtOpcUa\driver-postmortem\galaxy.mmf, survives hard crash, readable cross-process), MxAccessHandle : SafeHandle (ReleaseHandle loops Marshal.ReleaseComObject until refcount=0 then calls optional unregister callback), StaPump with responsiveness probe (BlockingCollection dispatcher for Phase 1 — real Win32 GetMessage/DispatchMessage pump slots in with the same semantics when the Galaxy code lift happens), IsExternalInit shim for init setters on .NET 4.8; Driver.Galaxy.Proxy (net10) implementing IDriver + ITagDiscovery forwarding over the IPC channel with MX data-type and security-classification mapping, plus Supervisor pieces — Backoff (5s → 15s → 60s capped, reset-on-stable-run), CircuitBreaker (3 crashes per 5 min opens; 1h → 4h → manual cooldown escalation; sticky alert doesn't auto-clear), HeartbeatMonitor (2s cadence, 3 consecutive misses = host dead per driver-stability.md). Infrastructure: docker SQL Server remapped to host port 14330 to coexist with the native MSSQL14 Galaxy ZB DB instance on 1433; NuGetAuditSuppress applied per-project for two System.Security.Cryptography.Xml advisories that only reach via EF Core Design with PrivateAssets=all (fix ships in 11.0.0-preview); .slnx gains 14 project registrations. Deferred with explicit TODOs in docs/v2/implementation/phase-2-partial-exit-evidence.md: Phase 1 Stream E Admin UI pages (Generations listing + draft-diff-publish, Equipment CRUD with OPC 40010 fields, UNS Areas/Lines tabs, ACLs + permission simulator, Generic JSON config editor, SignalR real-time, Release-Reservation + Merge-Equipment workflows, LDAP login page, AppServer smoke test per decision #142), Phase 2 Stream D (Galaxy MXAccess code lift out of legacy OtOpcUa.Host, dual-service installer, appsettings → DriverConfig migration script, legacy Host deletion — blocked by parity), Phase 2 Stream E (v1 IntegrationTests against v2 topology, Client.CLI walkthrough diff, four 2026-04-13 stability findings regression tests, adversarial review — requires live MXAccess runtime).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 21:35:25 -04:00