Closes the observer half of #162 that was flagged as "persisted as 0 today"
in PR #105. The Admin /hosts column refresh + FleetStatusHub SignalR push
+ red-badge visual still belong to the visual-compliance pass.
Core.Resilience:
- DriverResilienceStatusTracker gains RecordCallStart + RecordCallComplete
+ CurrentInFlight field on the snapshot record. Concurrent-safe via the
same ConcurrentDictionary.AddOrUpdate pattern as the other recorder methods.
Clamps to zero on over-decrement so a stray Complete-without-Start can't
drive the counter negative.
- CapabilityInvoker gains an optional statusTracker ctor parameter. When
wired, every ExecuteAsync / ExecuteAsync(void) wraps the pipeline call
in try / finally that records start/complete — so the counter advances
cleanly whether the call succeeds, cancels, or throws. Null tracker keeps
the pre-Phase-6.1 Stream E.3 behaviour exactly.
Server.Hosting:
- ResilienceStatusPublisherHostedService persists CurrentInFlight as the
DriverInstanceResilienceStatus.CurrentBulkheadDepth column (was 0 before
this PR). One-line fix on both the insert + update branches.
The in-flight counter is a pragmatic proxy for Polly's internal bulkhead
depth — a future PR wiring Polly telemetry would replace it with the real
value. The shape of the column + the publisher + the Admin /hosts query
doesn't change, so the follow-up is invisible to consumers.
Tests (8 new InFlightCounterTests, all pass):
- Start+Complete nets to zero.
- Nested starts sum; Complete decrements.
- Complete-without-Start clamps to zero.
- Different hosts track independently.
- Concurrent starts (500 parallel) don't lose count.
- CapabilityInvoker observed-mid-call depth == 1 during a pending call.
- CapabilityInvoker exception path still decrements (try/finally).
- CapabilityInvoker without tracker doesn't throw.
Full solution dotnet test: 1243 passing (was 1235, +8). Pre-existing
Client.CLI Subscribe flake unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the server-side/data-layer piece of Phase 6.4 Stream B.2-B.4. The
CSV-import preview + modal UI (Stream B.3/B.5) still belongs to the Admin
UI follow-up — this PR owns the staging tables + atomic finalise alone.
Configuration:
- New EquipmentImportBatch entity (Id, ClusterId, CreatedBy, CreatedAtUtc,
RowsStaged/Accepted/Rejected, FinalisedAtUtc?). Composite index on
(CreatedBy, FinalisedAtUtc) powers the Admin preview modal's "my open
batches" query.
- New EquipmentImportRow entity — one row per CSV row, 8 required columns
from decision #117 + 9 optional from decision #139 + IsAccepted flag +
RejectReason. FK to EquipmentImportBatch with cascade delete so
DropBatch collapses the whole tree.
- EF migration 20260419_..._AddEquipmentImportBatch.
- SchemaComplianceTests expected tables list gains the two new tables.
Admin.Services.EquipmentImportBatchService:
- CreateBatchAsync — new header row, caller-supplied ClusterId + CreatedBy.
- StageRowsAsync(batchId, acceptedRows, rejectedRows) — bulk-inserts the
parsed CSV rows into staging. Rejected rows carry LineNumberInFile +
RejectReason for the preview modal. Throws when the batch is finalised.
- DropBatchAsync — removes batch + cascaded rows. Throws when the batch
was already finalised (rollback via staging is not a time machine).
- FinaliseBatchAsync(batchId, generationId, driverInstanceId, unsLineId) —
atomic apply. Opens an EF transaction when the provider supports it
(SQL Server in prod; InMemory in tests skips the tx), bulk-inserts
every accepted staging row into Equipment, stamps
EquipmentImportBatch.FinalisedAtUtc, commits. Failure rolls back so
Equipment never partially mutates. Idempotent-under-double-call:
second finalise throws ImportBatchAlreadyFinalisedException.
- ListByUserAsync(createdBy, includeFinalised) — the Admin preview modal's
backing query. OrderByDescending on CreatedAtUtc so the most-recent
batch shows first.
- Two exception types: ImportBatchNotFoundException +
ImportBatchAlreadyFinalisedException.
ExternalIdReservation merging (ZTag + SAPID fleet-wide uniqueness) is NOT
done here — a narrower follow-up wires it once the concurrent-insert test
matrix is green.
Tests (10 new EquipmentImportBatchServiceTests, all pass):
- CreateBatch populates Id + CreatedAtUtc + zero-ed counters.
- StageRows accepted + rejected both persist; counters advance.
- DropBatch cascades row delete.
- DropBatch after finalise throws.
- Finalise translates accepted staging rows → Equipment under the target
GenerationId + DriverInstanceId + UnsLineId.
- Finalise twice throws.
- Finalise of unknown batch throws.
- Stage after finalise throws.
- ListByUserAsync filters by creator + finalised flag.
- Drop of unknown batch is a no-op (idempotent rollback).
Full solution dotnet test: 1235 passing (was 1225, +10). Pre-existing
Client.CLI Subscribe flake unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the HostedService half of Phase 6.1 Stream E.2 flagged as a follow-up
when the DriverResilienceStatusTracker shipped in PR #82. The Admin /hosts
column refresh + SignalR push + red-badge visual (Stream E.3) remain
deferred to the visual-compliance pass — this PR owns the persistence
story alone.
Server.Hosting:
- ResilienceStatusPublisherHostedService : BackgroundService. Samples the
DriverResilienceStatusTracker every TickInterval (default 5 s) and upserts
each (DriverInstanceId, HostName) counter pair into
DriverInstanceResilienceStatus via EF. New rows on first sight; in-place
updates on subsequent ticks.
- PersistOnceAsync extracted public so tests drive one tick directly —
matches the ScheduledRecycleHostedService pattern for deterministic
timing.
- Best-effort persistence: a DB outage logs a warning + continues; the next
tick retries. Never crashes the app on sample failure. Cancellation
propagates through cleanly.
- Tracks the bulkhead depth / recycle / footprint columns the entity was
designed for. CurrentBulkheadDepth currently persisted as 0 — the tracker
doesn't yet expose live bulkhead depth; a narrower follow-up wires the
Polly bulkhead-depth observer into the tracker.
Tests (6 new in ResilienceStatusPublisherHostedServiceTests):
- Empty tracker → tick is a no-op, zero rows written.
- Single-host counters → upsert a new row with ConsecutiveFailures + breaker
timestamp + sampled timestamp.
- Second tick updates the existing row in place (not a second insert).
- Multi-host pairs persist independently.
- Footprint counters (Baseline + Current) round-trip.
- TickCount advances on every PersistOnceAsync call.
Full solution dotnet test: 1225 passing (was 1219, +6). Pre-existing
Client.CLI Subscribe flake unchanged.
Production wiring (Program.cs) example:
builder.Services.AddSingleton<DriverResilienceStatusTracker>();
builder.Services.AddHostedService<ResilienceStatusPublisherHostedService>();
// Tracker gets wired into CapabilityInvoker via OtOpcUaServer resolution
// + the existing Phase 6.1 layer.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the per-device isolation gap flagged at the Phase 6.1 Stream A wire-up
(PR #78 used driver.DriverInstanceId as the pipeline host for every call, so
multi-host drivers like Modbus with N PLCs shared one pipeline — one dead PLC
poisoned sibling breakers). Decision #144 requires per-device isolation; this
PR wires it without breaking single-host drivers.
Core.Abstractions:
- IPerCallHostResolver interface. Optional driver capability. Drivers with
multi-host topology (Modbus across N PLCs, AB CIP across a rack, etc.)
implement this; single-host drivers (Galaxy, S7 against one PLC, OpcUaClient
against one remote server) leave it alone. Must be fast + allocation-free
— called once per tag on the hot path. Unknown refs return empty so dispatch
falls back to single-host without throwing.
Server/OpcUa/DriverNodeManager:
- Captures `driver as IPerCallHostResolver` at construction alongside the
existing capability casts.
- New `ResolveHostFor(fullReference)` helper returns either the resolver's
answer or the driver's DriverInstanceId (single-host fallback). Empty /
whitespace resolver output also falls back to DriverInstanceId.
- Every dispatch site now passes `ResolveHostFor(fullRef)` to the invoker
instead of `_driver.DriverInstanceId` — OnReadValue, OnWriteValue, all four
HistoryRead paths. The HistoryRead Events path tolerates fullRef=null and
falls back to DriverInstanceId for those cluster-wide event queries.
- Drivers without IPerCallHostResolver observe zero behavioural change:
every call still keys on DriverInstanceId, same as before.
Tests (4 new PerCallHostResolverDispatchTests, all pass):
- DeadPlc_DoesNotOpenBreaker_For_HealthyPlc_With_Resolver — 2 PLCs behind
one driver; hammer the dead PLC past its breaker threshold; assert the
healthy PLC's first call succeeds on its first attempt (decision #144).
- EmptyString / unknown-ref fallback behaviour documented via test.
- WithoutResolver_SameHost_Shares_One_Pipeline — regression guard for the
single-host pre-existing behaviour.
- WithResolver_TwoHosts_Get_Two_Pipelines — builds the CachedPipelineCount
assertion to confirm the shared-builder cache keys correctly.
Full solution dotnet test: 1219 passing (was 1215, +4). Pre-existing
Client.CLI Subscribe flake unchanged.
Adoption: Modbus driver (#120 follow-up), AB CIP / AB Legacy / TwinCAT
drivers (also #120) implement the interface and return the per-tag PLC host
string. Single-host drivers stay silent and pay zero cost.
Remaining sub-items of #160 still deferred:
- IAlarmSource.SubscribeAlarmsAsync + AcknowledgeAsync invoker wrapping.
Non-trivial because alarm subscription is push-based from driver through
IAlarmConditionSink — the wrap has to happen at the driver-to-server glue
rather than a synchronous dispatch site.
- Roslyn analyzer asserting every capability-interface call routes through
CapabilityInvoker. Substantial (separate analyzer project + test harness);
noise-value ratio favors shipping this post-v2-GA once the coverage is
known-stable.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the Phase 6.1 Stream A.2 "per-instance overrides bound from
DriverInstance.ResilienceConfig JSON column" work flagged as a follow-up
when Stream A.1 shipped in PR #78. Every driver can now override its Polly
pipeline policy per instance instead of inheriting pure tier defaults.
Configuration:
- DriverInstance entity gains a nullable `ResilienceConfig` string column
(nvarchar(max)) + SQL check constraint `CK_DriverInstance_ResilienceConfig_IsJson`
that enforces ISJSON when not null. Null = use tier defaults (decision
#143 / unchanged from pre-Phase-6.1).
- EF migration `20260419161008_AddDriverInstanceResilienceConfig`.
- SchemaComplianceTests expected-constraint list gains the new CK name.
Core.Resilience.DriverResilienceOptionsParser:
- Pure-function parser. ParseOrDefaults(tier, json, out diag) returns the
effective DriverResilienceOptions — tier defaults with per-capability /
bulkhead overrides layered on top when the JSON payload supplies them.
Partial policies (e.g. Read { retryCount: 10 }) fill missing fields from
the tier default for that capability.
- Malformed JSON falls back to pure tier defaults + surfaces a human-readable
diagnostic via the out parameter. Callers log the diag but don't fail
startup — a misconfigured ResilienceConfig must not brick a working
driver.
- Property names + capability keys are case-insensitive; unrecognised
capability names are logged-and-skipped; unrecognised shape-level keys
are ignored so future shapes land without a migration.
Server wire-in:
- OtOpcUaServer gains two optional ctor params: `tierLookup` (driverType →
DriverTier) + `resilienceConfigLookup` (driverInstanceId → JSON string).
CreateMasterNodeManager now resolves tier + JSON for each driver, parses
via DriverResilienceOptionsParser, logs the diagnostic if any, and
constructs CapabilityInvoker with the merged options instead of pure
Tier A defaults.
- OpcUaApplicationHost threads both lookups through. Default null keeps
existing tests constructing without either Func unchanged (falls back
to Tier A + tier defaults exactly as before).
Tests (13 new DriverResilienceOptionsParserTests):
- null / whitespace / empty-object JSON returns pure tier defaults.
- Malformed JSON falls back + surfaces diagnostic.
- Read override merged into tier defaults; other capabilities untouched.
- Partial policy fills missing fields from tier default.
- Bulkhead overrides honored.
- Unknown capability skipped + surfaced in diagnostic.
- Property names + capability keys are case-insensitive.
- Every tier × every capability × empty-JSON round-trips tier defaults
exactly (theory).
Full solution dotnet test: 1215 passing (was 1202, +13). Pre-existing
Client.CLI Subscribe flake unchanged.
Production wiring (Program.cs) example:
Func<string, DriverTier> tierLookup = type => type switch
{
"Galaxy" => DriverTier.C,
"Modbus" or "S7" => DriverTier.B,
"OpcUaClient" => DriverTier.A,
_ => DriverTier.A,
};
Func<string, string?> cfgLookup = id =>
db.DriverInstances.AsNoTracking().FirstOrDefault(x => x.DriverInstanceId == id)?.ResilienceConfig;
var host = new OpcUaApplicationHost(..., tierLookup: tierLookup, resilienceConfigLookup: cfgLookup);
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the server-side / non-UI piece of Phase 6.4 Stream D. The Razor
`IdentificationFields.razor` component for Admin-UI editing ships separately
when the Admin UI pass lands (still tracked under #157 UI follow-up).
Core.OpcUa additions:
- IdentificationFolderBuilder — pure-function builder that materializes the
OPC 40010 Machinery companion-spec Identification sub-folder per decision
#139. Reads the nine nullable columns off an Equipment row:
Manufacturer, Model, SerialNumber, HardwareRevision, SoftwareRevision,
YearOfConstruction (short → OPC UA Int32), AssetLocation, ManufacturerUri,
DeviceManualUri. Emits one AddProperty call per non-null field; skips the
sub-folder entirely when all nine are null so browse trees don't carry
pointless empty folders.
- HasAnyFields(equipment) — cheap short-circuit so callers can decide
whether to invoke Folder() at all.
- FolderName constant ("Identification") + FieldNames list exposed so
downstream tools / tests can cross-reference without duplicating the
decision-#139 field set.
ACL binding: the sub-folder + variables live under the Equipment node so
Phase 6.2's PermissionTrie treats them as part of the Equipment ScopeId —
no new scope level. A user with Equipment-level grant reads the
Identification fields; a user without gets BadUserAccessDenied on both the
Equipment node + its Identification variables. Documented in the class
remarks; cross-reference update to acl-design.md is a follow-up.
Tests (9 new IdentificationFolderBuilderTests):
- HasAnyFields all-null false / any-non-null true.
- Build all-null returns null + doesn't emit Folder.
- Build fully-populated emits all 9 fields in decision #139 order.
- Only non-null fields are emitted (3-of-9 case).
- YearOfConstruction short widens to DriverDataType.Int32 with int value.
- String values round-trip through AddProperty.
- FieldNames constant matches decision #139 exactly.
- FolderName is "Identification".
Full solution dotnet test: 1202 passing (was 1193, +9). Pre-existing
Client.CLI Subscribe flake unchanged.
Production integration: the component that consumes this is the
address-space-build flow that walks the live Equipment table + calls
IdentificationFolderBuilder.Build(equipmentFolder, equipment) under each
Equipment node. That integration is the remaining Stream D follow-up
alongside the Razor UI component.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Turns the Phase 6.1 Stream B.4 pure-logic ScheduledRecycleScheduler (shipped
in PR #79) into a running background feature. A Tier C driver registers its
scheduler at startup; the hosted service ticks every TickInterval (default
1 min) and invokes TickAsync on each registered scheduler.
Server.Hosting:
- ScheduledRecycleHostedService : BackgroundService. AddScheduler(s) must be
called before StartAsync — registering post-start throws
InvalidOperationException to avoid "some ticks saw my scheduler, some
didn't" races. ExecuteAsync loops on Task.Delay(TickInterval, _timeProvider,
stoppingToken) + delegates to a public TickOnceAsync method for one tick.
- TickOnceAsync extracted as the unit-of-work so tests drive it directly
without needing to synchronize with FakeTimeProvider + BackgroundService
timing semantics.
- Exception isolation: if one scheduler throws, the loop logs + continues
to the next scheduler. A flaky supervisor can't take down the tick for
every other Tier C driver.
- Diagnostics: TickCount + SchedulerCount properties for tests + logs.
Tests (7 new ScheduledRecycleHostedServiceTests, all pass):
- TickOnce before interval doesn't fire; TickCount still advances.
- TickOnce at/after interval fires the underlying scheduler exactly once.
- Multiple ticks accumulate count.
- AddScheduler after StartAsync throws.
- Throwing scheduler doesn't poison its neighbours (logs + continues).
- SchedulerCount matches registrations.
- Empty scheduler list ticks cleanly (no-op + counter advances).
Full solution dotnet test: 1193 passing (was 1186, +7). Pre-existing
Client.CLI Subscribe flake unchanged.
Production wiring (Program.cs):
builder.Services.AddSingleton<ScheduledRecycleHostedService>();
builder.Services.AddHostedService(sp => sp.GetRequiredService<ScheduledRecycleHostedService>());
// During DI configuration, once Tier C drivers + their ScheduledRecycleSchedulers
// are resolved, call host.AddScheduler(scheduler) for each.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires the Phase 6.3 Stream B pure-logic pieces (ServiceLevelCalculator,
RecoveryStateManager, ApplyLeaseRegistry) + Stream A topology loader
(RedundancyCoordinator) into one orchestrator the runtime + OPC UA node
surface consume. The actual OPC UA variable-node plumbing (mapping
ServiceLevel Byte + ServerUriArray String[] onto the Opc.Ua.Server stack)
is narrower follow-up on top of this — the publisher emits change events
the OPC UA layer subscribes to.
Server.Redundancy additions:
- PeerReachability record + PeerReachabilityTracker — thread-safe
per-peer-NodeId holder of the latest (HttpHealthy, UaHealthy) tuple. Probe
loops (Stream B.1/B.2 runtime follow-up) write via Update; the publisher
reads via Get. PeerReachability.FullyHealthy / Unknown sentinels for the
two most-common states.
- RedundancyStatePublisher — pure orchestrator, no background timer, no OPC
UA stack dep. ComputeAndPublish reads the 6 inputs + calls the calculator:
* role (from coordinator.Current.SelfRole)
* selfHealthy (caller-supplied Func<bool>)
* peerHttpHealthy + peerUaHealthy (aggregate across all peers in
coordinator.Current.Peers)
* applyInProgress (ApplyLeaseRegistry.IsApplyInProgress)
* recoveryDwellMet (RecoveryStateManager.IsDwellMet)
* topologyValid (coordinator.IsTopologyValid)
* operatorMaintenance (caller-supplied Func<bool>)
Before-coordinator-init returns NoData=1 so clients never see an
authoritative value from an un-bootstrapped server.
OnStateChanged event fires edge-triggered when the byte changes;
OnServerUriArrayChanged fires edge-triggered when the topology's self-first
peer-sorted URI array content changes.
- ServiceLevelSnapshot record — per-tick output with Value + Band +
Topology. The OPC UA layer's ServiceLevel Byte node subscribes to
OnStateChanged; the ServerUriArray node subscribes to OnServerUriArrayChanged.
Tests (8 new RedundancyStatePublisherTests, all pass):
- Before-init returns NoData (Value=1, Band=NoData).
- Authoritative-Primary when healthy + peer fully reachable.
- Isolated-Primary (230) retains authority when peer unreachable — matches
decision #154 non-promotion semantics.
- Mid-apply band dominates: open lease → Value=200 even with peer healthy.
- Self-unhealthy → NoData regardless of other inputs.
- OnStateChanged fires only on value transitions (edge-triggered).
- OnServerUriArrayChanged fires once per topology content change; repeat
ticks with same topology don't re-emit.
- Standalone cluster treats healthy as AuthoritativePrimary=255.
Microsoft.EntityFrameworkCore.InMemory 10.0.0 added to Server.Tests for the
coordinator-backed publisher tests.
Full solution dotnet test: 1186 passing (was 1178, +8). Pre-existing
Client.CLI Subscribe flake unchanged.
Closes the core of release blocker #3 — the pure-logic + orchestration
layer now exists + is unit-tested. Remaining Stream C surfaces: OPC UA
ServiceLevel Byte variable wiring (binds to OnStateChanged), ServerUriArray
String[] wiring (binds to OnServerUriArrayChanged), RedundancySupport
static from RedundancyMode. Those touch the OPC UA stack directly + land
as Stream C.2 follow-up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lands the data path that feeds the Phase 6.3 ServiceLevelCalculator shipped in
PR #89. OPC UA node wiring (ServiceLevel variable + ServerUriArray +
RedundancySupport) still deferred to task #147; peer-probe loops (Stream B.1/B.2
runtime layer beyond the calculator logic) deferred.
Server.Redundancy additions:
- RedundancyTopology record — immutable snapshot (ClusterId, SelfNodeId,
SelfRole, Mode, Peers[], SelfApplicationUri). ServerUriArray() emits the
OPC UA Part 4 §6.6.2.2 shape (self first, peers lexicographically by
NodeId). RedundancyPeer record with per-peer Host/OpcUaPort/DashboardPort/
ApplicationUri so the follow-up peer-probe loops know where to probe.
- ClusterTopologyLoader — pure fn from ServerCluster + ClusterNode[] to
RedundancyTopology. Enforces Phase 6.3 Stream A.1 invariants:
* At least one node per cluster.
* At most 2 nodes (decision #83, v2.0 cap).
* Every node belongs to the target cluster.
* Unique ApplicationUri across the cluster (OPC UA Part 4 trust pin,
decision #86).
* At most 1 Primary per cluster in Warm/Hot modes (decision #84).
* Self NodeId must be a member of the cluster.
Violations throw InvalidTopologyException with a decision-ID-tagged message
so operators know which invariant + what to fix.
- RedundancyCoordinator singleton — holds the current topology + IsTopologyValid
flag. InitializeAsync throws on invariant violation (startup fails fast).
RefreshAsync logs + flips IsTopologyValid=false (runtime won't tear down a
running server; ServiceLevelCalculator falls to InvalidTopology band = 2
which surfaces the problem to clients without crashing). CAS-style swap
via Volatile.Write so readers always see a coherent snapshot.
Tests (10 new ClusterTopologyLoaderTests):
- Single-node standalone loads + empty peer list.
- Two-node cluster loads self + peer.
- ServerUriArray puts self first + peers sort lexicographically.
- Empty-nodes throws.
- Self-not-in-cluster throws.
- Three-node cluster rejected with decision #83 message.
- Duplicate ApplicationUri rejected with decision #86 shape reference.
- Two Primaries in Warm mode rejected (decision #84 + runtime-band reference).
- Cross-cluster node rejected.
- None-mode allows any role mix (standalone clusters don't enforce Primary count).
Full solution dotnet test: 1178 passing (was 1168, +10). Pre-existing
Client.CLI Subscribe flake unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes release blocker #2 from docs/v2/v2-release-readiness.md — the
generation-sealed cache + resilient reader + stale-config flag shipped as
unit-tested primitives in PR #81, but no production path consumed them until
now. This PR wires them end-to-end.
Server additions:
- SealedBootstrap — Phase 6.1 Stream D consumption hook. Resolves the node's
current generation through ResilientConfigReader's timeout → retry →
fallback-to-sealed pipeline. On every successful central-DB fetch it seals
a fresh snapshot to <cache-root>/<cluster>/<generationId>.db so a future
cache-miss has a known-good fallback. Alongside the original NodeBootstrap
(which still uses the single-file ILocalConfigCache); Program.cs can
switch between them once operators are ready for the generation-sealed
semantics.
- OpcUaApplicationHost: new optional staleConfigFlag ctor parameter. When
wired, HealthEndpointsHost consumes `flag.IsStale` via the existing
usingStaleConfig Func<bool> hook. Means `/healthz` actually reports
`usingStaleConfig: true` whenever a read fell back to the sealed cache —
closes the loop between Stream D's flag + Stream C's /healthz body shape.
Tests (4 new SealedBootstrapIntegrationTests, all pass):
- Central-DB success path seals snapshot + flag stays fresh.
- Central-DB failure falls back to sealed snapshot + flag flips stale (the
SQL-kill scenario from Phase 6.1 Stream D.4.a).
- No-snapshot + central-down throws GenerationCacheUnavailableException
with a clear error (the first-boot scenario from D.4.c).
- Next successful bootstrap after a fallback clears the stale flag.
Full solution dotnet test: 1168 passing (was 1164, +4). Pre-existing
Client.CLI Subscribe flake unchanged.
Production activation: Program.cs wires SealedBootstrap (instead of
NodeBootstrap), constructs OpcUaApplicationHost with the staleConfigFlag,
and a HostedService polls sp_GetCurrentGenerationForCluster periodically so
peer-published generations land in this node's sealed cache. The poller
itself is Stream D.1.b follow-up.
The sp_PublishGeneration SQL-side hook (where the publish commit itself
could also write to a shared sealed cache) stays deferred — the per-node
seal pattern shipped here is the correct v2 GA model: each Server node
owns its own on-disk cache and refreshes from its own DB reads, matching
the Phase 6.1 scope-table description.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the Phase 6.2 security gap the v2 release-readiness dashboard flagged:
the evaluator + trie + gate shipped as code in PRs #84-88 but no dispatch
path called them. This PR threads the gate end-to-end from
OpcUaApplicationHost → OtOpcUaServer → DriverNodeManager and calls it on
every Read / Write / 4 HistoryRead paths.
Server.Security additions:
- NodeScopeResolver — maps driver fullRef → Core.Authorization NodeScope.
Phase 1 shape: populates ClusterId + TagId; leaves NamespaceId / UnsArea /
UnsLine / Equipment null. The cluster-level ACL cascade covers this
configuration (decision #129 additive grants). Finer-grained scope
resolution (joining against the live Configuration DB for UnsArea / UnsLine
path) lands as Stream C.12 follow-up.
- WriteAuthzPolicy.ToOpcUaOperation — maps SecurityClassification → the
OpcUaOperation the gate evaluator consults (Operate/SecuredWrite →
WriteOperate; Tune → WriteTune; Configure/VerifiedWrite → WriteConfigure).
DriverNodeManager wiring:
- Ctor gains optional AuthorizationGate + NodeScopeResolver; both null means
the pre-Phase-6.2 dispatch runs unchanged (backwards-compat for every
integration test that constructs DriverNodeManager directly).
- OnReadValue: ahead of the invoker call, builds NodeScope + calls
gate.IsAllowed(identity, Read, scope). Denied reads return
BadUserAccessDenied without hitting the driver.
- OnWriteValue: preserves the existing WriteAuthzPolicy check (classification
vs session roles) + adds an additive gate check using
WriteAuthzPolicy.ToOpcUaOperation(classification) to pick the right
WriteOperate/Tune/Configure surface. Lax mode falls through for identities
without LDAP groups.
- Four HistoryRead paths (Raw / Processed / AtTime / Events): gate check
runs per-node before the invoker. Events path tolerates fullRef=null
(event-history queries can target a notifier / driver-root; those are
cluster-wide reads that need a different scope shape — deferred).
- New WriteAccessDenied helper surfaces BadUserAccessDenied in the
OpcHistoryReadResult slot + errors list, matching the shape of the
existing WriteUnsupported / WriteInternalError helpers.
OtOpcUaServer + OpcUaApplicationHost: gate + resolver thread through as
optional constructor parameters (same pattern as DriverResiliencePipelineBuilder
in Phase 6.1). Null defaults keep the existing 3 OpcUaApplicationHost
integration tests constructing without them unchanged.
Tests (5 new in NodeScopeResolverTests):
- Resolve populates ClusterId + TagId + Equipment Kind.
- Resolve leaves finer path null per Phase 1 shape (doc'd as follow-up).
- Empty fullReference throws.
- Empty clusterId throws at ctor.
- Resolver is stateless across calls.
The existing 9 AuthorizationGate tests (shipped in PR #86) continue to
cover the gate's allow/deny semantics under strict + lax mode.
Full solution dotnet test: 1164 passing (was 1159, +5). Pre-existing
Client.CLI Subscribe flake unchanged. Existing OpcUaApplicationHost +
HealthEndpointsHost + driver integration tests continue to pass because the
gate defaults to null → no enforcement, and the lax-mode fallback returns
true for identities without LDAP groups (the anonymous test path).
Production deployments flip the gate on by constructing it via
OpcUaApplicationHost's new authzGate parameter + setting
`Authorization:StrictMode = true` once ACL data is populated. Flipping the
switch post-seed turns the evaluator + trie from scaffolded code into
actual enforcement.
This closes release blocker #1 listed in docs/v2/v2-release-readiness.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ships the non-UI piece of Stream D: a draft-aware write surface over NodeAcl
that enforces the Phase 6.2 plan's scope-uniqueness + grant-shape invariants.
Blazor UI pieces (RoleGrantsTab + AclsTab refresh + SignalR invalidation +
visual-compliance reviewer signoff) are deferred to the Phase 6.1-style
follow-up task.
Admin.Services:
- ValidatedNodeAclAuthoringService — alongside existing NodeAclService (raw
CRUD, kept for read + revoke paths). GrantAsync enforces:
* Permissions != None (decision #129 — additive only, no empty grants).
* Cluster scope has null ScopeId.
* Sub-cluster scope requires a populated ScopeId.
* No duplicate (GenerationId, ClusterId, LdapGroup, ScopeKind, ScopeId)
tuple — operator updates the row instead of inserting a duplicate.
UpdatePermissionsAsync also rejects None (operator revokes via NodeAclService).
Violations throw InvalidNodeAclGrantException.
Tests (10 new in Admin.Tests/ValidatedNodeAclAuthoringServiceTests):
- Grant rejects None permissions.
- Grant rejects Cluster-scope with ScopeId / sub-cluster without ScopeId.
- Grant succeeds on well-formed row.
- Grant rejects duplicate (group, scope) in same draft.
- Grant allows same group at different scope.
- Grant allows same (group, scope) in different draft.
- UpdatePermissions rejects None.
- UpdatePermissions round-trips new flags + notes.
- UpdatePermissions on unknown rowid throws.
Microsoft.EntityFrameworkCore.InMemory 10.0.0 added to Admin.Tests csproj.
Full solution dotnet test: 1097 passing (was 1087, +10). Phase 6.2 total is
now 1087+10 = 1097; baseline 906 → +191 net across Phase 6.1 (all streams) +
Phase 6.2 (Streams A, B, C foundation, D data layer).
Stream D follow-up task tracks: RoleGrantsTab CRUD over LdapGroupRoleMapping,
AclsTab write-through + Probe-this-permission diagnostic, draft-diff ACL
section, SignalR PermissionTrieCache invalidation push.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lands the integration seam between the Server project's OPC UA stack and the
Core.Authorization evaluator. Actual DriverNodeManager dispatch-path wiring
(Read/Write/HistoryRead/Browse/Call/Subscribe/Alarm surfaces) lands in the
follow-up PR on this branch — covered by Task #143 below.
Server.Security additions:
- ILdapGroupsBearer — marker interface a custom IUserIdentity implements to
expose its resolved LDAP group DNs. Parallel to the existing IRoleBearer
(admin roles) — control/data-plane separation per decision #150.
- AuthorizationGate — stateless bridge between Opc.Ua.IUserIdentity and
IPermissionEvaluator. IsAllowed(identity, operation, scope) materializes a
UserAuthorizationState from the identity's LDAP groups, delegates to the
evaluator, and returns a single bool the dispatch paths use to decide
whether to surface BadUserAccessDenied.
- StrictMode knob controls fail-open-during-transition vs fail-closed:
- strict=false (default during rollout) — null identity, identity without
ILdapGroupsBearer, or NotGranted outcome all return true so older
deployments without ACL data keep working.
- strict=true (production target) — any of the above returns false.
The appsetting `Authorization:StrictMode = true` flips deployments over
once their ACL data is populated.
Tests (9 new in Server.Tests/AuthorizationGateTests):
- Null identity — strict denies, lax allows.
- Identity without LDAP groups — strict denies, lax allows.
- LDAP group with matching grant allows.
- LDAP group without grant — strict denies.
- Wrong operation denied (Read-only grant, WriteOperate requested).
- BuildSessionState returns materialized state with LDAP groups + null when
identity doesn't carry them.
Full solution dotnet test: 1087 passing (Phase 6.1 = 1042, Phase 6.2 A = +9,
B = +27, C foundation = +9 = 1087). Pre-existing Client.CLI Subscribe flake
unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ships Stream B.1-B.6 — the data-plane authorization engine Phase 6.2 runs on.
Integration into OPC UA dispatch (Stream C — Read / Write / HistoryRead /
Subscribe / Browse / Call etc.) is the next PR on this branch.
New Core.Abstractions:
- OpcUaOperation enum enumerates every OPC UA surface the evaluator gates:
Browse, Read, WriteOperate/Tune/Configure (split by SecurityClassification),
HistoryRead, HistoryUpdate, CreateMonitoredItems, TransferSubscriptions,
Call, AlarmAcknowledge/Confirm/Shelve. Stream C maps each one back to its
dispatch call site.
New Core.Authorization namespace:
- NodeScope record + NodeHierarchyKind — 6-level scope addressing for
Equipment-kind (UNS) namespaces, folder-segment walk for SystemPlatform-kind
(Galaxy). NodeScope carries a Kind selector so the evaluator knows which
hierarchy to descend.
- AuthorizationDecision { Verdict, Provenance } + AuthorizationVerdict
{Allow, NotGranted, Denied} + MatchedGrant. Tri-state per decision #149;
Phase 6.2 only produces Allow + NotGranted, Denied stays reserved for v2.1
Explicit Deny without API break.
- IPermissionEvaluator.Authorize(session, operation, scope).
- PermissionTrie + PermissionTrieNode + TrieGrant. In-memory trie keyed on
the ACL scope hierarchy. CollectMatches walks Cluster → Namespace →
UnsArea → UnsLine → Equipment → Tag (or → FolderSegment(s) → Tag on
Galaxy). Pure additive union — matches that share an LDAP group with the
session contribute flags; OR across levels.
- PermissionTrieBuilder static factory. Build(clusterId, generationId, rows,
scopePaths?) returns a trie for one generation. Cross-cluster rows are
filtered out so the trie is cluster-coherent. Stream C follow-up wires a
real scopePaths lookup from the live DB; tests supply hand-built paths.
- PermissionTrieCache — process-singleton, keyed on (ClusterId, GenerationId).
Install(trie) adds a generation + promotes to "current" when the id is
highest-known (handles out-of-order installs gracefully). Prior generations
retained so an in-flight request against a prior trie still succeeds; GC
via Prune(cluster, keepLatest).
- UserAuthorizationState — per-session cache of resolved LDAP groups +
AuthGenerationId + MembershipVersion + MembershipResolvedUtc. Bounded by
MembershipFreshnessInterval (default 15 min per decision #151) +
AuthCacheMaxStaleness (default 5 min per decision #152).
- TriePermissionEvaluator — default IPermissionEvaluator. Fails closed on
stale sessions (IsStale check short-circuits to NotGranted), on cross-
cluster requests, on empty trie cache. Maps OpcUaOperation → NodePermissions
via MapOperationToPermission (total — every enum value has a mapping; tested).
Tests (27 new, all pass):
- PermissionTrieTests (7): cluster-level grant cascades to every tag;
equipment-level grant doesn't leak to sibling equipment; multi-group union
ORs flags; no-matching-group returns empty; Galaxy folder-segment grant
doesn't leak to sibling folder; cross-cluster rows don't land in this
cluster's trie; build is idempotent (B.6 invariants).
- TriePermissionEvaluatorTests (8): allow when flag matches; NotGranted when
no matching group; NotGranted when flags insufficient; HistoryRead requires
its own bit (decision-level requirement); cross-cluster session denied;
stale session fails closed; no cached trie denied; MapOperationToPermission
is total across every OpcUaOperation.
- PermissionTrieCacheTests (8): empty cache returns null; install-then-get
round-trips; new generation becomes current; out-of-order install doesn't
downgrade current; invalidate drops one cluster; prune retains most recent;
prune no-op when fewer than keep; cluster isolation.
- UserAuthorizationStateTests (4): fresh is not stale; IsStale after 5 min
default; NeedsRefresh true between freshness + staleness windows.
Full solution dotnet test: 1078 passing (baseline 906, Phase 6.1 = 1042,
Phase 6.2 Stream A = +9, Stream B = +27 = 1078). Pre-existing Client.CLI
Subscribe flake unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stream A.1-A.2 per docs/v2/implementation/phase-6-2-authorization-runtime.md.
Seed-data migration (A.3) is a separate follow-up once production LDAP group
DNs are finalised; until then CRUD via the Admin UI handles the fleet set up.
Configuration:
- New AdminRole enum {ConfigViewer, ConfigEditor, FleetAdmin} — string-stored.
- New LdapGroupRoleMapping entity with Id (surrogate PK), LdapGroup (512 chars),
Role (AdminRole enum), ClusterId (nullable, FK to ServerCluster), IsSystemWide,
CreatedAtUtc, Notes.
- EF config: UX_LdapGroupRoleMapping_Group_Cluster unique index on
(LdapGroup, ClusterId) + IX_LdapGroupRoleMapping_Group hot-path index on
LdapGroup for sign-in lookups. Cluster FK cascades on cluster delete.
- Migration 20260419_..._AddLdapGroupRoleMapping generated via `dotnet ef`.
Configuration.Services:
- ILdapGroupRoleMappingService — CRUD surface. Declared as control-plane only
per decision #150; the OPC UA data-path evaluator must NOT depend on this
interface (Phase 6.2 compliance check on control/data-plane separation).
GetByGroupsAsync is the hot-path sign-in lookup.
- LdapGroupRoleMappingService (EF Core impl) enforces the write-time invariant
"exactly one of (ClusterId populated, IsSystemWide=true)" and surfaces
InvalidLdapGroupRoleMappingException on violation. Create auto-populates Id
+ CreatedAtUtc when omitted.
Tests (9 new, all pass) in Configuration.Tests:
- Create sets Id + CreatedAtUtc.
- Create rejects empty LdapGroup.
- Create rejects IsSystemWide=true with populated ClusterId.
- Create rejects IsSystemWide=false with null ClusterId.
- GetByGroupsAsync returns matching rows only.
- GetByGroupsAsync with empty input returns empty (no full-table scan).
- ListAllAsync orders by group then cluster.
- Delete removes the target row.
- Delete of unknown id is a no-op.
Microsoft.EntityFrameworkCore.InMemory 10.0.0 added to Configuration.Tests for
the service-level tests (schema-compliance tests still use the live SQL
fixture).
SchemaComplianceTests updated to expect the new LdapGroupRoleMapping table.
Full solution dotnet test: 1051 passing (baseline 906, Phase 6.1 shipped at
1042, Phase 6.2 Stream A adds 9 = 1051). Pre-existing Client.CLI Subscribe
flake unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ships the data + runtime layer of Stream E. The SignalR hub and Blazor /hosts
page refresh (E.2-E.3) are follow-up work paired with the visual-compliance
review per Phase 6.4 patterns — documented as a deferred follow-up below.
Configuration:
- New entity DriverInstanceResilienceStatus with:
DriverInstanceId, HostName (composite PK),
LastCircuitBreakerOpenUtc, ConsecutiveFailures, CurrentBulkheadDepth,
LastRecycleUtc, BaselineFootprintBytes, CurrentFootprintBytes,
LastSampledUtc.
- Separate from DriverHostStatus (per-host connectivity view) so a Running
host that has tripped its breaker or is nearing its memory ceiling shows up
distinctly on Admin /hosts. Admin page left-joins both for display.
- OtOpcUaConfigDbContext + Fluent-API config + IX_DriverResilience_LastSampled
index for the stale-sample filter query.
- EF migration: 20260419124034_AddDriverInstanceResilienceStatus.
Core.Resilience:
- DriverResilienceStatusTracker — process-singleton in-memory tracker keyed on
(DriverInstanceId, HostName). CapabilityInvoker + MemoryTracking +
MemoryRecycle callers record failure/success/breaker-open/recycle/footprint
events; a HostedService (Stream E.2 follow-up) samples this tracker every
5 s and persists to the DB. Pure in-memory keeps tests fast + the core
free of EF/SQL dependencies.
Tests:
- DriverResilienceStatusTrackerTests (9 new, all pass): tryget-before-write
returns null; failures accumulate; success resets; breaker/recycle/footprint
fields populate; per-host isolation; snapshot returns all pairs; concurrent
writes don't lose counts.
- SchemaComplianceTests: expected-tables list updated to include the new
DriverInstanceResilienceStatus table.
Full solution dotnet test: 1042 passing (baseline 906, +136 for Phase 6.1 so
far across Streams A/B/C/D/E.1). Pre-existing Client.CLI Subscribe flake
unchanged.
Deferred to follow-up PR (E.2/E.3):
- ResilienceStatusPublisher HostedService that samples DriverResilienceStatusTracker
every 5 s + upserts DriverInstanceResilienceStatus rows.
- Admin FleetStatusHub SignalR hub pushing LastCircuitBreakerOpenUtc /
CurrentBulkheadDepth / LastRecycleUtc on change.
- Admin /hosts Blazor column additions (red badge when
ConsecutiveFailures > breakerThreshold / 2). Visual-compliance reviewer
signoff alongside Phase 6.4 admin-ui patterns.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes Stream D per docs/v2/implementation/phase-6-1-resilience-and-observability.md.
New Configuration.LocalCache types (alongside the existing single-file
LiteDbConfigCache):
- GenerationSealedCache — file-per-generation sealed snapshots per decision
#148. Each SealAsync writes <cache-root>/<clusterId>/<generationId>.db as a
read-only LiteDB file, then atomically publishes the CURRENT pointer via
temp-file + File.Replace. Prior-generation files stay on disk for audit.
Mixed-generation reads are structurally impossible: ReadCurrentAsync opens
the single file named by CURRENT. Corruption of the pointer or the sealed
file raises GenerationCacheUnavailableException — fails closed, never falls
back silently to an older generation. TryGetCurrentGenerationId returns the
pointer value or null for diagnostics.
- StaleConfigFlag — thread-safe (Volatile.Read/Write) bool. MarkStale when a
read fell back to the cache; MarkFresh when a central-DB read succeeded.
Surfaced on /healthz body and Admin /hosts (Stream C wiring already in
place).
- ResilientConfigReader — wraps a central-DB fetch function with the Stream
D.2 pipeline: timeout 2 s → retry N× jittered (skipped when retryCount=0) →
fallback to the sealed cache. Toggles StaleConfigFlag per outcome. Read path
only — the write path is expected to bypass this wrapper and fail hard on
DB outage so inconsistent writes never land. Cancellation passes through
and is NOT retried.
Configuration.csproj:
- Polly.Core 8.6.6 + Microsoft.Extensions.Logging.Abstractions added.
Tests (17 new, all pass):
- GenerationSealedCacheTests (10): first-boot-no-snapshot throws
GenerationCacheUnavailableException (D.4 scenario C), seal-then-read round
trip, sealed file is ReadOnly on disk, pointer advances to latest, prior
generation file preserved, corrupt sealed file fails closed, missing sealed
file fails closed, corrupt pointer fails closed (D.4 scenario B), same
generation sealed twice is idempotent, independent clusters don't
interfere.
- ResilientConfigReaderTests (4): central-DB success returns value + marks
fresh; central-DB failure exhausts retries + falls back to cache + marks
stale (D.4 scenario A); central-DB + cache both unavailable throws;
cancellation not retried.
- StaleConfigFlagTests (3): default is fresh; toggles; concurrent writes
converge.
Full solution dotnet test: 1033 passing (baseline 906, +127 net across Phase
6.1 Streams A/B/C/D). Pre-existing Client.CLI Subscribe flake unchanged.
Integration into Configuration read paths (DriverInstance enumeration,
LdapGroupRoleMapping fetches, etc.) + the sp_PublishGeneration hook that
writes sealed files lands in the Phase 6.1 Stream E / Admin-refresh PR where
the DB integration surfaces are already touched. Existing LiteDbConfigCache
continues serving its single-file role for the NodeBootstrap path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes Stream C per docs/v2/implementation/phase-6-1-resilience-and-observability.md.
Core.Observability (new namespace):
- DriverHealthReport — pure-function aggregation over DriverHealthSnapshot list.
Empty fleet = Healthy. Any Faulted = Faulted. Any Unknown/Initializing (no
Faulted) = NotReady. Any Degraded or Reconnecting (no Faulted, no NotReady)
= Degraded. Else Healthy. HttpStatus(verdict) maps to the Stream C.1 state
matrix: Healthy/Degraded → 200, NotReady/Faulted → 503.
- LogContextEnricher — Serilog LogContext wrapper. Push(id, type, capability,
correlationId) returns an IDisposable scope; inner log calls carry
DriverInstanceId / DriverType / CapabilityName / CorrelationId structured
properties automatically. NewCorrelationId = 12-hex-char GUID slice for
cases where no OPC UA RequestHeader.RequestHandle is in flight.
CapabilityInvoker — now threads LogContextEnricher around every ExecuteAsync /
ExecuteWriteAsync call site. OtOpcUaServer passes driver.DriverType through
so logs correlate to the driver type too. Every capability call emits
structured fields per the Stream C.4 compliance check.
Server.Observability:
- HealthEndpointsHost — standalone HttpListener on http://localhost:4841/
(loopback avoids Windows URL-ACL elevation; remote probing via reverse
proxy or explicit netsh urlacl grant). Routes:
/healthz → 200 when (configDbReachable OR usingStaleConfig); 503 otherwise.
Body: status, uptimeSeconds, configDbReachable, usingStaleConfig.
/readyz → DriverHealthReport.Aggregate + HttpStatus mapping.
Body: verdict, drivers[], degradedDrivers[], uptimeSeconds.
anything else → 404.
Disposal cooperative with the HttpListener shutdown.
- OpcUaApplicationHost starts the health host after the OPC UA server comes up
and disposes it on shutdown. New OpcUaServerOptions knobs:
HealthEndpointsEnabled (default true), HealthEndpointsPrefix (default
http://localhost:4841/).
Program.cs:
- Serilog pipeline adds Enrich.FromLogContext + opt-in JSON file sink via
`Serilog:WriteJson = true` appsetting. Uses Serilog.Formatting.Compact's
CompactJsonFormatter (one JSON object per line — SIEMs like Splunk,
Datadog, Graylog ingest without a regex parser).
Server.Tests:
- Existing 3 OpcUaApplicationHost integration tests now set
HealthEndpointsEnabled=false to avoid port :4841 collisions under parallel
execution.
- New HealthEndpointsHostTests (9): /healthz healthy empty fleet; stale-config
returns 200 with flag; unreachable+no-cache returns 503; /readyz empty/
Healthy/Faulted/Degraded/Initializing drivers return correct status and
bodies; unknown path → 404. Uses ephemeral ports via Interlocked counter.
Core.Tests:
- DriverHealthReportTests (8): empty fleet, all-healthy, any-Faulted trumps,
any-NotReady without Faulted, Degraded without Faulted/NotReady, HttpStatus
per-verdict theory.
- LogContextEnricherTests (8): all 4 properties attach; scope disposes cleanly;
NewCorrelationId shape; null/whitespace driverInstanceId throws.
- CapabilityInvokerEnrichmentTests (2): inner logs carry structured
properties; no context leak outside the call site.
Full solution dotnet test: 1016 passing (baseline 906, +110 for Phase 6.1 so
far across Streams A+B+C). Pre-existing Client.CLI Subscribe flake unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes out Stream B per docs/v2/implementation/phase-6-1-resilience-and-observability.md.
Core.Abstractions:
- IDriverSupervisor — process-level supervisor contract a Tier C driver's
out-of-process topology provides (Galaxy Proxy/Supervisor implements this in
a follow-up Driver.Galaxy wiring PR). Concerns: DriverInstanceId + RecycleAsync.
Tier A/B drivers don't implement this; Stream B code asserts tier == C before
ever calling it.
Core.Stability:
- MemoryRecycle — companion to MemoryTracking. On HardBreach, invokes the
supervisor IFF tier == C AND a supervisor is wired. Tier A/B HardBreach logs
a promotion-to-Tier-C recommendation and returns false. Soft/None/Warming
never triggers a recycle at any tier.
- ScheduledRecycleScheduler — Tier C opt-in periodic recycler per decision #67.
Ctor throws for Tier A/B (structural guard — scheduled recycle on an
in-process driver would kill every OPC UA session and every co-hosted
driver). TickAsync(now) advances the schedule by one interval per fire;
RequestRecycleNowAsync drives an ad-hoc recycle without shifting the cron.
- WedgeDetector — demand-aware per decision #147. Classify(state, demand, now)
returns:
* NotApplicable when driver state != Healthy
* Idle when Healthy + no pending work (bulkhead=0 && monitored=0 && historic=0)
* Healthy when Healthy + pending work + progress within threshold
* Faulted when Healthy + pending work + no progress within threshold
Threshold clamps to min 60 s. DemandSignal.HasPendingWork ORs the three counters.
The three false-wedge cases the plan calls out all stay Healthy: idle
subscription-only, slow historian backfill making progress, write-only burst
with drained bulkhead.
Tests (22 new, all pass):
- MemoryRecycleTests (7): Tier C hard-breach requests recycle; Tier A/B
hard-breach never requests; Tier C without supervisor no-ops; soft-breach
at every tier never requests; None/Warming never request.
- ScheduledRecycleSchedulerTests (6): ctor throws for A/B; zero/negative
interval throws; tick before due no-ops; tick at/after due fires once and
advances; RequestRecycleNow fires immediately without shifting schedule;
multiple fires across ticks advance one interval each.
- WedgeDetectorTests (9): threshold clamp to 60 s; unhealthy driver always
NotApplicable; idle subscription stays Idle; pending+fresh progress stays
Healthy; pending+stale progress is Faulted; MonitoredItems active but no
publish is Faulted; MonitoredItems active with fresh publish stays Healthy;
historian backfill with fresh progress stays Healthy; write-only burst with
empty bulkhead is Idle; HasPendingWork theory for any non-zero counter.
Full solution dotnet test: 989 passing (baseline 906, +83 for Phase 6.1 so far).
Pre-existing Client.CLI Subscribe flake unchanged.
Stream B complete. Next up: Stream C (health endpoints + structured logging).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stream B.1 — registry invariant:
- DriverTypeMetadata gains a required `DriverTier Tier` field. Every registered
driver type must declare its stability tier so the downstream MemoryTracking,
MemoryRecycle, and resilience-policy layers can resolve the right defaults.
Stamped-at-registration-time enforcement makes the "every driver type has a
non-null Tier" compliance check structurally impossible to fail.
- DriverTypeRegistry API unchanged; one new property on the record.
Stream B.2 — MemoryTracking (Core.Stability):
- Tier-agnostic tracker per decision #146: captures baseline as the median of
samples collected during a post-init warmup window (default 5 min), then
classifies each subsequent sample with the hybrid formula
`soft = max(multiplier × baseline, baseline + floor)`, `hard = 2 × soft`.
- Per-tier constants wired: Tier A mult=3 floor=50 MB, Tier B mult=3 floor=100 MB,
Tier C mult=2 floor=500 MB.
- Never kills. Hard-breach action returns HardBreach; the supervisor that acts
on that signal (MemoryRecycle) is Tier C only per decisions #74, #145 and
lands in the next B.3 commit on this branch.
- Two phases: WarmingUp (samples collected, Warming returned) and Steady
(baseline captured, soft/hard checks active). Transition is automatic when
the warmup window elapses.
Tests (15 new, all pass):
- Warming phase returns Warming until the window elapses.
- Window-elapsed captures median baseline + transitions to Steady.
- Per-tier constants match decision #146 table exactly.
- Soft threshold uses max() — small baseline → floor wins; large baseline →
multiplier wins.
- Hard = 2 × soft.
- Sample below soft = None; at soft = SoftBreach; at/above hard = HardBreach.
- DriverTypeRegistry: theory asserts Tier round-trips for A/B/C.
Full solution dotnet test: 963 passing (baseline 906, +57 net for Phase 6.1
Stream A + Stream B.1/B.2). Pre-existing Client.CLI Subscribe flake unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Every OnReadValue / OnWriteValue now routes through the process-singleton
DriverResiliencePipelineBuilder's CapabilityInvoker. Read / Write dispatch
paths gain timeout + per-capability retry + per-(driver, host) circuit breaker
+ bulkhead without touching the individual driver implementations.
Wiring:
- OpcUaApplicationHost: new optional DriverResiliencePipelineBuilder ctor
parameter (default null → instance-owned builder). Keeps the 3 test call
sites that construct OpcUaApplicationHost directly unchanged.
- OtOpcUaServer: requires the builder in its ctor; constructs one
CapabilityInvoker per driver at CreateMasterNodeManager time with default
Tier A DriverResilienceOptions. TODO: Stream B.1 will wire real per-driver-
type tiers via DriverTypeRegistry; Phase 6.1 follow-up will read the
DriverInstance.ResilienceConfig JSON column for per-instance overrides.
- DriverNodeManager: takes a CapabilityInvoker in its ctor. OnReadValue wraps
the driver's ReadAsync through ExecuteAsync(DriverCapability.Read, hostName,
...); OnWriteValue wraps WriteAsync through ExecuteWriteAsync(hostName,
isIdempotent, ...) where isIdempotent comes from the new
_writeIdempotentByFullRef map populated at Variable() registration from
DriverAttributeInfo.WriteIdempotent.
HostName defaults to driver.DriverInstanceId for now — a single-host pipeline
per driver. Multi-host drivers (Modbus with N PLCs) will expose their own per-
call host resolution in a follow-up so failing PLCs can trip per-PLC breakers
without poisoning siblings (decision #144).
Test fixup:
- FlakeyDriverIntegrationTests.Read_SurfacesSuccess_AfterTransientFailures:
bumped TimeoutSeconds=2 → 30. 10 retries at exponential backoff with jitter
can exceed 2s under parallel-test-run CPU pressure; the test asserts retry
behavior, not timeout budget, so the longer slack keeps it deterministic.
Full solution dotnet test: 948 passing. Pre-existing Client.CLI Subscribe
flake unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
IDriver.DriverInstanceId is declared as string in Core.Abstractions; keeping
the pipeline key as Guid meant every call site would need .ToString() / Guid.Parse
at the boundary. Switching the Resilience types to string removes that friction
and lets OtOpcUaServer pass driver.DriverInstanceId directly to the builder in
the upcoming server-dispatch wiring PR.
- DriverResiliencePipelineBuilder.GetOrCreate + Invalidate + PipelineKey
- CapabilityInvoker.ctor + _driverInstanceId field
Tests: all 48 Core.Tests still pass. The Invalidate test's keepId / dropId now
use distinct "drv-keep" / "drv-drop" literals (previously both were distinct
Guid.NewGuid() values, which the sed-driven refactor had collapsed to the same
literal — caught pre-commit).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per-tag opt-in for write-retry per docs/v2/plan.md decisions #44, #45, #143.
Default is false — writes never auto-retry unless the driver author has marked
the tag as safe to replay.
Core.Abstractions:
- DriverAttributeInfo gains `bool WriteIdempotent = false` at the end of the
positional record (back-compatible; every existing call site uses the default).
Driver.Modbus:
- ModbusTagDefinition gains `bool WriteIdempotent = false`. Safe candidates
documented in the param XML: holding-register set-points, configuration
registers. Unsafe: edge-triggered coils, counter-increment addresses.
- ModbusDriver.DiscoverAsync propagates t.WriteIdempotent into
DriverAttributeInfo.WriteIdempotent.
Driver.S7:
- S7TagDefinition gains `bool WriteIdempotent = false`. Safe candidates:
DB word/dword set-points, configuration DBs. Unsafe: M/Q bits that drive
edge-triggered program routines.
- S7Driver.DiscoverAsync propagates the flag.
Stream A.5 integration tests (FlakeyDriverIntegrationTests, 4 new) exercise
the invoker + flaky-driver contract the plan enumerates:
- Read with 5 transient failures succeeds on the 6th attempt (RetryCount=10).
- Non-idempotent write with RetryCount=5 configured still fails on the first
failure — no replay (decision #44 guard at the ExecuteWriteAsync surface).
- Idempotent write with 2 transient failures succeeds on the 3rd attempt.
- Two hosts on the same driver have independent breakers — dead-host trips
its breaker but live-host's first call still succeeds.
Propagation tests:
- ModbusDriverTests: SetPoint WriteIdempotent=true flows into
DriverAttributeInfo; PulseCoil default=false.
- S7DiscoveryAndSubscribeTests: same pattern for DBx SetPoint vs M-bit.
Full solution dotnet test: 947 passing (baseline 906, +41 net across Stream A
so far). Pre-existing Client.CLI Subscribe flake unchanged.
Stream A's remaining work (wiring CapabilityInvoker into DriverNodeManager's
OnReadValue / OnWriteValue / History / Subscribe dispatch paths) is the
server-side integration piece + needs DI wiring for the pipeline builder —
lands in the next PR on this branch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
One invoker per (DriverInstance, IDriver) pair; calls ExecuteAsync(capability,
host, callSite) and the invoker resolves the correct pipeline from the shared
DriverResiliencePipelineBuilder. The options accessor is a Func so Admin-edit
+ pipeline-invalidate takes effect without restarting the invoker or the
driver host.
ExecuteWriteAsync(isIdempotent) is the explicit write-safety surface:
- isIdempotent=false routes through a side pipeline with RetryCount=0 regardless
of what the caller configured. The cache key carries a "::non-idempotent"
suffix so it never collides with the retry-enabled write pipeline.
- isIdempotent=true routes through the normal Write pipeline. If the user has
configured Write retries (opt-in), the idempotent tag gets them; otherwise
default-0 still wins.
The server dispatch layer (next PR) reads WriteIdempotentAttribute on each tag
definition once at driver-init time and feeds the boolean into ExecuteWriteAsync.
Tests (6 new):
- Read retries on transient failure; returns value from call site.
- Write non-idempotent does NOT retry even when policy has 3 retries configured
(the explicit decision-#44 guard at the dispatch surface).
- Write idempotent retries when policy allows.
- Write with default tier-A policy (RetryCount=0) never retries regardless of
idempotency flag.
- Different hosts get independent pipelines.
Core.Tests now 44 passing (was 38). Invoker doc-refs completed (the XML comment
on WriteIdempotentAttribute no longer references a non-existent type).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lands the first chunk of the Phase 6.1 Stream A resilience layer per
docs/v2/implementation/phase-6-1-resilience-and-observability.md §Stream A.
Downstream CapabilityInvoker (A.3) + driver-dispatch wiring land in follow-up
PRs on the same branch.
Core.Abstractions additions:
- WriteIdempotentAttribute — marker for tag-definition records that opt into
auto-retry on IWritable.WriteAsync. Absence = no retry per decisions #44, #45,
#143. Read once via reflection at driver-init time; no per-write cost.
- DriverCapability enum — enumerates the 8 capability surface points
(Read / Write / Discover / Subscribe / Probe / AlarmSubscribe / AlarmAcknowledge
/ HistoryRead). AlarmAcknowledge is write-shaped (no retry by default).
- DriverTier enum — A/B/C per driver-stability.md §2-4. Stream B.1 wires this
into DriverTypeMetadata; surfaced here because the resilience policy defaults
key on it.
Core.Resilience new namespace:
- DriverResilienceOptions — per-tier × per-capability policy defaults.
GetTierDefaults(tier) is the source of truth:
* Tier A: Read 2s/3 retries, Write 2s/0 retries, breaker threshold 5
* Tier B: Read 4s/3, Write 4s/0, breaker threshold 5
* Tier C: Read 10s/1, Write 10s/0, breaker threshold 0 (supervisor handles
process-level breaker per decision #68)
Resolve(capability) overlays CapabilityPolicies on top of the defaults.
- DriverResiliencePipelineBuilder — composes Timeout → Retry (capability-
permitting, never on cancellation) → CircuitBreaker (tier-permitting) →
Bulkhead. Pipelines cached in a lock-free ConcurrentDictionary keyed on
(DriverInstanceId, HostName, DriverCapability) per decision #144 — one dead
PLC behind a multi-device driver does not open the breaker for healthy
siblings. Invalidate(driverInstanceId) supports Admin-triggered reload.
Tests (30 new, all pass):
- DriverResilienceOptionsTests: tier-default coverage for every capability,
Write + AlarmAcknowledge never retry at any tier, Tier C disables breaker,
resolve-with-override layering.
- DriverResiliencePipelineBuilderTests: Read retries transients, Write does NOT
retry on failure (decision #44 guard), dead-host isolation from sibling hosts,
pipeline reuse for same triple, per-capability isolation, breaker opens after
threshold on Tier A, timeout fires, cancellation is not retried,
invalidation scoped to matching instance.
Polly.Core 8.6.6 added to Core.csproj. Full solution dotnet test: 936 passing
(baseline 906 + 30 new). One pre-existing Client.CLI Subscribe flake unchanged
by this PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>