Closes the observer half of #162 that was flagged as "persisted as 0 today"
in PR #105. The Admin /hosts column refresh + FleetStatusHub SignalR push
+ red-badge visual still belong to the visual-compliance pass.
Core.Resilience:
- DriverResilienceStatusTracker gains RecordCallStart + RecordCallComplete
+ CurrentInFlight field on the snapshot record. Concurrent-safe via the
same ConcurrentDictionary.AddOrUpdate pattern as the other recorder methods.
Clamps to zero on over-decrement so a stray Complete-without-Start can't
drive the counter negative.
- CapabilityInvoker gains an optional statusTracker ctor parameter. When
wired, every ExecuteAsync / ExecuteAsync(void) wraps the pipeline call
in try / finally that records start/complete — so the counter advances
cleanly whether the call succeeds, cancels, or throws. Null tracker keeps
the pre-Phase-6.1 Stream E.3 behaviour exactly.
Server.Hosting:
- ResilienceStatusPublisherHostedService persists CurrentInFlight as the
DriverInstanceResilienceStatus.CurrentBulkheadDepth column (was 0 before
this PR). One-line fix on both the insert + update branches.
The in-flight counter is a pragmatic proxy for Polly's internal bulkhead
depth — a future PR wiring Polly telemetry would replace it with the real
value. The shape of the column + the publisher + the Admin /hosts query
doesn't change, so the follow-up is invisible to consumers.
Tests (8 new InFlightCounterTests, all pass):
- Start+Complete nets to zero.
- Nested starts sum; Complete decrements.
- Complete-without-Start clamps to zero.
- Different hosts track independently.
- Concurrent starts (500 parallel) don't lose count.
- CapabilityInvoker observed-mid-call depth == 1 during a pending call.
- CapabilityInvoker exception path still decrements (try/finally).
- CapabilityInvoker without tracker doesn't throw.
Full solution dotnet test: 1243 passing (was 1235, +8). Pre-existing
Client.CLI Subscribe flake unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the server-side/data-layer piece of Phase 6.4 Stream B.2-B.4. The
CSV-import preview + modal UI (Stream B.3/B.5) still belongs to the Admin
UI follow-up — this PR owns the staging tables + atomic finalise alone.
Configuration:
- New EquipmentImportBatch entity (Id, ClusterId, CreatedBy, CreatedAtUtc,
RowsStaged/Accepted/Rejected, FinalisedAtUtc?). Composite index on
(CreatedBy, FinalisedAtUtc) powers the Admin preview modal's "my open
batches" query.
- New EquipmentImportRow entity — one row per CSV row, 8 required columns
from decision #117 + 9 optional from decision #139 + IsAccepted flag +
RejectReason. FK to EquipmentImportBatch with cascade delete so
DropBatch collapses the whole tree.
- EF migration 20260419_..._AddEquipmentImportBatch.
- SchemaComplianceTests expected tables list gains the two new tables.
Admin.Services.EquipmentImportBatchService:
- CreateBatchAsync — new header row, caller-supplied ClusterId + CreatedBy.
- StageRowsAsync(batchId, acceptedRows, rejectedRows) — bulk-inserts the
parsed CSV rows into staging. Rejected rows carry LineNumberInFile +
RejectReason for the preview modal. Throws when the batch is finalised.
- DropBatchAsync — removes batch + cascaded rows. Throws when the batch
was already finalised (rollback via staging is not a time machine).
- FinaliseBatchAsync(batchId, generationId, driverInstanceId, unsLineId) —
atomic apply. Opens an EF transaction when the provider supports it
(SQL Server in prod; InMemory in tests skips the tx), bulk-inserts
every accepted staging row into Equipment, stamps
EquipmentImportBatch.FinalisedAtUtc, commits. Failure rolls back so
Equipment never partially mutates. Idempotent-under-double-call:
second finalise throws ImportBatchAlreadyFinalisedException.
- ListByUserAsync(createdBy, includeFinalised) — the Admin preview modal's
backing query. OrderByDescending on CreatedAtUtc so the most-recent
batch shows first.
- Two exception types: ImportBatchNotFoundException +
ImportBatchAlreadyFinalisedException.
ExternalIdReservation merging (ZTag + SAPID fleet-wide uniqueness) is NOT
done here — a narrower follow-up wires it once the concurrent-insert test
matrix is green.
Tests (10 new EquipmentImportBatchServiceTests, all pass):
- CreateBatch populates Id + CreatedAtUtc + zero-ed counters.
- StageRows accepted + rejected both persist; counters advance.
- DropBatch cascades row delete.
- DropBatch after finalise throws.
- Finalise translates accepted staging rows → Equipment under the target
GenerationId + DriverInstanceId + UnsLineId.
- Finalise twice throws.
- Finalise of unknown batch throws.
- Stage after finalise throws.
- ListByUserAsync filters by creator + finalised flag.
- Drop of unknown batch is a no-op (idempotent rollback).
Full solution dotnet test: 1235 passing (was 1225, +10). Pre-existing
Client.CLI Subscribe flake unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the HostedService half of Phase 6.1 Stream E.2 flagged as a follow-up
when the DriverResilienceStatusTracker shipped in PR #82. The Admin /hosts
column refresh + SignalR push + red-badge visual (Stream E.3) remain
deferred to the visual-compliance pass — this PR owns the persistence
story alone.
Server.Hosting:
- ResilienceStatusPublisherHostedService : BackgroundService. Samples the
DriverResilienceStatusTracker every TickInterval (default 5 s) and upserts
each (DriverInstanceId, HostName) counter pair into
DriverInstanceResilienceStatus via EF. New rows on first sight; in-place
updates on subsequent ticks.
- PersistOnceAsync extracted public so tests drive one tick directly —
matches the ScheduledRecycleHostedService pattern for deterministic
timing.
- Best-effort persistence: a DB outage logs a warning + continues; the next
tick retries. Never crashes the app on sample failure. Cancellation
propagates through cleanly.
- Tracks the bulkhead depth / recycle / footprint columns the entity was
designed for. CurrentBulkheadDepth currently persisted as 0 — the tracker
doesn't yet expose live bulkhead depth; a narrower follow-up wires the
Polly bulkhead-depth observer into the tracker.
Tests (6 new in ResilienceStatusPublisherHostedServiceTests):
- Empty tracker → tick is a no-op, zero rows written.
- Single-host counters → upsert a new row with ConsecutiveFailures + breaker
timestamp + sampled timestamp.
- Second tick updates the existing row in place (not a second insert).
- Multi-host pairs persist independently.
- Footprint counters (Baseline + Current) round-trip.
- TickCount advances on every PersistOnceAsync call.
Full solution dotnet test: 1225 passing (was 1219, +6). Pre-existing
Client.CLI Subscribe flake unchanged.
Production wiring (Program.cs) example:
builder.Services.AddSingleton<DriverResilienceStatusTracker>();
builder.Services.AddHostedService<ResilienceStatusPublisherHostedService>();
// Tracker gets wired into CapabilityInvoker via OtOpcUaServer resolution
// + the existing Phase 6.1 layer.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the per-device isolation gap flagged at the Phase 6.1 Stream A wire-up
(PR #78 used driver.DriverInstanceId as the pipeline host for every call, so
multi-host drivers like Modbus with N PLCs shared one pipeline — one dead PLC
poisoned sibling breakers). Decision #144 requires per-device isolation; this
PR wires it without breaking single-host drivers.
Core.Abstractions:
- IPerCallHostResolver interface. Optional driver capability. Drivers with
multi-host topology (Modbus across N PLCs, AB CIP across a rack, etc.)
implement this; single-host drivers (Galaxy, S7 against one PLC, OpcUaClient
against one remote server) leave it alone. Must be fast + allocation-free
— called once per tag on the hot path. Unknown refs return empty so dispatch
falls back to single-host without throwing.
Server/OpcUa/DriverNodeManager:
- Captures `driver as IPerCallHostResolver` at construction alongside the
existing capability casts.
- New `ResolveHostFor(fullReference)` helper returns either the resolver's
answer or the driver's DriverInstanceId (single-host fallback). Empty /
whitespace resolver output also falls back to DriverInstanceId.
- Every dispatch site now passes `ResolveHostFor(fullRef)` to the invoker
instead of `_driver.DriverInstanceId` — OnReadValue, OnWriteValue, all four
HistoryRead paths. The HistoryRead Events path tolerates fullRef=null and
falls back to DriverInstanceId for those cluster-wide event queries.
- Drivers without IPerCallHostResolver observe zero behavioural change:
every call still keys on DriverInstanceId, same as before.
Tests (4 new PerCallHostResolverDispatchTests, all pass):
- DeadPlc_DoesNotOpenBreaker_For_HealthyPlc_With_Resolver — 2 PLCs behind
one driver; hammer the dead PLC past its breaker threshold; assert the
healthy PLC's first call succeeds on its first attempt (decision #144).
- EmptyString / unknown-ref fallback behaviour documented via test.
- WithoutResolver_SameHost_Shares_One_Pipeline — regression guard for the
single-host pre-existing behaviour.
- WithResolver_TwoHosts_Get_Two_Pipelines — builds the CachedPipelineCount
assertion to confirm the shared-builder cache keys correctly.
Full solution dotnet test: 1219 passing (was 1215, +4). Pre-existing
Client.CLI Subscribe flake unchanged.
Adoption: Modbus driver (#120 follow-up), AB CIP / AB Legacy / TwinCAT
drivers (also #120) implement the interface and return the per-tag PLC host
string. Single-host drivers stay silent and pay zero cost.
Remaining sub-items of #160 still deferred:
- IAlarmSource.SubscribeAlarmsAsync + AcknowledgeAsync invoker wrapping.
Non-trivial because alarm subscription is push-based from driver through
IAlarmConditionSink — the wrap has to happen at the driver-to-server glue
rather than a synchronous dispatch site.
- Roslyn analyzer asserting every capability-interface call routes through
CapabilityInvoker. Substantial (separate analyzer project + test harness);
noise-value ratio favors shipping this post-v2-GA once the coverage is
known-stable.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the Phase 6.1 Stream A.2 "per-instance overrides bound from
DriverInstance.ResilienceConfig JSON column" work flagged as a follow-up
when Stream A.1 shipped in PR #78. Every driver can now override its Polly
pipeline policy per instance instead of inheriting pure tier defaults.
Configuration:
- DriverInstance entity gains a nullable `ResilienceConfig` string column
(nvarchar(max)) + SQL check constraint `CK_DriverInstance_ResilienceConfig_IsJson`
that enforces ISJSON when not null. Null = use tier defaults (decision
#143 / unchanged from pre-Phase-6.1).
- EF migration `20260419161008_AddDriverInstanceResilienceConfig`.
- SchemaComplianceTests expected-constraint list gains the new CK name.
Core.Resilience.DriverResilienceOptionsParser:
- Pure-function parser. ParseOrDefaults(tier, json, out diag) returns the
effective DriverResilienceOptions — tier defaults with per-capability /
bulkhead overrides layered on top when the JSON payload supplies them.
Partial policies (e.g. Read { retryCount: 10 }) fill missing fields from
the tier default for that capability.
- Malformed JSON falls back to pure tier defaults + surfaces a human-readable
diagnostic via the out parameter. Callers log the diag but don't fail
startup — a misconfigured ResilienceConfig must not brick a working
driver.
- Property names + capability keys are case-insensitive; unrecognised
capability names are logged-and-skipped; unrecognised shape-level keys
are ignored so future shapes land without a migration.
Server wire-in:
- OtOpcUaServer gains two optional ctor params: `tierLookup` (driverType →
DriverTier) + `resilienceConfigLookup` (driverInstanceId → JSON string).
CreateMasterNodeManager now resolves tier + JSON for each driver, parses
via DriverResilienceOptionsParser, logs the diagnostic if any, and
constructs CapabilityInvoker with the merged options instead of pure
Tier A defaults.
- OpcUaApplicationHost threads both lookups through. Default null keeps
existing tests constructing without either Func unchanged (falls back
to Tier A + tier defaults exactly as before).
Tests (13 new DriverResilienceOptionsParserTests):
- null / whitespace / empty-object JSON returns pure tier defaults.
- Malformed JSON falls back + surfaces diagnostic.
- Read override merged into tier defaults; other capabilities untouched.
- Partial policy fills missing fields from tier default.
- Bulkhead overrides honored.
- Unknown capability skipped + surfaced in diagnostic.
- Property names + capability keys are case-insensitive.
- Every tier × every capability × empty-JSON round-trips tier defaults
exactly (theory).
Full solution dotnet test: 1215 passing (was 1202, +13). Pre-existing
Client.CLI Subscribe flake unchanged.
Production wiring (Program.cs) example:
Func<string, DriverTier> tierLookup = type => type switch
{
"Galaxy" => DriverTier.C,
"Modbus" or "S7" => DriverTier.B,
"OpcUaClient" => DriverTier.A,
_ => DriverTier.A,
};
Func<string, string?> cfgLookup = id =>
db.DriverInstances.AsNoTracking().FirstOrDefault(x => x.DriverInstanceId == id)?.ResilienceConfig;
var host = new OpcUaApplicationHost(..., tierLookup: tierLookup, resilienceConfigLookup: cfgLookup);
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the server-side / non-UI piece of Phase 6.4 Stream D. The Razor
`IdentificationFields.razor` component for Admin-UI editing ships separately
when the Admin UI pass lands (still tracked under #157 UI follow-up).
Core.OpcUa additions:
- IdentificationFolderBuilder — pure-function builder that materializes the
OPC 40010 Machinery companion-spec Identification sub-folder per decision
#139. Reads the nine nullable columns off an Equipment row:
Manufacturer, Model, SerialNumber, HardwareRevision, SoftwareRevision,
YearOfConstruction (short → OPC UA Int32), AssetLocation, ManufacturerUri,
DeviceManualUri. Emits one AddProperty call per non-null field; skips the
sub-folder entirely when all nine are null so browse trees don't carry
pointless empty folders.
- HasAnyFields(equipment) — cheap short-circuit so callers can decide
whether to invoke Folder() at all.
- FolderName constant ("Identification") + FieldNames list exposed so
downstream tools / tests can cross-reference without duplicating the
decision-#139 field set.
ACL binding: the sub-folder + variables live under the Equipment node so
Phase 6.2's PermissionTrie treats them as part of the Equipment ScopeId —
no new scope level. A user with Equipment-level grant reads the
Identification fields; a user without gets BadUserAccessDenied on both the
Equipment node + its Identification variables. Documented in the class
remarks; cross-reference update to acl-design.md is a follow-up.
Tests (9 new IdentificationFolderBuilderTests):
- HasAnyFields all-null false / any-non-null true.
- Build all-null returns null + doesn't emit Folder.
- Build fully-populated emits all 9 fields in decision #139 order.
- Only non-null fields are emitted (3-of-9 case).
- YearOfConstruction short widens to DriverDataType.Int32 with int value.
- String values round-trip through AddProperty.
- FieldNames constant matches decision #139 exactly.
- FolderName is "Identification".
Full solution dotnet test: 1202 passing (was 1193, +9). Pre-existing
Client.CLI Subscribe flake unchanged.
Production integration: the component that consumes this is the
address-space-build flow that walks the live Equipment table + calls
IdentificationFolderBuilder.Build(equipmentFolder, equipment) under each
Equipment node. That integration is the remaining Stream D follow-up
alongside the Razor UI component.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Turns the Phase 6.1 Stream B.4 pure-logic ScheduledRecycleScheduler (shipped
in PR #79) into a running background feature. A Tier C driver registers its
scheduler at startup; the hosted service ticks every TickInterval (default
1 min) and invokes TickAsync on each registered scheduler.
Server.Hosting:
- ScheduledRecycleHostedService : BackgroundService. AddScheduler(s) must be
called before StartAsync — registering post-start throws
InvalidOperationException to avoid "some ticks saw my scheduler, some
didn't" races. ExecuteAsync loops on Task.Delay(TickInterval, _timeProvider,
stoppingToken) + delegates to a public TickOnceAsync method for one tick.
- TickOnceAsync extracted as the unit-of-work so tests drive it directly
without needing to synchronize with FakeTimeProvider + BackgroundService
timing semantics.
- Exception isolation: if one scheduler throws, the loop logs + continues
to the next scheduler. A flaky supervisor can't take down the tick for
every other Tier C driver.
- Diagnostics: TickCount + SchedulerCount properties for tests + logs.
Tests (7 new ScheduledRecycleHostedServiceTests, all pass):
- TickOnce before interval doesn't fire; TickCount still advances.
- TickOnce at/after interval fires the underlying scheduler exactly once.
- Multiple ticks accumulate count.
- AddScheduler after StartAsync throws.
- Throwing scheduler doesn't poison its neighbours (logs + continues).
- SchedulerCount matches registrations.
- Empty scheduler list ticks cleanly (no-op + counter advances).
Full solution dotnet test: 1193 passing (was 1186, +7). Pre-existing
Client.CLI Subscribe flake unchanged.
Production wiring (Program.cs):
builder.Services.AddSingleton<ScheduledRecycleHostedService>();
builder.Services.AddHostedService(sp => sp.GetRequiredService<ScheduledRecycleHostedService>());
// During DI configuration, once Tier C drivers + their ScheduledRecycleSchedulers
// are resolved, call host.AddScheduler(scheduler) for each.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires the Phase 6.3 Stream B pure-logic pieces (ServiceLevelCalculator,
RecoveryStateManager, ApplyLeaseRegistry) + Stream A topology loader
(RedundancyCoordinator) into one orchestrator the runtime + OPC UA node
surface consume. The actual OPC UA variable-node plumbing (mapping
ServiceLevel Byte + ServerUriArray String[] onto the Opc.Ua.Server stack)
is narrower follow-up on top of this — the publisher emits change events
the OPC UA layer subscribes to.
Server.Redundancy additions:
- PeerReachability record + PeerReachabilityTracker — thread-safe
per-peer-NodeId holder of the latest (HttpHealthy, UaHealthy) tuple. Probe
loops (Stream B.1/B.2 runtime follow-up) write via Update; the publisher
reads via Get. PeerReachability.FullyHealthy / Unknown sentinels for the
two most-common states.
- RedundancyStatePublisher — pure orchestrator, no background timer, no OPC
UA stack dep. ComputeAndPublish reads the 6 inputs + calls the calculator:
* role (from coordinator.Current.SelfRole)
* selfHealthy (caller-supplied Func<bool>)
* peerHttpHealthy + peerUaHealthy (aggregate across all peers in
coordinator.Current.Peers)
* applyInProgress (ApplyLeaseRegistry.IsApplyInProgress)
* recoveryDwellMet (RecoveryStateManager.IsDwellMet)
* topologyValid (coordinator.IsTopologyValid)
* operatorMaintenance (caller-supplied Func<bool>)
Before-coordinator-init returns NoData=1 so clients never see an
authoritative value from an un-bootstrapped server.
OnStateChanged event fires edge-triggered when the byte changes;
OnServerUriArrayChanged fires edge-triggered when the topology's self-first
peer-sorted URI array content changes.
- ServiceLevelSnapshot record — per-tick output with Value + Band +
Topology. The OPC UA layer's ServiceLevel Byte node subscribes to
OnStateChanged; the ServerUriArray node subscribes to OnServerUriArrayChanged.
Tests (8 new RedundancyStatePublisherTests, all pass):
- Before-init returns NoData (Value=1, Band=NoData).
- Authoritative-Primary when healthy + peer fully reachable.
- Isolated-Primary (230) retains authority when peer unreachable — matches
decision #154 non-promotion semantics.
- Mid-apply band dominates: open lease → Value=200 even with peer healthy.
- Self-unhealthy → NoData regardless of other inputs.
- OnStateChanged fires only on value transitions (edge-triggered).
- OnServerUriArrayChanged fires once per topology content change; repeat
ticks with same topology don't re-emit.
- Standalone cluster treats healthy as AuthoritativePrimary=255.
Microsoft.EntityFrameworkCore.InMemory 10.0.0 added to Server.Tests for the
coordinator-backed publisher tests.
Full solution dotnet test: 1186 passing (was 1178, +8). Pre-existing
Client.CLI Subscribe flake unchanged.
Closes the core of release blocker #3 — the pure-logic + orchestration
layer now exists + is unit-tested. Remaining Stream C surfaces: OPC UA
ServiceLevel Byte variable wiring (binds to OnStateChanged), ServerUriArray
String[] wiring (binds to OnServerUriArrayChanged), RedundancySupport
static from RedundancyMode. Those touch the OPC UA stack directly + land
as Stream C.2 follow-up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>