Closes the observer half of #162 that was flagged as "persisted as 0 today"
in PR #105. The Admin /hosts column refresh + FleetStatusHub SignalR push
+ red-badge visual still belong to the visual-compliance pass.
Core.Resilience:
- DriverResilienceStatusTracker gains RecordCallStart + RecordCallComplete
+ CurrentInFlight field on the snapshot record. Concurrent-safe via the
same ConcurrentDictionary.AddOrUpdate pattern as the other recorder methods.
Clamps to zero on over-decrement so a stray Complete-without-Start can't
drive the counter negative.
- CapabilityInvoker gains an optional statusTracker ctor parameter. When
wired, every ExecuteAsync / ExecuteAsync(void) wraps the pipeline call
in try / finally that records start/complete — so the counter advances
cleanly whether the call succeeds, cancels, or throws. Null tracker keeps
the pre-Phase-6.1 Stream E.3 behaviour exactly.
Server.Hosting:
- ResilienceStatusPublisherHostedService persists CurrentInFlight as the
DriverInstanceResilienceStatus.CurrentBulkheadDepth column (was 0 before
this PR). One-line fix on both the insert + update branches.
The in-flight counter is a pragmatic proxy for Polly's internal bulkhead
depth — a future PR wiring Polly telemetry would replace it with the real
value. The shape of the column + the publisher + the Admin /hosts query
doesn't change, so the follow-up is invisible to consumers.
Tests (8 new InFlightCounterTests, all pass):
- Start+Complete nets to zero.
- Nested starts sum; Complete decrements.
- Complete-without-Start clamps to zero.
- Different hosts track independently.
- Concurrent starts (500 parallel) don't lose count.
- CapabilityInvoker observed-mid-call depth == 1 during a pending call.
- CapabilityInvoker exception path still decrements (try/finally).
- CapabilityInvoker without tracker doesn't throw.
Full solution dotnet test: 1243 passing (was 1235, +8). Pre-existing
Client.CLI Subscribe flake unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the server-side/data-layer piece of Phase 6.4 Stream B.2-B.4. The
CSV-import preview + modal UI (Stream B.3/B.5) still belongs to the Admin
UI follow-up — this PR owns the staging tables + atomic finalise alone.
Configuration:
- New EquipmentImportBatch entity (Id, ClusterId, CreatedBy, CreatedAtUtc,
RowsStaged/Accepted/Rejected, FinalisedAtUtc?). Composite index on
(CreatedBy, FinalisedAtUtc) powers the Admin preview modal's "my open
batches" query.
- New EquipmentImportRow entity — one row per CSV row, 8 required columns
from decision #117 + 9 optional from decision #139 + IsAccepted flag +
RejectReason. FK to EquipmentImportBatch with cascade delete so
DropBatch collapses the whole tree.
- EF migration 20260419_..._AddEquipmentImportBatch.
- SchemaComplianceTests expected tables list gains the two new tables.
Admin.Services.EquipmentImportBatchService:
- CreateBatchAsync — new header row, caller-supplied ClusterId + CreatedBy.
- StageRowsAsync(batchId, acceptedRows, rejectedRows) — bulk-inserts the
parsed CSV rows into staging. Rejected rows carry LineNumberInFile +
RejectReason for the preview modal. Throws when the batch is finalised.
- DropBatchAsync — removes batch + cascaded rows. Throws when the batch
was already finalised (rollback via staging is not a time machine).
- FinaliseBatchAsync(batchId, generationId, driverInstanceId, unsLineId) —
atomic apply. Opens an EF transaction when the provider supports it
(SQL Server in prod; InMemory in tests skips the tx), bulk-inserts
every accepted staging row into Equipment, stamps
EquipmentImportBatch.FinalisedAtUtc, commits. Failure rolls back so
Equipment never partially mutates. Idempotent-under-double-call:
second finalise throws ImportBatchAlreadyFinalisedException.
- ListByUserAsync(createdBy, includeFinalised) — the Admin preview modal's
backing query. OrderByDescending on CreatedAtUtc so the most-recent
batch shows first.
- Two exception types: ImportBatchNotFoundException +
ImportBatchAlreadyFinalisedException.
ExternalIdReservation merging (ZTag + SAPID fleet-wide uniqueness) is NOT
done here — a narrower follow-up wires it once the concurrent-insert test
matrix is green.
Tests (10 new EquipmentImportBatchServiceTests, all pass):
- CreateBatch populates Id + CreatedAtUtc + zero-ed counters.
- StageRows accepted + rejected both persist; counters advance.
- DropBatch cascades row delete.
- DropBatch after finalise throws.
- Finalise translates accepted staging rows → Equipment under the target
GenerationId + DriverInstanceId + UnsLineId.
- Finalise twice throws.
- Finalise of unknown batch throws.
- Stage after finalise throws.
- ListByUserAsync filters by creator + finalised flag.
- Drop of unknown batch is a no-op (idempotent rollback).
Full solution dotnet test: 1235 passing (was 1225, +10). Pre-existing
Client.CLI Subscribe flake unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the HostedService half of Phase 6.1 Stream E.2 flagged as a follow-up
when the DriverResilienceStatusTracker shipped in PR #82. The Admin /hosts
column refresh + SignalR push + red-badge visual (Stream E.3) remain
deferred to the visual-compliance pass — this PR owns the persistence
story alone.
Server.Hosting:
- ResilienceStatusPublisherHostedService : BackgroundService. Samples the
DriverResilienceStatusTracker every TickInterval (default 5 s) and upserts
each (DriverInstanceId, HostName) counter pair into
DriverInstanceResilienceStatus via EF. New rows on first sight; in-place
updates on subsequent ticks.
- PersistOnceAsync extracted public so tests drive one tick directly —
matches the ScheduledRecycleHostedService pattern for deterministic
timing.
- Best-effort persistence: a DB outage logs a warning + continues; the next
tick retries. Never crashes the app on sample failure. Cancellation
propagates through cleanly.
- Tracks the bulkhead depth / recycle / footprint columns the entity was
designed for. CurrentBulkheadDepth currently persisted as 0 — the tracker
doesn't yet expose live bulkhead depth; a narrower follow-up wires the
Polly bulkhead-depth observer into the tracker.
Tests (6 new in ResilienceStatusPublisherHostedServiceTests):
- Empty tracker → tick is a no-op, zero rows written.
- Single-host counters → upsert a new row with ConsecutiveFailures + breaker
timestamp + sampled timestamp.
- Second tick updates the existing row in place (not a second insert).
- Multi-host pairs persist independently.
- Footprint counters (Baseline + Current) round-trip.
- TickCount advances on every PersistOnceAsync call.
Full solution dotnet test: 1225 passing (was 1219, +6). Pre-existing
Client.CLI Subscribe flake unchanged.
Production wiring (Program.cs) example:
builder.Services.AddSingleton<DriverResilienceStatusTracker>();
builder.Services.AddHostedService<ResilienceStatusPublisherHostedService>();
// Tracker gets wired into CapabilityInvoker via OtOpcUaServer resolution
// + the existing Phase 6.1 layer.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the per-device isolation gap flagged at the Phase 6.1 Stream A wire-up
(PR #78 used driver.DriverInstanceId as the pipeline host for every call, so
multi-host drivers like Modbus with N PLCs shared one pipeline — one dead PLC
poisoned sibling breakers). Decision #144 requires per-device isolation; this
PR wires it without breaking single-host drivers.
Core.Abstractions:
- IPerCallHostResolver interface. Optional driver capability. Drivers with
multi-host topology (Modbus across N PLCs, AB CIP across a rack, etc.)
implement this; single-host drivers (Galaxy, S7 against one PLC, OpcUaClient
against one remote server) leave it alone. Must be fast + allocation-free
— called once per tag on the hot path. Unknown refs return empty so dispatch
falls back to single-host without throwing.
Server/OpcUa/DriverNodeManager:
- Captures `driver as IPerCallHostResolver` at construction alongside the
existing capability casts.
- New `ResolveHostFor(fullReference)` helper returns either the resolver's
answer or the driver's DriverInstanceId (single-host fallback). Empty /
whitespace resolver output also falls back to DriverInstanceId.
- Every dispatch site now passes `ResolveHostFor(fullRef)` to the invoker
instead of `_driver.DriverInstanceId` — OnReadValue, OnWriteValue, all four
HistoryRead paths. The HistoryRead Events path tolerates fullRef=null and
falls back to DriverInstanceId for those cluster-wide event queries.
- Drivers without IPerCallHostResolver observe zero behavioural change:
every call still keys on DriverInstanceId, same as before.
Tests (4 new PerCallHostResolverDispatchTests, all pass):
- DeadPlc_DoesNotOpenBreaker_For_HealthyPlc_With_Resolver — 2 PLCs behind
one driver; hammer the dead PLC past its breaker threshold; assert the
healthy PLC's first call succeeds on its first attempt (decision #144).
- EmptyString / unknown-ref fallback behaviour documented via test.
- WithoutResolver_SameHost_Shares_One_Pipeline — regression guard for the
single-host pre-existing behaviour.
- WithResolver_TwoHosts_Get_Two_Pipelines — builds the CachedPipelineCount
assertion to confirm the shared-builder cache keys correctly.
Full solution dotnet test: 1219 passing (was 1215, +4). Pre-existing
Client.CLI Subscribe flake unchanged.
Adoption: Modbus driver (#120 follow-up), AB CIP / AB Legacy / TwinCAT
drivers (also #120) implement the interface and return the per-tag PLC host
string. Single-host drivers stay silent and pay zero cost.
Remaining sub-items of #160 still deferred:
- IAlarmSource.SubscribeAlarmsAsync + AcknowledgeAsync invoker wrapping.
Non-trivial because alarm subscription is push-based from driver through
IAlarmConditionSink — the wrap has to happen at the driver-to-server glue
rather than a synchronous dispatch site.
- Roslyn analyzer asserting every capability-interface call routes through
CapabilityInvoker. Substantial (separate analyzer project + test harness);
noise-value ratio favors shipping this post-v2-GA once the coverage is
known-stable.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the Phase 6.1 Stream A.2 "per-instance overrides bound from
DriverInstance.ResilienceConfig JSON column" work flagged as a follow-up
when Stream A.1 shipped in PR #78. Every driver can now override its Polly
pipeline policy per instance instead of inheriting pure tier defaults.
Configuration:
- DriverInstance entity gains a nullable `ResilienceConfig` string column
(nvarchar(max)) + SQL check constraint `CK_DriverInstance_ResilienceConfig_IsJson`
that enforces ISJSON when not null. Null = use tier defaults (decision
#143 / unchanged from pre-Phase-6.1).
- EF migration `20260419161008_AddDriverInstanceResilienceConfig`.
- SchemaComplianceTests expected-constraint list gains the new CK name.
Core.Resilience.DriverResilienceOptionsParser:
- Pure-function parser. ParseOrDefaults(tier, json, out diag) returns the
effective DriverResilienceOptions — tier defaults with per-capability /
bulkhead overrides layered on top when the JSON payload supplies them.
Partial policies (e.g. Read { retryCount: 10 }) fill missing fields from
the tier default for that capability.
- Malformed JSON falls back to pure tier defaults + surfaces a human-readable
diagnostic via the out parameter. Callers log the diag but don't fail
startup — a misconfigured ResilienceConfig must not brick a working
driver.
- Property names + capability keys are case-insensitive; unrecognised
capability names are logged-and-skipped; unrecognised shape-level keys
are ignored so future shapes land without a migration.
Server wire-in:
- OtOpcUaServer gains two optional ctor params: `tierLookup` (driverType →
DriverTier) + `resilienceConfigLookup` (driverInstanceId → JSON string).
CreateMasterNodeManager now resolves tier + JSON for each driver, parses
via DriverResilienceOptionsParser, logs the diagnostic if any, and
constructs CapabilityInvoker with the merged options instead of pure
Tier A defaults.
- OpcUaApplicationHost threads both lookups through. Default null keeps
existing tests constructing without either Func unchanged (falls back
to Tier A + tier defaults exactly as before).
Tests (13 new DriverResilienceOptionsParserTests):
- null / whitespace / empty-object JSON returns pure tier defaults.
- Malformed JSON falls back + surfaces diagnostic.
- Read override merged into tier defaults; other capabilities untouched.
- Partial policy fills missing fields from tier default.
- Bulkhead overrides honored.
- Unknown capability skipped + surfaced in diagnostic.
- Property names + capability keys are case-insensitive.
- Every tier × every capability × empty-JSON round-trips tier defaults
exactly (theory).
Full solution dotnet test: 1215 passing (was 1202, +13). Pre-existing
Client.CLI Subscribe flake unchanged.
Production wiring (Program.cs) example:
Func<string, DriverTier> tierLookup = type => type switch
{
"Galaxy" => DriverTier.C,
"Modbus" or "S7" => DriverTier.B,
"OpcUaClient" => DriverTier.A,
_ => DriverTier.A,
};
Func<string, string?> cfgLookup = id =>
db.DriverInstances.AsNoTracking().FirstOrDefault(x => x.DriverInstanceId == id)?.ResilienceConfig;
var host = new OpcUaApplicationHost(..., tierLookup: tierLookup, resilienceConfigLookup: cfgLookup);
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the server-side / non-UI piece of Phase 6.4 Stream D. The Razor
`IdentificationFields.razor` component for Admin-UI editing ships separately
when the Admin UI pass lands (still tracked under #157 UI follow-up).
Core.OpcUa additions:
- IdentificationFolderBuilder — pure-function builder that materializes the
OPC 40010 Machinery companion-spec Identification sub-folder per decision
#139. Reads the nine nullable columns off an Equipment row:
Manufacturer, Model, SerialNumber, HardwareRevision, SoftwareRevision,
YearOfConstruction (short → OPC UA Int32), AssetLocation, ManufacturerUri,
DeviceManualUri. Emits one AddProperty call per non-null field; skips the
sub-folder entirely when all nine are null so browse trees don't carry
pointless empty folders.
- HasAnyFields(equipment) — cheap short-circuit so callers can decide
whether to invoke Folder() at all.
- FolderName constant ("Identification") + FieldNames list exposed so
downstream tools / tests can cross-reference without duplicating the
decision-#139 field set.
ACL binding: the sub-folder + variables live under the Equipment node so
Phase 6.2's PermissionTrie treats them as part of the Equipment ScopeId —
no new scope level. A user with Equipment-level grant reads the
Identification fields; a user without gets BadUserAccessDenied on both the
Equipment node + its Identification variables. Documented in the class
remarks; cross-reference update to acl-design.md is a follow-up.
Tests (9 new IdentificationFolderBuilderTests):
- HasAnyFields all-null false / any-non-null true.
- Build all-null returns null + doesn't emit Folder.
- Build fully-populated emits all 9 fields in decision #139 order.
- Only non-null fields are emitted (3-of-9 case).
- YearOfConstruction short widens to DriverDataType.Int32 with int value.
- String values round-trip through AddProperty.
- FieldNames constant matches decision #139 exactly.
- FolderName is "Identification".
Full solution dotnet test: 1202 passing (was 1193, +9). Pre-existing
Client.CLI Subscribe flake unchanged.
Production integration: the component that consumes this is the
address-space-build flow that walks the live Equipment table + calls
IdentificationFolderBuilder.Build(equipmentFolder, equipment) under each
Equipment node. That integration is the remaining Stream D follow-up
alongside the Razor UI component.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Turns the Phase 6.1 Stream B.4 pure-logic ScheduledRecycleScheduler (shipped
in PR #79) into a running background feature. A Tier C driver registers its
scheduler at startup; the hosted service ticks every TickInterval (default
1 min) and invokes TickAsync on each registered scheduler.
Server.Hosting:
- ScheduledRecycleHostedService : BackgroundService. AddScheduler(s) must be
called before StartAsync — registering post-start throws
InvalidOperationException to avoid "some ticks saw my scheduler, some
didn't" races. ExecuteAsync loops on Task.Delay(TickInterval, _timeProvider,
stoppingToken) + delegates to a public TickOnceAsync method for one tick.
- TickOnceAsync extracted as the unit-of-work so tests drive it directly
without needing to synchronize with FakeTimeProvider + BackgroundService
timing semantics.
- Exception isolation: if one scheduler throws, the loop logs + continues
to the next scheduler. A flaky supervisor can't take down the tick for
every other Tier C driver.
- Diagnostics: TickCount + SchedulerCount properties for tests + logs.
Tests (7 new ScheduledRecycleHostedServiceTests, all pass):
- TickOnce before interval doesn't fire; TickCount still advances.
- TickOnce at/after interval fires the underlying scheduler exactly once.
- Multiple ticks accumulate count.
- AddScheduler after StartAsync throws.
- Throwing scheduler doesn't poison its neighbours (logs + continues).
- SchedulerCount matches registrations.
- Empty scheduler list ticks cleanly (no-op + counter advances).
Full solution dotnet test: 1193 passing (was 1186, +7). Pre-existing
Client.CLI Subscribe flake unchanged.
Production wiring (Program.cs):
builder.Services.AddSingleton<ScheduledRecycleHostedService>();
builder.Services.AddHostedService(sp => sp.GetRequiredService<ScheduledRecycleHostedService>());
// During DI configuration, once Tier C drivers + their ScheduledRecycleSchedulers
// are resolved, call host.AddScheduler(scheduler) for each.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires the Phase 6.3 Stream B pure-logic pieces (ServiceLevelCalculator,
RecoveryStateManager, ApplyLeaseRegistry) + Stream A topology loader
(RedundancyCoordinator) into one orchestrator the runtime + OPC UA node
surface consume. The actual OPC UA variable-node plumbing (mapping
ServiceLevel Byte + ServerUriArray String[] onto the Opc.Ua.Server stack)
is narrower follow-up on top of this — the publisher emits change events
the OPC UA layer subscribes to.
Server.Redundancy additions:
- PeerReachability record + PeerReachabilityTracker — thread-safe
per-peer-NodeId holder of the latest (HttpHealthy, UaHealthy) tuple. Probe
loops (Stream B.1/B.2 runtime follow-up) write via Update; the publisher
reads via Get. PeerReachability.FullyHealthy / Unknown sentinels for the
two most-common states.
- RedundancyStatePublisher — pure orchestrator, no background timer, no OPC
UA stack dep. ComputeAndPublish reads the 6 inputs + calls the calculator:
* role (from coordinator.Current.SelfRole)
* selfHealthy (caller-supplied Func<bool>)
* peerHttpHealthy + peerUaHealthy (aggregate across all peers in
coordinator.Current.Peers)
* applyInProgress (ApplyLeaseRegistry.IsApplyInProgress)
* recoveryDwellMet (RecoveryStateManager.IsDwellMet)
* topologyValid (coordinator.IsTopologyValid)
* operatorMaintenance (caller-supplied Func<bool>)
Before-coordinator-init returns NoData=1 so clients never see an
authoritative value from an un-bootstrapped server.
OnStateChanged event fires edge-triggered when the byte changes;
OnServerUriArrayChanged fires edge-triggered when the topology's self-first
peer-sorted URI array content changes.
- ServiceLevelSnapshot record — per-tick output with Value + Band +
Topology. The OPC UA layer's ServiceLevel Byte node subscribes to
OnStateChanged; the ServerUriArray node subscribes to OnServerUriArrayChanged.
Tests (8 new RedundancyStatePublisherTests, all pass):
- Before-init returns NoData (Value=1, Band=NoData).
- Authoritative-Primary when healthy + peer fully reachable.
- Isolated-Primary (230) retains authority when peer unreachable — matches
decision #154 non-promotion semantics.
- Mid-apply band dominates: open lease → Value=200 even with peer healthy.
- Self-unhealthy → NoData regardless of other inputs.
- OnStateChanged fires only on value transitions (edge-triggered).
- OnServerUriArrayChanged fires once per topology content change; repeat
ticks with same topology don't re-emit.
- Standalone cluster treats healthy as AuthoritativePrimary=255.
Microsoft.EntityFrameworkCore.InMemory 10.0.0 added to Server.Tests for the
coordinator-backed publisher tests.
Full solution dotnet test: 1186 passing (was 1178, +8). Pre-existing
Client.CLI Subscribe flake unchanged.
Closes the core of release blocker #3 — the pure-logic + orchestration
layer now exists + is unit-tested. Remaining Stream C surfaces: OPC UA
ServiceLevel Byte variable wiring (binds to OnStateChanged), ServerUriArray
String[] wiring (binds to OnServerUriArrayChanged), RedundancySupport
static from RedundancyMode. Those touch the OPC UA stack directly + land
as Stream C.2 follow-up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lands the data path that feeds the Phase 6.3 ServiceLevelCalculator shipped in
PR #89. OPC UA node wiring (ServiceLevel variable + ServerUriArray +
RedundancySupport) still deferred to task #147; peer-probe loops (Stream B.1/B.2
runtime layer beyond the calculator logic) deferred.
Server.Redundancy additions:
- RedundancyTopology record — immutable snapshot (ClusterId, SelfNodeId,
SelfRole, Mode, Peers[], SelfApplicationUri). ServerUriArray() emits the
OPC UA Part 4 §6.6.2.2 shape (self first, peers lexicographically by
NodeId). RedundancyPeer record with per-peer Host/OpcUaPort/DashboardPort/
ApplicationUri so the follow-up peer-probe loops know where to probe.
- ClusterTopologyLoader — pure fn from ServerCluster + ClusterNode[] to
RedundancyTopology. Enforces Phase 6.3 Stream A.1 invariants:
* At least one node per cluster.
* At most 2 nodes (decision #83, v2.0 cap).
* Every node belongs to the target cluster.
* Unique ApplicationUri across the cluster (OPC UA Part 4 trust pin,
decision #86).
* At most 1 Primary per cluster in Warm/Hot modes (decision #84).
* Self NodeId must be a member of the cluster.
Violations throw InvalidTopologyException with a decision-ID-tagged message
so operators know which invariant + what to fix.
- RedundancyCoordinator singleton — holds the current topology + IsTopologyValid
flag. InitializeAsync throws on invariant violation (startup fails fast).
RefreshAsync logs + flips IsTopologyValid=false (runtime won't tear down a
running server; ServiceLevelCalculator falls to InvalidTopology band = 2
which surfaces the problem to clients without crashing). CAS-style swap
via Volatile.Write so readers always see a coherent snapshot.
Tests (10 new ClusterTopologyLoaderTests):
- Single-node standalone loads + empty peer list.
- Two-node cluster loads self + peer.
- ServerUriArray puts self first + peers sort lexicographically.
- Empty-nodes throws.
- Self-not-in-cluster throws.
- Three-node cluster rejected with decision #83 message.
- Duplicate ApplicationUri rejected with decision #86 shape reference.
- Two Primaries in Warm mode rejected (decision #84 + runtime-band reference).
- Cross-cluster node rejected.
- None-mode allows any role mix (standalone clusters don't enforce Primary count).
Full solution dotnet test: 1178 passing (was 1168, +10). Pre-existing
Client.CLI Subscribe flake unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes release blocker #2 from docs/v2/v2-release-readiness.md — the
generation-sealed cache + resilient reader + stale-config flag shipped as
unit-tested primitives in PR #81, but no production path consumed them until
now. This PR wires them end-to-end.
Server additions:
- SealedBootstrap — Phase 6.1 Stream D consumption hook. Resolves the node's
current generation through ResilientConfigReader's timeout → retry →
fallback-to-sealed pipeline. On every successful central-DB fetch it seals
a fresh snapshot to <cache-root>/<cluster>/<generationId>.db so a future
cache-miss has a known-good fallback. Alongside the original NodeBootstrap
(which still uses the single-file ILocalConfigCache); Program.cs can
switch between them once operators are ready for the generation-sealed
semantics.
- OpcUaApplicationHost: new optional staleConfigFlag ctor parameter. When
wired, HealthEndpointsHost consumes `flag.IsStale` via the existing
usingStaleConfig Func<bool> hook. Means `/healthz` actually reports
`usingStaleConfig: true` whenever a read fell back to the sealed cache —
closes the loop between Stream D's flag + Stream C's /healthz body shape.
Tests (4 new SealedBootstrapIntegrationTests, all pass):
- Central-DB success path seals snapshot + flag stays fresh.
- Central-DB failure falls back to sealed snapshot + flag flips stale (the
SQL-kill scenario from Phase 6.1 Stream D.4.a).
- No-snapshot + central-down throws GenerationCacheUnavailableException
with a clear error (the first-boot scenario from D.4.c).
- Next successful bootstrap after a fallback clears the stale flag.
Full solution dotnet test: 1168 passing (was 1164, +4). Pre-existing
Client.CLI Subscribe flake unchanged.
Production activation: Program.cs wires SealedBootstrap (instead of
NodeBootstrap), constructs OpcUaApplicationHost with the staleConfigFlag,
and a HostedService polls sp_GetCurrentGenerationForCluster periodically so
peer-published generations land in this node's sealed cache. The poller
itself is Stream D.1.b follow-up.
The sp_PublishGeneration SQL-side hook (where the publish commit itself
could also write to a shared sealed cache) stays deferred — the per-node
seal pattern shipped here is the correct v2 GA model: each Server node
owns its own on-disk cache and refreshes from its own DB reads, matching
the Phase 6.1 scope-table description.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the Phase 6.2 security gap the v2 release-readiness dashboard flagged:
the evaluator + trie + gate shipped as code in PRs #84-88 but no dispatch
path called them. This PR threads the gate end-to-end from
OpcUaApplicationHost → OtOpcUaServer → DriverNodeManager and calls it on
every Read / Write / 4 HistoryRead paths.
Server.Security additions:
- NodeScopeResolver — maps driver fullRef → Core.Authorization NodeScope.
Phase 1 shape: populates ClusterId + TagId; leaves NamespaceId / UnsArea /
UnsLine / Equipment null. The cluster-level ACL cascade covers this
configuration (decision #129 additive grants). Finer-grained scope
resolution (joining against the live Configuration DB for UnsArea / UnsLine
path) lands as Stream C.12 follow-up.
- WriteAuthzPolicy.ToOpcUaOperation — maps SecurityClassification → the
OpcUaOperation the gate evaluator consults (Operate/SecuredWrite →
WriteOperate; Tune → WriteTune; Configure/VerifiedWrite → WriteConfigure).
DriverNodeManager wiring:
- Ctor gains optional AuthorizationGate + NodeScopeResolver; both null means
the pre-Phase-6.2 dispatch runs unchanged (backwards-compat for every
integration test that constructs DriverNodeManager directly).
- OnReadValue: ahead of the invoker call, builds NodeScope + calls
gate.IsAllowed(identity, Read, scope). Denied reads return
BadUserAccessDenied without hitting the driver.
- OnWriteValue: preserves the existing WriteAuthzPolicy check (classification
vs session roles) + adds an additive gate check using
WriteAuthzPolicy.ToOpcUaOperation(classification) to pick the right
WriteOperate/Tune/Configure surface. Lax mode falls through for identities
without LDAP groups.
- Four HistoryRead paths (Raw / Processed / AtTime / Events): gate check
runs per-node before the invoker. Events path tolerates fullRef=null
(event-history queries can target a notifier / driver-root; those are
cluster-wide reads that need a different scope shape — deferred).
- New WriteAccessDenied helper surfaces BadUserAccessDenied in the
OpcHistoryReadResult slot + errors list, matching the shape of the
existing WriteUnsupported / WriteInternalError helpers.
OtOpcUaServer + OpcUaApplicationHost: gate + resolver thread through as
optional constructor parameters (same pattern as DriverResiliencePipelineBuilder
in Phase 6.1). Null defaults keep the existing 3 OpcUaApplicationHost
integration tests constructing without them unchanged.
Tests (5 new in NodeScopeResolverTests):
- Resolve populates ClusterId + TagId + Equipment Kind.
- Resolve leaves finer path null per Phase 1 shape (doc'd as follow-up).
- Empty fullReference throws.
- Empty clusterId throws at ctor.
- Resolver is stateless across calls.
The existing 9 AuthorizationGate tests (shipped in PR #86) continue to
cover the gate's allow/deny semantics under strict + lax mode.
Full solution dotnet test: 1164 passing (was 1159, +5). Pre-existing
Client.CLI Subscribe flake unchanged. Existing OpcUaApplicationHost +
HealthEndpointsHost + driver integration tests continue to pass because the
gate defaults to null → no enforcement, and the lax-mode fallback returns
true for identities without LDAP groups (the anonymous test path).
Production deployments flip the gate on by constructing it via
OpcUaApplicationHost's new authzGate parameter + setting
`Authorization:StrictMode = true` once ACL data is populated. Flipping the
switch post-seed turns the evaluator + trie from scaffolded code into
actual enforcement.
This closes release blocker #1 listed in docs/v2/v2-release-readiness.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes out Phase 6 with the two pieces a release engineer needs before
tagging v2 GA:
1. scripts/compliance/phase-6-all.ps1 — meta-runner that invokes every
per-phase Phase 6.N compliance script in sequence + aggregates results.
Each sub-script runs in its own powershell.exe child process so per-script
$ErrorActionPreference + exit semantics can't interfere with the parent.
Exit 0 = every phase passes; exit 1 = one or more phases failed. Prints a
PASS/FAIL summary matrix at the end.
2. docs/v2/v2-release-readiness.md — single-view dashboard of everything
shipped + everything still deferred + release exit criteria. Called out
explicitly:
- Three release BLOCKERS (must close before v2 GA):
* Phase 6.2 Stream C dispatch wiring — AuthorizationGate exists but no
DriverNodeManager Read/Write/etc. path calls it (task #143).
* Phase 6.1 Stream D follow-up — ResilientConfigReader + sealed-cache
hook not yet consumed by any read path (task #136).
* Phase 6.3 Streams A/C/F — coordinator + UA-node wiring + client
interop still deferred (tasks #145, #147, #150).
- Three nice-to-haves (not release-blocking) — Admin UI polish, background
services, multi-host dispatch.
- Release exit criteria: all 4 compliance scripts exit 0, dotnet test ≤ 1
known flake, blockers closed or v2.1-deferred with written decision,
Fleet Admin signoff on deployment checklist, live-Galaxy smoke test,
OPC UA CTT pass, redundancy cutover validated with at least one
production client.
- Change log at the bottom so future ships of deferred follow-ups just
append dates + close out dashboard rows.
Meta-runner verified locally:
Phase 6.1 — PASS
Phase 6.2 — PASS
Phase 6.3 — PASS
Phase 6.4 — PASS
Aggregate: PASS (elapsed 340 s — most of that is the full solution
`dotnet test` each phase runs).
Net counts at capstone time: 906 baseline → 1159 passing across Phase 6
(+253). 15 deferred follow-up tasks tracked with IDs (#134-137, #143-144,
#145, #147, #149-150, #153, #155-157). v2 is NOT YET release-ready —
capstone makes that explicit rather than letting the "shipped" label on
each phase imply full readiness.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
scripts/compliance/phase-6-2-compliance.ps1 replaces the stub TODOs with 23
real checks spanning:
- Stream A: LdapGroupRoleMapping entity + AdminRole enum + ILdapGroupRoleMappingService
+ impl + write-time invariant + EF migration all present.
- Stream B: OpcUaOperation enum + NodeScope + AuthorizationDecision tri-state
+ IPermissionEvaluator + PermissionTrie + Builder + Cache keyed on
GenerationId + UserAuthorizationState with MembershipFreshnessInterval=15m
and AuthCacheMaxStaleness=5m + TriePermissionEvaluator + HistoryRead uses
its own flag.
- Control/data-plane separation: the evaluator + trie + cache + builder +
interface all have zero references to LdapGroupRoleMapping (decision #150).
- Stream C foundation: ILdapGroupsBearer + AuthorizationGate with StrictMode
knob. DriverNodeManager dispatch-path wiring (11 surfaces) is Deferred,
tracked as task #143.
- Stream D data layer: ValidatedNodeAclAuthoringService + exception type +
rejects None permissions. Blazor UI pieces (RoleGrantsTab, AclsTab,
SignalR invalidation, draft diff) are Deferred, tracked as task #144.
- Cross-cutting: full solution dotnet test runs; 1097 >= 1042 baseline;
tolerates the one pre-existing Client.CLI Subscribe flake.
IPermissionEvaluator doc-comment reworded to avoid mentioning the literal
type name "LdapGroupRoleMapping" — the compliance check does a text-absence
sweep for that identifier across the data-plane files.
docs/v2/implementation/phase-6-2-authorization-runtime.md status updated from
DRAFT to SHIPPED (core). Two deferred follow-ups explicitly called out so
operators see what's still pending for the "Phase 6.2 fully wired end-to-end"
milestone.
`Phase 6.2 compliance: PASS` — exit 0. Any regression that deletes a class
or re-introduces an LdapGroupRoleMapping reference into the data-plane
evaluator turns a green check red + exit non-zero.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ships the non-UI piece of Stream D: a draft-aware write surface over NodeAcl
that enforces the Phase 6.2 plan's scope-uniqueness + grant-shape invariants.
Blazor UI pieces (RoleGrantsTab + AclsTab refresh + SignalR invalidation +
visual-compliance reviewer signoff) are deferred to the Phase 6.1-style
follow-up task.
Admin.Services:
- ValidatedNodeAclAuthoringService — alongside existing NodeAclService (raw
CRUD, kept for read + revoke paths). GrantAsync enforces:
* Permissions != None (decision #129 — additive only, no empty grants).
* Cluster scope has null ScopeId.
* Sub-cluster scope requires a populated ScopeId.
* No duplicate (GenerationId, ClusterId, LdapGroup, ScopeKind, ScopeId)
tuple — operator updates the row instead of inserting a duplicate.
UpdatePermissionsAsync also rejects None (operator revokes via NodeAclService).
Violations throw InvalidNodeAclGrantException.
Tests (10 new in Admin.Tests/ValidatedNodeAclAuthoringServiceTests):
- Grant rejects None permissions.
- Grant rejects Cluster-scope with ScopeId / sub-cluster without ScopeId.
- Grant succeeds on well-formed row.
- Grant rejects duplicate (group, scope) in same draft.
- Grant allows same group at different scope.
- Grant allows same (group, scope) in different draft.
- UpdatePermissions rejects None.
- UpdatePermissions round-trips new flags + notes.
- UpdatePermissions on unknown rowid throws.
Microsoft.EntityFrameworkCore.InMemory 10.0.0 added to Admin.Tests csproj.
Full solution dotnet test: 1097 passing (was 1087, +10). Phase 6.2 total is
now 1087+10 = 1097; baseline 906 → +191 net across Phase 6.1 (all streams) +
Phase 6.2 (Streams A, B, C foundation, D data layer).
Stream D follow-up task tracks: RoleGrantsTab CRUD over LdapGroupRoleMapping,
AclsTab write-through + Probe-this-permission diagnostic, draft-diff ACL
section, SignalR PermissionTrieCache invalidation push.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lands the integration seam between the Server project's OPC UA stack and the
Core.Authorization evaluator. Actual DriverNodeManager dispatch-path wiring
(Read/Write/HistoryRead/Browse/Call/Subscribe/Alarm surfaces) lands in the
follow-up PR on this branch — covered by Task #143 below.
Server.Security additions:
- ILdapGroupsBearer — marker interface a custom IUserIdentity implements to
expose its resolved LDAP group DNs. Parallel to the existing IRoleBearer
(admin roles) — control/data-plane separation per decision #150.
- AuthorizationGate — stateless bridge between Opc.Ua.IUserIdentity and
IPermissionEvaluator. IsAllowed(identity, operation, scope) materializes a
UserAuthorizationState from the identity's LDAP groups, delegates to the
evaluator, and returns a single bool the dispatch paths use to decide
whether to surface BadUserAccessDenied.
- StrictMode knob controls fail-open-during-transition vs fail-closed:
- strict=false (default during rollout) — null identity, identity without
ILdapGroupsBearer, or NotGranted outcome all return true so older
deployments without ACL data keep working.
- strict=true (production target) — any of the above returns false.
The appsetting `Authorization:StrictMode = true` flips deployments over
once their ACL data is populated.
Tests (9 new in Server.Tests/AuthorizationGateTests):
- Null identity — strict denies, lax allows.
- Identity without LDAP groups — strict denies, lax allows.
- LDAP group with matching grant allows.
- LDAP group without grant — strict denies.
- Wrong operation denied (Read-only grant, WriteOperate requested).
- BuildSessionState returns materialized state with LDAP groups + null when
identity doesn't carry them.
Full solution dotnet test: 1087 passing (Phase 6.1 = 1042, Phase 6.2 A = +9,
B = +27, C foundation = +9 = 1087). Pre-existing Client.CLI Subscribe flake
unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ships Stream B.1-B.6 — the data-plane authorization engine Phase 6.2 runs on.
Integration into OPC UA dispatch (Stream C — Read / Write / HistoryRead /
Subscribe / Browse / Call etc.) is the next PR on this branch.
New Core.Abstractions:
- OpcUaOperation enum enumerates every OPC UA surface the evaluator gates:
Browse, Read, WriteOperate/Tune/Configure (split by SecurityClassification),
HistoryRead, HistoryUpdate, CreateMonitoredItems, TransferSubscriptions,
Call, AlarmAcknowledge/Confirm/Shelve. Stream C maps each one back to its
dispatch call site.
New Core.Authorization namespace:
- NodeScope record + NodeHierarchyKind — 6-level scope addressing for
Equipment-kind (UNS) namespaces, folder-segment walk for SystemPlatform-kind
(Galaxy). NodeScope carries a Kind selector so the evaluator knows which
hierarchy to descend.
- AuthorizationDecision { Verdict, Provenance } + AuthorizationVerdict
{Allow, NotGranted, Denied} + MatchedGrant. Tri-state per decision #149;
Phase 6.2 only produces Allow + NotGranted, Denied stays reserved for v2.1
Explicit Deny without API break.
- IPermissionEvaluator.Authorize(session, operation, scope).
- PermissionTrie + PermissionTrieNode + TrieGrant. In-memory trie keyed on
the ACL scope hierarchy. CollectMatches walks Cluster → Namespace →
UnsArea → UnsLine → Equipment → Tag (or → FolderSegment(s) → Tag on
Galaxy). Pure additive union — matches that share an LDAP group with the
session contribute flags; OR across levels.
- PermissionTrieBuilder static factory. Build(clusterId, generationId, rows,
scopePaths?) returns a trie for one generation. Cross-cluster rows are
filtered out so the trie is cluster-coherent. Stream C follow-up wires a
real scopePaths lookup from the live DB; tests supply hand-built paths.
- PermissionTrieCache — process-singleton, keyed on (ClusterId, GenerationId).
Install(trie) adds a generation + promotes to "current" when the id is
highest-known (handles out-of-order installs gracefully). Prior generations
retained so an in-flight request against a prior trie still succeeds; GC
via Prune(cluster, keepLatest).
- UserAuthorizationState — per-session cache of resolved LDAP groups +
AuthGenerationId + MembershipVersion + MembershipResolvedUtc. Bounded by
MembershipFreshnessInterval (default 15 min per decision #151) +
AuthCacheMaxStaleness (default 5 min per decision #152).
- TriePermissionEvaluator — default IPermissionEvaluator. Fails closed on
stale sessions (IsStale check short-circuits to NotGranted), on cross-
cluster requests, on empty trie cache. Maps OpcUaOperation → NodePermissions
via MapOperationToPermission (total — every enum value has a mapping; tested).
Tests (27 new, all pass):
- PermissionTrieTests (7): cluster-level grant cascades to every tag;
equipment-level grant doesn't leak to sibling equipment; multi-group union
ORs flags; no-matching-group returns empty; Galaxy folder-segment grant
doesn't leak to sibling folder; cross-cluster rows don't land in this
cluster's trie; build is idempotent (B.6 invariants).
- TriePermissionEvaluatorTests (8): allow when flag matches; NotGranted when
no matching group; NotGranted when flags insufficient; HistoryRead requires
its own bit (decision-level requirement); cross-cluster session denied;
stale session fails closed; no cached trie denied; MapOperationToPermission
is total across every OpcUaOperation.
- PermissionTrieCacheTests (8): empty cache returns null; install-then-get
round-trips; new generation becomes current; out-of-order install doesn't
downgrade current; invalidate drops one cluster; prune retains most recent;
prune no-op when fewer than keep; cluster isolation.
- UserAuthorizationStateTests (4): fresh is not stale; IsStale after 5 min
default; NeedsRefresh true between freshness + staleness windows.
Full solution dotnet test: 1078 passing (baseline 906, Phase 6.1 = 1042,
Phase 6.2 Stream A = +9, Stream B = +27 = 1078). Pre-existing Client.CLI
Subscribe flake unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stream A.1-A.2 per docs/v2/implementation/phase-6-2-authorization-runtime.md.
Seed-data migration (A.3) is a separate follow-up once production LDAP group
DNs are finalised; until then CRUD via the Admin UI handles the fleet set up.
Configuration:
- New AdminRole enum {ConfigViewer, ConfigEditor, FleetAdmin} — string-stored.
- New LdapGroupRoleMapping entity with Id (surrogate PK), LdapGroup (512 chars),
Role (AdminRole enum), ClusterId (nullable, FK to ServerCluster), IsSystemWide,
CreatedAtUtc, Notes.
- EF config: UX_LdapGroupRoleMapping_Group_Cluster unique index on
(LdapGroup, ClusterId) + IX_LdapGroupRoleMapping_Group hot-path index on
LdapGroup for sign-in lookups. Cluster FK cascades on cluster delete.
- Migration 20260419_..._AddLdapGroupRoleMapping generated via `dotnet ef`.
Configuration.Services:
- ILdapGroupRoleMappingService — CRUD surface. Declared as control-plane only
per decision #150; the OPC UA data-path evaluator must NOT depend on this
interface (Phase 6.2 compliance check on control/data-plane separation).
GetByGroupsAsync is the hot-path sign-in lookup.
- LdapGroupRoleMappingService (EF Core impl) enforces the write-time invariant
"exactly one of (ClusterId populated, IsSystemWide=true)" and surfaces
InvalidLdapGroupRoleMappingException on violation. Create auto-populates Id
+ CreatedAtUtc when omitted.
Tests (9 new, all pass) in Configuration.Tests:
- Create sets Id + CreatedAtUtc.
- Create rejects empty LdapGroup.
- Create rejects IsSystemWide=true with populated ClusterId.
- Create rejects IsSystemWide=false with null ClusterId.
- GetByGroupsAsync returns matching rows only.
- GetByGroupsAsync with empty input returns empty (no full-table scan).
- ListAllAsync orders by group then cluster.
- Delete removes the target row.
- Delete of unknown id is a no-op.
Microsoft.EntityFrameworkCore.InMemory 10.0.0 added to Configuration.Tests for
the service-level tests (schema-compliance tests still use the live SQL
fixture).
SchemaComplianceTests updated to expect the new LdapGroupRoleMapping table.
Full solution dotnet test: 1051 passing (baseline 906, Phase 6.1 shipped at
1042, Phase 6.2 Stream A adds 9 = 1051). Pre-existing Client.CLI Subscribe
flake unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
scripts/compliance/phase-6-1-compliance.ps1 replaces the stub TODOs with 34
real checks covering:
- Stream A: pipeline builder + CapabilityInvoker + WriteIdempotentAttribute
present; pipeline key includes HostName (per-device isolation per decision
#144); OnReadValue / OnWriteValue / HistoryRead route through invoker in
DriverNodeManager; Galaxy supervisor CircuitBreaker + Backoff preserved.
- Stream B: DriverTier enum; DriverTypeMetadata requires Tier; MemoryTracking
+ MemoryRecycle (Tier C-gated) + ScheduledRecycleScheduler (rejects Tier
A/B) + demand-aware WedgeDetector all present.
- Stream C: DriverHealthReport + HealthEndpointsHost; state matrix Healthy=200
/ Faulted=503 asserted in code; LogContextEnricher; JSON sink opt-in via
Serilog:WriteJson.
- Stream D: GenerationSealedCache + ReadOnly marking + GenerationCacheUnavailable
exception path; ResilientConfigReader + StaleConfigFlag.
- Stream E data layer: DriverInstanceResilienceStatus entity +
DriverResilienceStatusTracker. SignalR/Blazor surface is Deferred per the
visual-compliance follow-up pattern borrowed from Phase 6.4.
- Cross-cutting: full solution `dotnet test` runs; asserts 1042 >= 906
baseline; tolerates the one pre-existing Client.CLI Subscribe flake and
flags any new failure.
Running the script locally returns "Phase 6.1 compliance: PASS" — exit 0. Any
future regression that deletes a class or un-wires a dispatch path turns a
green check red + exit non-zero.
docs/v2/implementation/phase-6-1-resilience-and-observability.md status
updated from DRAFT to SHIPPED with the merged-PRs summary + test count delta +
the single deferred follow-up (visual review of the Admin /hosts columns).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ships the data + runtime layer of Stream E. The SignalR hub and Blazor /hosts
page refresh (E.2-E.3) are follow-up work paired with the visual-compliance
review per Phase 6.4 patterns — documented as a deferred follow-up below.
Configuration:
- New entity DriverInstanceResilienceStatus with:
DriverInstanceId, HostName (composite PK),
LastCircuitBreakerOpenUtc, ConsecutiveFailures, CurrentBulkheadDepth,
LastRecycleUtc, BaselineFootprintBytes, CurrentFootprintBytes,
LastSampledUtc.
- Separate from DriverHostStatus (per-host connectivity view) so a Running
host that has tripped its breaker or is nearing its memory ceiling shows up
distinctly on Admin /hosts. Admin page left-joins both for display.
- OtOpcUaConfigDbContext + Fluent-API config + IX_DriverResilience_LastSampled
index for the stale-sample filter query.
- EF migration: 20260419124034_AddDriverInstanceResilienceStatus.
Core.Resilience:
- DriverResilienceStatusTracker — process-singleton in-memory tracker keyed on
(DriverInstanceId, HostName). CapabilityInvoker + MemoryTracking +
MemoryRecycle callers record failure/success/breaker-open/recycle/footprint
events; a HostedService (Stream E.2 follow-up) samples this tracker every
5 s and persists to the DB. Pure in-memory keeps tests fast + the core
free of EF/SQL dependencies.
Tests:
- DriverResilienceStatusTrackerTests (9 new, all pass): tryget-before-write
returns null; failures accumulate; success resets; breaker/recycle/footprint
fields populate; per-host isolation; snapshot returns all pairs; concurrent
writes don't lose counts.
- SchemaComplianceTests: expected-tables list updated to include the new
DriverInstanceResilienceStatus table.
Full solution dotnet test: 1042 passing (baseline 906, +136 for Phase 6.1 so
far across Streams A/B/C/D/E.1). Pre-existing Client.CLI Subscribe flake
unchanged.
Deferred to follow-up PR (E.2/E.3):
- ResilienceStatusPublisher HostedService that samples DriverResilienceStatusTracker
every 5 s + upserts DriverInstanceResilienceStatus rows.
- Admin FleetStatusHub SignalR hub pushing LastCircuitBreakerOpenUtc /
CurrentBulkheadDepth / LastRecycleUtc on change.
- Admin /hosts Blazor column additions (red badge when
ConsecutiveFailures > breakerThreshold / 2). Visual-compliance reviewer
signoff alongside Phase 6.4 admin-ui patterns.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes Stream D per docs/v2/implementation/phase-6-1-resilience-and-observability.md.
New Configuration.LocalCache types (alongside the existing single-file
LiteDbConfigCache):
- GenerationSealedCache — file-per-generation sealed snapshots per decision
#148. Each SealAsync writes <cache-root>/<clusterId>/<generationId>.db as a
read-only LiteDB file, then atomically publishes the CURRENT pointer via
temp-file + File.Replace. Prior-generation files stay on disk for audit.
Mixed-generation reads are structurally impossible: ReadCurrentAsync opens
the single file named by CURRENT. Corruption of the pointer or the sealed
file raises GenerationCacheUnavailableException — fails closed, never falls
back silently to an older generation. TryGetCurrentGenerationId returns the
pointer value or null for diagnostics.
- StaleConfigFlag — thread-safe (Volatile.Read/Write) bool. MarkStale when a
read fell back to the cache; MarkFresh when a central-DB read succeeded.
Surfaced on /healthz body and Admin /hosts (Stream C wiring already in
place).
- ResilientConfigReader — wraps a central-DB fetch function with the Stream
D.2 pipeline: timeout 2 s → retry N× jittered (skipped when retryCount=0) →
fallback to the sealed cache. Toggles StaleConfigFlag per outcome. Read path
only — the write path is expected to bypass this wrapper and fail hard on
DB outage so inconsistent writes never land. Cancellation passes through
and is NOT retried.
Configuration.csproj:
- Polly.Core 8.6.6 + Microsoft.Extensions.Logging.Abstractions added.
Tests (17 new, all pass):
- GenerationSealedCacheTests (10): first-boot-no-snapshot throws
GenerationCacheUnavailableException (D.4 scenario C), seal-then-read round
trip, sealed file is ReadOnly on disk, pointer advances to latest, prior
generation file preserved, corrupt sealed file fails closed, missing sealed
file fails closed, corrupt pointer fails closed (D.4 scenario B), same
generation sealed twice is idempotent, independent clusters don't
interfere.
- ResilientConfigReaderTests (4): central-DB success returns value + marks
fresh; central-DB failure exhausts retries + falls back to cache + marks
stale (D.4 scenario A); central-DB + cache both unavailable throws;
cancellation not retried.
- StaleConfigFlagTests (3): default is fresh; toggles; concurrent writes
converge.
Full solution dotnet test: 1033 passing (baseline 906, +127 net across Phase
6.1 Streams A/B/C/D). Pre-existing Client.CLI Subscribe flake unchanged.
Integration into Configuration read paths (DriverInstance enumeration,
LdapGroupRoleMapping fetches, etc.) + the sp_PublishGeneration hook that
writes sealed files lands in the Phase 6.1 Stream E / Admin-refresh PR where
the DB integration surfaces are already touched. Existing LiteDbConfigCache
continues serving its single-file role for the NodeBootstrap path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes Stream C per docs/v2/implementation/phase-6-1-resilience-and-observability.md.
Core.Observability (new namespace):
- DriverHealthReport — pure-function aggregation over DriverHealthSnapshot list.
Empty fleet = Healthy. Any Faulted = Faulted. Any Unknown/Initializing (no
Faulted) = NotReady. Any Degraded or Reconnecting (no Faulted, no NotReady)
= Degraded. Else Healthy. HttpStatus(verdict) maps to the Stream C.1 state
matrix: Healthy/Degraded → 200, NotReady/Faulted → 503.
- LogContextEnricher — Serilog LogContext wrapper. Push(id, type, capability,
correlationId) returns an IDisposable scope; inner log calls carry
DriverInstanceId / DriverType / CapabilityName / CorrelationId structured
properties automatically. NewCorrelationId = 12-hex-char GUID slice for
cases where no OPC UA RequestHeader.RequestHandle is in flight.
CapabilityInvoker — now threads LogContextEnricher around every ExecuteAsync /
ExecuteWriteAsync call site. OtOpcUaServer passes driver.DriverType through
so logs correlate to the driver type too. Every capability call emits
structured fields per the Stream C.4 compliance check.
Server.Observability:
- HealthEndpointsHost — standalone HttpListener on http://localhost:4841/
(loopback avoids Windows URL-ACL elevation; remote probing via reverse
proxy or explicit netsh urlacl grant). Routes:
/healthz → 200 when (configDbReachable OR usingStaleConfig); 503 otherwise.
Body: status, uptimeSeconds, configDbReachable, usingStaleConfig.
/readyz → DriverHealthReport.Aggregate + HttpStatus mapping.
Body: verdict, drivers[], degradedDrivers[], uptimeSeconds.
anything else → 404.
Disposal cooperative with the HttpListener shutdown.
- OpcUaApplicationHost starts the health host after the OPC UA server comes up
and disposes it on shutdown. New OpcUaServerOptions knobs:
HealthEndpointsEnabled (default true), HealthEndpointsPrefix (default
http://localhost:4841/).
Program.cs:
- Serilog pipeline adds Enrich.FromLogContext + opt-in JSON file sink via
`Serilog:WriteJson = true` appsetting. Uses Serilog.Formatting.Compact's
CompactJsonFormatter (one JSON object per line — SIEMs like Splunk,
Datadog, Graylog ingest without a regex parser).
Server.Tests:
- Existing 3 OpcUaApplicationHost integration tests now set
HealthEndpointsEnabled=false to avoid port :4841 collisions under parallel
execution.
- New HealthEndpointsHostTests (9): /healthz healthy empty fleet; stale-config
returns 200 with flag; unreachable+no-cache returns 503; /readyz empty/
Healthy/Faulted/Degraded/Initializing drivers return correct status and
bodies; unknown path → 404. Uses ephemeral ports via Interlocked counter.
Core.Tests:
- DriverHealthReportTests (8): empty fleet, all-healthy, any-Faulted trumps,
any-NotReady without Faulted, Degraded without Faulted/NotReady, HttpStatus
per-verdict theory.
- LogContextEnricherTests (8): all 4 properties attach; scope disposes cleanly;
NewCorrelationId shape; null/whitespace driverInstanceId throws.
- CapabilityInvokerEnrichmentTests (2): inner logs carry structured
properties; no context leak outside the call site.
Full solution dotnet test: 1016 passing (baseline 906, +110 for Phase 6.1 so
far across Streams A+B+C). Pre-existing Client.CLI Subscribe flake unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes out Stream B per docs/v2/implementation/phase-6-1-resilience-and-observability.md.
Core.Abstractions:
- IDriverSupervisor — process-level supervisor contract a Tier C driver's
out-of-process topology provides (Galaxy Proxy/Supervisor implements this in
a follow-up Driver.Galaxy wiring PR). Concerns: DriverInstanceId + RecycleAsync.
Tier A/B drivers don't implement this; Stream B code asserts tier == C before
ever calling it.
Core.Stability:
- MemoryRecycle — companion to MemoryTracking. On HardBreach, invokes the
supervisor IFF tier == C AND a supervisor is wired. Tier A/B HardBreach logs
a promotion-to-Tier-C recommendation and returns false. Soft/None/Warming
never triggers a recycle at any tier.
- ScheduledRecycleScheduler — Tier C opt-in periodic recycler per decision #67.
Ctor throws for Tier A/B (structural guard — scheduled recycle on an
in-process driver would kill every OPC UA session and every co-hosted
driver). TickAsync(now) advances the schedule by one interval per fire;
RequestRecycleNowAsync drives an ad-hoc recycle without shifting the cron.
- WedgeDetector — demand-aware per decision #147. Classify(state, demand, now)
returns:
* NotApplicable when driver state != Healthy
* Idle when Healthy + no pending work (bulkhead=0 && monitored=0 && historic=0)
* Healthy when Healthy + pending work + progress within threshold
* Faulted when Healthy + pending work + no progress within threshold
Threshold clamps to min 60 s. DemandSignal.HasPendingWork ORs the three counters.
The three false-wedge cases the plan calls out all stay Healthy: idle
subscription-only, slow historian backfill making progress, write-only burst
with drained bulkhead.
Tests (22 new, all pass):
- MemoryRecycleTests (7): Tier C hard-breach requests recycle; Tier A/B
hard-breach never requests; Tier C without supervisor no-ops; soft-breach
at every tier never requests; None/Warming never request.
- ScheduledRecycleSchedulerTests (6): ctor throws for A/B; zero/negative
interval throws; tick before due no-ops; tick at/after due fires once and
advances; RequestRecycleNow fires immediately without shifting schedule;
multiple fires across ticks advance one interval each.
- WedgeDetectorTests (9): threshold clamp to 60 s; unhealthy driver always
NotApplicable; idle subscription stays Idle; pending+fresh progress stays
Healthy; pending+stale progress is Faulted; MonitoredItems active but no
publish is Faulted; MonitoredItems active with fresh publish stays Healthy;
historian backfill with fresh progress stays Healthy; write-only burst with
empty bulkhead is Idle; HasPendingWork theory for any non-zero counter.
Full solution dotnet test: 989 passing (baseline 906, +83 for Phase 6.1 so far).
Pre-existing Client.CLI Subscribe flake unchanged.
Stream B complete. Next up: Stream C (health endpoints + structured logging).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stream B.1 — registry invariant:
- DriverTypeMetadata gains a required `DriverTier Tier` field. Every registered
driver type must declare its stability tier so the downstream MemoryTracking,
MemoryRecycle, and resilience-policy layers can resolve the right defaults.
Stamped-at-registration-time enforcement makes the "every driver type has a
non-null Tier" compliance check structurally impossible to fail.
- DriverTypeRegistry API unchanged; one new property on the record.
Stream B.2 — MemoryTracking (Core.Stability):
- Tier-agnostic tracker per decision #146: captures baseline as the median of
samples collected during a post-init warmup window (default 5 min), then
classifies each subsequent sample with the hybrid formula
`soft = max(multiplier × baseline, baseline + floor)`, `hard = 2 × soft`.
- Per-tier constants wired: Tier A mult=3 floor=50 MB, Tier B mult=3 floor=100 MB,
Tier C mult=2 floor=500 MB.
- Never kills. Hard-breach action returns HardBreach; the supervisor that acts
on that signal (MemoryRecycle) is Tier C only per decisions #74, #145 and
lands in the next B.3 commit on this branch.
- Two phases: WarmingUp (samples collected, Warming returned) and Steady
(baseline captured, soft/hard checks active). Transition is automatic when
the warmup window elapses.
Tests (15 new, all pass):
- Warming phase returns Warming until the window elapses.
- Window-elapsed captures median baseline + transitions to Steady.
- Per-tier constants match decision #146 table exactly.
- Soft threshold uses max() — small baseline → floor wins; large baseline →
multiplier wins.
- Hard = 2 × soft.
- Sample below soft = None; at soft = SoftBreach; at/above hard = HardBreach.
- DriverTypeRegistry: theory asserts Tier round-trips for A/B/C.
Full solution dotnet test: 963 passing (baseline 906, +57 net for Phase 6.1
Stream A + Stream B.1/B.2). Pre-existing Client.CLI Subscribe flake unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per Stream A.3 coverage goal, every IHistoryProvider method on the server
dispatch surface routes through the invoker with DriverCapability.HistoryRead:
- HistoryReadRaw (line 487)
- HistoryReadProcessed (line 551)
- HistoryReadAtTime (line 608)
- HistoryReadEvents (line 665)
Each gets timeout + per-(driver, host) circuit breaker + the default Tier
retry policy (Tier A default: 2 retries at 30s timeout). Inner driver
GetAwaiter().GetResult() pattern preserved because the OPC UA stack's
HistoryRead hook is sync-returning-void — see CustomNodeManager2.
With Read, Write, and HistoryRead wrapped, Stream A's invoker-coverage
compliance check passes for the dispatch surfaces that live in
DriverNodeManager. Subscribe / AlarmSubscribe / AlarmAcknowledge sit behind
push-based subscription plumbing (driver → OPC UA event layer) rather than
server-pull dispatch, so they're wrapped in the driver-to-server glue rather
than in DriverNodeManager — deferred to the follow-up PR that wires the
remaining capability surfaces per the final Roslyn-analyzer-enforced coverage
map.
Full solution dotnet test: 948 passing. Pre-existing Client.CLI Subscribe
flake unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Every OnReadValue / OnWriteValue now routes through the process-singleton
DriverResiliencePipelineBuilder's CapabilityInvoker. Read / Write dispatch
paths gain timeout + per-capability retry + per-(driver, host) circuit breaker
+ bulkhead without touching the individual driver implementations.
Wiring:
- OpcUaApplicationHost: new optional DriverResiliencePipelineBuilder ctor
parameter (default null → instance-owned builder). Keeps the 3 test call
sites that construct OpcUaApplicationHost directly unchanged.
- OtOpcUaServer: requires the builder in its ctor; constructs one
CapabilityInvoker per driver at CreateMasterNodeManager time with default
Tier A DriverResilienceOptions. TODO: Stream B.1 will wire real per-driver-
type tiers via DriverTypeRegistry; Phase 6.1 follow-up will read the
DriverInstance.ResilienceConfig JSON column for per-instance overrides.
- DriverNodeManager: takes a CapabilityInvoker in its ctor. OnReadValue wraps
the driver's ReadAsync through ExecuteAsync(DriverCapability.Read, hostName,
...); OnWriteValue wraps WriteAsync through ExecuteWriteAsync(hostName,
isIdempotent, ...) where isIdempotent comes from the new
_writeIdempotentByFullRef map populated at Variable() registration from
DriverAttributeInfo.WriteIdempotent.
HostName defaults to driver.DriverInstanceId for now — a single-host pipeline
per driver. Multi-host drivers (Modbus with N PLCs) will expose their own per-
call host resolution in a follow-up so failing PLCs can trip per-PLC breakers
without poisoning siblings (decision #144).
Test fixup:
- FlakeyDriverIntegrationTests.Read_SurfacesSuccess_AfterTransientFailures:
bumped TimeoutSeconds=2 → 30. 10 retries at exponential backoff with jitter
can exceed 2s under parallel-test-run CPU pressure; the test asserts retry
behavior, not timeout budget, so the longer slack keeps it deterministic.
Full solution dotnet test: 948 passing. Pre-existing Client.CLI Subscribe
flake unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
IDriver.DriverInstanceId is declared as string in Core.Abstractions; keeping
the pipeline key as Guid meant every call site would need .ToString() / Guid.Parse
at the boundary. Switching the Resilience types to string removes that friction
and lets OtOpcUaServer pass driver.DriverInstanceId directly to the builder in
the upcoming server-dispatch wiring PR.
- DriverResiliencePipelineBuilder.GetOrCreate + Invalidate + PipelineKey
- CapabilityInvoker.ctor + _driverInstanceId field
Tests: all 48 Core.Tests still pass. The Invalidate test's keepId / dropId now
use distinct "drv-keep" / "drv-drop" literals (previously both were distinct
Guid.NewGuid() values, which the sed-driven refactor had collapsed to the same
literal — caught pre-commit).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per-tag opt-in for write-retry per docs/v2/plan.md decisions #44, #45, #143.
Default is false — writes never auto-retry unless the driver author has marked
the tag as safe to replay.
Core.Abstractions:
- DriverAttributeInfo gains `bool WriteIdempotent = false` at the end of the
positional record (back-compatible; every existing call site uses the default).
Driver.Modbus:
- ModbusTagDefinition gains `bool WriteIdempotent = false`. Safe candidates
documented in the param XML: holding-register set-points, configuration
registers. Unsafe: edge-triggered coils, counter-increment addresses.
- ModbusDriver.DiscoverAsync propagates t.WriteIdempotent into
DriverAttributeInfo.WriteIdempotent.
Driver.S7:
- S7TagDefinition gains `bool WriteIdempotent = false`. Safe candidates:
DB word/dword set-points, configuration DBs. Unsafe: M/Q bits that drive
edge-triggered program routines.
- S7Driver.DiscoverAsync propagates the flag.
Stream A.5 integration tests (FlakeyDriverIntegrationTests, 4 new) exercise
the invoker + flaky-driver contract the plan enumerates:
- Read with 5 transient failures succeeds on the 6th attempt (RetryCount=10).
- Non-idempotent write with RetryCount=5 configured still fails on the first
failure — no replay (decision #44 guard at the ExecuteWriteAsync surface).
- Idempotent write with 2 transient failures succeeds on the 3rd attempt.
- Two hosts on the same driver have independent breakers — dead-host trips
its breaker but live-host's first call still succeeds.
Propagation tests:
- ModbusDriverTests: SetPoint WriteIdempotent=true flows into
DriverAttributeInfo; PulseCoil default=false.
- S7DiscoveryAndSubscribeTests: same pattern for DBx SetPoint vs M-bit.
Full solution dotnet test: 947 passing (baseline 906, +41 net across Stream A
so far). Pre-existing Client.CLI Subscribe flake unchanged.
Stream A's remaining work (wiring CapabilityInvoker into DriverNodeManager's
OnReadValue / OnWriteValue / History / Subscribe dispatch paths) is the
server-side integration piece + needs DI wiring for the pipeline builder —
lands in the next PR on this branch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
One invoker per (DriverInstance, IDriver) pair; calls ExecuteAsync(capability,
host, callSite) and the invoker resolves the correct pipeline from the shared
DriverResiliencePipelineBuilder. The options accessor is a Func so Admin-edit
+ pipeline-invalidate takes effect without restarting the invoker or the
driver host.
ExecuteWriteAsync(isIdempotent) is the explicit write-safety surface:
- isIdempotent=false routes through a side pipeline with RetryCount=0 regardless
of what the caller configured. The cache key carries a "::non-idempotent"
suffix so it never collides with the retry-enabled write pipeline.
- isIdempotent=true routes through the normal Write pipeline. If the user has
configured Write retries (opt-in), the idempotent tag gets them; otherwise
default-0 still wins.
The server dispatch layer (next PR) reads WriteIdempotentAttribute on each tag
definition once at driver-init time and feeds the boolean into ExecuteWriteAsync.
Tests (6 new):
- Read retries on transient failure; returns value from call site.
- Write non-idempotent does NOT retry even when policy has 3 retries configured
(the explicit decision-#44 guard at the dispatch surface).
- Write idempotent retries when policy allows.
- Write with default tier-A policy (RetryCount=0) never retries regardless of
idempotency flag.
- Different hosts get independent pipelines.
Core.Tests now 44 passing (was 38). Invoker doc-refs completed (the XML comment
on WriteIdempotentAttribute no longer references a non-existent type).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lands the first chunk of the Phase 6.1 Stream A resilience layer per
docs/v2/implementation/phase-6-1-resilience-and-observability.md §Stream A.
Downstream CapabilityInvoker (A.3) + driver-dispatch wiring land in follow-up
PRs on the same branch.
Core.Abstractions additions:
- WriteIdempotentAttribute — marker for tag-definition records that opt into
auto-retry on IWritable.WriteAsync. Absence = no retry per decisions #44, #45,
#143. Read once via reflection at driver-init time; no per-write cost.
- DriverCapability enum — enumerates the 8 capability surface points
(Read / Write / Discover / Subscribe / Probe / AlarmSubscribe / AlarmAcknowledge
/ HistoryRead). AlarmAcknowledge is write-shaped (no retry by default).
- DriverTier enum — A/B/C per driver-stability.md §2-4. Stream B.1 wires this
into DriverTypeMetadata; surfaced here because the resilience policy defaults
key on it.
Core.Resilience new namespace:
- DriverResilienceOptions — per-tier × per-capability policy defaults.
GetTierDefaults(tier) is the source of truth:
* Tier A: Read 2s/3 retries, Write 2s/0 retries, breaker threshold 5
* Tier B: Read 4s/3, Write 4s/0, breaker threshold 5
* Tier C: Read 10s/1, Write 10s/0, breaker threshold 0 (supervisor handles
process-level breaker per decision #68)
Resolve(capability) overlays CapabilityPolicies on top of the defaults.
- DriverResiliencePipelineBuilder — composes Timeout → Retry (capability-
permitting, never on cancellation) → CircuitBreaker (tier-permitting) →
Bulkhead. Pipelines cached in a lock-free ConcurrentDictionary keyed on
(DriverInstanceId, HostName, DriverCapability) per decision #144 — one dead
PLC behind a multi-device driver does not open the breaker for healthy
siblings. Invalidate(driverInstanceId) supports Admin-triggered reload.
Tests (30 new, all pass):
- DriverResilienceOptionsTests: tier-default coverage for every capability,
Write + AlarmAcknowledge never retry at any tier, Tier C disables breaker,
resolve-with-override layering.
- DriverResiliencePipelineBuilderTests: Read retries transients, Write does NOT
retry on failure (decision #44 guard), dead-host isolation from sibling hosts,
pipeline reuse for same triple, per-capability isolation, breaker opens after
threshold on Tier A, timeout fires, cancellation is not retried,
invalidation scoped to matching instance.
Polly.Core 8.6.6 added to Core.csproj. Full solution dotnet test: 936 passing
(baseline 906 + 30 new). One pre-existing Client.CLI Subscribe flake unchanged
by this PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After shipping the four Phase 6 plan drafts (PRs 77-80), the adversarial-review
adjustments lived only as trailing "Review" sections. An implementer reading
Stream A would find the original unadjusted guidance, then have to cross-reference
the review to reconcile. This PR makes the plans genuinely executable:
1. Merges every ACCEPTed review finding into the actual Scope / Stream / Compliance
sections of each phase plan:
- phase-6-1: Scope table rewrite (per-capability retry, (instance,host) pipeline key,
MemoryTracking vs MemoryRecycle split, hybrid watchdog formula, demand-aware
wedge detector, generation-sealed LiteDB). Streams A/B/D + Compliance rewritten.
- phase-6-2: AuthorizationDecision tri-state, control/data-plane separation,
MembershipFreshnessInterval (15 min), AuthCacheMaxStaleness (5 min),
subscription stamp-and-reevaluate. Stream C widened to 11 OPC UA operations.
- phase-6-3: 8-state ServiceLevel matrix (OPC UA Part 5 §6.3.34-compliant),
two-layer peer probe (/healthz + UaHealthProbe), apply-lease via await using,
publish-generation fencing, InvalidTopology runtime state, ServerUriArray
self-first + peers. New Stream F (interop matrix + Galaxy failover).
- phase-6-4: DraftRevisionToken concurrency control, staged-import via
EquipmentImportBatch with user-scoped visibility, CSV header version marker,
decision-#117-aligned identifier columns, 1000-row diff cap,
decision-#139 OPC 40010 fields, Identification inherits Equipment ACL.
2. Appends decisions #143 through #162 to docs/v2/plan.md capturing the
architectural commitments the adjustments created. Each decision carries its
dated rationale so future readers know why the choice was made.
3. Scaffolds scripts/compliance/phase-6-{1,2,3,4}-compliance.ps1 — PowerShell
stubs with Assert-Todo / Assert-Pass / Assert-Fail helpers. Every check
maps to a Stream task ID from the corresponding phase plan. Currently all
checks are TODO and scripts exit 0; each implementation task is responsible
for replacing its TODO with a real check before closing that task. Saved
as UTF-8 with BOM so Windows PowerShell 5.1 parses em-dash characters
without breaking.
Net result: the Phase 6.1 plan is genuinely ready to execute. Stream A.3 can
start tomorrow without reconciling Streams vs. Review on every task; the
compliance script is wired to the Stream IDs; plan.md has the architectural
commitments that justify the Stream choices.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three substantive issues caught + fixed during the validation pass:
1. pymodbus rejects unknown keys at device-list / setup level. My PR 43 commit had `_layout_note`, `_uint16_layout`, `_bits_layout`, `_write_note` device-level JSON-comment fields that crashed pymodbus startup with `INVALID key in setup`. Removed all device-level _* fields. Inline `_quirk` keys WITHIN individual register entries are tolerated by pymodbus 3.13.0 — kept those in dl205.json since they document the byte math per quirk and the README + git history aren't enough context for a hand-author reading raw integer values. Documented the constraint in the top-level _comment of each profile.
2. pymodbus rejects sweeping `write` ranges that include any cell not assigned a type. My initial standard.json had `write: [[0, 2047]]` but only seeded HR[0..31] + HR[100] + HR[200..209] + bits[1024..1109] — pymodbus blew up on cell 32 (gap between HR[31] and HR[100]). Fixed by listing per-block write ranges that exactly mirror the seeded ranges. Same fix in dl205.json (was `[[0, 16383]]`).
3. pymodbus simulator stores all 4 standard Modbus tables in ONE underlying cell array — each cell can only be typed once (BITS or UINT16, not both). My initial standard.json had `bits[0..31]` AND `uint16[0..31]` overlapping at the same addresses; pymodbus crashed with `ERROR "uint16" <Cell> used`. Fixed by relocating coils to address 1024+, well clear of the uint16 entries at 0..209. Documented the layout constraint in the standard.json top-level _comment.
Substantive driver bug fixed: ModbusTcpTransport.ConnectAsync was using `new TcpClient()` (default constructor — dual-stack, IPv6 first) then `ConnectAsync(host, port)` with the user's hostname. .NET's TcpClient default-resolves "localhost" to ::1 first, fails to connect to pymodbus (which binds 0.0.0.0 IPv4-only), and only then retries IPv4 — the failure surfaces as the entire ConnectAsync timeout (2s by default) before the IPv4 attempt even starts. PR 30's smoke test silently SKIPPED because the fixture's TCP probe hit the same dual-stack ordering and timed out. Both fixed: ModbusSimulatorFixture probe now resolves Dns.GetHostAddresses, prefers AddressFamily.InterNetwork, dials IPv4 explicitly. ModbusTcpTransport does the same — resolves first, prefers IPv4, falls back to whatever Dns returns (handles IPv6-only hosts in the future). This is a real production-readiness fix because most Modbus PLCs are IPv4-only — a generic dual-stack TcpClient would burn the entire connect timeout against any IPv4-only PLC, masquerading as a connection failure when the PLC is actually fine.
Smoke-test address shifted HR[100] -> HR[200]. Standard.json's HR[100] is the auto-incrementing register that drives subscribe-and-receive tests, so write-then-read against it would race the increment. HR[200] is the first cell of a writable scratch range present in BOTH simulator profiles. DL205Profile.cs xml-doc updated to explain the shift; tag name "DL205_Smoke_HReg100" -> "Smoke_HReg200" + smoke test references updated. dl205.json gains a matching scratch HR[200..209] range so the smoke test runs identically against either profile.
Validation matrix:
- standard.json boot: clean (TCP 5020 listening within ~3s of pymodbus.simulator launch).
- dl205.json boot: clean.
- pymodbus client direct FC06 to HR[200]=1234 + FC03 read: round-trip OK.
- raw-bytes PowerShell TcpClient FC06 + 12-byte response: matches FC06 spec (echo of address + value).
- DL205SmokeTest against standard.json: 1/1 pass (was failing as 'BadInternalError' due to the dual-stack timeout + tag-name typo — both fixed).
- DL205SmokeTest against dl205.json: 1/1 pass.
- Modbus.Tests Unit suite: 52/52 pass — dual-stack transport fix is non-breaking.
- Solution build clean.
Memory + future-PR setup: pymodbus install + activation pattern is now bullet-pointed at the top of Pymodbus/README.md so future PRs (the per-quirk DL205_<behavior> tests in PR 44+) don't have to repeat the trial-and-error of getting the simulator + integration tests cooperating. The three bugs above are documented inline in the JSON profiles + ModbusTcpTransport so they don't bite again.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ModbusSimulatorFixture default port bumped from 502 to 5020 to match the pymodbus convention. Override via MODBUS_SIM_ENDPOINT for a real PLC on its native 502. Skip-message updated to point at the new Pymodbus\serve.ps1 wrapper instead of 'start ModbusPal'. csproj <None Update> rule swapped from ModbusPal/** to Pymodbus/** so the new JSON profiles + serve.ps1 + README copy to test-output as PreserveNewest.
standard.json — generic Modbus TCP server, slave id 1, port 5020, shared blocks=false (independent coils + HR address spaces, more textbook-PLC-like). HR[0..31] seeded with address-as-value via per-register uint16 entries, HR[100] auto-increments via the built-in increment action with parameters minval=0/maxval=65535 (drives subscribe-and-receive integration tests so they have a register that ticks without a write — pymodbus's increment ticks per-access not wall-clock, which is good enough for a 250ms-poll test), HR[200..209] scratch range left at 0 for write tests, coils 0..31 alternating, coils 100..109 scratch. write list covers 0..1023 so any test address is mutable.
dl205.json — AutomationDirect DirectLOGIC DL205/DL260 quirk simulator, slave id 1, port 5020, shared blocks=true (matches DL series memory model where coils/DI/HR overlay the same word address space). Each quirky register seeded with the pre-computed raw uint16 value documented in docs/v2/dl205.md, with an inline _quirk JSON-comment naming the behavior so future-me reading the file knows why HR[1040]=25928 means 'H' lo / 'e' hi (the user's headline string-byte-order finding). Encoded quirks: V0 marker at HR[0]=0xCAFE; V2000 at HR[1024]=0x2000; V40400 at HR[8448]=0x4040; 'Hello' string at HR[1040..1042] first-char-low-byte; Float32 1.5f at HR[1056..1057] in CDAB word order (low word first); BCD register at HR[1072]=0x1234; FC03-128-cap block at HR[1280..1407]; Y0/C0 coil markers at 2048/3072; scratch C-relays at 4000..4007.
serve.ps1 wrapper — pwsh script with a -Profile {standard|dl205} parameter switch. Validates pymodbus.simulator is on PATH (clearer message than the raw CommandNotFoundException), validates the profile JSON exists, builds the right --modbus_server/--modbus_device/--json_file/--http_port arg list, and execs pymodbus.simulator in the foreground. -HttpPort 0 disables the web UI. Foreground exec lets the operator Ctrl+C to stop without an extra control script.
README.md fully rewritten for pymodbus: install command (pip install 'pymodbus[simulator]==3.13.0' — pinned for reproducibility, [simulator] extra pulls aiohttp), per-profile reference tables, the same DL205 quirk → register table from PR 42 but adjusted for pymodbus paths, what's-NEW-vs-ModbusPal section (all four tables, raw uint16 seeding, declarative actions, custom Python action modules, headless, web UI, maintained), trade-offs section (float32-as-two-uint16s for explicit CDAB control, increment ticks per-access not wall-clock, shared-blocks mode for DL205 vs separate for Standard), file-format quick reference for hand-authoring more profiles. References pinned to the pymodbus readthedocs simulator/config + REST API pages.
docs/v2/modbus-test-plan.md harness section rewritten with the swap rationale; PR-history list updated to mark PR 42 SUPERSEDED by PR 43 and call out PR 44+ as the per-quirk implementation track. Test-conventions bullet about 'don't depend on ModbusPal state between tests' generalized to 'don't depend on simulator state' and a note added that pymodbus's REST API can reset state between facts if a test ever needs it.
DL205Profile.cs and DL205SmokeTests.cs xml-doc updated to reference pymodbus / dl205.json instead of ModbusPal / DL205.xmpp.
Functional validation deferred — Python isn't installed on this dev box (winget search returned no matches for Python.Python.3 exact). JSON parses structurally (PowerShell ConvertFrom-Json clean on both files), build clean, .json + serve.ps1 + README all copy to test-output as expected. User installs pymodbus when they want to actually run the simulator end-to-end; if pymodbus rejects the config the README's reference link to pymodbus's simulator/config schema doc is the right next stop.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Standard.xmpp — generic Modbus TCP server on port 502, slave id 1. HR[0..31] seeded with address-as-value (HR[5]=5 — easy mental map for diagnostics), HR[100] auto-incrementing via a 1Hz LinearGenerator binding (drives subscribe-and-receive integration tests so they have a register that actually changes without a write), HR[200..209] scratch range for write-roundtrip tests, coils 0..31 alternating on/off, coils 100..109 scratch. The Tick automation runs 0..65535 over 60s looping; bound to HR[100] via Binding_SINT16 — slow enough that a 250ms-poll integration test sees discrete jumps, fast enough that a 5s subscribe test sees several change notifications.
DL205.xmpp — AutomationDirect DirectLOGIC DL205/DL260 quirk simulator on port 502, slave id 1, modeling the behaviors documented in docs/v2/dl205.md as concrete register values so DL205 integration tests can assert each quirk WITHOUT a live PLC. Per-quirk encoding: V0 marker at HR[0]=0xCAFE proves register 0 is valid (rejects-register-0 rumour disproved); V2000 marker at HR[1024]=0x2000 proves V-memory octal-to-decimal mapping; V40400 marker at HR[8448]=0x4040 proves V40400→PDU 0x2100 (NOT register 0, contrary to the widespread shorthand); 'Hello' string at HR[1040..1042] packed first-char-low-byte (HR[1040]=0x6548 = 'H' lo + 'e' hi, HR[1041]=0x6C6C, HR[1042]=0x006F) — the headline string-byte-order quirk the user flagged; Float32 1.5f at HR[1056..1057] in CDAB word order (low word first: 0, then 0x3FC0); BCD register at HR[1072]=0x1234 representing decimal 1234 in BCD nibbles (NOT binary 0x04D2); 128-register block at HR[1280..1407] for FC03-128-cap testing; Y0 marker at coil 2048, C0 marker at coil 3072, scratch C-coils at 4000..4007 for write tests.
Critical limitation flagged inline + in README: ModbusPal 1.6b CANNOT represent the DL205 quirks semantically — it has no string binding, no BCD binding, no arbitrary-byte-layout binding (only SINT16/SINT32/FLOAT32 with word-order). So every DL205 quirk is encoded as a pre-computed raw 16-bit integer with the math worked out in inline comments above each register. Becomes unreadable past ~50 quirky registers; the README's 'alternatives' section recommends switching to pymodbus when that threshold approaches (pymodbus's ModbusSimulatorServer has first-class headless + scriptable callbacks for byte-level layouts).
Other ModbusPal 1.6b limitations called out in README: only holding_registers + coils sections in the official build (no input_registers / discrete_inputs — DL260 X-input markers can't be encoded faithfully here, FC02/FC04 tests wait for a fork or pymodbus); abandoned project (last release 1.6b, active forks at SCADA-LTS/ModbusPal, ControlThings-io/modbuspal, mrhenrike/ModbusPalEnhanced); no headless mode in the official JAR (-loadFile / -hide flags only in source-built forks); CVE-2018-10832 XXE on .xmpp import (don't import untrusted profiles — the in-repo ones are author-controlled).
README.md updated with: per-profile description tables, getting-started (download jar + java -jar + GUI File>Load>Run), MODBUS_SIM_ENDPOINT env-var override doc, two reference tables documenting which HR / coil address encodes which DL205 quirk + which test name asserts it (the same DL205_<behavior> naming convention from docs/v2/modbus-test-plan.md), 4-row alternatives comparison (pymodbus / diagslave / ModbusMechanic / ModRSsim2) for when ModbusPal can no longer carry the load, and a quick-reference XML format table at the bottom for future-me hand-authoring more profiles.
Pure documentation + test-asset PR — no code changes. The integration tests that consume these profiles (the actual DL205_<behavior> facts) land one at a time in PR 43+ as user validates each quirk via ModbusPal on the bench.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The user explicitly flagged that DL205/DL260 strings don't follow Modbus convention; research turned up that and a lot more. Headline findings:
String packing — TWO chars per V-memory register but the FIRST char is in the LOW byte (opposite of the big-endian Modbus convention generic drivers default to). 'Hello' in V2000 reads back as 'eHll o\0' on a textbook decoder. Kepware's DirectLogic driver exposes a per-tag 'String Byte Order = Low/High' toggle specifically for this; we'll need the same. Null-terminated, no length prefix, no dedicated KSTR address space — strings live wherever ladder allocates them in V-memory.
V-memory addressing — DirectLOGIC's native V-memory is OCTAL (V2000, V40400) but Modbus is decimal. The CPU translates: V2000 octal = decimal 1024 = Modbus PDU 0x0400. The widespread 'V40400 = register 0' shorthand is wrong on modern firmware (that was DL05/DL06 relative mode); on H2-ECOM100 absolute mode (factory default) V40400 = PDU 0x2100. We'd surface this with an address-format helper in the device profile so operators write V2000 instead of computing 1024 by hand.
Word order CDAB for all 32-bit values — DL205 and DL260 agree, ECOM modules don't re-swap. Already supported via ModbusByteOrder.WordSwap; just needs to be the default in the DL205 profile.
BCD-as-default numeric storage — bit one I didn't expect. DirectLOGIC stores 'V2000 = 1234' as 0x1234 on the wire (BCD nibbles), not as 0x04D2 (decimal 1234). IEEE 754 Float32 only works when ladder used the explicit R type (LDR/OUTR instructions). We need a new decoder mode for BCD-encoded registers — current code assumes binary integers.
FC quantity caps — FC03/04 cap at 128 (above spec's 125 — Bonus territory, current code already respects 125), FC16 caps at 100 (BELOW spec's 123 — important bulk-write batching gotcha). Quantity overrun returns exception 03 IllegalDataValue.
Coil/discrete mappings — DL260: X0->discrete input 0, Y0->coil 2048, C0->coil 3072. SP specials at discrete input 1024-1535 RO. These are CPU-wired constants and cannot be remapped; need to be hardcoded in the DL205/DL260 device profile.
Register 0 — accepted on DL205/DL260 with ECOM in absolute mode, contrary to the widespread internet claim that 'DirectLOGIC rejects register 0'. That rumour was an older DL05/DL06 relative-mode artefact. Our ModbusProbeOptions.ProbeAddress default of 0 is therefore safe for DL205/DL260.
Exception codes — only the standard 01-04. Write-to-protected-bit returns 02 on newer firmware, 04 on older (firmware-transition revision unconfirmed); driver should map both to BadNotWritable. No proprietary exception codes.
Behavioral oddities — H2-ECOM100 accepts MAX 4 simultaneous TCP connections (5th refused at TCP accept). No TCP keepalive (intermediate NAT/firewall drops idle sockets after 2-5 min — periodic probe required). No mid-stream resync on malformed MBAP — driver must reconnect + replay. TxId-drop-under-load forum rumour is unconfirmed; our single-flight + TxId-match guard handles it either way.
Each H2 section ends with the integration-test names we'd ship per the modbus-test-plan.md DL205_<behavior> convention — twelve named test slots ready for PR 42+ to fill in one at a time. References (8) cited inline, primarily D2-USER-M, HA-ECOM-M, and the Kepware DirectLogic Ethernet driver manual which documents these vendor quirks explicitly because they have to cope with them.
modbus-test-plan.md DL205 section rewritten as a priority-ordered table with three columns (quirk / driver impact / test name), pointing the reader at dl205.md for the full reference. Operator-reported items separated into a tail subsection so future-me knows which behaviors are documented vs reproduced-on-hardware.
Pure documentation PR — no code changes. The actual driver work (string-byte-order option, BCD decoder mode, V-memory address helper, FC16 cap-per-device-family, multi-client TCP handling) lands one PR per quirk in PR 42+ as ModbusPal validation completes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two new facts target DelmiaReceiver_001.TestAttribute — the writable Boolean UDA on the TestMachine_001 hierarchy in this dev Galaxy. The user nominated TestMachine_001 (the deployed test-target object) as a scratch surface for live testing; ZB query showed DelmiaReceiver_001 carries one dynamic_attribute named TestAttribute (mx_data_type=1=Boolean, lock_type=0=writable, security_classification=1=Operate). Naming makes the intent obvious — the attribute exists for exactly this kind of integration testing — and Boolean keeps the assertions simple (invert, write, read back).
Write_then_read_roundtrips_a_writable_Boolean_attribute_on_TestMachine_001: reads the current value as the baseline (Galaxy may return Uncertain quality until the Engine has scanned the attribute at least once — we don't read into a typed bool until Status is Good), inverts it, writes via IWritable, then polls reads in a 5s loop until either the new value comes back or the budget expires. The scan-window poll (rather than a single read after a fixed delay) accommodates Galaxy's variable scan latency on a fresh service start. Restore-on-finally writes the original value back so re-running the test doesn't accumulate a flipped TestAttribute on the dev box (Galaxy holds UDA values across runs since they're deployed). Best-effort restore — swallows exceptions so a failure in restore doesn't mask the primary assertion.
Subscribe_fires_OnDataChange_with_initial_value_then_again_after_a_write: subscribes to the same attribute with a 250ms publishing interval, captures every OnDataChange notification onto a thread-safe ConcurrentQueue (MXAccess advisory fires on its own thread per Galaxy's COM apartment model — must not block it), waits up to 5s for the initial-value callback (per ISubscribable's contract: 'driver MAY fire OnDataChange immediately with the current value'), records the queue depth as a baseline, writes the toggled value, waits up to 8s for at least one MORE notification, then searches the queue tail for the notification carrying the toggled value (initial value may appear multiple times before the write commits — looking at the tail finds the post-write delta even if the queue grew during the wait window). Unsubscribes on finally + restores baseline.
Both tests use Convert.ToBoolean(value ?? false) to defensively handle the Boxed-vs-typed quirk in MessagePack-deserialized Galaxy values — depending on the wire encoding the Boolean might come back as System.Boolean or System.Object boxing one. Convert.ToBoolean handles both. Same pattern in OnReadValue's existing usage.
WaitForAsync helper does the loop+budget pattern shared by both tests.
PR 40 is the code side of LMX #5's final two deferred facts. To actually run them green requires re-executing from a normal (non-admin) PowerShell — the elevated-shell skip from PR 39 fires correctly under bash + sc.exe-context (verified). lmx-followups.md #5 updated to note the new facts + the run command + the one remaining genuine follow-up (alarm-condition fact when an alarm-flagged attribute is deployed on TestMachine_001).
Test posture from elevated bash: 7 LiveStackSmokeTests facts discovered (was 5; +2 new), all skip cleanly with the elevation message. Build clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR 39 closes the gap. New IsElevatedAdministratorOnWindows static helper (Windows-only via RuntimeInformation.IsOSPlatform; non-Windows hosts return false and let the prerequisite probe own the skip-with-reason path) checks WindowsPrincipal.IsInRole(WindowsBuiltInRole.Administrator) on the current process token. When true, InitializeAsync short-circuits to a SkipReason that names the cause directly: 'elevated token's Admins group membership trumps the allow rule — re-run from a NORMAL (non-admin) PowerShell window'. Catches and swallows any probe-side exception so a Win32 oddity can't crash the test fixture; failed probe falls through to the regular prerequisite path.
The check fires BEFORE AvevaPrerequisites.CheckAllAsync runs because the prereq probe's own pipe connect hits the same admin-deny and surfaces UnauthorizedAccessException with no context. Short-circuiting earlier saves the 10-second probe + produces a single actionable line.
Tests — verified manually from an elevated bash session against the just-installed OtOpcUaGalaxyHost service: skip message reads 'Test host is running with elevated (Administrators) privileges, but the OtOpcUaGalaxyHost named-pipe ACL explicitly denies Administrators per the IPC security design (decision #76 / PipeAcl.cs). Re-run from a NORMAL (non-admin) PowerShell window — even when your user is already in the pipe's allow list, the elevated token's Admins group membership trumps the allow rule.' Proxy.Tests Unit: 17 pass / 0 fail (unchanged — fixture change is non-breaking; existing tests don't run as admin in normal CI flow). Build clean.
Bonus: gitignored .local/ directory (a previous direct commit on local v2 that I'm now landing here) so per-install secrets like the Galaxy.Host shared-secret file don't leak into the repo.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per-kind override shape: each hook receives the pre-filtered nodesToProcess list (NodeHandles for nodes this manager claimed), iterates them, resolves handle.NodeId.Identifier to the driver-side full reference string, and dispatches to the right IHistoryProvider method. Write back into the outer results + errors slots at handle.Index (not the local loop counter — nodesToProcess is a filtered subset of nodesToRead, so indexing by the loop counter lands in the wrong slot for mixed-manager batches). WriteResult helper sets both results[i] AND errors[i]; this matters because MasterNodeManager merges them and leaving errors[i] at its default (BadHistoryOperationUnsupported) overrides a Good result with Unsupported on the wire — this was the subtle failure mode that masked a correctly-constructed HistoryData response during debugging. Failure-isolation per node: NotSupportedException from a driver that doesn't implement a particular HistoryProvider method translates to BadHistoryOperationUnsupported in that slot; generic exceptions log and surface BadInternalError; unresolvable NodeIds get BadNodeIdUnknown. The batch continues unconditionally.
Aggregate mapping: MapAggregate translates ObjectIds.AggregateFunction_Average / Minimum / Maximum / Total / Count to the driver's HistoryAggregateType enum. Null for anything else (e.g. TimeAverage, Interpolative) so the handler surfaces BadAggregateNotSupported at the batch level — per Part 13, one unsupported aggregate means the whole request fails since ReadProcessedDetails carries one aggregate list for all nodes. BuildHistoryData wraps driver DataValueSnapshots as Opc.Ua.HistoryData in an ExtensionObject; BuildHistoryEvent wraps HistoricalEvents as Opc.Ua.HistoryEvent with the canonical BaseEventType field list (EventId, SourceName, Message, Severity, Time, ReceiveTime — the order OPC UA clients that didn't customize the SelectClause expect). ToDataValue preserves null SourceTimestamp (Galaxy historian rows often carry only ServerTimestamp) — synthesizing a SourceTimestamp would lie about actual sample time.
Two address-space changes were required to make the stack dispatch reach the per-kind hooks at all: (1) historized variables get AccessLevels.HistoryRead added to their AccessLevel byte — the base's early-gate check on (variable.AccessLevel & HistoryRead != 0) was rejecting requests before our override ever ran; (2) the driver-root folder gets EventNotifiers.HistoryRead | SubscribeToEvents so HistoryReadEvents can target it (the conventional pattern for alarm-history browse against a driver-owned object). Document the 'set both bits' requirement inline since it's not obvious from the surface API.
OpcHistoryReadResult alias: Opc.Ua.HistoryReadResult (service-layer per-node result) collides with Core.Abstractions.HistoryReadResult (driver-side samples + continuation point) by type name; the alias 'using OpcHistoryReadResult = Opc.Ua.HistoryReadResult' keeps the override signatures unambiguous and the test project applies the mirror pattern for its stub driver impl.
Tests — DriverNodeManagerHistoryMappingTests (12 new Category=Unit cases): MapAggregate translates each supported aggregate NodeId via reflection-backed theory (guards against the stack renaming AggregateFunction_* constants); returns null for unsupported NodeIds (TimeAverage) and null input; BuildHistoryData wraps samples with correct DataValues + SourceTimestamp preservation; BuildHistoryEvent emits the 6-element BaseEventType field list in canonical order (regression guard for a future 'respect the client's SelectClauses' change); null SourceName / Message translate to empty-string Variants (nullable-Variant refactor trap); ToDataValue preserves StatusCode + both timestamps; ToDataValue leaves SourceTimestamp at default when the snapshot omits it. HistoryReadIntegrationTests (5 new Category=Integration): drives a real OPC UA client Session.HistoryRead against a fake HistoryDriver through the running server. Covers raw round-trip (verifies per-node DataValue ordering + values); processed with Average aggregate (captures the driver's received aggregate + interval, asserting MapAggregate routed correctly); unsupported aggregate (TimeAverage → BadAggregateNotSupported); at-time (forwards the per-timestamp list); events (BaseEventType field list shape, SelectClauses populated to satisfy the stack's filter validator). Server.Tests Unit: 55 pass / 0 fail (43 prior + 12 new mapping). Server.Tests Integration: 14 pass / 0 fail (9 prior + 5 new history). Full solution build clean, 0 errors.
lmx-followups.md #1 updated to 'DONE (PRs 35 + 38)' with two explicit deferred items: continuation-point plumbing (driver returns null today so pass-through is fine) and per-SelectClause evaluation in HistoryReadEvents (clients with custom field selections get the canonical BaseEventType layout today).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
LiveStackConfig resolves the pipe name + per-install shared secret from two sources in order: OTOPCUA_GALAXY_PIPE + OTOPCUA_GALAXY_SECRET env vars first (for CI / benchwork overrides), then the service's per-process Environment registry values under HKLM\SYSTEM\CurrentControlSet\Services\OtOpcUaGalaxyHost (what Install-Services.ps1 writes at install time). Registry read requires the test host to run elevated on most boxes — the skip message says so explicitly so operators see the right remediation. Hard-coded secrets are deliberately avoided: the installer generates 32 fresh random bytes per install, a committed secret would diverge from production the moment the service is re-installed.
LiveStackFixture is an IAsyncLifetime that (1) runs AvevaPrerequisites.CheckAllAsync with CheckGalaxyHostPipe=true + CheckHistorian=false — produces a structured PrerequisiteReport whose SkipReason is the exact operator-facing 'here's what you need to fix' text, (2) resolves LiveStackConfig and surfaces a clear skip when the secret isn't discoverable, (3) instantiates GalaxyProxyDriver + calls InitializeAsync (the IPC handshake), capturing a skip with the exception detail + common-cause hints (secret mismatch, SID not in pipe ACL, Host's backend couldn't connect to ZB) rather than letting a NullRef cascade through every subsequent test. SkipIfUnavailable() translates the captured SkipReason into Assert.Skip at the top of every fact so tests read as cleanly-skipped with a visible reason, not silently-passed or crashed.
LiveStackSmokeTests (5 facts, Collection=LiveStack, Category=LiveGalaxy): Fixture_initialized_successfully (cheapest possible end-to-end assertion — if this passes, the IPC handshake worked); Driver_reports_Healthy_after_IPC_handshake (DriverHealth.State post-connect); DiscoverAsync_returns_at_least_one_variable_from_live_galaxy (captures every Variable() call from DiscoverAsync via CapturingAddressSpaceBuilder and asserts > 0 — zero here usually means the Host couldn't read ZB, the skip message names OTOPCUA_GALAXY_ZB_CONN to check); GetHostStatuses_reports_at_least_one_platform (IHostConnectivityProbe surface — zero means the probe loop hasn't fired or no Platform is deployed locally); Can_read_a_discovered_variable_from_live_galaxy (reads the first discovered attribute's full reference, asserts status != BadInternalError — Galaxy's Uncertain-quality-until-first-Engine-scan is intentionally NOT treated as failure since it depends on runtime state that varies across test runs). Read-only by design; writes need an agreed scratch tag to avoid mutating a process-critical attribute — deferred to a follow-up PR that reuses this fixture.
CapturingAddressSpaceBuilder is a minimal IAddressSpaceBuilder that flattens every Variable() call into a list so tests can inspect what discovery produced without booting the full OPC UA node-manager stack; alarm annotation + property calls are no-ops. Scoped private to the test class.
Galaxy.Proxy.Tests csproj gains a ProjectReference to Driver.Galaxy.TestSupport (PR 36) for AvevaPrerequisites. The NU1702 warning about the Host project being net48-referenced-by-net10 is pre-existing from the HostSubprocessParityTests — Proxy.Tests only needs the Host EXE path for that parity scenario, not type surface.
Test run on THIS machine (OtOpcUaGalaxyHost not yet installed): Skipped! Failed 0, Passed 0, Skipped 5 — each skip message includes the full prerequisites report pointing at the missing service. Once the service is installed + started (scripts\install\Install-Services.ps1), the 5 facts will execute against live Galaxy. Proxy.Tests Unit: 17 pass / 0 fail (unchanged — new tests are Category=LiveGalaxy, separate suite). Full Proxy build clean. Memory already captures the 'live tests run via already-running service, don't spawn' convention (project_galaxy_host_service.md).
lmx-followups.md #5 updated: status is 'IN PROGRESS' across PRs 36 + 37 with the explicit remaining work (install + start services, subscribe-and-receive, write round-trip).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AvevaPrerequisites.CheckAllAsync walks eight probe categories producing PrerequisiteCheck rows each with Name (e.g. 'service:aaBootstrap', 'sql:ZB', 'com:LMXProxy', 'registry:ArchestrA.Framework'), Category (AvevaCoreService / AvevaSoftService / AvevaInstall / MxAccessCom / GalaxyRepository / AvevaHistorian / OtOpcUaService / Environment), Status (Pass / Warn / Fail / Skip), and operator-facing Detail message. Report aggregates them: IsLivetestReady (no Fails anywhere) and IsAvevaSideReady (AVEVA-side categories pass, our v2 services can be absent while still considering the environment AVEVA-ready) so different test tiers can use the right threshold.
Individual probes: ServiceProbe.Check queries the Windows Service Control Manager via System.ServiceProcess.ServiceController — treats DemandStart+Stopped as Warn (NmxSvc is DemandStart by design; master pulls it up) but AutoStart+Stopped as Fail; not-installed is Fail for hard-required services, Warn for soft ones; non-Windows hosts get Skip; transitional states like StartPending get Warn with a 'try again' hint. RegistryProbe reads HKLM\SOFTWARE\WOW6432Node\ArchestrA\{Framework,Framework\Platform,MSIInstall} — Framework key presence + populated InstallPath/RootPath values mean System Platform installed; PfeConfigOptions in the Platform subkey (format 'PlatformId=N,EngineId=N,...') indicates a Platform has been deployed from the IDE (PlatformId=0 means never deployed — MXAccess will connect but every subscription will be Bad quality); RebootRequired='True' under MSIInstall surfaces as a loud warn since post-patch behavior is undefined. MxAccessComProbe resolves the LMXProxy.LMXProxyServer ProgID → CLSID → HKLM\SOFTWARE\Classes\WOW6432Node\CLSID\{guid}\InprocServer32, verifying the registered file exists on disk (catches the orphan-registry case where a previous uninstall left the ProgID registered but the DLL is gone — distinguishes it from the 'totally not installed' case by message); also emits a Warn when the test process is 64-bit (MXAccess COM activation fails with REGDB_E_CLASSNOTREG 0x80040154 regardless of registration, so seeing this warning tells operators why the activation would fail even on a fully-installed machine). SqlProbe tests Galaxy Repository via Microsoft.Data.SqlClient using the Windows-auth localhost connection string the repo code defaults to — distinguishes 'SQL Server unreachable' (connection fails) from 'ZB database does not exist' (SELECT DB_ID('ZB') returns null) because they have different remediation paths (sc.exe start MSSQLSERVER vs. restore from .cab backup); a secondary CheckDeployedObjectCountAsync query on 'gobject WHERE deployed_version > 0' warns when the count is zero because discovery smoke tests will return empty hierarchies. NamedPipeProbe opens a 2s NamedPipeClientStream against OtOpcUaGalaxyHost's pipe ('OtOpcUaGalaxy' per the installer default) — pipe accepting a connection proves the Host service is listening; disconnects immediately so we don't consume a session slot.
Service lists kept as internal static data so tests can inspect + override: CoreServices (aaBootstrap + aaGR + NmxSvc + MSSQLSERVER — hard fail if missing), SoftServices (aaLogger + aaUserValidator + aaGlobalDataCacheMonitorSvr — warn only; stack runs without them but diagnostics/auth are degraded), HistorianServices (aahClientAccessPoint + aahGateway — opt-in via Options.CheckHistorian, only matters for HistoryRead IPC paths), OtOpcUaServices (our OtOpcUaGalaxyHost hard-required for end-to-end live tests + OtOpcUa warn + GLAuth warn). Narrower entry points CheckRepositoryOnlyAsync and CheckGalaxyHostPipeOnlyAsync for tests that only care about specific subsystems — avoid paying the full probe cost on every GalaxyRepositoryLiveSmokeTests fact.
Multi-targeting mechanics: System.ServiceProcess.ServiceController + Microsoft.Win32.Registry are NuGet packages on net10 but in-box BCL references on net48; csproj conditions Package vs Reference by TargetFramework. Microsoft.Data.SqlClient v6 supports both frameworks so single PackageReference. Net48Polyfills.cs provides IsExternalInit shim (records/init-only setters) and SupportedOSPlatformAttribute stub so the same Probe sources compile on both frameworks without per-callsite preprocessor guards — lets Roslyn's platform-compatibility analyzer stay useful on net10 without breaking net48 builds.
Existing GalaxyRepositoryLiveSmokeTests updated to delegate its skip decision to AvevaPrerequisites.CheckRepositoryOnlyAsync (legacy ZbReachableAsync kept as a compatibility adapter so the in-test 'if (!await ZbReachableAsync()) return;' pattern keeps working while the surrounding fixtures gradually migrate to Assert.Skip-with-reason). Slnx file registers the new project.
Tests — AvevaPrerequisitesLiveTests (8 new Integration cases, Category=LiveGalaxy): the helper correctly reports Framework install (registry pass), aaBootstrap Running (service pass), aaGR Running (service pass), MxAccess COM registered (com pass), ZB database reachable (sql pass), deployed-object count > 0 (warn-upgraded-to-pass because this box has 49 objects deployed), the AVEVA side is ready even when our own services (OtOpcUaGalaxyHost) aren't installed yet (IsAvevaSideReady=true), and the helper emits rows for OtOpcUaGalaxyHost + OtOpcUa + GLAuth even when not installed (regression guard — nobody can accidentally ship a check that omits our own services). Full Galaxy.Host.Tests Category=LiveGalaxy suite: 13 pass (5 prior smoke + 8 new prerequisites). Full solution build clean, 0 errors.
What's NOT in this PR: end-to-end Galaxy stack smoke (Proxy → Host pipe → MXAccess → real Galaxy tag). That's the next PR — this one is the gate the end-to-end smoke will call first to produce actionable skip messages instead of silent returns.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Interface additions use C# default interface implementations that throw NotSupportedException — existing IHistoryProvider implementations keep compiling, only drivers whose backend carries the relevant capability override. This matches the 'capabilities are optional per driver' design already used by IHistoryProvider.ReadProcessedAsync's docs (Modbus / OPC UA Client drivers never had an event historian and the default-throw path lets callers see BadHistoryOperationUnsupported naturally). New HistoricalEvent record models one historian row (EventId, SourceName, EventTimeUtc + ReceivedTimeUtc — process vs historian-persist timestamps, Message, Severity mapped to OPC UA's 1-1000 range); HistoricalEventsResult pairs the event list with a continuation-point token for future batching. Both live in Core.Abstractions so downstream (Proxy, Host, Server) reference a single domain shape — no Shared-contract leak into the driver-facing interface.
GalaxyProxyDriver.ReadAtTimeAsync maps the domain DateTime[] to Unix-ms longs, calls CallAsync on the existing MessageKind.HistoryReadAtTimeRequest, and trusts the Host's one-sample-per-requested-timestamp contract (the Host pads with bad-quality snapshots for timestamps it can't interpolate; re-aligning on the Proxy side would duplicate the Host's interpolation policy logic). ReadEventsAsync does the same for HistoryReadEventsRequest; ToHistoricalEvent translates GalaxyHistoricalEvent (MessagePack-annotated, Unix-ms) to the domain record, explicitly tagging DateTimeKind.Utc on both timestamp fields so downstream serializers (JSON, OPC UA types) don't apply an unexpected local-time offset.
Tests — HistoricalEventMappingTests (3 new Proxy.Tests unit cases): every field maps correctly from wire to domain; null SourceName and null DisplayText preserve through the mapping (system events without a source come out with null so callers can distinguish them from alarm events); both timestamps come out as DateTimeKind.Utc (regression guard against a future refactor using DateTime.FromFileTimeUtc or similar that defaults to Unspecified). Driver.Galaxy.Proxy.Tests Unit suite: 17 pass / 0 fail (14 prior + 3 new). Full solution build clean, 0 errors.
Scope exclusions — DriverNodeManager HistoryRead service-handler wiring (on the OPC UA Server side, where HistoryReadAtTime and HistoryReadEvents service requests land) and the full-loop integration test (OPC UA client → server → IPC → Host → HistorianDataSource → back) are deferred to a focused follow-up PR. The capability surface is the load-bearing change; wiring the service handlers is mechanical in comparison and worth its own PR for reviewability. docs/v2/lmx-followups.md #1 updated with the split.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Polling chosen over event-driven for initial scope: simpler, matches Admin UI consumer cadence, avoids DriverHost lifecycle-event plumbing that doesn't exist today. Event-driven push for sub-heartbeat latency is a straightforward follow-up.
Admin.Services.HostStatusService left-joins DriverHostStatus against ClusterNode on NodeId so rows persist even when the ClusterNode entry doesn't exist yet (first-boot bootstrap case). StaleThreshold = 30s — covers one missed publisher heartbeat plus a generous buffer for clock skew and GC pauses. Admin Components/Pages/Hosts.razor — FleetAdmin-visible page grouped by cluster (handles the '(unassigned)' case for rows without a matching ClusterNode). Four summary cards (Hosts / Running / Stale / Faulted); per-cluster table with Node / Driver / Host / State + Stale-badge / Last-transition / Last-seen / Detail columns; 10s auto-refresh via IServiceScopeFactory timer pattern matching FleetStatusPoller + Fleet dashboard (PR 27). Row-class highlighting: Faulted → table-danger, Stale → table-warning, else default. State badge maps DriverHostState enum to bootstrap color classes. Sidebar link added between 'Fleet status' and 'Clusters'.
Server csproj adds Microsoft.EntityFrameworkCore.SqlServer 10.0.0 + registers OtOpcUaConfigDbContext in Program.cs scoped via NodeOptions.ConfigDbConnectionString (no Admin-style manual SQL raw — the DbContext is the only access path, keeps migrations owner-of-record).
Tests — HostStatusPublisherTests (4 new Integration cases, uses per-run throwaway DB matching the FleetStatusPollerTests pattern): publisher upserts one row per host from each probe-capable driver and skips non-probe drivers; second tick advances LastSeenUtc without creating duplicate rows (upsert pattern verified end-to-end); state change between ticks updates State AND StateChangedUtc (datetime2(3) rounds to millisecond precision so comparison uses 1ms tolerance — documented inline); MapState translates every HostState enum member. Server.Tests Integration: 4 new tests pass. Admin build clean, Admin.Tests Unit still 23 / 0. docs/v2/lmx-followups.md item #7 marked DONE with three explicit deferred items (event-driven push, failure-count column, SignalR fan-out).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DriverHostState enum lives in Configuration.Enums/ rather than reusing Core.Abstractions.HostState so the Configuration project stays free of driver-runtime dependencies (it's referenced by both the Admin process and the Server process, so pulling in the driver-abstractions assembly to every Admin build would be unnecessary weight). The server-side publisher hosted service (follow-up PR 34) will translate HostStatusChangedEventArgs.NewState to this enum on every transition.
No foreign key to ClusterNode — a Server may start reporting host status before its ClusterNode row exists (first-boot bootstrap), and we'd rather keep the status row than drop it. The Admin-side service that renders the dashboard will left-join on NodeId when presenting. Two indexes declared: IX_DriverHostStatus_Node drives the per-cluster drill-down (Admin UI joins ClusterNode on ClusterId to pick which NodeIds to fetch), IX_DriverHostStatus_LastSeen drives the stale-row query (now - LastSeen > threshold).
EF migration AddDriverHostStatus creates the table + PK + both indexes. Model snapshot updated. SchemaComplianceTests expected-tables list extended. DriverHostStatusTests (3 new cases, category SchemaCompliance, uses the shared fixture DB): composite key allows same (host, driver) across different nodes AND same (node, host) across different drivers — both real-world cases the publisher needs to support; upsert-in-place pattern (fetch-by-composite-PK, mutate, save) produces one row not two — the pattern the publisher will use; State enum persists as string not int — reading the DB via ADO.NET returns 'Faulted' not '3'.
Configuration.Tests SchemaCompliance suite: 10 pass / 0 fail (7 prior + 3 new). Configuration build clean. No Server or Admin code changes yet — publisher + /hosts page are PR 34.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
No production code changes — pure additive test. Server.Tests Integration: 3 new tests pass; existing OpcUaServerIntegrationTests stays green (single-driver case still exercised there). Full Server.Tests Unit still 43 / 0. Deferred: multi-driver alarm-event case (two drivers each raising a GalaxyAlarmEvent, assert each condition lands on its owning instance's condition node) — needs a stub IAlarmSource and is worth its own focused PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Active Directory compatibility. LdapOptions xml-doc expanded with a cheat-sheet covering Server (DC FQDN), Port 389 vs 636, UseTls=true under AD LDAP-signing enforcement, dedicated read-only service account DN, sAMAccountName vs userPrincipalName vs cn trade-offs, memberOf DN shape (CN=Group,OU=...,DC=... with the CN= RDN stripped to become the GroupToRole key), and the explicit 'nested groups NOT expanded' call-out (LDAP_MATCHING_RULE_IN_CHAIN / tokenGroups is a future authenticator enhancement, not a config change). docs/security.md §'Active Directory configuration' adds a complete appsettings.json snippet with realistic AD group names (OPCUA-Operators → WriteOperate, OPCUA-Engineers → WriteConfigure, OPCUA-AlarmAck → AlarmAck, OPCUA-Tuners → WriteTune), LDAPS port 636, TLS on, insecure-LDAP off, and operator-facing notes on each field. LdapUserAuthenticatorAdCompatTests (5 unit guards): ExtractFirstRdnValue parses AD-style 'CN=OPCUA-Operators,OU=...,DC=...' DNs correctly (case-preserving — operators' GroupToRole keys stay readable); also handles mixed case and spaces in group names ('Domain Users'); also works against the OpenLDAP ou=<group>,ou=groups shape (GLAuth) so one extractor tolerates both memberOf formats common in the field; EscapeLdapFilter escapes the RFC 4515 injection set (\, *, (, ), \0) so a malicious login like 'admin)(cn=*' can't break out of the filter; default UserNameAttribute regression guard.
Test posture — Server.Tests Unit: 43 pass / 0 fail (38 prior + 5 new AD-compat guards). Server.Tests LiveLdap category: 6 pass / 0 fail against running GLAuth (would skip cleanly without). Server build clean, 0 errors, 0 warnings.
Deferred: the session-identity end-to-end check (drive a full OPC UA UserName session, then read a 'whoami' node to verify the role landed on RoleBasedIdentity). That needs a test-only address-space node and is scoped for a separate PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ModbusSimulatorFixture is a collection fixture so the 2s TCP probe runs once per run, not per test; SkipReason gets a clear operator-facing message ('start ModbusPal or override MODBUS_SIM_ENDPOINT'). Tests call Assert.Skip(sim.SkipReason) rather than silently returning — matches the test-plan convention and reads cleanly in CI logs. DL205Profile.BuildOptions deliberately disables the background probe loop since integration tests drive reads explicitly and the probe would race with assertions. Tag naming uses the DL205_ prefix so filter 'DisplayName~DL205' surfaces device-specific failures at a glance.
Project references: xunit.v3 + Shouldly + Microsoft.NET.Test.Sdk + xunit.runner.visualstudio (matches the existing Driver.Modbus.Tests unit project), project ref to src/Driver.Modbus. Registered in ZB.MOM.WW.OtOpcUa.slnx under tests/. ModbusPal/README.md documents the dev loop (install ModbusPal jar, load profile, start simulator, dotnet test), explains MODBUS_SIM_ENDPOINT override for real-PLC benchwork, and flags DL205.xmpp as the first profile to add in a follow-up PR.
dotnet test run against the scaffold (no simulator running) skips cleanly: 0 failed, 0 passed, 1 skipped, with the SkipReason surfaced. dotnet build clean (0 warnings, 0 errors). Updated docs/v2/modbus-test-plan.md to mark the scaffold PR done and renumbered future PRs from 'PR 27+' to 'PR 31+' to stay in sync with the actual PR chain.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sidebar's 'Signed in as' line now wraps the display name in a link to /account so the existing sidebar-compact view becomes the entry point for the fuller page — keeps the sign-out button where it was for muscle memory, just adds the detail page one click away. Page is gated with [Authorize] (any authenticated admin) rather than a specific role — the capability table deliberately works for every signed-in user so they can see what they DON'T have access to, which helps them file the right ticket with their LDAP admin instead of getting a plain Access Denied when navigating blindly.
Capability → required-role table is defined as a private readonly record list in the page rather than pulled from a service because it's a UI-presentation concern, not runtime policy state — the runtime policy IS Program.cs's AddAuthorizationBuilder + each page's [Authorize] attribute, and this table just mirrors it for operator readability. Comment on the list reminds future-me to extend it when a new policy or [Authorize] page lands. No behavior change if roles are empty, but the page surfaces a hint ('Sign-in would have been blocked, so if you're seeing this, the session claim is likely stale') that nudges the operator toward signing out + back in.
No new tests added — the page is pure display over claims; its only logic is the 'has-capability' Any-overlap check which is exactly what ASP.NET's [Authorize(Roles=...)] does in-framework, and duplicating that in a unit test would test the framework rather than our code. Admin.Tests Unit stays 23 pass / 0 fail. Admin build clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Razor page layout: two tables (Rejected / Trusted) with Subject / Issuer / Thumbprint / Valid-window / Actions columns, status banner after each action with success or warning kind ('file missing' = another admin handled it), FleetAdmin-only via [Authorize(Roles=AdminRoles.FleetAdmin)]. Each action invokes LogActionAsync which Serilog-logs the authenticated admin user + thumbprint + action for an audit trail — DB-level ConfigAuditLog persistence is deferred because its schema is cluster-scoped and cert actions are cluster-agnostic; Serilog + CertTrustService's filesystem-op info logs give the forensic trail in the meantime. Sidebar link added to MainLayout between Reservations and the future Account page.
Tests — CertTrustServiceTests (9 new unit cases): ListRejected parses Subject + Thumbprint + store kind from a self-signed test cert written into rejected/certs/; rejected and trusted stores are kept separate; TrustRejected moves the file and the Rejected list is empty afterwards; TrustRejected with a thumbprint not in rejected returns false without touching trusted; DeleteRejected removes the file; UntrustCert removes from trusted only; thumbprint match is case-insensitive (operator UX); missing store directories produce empty lists instead of throwing DirectoryNotFoundException (pristine-install tolerance); a junk .der in the store is logged + skipped and the valid certs still surface (one bad file doesn't break the page). Full Admin.Tests Unit suite: 23 pass / 0 fail (14 prior + 9 new). Full Admin build clean — 0 errors, 0 warnings.
lmx-followups.md #3 marked DONE with a cross-reference to this PR and a note that flipping AutoAcceptUntrustedClientCertificates to false as the production default is a deployment-config follow-up, not a code gap — the Admin UI is now ready to be the trust gate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
{"title":"Phase 3 PR 54 -- Siemens S7 Modbus TCP quirks research doc","body":"## Summary\n\nAdds `docs/v2/s7.md` (485 lines) covering Siemens SIMATIC S7 family Modbus TCP behavior. Mirrors the `docs/v2/dl205.md` template for future per-quirk implementation PRs.\n\n## Key findings for the implementation track\n\n- **No fixed memory map** — every S7 Modbus server is user-wired via `MB_SERVER`/`MODBUSCP`/`MODBUSPN` library blocks. Driver must accept per-site config, not assume a vendor layout.\n- **MB_SERVER requires non-optimized DBs** (STATUS `0x8383` if optimized). Most common field bug.\n- **Word order default = ABCD** (opposite of DL260). Driver's S7 profile default must be `ByteOrder.BigEndian`, not `WordSwap`.\n- **One port per MB_SERVER instance** — multi-client requires parallel FBs on 503/504/… Most clients assume port 502 multiplexes (wrong on S7).\n- **CP 343-1 Lean is server-only**, requires the `2XV9450-1MB00` license.\n- **FC20/21/22/23/43 all return Illegal Function** on every S7 variant — driver must not attempt FC23 bulk-read optimization for S7.\n- **STOP-mode behavior non-deterministic** across firmware bands — treat both read/write STOP-mode responses as unavailable.\n\nTwo items flagged as unconfirmed rumour (V2.0+ float byte-order claim, STOP-mode caching location).\n\nNo code, no tests — implementation lands in PRs 56+.\n\n## Test plan\n- [x] Doc renders as markdown\n- [x] 31 citations present\n- [x] Section structure matches dl205.md template","head":"phase-3-pr54-s7-research-doc","base":"v2"}
{"title":"Phase 3 PR 55 -- Mitsubishi MELSEC Modbus TCP quirks research doc","body":"## Summary\n\nAdds `docs/v2/mitsubishi.md` (451 lines) covering MELSEC Q/L/iQ-R/iQ-F/FX3U Modbus TCP behavior. Mirrors `docs/v2/dl205.md` template for per-quirk implementation PRs.\n\n## Key findings for the implementation track\n\n- **Module naming trap** — `QJ71MB91` is SERIAL RTU, not TCP. TCP module is `QJ71MT91`. Surface clearly in driver docs.\n- **No canonical mapping** — per-site 'Modbus Device Assignment Parameter' block (up to 16 entries). Treat mapping as runtime config.\n- **X/Y hex vs octal depends on family** — Q/L/iQ-R use HEX (X20 = decimal 32); FX/iQ-F use OCTAL (X20 = decimal 16). Helper must take a family selector.\n- **Word order CDAB default** across all MELSEC families (opposite of Siemens S7). Driver Mitsubishi profile default: `ByteOrder.WordSwap`.\n- **D-registers binary by default** (opposite of DL205's BCD default). Caller opts in to `Bcd16`/`Bcd32` when ladder uses BCD.\n- **FX5U needs firmware ≥ 1.060** for Modbus TCP server — older is client-only.\n- **FX3U-ENET vs FX3U-ENET-P502 vs FX3U-ENET-ADP** — only the middle one binds port 502; the last has no Modbus at all. Common operator mis-purchase.\n- **QJ71MT91 does NOT support FC22 / FC23** — iQ-R / iQ-F do. Bulk-read optimization must gate on capability.\n- **STOP-mode writes configurable** on Q/L/iQ-R/iQ-F (default accept), always rejected on FX3U-ENET.\n\nThree unconfirmed rumours flagged separately.\n\nNo code, no tests — implementation lands in PRs 58+.\n\n## Test plan\n- [x] Doc renders as markdown\n- [x] 17 citations present\n- [x] Per-model test naming matrix included (`Mitsubishi_QJ71MT91_*`, `Mitsubishi_FX5U_*`, `Mitsubishi_FX3U_ENET_*`, shared `Mitsubishi_Common_*`)","head":"phase-3-pr55-mitsubishi-research-doc","base":"v2"}
@@ -348,6 +348,44 @@ The project uses [GLAuth](https://github.com/glauth/glauth) v2.4.0 as the LDAP s
Enable LDAP in `appsettings.json` under `Authentication.Ldap`. See [Configuration Guide](Configuration.md) for the full property reference.
### Active Directory configuration
Production deployments typically point at Active Directory instead of GLAuth. Only four properties differ from the dev defaults: `Server`, `Port`, `UserNameAttribute`, and `ServiceAccountDn`. The same `GroupToRole` mechanism works — map your AD security groups to OPC UA roles.
"ServiceAccountPassword":"<from your secret store>",
"DisplayNameAttribute":"displayName",
"GroupAttribute":"memberOf",
"UserNameAttribute":"sAMAccountName",
"GroupToRole":{
"OPCUA-Operators":"WriteOperate",
"OPCUA-Engineers":"WriteConfigure",
"OPCUA-AlarmAck":"AlarmAck",
"OPCUA-Tuners":"WriteTune"
}
}
}
}
```
Notes:
-`UserNameAttribute: "sAMAccountName"` is the critical AD override — the default `uid` is not populated on AD user entries, so the user-DN lookup returns no results without it. Use `userPrincipalName` instead if operators log in with `user@corp.example.com` form.
-`Port: 636` + `UseTls: true` is required under AD's LDAP-signing enforcement. AD increasingly rejects plain-LDAP bind; set `AllowInsecureLdap: false` to refuse fallback.
-`ServiceAccountDn` should name a dedicated read-only service principal — not a privileged admin. The account needs read access to user and group entries in the search base.
-`memberOf` values come back as full DNs like `CN=OPCUA-Operators,OU=OPC UA Security Groups,OU=Groups,DC=corp,DC=example,DC=com`. The authenticator strips the leading `CN=` RDN value so operators configure `GroupToRole` with readable group common-names.
- Nested group membership is **not** expanded — assign users directly to the role-mapped groups, or pre-flatten membership in AD. `LDAP_MATCHING_RULE_IN_CHAIN` / `tokenGroups` expansion is an authenticator enhancement, not a config change.
### Security Considerations
- LDAP credentials are transmitted in plaintext over the OPC UA channel unless transport security is enabled. Use `Basic256Sha256-SignAndEncrypt` for production deployments.
# V1 Archive Status (Phase 2 Stream D, 2026-04-18)
# V1 Archive Status — CLOSED (Phase 2 Streams D + E complete)
This document inventories every v1 surface that's been **functionally superseded** by v2 but
**physically retained** in the build until the deletion PR (Phase 2 PR 3). Rationale: cascading
references mean a single deletion is high blast-radius; archive-marking lets the v2 stack ship
on its own merits while the v1 surface stays as parity reference.
> **Status as of 2026-04-18: the v1 archive has been fully removed from the tree.**
> This document is retained as historical record of the Phase 2 Stream D / E closure.
## Archived projects
## Final state
| Path | Status | Replaced by | Build behavior |
|---|---|---|---|
| `src/ZB.MOM.WW.OtOpcUa.Host/` | Archive (executable in build) | `OtOpcUa.Server` + `Driver.Galaxy.Host` + `Driver.Galaxy.Proxy` | Builds; not deployed by v2 install scripts |
| `src/ZB.MOM.WW.OtOpcUa.Historian.Aveva/` | Archive (plugin in build) | TODO: port into `Driver.Galaxy.Host/Backend/Historian/` (Task B.1.h follow-up) | Builds; loaded only by archived Host |
| `tests/ZB.MOM.WW.OtOpcUa.Tests.v1Archive/` | Archive | `Driver.Galaxy.E2E` + per-component test projects | `<IsTestProject>false</IsTestProject>` — `dotnet test slnx` skips |
> **Status**: **SHIPPED** 2026-04-19 — Streams A/B/C/D + E data layer merged to `v2` across PRs #78-82. Final exit-gate PR #83 turns the compliance script into real checks (all pass) and records this status update. One deferred piece: Stream E.2/E.3 SignalR hub + Blazor `/hosts` column refresh lands in a visual-compliance follow-up PR on the Phase 6.4 Admin UI branch.
Land the cross-cutting runtime protections + operability features that `plan.md` + `driver-stability.md` specify by decision but that no driver-phase actually wires. End-state: every driver goes through the same Polly resilience layer, health endpoints render the live driver fleet, structured logs carry per-request correlation IDs, and the config substrate survives a central DB outage via a LiteDB local cache.
Closes these gaps flagged in the 2026-04-19 audit:
1. Polly v8 resilience pipelines wired to every `IDriver` capability (no-op per-driver today; Galaxy has a hand-rolled `CircuitBreaker` only).
2. Tier A/B/C enforcement at runtime — `driver-stability.md` §2–4 and decisions #63–73 define memory watchdog, bounded queues, scheduled recycle, wedge detection; `MemoryWatchdog` exists only inside `Driver.Galaxy.Host`.
3. Health endpoints (`/healthz`, `/readyz`) on `OtOpcUa.Server`.
4. Structured Serilog with per-request correlation IDs (driver instance, OPC UA session, IPC call).
5. LiteDB local cache + Polly retry + fallback on central-DB outage (decision #36).
## Scope — What Changes
| Concern | Change |
|---------|--------|
| `Core` → new `Core.Resilience` sub-namespace | Shared Polly pipeline builder (`DriverResiliencePipelines`). **Pipeline key = `(DriverInstanceId, HostName)`** so one dead PLC behind a multi-device driver doesn't open the breaker for healthy siblings (decision #35 per-device isolation). **Per-capability policy** — Read / HistoryRead / Discover / Probe / Alarm get retries; **Write does NOT** unless `[WriteIdempotent]` on the tag definition (decisions #44-45). |
| Every capability-interface consumer in the server | Wrap `IReadable.ReadAsync`, `IWritable.WriteAsync`, `ITagDiscovery.DiscoverAsync`, `ISubscribable.SubscribeAsync/UnsubscribeAsync`, `IHostConnectivityProbe` probe loop, `IAlarmSource.SubscribeAlarmsAsync/AcknowledgeAsync`, `IHistoryProvider.ReadRawAsync/ReadProcessedAsync/ReadAtTimeAsync/ReadEventsAsync`. Composition: timeout → (retry when capability supports) → circuit breaker → bulkhead. |
| `Core.Abstractions` → new `WriteIdempotentAttribute` | Marker on `ModbusTagDefinition` / `S7TagDefinition` / `OpcUaClientDriver` tag rows; opts that tag into auto-retry on Write. Absence = no retry, per spec. |
| `Core` → new `Core.Stability` sub-namespace — **split** | Two separate subsystems: (a) **`MemoryTracking`** runs all tiers; captures baseline (median of first 5 min `GetMemoryFootprint` samples) + applies the hybrid rule `soft = max(multiplier × baseline, baseline + floor)`; soft breach logs + surfaces to Admin; never kills. (b) **`MemoryRecycle`** (Tier C only — requires out-of-process topology) handles hard-breach recycle via the Proxy-side supervisor. Tier A/B overrun escalates to Tier C promotion ticket, not auto-kill. |
| `ScheduledRecycleScheduler` | Tier C only per decisions #73-74. Weekly/time-of-day recycle via Proxy supervisor. Tier A/B opt-in recycle lands in a future phase together with a Tier-C-escalation workflow. |
| `WedgeDetector` | **Demand-aware**: flips a driver to Faulted only when `(hasPendingWork AND noProgressIn > threshold)`. `hasPendingWork` derives from non-zero Polly bulkhead depth OR ≥1 active MonitoredItem OR ≥1 queued historian read. Idle + subscription-only drivers stay Healthy. |
| `DriverTypeRegistry` | Each driver type registers its `DriverTier` {A, B, C}. Tier C drivers must advertise their out-of-process topology; the registry enforces invariants (Tier C has a `Proxy` + `Host` pair). |
| `Driver.Galaxy.Proxy/Supervisor/` | **Retains** existing `CircuitBreaker` + `Backoff` — they guard IPC respawn (decision #68), different concern from the per-call Polly layer. Only `HeartbeatMonitor` is referenced downstream (IPC liveness). |
| `OtOpcUa.Server` → Minimal API endpoints on `http://+:4841` | `/healthz` = process alive + (config DB reachable OR `UsingStaleConfig=true`). `/readyz` = ANDed driver health; state-machine per `DriverState`: `Unknown`/`Initializing` → 503, `Healthy` → 200, `Degraded` → 200 + `{degradedDrivers: [...]}` in body, `Faulted` → 503. JSON body always reports per-instance detail. |
| Serilog configuration | Centralize enrichers in `OtOpcUa.Server/Observability/LogContextEnricher.cs`. Every capability call runs inside a `LogContext.PushProperty` scope with {DriverInstanceId, DriverType, CapabilityName, CorrelationId (UA RequestHandle or internal GUID)}. Sink config stays rolling-file per CLAUDE.md; JSON sink added alongside plain-text (switchable via `Serilog:WriteJson` appsetting). |
| `Configuration` project | Add `LiteDbConfigCache` adapter. **Generation-sealed snapshots**: `sp_PublishGeneration` writes `<cache-root>/<cluster>/<generationId>.db` as a read-only sealed file. Reads serve the last-known-sealed generation; mixed-generation reads are impossible. Write path bypasses cache + fails hard on DB outage. Pipeline: timeout (2 s) → retry (3×, jittered) → fallback-to-sealed-snapshot. |
| `DriverHostStatus` vs. `DriverInstanceResilienceStatus` | New separate entity `DriverInstanceResilienceStatus { DriverInstanceId, HostName, LastCircuitBreakerOpenUtc, ConsecutiveFailures, CurrentBulkheadDepth, LastRecycleUtc, BaselineFootprintBytes }`. `DriverHostStatus` keeps per-host connectivity only; Admin `/hosts` joins both for display. |
## Scope — What Does NOT Change
| Item | Reason |
|------|--------|
| Driver wire protocols | Resilience is a server-side wrapper; individual drivers don't see Polly. Their existing retry logic (ModbusTcpTransport reconnect, SessionReconnectHandler) stays in place as inner layers. |
| Config DB schema | LiteDB cache is a read-only mirror; no new central tables except `DriverHostStatus` column additions. |
| OPC UA wire behavior visible to clients | Health endpoints live on a separate HTTP port (4841 by convention); the OPC UA server on 4840 is unaffected. |
| The four 2026-04-13 Galaxy stability findings | Already closed in Phase 2. Phase 6.1 *generalises* the pattern, doesn't re-fix Galaxy. |
| Driver-layer SafeHandle usage | Existing Galaxy `SafeMxAccessHandle` + Modbus `TcpClient` disposal stay — they're driver-internal, not part of the cross-cutting layer. |
- [ ] Existing drivers catalogued: Galaxy.Proxy, Modbus, S7, OpcUaClient — confirm test counts baseline so the resilience layer doesn't regress any
- [ ] Serilog configuration inventory: locate every `Log.ForContext` call site that will need `LogContext` rewrap
- [ ] Admin `/hosts` page's current `DriverHostStatus` consumption reviewed so the schema extensions don't break it
## Task Breakdown
### Stream A — Resilience layer (1 week)
1.**A.1** Add `Polly.Core` + `Microsoft.Extensions.Resilience` to `Core`. Build `DriverResiliencePipelineBuilder` — key on `(DriverInstanceId, HostName)`; composes Timeout → (Retry when the capability allows it; skipped for Write unless `[WriteIdempotent]`) → CircuitBreaker → Bulkhead. Per-capability policy map documented in `DriverResilienceOptions.CapabilityPolicies`.
2.**A.2**`DriverResilienceOptions` record bound from `DriverInstance.ResilienceConfig` JSON column (new nullable). **Per-tier × per-capability** defaults: Tier A (OpcUaClient, S7) Read 3 retries/2 s/5-failure-breaker, Write 0 retries/2 s/5-failure-breaker; Tier B (Modbus) Read 3/4 s/5, Write 0/4 s/5; Tier C (Galaxy) Read 1 retry/10 s/no-kill, Write 0/10 s/no-kill. Idempotent writes can opt into Read-shaped retry via the attribute.
3.**A.3**`CapabilityInvoker<TCapability, TResult>` wraps every method on the capability interfaces (`IReadable.ReadAsync`, `IWritable.WriteAsync`, `ITagDiscovery.DiscoverAsync`, `ISubscribable.SubscribeAsync/UnsubscribeAsync`, `IHostConnectivityProbe` probe loop, `IAlarmSource.SubscribeAlarmsAsync/AcknowledgeAsync`, `IHistoryProvider.ReadRawAsync/ReadProcessedAsync/ReadAtTimeAsync/ReadEventsAsync`). Existing server-side dispatch routes through it.
4.**A.4****Retain**`Driver.Galaxy.Proxy/Supervisor/CircuitBreaker.cs` + `Backoff.cs` — they guard IPC process respawn (decision #68), orthogonal to the per-call Polly layer. Only `HeartbeatMonitor` is consumed outside the supervisor.
5.**A.5** Unit tests: per-policy, per-composition. Negative integration tests: (a) Modbus FlakeyTransport fails 5× on Read, succeeds 6th — invoker surfaces success; (b) Modbus FlakeyTransport fails 1× on Write with `[WriteIdempotent]=false` — invoker surfaces failure without retry (no duplicate pulse); (c) Modbus FlakeyTransport fails 1× on Write with `[WriteIdempotent]=true` — invoker retries. Bench: no-op overhead < 1%.
6.**A.6**`WriteIdempotentAttribute` in `Core.Abstractions`. Modbus/S7/OpcUaClient tag-definition records pick it up; invoker reads via reflection once at driver init.
### Stream B — Tier A/B/C stability runtime — split into MemoryTracking + MemoryRecycle (1 week)
2.**B.2****`MemoryTracking`** (all tiers) lifted from `Driver.Galaxy.Host/MemoryWatchdog.cs`. Captures `BaselineFootprintBytes` as the median of first 5 min of `IDriver.GetMemoryFootprint()` samples post-`InitializeAsync`. Applies **decision #70 hybrid formula**: `soft = max(multiplier × baseline, baseline + floor)`; Tier A multiplier=3, floor=50 MB; Tier B multiplier=3, floor=100 MB; Tier C multiplier=2, floor=500 MB. Soft breach → log + `DriverInstanceResilienceStatus.CurrentFootprint` tick; never kills. Hard = 2 × soft.
3.**B.3****`MemoryRecycle`** (Tier C only per decisions #73-74). Hard-breach on a Tier C driver triggers `ScheduledRecycleScheduler.RequestRecycleNow(driverInstanceId)`; scheduler proxies to `Driver.Galaxy.Proxy/Supervisor/` which restarts the Host process. Tier A/B hard-breach logs a promotion-to-Tier-C recommendation; **never auto-kills** the in-process driver.
4.**B.4****`ScheduledRecycleScheduler`** per decision #67: Tier C driver instances opt-in to a weekly recycle at a configured cron. Tier A/B scheduled recycle deferred to a later phase paired with Tier-C escalation.
5.**B.5****`WedgeDetector`** demand-aware: `if (state==Healthy && hasPendingWork && noProgressIn > WedgeThreshold) → force ReinitializeAsync`. `hasPendingWork` = (bulkhead depth > 0) OR (active monitored items > 0) OR (queued historian-read count > 0). `WedgeThreshold` default 5 × PublishingInterval, min 60 s. Idle driver stays Healthy.
6.**B.6** Tests: tracking unit tests drive synthetic allocation against a fake `GetMemoryFootprint`; recycle tests use a mock supervisor; wedge tests include the false-fault cases — idle subscriber, slow historian backfill, write-only burst.
### Stream C — Health endpoints + structured logging (4 days)
1.**C.1**`OtOpcUa.Server/Observability/HealthEndpoints.cs` — Minimal API on a second Kestrel binding (default `http://+:4841`). `/healthz` reports process uptime + config-DB reachability (or cache-warm). `/readyz` enumerates `DriverInstance` rows + reports each driver's `DriverHealth.State`; returns 503 if ANY driver is Faulted. JSON body per `docs/v2/acl-design.md` §"Operator Dashboards" shape.
2.**C.2**`LogContextEnricher` installed at Serilog config time. Every driver-capability call site wraps its body in `using (LogContext.PushProperty("DriverInstanceId", id)) using (LogContext.PushProperty("CorrelationId", correlationId))`. Correlation IDs: reuse OPC UA `RequestHeader.RequestHandle` when in-flight; otherwise generate `Guid.NewGuid().ToString("N")[..12]`.
3.**C.3** Add JSON-formatted Serilog sink alongside the existing rolling-file plain-text sink so SIEMs (Splunk, Datadog) can ingest without a regex parser. Sink switchable via `Serilog:WriteJson` appsetting.
### Stream D — Config DB LiteDB fallback — generation-sealed snapshots (1 week)
1.**D.1**`LiteDbConfigCache` adapter backed by **sealed generation snapshots**: each successful `sp_PublishGeneration` writes `<cache-root>/<clusterId>/<generationId>.db` as read-only after commit. The adapter maintains a `CurrentSealedGenerationId` pointer updated atomically on successful publish. Mixed-generation reads are **impossible** — every read served from the cache serves one coherent sealed generation.
2.**D.2** Write-path queries (draft save, publish) bypass the cache entirely and fail hard on DB outage. Read-path queries (DriverInstance enumeration, LdapGroupRoleMapping, cluster + namespace metadata) go through the pipeline: timeout 2 s → retry 3× jittered → fallback to the current sealed snapshot.
3.**D.3**`UsingStaleConfig` flag flips true when a read fell back to the sealed snapshot; cleared on the next successful DB round-trip. Surfaced on `/healthz` body and Admin `/hosts`.
4.**D.4** Tests: (a) SQL-container kill mid-operation — read returns sealed snapshot, `UsingStaleConfig=true`, driver stays Healthy; (b) mixed-generation guard — attempt to serve partial generation by corrupting a snapshot file mid-read → adapter fails closed rather than serving mixed data; (c) first-boot-no-snapshot case — adapter refuses to start, driver fails `InitializeAsync` with a clear config-DB-required error.
### Stream E — Admin `/hosts` page refresh (3 days)
1.**E.1** Extend `DriverHostStatus` schema with Stream A resilience columns. Generate EF migration.
3.**E.3**`/hosts` Blazor page renders new columns; red badge if `ConsecutiveFailures > breakerThreshold / 2`.
## Compliance Checks (run at exit gate)
- [ ]**Invoker coverage**: every method on `IReadable` / `IWritable` / `ITagDiscovery` / `ISubscribable` / `IHostConnectivityProbe` / `IAlarmSource` / `IHistoryProvider` in the server dispatch layer routes through `CapabilityInvoker`. Enforce via a Roslyn analyzer (error-level; warning-first is rejected — the compliance check is the gate).
- [ ]**Write-retry guard**: writes without `[WriteIdempotent]` never get retried. Unit-test the invoker path asserts zero retry attempts.
- [ ]**Pipeline isolation**: pipeline key is `(DriverInstanceId, HostName)`. Integration test with two Modbus hosts under one instance — failing host A does not open the breaker for host B.
- [ ]**Tier registry**: every driver type registered in `DriverTypeRegistry` has a non-null `Tier`. Unit test walks the registry + asserts no gaps. Tier C registrations must declare their out-of-process topology.
- [ ]**MemoryTracking never kills**: soft/hard breach tests on a Tier A/B driver log + surface without terminating the process.
- [ ]**MemoryRecycle Tier C only**: hard breach on a Tier A driver never invokes the supervisor; on Tier C it does.
- [ ]**Wedge demand-aware**: test suite includes idle-subscription-only, slow-historian-backfill, and write-only-burst cases — driver stays Healthy.
- [ ]**Galaxy supervisor preserved**: `Driver.Galaxy.Proxy/Supervisor/CircuitBreaker.cs` + `Backoff.cs` still present + still invoked on Host crash.
- [ ]**Health state machine**: `/healthz` + `/readyz` respond within 500 ms for every `DriverState`; state-machine table in this doc drives the test matrix.
- [ ]**Structured log**: CI grep asserts at least one log line per capability call has `"DriverInstanceId"` + `"CorrelationId"` JSON fields.
- [ ] No regression in existing test suites — `dotnet test ZB.MOM.WW.OtOpcUa.slnx` count equal-or-greater than pre-Phase-6.1 baseline.
## Risks and Mitigations
| Risk | Likelihood | Impact | Mitigation |
|------|:----------:|:------:|------------|
| Polly pipeline adds per-request latency on hot path | Medium | Medium | Benchmark Stream A.5 before merging; 1 % overhead budget; inline hot path short-circuits when retry count = 0 |
| LiteDB cache diverges from central DB | Medium | High | Stale-data banner in Admin UI; `UsingStaleConfig` flag surfaced on `/readyz`; cache refresh on every successful DB round-trip; 24-hour synthetic warning |
| Tier watchdog false-positive-kills a legitimate batch load | Low | High | Soft/hard threshold split; soft only logs; hard triggers recycle; thresholds configurable per-instance |
| Wedge detector races with slow-but-healthy drivers | Medium | High | Minimum 60 s threshold; detector only activates if driver claims `Healthy`; add circuit-breaker feedback so rapid oscillation trips instead of thrashing |
| Roslyn analyzer breaks external driver authors | Low | Medium | Release analyzer as warning-level initially; upgrade to error in Phase 6.1+1 after one release cycle |
Plan substantially rewritten before implementation to address these findings. Each entry: severity · verdict · adjustment.
1.**Crit · ACCEPT** — Auto-retry collides with decisions #44/#45 (no auto-write-retry; opt-in via `WriteIdempotent` + CAS). Pipeline now **capability-specific**: Read/HistoryRead/Discover/Probe/Alarm-subscribe all get retries; **Write does not** unless the tag metadata carries `WriteIdempotent=true`. New `WriteIdempotentAttribute` surfaces on `ModbusTagDefinition` / `S7TagDefinition` / etc.
2.**Crit · ACCEPT** — "One pipeline per driver instance" breaks decision #35's per-device isolation. **Change**: pipeline key is `(DriverInstanceId, HostName)` not just `DriverInstanceId`. One dead PLC behind a multi-device Modbus driver no longer opens the breaker for healthy siblings.
3.**Crit · ACCEPT** — Memory watchdog + scheduled recycle at Tier A/B breaches decisions #73/#74 (process-kill protections are Tier-C-only). **Change**: Stream B splits into two — `MemoryTracking` (all tiers, soft/hard thresholds log + surface to Admin `/hosts`; never kills) and `MemoryRecycle` (Tier C only, requires out-of-process topology). Tier A/B overrun paths escalate to Tier C via a future PR, not auto-kill.
4.**High · ACCEPT** — Removing Galaxy's hand-rolled `CircuitBreaker` drops decision #68 host-supervision crash-loop protection. **Change**: keep `Driver.Galaxy.Proxy/Supervisor/CircuitBreaker.cs` + `Backoff.cs` — they guard the IPC *process* re-spawn, not the per-call data path. Data-path Polly is an orthogonal layer.
5.**High · ACCEPT** — Roslyn analyzer targeting `IDriver` misses the hot paths (`IReadable.ReadAsync`, `IWritable.WriteAsync`, `ISubscribable.SubscribeAsync` etc.). **Change**: analyzer rule now matches every method on the capability interfaces; compliance doc enumerates the full call-site list.
6.**High · ACCEPT** — `/healthz` + `/readyz` under-specified for degraded-running. **Change**: add a state-matrix sub-section explicitly covering `Unknown` (pre-init: `/readyz` 503), `Initializing` (503), `Healthy` (200), `Degraded` (200 with JSON body flagging the degraded driver; `/readyz` is OR across drivers), `Faulted` (503), plus cached-config-serving (`/healthz` returns 200 + `UsingStaleConfig: true` in JSON body).
7.**High · ACCEPT** — `WedgeDetector` based on "no successful Read" false-fires on write-only subscriptions + idle systems. **Change**: wedge criteria now `(hasPendingWork AND noProgressIn > threshold)` where `hasPendingWork` comes from the Polly bulkhead depth + active MonitoredItem count. Idle driver stays Healthy.
8.**High · ACCEPT** — LiteDB cache serving mixed-generation reads breaks publish atomicity. **Change**: cache is snapshot-per-generation. Each published generation writes a sealed snapshot into `<cache-root>/<cluster>/<generationId>.db`; reads serve the last-known-sealed generation and never mix. Central DB outage during a *publish* means that publish fails (write path doesn't use cache); reads continue from the prior sealed snapshot.
9.**Med · ACCEPT** — `DriverHostStatus` schema conflates per-host connectivity with per-driver-instance resilience counters. **Change**: new `DriverInstanceResilienceStatus` table separate from `DriverHostStatus`. Admin `/hosts` joins both for display.
10.**Med · ACCEPT** — Compliance says analyzer-error; risks say analyzer-warning. **Change**: phase 6.1 ships at **error** level (this phase is the gate); warning-mode option removed.
11.**Med · ACCEPT** — Hardcoded per-tier MB bands ignore decision #70's `max(multiplier × baseline, baseline + floor)` formula with observed-baseline capture. **Change**: watchdog captures baseline at post-init plateau (median of first 5 min GetMemoryFootprint samples) + applies the hybrid formula. Tier constants now encode the multiplier + floor, not raw MB.
> **Status**: **SHIPPED (core)** 2026-04-19 — Streams A, B, C (foundation), D (data layer) merged to `v2` across PRs #84-87. Final exit-gate PR #88 turns the compliance stub into real checks (all pass, 2 deferred surfaces tracked).
>
> Deferred follow-ups (tracked separately):
> - Stream C dispatch wiring on the 11 OPC UA operation surfaces (task #143).
> **Predecessor**: Phase 6.1 (Resilience & Observability) — reuses the Polly pipeline for ACL-cache refresh retries
> **Successor**: Phase 6.3 (Redundancy)
## Phase Objective
Wire ACL enforcement on every OPC UA Read / Write / Subscribe / Call path + LDAP group → admin role grants that the v2 plan specified but never ran. End-state: a user's effective permissions resolve through a per-session permission-trie over the 6-level `Cluster / Namespace / UnsArea / UnsLine / Equipment / Tag` hierarchy, cached per session, invalidated on generation-apply + LDAP group expiry.
Closes these gaps:
1.**Data-path ACL enforcement** — `NodeAcl` table + `NodePermissions` flags shipped; `NodeAclService.cs` present as a CRUD surface; no code consults ACLs at `Read`/`Write` time. OPC UA server answers everything to everyone.
2.**`LdapGroupRoleMapping` for cluster-scoped admin grants** — decision #105 shipped as the *design*; admin roles are hardcoded (`FleetAdmin` / `ConfigEditor` / `ReadOnly`) with no cluster-scoping and no LDAP-to-grant table. Decision #105 explicitly lifts this from v2.1 into v2.0.
4.**Admin UI ACL grant editor** — `AclsTab.razor` exists but edits the now-unused `NodeAcl` table; needs to wire to the runtime evaluator + the new `LdapGroupRoleMapping` table.
## Scope — What Changes
**Architectural separation** (critical for correctness): `LdapGroupRoleMapping` is **control-plane only** — it maps LDAP groups to Admin UI roles (`FleetAdmin` / `ConfigEditor` / `ReadOnly`) and cluster scopes for Admin access. **It is NOT consulted by the OPC UA data-path evaluator.** The data-path evaluator reads `NodeAcl` rows joined directly against the session's **resolved LDAP group memberships**. The two concerns share zero runtime code path.
| Concern | Change |
|---------|--------|
| `Configuration` project | New entity `LdapGroupRoleMapping { Id, LdapGroup, Role, ClusterId? (nullable = system-wide), IsSystemWide, GeneratedAtUtc }`. **Consumed only by Admin UI role routing.** Migration. Admin CRUD. |
| `Core` → new `Core.Authorization` sub-namespace | `IPermissionEvaluator.Authorize(IEnumerable<Claim> identity, OpcUaOperation op, NodeId nodeId) → AuthorizationDecision`. `op` covers every OPC UA surface: Browse, Read, Write, HistoryRead, HistoryUpdate, CreateMonitoredItems, TransferSubscriptions, Call, Acknowledge, Confirm, Shelve. Result is tri-state (internal model distinguishes `Allow` / `NotGranted` / `Denied` + carries matched-grant provenance). Phase 6.2 only produces `Allow` + `NotGranted`; v2.1 Deny lands without API break. |
| `PermissionTrieBuilder` | Builds trie from `NodeAcl` rows joined against **resolved LDAP group memberships**, keyed on 6-level scope hierarchy for Equipment namespaces. **SystemPlatform namespaces (Galaxy)** use a `FolderSegment` scope level between Namespace and Tag, populated from `Tag.FolderPath` segments, so folder subtree authorization works on Galaxy trees the same way UNS works on Equipment trees. Trie node carries `ScopeKind` enum. |
| `PermissionTrieCache` + freshness | One trie per `(ClusterId, GenerationId)`. Invalidated on `sp_PublishGeneration` via in-process event bus AND generation-ID check on hot path — every authz call looks up `CurrentGenerationId` (Polly-wrapped, sub-second cache); a Backup that cached a stale generation detects the mismatch + forces re-load. **Redundancy-safe**. |
| `UserAuthorizationState` freshness | Cached per session BUT bounded by `MembershipFreshnessInterval` (default **15 min**). Past that, the next hot-path authz call re-resolves LDAP group memberships via `LdapGroupService`. Failure to re-resolve (LDAP unreachable) → **fail-closed**: evaluator returns `NotGranted` for every call until memberships refresh successfully. Decoupled from Phase 6.1's availability-oriented 24h cache. |
| `AuthCacheMaxStaleness` | Separate from Phase 6.1's `UsingStaleConfig` window. Default 5 min — beyond that, authz fails closed regardless of Phase 6.1 cache warmth. |
| OPC UA server dispatch — all enforcement surfaces | `DriverNodeManager` wires evaluator on: **Browse + TranslateBrowsePathsToNodeIds** (ancestors implicitly visible if any descendant has a grant; denied ancestors filter from results), **Read** (per-attribute StatusCode `BadUserAccessDenied` in mixed-authorization batches; batch never poisons), **Write** (uses `NodePermissions.WriteOperate/Tune/Configure` based on driver `SecurityClassification`), **HistoryRead** (uses `NodePermissions.HistoryRead` — **distinct** flag, not Read), **HistoryUpdate** (`NodePermissions.HistoryUpdate`), **CreateMonitoredItems** (per-`MonitoredItemCreateResult` denial), **TransferSubscriptions** (re-evaluates items on transfer), **Call** (`NodePermissions.MethodCall`), **Acknowledge/Confirm/Shelve** (per-alarm flags). |
| Subscription re-authorization | Each `MonitoredItem` is stamped with `(AuthGenerationId, MembershipVersion)` at create time. On every Publish, items with a stamp mismatching the session's current `(AuthGenerationId, MembershipVersion)` get re-evaluated; revoked items drop to `BadUserAccessDenied` within one publish cycle. Unchanged items stay fast-path. |
| `LdapAuthService` | On cookie-auth success: resolves LDAP group memberships; loads matching `LdapGroupRoleMapping` rows → role claims + cluster-scope claims (control plane); stores `UserAuthorizationState.LdapGroups` on the session for the data-plane evaluator. |
| `ValidatedNodeAclAuthoringService` | Replaces CRUD-only `NodeAclService` for authoring. Validates (LDAP group exists, scope exists in current or target draft, grant shape is valid, no duplicate `(LdapGroup, Scope)` pair). Admin UI writes only through it. |
| Admin UI `AclsTab.razor` | Writes via `ValidatedNodeAclAuthoringService`. Adds Probe-This-Permission row that runs the real evaluator against a chosen `(LDAP group, node, operation)` and shows `Allow` / `NotGranted` + matched-grant provenance. |
| Admin UI new tab `RoleGrantsTab.razor` | CRUD over `LdapGroupRoleMapping`. Per-cluster + system-wide grants. FleetAdmin only. **Documentation explicit** that this only affects Admin UI access, not OPC UA data plane. |
| Audit log | Every Grant/Revoke/Publish on `LdapGroupRoleMapping` or `NodeAcl` writes an `AuditLog` row with old/new state + user. |
## Scope — What Does NOT Change
| Item | Reason |
|------|--------|
| OPC UA authn | Already done (PR 19 LDAP user identity + Basic256Sha256 profile). Phase 6.2 is authorization only. |
| Driver-side `SecurityClassification` metadata | Drivers keep reporting `Operate` / `ViewOnly` / etc. — the evaluator uses them as *part* of the decision but doesn't replace them. |
- [ ] Existing `NodeAcl` + `NodePermissions` flag enum audited; confirm bitmask flags match `acl-design.md` table
- [ ] Existing `LdapAuthService` group-resolution code path traced end-to-end — confirm it already queries group memberships (we only need the caller to consume the result)
- [ ] Test DB scenarios catalogued: two clusters, three LDAP groups per cluster, mixed grant shapes; captured as seed-data fixtures
## Task Breakdown
### Stream A — `LdapGroupRoleMapping` table + migration (3 days)
1.**A.1** Entity + EF Core migration. Columns per §Scope table. Unique constraint on `(LdapGroup, ClusterId)` with null-tolerant comparer for the system-wide case. Index on `LdapGroup` for the hot-path lookup on auth.
2.**A.2**`ILdapGroupRoleMappingService` CRUD. Wrap in the Phase 6.1 Polly pipeline (timeout → retry → fallback-to-cache).
3.**A.3** Seed-data migration: preserve the current hardcoded `FleetAdmin` / `ConfigEditor` / `ReadOnly` mappings by seeding rows for the existing LDAP groups the dev box uses (`cn=fleet-admin,…`, `cn=config-editor,…`, `cn=read-only,…`). Op no-op migration for existing deployments.
### Stream B — Permission-trie evaluator (1 week)
1.**B.1**`IPermissionEvaluator.Authorize(IEnumerable<Claim> identity, NodeId nodeId, NodePermissions needed)` — returns `bool`. Phase 6.2 returns only `true` / `false`; v2.1 can widen to `Allow`/`Deny`/`Indeterminate` if Deny lands.
2.**B.2**`PermissionTrieBuilder` builds the trie from `NodeAcl` + `LdapGroupRoleMapping` joined to the current generation's `UnsArea` + `UnsLine` + `Equipment` + `Tag` tables. One trie per `(ClusterId, GenerationId)` so rollback doesn't smear permissions across generations.
3.**B.3** Trie node structure: `{ Level: enum, ScopeId: Guid, AllowedPermissions: NodePermissions, ChildrenByLevel: Dictionary<Guid, TrieNode> }`. Evaluation walks from Cluster → Namespace → UnsArea → UnsLine → Equipment → Tag, ORing allowed permissions at each level. Additive semantics: a grant at Cluster level cascades to every descendant tag.
4.**B.4**`PermissionTrieCache` service scoped as singleton; exposes `GetTrieAsync(ClusterId, ct)` that returns the current-generation trie. Invalidated on `sp_PublishGeneration` via an in-process event bus; also on TTL expiry (24 h safety net).
5.**B.5** Per-session cached evaluator: OPC UA Session authentication produces `UserAuthorizationState { ClusterId, LdapGroups[], Trie }`; cached on the session until session close or generation-apply.
6.**B.6** Unit tests: trie-walk theory covering (a) Cluster-level grant cascades to tags, (b) Equipment-level grant doesn't leak to sibling Equipment, (c) multi-group union, (d) no-grant → deny, (e) Galaxy nodes skip UnsArea/UnsLine levels.
### Stream C — OPC UA server dispatch wiring (6 days, widened)
1.**C.1**`DriverNodeManager.Read` — evaluator consulted per `ReadValueId` with `OpcUaOperation.Read`. Denied attributes get `BadUserAccessDenied` per-item; batch never poisons. Integration test covers mixed-authorization batch (3 authorized + 2 denied → 3 Good values + 2 Bad StatusCodes, request completes).
2.**C.2**`DriverNodeManager.Write` — evaluator chooses `NodePermissions.WriteOperate` / `WriteTune` / `WriteConfigure` based on the driver-reported `SecurityClassification`.
3.**C.3**`DriverNodeManager.HistoryRead` — **uses `NodePermissions.HistoryRead`**, which is a **distinct flag** from Read. Test: user with Read but not HistoryRead can read live values but gets `BadUserAccessDenied` on `HistoryRead`.
5.**C.5**`DriverNodeManager.CreateMonitoredItems` — per-`MonitoredItemCreateResult` denial in mixed-authorization batch; partial success path per OPC UA Part 4. Each created item stamped `(AuthGenerationId, MembershipVersion)`.
6.**C.6**`DriverNodeManager.TransferSubscriptions` — on reconnect, re-evaluate every transferred `MonitoredItem` against the session's current auth state. Stale-stamp items drop to `BadUserAccessDenied`.
7.**C.7****Browse + TranslateBrowsePathsToNodeIds** — evaluator called with `OpcUaOperation.Browse`. Ancestor visibility implied when any descendant has a grant (per `acl-design.md` §Browse). Denied ancestors filter from browse results — the UA browser sees a hierarchy truncated at the denied ancestor rather than an inconsistent child-without-parent view.
10.**C.10** Publish path — for each `MonitoredItem` with a mismatched `(AuthGenerationId, MembershipVersion)` stamp, re-evaluate. Unchanged items stay fast-path; changes happen at next publish cycle.
11.**C.11** Integration tests: three-user seed with different memberships; matrix covers every operation in §Scope. Mixed-batch tests for Read + CreateMonitoredItems.
### Stream D — Admin UI refresh (4 days)
1.**D.1**`RoleGrantsTab.razor` — FleetAdmin-gated CRUD on `LdapGroupRoleMapping`. Per-cluster dropdown + system-wide checkbox. Validation: LDAP group must exist in the dev LDAP (GLAuth) before saving — best-effort probe with graceful degradation.
2.**D.2**`AclsTab.razor` rewrites its edit path to write through the new `NodeAclService`. Adds a "Probe this permission" row: choose `(LDAP group, node, action)` → shows Allow / Deny + the reason (which grant matched).
3.**D.3** Draft-generation diff viewer now includes an ACL section: "X grants added, Y grants removed, Z grants changed."
4.**D.4** SignalR notification: `PermissionTrieCache` invalidation on `sp_PublishGeneration` pushes to Admin UI so operators see "this clusters permissions were just updated" within 2 s.
## Compliance Checks (run at exit gate)
- [ ]**Control/data-plane separation**: `LdapGroupRoleMapping` consumed only by Admin UI; the data-path evaluator has zero references to it. Enforced via a project-reference audit (Admin project references the mapping service; `Core.Authorization` does not).
- [ ]**Every operation wired**: Browse, Read, Write, HistoryRead, HistoryUpdate, CreateMonitoredItems, TransferSubscriptions, Call, Acknowledge, Confirm, Shelve all consult the evaluator. Integration test matrix covers every operation × allow/deny.
- [ ]**HistoryRead uses its own flag**: test "user with Read + no HistoryRead gets `BadUserAccessDenied` on HistoryRead".
- [ ]**Browse ancestor visibility**: user with a grant only on a deep equipment node can browse the path to it (ancestors implied); denied ancestors filter from browse results otherwise.
- [ ]**Galaxy FolderSegment coverage**: a grant on a Galaxy folder subtree cascades to its tags; sibling folders are unaffected. Trie test covers this.
- [ ]**Subscription re-authorization**: integration test — create item, revoke grant via draft+publish, next publish cycle the item returns `BadUserAccessDenied` (not silently still-notifying).
- [ ]**Membership freshness**: test — 15 min MembershipFreshnessInterval elapses on a long-lived session + LDAP now unreachable → authz fails closed on the next request until LDAP recovers.
- [ ]**Auth cache fail-closed**: test — Phase 6.1 cache serves stale config for 6 min; authz evaluator refuses all calls after 5 min regardless.
- [ ]**Trie invariants**: `PermissionTrieBuilder` is idempotent (build twice with identical inputs → equal tries).
- [ ]**Redundancy-safe invalidation**: integration test — two nodes, a publish on one, authorize a request on the other before in-process event propagates → generation-mismatch forces re-load, no stale decision.
- [ ]**Authoring validation**: `AclsTab` cannot save a `(LdapGroup, Scope)` pair that already exists in the draft; operator sees the validation error pre-save.
- [ ]**AuthorizationDecision shape stability**: API surface exposes `Allow` + `NotGranted` only; `Denied` variant exists in the type but is never produced; v2.1 can add Deny without API break.
- [ ] No regression in driver test counts.
## Risks and Mitigations
| Risk | Likelihood | Impact | Mitigation |
|------|:----------:|:------:|------------|
| ACL evaluator latency on per-read hot path | Medium | High | Trie lookup is O(depth) = O(6); session-cached UserAuthorizationState avoids per-Read trie rebuild; benchmark in Stream B.6 |
| Trie cache stale after a rollback | Medium | High | `sp_PublishGeneration` + `sp_RollbackGeneration` both emit the invalidation event; trie keyed on `(ClusterId, GenerationId)` so rollback fetches the prior trie cleanly |
| `BadUserAccessDenied` returns expose sensitive browse-name metadata | Low | Medium | Server returns only the status code + NodeId; no message leak per OPC UA Part 4 §7.34 guidance |
| LdapGroupRoleMapping migration breaks existing deployments | Low | High | Seed-migration preserves the hardcoded groups' effective grants verbatim; smoke test exercises the post-migration fleet admin login |
| Deny semantics accidentally ship (would break `acl-design.md` defer) | Low | Medium | `IPermissionEvaluator.Authorize` returns `bool` (not tri-state) through Phase 6.2; widening to `Allow`/`Deny`/`Indeterminate` is a v2.1 ticket |
1.**Crit · ACCEPT** — Trie must not conflate `LdapGroupRoleMapping` (control-plane admin claims per decision #105) with data-plane ACLs (decision #129). **Change**: `LdapGroupRoleMapping` is consumed only by the Admin UI role router. Data-plane trie reads `NodeAcl` rows joined against the session's **resolved LDAP groups**, never admin roles. Stream B.2 updated.
2.**Crit · ACCEPT** — Cached `UserAuthorizationState` survives LDAP group changes because memberships only refresh at cookie-auth. Change: add `MembershipFreshnessInterval` (default 15 min); past that, next hot-path authz call forces group re-resolution (fail-closed if LDAP unreachable). Session-close-wins on config-rollback.
3.**High · ACCEPT** — Node-local invalidation doesn't extend across redundant pair. **Change**: trie keyed on `(ClusterId, GenerationId)`; hot-path authz looks up `CurrentGenerationId` from the shared config DB (Polly-wrapped + sub-second cache). A Backup that read stale generation gets a mismatched trie → forces re-load. Implementation note added to Stream B.4.
4.**High · ACCEPT** — Browse enforcement missing. **Change**: new Stream C.7 (`Browse + TranslateBrowsePathsToNodeIds` enforcement). Ancestor visibility implied when any descendant has a grant; denied ancestors filter from browse results per `acl-design.md` §Browse.
5.**High · ACCEPT** — `HistoryRead` should use `NodePermissions.HistoryRead` bit, not `Read`. **Change**: Stream C.3 revised; separate unit test asserts `Read+no-HistoryRead` denies HistoryRead while allowing current-value reads.
6.**High · ACCEPT** — Galaxy shallow-path (Cluster→Namespace→Tag) loses folder hierarchy authorization. **Change**: SystemPlatform namespaces use a `FolderSegment` scope-level between Namespace and Tag, populated from `Tag.FolderPath`; UNS-kind namespaces keep the 6-level hierarchy. Trie supports both via `ScopeKind` on each node.
7.**High · ACCEPT** — Subscription re-authorization policy unresolved between create-time-only (fast, wrong on revoke) and per-publish (slow). **Change**: stamp each `MonitoredItem` with `(AuthGenerationId, MembershipVersion)`; re-evaluate on Publish only when either version changed. Revoked items drop to `BadUserAccessDenied` within one publish cycle.
8.**Med · ACCEPT** — Mixed-authorization batch `Read` / `CreateMonitoredItems` service-result semantics underspecified. **Change**: Stream C.6 explicitly tests per-`ReadValueId` + per-`MonitoredItemCreateResult` denial in mixed batches; batch never collapses to a coarse failure.
9.**Med · ACCEPT** — Missing surfaces: `Method.Call`, `HistoryUpdate`, event filter on subscriptions, subscription-transfer on reconnect, alarm-ack. **Change**: scope expanded — every OPC UA authorization surface enumerated in Stream C: Read, Write, HistoryRead, HistoryUpdate, CreateMonitoredItems, TransferSubscriptions, Call, Acknowledge/Confirm/Shelve, Browse, TranslateBrowsePathsToNodeIds.
10.**Med · ACCEPT** — `bool` evaluator bakes in grant-only semantics; collides with v2.1 Deny. **Change**: internal model uses `AuthorizationDecision { Allow | NotGranted | Denied, IReadOnlyList<MatchedGrant> Provenance }`. Phase 6.2 maps `Denied` → never produced; UI + audit log use the full record so v2.1 Deny lands without API break.
11.**Med · ACCEPT** — 6.1 cache fallback is availability-oriented; applying it to auth is correctness-dangerous. **Change**: auth-specific staleness budget `AuthCacheMaxStaleness` (default 5 min, not 24 h). Past that, hot-path evaluator fails closed on cached reads; all authorization calls return `NotGranted` until fresh data lands. Documented in risks + compliance.
12.**Low · ACCEPT** — Existing `NodeAclService` is raw CRUD. **Change**: new `ValidatedNodeAclAuthoringService` enforces scope-uniqueness + draft/publish invariants + rejects invalid (LDAP group, scope) pairs; Admin UI writes through it only. Stream D.2 adjusted.
> **Status**: **SHIPPED (core)** 2026-04-19 — Streams B (ServiceLevelCalculator + RecoveryStateManager) and D core (ApplyLeaseRegistry) merged to `v2` in PR #89. Exit gate in PR #90.
>
> Deferred follow-ups (tracked separately):
> - Stream A — RedundancyCoordinator cluster-topology loader (task #145).
> - Stream C — OPC UA node wiring: ServiceLevel + ServerUriArray + RedundancySupport (task #147).
> **Predecessor**: Phase 6.2 (Authorization) — reuses the Phase 6.1 health endpoints for cluster-peer probing
> **Successor**: Phase 6.4 (Admin UI completion)
## Phase Objective
Land the non-transparent redundancy protocol end-to-end: two `OtOpcUa.Server` instances in a `ServerCluster` each expose a live `ServiceLevel` node whose value reflects that instance's suitability to serve traffic, advertise each other via `ServerUriArray`, and transition role (Primary ↔ Backup) based on health + operator intent.
Closes these gaps:
1.**Dynamic `ServiceLevel`** — OPC UA Part 5 §6.3.34 specifies a Byte (0..255) that clients poll to pick the healthiest server. Our server publishes it as a static value today.
2.**`ServerUriArray` broadcast** — Part 4 specifies that every node in a redundant pair should advertise its peers' ApplicationUris. Currently advertises only its own.
3.**Primary / Backup role coordination** — entities carry `RedundancyRole` but the runtime doesn't read it; no peer health probing; no role-transfer on primary failure.
4.**Mid-apply dip** — decision-level expectation that a server mid-generation-apply should report a *lower* ServiceLevel so clients cut over to the peer during the apply window. Not implemented.
## Scope — What Changes
| Concern | Change |
|---------|--------|
| `OtOpcUa.Server` → new `Server.Redundancy` sub-namespace | `RedundancyCoordinator` singleton. Resolves the current node's `ClusterNode` row at startup, loads peers, runs **two-layer peer health probe**: (a) `/healthz` every 2 s as the fast-fail (inherits Phase 6.1 semantics — HTTP + DB/cache healthy); (b) `UaHealthProbe` every 10 s — opens a lightweight OPC UA client session to the peer + reads its `ServiceLevel` node + verifies endpoint serves data. Authority decisions use UaHealthProbe; `/healthz` is used only to avoid wasting UA probes when peer is obviously down. |
| Publish-generation fencing | Topology + role decisions are stamped with a monotonic `ConfigGenerationId` from the shared config DB. Coordinator re-reads topology via CAS on `(ClusterId, ExpectedGeneration)` → new row; peers reject state propagated from a lower generation. Prevents split-publish races. |
| `InvalidTopology` runtime state | If both nodes detect >1 Primary AFTER startup (config-DB drift during a publish), both self-demote to ServiceLevel 2 until convergence. Neither node serves authoritatively; clients pick the healthier alternative or reconnect later. |
| OPC UA server root | `ServiceLevel` variable node becomes a `BaseDataVariable` whose value updates on `RedundancyCoordinator` state change. `ServerUriArray` array variable includes **self + peers** in stable deterministic ordering (decision per OPC UA Part 4 §6.6.2.2). `RedundancySupport` stays static (set from `RedundancyMode` at startup); `Transparent` mode validated pre-publish, not rejected at startup. |
| `RedundancyCoordinator` computation | **8-state ServiceLevel matrix** — avoids OPC UA Part 5 §6.3.34 collision (`0=Maintenance`, `1=NoData`). Operator-declared maintenance only = **0**. Unreachable / Faulted = **1**. In-range operational states occupy **2..255**: Authoritative-Primary = **255**; Isolated-Primary (peer unreachable, self serving) = **230**; Primary-Mid-Apply = **200**; Recovering-Primary (post-fault, dwell not met) = **180**; Authoritative-Backup = **100**; Isolated-Backup (primary unreachable, "take over if asked") = **80**; Backup-Mid-Apply = **50**; Recovering-Backup = **30**; `InvalidTopology` (runtime detects >1 Primary) = **2** (detected-inconsistency band — below normal operation). Full matrix documented in `docs/Redundancy.md` update. |
| Role transition | Split-brain avoidance: role is *declared* in the shared config DB (`ClusterNode.RedundancyRole`), not elected at runtime. An operator flips the row (or a failover script does). Coordinator only reads; never writes. |
| `sp_PublishGeneration` hook | Uses named **apply leases** keyed to `(ConfigGenerationId, PublishRequestId)`. `await using var lease = coordinator.BeginApplyLease(...)`. Disposal on any exit path (success, exception, cancellation) decrements. Watchdog auto-closes any lease older than `ApplyMaxDuration` (default 10 min) → ServiceLevel can't stick at mid-apply. Pre-publish validator rejects unsupported `RedundancyMode` (e.g. `Transparent`) with a clear error so runtime never sees an invalid state. |
| Admin UI `/cluster/{id}` page | New `RedundancyTab.razor` — shows current node's role + ServiceLevel + peer reachability. FleetAdmin can trigger a role-swap by editing `ClusterNode.RedundancyRole` + publishing a draft. |
| Metrics | New OpenTelemetry metrics: `ot_opcua_service_level{cluster,node}`, `ot_opcua_peer_reachable{cluster,node,peer}`, `ot_opcua_apply_in_progress{cluster,node}`. Sink via Phase 6.1 observability layer. |
## Scope — What Does NOT Change
| Item | Reason |
|------|--------|
| OPC UA authn / authz | Phases 6.2 + prior. Redundancy is orthogonal. |
| Driver layer | Drivers aren't redundancy-aware; they run on each node independently against the same equipment. The server layer handles the ServiceLevel story. |
| Automatic failover / election | Explicitly out of scope. Non-transparent = client picks which server to use via ServiceLevel + ServerUriArray. We do NOT ship consensus, leader election, or automatic promotion. Operator-driven failover is the v2.0 model per decision #79–85. |
| Transparent redundancy (`RedundancySupport=Transparent`) | Not supported. If the operator asks for it the server fails startup with a clear error. |
| Historian redundancy | Galaxy Historian's own redundancy (two historians on two CPUs) is out of scope. The Galaxy driver talks to whichever historian is reachable from its node. |
- [ ] OPC UA Part 4 §Redundancy + Part 5 §6.3.34 (ServiceLevel) re-skimmed
- [ ] Dev box has two OtOpcUa.Server instances configured against the same cluster — one designated Primary, one Backup — for integration testing
## Task Breakdown
### Stream A — Cluster topology loader (3 days)
1.**A.1**`RedundancyCoordinator` startup path: reads `ClusterNode` row for the current node (identified by `appsettings.json``Cluster:NodeId`), reads the cluster's peer list, validates invariants (no duplicate `ApplicationUri`, at most one `Primary` per cluster if `RedundancyMode.WarmActive`, at most two nodes total in v2.0 per decision #83).
2.**A.2** Topology subscription — coordinator re-reads on `sp_PublishGeneration` confirmation so an operator role-swap takes effect after publish (no process restart needed).
### Stream B — Peer health probing + ServiceLevel computation (6 days, widened)
1.**B.1**`PeerHttpProbeLoop` per peer at 2 s — calls peer's `/healthz`, 1 s timeout, exponential backoff on sustained failure. Used as fast-fail.
2.**B.2**`PeerUaProbeLoop` per peer at 10 s — opens an OPC UA client session to the peer (reuses Phase 5 `Driver.OpcUaClient` stack), reads peer's `ServiceLevel` node + verifies endpoint serves data. Short-circuit: if HTTP probe is failing, skip UA probe (no wasted sessions).
3.**B.3**`ServiceLevelCalculator.Compute(role, selfHealth, peerHttpHealthy, peerUaHealthy, applyInProgress, recoveryDwellMet, topologyValid) → byte`. 8-state matrix per §Scope. `topologyValid=false` forces InvalidTopology = 2 regardless of other inputs.
4.**B.4**`RecoveryStateManager`: after a `Faulted → Healthy` transition, hold driver in `Recovering` band (180 Primary / 30 Backup) for `RecoveryDwellTime` (default 60 s) AND require one positive publish witness (successful `Read` on a reference node) before entering Authoritative band.
5.**B.5** Calculator reacts to inputs via `IObserver` so changes immediately push to the OPC UA `ServiceLevel` node.
6.**B.6** Tests: **64-case matrix** covering role × self-health × peer-http × peer-ua × apply × recovery × topology. Specific cases flagged: Primary-with-unreachable-peer-serves-at-230 (authority retained); Backup-with-unreachable-primary-escalates-to-80 (not auto-promote); InvalidTopology demotes both nodes; Recovering dwell + publish-witness blocks premature return to 255.
### Stream C — OPC UA node wiring (3 days)
1.**C.1**`ServiceLevel` variable node created under `ServerStatus` at server startup. Type `Byte`, AccessLevel = CurrentRead only. Subscribe to `ServiceLevelCalculator` observable; push updates via `DataChangeNotification`.
2.**C.2**`ServerUriArray` variable node under `ServerCapabilities`. Array of `String`, **includes self + peers** with deterministic ordering (self first). Updates on topology change. Compliance test asserts local-plus-peer membership.
3.**C.3**`RedundancySupport` variable — static at startup from `RedundancyMode`. Values: `None`, `Cold`, `Warm`, `WarmActive`, `Hot`. Unsupported values (`Transparent`, `HotAndMirrored`) are rejected **pre-publish** by validator — runtime never sees them.
4.**C.4** Client.CLI cutover test: connect to primary, read `ServiceLevel` → 255; pause primary apply → 200; unreachable peer while apply in progress → 200 (apply dominates peer-unreachable per matrix); client sees peer via `ServerUriArray`; fail primary → client reconnects to peer at 80 (isolated-backup band).
### Stream D — Apply-window integration (3 days)
1.**D.1**`sp_PublishGeneration` caller wraps the apply in `await using var lease = coordinator.BeginApplyLease(generationId, publishRequestId)`. Lease keyed to `(ConfigGenerationId, PublishRequestId)` so concurrent publishes stay isolated. Disposal decrements on every exit path.
2.**D.2**`ApplyLeaseWatchdog` auto-closes leases older than `ApplyMaxDuration` (default 10 min) so a crashed publisher can't pin the node at mid-apply.
3.**D.3** Pre-publish validator in `sp_PublishGeneration` rejects unsupported `RedundancyMode` values (`Transparent`, `HotAndMirrored`) with a clear error message — runtime never sees an invalid mode.
4.**D.4** Tests: (a) mid-apply client subscribes → sees ServiceLevel drop → sees restore; (b) lease leak via `ThreadAbort` / cancellation → watchdog closes; (c) publish rejected for `Transparent` → operator-actionable error.
### Stream E — Admin UI + metrics (3 days)
1.**E.1**`RedundancyTab.razor` under `/cluster/{id}/redundancy`. Shows each node's role, current ServiceLevel (with band label per 8-state matrix), peer reachability (HTTP + UA probe separately), last apply timestamp. Role-swap button posts a draft edit on `ClusterNode.RedundancyRole`; publish applies.
2.**E.2** OpenTelemetry meter export: `ot_opcua_service_level{cluster,node}` gauge + `ot_opcua_peer_reachable{cluster,node,peer,kind=http|ua}` + `ot_opcua_apply_in_progress{cluster,node}` + `ot_opcua_topology_valid{cluster}`. Sink via Phase 6.1 observability.
3.**E.3** SignalR push: `FleetStatusHub` broadcasts ServiceLevel changes so the Admin UI updates within ~1 s of the coordinator observing a peer flip.
### Stream F — Client-interoperability matrix (3 days, new)
1.**F.1** Validate ServiceLevel-driven cutover against **Ignition 8.1 + 8.3**, **Kepware KEPServerEX 6.x**, **Aveva OI Gateway 2020R2 + 2023R1**. For each: configure the client with both endpoints, verify it honors `ServiceLevel` + `ServerUriArray` during primary failover.
2.**F.2** Clients that don't honour the standards (doc field — may include Kepware and OI Gateway per Codex review) get an explicit compatibility-matrix entry: "requires manual backup-endpoint config / vendor-specific redundancy primitives". Documented in `docs/Redundancy.md`.
3.**F.3** Galaxy MXAccess failover test — boot Galaxy.Proxy on both nodes, kill Primary, assert Galaxy consumer reconnects to Backup within `(SessionTimeout + KeepAliveInterval × 3)`. Document required session-timeout config in `docs/Redundancy.md`.
## Compliance Checks (run at exit gate)
- [ ]**OPC UA band compliance**: `0=Maintenance` reserved, `1=NoData` reserved. Operational states in 2..255 per 8-state matrix.
- [ ]**UaHealthProbe authority**: integration test — peer returns HTTP 200 but OPC UA endpoint unreachable → coordinator treats peer as UA-unhealthy; peer is not a valid authority source.
- [ ]**Apply-lease disposal**: leases close on exception, cancellation, and watchdog timeout; ServiceLevel never sticks at mid-apply band.
- [ ]**Transparent-mode rejection**: attempting to publish `RedundancyMode=Transparent` is blocked at `sp_PublishGeneration`; runtime never sees an invalid mode.
- [ ]**Role transition via operator publish**: FleetAdmin swaps `RedundancyRole` in a draft, publishes; both nodes re-read topology on publish confirmation + flip ServiceLevel — no restart.
- [ ]**Client.CLI cutover**: with primary halted, Client.CLI that was connected to primary sees primary drop + reconnects to backup via `ServerUriArray`.
- [ ]**Client interoperability matrix** (Stream F): Ignition 8.1 + 8.3 honour ServiceLevel; Kepware + Aveva OI Gateway findings documented.
- [ ]**Galaxy MXAccess failover**: end-to-end test — primary kill → Galaxy consumer reconnects to backup within session-timeout budget.
- [ ] No regression in existing driver test suites; no regression in `/healthz` reachability under redundancy load.
## Risks and Mitigations
| Risk | Likelihood | Impact | Mitigation |
|------|:----------:|:------:|------------|
| Split-brain from operator race (both nodes marked Primary) | Low | High | Coordinator rejects startup if its cluster has >1 Primary row; logs + fails fast. Document as a publish-time validation in `sp_PublishGeneration`. |
| ServiceLevel thrashing on flaky peer | Medium | Medium | 2 s probe interval + 3-sample smoothing window; only declares a peer unreachable after 3 consecutive failed probes |
| Client ignores ServiceLevel and stays on broken primary | Medium | Medium | Documented in `docs/Redundancy.md` — non-transparent redundancy requires client cooperation; most SCADA clients (Ignition, Kepware, Aveva OI Gateway) honor it. Unit-test the advertised values; field behavior is client-responsibility |
| Apply-window counter leaks on exception | Low | High | `BeginApplyWindow` returns `IDisposable`; `using` syntax enforces paired decrement; unit test for exception-in-apply path |
| `HttpClient` probe leaks sockets | Low | Medium | Single shared `HttpClient` per coordinator (not per-probe); timeouts tight to avoid keeping connections open during peer downtime |
1.**Crit · ACCEPT** — No publish-generation fencing enables split-publish advertising both as authoritative. **Change**: coordinator CAS on a monotonic `ConfigGenerationId`; every topology decision is generation-stamped; peers reject state propagated from a lower generation.
2.**Crit · ACCEPT** — `>1 Primary` at startup covered but runtime containment missing when invalid topology appears later (mid-apply race). **Change**: add runtime `InvalidTopology` state — both nodes self-demote to ServiceLevel 2 (the "detected inconsistency" band, below normal operation) until convergence.
3.**High · ACCEPT** — `0 = Faulted` collides with OPC UA Part 5 §6.3.34 semantics where 0 means **Maintenance** and 1 means NoData. **Change**: reserve **0** for operator-declared maintenance-mode only; Faulted/unreachable uses **1** (NoData); in-range degraded states occupy 2..199.
4.**High · ACCEPT** — Matrix collapses distinct operational states onto the same value. **Change**: matrix expanded to Authoritative-Primary=255, Isolated-Primary=230 (peer unreachable — still serving), Primary-Mid-Apply=200, Recovering-Primary=180, Authoritative-Backup=100, Isolated-Backup=80 (primary unreachable — "take over if asked"), Backup-Mid-Apply=50, Recovering-Backup=30.
5.**High · ACCEPT** — `/healthz` from 6.1 is HTTP-healthy but doesn't guarantee OPC UA data plane. **Change**: add a redundancy-specific probe `UaHealthProbe` — issues a `ReadAsync(ServiceLevel)` against the peer's OPC UA endpoint via a lightweight client session. `/healthz` remains the fast-fail; the UA probe is the authority signal.
6.**High · ACCEPT** — `ServerUriArray` must include self + peers, not peers only. **Change**: array contains `[self.ApplicationUri, peer.ApplicationUri]` in stable deterministic ordering; compliance test asserts local-plus-peer membership.
7.**Med · ACCEPT** — No `Faulted → Recovering → Healthy` path. **Change**: add `Recovering` state with min dwell time (60 s default) + positive publish witness (one successful Read on a reference node) before returning to Healthy. Thrash-prevention.
8.**Med · ACCEPT** — Topology change during in-flight probe undefined. **Change**: every probe task tagged with `ConfigGenerationId` at dispatch; obsolete results discarded; in-flight probes cancelled on topology reload.
9.**Med · ACCEPT** — Apply-window counter race on exception/cancellation/async ownership. **Change**: apply-window is a named lease keyed to `(ConfigGenerationId, PublishRequestId)` with disposal enforced via `await using`; watchdog detects leased-but-abandoned and force-closes after `ApplyMaxDuration` (default 10 min).
10.**High · ACCEPT** — Ignition + Kepware + Aveva OI Gateway `ServiceLevel` compliance is unverified. **Change**: risk elevated to High; add Stream F (new) — build an interop matrix: validate against Ignition 8.1/8.3, Kepware KEPServerEX 6.x, Aveva OI Gateway 2020R2 + 2023R1. Document per-client cutover behaviour. Field deployments get a documented compatibility table; clients that ignore ServiceLevel documented as requiring explicit backup-endpoint config.
11.**Med · ACCEPT** — Galaxy MXAccess re-session on Primary death not in acceptance. **Change**: Stream F adds an end-to-end failover smoke test that boots Galaxy.Proxy on both nodes, kills Primary, asserts Galaxy consumer reconnects to Backup within `(SessionTimeout + KeepAliveInterval × 3)` budget. `docs/Redundancy.md` updated with required session timeouts.
12.**Med · ACCEPT** — Transparent-mode startup rejection is outage-prone. **Change**: `sp_PublishGeneration` validates `RedundancyMode` pre-publish — unsupported values reject the publish attempt with a clear validation error; runtime never sees an unsupported mode. Last-good config stays active.
Close the Admin UI feature-completeness checklist that Phase 1 Stream E exit gate left open. Each item below is an existing `phase-1-configuration-and-admin-scaffold.md` completion-checklist entry that is currently unchecked.
Gaps to close:
1.**UNS Structure tab drag/move with impact preview** — decision #115 + `admin-ui.md` §"UNS". Current state: list-only render; no drag reorder; no "X lines / Y equipment impacted" preview.
2.**Equipment CSV import + 5-identifier search** — decision #95 + #117. Current state: basic form; no CSV parser; search indexes only ZTag.
3.**Draft-generation diff viewer** — enhance existing `DiffViewer.razor` to show generation-diff not just staged-edit diff; highlight ACL grant changes (lands after Phase 6.2).
4.**`_base` equipment-class Identification fields exposure** — decision #138–139. Columns exist on `Equipment`; no Admin UI field group; no address-space exposure of the OPC 40010 sub-folder.
## Scope — What Changes
| Concern | Change |
|---------|--------|
| `Admin/Pages/UnsTab.razor` | Tree component with drag-drop using **`MudBlazor.TreeView` + `MudBlazor.DropTarget`** (existing transitive dep — no new third-party package). Native HTML5 DnD rejected because virtualization + DnD on 500+ nodes doesn't combine reliably. Each drag fires a "Compute Impact" call carrying a `DraftRevisionToken`; modal preview ("Moving Line 'Oven-2' from 'Packaging' to 'Assembly' will re-home 14 equipment + re-parent 237 tags"). **Confirm step re-checks the token** and rejects with a `409 Conflict / refresh-required` modal if the draft advanced between preview and commit. |
| `Admin/Services/UnsImpactAnalyzer.cs` | New service. Given a move-operation (line move, area rename, line merge), computes cascade counts + `DraftRevisionToken` at preview time. Pure-function shape; testable in isolation. |
| `Admin/Pages/EquipmentTab.razor` | Add CSV-import button → modal with file picker + dry-run preview. **Identifier search** uses the canonical decision #117 set: `ZTag / MachineCode / SAPID / EquipmentId / EquipmentUuid`. Typeahead probes each column with a ranking query (exact match score 100 → prefix 50 → opt-in LIKE 20; published > draft tie-break). Result row shows which field matched via trailing badge. |
| **Staged-import table**`EquipmentImportBatch` | New entity `{ Id, CreatedAtUtc, CreatedBy, RowsStaged, RowsAccepted, RowsRejected, FinalisedAtUtc? }` + child `EquipmentImportRow` records. Import writes rows in chunks to the staging table (not to `Equipment`). `FinaliseImportBatch` is the atomic finalize step that applies all accepted rows to `Equipment` + `ExternalIdReservation` in one transaction — short + bounded regardless of input size. Rollback = drop the batch row; `Equipment` never partially mutates. |
| `Admin/Pages/DraftEditor.razor` + `DiffViewer.razor` | Diff viewer refactored into a base component + section plugins: `StructuralDiffSection`, `EquipmentDiffSection`, `TagDiffSection`, `AclDiffSection` (Phase 6.2), `RedundancyDiffSection` (Phase 6.3), `IdentificationDiffSection`. Each section has a **1000-row hard cap**; over-cap renders an aggregate summary + "Load full diff" button streaming 500-row pages via SignalR. Subtree-rename diffs (decision #115 bulk restructure) surface as summary only by default. |
| `Admin/Components/IdentificationFields.razor` | New component. Renders the OPC 40010 field set **per decision #139**: `Manufacturer, Model, SerialNumber, HardwareRevision, SoftwareRevision, YearOfConstruction, AssetLocation, ManufacturerUri, DeviceManualUri`. `ProductInstanceUri / DeviceRevision / MonthOfConstruction` dropped from this phase — they need a separate decision-log widening. |
| `OtOpcUa.Server/OpcUa/DriverNodeManager` — Equipment folder build | When an `Equipment` row has non-null Identification fields, the server adds an `Identification` sub-folder under the Equipment node containing one variable per non-null field. **ACL binding**: the sub-folder + variables inherit the `Equipment` scope's grants from Phase 6.2's trie — no new scope level added. Documented in `acl-design.md` cross-reference update. |
## Scope — What Does NOT Change
| Item | Reason |
|------|--------|
| Admin UI visual language | Bootstrap 5 / cookie auth / sidebar layout unchanged — consistency with ScadaLink design reference. |
| LDAP auth flow | Already shipped in Phase 1. Phase 6.4 is additive UI only. |
| Equipment-class *template schema validation* | Still deferred (decision #112 — schemas repo not landed). We expose the Identification fields but don't validate against a template hierarchy. |
| Drag/move to *other clusters* | Out of scope — equipment is cluster-scoped per decision #82. Cross-cluster migration is a different workflow. |
## Entry Gate Checklist
- [ ] Phase 6.2 merged (ACL grants are part of the new diff viewer sections)
- [ ] Phase 6.3 merged (redundancy-role changes are part of the diff viewer)
- [ ]`phase-1-configuration-and-admin-scaffold.md` §Stream E completion checklist re-read — confirm these are the remaining items
- [ ]`admin-ui.md` re-skimmed for screen layouts
- [ ] Existing `EquipmentTab.razor` / `UnsTab.razor` / `DraftEditor.razor` diff'd against what ships today so the edits are additive not destructive
- [ ] Dev Galaxy available for OPC 40010 exposure smoke testing
## Task Breakdown
### Stream A — UNS drag/reorder + impact preview (5 days)
1.**A.1** 1000-node synthetic seed fixture. Drag-latency bench against `MudBlazor.TreeView` + `MudBlazor.DropTarget` — commit to the component if latency budget (100 ms drag-enter feedback) holds; fall back to flat-list reorder UI (Area/Line dropdowns) with loss of visual drag affordance otherwise.
3.**A.3** Modal preview wired to `UnsImpactAnalyzer`. **Confirm** re-reads the current draft revision + compares against the preview's token; if the draft advanced (another operator saved a different edit), show a `409 Conflict / refresh-required` modal rather than silently overwriting.
4.**A.4** Cross-cluster drop attempts: target disabled + toast "Equipment is cluster-scoped (decision #82). To move across clusters, use Export → Import on the Cluster detail page." Plus help link.
5.**A.5** Playwright (or equivalent) smoke test: drag a line across areas, assert modal shows right counts, assert draft row reflects the move; concurrent-edit test runs two sessions + asserts the later Confirm hits the 409.
2.**B.2**`EquipmentImportBatch` + `EquipmentImportRow` staging tables (migration). Import writes preview rows to staging via chunked inserts; staging never blocks `Equipment` or `ExternalIdReservation`. Preview query reads staging + validates each row against the current `Equipment` state + `ExternalIdReservation` freshness.
3.**B.3**`ImportPreview` UI — per-row accept/reject table. Reject reasons: "ZTag already exists in draft", "ExternalIdReservation conflict with Cluster X", "UnsLineName not found in draft UNS tree", etc. Operator reviews + clicks "Commit".
4.**B.4**`FinaliseImportBatch` — atomic finalize. One EF transaction applies accepted rows to `Equipment` + `ExternalIdReservation`; duration bounded regardless of input size (the atomic step is a bulk-insert, not per-row row-by-row). Rollback = drop batch row via `DropImportBatch`; `Equipment` never partially mutates.
5.**B.5** Five-identifier search. Rank SQL: exact match any identifier = score 100, prefix match = 50, LIKE-fuzzy (opt-in via `?fuzzy=true`) = 20; tie-break `published > draft` then `RowVersion DESC`. Typeahead shows which field matched via trailing badge.
6.**B.6** Smoke tests: 100-row CSV with 10 conflicts (5 ZTag dupes, 3 reservation clashes, 2 missing UnsLines); 10k-row perf test asserting finalize txn < 30 s; concurrent import + external `ExternalIdReservation` insert test asserts retryable-conflict handling.
### Stream C — Diff viewer enhancements (4 days)
1.**C.1** Refactor `DiffViewer.razor` into a base component + section plugins. Plugins: `StructuralDiffSection` (UNS tree), `EquipmentDiffSection`, `TagDiffSection`, `AclDiffSection` (Phase 6.2), `RedundancyDiffSection` (Phase 6.3), `IdentificationDiffSection`.
2.**C.2** Each section renders collapsed by default; counts + top-line summary always visible. **1000-row hard cap** per section — over-cap sections render aggregate summary (e.g. "237 equipment re-parented from Packaging to Assembly") with a "Load full diff" button that streams 500-row pages via SignalR.
3.**C.3** Subtree-rename diffs (decision #115 bulk restructure) surface as summary only by default regardless of row count.
4.**C.4** Tests: seed two generations with deliberate diffs; assert every section reports the right counts + top-line summary + hard-cap behavior.
### Stream D — OPC 40010 Identification exposure (3 days)
1.**D.1**`IdentificationFields.razor` component. Renders the **9 decision #139 fields**: `Manufacturer, Model, SerialNumber, HardwareRevision, SoftwareRevision, YearOfConstruction, AssetLocation, ManufacturerUri, DeviceManualUri`. Labelled inputs; nullable columns show empty input; required-field validation on commit only.
2.**D.2**`DriverNodeManager` equipment-folder builder — after building the equipment node, inspect the 9 Identification columns; if any non-null, add an `Identification` sub-folder with variable-per-non-null-field. ACL binding: sub-folder + variables inherit the **same `ScopeId` as the Equipment node** (Phase 6.2's trie treats them as part of the Equipment scope — no new scope level).
3.**D.3** Address-space smoke test via Client.CLI: browse an equipment node, assert `Identification` sub-folder present when columns are set, absent when all null, variables match the field values.
4.**D.4** ACL integration test: a user with Equipment-level grant reads the `Identification` variables without needing a separate grant; a user without the Equipment grant gets `BadUserAccessDenied` on both the Equipment node + its Identification variables.
## Compliance Checks (run at exit gate)
- [ ]**UNS drag/move**: drag a line across areas; modal preview shows correct impacted-equipment + impacted-tag counts.
- [ ]**Concurrent-edit safety**: two-session test — session B saves a draft edit after session A opened the preview; session A's Confirm returns `409 Conflict / refresh-required` instead of overwriting.
- [ ]**Cross-cluster drop**: dropping equipment across cluster boundaries is disabled + shows actionable toast pointing to Export/Import workflow.
- [ ]**1000-node tree**: drag operations on a 1000-node seed maintain < 100 ms drag-enter feedback.
- [ ]**CSV header version**: file missing `# OtOpcUaCsv v1` first line is rejected pre-parse.
- [ ]**CSV canonical identifier set**: columns match decision #117 (ZTag / MachineCode / SAPID / EquipmentId / EquipmentUuid); drift from the earlier draft surfaces as a test failure.
- [ ]**Staged-import atomicity**: `FinaliseImportBatch` transaction bounded < 30 s for a 10k-row import; pre-finalize stagings visible only to the importing user; rollback via `DropImportBatch`.
- [ ]**Concurrent import + external reservation**: concurrent test — third party inserts to `ExternalIdReservation` mid-finalize; finalize retries with conflict handling; no corruption.
- [ ]**5-identifier search ranking**: exact matches outrank prefix matches; published outranks draft for equal scores.
- [ ]**Diff viewer section caps**: 2000-row subtree-rename diff renders as summary only; "Load full diff" streams in pages.
- [ ]**OPC 40010 field list match**: rendered field group matches decision #139 exactly; no extra fields.
- [ ]**OPC 40010 exposure**: Client.CLI browse shows `Identification` sub-folder when equipment has non-null columns; absent when all null.
- [ ]**ACL inheritance for Identification**: integration test — Equipment-grant user reads Identification; no-grant user gets `BadUserAccessDenied` on both.
- [ ]**Visual parity reviewer**: named role (`FleetAdmin` user, not the implementation lead) compares side-by-side against `admin-ui.md` §Visual-Design reference panels; signoff artefact is a checked-in screenshot set under `docs/v2/visual-compliance/phase-6-4/`.
## Risks and Mitigations
| Risk | Likelihood | Impact | Mitigation |
|------|:----------:|:------:|------------|
| UNS drag-drop janky on large trees (>500 nodes) | Medium | Medium | Virtualize the tree component; default-collapse nested areas; test with a synthetic 1000-equipment seed |
| CSV import performance on 10k-row imports | Medium | Medium | Stream-parse rather than load-into-memory; preview renders in batches of 100; commit is chunked-EF-insert with progress bar |
| Diff viewer becomes unwieldy with many sections | Low | Medium | Each section collapsed by default; top-line summary row always shown; Phase 6.4 caps at 6 sections |
| OPC 40010 sub-folder accidentally exposes NULL/empty identification columns as empty-string variables | Low | Low | Column null-check in the builder; drop variables whose DB value is null |
| 5-identifier search pulls full table | Medium | Medium | Indexes on each of ZTag/SAPID/UniqueId/Alias1/Alias2; search query uses a UNION of 5 indexed lookups; falls back to LIKE only on explicit operator opt-in |
## Adversarial Review — 2026-04-19 (Codex, via `codex-rescue` subagent)
1.**Crit · ACCEPT** — Stale UNS impact preview can overwrite concurrent draft edits. **Change**: each preview carries a `DraftRevisionToken`; `Confirm` compares against the current draft + rejects with a `409 Conflict / refresh-required` modal if any draft edit landed since the preview was generated. Stream A.3 updated.
2.**High · ACCEPT** — CSV import atomicity is internally contradictory (single EF transaction vs. chunked inserts). **Change**: one explicit model — staged-import table (`EquipmentImportBatch { Id, CreatedAtUtc, RowsStaged, RowsAccepted, RowsRejected }`) receives rows in chunks; final `FinaliseImportBatch` is atomic over `Equipment` + `ExternalIdReservation`. Rollback is "drop the batch row" — the real Equipment table is never partially mutated.
3.**Crit · ACCEPT** — Identifier contract rewrite mis-cites decisions. **Change**: revert to the `admin-ui.md` + decision #117 canonical set — `ZTag / MachineCode / SAPID / EquipmentId / EquipmentUuid`. CSV header follows that set verbatim. Introduce a separate decision entry for versioned CSV header shape before adding any new column; CSV header row must start with `# OtOpcUaCsv v1` so future shape changes are unambiguous.
4.**Med · ACCEPT** — Search ordering undefined. **Change**: rank SQL — exact match on any identifier scores 100; prefix match 50; LIKE-fuzzy 20; published > draft tie-breaker; `ORDER BY score DESC, RowVersion DESC`. Typeahead shows which field matched via trailing badge.
5.**High · ACCEPT** — HTML5 DnD on virtualized tree is aspirational. **Change**: Stream A.2 rewritten — commits to **`MudBlazor.TreeView` + `MudBlazor.DropTarget`** (already a transitive dep via the existing Admin UI). Build a 1000-node synthetic seed in A.1 + validate drag-latency budget before implementing impact preview. If MudBlazor can't hit the budget, fall back to a flat-list reorder UI with Area/Line dropdowns (loss of visual drag affordance but unblocks the feature).
6.**Med · ACCEPT** — Collapsed-by-default doesn't handle generation-sized diffs. **Change**: each diff section has a hard row cap (1000 by default). Over-cap sections render an aggregate summary + "Load full diff" button that streams via SignalR in 500-row pages. Decision #115 subtree renames surface as a "N equipment re-parented under X → Y" summary instead of row-by-row.
7.**High · ACCEPT** — OPC 40010 field list doesn't match decision #139. **Change**: field group realigned to `Manufacturer, Model, SerialNumber, HardwareRevision, SoftwareRevision, YearOfConstruction, AssetLocation, ManufacturerUri, DeviceManualUri`. `ProductInstanceUri / DeviceRevision / MonthOfConstruction` dropped from Phase 6.4 — they belong to a future OPC 40010 widening decision.
8.**High · ACCEPT** — `Identification` subtree unreconciled with ACL hierarchy (Phase 6.2 6-level scope). **Change**: address-space builder creates the Identification sub-folder under the Equipment node **with the same ScopeId as Equipment** — no new scope level. ACL evaluator treats `…/Equipment/Identification/X` as inheriting the `Equipment` scope's grants. Documented in Phase 6.2's `acl-design.md` cross-reference update.
9.**Low · ACCEPT** — Visual-review gate names nonexistent reviewer role. **Change**: rubric defined — a named "Admin UX reviewer" (role `FleetAdmin` user, not the implementation lead) compares side-by-side screenshots against the `admin-ui.md` §Visual-Design reference panels; signoff artefact is a checked-in screenshot set under `docs/v2/visual-compliance/phase-6-4/`.
10.**Med · ACCEPT** — Cross-cluster drag/drop lacks loud failure path. **Change**: on drop across cluster boundary, disable the drop target + show a toast "Equipment is cluster-scoped (decision #82). To move across clusters, use the Export → Import workflow on the Cluster detail page." Plus a help link. Tested in Stream A.4.
3.`dotnet test tests\ZB.MOM.WW.OtOpcUa.Driver.Modbus.IntegrationTests` —
tests auto-skip when the endpoint is unreachable. Default endpoint is
`localhost:5020`; override via `MODBUS_SIM_ENDPOINT` for a real PLC on its
native port 502.
## Per-device quirk catalog
### AutomationDirect DL205 / DL260
First known target device family. **Full quirk catalog with primary-source citations
and per-quirk integration-test names lives at [`dl205.md`](dl205.md)** — that doc is
the reference; this section is the testing roadmap.
Confirmed quirks (priority order — top items are highest-impact for our driver
and ship first as PR 41+):
| Quirk | Driver impact | Integration-test name |
|---|---|---|
| **String packing**: 2 chars/register, **first char in low byte** (opposite of generic Modbus) | `ModbusDataType.String` decoder must be configurable per-device family — current code assumes high-byte-first | `DL205_String_low_byte_first_within_register` |
| **Word order CDAB** for Int32/UInt32/Float32 | Already configurable via `ModbusByteOrder.WordSwap`; default per device profile | `DL205_Int32_word_order_is_CDAB` |
| **BCD-as-default** numeric storage (only IEEE 754 when ladder uses `R` type) | New decoder mode — register reads as `0x1234` for ladder value `1234`, not as decimal `4660` | `DL205_BCD_register_decodes_as_hex_nibbles` |
| **FC16 capped at 100 registers** (below the spec's 123) | Bulk-write batching must cap per-device-family | `DL205_FC16_101_registers_returns_IllegalDataValue` |
| **FC03/04 capped at 128** (above the spec's 125) | Less impactful — clients that respect the spec's 125 stay safe | `DL205_FC03_129_registers_returns_IllegalDataValue` |
| **V-memory octal-to-decimal addressing** (V2000 octal → 0x0400 decimal) | New address-format helper in profile config so operators can write `V2000` instead of computing `1024` themselves | `DL205_Vmem_V2000_maps_to_PDU_0x0400` |
@@ -909,6 +909,26 @@ Each step leaves the system runnable. The generic extraction is effectively free
| 140 | Enterprise shortname = `zb` (UNS level-1 segment) | Closes corrections-doc D4. Matches the existing `ZB.MOM.WW.*` namespace prefix used throughout the codebase; short by design since this segment appears in every equipment path (`zb/warsaw-west/bldg-3/line-2/cnc-mill-05/RunState`); operators already say "ZB" colloquially. Admin UI cluster-create form default-prefills `zb` for the Enterprise field. Production deployments use it directly from cluster-create | 2026-04-17 |
| 141 | Tier 3 (AppServer IO) cutover is feasible — AVEVA's OI Gateway supports arbitrary upstream OPC UA servers as a documented pattern | Closes corrections-doc E2 with **GREEN-YELLOW** verdict. Multiple AVEVA partners (Software Toolbox, InSource) have published working integrations against four different non-AVEVA upstream servers (TOP Server, OPC Router, OmniServer, Cogent DataHub). No re-architecting of OtOpcUa required. Path: `OPC UA node → OI Gateway → SuiteLink → $DDESuiteLinkDIObject → AppServer attribute`. Recommended AppServer floor: System Platform 2023 R2 Patch 01. Two integrator-burden risks tracked: validation/GxP paperwork (no AVEVA blueprint exists for non-AVEVA upstream servers in Part 11 deployments) and unpublished scale benchmarks (in-house benchmark required before cutover scheduling). See `aveva-system-platform-io-research.md` | 2026-04-17 |
| 142 | Phase 1 acceptance includes an end-to-end AppServer-via-OI-Gateway smoke test against OtOpcUa | Catches AppServer-specific quirks (cert exchange via reject-and-trust workflow, endpoint URL must NOT include `/discovery` suffix per Inductive Automation forum failure mode, service-account install required because OI Gateway under SYSTEM cannot connect to remote OPC servers, `Basic256Sha256` + `SignAndEncrypt` + LDAP-username token combination must work end-to-end) early — well before the Year 3 tier-3 cutover schedule. Adds one task to `phase-1-configuration-and-admin-scaffold.md` Stream E (Admin smoke test) | 2026-04-17 |
| 143 | Polly per-capability policy — Read / HistoryRead / Discover / Probe / Alarm-subscribe auto-retry; Write does NOT auto-retry unless the tag metadata carries `[WriteIdempotent]` | Decisions #44-45 forbid auto-retry on Write because a timed-out write can succeed on the device + be replayed by the pipeline, duplicating pulses / alarm acks / counter increments / recipe-step advances. Per-capability policy in the shared Polly layer makes the retry safety story explicit; `WriteIdempotentAttribute` on tag definitions is the opt-in surface | 2026-04-19 |
| 144 | Polly pipeline key = `(DriverInstanceId, HostName)`, not DriverInstanceId alone | Decision #35 requires per-device isolation. One dead PLC behind a multi-device Modbus driver must NOT open the circuit breaker for healthy sibling hosts. Per-instance pipelines would poison every device behind one bad endpoint | 2026-04-19 |
| 145 | Tier A/B/C runtime enforcement splits into `MemoryTracking` (all tiers — soft/hard thresholds log + surface, NEVER kill) and `MemoryRecycle` (Tier C only — requires out-of-process topology). Tier A/B hard-breach logs a promotion-to-Tier-C recommendation; the runtime never auto-kills an in-process driver | Decisions #73-74 reserve process-kill protections for Tier C. An in-process Tier A/B "recycle" would kill every OPC UA session + every other in-proc driver for one leaky instance, blast-radius worse than the leak | 2026-04-19 |
| 146 | Memory watchdog uses the hybrid formula `soft = max(multiplier × baseline, baseline + floor)`, with baseline captured as the median of the first 5 min of `GetMemoryFootprint()` samples post-InitializeAsync. Tier-specific constants: A multiplier=3 floor=50 MB, B multiplier=3 floor=100 MB, C multiplier=2 floor=500 MB. Hard = 2 × soft | Codex adversarial review on the Phase 6.1 plan flagged that hardcoded per-tier MB bands diverge from decision #70's specified formula. Static bands false-trigger on small-footprint drivers + miss meaningful growth on large ones. Observed-baseline + hybrid formula recovers the original intent | 2026-04-19 |
| 147 | `WedgeDetector` uses demand-aware criteria `(state==Healthy AND hasPendingWork AND noProgressIn > threshold)`. `hasPendingWork` = (Polly bulkhead depth > 0) OR (active MonitoredItem count > 0) OR (queued historian read count > 0). Idle + subscription-only + write-only-burst drivers stay Healthy without false-fault | Previous "no successful Read in N intervals" formulation flipped legitimate idle subscribers, slow historian backfills, and write-heavy drivers to Faulted. The demand-aware check only fires when the driver claims work is outstanding | 2026-04-19 |
| 148 | LiteDB config cache is **generation-sealed**: `sp_PublishGeneration` writes `<cache-root>/<cluster>/<generationId>.db` as a read-only sealed file; cache reads serve the last-known-sealed generation. Mixed-generation reads are impossible | Prior "refresh on every successful query" cache could serve LDAP role mapping from one generation alongside UNS topology from another, producing impossible states. Sealed-snapshot invariant keeps cache-served reads coherent with a real published state | 2026-04-19 |
| 149 | `AuthorizationDecision { Allow \| NotGranted \| Denied, IReadOnlyList<MatchedGrant> Provenance }` — tri-state internal model. Phase 6.2 only produces `Allow` + `NotGranted` (grant-only semantics per decision #129); v2.1 Deny widens without API break | bool return would collapse `no-matching-grant` and `explicit-deny` into the same runtime state + UI explanation; provenance record is needed for the audit log anyway. Making the shape tri-state from Phase 6.2 avoids a breaking change in v2.1 | 2026-04-19 |
| 150 | Data-plane ACL evaluator consumes `NodeAcl` rows joined against the session's resolved LDAP group memberships. `LdapGroupRoleMapping` (decision #105) is control-plane only — routes LDAP groups to Admin UI roles. Zero runtime overlap between the two | Codex adversarial review flagged that Phase 6.2 draft conflated the two — building the data-plane trie from `LdapGroupRoleMapping` would let a user inherit tag permissions from an admin-role claim path never intended as a data-path grant | 2026-04-19 |
| 151 | `UserAuthorizationState` cached per session but bounded by `MembershipFreshnessInterval` (default 15 min). Past that interval the next hot-path authz call re-resolves LDAP group memberships; failure to re-resolve (LDAP unreachable) → fail-closed (evaluator returns `NotGranted` until memberships refresh successfully) | Previous design cached memberships until session close, so a user removed from a privileged LDAP group could keep authorized access for hours. Bounded freshness + fail-closed covers the revoke-takes-effect story | 2026-04-19 |
| 152 | Auth cache has its own staleness budget `AuthCacheMaxStaleness` (default 5 min), independent of decision #36's availability-oriented config cache (24 h). Past 5 min on authorization data, evaluator fails closed regardless of whether the underlying config is still serving from cache | Availability-oriented caches trade correctness for uptime. Authorization data is correctness-sensitive — stale ACLs silently extend revoked access. Auth-specific budget keeps the two concerns from colliding | 2026-04-19 |
| 153 | MonitoredItem carries `(AuthGenerationId, MembershipVersion)` stamp at create time. On every Publish, items with a mismatching stamp re-evaluate; unchanged items stay fast-path. Revoked items drop to `BadUserAccessDenied` within one publish cycle | Create-time-only authorization leaves revoked users receiving data forever; per-publish re-authorization at 100 ms cadence across 50 groups × 6 levels is too expensive. Stamp-then-reevaluate-on-change balances correctness with cost | 2026-04-19 |
| 154 | ServiceLevel reserves `0` for operator-declared maintenance only; `1` = NoData (unreachable / Faulted); operational states occupy `2..255` in an 8-state matrix (Authoritative-Primary=255, Isolated-Primary=230, Primary-Mid-Apply=200, Recovering-Primary=180, Authoritative-Backup=100, Isolated-Backup=80, Backup-Mid-Apply=50, Recovering-Backup=30, InvalidTopology=2) | OPC UA Part 5 §6.3.34 defines `0=Maintenance` + `1=NoData`; using `0` for our Faulted case collides with spec + triggers spec-compliant clients to enter maintenance-mode cutover. Expanded 8-state matrix covers operational states the 5-state original collapsed together (e.g. Isolated-Primary vs Primary-Mid-Apply were both 200) | 2026-04-19 |
| 155 | `ServerUriArray` includes self + peers (self first, deterministic ordering), per OPC UA Part 4 §6.6.2.2 | Previous design excluded self from the array — spec violation + clients lose the ability to map server identities consistently during failover | 2026-04-19 |
| 156 | Redundancy peer health uses a two-layer probe: `/healthz` (2 s) as fast-fail + `UaHealthProbe` (10 s, opens OPC UA client session to peer + reads its `ServiceLevel` node) as the authority signal. HTTP-healthy ≠ UA-authoritative | `/healthz` returns 200 whenever HTTP + config DB/cache is healthy — but a peer can be HTTP-healthy with a broken OPC UA endpoint or a stuck subscription publisher. Using HTTP alone would advertise authority against servers that can't actually publish data | 2026-04-19 |
| 157 | Publish-generation fencing — coordinator CAS on a monotonic `ConfigGenerationId`; every topology + role decision is generation-stamped; peers reject state propagated from a lower generation. Runtime `InvalidTopology` state (both self-demote to ServiceLevel 2) when >1 Primary detected post-startup | Operator race publishing two drafts with different roles can produce two locally-valid views; without fencing + runtime containment both nodes can serve as Primary until manual intervention | 2026-04-19 |
| 158 | Apply-window uses named leases keyed to `(ConfigGenerationId, PublishRequestId)` via `await using`. `ApplyLeaseWatchdog` auto-closes any lease older than `ApplyMaxDuration` (default 10 min) | Simple `IDisposable`-counter design leaks on cancellation / async-ownership races; a stuck positive count leaves the node permanently mid-apply. Generation-keyed leases + watchdog bound worst case | 2026-04-19 |
| 159 | CSV import header row must start with `# OtOpcUaCsv v1` (version marker). Future shape changes bump the version; parser forks per version. Canonical identifier columns follow decision #117:`ZTag, MachineCode, SAPID, EquipmentId, EquipmentUuid` | Without a version marker the CSV schema has no upgrade path — adding a required column breaks every old export silently. The version prefix makes parser dispatch explicit + future-compatible | 2026-04-19 |
| 160 | Equipment CSV import uses a staged-import pattern: `EquipmentImportBatch` + `EquipmentImportRow` tables receive chunked inserts; `FinaliseImportBatch` is one atomic transaction that applies accepted rows to `Equipment` + `ExternalIdReservation`. Rollback = drop the batch row; `Equipment` never partially mutates | 10k-row single-transaction import holds locks too long; chunked direct writes lose all-or-nothing rollback. Staging + atomic finalize bounds transaction duration + preserves rollback semantics | 2026-04-19 |
| 161 | UNS drag-reorder impact preview carries a `DraftRevisionToken`; Confirm re-checks against the current draft + returns `409 Conflict / refresh-required` if the draft advanced between preview and commit | Without concurrency control, two operators editing the same draft can overwrite each other's changes silently. Draft-revision token + 409 response makes the race visible + forces refresh | 2026-04-19 |
| 162 | OPC 40010 Identification sub-folder exposed under each equipment node inherits the Equipment scope's ACL grants — the ACL trie does NOT add a new scope level for Identification | Adding a new scope level for Identification would require every grant to add a second grant for `Equipment/Identification`; inheriting the Equipment scope keeps the grant model flat + prevents operator-forgot-to-grant-Identification access surprises | 2026-04-19 |
| Discrete Inputs (1x)| FC02 | *implicit* — I process image | `%I0.0`–`%I1023.7` (→ discrete addresses 0–8191) [1][9] |
| Input Registers (3x)| FC04 | *implicit* — M memory or DB (version-dependent) | Some firmware routes FC04 through the same MB_HOLD_REG buffer [1][8] |
| Holding Registers (4x)| FC03, FC06, FC16 | `MB_HOLD_REG` pointer | User DB (e.g. `DB10.DBW0`) or `%MW` area [1][2][8] |
- **`MB_HOLD_REG` is a pointer (VARIANT / ANY) into a user-defined DB** whose
first byte is holding-register 0 (`40001` in 1-based Modicon form). Byte
offset 2 is register 1, byte offset 4 is register 2, etc. [1][2].
- **The DB *must* have "Optimized block access" UNCHECKED.** Optimized DBs let
the compiler reorder fields for alignment; Modbus requires fixed byte
offsets. With optimized access on, the compiler accepts the project but
`MB_SERVER` returns STATUS `0x8383` (misaligned access) or silently reads
zeros [8][10][11]. This is the #1 support-forum complaint.
- **FC01/FC02/FC05/FC15 hit the Q and I process images directly — not the
- **CP 443-1 in S7-400H**: uses `MODBUSCP_REDUNDANT` which presents two
Ethernet endpoints that fail over. Our driver's redundancy support should
recognize the S7-400H profile as "two IP addresses, same server state,
advertise via `ServerUriArray`" [6].
- **ET 200SP CPU (1510SP / 1512SP)**: behaves as S7-1500 from `MB_SERVER`
perspective. No known deltas [3].
## References
1. Siemens Industry Online Support, *Modbus/TCP Communication between SIMATIC S7-1500 / S7-1200 and Modbus/TCP Controllers with Instructions `MB_CLIENT` and `MB_SERVER`*, Entry ID 102020340, V6 (Feb 2021). https://cache.industry.siemens.com/dl/files/340/102020340/att_118119/v6/net_modbus_tcp_s7-1500_s7-1200_en.pdf
2. Siemens TIA Portal Online Docs, *MB_SERVER instruction*. https://docs.tia.siemens.cloud/r/simatic_s7_1200_manual_collection_eses_20/communication-processor-and-modbus-tcp/modbus-communication/modbus-tcp/modbus-tcp-instructions/mb_server-communicate-using-profinet-as-modbus-tcp-server-instruction
3. Siemens, *SIMATIC S7-1500 Communication Function Manual* (covers ET 200SP CPU). http://public.eandm.com/Public_Docs/s71500_communication_function_manual_en-US_en-US.pdf
4. Siemens Industry Online Support, *SIMATIC Modbus/TCP communication using CP 343-1 and CP 443-1 — Programming Manual*, Entry ID 103447617. https://cache.industry.siemens.com/dl/files/617/103447617/att_106971/v1/simatic_modbus_tcp_cp_en-US_en-US.pdf
5. Siemens Industry Online Support FAQ *"Which technical data applies for the SIMATIC Modbus/TCP software for CP 343-1 / CP 443-1?"*, Entry ID 104946406. https://www.industry-mobile-support.siemens-info.com/en/article/detail/104946406
6. Siemens Industry Online Support, *Redundant Modbus/TCP communication via CP 443-1 in S7-400H systems*, Entry ID 109739212. https://cache.industry.siemens.com/dl/files/212/109739212/att_887886/v1/SIMATIC_modbus_tcp_cp_red_e_en-US.pdf
7. Siemens Industry Online Support, *SIMATIC MODBUS (TCP) PN CPU Library — Programming and Operating Manual 06/2014*, Entry ID 75330636. https://support.industry.siemens.com/cs/attachments/75330636/ModbusTCPPNCPUen.pdf
8. DMC Inc., *Using an S7-1200 PLC as a Modbus TCP Slave*. https://www.dmcinfo.com/blog/27313/using-an-s7-1200-plc-as-a-modbus-tcp-slave/
10. lamaPLC, *Simatic Modbus S7 error- and statuscodes*. https://www.lamaplc.com/doku.php?id=simatic:errorcodes
11. ScadaProtocols, *How to Configure Modbus TCP on Siemens S7-1200 (TIA Portal Step-by-Step)*. https://scadaprotocols.com/modbus-tcp-siemens-s7-1200-tia-portal/
12. Industrial Monitor Direct, *Reading and Writing Memory Bits via Modbus TCP on S7-1200*. https://industrialmonitordirect.com/blogs/knowledgebase/reading-and-writing-memory-bits-via-modbus-tcp-on-s7-1200
13. PLCtalk forum *"Siemens S7-1200 modbus understanding"*. https://www.plctalk.net/forums/threads/siemens-s7-1200-modbus-understanding.104119/
15. Chipkin, *How Real (Floating Point) and 32-bit Data is Encoded in Modbus*. https://store.chipkin.com/articles/how-real-floating-point-and-32-bit-data-is-encoded-in-modbus-rtu-messages
16. Siemens Industry Online Support forum, *MODBUS DATA conversion in S7-1200 CPU*, Entry ID 97287. https://support.industry.siemens.com/forum/WW/en/posts/modbus-data-converson-in-s7-1200-cpu/97287
18. Siemens TIA Portal, *Data types in SIMATIC S7-1200/1500 — String/WString header layout* (system manual, "Elementary Data Types").
19. Kepware / PTC, *Siemens TCP/IP Ethernet Driver Help*, "Byte / Word Order" tag property. https://www.opcturkey.com/uploads/siemens-tcp-ip-ethernet-manual.pdf
20. Siemens SiePortal forum, *Transfer float out of words*, Entry ID 187811. https://sieportal.siemens.com/en-ww/support/forum/posts/transfer-float-out-of-words/187811 _(operator-reported "S7 swaps float" claim — traced to remote-device issue; **unconfirmed**.)_
21. Siemens SiePortal forum, *S7-1200 communication with Modbus TCP*, Entry ID 133086. https://support.industry.siemens.com/forum/WW/en/posts/s7-1200-communication-with-modbus-tcp/133086
22. Siemens SiePortal forum, *S7-1500 MB Server Holding Register Max Word*, Entry ID 224636. https://support.industry.siemens.com/forum/WW/en/posts/s7-1500-mb-server-holding-register-max-word/224636
23. Siemens, *SIMATIC S7-1500 Technical Specifications* — CPU-specific DB size limits in each CPU manual's "Memory" table.
26. Siemens SiePortal forum, *How many TCP connections can the S7-1200 make?*, Entry ID 275570. https://support.industry.siemens.com/forum/WW/en/posts/how-many-tcp-connections-can-the-s7-1200-make/275570
27. Siemens SiePortal forum, *Simultaneous connections of Modbus TCP*, Entry ID 189626. https://support.industry.siemens.com/forum/ww/en/posts/simultaneous-connections-of-modbus-tcp/189626
28. Siemens SiePortal forum, *How many Modbus TCP IP clients can read simultaneously from S7-1517*, Entry ID 261569. https://support.industry.siemens.com/forum/WW/en/posts/how-many-modbus-tcp-ip-client-can-read-simultaneously-in-s7-1517/261569
> **Status**: **RELEASE-READY (code-path)** for v2 GA — all three code-path release blockers are closed. Remaining work is manual (client interop matrix, deployment checklist signoff, OPC UA CTT pass) + hardening follow-ups; see exit-criteria checklist below.
This doc is the single view of where v2 stands against its release criteria. Update it whenever a deferred follow-up closes or a new release blocker is discovered.
**Closed**. `AuthorizationGate` + `NodeScopeResolver` now thread through `OpcUaApplicationHost → OtOpcUaServer → DriverNodeManager`. `OnReadValue` + `OnWriteValue` + all four HistoryRead paths call `gate.IsAllowed(identity, operation, scope)` before the invoker. Production deployments activate enforcement by constructing `OpcUaApplicationHost` with an `AuthorizationGate(StrictMode: true)` + populating the `NodeAcl` table.
Additional Stream C surfaces (not release-blocking, hardening only):
- Browse + TranslateBrowsePathsToNodeIds gating with ancestor-visibility logic per `acl-design.md` §Browse.
- CreateMonitoredItems + TransferSubscriptions gating with per-item `(AuthGenerationId, MembershipVersion)` stamp so revoked grants surface `BadUserAccessDenied` within one publish cycle (decision #153).
- Alarm Acknowledge / Confirm / Shelve gating.
- Call (method invocation) gating.
- Finer-grained scope resolution — current `NodeScopeResolver` returns a flat cluster-level scope. Joining against the live Configuration DB to populate UnsArea / UnsLine / Equipment path is tracked as Stream C.12.
- 3-user integration matrix covering every operation × allow/deny.
These are additional hardening — the three highest-value surfaces (Read / Write / HistoryRead) are now gated, which covers the base-security gap for v2 GA.
**Closed**. `SealedBootstrap` consumes `ResilientConfigReader` + `GenerationSealedCache` + `StaleConfigFlag` end-to-end: bootstrap calls go through the timeout → retry → fallback-to-sealed pipeline; every central-DB success writes a fresh sealed snapshot so the next cache-miss has a known-good fallback; `StaleConfigFlag.IsStale` is now consumed by `HealthEndpointsHost.usingStaleConfig` so `/healthz` body reports reality.
Production activation: Program.cs switches `NodeBootstrap → SealedBootstrap` + constructs `OpcUaApplicationHost` with the `StaleConfigFlag` as an optional ctor parameter.
Remaining follow-ups (hardening, not release-blocking):
- A `HostedService` that polls `sp_GetCurrentGenerationForCluster` periodically so peer-published generations land in this node's cache without a restart.
- Richer snapshot payload via `sp_GetGenerationContent` so fallback can serve the full generation content (DriverInstance enumeration, ACL rows, etc.) from the sealed cache alone.
**Closed**. The runtime orchestration layer now exists end-to-end:
-`RedundancyCoordinator` reads `ClusterNode` + peer list at startup (Stream A shipped in PR #98). Invariants enforced: 1-2 nodes (decision #83), unique ApplicationUri (#86), ≤1 Primary in Warm/Hot (#84). Startup fails fast on violation; runtime refresh logs + flips `IsTopologyValid=false` so the calculator falls to band 2 without tearing down.
-`RedundancyStatePublisher` orchestrates topology + apply lease + recovery state + peer reachability through `ServiceLevelCalculator` + emits `OnStateChanged` / `OnServerUriArrayChanged` edge-triggered events (Stream C core shipped in PR #99). The OPC UA `ServiceLevel` Byte variable + `ServerUriArray` String[] variable subscribe to these events.
Remaining Phase 6.3 surfaces (hardening, not release-blocking):
-`PeerHttpProbeLoop` + `PeerUaProbeLoop` HostedServices that poll the peer + write to `PeerReachabilityTracker` on each tick. Without these the publisher sees `PeerReachability.Unknown` for every peer → Isolated-Primary band (230) even when the peer is up. Safe default (retains authority) but not the full non-transparent-redundancy UX.
- OPC UA variable-node wiring layer: bind the `ServiceLevel` Byte node + `ServerUriArray` String[] node to the publisher's events via `BaseDataVariable.OnReadValue` / direct value push. Scoped follow-up on the Opc.Ua.Server stack integration.
-`sp_PublishGeneration` wraps its apply in `await using var lease = coordinator.BeginApplyLease(...)` so the `PrimaryMidApply` band (200) fires during actual publishes (task #148 part 2).
AB CIP, AB Legacy, TwinCAT ADS, FOCAS drivers are planned but unshipped. Decision pending on whether these are release-blocking for v2 GA or can slip to a v2.1 follow-up.
- **Background services** — Phase 6.1 Stream B.4 `ScheduledRecycleScheduler` HostedService (task #137), Phase 6.1 Stream A analyzer (task #135 — Roslyn analyzer asserting every capability surface routes through `CapabilityInvoker`).
- **Multi-host dispatch** — Phase 6.1 Stream A follow-up (task #135). Currently every driver gets a single pipeline keyed on `driver.DriverInstanceId`; multi-host drivers (Modbus with N PLCs) need per-PLC host resolution so failing PLCs trip per-PLC breakers without poisoning siblings. Decision #144 requires this but we haven't wired it yet.
## Running the release-readiness check
```bash
pwsh ./scripts/compliance/phase-6-all.ps1
```
This meta-runner invokes each `phase-6-N-compliance.ps1` script in sequence and reports an aggregate PASS/FAIL. It is the single-command verification that what we claim is shipped still compiles + tests pass + the plan-level invariants are still satisfied.
Exit 0 = every phase passes its compliance checks + no test-count regression.
## Release-readiness exit criteria
v2 GA requires all of the following:
- [ ] All four Phase 6.N compliance scripts exit 0.
- [ ]`dotnet test ZB.MOM.WW.OtOpcUa.slnx` passes with ≤ 1 known-flake failure.
- [ ] Release blockers listed above all closed (or consciously deferred to v2.1 with a written decision).
- [ ] Production deployment checklist (separate doc) signed off by Fleet Admin.
- [ ] At least one end-to-end integration run against the live Galaxy on the dev box succeeds.
- [ ] OPC UA conformance test (CTT or UA Compliance Test Tool) passes against the live endpoint.
- [ ] Non-transparent redundancy cutover validated with at least one production client (Ignition 8.3 recommended — see decision #85).
## Change log
- **2026-04-19** — Release blocker #3**closed** (PRs #98–99). Phase 6.3 Streams A + C core shipped: `ClusterTopologyLoader` + `RedundancyCoordinator` + `RedundancyStatePublisher` + `PeerReachabilityTracker`. Code-path release blockers all closed; remaining Phase 6.3 surfaces (peer-probe HostedServices, OPC UA variable-node binding, sp_PublishGeneration lease wrap, client interop matrix) are hardening follow-ups.
- **2026-04-19** — Release blocker #2**closed** (PR #96). `SealedBootstrap` consumes `ResilientConfigReader` + `GenerationSealedCache` + `StaleConfigFlag`; `/healthz` now surfaces the stale flag. Remaining follow-ups (periodic poller + richer snapshot payload) downgraded to hardening.
Assert-TextFound"Corruption fails closed with GenerationCacheUnavailableException""GenerationCacheUnavailableException"@("src/ZB.MOM.WW.OtOpcUa.Configuration/LocalCache/GenerationSealedCache.cs")
Assert-TextFound"HistoryRead maps to NodePermissions.HistoryRead""HistoryRead.+NodePermissions\.HistoryRead"@("src/ZB.MOM.WW.OtOpcUa.Core/Authorization/TriePermissionEvaluator.cs")
Assert-TextFound"Maintenance = 0 (reserved per OPC UA Part 5)""Maintenance\s*=\s*0"@("src/ZB.MOM.WW.OtOpcUa.Server/Redundancy/ServiceLevelCalculator.cs")
Assert-TextFound"NoData = 1 (reserved per OPC UA Part 5)""NoData\s*=\s*1"@("src/ZB.MOM.WW.OtOpcUa.Server/Redundancy/ServiceLevelCalculator.cs")
Assert-TextFound"CSV header version marker '# OtOpcUaCsv v1'""OtOpcUaCsv v1"@("src/ZB.MOM.WW.OtOpcUa.Admin/Services/EquipmentCsvImporter.cs")
Assert-TextFound"Required columns match decision #117""ZTag.+MachineCode.+SAPID.+EquipmentId.+EquipmentUuid"@("src/ZB.MOM.WW.OtOpcUa.Admin/Services/EquipmentCsvImporter.cs")
Assert-TextFound"Optional columns match decision #139 (Manufacturer)""Manufacturer"@("src/ZB.MOM.WW.OtOpcUa.Admin/Services/EquipmentCsvImporter.cs")
Assert-TextFound"Optional columns include DeviceManualUri""DeviceManualUri"@("src/ZB.MOM.WW.OtOpcUa.Admin/Services/EquipmentCsvImporter.cs")
Assert-TextFound"Rejects duplicate ZTag within file""Duplicate ZTag"@("src/ZB.MOM.WW.OtOpcUa.Admin/Services/EquipmentCsvImporter.cs")
PKI store root <code>@Certs.PkiStoreRoot</code>. Trusting a rejected cert moves the file into the trusted store — the OPC UA server picks up the change on the next client handshake, so operators should retry the rejected client's connection after trusting.
<p class="text-muted">No client certs have been explicitly trusted. The server's own application cert lives in <code>own/</code> and is not listed here.</p>
Auto-refresh every @RefreshIntervalSeconds s. Last updated: @(_lastRefreshUtc?.ToString("HH:mm:ss 'UTC'") ?? "—")
</span>
</div>
<div class="alert alert-info small mb-4">
Each row is one host reported by a driver instance on a server node. Galaxy drivers report
per-Platform / per-AppEngine entries; Modbus drivers report the PLC endpoint. Rows age out
of the Server's publisher on every 10-second heartbeat — rows whose LastSeen is older than
30s are flagged Stale, which usually means the owning Server process has crashed or lost
its DB connection.
</div>
@if (_rows is null)
{
<p>Loading…</p>
}
else if (_rows.Count == 0)
{
<div class="alert alert-secondary">
No host-status rows yet. The Server publishes its first tick 2s after startup; if this list stays empty, check that the Server is running and the driver implements <code>IHostConnectivityProbe</code>.
$"No sealed generation for cluster '{clusterId}' at '{clusterDir}'. First-boot case: the central DB must be reachable at least once before cache fallback is possible.");
/// Default implementation throws. Only drivers with an event historian (Galaxy via the
/// Wonderware Alarm & Events log) override. Modbus / the OPC UA Client driver stay
/// with the default and let callers see <c>BadHistoryOperationUnsupported</c>.
/// </remarks>
Task<HistoricalEventsResult>ReadEventsAsync(
string?sourceName,
DateTimestartUtc,
DateTimeendUtc,
intmaxEvents,
CancellationTokencancellationToken)
=>thrownewNotSupportedException(
$"{GetType().Name} does not implement ReadEventsAsync. "+
"Drivers whose backends have an event historian override this method.");
}
/// <summary>Result of a HistoryRead call.</summary>
@@ -48,3 +94,29 @@ public enum HistoryAggregateType
Total,
Count,
}
/// <summary>
/// One row returned by <see cref="IHistoryProvider.ReadEventsAsync"/> — a historical
/// alarm/event record, not the OPC UA live-event stream. Fields match the minimum set the
/// Server needs to populate a <c>HistoryEventFieldList</c> for HistoryReadEvents responses.
/// </summary>
/// <param name="EventId">Stable unique id for the event — driver-specific format.</param>
/// <param name="SourceName">Source object that emitted the event. May differ from the <c>sourceName</c> filter the caller passed (fuzzy matches).</param>
/// <param name="EventTimeUtc">Process-side timestamp — when the event actually occurred.</param>
/// <param name="ReceivedTimeUtc">Historian-side timestamp — when the historian persisted the row; may lag <paramref name="EventTimeUtc"/> by the historian's buffer flush cadence.</param>
/// when any descendant has a grant; denied ancestors filter from browse results.
/// </summary>
Browse,
/// <summary><c>Read</c> on a variable node.</summary>
Read,
/// <summary><c>Write</c> when the target has <see cref="SecurityClassification.Operate"/> / <see cref="SecurityClassification.FreeAccess"/>.</summary>
WriteOperate,
/// <summary><c>Write</c> when the target has <see cref="SecurityClassification.Tune"/>.</summary>
WriteTune,
/// <summary><c>Write</c> when the target has <see cref="SecurityClassification.Configure"/>.</summary>
WriteConfigure,
/// <summary><c>HistoryRead</c> — uses its own <c>NodePermissions.HistoryRead</c> bit; Read alone is NOT sufficient (decision in Phase 6.2 Compliance).</summary>
HistoryRead,
/// <summary><c>HistoryUpdate</c> — annotation / insert / delete on historian.</summary>
HistoryUpdate,
/// <summary><c>CreateMonitoredItems</c>. Per-item denial in mixed-authorization batches.</summary>
CreateMonitoredItems,
/// <summary><c>TransferSubscriptions</c>. Re-evaluates transferred items against current auth state.</summary>
TransferSubscriptions,
/// <summary><c>Call</c> on a Method node.</summary>
/// Snapshot accessor for the current resilience options. Invoked per call so Admin-edit +
/// pipeline-invalidate can take effect without restarting the invoker.
/// </param>
/// <param name="driverType">Driver type name for structured-log enrichment (e.g. <c>"Modbus"</c>).</param>
/// <param name="statusTracker">Optional resilience-status tracker. When wired, every capability call records start/complete so Admin <c>/hosts</c> can surface <see cref="ResilienceStatusSnapshot.CurrentInFlight"/> as the bulkhead-depth proxy.</param>
/// <summary>Outcome of a single <see cref="WedgeDetector.Classify"/> call.</summary>
publicenumWedgeVerdict
{
/// <summary>Driver wasn't Healthy to begin with — wedge detection doesn't apply.</summary>
NotApplicable,
/// <summary>Driver claims Healthy + no pending work → stays Healthy.</summary>
Idle,
/// <summary>Driver claims Healthy + has pending work + has made progress within the threshold → stays Healthy.</summary>
Healthy,
/// <summary>Driver claims Healthy + has pending work + has NOT made progress within the threshold → wedged.</summary>
Faulted,
}
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.