Closes release blocker #2 from docs/v2/v2-release-readiness.md — the
generation-sealed cache + resilient reader + stale-config flag shipped as
unit-tested primitives in PR #81, but no production path consumed them until
now. This PR wires them end-to-end.
Server additions:
- SealedBootstrap — Phase 6.1 Stream D consumption hook. Resolves the node's
current generation through ResilientConfigReader's timeout → retry →
fallback-to-sealed pipeline. On every successful central-DB fetch it seals
a fresh snapshot to <cache-root>/<cluster>/<generationId>.db so a future
cache-miss has a known-good fallback. Alongside the original NodeBootstrap
(which still uses the single-file ILocalConfigCache); Program.cs can
switch between them once operators are ready for the generation-sealed
semantics.
- OpcUaApplicationHost: new optional staleConfigFlag ctor parameter. When
wired, HealthEndpointsHost consumes `flag.IsStale` via the existing
usingStaleConfig Func<bool> hook. Means `/healthz` actually reports
`usingStaleConfig: true` whenever a read fell back to the sealed cache —
closes the loop between Stream D's flag + Stream C's /healthz body shape.
Tests (4 new SealedBootstrapIntegrationTests, all pass):
- Central-DB success path seals snapshot + flag stays fresh.
- Central-DB failure falls back to sealed snapshot + flag flips stale (the
SQL-kill scenario from Phase 6.1 Stream D.4.a).
- No-snapshot + central-down throws GenerationCacheUnavailableException
with a clear error (the first-boot scenario from D.4.c).
- Next successful bootstrap after a fallback clears the stale flag.
Full solution dotnet test: 1168 passing (was 1164, +4). Pre-existing
Client.CLI Subscribe flake unchanged.
Production activation: Program.cs wires SealedBootstrap (instead of
NodeBootstrap), constructs OpcUaApplicationHost with the staleConfigFlag,
and a HostedService polls sp_GetCurrentGenerationForCluster periodically so
peer-published generations land in this node's sealed cache. The poller
itself is Stream D.1.b follow-up.
The sp_PublishGeneration SQL-side hook (where the publish commit itself
could also write to a shared sealed cache) stays deferred — the per-node
seal pattern shipped here is the correct v2 GA model: each Server node
owns its own on-disk cache and refreshes from its own DB reads, matching
the Phase 6.1 scope-table description.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the Phase 6.2 security gap the v2 release-readiness dashboard flagged:
the evaluator + trie + gate shipped as code in PRs #84-88 but no dispatch
path called them. This PR threads the gate end-to-end from
OpcUaApplicationHost → OtOpcUaServer → DriverNodeManager and calls it on
every Read / Write / 4 HistoryRead paths.
Server.Security additions:
- NodeScopeResolver — maps driver fullRef → Core.Authorization NodeScope.
Phase 1 shape: populates ClusterId + TagId; leaves NamespaceId / UnsArea /
UnsLine / Equipment null. The cluster-level ACL cascade covers this
configuration (decision #129 additive grants). Finer-grained scope
resolution (joining against the live Configuration DB for UnsArea / UnsLine
path) lands as Stream C.12 follow-up.
- WriteAuthzPolicy.ToOpcUaOperation — maps SecurityClassification → the
OpcUaOperation the gate evaluator consults (Operate/SecuredWrite →
WriteOperate; Tune → WriteTune; Configure/VerifiedWrite → WriteConfigure).
DriverNodeManager wiring:
- Ctor gains optional AuthorizationGate + NodeScopeResolver; both null means
the pre-Phase-6.2 dispatch runs unchanged (backwards-compat for every
integration test that constructs DriverNodeManager directly).
- OnReadValue: ahead of the invoker call, builds NodeScope + calls
gate.IsAllowed(identity, Read, scope). Denied reads return
BadUserAccessDenied without hitting the driver.
- OnWriteValue: preserves the existing WriteAuthzPolicy check (classification
vs session roles) + adds an additive gate check using
WriteAuthzPolicy.ToOpcUaOperation(classification) to pick the right
WriteOperate/Tune/Configure surface. Lax mode falls through for identities
without LDAP groups.
- Four HistoryRead paths (Raw / Processed / AtTime / Events): gate check
runs per-node before the invoker. Events path tolerates fullRef=null
(event-history queries can target a notifier / driver-root; those are
cluster-wide reads that need a different scope shape — deferred).
- New WriteAccessDenied helper surfaces BadUserAccessDenied in the
OpcHistoryReadResult slot + errors list, matching the shape of the
existing WriteUnsupported / WriteInternalError helpers.
OtOpcUaServer + OpcUaApplicationHost: gate + resolver thread through as
optional constructor parameters (same pattern as DriverResiliencePipelineBuilder
in Phase 6.1). Null defaults keep the existing 3 OpcUaApplicationHost
integration tests constructing without them unchanged.
Tests (5 new in NodeScopeResolverTests):
- Resolve populates ClusterId + TagId + Equipment Kind.
- Resolve leaves finer path null per Phase 1 shape (doc'd as follow-up).
- Empty fullReference throws.
- Empty clusterId throws at ctor.
- Resolver is stateless across calls.
The existing 9 AuthorizationGate tests (shipped in PR #86) continue to
cover the gate's allow/deny semantics under strict + lax mode.
Full solution dotnet test: 1164 passing (was 1159, +5). Pre-existing
Client.CLI Subscribe flake unchanged. Existing OpcUaApplicationHost +
HealthEndpointsHost + driver integration tests continue to pass because the
gate defaults to null → no enforcement, and the lax-mode fallback returns
true for identities without LDAP groups (the anonymous test path).
Production deployments flip the gate on by constructing it via
OpcUaApplicationHost's new authzGate parameter + setting
`Authorization:StrictMode = true` once ACL data is populated. Flipping the
switch post-seed turns the evaluator + trie from scaffolded code into
actual enforcement.
This closes release blocker #1 listed in docs/v2/v2-release-readiness.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
scripts/compliance/phase-6-2-compliance.ps1 replaces the stub TODOs with 23
real checks spanning:
- Stream A: LdapGroupRoleMapping entity + AdminRole enum + ILdapGroupRoleMappingService
+ impl + write-time invariant + EF migration all present.
- Stream B: OpcUaOperation enum + NodeScope + AuthorizationDecision tri-state
+ IPermissionEvaluator + PermissionTrie + Builder + Cache keyed on
GenerationId + UserAuthorizationState with MembershipFreshnessInterval=15m
and AuthCacheMaxStaleness=5m + TriePermissionEvaluator + HistoryRead uses
its own flag.
- Control/data-plane separation: the evaluator + trie + cache + builder +
interface all have zero references to LdapGroupRoleMapping (decision #150).
- Stream C foundation: ILdapGroupsBearer + AuthorizationGate with StrictMode
knob. DriverNodeManager dispatch-path wiring (11 surfaces) is Deferred,
tracked as task #143.
- Stream D data layer: ValidatedNodeAclAuthoringService + exception type +
rejects None permissions. Blazor UI pieces (RoleGrantsTab, AclsTab,
SignalR invalidation, draft diff) are Deferred, tracked as task #144.
- Cross-cutting: full solution dotnet test runs; 1097 >= 1042 baseline;
tolerates the one pre-existing Client.CLI Subscribe flake.
IPermissionEvaluator doc-comment reworded to avoid mentioning the literal
type name "LdapGroupRoleMapping" — the compliance check does a text-absence
sweep for that identifier across the data-plane files.
docs/v2/implementation/phase-6-2-authorization-runtime.md status updated from
DRAFT to SHIPPED (core). Two deferred follow-ups explicitly called out so
operators see what's still pending for the "Phase 6.2 fully wired end-to-end"
milestone.
`Phase 6.2 compliance: PASS` — exit 0. Any regression that deletes a class
or re-introduces an LdapGroupRoleMapping reference into the data-plane
evaluator turns a green check red + exit non-zero.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ships the non-UI piece of Stream D: a draft-aware write surface over NodeAcl
that enforces the Phase 6.2 plan's scope-uniqueness + grant-shape invariants.
Blazor UI pieces (RoleGrantsTab + AclsTab refresh + SignalR invalidation +
visual-compliance reviewer signoff) are deferred to the Phase 6.1-style
follow-up task.
Admin.Services:
- ValidatedNodeAclAuthoringService — alongside existing NodeAclService (raw
CRUD, kept for read + revoke paths). GrantAsync enforces:
* Permissions != None (decision #129 — additive only, no empty grants).
* Cluster scope has null ScopeId.
* Sub-cluster scope requires a populated ScopeId.
* No duplicate (GenerationId, ClusterId, LdapGroup, ScopeKind, ScopeId)
tuple — operator updates the row instead of inserting a duplicate.
UpdatePermissionsAsync also rejects None (operator revokes via NodeAclService).
Violations throw InvalidNodeAclGrantException.
Tests (10 new in Admin.Tests/ValidatedNodeAclAuthoringServiceTests):
- Grant rejects None permissions.
- Grant rejects Cluster-scope with ScopeId / sub-cluster without ScopeId.
- Grant succeeds on well-formed row.
- Grant rejects duplicate (group, scope) in same draft.
- Grant allows same group at different scope.
- Grant allows same (group, scope) in different draft.
- UpdatePermissions rejects None.
- UpdatePermissions round-trips new flags + notes.
- UpdatePermissions on unknown rowid throws.
Microsoft.EntityFrameworkCore.InMemory 10.0.0 added to Admin.Tests csproj.
Full solution dotnet test: 1097 passing (was 1087, +10). Phase 6.2 total is
now 1087+10 = 1097; baseline 906 → +191 net across Phase 6.1 (all streams) +
Phase 6.2 (Streams A, B, C foundation, D data layer).
Stream D follow-up task tracks: RoleGrantsTab CRUD over LdapGroupRoleMapping,
AclsTab write-through + Probe-this-permission diagnostic, draft-diff ACL
section, SignalR PermissionTrieCache invalidation push.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lands the integration seam between the Server project's OPC UA stack and the
Core.Authorization evaluator. Actual DriverNodeManager dispatch-path wiring
(Read/Write/HistoryRead/Browse/Call/Subscribe/Alarm surfaces) lands in the
follow-up PR on this branch — covered by Task #143 below.
Server.Security additions:
- ILdapGroupsBearer — marker interface a custom IUserIdentity implements to
expose its resolved LDAP group DNs. Parallel to the existing IRoleBearer
(admin roles) — control/data-plane separation per decision #150.
- AuthorizationGate — stateless bridge between Opc.Ua.IUserIdentity and
IPermissionEvaluator. IsAllowed(identity, operation, scope) materializes a
UserAuthorizationState from the identity's LDAP groups, delegates to the
evaluator, and returns a single bool the dispatch paths use to decide
whether to surface BadUserAccessDenied.
- StrictMode knob controls fail-open-during-transition vs fail-closed:
- strict=false (default during rollout) — null identity, identity without
ILdapGroupsBearer, or NotGranted outcome all return true so older
deployments without ACL data keep working.
- strict=true (production target) — any of the above returns false.
The appsetting `Authorization:StrictMode = true` flips deployments over
once their ACL data is populated.
Tests (9 new in Server.Tests/AuthorizationGateTests):
- Null identity — strict denies, lax allows.
- Identity without LDAP groups — strict denies, lax allows.
- LDAP group with matching grant allows.
- LDAP group without grant — strict denies.
- Wrong operation denied (Read-only grant, WriteOperate requested).
- BuildSessionState returns materialized state with LDAP groups + null when
identity doesn't carry them.
Full solution dotnet test: 1087 passing (Phase 6.1 = 1042, Phase 6.2 A = +9,
B = +27, C foundation = +9 = 1087). Pre-existing Client.CLI Subscribe flake
unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ships Stream B.1-B.6 — the data-plane authorization engine Phase 6.2 runs on.
Integration into OPC UA dispatch (Stream C — Read / Write / HistoryRead /
Subscribe / Browse / Call etc.) is the next PR on this branch.
New Core.Abstractions:
- OpcUaOperation enum enumerates every OPC UA surface the evaluator gates:
Browse, Read, WriteOperate/Tune/Configure (split by SecurityClassification),
HistoryRead, HistoryUpdate, CreateMonitoredItems, TransferSubscriptions,
Call, AlarmAcknowledge/Confirm/Shelve. Stream C maps each one back to its
dispatch call site.
New Core.Authorization namespace:
- NodeScope record + NodeHierarchyKind — 6-level scope addressing for
Equipment-kind (UNS) namespaces, folder-segment walk for SystemPlatform-kind
(Galaxy). NodeScope carries a Kind selector so the evaluator knows which
hierarchy to descend.
- AuthorizationDecision { Verdict, Provenance } + AuthorizationVerdict
{Allow, NotGranted, Denied} + MatchedGrant. Tri-state per decision #149;
Phase 6.2 only produces Allow + NotGranted, Denied stays reserved for v2.1
Explicit Deny without API break.
- IPermissionEvaluator.Authorize(session, operation, scope).
- PermissionTrie + PermissionTrieNode + TrieGrant. In-memory trie keyed on
the ACL scope hierarchy. CollectMatches walks Cluster → Namespace →
UnsArea → UnsLine → Equipment → Tag (or → FolderSegment(s) → Tag on
Galaxy). Pure additive union — matches that share an LDAP group with the
session contribute flags; OR across levels.
- PermissionTrieBuilder static factory. Build(clusterId, generationId, rows,
scopePaths?) returns a trie for one generation. Cross-cluster rows are
filtered out so the trie is cluster-coherent. Stream C follow-up wires a
real scopePaths lookup from the live DB; tests supply hand-built paths.
- PermissionTrieCache — process-singleton, keyed on (ClusterId, GenerationId).
Install(trie) adds a generation + promotes to "current" when the id is
highest-known (handles out-of-order installs gracefully). Prior generations
retained so an in-flight request against a prior trie still succeeds; GC
via Prune(cluster, keepLatest).
- UserAuthorizationState — per-session cache of resolved LDAP groups +
AuthGenerationId + MembershipVersion + MembershipResolvedUtc. Bounded by
MembershipFreshnessInterval (default 15 min per decision #151) +
AuthCacheMaxStaleness (default 5 min per decision #152).
- TriePermissionEvaluator — default IPermissionEvaluator. Fails closed on
stale sessions (IsStale check short-circuits to NotGranted), on cross-
cluster requests, on empty trie cache. Maps OpcUaOperation → NodePermissions
via MapOperationToPermission (total — every enum value has a mapping; tested).
Tests (27 new, all pass):
- PermissionTrieTests (7): cluster-level grant cascades to every tag;
equipment-level grant doesn't leak to sibling equipment; multi-group union
ORs flags; no-matching-group returns empty; Galaxy folder-segment grant
doesn't leak to sibling folder; cross-cluster rows don't land in this
cluster's trie; build is idempotent (B.6 invariants).
- TriePermissionEvaluatorTests (8): allow when flag matches; NotGranted when
no matching group; NotGranted when flags insufficient; HistoryRead requires
its own bit (decision-level requirement); cross-cluster session denied;
stale session fails closed; no cached trie denied; MapOperationToPermission
is total across every OpcUaOperation.
- PermissionTrieCacheTests (8): empty cache returns null; install-then-get
round-trips; new generation becomes current; out-of-order install doesn't
downgrade current; invalidate drops one cluster; prune retains most recent;
prune no-op when fewer than keep; cluster isolation.
- UserAuthorizationStateTests (4): fresh is not stale; IsStale after 5 min
default; NeedsRefresh true between freshness + staleness windows.
Full solution dotnet test: 1078 passing (baseline 906, Phase 6.1 = 1042,
Phase 6.2 Stream A = +9, Stream B = +27 = 1078). Pre-existing Client.CLI
Subscribe flake unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stream A.1-A.2 per docs/v2/implementation/phase-6-2-authorization-runtime.md.
Seed-data migration (A.3) is a separate follow-up once production LDAP group
DNs are finalised; until then CRUD via the Admin UI handles the fleet set up.
Configuration:
- New AdminRole enum {ConfigViewer, ConfigEditor, FleetAdmin} — string-stored.
- New LdapGroupRoleMapping entity with Id (surrogate PK), LdapGroup (512 chars),
Role (AdminRole enum), ClusterId (nullable, FK to ServerCluster), IsSystemWide,
CreatedAtUtc, Notes.
- EF config: UX_LdapGroupRoleMapping_Group_Cluster unique index on
(LdapGroup, ClusterId) + IX_LdapGroupRoleMapping_Group hot-path index on
LdapGroup for sign-in lookups. Cluster FK cascades on cluster delete.
- Migration 20260419_..._AddLdapGroupRoleMapping generated via `dotnet ef`.
Configuration.Services:
- ILdapGroupRoleMappingService — CRUD surface. Declared as control-plane only
per decision #150; the OPC UA data-path evaluator must NOT depend on this
interface (Phase 6.2 compliance check on control/data-plane separation).
GetByGroupsAsync is the hot-path sign-in lookup.
- LdapGroupRoleMappingService (EF Core impl) enforces the write-time invariant
"exactly one of (ClusterId populated, IsSystemWide=true)" and surfaces
InvalidLdapGroupRoleMappingException on violation. Create auto-populates Id
+ CreatedAtUtc when omitted.
Tests (9 new, all pass) in Configuration.Tests:
- Create sets Id + CreatedAtUtc.
- Create rejects empty LdapGroup.
- Create rejects IsSystemWide=true with populated ClusterId.
- Create rejects IsSystemWide=false with null ClusterId.
- GetByGroupsAsync returns matching rows only.
- GetByGroupsAsync with empty input returns empty (no full-table scan).
- ListAllAsync orders by group then cluster.
- Delete removes the target row.
- Delete of unknown id is a no-op.
Microsoft.EntityFrameworkCore.InMemory 10.0.0 added to Configuration.Tests for
the service-level tests (schema-compliance tests still use the live SQL
fixture).
SchemaComplianceTests updated to expect the new LdapGroupRoleMapping table.
Full solution dotnet test: 1051 passing (baseline 906, Phase 6.1 shipped at
1042, Phase 6.2 Stream A adds 9 = 1051). Pre-existing Client.CLI Subscribe
flake unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ships the data + runtime layer of Stream E. The SignalR hub and Blazor /hosts
page refresh (E.2-E.3) are follow-up work paired with the visual-compliance
review per Phase 6.4 patterns — documented as a deferred follow-up below.
Configuration:
- New entity DriverInstanceResilienceStatus with:
DriverInstanceId, HostName (composite PK),
LastCircuitBreakerOpenUtc, ConsecutiveFailures, CurrentBulkheadDepth,
LastRecycleUtc, BaselineFootprintBytes, CurrentFootprintBytes,
LastSampledUtc.
- Separate from DriverHostStatus (per-host connectivity view) so a Running
host that has tripped its breaker or is nearing its memory ceiling shows up
distinctly on Admin /hosts. Admin page left-joins both for display.
- OtOpcUaConfigDbContext + Fluent-API config + IX_DriverResilience_LastSampled
index for the stale-sample filter query.
- EF migration: 20260419124034_AddDriverInstanceResilienceStatus.
Core.Resilience:
- DriverResilienceStatusTracker — process-singleton in-memory tracker keyed on
(DriverInstanceId, HostName). CapabilityInvoker + MemoryTracking +
MemoryRecycle callers record failure/success/breaker-open/recycle/footprint
events; a HostedService (Stream E.2 follow-up) samples this tracker every
5 s and persists to the DB. Pure in-memory keeps tests fast + the core
free of EF/SQL dependencies.
Tests:
- DriverResilienceStatusTrackerTests (9 new, all pass): tryget-before-write
returns null; failures accumulate; success resets; breaker/recycle/footprint
fields populate; per-host isolation; snapshot returns all pairs; concurrent
writes don't lose counts.
- SchemaComplianceTests: expected-tables list updated to include the new
DriverInstanceResilienceStatus table.
Full solution dotnet test: 1042 passing (baseline 906, +136 for Phase 6.1 so
far across Streams A/B/C/D/E.1). Pre-existing Client.CLI Subscribe flake
unchanged.
Deferred to follow-up PR (E.2/E.3):
- ResilienceStatusPublisher HostedService that samples DriverResilienceStatusTracker
every 5 s + upserts DriverInstanceResilienceStatus rows.
- Admin FleetStatusHub SignalR hub pushing LastCircuitBreakerOpenUtc /
CurrentBulkheadDepth / LastRecycleUtc on change.
- Admin /hosts Blazor column additions (red badge when
ConsecutiveFailures > breakerThreshold / 2). Visual-compliance reviewer
signoff alongside Phase 6.4 admin-ui patterns.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes Stream D per docs/v2/implementation/phase-6-1-resilience-and-observability.md.
New Configuration.LocalCache types (alongside the existing single-file
LiteDbConfigCache):
- GenerationSealedCache — file-per-generation sealed snapshots per decision
#148. Each SealAsync writes <cache-root>/<clusterId>/<generationId>.db as a
read-only LiteDB file, then atomically publishes the CURRENT pointer via
temp-file + File.Replace. Prior-generation files stay on disk for audit.
Mixed-generation reads are structurally impossible: ReadCurrentAsync opens
the single file named by CURRENT. Corruption of the pointer or the sealed
file raises GenerationCacheUnavailableException — fails closed, never falls
back silently to an older generation. TryGetCurrentGenerationId returns the
pointer value or null for diagnostics.
- StaleConfigFlag — thread-safe (Volatile.Read/Write) bool. MarkStale when a
read fell back to the cache; MarkFresh when a central-DB read succeeded.
Surfaced on /healthz body and Admin /hosts (Stream C wiring already in
place).
- ResilientConfigReader — wraps a central-DB fetch function with the Stream
D.2 pipeline: timeout 2 s → retry N× jittered (skipped when retryCount=0) →
fallback to the sealed cache. Toggles StaleConfigFlag per outcome. Read path
only — the write path is expected to bypass this wrapper and fail hard on
DB outage so inconsistent writes never land. Cancellation passes through
and is NOT retried.
Configuration.csproj:
- Polly.Core 8.6.6 + Microsoft.Extensions.Logging.Abstractions added.
Tests (17 new, all pass):
- GenerationSealedCacheTests (10): first-boot-no-snapshot throws
GenerationCacheUnavailableException (D.4 scenario C), seal-then-read round
trip, sealed file is ReadOnly on disk, pointer advances to latest, prior
generation file preserved, corrupt sealed file fails closed, missing sealed
file fails closed, corrupt pointer fails closed (D.4 scenario B), same
generation sealed twice is idempotent, independent clusters don't
interfere.
- ResilientConfigReaderTests (4): central-DB success returns value + marks
fresh; central-DB failure exhausts retries + falls back to cache + marks
stale (D.4 scenario A); central-DB + cache both unavailable throws;
cancellation not retried.
- StaleConfigFlagTests (3): default is fresh; toggles; concurrent writes
converge.
Full solution dotnet test: 1033 passing (baseline 906, +127 net across Phase
6.1 Streams A/B/C/D). Pre-existing Client.CLI Subscribe flake unchanged.
Integration into Configuration read paths (DriverInstance enumeration,
LdapGroupRoleMapping fetches, etc.) + the sp_PublishGeneration hook that
writes sealed files lands in the Phase 6.1 Stream E / Admin-refresh PR where
the DB integration surfaces are already touched. Existing LiteDbConfigCache
continues serving its single-file role for the NodeBootstrap path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes Stream C per docs/v2/implementation/phase-6-1-resilience-and-observability.md.
Core.Observability (new namespace):
- DriverHealthReport — pure-function aggregation over DriverHealthSnapshot list.
Empty fleet = Healthy. Any Faulted = Faulted. Any Unknown/Initializing (no
Faulted) = NotReady. Any Degraded or Reconnecting (no Faulted, no NotReady)
= Degraded. Else Healthy. HttpStatus(verdict) maps to the Stream C.1 state
matrix: Healthy/Degraded → 200, NotReady/Faulted → 503.
- LogContextEnricher — Serilog LogContext wrapper. Push(id, type, capability,
correlationId) returns an IDisposable scope; inner log calls carry
DriverInstanceId / DriverType / CapabilityName / CorrelationId structured
properties automatically. NewCorrelationId = 12-hex-char GUID slice for
cases where no OPC UA RequestHeader.RequestHandle is in flight.
CapabilityInvoker — now threads LogContextEnricher around every ExecuteAsync /
ExecuteWriteAsync call site. OtOpcUaServer passes driver.DriverType through
so logs correlate to the driver type too. Every capability call emits
structured fields per the Stream C.4 compliance check.
Server.Observability:
- HealthEndpointsHost — standalone HttpListener on http://localhost:4841/
(loopback avoids Windows URL-ACL elevation; remote probing via reverse
proxy or explicit netsh urlacl grant). Routes:
/healthz → 200 when (configDbReachable OR usingStaleConfig); 503 otherwise.
Body: status, uptimeSeconds, configDbReachable, usingStaleConfig.
/readyz → DriverHealthReport.Aggregate + HttpStatus mapping.
Body: verdict, drivers[], degradedDrivers[], uptimeSeconds.
anything else → 404.
Disposal cooperative with the HttpListener shutdown.
- OpcUaApplicationHost starts the health host after the OPC UA server comes up
and disposes it on shutdown. New OpcUaServerOptions knobs:
HealthEndpointsEnabled (default true), HealthEndpointsPrefix (default
http://localhost:4841/).
Program.cs:
- Serilog pipeline adds Enrich.FromLogContext + opt-in JSON file sink via
`Serilog:WriteJson = true` appsetting. Uses Serilog.Formatting.Compact's
CompactJsonFormatter (one JSON object per line — SIEMs like Splunk,
Datadog, Graylog ingest without a regex parser).
Server.Tests:
- Existing 3 OpcUaApplicationHost integration tests now set
HealthEndpointsEnabled=false to avoid port :4841 collisions under parallel
execution.
- New HealthEndpointsHostTests (9): /healthz healthy empty fleet; stale-config
returns 200 with flag; unreachable+no-cache returns 503; /readyz empty/
Healthy/Faulted/Degraded/Initializing drivers return correct status and
bodies; unknown path → 404. Uses ephemeral ports via Interlocked counter.
Core.Tests:
- DriverHealthReportTests (8): empty fleet, all-healthy, any-Faulted trumps,
any-NotReady without Faulted, Degraded without Faulted/NotReady, HttpStatus
per-verdict theory.
- LogContextEnricherTests (8): all 4 properties attach; scope disposes cleanly;
NewCorrelationId shape; null/whitespace driverInstanceId throws.
- CapabilityInvokerEnrichmentTests (2): inner logs carry structured
properties; no context leak outside the call site.
Full solution dotnet test: 1016 passing (baseline 906, +110 for Phase 6.1 so
far across Streams A+B+C). Pre-existing Client.CLI Subscribe flake unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes out Stream B per docs/v2/implementation/phase-6-1-resilience-and-observability.md.
Core.Abstractions:
- IDriverSupervisor — process-level supervisor contract a Tier C driver's
out-of-process topology provides (Galaxy Proxy/Supervisor implements this in
a follow-up Driver.Galaxy wiring PR). Concerns: DriverInstanceId + RecycleAsync.
Tier A/B drivers don't implement this; Stream B code asserts tier == C before
ever calling it.
Core.Stability:
- MemoryRecycle — companion to MemoryTracking. On HardBreach, invokes the
supervisor IFF tier == C AND a supervisor is wired. Tier A/B HardBreach logs
a promotion-to-Tier-C recommendation and returns false. Soft/None/Warming
never triggers a recycle at any tier.
- ScheduledRecycleScheduler — Tier C opt-in periodic recycler per decision #67.
Ctor throws for Tier A/B (structural guard — scheduled recycle on an
in-process driver would kill every OPC UA session and every co-hosted
driver). TickAsync(now) advances the schedule by one interval per fire;
RequestRecycleNowAsync drives an ad-hoc recycle without shifting the cron.
- WedgeDetector — demand-aware per decision #147. Classify(state, demand, now)
returns:
* NotApplicable when driver state != Healthy
* Idle when Healthy + no pending work (bulkhead=0 && monitored=0 && historic=0)
* Healthy when Healthy + pending work + progress within threshold
* Faulted when Healthy + pending work + no progress within threshold
Threshold clamps to min 60 s. DemandSignal.HasPendingWork ORs the three counters.
The three false-wedge cases the plan calls out all stay Healthy: idle
subscription-only, slow historian backfill making progress, write-only burst
with drained bulkhead.
Tests (22 new, all pass):
- MemoryRecycleTests (7): Tier C hard-breach requests recycle; Tier A/B
hard-breach never requests; Tier C without supervisor no-ops; soft-breach
at every tier never requests; None/Warming never request.
- ScheduledRecycleSchedulerTests (6): ctor throws for A/B; zero/negative
interval throws; tick before due no-ops; tick at/after due fires once and
advances; RequestRecycleNow fires immediately without shifting schedule;
multiple fires across ticks advance one interval each.
- WedgeDetectorTests (9): threshold clamp to 60 s; unhealthy driver always
NotApplicable; idle subscription stays Idle; pending+fresh progress stays
Healthy; pending+stale progress is Faulted; MonitoredItems active but no
publish is Faulted; MonitoredItems active with fresh publish stays Healthy;
historian backfill with fresh progress stays Healthy; write-only burst with
empty bulkhead is Idle; HasPendingWork theory for any non-zero counter.
Full solution dotnet test: 989 passing (baseline 906, +83 for Phase 6.1 so far).
Pre-existing Client.CLI Subscribe flake unchanged.
Stream B complete. Next up: Stream C (health endpoints + structured logging).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stream B.1 — registry invariant:
- DriverTypeMetadata gains a required `DriverTier Tier` field. Every registered
driver type must declare its stability tier so the downstream MemoryTracking,
MemoryRecycle, and resilience-policy layers can resolve the right defaults.
Stamped-at-registration-time enforcement makes the "every driver type has a
non-null Tier" compliance check structurally impossible to fail.
- DriverTypeRegistry API unchanged; one new property on the record.
Stream B.2 — MemoryTracking (Core.Stability):
- Tier-agnostic tracker per decision #146: captures baseline as the median of
samples collected during a post-init warmup window (default 5 min), then
classifies each subsequent sample with the hybrid formula
`soft = max(multiplier × baseline, baseline + floor)`, `hard = 2 × soft`.
- Per-tier constants wired: Tier A mult=3 floor=50 MB, Tier B mult=3 floor=100 MB,
Tier C mult=2 floor=500 MB.
- Never kills. Hard-breach action returns HardBreach; the supervisor that acts
on that signal (MemoryRecycle) is Tier C only per decisions #74, #145 and
lands in the next B.3 commit on this branch.
- Two phases: WarmingUp (samples collected, Warming returned) and Steady
(baseline captured, soft/hard checks active). Transition is automatic when
the warmup window elapses.
Tests (15 new, all pass):
- Warming phase returns Warming until the window elapses.
- Window-elapsed captures median baseline + transitions to Steady.
- Per-tier constants match decision #146 table exactly.
- Soft threshold uses max() — small baseline → floor wins; large baseline →
multiplier wins.
- Hard = 2 × soft.
- Sample below soft = None; at soft = SoftBreach; at/above hard = HardBreach.
- DriverTypeRegistry: theory asserts Tier round-trips for A/B/C.
Full solution dotnet test: 963 passing (baseline 906, +57 net for Phase 6.1
Stream A + Stream B.1/B.2). Pre-existing Client.CLI Subscribe flake unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per Stream A.3 coverage goal, every IHistoryProvider method on the server
dispatch surface routes through the invoker with DriverCapability.HistoryRead:
- HistoryReadRaw (line 487)
- HistoryReadProcessed (line 551)
- HistoryReadAtTime (line 608)
- HistoryReadEvents (line 665)
Each gets timeout + per-(driver, host) circuit breaker + the default Tier
retry policy (Tier A default: 2 retries at 30s timeout). Inner driver
GetAwaiter().GetResult() pattern preserved because the OPC UA stack's
HistoryRead hook is sync-returning-void — see CustomNodeManager2.
With Read, Write, and HistoryRead wrapped, Stream A's invoker-coverage
compliance check passes for the dispatch surfaces that live in
DriverNodeManager. Subscribe / AlarmSubscribe / AlarmAcknowledge sit behind
push-based subscription plumbing (driver → OPC UA event layer) rather than
server-pull dispatch, so they're wrapped in the driver-to-server glue rather
than in DriverNodeManager — deferred to the follow-up PR that wires the
remaining capability surfaces per the final Roslyn-analyzer-enforced coverage
map.
Full solution dotnet test: 948 passing. Pre-existing Client.CLI Subscribe
flake unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Every OnReadValue / OnWriteValue now routes through the process-singleton
DriverResiliencePipelineBuilder's CapabilityInvoker. Read / Write dispatch
paths gain timeout + per-capability retry + per-(driver, host) circuit breaker
+ bulkhead without touching the individual driver implementations.
Wiring:
- OpcUaApplicationHost: new optional DriverResiliencePipelineBuilder ctor
parameter (default null → instance-owned builder). Keeps the 3 test call
sites that construct OpcUaApplicationHost directly unchanged.
- OtOpcUaServer: requires the builder in its ctor; constructs one
CapabilityInvoker per driver at CreateMasterNodeManager time with default
Tier A DriverResilienceOptions. TODO: Stream B.1 will wire real per-driver-
type tiers via DriverTypeRegistry; Phase 6.1 follow-up will read the
DriverInstance.ResilienceConfig JSON column for per-instance overrides.
- DriverNodeManager: takes a CapabilityInvoker in its ctor. OnReadValue wraps
the driver's ReadAsync through ExecuteAsync(DriverCapability.Read, hostName,
...); OnWriteValue wraps WriteAsync through ExecuteWriteAsync(hostName,
isIdempotent, ...) where isIdempotent comes from the new
_writeIdempotentByFullRef map populated at Variable() registration from
DriverAttributeInfo.WriteIdempotent.
HostName defaults to driver.DriverInstanceId for now — a single-host pipeline
per driver. Multi-host drivers (Modbus with N PLCs) will expose their own per-
call host resolution in a follow-up so failing PLCs can trip per-PLC breakers
without poisoning siblings (decision #144).
Test fixup:
- FlakeyDriverIntegrationTests.Read_SurfacesSuccess_AfterTransientFailures:
bumped TimeoutSeconds=2 → 30. 10 retries at exponential backoff with jitter
can exceed 2s under parallel-test-run CPU pressure; the test asserts retry
behavior, not timeout budget, so the longer slack keeps it deterministic.
Full solution dotnet test: 948 passing. Pre-existing Client.CLI Subscribe
flake unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
IDriver.DriverInstanceId is declared as string in Core.Abstractions; keeping
the pipeline key as Guid meant every call site would need .ToString() / Guid.Parse
at the boundary. Switching the Resilience types to string removes that friction
and lets OtOpcUaServer pass driver.DriverInstanceId directly to the builder in
the upcoming server-dispatch wiring PR.
- DriverResiliencePipelineBuilder.GetOrCreate + Invalidate + PipelineKey
- CapabilityInvoker.ctor + _driverInstanceId field
Tests: all 48 Core.Tests still pass. The Invalidate test's keepId / dropId now
use distinct "drv-keep" / "drv-drop" literals (previously both were distinct
Guid.NewGuid() values, which the sed-driven refactor had collapsed to the same
literal — caught pre-commit).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per-tag opt-in for write-retry per docs/v2/plan.md decisions #44, #45, #143.
Default is false — writes never auto-retry unless the driver author has marked
the tag as safe to replay.
Core.Abstractions:
- DriverAttributeInfo gains `bool WriteIdempotent = false` at the end of the
positional record (back-compatible; every existing call site uses the default).
Driver.Modbus:
- ModbusTagDefinition gains `bool WriteIdempotent = false`. Safe candidates
documented in the param XML: holding-register set-points, configuration
registers. Unsafe: edge-triggered coils, counter-increment addresses.
- ModbusDriver.DiscoverAsync propagates t.WriteIdempotent into
DriverAttributeInfo.WriteIdempotent.
Driver.S7:
- S7TagDefinition gains `bool WriteIdempotent = false`. Safe candidates:
DB word/dword set-points, configuration DBs. Unsafe: M/Q bits that drive
edge-triggered program routines.
- S7Driver.DiscoverAsync propagates the flag.
Stream A.5 integration tests (FlakeyDriverIntegrationTests, 4 new) exercise
the invoker + flaky-driver contract the plan enumerates:
- Read with 5 transient failures succeeds on the 6th attempt (RetryCount=10).
- Non-idempotent write with RetryCount=5 configured still fails on the first
failure — no replay (decision #44 guard at the ExecuteWriteAsync surface).
- Idempotent write with 2 transient failures succeeds on the 3rd attempt.
- Two hosts on the same driver have independent breakers — dead-host trips
its breaker but live-host's first call still succeeds.
Propagation tests:
- ModbusDriverTests: SetPoint WriteIdempotent=true flows into
DriverAttributeInfo; PulseCoil default=false.
- S7DiscoveryAndSubscribeTests: same pattern for DBx SetPoint vs M-bit.
Full solution dotnet test: 947 passing (baseline 906, +41 net across Stream A
so far). Pre-existing Client.CLI Subscribe flake unchanged.
Stream A's remaining work (wiring CapabilityInvoker into DriverNodeManager's
OnReadValue / OnWriteValue / History / Subscribe dispatch paths) is the
server-side integration piece + needs DI wiring for the pipeline builder —
lands in the next PR on this branch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
One invoker per (DriverInstance, IDriver) pair; calls ExecuteAsync(capability,
host, callSite) and the invoker resolves the correct pipeline from the shared
DriverResiliencePipelineBuilder. The options accessor is a Func so Admin-edit
+ pipeline-invalidate takes effect without restarting the invoker or the
driver host.
ExecuteWriteAsync(isIdempotent) is the explicit write-safety surface:
- isIdempotent=false routes through a side pipeline with RetryCount=0 regardless
of what the caller configured. The cache key carries a "::non-idempotent"
suffix so it never collides with the retry-enabled write pipeline.
- isIdempotent=true routes through the normal Write pipeline. If the user has
configured Write retries (opt-in), the idempotent tag gets them; otherwise
default-0 still wins.
The server dispatch layer (next PR) reads WriteIdempotentAttribute on each tag
definition once at driver-init time and feeds the boolean into ExecuteWriteAsync.
Tests (6 new):
- Read retries on transient failure; returns value from call site.
- Write non-idempotent does NOT retry even when policy has 3 retries configured
(the explicit decision-#44 guard at the dispatch surface).
- Write idempotent retries when policy allows.
- Write with default tier-A policy (RetryCount=0) never retries regardless of
idempotency flag.
- Different hosts get independent pipelines.
Core.Tests now 44 passing (was 38). Invoker doc-refs completed (the XML comment
on WriteIdempotentAttribute no longer references a non-existent type).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lands the first chunk of the Phase 6.1 Stream A resilience layer per
docs/v2/implementation/phase-6-1-resilience-and-observability.md §Stream A.
Downstream CapabilityInvoker (A.3) + driver-dispatch wiring land in follow-up
PRs on the same branch.
Core.Abstractions additions:
- WriteIdempotentAttribute — marker for tag-definition records that opt into
auto-retry on IWritable.WriteAsync. Absence = no retry per decisions #44, #45,
#143. Read once via reflection at driver-init time; no per-write cost.
- DriverCapability enum — enumerates the 8 capability surface points
(Read / Write / Discover / Subscribe / Probe / AlarmSubscribe / AlarmAcknowledge
/ HistoryRead). AlarmAcknowledge is write-shaped (no retry by default).
- DriverTier enum — A/B/C per driver-stability.md §2-4. Stream B.1 wires this
into DriverTypeMetadata; surfaced here because the resilience policy defaults
key on it.
Core.Resilience new namespace:
- DriverResilienceOptions — per-tier × per-capability policy defaults.
GetTierDefaults(tier) is the source of truth:
* Tier A: Read 2s/3 retries, Write 2s/0 retries, breaker threshold 5
* Tier B: Read 4s/3, Write 4s/0, breaker threshold 5
* Tier C: Read 10s/1, Write 10s/0, breaker threshold 0 (supervisor handles
process-level breaker per decision #68)
Resolve(capability) overlays CapabilityPolicies on top of the defaults.
- DriverResiliencePipelineBuilder — composes Timeout → Retry (capability-
permitting, never on cancellation) → CircuitBreaker (tier-permitting) →
Bulkhead. Pipelines cached in a lock-free ConcurrentDictionary keyed on
(DriverInstanceId, HostName, DriverCapability) per decision #144 — one dead
PLC behind a multi-device driver does not open the breaker for healthy
siblings. Invalidate(driverInstanceId) supports Admin-triggered reload.
Tests (30 new, all pass):
- DriverResilienceOptionsTests: tier-default coverage for every capability,
Write + AlarmAcknowledge never retry at any tier, Tier C disables breaker,
resolve-with-override layering.
- DriverResiliencePipelineBuilderTests: Read retries transients, Write does NOT
retry on failure (decision #44 guard), dead-host isolation from sibling hosts,
pipeline reuse for same triple, per-capability isolation, breaker opens after
threshold on Tier A, timeout fires, cancellation is not retried,
invalidation scoped to matching instance.
Polly.Core 8.6.6 added to Core.csproj. Full solution dotnet test: 936 passing
(baseline 906 + 30 new). One pre-existing Client.CLI Subscribe flake unchanged
by this PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three substantive issues caught + fixed during the validation pass:
1. pymodbus rejects unknown keys at device-list / setup level. My PR 43 commit had `_layout_note`, `_uint16_layout`, `_bits_layout`, `_write_note` device-level JSON-comment fields that crashed pymodbus startup with `INVALID key in setup`. Removed all device-level _* fields. Inline `_quirk` keys WITHIN individual register entries are tolerated by pymodbus 3.13.0 — kept those in dl205.json since they document the byte math per quirk and the README + git history aren't enough context for a hand-author reading raw integer values. Documented the constraint in the top-level _comment of each profile.
2. pymodbus rejects sweeping `write` ranges that include any cell not assigned a type. My initial standard.json had `write: [[0, 2047]]` but only seeded HR[0..31] + HR[100] + HR[200..209] + bits[1024..1109] — pymodbus blew up on cell 32 (gap between HR[31] and HR[100]). Fixed by listing per-block write ranges that exactly mirror the seeded ranges. Same fix in dl205.json (was `[[0, 16383]]`).
3. pymodbus simulator stores all 4 standard Modbus tables in ONE underlying cell array — each cell can only be typed once (BITS or UINT16, not both). My initial standard.json had `bits[0..31]` AND `uint16[0..31]` overlapping at the same addresses; pymodbus crashed with `ERROR "uint16" <Cell> used`. Fixed by relocating coils to address 1024+, well clear of the uint16 entries at 0..209. Documented the layout constraint in the standard.json top-level _comment.
Substantive driver bug fixed: ModbusTcpTransport.ConnectAsync was using `new TcpClient()` (default constructor — dual-stack, IPv6 first) then `ConnectAsync(host, port)` with the user's hostname. .NET's TcpClient default-resolves "localhost" to ::1 first, fails to connect to pymodbus (which binds 0.0.0.0 IPv4-only), and only then retries IPv4 — the failure surfaces as the entire ConnectAsync timeout (2s by default) before the IPv4 attempt even starts. PR 30's smoke test silently SKIPPED because the fixture's TCP probe hit the same dual-stack ordering and timed out. Both fixed: ModbusSimulatorFixture probe now resolves Dns.GetHostAddresses, prefers AddressFamily.InterNetwork, dials IPv4 explicitly. ModbusTcpTransport does the same — resolves first, prefers IPv4, falls back to whatever Dns returns (handles IPv6-only hosts in the future). This is a real production-readiness fix because most Modbus PLCs are IPv4-only — a generic dual-stack TcpClient would burn the entire connect timeout against any IPv4-only PLC, masquerading as a connection failure when the PLC is actually fine.
Smoke-test address shifted HR[100] -> HR[200]. Standard.json's HR[100] is the auto-incrementing register that drives subscribe-and-receive tests, so write-then-read against it would race the increment. HR[200] is the first cell of a writable scratch range present in BOTH simulator profiles. DL205Profile.cs xml-doc updated to explain the shift; tag name "DL205_Smoke_HReg100" -> "Smoke_HReg200" + smoke test references updated. dl205.json gains a matching scratch HR[200..209] range so the smoke test runs identically against either profile.
Validation matrix:
- standard.json boot: clean (TCP 5020 listening within ~3s of pymodbus.simulator launch).
- dl205.json boot: clean.
- pymodbus client direct FC06 to HR[200]=1234 + FC03 read: round-trip OK.
- raw-bytes PowerShell TcpClient FC06 + 12-byte response: matches FC06 spec (echo of address + value).
- DL205SmokeTest against standard.json: 1/1 pass (was failing as 'BadInternalError' due to the dual-stack timeout + tag-name typo — both fixed).
- DL205SmokeTest against dl205.json: 1/1 pass.
- Modbus.Tests Unit suite: 52/52 pass — dual-stack transport fix is non-breaking.
- Solution build clean.
Memory + future-PR setup: pymodbus install + activation pattern is now bullet-pointed at the top of Pymodbus/README.md so future PRs (the per-quirk DL205_<behavior> tests in PR 44+) don't have to repeat the trial-and-error of getting the simulator + integration tests cooperating. The three bugs above are documented inline in the JSON profiles + ModbusTcpTransport so they don't bite again.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per-kind override shape: each hook receives the pre-filtered nodesToProcess list (NodeHandles for nodes this manager claimed), iterates them, resolves handle.NodeId.Identifier to the driver-side full reference string, and dispatches to the right IHistoryProvider method. Write back into the outer results + errors slots at handle.Index (not the local loop counter — nodesToProcess is a filtered subset of nodesToRead, so indexing by the loop counter lands in the wrong slot for mixed-manager batches). WriteResult helper sets both results[i] AND errors[i]; this matters because MasterNodeManager merges them and leaving errors[i] at its default (BadHistoryOperationUnsupported) overrides a Good result with Unsupported on the wire — this was the subtle failure mode that masked a correctly-constructed HistoryData response during debugging. Failure-isolation per node: NotSupportedException from a driver that doesn't implement a particular HistoryProvider method translates to BadHistoryOperationUnsupported in that slot; generic exceptions log and surface BadInternalError; unresolvable NodeIds get BadNodeIdUnknown. The batch continues unconditionally.
Aggregate mapping: MapAggregate translates ObjectIds.AggregateFunction_Average / Minimum / Maximum / Total / Count to the driver's HistoryAggregateType enum. Null for anything else (e.g. TimeAverage, Interpolative) so the handler surfaces BadAggregateNotSupported at the batch level — per Part 13, one unsupported aggregate means the whole request fails since ReadProcessedDetails carries one aggregate list for all nodes. BuildHistoryData wraps driver DataValueSnapshots as Opc.Ua.HistoryData in an ExtensionObject; BuildHistoryEvent wraps HistoricalEvents as Opc.Ua.HistoryEvent with the canonical BaseEventType field list (EventId, SourceName, Message, Severity, Time, ReceiveTime — the order OPC UA clients that didn't customize the SelectClause expect). ToDataValue preserves null SourceTimestamp (Galaxy historian rows often carry only ServerTimestamp) — synthesizing a SourceTimestamp would lie about actual sample time.
Two address-space changes were required to make the stack dispatch reach the per-kind hooks at all: (1) historized variables get AccessLevels.HistoryRead added to their AccessLevel byte — the base's early-gate check on (variable.AccessLevel & HistoryRead != 0) was rejecting requests before our override ever ran; (2) the driver-root folder gets EventNotifiers.HistoryRead | SubscribeToEvents so HistoryReadEvents can target it (the conventional pattern for alarm-history browse against a driver-owned object). Document the 'set both bits' requirement inline since it's not obvious from the surface API.
OpcHistoryReadResult alias: Opc.Ua.HistoryReadResult (service-layer per-node result) collides with Core.Abstractions.HistoryReadResult (driver-side samples + continuation point) by type name; the alias 'using OpcHistoryReadResult = Opc.Ua.HistoryReadResult' keeps the override signatures unambiguous and the test project applies the mirror pattern for its stub driver impl.
Tests — DriverNodeManagerHistoryMappingTests (12 new Category=Unit cases): MapAggregate translates each supported aggregate NodeId via reflection-backed theory (guards against the stack renaming AggregateFunction_* constants); returns null for unsupported NodeIds (TimeAverage) and null input; BuildHistoryData wraps samples with correct DataValues + SourceTimestamp preservation; BuildHistoryEvent emits the 6-element BaseEventType field list in canonical order (regression guard for a future 'respect the client's SelectClauses' change); null SourceName / Message translate to empty-string Variants (nullable-Variant refactor trap); ToDataValue preserves StatusCode + both timestamps; ToDataValue leaves SourceTimestamp at default when the snapshot omits it. HistoryReadIntegrationTests (5 new Category=Integration): drives a real OPC UA client Session.HistoryRead against a fake HistoryDriver through the running server. Covers raw round-trip (verifies per-node DataValue ordering + values); processed with Average aggregate (captures the driver's received aggregate + interval, asserting MapAggregate routed correctly); unsupported aggregate (TimeAverage → BadAggregateNotSupported); at-time (forwards the per-timestamp list); events (BaseEventType field list shape, SelectClauses populated to satisfy the stack's filter validator). Server.Tests Unit: 55 pass / 0 fail (43 prior + 12 new mapping). Server.Tests Integration: 14 pass / 0 fail (9 prior + 5 new history). Full solution build clean, 0 errors.
lmx-followups.md #1 updated to 'DONE (PRs 35 + 38)' with two explicit deferred items: continuation-point plumbing (driver returns null today so pass-through is fine) and per-SelectClause evaluation in HistoryReadEvents (clients with custom field selections get the canonical BaseEventType layout today).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Interface additions use C# default interface implementations that throw NotSupportedException — existing IHistoryProvider implementations keep compiling, only drivers whose backend carries the relevant capability override. This matches the 'capabilities are optional per driver' design already used by IHistoryProvider.ReadProcessedAsync's docs (Modbus / OPC UA Client drivers never had an event historian and the default-throw path lets callers see BadHistoryOperationUnsupported naturally). New HistoricalEvent record models one historian row (EventId, SourceName, EventTimeUtc + ReceivedTimeUtc — process vs historian-persist timestamps, Message, Severity mapped to OPC UA's 1-1000 range); HistoricalEventsResult pairs the event list with a continuation-point token for future batching. Both live in Core.Abstractions so downstream (Proxy, Host, Server) reference a single domain shape — no Shared-contract leak into the driver-facing interface.
GalaxyProxyDriver.ReadAtTimeAsync maps the domain DateTime[] to Unix-ms longs, calls CallAsync on the existing MessageKind.HistoryReadAtTimeRequest, and trusts the Host's one-sample-per-requested-timestamp contract (the Host pads with bad-quality snapshots for timestamps it can't interpolate; re-aligning on the Proxy side would duplicate the Host's interpolation policy logic). ReadEventsAsync does the same for HistoryReadEventsRequest; ToHistoricalEvent translates GalaxyHistoricalEvent (MessagePack-annotated, Unix-ms) to the domain record, explicitly tagging DateTimeKind.Utc on both timestamp fields so downstream serializers (JSON, OPC UA types) don't apply an unexpected local-time offset.
Tests — HistoricalEventMappingTests (3 new Proxy.Tests unit cases): every field maps correctly from wire to domain; null SourceName and null DisplayText preserve through the mapping (system events without a source come out with null so callers can distinguish them from alarm events); both timestamps come out as DateTimeKind.Utc (regression guard against a future refactor using DateTime.FromFileTimeUtc or similar that defaults to Unspecified). Driver.Galaxy.Proxy.Tests Unit suite: 17 pass / 0 fail (14 prior + 3 new). Full solution build clean, 0 errors.
Scope exclusions — DriverNodeManager HistoryRead service-handler wiring (on the OPC UA Server side, where HistoryReadAtTime and HistoryReadEvents service requests land) and the full-loop integration test (OPC UA client → server → IPC → Host → HistorianDataSource → back) are deferred to a focused follow-up PR. The capability surface is the load-bearing change; wiring the service handlers is mechanical in comparison and worth its own PR for reviewability. docs/v2/lmx-followups.md #1 updated with the split.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Polling chosen over event-driven for initial scope: simpler, matches Admin UI consumer cadence, avoids DriverHost lifecycle-event plumbing that doesn't exist today. Event-driven push for sub-heartbeat latency is a straightforward follow-up.
Admin.Services.HostStatusService left-joins DriverHostStatus against ClusterNode on NodeId so rows persist even when the ClusterNode entry doesn't exist yet (first-boot bootstrap case). StaleThreshold = 30s — covers one missed publisher heartbeat plus a generous buffer for clock skew and GC pauses. Admin Components/Pages/Hosts.razor — FleetAdmin-visible page grouped by cluster (handles the '(unassigned)' case for rows without a matching ClusterNode). Four summary cards (Hosts / Running / Stale / Faulted); per-cluster table with Node / Driver / Host / State + Stale-badge / Last-transition / Last-seen / Detail columns; 10s auto-refresh via IServiceScopeFactory timer pattern matching FleetStatusPoller + Fleet dashboard (PR 27). Row-class highlighting: Faulted → table-danger, Stale → table-warning, else default. State badge maps DriverHostState enum to bootstrap color classes. Sidebar link added between 'Fleet status' and 'Clusters'.
Server csproj adds Microsoft.EntityFrameworkCore.SqlServer 10.0.0 + registers OtOpcUaConfigDbContext in Program.cs scoped via NodeOptions.ConfigDbConnectionString (no Admin-style manual SQL raw — the DbContext is the only access path, keeps migrations owner-of-record).
Tests — HostStatusPublisherTests (4 new Integration cases, uses per-run throwaway DB matching the FleetStatusPollerTests pattern): publisher upserts one row per host from each probe-capable driver and skips non-probe drivers; second tick advances LastSeenUtc without creating duplicate rows (upsert pattern verified end-to-end); state change between ticks updates State AND StateChangedUtc (datetime2(3) rounds to millisecond precision so comparison uses 1ms tolerance — documented inline); MapState translates every HostState enum member. Server.Tests Integration: 4 new tests pass. Admin build clean, Admin.Tests Unit still 23 / 0. docs/v2/lmx-followups.md item #7 marked DONE with three explicit deferred items (event-driven push, failure-count column, SignalR fan-out).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DriverHostState enum lives in Configuration.Enums/ rather than reusing Core.Abstractions.HostState so the Configuration project stays free of driver-runtime dependencies (it's referenced by both the Admin process and the Server process, so pulling in the driver-abstractions assembly to every Admin build would be unnecessary weight). The server-side publisher hosted service (follow-up PR 34) will translate HostStatusChangedEventArgs.NewState to this enum on every transition.
No foreign key to ClusterNode — a Server may start reporting host status before its ClusterNode row exists (first-boot bootstrap), and we'd rather keep the status row than drop it. The Admin-side service that renders the dashboard will left-join on NodeId when presenting. Two indexes declared: IX_DriverHostStatus_Node drives the per-cluster drill-down (Admin UI joins ClusterNode on ClusterId to pick which NodeIds to fetch), IX_DriverHostStatus_LastSeen drives the stale-row query (now - LastSeen > threshold).
EF migration AddDriverHostStatus creates the table + PK + both indexes. Model snapshot updated. SchemaComplianceTests expected-tables list extended. DriverHostStatusTests (3 new cases, category SchemaCompliance, uses the shared fixture DB): composite key allows same (host, driver) across different nodes AND same (node, host) across different drivers — both real-world cases the publisher needs to support; upsert-in-place pattern (fetch-by-composite-PK, mutate, save) produces one row not two — the pattern the publisher will use; State enum persists as string not int — reading the DB via ADO.NET returns 'Faulted' not '3'.
Configuration.Tests SchemaCompliance suite: 10 pass / 0 fail (7 prior + 3 new). Configuration build clean. No Server or Admin code changes yet — publisher + /hosts page are PR 34.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Active Directory compatibility. LdapOptions xml-doc expanded with a cheat-sheet covering Server (DC FQDN), Port 389 vs 636, UseTls=true under AD LDAP-signing enforcement, dedicated read-only service account DN, sAMAccountName vs userPrincipalName vs cn trade-offs, memberOf DN shape (CN=Group,OU=...,DC=... with the CN= RDN stripped to become the GroupToRole key), and the explicit 'nested groups NOT expanded' call-out (LDAP_MATCHING_RULE_IN_CHAIN / tokenGroups is a future authenticator enhancement, not a config change). docs/security.md §'Active Directory configuration' adds a complete appsettings.json snippet with realistic AD group names (OPCUA-Operators → WriteOperate, OPCUA-Engineers → WriteConfigure, OPCUA-AlarmAck → AlarmAck, OPCUA-Tuners → WriteTune), LDAPS port 636, TLS on, insecure-LDAP off, and operator-facing notes on each field. LdapUserAuthenticatorAdCompatTests (5 unit guards): ExtractFirstRdnValue parses AD-style 'CN=OPCUA-Operators,OU=...,DC=...' DNs correctly (case-preserving — operators' GroupToRole keys stay readable); also handles mixed case and spaces in group names ('Domain Users'); also works against the OpenLDAP ou=<group>,ou=groups shape (GLAuth) so one extractor tolerates both memberOf formats common in the field; EscapeLdapFilter escapes the RFC 4515 injection set (\, *, (, ), \0) so a malicious login like 'admin)(cn=*' can't break out of the filter; default UserNameAttribute regression guard.
Test posture — Server.Tests Unit: 43 pass / 0 fail (38 prior + 5 new AD-compat guards). Server.Tests LiveLdap category: 6 pass / 0 fail against running GLAuth (would skip cleanly without). Server build clean, 0 errors, 0 warnings.
Deferred: the session-identity end-to-end check (drive a full OPC UA UserName session, then read a 'whoami' node to verify the role landed on RoleBasedIdentity). That needs a test-only address-space node and is scoped for a separate PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sidebar's 'Signed in as' line now wraps the display name in a link to /account so the existing sidebar-compact view becomes the entry point for the fuller page — keeps the sign-out button where it was for muscle memory, just adds the detail page one click away. Page is gated with [Authorize] (any authenticated admin) rather than a specific role — the capability table deliberately works for every signed-in user so they can see what they DON'T have access to, which helps them file the right ticket with their LDAP admin instead of getting a plain Access Denied when navigating blindly.
Capability → required-role table is defined as a private readonly record list in the page rather than pulled from a service because it's a UI-presentation concern, not runtime policy state — the runtime policy IS Program.cs's AddAuthorizationBuilder + each page's [Authorize] attribute, and this table just mirrors it for operator readability. Comment on the list reminds future-me to extend it when a new policy or [Authorize] page lands. No behavior change if roles are empty, but the page surfaces a hint ('Sign-in would have been blocked, so if you're seeing this, the session claim is likely stale') that nudges the operator toward signing out + back in.
No new tests added — the page is pure display over claims; its only logic is the 'has-capability' Any-overlap check which is exactly what ASP.NET's [Authorize(Roles=...)] does in-framework, and duplicating that in a unit test would test the framework rather than our code. Admin.Tests Unit stays 23 pass / 0 fail. Admin build clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>