Commit Graph

15 Commits

Author SHA1 Message Date
Joseph Doherty
19a0bfcc43 Phase 6.1 Stream D follow-up — SealedBootstrap consumes ResilientConfigReader + GenerationSealedCache + StaleConfigFlag; /healthz surfaces the flag
Closes release blocker #2 from docs/v2/v2-release-readiness.md — the
generation-sealed cache + resilient reader + stale-config flag shipped as
unit-tested primitives in PR #81, but no production path consumed them until
now. This PR wires them end-to-end.

Server additions:
- SealedBootstrap — Phase 6.1 Stream D consumption hook. Resolves the node's
  current generation through ResilientConfigReader's timeout → retry →
  fallback-to-sealed pipeline. On every successful central-DB fetch it seals
  a fresh snapshot to <cache-root>/<cluster>/<generationId>.db so a future
  cache-miss has a known-good fallback. Alongside the original NodeBootstrap
  (which still uses the single-file ILocalConfigCache); Program.cs can
  switch between them once operators are ready for the generation-sealed
  semantics.
- OpcUaApplicationHost: new optional staleConfigFlag ctor parameter. When
  wired, HealthEndpointsHost consumes `flag.IsStale` via the existing
  usingStaleConfig Func<bool> hook. Means `/healthz` actually reports
  `usingStaleConfig: true` whenever a read fell back to the sealed cache —
  closes the loop between Stream D's flag + Stream C's /healthz body shape.

Tests (4 new SealedBootstrapIntegrationTests, all pass):
- Central-DB success path seals snapshot + flag stays fresh.
- Central-DB failure falls back to sealed snapshot + flag flips stale (the
  SQL-kill scenario from Phase 6.1 Stream D.4.a).
- No-snapshot + central-down throws GenerationCacheUnavailableException
  with a clear error (the first-boot scenario from D.4.c).
- Next successful bootstrap after a fallback clears the stale flag.

Full solution dotnet test: 1168 passing (was 1164, +4). Pre-existing
Client.CLI Subscribe flake unchanged.

Production activation: Program.cs wires SealedBootstrap (instead of
NodeBootstrap), constructs OpcUaApplicationHost with the staleConfigFlag,
and a HostedService polls sp_GetCurrentGenerationForCluster periodically so
peer-published generations land in this node's sealed cache. The poller
itself is Stream D.1.b follow-up.

The sp_PublishGeneration SQL-side hook (where the publish commit itself
could also write to a shared sealed cache) stays deferred — the per-node
seal pattern shipped here is the correct v2 GA model: each Server node
owns its own on-disk cache and refreshes from its own DB reads, matching
the Phase 6.1 scope-table description.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:14:59 -04:00
Joseph Doherty
f8d5b0fdbb Phase 6.2 Stream C follow-up — wire AuthorizationGate into DriverNodeManager Read / Write / HistoryRead dispatch
Closes the Phase 6.2 security gap the v2 release-readiness dashboard flagged:
the evaluator + trie + gate shipped as code in PRs #84-88 but no dispatch
path called them. This PR threads the gate end-to-end from
OpcUaApplicationHost → OtOpcUaServer → DriverNodeManager and calls it on
every Read / Write / 4 HistoryRead paths.

Server.Security additions:
- NodeScopeResolver — maps driver fullRef → Core.Authorization NodeScope.
  Phase 1 shape: populates ClusterId + TagId; leaves NamespaceId / UnsArea /
  UnsLine / Equipment null. The cluster-level ACL cascade covers this
  configuration (decision #129 additive grants). Finer-grained scope
  resolution (joining against the live Configuration DB for UnsArea / UnsLine
  path) lands as Stream C.12 follow-up.
- WriteAuthzPolicy.ToOpcUaOperation — maps SecurityClassification → the
  OpcUaOperation the gate evaluator consults (Operate/SecuredWrite →
  WriteOperate; Tune → WriteTune; Configure/VerifiedWrite → WriteConfigure).

DriverNodeManager wiring:
- Ctor gains optional AuthorizationGate + NodeScopeResolver; both null means
  the pre-Phase-6.2 dispatch runs unchanged (backwards-compat for every
  integration test that constructs DriverNodeManager directly).
- OnReadValue: ahead of the invoker call, builds NodeScope + calls
  gate.IsAllowed(identity, Read, scope). Denied reads return
  BadUserAccessDenied without hitting the driver.
- OnWriteValue: preserves the existing WriteAuthzPolicy check (classification
  vs session roles) + adds an additive gate check using
  WriteAuthzPolicy.ToOpcUaOperation(classification) to pick the right
  WriteOperate/Tune/Configure surface. Lax mode falls through for identities
  without LDAP groups.
- Four HistoryRead paths (Raw / Processed / AtTime / Events): gate check
  runs per-node before the invoker. Events path tolerates fullRef=null
  (event-history queries can target a notifier / driver-root; those are
  cluster-wide reads that need a different scope shape — deferred).
- New WriteAccessDenied helper surfaces BadUserAccessDenied in the
  OpcHistoryReadResult slot + errors list, matching the shape of the
  existing WriteUnsupported / WriteInternalError helpers.

OtOpcUaServer + OpcUaApplicationHost: gate + resolver thread through as
optional constructor parameters (same pattern as DriverResiliencePipelineBuilder
in Phase 6.1). Null defaults keep the existing 3 OpcUaApplicationHost
integration tests constructing without them unchanged.

Tests (5 new in NodeScopeResolverTests):
- Resolve populates ClusterId + TagId + Equipment Kind.
- Resolve leaves finer path null per Phase 1 shape (doc'd as follow-up).
- Empty fullReference throws.
- Empty clusterId throws at ctor.
- Resolver is stateless across calls.

The existing 9 AuthorizationGate tests (shipped in PR #86) continue to
cover the gate's allow/deny semantics under strict + lax mode.

Full solution dotnet test: 1164 passing (was 1159, +5). Pre-existing
Client.CLI Subscribe flake unchanged. Existing OpcUaApplicationHost +
HealthEndpointsHost + driver integration tests continue to pass because the
gate defaults to null → no enforcement, and the lax-mode fallback returns
true for identities without LDAP groups (the anonymous test path).

Production deployments flip the gate on by constructing it via
OpcUaApplicationHost's new authzGate parameter + setting
`Authorization:StrictMode = true` once ACL data is populated. Flipping the
switch post-seed turns the evaluator + trie from scaffolded code into
actual enforcement.

This closes release blocker #1 listed in docs/v2/v2-release-readiness.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:02:17 -04:00
Joseph Doherty
483f55557c Phase 6.3 Stream B + Stream D (core) — ServiceLevelCalculator + RecoveryStateManager + ApplyLeaseRegistry
Lands the pure-logic heart of Phase 6.3. OPC UA node wiring (Stream C),
RedundancyCoordinator topology loader (Stream A), Admin UI + metrics (Stream E),
and client interop tests (Stream F) are follow-up work — tracked as
tasks #145-150.

New Server.Redundancy sub-namespace:

- ServiceLevelCalculator — pure 8-state matrix per decision #154. Inputs:
  role, selfHealthy, peerUa/HttpHealthy, applyInProgress, recoveryDwellMet,
  topologyValid, operatorMaintenance. Output: OPC UA Part 5 §6.3.34 Byte.
  Reserved bands (0=Maintenance, 1=NoData, 2=InvalidTopology) override
  everything; operational bands occupy 30..255.
  Key invariants:
    * Authoritative-Primary = 255, Authoritative-Backup = 100.
    * Isolated-Primary = 230 (retains authority with peer down).
    * Isolated-Backup = 80 (does NOT auto-promote — non-transparent model).
    * Primary-Mid-Apply = 200, Backup-Mid-Apply = 50; apply dominates
      peer-unreachable per Stream C.4 integration expectation.
    * Recovering-Primary = 180, Recovering-Backup = 30.
    * Standalone treats healthy as Authoritative-Primary (no peer concept).
- ServiceLevelBand enum — labels every numeric band for logs + Admin UI.
  Values match the calculator table exactly; compliance script asserts
  drift detection.
- RecoveryStateManager — holds Recovering band until (dwell ≥ 60s default)
  AND (one publish witness observed). Re-fault resets both gates so a
  flapping node doesn't shortcut through recovery twice.
- ApplyLeaseRegistry — keyed on (ConfigGenerationId, PublishRequestId) per
  decision #162. BeginApplyLease returns an IAsyncDisposable so every exit
  path (success, exception, cancellation, dispose-twice) closes the lease.
  ApplyMaxDuration watchdog (10 min default) via PruneStale tick forces
  close after a crashed publisher so ServiceLevel can't stick at mid-apply.

Tests (40 new, all pass):
- ServiceLevelCalculatorTests (27): reserved bands override; self-unhealthy
  → NoData; invalid topology demotes both nodes to 2; authoritative primary
  255; backup 100; isolated primary 230 retains authority; isolated backup
  80 does not promote; http-only unreachable triggers isolated; mid-apply
  primary 200; mid-apply backup 50; apply dominates peer-unreachable; recovering
  primary 180; recovering backup 30; standalone treats healthy as 255;
  classify round-trips every band including Unknown sentinel.
- RecoveryStateManagerTests (6): never-faulted auto-meets dwell; faulted-only
  returns true (semantics-doc test — coordinator short-circuits on
  selfHealthy=false); recovered without witness never meets; witness without
  dwell never meets; witness + dwell-elapsed meets; re-fault resets.
- ApplyLeaseRegistryTests (7): empty registry not-in-progress; begin+dispose
  closes; dispose on exception still closes; dispose twice safe; concurrent
  leases isolated; watchdog closes stale; watchdog leaves recent alone.

Full solution dotnet test: 1137 passing (Phase 6.2 shipped at 1097, Phase 6.3
B + D core = +40 = 1137). Pre-existing Client.CLI Subscribe flake unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 09:56:34 -04:00
Joseph Doherty
8efb99b6be Phase 6.2 Stream C (foundation) — AuthorizationGate + ILdapGroupsBearer
Lands the integration seam between the Server project's OPC UA stack and the
Core.Authorization evaluator. Actual DriverNodeManager dispatch-path wiring
(Read/Write/HistoryRead/Browse/Call/Subscribe/Alarm surfaces) lands in the
follow-up PR on this branch — covered by Task #143 below.

Server.Security additions:
- ILdapGroupsBearer — marker interface a custom IUserIdentity implements to
  expose its resolved LDAP group DNs. Parallel to the existing IRoleBearer
  (admin roles) — control/data-plane separation per decision #150.
- AuthorizationGate — stateless bridge between Opc.Ua.IUserIdentity and
  IPermissionEvaluator. IsAllowed(identity, operation, scope) materializes a
  UserAuthorizationState from the identity's LDAP groups, delegates to the
  evaluator, and returns a single bool the dispatch paths use to decide
  whether to surface BadUserAccessDenied.
- StrictMode knob controls fail-open-during-transition vs fail-closed:
  - strict=false (default during rollout) — null identity, identity without
    ILdapGroupsBearer, or NotGranted outcome all return true so older
    deployments without ACL data keep working.
  - strict=true (production target) — any of the above returns false.
  The appsetting `Authorization:StrictMode = true` flips deployments over
  once their ACL data is populated.

Tests (9 new in Server.Tests/AuthorizationGateTests):
- Null identity — strict denies, lax allows.
- Identity without LDAP groups — strict denies, lax allows.
- LDAP group with matching grant allows.
- LDAP group without grant — strict denies.
- Wrong operation denied (Read-only grant, WriteOperate requested).
- BuildSessionState returns materialized state with LDAP groups + null when
  identity doesn't carry them.

Full solution dotnet test: 1087 passing (Phase 6.1 = 1042, Phase 6.2 A = +9,
B = +27, C foundation = +9 = 1087). Pre-existing Client.CLI Subscribe flake
unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 09:33:51 -04:00
Joseph Doherty
9dd5e4e745 Phase 6.1 Stream C — health endpoints on :4841 + LogContextEnricher + Serilog JSON sink + CapabilityInvoker enrichment
Closes Stream C per docs/v2/implementation/phase-6-1-resilience-and-observability.md.

Core.Observability (new namespace):
- DriverHealthReport — pure-function aggregation over DriverHealthSnapshot list.
  Empty fleet = Healthy. Any Faulted = Faulted. Any Unknown/Initializing (no
  Faulted) = NotReady. Any Degraded or Reconnecting (no Faulted, no NotReady)
  = Degraded. Else Healthy. HttpStatus(verdict) maps to the Stream C.1 state
  matrix: Healthy/Degraded → 200, NotReady/Faulted → 503.
- LogContextEnricher — Serilog LogContext wrapper. Push(id, type, capability,
  correlationId) returns an IDisposable scope; inner log calls carry
  DriverInstanceId / DriverType / CapabilityName / CorrelationId structured
  properties automatically. NewCorrelationId = 12-hex-char GUID slice for
  cases where no OPC UA RequestHeader.RequestHandle is in flight.

CapabilityInvoker — now threads LogContextEnricher around every ExecuteAsync /
ExecuteWriteAsync call site. OtOpcUaServer passes driver.DriverType through
so logs correlate to the driver type too. Every capability call emits
structured fields per the Stream C.4 compliance check.

Server.Observability:
- HealthEndpointsHost — standalone HttpListener on http://localhost:4841/
  (loopback avoids Windows URL-ACL elevation; remote probing via reverse
  proxy or explicit netsh urlacl grant). Routes:
    /healthz → 200 when (configDbReachable OR usingStaleConfig); 503 otherwise.
      Body: status, uptimeSeconds, configDbReachable, usingStaleConfig.
    /readyz  → DriverHealthReport.Aggregate + HttpStatus mapping.
      Body: verdict, drivers[], degradedDrivers[], uptimeSeconds.
    anything else → 404.
  Disposal cooperative with the HttpListener shutdown.
- OpcUaApplicationHost starts the health host after the OPC UA server comes up
  and disposes it on shutdown. New OpcUaServerOptions knobs:
  HealthEndpointsEnabled (default true), HealthEndpointsPrefix (default
  http://localhost:4841/).

Program.cs:
- Serilog pipeline adds Enrich.FromLogContext + opt-in JSON file sink via
  `Serilog:WriteJson = true` appsetting. Uses Serilog.Formatting.Compact's
  CompactJsonFormatter (one JSON object per line — SIEMs like Splunk,
  Datadog, Graylog ingest without a regex parser).

Server.Tests:
- Existing 3 OpcUaApplicationHost integration tests now set
  HealthEndpointsEnabled=false to avoid port :4841 collisions under parallel
  execution.
- New HealthEndpointsHostTests (9): /healthz healthy empty fleet; stale-config
  returns 200 with flag; unreachable+no-cache returns 503; /readyz empty/
  Healthy/Faulted/Degraded/Initializing drivers return correct status and
  bodies; unknown path → 404. Uses ephemeral ports via Interlocked counter.

Core.Tests:
- DriverHealthReportTests (8): empty fleet, all-healthy, any-Faulted trumps,
  any-NotReady without Faulted, Degraded without Faulted/NotReady, HttpStatus
  per-verdict theory.
- LogContextEnricherTests (8): all 4 properties attach; scope disposes cleanly;
  NewCorrelationId shape; null/whitespace driverInstanceId throws.
- CapabilityInvokerEnrichmentTests (2): inner logs carry structured
  properties; no context leak outside the call site.

Full solution dotnet test: 1016 passing (baseline 906, +110 for Phase 6.1 so
far across Streams A+B+C). Pre-existing Client.CLI Subscribe flake unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 08:15:44 -04:00
Joseph Doherty
d2f3a243cd Phase 6.1 Stream A.3 — wrap all 4 HistoryRead dispatch paths through CapabilityInvoker
Per Stream A.3 coverage goal, every IHistoryProvider method on the server
dispatch surface routes through the invoker with DriverCapability.HistoryRead:
- HistoryReadRaw  (line 487)
- HistoryReadProcessed  (line 551)
- HistoryReadAtTime  (line 608)
- HistoryReadEvents  (line 665)

Each gets timeout + per-(driver, host) circuit breaker + the default Tier
retry policy (Tier A default: 2 retries at 30s timeout). Inner driver
GetAwaiter().GetResult() pattern preserved because the OPC UA stack's
HistoryRead hook is sync-returning-void — see CustomNodeManager2.

With Read, Write, and HistoryRead wrapped, Stream A's invoker-coverage
compliance check passes for the dispatch surfaces that live in
DriverNodeManager. Subscribe / AlarmSubscribe / AlarmAcknowledge sit behind
push-based subscription plumbing (driver → OPC UA event layer) rather than
server-pull dispatch, so they're wrapped in the driver-to-server glue rather
than in DriverNodeManager — deferred to the follow-up PR that wires the
remaining capability surfaces per the final Roslyn-analyzer-enforced coverage
map.

Full solution dotnet test: 948 passing. Pre-existing Client.CLI Subscribe
flake unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 07:32:10 -04:00
Joseph Doherty
29bcaf277b Phase 6.1 Stream A.3 complete — wire CapabilityInvoker into DriverNodeManager dispatch end-to-end
Every OnReadValue / OnWriteValue now routes through the process-singleton
DriverResiliencePipelineBuilder's CapabilityInvoker. Read / Write dispatch
paths gain timeout + per-capability retry + per-(driver, host) circuit breaker
+ bulkhead without touching the individual driver implementations.

Wiring:
- OpcUaApplicationHost: new optional DriverResiliencePipelineBuilder ctor
  parameter (default null → instance-owned builder). Keeps the 3 test call
  sites that construct OpcUaApplicationHost directly unchanged.
- OtOpcUaServer: requires the builder in its ctor; constructs one
  CapabilityInvoker per driver at CreateMasterNodeManager time with default
  Tier A DriverResilienceOptions. TODO: Stream B.1 will wire real per-driver-
  type tiers via DriverTypeRegistry; Phase 6.1 follow-up will read the
  DriverInstance.ResilienceConfig JSON column for per-instance overrides.
- DriverNodeManager: takes a CapabilityInvoker in its ctor. OnReadValue wraps
  the driver's ReadAsync through ExecuteAsync(DriverCapability.Read, hostName,
  ...); OnWriteValue wraps WriteAsync through ExecuteWriteAsync(hostName,
  isIdempotent, ...) where isIdempotent comes from the new
  _writeIdempotentByFullRef map populated at Variable() registration from
  DriverAttributeInfo.WriteIdempotent.

HostName defaults to driver.DriverInstanceId for now — a single-host pipeline
per driver. Multi-host drivers (Modbus with N PLCs) will expose their own per-
call host resolution in a follow-up so failing PLCs can trip per-PLC breakers
without poisoning siblings (decision #144).

Test fixup:
- FlakeyDriverIntegrationTests.Read_SurfacesSuccess_AfterTransientFailures:
  bumped TimeoutSeconds=2 → 30. 10 retries at exponential backoff with jitter
  can exceed 2s under parallel-test-run CPU pressure; the test asserts retry
  behavior, not timeout budget, so the longer slack keeps it deterministic.

Full solution dotnet test: 948 passing. Pre-existing Client.CLI Subscribe
flake unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 07:28:28 -04:00
Joseph Doherty
52a29100b1 Phase 3 PR 38 — DriverNodeManager HistoryRead override (LMX #1 finish). Wires the OPC UA HistoryRead service through CustomNodeManager2's four protected per-kind hooks — HistoryReadRawModified / HistoryReadProcessed / HistoryReadAtTime / HistoryReadEvents — each dispatching to the driver's IHistoryProvider capability (PR 35 for ReadAtTime + ReadEvents on top of PR 19-era ReadRaw + ReadProcessed). Was the last missing piece of the end-to-end HistoryRead path: PR 10 + PR 11 shipped the Galaxy.Host IPC contracts, PR 35 surfaced them on IHistoryProvider + GalaxyProxyDriver, but no server-side handler bridged OPC UA HistoryRead service requests onto the capability interface. Now it does.
Per-kind override shape: each hook receives the pre-filtered nodesToProcess list (NodeHandles for nodes this manager claimed), iterates them, resolves handle.NodeId.Identifier to the driver-side full reference string, and dispatches to the right IHistoryProvider method. Write back into the outer results + errors slots at handle.Index (not the local loop counter — nodesToProcess is a filtered subset of nodesToRead, so indexing by the loop counter lands in the wrong slot for mixed-manager batches). WriteResult helper sets both results[i] AND errors[i]; this matters because MasterNodeManager merges them and leaving errors[i] at its default (BadHistoryOperationUnsupported) overrides a Good result with Unsupported on the wire — this was the subtle failure mode that masked a correctly-constructed HistoryData response during debugging. Failure-isolation per node: NotSupportedException from a driver that doesn't implement a particular HistoryProvider method translates to BadHistoryOperationUnsupported in that slot; generic exceptions log and surface BadInternalError; unresolvable NodeIds get BadNodeIdUnknown. The batch continues unconditionally.
Aggregate mapping: MapAggregate translates ObjectIds.AggregateFunction_Average / Minimum / Maximum / Total / Count to the driver's HistoryAggregateType enum. Null for anything else (e.g. TimeAverage, Interpolative) so the handler surfaces BadAggregateNotSupported at the batch level — per Part 13, one unsupported aggregate means the whole request fails since ReadProcessedDetails carries one aggregate list for all nodes. BuildHistoryData wraps driver DataValueSnapshots as Opc.Ua.HistoryData in an ExtensionObject; BuildHistoryEvent wraps HistoricalEvents as Opc.Ua.HistoryEvent with the canonical BaseEventType field list (EventId, SourceName, Message, Severity, Time, ReceiveTime — the order OPC UA clients that didn't customize the SelectClause expect). ToDataValue preserves null SourceTimestamp (Galaxy historian rows often carry only ServerTimestamp) — synthesizing a SourceTimestamp would lie about actual sample time.
Two address-space changes were required to make the stack dispatch reach the per-kind hooks at all: (1) historized variables get AccessLevels.HistoryRead added to their AccessLevel byte — the base's early-gate check on (variable.AccessLevel & HistoryRead != 0) was rejecting requests before our override ever ran; (2) the driver-root folder gets EventNotifiers.HistoryRead | SubscribeToEvents so HistoryReadEvents can target it (the conventional pattern for alarm-history browse against a driver-owned object). Document the 'set both bits' requirement inline since it's not obvious from the surface API.
OpcHistoryReadResult alias: Opc.Ua.HistoryReadResult (service-layer per-node result) collides with Core.Abstractions.HistoryReadResult (driver-side samples + continuation point) by type name; the alias 'using OpcHistoryReadResult = Opc.Ua.HistoryReadResult' keeps the override signatures unambiguous and the test project applies the mirror pattern for its stub driver impl.
Tests — DriverNodeManagerHistoryMappingTests (12 new Category=Unit cases): MapAggregate translates each supported aggregate NodeId via reflection-backed theory (guards against the stack renaming AggregateFunction_* constants); returns null for unsupported NodeIds (TimeAverage) and null input; BuildHistoryData wraps samples with correct DataValues + SourceTimestamp preservation; BuildHistoryEvent emits the 6-element BaseEventType field list in canonical order (regression guard for a future 'respect the client's SelectClauses' change); null SourceName / Message translate to empty-string Variants (nullable-Variant refactor trap); ToDataValue preserves StatusCode + both timestamps; ToDataValue leaves SourceTimestamp at default when the snapshot omits it. HistoryReadIntegrationTests (5 new Category=Integration): drives a real OPC UA client Session.HistoryRead against a fake HistoryDriver through the running server. Covers raw round-trip (verifies per-node DataValue ordering + values); processed with Average aggregate (captures the driver's received aggregate + interval, asserting MapAggregate routed correctly); unsupported aggregate (TimeAverage → BadAggregateNotSupported); at-time (forwards the per-timestamp list); events (BaseEventType field list shape, SelectClauses populated to satisfy the stack's filter validator). Server.Tests Unit: 55 pass / 0 fail (43 prior + 12 new mapping). Server.Tests Integration: 14 pass / 0 fail (9 prior + 5 new history). Full solution build clean, 0 errors.
lmx-followups.md #1 updated to 'DONE (PRs 35 + 38)' with two explicit deferred items: continuation-point plumbing (driver returns null today so pass-through is fine) and per-SelectClause evaluation in HistoryReadEvents (clients with custom field selections get the canonical BaseEventType layout today).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 17:50:23 -04:00
Joseph Doherty
ef2a810b2d Phase 3 PR 34 — Host-status publisher (Server) + /hosts drill-down page (Admin). Closes LMX follow-up #7 by wiring together the data layer from PR 33. Server.HostStatusPublisher is a BackgroundService that walks every driver registered in DriverHost every 10 seconds, skips drivers that don't implement IHostConnectivityProbe, calls GetHostStatuses() on each probe-capable driver, and upserts one DriverHostStatus row per (NodeId, DriverInstanceId, HostName) into the central config DB. Upsert path: SingleOrDefaultAsync on the composite PK; if no row exists, Add a new one; if a row exists, LastSeenUtc advances unconditionally (heartbeat) and State + StateChangedUtc update only on transitions so Admin UI can distinguish 'still reporting, still Running' from 'freshly transitioned to Running'. MapState translates Core.Abstractions.HostState to Configuration.Enums.DriverHostState (intentional duplicate enum — Configuration project stays free of driver-runtime deps per PR 33's choice). If a driver's GetHostStatuses throws, log warning and skip that driver this tick — never take down the Server on a publisher failure. If the DB is unreachable, log warning + retry next heartbeat (no buffering — next tick's current-state snapshot is more useful than replaying stale transitions after a long outage). 2-second startup delay so NodeBootstrap's RegisterAsync calls land before the first publish tick, then tick runs immediately so a freshly-started Server surfaces its host topology in the Admin UI without waiting a full interval.
Polling chosen over event-driven for initial scope: simpler, matches Admin UI consumer cadence, avoids DriverHost lifecycle-event plumbing that doesn't exist today. Event-driven push for sub-heartbeat latency is a straightforward follow-up.
Admin.Services.HostStatusService left-joins DriverHostStatus against ClusterNode on NodeId so rows persist even when the ClusterNode entry doesn't exist yet (first-boot bootstrap case). StaleThreshold = 30s — covers one missed publisher heartbeat plus a generous buffer for clock skew and GC pauses. Admin Components/Pages/Hosts.razor — FleetAdmin-visible page grouped by cluster (handles the '(unassigned)' case for rows without a matching ClusterNode). Four summary cards (Hosts / Running / Stale / Faulted); per-cluster table with Node / Driver / Host / State + Stale-badge / Last-transition / Last-seen / Detail columns; 10s auto-refresh via IServiceScopeFactory timer pattern matching FleetStatusPoller + Fleet dashboard (PR 27). Row-class highlighting: Faulted → table-danger, Stale → table-warning, else default. State badge maps DriverHostState enum to bootstrap color classes. Sidebar link added between 'Fleet status' and 'Clusters'.
Server csproj adds Microsoft.EntityFrameworkCore.SqlServer 10.0.0 + registers OtOpcUaConfigDbContext in Program.cs scoped via NodeOptions.ConfigDbConnectionString (no Admin-style manual SQL raw — the DbContext is the only access path, keeps migrations owner-of-record).
Tests — HostStatusPublisherTests (4 new Integration cases, uses per-run throwaway DB matching the FleetStatusPollerTests pattern): publisher upserts one row per host from each probe-capable driver and skips non-probe drivers; second tick advances LastSeenUtc without creating duplicate rows (upsert pattern verified end-to-end); state change between ticks updates State AND StateChangedUtc (datetime2(3) rounds to millisecond precision so comparison uses 1ms tolerance — documented inline); MapState translates every HostState enum member. Server.Tests Integration: 4 new tests pass. Admin build clean, Admin.Tests Unit still 23 / 0. docs/v2/lmx-followups.md item #7 marked DONE with three explicit deferred items (event-driven push, failure-count column, SignalR fan-out).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 15:51:55 -04:00
Joseph Doherty
4886a5783f Phase 3 PR 31 — Live-LDAP integration test + Active Directory compatibility. Closes LMX follow-up #4 with 6 live-bind tests in Server.Tests/LdapUserAuthenticatorLiveTests.cs against the dev GLAuth instance at localhost:3893 (skipped cleanly when unreachable via Assert.Skip + a clear SkipReason — matches the GalaxyRepositoryLiveSmokeTests pattern). Coverage: valid credentials bind + surface DisplayName; wrong password fails; unknown user fails; empty credentials fail pre-flight without touching the directory; writeop user's memberOf maps through GroupToRole to WriteOperate (the exact string WriteAuthzPolicy.IsAllowed expects); admin user surfaces all four mapped roles (WriteOperate + WriteTune + WriteConfigure + AlarmAck) proving memberOf parsing doesn't stop after the first match. While wiring this up, the authenticator's hard-coded user-lookup filter 'uid=<name>' didn't match GLAuth (which keys users by cn and doesn't populate uid) — AND it doesn't match Active Directory either, which uses sAMAccountName. Added UserNameAttribute to LdapOptions (default 'uid' for RFC 2307 backcompat) so deployments override to 'cn' / 'sAMAccountName' / 'userPrincipalName' as the directory requires; authenticator filter now interpolates the configured attribute. The default stays 'uid' so existing test fixtures and OpenLDAP installs keep working without a config change — a regression guard in LdapUserAuthenticatorAdCompatTests.LdapOptions_default_UserNameAttribute_is_uid_for_rfc2307_compat pins this so a future 'helpful' default change can't silently break anyone.
Active Directory compatibility. LdapOptions xml-doc expanded with a cheat-sheet covering Server (DC FQDN), Port 389 vs 636, UseTls=true under AD LDAP-signing enforcement, dedicated read-only service account DN, sAMAccountName vs userPrincipalName vs cn trade-offs, memberOf DN shape (CN=Group,OU=...,DC=... with the CN= RDN stripped to become the GroupToRole key), and the explicit 'nested groups NOT expanded' call-out (LDAP_MATCHING_RULE_IN_CHAIN / tokenGroups is a future authenticator enhancement, not a config change). docs/security.md §'Active Directory configuration' adds a complete appsettings.json snippet with realistic AD group names (OPCUA-Operators → WriteOperate, OPCUA-Engineers → WriteConfigure, OPCUA-AlarmAck → AlarmAck, OPCUA-Tuners → WriteTune), LDAPS port 636, TLS on, insecure-LDAP off, and operator-facing notes on each field. LdapUserAuthenticatorAdCompatTests (5 unit guards): ExtractFirstRdnValue parses AD-style 'CN=OPCUA-Operators,OU=...,DC=...' DNs correctly (case-preserving — operators' GroupToRole keys stay readable); also handles mixed case and spaces in group names ('Domain Users'); also works against the OpenLDAP ou=<group>,ou=groups shape (GLAuth) so one extractor tolerates both memberOf formats common in the field; EscapeLdapFilter escapes the RFC 4515 injection set (\, *, (, ), \0) so a malicious login like 'admin)(cn=*' can't break out of the filter; default UserNameAttribute regression guard.
Test posture — Server.Tests Unit: 43 pass / 0 fail (38 prior + 5 new AD-compat guards). Server.Tests LiveLdap category: 6 pass / 0 fail against running GLAuth (would skip cleanly without). Server build clean, 0 errors, 0 warnings.
Deferred: the session-identity end-to-end check (drive a full OPC UA UserName session, then read a 'whoami' node to verify the role landed on RoleBasedIdentity). That needs a test-only address-space node and is scoped for a separate PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 15:23:22 -04:00
Joseph Doherty
6b04a85f86 Phase 3 PR 26 — server-layer write authorization gating by role. Per the user's ACL-at-server-layer directive (saved as feedback_acl_at_server_layer.md in memory), write authorization is enforced in DriverNodeManager.OnWriteValue and never delegated to the driver or to driver-specific auth (the v1 Galaxy-provided security path is explicitly not part of v2 — drivers report SecurityClassification as discovery metadata only). New WriteAuthzPolicy static class in Server/Security/ maps SecurityClassification → required role per the table documented in docs/Configuration.md: FreeAccess = no role required (anonymous sessions can write), Operate + SecuredWrite = WriteOperate, Tune = WriteTune, VerifiedWrite + Configure = WriteConfigure, ViewOnly = deny regardless of roles. Role matching is case-insensitive and role requirements do NOT cascade — a session with WriteConfigure can write Configure attributes but needs WriteOperate separately to write Operate attributes; this is deliberate so escalation is an explicit LDAP group assignment, not a hierarchy the policy silently grants. DriverNodeManager gains a _securityByFullRef Dictionary populated during Variable() registration (parallel to the existing _variablesByFullRef) so OnWriteValue can look up the classification in O(1) on the hot path. OnWriteValue casts the session's context.UserIdentity to the new IRoleBearer interface (implemented by OtOpcUaServer.RoleBasedIdentity from PR 19) — empty Roles collection when the session is anonymous; the same WriteAuthzPolicy.IsAllowed check then either short-circuits true (FreeAccess), false (ViewOnly), or walks the roles list looking for the required one. On deny, OnWriteValue logs 'Write denied for {FullRef}: classification=X userRoles=[...]' at Information level (readable trail for operator complaints) and returns BadUserAccessDenied without touching IWritable.WriteAsync — drivers never see a request we'd have refused. IRoleBearer kept as a minimal server-side interface rather than reusing some abstraction from Core.Abstractions because the concept is OPC-UA-session-scoped and doesn't generalize (the driver side has no notion of a user session). Tests — WriteAuthzPolicyTests (17 new cases): FreeAccess allows write with empty role set + arbitrary roles; ViewOnly denies write even with every role; Operate requires WriteOperate; role match is case-insensitive; Operate denies empty role set + wrong role; SecuredWrite shares Operate's requirement; Tune requires WriteTune; Tune denies WriteOperate-only (asserts roles don't cascade — this is the test that catches a future regression where someone 'helpfully' adds a role-escalation table); Configure requires WriteConfigure; VerifiedWrite shares Configure's requirement; multi-role session allowed when any role matches; unrelated roles denied; RequiredRole theory covering all 5 classified-and-mapped rows + null for FreeAccess/ViewOnly special cases. lmx-followups.md follow-up #2 marked DONE with a back-reference to this PR and the memory note. Full Server.Tests Unit suite: 38 pass / 0 fail (17 new WriteAuthz + 14 SecurityConfiguration from PR 19 + 2 NodeBootstrap + 5 others). Server.Tests Integration (Category=Integration) 2 pass — existing PR 17 anonymous-endpoint smoke tests stay green since the read path doesn't hit OnWriteValue.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:01:01 -04:00
Joseph Doherty
22d3b0d23c Phase 3 PR 19 — LDAP user identity + Basic256Sha256 security profile. Replaces the anonymous-only endpoint with a configurable security profile and an LDAP-backed UserName token validator. New IUserAuthenticator abstraction in Backend/Security/: LdapUserAuthenticator binds to the configured directory (reuses the pattern from Admin.Security.LdapAuthService without the cross-app dependency — Novell.Directory.Ldap.NETStandard 3.6.0 package ref added to Server alongside the existing OPCFoundation packages) and maps group membership to OPC UA roles via LdapOptions.GroupToRole (case-insensitive). DenyAllUserAuthenticator is the default when Ldap.Enabled=false so UserName token attempts return a clean BadUserAccessDenied rather than hanging on a localhost:3893 bind attempt. OpcUaSecurityProfile enum + LdapOptions nested record on OpcUaServerOptions. Profile=None keeps the PR 17 shape (SecurityPolicies.None + Anonymous token only) so existing integration tests stay green; Profile=Basic256Sha256SignAndEncrypt adds a second ServerSecurityPolicy (Basic256Sha256 + SignAndEncrypt) to the collection and, when Ldap.Enabled=true, adds a UserName token policy scoped to SecurityPolicies.Basic256Sha256 only — passwords must ride an encrypted channel, the stack rejects UserName over None. OtOpcUaServer.OnServerStarted hooks SessionManager.ImpersonateUser: AnonymousIdentityToken passes through; UserNameIdentityToken delegates to IUserAuthenticator.AuthenticateAsync — rejected identities throw ServiceResultException(BadUserAccessDenied); accepted identities get a RoleBasedIdentity that carries the resolved roles through session.Identity so future PRs can gate writes by role. OpcUaApplicationHost + OtOpcUaServer constructors take IUserAuthenticator as a dependency. Program.cs binds the new OpcUaServer:Ldap section from appsettings (Enabled defaults false, GroupToRole parsed as Dictionary<string,string>), registers IUserAuthenticator as LdapUserAuthenticator when enabled or DenyAllUserAuthenticator otherwise. PR 17 integration test updated to pass DenyAllUserAuthenticator so it keeps exercising the anonymous-only path unchanged. Tests — SecurityConfigurationTests (new, 13 cases): DenyAllAuthenticator rejects every credential; LdapAuthenticator rejects blank creds without hitting the server; rejects when Enabled=false; rejects plaintext when both UseTls=false AND AllowInsecureLdap=false (safety guard matching the Admin service); EscapeLdapFilter theory (4 rows: plain passthrough, parens/asterisk/backslash → hex escape) — regression guard against LDAP injection; ExtractOuSegment theory (3 rows: finds ou=, returns null when absent, handles multiple ou segments by returning first); ExtractFirstRdnValue theory (3 rows: strips cn= prefix, handles single-segment DN, returns plain string unchanged when no =). OpcUaServerOptions_default_is_anonymous_only asserts the default posture preserves PR 17 behavior. InternalsVisibleTo('ZB.MOM.WW.OtOpcUa.Server.Tests') added to Server csproj so ExtractOuSegment and siblings are reachable from the tests. Full solution: 0 errors, 180 tests pass (8 Core + 14 Proxy + 24 Configuration + 6 Shared + 91 Galaxy.Host + 19 Server (17 unit + 2 integration) + 18 Admin). Live-LDAP integration test (connect via Basic256Sha256 endpoint with a real user from GLAuth, assert the session.Identity carries the mapped role) is deferred to a follow-up — it requires the GLAuth dev instance to be running at localhost:3893 which is dev-machine-specific, and the test harness for that also needs a fresh client-side certificate provisioned by the live server's trusted store.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 08:49:46 -04:00
Joseph Doherty
46834a43bd Phase 3 PR 17 — complete OPC UA server startup end-to-end + integration test. PR 16 shipped the materialization shape (DriverNodeManager / OtOpcUaServer) without the activation glue; this PR finishes the scope so an external OPC UA client can actually connect, browse, and read. New OpcUaServerOptions DTO bound from the OpcUaServer section of appsettings.json (EndpointUrl default opc.tcp://0.0.0.0:4840/OtOpcUa, ApplicationName, ApplicationUri, PkiStoreRoot default %ProgramData%\OtOpcUa\pki, AutoAcceptUntrustedClientCertificates default true for dev — production flips via config). OpcUaApplicationHost wraps Opc.Ua.Configuration.ApplicationInstance: BuildConfiguration constructs the ApplicationConfiguration programmatically (no external XML) with SecurityConfiguration pointing at <PkiStoreRoot>/own, /issuers, /trusted, /rejected directories — stack auto-creates the cert folders on first run and generates a self-signed application certificate via CheckApplicationInstanceCertificate, ServerConfiguration.BaseAddresses set to the endpoint URL + SecurityPolicies just None + UserTokenPolicies just Anonymous with PolicyId='Anonymous' + SecurityPolicyUri=None so the client's UserTokenPolicy lookup succeeds at OpenSession, TransportQuotas.OperationTimeout=15s + MinRequestThreadCount=5 / MaxRequestThreadCount=100 / MaxQueuedRequestCount=200, CertificateValidator auto-accepts untrusted when configured. StartAsync creates the OtOpcUaServer (passes DriverHost + ILoggerFactory so one DriverNodeManager is created per registered driver in CreateMasterNodeManager from PR 16), calls ApplicationInstance.Start(server) to bind the endpoint, then walks each DriverNodeManager and drives a fresh GenericDriverNodeManager.BuildAddressSpaceAsync against it so the driver's discovery streams into the address space that's already serving clients. Per-driver discovery is isolated per decision #12: a discovery exception marks the driver's subtree faulted but the server stays up serving the other drivers' subtrees. DriverHost.GetDriver(instanceId) public accessor added alongside the existing GetHealth so OtOpcUaServer can enumerate drivers during CreateMasterNodeManager. DriverNodeManager.Driver property made public so OpcUaApplicationHost can identify which driver each node manager wraps during the discovery loop. OpcUaServerService constructor takes OpcUaApplicationHost — ExecuteAsync sequence now: bootstrap.LoadCurrentGenerationAsync → applicationHost.StartAsync → infinite Task.Delay until stop. StopAsync disposes the application host (which stops the server via OtOpcUaServer.Stop) before disposing DriverHost. Program.cs binds OpcUaServerOptions from appsettings + registers OpcUaApplicationHost + OpcUaServerOptions as singletons. Integration test (OpcUaServerIntegrationTests, Category=Integration): IAsyncLifetime spins up the server on a random non-default port (48400+random for test isolation) with a per-test-run PKI store root (%temp%/otopcua-test-<guid>) + a FakeDriver registered in DriverHost that has ITagDiscovery + IReadable implementations — DiscoverAsync registers TestFolder>Var1, ReadAsync returns 42. Client_can_connect_and_browse_driver_subtree creates an in-process OPC UA client session via CoreClientUtils.SelectEndpoint (which talks to the running server's GetEndpoints and fetches the live EndpointDescription with the actual PolicyId), browses the fake driver's root, asserts TestFolder appears in the returned references. Client_can_read_a_driver_variable_through_the_node_manager constructs the variable NodeId using the namespace index the server registered (urn:OtOpcUa:fake), calls Session.ReadValue, asserts the DataValue.Value is 42 — the whole pipeline (client → server endpoint → DriverNodeManager.OnReadValue → FakeDriver.ReadAsync → back through the node manager → response to client) round-trips correctly. Dispose tears down the session, server, driver host, and PKI store directory. Full solution: 0 errors, 165 tests pass (8 Core unit + 14 Proxy unit + 24 Configuration unit + 6 Shared unit + 91 Galaxy.Host unit + 4 Server (2 unit NodeBootstrap + 2 new integration) + 18 Admin). End-to-end outcome: PR 14's GalaxyAlarmTracker alarm events now flow through PR 15's GenericDriverNodeManager event forwarder → PR 16's ConditionSink → OPC UA AlarmConditionState.ReportEvent → out to every OPC UA client subscribed to the alarm condition. The full alarm subsystem (driver-side subscription of the Galaxy 4-attribute quartet, Core-side routing by source node id, Server-side AlarmConditionState materialization with ReportEvent dispatch) is now complete and observable through any compliant OPC UA client. LDAP / security-profile wire-up (replacing the anonymous-only endpoint with BasicSignAndEncrypt + user identity mapping to NodePermissions role) is the next layer — it reuses the same ApplicationConfiguration plumbing this PR introduces but needs a deployment-policy source (central config DB) for the cert trust decisions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 08:18:37 -04:00
Joseph Doherty
f53c39a598 Phase 3 PR 16 — concrete OPC UA server scaffolding with AlarmConditionState materialization. Introduces the OPCFoundation.NetStandard.Opc.Ua.Server package (v1.5.374.126, same version the v1 stack already uses) and two new server-side classes: DriverNodeManager : CustomNodeManager2 is the concrete realization of PR 15's IAddressSpaceBuilder contract — Folder() creates FolderState nodes under an Organizes hierarchy rooted at ObjectsFolder > DriverInstanceId; Variable() creates BaseDataVariableState with DataType mapped from DriverDataType (Boolean/Int32/Float/Double/String/DateTime) + ValueRank (Scalar or OneDimension) + AccessLevel CurrentReadOrWrite; AddProperty() creates PropertyState with HasProperty reference. Read hook wires OnReadValue per variable to route to IReadable.ReadAsync; Write hook wires OnWriteValue to route to IWritable.WriteAsync and surface per-tag StatusCode. MarkAsAlarmCondition() materializes an OPC UA AlarmConditionState child of the variable, seeded from AlarmConditionInfo (SourceName, InitialSeverity → UA severity via Low=250/Medium=500/High=700/Critical=900, InitialDescription), initial state Enabled + Acknowledged + Inactive + Retain=false. Returns an IAlarmConditionSink whose OnTransition updates alarm.Severity/Time/Message and switches state per AlarmType string ('Active' → SetActiveState(true) + SetAcknowledgedState(false) + Retain=true; 'Acknowledged' → SetAcknowledgedState(true); 'Inactive' → SetActiveState(false) + Retain=false if already Acked) then calls alarm.ReportEvent to emit the OPC UA event to subscribed clients. Galaxy's GalaxyAlarmTracker (PR 14) now lands at a concrete AlarmConditionState node instead of just raising an unobserved C# event. OtOpcUaServer : StandardServer wires one DriverNodeManager per DriverHost.GetDriver during CreateMasterNodeManager — anonymous endpoint, no security profile (minimum-viable; LDAP + security-profile wire-up is the next PR). DriverHost gains public GetDriver(instanceId) so the server can enumerate drivers at startup. NestedBuilder inner class in DriverNodeManager implements IAddressSpaceBuilder by temporarily retargeting the parent's _currentFolder during each call so Folder→Variable→AddProperty land under the correct subtree — not thread-safe if discovery ran concurrently, but GenericDriverNodeManager.BuildAddressSpaceAsync is sequential per driver so this is safe by construction. NuGet audit suppress for GHSA-h958-fxgg-g7w3 (moderate-severity in OPCFoundation.NetStandard.Opc.Ua.Core 1.5.374.126; v1 stack already accepts this risk on the same package version). PR 16 is scoped as scaffolding — the actual server startup (ApplicationInstance, certificate config, endpoint binding, session management wiring into OpcUaServerService.ExecuteAsync) is deferred to a follow-up PR because it needs ApplicationConfiguration XML + optional-cert-store logic that depends on per-deployment policy decisions. The materialization shape is complete: a subsequent PR adds 100 LOC to start the server and all the already-written IAddressSpaceBuilder + alarm-condition + read/write wire-up activates end-to-end. Full solution: 0 errors, 152 unit tests pass (no new tests this PR — DriverNodeManager unit testing needs an IServerInternal mock which is heavyweight; live-endpoint integration tests land alongside the server-startup PR).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 08:00:36 -04:00
Joseph Doherty
01fd90c178 Phase 1 Streams B–E scaffold + Phase 2 Streams A–C scaffold — 8 new projects with ~70 new tests, all green alongside the 494 v1 IntegrationTests baseline (parity preserved: no v1 tests broken; legacy OtOpcUa.Host untouched). Phase 1 finish: Configuration project (16 entities + 10 enums + DbContext + DesignTimeDbContextFactory + InitialSchema/StoredProcedures/AuthorizationGrants migrations — 8 procs including sp_PublishGeneration with MERGE on ExternalIdReservation per decision #124, sp_RollbackToGeneration cloning rows into a new published generation, sp_ValidateDraft with cross-cluster-namespace + EquipmentUuid-immutability + ZTag/SAPID reservation pre-flight, sp_ComputeGenerationDiff with CHECKSUM-based row signature — plus OtOpcUaNode/OtOpcUaAdmin SQL roles with EXECUTE grants scoped to per-principal-class proc sets and DENY UPDATE/DELETE/INSERT/SELECT on dbo schema); managed DraftValidator covering UNS segment regex, path length, EquipmentUuid immutability across generations, same-cluster namespace binding (decision #122), reservation pre-flight, EquipmentId derivation (decision #125), driver↔namespace compatibility — returning every failing rule in one pass; LiteDB local cache with round-trip + ring pruning + corruption-fast-fail; GenerationApplier with per-entity Added/Removed/Modified diff and dependency-ordered callbacks (namespace → driver → device → equipment → poll-group → tag, Removed before Added); Core project with GenericDriverNodeManager (scaffold for the Phase 2 Galaxy port) and DriverHost lifecycle registry; Server project using Microsoft.Extensions.Hosting BackgroundService replacing TopShelf, with NodeBootstrap that falls back to LiteDB cache when the central DB is unreachable (decision #79); Admin project scaffolded as Blazor Server with Bootstrap 5 sidebar layout, cookie auth, three admin roles (ConfigViewer/ConfigEditor/FleetAdmin), Cluster + Generation services fronting the stored procs. Phase 2 scaffold: Driver.Galaxy.Shared (netstandard2.0) with full MessagePack IPC contract surface — Hello version negotiation, Open/CloseSession, Heartbeat, DiscoverHierarchy + GalaxyObjectInfo/GalaxyAttributeInfo, Read/WriteValues, Subscribe/Unsubscribe/OnDataChange, AlarmSubscribe/Event/Ack, HistoryRead, HostConnectivityStatus, Recycle — plus length-prefixed framing (decision #28) with a 16 MiB cap and thread-safe FrameWriter/FrameReader; Driver.Galaxy.Host (net48) implementing the Tier C cross-cutting protections from driver-stability.md — strict PipeAcl (allow configured server SID only, explicit deny on LocalSystem + Administrators), PipeServer with caller-SID verification via pipe.RunAsClient + WindowsIdentity.GetCurrent and per-process shared-secret Hello, Galaxy-specific MemoryWatchdog (warn at max(1.5×baseline, +200 MB), soft-recycle at max(2×baseline, +200 MB), hard ceiling 1.5 GB, slope ≥5 MB/min over 30-min rolling window), RecyclePolicy (1 soft recycle per hour cap + 03:00 local daily scheduled), PostMortemMmf (1000-entry ring buffer in %ProgramData%\OtOpcUa\driver-postmortem\galaxy.mmf, survives hard crash, readable cross-process), MxAccessHandle : SafeHandle (ReleaseHandle loops Marshal.ReleaseComObject until refcount=0 then calls optional unregister callback), StaPump with responsiveness probe (BlockingCollection dispatcher for Phase 1 — real Win32 GetMessage/DispatchMessage pump slots in with the same semantics when the Galaxy code lift happens), IsExternalInit shim for init setters on .NET 4.8; Driver.Galaxy.Proxy (net10) implementing IDriver + ITagDiscovery forwarding over the IPC channel with MX data-type and security-classification mapping, plus Supervisor pieces — Backoff (5s → 15s → 60s capped, reset-on-stable-run), CircuitBreaker (3 crashes per 5 min opens; 1h → 4h → manual cooldown escalation; sticky alert doesn't auto-clear), HeartbeatMonitor (2s cadence, 3 consecutive misses = host dead per driver-stability.md). Infrastructure: docker SQL Server remapped to host port 14330 to coexist with the native MSSQL14 Galaxy ZB DB instance on 1433; NuGetAuditSuppress applied per-project for two System.Security.Cryptography.Xml advisories that only reach via EF Core Design with PrivateAssets=all (fix ships in 11.0.0-preview); .slnx gains 14 project registrations. Deferred with explicit TODOs in docs/v2/implementation/phase-2-partial-exit-evidence.md: Phase 1 Stream E Admin UI pages (Generations listing + draft-diff-publish, Equipment CRUD with OPC 40010 fields, UNS Areas/Lines tabs, ACLs + permission simulator, Generic JSON config editor, SignalR real-time, Release-Reservation + Merge-Equipment workflows, LDAP login page, AppServer smoke test per decision #142), Phase 2 Stream D (Galaxy MXAccess code lift out of legacy OtOpcUa.Host, dual-service installer, appsettings → DriverConfig migration script, legacy Host deletion — blocked by parity), Phase 2 Stream E (v1 IntegrationTests against v2 topology, Client.CLI walkthrough diff, four 2026-04-13 stability findings regression tests, adversarial review — requires live MXAccess runtime).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 21:35:25 -04:00