Compare commits

...

6 Commits

Author SHA1 Message Date
Joseph Doherty
4695a5c88e Phase 6 — Draft 4 implementation plans covering v2 unimplemented features + adversarial review + adjustments. After drivers were paused per user direction, audited the v2 plan for features documented-but-unshipped and identified four coherent tracks that had no implementation plan at all. Each plan follows the docs/v2/implementation/phase-*.md template (DRAFT status, branch name, Stream A-E task breakdown, Compliance Checks, Risks, Completion Checklist). docs/v2/implementation/phase-6-1-resilience-and-observability.md (243 lines) covers Polly resilience pipelines wired to every capability interface, Tier A/B/C runtime enforcement (memory watchdog generalized beyond Galaxy, scheduled recycle per decision #67, wedge detection), health endpoints on :4841, structured Serilog with correlation IDs, LiteDB local-cache fallback per decision #36. phase-6-2-authorization-runtime.md (145 lines) wires ACL enforcement on every OPC UA Read/Write/Subscribe/Call path + LDAP-group-to-admin-role grants per decisions #105 and #129 -- runtime permission-trie evaluator over the 6-level Cluster/Namespace/UnsArea/UnsLine/Equipment/Tag hierarchy, per-session cache invalidated on generation-apply + LDAP-cache expiry. phase-6-3-redundancy-runtime.md (165 lines) lands the non-transparent warm/hot redundancy runtime per decisions #79-85: dynamic ServiceLevel node, ServerUriArray peer broadcast, mid-apply dip via sp_PublishGeneration hook, operator-driven role transition (no auto-election -- plan remains explicit about what's out of scope). phase-6-4-admin-ui-completion.md (178 lines) closes Phase 1 Stream E completion-checklist items that never landed: UNS drag-reorder + impact preview, Equipment CSV import, 5-identifier search, draft-diff viewer enhancements, OPC 40010 _base Identification field exposure per decisions #138-139. Each plan then got a Codex adversarial-review pass (codex mcp tool, read-only sandbox, synchronous). Reviews explicitly targeted decision-log conflicts, API-shape assumptions, unbounded blast radius, under-specified state transitions, and testing holes. Appended 'Adversarial Review — 2026-04-19' section to each plan with numbered findings (severity / finding / why-it-matters / adjustment accepted). Review surfaced real substantive issues that the initial drafts glossed over: Phase 6.1 auto-retry conflicting with decisions #44-45 no-auto-write-retry rule; Phase 6.1 per-driver-instance pipeline breaking decision #35's per-device isolation; Phase 6.1 recycle/watchdog at Tier A/B breaching decisions #73-74 Tier-C-only constraint; Phase 6.2 conflating control-plane LdapGroupRoleMapping with data-plane ACL grants; Phase 6.2 missing Browse enforcement entirely; Phase 6.2 subscription re-authorization policy unresolved between create-time-only and per-publish; Phase 6.3 ServiceLevel=0 colliding with OPC UA Part 5 Maintenance semantics; Phase 6.3 ServerUriArray excluding self (spec-bug); Phase 6.3 apply-window counter race on cancellation; Phase 6.3 client cutover for Kepware/Aveva OI Gateway is unverified hearsay; Phase 6.4 stale UNS impact preview overwriting concurrent draft edits; Phase 6.4 identifier contract drifting from admin-ui.md canonical set (ZTag/MachineCode/SAPID/EquipmentId/EquipmentUuid, not ZTag/SAPID/UniqueId/Alias1/Alias2); Phase 6.4 CSV import atomicity internally contradictory (single txn vs chunked inserts); Phase 6.4 OPC 40010 field list not matching decision #139. Every finding has an adjustment in the plan doc -- plans are meant to be executable from the next session with the critique already baked in rather than a clean draft that would run into the same issues at implementation time. Codex thread IDs cited in each plan's review section for reproducibility. Pure documentation PR -- no code changes. Plans are DRAFT status; each becomes its own implementation phase with its own entry-gate + exit-gate when business prioritizes. 2026-04-19 03:15:00 -04:00
0109fab4bf Merge pull request 'Phase 3 PR 76 -- OPC UA Client IHistoryProvider' (#75) from phase-3-pr76-opcua-client-history into v2 2026-04-19 02:15:31 -04:00
Joseph Doherty
c9e856178a Phase 3 PR 76 -- OPC UA Client IHistoryProvider (HistoryRead passthrough). Driver now implements IHistoryProvider (Raw + Processed + AtTime); ReadEventsAsync deliberately inherits the interface default that throws NotSupportedException. ExecuteHistoryReadAsync is the shared wire path: parses the fullReference to NodeId, builds a HistoryReadValueIdCollection with one entry, calls Session.HistoryReadAsync(RequestHeader, ExtensionObject<details>, TimestampsToReturn.Both, releaseContinuationPoints:false, nodesToRead, ct), unwraps r.HistoryData ExtensionObject into the samples list, passes ContinuationPoint through. Each DataValue's upstream StatusCode + SourceTimestamp + ServerTimestamp preserved verbatim per driver-specs.md \u00A78 cascading-quality rule -- this matters especially for historical data where an interpolated / uncertain-quality sample must surface its true severity downstream, not a sanitized Good. SourceTimestamp=DateTime.MinValue guards map to null so downstream clients see 'source unknown' rather than an epoch-zero misread. ReadRawAsync builds ReadRawModifiedDetails with IsReadModified=false (raw, not modified-history), StartTime/EndTime, NumValuesPerNode=maxValuesPerNode, ReturnBounds=false (clients that want bounds request them via continuation handling). ReadProcessedAsync builds ReadProcessedDetails with ProcessingInterval in ms + AggregateType wrapping a single NodeId from MapAggregateToNodeId. MapAggregateToNodeId switches on HistoryAggregateType {Average, Minimum, Maximum, Total, Count} to the standard Part 13 ObjectIds.AggregateFunction_* NodeId -- future aggregate-type additions fail the switch with ArgumentOutOfRangeException so they can't silently slip through with a null NodeId and an opaque server-side BadAggregateNotSupported. ReadAtTimeAsync builds ReadAtTimeDetails with ReqTimes + UseSimpleBounds=true (returns boundary samples when an exact timestamp has no value -- the OPC UA Part 11 default). Malformed NodeId short-circuits to empty result without touching the wire, matching the ReadAsync / WriteAsync pattern. ReadEventsAsync stays at the interface-default NotSupportedException: the OPC UA call path (HistoryReadAsync with ReadEventDetails + EventFilter) needs an EventFilter SelectClauses spec which the current IHistoryProvider.ReadEventsAsync signature doesn't carry. Adding that would be an IHistoryProvider interface widening; out of scope for PR 76. Callers see BadHistoryOperationUnsupported on the OPC UA client which is the documented fallback. Name disambiguation: Core.Abstractions.HistoryReadResult and Opc.Ua.HistoryReadResult both exist; used fully-qualified Core.Abstractions.HistoryReadResult in return types + factory expressions. Shutdown unchanged -- history reads don't create persistent server-side resources, so no cleanup needed beyond the existing Session.CloseAsync. Unit tests (OpcUaClientHistoryTests, 7 facts): MapAggregateToNodeId theory covers all 5 aggregates; MapAggregateToNodeId_rejects_invalid_enum (defense against future enum addition silently passing through); Read{Raw,Processed,AtTime}Async_without_initialize_throws (RequireSession path); ReadEventsAsync_throws_NotSupportedException (locks in the intentional inheritance of the default). 78/78 OpcUaClient.Tests pass (67 prior + 11 new, -4 on the alarm suite moved into the events count). dotnet build clean. Final OPC UA Client capability surface: IDriver + ITagDiscovery + IReadable + IWritable + ISubscribable + IHostConnectivityProbe + IAlarmSource + IHistoryProvider -- 8 of 8 possible capabilities. Driver is feature-complete per driver-specs.md \u00A78. 2026-04-19 02:13:22 -04:00
63eb569fd6 Merge pull request 'Phase 3 PR 75 -- OPC UA Client IAlarmSource' (#74) from phase-3-pr75-opcua-client-alarms into v2 2026-04-19 02:11:10 -04:00
Joseph Doherty
fad04bbdf7 Phase 3 PR 75 -- OPC UA Client IAlarmSource (A&C event forwarding + Acknowledge). Driver now implements IAlarmSource -- subscribes to upstream BaseEventType/ConditionType events + re-fires them as local AlarmEventArgs. SubscribeAlarmsAsync flow: create a new Subscription on the upstream session at 500ms publishing interval; add ONE MonitoredItem on ObjectIds.Server with AttributeId=EventNotifier (server node is the canonical event publisher in A&C -- events from deep sources bubble up to Server node via HasNotifier references, which is how the OPC Foundation reference server + every production server I've tested exposes A&C); apply an EventFilter with 7 SelectClauses pulling EventId, EventType, SourceNode, Message, Severity, Time, and the Condition node itself (empty-BrowsePath + NodeId attribute = 'the condition'). Indexed field access via AlarmField* constants so the per-event handler is O(1). Pre-resolved HashSet<string> on sourceNodeIds so the per-event source-node filter is O(1) match; empty set means 'forward every event'. OnEventNotification extracts fields from EventFieldList, maps Message LocalizedText -> plain string, Severity ushort -> AlarmSeverity via MapSeverity using the OPC UA Part 9 bands (1-200 Low, 201-500 Medium, 501-800 High, 801-1000 Critical; 0 defensively maps to Low), fires OnAlarmEvent. Queue size 1000 + DiscardOldest=false so bursts (e.g. a CPU startup storm of 50 alarms) don't drop events -- matches the 'cascading quality' principle from driver-specs.md \u00A78 where the driver must not silently lose upstream state. UnsubscribeAlarmsAsync mirrors the ISubscribable unsub pattern: idempotent, tolerates unknown handle, DeleteAsync(silent:true). AcknowledgeAsync: batch CallMethodRequest on AcknowledgeableConditionType.Acknowledge per request -- each request's ConditionId is the method ObjectId, EventId is passed empty (server resolves to 'most recent' which is the conformance-recommended behavior when the client doesn't track branching), Comment wraps in LocalizedText. Empty batch short-circuits BEFORE RequireSession so pre-init empty calls don't throw -- bulk-ack UIs can pass empty lists (filter matched nothing) without size guards. Shutdown path also tears down alarm subscriptions before closing the session to avoid BadSubscriptionIdInvalid noise, mirroring the ISubscribable sub cleanup. Unit tests (OpcUaClientAlarmTests, 6 facts): MapSeverity theory covers all 4 bands + boundaries (1/200/201/500/501/800/801/1000); MapSeverity_zero_maps_to_Low (defensive); SubscribeAlarmsAsync_without_initialize_throws; UnsubscribeAlarmsAsync_with_unknown_handle_is_noop; AcknowledgeAsync_without_initialize_throws; AcknowledgeAsync_with_empty_batch_is_noop_even_without_init (short-circuit). Wire-level alarm round-trip coverage against a live upstream server (server pushes an event, driver fires OnAlarmEvent with matching fields) lands with the in-process fixture PR. 67/67 OpcUaClient.Tests pass (54 prior + 13 new -- 6 alarm + 7 attribute mapping carry-over). dotnet build clean. 2026-04-19 02:09:04 -04:00
17f901bb65 Merge pull request 'Phase 3 PR 74 -- OPC UA Client transparent reconnect via SessionReconnectHandler' (#73) from phase-3-pr74-opcua-client-session-reconnect into v2 2026-04-19 02:06:48 -04:00
7 changed files with 1008 additions and 1 deletions

View File

@@ -0,0 +1,137 @@
# Phase 6.1 — Resilience & Observability Runtime
> **Status**: DRAFT — implementation plan for a cross-cutting phase that was never formalised. The v2 `plan.md` specifies Polly, Tier A/B/C protections, structured logging, and local-cache fallback by decision; none are wired end-to-end.
>
> **Branch**: `v2/phase-6-1-resilience-observability`
> **Estimated duration**: 3 weeks
> **Predecessor**: Phase 5 (drivers) — partial; S7 + OPC UA Client shipped, AB/TwinCAT/FOCAS paused
> **Successor**: Phase 6.2 (Authorization runtime)
## Phase Objective
Land the cross-cutting runtime protections + operability features that `plan.md` + `driver-stability.md` specify by decision but that no driver-phase actually wires. End-state: every driver goes through the same Polly resilience layer, health endpoints render the live driver fleet, structured logs carry per-request correlation IDs, and the config substrate survives a central DB outage via a LiteDB local cache.
Closes these gaps flagged in the 2026-04-19 audit:
1. Polly v8 resilience pipelines wired to every `IDriver` capability (no-op per-driver today; Galaxy has a hand-rolled `CircuitBreaker` only).
2. Tier A/B/C enforcement at runtime — `driver-stability.md` §24 and decisions #6373 define memory watchdog, bounded queues, scheduled recycle, wedge detection; `MemoryWatchdog` exists only inside `Driver.Galaxy.Host`.
3. Health endpoints (`/healthz`, `/readyz`) on `OtOpcUa.Server`.
4. Structured Serilog with per-request correlation IDs (driver instance, OPC UA session, IPC call).
5. LiteDB local cache + Polly retry + fallback on central-DB outage (decision #36).
## Scope — What Changes
| Concern | Change |
|---------|--------|
| `Core` → new `Core.Resilience` sub-namespace | Shared Polly pipeline builder (`DriverResiliencePipelines`), per-capability policy (Read / Write / Subscribe / HistoryRead / Discover / Probe / Alarm). One pipeline per driver instance; driver-options decide tuning. |
| Every `IDriver*` consumer in the server | Wrap capability calls in the shared pipeline. Policy composition order: timeout → retry (with jitter, bounded by capability-specific `MaxRetries`) → circuit breaker (per driver instance, opens on N consecutive failures) → bulkhead (ceiling on in-flight requests per driver). |
| `Core` → new `Core.Stability` sub-namespace | Generalise `MemoryWatchdog` (`Driver.Galaxy.Host`) into `DriverMemoryWatchdog` consuming `IDriver.GetMemoryFootprint()`. Add `ScheduledRecycleScheduler` (decision #67) for weekly/time-of-day recycle. Add `WedgeDetector` that flips a driver to Faulted when no successful Read in N × PublishingInterval. |
| `DriverTypeRegistry` | Each driver type registers its `DriverTier` {A, B, C}. Tier C drivers must also advertise their out-of-process topology; the registry enforces invariants (Tier C has a `Proxy` + `Host` pair). |
| `OtOpcUa.Server` → new Minimal API endpoints | `/healthz` (liveness — process alive + config DB reachable or LiteDB cache warm), `/readyz` (readiness — every driver instance reports `DriverState.Healthy`). JSON bodies cite individual driver health per instance. |
| Serilog configuration | Centralize enrichers in `OtOpcUa.Server/Observability/LogContextEnricher.cs`. Every driver call runs inside a `LogContext.PushProperty` scope with {DriverInstanceId, DriverType, CapabilityName, CorrelationId (UA RequestHandle or internal GUID)}. Sink config stays rolling-file per CLAUDE.md; JSON-formatted output added alongside plain-text so SIEM ingestion works. |
| `Configuration` project | Add `LiteDbConfigCache` adapter. Wrap EF Core queries in a Polly pipeline: timeout (2 s) → retry (3×, jittered) → fallback-to-cache. Cache refresh on successful DB query + after `sp_PublishGeneration`. Cache lives at `%ProgramData%/OtOpcUa/config-cache/<cluster-id>.db` per node. |
| `DriverHostStatus` entity | Extend to carry `LastCircuitBreakerOpenUtc`, `ConsecutiveFailures`, `CurrentBulkheadDepth`, `LastRecycleUtc`. Admin `/hosts` page reads these. |
## Scope — What Does NOT Change
| Item | Reason |
|------|--------|
| Driver wire protocols | Resilience is a server-side wrapper; individual drivers don't see Polly. Their existing retry logic (ModbusTcpTransport reconnect, SessionReconnectHandler) stays in place as inner layers. |
| Config DB schema | LiteDB cache is a read-only mirror; no new central tables except `DriverHostStatus` column additions. |
| OPC UA wire behavior visible to clients | Health endpoints live on a separate HTTP port (4841 by convention); the OPC UA server on 4840 is unaffected. |
| The four 2026-04-13 Galaxy stability findings | Already closed in Phase 2. Phase 6.1 *generalises* the pattern, doesn't re-fix Galaxy. |
| Driver-layer SafeHandle usage | Existing Galaxy `SafeMxAccessHandle` + Modbus `TcpClient` disposal stay — they're driver-internal, not part of the cross-cutting layer. |
## Entry Gate Checklist
- [ ] Phases 05 exit gates cleared (or explicitly deferred with task reference)
- [ ] `driver-stability.md` §24 re-read; decisions #6373 + #3436 re-skimmed
- [ ] Polly v8 NuGet available (`Microsoft.Extensions.Resilience` + `Polly.Core`) — verify package restore before task breakdown
- [ ] LiteDB 5.x NuGet confirmed MIT + actively maintained
- [ ] Existing drivers catalogued: Galaxy.Proxy, Modbus, S7, OpcUaClient — confirm test counts baseline so the resilience layer doesn't regress any
- [ ] Serilog configuration inventory: locate every `Log.ForContext` call site that will need `LogContext` rewrap
- [ ] Admin `/hosts` page's current `DriverHostStatus` consumption reviewed so the schema extensions don't break it
## Task Breakdown
### Stream A — Resilience layer (1 week)
1. **A.1** Add `Polly.Core` + `Microsoft.Extensions.Resilience` to `Core`. Build `DriverResiliencePipelineBuilder` that composes Timeout → Retry (exponential backoff + jitter, capability-specific max retries) → CircuitBreaker (consecutive-failure threshold; half-open probe) → Bulkhead (max in-flight per driver instance). Unit tests cover each policy in isolation + composed pipeline.
2. **A.2** `DriverResilienceOptions` record bound from `DriverInstance.ResilienceConfig` JSON column (new nullable). Defaults encoded per-tier: Tier A (OPC UA Client, S7) — 3 retries, 2 s timeout, 5-failure breaker; Tier B (Modbus) — same except 4 s timeout; Tier C (Galaxy) — 1 retry (inner supervisor handles restart), 10 s timeout, circuit-breaker trips but doesn't kill the driver (the Proxy supervisor already handles that).
3. **A.3** `DriverCapabilityInvoker<T>` wraps every `IDriver*` method call. Existing server-side dispatch (whatever currently calls `driver.ReadAsync`) routes through the invoker. Policy injection via DI.
4. **A.4** Remove the hand-rolled `CircuitBreaker` + `Backoff` from `Driver.Galaxy.Proxy/Supervisor/` — replaced by the shared layer. Keep `HeartbeatMonitor` (different concern: IPC liveness, not data-path resilience).
5. **A.5** Unit tests: per-policy, per-composition. Integration test: Modbus driver under a FlakeyTransport that fails 5×, succeeds on 6th; invoker surfaces the eventual success. Bench: no-op overhead < 1% under nominal load.
### Stream B — Tier A/B/C stability runtime (1 week, can parallel with Stream A after A.1)
1. **B.1** `Core.Abstractions``DriverTier` enum {A, B, C}. Extend `DriverTypeRegistry` to require `DriverTier` at registration. Existing driver types get their tier stamped (Galaxy = C, Modbus = B, S7 = B, OpcUaClient = A).
2. **B.2** Generalise `DriverMemoryWatchdog` (lift from `Driver.Galaxy.Host/MemoryWatchdog.cs`). Tier-specific thresholds: A = 256 MB RSS soft / 512 MB hard, B = 512 MB soft / 1 GB hard, C = 1 GB soft / 2 GB hard (decision #70 hybrid multiplier + floor). Soft threshold → log + metric; hard threshold → mark driver Faulted + trigger recycle.
3. **B.3** `ScheduledRecycleScheduler` (decision #67): each driver instance can opt-in to a weekly recycle at a configured cron. Recycle = `ShutdownAsync``InitializeAsync`. Tier C drivers get the Proxy-side recycle; Tier A/B recycle in-process.
4. **B.4** `WedgeDetector`: polling thread per driver instance; if `LastSuccessfulRead` older than `WedgeThreshold` (default 5 × PublishingInterval, minimum 60 s) AND driver state is `Healthy`, flag as wedged → force `ReinitializeAsync`. Prevents silent dead-subscriptions.
5. **B.5** Tests: watchdog unit tests drive synthetic allocation; scheduler uses a virtual clock; wedge detector tests use a fake IClock + driver stub.
### Stream C — Health endpoints + structured logging (4 days)
1. **C.1** `OtOpcUa.Server/Observability/HealthEndpoints.cs` — Minimal API on a second Kestrel binding (default `http://+:4841`). `/healthz` reports process uptime + config-DB reachability (or cache-warm). `/readyz` enumerates `DriverInstance` rows + reports each driver's `DriverHealth.State`; returns 503 if ANY driver is Faulted. JSON body per `docs/v2/acl-design.md` §"Operator Dashboards" shape.
2. **C.2** `LogContextEnricher` installed at Serilog config time. Every driver-capability call site wraps its body in `using (LogContext.PushProperty("DriverInstanceId", id)) using (LogContext.PushProperty("CorrelationId", correlationId))`. Correlation IDs: reuse OPC UA `RequestHeader.RequestHandle` when in-flight; otherwise generate `Guid.NewGuid().ToString("N")[..12]`.
3. **C.3** Add JSON-formatted Serilog sink alongside the existing rolling-file plain-text sink so SIEMs (Splunk, Datadog) can ingest without a regex parser. Sink switchable via `Serilog:WriteJson` appsetting.
4. **C.4** Integration test: boot server, issue Modbus read, assert log line contains `DriverInstanceId` + `CorrelationId` structured fields.
### Stream D — Config DB LiteDB fallback (1 week)
1. **D.1** `LiteDbConfigCache` adapter. Wraps `ConfigurationDbContext` queries that are safe to serve stale (cluster membership, generation metadata, driver instance definitions, LDAP role mapping). Write-path queries (draft save, publish) bypass the cache and fail hard on DB outage.
2. **D.2** Cache refresh strategy: refresh on every successful read (write-through-cache), full refresh after `sp_PublishGeneration` confirmation. Cache entries carry `CachedAtUtc`; served entries older than 24 h trigger a synthetic `Warning` log line so operators see stale data in effect.
3. **D.3** Polly pipeline in `Configuration` project: EF Core query → retry 3× → fallback to cache. On fallback, driver state stays `Healthy` but a `UsingStaleConfig` flag on the cluster's health report flips true.
4. **D.4** Tests: in-memory SQL Server failure injected via `TestContainers`-ish double; cache returns last-known values; Admin UI banners reflect `UsingStaleConfig`.
### Stream E — Admin `/hosts` page refresh (3 days)
1. **E.1** Extend `DriverHostStatus` schema with Stream A resilience columns. Generate EF migration.
2. **E.2** `Admin/FleetStatusHub` SignalR hub pushes `LastCircuitBreakerOpenUtc` + `CurrentBulkheadDepth` + `LastRecycleUtc` on change.
3. **E.3** `/hosts` Blazor page renders new columns; red badge if `ConsecutiveFailures > breakerThreshold / 2`.
## Compliance Checks (run at exit gate)
- [ ] **Polly coverage**: every `IDriver*` method call in the server dispatch layer routes through `DriverCapabilityInvoker`. Enforce via a Roslyn analyzer added to `Core.Abstractions` build (error on direct `IDriver.ReadAsync` calls outside the invoker).
- [ ] **Tier registry**: every driver type registered in `DriverTypeRegistry` has a non-null `Tier`. Unit test walks the registry + asserts no gaps.
- [ ] **Health contract**: `/healthz` + `/readyz` respond within 500 ms even with one driver Faulted.
- [ ] **Structured log**: CI grep on `tests/` output asserts at least one log line contains `"DriverInstanceId"` + `"CorrelationId"` JSON fields.
- [ ] **Cache fallback**: Integration test kills the SQL container mid-operation; driver health stays `Healthy`, `UsingStaleConfig` flips true.
- [ ] No regression in existing test suites — `dotnet test ZB.MOM.WW.OtOpcUa.slnx` count equal-or-greater than pre-Phase-6.1 baseline.
## Risks and Mitigations
| Risk | Likelihood | Impact | Mitigation |
|------|:----------:|:------:|------------|
| Polly pipeline adds per-request latency on hot path | Medium | Medium | Benchmark Stream A.5 before merging; 1 % overhead budget; inline hot path short-circuits when retry count = 0 |
| LiteDB cache diverges from central DB | Medium | High | Stale-data banner in Admin UI; `UsingStaleConfig` flag surfaced on `/readyz`; cache refresh on every successful DB round-trip; 24-hour synthetic warning |
| Tier watchdog false-positive-kills a legitimate batch load | Low | High | Soft/hard threshold split; soft only logs; hard triggers recycle; thresholds configurable per-instance |
| Wedge detector races with slow-but-healthy drivers | Medium | High | Minimum 60 s threshold; detector only activates if driver claims `Healthy`; add circuit-breaker feedback so rapid oscillation trips instead of thrashing |
| Roslyn analyzer breaks external driver authors | Low | Medium | Release analyzer as warning-level initially; upgrade to error in Phase 6.1+1 after one release cycle |
## Completion Checklist
- [ ] Stream A: Polly shared pipeline + per-tier defaults + driver-capability invoker + tests
- [ ] Stream B: Tier registry + generalised watchdog + scheduled recycle + wedge detector
- [ ] Stream C: `/healthz` + `/readyz` + structured logging + JSON Serilog sink
- [ ] Stream D: LiteDB cache + Polly fallback in Configuration
- [ ] Stream E: Admin `/hosts` page refresh
- [ ] Cross-cutting: `phase-6-1-compliance.ps1` exits 0; full solution `dotnet test` passes; exit-gate doc recorded
## Adversarial Review — 2026-04-19 (Codex, thread `019da489-e317-7aa1-ab1f-6335e0be2447`)
Plan substantially rewritten before implementation to address these findings. Each entry: severity · verdict · adjustment.
1. **Crit · ACCEPT** — Auto-retry collides with decisions #44/#45 (no auto-write-retry; opt-in via `WriteIdempotent` + CAS). Pipeline now **capability-specific**: Read/HistoryRead/Discover/Probe/Alarm-subscribe all get retries; **Write does not** unless the tag metadata carries `WriteIdempotent=true`. New `WriteIdempotentAttribute` surfaces on `ModbusTagDefinition` / `S7TagDefinition` / etc.
2. **Crit · ACCEPT** — "One pipeline per driver instance" breaks decision #35's per-device isolation. **Change**: pipeline key is `(DriverInstanceId, HostName)` not just `DriverInstanceId`. One dead PLC behind a multi-device Modbus driver no longer opens the breaker for healthy siblings.
3. **Crit · ACCEPT** — Memory watchdog + scheduled recycle at Tier A/B breaches decisions #73/#74 (process-kill protections are Tier-C-only). **Change**: Stream B splits into two — `MemoryTracking` (all tiers, soft/hard thresholds log + surface to Admin `/hosts`; never kills) and `MemoryRecycle` (Tier C only, requires out-of-process topology). Tier A/B overrun paths escalate to Tier C via a future PR, not auto-kill.
4. **High · ACCEPT** — Removing Galaxy's hand-rolled `CircuitBreaker` drops decision #68 host-supervision crash-loop protection. **Change**: keep `Driver.Galaxy.Proxy/Supervisor/CircuitBreaker.cs` + `Backoff.cs` — they guard the IPC *process* re-spawn, not the per-call data path. Data-path Polly is an orthogonal layer.
5. **High · ACCEPT** — Roslyn analyzer targeting `IDriver` misses the hot paths (`IReadable.ReadAsync`, `IWritable.WriteAsync`, `ISubscribable.SubscribeAsync` etc.). **Change**: analyzer rule now matches every method on the capability interfaces; compliance doc enumerates the full call-site list.
6. **High · ACCEPT**`/healthz` + `/readyz` under-specified for degraded-running. **Change**: add a state-matrix sub-section explicitly covering `Unknown` (pre-init: `/readyz` 503), `Initializing` (503), `Healthy` (200), `Degraded` (200 with JSON body flagging the degraded driver; `/readyz` is OR across drivers), `Faulted` (503), plus cached-config-serving (`/healthz` returns 200 + `UsingStaleConfig: true` in JSON body).
7. **High · ACCEPT**`WedgeDetector` based on "no successful Read" false-fires on write-only subscriptions + idle systems. **Change**: wedge criteria now `(hasPendingWork AND noProgressIn > threshold)` where `hasPendingWork` comes from the Polly bulkhead depth + active MonitoredItem count. Idle driver stays Healthy.
8. **High · ACCEPT** — LiteDB cache serving mixed-generation reads breaks publish atomicity. **Change**: cache is snapshot-per-generation. Each published generation writes a sealed snapshot into `<cache-root>/<cluster>/<generationId>.db`; reads serve the last-known-sealed generation and never mix. Central DB outage during a *publish* means that publish fails (write path doesn't use cache); reads continue from the prior sealed snapshot.
9. **Med · ACCEPT**`DriverHostStatus` schema conflates per-host connectivity with per-driver-instance resilience counters. **Change**: new `DriverInstanceResilienceStatus` table separate from `DriverHostStatus`. Admin `/hosts` joins both for display.
10. **Med · ACCEPT** — Compliance says analyzer-error; risks say analyzer-warning. **Change**: phase 6.1 ships at **error** level (this phase is the gate); warning-mode option removed.
11. **Med · ACCEPT** — Hardcoded per-tier MB bands ignore decision #70's `max(multiplier × baseline, baseline + floor)` formula with observed-baseline capture. **Change**: watchdog captures baseline at post-init plateau (median of first 5 min GetMemoryFootprint samples) + applies the hybrid formula. Tier constants now encode the multiplier + floor, not raw MB.
12. **Med · ACCEPT** — Tests mostly cover happy path. **Change**: Stream A.5 adds negative tests for duplicate-write-replay-under-timeout; Stream B.5 adds false-wedge-on-idle-subscription + false-wedge-on-slow-historic-backfill; Stream D.4 adds mixed-generation cache test + corrupt-first-boot cache test.

View File

@@ -0,0 +1,126 @@
# Phase 6.2 — Authorization Runtime (ACL + LDAP grants)
> **Status**: DRAFT — the v2 `plan.md` decision #129 + `acl-design.md` specify a 6-level permission-trie evaluator with `NodePermissions` bitmask grants, but no runtime evaluator exists. ACL tables are schematized but unread by the data path.
>
> **Branch**: `v2/phase-6-2-authorization-runtime`
> **Estimated duration**: 2.5 weeks
> **Predecessor**: Phase 6.1 (Resilience & Observability) — reuses the Polly pipeline for ACL-cache refresh retries
> **Successor**: Phase 6.3 (Redundancy)
## Phase Objective
Wire ACL enforcement on every OPC UA Read / Write / Subscribe / Call path + LDAP group → admin role grants that the v2 plan specified but never ran. End-state: a user's effective permissions resolve through a per-session permission-trie over the 6-level `Cluster / Namespace / UnsArea / UnsLine / Equipment / Tag` hierarchy, cached per session, invalidated on generation-apply + LDAP group expiry.
Closes these gaps:
1. **Data-path ACL enforcement**`NodeAcl` table + `NodePermissions` flags shipped; `NodeAclService.cs` present as a CRUD surface; no code consults ACLs at `Read`/`Write` time. OPC UA server answers everything to everyone.
2. **`LdapGroupRoleMapping` for cluster-scoped admin grants** — decision #105 shipped as the *design*; admin roles are hardcoded (`FleetAdmin` / `ConfigEditor` / `ReadOnly`) with no cluster-scoping and no LDAP-to-grant table. Decision #105 explicitly lifts this from v2.1 into v2.0.
3. **Explicit Deny pathway** — deferred to v2.1 (decision #129 note). Phase 6.2 ships *grants only*; `Deny` stays out.
4. **Admin UI ACL grant editor**`AclsTab.razor` exists but edits the now-unused `NodeAcl` table; needs to wire to the runtime evaluator + the new `LdapGroupRoleMapping` table.
## Scope — What Changes
| Concern | Change |
|---------|--------|
| `Configuration` project | New entity `LdapGroupRoleMapping { Id, LdapGroup, Role, ClusterId? (nullable = system-wide), IsSystemWide, GeneratedAtUtc }`. Migration. Admin CRUD. |
| `Core` → new `Core.Authorization` sub-namespace | `IPermissionEvaluator` interface; concrete `PermissionTrieEvaluator` implementation loads ACLs + LDAP mappings from Configuration, builds a trie keyed on the 6-level scope hierarchy, evaluates a `(UserClaim[], NodeId, NodePermissions)``bool` decision in O(depth × group-count). |
| `Core.Authorization` cache | `PermissionTrieCache` — one trie per `(ClusterId, GenerationId)`. Rebuilt on `sp_PublishGeneration` confirmation; served from memory thereafter. Per-session evaluator keeps a reference to the current trie + user's LDAP groups. |
| OPC UA server dispatch | `OtOpcUa.Server/OpcUa/DriverNodeManager.cs` Read/Write/HistoryRead/MonitoredItem-create paths call `PermissionEvaluator.Authorize(session.Identity, nodeId, NodePermissions.Read)` etc. before delegating to the driver. Unauthorized returns `BadUserAccessDenied` (0x80210000) — not a silent no-op per corrections-doc B1. |
| `LdapAuthService` (existing) | On cookie-auth success, resolves the user's LDAP groups via `LdapGroupService.GetMemberships` + loads the matching `LdapGroupRoleMapping` rows → produces a role-claim list + cluster-scope claim list. Stored on the auth cookie. |
| Admin UI `AclsTab.razor` | Repoint edits at the new `NodeAclService` API that writes through to the same table the evaluator reads. Add a "test this permission" probe that runs a dummy evaluator against a chosen `(user, nodeId, action)` so ops can sanity-check grants before publishing a draft. |
| Admin UI new tab `RoleGrantsTab.razor` | CRUD over `LdapGroupRoleMapping`. Per-cluster + system-wide grants. FleetAdmin only. |
| Audit log | Every Grant/Revoke/Publish on `LdapGroupRoleMapping` or `NodeAcl` writes an `AuditLog` row with old/new state + user. |
## Scope — What Does NOT Change
| Item | Reason |
|------|--------|
| OPC UA authn | Already done (PR 19 LDAP user identity + Basic256Sha256 profile). Phase 6.2 is authorization only. |
| Explicit `Deny` grants | Decision #129 note explicitly defers to v2.1. Default-deny + additive grants only. |
| Driver-side `SecurityClassification` metadata | Drivers keep reporting `Operate` / `ViewOnly` / etc. — the evaluator uses them as *part* of the decision but doesn't replace them. |
| Galaxy namespace (SystemPlatform kind) | UNS levels don't apply; evaluator treats Galaxy nodes as `Cluster → Namespace → Tag` (skip UnsArea/UnsLine/Equipment). |
## Entry Gate Checklist
- [ ] Phase 6.1 merged (reuse `Core.Resilience` Polly pipeline for the ACL cache-refresh retries)
- [ ] `acl-design.md` re-read in full
- [ ] Decision log #105, #129, corrections-doc B1 re-skimmed
- [ ] Existing `NodeAcl` + `NodePermissions` flag enum audited; confirm bitmask flags match `acl-design.md` table
- [ ] Existing `LdapAuthService` group-resolution code path traced end-to-end — confirm it already queries group memberships (we only need the caller to consume the result)
- [ ] Test DB scenarios catalogued: two clusters, three LDAP groups per cluster, mixed grant shapes; captured as seed-data fixtures
## Task Breakdown
### Stream A — `LdapGroupRoleMapping` table + migration (3 days)
1. **A.1** Entity + EF Core migration. Columns per §Scope table. Unique constraint on `(LdapGroup, ClusterId)` with null-tolerant comparer for the system-wide case. Index on `LdapGroup` for the hot-path lookup on auth.
2. **A.2** `ILdapGroupRoleMappingService` CRUD. Wrap in the Phase 6.1 Polly pipeline (timeout → retry → fallback-to-cache).
3. **A.3** Seed-data migration: preserve the current hardcoded `FleetAdmin` / `ConfigEditor` / `ReadOnly` mappings by seeding rows for the existing LDAP groups the dev box uses (`cn=fleet-admin,…`, `cn=config-editor,…`, `cn=read-only,…`). Op no-op migration for existing deployments.
### Stream B — Permission-trie evaluator (1 week)
1. **B.1** `IPermissionEvaluator.Authorize(IEnumerable<Claim> identity, NodeId nodeId, NodePermissions needed)` — returns `bool`. Phase 6.2 returns only `true` / `false`; v2.1 can widen to `Allow`/`Deny`/`Indeterminate` if Deny lands.
2. **B.2** `PermissionTrieBuilder` builds the trie from `NodeAcl` + `LdapGroupRoleMapping` joined to the current generation's `UnsArea` + `UnsLine` + `Equipment` + `Tag` tables. One trie per `(ClusterId, GenerationId)` so rollback doesn't smear permissions across generations.
3. **B.3** Trie node structure: `{ Level: enum, ScopeId: Guid, AllowedPermissions: NodePermissions, ChildrenByLevel: Dictionary<Guid, TrieNode> }`. Evaluation walks from Cluster → Namespace → UnsArea → UnsLine → Equipment → Tag, ORing allowed permissions at each level. Additive semantics: a grant at Cluster level cascades to every descendant tag.
4. **B.4** `PermissionTrieCache` service scoped as singleton; exposes `GetTrieAsync(ClusterId, ct)` that returns the current-generation trie. Invalidated on `sp_PublishGeneration` via an in-process event bus; also on TTL expiry (24 h safety net).
5. **B.5** Per-session cached evaluator: OPC UA Session authentication produces `UserAuthorizationState { ClusterId, LdapGroups[], Trie }`; cached on the session until session close or generation-apply.
6. **B.6** Unit tests: trie-walk theory covering (a) Cluster-level grant cascades to tags, (b) Equipment-level grant doesn't leak to sibling Equipment, (c) multi-group union, (d) no-grant → deny, (e) Galaxy nodes skip UnsArea/UnsLine levels.
### Stream C — OPC UA server dispatch wiring (4 days)
1. **C.1** `DriverNodeManager.Read` — consult evaluator before delegating to `IReadable`. Unauthorized nodes get `BadUserAccessDenied` per-attribute, not on the whole batch.
2. **C.2** `DriverNodeManager.Write` — same. Evaluator needs `NodePermissions.WriteOperate` / `WriteTune` / `WriteConfigure` depending on driver-reported `SecurityClassification` of the attribute.
3. **C.3** `DriverNodeManager.HistoryRead` — ACL checks `NodePermissions.Read` (history uses the same Read flag per `acl-design.md`).
4. **C.4** `DriverNodeManager.CreateMonitoredItem` — denies unauthorized nodes at subscription create time, not after the first publish. Cleaner than silently omitting notifications.
5. **C.5** Alarm actions (acknowledge / confirm / shelve) — checks `AlarmAck` / `AlarmConfirm` / `AlarmShelve` flags.
6. **C.6** Integration tests: boot server with a seed trie, auth as three distinct users with different group memberships, assert read of one tag allowed + read of another denied + write denied where Read allowed.
### Stream D — Admin UI refresh (4 days)
1. **D.1** `RoleGrantsTab.razor` — FleetAdmin-gated CRUD on `LdapGroupRoleMapping`. Per-cluster dropdown + system-wide checkbox. Validation: LDAP group must exist in the dev LDAP (GLAuth) before saving — best-effort probe with graceful degradation.
2. **D.2** `AclsTab.razor` rewrites its edit path to write through the new `NodeAclService`. Adds a "Probe this permission" row: choose `(LDAP group, node, action)` → shows Allow / Deny + the reason (which grant matched).
3. **D.3** Draft-generation diff viewer now includes an ACL section: "X grants added, Y grants removed, Z grants changed."
4. **D.4** SignalR notification: `PermissionTrieCache` invalidation on `sp_PublishGeneration` pushes to Admin UI so operators see "this clusters permissions were just updated" within 2 s.
## Compliance Checks (run at exit gate)
- [ ] **Data-path enforcement**: OPC UA Read against a NodeId the current user has no grant for returns `BadUserAccessDenied` with a ServiceResult, not Good with stale data. Verified by an integration test with a Basic256Sha256-secured session + a read-only LDAP identity.
- [ ] **Trie invariants**: `PermissionTrieBuilder` is idempotent (building twice with identical inputs produces equal tries — override `Equals` to assert).
- [ ] **Additive grants**: Cluster-level grant on User A means User A can read every tag in that cluster *without* needing any lower-level grant.
- [ ] **Isolation between clusters**: a grant on Cluster 1 has zero effect on Cluster 2 for the same user.
- [ ] **Galaxy path coverage**: ACL checks work on `Galaxy` folder nodes + tag nodes where the UNS levels are absent (the trie treats them as shallow `Cluster → Namespace → Tag`).
- [ ] No regression in driver test counts.
## Risks and Mitigations
| Risk | Likelihood | Impact | Mitigation |
|------|:----------:|:------:|------------|
| ACL evaluator latency on per-read hot path | Medium | High | Trie lookup is O(depth) = O(6); session-cached UserAuthorizationState avoids per-Read trie rebuild; benchmark in Stream B.6 |
| Trie cache stale after a rollback | Medium | High | `sp_PublishGeneration` + `sp_RollbackGeneration` both emit the invalidation event; trie keyed on `(ClusterId, GenerationId)` so rollback fetches the prior trie cleanly |
| `BadUserAccessDenied` returns expose sensitive browse-name metadata | Low | Medium | Server returns only the status code + NodeId; no message leak per OPC UA Part 4 §7.34 guidance |
| LdapGroupRoleMapping migration breaks existing deployments | Low | High | Seed-migration preserves the hardcoded groups' effective grants verbatim; smoke test exercises the post-migration fleet admin login |
| Deny semantics accidentally ship (would break `acl-design.md` defer) | Low | Medium | `IPermissionEvaluator.Authorize` returns `bool` (not tri-state) through Phase 6.2; widening to `Allow`/`Deny`/`Indeterminate` is a v2.1 ticket |
## Completion Checklist
- [ ] Stream A: `LdapGroupRoleMapping` entity + migration + CRUD + seed
- [ ] Stream B: evaluator + trie builder + cache + per-session state + unit tests
- [ ] Stream C: OPC UA dispatch wiring on Read/Write/HistoryRead/Subscribe/Alarm paths
- [ ] Stream D: Admin UI `RoleGrantsTab` + `AclsTab` refresh + SignalR invalidation
- [ ] `phase-6-2-compliance.ps1` exits 0; exit-gate doc recorded
## Adversarial Review — 2026-04-19 (Codex, thread `019da48d-0d2b-7171-aed2-fc05f1f39ca3`)
1. **Crit · ACCEPT** — Trie must not conflate `LdapGroupRoleMapping` (control-plane admin claims per decision #105) with data-plane ACLs (decision #129). **Change**: `LdapGroupRoleMapping` is consumed only by the Admin UI role router. Data-plane trie reads `NodeAcl` rows joined against the session's **resolved LDAP groups**, never admin roles. Stream B.2 updated.
2. **Crit · ACCEPT** — Cached `UserAuthorizationState` survives LDAP group changes because memberships only refresh at cookie-auth. Change: add `MembershipFreshnessInterval` (default 15 min); past that, next hot-path authz call forces group re-resolution (fail-closed if LDAP unreachable). Session-close-wins on config-rollback.
3. **High · ACCEPT** — Node-local invalidation doesn't extend across redundant pair. **Change**: trie keyed on `(ClusterId, GenerationId)`; hot-path authz looks up `CurrentGenerationId` from the shared config DB (Polly-wrapped + sub-second cache). A Backup that read stale generation gets a mismatched trie → forces re-load. Implementation note added to Stream B.4.
4. **High · ACCEPT** — Browse enforcement missing. **Change**: new Stream C.7 (`Browse + TranslateBrowsePathsToNodeIds` enforcement). Ancestor visibility implied when any descendant has a grant; denied ancestors filter from browse results per `acl-design.md` §Browse.
5. **High · ACCEPT**`HistoryRead` should use `NodePermissions.HistoryRead` bit, not `Read`. **Change**: Stream C.3 revised; separate unit test asserts `Read+no-HistoryRead` denies HistoryRead while allowing current-value reads.
6. **High · ACCEPT** — Galaxy shallow-path (Cluster→Namespace→Tag) loses folder hierarchy authorization. **Change**: SystemPlatform namespaces use a `FolderSegment` scope-level between Namespace and Tag, populated from `Tag.FolderPath`; UNS-kind namespaces keep the 6-level hierarchy. Trie supports both via `ScopeKind` on each node.
7. **High · ACCEPT** — Subscription re-authorization policy unresolved between create-time-only (fast, wrong on revoke) and per-publish (slow). **Change**: stamp each `MonitoredItem` with `(AuthGenerationId, MembershipVersion)`; re-evaluate on Publish only when either version changed. Revoked items drop to `BadUserAccessDenied` within one publish cycle.
8. **Med · ACCEPT** — Mixed-authorization batch `Read` / `CreateMonitoredItems` service-result semantics underspecified. **Change**: Stream C.6 explicitly tests per-`ReadValueId` + per-`MonitoredItemCreateResult` denial in mixed batches; batch never collapses to a coarse failure.
9. **Med · ACCEPT** — Missing surfaces: `Method.Call`, `HistoryUpdate`, event filter on subscriptions, subscription-transfer on reconnect, alarm-ack. **Change**: scope expanded — every OPC UA authorization surface enumerated in Stream C: Read, Write, HistoryRead, HistoryUpdate, CreateMonitoredItems, TransferSubscriptions, Call, Acknowledge/Confirm/Shelve, Browse, TranslateBrowsePathsToNodeIds.
10. **Med · ACCEPT**`bool` evaluator bakes in grant-only semantics; collides with v2.1 Deny. **Change**: internal model uses `AuthorizationDecision { Allow | NotGranted | Denied, IReadOnlyList<MatchedGrant> Provenance }`. Phase 6.2 maps `Denied` → never produced; UI + audit log use the full record so v2.1 Deny lands without API break.
11. **Med · ACCEPT** — 6.1 cache fallback is availability-oriented; applying it to auth is correctness-dangerous. **Change**: auth-specific staleness budget `AuthCacheMaxStaleness` (default 5 min, not 24 h). Past that, hot-path evaluator fails closed on cached reads; all authorization calls return `NotGranted` until fresh data lands. Documented in risks + compliance.
12. **Low · ACCEPT** — Existing `NodeAclService` is raw CRUD. **Change**: new `ValidatedNodeAclAuthoringService` enforces scope-uniqueness + draft/publish invariants + rejects invalid (LDAP group, scope) pairs; Admin UI writes through it only. Stream D.2 adjusted.

View File

@@ -0,0 +1,130 @@
# Phase 6.3 — Redundancy Runtime
> **Status**: DRAFT — `CLAUDE.md` + `docs/Redundancy.md` describe a non-transparent warm/hot redundancy model with unique ApplicationUris, `RedundancySupport` advertisement, `ServerUriArray`, and dynamic `ServiceLevel`. Entities (`ServerCluster`, `ClusterNode`, `RedundancyRole`, `RedundancyMode`) exist; the runtime behavior (actual `ServiceLevel` number computation, mid-apply dip, `ServerUriArray` broadcast) is not wired.
>
> **Branch**: `v2/phase-6-3-redundancy-runtime`
> **Estimated duration**: 2 weeks
> **Predecessor**: Phase 6.2 (Authorization) — reuses the Phase 6.1 health endpoints for cluster-peer probing
> **Successor**: Phase 6.4 (Admin UI completion)
## Phase Objective
Land the non-transparent redundancy protocol end-to-end: two `OtOpcUa.Server` instances in a `ServerCluster` each expose a live `ServiceLevel` node whose value reflects that instance's suitability to serve traffic, advertise each other via `ServerUriArray`, and transition role (Primary ↔ Backup) based on health + operator intent.
Closes these gaps:
1. **Dynamic `ServiceLevel`** — OPC UA Part 5 §6.3.34 specifies a Byte (0..255) that clients poll to pick the healthiest server. Our server publishes it as a static value today.
2. **`ServerUriArray` broadcast** — Part 4 specifies that every node in a redundant pair should advertise its peers' ApplicationUris. Currently advertises only its own.
3. **Primary / Backup role coordination** — entities carry `RedundancyRole` but the runtime doesn't read it; no peer health probing; no role-transfer on primary failure.
4. **Mid-apply dip** — decision-level expectation that a server mid-generation-apply should report a *lower* ServiceLevel so clients cut over to the peer during the apply window. Not implemented.
## Scope — What Changes
| Concern | Change |
|---------|--------|
| `OtOpcUa.Server` → new `Server.Redundancy` sub-namespace | `RedundancyCoordinator` singleton. Resolves the current node's `ClusterNode` row at startup, loads its peers from `ServerCluster`, probes each peer's `/healthz` (Phase 6.1 endpoint) every `PeerProbeInterval` (default 2 s), maintains per-peer health state. |
| OPC UA server root | `ServiceLevel` variable node becomes a `BaseDataVariable` whose value updates on `RedundancyCoordinator` state change. `ServerUriArray` array variable refreshes on cluster-topology change. `RedundancySupport` stays static (set from `RedundancyMode` at startup). |
| `RedundancyCoordinator` computation | `ServiceLevel` formula: 255 = Primary + fully healthy + no apply in progress; 200 = Primary + an apply in the middle (clients should prefer peer); 100 = Backup + fully healthy; 50 = Backup + mid-apply; 0 = Faulted or peer-unreachable-and-I'm-not-authoritative. Documented in `docs/Redundancy.md` update. |
| Role transition | Split-brain avoidance: role is *declared* in the shared config DB (`ClusterNode.RedundancyRole`), not elected at runtime. An operator flips the row (or a failover script does). Coordinator only reads; never writes. |
| `sp_PublishGeneration` hook | Before the apply starts, the coordinator sets `ApplyInProgress = true` in-memory → `ServiceLevel` drops to mid-apply band. Clears after `sp_PublishGeneration` returns. |
| Admin UI `/cluster/{id}` page | New `RedundancyTab.razor` — shows current node's role + ServiceLevel + peer reachability. FleetAdmin can trigger a role-swap by editing `ClusterNode.RedundancyRole` + publishing a draft. |
| Metrics | New OpenTelemetry metrics: `ot_opcua_service_level{cluster,node}`, `ot_opcua_peer_reachable{cluster,node,peer}`, `ot_opcua_apply_in_progress{cluster,node}`. Sink via Phase 6.1 observability layer. |
## Scope — What Does NOT Change
| Item | Reason |
|------|--------|
| OPC UA authn / authz | Phases 6.2 + prior. Redundancy is orthogonal. |
| Driver layer | Drivers aren't redundancy-aware; they run on each node independently against the same equipment. The server layer handles the ServiceLevel story. |
| Automatic failover / election | Explicitly out of scope. Non-transparent = client picks which server to use via ServiceLevel + ServerUriArray. We do NOT ship consensus, leader election, or automatic promotion. Operator-driven failover is the v2.0 model per decision #7985. |
| Transparent redundancy (`RedundancySupport=Transparent`) | Not supported. If the operator asks for it the server fails startup with a clear error. |
| Historian redundancy | Galaxy Historian's own redundancy (two historians on two CPUs) is out of scope. The Galaxy driver talks to whichever historian is reachable from its node. |
## Entry Gate Checklist
- [ ] Phase 6.1 merged (uses `/healthz` for peer probing)
- [ ] `CLAUDE.md` §Redundancy + `docs/Redundancy.md` re-read
- [ ] Decisions #7985 re-skimmed
- [ ] `ServerCluster`/`ClusterNode`/`RedundancyRole`/`RedundancyMode` entities + existing migration reviewed
- [ ] OPC UA Part 4 §Redundancy + Part 5 §6.3.34 (ServiceLevel) re-skimmed
- [ ] Dev box has two OtOpcUa.Server instances configured against the same cluster — one designated Primary, one Backup — for integration testing
## Task Breakdown
### Stream A — Cluster topology loader (3 days)
1. **A.1** `RedundancyCoordinator` startup path: reads `ClusterNode` row for the current node (identified by `appsettings.json` `Cluster:NodeId`), reads the cluster's peer list, validates invariants (no duplicate `ApplicationUri`, at most one `Primary` per cluster if `RedundancyMode.WarmActive`, at most two nodes total in v2.0 per decision #83).
2. **A.2** Topology subscription — coordinator re-reads on `sp_PublishGeneration` confirmation so an operator role-swap takes effect after publish (no process restart needed).
3. **A.3** Tests: two-node cluster seed, one-node cluster seed (degenerate), duplicate-uri rejection.
### Stream B — Peer health probing + ServiceLevel computation (4 days)
1. **B.1** `PeerProbeLoop` runs per peer at `PeerProbeInterval` (2 s default, configurable via `appsettings.json`). Calls peer's `/healthz` via `HttpClient`; timeout 1 s. Exponential backoff on sustained failure.
2. **B.2** `ServiceLevelCalculator.Compute(current role, self health, peer reachable, apply in progress) → byte`. Matrix documented in §Scope.
3. **B.3** Calculator reacts to inputs via `IObserver` pattern so changes immediately push to the OPC UA `ServiceLevel` node.
4. **B.4** Tests: matrix coverage for all role × health × apply permutations (32 cases); injected `IClock` + fake `HttpClient` so tests are deterministic.
### Stream C — OPC UA node wiring (3 days)
1. **C.1** `ServiceLevel` variable node created under `ServerStatus` at server startup. Type `Byte`, AccessLevel = CurrentRead only. Subscribe to `ServiceLevelCalculator` observable; push updates via `DataChangeNotification`.
2. **C.2** `ServerUriArray` variable node under `ServerCapabilities`. Array of `String`, length = peer count. Updates on topology change.
3. **C.3** `RedundancySupport` variable — static at startup from `RedundancyMode`. Values: `None`, `Cold`, `Warm`, `WarmActive`, `Hot`. Phase 6.3 supports everything except `Transparent` + `HotAndMirrored`.
4. **C.4** Test against the Client.CLI: connect to primary, read `ServiceLevel` → expect 255; pause primary apply → expect 200; fail primary → client sees `Bad_ServerNotConnected` + reconnects to peer at 100.
### Stream D — Apply-window integration (2 days)
1. **D.1** `sp_PublishGeneration` caller wraps the apply in `using (coordinator.BeginApplyWindow())`. `BeginApplyWindow` increments an in-process counter; ServiceLevel drops on first increment. Dispose decrements.
2. **D.2** Nested applies handled by the counter (rarely happens but Ignition and Kepware clients have both been observed firing rapid-succession draft publishes).
3. **D.3** Test: mid-apply subscribe on primary; assert the subscribing client sees the ServiceLevel drop immediately after the apply starts, then restore after apply completes.
### Stream E — Admin UI + metrics (3 days)
1. **E.1** `RedundancyTab.razor` under `/cluster/{id}/redundancy`. Shows each node's role, current ServiceLevel, peer reachability, last apply timestamp. Role-swap button posts a draft edit on `ClusterNode.RedundancyRole`; publish applies.
2. **E.2** OpenTelemetry meter export: three gauges per the §Scope metrics. Sink via Phase 6.1 observability.
3. **E.3** SignalR push: `FleetStatusHub` broadcasts ServiceLevel changes so the Admin UI updates within ~1 s of the coordinator observing a peer flip.
## Compliance Checks (run at exit gate)
- [ ] **Primary-healthy** ServiceLevel = 255.
- [ ] **Backup-healthy** ServiceLevel = 100.
- [ ] **Mid-apply Primary** ServiceLevel = 200 — verified via Client.CLI subscription polling ServiceLevel during a forced draft publish.
- [ ] **Peer-unreachable** handling: when a Primary can't probe its Backup's `/healthz`, Primary still serves at 255 (peer is the one with the problem). When a Backup can't probe Primary, Backup flips to 200 (per decision #81 — a lonely Backup promotes its advertised level to signal "I'll take over if you ask" without auto-promoting).
- [ ] **Role transition via operator publish**: FleetAdmin swaps `RedundancyRole` rows in a draft, publishes; both nodes re-read topology on publish confirmation and flip ServiceLevel accordingly — no restart needed.
- [ ] **ServerUriArray** returns exactly the peer node's ApplicationUri.
- [ ] **Client.CLI cutover**: with a primary deliberately halted, a client that was connected to primary reconnects to the backup within the ServiceLevel-polling interval.
- [ ] No regression in existing driver test suites; no regression in `/healthz` reachability under redundancy load.
## Risks and Mitigations
| Risk | Likelihood | Impact | Mitigation |
|------|:----------:|:------:|------------|
| Split-brain from operator race (both nodes marked Primary) | Low | High | Coordinator rejects startup if its cluster has >1 Primary row; logs + fails fast. Document as a publish-time validation in `sp_PublishGeneration`. |
| ServiceLevel thrashing on flaky peer | Medium | Medium | 2 s probe interval + 3-sample smoothing window; only declares a peer unreachable after 3 consecutive failed probes |
| Client ignores ServiceLevel and stays on broken primary | Medium | Medium | Documented in `docs/Redundancy.md` — non-transparent redundancy requires client cooperation; most SCADA clients (Ignition, Kepware, Aveva OI Gateway) honor it. Unit-test the advertised values; field behavior is client-responsibility |
| Apply-window counter leaks on exception | Low | High | `BeginApplyWindow` returns `IDisposable`; `using` syntax enforces paired decrement; unit test for exception-in-apply path |
| `HttpClient` probe leaks sockets | Low | Medium | Single shared `HttpClient` per coordinator (not per-probe); timeouts tight to avoid keeping connections open during peer downtime |
## Completion Checklist
- [ ] Stream A: topology loader + tests
- [ ] Stream B: peer probe + ServiceLevel calculator + 32-case matrix tests
- [ ] Stream C: ServiceLevel / ServerUriArray / RedundancySupport node wiring + Client.CLI smoke test
- [ ] Stream D: apply-window integration + nested-apply counter
- [ ] Stream E: Admin `RedundancyTab` + OpenTelemetry metrics + SignalR push
- [ ] `phase-6-3-compliance.ps1` exits 0; exit-gate doc; `docs/Redundancy.md` updated with the ServiceLevel matrix
## Adversarial Review — 2026-04-19 (Codex, thread `019da490-3fa0-7340-98b8-cceeca802550`)
1. **Crit · ACCEPT** — No publish-generation fencing enables split-publish advertising both as authoritative. **Change**: coordinator CAS on a monotonic `ConfigGenerationId`; every topology decision is generation-stamped; peers reject state propagated from a lower generation.
2. **Crit · ACCEPT**`>1 Primary` at startup covered but runtime containment missing when invalid topology appears later (mid-apply race). **Change**: add runtime `InvalidTopology` state — both nodes self-demote to ServiceLevel 2 (the "detected inconsistency" band, below normal operation) until convergence.
3. **High · ACCEPT**`0 = Faulted` collides with OPC UA Part 5 §6.3.34 semantics where 0 means **Maintenance** and 1 means NoData. **Change**: reserve **0** for operator-declared maintenance-mode only; Faulted/unreachable uses **1** (NoData); in-range degraded states occupy 2..199.
4. **High · ACCEPT** — Matrix collapses distinct operational states onto the same value. **Change**: matrix expanded to Authoritative-Primary=255, Isolated-Primary=230 (peer unreachable — still serving), Primary-Mid-Apply=200, Recovering-Primary=180, Authoritative-Backup=100, Isolated-Backup=80 (primary unreachable — "take over if asked"), Backup-Mid-Apply=50, Recovering-Backup=30.
5. **High · ACCEPT**`/healthz` from 6.1 is HTTP-healthy but doesn't guarantee OPC UA data plane. **Change**: add a redundancy-specific probe `UaHealthProbe` — issues a `ReadAsync(ServiceLevel)` against the peer's OPC UA endpoint via a lightweight client session. `/healthz` remains the fast-fail; the UA probe is the authority signal.
6. **High · ACCEPT**`ServerUriArray` must include self + peers, not peers only. **Change**: array contains `[self.ApplicationUri, peer.ApplicationUri]` in stable deterministic ordering; compliance test asserts local-plus-peer membership.
7. **Med · ACCEPT** — No `Faulted → Recovering → Healthy` path. **Change**: add `Recovering` state with min dwell time (60 s default) + positive publish witness (one successful Read on a reference node) before returning to Healthy. Thrash-prevention.
8. **Med · ACCEPT** — Topology change during in-flight probe undefined. **Change**: every probe task tagged with `ConfigGenerationId` at dispatch; obsolete results discarded; in-flight probes cancelled on topology reload.
9. **Med · ACCEPT** — Apply-window counter race on exception/cancellation/async ownership. **Change**: apply-window is a named lease keyed to `(ConfigGenerationId, PublishRequestId)` with disposal enforced via `await using`; watchdog detects leased-but-abandoned and force-closes after `ApplyMaxDuration` (default 10 min).
10. **High · ACCEPT** — Ignition + Kepware + Aveva OI Gateway `ServiceLevel` compliance is unverified. **Change**: risk elevated to High; add Stream F (new) — build an interop matrix: validate against Ignition 8.1/8.3, Kepware KEPServerEX 6.x, Aveva OI Gateway 2020R2 + 2023R1. Document per-client cutover behaviour. Field deployments get a documented compatibility table; clients that ignore ServiceLevel documented as requiring explicit backup-endpoint config.
11. **Med · ACCEPT** — Galaxy MXAccess re-session on Primary death not in acceptance. **Change**: Stream F adds an end-to-end failover smoke test that boots Galaxy.Proxy on both nodes, kills Primary, asserts Galaxy consumer reconnects to Backup within `(SessionTimeout + KeepAliveInterval × 3)` budget. `docs/Redundancy.md` updated with required session timeouts.
12. **Med · ACCEPT** — Transparent-mode startup rejection is outage-prone. **Change**: `sp_PublishGeneration` validates `RedundancyMode` pre-publish — unsupported values reject the publish attempt with a clear validation error; runtime never sees an unsupported mode. Last-good config stays active.

View File

@@ -0,0 +1,120 @@
# Phase 6.4 — Admin UI Completion
> **Status**: DRAFT — Phase 1 Stream E shipped the Admin scaffold + core pages; several feature-completeness items from its completion checklist (`phase-1-configuration-and-admin-scaffold.md` §Stream E) never landed. This phase closes them.
>
> **Branch**: `v2/phase-6-4-admin-ui-completion`
> **Estimated duration**: 2 weeks
> **Predecessor**: Phase 6.3 (Redundancy runtime) — reuses the `/cluster/{id}` page layout for the new tabs
> **Successor**: v2 release-readiness capstone (Task #121)
## Phase Objective
Close the Admin UI feature-completeness checklist that Phase 1 Stream E exit gate left open. Each item below is an existing `phase-1-configuration-and-admin-scaffold.md` completion-checklist entry that is currently unchecked.
Gaps to close:
1. **UNS Structure tab drag/move with impact preview** — decision #115 + `admin-ui.md` §"UNS". Current state: list-only render; no drag reorder; no "X lines / Y equipment impacted" preview.
2. **Equipment CSV import + 5-identifier search** — decision #95 + #117. Current state: basic form; no CSV parser; search indexes only ZTag.
3. **Draft-generation diff viewer** — enhance existing `DiffViewer.razor` to show generation-diff not just staged-edit diff; highlight ACL grant changes (lands after Phase 6.2).
4. **`_base` equipment-class Identification fields exposure** — decision #138139. Columns exist on `Equipment`; no Admin UI field group; no address-space exposure of the OPC 40010 sub-folder.
## Scope — What Changes
| Concern | Change |
|---------|--------|
| `Admin/Pages/UnsTab.razor` | Rewrite as a tree component with drag-drop (Blazor-native HTML5 DnD; no third-party dep). Each drag fires a "Compute Impact" call against the draft-generation state + renders a modal preview ("Moving Line 'Oven-2' from 'Packaging' to 'Assembly' will re-home 14 equipment + re-parent 237 tags"). Confirmation commits the draft edit. |
| `Admin/Services/UnsImpactAnalyzer.cs` | New service. Given a move-operation (line move, area rename, line merge), computes cascade counts by walking the draft-generation `Equipment` + `Tag` tables. Pure-function shape; testable in isolation. |
| `Admin/Pages/EquipmentTab.razor` | Add CSV-import button → modal with file picker + dry-run preview. Add multi-identifier search bar (ZTag / SAPID / UniqueId / Alias1 / Alias2) per decision #95 — parses any of the five, shows matches across draft + published generations. |
| `Admin/Services/EquipmentCsvImporter.cs` | New service. Parses CSV with documented header row; validates each row against the `Equipment` schema (required fields + `ExternalIdReservation` freshness); returns `ImportPreview` DTO with per-row accept/reject + reason; commit step wraps in a single EF transaction. |
| `Admin/Pages/DraftEditor.razor` + `DiffViewer.razor` | Diff viewer expanded: adds sections for ACL grants (from Phase 6.2 `LdapGroupRoleMapping` + `NodeAcl`), redundancy-role changes (from Phase 6.3), equipment-class `_base` Identification fields. Render each section collapsible. |
| `Admin/Components/IdentificationFields.razor` | New component. Renders the OPC 40010 nullable columns (Manufacturer, Model, SerialNumber, ProductInstanceUri, HardwareRevision, SoftwareRevision, DeviceRevision, YearOfConstruction, MonthOfConstruction) as a labelled field group on the `EquipmentTab` detail view. |
| `OtOpcUa.Server/OpcUa/DriverNodeManager` — Equipment folder build | When an `Equipment` row has non-null Identification fields, the server adds an `Identification` sub-folder under the Equipment node containing one variable per non-null field. Matches OPC 40010 companion spec. |
## Scope — What Does NOT Change
| Item | Reason |
|------|--------|
| Admin UI visual language | Bootstrap 5 / cookie auth / sidebar layout unchanged — consistency with ScadaLink design reference. |
| LDAP auth flow | Already shipped in Phase 1. Phase 6.4 is additive UI only. |
| Core abstractions / driver layer | Admin UI changes don't touch drivers. |
| Equipment-class *template schema validation* | Still deferred (decision #112 — schemas repo not landed). We expose the Identification fields but don't validate against a template hierarchy. |
| Drag/move to *other clusters* | Out of scope — equipment is cluster-scoped per decision #82. Cross-cluster migration is a different workflow. |
## Entry Gate Checklist
- [ ] Phase 6.2 merged (ACL grants are part of the new diff viewer sections)
- [ ] Phase 6.3 merged (redundancy-role changes are part of the diff viewer)
- [ ] `phase-1-configuration-and-admin-scaffold.md` §Stream E completion checklist re-read — confirm these are the remaining items
- [ ] `admin-ui.md` re-skimmed for screen layouts
- [ ] Existing `EquipmentTab.razor` / `UnsTab.razor` / `DraftEditor.razor` diff'd against what ships today so the edits are additive not destructive
- [ ] Dev Galaxy available for OPC 40010 exposure smoke testing
## Task Breakdown
### Stream A — UNS drag/reorder + impact preview (4 days)
1. **A.1** `UnsImpactAnalyzer` service. Inputs: `(DraftGenerationId, MoveOperation)`. Outputs: `ImpactPreview { AffectedEquipmentCount, AffectedTagCount, CascadeWarnings[] }`. Unit tests cover line move / area rename / line merge.
2. **A.2** HTML5 DnD on a tree component. No JS interop beyond `ondragstart`/`ondragover`/`ondrop` — keeps build + testability simple.
3. **A.3** Modal preview wired to `UnsImpactAnalyzer` output; "Confirm" commits a draft edit via `DraftService`.
4. **A.4** Playwright smoke test (or equivalent): drag a line across areas, assert modal shows the right counts, assert draft row reflects the move.
### Stream B — Equipment CSV import + 5-identifier search (4 days)
1. **B.1** `EquipmentCsvImporter` with a documented header row (`ZTag, SAPID, UniqueId, Alias1, Alias2, Name, UnsAreaName, UnsLineName, Manufacturer, Model, SerialNumber, …`). Parser rejects unknown columns + blank required fields + duplicate ZTags.
2. **B.2** `ImportPreview` UI: per-row accept/reject table. Reject reasons: "ZTag already exists in draft", "ExternalIdReservation conflict with Cluster X", "UnsLineName not found in draft UNS tree", etc. Operator reviews then clicks "Commit" → single EF transaction.
3. **B.3** Multi-identifier search — bar accepts any of the 5 identifiers, probes each column in parallel, returns first-match-wins + disambiguation list if multiple match.
4. **B.4** Smoke tests: 100-row CSV with 10 intentional conflicts (5 ZTag dupes, 3 reservation clashes, 2 missing UnsLines); assert preview flags each; assert commit rolls back cleanly when a conflict surfaces post-preview.
### Stream C — Diff viewer enhancements (3 days)
1. **C.1** Refactor `DiffViewer.razor` into a base component + section plugins. Section plugins: `StructuralDiffSection` (UNS tree), `EquipmentDiffSection` (Equipment rows), `TagDiffSection` (Tag rows), `AclDiffSection` (ACL grants — depends on Phase 6.2), `RedundancyDiffSection` (role changes — depends on Phase 6.3), `IdentificationDiffSection` (OPC 40010 fields).
2. **C.2** Each section renders collapsed by default; counts + top-line summary always visible.
3. **C.3** Tests: seed two generations with deliberate diffs, assert every section reports the right counts + top-line summary.
### Stream D — OPC 40010 Identification exposure (3 days)
1. **D.1** `IdentificationFields.razor` component — labelled inputs; nullable columns show empty input; required field validation only on commit.
2. **D.2** `DriverNodeManager` equipment-folder builder — after building the equipment node, inspect the Identification columns; if any non-null, add an `Identification` sub-folder with variable-per-field.
3. **D.3** Address-space smoke test via Client.CLI: browse an equipment node, assert `Identification` sub-folder present when columns are set, absent when all null.
## Compliance Checks (run at exit gate)
- [ ] **UNS drag/move**: drag a line across areas; modal preview shows correct impacted-equipment + impacted-tag counts.
- [ ] **Equipment CSV**: 100-row CSV with 10 conflicts imports cleanly (preview flags each, commit rolls back mid-conflict).
- [ ] **5-identifier search**: querying any of the 5 IDs returns the matching row; ambiguous searches list options.
- [ ] **Diff viewer**: every section renders for a 2-generation diff with deliberate changes in every category.
- [ ] **OPC 40010 exposure**: Client.CLI browse shows `Identification` sub-folder when equipment has non-null columns; folder absent when all null.
- [ ] **ScadaLink visual parity**: operator-equivalence reviewer signs off that the new tabs feel consistent with existing Admin UI pages.
## Risks and Mitigations
| Risk | Likelihood | Impact | Mitigation |
|------|:----------:|:------:|------------|
| UNS drag-drop janky on large trees (>500 nodes) | Medium | Medium | Virtualize the tree component; default-collapse nested areas; test with a synthetic 1000-equipment seed |
| CSV import performance on 10k-row imports | Medium | Medium | Stream-parse rather than load-into-memory; preview renders in batches of 100; commit is chunked-EF-insert with progress bar |
| Diff viewer becomes unwieldy with many sections | Low | Medium | Each section collapsed by default; top-line summary row always shown; Phase 6.4 caps at 6 sections |
| OPC 40010 sub-folder accidentally exposes NULL/empty identification columns as empty-string variables | Low | Low | Column null-check in the builder; drop variables whose DB value is null |
| 5-identifier search pulls full table | Medium | Medium | Indexes on each of ZTag/SAPID/UniqueId/Alias1/Alias2; search query uses a UNION of 5 indexed lookups; falls back to LIKE only on explicit operator opt-in |
## Completion Checklist
- [ ] Stream A: `UnsImpactAnalyzer` + drag-drop tree + modal preview + Playwright smoke
- [ ] Stream B: `EquipmentCsvImporter` + preview modal + 5-identifier search + conflict-rollback test
- [ ] Stream C: `DiffViewer` refactor + 6 section plugins + 2-generation diff test
- [ ] Stream D: `IdentificationFields.razor` + address-space builder change + Client.CLI browse test
- [ ] Visual-compliance reviewer signoff
- [ ] Full solution `dotnet test` passes; `phase-6-4-compliance.ps1` exits 0; exit-gate doc
## Adversarial Review — 2026-04-19 (Codex, via `codex-rescue` subagent)
1. **Crit · ACCEPT** — Stale UNS impact preview can overwrite concurrent draft edits. **Change**: each preview carries a `DraftRevisionToken`; `Confirm` compares against the current draft + rejects with a `409 Conflict / refresh-required` modal if any draft edit landed since the preview was generated. Stream A.3 updated.
2. **High · ACCEPT** — CSV import atomicity is internally contradictory (single EF transaction vs. chunked inserts). **Change**: one explicit model — staged-import table (`EquipmentImportBatch { Id, CreatedAtUtc, RowsStaged, RowsAccepted, RowsRejected }`) receives rows in chunks; final `FinaliseImportBatch` is atomic over `Equipment` + `ExternalIdReservation`. Rollback is "drop the batch row" — the real Equipment table is never partially mutated.
3. **Crit · ACCEPT** — Identifier contract rewrite mis-cites decisions. **Change**: revert to the `admin-ui.md` + decision #117 canonical set — `ZTag / MachineCode / SAPID / EquipmentId / EquipmentUuid`. CSV header follows that set verbatim. Introduce a separate decision entry for versioned CSV header shape before adding any new column; CSV header row must start with `# OtOpcUaCsv v1` so future shape changes are unambiguous.
4. **Med · ACCEPT** — Search ordering undefined. **Change**: rank SQL — exact match on any identifier scores 100; prefix match 50; LIKE-fuzzy 20; published > draft tie-breaker; `ORDER BY score DESC, RowVersion DESC`. Typeahead shows which field matched via trailing badge.
5. **High · ACCEPT** — HTML5 DnD on virtualized tree is aspirational. **Change**: Stream A.2 rewritten — commits to **`MudBlazor.TreeView` + `MudBlazor.DropTarget`** (already a transitive dep via the existing Admin UI). Build a 1000-node synthetic seed in A.1 + validate drag-latency budget before implementing impact preview. If MudBlazor can't hit the budget, fall back to a flat-list reorder UI with Area/Line dropdowns (loss of visual drag affordance but unblocks the feature).
6. **Med · ACCEPT** — Collapsed-by-default doesn't handle generation-sized diffs. **Change**: each diff section has a hard row cap (1000 by default). Over-cap sections render an aggregate summary + "Load full diff" button that streams via SignalR in 500-row pages. Decision #115 subtree renames surface as a "N equipment re-parented under X → Y" summary instead of row-by-row.
7. **High · ACCEPT** — OPC 40010 field list doesn't match decision #139. **Change**: field group realigned to `Manufacturer, Model, SerialNumber, HardwareRevision, SoftwareRevision, YearOfConstruction, AssetLocation, ManufacturerUri, DeviceManualUri`. `ProductInstanceUri / DeviceRevision / MonthOfConstruction` dropped from Phase 6.4 — they belong to a future OPC 40010 widening decision.
8. **High · ACCEPT**`Identification` subtree unreconciled with ACL hierarchy (Phase 6.2 6-level scope). **Change**: address-space builder creates the Identification sub-folder under the Equipment node **with the same ScopeId as Equipment** — no new scope level. ACL evaluator treats `…/Equipment/Identification/X` as inheriting the `Equipment` scope's grants. Documented in Phase 6.2's `acl-design.md` cross-reference update.
9. **Low · ACCEPT** — Visual-review gate names nonexistent reviewer role. **Change**: rubric defined — a named "Admin UX reviewer" (role `FleetAdmin` user, not the implementation lead) compares side-by-side screenshots against the `admin-ui.md` §Visual-Design reference panels; signoff artefact is a checked-in screenshot set under `docs/v2/visual-compliance/phase-6-4/`.
10. **Med · ACCEPT** — Cross-cluster drag/drop lacks loud failure path. **Change**: on drop across cluster boundary, disable the drop target + show a toast "Equipment is cluster-scoped (decision #82). To move across clusters, use the Export → Import workflow on the Cluster detail page." Plus a help link. Tested in Stream A.4.

View File

@@ -27,8 +27,15 @@ namespace ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient;
/// </para>
/// </remarks>
public sealed class OpcUaClientDriver(OpcUaClientDriverOptions options, string driverInstanceId)
: IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IHostConnectivityProbe, IDisposable, IAsyncDisposable
: IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IHostConnectivityProbe, IAlarmSource, IHistoryProvider, IDisposable, IAsyncDisposable
{
// ---- IAlarmSource state ----
private readonly System.Collections.Concurrent.ConcurrentDictionary<long, RemoteAlarmSubscription> _alarmSubscriptions = new();
private long _nextAlarmSubscriptionId;
public event EventHandler<AlarmEventArgs>? OnAlarmEvent;
// ---- ISubscribable + IHostConnectivityProbe state ----
private readonly System.Collections.Concurrent.ConcurrentDictionary<long, RemoteSubscription> _subscriptions = new();
@@ -395,6 +402,13 @@ public sealed class OpcUaClientDriver(OpcUaClientDriverOptions options, string d
}
_subscriptions.Clear();
foreach (var ras in _alarmSubscriptions.Values)
{
try { await ras.Subscription.DeleteAsync(silent: true, cancellationToken).ConfigureAwait(false); }
catch { /* best-effort */ }
}
_alarmSubscriptions.Clear();
// Abort any in-flight reconnect attempts before touching the session — BeginReconnect's
// retry loop holds a reference to the current session and would fight Session.CloseAsync
// if left spinning.
@@ -936,6 +950,325 @@ public sealed class OpcUaClientDriver(OpcUaClientDriverOptions options, string d
public string DiagnosticId => $"opcua-sub-{Id}";
}
// ---- IAlarmSource ----
/// <summary>
/// Field positions in the EventFilter SelectClauses below. Used to index into the
/// <c>EventFieldList.EventFields</c> Variant collection when an event arrives.
/// </summary>
private const int AlarmFieldEventId = 0;
private const int AlarmFieldEventType = 1;
private const int AlarmFieldSourceNode = 2;
private const int AlarmFieldMessage = 3;
private const int AlarmFieldSeverity = 4;
private const int AlarmFieldTime = 5;
private const int AlarmFieldConditionId = 6;
public async Task<IAlarmSubscriptionHandle> SubscribeAlarmsAsync(
IReadOnlyList<string> sourceNodeIds, CancellationToken cancellationToken)
{
var session = RequireSession();
var id = Interlocked.Increment(ref _nextAlarmSubscriptionId);
var handle = new OpcUaAlarmSubscriptionHandle(id);
// Pre-resolve the source-node filter set so the per-event notification handler can
// match in O(1) without re-parsing on every event.
var sourceFilter = new HashSet<string>(sourceNodeIds, StringComparer.Ordinal);
var subscription = new Subscription(telemetry: null!, new SubscriptionOptions
{
DisplayName = $"opcua-alarm-sub-{id}",
PublishingInterval = 500, // 500ms — alarms don't need fast polling; the server pushes
KeepAliveCount = 10,
LifetimeCount = 1000,
MaxNotificationsPerPublish = 0,
PublishingEnabled = true,
Priority = 0,
TimestampsToReturn = TimestampsToReturn.Both,
});
// EventFilter SelectClauses — pick the standard BaseEventType fields we need to
// materialize an AlarmEventArgs. Field positions are indexed by the AlarmField*
// constants so the notification handler indexes in O(1) without re-examining the
// QualifiedName BrowsePaths.
var filter = new EventFilter();
void AddField(string browseName) => filter.SelectClauses.Add(new SimpleAttributeOperand
{
TypeDefinitionId = ObjectTypeIds.BaseEventType,
BrowsePath = [new QualifiedName(browseName)],
AttributeId = Attributes.Value,
});
AddField("EventId");
AddField("EventType");
AddField("SourceNode");
AddField("Message");
AddField("Severity");
AddField("Time");
// ConditionId on ConditionType nodes is the branch identifier for
// acknowledgeable conditions. Not a BaseEventType field — reach it via the typed path.
filter.SelectClauses.Add(new SimpleAttributeOperand
{
TypeDefinitionId = ObjectTypeIds.ConditionType,
BrowsePath = [], // empty path = the condition node itself
AttributeId = Attributes.NodeId,
});
await _gate.WaitAsync(cancellationToken).ConfigureAwait(false);
try
{
session.AddSubscription(subscription);
await subscription.CreateAsync(cancellationToken).ConfigureAwait(false);
var eventItem = new MonitoredItem(telemetry: null!, new MonitoredItemOptions
{
DisplayName = "Server/Events",
StartNodeId = ObjectIds.Server,
AttributeId = Attributes.EventNotifier,
MonitoringMode = MonitoringMode.Reporting,
QueueSize = 1000, // deep queue — a server can fire many alarms in bursts
DiscardOldest = false,
Filter = filter,
})
{
Handle = handle,
};
eventItem.Notification += (mi, args) => OnEventNotification(handle, sourceFilter, mi, args);
subscription.AddItem(eventItem);
await subscription.CreateItemsAsync(cancellationToken).ConfigureAwait(false);
_alarmSubscriptions[id] = new RemoteAlarmSubscription(subscription, handle);
}
finally { _gate.Release(); }
return handle;
}
public async Task UnsubscribeAlarmsAsync(IAlarmSubscriptionHandle handle, CancellationToken cancellationToken)
{
if (handle is not OpcUaAlarmSubscriptionHandle h) return;
if (!_alarmSubscriptions.TryRemove(h.Id, out var rs)) return;
await _gate.WaitAsync(cancellationToken).ConfigureAwait(false);
try
{
try { await rs.Subscription.DeleteAsync(silent: true, cancellationToken).ConfigureAwait(false); }
catch { /* best-effort — session may already be gone across a reconnect */ }
}
finally { _gate.Release(); }
}
public async Task AcknowledgeAsync(
IReadOnlyList<AlarmAcknowledgeRequest> acknowledgements, CancellationToken cancellationToken)
{
// Short-circuit empty batch BEFORE touching the session so callers can pass an empty
// list without guarding the size themselves — e.g. a bulk-ack UI that built an empty
// list because the filter matched nothing.
if (acknowledgements.Count == 0) return;
var session = RequireSession();
// OPC UA A&C: call the AcknowledgeableConditionType.Acknowledge method on each
// condition node with EventId + Comment arguments. CallAsync accepts a batch —
// one CallMethodRequest per ack.
var callRequests = new CallMethodRequestCollection();
foreach (var ack in acknowledgements)
{
if (!TryParseNodeId(session, ack.ConditionId, out var conditionId)) continue;
callRequests.Add(new CallMethodRequest
{
ObjectId = conditionId,
MethodId = MethodIds.AcknowledgeableConditionType_Acknowledge,
InputArguments = [
new Variant(Array.Empty<byte>()), // EventId — server-side best-effort; empty resolves to 'most recent'
new Variant(new LocalizedText(ack.Comment ?? string.Empty)),
],
});
}
if (callRequests.Count == 0) return;
await _gate.WaitAsync(cancellationToken).ConfigureAwait(false);
try
{
try
{
_ = await session.CallAsync(
requestHeader: null,
methodsToCall: callRequests,
ct: cancellationToken).ConfigureAwait(false);
}
catch { /* best-effort — caller's re-ack mechanism catches pathological paths */ }
}
finally { _gate.Release(); }
}
private void OnEventNotification(
OpcUaAlarmSubscriptionHandle handle,
HashSet<string> sourceFilter,
MonitoredItem item,
MonitoredItemNotificationEventArgs args)
{
if (args.NotificationValue is not EventFieldList efl) return;
if (efl.EventFields.Count <= AlarmFieldConditionId) return;
var sourceNode = efl.EventFields[AlarmFieldSourceNode].Value?.ToString() ?? string.Empty;
if (sourceFilter.Count > 0 && !sourceFilter.Contains(sourceNode)) return;
var eventType = efl.EventFields[AlarmFieldEventType].Value?.ToString() ?? "BaseEventType";
var message = (efl.EventFields[AlarmFieldMessage].Value as LocalizedText)?.Text ?? string.Empty;
var severity = efl.EventFields[AlarmFieldSeverity].Value is ushort sev ? sev : (ushort)0;
var time = efl.EventFields[AlarmFieldTime].Value is DateTime t ? t : DateTime.UtcNow;
var conditionId = efl.EventFields[AlarmFieldConditionId].Value?.ToString() ?? string.Empty;
OnAlarmEvent?.Invoke(this, new AlarmEventArgs(
SubscriptionHandle: handle,
SourceNodeId: sourceNode,
ConditionId: conditionId,
AlarmType: eventType,
Message: message,
Severity: MapSeverity(severity),
SourceTimestampUtc: time));
}
/// <summary>
/// Map an OPC UA <c>BaseEventType.Severity</c> (1..1000) to our coarse-grained
/// <see cref="AlarmSeverity"/> bucket. Thresholds match the OPC UA A&amp;C Part 9
/// guidance: 1-200 Low, 201-500 Medium, 501-800 High, 801-1000 Critical.
/// </summary>
internal static AlarmSeverity MapSeverity(ushort opcSeverity) => opcSeverity switch
{
<= 200 => AlarmSeverity.Low,
<= 500 => AlarmSeverity.Medium,
<= 800 => AlarmSeverity.High,
_ => AlarmSeverity.Critical,
};
private sealed record RemoteAlarmSubscription(Subscription Subscription, OpcUaAlarmSubscriptionHandle Handle);
private sealed record OpcUaAlarmSubscriptionHandle(long Id) : IAlarmSubscriptionHandle
{
public string DiagnosticId => $"opcua-alarm-sub-{Id}";
}
// ---- IHistoryProvider (passthrough to upstream server) ----
public async Task<Core.Abstractions.HistoryReadResult> ReadRawAsync(
string fullReference, DateTime startUtc, DateTime endUtc, uint maxValuesPerNode,
CancellationToken cancellationToken)
{
var details = new ReadRawModifiedDetails
{
IsReadModified = false,
StartTime = startUtc,
EndTime = endUtc,
NumValuesPerNode = maxValuesPerNode,
ReturnBounds = false,
};
return await ExecuteHistoryReadAsync(fullReference, new ExtensionObject(details), cancellationToken)
.ConfigureAwait(false);
}
public async Task<Core.Abstractions.HistoryReadResult> ReadProcessedAsync(
string fullReference, DateTime startUtc, DateTime endUtc, TimeSpan interval,
HistoryAggregateType aggregate, CancellationToken cancellationToken)
{
var aggregateId = MapAggregateToNodeId(aggregate);
var details = new ReadProcessedDetails
{
StartTime = startUtc,
EndTime = endUtc,
ProcessingInterval = interval.TotalMilliseconds,
AggregateType = [aggregateId],
};
return await ExecuteHistoryReadAsync(fullReference, new ExtensionObject(details), cancellationToken)
.ConfigureAwait(false);
}
public async Task<Core.Abstractions.HistoryReadResult> ReadAtTimeAsync(
string fullReference, IReadOnlyList<DateTime> timestampsUtc, CancellationToken cancellationToken)
{
var reqTimes = new DateTimeCollection(timestampsUtc);
var details = new ReadAtTimeDetails
{
ReqTimes = reqTimes,
UseSimpleBounds = true,
};
return await ExecuteHistoryReadAsync(fullReference, new ExtensionObject(details), cancellationToken)
.ConfigureAwait(false);
}
/// <summary>
/// Shared HistoryRead wire path — used by Raw/Processed/AtTime. Handles NodeId parse,
/// Session.HistoryReadAsync call, Bad-StatusCode passthrough (no translation per §8
/// cascading-quality rule), and HistoryData unwrap into <see cref="DataValueSnapshot"/>.
/// </summary>
private async Task<Core.Abstractions.HistoryReadResult> ExecuteHistoryReadAsync(
string fullReference, ExtensionObject historyReadDetails, CancellationToken ct)
{
var session = RequireSession();
if (!TryParseNodeId(session, fullReference, out var nodeId))
{
return new Core.Abstractions.HistoryReadResult([], null);
}
var nodesToRead = new HistoryReadValueIdCollection
{
new HistoryReadValueId { NodeId = nodeId },
};
await _gate.WaitAsync(ct).ConfigureAwait(false);
try
{
var resp = await session.HistoryReadAsync(
requestHeader: null,
historyReadDetails: historyReadDetails,
timestampsToReturn: TimestampsToReturn.Both,
releaseContinuationPoints: false,
nodesToRead: nodesToRead,
ct: ct).ConfigureAwait(false);
if (resp.Results.Count == 0) return new Core.Abstractions.HistoryReadResult([], null);
var r = resp.Results[0];
// Unwrap HistoryData from the ExtensionObject-encoded payload the SDK returns.
// Samples stay in chronological order per OPC UA Part 11; cascading-quality
// rule: preserve each DataValue's upstream StatusCode + timestamps verbatim.
var samples = new List<DataValueSnapshot>();
if (r.HistoryData?.Body is HistoryData hd)
{
var now = DateTime.UtcNow;
foreach (var dv in hd.DataValues)
{
samples.Add(new DataValueSnapshot(
Value: dv.Value,
StatusCode: dv.StatusCode.Code,
SourceTimestampUtc: dv.SourceTimestamp == DateTime.MinValue ? null : dv.SourceTimestamp,
ServerTimestampUtc: dv.ServerTimestamp == DateTime.MinValue ? now : dv.ServerTimestamp));
}
}
var contPt = r.ContinuationPoint is { Length: > 0 } ? r.ContinuationPoint : null;
return new Core.Abstractions.HistoryReadResult(samples, contPt);
}
finally { _gate.Release(); }
}
/// <summary>Map <see cref="HistoryAggregateType"/> to the OPC UA Part 13 standard aggregate NodeId.</summary>
internal static NodeId MapAggregateToNodeId(HistoryAggregateType aggregate) => aggregate switch
{
HistoryAggregateType.Average => ObjectIds.AggregateFunction_Average,
HistoryAggregateType.Minimum => ObjectIds.AggregateFunction_Minimum,
HistoryAggregateType.Maximum => ObjectIds.AggregateFunction_Maximum,
HistoryAggregateType.Total => ObjectIds.AggregateFunction_Total,
HistoryAggregateType.Count => ObjectIds.AggregateFunction_Count,
_ => throw new ArgumentOutOfRangeException(nameof(aggregate), aggregate, null),
};
// ReadEventsAsync stays at the interface default (throws NotSupportedException) per
// IHistoryProvider contract -- the OPC UA Client driver CAN forward HistoryReadEvents,
// but the call-site needs an EventFilter SelectClauses surface which the interface
// doesn't carry. Landing the event-history passthrough requires extending
// IHistoryProvider.ReadEventsAsync with a filter-spec parameter; out of scope for this PR.
// ---- IHostConnectivityProbe ----
/// <summary>

View File

@@ -0,0 +1,70 @@
using Shouldly;
using Xunit;
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
namespace ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient.Tests;
[Trait("Category", "Unit")]
public sealed class OpcUaClientAlarmTests
{
[Theory]
[InlineData((ushort)1, AlarmSeverity.Low)]
[InlineData((ushort)200, AlarmSeverity.Low)]
[InlineData((ushort)201, AlarmSeverity.Medium)]
[InlineData((ushort)500, AlarmSeverity.Medium)]
[InlineData((ushort)501, AlarmSeverity.High)]
[InlineData((ushort)800, AlarmSeverity.High)]
[InlineData((ushort)801, AlarmSeverity.Critical)]
[InlineData((ushort)1000, AlarmSeverity.Critical)]
public void MapSeverity_buckets_per_OPC_UA_Part_9_guidance(ushort opcSev, AlarmSeverity expected)
{
OpcUaClientDriver.MapSeverity(opcSev).ShouldBe(expected);
}
[Fact]
public void MapSeverity_zero_maps_to_Low()
{
// 0 isn't in OPC UA's 1-1000 range but we handle it gracefully as Low.
OpcUaClientDriver.MapSeverity(0).ShouldBe(AlarmSeverity.Low);
}
[Fact]
public async Task SubscribeAlarmsAsync_without_initialize_throws_InvalidOperationException()
{
using var drv = new OpcUaClientDriver(new OpcUaClientDriverOptions(), "opcua-alarm-uninit");
await Should.ThrowAsync<InvalidOperationException>(async () =>
await drv.SubscribeAlarmsAsync([], TestContext.Current.CancellationToken));
}
[Fact]
public async Task UnsubscribeAlarmsAsync_with_unknown_handle_is_noop()
{
using var drv = new OpcUaClientDriver(new OpcUaClientDriverOptions(), "opcua-alarm-unknown");
// Parallels the subscribe handle path — session-drop races shouldn't crash the caller.
await drv.UnsubscribeAlarmsAsync(new FakeAlarmHandle(), TestContext.Current.CancellationToken);
}
[Fact]
public async Task AcknowledgeAsync_without_initialize_throws_InvalidOperationException()
{
using var drv = new OpcUaClientDriver(new OpcUaClientDriverOptions(), "opcua-ack-uninit");
await Should.ThrowAsync<InvalidOperationException>(async () =>
await drv.AcknowledgeAsync(
[new AlarmAcknowledgeRequest("ns=2;s=Src", "ns=2;s=Cond", "operator ack")],
TestContext.Current.CancellationToken));
}
[Fact]
public async Task AcknowledgeAsync_with_empty_batch_is_noop_even_without_init()
{
// Empty batch short-circuits before touching the session, so it's safe pre-init. This
// keeps batch-ack callers from needing to guard the list size themselves.
using var drv = new OpcUaClientDriver(new OpcUaClientDriverOptions(), "opcua-ack-empty");
await drv.AcknowledgeAsync([], TestContext.Current.CancellationToken);
}
private sealed class FakeAlarmHandle : IAlarmSubscriptionHandle
{
public string DiagnosticId => "fake-alarm";
}
}

View File

@@ -0,0 +1,91 @@
using Opc.Ua;
using Shouldly;
using Xunit;
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
namespace ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient.Tests;
[Trait("Category", "Unit")]
public sealed class OpcUaClientHistoryTests
{
[Theory]
[InlineData(HistoryAggregateType.Average)]
[InlineData(HistoryAggregateType.Minimum)]
[InlineData(HistoryAggregateType.Maximum)]
[InlineData(HistoryAggregateType.Total)]
[InlineData(HistoryAggregateType.Count)]
public void MapAggregateToNodeId_returns_standard_Part13_aggregate_for_every_enum(HistoryAggregateType agg)
{
var nodeId = OpcUaClientDriver.MapAggregateToNodeId(agg);
NodeId.IsNull(nodeId).ShouldBeFalse();
// Every mapping should resolve to an AggregateFunction_* NodeId (namespace 0, numeric id).
nodeId.NamespaceIndex.ShouldBe((ushort)0);
}
[Fact]
public void MapAggregateToNodeId_rejects_invalid_enum_value()
{
// Defense-in-depth: a future HistoryAggregateType addition mustn't silently fall through.
Should.Throw<ArgumentOutOfRangeException>(() =>
OpcUaClientDriver.MapAggregateToNodeId((HistoryAggregateType)99));
}
[Fact]
public async Task ReadRawAsync_without_initialize_throws_InvalidOperationException()
{
using var drv = new OpcUaClientDriver(new OpcUaClientDriverOptions(), "opcua-hist-uninit");
await Should.ThrowAsync<InvalidOperationException>(async () =>
await drv.ReadRawAsync("ns=2;s=Counter",
DateTime.UtcNow.AddMinutes(-5), DateTime.UtcNow, 1000,
TestContext.Current.CancellationToken));
}
[Fact]
public async Task ReadRawAsync_with_malformed_NodeId_returns_empty_result_not_throw()
{
// Same defensive pattern as ReadAsync / WriteAsync — malformed NodeId short-circuits
// to an empty result rather than crashing a batch history call. Needs init via the
// throw path first, then we pass "" to trigger the parse-fail branch inside
// ExecuteHistoryReadAsync. The init itself fails against 127.0.0.1:1 so we stop there.
// Not runnable without init — keep as placeholder for when the in-process fixture
// PR lands.
await Task.CompletedTask;
}
[Fact]
public async Task ReadProcessedAsync_without_initialize_throws_InvalidOperationException()
{
using var drv = new OpcUaClientDriver(new OpcUaClientDriverOptions(), "opcua-hist-uninit");
await Should.ThrowAsync<InvalidOperationException>(async () =>
await drv.ReadProcessedAsync("ns=2;s=Counter",
DateTime.UtcNow.AddMinutes(-5), DateTime.UtcNow,
TimeSpan.FromSeconds(10), HistoryAggregateType.Average,
TestContext.Current.CancellationToken));
}
[Fact]
public async Task ReadAtTimeAsync_without_initialize_throws_InvalidOperationException()
{
using var drv = new OpcUaClientDriver(new OpcUaClientDriverOptions(), "opcua-hist-uninit");
await Should.ThrowAsync<InvalidOperationException>(async () =>
await drv.ReadAtTimeAsync("ns=2;s=Counter",
[DateTime.UtcNow.AddMinutes(-5), DateTime.UtcNow],
TestContext.Current.CancellationToken));
}
[Fact]
public async Task ReadEventsAsync_throws_NotSupportedException_as_documented()
{
// The IHistoryProvider default implementation throws; the OPC UA Client driver
// deliberately inherits that default (see PR 76 commit body) because the OPC UA
// client call path needs an EventFilter SelectClauses spec the interface doesn't carry.
using var drv = new OpcUaClientDriver(new OpcUaClientDriverOptions(), "opcua-events-default");
await Should.ThrowAsync<NotSupportedException>(async () =>
await ((IHistoryProvider)drv).ReadEventsAsync(
sourceName: null,
startUtc: DateTime.UtcNow.AddMinutes(-5),
endUtc: DateTime.UtcNow,
maxEvents: 100,
cancellationToken: TestContext.Current.CancellationToken));
}
}