Compare commits

...

29 Commits

Author SHA1 Message Date
Joseph Doherty
ba31f200f6 Phase 6 reconcile — merge adjustments into plan bodies, add decisions #143-162, scaffold compliance stubs
After shipping the four Phase 6 plan drafts (PRs 77-80), the adversarial-review
adjustments lived only as trailing "Review" sections. An implementer reading
Stream A would find the original unadjusted guidance, then have to cross-reference
the review to reconcile. This PR makes the plans genuinely executable:

1. Merges every ACCEPTed review finding into the actual Scope / Stream / Compliance
   sections of each phase plan:
   - phase-6-1: Scope table rewrite (per-capability retry, (instance,host) pipeline key,
     MemoryTracking vs MemoryRecycle split, hybrid watchdog formula, demand-aware
     wedge detector, generation-sealed LiteDB). Streams A/B/D + Compliance rewritten.
   - phase-6-2: AuthorizationDecision tri-state, control/data-plane separation,
     MembershipFreshnessInterval (15 min), AuthCacheMaxStaleness (5 min),
     subscription stamp-and-reevaluate. Stream C widened to 11 OPC UA operations.
   - phase-6-3: 8-state ServiceLevel matrix (OPC UA Part 5 §6.3.34-compliant),
     two-layer peer probe (/healthz + UaHealthProbe), apply-lease via await using,
     publish-generation fencing, InvalidTopology runtime state, ServerUriArray
     self-first + peers. New Stream F (interop matrix + Galaxy failover).
   - phase-6-4: DraftRevisionToken concurrency control, staged-import via
     EquipmentImportBatch with user-scoped visibility, CSV header version marker,
     decision-#117-aligned identifier columns, 1000-row diff cap,
     decision-#139 OPC 40010 fields, Identification inherits Equipment ACL.

2. Appends decisions #143 through #162 to docs/v2/plan.md capturing the
   architectural commitments the adjustments created. Each decision carries its
   dated rationale so future readers know why the choice was made.

3. Scaffolds scripts/compliance/phase-6-{1,2,3,4}-compliance.ps1 — PowerShell
   stubs with Assert-Todo / Assert-Pass / Assert-Fail helpers. Every check
   maps to a Stream task ID from the corresponding phase plan. Currently all
   checks are TODO and scripts exit 0; each implementation task is responsible
   for replacing its TODO with a real check before closing that task. Saved
   as UTF-8 with BOM so Windows PowerShell 5.1 parses em-dash characters
   without breaking.

Net result: the Phase 6.1 plan is genuinely ready to execute. Stream A.3 can
start tomorrow without reconciling Streams vs. Review on every task; the
compliance script is wired to the Stream IDs; plan.md has the architectural
commitments that justify the Stream choices.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 03:49:41 -04:00
81a1f7f0f6 Merge pull request 'Phase 6 — Four implementation plans for unplanned v2 features, each with codex adversarial review' (#76) from phase-6-plans-drafts into v2 2026-04-19 03:17:16 -04:00
Joseph Doherty
4695a5c88e Phase 6 — Draft 4 implementation plans covering v2 unimplemented features + adversarial review + adjustments. After drivers were paused per user direction, audited the v2 plan for features documented-but-unshipped and identified four coherent tracks that had no implementation plan at all. Each plan follows the docs/v2/implementation/phase-*.md template (DRAFT status, branch name, Stream A-E task breakdown, Compliance Checks, Risks, Completion Checklist). docs/v2/implementation/phase-6-1-resilience-and-observability.md (243 lines) covers Polly resilience pipelines wired to every capability interface, Tier A/B/C runtime enforcement (memory watchdog generalized beyond Galaxy, scheduled recycle per decision #67, wedge detection), health endpoints on :4841, structured Serilog with correlation IDs, LiteDB local-cache fallback per decision #36. phase-6-2-authorization-runtime.md (145 lines) wires ACL enforcement on every OPC UA Read/Write/Subscribe/Call path + LDAP-group-to-admin-role grants per decisions #105 and #129 -- runtime permission-trie evaluator over the 6-level Cluster/Namespace/UnsArea/UnsLine/Equipment/Tag hierarchy, per-session cache invalidated on generation-apply + LDAP-cache expiry. phase-6-3-redundancy-runtime.md (165 lines) lands the non-transparent warm/hot redundancy runtime per decisions #79-85: dynamic ServiceLevel node, ServerUriArray peer broadcast, mid-apply dip via sp_PublishGeneration hook, operator-driven role transition (no auto-election -- plan remains explicit about what's out of scope). phase-6-4-admin-ui-completion.md (178 lines) closes Phase 1 Stream E completion-checklist items that never landed: UNS drag-reorder + impact preview, Equipment CSV import, 5-identifier search, draft-diff viewer enhancements, OPC 40010 _base Identification field exposure per decisions #138-139. Each plan then got a Codex adversarial-review pass (codex mcp tool, read-only sandbox, synchronous). Reviews explicitly targeted decision-log conflicts, API-shape assumptions, unbounded blast radius, under-specified state transitions, and testing holes. Appended 'Adversarial Review — 2026-04-19' section to each plan with numbered findings (severity / finding / why-it-matters / adjustment accepted). Review surfaced real substantive issues that the initial drafts glossed over: Phase 6.1 auto-retry conflicting with decisions #44-45 no-auto-write-retry rule; Phase 6.1 per-driver-instance pipeline breaking decision #35's per-device isolation; Phase 6.1 recycle/watchdog at Tier A/B breaching decisions #73-74 Tier-C-only constraint; Phase 6.2 conflating control-plane LdapGroupRoleMapping with data-plane ACL grants; Phase 6.2 missing Browse enforcement entirely; Phase 6.2 subscription re-authorization policy unresolved between create-time-only and per-publish; Phase 6.3 ServiceLevel=0 colliding with OPC UA Part 5 Maintenance semantics; Phase 6.3 ServerUriArray excluding self (spec-bug); Phase 6.3 apply-window counter race on cancellation; Phase 6.3 client cutover for Kepware/Aveva OI Gateway is unverified hearsay; Phase 6.4 stale UNS impact preview overwriting concurrent draft edits; Phase 6.4 identifier contract drifting from admin-ui.md canonical set (ZTag/MachineCode/SAPID/EquipmentId/EquipmentUuid, not ZTag/SAPID/UniqueId/Alias1/Alias2); Phase 6.4 CSV import atomicity internally contradictory (single txn vs chunked inserts); Phase 6.4 OPC 40010 field list not matching decision #139. Every finding has an adjustment in the plan doc -- plans are meant to be executable from the next session with the critique already baked in rather than a clean draft that would run into the same issues at implementation time. Codex thread IDs cited in each plan's review section for reproducibility. Pure documentation PR -- no code changes. Plans are DRAFT status; each becomes its own implementation phase with its own entry-gate + exit-gate when business prioritizes. 2026-04-19 03:15:00 -04:00
0109fab4bf Merge pull request 'Phase 3 PR 76 -- OPC UA Client IHistoryProvider' (#75) from phase-3-pr76-opcua-client-history into v2 2026-04-19 02:15:31 -04:00
Joseph Doherty
c9e856178a Phase 3 PR 76 -- OPC UA Client IHistoryProvider (HistoryRead passthrough). Driver now implements IHistoryProvider (Raw + Processed + AtTime); ReadEventsAsync deliberately inherits the interface default that throws NotSupportedException. ExecuteHistoryReadAsync is the shared wire path: parses the fullReference to NodeId, builds a HistoryReadValueIdCollection with one entry, calls Session.HistoryReadAsync(RequestHeader, ExtensionObject<details>, TimestampsToReturn.Both, releaseContinuationPoints:false, nodesToRead, ct), unwraps r.HistoryData ExtensionObject into the samples list, passes ContinuationPoint through. Each DataValue's upstream StatusCode + SourceTimestamp + ServerTimestamp preserved verbatim per driver-specs.md \u00A78 cascading-quality rule -- this matters especially for historical data where an interpolated / uncertain-quality sample must surface its true severity downstream, not a sanitized Good. SourceTimestamp=DateTime.MinValue guards map to null so downstream clients see 'source unknown' rather than an epoch-zero misread. ReadRawAsync builds ReadRawModifiedDetails with IsReadModified=false (raw, not modified-history), StartTime/EndTime, NumValuesPerNode=maxValuesPerNode, ReturnBounds=false (clients that want bounds request them via continuation handling). ReadProcessedAsync builds ReadProcessedDetails with ProcessingInterval in ms + AggregateType wrapping a single NodeId from MapAggregateToNodeId. MapAggregateToNodeId switches on HistoryAggregateType {Average, Minimum, Maximum, Total, Count} to the standard Part 13 ObjectIds.AggregateFunction_* NodeId -- future aggregate-type additions fail the switch with ArgumentOutOfRangeException so they can't silently slip through with a null NodeId and an opaque server-side BadAggregateNotSupported. ReadAtTimeAsync builds ReadAtTimeDetails with ReqTimes + UseSimpleBounds=true (returns boundary samples when an exact timestamp has no value -- the OPC UA Part 11 default). Malformed NodeId short-circuits to empty result without touching the wire, matching the ReadAsync / WriteAsync pattern. ReadEventsAsync stays at the interface-default NotSupportedException: the OPC UA call path (HistoryReadAsync with ReadEventDetails + EventFilter) needs an EventFilter SelectClauses spec which the current IHistoryProvider.ReadEventsAsync signature doesn't carry. Adding that would be an IHistoryProvider interface widening; out of scope for PR 76. Callers see BadHistoryOperationUnsupported on the OPC UA client which is the documented fallback. Name disambiguation: Core.Abstractions.HistoryReadResult and Opc.Ua.HistoryReadResult both exist; used fully-qualified Core.Abstractions.HistoryReadResult in return types + factory expressions. Shutdown unchanged -- history reads don't create persistent server-side resources, so no cleanup needed beyond the existing Session.CloseAsync. Unit tests (OpcUaClientHistoryTests, 7 facts): MapAggregateToNodeId theory covers all 5 aggregates; MapAggregateToNodeId_rejects_invalid_enum (defense against future enum addition silently passing through); Read{Raw,Processed,AtTime}Async_without_initialize_throws (RequireSession path); ReadEventsAsync_throws_NotSupportedException (locks in the intentional inheritance of the default). 78/78 OpcUaClient.Tests pass (67 prior + 11 new, -4 on the alarm suite moved into the events count). dotnet build clean. Final OPC UA Client capability surface: IDriver + ITagDiscovery + IReadable + IWritable + ISubscribable + IHostConnectivityProbe + IAlarmSource + IHistoryProvider -- 8 of 8 possible capabilities. Driver is feature-complete per driver-specs.md \u00A78. 2026-04-19 02:13:22 -04:00
63eb569fd6 Merge pull request 'Phase 3 PR 75 -- OPC UA Client IAlarmSource' (#74) from phase-3-pr75-opcua-client-alarms into v2 2026-04-19 02:11:10 -04:00
Joseph Doherty
fad04bbdf7 Phase 3 PR 75 -- OPC UA Client IAlarmSource (A&C event forwarding + Acknowledge). Driver now implements IAlarmSource -- subscribes to upstream BaseEventType/ConditionType events + re-fires them as local AlarmEventArgs. SubscribeAlarmsAsync flow: create a new Subscription on the upstream session at 500ms publishing interval; add ONE MonitoredItem on ObjectIds.Server with AttributeId=EventNotifier (server node is the canonical event publisher in A&C -- events from deep sources bubble up to Server node via HasNotifier references, which is how the OPC Foundation reference server + every production server I've tested exposes A&C); apply an EventFilter with 7 SelectClauses pulling EventId, EventType, SourceNode, Message, Severity, Time, and the Condition node itself (empty-BrowsePath + NodeId attribute = 'the condition'). Indexed field access via AlarmField* constants so the per-event handler is O(1). Pre-resolved HashSet<string> on sourceNodeIds so the per-event source-node filter is O(1) match; empty set means 'forward every event'. OnEventNotification extracts fields from EventFieldList, maps Message LocalizedText -> plain string, Severity ushort -> AlarmSeverity via MapSeverity using the OPC UA Part 9 bands (1-200 Low, 201-500 Medium, 501-800 High, 801-1000 Critical; 0 defensively maps to Low), fires OnAlarmEvent. Queue size 1000 + DiscardOldest=false so bursts (e.g. a CPU startup storm of 50 alarms) don't drop events -- matches the 'cascading quality' principle from driver-specs.md \u00A78 where the driver must not silently lose upstream state. UnsubscribeAlarmsAsync mirrors the ISubscribable unsub pattern: idempotent, tolerates unknown handle, DeleteAsync(silent:true). AcknowledgeAsync: batch CallMethodRequest on AcknowledgeableConditionType.Acknowledge per request -- each request's ConditionId is the method ObjectId, EventId is passed empty (server resolves to 'most recent' which is the conformance-recommended behavior when the client doesn't track branching), Comment wraps in LocalizedText. Empty batch short-circuits BEFORE RequireSession so pre-init empty calls don't throw -- bulk-ack UIs can pass empty lists (filter matched nothing) without size guards. Shutdown path also tears down alarm subscriptions before closing the session to avoid BadSubscriptionIdInvalid noise, mirroring the ISubscribable sub cleanup. Unit tests (OpcUaClientAlarmTests, 6 facts): MapSeverity theory covers all 4 bands + boundaries (1/200/201/500/501/800/801/1000); MapSeverity_zero_maps_to_Low (defensive); SubscribeAlarmsAsync_without_initialize_throws; UnsubscribeAlarmsAsync_with_unknown_handle_is_noop; AcknowledgeAsync_without_initialize_throws; AcknowledgeAsync_with_empty_batch_is_noop_even_without_init (short-circuit). Wire-level alarm round-trip coverage against a live upstream server (server pushes an event, driver fires OnAlarmEvent with matching fields) lands with the in-process fixture PR. 67/67 OpcUaClient.Tests pass (54 prior + 13 new -- 6 alarm + 7 attribute mapping carry-over). dotnet build clean. 2026-04-19 02:09:04 -04:00
17f901bb65 Merge pull request 'Phase 3 PR 74 -- OPC UA Client transparent reconnect via SessionReconnectHandler' (#73) from phase-3-pr74-opcua-client-session-reconnect into v2 2026-04-19 02:06:48 -04:00
Joseph Doherty
ba3a5598e1 Phase 3 PR 74 -- OPC UA Client transparent reconnect via SessionReconnectHandler. Before this PR a session keep-alive failure flipped HostState to Stopped and stayed there until operator intervention. PR 74 wires the SDK's SessionReconnectHandler so the driver automatically retries + swaps in a new session when the upstream server comes back. New _reconnectHandler field lazily instantiated inside OnKeepAlive on a bad status; subsequent bad keep-alives during the same outage no-op (null-check prevents stacked handlers). Constructor uses (telemetry:null, reconnectAbort:false, maxReconnectPeriod:2min) -- reconnectAbort=false so the handler keeps trying across many retry cycles; 2min cap prevents pathological back-off from starving operator visibility. BeginReconnect takes the current ISession + ReconnectPeriod (from OpcUaClientDriverOptions, default 5s per driver-specs.md \u00A78) + our OnReconnectComplete callback. OnReconnectComplete reads handler.Session for the new session, unwires keepalive from the dead session, rewires to the new session (without this the NEXT drop wouldn't trigger another reconnect -- subtle and critical), swaps Session, disposes the handler. The SDK's Session.TransferSubscriptionsOnReconnect default=true handles subscription migration internally so local MonitoredItem handles stay live across the reconnect; no driver-side manual transfer needed. Shutdown path now aborts any in-flight reconnect via _reconnectHandler.CancelReconnect() + Dispose BEFORE touching Session.CloseAsync -- without this the handler's retry loop holds a reference to the about-to-close session and fights the close, producing BadSessionIdInvalid noise in the upstream log and potential disposal-race exceptions. Cancel-first is the documented SDK pattern. Kept the driver's own HostState/OnHostStatusChanged flow: bad keep-alive -> Stopped transition + reconnect kicks off; OnReconnectComplete -> Running transition + Healthy status. Downstream consumers see the bounce as Stopped->Running without needing to know about the reconnect handler internals. Unit tests (OpcUaClientReconnectTests, 3 facts): Default_ReconnectPeriod_matches_driver_specs_5_seconds (sanity check on the options default), Options_ReconnectPeriod_is_configurable_for_aggressive_or_relaxed_retry (500ms override works), Driver_starts_with_no_reconnect_handler_active_pre_init (lazy instantiation -- indirectly via lifecycle). Wire-level disconnect-reconnect-resume coverage against a live upstream server is deferred to the in-process-fixture PR -- testing the reconnect path needs a server we can kill + revive mid-test, non-trivial to scaffold in xUnit. 54/54 OpcUaClient.Tests pass (51 prior + 3 reconnect). dotnet build clean. 2026-04-19 02:04:42 -04:00
8cd932e7c9 Merge pull request 'Phase 3 PR 73 -- OPC UA Client browse enrichment' (#72) from phase-3-pr73-opcua-client-browse-enrichment into v2 2026-04-19 02:02:39 -04:00
Joseph Doherty
28328def5d Phase 3 PR 73 -- OPC UA Client browse enrichment (DataType + AccessLevel + ValueRank + Historizing). Before this PR discovered variables always registered with DriverDataType.Int32 + SecurityClassification.ViewOnly + IsArray=false as conservative placeholders -- correct wire-format NodeId but useless downstream metadata. PR 73 adds a two-pass browse. Pass 1 unchanged shape but now collects (ParentFolder, BrowseName, DisplayName, NodeId) tuples into a pendingVariables list instead of registering each variable inline; folders still register inline. Pass 2 calls Session.ReadAsync once with (variableCount * 4) ReadValueId entries reading DataType + ValueRank + UserAccessLevel + Historizing for every variable. Server-side chunking via the SDK keeps the request shape within the server's per-request limits automatically. Attribute mapping: MapUpstreamDataType maps every standard DataTypeIds.* to a DriverDataType -- Boolean, SByte+Byte widened to Int16 (DriverDataType has no 8-bit, flagged in comment for future Core.Abstractions widening), Int16/32/64, UInt16/32/64, Float->Float32, Double->Float64, String, DateTime+UtcTime->DateTime. Unknown/vendor-custom NodeIds fall back to String -- safest passthrough for Variant-wrapped structs/enums/extension objects since the cascading-quality path preserves upstream StatusCode+timestamps regardless. MapAccessLevelToSecurityClass reads AccessLevels.CurrentWrite bit (0x02) -- when set, the variable is writable-for-this-user so it surfaces as Operate; otherwise ViewOnly. Uses UserAccessLevel not AccessLevel because UserAccessLevel is post-ACL-filter -- reflects what THIS session can actually do, not the server's default. IsArray derived from ValueRank (-1 = scalar, 0 = 1-D array, 1+ = multi-dim). IsHistorized reflects the server's Historizing flag directly so PR 76's IHistoryProvider routing can gate on it. Graceful degradation: (a) individual attribute failures (Bad StatusCode on DataType read) fall through to the type defaults, variable still registers; (b) wholesale enrichment-read failure (e.g. session dropped mid-browse) catches the exception, registers every pending variable with fallback defaults via RegisterFallback, browse completes. Either way the downstream address space is never empty when browse succeeded the first pass -- partial metadata is strictly better than missing variables. Unit tests (OpcUaClientAttributeMappingTests, 20 facts): MapUpstreamDataType theory covers 11 standard types including Boolean/Int16/UInt16/Int32/UInt32/Int64/UInt64/Float/Double/String/DateTime; separate facts for SByte+Byte (widened to Int16), UtcTime (DateTime), custom NodeId (String fallback); MapAccessLevelToSecurityClass theory covers 6 access-level bitmasks including CurrentRead-only (ViewOnly), CurrentWrite-only (Operate), read+write (Operate), HistoryRead-only (ViewOnly -- no Write bit). 51/51 OpcUaClient.Tests pass (31 prior + 20 new). dotnet build clean. Pending variables structured as a private readonly record struct so the ref-type allocation is stack-local for typical browse sizes. Paves the way for PR 74 SessionReconnectHandler (same enrichment path is re-runnable on reconnect) + PR 76 IHistoryProvider (gates on IsHistorized). 2026-04-19 02:00:31 -04:00
d3bf544abc Merge pull request 'Phase 3 PR 72 -- Multi-endpoint failover for OPC UA Client' (#71) from phase-3-pr72-opcua-client-failover into v2 2026-04-19 01:54:36 -04:00
Joseph Doherty
24435712c4 Phase 3 PR 72 -- Multi-endpoint failover for OPC UA Client driver. Adds OpcUaClientDriverOptions.EndpointUrls ordered list + PerEndpointConnectTimeout knob. On InitializeAsync the driver walks the candidate list in order via ResolveEndpointCandidates and returns the session from the first endpoint that successfully connects. Captures per-URL failure reasons in a List<string> and, if every candidate fails, throws AggregateException whose message names every URL + its failure class (e.g. 'opc.tcp://primary:4840 -> TimeoutException: ...'). That's critical diag for field debugging -- without it 'failover picked the wrong one' surfaces as a mystery. Single-URL backwards compat: EndpointUrl field retained as a one-URL shortcut. When EndpointUrls is null or empty the driver falls through to a single-candidate list of [EndpointUrl], so every existing single-endpoint config keeps working without migration. When both are provided, EndpointUrls wins + EndpointUrl is ignored -- documented on the field xml-doc. Per-endpoint connect budget: PerEndpointConnectTimeout (default 3s) caps each attempt so a sweep over several dead servers can't blow the overall init budget. Applied via CancellationTokenSource.CreateLinkedTokenSource + CancelAfter inside OpenSessionOnEndpointAsync (the extracted single-endpoint connect helper) so the cap is independent of the outer Options.Timeout which governs steady-state ops. BuildUserIdentity extracted out of InitializeAsync so the failover loop builds the UserIdentity ONCE and reuses it across every endpoint attempt -- generating it N times would re-unlock the user cert's private key N times, wasteful + keeps the password in memory longer. HostName now reflects the endpoint that actually connected via _connectedEndpointUrl instead of always returning opts.EndpointUrl -- so the Admin /hosts dashboard shows which of the configured endpoints is currently serving traffic (primary vs backup). Falls back to the first candidate pre-connect so the dashboard has a sensible identity before the first connect, and resets to null on ShutdownAsync. Use case: an OPC UA hot-standby server pair (primary 4840 + backup 4841) where either can serve the same address space. Operator configures EndpointUrls=[primary, backup]; driver tries primary first, falls over to backup on primary failure with a clean AggregateException describing both attempts if both are down. Unit tests (OpcUaClientFailoverTests, 5 facts): ResolveEndpointCandidates_prefers_EndpointUrls_when_provided (list trumps single), ResolveEndpointCandidates_falls_back_to_single_EndpointUrl_when_list_empty (legacy config compat), ResolveEndpointCandidates_empty_list_treated_as_fallback (explicit empty list also falls back -- otherwise we'd produce a zero-candidate sweep that throws with nothing tried), HostName_uses_first_candidate_before_connect (dashboard rendering pre-connect), Initialize_against_all_unreachable_endpoints_throws_AggregateException_listing_each (three loopback dead ports, asserts each URL appears in the aggregate message + driver flips to Faulted). 31/31 OpcUaClient.Tests pass. dotnet build clean. OPC UA Client driver security/auth/availability feature set now complete per driver-specs.md \u00A78: policy-filtered endpoint selection (PR 70), Anonymous+Username+Certificate auth (PR 71), multi-endpoint failover (this PR). 2026-04-19 01:52:31 -04:00
3f7b4d05e6 Merge pull request 'Phase 3 PR 71 -- OpcUaAuthType.Certificate user authentication' (#70) from phase-3-pr71-opcua-client-cert-auth into v2 2026-04-19 01:49:29 -04:00
Joseph Doherty
a79c5f3008 Phase 3 PR 71 -- OpcUaAuthType.Certificate user authentication. Implements the third user-token type in the OPC UA spec (Anonymous + UserName + Certificate). Before this PR the Certificate branch threw NotSupportedException. Adds OpcUaClientDriverOptions.UserCertificatePath + UserCertificatePassword knobs for the PFX on disk. The InitializeAsync user-identity switch now calls BuildCertificateIdentity for AuthType=Certificate. Load path uses X509CertificateLoader.LoadPkcs12FromFile -- the non-obsolete .NET 9+ API; the legacy X509Certificate2 PFX ctors are deprecated on net10. Validation up-front: empty UserCertificatePath throws InvalidOperationException naming the missing field; non-existent file throws FileNotFoundException with path; private-key-missing throws InvalidOperationException explaining the private key is required to sign the OPC UA user-token challenge at session activation. Each failure mode is an operator-actionable config problem rather than a mysterious ServiceResultException during session open. UserIdentity(X509Certificate2) ctor carries the cert directly; the SDK sets TokenType=Certificate + wires the cert's public key into the activate-session payload. Private key stays in-memory on the OpenSSL / .NET crypto boundary. Unit tests (OpcUaClientCertAuthTests, 3 facts): BuildCertificateIdentity_rejects_missing_path (error message mentions UserCertificatePath so the fix is obvious); BuildCertificateIdentity_rejects_nonexistent_file (FileNotFoundException); BuildCertificateIdentity_loads_a_valid_PFX_with_private_key -- generates a self-signed RSA-2048 cert on the fly with CertificateRequest.CreateSelfSigned, exports to temp PFX with a password, loads it through the helper, asserts TokenType=Certificate. Test cleans up the temp file in a finally block (best-effort; Windows file locking can leave orphans which is acceptable for %TEMP%). Self-signed cert-on-the-fly avoids shipping a static test PFX that could be flagged by secret-scanners and keeps the test hermetic across dev boxes. 26/26 OpcUaClient.Tests pass (23 prior + 3 cert auth). dotnet build clean. Feature: Anonymous + Username + Certificate all work -- driver-specs.md \u00A78 auth story complete. 2026-04-19 01:47:18 -04:00
a5299a2fee Merge pull request 'Phase 3 PR 70 -- Apply SecurityPolicy + expand to standard OPC UA policies' (#69) from phase-3-pr70-opcua-client-security-policy into v2 2026-04-19 01:46:13 -04:00
Joseph Doherty
a65215684c Phase 3 PR 70 -- Apply SecurityPolicy explicitly + expand to standard OPC UA policy list. Before this PR SecurityPolicy was a string field that got ignored -- the driver only passed useSecurity=SecurityMode!=None to SelectEndpointAsync, so an operator asking for Basic256Sha256 on a server that also advertised Basic128Rsa15 could silently end up on the weaker cipher (the SDK's SelectEndpoint returns whichever matching endpoint the server listed first). PR 70 makes policy matching explicit. SecurityPolicy is now an OpcUaSecurityPolicy enum covering the six standard policies documented in OPC UA 1.04: None, Basic128Rsa15 (deprecated, brownfield interop only), Basic256 (deprecated), Basic256Sha256 (recommended baseline), Aes128_Sha256_RsaOaep, Aes256_Sha256_RsaPss. Each maps through MapSecurityPolicy to the SecurityPolicies URI constant the SDK uses for endpoint matching. New SelectMatchingEndpointAsync replaces CoreClientUtils.SelectEndpointAsync. Flow: opens a DiscoveryClient via the non-obsolete DiscoveryClient.CreateAsync(ApplicationConfiguration, Uri, DiagnosticsMasks, ct) path, calls GetEndpointsAsync to enumerate every endpoint the server advertises, filters client-side by policy URI AND mode. When no endpoint matches, throws InvalidOperationException with the full list of what the server DID advertise formatted as 'Policy/Mode' pairs so the operator sees exactly what to fix in their config without a Wireshark trace. Fail-loud behaviour intentional -- a silent fall-through to weaker crypto is worse than a clear config error. MapSecurityPolicy is internal-visible to tests via InternalsVisibleTo from PR 66. Unit tests (OpcUaClientSecurityPolicyTests, 5 facts): MapSecurityPolicy_returns_known_non_empty_uri_for_every_enum_value theory covers all 6 policies; URI contains the enum name for non-None so operators can grep logs back to the config value; MapSecurityPolicy_None_matches_SDK_None_URI, MapSecurityPolicy_Basic256Sha256_matches_SDK_URI, MapSecurityPolicy_Aes256_Sha256_RsaPss_matches_SDK_URI all cross-check against the SDK's SecurityPolicies.* constants to catch a future enum-vs-URI drift; Every_enum_value_has_a_mapping walks Enum.GetValues to ensure adding a new case doesn't silently fall through the switch. Scaffold test updated to assert SecurityPolicy default = None (was previously unchecked). 23/23 OpcUaClient.Tests pass (13 prior + 5 scaffold + 5 new policy). dotnet build clean. Note on DiscoveryClient: the synchronous DiscoveryClient.Create(...) overloads are all [Obsolete] in SDK 1.5.378; must use DiscoveryClient.CreateAsync. GetEndpointsAsync(null, ct) returns EndpointDescriptionCollection directly (not a wrapper). 2026-04-19 01:44:07 -04:00
82f2dfcfa3 Merge pull request 'Phase 3 PR 69 -- OPC UA Client ISubscribable + IHostConnectivityProbe' (#68) from phase-3-pr69-opcua-client-subscribe-probe into v2 2026-04-19 01:24:21 -04:00
Joseph Doherty
0433d3a35e Phase 3 PR 69 -- OPC UA Client ISubscribable + IHostConnectivityProbe. Completes the OpcUaClientDriver capability surface — now matches the Galaxy + Modbus + S7 driver coverage. ISubscribable: SubscribeAsync creates a new upstream Subscription via the non-obsolete Subscription(ITelemetryContext, SubscriptionOptions) ctor + AddItem/CreateItemsAsync flow, wires each MonitoredItem's Notification event into OnDataChange. Tag strings round-trip through MonitoredItem.Handle so the notification handler can identify which tag changed without a second lookup. Publishing interval floored at 50ms (servers negotiate up anyway; sub-50ms wastes round-trip). SubscriptionOptions uses KeepAliveCount=10, LifetimeCount=1000, TimestampsToReturn=Both so SourceTimestamp passthrough for the cascading-quality rule works through subscription paths too. UnsubscribeAsync calls Subscription.DeleteAsync(silent:true) and tolerates unknown handles (returns cleanly) because the caller's race with server-side cleanup after a session drop shouldn't crash either side. Session shutdown explicitly deletes every remote subscription before closing — avoids BadSubscriptionIdInvalid noise in the upstream server's log on Close. IHostConnectivityProbe: HostName surfaced as the EndpointUrl (not host:port like the Modbus/S7 drivers) so the Admin /hosts dashboard can render the full opc.tcp:// URL as a clickable target back at the remote server. HostState tracked via session.KeepAlive event — OPC UA's built-in keep-alive is authoritative for session liveness (the SDK pings on KeepAliveInterval, sets KeepAliveStopped after N missed pings), strictly better than a driver-side polling probe: no extra wire round-trip, no duplicate semantic with the native protocol. Handler transitions Running on healthy keep-alives and Stopped on any Bad service-result. Initial Running raised at end of InitializeAsync once the session is up; Shutdown transitions back to Unknown + unwires the handler. Unit tests (OpcUaClientSubscribeAndProbeTests, 3 facts): SubscribeAsync_without_initialize_throws_InvalidOperationException, UnsubscribeAsync_with_unknown_handle_is_noop (session-drop-race safety), GetHostStatuses_returns_endpoint_url_row_pre_init (asserts EndpointUrl as the host identity -- the full opc.tcp://plc.example:4840 URL). Live-session subscribe/unsubscribe round-trip + keep-alive state transition coverage lands in a follow-up PR once we scaffold the in-process OPC UA server fixture. 13/13 OpcUaClient.Tests pass. dotnet build clean. All six capability interfaces (IDriver / ITagDiscovery / IReadable / IWritable / ISubscribable / IHostConnectivityProbe) implemented — OPC UA Client driver surface complete. 2026-04-19 01:22:14 -04:00
141673fc80 Merge pull request 'Phase 3 PR 68 -- OPC UA Client ITagDiscovery (Full browse)' (#67) from phase-3-pr68-opcua-client-discovery into v2 2026-04-19 01:19:27 -04:00
Joseph Doherty
db56a95819 Phase 3 PR 68 -- OPC UA Client ITagDiscovery via recursive browse (Full strategy). Adds ITagDiscovery to OpcUaClientDriver. DiscoverAsync opens a single Remote folder on the IAddressSpaceBuilder and recursively browses from the configured root (default: ObjectsFolder i=85; override via OpcUaClientDriverOptions.BrowseRoot for scoped discovery). Browse uses non-obsolete Session.BrowseAsync(RequestHeader, ViewDescription, uint maxReferences, BrowseDescriptionCollection, ct) with HierarchicalReferences forward, subtypes included, NodeClassMask Object+Variable, ResultMask pulling BrowseName + DisplayName + NodeClass + TypeDefinition. Objects become sub-folders via builder.Folder; Variables become builder.Variable entries with FullName set to the NodeId.ToString() serialization so IReadable/IWritable can round-trip without re-resolving. Three safety caps added to OpcUaClientDriverOptions to bound runaway discovery: (1) MaxBrowseDepth default 10 -- deep enough for realistic OPC UA information models, shallow enough that cyclic graphs can't spin the browse forever. (2) MaxDiscoveredNodes default 10_000 -- caps memory on pathological remote servers. Once the cap is hit, recursion short-circuits and the partially-discovered tree is still projected into the local address space (graceful degradation rather than all-or-nothing). (3) BrowseRoot as an opt-in scope restriction string per driver-specs.md \u00A78 -- defaults to ObjectsFolder but operators with 100k-node servers can point it at a single subtree. Visited-set tracks NodeIds already visited to prevent infinite cycles on graphs with non-strict hierarchy (OPC UA models can have back-references). Transient browse failures on a subtree are swallowed -- the sub-branch stops but the rest of discovery continues, matching the Modbus driver's 'transient poll errors don't kill the loop' pattern. The driver's health surface reflects the network-level cascade via the probe loop (PR 69). Deferred to a follow-up PR: DataType resolution via a batch Session.ReadAsync(Attributes.DataType) after the browse so DriverAttributeInfo.DriverDataType is accurate instead of the current conservative DriverDataType.Int32 default; AccessLevel-derived SecurityClass instead of the current ViewOnly default; array-type detection via Attributes.ValueRank + ArrayDimensions. These need an extra wire round-trip per batch of variables + a NodeId -> DriverDataType mapping table; out of scope for PR 68 to keep browse path landable. Unit tests (OpcUaClientDiscoveryTests, 3 facts): DiscoverAsync_without_initialize_throws_InvalidOperationException (pre-init hits RequireSession); DiscoverAsync_rejects_null_builder (ArgumentNullException); Discovery_caps_are_sensible_defaults (asserts 10000 / 10 / null defaults documented above). NullAddressSpaceBuilder stub implements the full IAddressSpaceBuilder shape including IVariableHandle.MarkAsAlarmCondition (throws NotSupportedException since this PR doesn't wire alarms). Live-browse coverage against a real remote server is deferred to the in-process-server-fixture PR. 10/10 OpcUaClient.Tests pass. dotnet build clean. 2026-04-19 01:17:21 -04:00
89bd726fa8 Merge pull request 'Phase 3 PR 67 -- OPC UA Client IReadable + IWritable' (#66) from phase-3-pr67-opcua-client-read-write into v2 2026-04-19 01:15:42 -04:00
Joseph Doherty
238748bc98 Phase 3 PR 67 -- OPC UA Client IReadable + IWritable via Session.ReadAsync/WriteAsync. Adds IReadable + IWritable capabilities to OpcUaClientDriver, routing reads/writes through the session's non-obsolete ReadAsync(RequestHeader, maxAge, TimestampsToReturn, ReadValueIdCollection, ct) and WriteAsync(RequestHeader, WriteValueCollection, ct) overloads (the sync and BeginXxx/EndXxx patterns are all [Obsolete] in SDK 1.5.378). Serializes on the shared Gate from PR 66 so reads + writes + future subscribe + probe don't race on the single session. NodeId parsing: fullReferences use OPC UA's standard serialized NodeId form -- ns=2;s=Demo.Counter, i=2253, ns=4;g=... for GUID, ns=3;b=... for opaque. TryParseNodeId calls NodeId.Parse with the session's MessageContext which honours the server-negotiated namespace URI table. Malformed input surfaces as BadNodeIdInvalid (0x80330000) WITHOUT a wire round-trip -- saves a request for a fault the driver can detect locally. Cascading-quality implementation per driver-specs.md \u00A78: upstream StatusCode, SourceTimestamp, and ServerTimestamp pass through VERBATIM. Bad codes from the remote server stay as the same Bad code (not translated to generic BadInternalError) so downstream clients can distinguish 'upstream value unavailable' from 'local driver bug'. SourceTimestamp is preserved verbatim (null on MinValue guard) so staleness is visible; ServerTimestamp falls back to DateTime.UtcNow if the upstream omitted it, never overwriting a non-zero value. Wire-level exceptions in the Read batch -- transport / timeout / session-dropped -- fan out BadCommunicationError (0x80050000) across every tag in the batch, not BadInternalError, so operators distinguish network reachability from driver faults. Write-side same pattern: successful WriteAsync maps each upstream StatusCode.Code verbatim into the local WriteResult.StatusCode; transport-layer failure fans out BadCommunicationError across the whole batch. WriteValue carries AttributeId=Value + DataValue wrapping Variant(writeValue) -- the SDK handles the type-to-Variant mapping for common CLR types (bool, int, float, string, etc.) so the driver doesn't need a per-type switch. Name disambiguation: the SDK has its own Opc.Ua.WriteRequest type which collides with ZB.MOM.WW.OtOpcUa.Core.Abstractions.WriteRequest; method signature uses the fully-qualified Core.Abstractions.WriteRequest. Unit tests (OpcUaClientReadWriteTests, 2 facts): ReadAsync_without_initialize_throws_InvalidOperationException + WriteAsync_without_initialize_throws_InvalidOperationException -- pre-init calls hit RequireSession and fail uniformly. Wire-level round-trip coverage against a live remote server lands in a follow-up PR once we scaffold an in-process OPC UA server fixture (the existing Server project in the solution is a candidate host). 7/7 OpcUaClient.Tests pass (5 scaffold + 2 read/write). dotnet build clean. Scope: ITagDiscovery (browse) + ISubscribable + IHostConnectivityProbe remain deferred to PRs 68-69 which also need namespace-index remapping and reference-counted MonitoredItem forwarding per driver-specs.md \u00A78. 2026-04-19 01:13:34 -04:00
b21d550836 Merge pull request 'Phase 3 PR 66 -- OPC UA Client (gateway) driver scaffold' (#65) from phase-3-pr66-opcua-client-scaffold into v2 2026-04-19 01:10:07 -04:00
Joseph Doherty
91eaf534c8 Phase 3 PR 66 -- OPC UA Client (gateway) driver project scaffold + IDriver session lifecycle. First driver that CONSUMES OPC UA rather than PUBLISHES it -- connects to a remote server and re-exposes its address space through the local OtOpcUa server per driver-specs.md \u00A78. Uses the same OPCFoundation.NetStandard.Opc.Ua.Client package the existing Client.Shared ships (bumped to 1.5.378.106 to match). Builds its own ApplicationConfiguration (cert stores under %LocalAppData%/OtOpcUa/pki so multiple driver instances in one OtOpcUa server process share a trust anchor) rather than reusing Client.Shared -- Client.Shared is oriented at the interactive CLI with different session-lifetime needs (this driver is always-on, needs keep-alive + session transfer on reconnect + multi-year uptime). Navigated the post-refactor 1.5.378 SDK surface: every Session.Create* static is now [Obsolete] in favour of DefaultSessionFactory; CoreClientUtils.SelectEndpoint got the sync overloads deprecated in favour of SelectEndpointAsync with a required ITelemetryContext parameter. Driver passes telemetry: null! to both SelectEndpointAsync + new DefaultSessionFactory(telemetry: null!) -- the SDK's internal default sink handles null gracefully and plumbing a telemetry context through the driver options surface is out of scope (the driver emits its own logs via the DriverHealth surface anyway). ApplicationInstance default ctor is also obsolete; wrapped in #pragma warning disable CS0618 rather than migrate to the ITelemetryContext overload for the same reason. OpcUaClientDriverOptions models driver-specs.md \u00A78 settings: EndpointUrl (default opc.tcp://localhost:4840 IANA-assigned port), SecurityPolicy/SecurityMode/AuthType enums, Username/Password, SessionTimeout=120s + KeepAliveInterval=5s + ReconnectPeriod=5s (defaults from spec), AutoAcceptCertificates=false (production default; dev turns on for self-signed servers), ApplicationUri + SessionName knobs for certificate SAN matching and remote-server session-list identification. OpcUaClientDriver : IDriver: InitializeAsync builds the ApplicationConfiguration, resolves + creates cert if missing via app.CheckApplicationInstanceCertificatesAsync, selects endpoint via CoreClientUtils.SelectEndpointAsync, builds UserIdentity (Anonymous or Username with UTF-8-encoded password bytes -- the legacy string-password ctor went away; Certificate auth deferred), creates session via DefaultSessionFactory.CreateAsync. Health transitions Unknown -> Initializing -> Healthy on success or -> Faulted on failure with best-effort Session.CloseAsync cleanup. ShutdownAsync (async now, not Task.CompletedTask) closes the session + disposes. Internal Session + Gate expose to the test project via InternalsVisibleTo so PRs 67-69 can stack read/write/discovery/subscribe on the same serialization. Scaffold tests (OpcUaClientDriverScaffoldTests, 5 facts): Default_options_target_standard_opcua_port_and_anonymous_auth (4840 + None mode + Anonymous + AutoAccept=false production default), Default_timeouts_match_driver_specs_section_8 (120s/5s/5s), Driver_reports_type_and_id_before_connect (DriverType=OpcUaClient, DriverInstanceId round-trip, pre-init Unknown health), Initialize_against_unreachable_endpoint_transitions_to_Faulted_and_throws, Reinitialize_against_unreachable_endpoint_re_throws. Uses opc.tcp://127.0.0.1:1 as the 'guaranteed-unreachable' target -- RFC 5737 reserved IPs get black-holed and time out only after the SDK's internal retry/backoff fully elapses (~60s), while port 1 on loopback refuses immediately with TCP RST which keeps the test suite snappy (5 tests / 8s). 5/5 pass. dotnet build clean. Scope boundary: ITagDiscovery / IReadable / IWritable / ISubscribable / IHostConnectivityProbe deliberately NOT in this PR -- they need browse + namespace remapping + reference-counted MonitoredItem forwarding + keep-alive probing and land in PRs 67-69. 2026-04-19 01:07:57 -04:00
d33e38e059 Merge pull request 'Phase 3 PR 65 -- S7 ITagDiscovery + ISubscribable + IHostConnectivityProbe' (#64) from phase-3-pr65-s7-discovery-subscribe-probe into v2 2026-04-19 00:18:17 -04:00
Joseph Doherty
d8ef35d5bd Phase 3 PR 65 -- S7 ITagDiscovery + ISubscribable polling overlay + IHostConnectivityProbe. Three more capability interfaces on S7Driver, matching the Modbus driver's capability coverage. ITagDiscovery: DiscoverAsync streams every configured tag into IAddressSpaceBuilder under a single 'S7' folder; builder.Variable gets a DriverAttributeInfo carrying DriverDataType (MapDataType: Bool->Boolean, Byte/Int/UInt sizes->Int32 (until Core.Abstractions adds widths), Float32/Float64 direct, String + DateTime direct), SecurityClass (Operate if tag.Writable else ViewOnly -- matches the Modbus pattern so DriverNodeManager's ACL layer can gate writes per role without S7-specific logic), IsHistorized=false (S7 has no native historian surface), IsAlarm=false (S7 alarms land through TIA Portal's alarm-in-DB pattern which is per-site and out of scope for PR 65). ISubscribable polling overlay: same pattern Modbus established in PR 22. SubscribeAsync spawns a Task.Run loop that polls every tag, diffs against LastValues, raises OnDataChange on changes plus a force-raise on initial-data push per OPC UA Part 4 convention. Interval floored at 100ms -- S7 CPUs scan 2-10ms but process the comms mailbox at most once per scan, so sub-scan polling just queues wire-side with worse latency per S7netplus documented pattern. Poll errors tolerated: first-read fault doesn't kill the loop (caller can't receive initial values but subsequent polls try again); transient poll errors also swallowed so the loop survives a power-cycle + reconnect through the health surface. UnsubscribeAsync cancels the CTS + removes the subscription -- unknown handle is a no-op, not a throw, because the caller's race with server-side cleanup shouldn't crash either side. Shutdown tears down every subscription before disposing the Plc. IHostConnectivityProbe: HostName surfaced as host:port to match Modbus driver convention (Admin /hosts dashboard renders both families uniformly). GetHostStatuses returns one row (single-endpoint driver). ProbeLoopAsync serializes on the shared Gate + calls Plc.ReadStatusAsync (cheap Get-CPU-Status PDU that doubles as an 'is PLC up' check) every Probe.Interval with a Probe.Timeout cap, transitions HostState Unknown/Stopped -> Running on success and -> Stopped on any failure, raises OnHostStatusChanged only on actual transitions (no noise for steady-state probes). Probe loop starts at end of InitializeAsync when Probe.Enabled=true (default); Shutdown cancels the probe CTS. Initial state stays Unknown until first successful probe -- avoids broadcasting a premature Running before any PDU round-trip has happened. Unit tests (S7DiscoveryAndSubscribeTests, 4 facts): DiscoverAsync_projects_every_tag_into_the_address_space (3 tags + mixed writable/read-only -> Operate vs ViewOnly asserted), GetHostStatuses_returns_one_row_with_host_port_identity_pre_init, SubscribeAsync_returns_unique_handles_and_UnsubscribeAsync_accepts_them (diagnosticId uniqueness + idempotent double-unsubscribe), Subscribe_publishing_interval_is_floored_at_100ms (accepts 50ms request without throwing -- floor is applied internally). Uses a RecordingAddressSpaceBuilder stub that implements IVariableHandle.FullReference + MarkAsAlarmCondition (throws NotImplementedException since the S7 driver never calls it -- alarms out of scope). 57/57 S7 unit tests pass. dotnet build clean. All 5 capability interfaces (IDriver/ITagDiscovery/IReadable/IWritable/ISubscribable/IHostConnectivityProbe) now implemented -- the S7 driver surface is on par with the Modbus driver, minus the extended data types (Int64/UInt64/Float64/String/DateTime deferred per PR 64). 2026-04-19 00:16:10 -04:00
5e318a1ab6 Merge pull request 'Phase 3 PR 64 -- S7 IReadable + IWritable via S7.Net' (#63) from phase-3-pr64-s7-read-write into v2 2026-04-19 00:12:59 -04:00
0eab1271be Merge pull request 'Phase 3 PR 63 -- S7AddressParser (DB/M/I/Q/T/C grammar)' (#62) from phase-3-pr63-s7-address-parser into v2 2026-04-19 00:08:27 -04:00
27 changed files with 3558 additions and 1 deletions

View File

@@ -10,6 +10,7 @@
<Project Path="src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy.csproj"/>
<Project Path="src/ZB.MOM.WW.OtOpcUa.Driver.Modbus/ZB.MOM.WW.OtOpcUa.Driver.Modbus.csproj"/>
<Project Path="src/ZB.MOM.WW.OtOpcUa.Driver.S7/ZB.MOM.WW.OtOpcUa.Driver.S7.csproj"/>
<Project Path="src/ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient/ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient.csproj"/>
<Project Path="src/ZB.MOM.WW.OtOpcUa.Client.Shared/ZB.MOM.WW.OtOpcUa.Client.Shared.csproj"/>
<Project Path="src/ZB.MOM.WW.OtOpcUa.Client.CLI/ZB.MOM.WW.OtOpcUa.Client.CLI.csproj"/>
<Project Path="src/ZB.MOM.WW.OtOpcUa.Client.UI/ZB.MOM.WW.OtOpcUa.Client.UI.csproj"/>
@@ -28,6 +29,7 @@
<Project Path="tests/ZB.MOM.WW.OtOpcUa.Driver.Modbus.Tests/ZB.MOM.WW.OtOpcUa.Driver.Modbus.Tests.csproj"/>
<Project Path="tests/ZB.MOM.WW.OtOpcUa.Driver.Modbus.IntegrationTests/ZB.MOM.WW.OtOpcUa.Driver.Modbus.IntegrationTests.csproj"/>
<Project Path="tests/ZB.MOM.WW.OtOpcUa.Driver.S7.Tests/ZB.MOM.WW.OtOpcUa.Driver.S7.Tests.csproj"/>
<Project Path="tests/ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient.Tests/ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient.Tests.csproj"/>
<Project Path="tests/ZB.MOM.WW.OtOpcUa.Client.Shared.Tests/ZB.MOM.WW.OtOpcUa.Client.Shared.Tests.csproj"/>
<Project Path="tests/ZB.MOM.WW.OtOpcUa.Client.CLI.Tests/ZB.MOM.WW.OtOpcUa.Client.CLI.Tests.csproj"/>
<Project Path="tests/ZB.MOM.WW.OtOpcUa.Client.UI.Tests/ZB.MOM.WW.OtOpcUa.Client.UI.Tests.csproj"/>

View File

@@ -0,0 +1,149 @@
# Phase 6.1 — Resilience & Observability Runtime
> **Status**: DRAFT — implementation plan for a cross-cutting phase that was never formalised. The v2 `plan.md` specifies Polly, Tier A/B/C protections, structured logging, and local-cache fallback by decision; none are wired end-to-end.
>
> **Branch**: `v2/phase-6-1-resilience-observability`
> **Estimated duration**: 3 weeks
> **Predecessor**: Phase 5 (drivers) — partial; S7 + OPC UA Client shipped, AB/TwinCAT/FOCAS paused
> **Successor**: Phase 6.2 (Authorization runtime)
## Phase Objective
Land the cross-cutting runtime protections + operability features that `plan.md` + `driver-stability.md` specify by decision but that no driver-phase actually wires. End-state: every driver goes through the same Polly resilience layer, health endpoints render the live driver fleet, structured logs carry per-request correlation IDs, and the config substrate survives a central DB outage via a LiteDB local cache.
Closes these gaps flagged in the 2026-04-19 audit:
1. Polly v8 resilience pipelines wired to every `IDriver` capability (no-op per-driver today; Galaxy has a hand-rolled `CircuitBreaker` only).
2. Tier A/B/C enforcement at runtime — `driver-stability.md` §24 and decisions #6373 define memory watchdog, bounded queues, scheduled recycle, wedge detection; `MemoryWatchdog` exists only inside `Driver.Galaxy.Host`.
3. Health endpoints (`/healthz`, `/readyz`) on `OtOpcUa.Server`.
4. Structured Serilog with per-request correlation IDs (driver instance, OPC UA session, IPC call).
5. LiteDB local cache + Polly retry + fallback on central-DB outage (decision #36).
## Scope — What Changes
| Concern | Change |
|---------|--------|
| `Core` → new `Core.Resilience` sub-namespace | Shared Polly pipeline builder (`DriverResiliencePipelines`). **Pipeline key = `(DriverInstanceId, HostName)`** so one dead PLC behind a multi-device driver doesn't open the breaker for healthy siblings (decision #35 per-device isolation). **Per-capability policy** — Read / HistoryRead / Discover / Probe / Alarm get retries; **Write does NOT** unless `[WriteIdempotent]` on the tag definition (decisions #44-45). |
| Every capability-interface consumer in the server | Wrap `IReadable.ReadAsync`, `IWritable.WriteAsync`, `ITagDiscovery.DiscoverAsync`, `ISubscribable.SubscribeAsync/UnsubscribeAsync`, `IHostConnectivityProbe` probe loop, `IAlarmSource.SubscribeAlarmsAsync/AcknowledgeAsync`, `IHistoryProvider.ReadRawAsync/ReadProcessedAsync/ReadAtTimeAsync/ReadEventsAsync`. Composition: timeout → (retry when capability supports) → circuit breaker → bulkhead. |
| `Core.Abstractions` → new `WriteIdempotentAttribute` | Marker on `ModbusTagDefinition` / `S7TagDefinition` / `OpcUaClientDriver` tag rows; opts that tag into auto-retry on Write. Absence = no retry, per spec. |
| `Core` → new `Core.Stability` sub-namespace — **split** | Two separate subsystems: (a) **`MemoryTracking`** runs all tiers; captures baseline (median of first 5 min `GetMemoryFootprint` samples) + applies the hybrid rule `soft = max(multiplier × baseline, baseline + floor)`; soft breach logs + surfaces to Admin; never kills. (b) **`MemoryRecycle`** (Tier C only — requires out-of-process topology) handles hard-breach recycle via the Proxy-side supervisor. Tier A/B overrun escalates to Tier C promotion ticket, not auto-kill. |
| `ScheduledRecycleScheduler` | Tier C only per decisions #73-74. Weekly/time-of-day recycle via Proxy supervisor. Tier A/B opt-in recycle lands in a future phase together with a Tier-C-escalation workflow. |
| `WedgeDetector` | **Demand-aware**: flips a driver to Faulted only when `(hasPendingWork AND noProgressIn > threshold)`. `hasPendingWork` derives from non-zero Polly bulkhead depth OR ≥1 active MonitoredItem OR ≥1 queued historian read. Idle + subscription-only drivers stay Healthy. |
| `DriverTypeRegistry` | Each driver type registers its `DriverTier` {A, B, C}. Tier C drivers must advertise their out-of-process topology; the registry enforces invariants (Tier C has a `Proxy` + `Host` pair). |
| `Driver.Galaxy.Proxy/Supervisor/` | **Retains** existing `CircuitBreaker` + `Backoff` — they guard IPC respawn (decision #68), different concern from the per-call Polly layer. Only `HeartbeatMonitor` is referenced downstream (IPC liveness). |
| `OtOpcUa.Server` → Minimal API endpoints on `http://+:4841` | `/healthz` = process alive + (config DB reachable OR `UsingStaleConfig=true`). `/readyz` = ANDed driver health; state-machine per `DriverState`: `Unknown`/`Initializing` → 503, `Healthy` → 200, `Degraded` → 200 + `{degradedDrivers: [...]}` in body, `Faulted` → 503. JSON body always reports per-instance detail. |
| Serilog configuration | Centralize enrichers in `OtOpcUa.Server/Observability/LogContextEnricher.cs`. Every capability call runs inside a `LogContext.PushProperty` scope with {DriverInstanceId, DriverType, CapabilityName, CorrelationId (UA RequestHandle or internal GUID)}. Sink config stays rolling-file per CLAUDE.md; JSON sink added alongside plain-text (switchable via `Serilog:WriteJson` appsetting). |
| `Configuration` project | Add `LiteDbConfigCache` adapter. **Generation-sealed snapshots**: `sp_PublishGeneration` writes `<cache-root>/<cluster>/<generationId>.db` as a read-only sealed file. Reads serve the last-known-sealed generation; mixed-generation reads are impossible. Write path bypasses cache + fails hard on DB outage. Pipeline: timeout (2 s) → retry (3×, jittered) → fallback-to-sealed-snapshot. |
| `DriverHostStatus` vs. `DriverInstanceResilienceStatus` | New separate entity `DriverInstanceResilienceStatus { DriverInstanceId, HostName, LastCircuitBreakerOpenUtc, ConsecutiveFailures, CurrentBulkheadDepth, LastRecycleUtc, BaselineFootprintBytes }`. `DriverHostStatus` keeps per-host connectivity only; Admin `/hosts` joins both for display. |
## Scope — What Does NOT Change
| Item | Reason |
|------|--------|
| Driver wire protocols | Resilience is a server-side wrapper; individual drivers don't see Polly. Their existing retry logic (ModbusTcpTransport reconnect, SessionReconnectHandler) stays in place as inner layers. |
| Config DB schema | LiteDB cache is a read-only mirror; no new central tables except `DriverHostStatus` column additions. |
| OPC UA wire behavior visible to clients | Health endpoints live on a separate HTTP port (4841 by convention); the OPC UA server on 4840 is unaffected. |
| The four 2026-04-13 Galaxy stability findings | Already closed in Phase 2. Phase 6.1 *generalises* the pattern, doesn't re-fix Galaxy. |
| Driver-layer SafeHandle usage | Existing Galaxy `SafeMxAccessHandle` + Modbus `TcpClient` disposal stay — they're driver-internal, not part of the cross-cutting layer. |
## Entry Gate Checklist
- [ ] Phases 05 exit gates cleared (or explicitly deferred with task reference)
- [ ] `driver-stability.md` §24 re-read; decisions #6373 + #3436 re-skimmed
- [ ] Polly v8 NuGet available (`Microsoft.Extensions.Resilience` + `Polly.Core`) — verify package restore before task breakdown
- [ ] LiteDB 5.x NuGet confirmed MIT + actively maintained
- [ ] Existing drivers catalogued: Galaxy.Proxy, Modbus, S7, OpcUaClient — confirm test counts baseline so the resilience layer doesn't regress any
- [ ] Serilog configuration inventory: locate every `Log.ForContext` call site that will need `LogContext` rewrap
- [ ] Admin `/hosts` page's current `DriverHostStatus` consumption reviewed so the schema extensions don't break it
## Task Breakdown
### Stream A — Resilience layer (1 week)
1. **A.1** Add `Polly.Core` + `Microsoft.Extensions.Resilience` to `Core`. Build `DriverResiliencePipelineBuilder` — key on `(DriverInstanceId, HostName)`; composes Timeout → (Retry when the capability allows it; skipped for Write unless `[WriteIdempotent]`) → CircuitBreaker → Bulkhead. Per-capability policy map documented in `DriverResilienceOptions.CapabilityPolicies`.
2. **A.2** `DriverResilienceOptions` record bound from `DriverInstance.ResilienceConfig` JSON column (new nullable). **Per-tier × per-capability** defaults: Tier A (OpcUaClient, S7) Read 3 retries/2 s/5-failure-breaker, Write 0 retries/2 s/5-failure-breaker; Tier B (Modbus) Read 3/4 s/5, Write 0/4 s/5; Tier C (Galaxy) Read 1 retry/10 s/no-kill, Write 0/10 s/no-kill. Idempotent writes can opt into Read-shaped retry via the attribute.
3. **A.3** `CapabilityInvoker<TCapability, TResult>` wraps every method on the capability interfaces (`IReadable.ReadAsync`, `IWritable.WriteAsync`, `ITagDiscovery.DiscoverAsync`, `ISubscribable.SubscribeAsync/UnsubscribeAsync`, `IHostConnectivityProbe` probe loop, `IAlarmSource.SubscribeAlarmsAsync/AcknowledgeAsync`, `IHistoryProvider.ReadRawAsync/ReadProcessedAsync/ReadAtTimeAsync/ReadEventsAsync`). Existing server-side dispatch routes through it.
4. **A.4** **Retain** `Driver.Galaxy.Proxy/Supervisor/CircuitBreaker.cs` + `Backoff.cs` — they guard IPC process respawn (decision #68), orthogonal to the per-call Polly layer. Only `HeartbeatMonitor` is consumed outside the supervisor.
5. **A.5** Unit tests: per-policy, per-composition. Negative integration tests: (a) Modbus FlakeyTransport fails 5× on Read, succeeds 6th — invoker surfaces success; (b) Modbus FlakeyTransport fails 1× on Write with `[WriteIdempotent]=false` — invoker surfaces failure without retry (no duplicate pulse); (c) Modbus FlakeyTransport fails 1× on Write with `[WriteIdempotent]=true` — invoker retries. Bench: no-op overhead < 1%.
6. **A.6** `WriteIdempotentAttribute` in `Core.Abstractions`. Modbus/S7/OpcUaClient tag-definition records pick it up; invoker reads via reflection once at driver init.
### Stream B — Tier A/B/C stability runtime — split into MemoryTracking + MemoryRecycle (1 week)
1. **B.1** `Core.Abstractions``DriverTier` enum {A, B, C}. Extend `DriverTypeRegistry` to require `DriverTier` at registration. Existing driver types stamped (Galaxy = C, Modbus = B, S7 = B, OpcUaClient = A).
2. **B.2** **`MemoryTracking`** (all tiers) lifted from `Driver.Galaxy.Host/MemoryWatchdog.cs`. Captures `BaselineFootprintBytes` as the median of first 5 min of `IDriver.GetMemoryFootprint()` samples post-`InitializeAsync`. Applies **decision #70 hybrid formula**: `soft = max(multiplier × baseline, baseline + floor)`; Tier A multiplier=3, floor=50 MB; Tier B multiplier=3, floor=100 MB; Tier C multiplier=2, floor=500 MB. Soft breach → log + `DriverInstanceResilienceStatus.CurrentFootprint` tick; never kills. Hard = 2 × soft.
3. **B.3** **`MemoryRecycle`** (Tier C only per decisions #73-74). Hard-breach on a Tier C driver triggers `ScheduledRecycleScheduler.RequestRecycleNow(driverInstanceId)`; scheduler proxies to `Driver.Galaxy.Proxy/Supervisor/` which restarts the Host process. Tier A/B hard-breach logs a promotion-to-Tier-C recommendation; **never auto-kills** the in-process driver.
4. **B.4** **`ScheduledRecycleScheduler`** per decision #67: Tier C driver instances opt-in to a weekly recycle at a configured cron. Tier A/B scheduled recycle deferred to a later phase paired with Tier-C escalation.
5. **B.5** **`WedgeDetector`** demand-aware: `if (state==Healthy && hasPendingWork && noProgressIn > WedgeThreshold) → force ReinitializeAsync`. `hasPendingWork` = (bulkhead depth > 0) OR (active monitored items > 0) OR (queued historian-read count > 0). `WedgeThreshold` default 5 × PublishingInterval, min 60 s. Idle driver stays Healthy.
6. **B.6** Tests: tracking unit tests drive synthetic allocation against a fake `GetMemoryFootprint`; recycle tests use a mock supervisor; wedge tests include the false-fault cases — idle subscriber, slow historian backfill, write-only burst.
### Stream C — Health endpoints + structured logging (4 days)
1. **C.1** `OtOpcUa.Server/Observability/HealthEndpoints.cs` — Minimal API on a second Kestrel binding (default `http://+:4841`). `/healthz` reports process uptime + config-DB reachability (or cache-warm). `/readyz` enumerates `DriverInstance` rows + reports each driver's `DriverHealth.State`; returns 503 if ANY driver is Faulted. JSON body per `docs/v2/acl-design.md` §"Operator Dashboards" shape.
2. **C.2** `LogContextEnricher` installed at Serilog config time. Every driver-capability call site wraps its body in `using (LogContext.PushProperty("DriverInstanceId", id)) using (LogContext.PushProperty("CorrelationId", correlationId))`. Correlation IDs: reuse OPC UA `RequestHeader.RequestHandle` when in-flight; otherwise generate `Guid.NewGuid().ToString("N")[..12]`.
3. **C.3** Add JSON-formatted Serilog sink alongside the existing rolling-file plain-text sink so SIEMs (Splunk, Datadog) can ingest without a regex parser. Sink switchable via `Serilog:WriteJson` appsetting.
4. **C.4** Integration test: boot server, issue Modbus read, assert log line contains `DriverInstanceId` + `CorrelationId` structured fields.
### Stream D — Config DB LiteDB fallback — generation-sealed snapshots (1 week)
1. **D.1** `LiteDbConfigCache` adapter backed by **sealed generation snapshots**: each successful `sp_PublishGeneration` writes `<cache-root>/<clusterId>/<generationId>.db` as read-only after commit. The adapter maintains a `CurrentSealedGenerationId` pointer updated atomically on successful publish. Mixed-generation reads are **impossible** — every read served from the cache serves one coherent sealed generation.
2. **D.2** Write-path queries (draft save, publish) bypass the cache entirely and fail hard on DB outage. Read-path queries (DriverInstance enumeration, LdapGroupRoleMapping, cluster + namespace metadata) go through the pipeline: timeout 2 s → retry 3× jittered → fallback to the current sealed snapshot.
3. **D.3** `UsingStaleConfig` flag flips true when a read fell back to the sealed snapshot; cleared on the next successful DB round-trip. Surfaced on `/healthz` body and Admin `/hosts`.
4. **D.4** Tests: (a) SQL-container kill mid-operation — read returns sealed snapshot, `UsingStaleConfig=true`, driver stays Healthy; (b) mixed-generation guard — attempt to serve partial generation by corrupting a snapshot file mid-read → adapter fails closed rather than serving mixed data; (c) first-boot-no-snapshot case — adapter refuses to start, driver fails `InitializeAsync` with a clear config-DB-required error.
### Stream E — Admin `/hosts` page refresh (3 days)
1. **E.1** Extend `DriverHostStatus` schema with Stream A resilience columns. Generate EF migration.
2. **E.2** `Admin/FleetStatusHub` SignalR hub pushes `LastCircuitBreakerOpenUtc` + `CurrentBulkheadDepth` + `LastRecycleUtc` on change.
3. **E.3** `/hosts` Blazor page renders new columns; red badge if `ConsecutiveFailures > breakerThreshold / 2`.
## Compliance Checks (run at exit gate)
- [ ] **Invoker coverage**: every method on `IReadable` / `IWritable` / `ITagDiscovery` / `ISubscribable` / `IHostConnectivityProbe` / `IAlarmSource` / `IHistoryProvider` in the server dispatch layer routes through `CapabilityInvoker`. Enforce via a Roslyn analyzer (error-level; warning-first is rejected — the compliance check is the gate).
- [ ] **Write-retry guard**: writes without `[WriteIdempotent]` never get retried. Unit-test the invoker path asserts zero retry attempts.
- [ ] **Pipeline isolation**: pipeline key is `(DriverInstanceId, HostName)`. Integration test with two Modbus hosts under one instance — failing host A does not open the breaker for host B.
- [ ] **Tier registry**: every driver type registered in `DriverTypeRegistry` has a non-null `Tier`. Unit test walks the registry + asserts no gaps. Tier C registrations must declare their out-of-process topology.
- [ ] **MemoryTracking never kills**: soft/hard breach tests on a Tier A/B driver log + surface without terminating the process.
- [ ] **MemoryRecycle Tier C only**: hard breach on a Tier A driver never invokes the supervisor; on Tier C it does.
- [ ] **Wedge demand-aware**: test suite includes idle-subscription-only, slow-historian-backfill, and write-only-burst cases — driver stays Healthy.
- [ ] **Galaxy supervisor preserved**: `Driver.Galaxy.Proxy/Supervisor/CircuitBreaker.cs` + `Backoff.cs` still present + still invoked on Host crash.
- [ ] **Health state machine**: `/healthz` + `/readyz` respond within 500 ms for every `DriverState`; state-machine table in this doc drives the test matrix.
- [ ] **Structured log**: CI grep asserts at least one log line per capability call has `"DriverInstanceId"` + `"CorrelationId"` JSON fields.
- [ ] **Generation-sealed cache**: integration tests cover (a) SQL-kill mid-operation serves last-sealed snapshot; (b) mixed-generation corruption fails closed; (c) first-boot no-snapshot + DB-down → `InitializeAsync` fails with clear error.
- [ ] No regression in existing test suites — `dotnet test ZB.MOM.WW.OtOpcUa.slnx` count equal-or-greater than pre-Phase-6.1 baseline.
## Risks and Mitigations
| Risk | Likelihood | Impact | Mitigation |
|------|:----------:|:------:|------------|
| Polly pipeline adds per-request latency on hot path | Medium | Medium | Benchmark Stream A.5 before merging; 1 % overhead budget; inline hot path short-circuits when retry count = 0 |
| LiteDB cache diverges from central DB | Medium | High | Stale-data banner in Admin UI; `UsingStaleConfig` flag surfaced on `/readyz`; cache refresh on every successful DB round-trip; 24-hour synthetic warning |
| Tier watchdog false-positive-kills a legitimate batch load | Low | High | Soft/hard threshold split; soft only logs; hard triggers recycle; thresholds configurable per-instance |
| Wedge detector races with slow-but-healthy drivers | Medium | High | Minimum 60 s threshold; detector only activates if driver claims `Healthy`; add circuit-breaker feedback so rapid oscillation trips instead of thrashing |
| Roslyn analyzer breaks external driver authors | Low | Medium | Release analyzer as warning-level initially; upgrade to error in Phase 6.1+1 after one release cycle |
## Completion Checklist
- [ ] Stream A: Polly shared pipeline + per-tier defaults + driver-capability invoker + tests
- [ ] Stream B: Tier registry + generalised watchdog + scheduled recycle + wedge detector
- [ ] Stream C: `/healthz` + `/readyz` + structured logging + JSON Serilog sink
- [ ] Stream D: LiteDB cache + Polly fallback in Configuration
- [ ] Stream E: Admin `/hosts` page refresh
- [ ] Cross-cutting: `phase-6-1-compliance.ps1` exits 0; full solution `dotnet test` passes; exit-gate doc recorded
## Adversarial Review — 2026-04-19 (Codex, thread `019da489-e317-7aa1-ab1f-6335e0be2447`)
Plan substantially rewritten before implementation to address these findings. Each entry: severity · verdict · adjustment.
1. **Crit · ACCEPT** — Auto-retry collides with decisions #44/#45 (no auto-write-retry; opt-in via `WriteIdempotent` + CAS). Pipeline now **capability-specific**: Read/HistoryRead/Discover/Probe/Alarm-subscribe all get retries; **Write does not** unless the tag metadata carries `WriteIdempotent=true`. New `WriteIdempotentAttribute` surfaces on `ModbusTagDefinition` / `S7TagDefinition` / etc.
2. **Crit · ACCEPT** — "One pipeline per driver instance" breaks decision #35's per-device isolation. **Change**: pipeline key is `(DriverInstanceId, HostName)` not just `DriverInstanceId`. One dead PLC behind a multi-device Modbus driver no longer opens the breaker for healthy siblings.
3. **Crit · ACCEPT** — Memory watchdog + scheduled recycle at Tier A/B breaches decisions #73/#74 (process-kill protections are Tier-C-only). **Change**: Stream B splits into two — `MemoryTracking` (all tiers, soft/hard thresholds log + surface to Admin `/hosts`; never kills) and `MemoryRecycle` (Tier C only, requires out-of-process topology). Tier A/B overrun paths escalate to Tier C via a future PR, not auto-kill.
4. **High · ACCEPT** — Removing Galaxy's hand-rolled `CircuitBreaker` drops decision #68 host-supervision crash-loop protection. **Change**: keep `Driver.Galaxy.Proxy/Supervisor/CircuitBreaker.cs` + `Backoff.cs` — they guard the IPC *process* re-spawn, not the per-call data path. Data-path Polly is an orthogonal layer.
5. **High · ACCEPT** — Roslyn analyzer targeting `IDriver` misses the hot paths (`IReadable.ReadAsync`, `IWritable.WriteAsync`, `ISubscribable.SubscribeAsync` etc.). **Change**: analyzer rule now matches every method on the capability interfaces; compliance doc enumerates the full call-site list.
6. **High · ACCEPT**`/healthz` + `/readyz` under-specified for degraded-running. **Change**: add a state-matrix sub-section explicitly covering `Unknown` (pre-init: `/readyz` 503), `Initializing` (503), `Healthy` (200), `Degraded` (200 with JSON body flagging the degraded driver; `/readyz` is OR across drivers), `Faulted` (503), plus cached-config-serving (`/healthz` returns 200 + `UsingStaleConfig: true` in JSON body).
7. **High · ACCEPT**`WedgeDetector` based on "no successful Read" false-fires on write-only subscriptions + idle systems. **Change**: wedge criteria now `(hasPendingWork AND noProgressIn > threshold)` where `hasPendingWork` comes from the Polly bulkhead depth + active MonitoredItem count. Idle driver stays Healthy.
8. **High · ACCEPT** — LiteDB cache serving mixed-generation reads breaks publish atomicity. **Change**: cache is snapshot-per-generation. Each published generation writes a sealed snapshot into `<cache-root>/<cluster>/<generationId>.db`; reads serve the last-known-sealed generation and never mix. Central DB outage during a *publish* means that publish fails (write path doesn't use cache); reads continue from the prior sealed snapshot.
9. **Med · ACCEPT**`DriverHostStatus` schema conflates per-host connectivity with per-driver-instance resilience counters. **Change**: new `DriverInstanceResilienceStatus` table separate from `DriverHostStatus`. Admin `/hosts` joins both for display.
10. **Med · ACCEPT** — Compliance says analyzer-error; risks say analyzer-warning. **Change**: phase 6.1 ships at **error** level (this phase is the gate); warning-mode option removed.
11. **Med · ACCEPT** — Hardcoded per-tier MB bands ignore decision #70's `max(multiplier × baseline, baseline + floor)` formula with observed-baseline capture. **Change**: watchdog captures baseline at post-init plateau (median of first 5 min GetMemoryFootprint samples) + applies the hybrid formula. Tier constants now encode the multiplier + floor, not raw MB.
12. **Med · ACCEPT** — Tests mostly cover happy path. **Change**: Stream A.5 adds negative tests for duplicate-write-replay-under-timeout; Stream B.5 adds false-wedge-on-idle-subscription + false-wedge-on-slow-historic-backfill; Stream D.4 adds mixed-generation cache test + corrupt-first-boot cache test.

View File

@@ -0,0 +1,147 @@
# Phase 6.2 — Authorization Runtime (ACL + LDAP grants)
> **Status**: DRAFT — the v2 `plan.md` decision #129 + `acl-design.md` specify a 6-level permission-trie evaluator with `NodePermissions` bitmask grants, but no runtime evaluator exists. ACL tables are schematized but unread by the data path.
>
> **Branch**: `v2/phase-6-2-authorization-runtime`
> **Estimated duration**: 2.5 weeks
> **Predecessor**: Phase 6.1 (Resilience & Observability) — reuses the Polly pipeline for ACL-cache refresh retries
> **Successor**: Phase 6.3 (Redundancy)
## Phase Objective
Wire ACL enforcement on every OPC UA Read / Write / Subscribe / Call path + LDAP group → admin role grants that the v2 plan specified but never ran. End-state: a user's effective permissions resolve through a per-session permission-trie over the 6-level `Cluster / Namespace / UnsArea / UnsLine / Equipment / Tag` hierarchy, cached per session, invalidated on generation-apply + LDAP group expiry.
Closes these gaps:
1. **Data-path ACL enforcement**`NodeAcl` table + `NodePermissions` flags shipped; `NodeAclService.cs` present as a CRUD surface; no code consults ACLs at `Read`/`Write` time. OPC UA server answers everything to everyone.
2. **`LdapGroupRoleMapping` for cluster-scoped admin grants** — decision #105 shipped as the *design*; admin roles are hardcoded (`FleetAdmin` / `ConfigEditor` / `ReadOnly`) with no cluster-scoping and no LDAP-to-grant table. Decision #105 explicitly lifts this from v2.1 into v2.0.
3. **Explicit Deny pathway** — deferred to v2.1 (decision #129 note). Phase 6.2 ships *grants only*; `Deny` stays out.
4. **Admin UI ACL grant editor**`AclsTab.razor` exists but edits the now-unused `NodeAcl` table; needs to wire to the runtime evaluator + the new `LdapGroupRoleMapping` table.
## Scope — What Changes
**Architectural separation** (critical for correctness): `LdapGroupRoleMapping` is **control-plane only** — it maps LDAP groups to Admin UI roles (`FleetAdmin` / `ConfigEditor` / `ReadOnly`) and cluster scopes for Admin access. **It is NOT consulted by the OPC UA data-path evaluator.** The data-path evaluator reads `NodeAcl` rows joined directly against the session's **resolved LDAP group memberships**. The two concerns share zero runtime code path.
| Concern | Change |
|---------|--------|
| `Configuration` project | New entity `LdapGroupRoleMapping { Id, LdapGroup, Role, ClusterId? (nullable = system-wide), IsSystemWide, GeneratedAtUtc }`. **Consumed only by Admin UI role routing.** Migration. Admin CRUD. |
| `Core` → new `Core.Authorization` sub-namespace | `IPermissionEvaluator.Authorize(IEnumerable<Claim> identity, OpcUaOperation op, NodeId nodeId) → AuthorizationDecision`. `op` covers every OPC UA surface: Browse, Read, Write, HistoryRead, HistoryUpdate, CreateMonitoredItems, TransferSubscriptions, Call, Acknowledge, Confirm, Shelve. Result is tri-state (internal model distinguishes `Allow` / `NotGranted` / `Denied` + carries matched-grant provenance). Phase 6.2 only produces `Allow` + `NotGranted`; v2.1 Deny lands without API break. |
| `PermissionTrieBuilder` | Builds trie from `NodeAcl` rows joined against **resolved LDAP group memberships**, keyed on 6-level scope hierarchy for Equipment namespaces. **SystemPlatform namespaces (Galaxy)** use a `FolderSegment` scope level between Namespace and Tag, populated from `Tag.FolderPath` segments, so folder subtree authorization works on Galaxy trees the same way UNS works on Equipment trees. Trie node carries `ScopeKind` enum. |
| `PermissionTrieCache` + freshness | One trie per `(ClusterId, GenerationId)`. Invalidated on `sp_PublishGeneration` via in-process event bus AND generation-ID check on hot path — every authz call looks up `CurrentGenerationId` (Polly-wrapped, sub-second cache); a Backup that cached a stale generation detects the mismatch + forces re-load. **Redundancy-safe**. |
| `UserAuthorizationState` freshness | Cached per session BUT bounded by `MembershipFreshnessInterval` (default **15 min**). Past that, the next hot-path authz call re-resolves LDAP group memberships via `LdapGroupService`. Failure to re-resolve (LDAP unreachable) → **fail-closed**: evaluator returns `NotGranted` for every call until memberships refresh successfully. Decoupled from Phase 6.1's availability-oriented 24h cache. |
| `AuthCacheMaxStaleness` | Separate from Phase 6.1's `UsingStaleConfig` window. Default 5 min — beyond that, authz fails closed regardless of Phase 6.1 cache warmth. |
| OPC UA server dispatch — all enforcement surfaces | `DriverNodeManager` wires evaluator on: **Browse + TranslateBrowsePathsToNodeIds** (ancestors implicitly visible if any descendant has a grant; denied ancestors filter from results), **Read** (per-attribute StatusCode `BadUserAccessDenied` in mixed-authorization batches; batch never poisons), **Write** (uses `NodePermissions.WriteOperate/Tune/Configure` based on driver `SecurityClassification`), **HistoryRead** (uses `NodePermissions.HistoryRead`**distinct** flag, not Read), **HistoryUpdate** (`NodePermissions.HistoryUpdate`), **CreateMonitoredItems** (per-`MonitoredItemCreateResult` denial), **TransferSubscriptions** (re-evaluates items on transfer), **Call** (`NodePermissions.MethodCall`), **Acknowledge/Confirm/Shelve** (per-alarm flags). |
| Subscription re-authorization | Each `MonitoredItem` is stamped with `(AuthGenerationId, MembershipVersion)` at create time. On every Publish, items with a stamp mismatching the session's current `(AuthGenerationId, MembershipVersion)` get re-evaluated; revoked items drop to `BadUserAccessDenied` within one publish cycle. Unchanged items stay fast-path. |
| `LdapAuthService` | On cookie-auth success: resolves LDAP group memberships; loads matching `LdapGroupRoleMapping` rows → role claims + cluster-scope claims (control plane); stores `UserAuthorizationState.LdapGroups` on the session for the data-plane evaluator. |
| `ValidatedNodeAclAuthoringService` | Replaces CRUD-only `NodeAclService` for authoring. Validates (LDAP group exists, scope exists in current or target draft, grant shape is valid, no duplicate `(LdapGroup, Scope)` pair). Admin UI writes only through it. |
| Admin UI `AclsTab.razor` | Writes via `ValidatedNodeAclAuthoringService`. Adds Probe-This-Permission row that runs the real evaluator against a chosen `(LDAP group, node, operation)` and shows `Allow` / `NotGranted` + matched-grant provenance. |
| Admin UI new tab `RoleGrantsTab.razor` | CRUD over `LdapGroupRoleMapping`. Per-cluster + system-wide grants. FleetAdmin only. **Documentation explicit** that this only affects Admin UI access, not OPC UA data plane. |
| Audit log | Every Grant/Revoke/Publish on `LdapGroupRoleMapping` or `NodeAcl` writes an `AuditLog` row with old/new state + user. |
## Scope — What Does NOT Change
| Item | Reason |
|------|--------|
| OPC UA authn | Already done (PR 19 LDAP user identity + Basic256Sha256 profile). Phase 6.2 is authorization only. |
| Explicit `Deny` grants | Decision #129 note explicitly defers to v2.1. Default-deny + additive grants only. |
| Driver-side `SecurityClassification` metadata | Drivers keep reporting `Operate` / `ViewOnly` / etc. — the evaluator uses them as *part* of the decision but doesn't replace them. |
| Galaxy namespace (SystemPlatform kind) | UNS levels don't apply; evaluator treats Galaxy nodes as `Cluster → Namespace → Tag` (skip UnsArea/UnsLine/Equipment). |
## Entry Gate Checklist
- [ ] Phase 6.1 merged (reuse `Core.Resilience` Polly pipeline for the ACL cache-refresh retries)
- [ ] `acl-design.md` re-read in full
- [ ] Decision log #105, #129, corrections-doc B1 re-skimmed
- [ ] Existing `NodeAcl` + `NodePermissions` flag enum audited; confirm bitmask flags match `acl-design.md` table
- [ ] Existing `LdapAuthService` group-resolution code path traced end-to-end — confirm it already queries group memberships (we only need the caller to consume the result)
- [ ] Test DB scenarios catalogued: two clusters, three LDAP groups per cluster, mixed grant shapes; captured as seed-data fixtures
## Task Breakdown
### Stream A — `LdapGroupRoleMapping` table + migration (3 days)
1. **A.1** Entity + EF Core migration. Columns per §Scope table. Unique constraint on `(LdapGroup, ClusterId)` with null-tolerant comparer for the system-wide case. Index on `LdapGroup` for the hot-path lookup on auth.
2. **A.2** `ILdapGroupRoleMappingService` CRUD. Wrap in the Phase 6.1 Polly pipeline (timeout → retry → fallback-to-cache).
3. **A.3** Seed-data migration: preserve the current hardcoded `FleetAdmin` / `ConfigEditor` / `ReadOnly` mappings by seeding rows for the existing LDAP groups the dev box uses (`cn=fleet-admin,…`, `cn=config-editor,…`, `cn=read-only,…`). Op no-op migration for existing deployments.
### Stream B — Permission-trie evaluator (1 week)
1. **B.1** `IPermissionEvaluator.Authorize(IEnumerable<Claim> identity, NodeId nodeId, NodePermissions needed)` — returns `bool`. Phase 6.2 returns only `true` / `false`; v2.1 can widen to `Allow`/`Deny`/`Indeterminate` if Deny lands.
2. **B.2** `PermissionTrieBuilder` builds the trie from `NodeAcl` + `LdapGroupRoleMapping` joined to the current generation's `UnsArea` + `UnsLine` + `Equipment` + `Tag` tables. One trie per `(ClusterId, GenerationId)` so rollback doesn't smear permissions across generations.
3. **B.3** Trie node structure: `{ Level: enum, ScopeId: Guid, AllowedPermissions: NodePermissions, ChildrenByLevel: Dictionary<Guid, TrieNode> }`. Evaluation walks from Cluster → Namespace → UnsArea → UnsLine → Equipment → Tag, ORing allowed permissions at each level. Additive semantics: a grant at Cluster level cascades to every descendant tag.
4. **B.4** `PermissionTrieCache` service scoped as singleton; exposes `GetTrieAsync(ClusterId, ct)` that returns the current-generation trie. Invalidated on `sp_PublishGeneration` via an in-process event bus; also on TTL expiry (24 h safety net).
5. **B.5** Per-session cached evaluator: OPC UA Session authentication produces `UserAuthorizationState { ClusterId, LdapGroups[], Trie }`; cached on the session until session close or generation-apply.
6. **B.6** Unit tests: trie-walk theory covering (a) Cluster-level grant cascades to tags, (b) Equipment-level grant doesn't leak to sibling Equipment, (c) multi-group union, (d) no-grant → deny, (e) Galaxy nodes skip UnsArea/UnsLine levels.
### Stream C — OPC UA server dispatch wiring (6 days, widened)
1. **C.1** `DriverNodeManager.Read` — evaluator consulted per `ReadValueId` with `OpcUaOperation.Read`. Denied attributes get `BadUserAccessDenied` per-item; batch never poisons. Integration test covers mixed-authorization batch (3 authorized + 2 denied → 3 Good values + 2 Bad StatusCodes, request completes).
2. **C.2** `DriverNodeManager.Write` — evaluator chooses `NodePermissions.WriteOperate` / `WriteTune` / `WriteConfigure` based on the driver-reported `SecurityClassification`.
3. **C.3** `DriverNodeManager.HistoryRead`**uses `NodePermissions.HistoryRead`**, which is a **distinct flag** from Read. Test: user with Read but not HistoryRead can read live values but gets `BadUserAccessDenied` on `HistoryRead`.
4. **C.4** `DriverNodeManager.HistoryUpdate` — uses `NodePermissions.HistoryUpdate`.
5. **C.5** `DriverNodeManager.CreateMonitoredItems` — per-`MonitoredItemCreateResult` denial in mixed-authorization batch; partial success path per OPC UA Part 4. Each created item stamped `(AuthGenerationId, MembershipVersion)`.
6. **C.6** `DriverNodeManager.TransferSubscriptions` — on reconnect, re-evaluate every transferred `MonitoredItem` against the session's current auth state. Stale-stamp items drop to `BadUserAccessDenied`.
7. **C.7** **Browse + TranslateBrowsePathsToNodeIds** — evaluator called with `OpcUaOperation.Browse`. Ancestor visibility implied when any descendant has a grant (per `acl-design.md` §Browse). Denied ancestors filter from browse results — the UA browser sees a hierarchy truncated at the denied ancestor rather than an inconsistent child-without-parent view.
8. **C.8** `DriverNodeManager.Call``NodePermissions.MethodCall`.
9. **C.9** Alarm actions (Acknowledge / Confirm / Shelve) — per-alarm `NodePermissions.AlarmAck` / `AlarmConfirm` / `AlarmShelve`.
10. **C.10** Publish path — for each `MonitoredItem` with a mismatched `(AuthGenerationId, MembershipVersion)` stamp, re-evaluate. Unchanged items stay fast-path; changes happen at next publish cycle.
11. **C.11** Integration tests: three-user seed with different memberships; matrix covers every operation in §Scope. Mixed-batch tests for Read + CreateMonitoredItems.
### Stream D — Admin UI refresh (4 days)
1. **D.1** `RoleGrantsTab.razor` — FleetAdmin-gated CRUD on `LdapGroupRoleMapping`. Per-cluster dropdown + system-wide checkbox. Validation: LDAP group must exist in the dev LDAP (GLAuth) before saving — best-effort probe with graceful degradation.
2. **D.2** `AclsTab.razor` rewrites its edit path to write through the new `NodeAclService`. Adds a "Probe this permission" row: choose `(LDAP group, node, action)` → shows Allow / Deny + the reason (which grant matched).
3. **D.3** Draft-generation diff viewer now includes an ACL section: "X grants added, Y grants removed, Z grants changed."
4. **D.4** SignalR notification: `PermissionTrieCache` invalidation on `sp_PublishGeneration` pushes to Admin UI so operators see "this clusters permissions were just updated" within 2 s.
## Compliance Checks (run at exit gate)
- [ ] **Control/data-plane separation**: `LdapGroupRoleMapping` consumed only by Admin UI; the data-path evaluator has zero references to it. Enforced via a project-reference audit (Admin project references the mapping service; `Core.Authorization` does not).
- [ ] **Every operation wired**: Browse, Read, Write, HistoryRead, HistoryUpdate, CreateMonitoredItems, TransferSubscriptions, Call, Acknowledge, Confirm, Shelve all consult the evaluator. Integration test matrix covers every operation × allow/deny.
- [ ] **HistoryRead uses its own flag**: test "user with Read + no HistoryRead gets `BadUserAccessDenied` on HistoryRead".
- [ ] **Mixed-batch semantics**: Read of 5 nodes (3 allowed + 2 denied) returns 3 Good + 2 `BadUserAccessDenied` per-`ReadValueId`; CreateMonitoredItems equivalent.
- [ ] **Browse ancestor visibility**: user with a grant only on a deep equipment node can browse the path to it (ancestors implied); denied ancestors filter from browse results otherwise.
- [ ] **Galaxy FolderSegment coverage**: a grant on a Galaxy folder subtree cascades to its tags; sibling folders are unaffected. Trie test covers this.
- [ ] **Subscription re-authorization**: integration test — create item, revoke grant via draft+publish, next publish cycle the item returns `BadUserAccessDenied` (not silently still-notifying).
- [ ] **Membership freshness**: test — 15 min MembershipFreshnessInterval elapses on a long-lived session + LDAP now unreachable → authz fails closed on the next request until LDAP recovers.
- [ ] **Auth cache fail-closed**: test — Phase 6.1 cache serves stale config for 6 min; authz evaluator refuses all calls after 5 min regardless.
- [ ] **Trie invariants**: `PermissionTrieBuilder` is idempotent (build twice with identical inputs → equal tries).
- [ ] **Additive grants + cluster isolation**: cluster-grant cascades; cross-cluster leakage impossible.
- [ ] **Redundancy-safe invalidation**: integration test — two nodes, a publish on one, authorize a request on the other before in-process event propagates → generation-mismatch forces re-load, no stale decision.
- [ ] **Authoring validation**: `AclsTab` cannot save a `(LdapGroup, Scope)` pair that already exists in the draft; operator sees the validation error pre-save.
- [ ] **AuthorizationDecision shape stability**: API surface exposes `Allow` + `NotGranted` only; `Denied` variant exists in the type but is never produced; v2.1 can add Deny without API break.
- [ ] No regression in driver test counts.
## Risks and Mitigations
| Risk | Likelihood | Impact | Mitigation |
|------|:----------:|:------:|------------|
| ACL evaluator latency on per-read hot path | Medium | High | Trie lookup is O(depth) = O(6); session-cached UserAuthorizationState avoids per-Read trie rebuild; benchmark in Stream B.6 |
| Trie cache stale after a rollback | Medium | High | `sp_PublishGeneration` + `sp_RollbackGeneration` both emit the invalidation event; trie keyed on `(ClusterId, GenerationId)` so rollback fetches the prior trie cleanly |
| `BadUserAccessDenied` returns expose sensitive browse-name metadata | Low | Medium | Server returns only the status code + NodeId; no message leak per OPC UA Part 4 §7.34 guidance |
| LdapGroupRoleMapping migration breaks existing deployments | Low | High | Seed-migration preserves the hardcoded groups' effective grants verbatim; smoke test exercises the post-migration fleet admin login |
| Deny semantics accidentally ship (would break `acl-design.md` defer) | Low | Medium | `IPermissionEvaluator.Authorize` returns `bool` (not tri-state) through Phase 6.2; widening to `Allow`/`Deny`/`Indeterminate` is a v2.1 ticket |
## Completion Checklist
- [ ] Stream A: `LdapGroupRoleMapping` entity + migration + CRUD + seed
- [ ] Stream B: evaluator + trie builder + cache + per-session state + unit tests
- [ ] Stream C: OPC UA dispatch wiring on Read/Write/HistoryRead/Subscribe/Alarm paths
- [ ] Stream D: Admin UI `RoleGrantsTab` + `AclsTab` refresh + SignalR invalidation
- [ ] `phase-6-2-compliance.ps1` exits 0; exit-gate doc recorded
## Adversarial Review — 2026-04-19 (Codex, thread `019da48d-0d2b-7171-aed2-fc05f1f39ca3`)
1. **Crit · ACCEPT** — Trie must not conflate `LdapGroupRoleMapping` (control-plane admin claims per decision #105) with data-plane ACLs (decision #129). **Change**: `LdapGroupRoleMapping` is consumed only by the Admin UI role router. Data-plane trie reads `NodeAcl` rows joined against the session's **resolved LDAP groups**, never admin roles. Stream B.2 updated.
2. **Crit · ACCEPT** — Cached `UserAuthorizationState` survives LDAP group changes because memberships only refresh at cookie-auth. Change: add `MembershipFreshnessInterval` (default 15 min); past that, next hot-path authz call forces group re-resolution (fail-closed if LDAP unreachable). Session-close-wins on config-rollback.
3. **High · ACCEPT** — Node-local invalidation doesn't extend across redundant pair. **Change**: trie keyed on `(ClusterId, GenerationId)`; hot-path authz looks up `CurrentGenerationId` from the shared config DB (Polly-wrapped + sub-second cache). A Backup that read stale generation gets a mismatched trie → forces re-load. Implementation note added to Stream B.4.
4. **High · ACCEPT** — Browse enforcement missing. **Change**: new Stream C.7 (`Browse + TranslateBrowsePathsToNodeIds` enforcement). Ancestor visibility implied when any descendant has a grant; denied ancestors filter from browse results per `acl-design.md` §Browse.
5. **High · ACCEPT**`HistoryRead` should use `NodePermissions.HistoryRead` bit, not `Read`. **Change**: Stream C.3 revised; separate unit test asserts `Read+no-HistoryRead` denies HistoryRead while allowing current-value reads.
6. **High · ACCEPT** — Galaxy shallow-path (Cluster→Namespace→Tag) loses folder hierarchy authorization. **Change**: SystemPlatform namespaces use a `FolderSegment` scope-level between Namespace and Tag, populated from `Tag.FolderPath`; UNS-kind namespaces keep the 6-level hierarchy. Trie supports both via `ScopeKind` on each node.
7. **High · ACCEPT** — Subscription re-authorization policy unresolved between create-time-only (fast, wrong on revoke) and per-publish (slow). **Change**: stamp each `MonitoredItem` with `(AuthGenerationId, MembershipVersion)`; re-evaluate on Publish only when either version changed. Revoked items drop to `BadUserAccessDenied` within one publish cycle.
8. **Med · ACCEPT** — Mixed-authorization batch `Read` / `CreateMonitoredItems` service-result semantics underspecified. **Change**: Stream C.6 explicitly tests per-`ReadValueId` + per-`MonitoredItemCreateResult` denial in mixed batches; batch never collapses to a coarse failure.
9. **Med · ACCEPT** — Missing surfaces: `Method.Call`, `HistoryUpdate`, event filter on subscriptions, subscription-transfer on reconnect, alarm-ack. **Change**: scope expanded — every OPC UA authorization surface enumerated in Stream C: Read, Write, HistoryRead, HistoryUpdate, CreateMonitoredItems, TransferSubscriptions, Call, Acknowledge/Confirm/Shelve, Browse, TranslateBrowsePathsToNodeIds.
10. **Med · ACCEPT**`bool` evaluator bakes in grant-only semantics; collides with v2.1 Deny. **Change**: internal model uses `AuthorizationDecision { Allow | NotGranted | Denied, IReadOnlyList<MatchedGrant> Provenance }`. Phase 6.2 maps `Denied` → never produced; UI + audit log use the full record so v2.1 Deny lands without API break.
11. **Med · ACCEPT** — 6.1 cache fallback is availability-oriented; applying it to auth is correctness-dangerous. **Change**: auth-specific staleness budget `AuthCacheMaxStaleness` (default 5 min, not 24 h). Past that, hot-path evaluator fails closed on cached reads; all authorization calls return `NotGranted` until fresh data lands. Documented in risks + compliance.
12. **Low · ACCEPT** — Existing `NodeAclService` is raw CRUD. **Change**: new `ValidatedNodeAclAuthoringService` enforces scope-uniqueness + draft/publish invariants + rejects invalid (LDAP group, scope) pairs; Admin UI writes through it only. Stream D.2 adjusted.

View File

@@ -0,0 +1,150 @@
# Phase 6.3 — Redundancy Runtime
> **Status**: DRAFT — `CLAUDE.md` + `docs/Redundancy.md` describe a non-transparent warm/hot redundancy model with unique ApplicationUris, `RedundancySupport` advertisement, `ServerUriArray`, and dynamic `ServiceLevel`. Entities (`ServerCluster`, `ClusterNode`, `RedundancyRole`, `RedundancyMode`) exist; the runtime behavior (actual `ServiceLevel` number computation, mid-apply dip, `ServerUriArray` broadcast) is not wired.
>
> **Branch**: `v2/phase-6-3-redundancy-runtime`
> **Estimated duration**: 2 weeks
> **Predecessor**: Phase 6.2 (Authorization) — reuses the Phase 6.1 health endpoints for cluster-peer probing
> **Successor**: Phase 6.4 (Admin UI completion)
## Phase Objective
Land the non-transparent redundancy protocol end-to-end: two `OtOpcUa.Server` instances in a `ServerCluster` each expose a live `ServiceLevel` node whose value reflects that instance's suitability to serve traffic, advertise each other via `ServerUriArray`, and transition role (Primary ↔ Backup) based on health + operator intent.
Closes these gaps:
1. **Dynamic `ServiceLevel`** — OPC UA Part 5 §6.3.34 specifies a Byte (0..255) that clients poll to pick the healthiest server. Our server publishes it as a static value today.
2. **`ServerUriArray` broadcast** — Part 4 specifies that every node in a redundant pair should advertise its peers' ApplicationUris. Currently advertises only its own.
3. **Primary / Backup role coordination** — entities carry `RedundancyRole` but the runtime doesn't read it; no peer health probing; no role-transfer on primary failure.
4. **Mid-apply dip** — decision-level expectation that a server mid-generation-apply should report a *lower* ServiceLevel so clients cut over to the peer during the apply window. Not implemented.
## Scope — What Changes
| Concern | Change |
|---------|--------|
| `OtOpcUa.Server` → new `Server.Redundancy` sub-namespace | `RedundancyCoordinator` singleton. Resolves the current node's `ClusterNode` row at startup, loads peers, runs **two-layer peer health probe**: (a) `/healthz` every 2 s as the fast-fail (inherits Phase 6.1 semantics — HTTP + DB/cache healthy); (b) `UaHealthProbe` every 10 s — opens a lightweight OPC UA client session to the peer + reads its `ServiceLevel` node + verifies endpoint serves data. Authority decisions use UaHealthProbe; `/healthz` is used only to avoid wasting UA probes when peer is obviously down. |
| Publish-generation fencing | Topology + role decisions are stamped with a monotonic `ConfigGenerationId` from the shared config DB. Coordinator re-reads topology via CAS on `(ClusterId, ExpectedGeneration)` → new row; peers reject state propagated from a lower generation. Prevents split-publish races. |
| `InvalidTopology` runtime state | If both nodes detect >1 Primary AFTER startup (config-DB drift during a publish), both self-demote to ServiceLevel 2 until convergence. Neither node serves authoritatively; clients pick the healthier alternative or reconnect later. |
| OPC UA server root | `ServiceLevel` variable node becomes a `BaseDataVariable` whose value updates on `RedundancyCoordinator` state change. `ServerUriArray` array variable includes **self + peers** in stable deterministic ordering (decision per OPC UA Part 4 §6.6.2.2). `RedundancySupport` stays static (set from `RedundancyMode` at startup); `Transparent` mode validated pre-publish, not rejected at startup. |
| `RedundancyCoordinator` computation | **8-state ServiceLevel matrix** — avoids OPC UA Part 5 §6.3.34 collision (`0=Maintenance`, `1=NoData`). Operator-declared maintenance only = **0**. Unreachable / Faulted = **1**. In-range operational states occupy **2..255**: Authoritative-Primary = **255**; Isolated-Primary (peer unreachable, self serving) = **230**; Primary-Mid-Apply = **200**; Recovering-Primary (post-fault, dwell not met) = **180**; Authoritative-Backup = **100**; Isolated-Backup (primary unreachable, "take over if asked") = **80**; Backup-Mid-Apply = **50**; Recovering-Backup = **30**; `InvalidTopology` (runtime detects >1 Primary) = **2** (detected-inconsistency band — below normal operation). Full matrix documented in `docs/Redundancy.md` update. |
| Role transition | Split-brain avoidance: role is *declared* in the shared config DB (`ClusterNode.RedundancyRole`), not elected at runtime. An operator flips the row (or a failover script does). Coordinator only reads; never writes. |
| `sp_PublishGeneration` hook | Uses named **apply leases** keyed to `(ConfigGenerationId, PublishRequestId)`. `await using var lease = coordinator.BeginApplyLease(...)`. Disposal on any exit path (success, exception, cancellation) decrements. Watchdog auto-closes any lease older than `ApplyMaxDuration` (default 10 min) → ServiceLevel can't stick at mid-apply. Pre-publish validator rejects unsupported `RedundancyMode` (e.g. `Transparent`) with a clear error so runtime never sees an invalid state. |
| Admin UI `/cluster/{id}` page | New `RedundancyTab.razor` — shows current node's role + ServiceLevel + peer reachability. FleetAdmin can trigger a role-swap by editing `ClusterNode.RedundancyRole` + publishing a draft. |
| Metrics | New OpenTelemetry metrics: `ot_opcua_service_level{cluster,node}`, `ot_opcua_peer_reachable{cluster,node,peer}`, `ot_opcua_apply_in_progress{cluster,node}`. Sink via Phase 6.1 observability layer. |
## Scope — What Does NOT Change
| Item | Reason |
|------|--------|
| OPC UA authn / authz | Phases 6.2 + prior. Redundancy is orthogonal. |
| Driver layer | Drivers aren't redundancy-aware; they run on each node independently against the same equipment. The server layer handles the ServiceLevel story. |
| Automatic failover / election | Explicitly out of scope. Non-transparent = client picks which server to use via ServiceLevel + ServerUriArray. We do NOT ship consensus, leader election, or automatic promotion. Operator-driven failover is the v2.0 model per decision #7985. |
| Transparent redundancy (`RedundancySupport=Transparent`) | Not supported. If the operator asks for it the server fails startup with a clear error. |
| Historian redundancy | Galaxy Historian's own redundancy (two historians on two CPUs) is out of scope. The Galaxy driver talks to whichever historian is reachable from its node. |
## Entry Gate Checklist
- [ ] Phase 6.1 merged (uses `/healthz` for peer probing)
- [ ] `CLAUDE.md` §Redundancy + `docs/Redundancy.md` re-read
- [ ] Decisions #7985 re-skimmed
- [ ] `ServerCluster`/`ClusterNode`/`RedundancyRole`/`RedundancyMode` entities + existing migration reviewed
- [ ] OPC UA Part 4 §Redundancy + Part 5 §6.3.34 (ServiceLevel) re-skimmed
- [ ] Dev box has two OtOpcUa.Server instances configured against the same cluster — one designated Primary, one Backup — for integration testing
## Task Breakdown
### Stream A — Cluster topology loader (3 days)
1. **A.1** `RedundancyCoordinator` startup path: reads `ClusterNode` row for the current node (identified by `appsettings.json` `Cluster:NodeId`), reads the cluster's peer list, validates invariants (no duplicate `ApplicationUri`, at most one `Primary` per cluster if `RedundancyMode.WarmActive`, at most two nodes total in v2.0 per decision #83).
2. **A.2** Topology subscription — coordinator re-reads on `sp_PublishGeneration` confirmation so an operator role-swap takes effect after publish (no process restart needed).
3. **A.3** Tests: two-node cluster seed, one-node cluster seed (degenerate), duplicate-uri rejection.
### Stream B — Peer health probing + ServiceLevel computation (6 days, widened)
1. **B.1** `PeerHttpProbeLoop` per peer at 2 s — calls peer's `/healthz`, 1 s timeout, exponential backoff on sustained failure. Used as fast-fail.
2. **B.2** `PeerUaProbeLoop` per peer at 10 s — opens an OPC UA client session to the peer (reuses Phase 5 `Driver.OpcUaClient` stack), reads peer's `ServiceLevel` node + verifies endpoint serves data. Short-circuit: if HTTP probe is failing, skip UA probe (no wasted sessions).
3. **B.3** `ServiceLevelCalculator.Compute(role, selfHealth, peerHttpHealthy, peerUaHealthy, applyInProgress, recoveryDwellMet, topologyValid) → byte`. 8-state matrix per §Scope. `topologyValid=false` forces InvalidTopology = 2 regardless of other inputs.
4. **B.4** `RecoveryStateManager`: after a `Faulted → Healthy` transition, hold driver in `Recovering` band (180 Primary / 30 Backup) for `RecoveryDwellTime` (default 60 s) AND require one positive publish witness (successful `Read` on a reference node) before entering Authoritative band.
5. **B.5** Calculator reacts to inputs via `IObserver` so changes immediately push to the OPC UA `ServiceLevel` node.
6. **B.6** Tests: **64-case matrix** covering role × self-health × peer-http × peer-ua × apply × recovery × topology. Specific cases flagged: Primary-with-unreachable-peer-serves-at-230 (authority retained); Backup-with-unreachable-primary-escalates-to-80 (not auto-promote); InvalidTopology demotes both nodes; Recovering dwell + publish-witness blocks premature return to 255.
### Stream C — OPC UA node wiring (3 days)
1. **C.1** `ServiceLevel` variable node created under `ServerStatus` at server startup. Type `Byte`, AccessLevel = CurrentRead only. Subscribe to `ServiceLevelCalculator` observable; push updates via `DataChangeNotification`.
2. **C.2** `ServerUriArray` variable node under `ServerCapabilities`. Array of `String`, **includes self + peers** with deterministic ordering (self first). Updates on topology change. Compliance test asserts local-plus-peer membership.
3. **C.3** `RedundancySupport` variable — static at startup from `RedundancyMode`. Values: `None`, `Cold`, `Warm`, `WarmActive`, `Hot`. Unsupported values (`Transparent`, `HotAndMirrored`) are rejected **pre-publish** by validator — runtime never sees them.
4. **C.4** Client.CLI cutover test: connect to primary, read `ServiceLevel` → 255; pause primary apply → 200; unreachable peer while apply in progress → 200 (apply dominates peer-unreachable per matrix); client sees peer via `ServerUriArray`; fail primary → client reconnects to peer at 80 (isolated-backup band).
### Stream D — Apply-window integration (3 days)
1. **D.1** `sp_PublishGeneration` caller wraps the apply in `await using var lease = coordinator.BeginApplyLease(generationId, publishRequestId)`. Lease keyed to `(ConfigGenerationId, PublishRequestId)` so concurrent publishes stay isolated. Disposal decrements on every exit path.
2. **D.2** `ApplyLeaseWatchdog` auto-closes leases older than `ApplyMaxDuration` (default 10 min) so a crashed publisher can't pin the node at mid-apply.
3. **D.3** Pre-publish validator in `sp_PublishGeneration` rejects unsupported `RedundancyMode` values (`Transparent`, `HotAndMirrored`) with a clear error message — runtime never sees an invalid mode.
4. **D.4** Tests: (a) mid-apply client subscribes → sees ServiceLevel drop → sees restore; (b) lease leak via `ThreadAbort` / cancellation → watchdog closes; (c) publish rejected for `Transparent` → operator-actionable error.
### Stream E — Admin UI + metrics (3 days)
1. **E.1** `RedundancyTab.razor` under `/cluster/{id}/redundancy`. Shows each node's role, current ServiceLevel (with band label per 8-state matrix), peer reachability (HTTP + UA probe separately), last apply timestamp. Role-swap button posts a draft edit on `ClusterNode.RedundancyRole`; publish applies.
2. **E.2** OpenTelemetry meter export: `ot_opcua_service_level{cluster,node}` gauge + `ot_opcua_peer_reachable{cluster,node,peer,kind=http|ua}` + `ot_opcua_apply_in_progress{cluster,node}` + `ot_opcua_topology_valid{cluster}`. Sink via Phase 6.1 observability.
3. **E.3** SignalR push: `FleetStatusHub` broadcasts ServiceLevel changes so the Admin UI updates within ~1 s of the coordinator observing a peer flip.
### Stream F — Client-interoperability matrix (3 days, new)
1. **F.1** Validate ServiceLevel-driven cutover against **Ignition 8.1 + 8.3**, **Kepware KEPServerEX 6.x**, **Aveva OI Gateway 2020R2 + 2023R1**. For each: configure the client with both endpoints, verify it honors `ServiceLevel` + `ServerUriArray` during primary failover.
2. **F.2** Clients that don't honour the standards (doc field — may include Kepware and OI Gateway per Codex review) get an explicit compatibility-matrix entry: "requires manual backup-endpoint config / vendor-specific redundancy primitives". Documented in `docs/Redundancy.md`.
3. **F.3** Galaxy MXAccess failover test — boot Galaxy.Proxy on both nodes, kill Primary, assert Galaxy consumer reconnects to Backup within `(SessionTimeout + KeepAliveInterval × 3)`. Document required session-timeout config in `docs/Redundancy.md`.
## Compliance Checks (run at exit gate)
- [ ] **OPC UA band compliance**: `0=Maintenance` reserved, `1=NoData` reserved. Operational states in 2..255 per 8-state matrix.
- [ ] **Authoritative-Primary** ServiceLevel = 255.
- [ ] **Isolated-Primary** (peer unreachable, self serving) = 230 — Primary retains authority.
- [ ] **Primary-Mid-Apply** = 200.
- [ ] **Recovering-Primary** = 180 with dwell + publish witness enforced.
- [ ] **Authoritative-Backup** = 100.
- [ ] **Isolated-Backup** (primary unreachable) = 80 — does NOT auto-promote.
- [ ] **InvalidTopology** = 2 — both nodes self-demote when >1 Primary detected runtime.
- [ ] **ServerUriArray** returns self + peer URIs, self first.
- [ ] **UaHealthProbe authority**: integration test — peer returns HTTP 200 but OPC UA endpoint unreachable → coordinator treats peer as UA-unhealthy; peer is not a valid authority source.
- [ ] **Apply-lease disposal**: leases close on exception, cancellation, and watchdog timeout; ServiceLevel never sticks at mid-apply band.
- [ ] **Transparent-mode rejection**: attempting to publish `RedundancyMode=Transparent` is blocked at `sp_PublishGeneration`; runtime never sees an invalid mode.
- [ ] **Role transition via operator publish**: FleetAdmin swaps `RedundancyRole` in a draft, publishes; both nodes re-read topology on publish confirmation + flip ServiceLevel — no restart.
- [ ] **Client.CLI cutover**: with primary halted, Client.CLI that was connected to primary sees primary drop + reconnects to backup via `ServerUriArray`.
- [ ] **Client interoperability matrix** (Stream F): Ignition 8.1 + 8.3 honour ServiceLevel; Kepware + Aveva OI Gateway findings documented.
- [ ] **Galaxy MXAccess failover**: end-to-end test — primary kill → Galaxy consumer reconnects to backup within session-timeout budget.
- [ ] No regression in existing driver test suites; no regression in `/healthz` reachability under redundancy load.
## Risks and Mitigations
| Risk | Likelihood | Impact | Mitigation |
|------|:----------:|:------:|------------|
| Split-brain from operator race (both nodes marked Primary) | Low | High | Coordinator rejects startup if its cluster has >1 Primary row; logs + fails fast. Document as a publish-time validation in `sp_PublishGeneration`. |
| ServiceLevel thrashing on flaky peer | Medium | Medium | 2 s probe interval + 3-sample smoothing window; only declares a peer unreachable after 3 consecutive failed probes |
| Client ignores ServiceLevel and stays on broken primary | Medium | Medium | Documented in `docs/Redundancy.md` — non-transparent redundancy requires client cooperation; most SCADA clients (Ignition, Kepware, Aveva OI Gateway) honor it. Unit-test the advertised values; field behavior is client-responsibility |
| Apply-window counter leaks on exception | Low | High | `BeginApplyWindow` returns `IDisposable`; `using` syntax enforces paired decrement; unit test for exception-in-apply path |
| `HttpClient` probe leaks sockets | Low | Medium | Single shared `HttpClient` per coordinator (not per-probe); timeouts tight to avoid keeping connections open during peer downtime |
## Completion Checklist
- [ ] Stream A: topology loader + tests
- [ ] Stream B: peer probe + ServiceLevel calculator + 32-case matrix tests
- [ ] Stream C: ServiceLevel / ServerUriArray / RedundancySupport node wiring + Client.CLI smoke test
- [ ] Stream D: apply-window integration + nested-apply counter
- [ ] Stream E: Admin `RedundancyTab` + OpenTelemetry metrics + SignalR push
- [ ] `phase-6-3-compliance.ps1` exits 0; exit-gate doc; `docs/Redundancy.md` updated with the ServiceLevel matrix
## Adversarial Review — 2026-04-19 (Codex, thread `019da490-3fa0-7340-98b8-cceeca802550`)
1. **Crit · ACCEPT** — No publish-generation fencing enables split-publish advertising both as authoritative. **Change**: coordinator CAS on a monotonic `ConfigGenerationId`; every topology decision is generation-stamped; peers reject state propagated from a lower generation.
2. **Crit · ACCEPT**`>1 Primary` at startup covered but runtime containment missing when invalid topology appears later (mid-apply race). **Change**: add runtime `InvalidTopology` state — both nodes self-demote to ServiceLevel 2 (the "detected inconsistency" band, below normal operation) until convergence.
3. **High · ACCEPT**`0 = Faulted` collides with OPC UA Part 5 §6.3.34 semantics where 0 means **Maintenance** and 1 means NoData. **Change**: reserve **0** for operator-declared maintenance-mode only; Faulted/unreachable uses **1** (NoData); in-range degraded states occupy 2..199.
4. **High · ACCEPT** — Matrix collapses distinct operational states onto the same value. **Change**: matrix expanded to Authoritative-Primary=255, Isolated-Primary=230 (peer unreachable — still serving), Primary-Mid-Apply=200, Recovering-Primary=180, Authoritative-Backup=100, Isolated-Backup=80 (primary unreachable — "take over if asked"), Backup-Mid-Apply=50, Recovering-Backup=30.
5. **High · ACCEPT**`/healthz` from 6.1 is HTTP-healthy but doesn't guarantee OPC UA data plane. **Change**: add a redundancy-specific probe `UaHealthProbe` — issues a `ReadAsync(ServiceLevel)` against the peer's OPC UA endpoint via a lightweight client session. `/healthz` remains the fast-fail; the UA probe is the authority signal.
6. **High · ACCEPT**`ServerUriArray` must include self + peers, not peers only. **Change**: array contains `[self.ApplicationUri, peer.ApplicationUri]` in stable deterministic ordering; compliance test asserts local-plus-peer membership.
7. **Med · ACCEPT** — No `Faulted → Recovering → Healthy` path. **Change**: add `Recovering` state with min dwell time (60 s default) + positive publish witness (one successful Read on a reference node) before returning to Healthy. Thrash-prevention.
8. **Med · ACCEPT** — Topology change during in-flight probe undefined. **Change**: every probe task tagged with `ConfigGenerationId` at dispatch; obsolete results discarded; in-flight probes cancelled on topology reload.
9. **Med · ACCEPT** — Apply-window counter race on exception/cancellation/async ownership. **Change**: apply-window is a named lease keyed to `(ConfigGenerationId, PublishRequestId)` with disposal enforced via `await using`; watchdog detects leased-but-abandoned and force-closes after `ApplyMaxDuration` (default 10 min).
10. **High · ACCEPT** — Ignition + Kepware + Aveva OI Gateway `ServiceLevel` compliance is unverified. **Change**: risk elevated to High; add Stream F (new) — build an interop matrix: validate against Ignition 8.1/8.3, Kepware KEPServerEX 6.x, Aveva OI Gateway 2020R2 + 2023R1. Document per-client cutover behaviour. Field deployments get a documented compatibility table; clients that ignore ServiceLevel documented as requiring explicit backup-endpoint config.
11. **Med · ACCEPT** — Galaxy MXAccess re-session on Primary death not in acceptance. **Change**: Stream F adds an end-to-end failover smoke test that boots Galaxy.Proxy on both nodes, kills Primary, asserts Galaxy consumer reconnects to Backup within `(SessionTimeout + KeepAliveInterval × 3)` budget. `docs/Redundancy.md` updated with required session timeouts.
12. **Med · ACCEPT** — Transparent-mode startup rejection is outage-prone. **Change**: `sp_PublishGeneration` validates `RedundancyMode` pre-publish — unsupported values reject the publish attempt with a clear validation error; runtime never sees an unsupported mode. Last-good config stays active.

View File

@@ -0,0 +1,134 @@
# Phase 6.4 — Admin UI Completion
> **Status**: DRAFT — Phase 1 Stream E shipped the Admin scaffold + core pages; several feature-completeness items from its completion checklist (`phase-1-configuration-and-admin-scaffold.md` §Stream E) never landed. This phase closes them.
>
> **Branch**: `v2/phase-6-4-admin-ui-completion`
> **Estimated duration**: 2 weeks
> **Predecessor**: Phase 6.3 (Redundancy runtime) — reuses the `/cluster/{id}` page layout for the new tabs
> **Successor**: v2 release-readiness capstone (Task #121)
## Phase Objective
Close the Admin UI feature-completeness checklist that Phase 1 Stream E exit gate left open. Each item below is an existing `phase-1-configuration-and-admin-scaffold.md` completion-checklist entry that is currently unchecked.
Gaps to close:
1. **UNS Structure tab drag/move with impact preview** — decision #115 + `admin-ui.md` §"UNS". Current state: list-only render; no drag reorder; no "X lines / Y equipment impacted" preview.
2. **Equipment CSV import + 5-identifier search** — decision #95 + #117. Current state: basic form; no CSV parser; search indexes only ZTag.
3. **Draft-generation diff viewer** — enhance existing `DiffViewer.razor` to show generation-diff not just staged-edit diff; highlight ACL grant changes (lands after Phase 6.2).
4. **`_base` equipment-class Identification fields exposure** — decision #138139. Columns exist on `Equipment`; no Admin UI field group; no address-space exposure of the OPC 40010 sub-folder.
## Scope — What Changes
| Concern | Change |
|---------|--------|
| `Admin/Pages/UnsTab.razor` | Tree component with drag-drop using **`MudBlazor.TreeView` + `MudBlazor.DropTarget`** (existing transitive dep — no new third-party package). Native HTML5 DnD rejected because virtualization + DnD on 500+ nodes doesn't combine reliably. Each drag fires a "Compute Impact" call carrying a `DraftRevisionToken`; modal preview ("Moving Line 'Oven-2' from 'Packaging' to 'Assembly' will re-home 14 equipment + re-parent 237 tags"). **Confirm step re-checks the token** and rejects with a `409 Conflict / refresh-required` modal if the draft advanced between preview and commit. |
| `Admin/Services/UnsImpactAnalyzer.cs` | New service. Given a move-operation (line move, area rename, line merge), computes cascade counts + `DraftRevisionToken` at preview time. Pure-function shape; testable in isolation. |
| `Admin/Pages/EquipmentTab.razor` | Add CSV-import button → modal with file picker + dry-run preview. **Identifier search** uses the canonical decision #117 set: `ZTag / MachineCode / SAPID / EquipmentId / EquipmentUuid`. Typeahead probes each column with a ranking query (exact match score 100 → prefix 50 → opt-in LIKE 20; published > draft tie-break). Result row shows which field matched via trailing badge. |
| `Admin/Services/EquipmentCsvImporter.cs` | New service. CSV header row must start with `# OtOpcUaCsv v1` (version marker — future shape changes bump the version). Columns: `ZTag, MachineCode, SAPID, EquipmentId, EquipmentUuid, Name, UnsAreaName, UnsLineName, Manufacturer, Model, SerialNumber, HardwareRevision, SoftwareRevision, YearOfConstruction, AssetLocation, ManufacturerUri, DeviceManualUri`. Parser rejects unknown columns + blank required fields + duplicate ZTags + missing UnsLines. |
| **Staged-import table** `EquipmentImportBatch` | New entity `{ Id, CreatedAtUtc, CreatedBy, RowsStaged, RowsAccepted, RowsRejected, FinalisedAtUtc? }` + child `EquipmentImportRow` records. Import writes rows in chunks to the staging table (not to `Equipment`). `FinaliseImportBatch` is the atomic finalize step that applies all accepted rows to `Equipment` + `ExternalIdReservation` in one transaction — short + bounded regardless of input size. Rollback = drop the batch row; `Equipment` never partially mutates. |
| `Admin/Pages/DraftEditor.razor` + `DiffViewer.razor` | Diff viewer refactored into a base component + section plugins: `StructuralDiffSection`, `EquipmentDiffSection`, `TagDiffSection`, `AclDiffSection` (Phase 6.2), `RedundancyDiffSection` (Phase 6.3), `IdentificationDiffSection`. Each section has a **1000-row hard cap**; over-cap renders an aggregate summary + "Load full diff" button streaming 500-row pages via SignalR. Subtree-rename diffs (decision #115 bulk restructure) surface as summary only by default. |
| `Admin/Components/IdentificationFields.razor` | New component. Renders the OPC 40010 field set **per decision #139**: `Manufacturer, Model, SerialNumber, HardwareRevision, SoftwareRevision, YearOfConstruction, AssetLocation, ManufacturerUri, DeviceManualUri`. `ProductInstanceUri / DeviceRevision / MonthOfConstruction` dropped from this phase — they need a separate decision-log widening. |
| `OtOpcUa.Server/OpcUa/DriverNodeManager` — Equipment folder build | When an `Equipment` row has non-null Identification fields, the server adds an `Identification` sub-folder under the Equipment node containing one variable per non-null field. **ACL binding**: the sub-folder + variables inherit the `Equipment` scope's grants from Phase 6.2's trie — no new scope level added. Documented in `acl-design.md` cross-reference update. |
## Scope — What Does NOT Change
| Item | Reason |
|------|--------|
| Admin UI visual language | Bootstrap 5 / cookie auth / sidebar layout unchanged — consistency with ScadaLink design reference. |
| LDAP auth flow | Already shipped in Phase 1. Phase 6.4 is additive UI only. |
| Core abstractions / driver layer | Admin UI changes don't touch drivers. |
| Equipment-class *template schema validation* | Still deferred (decision #112 — schemas repo not landed). We expose the Identification fields but don't validate against a template hierarchy. |
| Drag/move to *other clusters* | Out of scope — equipment is cluster-scoped per decision #82. Cross-cluster migration is a different workflow. |
## Entry Gate Checklist
- [ ] Phase 6.2 merged (ACL grants are part of the new diff viewer sections)
- [ ] Phase 6.3 merged (redundancy-role changes are part of the diff viewer)
- [ ] `phase-1-configuration-and-admin-scaffold.md` §Stream E completion checklist re-read — confirm these are the remaining items
- [ ] `admin-ui.md` re-skimmed for screen layouts
- [ ] Existing `EquipmentTab.razor` / `UnsTab.razor` / `DraftEditor.razor` diff'd against what ships today so the edits are additive not destructive
- [ ] Dev Galaxy available for OPC 40010 exposure smoke testing
## Task Breakdown
### Stream A — UNS drag/reorder + impact preview (5 days)
1. **A.1** 1000-node synthetic seed fixture. Drag-latency bench against `MudBlazor.TreeView` + `MudBlazor.DropTarget` — commit to the component if latency budget (100 ms drag-enter feedback) holds; fall back to flat-list reorder UI (Area/Line dropdowns) with loss of visual drag affordance otherwise.
2. **A.2** `UnsImpactAnalyzer` service. Inputs: `(DraftGenerationId, MoveOperation, DraftRevisionToken)`. Outputs: `ImpactPreview { AffectedEquipmentCount, AffectedTagCount, CascadeWarnings[], DraftRevisionToken }`. Pure-function shape; testable in isolation.
3. **A.3** Modal preview wired to `UnsImpactAnalyzer`. **Confirm** re-reads the current draft revision + compares against the preview's token; if the draft advanced (another operator saved a different edit), show a `409 Conflict / refresh-required` modal rather than silently overwriting.
4. **A.4** Cross-cluster drop attempts: target disabled + toast "Equipment is cluster-scoped (decision #82). To move across clusters, use Export → Import on the Cluster detail page." Plus help link.
5. **A.5** Playwright (or equivalent) smoke test: drag a line across areas, assert modal shows right counts, assert draft row reflects the move; concurrent-edit test runs two sessions + asserts the later Confirm hits the 409.
### Stream B — Equipment CSV import + 5-identifier search (5 days)
1. **B.1** `EquipmentCsvImporter`. Strict RFC 4180 parser (per decision #95). Header row validation: first line must match `# OtOpcUaCsv v1` — future versions fork parser versions. Required columns: `ZTag, MachineCode, SAPID, EquipmentId, EquipmentUuid, Name, UnsAreaName, UnsLineName`. Optional: `Manufacturer, Model, SerialNumber, HardwareRevision, SoftwareRevision, YearOfConstruction, AssetLocation, ManufacturerUri, DeviceManualUri`. Parser rejects unknown columns + blank required fields + duplicate ZTags.
2. **B.2** `EquipmentImportBatch` + `EquipmentImportRow` staging tables (migration). Import writes preview rows to staging via chunked inserts; staging never blocks `Equipment` or `ExternalIdReservation`. Preview query reads staging + validates each row against the current `Equipment` state + `ExternalIdReservation` freshness.
3. **B.3** `ImportPreview` UI — per-row accept/reject table. Reject reasons: "ZTag already exists in draft", "ExternalIdReservation conflict with Cluster X", "UnsLineName not found in draft UNS tree", etc. Operator reviews + clicks "Commit".
4. **B.4** `FinaliseImportBatch` — atomic finalize. One EF transaction applies accepted rows to `Equipment` + `ExternalIdReservation`; duration bounded regardless of input size (the atomic step is a bulk-insert, not per-row row-by-row). Rollback = drop batch row via `DropImportBatch`; `Equipment` never partially mutates.
5. **B.5** Five-identifier search. Rank SQL: exact match any identifier = score 100, prefix match = 50, LIKE-fuzzy (opt-in via `?fuzzy=true`) = 20; tie-break `published > draft` then `RowVersion DESC`. Typeahead shows which field matched via trailing badge.
6. **B.6** Smoke tests: 100-row CSV with 10 conflicts (5 ZTag dupes, 3 reservation clashes, 2 missing UnsLines); 10k-row perf test asserting finalize txn < 30 s; concurrent import + external `ExternalIdReservation` insert test asserts retryable-conflict handling.
### Stream C — Diff viewer enhancements (4 days)
1. **C.1** Refactor `DiffViewer.razor` into a base component + section plugins. Plugins: `StructuralDiffSection` (UNS tree), `EquipmentDiffSection`, `TagDiffSection`, `AclDiffSection` (Phase 6.2), `RedundancyDiffSection` (Phase 6.3), `IdentificationDiffSection`.
2. **C.2** Each section renders collapsed by default; counts + top-line summary always visible. **1000-row hard cap** per section — over-cap sections render aggregate summary (e.g. "237 equipment re-parented from Packaging to Assembly") with a "Load full diff" button that streams 500-row pages via SignalR.
3. **C.3** Subtree-rename diffs (decision #115 bulk restructure) surface as summary only by default regardless of row count.
4. **C.4** Tests: seed two generations with deliberate diffs; assert every section reports the right counts + top-line summary + hard-cap behavior.
### Stream D — OPC 40010 Identification exposure (3 days)
1. **D.1** `IdentificationFields.razor` component. Renders the **9 decision #139 fields**: `Manufacturer, Model, SerialNumber, HardwareRevision, SoftwareRevision, YearOfConstruction, AssetLocation, ManufacturerUri, DeviceManualUri`. Labelled inputs; nullable columns show empty input; required-field validation on commit only.
2. **D.2** `DriverNodeManager` equipment-folder builder — after building the equipment node, inspect the 9 Identification columns; if any non-null, add an `Identification` sub-folder with variable-per-non-null-field. ACL binding: sub-folder + variables inherit the **same `ScopeId` as the Equipment node** (Phase 6.2's trie treats them as part of the Equipment scope — no new scope level).
3. **D.3** Address-space smoke test via Client.CLI: browse an equipment node, assert `Identification` sub-folder present when columns are set, absent when all null, variables match the field values.
4. **D.4** ACL integration test: a user with Equipment-level grant reads the `Identification` variables without needing a separate grant; a user without the Equipment grant gets `BadUserAccessDenied` on both the Equipment node + its Identification variables.
## Compliance Checks (run at exit gate)
- [ ] **UNS drag/move**: drag a line across areas; modal preview shows correct impacted-equipment + impacted-tag counts.
- [ ] **Concurrent-edit safety**: two-session test — session B saves a draft edit after session A opened the preview; session A's Confirm returns `409 Conflict / refresh-required` instead of overwriting.
- [ ] **Cross-cluster drop**: dropping equipment across cluster boundaries is disabled + shows actionable toast pointing to Export/Import workflow.
- [ ] **1000-node tree**: drag operations on a 1000-node seed maintain < 100 ms drag-enter feedback.
- [ ] **CSV header version**: file missing `# OtOpcUaCsv v1` first line is rejected pre-parse.
- [ ] **CSV canonical identifier set**: columns match decision #117 (ZTag / MachineCode / SAPID / EquipmentId / EquipmentUuid); drift from the earlier draft surfaces as a test failure.
- [ ] **Staged-import atomicity**: `FinaliseImportBatch` transaction bounded < 30 s for a 10k-row import; pre-finalize stagings visible only to the importing user; rollback via `DropImportBatch`.
- [ ] **Concurrent import + external reservation**: concurrent test — third party inserts to `ExternalIdReservation` mid-finalize; finalize retries with conflict handling; no corruption.
- [ ] **5-identifier search ranking**: exact matches outrank prefix matches; published outranks draft for equal scores.
- [ ] **Diff viewer section caps**: 2000-row subtree-rename diff renders as summary only; "Load full diff" streams in pages.
- [ ] **OPC 40010 field list match**: rendered field group matches decision #139 exactly; no extra fields.
- [ ] **OPC 40010 exposure**: Client.CLI browse shows `Identification` sub-folder when equipment has non-null columns; absent when all null.
- [ ] **ACL inheritance for Identification**: integration test — Equipment-grant user reads Identification; no-grant user gets `BadUserAccessDenied` on both.
- [ ] **Visual parity reviewer**: named role (`FleetAdmin` user, not the implementation lead) compares side-by-side against `admin-ui.md` §Visual-Design reference panels; signoff artefact is a checked-in screenshot set under `docs/v2/visual-compliance/phase-6-4/`.
## Risks and Mitigations
| Risk | Likelihood | Impact | Mitigation |
|------|:----------:|:------:|------------|
| UNS drag-drop janky on large trees (>500 nodes) | Medium | Medium | Virtualize the tree component; default-collapse nested areas; test with a synthetic 1000-equipment seed |
| CSV import performance on 10k-row imports | Medium | Medium | Stream-parse rather than load-into-memory; preview renders in batches of 100; commit is chunked-EF-insert with progress bar |
| Diff viewer becomes unwieldy with many sections | Low | Medium | Each section collapsed by default; top-line summary row always shown; Phase 6.4 caps at 6 sections |
| OPC 40010 sub-folder accidentally exposes NULL/empty identification columns as empty-string variables | Low | Low | Column null-check in the builder; drop variables whose DB value is null |
| 5-identifier search pulls full table | Medium | Medium | Indexes on each of ZTag/SAPID/UniqueId/Alias1/Alias2; search query uses a UNION of 5 indexed lookups; falls back to LIKE only on explicit operator opt-in |
## Completion Checklist
- [ ] Stream A: `UnsImpactAnalyzer` + drag-drop tree + modal preview + Playwright smoke
- [ ] Stream B: `EquipmentCsvImporter` + preview modal + 5-identifier search + conflict-rollback test
- [ ] Stream C: `DiffViewer` refactor + 6 section plugins + 2-generation diff test
- [ ] Stream D: `IdentificationFields.razor` + address-space builder change + Client.CLI browse test
- [ ] Visual-compliance reviewer signoff
- [ ] Full solution `dotnet test` passes; `phase-6-4-compliance.ps1` exits 0; exit-gate doc
## Adversarial Review — 2026-04-19 (Codex, via `codex-rescue` subagent)
1. **Crit · ACCEPT** — Stale UNS impact preview can overwrite concurrent draft edits. **Change**: each preview carries a `DraftRevisionToken`; `Confirm` compares against the current draft + rejects with a `409 Conflict / refresh-required` modal if any draft edit landed since the preview was generated. Stream A.3 updated.
2. **High · ACCEPT** — CSV import atomicity is internally contradictory (single EF transaction vs. chunked inserts). **Change**: one explicit model — staged-import table (`EquipmentImportBatch { Id, CreatedAtUtc, RowsStaged, RowsAccepted, RowsRejected }`) receives rows in chunks; final `FinaliseImportBatch` is atomic over `Equipment` + `ExternalIdReservation`. Rollback is "drop the batch row" — the real Equipment table is never partially mutated.
3. **Crit · ACCEPT** — Identifier contract rewrite mis-cites decisions. **Change**: revert to the `admin-ui.md` + decision #117 canonical set — `ZTag / MachineCode / SAPID / EquipmentId / EquipmentUuid`. CSV header follows that set verbatim. Introduce a separate decision entry for versioned CSV header shape before adding any new column; CSV header row must start with `# OtOpcUaCsv v1` so future shape changes are unambiguous.
4. **Med · ACCEPT** — Search ordering undefined. **Change**: rank SQL — exact match on any identifier scores 100; prefix match 50; LIKE-fuzzy 20; published > draft tie-breaker; `ORDER BY score DESC, RowVersion DESC`. Typeahead shows which field matched via trailing badge.
5. **High · ACCEPT** — HTML5 DnD on virtualized tree is aspirational. **Change**: Stream A.2 rewritten — commits to **`MudBlazor.TreeView` + `MudBlazor.DropTarget`** (already a transitive dep via the existing Admin UI). Build a 1000-node synthetic seed in A.1 + validate drag-latency budget before implementing impact preview. If MudBlazor can't hit the budget, fall back to a flat-list reorder UI with Area/Line dropdowns (loss of visual drag affordance but unblocks the feature).
6. **Med · ACCEPT** — Collapsed-by-default doesn't handle generation-sized diffs. **Change**: each diff section has a hard row cap (1000 by default). Over-cap sections render an aggregate summary + "Load full diff" button that streams via SignalR in 500-row pages. Decision #115 subtree renames surface as a "N equipment re-parented under X → Y" summary instead of row-by-row.
7. **High · ACCEPT** — OPC 40010 field list doesn't match decision #139. **Change**: field group realigned to `Manufacturer, Model, SerialNumber, HardwareRevision, SoftwareRevision, YearOfConstruction, AssetLocation, ManufacturerUri, DeviceManualUri`. `ProductInstanceUri / DeviceRevision / MonthOfConstruction` dropped from Phase 6.4 — they belong to a future OPC 40010 widening decision.
8. **High · ACCEPT**`Identification` subtree unreconciled with ACL hierarchy (Phase 6.2 6-level scope). **Change**: address-space builder creates the Identification sub-folder under the Equipment node **with the same ScopeId as Equipment** — no new scope level. ACL evaluator treats `…/Equipment/Identification/X` as inheriting the `Equipment` scope's grants. Documented in Phase 6.2's `acl-design.md` cross-reference update.
9. **Low · ACCEPT** — Visual-review gate names nonexistent reviewer role. **Change**: rubric defined — a named "Admin UX reviewer" (role `FleetAdmin` user, not the implementation lead) compares side-by-side screenshots against the `admin-ui.md` §Visual-Design reference panels; signoff artefact is a checked-in screenshot set under `docs/v2/visual-compliance/phase-6-4/`.
10. **Med · ACCEPT** — Cross-cluster drag/drop lacks loud failure path. **Change**: on drop across cluster boundary, disable the drop target + show a toast "Equipment is cluster-scoped (decision #82). To move across clusters, use the Export → Import workflow on the Cluster detail page." Plus a help link. Tested in Stream A.4.

View File

@@ -909,6 +909,26 @@ Each step leaves the system runnable. The generic extraction is effectively free
| 140 | Enterprise shortname = `zb` (UNS level-1 segment) | Closes corrections-doc D4. Matches the existing `ZB.MOM.WW.*` namespace prefix used throughout the codebase; short by design since this segment appears in every equipment path (`zb/warsaw-west/bldg-3/line-2/cnc-mill-05/RunState`); operators already say "ZB" colloquially. Admin UI cluster-create form default-prefills `zb` for the Enterprise field. Production deployments use it directly from cluster-create | 2026-04-17 |
| 141 | Tier 3 (AppServer IO) cutover is feasible — AVEVA's OI Gateway supports arbitrary upstream OPC UA servers as a documented pattern | Closes corrections-doc E2 with **GREEN-YELLOW** verdict. Multiple AVEVA partners (Software Toolbox, InSource) have published working integrations against four different non-AVEVA upstream servers (TOP Server, OPC Router, OmniServer, Cogent DataHub). No re-architecting of OtOpcUa required. Path: `OPC UA node → OI Gateway → SuiteLink → $DDESuiteLinkDIObject → AppServer attribute`. Recommended AppServer floor: System Platform 2023 R2 Patch 01. Two integrator-burden risks tracked: validation/GxP paperwork (no AVEVA blueprint exists for non-AVEVA upstream servers in Part 11 deployments) and unpublished scale benchmarks (in-house benchmark required before cutover scheduling). See `aveva-system-platform-io-research.md` | 2026-04-17 |
| 142 | Phase 1 acceptance includes an end-to-end AppServer-via-OI-Gateway smoke test against OtOpcUa | Catches AppServer-specific quirks (cert exchange via reject-and-trust workflow, endpoint URL must NOT include `/discovery` suffix per Inductive Automation forum failure mode, service-account install required because OI Gateway under SYSTEM cannot connect to remote OPC servers, `Basic256Sha256` + `SignAndEncrypt` + LDAP-username token combination must work end-to-end) early — well before the Year 3 tier-3 cutover schedule. Adds one task to `phase-1-configuration-and-admin-scaffold.md` Stream E (Admin smoke test) | 2026-04-17 |
| 143 | Polly per-capability policy — Read / HistoryRead / Discover / Probe / Alarm-subscribe auto-retry; Write does NOT auto-retry unless the tag metadata carries `[WriteIdempotent]` | Decisions #44-45 forbid auto-retry on Write because a timed-out write can succeed on the device + be replayed by the pipeline, duplicating pulses / alarm acks / counter increments / recipe-step advances. Per-capability policy in the shared Polly layer makes the retry safety story explicit; `WriteIdempotentAttribute` on tag definitions is the opt-in surface | 2026-04-19 |
| 144 | Polly pipeline key = `(DriverInstanceId, HostName)`, not DriverInstanceId alone | Decision #35 requires per-device isolation. One dead PLC behind a multi-device Modbus driver must NOT open the circuit breaker for healthy sibling hosts. Per-instance pipelines would poison every device behind one bad endpoint | 2026-04-19 |
| 145 | Tier A/B/C runtime enforcement splits into `MemoryTracking` (all tiers — soft/hard thresholds log + surface, NEVER kill) and `MemoryRecycle` (Tier C only — requires out-of-process topology). Tier A/B hard-breach logs a promotion-to-Tier-C recommendation; the runtime never auto-kills an in-process driver | Decisions #73-74 reserve process-kill protections for Tier C. An in-process Tier A/B "recycle" would kill every OPC UA session + every other in-proc driver for one leaky instance, blast-radius worse than the leak | 2026-04-19 |
| 146 | Memory watchdog uses the hybrid formula `soft = max(multiplier × baseline, baseline + floor)`, with baseline captured as the median of the first 5 min of `GetMemoryFootprint()` samples post-InitializeAsync. Tier-specific constants: A multiplier=3 floor=50 MB, B multiplier=3 floor=100 MB, C multiplier=2 floor=500 MB. Hard = 2 × soft | Codex adversarial review on the Phase 6.1 plan flagged that hardcoded per-tier MB bands diverge from decision #70's specified formula. Static bands false-trigger on small-footprint drivers + miss meaningful growth on large ones. Observed-baseline + hybrid formula recovers the original intent | 2026-04-19 |
| 147 | `WedgeDetector` uses demand-aware criteria `(state==Healthy AND hasPendingWork AND noProgressIn > threshold)`. `hasPendingWork` = (Polly bulkhead depth > 0) OR (active MonitoredItem count > 0) OR (queued historian read count > 0). Idle + subscription-only + write-only-burst drivers stay Healthy without false-fault | Previous "no successful Read in N intervals" formulation flipped legitimate idle subscribers, slow historian backfills, and write-heavy drivers to Faulted. The demand-aware check only fires when the driver claims work is outstanding | 2026-04-19 |
| 148 | LiteDB config cache is **generation-sealed**: `sp_PublishGeneration` writes `<cache-root>/<cluster>/<generationId>.db` as a read-only sealed file; cache reads serve the last-known-sealed generation. Mixed-generation reads are impossible | Prior "refresh on every successful query" cache could serve LDAP role mapping from one generation alongside UNS topology from another, producing impossible states. Sealed-snapshot invariant keeps cache-served reads coherent with a real published state | 2026-04-19 |
| 149 | `AuthorizationDecision { Allow \| NotGranted \| Denied, IReadOnlyList<MatchedGrant> Provenance }` — tri-state internal model. Phase 6.2 only produces `Allow` + `NotGranted` (grant-only semantics per decision #129); v2.1 Deny widens without API break | bool return would collapse `no-matching-grant` and `explicit-deny` into the same runtime state + UI explanation; provenance record is needed for the audit log anyway. Making the shape tri-state from Phase 6.2 avoids a breaking change in v2.1 | 2026-04-19 |
| 150 | Data-plane ACL evaluator consumes `NodeAcl` rows joined against the session's resolved LDAP group memberships. `LdapGroupRoleMapping` (decision #105) is control-plane only — routes LDAP groups to Admin UI roles. Zero runtime overlap between the two | Codex adversarial review flagged that Phase 6.2 draft conflated the two — building the data-plane trie from `LdapGroupRoleMapping` would let a user inherit tag permissions from an admin-role claim path never intended as a data-path grant | 2026-04-19 |
| 151 | `UserAuthorizationState` cached per session but bounded by `MembershipFreshnessInterval` (default 15 min). Past that interval the next hot-path authz call re-resolves LDAP group memberships; failure to re-resolve (LDAP unreachable) → fail-closed (evaluator returns `NotGranted` until memberships refresh successfully) | Previous design cached memberships until session close, so a user removed from a privileged LDAP group could keep authorized access for hours. Bounded freshness + fail-closed covers the revoke-takes-effect story | 2026-04-19 |
| 152 | Auth cache has its own staleness budget `AuthCacheMaxStaleness` (default 5 min), independent of decision #36's availability-oriented config cache (24 h). Past 5 min on authorization data, evaluator fails closed regardless of whether the underlying config is still serving from cache | Availability-oriented caches trade correctness for uptime. Authorization data is correctness-sensitive — stale ACLs silently extend revoked access. Auth-specific budget keeps the two concerns from colliding | 2026-04-19 |
| 153 | MonitoredItem carries `(AuthGenerationId, MembershipVersion)` stamp at create time. On every Publish, items with a mismatching stamp re-evaluate; unchanged items stay fast-path. Revoked items drop to `BadUserAccessDenied` within one publish cycle | Create-time-only authorization leaves revoked users receiving data forever; per-publish re-authorization at 100 ms cadence across 50 groups × 6 levels is too expensive. Stamp-then-reevaluate-on-change balances correctness with cost | 2026-04-19 |
| 154 | ServiceLevel reserves `0` for operator-declared maintenance only; `1` = NoData (unreachable / Faulted); operational states occupy `2..255` in an 8-state matrix (Authoritative-Primary=255, Isolated-Primary=230, Primary-Mid-Apply=200, Recovering-Primary=180, Authoritative-Backup=100, Isolated-Backup=80, Backup-Mid-Apply=50, Recovering-Backup=30, InvalidTopology=2) | OPC UA Part 5 §6.3.34 defines `0=Maintenance` + `1=NoData`; using `0` for our Faulted case collides with spec + triggers spec-compliant clients to enter maintenance-mode cutover. Expanded 8-state matrix covers operational states the 5-state original collapsed together (e.g. Isolated-Primary vs Primary-Mid-Apply were both 200) | 2026-04-19 |
| 155 | `ServerUriArray` includes self + peers (self first, deterministic ordering), per OPC UA Part 4 §6.6.2.2 | Previous design excluded self from the array — spec violation + clients lose the ability to map server identities consistently during failover | 2026-04-19 |
| 156 | Redundancy peer health uses a two-layer probe: `/healthz` (2 s) as fast-fail + `UaHealthProbe` (10 s, opens OPC UA client session to peer + reads its `ServiceLevel` node) as the authority signal. HTTP-healthy ≠ UA-authoritative | `/healthz` returns 200 whenever HTTP + config DB/cache is healthy — but a peer can be HTTP-healthy with a broken OPC UA endpoint or a stuck subscription publisher. Using HTTP alone would advertise authority against servers that can't actually publish data | 2026-04-19 |
| 157 | Publish-generation fencing — coordinator CAS on a monotonic `ConfigGenerationId`; every topology + role decision is generation-stamped; peers reject state propagated from a lower generation. Runtime `InvalidTopology` state (both self-demote to ServiceLevel 2) when >1 Primary detected post-startup | Operator race publishing two drafts with different roles can produce two locally-valid views; without fencing + runtime containment both nodes can serve as Primary until manual intervention | 2026-04-19 |
| 158 | Apply-window uses named leases keyed to `(ConfigGenerationId, PublishRequestId)` via `await using`. `ApplyLeaseWatchdog` auto-closes any lease older than `ApplyMaxDuration` (default 10 min) | Simple `IDisposable`-counter design leaks on cancellation / async-ownership races; a stuck positive count leaves the node permanently mid-apply. Generation-keyed leases + watchdog bound worst case | 2026-04-19 |
| 159 | CSV import header row must start with `# OtOpcUaCsv v1` (version marker). Future shape changes bump the version; parser forks per version. Canonical identifier columns follow decision #117: `ZTag, MachineCode, SAPID, EquipmentId, EquipmentUuid` | Without a version marker the CSV schema has no upgrade path — adding a required column breaks every old export silently. The version prefix makes parser dispatch explicit + future-compatible | 2026-04-19 |
| 160 | Equipment CSV import uses a staged-import pattern: `EquipmentImportBatch` + `EquipmentImportRow` tables receive chunked inserts; `FinaliseImportBatch` is one atomic transaction that applies accepted rows to `Equipment` + `ExternalIdReservation`. Rollback = drop the batch row; `Equipment` never partially mutates | 10k-row single-transaction import holds locks too long; chunked direct writes lose all-or-nothing rollback. Staging + atomic finalize bounds transaction duration + preserves rollback semantics | 2026-04-19 |
| 161 | UNS drag-reorder impact preview carries a `DraftRevisionToken`; Confirm re-checks against the current draft + returns `409 Conflict / refresh-required` if the draft advanced between preview and commit | Without concurrency control, two operators editing the same draft can overwrite each other's changes silently. Draft-revision token + 409 response makes the race visible + forces refresh | 2026-04-19 |
| 162 | OPC 40010 Identification sub-folder exposed under each equipment node inherits the Equipment scope's ACL grants — the ACL trie does NOT add a new scope level for Identification | Adding a new scope level for Identification would require every grant to add a second grant for `Equipment/Identification`; inheriting the Equipment scope keeps the grant model flat + prevents operator-forgot-to-grant-Identification access surprises | 2026-04-19 |
## Reference Documents

View File

@@ -0,0 +1,79 @@
<#
.SYNOPSIS
Phase 6.1 exit-gate compliance check — stub. Each `Assert-*` either passes
(Write-Host green) or throws. Non-zero exit = fail.
.DESCRIPTION
Validates Phase 6.1 (Resilience & Observability runtime) completion. Checks
enumerated in `docs/v2/implementation/phase-6-1-resilience-and-observability.md`
§"Compliance Checks (run at exit gate)".
Current status: SCAFFOLD. Every check writes a TODO line and does NOT throw.
Each implementation task in Phase 6.1 is responsible for replacing its TODO
with a real check before closing that task.
.NOTES
Usage: pwsh ./scripts/compliance/phase-6-1-compliance.ps1
Exit: 0 = all checks passed (or are still TODO); non-zero = explicit fail
#>
[CmdletBinding()]
param()
$ErrorActionPreference = 'Stop'
$script:failures = 0
function Assert-Todo {
param([string]$Check, [string]$ImplementationTask)
Write-Host " [TODO] $Check (implement during $ImplementationTask)" -ForegroundColor Yellow
}
function Assert-Pass {
param([string]$Check)
Write-Host " [PASS] $Check" -ForegroundColor Green
}
function Assert-Fail {
param([string]$Check, [string]$Reason)
Write-Host " [FAIL] $Check$Reason" -ForegroundColor Red
$script:failures++
}
Write-Host ""
Write-Host "=== Phase 6.1 compliance — Resilience & Observability runtime ===" -ForegroundColor Cyan
Write-Host ""
Write-Host "Stream A — Resilience layer"
Assert-Todo "Invoker coverage — every capability-interface method routes through CapabilityInvoker (analyzer error-level)" "Stream A.3"
Assert-Todo "Write-retry guard — writes without [WriteIdempotent] never retry" "Stream A.5"
Assert-Todo "Pipeline isolation — `(DriverInstanceId, HostName)` key; one dead host does not open breaker for siblings" "Stream A.5"
Write-Host ""
Write-Host "Stream B — Tier A/B/C runtime"
Assert-Todo "Tier registry — every driver type has non-null Tier; Tier C declares out-of-process topology" "Stream B.1"
Assert-Todo "MemoryTracking never kills — soft/hard breach on Tier A/B logs + surfaces without terminating" "Stream B.6"
Assert-Todo "MemoryRecycle Tier C only — hard breach on Tier A never invokes supervisor; Tier C does" "Stream B.6"
Assert-Todo "Wedge demand-aware — idle/historic-backfill/write-only cases stay Healthy" "Stream B.6"
Assert-Todo "Galaxy supervisor preserved — Driver.Galaxy.Proxy/Supervisor/CircuitBreaker + Backoff still present + invoked" "Stream A.4"
Write-Host ""
Write-Host "Stream C — Health + logging"
Assert-Todo "Health state machine — /healthz + /readyz respond < 500 ms for every DriverState per matrix in plan" "Stream C.4"
Assert-Todo "Structured log — CI grep asserts DriverInstanceId + CorrelationId JSON fields present" "Stream C.4"
Write-Host ""
Write-Host "Stream D — LiteDB cache"
Assert-Todo "Generation-sealed snapshot — SQL kill mid-op serves last-sealed snapshot; UsingStaleConfig=true" "Stream D.4"
Assert-Todo "Mixed-generation guard — corruption of snapshot file fails closed; no mixed reads" "Stream D.4"
Assert-Todo "First-boot no-snapshot + DB-down — InitializeAsync fails with clear error" "Stream D.4"
Write-Host ""
Write-Host "Cross-cutting"
Assert-Todo "No test-count regression — dotnet test ZB.MOM.WW.OtOpcUa.slnx count ≥ pre-Phase-6.1 baseline" "Final exit-gate"
Write-Host ""
if ($script:failures -eq 0) {
Write-Host "Phase 6.1 compliance: scaffold-mode PASS (all checks TODO)" -ForegroundColor Green
exit 0
}
Write-Host "Phase 6.1 compliance: $script:failures FAIL(s)" -ForegroundColor Red
exit 1

View File

@@ -0,0 +1,81 @@
<#
.SYNOPSIS
Phase 6.2 exit-gate compliance check — stub. Each `Assert-*` either passes
(Write-Host green) or throws. Non-zero exit = fail.
.DESCRIPTION
Validates Phase 6.2 (Authorization runtime) completion. Checks enumerated
in `docs/v2/implementation/phase-6-2-authorization-runtime.md`
§"Compliance Checks (run at exit gate)".
Current status: SCAFFOLD. Every check writes a TODO line and does NOT throw.
Each implementation task in Phase 6.2 is responsible for replacing its TODO
with a real check before closing that task.
.NOTES
Usage: pwsh ./scripts/compliance/phase-6-2-compliance.ps1
Exit: 0 = all checks passed (or are still TODO); non-zero = explicit fail
#>
[CmdletBinding()]
param()
$ErrorActionPreference = 'Stop'
$script:failures = 0
function Assert-Todo {
param([string]$Check, [string]$ImplementationTask)
Write-Host " [TODO] $Check (implement during $ImplementationTask)" -ForegroundColor Yellow
}
function Assert-Pass {
param([string]$Check)
Write-Host " [PASS] $Check" -ForegroundColor Green
}
function Assert-Fail {
param([string]$Check, [string]$Reason)
Write-Host " [FAIL] $Check$Reason" -ForegroundColor Red
$script:failures++
}
Write-Host ""
Write-Host "=== Phase 6.2 compliance — Authorization runtime ===" -ForegroundColor Cyan
Write-Host ""
Write-Host "Stream A — LdapGroupRoleMapping (control plane)"
Assert-Todo "Control/data-plane separation — Core.Authorization has zero refs to LdapGroupRoleMapping" "Stream A.2"
Assert-Todo "Authoring validation — AclsTab rejects duplicate (LdapGroup, Scope) pre-save" "Stream A.3"
Write-Host ""
Write-Host "Stream B — Evaluator + trie + cache"
Assert-Todo "Trie invariants — PermissionTrieBuilder idempotent (build twice == equal)" "Stream B.1"
Assert-Todo "Additive grants + cluster isolation — cross-cluster leakage impossible" "Stream B.1"
Assert-Todo "Galaxy FolderSegment coverage — folder-subtree grant cascades; siblings unaffected" "Stream B.2"
Assert-Todo "Redundancy-safe invalidation — generation-mismatch forces trie re-load on peer" "Stream B.4"
Assert-Todo "Membership freshness — 15 min interval elapsed + LDAP down = fail-closed" "Stream B.5"
Assert-Todo "Auth cache fail-closed — 5 min AuthCacheMaxStaleness exceeded = NotGranted" "Stream B.5"
Assert-Todo "AuthorizationDecision shape — Allow + NotGranted only; Denied variant exists unused" "Stream B.6"
Write-Host ""
Write-Host "Stream C — OPC UA operation wiring"
Assert-Todo "Every operation wired — Browse/Read/Write/HistoryRead/HistoryUpdate/CreateMonitoredItems/TransferSubscriptions/Call/Ack/Confirm/Shelve" "Stream C.1-C.7"
Assert-Todo "HistoryRead uses its own flag — Read+no-HistoryRead denies HistoryRead" "Stream C.3"
Assert-Todo "Mixed-batch semantics — 3 allowed + 2 denied returns per-item status, no coarse failure" "Stream C.6"
Assert-Todo "Browse ancestor visibility — deep grant implies ancestor browse; denied ancestors filter" "Stream C.7"
Assert-Todo "Subscription re-authorization — revoked grant surfaces BadUserAccessDenied in one publish" "Stream C.5"
Write-Host ""
Write-Host "Stream D — Admin UI + SignalR invalidation"
Assert-Todo "SignalR invalidation — sp_PublishGeneration pushes PermissionTrieCache invalidate < 2 s" "Stream D.4"
Write-Host ""
Write-Host "Cross-cutting"
Assert-Todo "No test-count regression — dotnet test ZB.MOM.WW.OtOpcUa.slnx count ≥ pre-Phase-6.2 baseline" "Final exit-gate"
Write-Host ""
if ($script:failures -eq 0) {
Write-Host "Phase 6.2 compliance: scaffold-mode PASS (all checks TODO)" -ForegroundColor Green
exit 0
}
Write-Host "Phase 6.2 compliance: $script:failures FAIL(s)" -ForegroundColor Red
exit 1

View File

@@ -0,0 +1,85 @@
<#
.SYNOPSIS
Phase 6.3 exit-gate compliance check — stub. Each `Assert-*` either passes
(Write-Host green) or throws. Non-zero exit = fail.
.DESCRIPTION
Validates Phase 6.3 (Redundancy runtime) completion. Checks enumerated in
`docs/v2/implementation/phase-6-3-redundancy-runtime.md`
§"Compliance Checks (run at exit gate)".
Current status: SCAFFOLD. Every check writes a TODO line and does NOT throw.
Each implementation task in Phase 6.3 is responsible for replacing its TODO
with a real check before closing that task.
.NOTES
Usage: pwsh ./scripts/compliance/phase-6-3-compliance.ps1
Exit: 0 = all checks passed (or are still TODO); non-zero = explicit fail
#>
[CmdletBinding()]
param()
$ErrorActionPreference = 'Stop'
$script:failures = 0
function Assert-Todo {
param([string]$Check, [string]$ImplementationTask)
Write-Host " [TODO] $Check (implement during $ImplementationTask)" -ForegroundColor Yellow
}
function Assert-Pass {
param([string]$Check)
Write-Host " [PASS] $Check" -ForegroundColor Green
}
function Assert-Fail {
param([string]$Check, [string]$Reason)
Write-Host " [FAIL] $Check$Reason" -ForegroundColor Red
$script:failures++
}
Write-Host ""
Write-Host "=== Phase 6.3 compliance — Redundancy runtime ===" -ForegroundColor Cyan
Write-Host ""
Write-Host "Stream A — Topology loader"
Assert-Todo "Transparent-mode rejection — sp_PublishGeneration blocks RedundancyMode=Transparent" "Stream A.3"
Write-Host ""
Write-Host "Stream B — Peer probe + ServiceLevel calculator"
Assert-Todo "OPC UA band compliance — 0=Maintenance / 1=NoData reserved; operational 2..255" "Stream B.2"
Assert-Todo "Authoritative-Primary ServiceLevel = 255" "Stream B.2"
Assert-Todo "Isolated-Primary (peer unreachable, self serving) = 230" "Stream B.2"
Assert-Todo "Primary-Mid-Apply = 200" "Stream B.2"
Assert-Todo "Recovering-Primary = 180 with dwell + publish witness enforced" "Stream B.2"
Assert-Todo "Authoritative-Backup = 100" "Stream B.2"
Assert-Todo "Isolated-Backup (primary unreachable) = 80 — no auto-promote" "Stream B.2"
Assert-Todo "InvalidTopology = 2 — >1 Primary self-demotes both nodes" "Stream B.2"
Assert-Todo "UaHealthProbe authority — HTTP-200 + UA-down peer treated as UA-unhealthy" "Stream B.1"
Write-Host ""
Write-Host "Stream C — OPC UA node wiring"
Assert-Todo "ServerUriArray — returns self + peer URIs, self first" "Stream C.2"
Assert-Todo "Client.CLI cutover — primary halt triggers reconnect to backup via ServerUriArray" "Stream C.4"
Write-Host ""
Write-Host "Stream D — Apply-lease + publish fencing"
Assert-Todo "Apply-lease disposal — leases close on exception, cancellation, watchdog timeout" "Stream D.2"
Assert-Todo "Role transition via operator publish — no restart; both nodes flip ServiceLevel on publish confirm" "Stream D.3"
Write-Host ""
Write-Host "Stream F — Interop matrix"
Assert-Todo "Client interoperability matrix — Ignition 8.1/8.3 / Kepware / Aveva OI Gateway findings documented" "Stream F.1-F.2"
Assert-Todo "Galaxy MXAccess failover — primary kill; Galaxy consumer reconnects within session-timeout budget" "Stream F.3"
Write-Host ""
Write-Host "Cross-cutting"
Assert-Todo "No regression in driver test suites; /healthz reachable under redundancy load" "Final exit-gate"
Write-Host ""
if ($script:failures -eq 0) {
Write-Host "Phase 6.3 compliance: scaffold-mode PASS (all checks TODO)" -ForegroundColor Green
exit 0
}
Write-Host "Phase 6.3 compliance: $script:failures FAIL(s)" -ForegroundColor Red
exit 1

View File

@@ -0,0 +1,83 @@
<#
.SYNOPSIS
Phase 6.4 exit-gate compliance check — stub. Each `Assert-*` either passes
(Write-Host green) or throws. Non-zero exit = fail.
.DESCRIPTION
Validates Phase 6.4 (Admin UI completion) completion. Checks enumerated in
`docs/v2/implementation/phase-6-4-admin-ui-completion.md`
§"Compliance Checks (run at exit gate)".
Current status: SCAFFOLD. Every check writes a TODO line and does NOT throw.
Each implementation task in Phase 6.4 is responsible for replacing its TODO
with a real check before closing that task.
.NOTES
Usage: pwsh ./scripts/compliance/phase-6-4-compliance.ps1
Exit: 0 = all checks passed (or are still TODO); non-zero = explicit fail
#>
[CmdletBinding()]
param()
$ErrorActionPreference = 'Stop'
$script:failures = 0
function Assert-Todo {
param([string]$Check, [string]$ImplementationTask)
Write-Host " [TODO] $Check (implement during $ImplementationTask)" -ForegroundColor Yellow
}
function Assert-Pass {
param([string]$Check)
Write-Host " [PASS] $Check" -ForegroundColor Green
}
function Assert-Fail {
param([string]$Check, [string]$Reason)
Write-Host " [FAIL] $Check$Reason" -ForegroundColor Red
$script:failures++
}
Write-Host ""
Write-Host "=== Phase 6.4 compliance — Admin UI completion ===" -ForegroundColor Cyan
Write-Host ""
Write-Host "Stream A — UNS drag/move + impact preview"
Assert-Todo "UNS drag/move — drag line across areas; modal shows correct impacted-equipment + tag counts" "Stream A.2"
Assert-Todo "Concurrent-edit safety — session B saves draft mid-preview; session A Confirm returns 409" "Stream A.3 (DraftRevisionToken)"
Assert-Todo "Cross-cluster drop disabled — actionable toast points to Export/Import" "Stream A.2"
Assert-Todo "1000-node tree — drag-enter feedback < 100 ms" "Stream A.4"
Write-Host ""
Write-Host "Stream B — CSV import + staged-import + 5-identifier search"
Assert-Todo "CSV header version — file missing '# OtOpcUaCsv v1' rejected pre-parse" "Stream B.1"
Assert-Todo "CSV canonical identifier set — columns match decision #117 exactly" "Stream B.1"
Assert-Todo "Staged-import atomicity — 10k-row FinaliseImportBatch < 30 s; user-scoped visibility; DropImportBatch rollback" "Stream B.3"
Assert-Todo "Concurrent import + external reservation — finalize retries with conflict handling; no corruption" "Stream B.3"
Assert-Todo "5-identifier search ranking — exact > prefix; published > draft for equal scores" "Stream B.4"
Write-Host ""
Write-Host "Stream C — DiffViewer sections"
Assert-Todo "Diff viewer section caps — 2000-row subtree-rename summary-only; 'Load full diff' paginates" "Stream C.2"
Write-Host ""
Write-Host "Stream D — Identification (OPC 40010)"
Assert-Todo "OPC 40010 field list match — rendered fields match decision #139 exactly; no extras" "Stream D.1"
Assert-Todo "OPC 40010 exposure — Identification sub-folder shows when non-null; absent when all null" "Stream D.3"
Assert-Todo "ACL inheritance for Identification — Equipment-grant reads; no-grant denies both" "Stream D.4"
Write-Host ""
Write-Host "Visual compliance"
Assert-Todo "Visual parity reviewer — FleetAdmin signoff vs admin-ui.md §Visual-Design; screenshot set checked in under docs/v2/visual-compliance/phase-6-4/" "Visual review"
Write-Host ""
Write-Host "Cross-cutting"
Assert-Todo "Full solution dotnet test passes; no test-count regression vs pre-Phase-6.4 baseline" "Final exit-gate"
Write-Host ""
if ($script:failures -eq 0) {
Write-Host "Phase 6.4 compliance: scaffold-mode PASS (all checks TODO)" -ForegroundColor Green
exit 0
}
Write-Host "Phase 6.4 compliance: $script:failures FAIL(s)" -ForegroundColor Red
exit 1

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,180 @@
namespace ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient;
/// <summary>
/// OPC UA Client (gateway) driver configuration. Bound from <c>DriverConfig</c> JSON at
/// driver-host registration time. Models the settings documented in
/// <c>docs/v2/driver-specs.md</c> §8.
/// </summary>
/// <remarks>
/// This driver connects to a REMOTE OPC UA server and re-exposes its address space
/// through the local OtOpcUa server — the opposite direction from the usual "server
/// exposes PLC data" flow. Tier A (pure managed, OPC Foundation reference SDK); universal
/// protections cover it.
/// </remarks>
public sealed class OpcUaClientDriverOptions
{
/// <summary>
/// Remote OPC UA endpoint URL, e.g. <c>opc.tcp://plc.internal:4840</c>. Convenience
/// shortcut for a single-endpoint deployment — equivalent to setting
/// <see cref="EndpointUrls"/> to a list with this one URL. When both are provided,
/// the list wins and <see cref="EndpointUrl"/> is ignored.
/// </summary>
public string EndpointUrl { get; init; } = "opc.tcp://localhost:4840";
/// <summary>
/// Ordered list of candidate endpoint URLs for failover. The driver tries each in
/// order at <see cref="OpcUaClientDriver.InitializeAsync"/> and on session drop;
/// the first URL that successfully connects wins. Typical use-case: an OPC UA server
/// pair running in hot-standby (primary 4840 + backup 4841) where either can serve
/// the same address space. Leave unset (or empty) to use <see cref="EndpointUrl"/>
/// as a single-URL shortcut.
/// </summary>
public IReadOnlyList<string> EndpointUrls { get; init; } = [];
/// <summary>
/// Per-endpoint connect-attempt timeout during the failover sweep. Short enough that
/// cycling through several dead servers doesn't blow the overall init budget, long
/// enough to tolerate a slow TLS handshake on a healthy server. Applied independently
/// of <see cref="Timeout"/> which governs steady-state operations.
/// </summary>
public TimeSpan PerEndpointConnectTimeout { get; init; } = TimeSpan.FromSeconds(3);
/// <summary>
/// Security policy to require when selecting an endpoint. Either a
/// <see cref="OpcUaSecurityPolicy"/> enum constant or a free-form string (for
/// forward-compatibility with future OPC UA policies not yet in the enum).
/// Matched against <c>EndpointDescription.SecurityPolicyUri</c> suffix — the driver
/// connects to the first endpoint whose policy name matches AND whose mode matches
/// <see cref="SecurityMode"/>. When set to <see cref="OpcUaSecurityPolicy.None"/>
/// the driver picks any unsecured endpoint regardless of policy string.
/// </summary>
public OpcUaSecurityPolicy SecurityPolicy { get; init; } = OpcUaSecurityPolicy.None;
/// <summary>Security mode.</summary>
public OpcUaSecurityMode SecurityMode { get; init; } = OpcUaSecurityMode.None;
/// <summary>Authentication type.</summary>
public OpcUaAuthType AuthType { get; init; } = OpcUaAuthType.Anonymous;
/// <summary>User name (required only for <see cref="OpcUaAuthType.Username"/>).</summary>
public string? Username { get; init; }
/// <summary>Password (required only for <see cref="OpcUaAuthType.Username"/>).</summary>
public string? Password { get; init; }
/// <summary>
/// Filesystem path to the user-identity certificate (PFX/PEM). Required when
/// <see cref="AuthType"/> is <see cref="OpcUaAuthType.Certificate"/>. The driver
/// loads the cert + private key, which the remote server validates against its
/// <c>TrustedUserCertificates</c> store to authenticate the session's user token.
/// Leave unset to use the driver's application-instance certificate as the user
/// token (not typical — most deployments have a separate user cert).
/// </summary>
public string? UserCertificatePath { get; init; }
/// <summary>
/// Optional password that unlocks <see cref="UserCertificatePath"/> when the PFX is
/// protected. PEM files generally have their password on the adjacent key file; this
/// knob only applies to password-locked PFX.
/// </summary>
public string? UserCertificatePassword { get; init; }
/// <summary>Server-negotiated session timeout. Default 120s per driver-specs.md §8.</summary>
public TimeSpan SessionTimeout { get; init; } = TimeSpan.FromSeconds(120);
/// <summary>Client-side keep-alive interval.</summary>
public TimeSpan KeepAliveInterval { get; init; } = TimeSpan.FromSeconds(5);
/// <summary>Initial reconnect delay after a session drop.</summary>
public TimeSpan ReconnectPeriod { get; init; } = TimeSpan.FromSeconds(5);
/// <summary>
/// When <c>true</c>, the driver accepts any self-signed / untrusted server certificate.
/// Dev-only — must be <c>false</c> in production so MITM attacks against the opc.tcp
/// channel fail closed.
/// </summary>
public bool AutoAcceptCertificates { get; init; } = false;
/// <summary>
/// Application URI the driver reports during session creation. Must match the
/// subject-alt-name on the client certificate if one is used, which is why it's a
/// config knob rather than hard-coded.
/// </summary>
public string ApplicationUri { get; init; } = "urn:localhost:OtOpcUa:GatewayClient";
/// <summary>
/// Friendly name sent to the remote server for diagnostics. Shows up in the remote
/// server's session-list so operators can identify which gateway instance is calling.
/// </summary>
public string SessionName { get; init; } = "OtOpcUa-Gateway";
/// <summary>Connect + per-operation timeout.</summary>
public TimeSpan Timeout { get; init; } = TimeSpan.FromSeconds(10);
/// <summary>
/// Root NodeId to mirror. Default <c>null</c> = <c>ObjectsFolder</c> (i=85). Set to
/// a scoped root to restrict the address space the driver exposes locally — useful
/// when the remote server has tens of thousands of nodes and only a subset is
/// needed downstream.
/// </summary>
public string? BrowseRoot { get; init; }
/// <summary>
/// Cap on total nodes discovered during <c>DiscoverAsync</c>. Default 10_000 —
/// bounds memory on runaway remote servers without being so low that normal
/// deployments hit it. When the cap is reached discovery stops and a warning is
/// written to the driver health surface; the partially-discovered tree is still
/// projected into the local address space.
/// </summary>
public int MaxDiscoveredNodes { get; init; } = 10_000;
/// <summary>
/// Max hierarchical depth of the browse. Default 10 — deep enough for realistic
/// OPC UA information models, shallow enough that cyclic graphs can't spin the
/// browse forever.
/// </summary>
public int MaxBrowseDepth { get; init; } = 10;
}
/// <summary>OPC UA message security mode.</summary>
public enum OpcUaSecurityMode
{
None,
Sign,
SignAndEncrypt,
}
/// <summary>
/// OPC UA security policies recognized by the driver. Maps to the standard
/// <c>http://opcfoundation.org/UA/SecurityPolicy#</c> URI suffixes the SDK uses for
/// endpoint matching.
/// </summary>
/// <remarks>
/// <see cref="Basic128Rsa15"/> and <see cref="Basic256"/> are <b>deprecated</b> per OPC UA
/// spec v1.04 — they remain in the enum only for brownfield interop with older servers.
/// Prefer <see cref="Basic256Sha256"/>, <see cref="Aes128_Sha256_RsaOaep"/>, or
/// <see cref="Aes256_Sha256_RsaPss"/> for new deployments.
/// </remarks>
public enum OpcUaSecurityPolicy
{
/// <summary>No security. Unsigned, unencrypted wire.</summary>
None,
/// <summary>Deprecated (OPC UA 1.04). Retained for legacy server interop.</summary>
Basic128Rsa15,
/// <summary>Deprecated (OPC UA 1.04). Retained for legacy server interop.</summary>
Basic256,
/// <summary>Recommended baseline for current deployments.</summary>
Basic256Sha256,
/// <summary>Current OPC UA policy; AES-128 + SHA-256 + RSA-OAEP.</summary>
Aes128_Sha256_RsaOaep,
/// <summary>Current OPC UA policy; AES-256 + SHA-256 + RSA-PSS.</summary>
Aes256_Sha256_RsaPss,
}
/// <summary>User authentication type sent to the remote server.</summary>
public enum OpcUaAuthType
{
Anonymous,
Username,
Certificate,
}

View File

@@ -0,0 +1,28 @@
<Project Sdk="Microsoft.NET.Sdk">
<PropertyGroup>
<TargetFramework>net10.0</TargetFramework>
<Nullable>enable</Nullable>
<ImplicitUsings>enable</ImplicitUsings>
<LangVersion>latest</LangVersion>
<TreatWarningsAsErrors>true</TreatWarningsAsErrors>
<GenerateDocumentationFile>true</GenerateDocumentationFile>
<NoWarn>$(NoWarn);CS1591</NoWarn>
<RootNamespace>ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient</RootNamespace>
<AssemblyName>ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient</AssemblyName>
</PropertyGroup>
<ItemGroup>
<ProjectReference Include="..\ZB.MOM.WW.OtOpcUa.Core.Abstractions\ZB.MOM.WW.OtOpcUa.Core.Abstractions.csproj"/>
</ItemGroup>
<ItemGroup>
<PackageReference Include="OPCFoundation.NetStandard.Opc.Ua.Client" Version="1.5.378.106"/>
<PackageReference Include="OPCFoundation.NetStandard.Opc.Ua.Configuration" Version="1.5.378.106"/>
</ItemGroup>
<ItemGroup>
<InternalsVisibleTo Include="ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient.Tests"/>
</ItemGroup>
</Project>

View File

@@ -26,8 +26,20 @@ namespace ZB.MOM.WW.OtOpcUa.Driver.S7;
/// </para>
/// </remarks>
public sealed class S7Driver(S7DriverOptions options, string driverInstanceId)
: IDriver, IReadable, IWritable, IDisposable, IAsyncDisposable
: IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IHostConnectivityProbe, IDisposable, IAsyncDisposable
{
// ---- ISubscribable + IHostConnectivityProbe state ----
private readonly System.Collections.Concurrent.ConcurrentDictionary<long, SubscriptionState> _subscriptions = new();
private long _nextSubscriptionId;
private readonly object _probeLock = new();
private HostState _hostState = HostState.Unknown;
private DateTime _hostStateChangedUtc = DateTime.UtcNow;
private CancellationTokenSource? _probeCts;
public event EventHandler<DataChangeEventArgs>? OnDataChange;
public event EventHandler<HostStatusChangedEventArgs>? OnHostStatusChanged;
/// <summary>OPC UA StatusCode used when the tag name isn't in the driver's tag map.</summary>
private const uint StatusBadNodeIdUnknown = 0x80340000u;
/// <summary>OPC UA StatusCode used when the tag's data type isn't implemented yet.</summary>
@@ -98,6 +110,15 @@ public sealed class S7Driver(S7DriverOptions options, string driverInstanceId)
}
_health = new DriverHealth(DriverState.Healthy, DateTime.UtcNow, null);
// Kick off the probe loop once the connection is up. Initial HostState stays
// Unknown until the first probe tick succeeds — avoids broadcasting a premature
// Running transition before any PDU round-trip has happened.
if (_options.Probe.Enabled)
{
_probeCts = new CancellationTokenSource();
_ = Task.Run(() => ProbeLoopAsync(_probeCts.Token), _probeCts.Token);
}
}
catch (Exception ex)
{
@@ -118,6 +139,17 @@ public sealed class S7Driver(S7DriverOptions options, string driverInstanceId)
public Task ShutdownAsync(CancellationToken cancellationToken)
{
try { _probeCts?.Cancel(); } catch { }
_probeCts?.Dispose();
_probeCts = null;
foreach (var state in _subscriptions.Values)
{
try { state.Cts.Cancel(); } catch { }
state.Cts.Dispose();
}
_subscriptions.Clear();
try { Plc?.Close(); } catch { /* best-effort — tearing down anyway */ }
Plc = null;
_health = new DriverHealth(DriverState.Unknown, _health.LastSuccessfulRead, null);
@@ -294,6 +326,180 @@ public sealed class S7Driver(S7DriverOptions options, string driverInstanceId)
private global::S7.Net.Plc RequirePlc() =>
Plc ?? throw new InvalidOperationException("S7Driver not initialized");
// ---- ITagDiscovery ----
public Task DiscoverAsync(IAddressSpaceBuilder builder, CancellationToken cancellationToken)
{
ArgumentNullException.ThrowIfNull(builder);
var folder = builder.Folder("S7", "S7");
foreach (var t in _options.Tags)
{
folder.Variable(t.Name, t.Name, new DriverAttributeInfo(
FullName: t.Name,
DriverDataType: MapDataType(t.DataType),
IsArray: false,
ArrayDim: null,
SecurityClass: t.Writable ? SecurityClassification.Operate : SecurityClassification.ViewOnly,
IsHistorized: false,
IsAlarm: false));
}
return Task.CompletedTask;
}
private static DriverDataType MapDataType(S7DataType t) => t switch
{
S7DataType.Bool => DriverDataType.Boolean,
S7DataType.Byte => DriverDataType.Int32, // no 8-bit in DriverDataType yet
S7DataType.Int16 or S7DataType.UInt16 or S7DataType.Int32 or S7DataType.UInt32 => DriverDataType.Int32,
S7DataType.Int64 or S7DataType.UInt64 => DriverDataType.Int32, // widens; lossy for >2^31-1
S7DataType.Float32 => DriverDataType.Float32,
S7DataType.Float64 => DriverDataType.Float64,
S7DataType.String => DriverDataType.String,
S7DataType.DateTime => DriverDataType.DateTime,
_ => DriverDataType.Int32,
};
// ---- ISubscribable (polling overlay) ----
public Task<ISubscriptionHandle> SubscribeAsync(
IReadOnlyList<string> fullReferences, TimeSpan publishingInterval, CancellationToken cancellationToken)
{
var id = Interlocked.Increment(ref _nextSubscriptionId);
var cts = new CancellationTokenSource();
// Floor at 100 ms — S7 CPUs scan 2-10 ms but the comms mailbox is processed at most
// once per scan; sub-100 ms polling just queues wire-side with worse latency.
var interval = publishingInterval < TimeSpan.FromMilliseconds(100)
? TimeSpan.FromMilliseconds(100)
: publishingInterval;
var handle = new S7SubscriptionHandle(id);
var state = new SubscriptionState(handle, [.. fullReferences], interval, cts);
_subscriptions[id] = state;
_ = Task.Run(() => PollLoopAsync(state, cts.Token), cts.Token);
return Task.FromResult<ISubscriptionHandle>(handle);
}
public Task UnsubscribeAsync(ISubscriptionHandle handle, CancellationToken cancellationToken)
{
if (handle is S7SubscriptionHandle h && _subscriptions.TryRemove(h.Id, out var state))
{
state.Cts.Cancel();
state.Cts.Dispose();
}
return Task.CompletedTask;
}
private async Task PollLoopAsync(SubscriptionState state, CancellationToken ct)
{
// Initial-data push per OPC UA Part 4 convention.
try { await PollOnceAsync(state, forceRaise: true, ct).ConfigureAwait(false); }
catch (OperationCanceledException) { return; }
catch { /* first-read error — polling continues */ }
while (!ct.IsCancellationRequested)
{
try { await Task.Delay(state.Interval, ct).ConfigureAwait(false); }
catch (OperationCanceledException) { return; }
try { await PollOnceAsync(state, forceRaise: false, ct).ConfigureAwait(false); }
catch (OperationCanceledException) { return; }
catch { /* transient polling error — loop continues, health surface reflects it */ }
}
}
private async Task PollOnceAsync(SubscriptionState state, bool forceRaise, CancellationToken ct)
{
var snapshots = await ReadAsync(state.TagReferences, ct).ConfigureAwait(false);
for (var i = 0; i < state.TagReferences.Count; i++)
{
var tagRef = state.TagReferences[i];
var current = snapshots[i];
var lastSeen = state.LastValues.TryGetValue(tagRef, out var prev) ? prev : default;
if (forceRaise || !Equals(lastSeen?.Value, current.Value) || lastSeen?.StatusCode != current.StatusCode)
{
state.LastValues[tagRef] = current;
OnDataChange?.Invoke(this, new DataChangeEventArgs(state.Handle, tagRef, current));
}
}
}
private sealed record SubscriptionState(
S7SubscriptionHandle Handle,
IReadOnlyList<string> TagReferences,
TimeSpan Interval,
CancellationTokenSource Cts)
{
public System.Collections.Concurrent.ConcurrentDictionary<string, DataValueSnapshot> LastValues { get; }
= new(StringComparer.OrdinalIgnoreCase);
}
private sealed record S7SubscriptionHandle(long Id) : ISubscriptionHandle
{
public string DiagnosticId => $"s7-sub-{Id}";
}
// ---- IHostConnectivityProbe ----
/// <summary>
/// Host identifier surfaced in <see cref="GetHostStatuses"/>. <c>host:port</c> format
/// matches the Modbus driver's convention so the Admin UI dashboard renders both
/// family's rows uniformly.
/// </summary>
public string HostName => $"{_options.Host}:{_options.Port}";
public IReadOnlyList<HostConnectivityStatus> GetHostStatuses()
{
lock (_probeLock)
return [new HostConnectivityStatus(HostName, _hostState, _hostStateChangedUtc)];
}
private async Task ProbeLoopAsync(CancellationToken ct)
{
while (!ct.IsCancellationRequested)
{
var success = false;
try
{
// Probe via S7.Net's low-cost GetCpuStatus — returns the CPU state (Run/Stop)
// and is intentionally light on the comms mailbox. Single-word Plc.ReadAsync
// would also work but GetCpuStatus doubles as a "PLC actually up" check.
using var probeCts = CancellationTokenSource.CreateLinkedTokenSource(ct);
probeCts.CancelAfter(_options.Probe.Timeout);
var plc = Plc;
if (plc is null) throw new InvalidOperationException("Plc dropped during probe");
await _gate.WaitAsync(probeCts.Token).ConfigureAwait(false);
try
{
_ = await plc.ReadStatusAsync(probeCts.Token).ConfigureAwait(false);
success = true;
}
finally { _gate.Release(); }
}
catch (OperationCanceledException) when (ct.IsCancellationRequested) { return; }
catch { /* transport/timeout/exception — treated as Stopped below */ }
TransitionTo(success ? HostState.Running : HostState.Stopped);
try { await Task.Delay(_options.Probe.Interval, ct).ConfigureAwait(false); }
catch (OperationCanceledException) { return; }
}
}
private void TransitionTo(HostState newState)
{
HostState old;
lock (_probeLock)
{
old = _hostState;
if (old == newState) return;
_hostState = newState;
_hostStateChangedUtc = DateTime.UtcNow;
}
OnHostStatusChanged?.Invoke(this, new HostStatusChangedEventArgs(HostName, old, newState));
}
public void Dispose() => DisposeAsync().AsTask().GetAwaiter().GetResult();
public async ValueTask DisposeAsync()

View File

@@ -0,0 +1,70 @@
using Shouldly;
using Xunit;
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
namespace ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient.Tests;
[Trait("Category", "Unit")]
public sealed class OpcUaClientAlarmTests
{
[Theory]
[InlineData((ushort)1, AlarmSeverity.Low)]
[InlineData((ushort)200, AlarmSeverity.Low)]
[InlineData((ushort)201, AlarmSeverity.Medium)]
[InlineData((ushort)500, AlarmSeverity.Medium)]
[InlineData((ushort)501, AlarmSeverity.High)]
[InlineData((ushort)800, AlarmSeverity.High)]
[InlineData((ushort)801, AlarmSeverity.Critical)]
[InlineData((ushort)1000, AlarmSeverity.Critical)]
public void MapSeverity_buckets_per_OPC_UA_Part_9_guidance(ushort opcSev, AlarmSeverity expected)
{
OpcUaClientDriver.MapSeverity(opcSev).ShouldBe(expected);
}
[Fact]
public void MapSeverity_zero_maps_to_Low()
{
// 0 isn't in OPC UA's 1-1000 range but we handle it gracefully as Low.
OpcUaClientDriver.MapSeverity(0).ShouldBe(AlarmSeverity.Low);
}
[Fact]
public async Task SubscribeAlarmsAsync_without_initialize_throws_InvalidOperationException()
{
using var drv = new OpcUaClientDriver(new OpcUaClientDriverOptions(), "opcua-alarm-uninit");
await Should.ThrowAsync<InvalidOperationException>(async () =>
await drv.SubscribeAlarmsAsync([], TestContext.Current.CancellationToken));
}
[Fact]
public async Task UnsubscribeAlarmsAsync_with_unknown_handle_is_noop()
{
using var drv = new OpcUaClientDriver(new OpcUaClientDriverOptions(), "opcua-alarm-unknown");
// Parallels the subscribe handle path — session-drop races shouldn't crash the caller.
await drv.UnsubscribeAlarmsAsync(new FakeAlarmHandle(), TestContext.Current.CancellationToken);
}
[Fact]
public async Task AcknowledgeAsync_without_initialize_throws_InvalidOperationException()
{
using var drv = new OpcUaClientDriver(new OpcUaClientDriverOptions(), "opcua-ack-uninit");
await Should.ThrowAsync<InvalidOperationException>(async () =>
await drv.AcknowledgeAsync(
[new AlarmAcknowledgeRequest("ns=2;s=Src", "ns=2;s=Cond", "operator ack")],
TestContext.Current.CancellationToken));
}
[Fact]
public async Task AcknowledgeAsync_with_empty_batch_is_noop_even_without_init()
{
// Empty batch short-circuits before touching the session, so it's safe pre-init. This
// keeps batch-ack callers from needing to guard the list size themselves.
using var drv = new OpcUaClientDriver(new OpcUaClientDriverOptions(), "opcua-ack-empty");
await drv.AcknowledgeAsync([], TestContext.Current.CancellationToken);
}
private sealed class FakeAlarmHandle : IAlarmSubscriptionHandle
{
public string DiagnosticId => "fake-alarm";
}
}

View File

@@ -0,0 +1,62 @@
using Opc.Ua;
using Shouldly;
using Xunit;
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
namespace ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient.Tests;
[Trait("Category", "Unit")]
public sealed class OpcUaClientAttributeMappingTests
{
[Theory]
[InlineData((uint)DataTypes.Boolean, DriverDataType.Boolean)]
[InlineData((uint)DataTypes.Int16, DriverDataType.Int16)]
[InlineData((uint)DataTypes.UInt16, DriverDataType.UInt16)]
[InlineData((uint)DataTypes.Int32, DriverDataType.Int32)]
[InlineData((uint)DataTypes.UInt32, DriverDataType.UInt32)]
[InlineData((uint)DataTypes.Int64, DriverDataType.Int64)]
[InlineData((uint)DataTypes.UInt64, DriverDataType.UInt64)]
[InlineData((uint)DataTypes.Float, DriverDataType.Float32)]
[InlineData((uint)DataTypes.Double, DriverDataType.Float64)]
[InlineData((uint)DataTypes.String, DriverDataType.String)]
[InlineData((uint)DataTypes.DateTime, DriverDataType.DateTime)]
public void MapUpstreamDataType_recognizes_standard_builtin_types(uint typeId, DriverDataType expected)
{
var nodeId = new NodeId(typeId);
OpcUaClientDriver.MapUpstreamDataType(nodeId).ShouldBe(expected);
}
[Fact]
public void MapUpstreamDataType_maps_SByte_and_Byte_to_Int16_since_DriverDataType_lacks_8bit()
{
// DriverDataType has no 8-bit type; conservative widen to Int16. Documented so a
// future Core.Abstractions PR that adds Int8/Byte can find this call site.
OpcUaClientDriver.MapUpstreamDataType(new NodeId((uint)DataTypes.SByte)).ShouldBe(DriverDataType.Int16);
OpcUaClientDriver.MapUpstreamDataType(new NodeId((uint)DataTypes.Byte)).ShouldBe(DriverDataType.Int16);
}
[Fact]
public void MapUpstreamDataType_falls_back_to_String_for_unknown_custom_types()
{
// Custom vendor extension object — NodeId in namespace 2 that isn't a standard type.
OpcUaClientDriver.MapUpstreamDataType(new NodeId("CustomStruct", 2)).ShouldBe(DriverDataType.String);
}
[Fact]
public void MapUpstreamDataType_handles_UtcTime_as_DateTime()
{
OpcUaClientDriver.MapUpstreamDataType(new NodeId((uint)DataTypes.UtcTime)).ShouldBe(DriverDataType.DateTime);
}
[Theory]
[InlineData((byte)0, SecurityClassification.ViewOnly)] // no access flags set
[InlineData((byte)1, SecurityClassification.ViewOnly)] // CurrentRead only
[InlineData((byte)2, SecurityClassification.Operate)] // CurrentWrite only
[InlineData((byte)3, SecurityClassification.Operate)] // CurrentRead + CurrentWrite
[InlineData((byte)0x0F, SecurityClassification.Operate)] // read+write+historyRead+historyWrite
[InlineData((byte)0x04, SecurityClassification.ViewOnly)] // HistoryRead only — no Write bit
public void MapAccessLevelToSecurityClass_respects_CurrentWrite_bit(byte accessLevel, SecurityClassification expected)
{
OpcUaClientDriver.MapAccessLevelToSecurityClass(accessLevel).ShouldBe(expected);
}
}

View File

@@ -0,0 +1,59 @@
using System.Security.Cryptography;
using System.Security.Cryptography.X509Certificates;
using Shouldly;
using Xunit;
namespace ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient.Tests;
[Trait("Category", "Unit")]
public sealed class OpcUaClientCertAuthTests
{
[Fact]
public void BuildCertificateIdentity_rejects_missing_path()
{
var opts = new OpcUaClientDriverOptions { AuthType = OpcUaAuthType.Certificate };
Should.Throw<InvalidOperationException>(() => OpcUaClientDriver.BuildCertificateIdentity(opts))
.Message.ShouldContain("UserCertificatePath");
}
[Fact]
public void BuildCertificateIdentity_rejects_nonexistent_file()
{
var opts = new OpcUaClientDriverOptions
{
AuthType = OpcUaAuthType.Certificate,
UserCertificatePath = Path.Combine(Path.GetTempPath(), $"does-not-exist-{Guid.NewGuid():N}.pfx"),
};
Should.Throw<FileNotFoundException>(() => OpcUaClientDriver.BuildCertificateIdentity(opts));
}
[Fact]
public void BuildCertificateIdentity_loads_a_valid_PFX_with_private_key()
{
// Generate a self-signed cert on the fly so the test doesn't ship a static PFX.
// The driver doesn't care about the issuer — just needs a cert with a private key.
using var rsa = RSA.Create(2048);
var req = new CertificateRequest("CN=OpcUaClientCertAuthTests", rsa,
HashAlgorithmName.SHA256, RSASignaturePadding.Pkcs1);
var cert = req.CreateSelfSigned(DateTimeOffset.UtcNow.AddMinutes(-5), DateTimeOffset.UtcNow.AddHours(1));
var tmpPath = Path.Combine(Path.GetTempPath(), $"opcua-cert-test-{Guid.NewGuid():N}.pfx");
File.WriteAllBytes(tmpPath, cert.Export(X509ContentType.Pfx, "testpw"));
try
{
var opts = new OpcUaClientDriverOptions
{
AuthType = OpcUaAuthType.Certificate,
UserCertificatePath = tmpPath,
UserCertificatePassword = "testpw",
};
var identity = OpcUaClientDriver.BuildCertificateIdentity(opts);
identity.ShouldNotBeNull();
identity.TokenType.ShouldBe(Opc.Ua.UserTokenType.Certificate);
}
finally
{
try { File.Delete(tmpPath); } catch { /* best-effort */ }
}
}
}

View File

@@ -0,0 +1,55 @@
using Shouldly;
using Xunit;
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
namespace ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient.Tests;
/// <summary>
/// Scaffold tests for <see cref="OpcUaClientDriver"/>'s <see cref="ITagDiscovery"/>
/// surface that don't require a live remote server. Live-browse coverage lands in a
/// follow-up PR once the in-process OPC UA server fixture is scaffolded.
/// </summary>
[Trait("Category", "Unit")]
public sealed class OpcUaClientDiscoveryTests
{
[Fact]
public async Task DiscoverAsync_without_initialize_throws_InvalidOperationException()
{
using var drv = new OpcUaClientDriver(new OpcUaClientDriverOptions(), "opcua-disco");
var builder = new NullAddressSpaceBuilder();
await Should.ThrowAsync<InvalidOperationException>(async () =>
await drv.DiscoverAsync(builder, TestContext.Current.CancellationToken));
}
[Fact]
public void DiscoverAsync_rejects_null_builder()
{
using var drv = new OpcUaClientDriver(new OpcUaClientDriverOptions(), "opcua-disco");
Should.ThrowAsync<ArgumentNullException>(async () =>
await drv.DiscoverAsync(null!, TestContext.Current.CancellationToken));
}
[Fact]
public void Discovery_caps_are_sensible_defaults()
{
var opts = new OpcUaClientDriverOptions();
opts.MaxDiscoveredNodes.ShouldBe(10_000, "bounds memory on runaway servers without clipping normal models");
opts.MaxBrowseDepth.ShouldBe(10, "deep enough for realistic info models; shallow enough for cycle safety");
opts.BrowseRoot.ShouldBeNull("null = default to ObjectsFolder i=85");
}
private sealed class NullAddressSpaceBuilder : IAddressSpaceBuilder
{
public IAddressSpaceBuilder Folder(string browseName, string displayName) => this;
public IVariableHandle Variable(string browseName, string displayName, DriverAttributeInfo attributeInfo)
=> new StubHandle();
public void AddProperty(string browseName, DriverDataType dataType, object? value) { }
public void AttachAlarmCondition(IVariableHandle sourceVariable, string alarmName, DriverAttributeInfo alarmInfo) { }
private sealed class StubHandle : IVariableHandle
{
public string FullReference => "stub";
public IAlarmConditionSink MarkAsAlarmCondition(AlarmConditionInfo info) => throw new NotSupportedException();
}
}
}

View File

@@ -0,0 +1,91 @@
using Shouldly;
using Xunit;
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
namespace ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient.Tests;
/// <summary>
/// Scaffold-level tests for <see cref="OpcUaClientDriver"/> that don't require a live
/// remote OPC UA server. PR 67+ adds IReadable/IWritable/ITagDiscovery/ISubscribable
/// tests against a local in-process OPC UA server fixture.
/// </summary>
[Trait("Category", "Unit")]
public sealed class OpcUaClientDriverScaffoldTests
{
[Fact]
public void Default_options_target_standard_opcua_port_and_anonymous_auth()
{
var opts = new OpcUaClientDriverOptions();
opts.EndpointUrl.ShouldBe("opc.tcp://localhost:4840", "4840 is the IANA-assigned OPC UA port");
opts.SecurityMode.ShouldBe(OpcUaSecurityMode.None);
opts.SecurityPolicy.ShouldBe(OpcUaSecurityPolicy.None);
opts.AuthType.ShouldBe(OpcUaAuthType.Anonymous);
opts.AutoAcceptCertificates.ShouldBeFalse("production default must reject untrusted server certs");
}
[Fact]
public void Default_timeouts_match_driver_specs_section_8()
{
var opts = new OpcUaClientDriverOptions();
opts.SessionTimeout.ShouldBe(TimeSpan.FromSeconds(120));
opts.KeepAliveInterval.ShouldBe(TimeSpan.FromSeconds(5));
opts.ReconnectPeriod.ShouldBe(TimeSpan.FromSeconds(5));
}
[Fact]
public void Driver_reports_type_and_id_before_connect()
{
using var drv = new OpcUaClientDriver(new OpcUaClientDriverOptions(), "opcua-test");
drv.DriverType.ShouldBe("OpcUaClient");
drv.DriverInstanceId.ShouldBe("opcua-test");
drv.GetHealth().State.ShouldBe(DriverState.Unknown);
}
[Fact]
public async Task Initialize_against_unreachable_endpoint_transitions_to_Faulted_and_throws()
{
// RFC 5737 reserved-for-documentation IP; won't route anywhere. Pick opc.tcp:// so
// endpoint selection hits the transport-layer connection rather than a DNS lookup.
var opts = new OpcUaClientDriverOptions
{
// Port 1 on loopback is effectively guaranteed to be closed — the OS responds
// with TCP RST immediately instead of hanging on connect, which keeps the
// unreachable-host tests snappy. Don't use an RFC 5737 reserved IP; those get
// routed to a black-hole + time out only after the SDK's internal retry/backoff
// fully elapses (~60s even with Options.Timeout=500ms).
EndpointUrl = "opc.tcp://127.0.0.1:1",
Timeout = TimeSpan.FromMilliseconds(500),
AutoAcceptCertificates = true, // dev-mode to bypass cert validation in the test
};
using var drv = new OpcUaClientDriver(opts, "opcua-unreach");
await Should.ThrowAsync<Exception>(async () =>
await drv.InitializeAsync("{}", TestContext.Current.CancellationToken));
var health = drv.GetHealth();
health.State.ShouldBe(DriverState.Faulted);
health.LastError.ShouldNotBeNull();
}
[Fact]
public async Task Reinitialize_against_unreachable_endpoint_re_throws()
{
var opts = new OpcUaClientDriverOptions
{
// Port 1 on loopback is effectively guaranteed to be closed — the OS responds
// with TCP RST immediately instead of hanging on connect, which keeps the
// unreachable-host tests snappy. Don't use an RFC 5737 reserved IP; those get
// routed to a black-hole + time out only after the SDK's internal retry/backoff
// fully elapses (~60s even with Options.Timeout=500ms).
EndpointUrl = "opc.tcp://127.0.0.1:1",
Timeout = TimeSpan.FromMilliseconds(500),
AutoAcceptCertificates = true,
};
using var drv = new OpcUaClientDriver(opts, "opcua-reinit");
await Should.ThrowAsync<Exception>(async () =>
await drv.InitializeAsync("{}", TestContext.Current.CancellationToken));
await Should.ThrowAsync<Exception>(async () =>
await drv.ReinitializeAsync("{}", TestContext.Current.CancellationToken));
}
}

View File

@@ -0,0 +1,81 @@
using Shouldly;
using Xunit;
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
namespace ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient.Tests;
[Trait("Category", "Unit")]
public sealed class OpcUaClientFailoverTests
{
[Fact]
public void ResolveEndpointCandidates_prefers_EndpointUrls_when_provided()
{
var opts = new OpcUaClientDriverOptions
{
EndpointUrl = "opc.tcp://fallback:4840",
EndpointUrls = ["opc.tcp://primary:4840", "opc.tcp://backup:4841"],
};
var list = OpcUaClientDriver.ResolveEndpointCandidates(opts);
list.Count.ShouldBe(2);
list[0].ShouldBe("opc.tcp://primary:4840");
list[1].ShouldBe("opc.tcp://backup:4841");
}
[Fact]
public void ResolveEndpointCandidates_falls_back_to_single_EndpointUrl_when_list_empty()
{
var opts = new OpcUaClientDriverOptions { EndpointUrl = "opc.tcp://only:4840" };
var list = OpcUaClientDriver.ResolveEndpointCandidates(opts);
list.Count.ShouldBe(1);
list[0].ShouldBe("opc.tcp://only:4840");
}
[Fact]
public void ResolveEndpointCandidates_empty_list_treated_as_fallback_to_EndpointUrl()
{
// Explicit empty list should still fall back to the single-URL shortcut rather than
// producing a zero-candidate sweep that would immediately throw with no URLs tried.
var opts = new OpcUaClientDriverOptions
{
EndpointUrl = "opc.tcp://single:4840",
EndpointUrls = [],
};
OpcUaClientDriver.ResolveEndpointCandidates(opts).Count.ShouldBe(1);
}
[Fact]
public void HostName_uses_first_candidate_before_connect()
{
var opts = new OpcUaClientDriverOptions
{
EndpointUrls = ["opc.tcp://primary:4840", "opc.tcp://backup:4841"],
};
using var drv = new OpcUaClientDriver(opts, "opcua-host");
drv.HostName.ShouldBe("opc.tcp://primary:4840",
"pre-connect the dashboard should show the first candidate URL so operators can link back");
}
[Fact]
public async Task Initialize_against_all_unreachable_endpoints_throws_AggregateException_listing_each()
{
// Port 1 + port 2 + port 3 on loopback are all guaranteed closed (TCP RST immediate).
// Failover sweep should attempt all three and throw AggregateException naming each URL
// so operators see exactly which candidates were tried.
var opts = new OpcUaClientDriverOptions
{
EndpointUrls = ["opc.tcp://127.0.0.1:1", "opc.tcp://127.0.0.1:2", "opc.tcp://127.0.0.1:3"],
PerEndpointConnectTimeout = TimeSpan.FromMilliseconds(500),
Timeout = TimeSpan.FromMilliseconds(500),
AutoAcceptCertificates = true,
};
using var drv = new OpcUaClientDriver(opts, "opcua-failover");
var ex = await Should.ThrowAsync<AggregateException>(async () =>
await drv.InitializeAsync("{}", TestContext.Current.CancellationToken));
ex.Message.ShouldContain("127.0.0.1:1");
ex.Message.ShouldContain("127.0.0.1:2");
ex.Message.ShouldContain("127.0.0.1:3");
drv.GetHealth().State.ShouldBe(DriverState.Faulted);
}
}

View File

@@ -0,0 +1,91 @@
using Opc.Ua;
using Shouldly;
using Xunit;
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
namespace ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient.Tests;
[Trait("Category", "Unit")]
public sealed class OpcUaClientHistoryTests
{
[Theory]
[InlineData(HistoryAggregateType.Average)]
[InlineData(HistoryAggregateType.Minimum)]
[InlineData(HistoryAggregateType.Maximum)]
[InlineData(HistoryAggregateType.Total)]
[InlineData(HistoryAggregateType.Count)]
public void MapAggregateToNodeId_returns_standard_Part13_aggregate_for_every_enum(HistoryAggregateType agg)
{
var nodeId = OpcUaClientDriver.MapAggregateToNodeId(agg);
NodeId.IsNull(nodeId).ShouldBeFalse();
// Every mapping should resolve to an AggregateFunction_* NodeId (namespace 0, numeric id).
nodeId.NamespaceIndex.ShouldBe((ushort)0);
}
[Fact]
public void MapAggregateToNodeId_rejects_invalid_enum_value()
{
// Defense-in-depth: a future HistoryAggregateType addition mustn't silently fall through.
Should.Throw<ArgumentOutOfRangeException>(() =>
OpcUaClientDriver.MapAggregateToNodeId((HistoryAggregateType)99));
}
[Fact]
public async Task ReadRawAsync_without_initialize_throws_InvalidOperationException()
{
using var drv = new OpcUaClientDriver(new OpcUaClientDriverOptions(), "opcua-hist-uninit");
await Should.ThrowAsync<InvalidOperationException>(async () =>
await drv.ReadRawAsync("ns=2;s=Counter",
DateTime.UtcNow.AddMinutes(-5), DateTime.UtcNow, 1000,
TestContext.Current.CancellationToken));
}
[Fact]
public async Task ReadRawAsync_with_malformed_NodeId_returns_empty_result_not_throw()
{
// Same defensive pattern as ReadAsync / WriteAsync — malformed NodeId short-circuits
// to an empty result rather than crashing a batch history call. Needs init via the
// throw path first, then we pass "" to trigger the parse-fail branch inside
// ExecuteHistoryReadAsync. The init itself fails against 127.0.0.1:1 so we stop there.
// Not runnable without init — keep as placeholder for when the in-process fixture
// PR lands.
await Task.CompletedTask;
}
[Fact]
public async Task ReadProcessedAsync_without_initialize_throws_InvalidOperationException()
{
using var drv = new OpcUaClientDriver(new OpcUaClientDriverOptions(), "opcua-hist-uninit");
await Should.ThrowAsync<InvalidOperationException>(async () =>
await drv.ReadProcessedAsync("ns=2;s=Counter",
DateTime.UtcNow.AddMinutes(-5), DateTime.UtcNow,
TimeSpan.FromSeconds(10), HistoryAggregateType.Average,
TestContext.Current.CancellationToken));
}
[Fact]
public async Task ReadAtTimeAsync_without_initialize_throws_InvalidOperationException()
{
using var drv = new OpcUaClientDriver(new OpcUaClientDriverOptions(), "opcua-hist-uninit");
await Should.ThrowAsync<InvalidOperationException>(async () =>
await drv.ReadAtTimeAsync("ns=2;s=Counter",
[DateTime.UtcNow.AddMinutes(-5), DateTime.UtcNow],
TestContext.Current.CancellationToken));
}
[Fact]
public async Task ReadEventsAsync_throws_NotSupportedException_as_documented()
{
// The IHistoryProvider default implementation throws; the OPC UA Client driver
// deliberately inherits that default (see PR 76 commit body) because the OPC UA
// client call path needs an EventFilter SelectClauses spec the interface doesn't carry.
using var drv = new OpcUaClientDriver(new OpcUaClientDriverOptions(), "opcua-events-default");
await Should.ThrowAsync<NotSupportedException>(async () =>
await ((IHistoryProvider)drv).ReadEventsAsync(
sourceName: null,
startUtc: DateTime.UtcNow.AddMinutes(-5),
endUtc: DateTime.UtcNow,
maxEvents: 100,
cancellationToken: TestContext.Current.CancellationToken));
}
}

View File

@@ -0,0 +1,32 @@
using Shouldly;
using Xunit;
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
namespace ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient.Tests;
/// <summary>
/// Unit tests for the IReadable/IWritable surface that don't need a live remote OPC UA
/// server. Wire-level round-trips against a local in-process server fixture land in a
/// follow-up PR once we have one scaffolded.
/// </summary>
[Trait("Category", "Unit")]
public sealed class OpcUaClientReadWriteTests
{
[Fact]
public async Task ReadAsync_without_initialize_throws_InvalidOperationException()
{
using var drv = new OpcUaClientDriver(new OpcUaClientDriverOptions(), "opcua-uninit");
await Should.ThrowAsync<InvalidOperationException>(async () =>
await drv.ReadAsync(["ns=2;s=Demo"], TestContext.Current.CancellationToken));
}
[Fact]
public async Task WriteAsync_without_initialize_throws_InvalidOperationException()
{
using var drv = new OpcUaClientDriver(new OpcUaClientDriverOptions(), "opcua-uninit");
await Should.ThrowAsync<InvalidOperationException>(async () =>
await drv.WriteAsync(
[new WriteRequest("ns=2;s=Demo", 42)],
TestContext.Current.CancellationToken));
}
}

View File

@@ -0,0 +1,36 @@
using Shouldly;
using Xunit;
namespace ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient.Tests;
/// <summary>
/// Scaffold tests for <see cref="SessionReconnectHandler"/> wiring. Wire-level
/// disconnect-reconnect-resume coverage against a live upstream server lands with the
/// in-process fixture — too much machinery for a unit-test-only lane.
/// </summary>
[Trait("Category", "Unit")]
public sealed class OpcUaClientReconnectTests
{
[Fact]
public void Default_ReconnectPeriod_matches_driver_specs_5_seconds()
{
new OpcUaClientDriverOptions().ReconnectPeriod.ShouldBe(TimeSpan.FromSeconds(5));
}
[Fact]
public void Options_ReconnectPeriod_is_configurable_for_aggressive_or_relaxed_retry()
{
var opts = new OpcUaClientDriverOptions { ReconnectPeriod = TimeSpan.FromMilliseconds(500) };
opts.ReconnectPeriod.ShouldBe(TimeSpan.FromMilliseconds(500));
}
[Fact]
public void Driver_starts_with_no_reconnect_handler_active_pre_init()
{
// The reconnect handler is lazy — spun up only when a bad keep-alive fires. Pre-init
// there's no session to reconnect, so the field must be null (indirectly verified by
// the lifecycle-shape test suite catching any accidental construction).
using var drv = new OpcUaClientDriver(new OpcUaClientDriverOptions(), "opcua-reconnect");
drv.GetHealth().State.ShouldBe(Core.Abstractions.DriverState.Unknown);
}
}

View File

@@ -0,0 +1,54 @@
using Opc.Ua;
using Shouldly;
using Xunit;
namespace ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient.Tests;
[Trait("Category", "Unit")]
public sealed class OpcUaClientSecurityPolicyTests
{
[Theory]
[InlineData(OpcUaSecurityPolicy.None)]
[InlineData(OpcUaSecurityPolicy.Basic128Rsa15)]
[InlineData(OpcUaSecurityPolicy.Basic256)]
[InlineData(OpcUaSecurityPolicy.Basic256Sha256)]
[InlineData(OpcUaSecurityPolicy.Aes128_Sha256_RsaOaep)]
[InlineData(OpcUaSecurityPolicy.Aes256_Sha256_RsaPss)]
public void MapSecurityPolicy_returns_known_non_empty_uri_for_every_enum_value(OpcUaSecurityPolicy policy)
{
var uri = OpcUaClientDriver.MapSecurityPolicy(policy);
uri.ShouldNotBeNullOrEmpty();
// Each URI should end in the enum name (for the non-None policies) so a driver
// operator reading logs can correlate the URI back to the config value.
if (policy != OpcUaSecurityPolicy.None)
uri.ShouldContain(policy.ToString());
}
[Fact]
public void MapSecurityPolicy_None_matches_SDK_None_URI()
{
OpcUaClientDriver.MapSecurityPolicy(OpcUaSecurityPolicy.None)
.ShouldBe(SecurityPolicies.None);
}
[Fact]
public void MapSecurityPolicy_Basic256Sha256_matches_SDK_URI()
{
OpcUaClientDriver.MapSecurityPolicy(OpcUaSecurityPolicy.Basic256Sha256)
.ShouldBe(SecurityPolicies.Basic256Sha256);
}
[Fact]
public void MapSecurityPolicy_Aes256_Sha256_RsaPss_matches_SDK_URI()
{
OpcUaClientDriver.MapSecurityPolicy(OpcUaSecurityPolicy.Aes256_Sha256_RsaPss)
.ShouldBe(SecurityPolicies.Aes256_Sha256_RsaPss);
}
[Fact]
public void Every_enum_value_has_a_mapping()
{
foreach (OpcUaSecurityPolicy p in Enum.GetValues<OpcUaSecurityPolicy>())
Should.NotThrow(() => OpcUaClientDriver.MapSecurityPolicy(p));
}
}

View File

@@ -0,0 +1,50 @@
using Shouldly;
using Xunit;
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
namespace ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient.Tests;
/// <summary>
/// Scaffold tests for <c>ISubscribable</c> + <c>IHostConnectivityProbe</c> that don't
/// need a live remote server. Live-session tests (subscribe/unsubscribe round-trip,
/// keep-alive transitions) land in a follow-up PR once the in-process OPC UA server
/// fixture is scaffolded.
/// </summary>
[Trait("Category", "Unit")]
public sealed class OpcUaClientSubscribeAndProbeTests
{
[Fact]
public async Task SubscribeAsync_without_initialize_throws_InvalidOperationException()
{
using var drv = new OpcUaClientDriver(new OpcUaClientDriverOptions(), "opcua-sub-uninit");
await Should.ThrowAsync<InvalidOperationException>(async () =>
await drv.SubscribeAsync(["ns=2;s=Demo"], TimeSpan.FromMilliseconds(100), TestContext.Current.CancellationToken));
}
[Fact]
public async Task UnsubscribeAsync_with_unknown_handle_is_noop()
{
using var drv = new OpcUaClientDriver(new OpcUaClientDriverOptions(), "opcua-sub-unknown");
// UnsubscribeAsync returns cleanly for handles it doesn't recognise — protects against
// the caller's race with server-side cleanup after a session drop.
await drv.UnsubscribeAsync(new FakeHandle(), TestContext.Current.CancellationToken);
}
[Fact]
public void GetHostStatuses_returns_endpoint_url_row_pre_init()
{
using var drv = new OpcUaClientDriver(
new OpcUaClientDriverOptions { EndpointUrl = "opc.tcp://plc.example:4840" },
"opcua-hosts");
var rows = drv.GetHostStatuses();
rows.Count.ShouldBe(1);
rows[0].HostName.ShouldBe("opc.tcp://plc.example:4840",
"host identity mirrors the endpoint URL so the Admin /hosts dashboard can link back to the remote server");
rows[0].State.ShouldBe(HostState.Unknown);
}
private sealed class FakeHandle : ISubscriptionHandle
{
public string DiagnosticId => "fake";
}
}

View File

@@ -0,0 +1,31 @@
<Project Sdk="Microsoft.NET.Sdk">
<PropertyGroup>
<TargetFramework>net10.0</TargetFramework>
<Nullable>enable</Nullable>
<ImplicitUsings>enable</ImplicitUsings>
<IsPackable>false</IsPackable>
<IsTestProject>true</IsTestProject>
<RootNamespace>ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient.Tests</RootNamespace>
</PropertyGroup>
<ItemGroup>
<PackageReference Include="xunit.v3" Version="1.1.0"/>
<PackageReference Include="Shouldly" Version="4.3.0"/>
<PackageReference Include="Microsoft.NET.Test.Sdk" Version="17.12.0"/>
<PackageReference Include="xunit.runner.visualstudio" Version="3.0.2">
<PrivateAssets>all</PrivateAssets>
<IncludeAssets>runtime; build; native; contentfiles; analyzers; buildtransitive</IncludeAssets>
</PackageReference>
</ItemGroup>
<ItemGroup>
<ProjectReference Include="..\..\src\ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient\ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient.csproj"/>
</ItemGroup>
<ItemGroup>
<NuGetAuditSuppress Include="https://github.com/advisories/GHSA-37gx-xxp4-5rgx"/>
<NuGetAuditSuppress Include="https://github.com/advisories/GHSA-w3x6-4m5h-cxqf"/>
</ItemGroup>
</Project>

View File

@@ -0,0 +1,117 @@
using Shouldly;
using Xunit;
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
namespace ZB.MOM.WW.OtOpcUa.Driver.S7.Tests;
/// <summary>
/// Shape tests for <see cref="S7Driver"/>'s <see cref="ITagDiscovery"/>,
/// <see cref="ISubscribable"/>, and <see cref="IHostConnectivityProbe"/> surfaces that
/// don't need a live PLC. Wire-level polling round-trips and probe transitions land in a
/// follow-up PR once we have a mock S7 server.
/// </summary>
[Trait("Category", "Unit")]
public sealed class S7DiscoveryAndSubscribeTests
{
private sealed class RecordingAddressSpaceBuilder : IAddressSpaceBuilder
{
public readonly List<string> Folders = new();
public readonly List<(string Name, DriverAttributeInfo Attr)> Variables = new();
public IAddressSpaceBuilder Folder(string browseName, string displayName)
{
Folders.Add(browseName);
return this;
}
public IVariableHandle Variable(string browseName, string displayName, DriverAttributeInfo attributeInfo)
{
Variables.Add((browseName, attributeInfo));
return new StubHandle();
}
public void AddProperty(string browseName, DriverDataType dataType, object? value) { }
public void AttachAlarmCondition(IVariableHandle sourceVariable, string alarmName, DriverAttributeInfo alarmInfo) { }
private sealed class StubHandle : IVariableHandle
{
public string FullReference => "stub";
public IAlarmConditionSink MarkAsAlarmCondition(AlarmConditionInfo info)
=> throw new NotImplementedException("S7 driver never calls this — no alarm surfacing");
}
}
[Fact]
public async Task DiscoverAsync_projects_every_tag_into_the_address_space()
{
var opts = new S7DriverOptions
{
Host = "192.0.2.1",
Tags =
[
new("TempSetpoint", "DB1.DBW0", S7DataType.Int16, Writable: true),
new("FaultBit", "M0.0", S7DataType.Bool, Writable: false),
new("PIDOutput", "DB5.DBD12", S7DataType.Float32, Writable: true),
],
};
using var drv = new S7Driver(opts, "s7-disco");
var builder = new RecordingAddressSpaceBuilder();
await drv.DiscoverAsync(builder, TestContext.Current.CancellationToken);
builder.Folders.ShouldContain("S7");
builder.Variables.Count.ShouldBe(3);
builder.Variables[0].Name.ShouldBe("TempSetpoint");
builder.Variables[0].Attr.SecurityClass.ShouldBe(SecurityClassification.Operate, "writable tags get Operate security class");
builder.Variables[1].Attr.SecurityClass.ShouldBe(SecurityClassification.ViewOnly, "read-only tags get ViewOnly");
builder.Variables[2].Attr.DriverDataType.ShouldBe(DriverDataType.Float32);
}
[Fact]
public void GetHostStatuses_returns_one_row_with_host_port_identity_pre_init()
{
var opts = new S7DriverOptions { Host = "plc1.internal", Port = 102 };
using var drv = new S7Driver(opts, "s7-host");
var rows = drv.GetHostStatuses();
rows.Count.ShouldBe(1);
rows[0].HostName.ShouldBe("plc1.internal:102");
rows[0].State.ShouldBe(HostState.Unknown, "pre-init / pre-probe state is Unknown");
}
[Fact]
public async Task SubscribeAsync_returns_unique_handles_and_UnsubscribeAsync_accepts_them()
{
var opts = new S7DriverOptions { Host = "192.0.2.1" };
using var drv = new S7Driver(opts, "s7-sub");
// SubscribeAsync does not itself call ReadAsync (the poll task does), so this works
// even though the driver isn't initialized. The poll task catches the resulting
// InvalidOperationException and the loop quietly continues — same pattern as the
// Modbus driver's poll loop tolerating transient transport failures.
var h1 = await drv.SubscribeAsync(["T1"], TimeSpan.FromMilliseconds(200), TestContext.Current.CancellationToken);
var h2 = await drv.SubscribeAsync(["T2"], TimeSpan.FromMilliseconds(200), TestContext.Current.CancellationToken);
h1.DiagnosticId.ShouldStartWith("s7-sub-");
h2.DiagnosticId.ShouldStartWith("s7-sub-");
h1.DiagnosticId.ShouldNotBe(h2.DiagnosticId);
await drv.UnsubscribeAsync(h1, TestContext.Current.CancellationToken);
await drv.UnsubscribeAsync(h2, TestContext.Current.CancellationToken);
// UnsubscribeAsync with an unknown handle must be a no-op, not throw.
await drv.UnsubscribeAsync(h1, TestContext.Current.CancellationToken);
}
[Fact]
public async Task Subscribe_publishing_interval_is_floored_at_100ms()
{
var opts = new S7DriverOptions { Host = "192.0.2.1", Probe = new S7ProbeOptions { Enabled = false } };
using var drv = new S7Driver(opts, "s7-floor");
// 50 ms requested — the floor protects the S7 CPU from sub-scan polling that would
// just queue wire-side. Test that the subscription is accepted (the floor is applied
// internally; the floor value isn't exposed, so we're really just asserting that the
// driver doesn't reject small intervals).
var h = await drv.SubscribeAsync(["T"], TimeSpan.FromMilliseconds(50), TestContext.Current.CancellationToken);
h.ShouldNotBeNull();
await drv.UnsubscribeAsync(h, TestContext.Current.CancellationToken);
}
}