Phase 3 PR 72 -- Multi-endpoint failover for OPC UA Client #71

Merged
dohertj2 merged 1 commits from phase-3-pr72-opcua-client-failover into v2 2026-04-19 01:54:37 -04:00
Owner

Summary

Adds an ordered EndpointUrls list + PerEndpointConnectTimeout (default 3s). InitializeAsync walks the candidates in order; the first one that successfully connects wins. If all fail, throws AggregateException whose message names every URL + its failure class — clean diag for hot-standby debugging.

  • Backward-compat: EndpointUrl retained as a one-URL shortcut. Empty/unset EndpointUrls falls back to [EndpointUrl].
  • User-identity hoisted: built once before the sweep so the user-cert password only unlocks the PFX once (shorter memory residency, no redundant work).
  • HostName now reflects the endpoint that actually connected (via _connectedEndpointUrl) instead of always returning opts.EndpointUrl — the Admin /hosts dashboard shows primary vs backup as they change.

Validation

  • 31/31 OpcUaClient.Tests pass (5 new failover)
  • dotnet build: 0 errors

Use case

Hot-standby OPC UA pair: EndpointUrls=[primary, backup]. Driver tries primary first; falls over on failure; AggregateException if both are down lists each attempt's failure mode.

Test plan

  • EndpointUrls trumps EndpointUrl
  • Empty list falls back gracefully
  • HostName renders first candidate pre-connect
  • All-unreachable → AggregateException naming each URL + Faulted state
## Summary Adds an ordered `EndpointUrls` list + `PerEndpointConnectTimeout` (default 3s). `InitializeAsync` walks the candidates in order; the first one that successfully connects wins. If all fail, throws `AggregateException` whose message names every URL + its failure class — clean diag for hot-standby debugging. - **Backward-compat**: `EndpointUrl` retained as a one-URL shortcut. Empty/unset `EndpointUrls` falls back to `[EndpointUrl]`. - **User-identity hoisted**: built once before the sweep so the user-cert password only unlocks the PFX once (shorter memory residency, no redundant work). - **HostName** now reflects the endpoint that actually connected (via `_connectedEndpointUrl`) instead of always returning `opts.EndpointUrl` — the Admin `/hosts` dashboard shows primary vs backup as they change. ## Validation - 31/31 OpcUaClient.Tests pass (5 new failover) - `dotnet build`: 0 errors ## Use case Hot-standby OPC UA pair: `EndpointUrls=[primary, backup]`. Driver tries primary first; falls over on failure; `AggregateException` if both are down lists each attempt's failure mode. ## Test plan - [x] `EndpointUrls` trumps `EndpointUrl` - [x] Empty list falls back gracefully - [x] `HostName` renders first candidate pre-connect - [x] All-unreachable → `AggregateException` naming each URL + Faulted state
dohertj2 added 1 commit 2026-04-19 01:54:33 -04:00
Phase 3 PR 72 -- Multi-endpoint failover for OPC UA Client driver. Adds OpcUaClientDriverOptions.EndpointUrls ordered list + PerEndpointConnectTimeout knob. On InitializeAsync the driver walks the candidate list in order via ResolveEndpointCandidates and returns the session from the first endpoint that successfully connects. Captures per-URL failure reasons in a List<string> and, if every candidate fails, throws AggregateException whose message names every URL + its failure class (e.g. 'opc.tcp://primary:4840 -> TimeoutException: ...'). That's critical diag for field debugging -- without it 'failover picked the wrong one' surfaces as a mystery. Single-URL backwards compat: EndpointUrl field retained as a one-URL shortcut. When EndpointUrls is null or empty the driver falls through to a single-candidate list of [EndpointUrl], so every existing single-endpoint config keeps working without migration. When both are provided, EndpointUrls wins + EndpointUrl is ignored -- documented on the field xml-doc. Per-endpoint connect budget: PerEndpointConnectTimeout (default 3s) caps each attempt so a sweep over several dead servers can't blow the overall init budget. Applied via CancellationTokenSource.CreateLinkedTokenSource + CancelAfter inside OpenSessionOnEndpointAsync (the extracted single-endpoint connect helper) so the cap is independent of the outer Options.Timeout which governs steady-state ops. BuildUserIdentity extracted out of InitializeAsync so the failover loop builds the UserIdentity ONCE and reuses it across every endpoint attempt -- generating it N times would re-unlock the user cert's private key N times, wasteful + keeps the password in memory longer. HostName now reflects the endpoint that actually connected via _connectedEndpointUrl instead of always returning opts.EndpointUrl -- so the Admin /hosts dashboard shows which of the configured endpoints is currently serving traffic (primary vs backup). Falls back to the first candidate pre-connect so the dashboard has a sensible identity before the first connect, and resets to null on ShutdownAsync. Use case: an OPC UA hot-standby server pair (primary 4840 + backup 4841) where either can serve the same address space. Operator configures EndpointUrls=[primary, backup]; driver tries primary first, falls over to backup on primary failure with a clean AggregateException describing both attempts if both are down. Unit tests (OpcUaClientFailoverTests, 5 facts): ResolveEndpointCandidates_prefers_EndpointUrls_when_provided (list trumps single), ResolveEndpointCandidates_falls_back_to_single_EndpointUrl_when_list_empty (legacy config compat), ResolveEndpointCandidates_empty_list_treated_as_fallback (explicit empty list also falls back -- otherwise we'd produce a zero-candidate sweep that throws with nothing tried), HostName_uses_first_candidate_before_connect (dashboard rendering pre-connect), Initialize_against_all_unreachable_endpoints_throws_AggregateException_listing_each (three loopback dead ports, asserts each URL appears in the aggregate message + driver flips to Faulted). 31/31 OpcUaClient.Tests pass. dotnet build clean. OPC UA Client driver security/auth/availability feature set now complete per driver-specs.md \u00A78: policy-filtered endpoint selection (PR 70), Anonymous+Username+Certificate auth (PR 71), multi-endpoint failover (this PR). 24435712c4
dohertj2 merged commit d3bf544abc into v2 2026-04-19 01:54:37 -04:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: dohertj2/lmxopcua#71