diff --git a/docs/drivers/AbCip-HSBY.md b/docs/drivers/AbCip-HSBY.md index 436cdfb..9a972f4 100644 --- a/docs/drivers/AbCip-HSBY.md +++ b/docs/drivers/AbCip-HSBY.md @@ -1,14 +1,18 @@ # AbCip — ControlLogix HSBY paired-IP support -PR abcip-5.1 adds **non-transparent** HSBY (Hot-Standby) awareness to the AB -CIP driver. Each device may declare a partner gateway; when both gateways are -up the driver concurrently probes a role tag on each chassis and reports -which one is currently Active. +PR abcip-5.1 + 5.2 ship **non-transparent** HSBY (Hot-Standby) awareness +to the AB CIP driver. Each device may declare a partner gateway; when both +gateways are up the driver concurrently probes a role tag on each chassis, +reports which one is currently Active, and routes reads / writes through +that chassis automatically. -PR abcip-5.1 only **gathers + reports** the role. PR abcip-5.2 is the -follow-up that wires the resolved active address into -`AbCipDriver.ResolveHost` so reads and writes route to whichever chassis is -Active without operator intervention. +- **PR abcip-5.1** — gathers + reports the role of each chassis through + driver diagnostics. See [Role-tag detection matrix](#role-tag-detection-matrix) + + [Active-resolution rules](#active-resolution-rules). +- **PR abcip-5.2** — wires the resolved active address into + `AbCipDriver.ResolveHost` and the runtime-cache lifecycle. See + [Failover behaviour](#failover-behaviour-pr-52) + + [Failure-mode walkthrough](#failure-mode-walkthrough). ## When to use HSBY paired IPs @@ -24,7 +28,8 @@ edited the config. PR abcip-5.1 closes the visibility half of that gap by reading the role tag on both chassis. PR abcip-5.2 closes the routing half by re-pointing -`ResolveHost` at the Active address each tick. +`ResolveHost` at the Active address each tick + invalidating the per-tag +runtime cache + write-coalescer state on every flip. ## Configuration @@ -88,14 +93,17 @@ The driver surfaces three diagnostics counters per HSBY-enabled device | `AbCip.HsbyActive` | `1` if primary is Active, `2` if partner is Active, `0` if neither (or HSBY off) | | `AbCip.HsbyPrimaryRole` | `(int)HsbyRole` — `0` = Unknown, `1` = Active, `2` = Standby, `3` = Disqualified | | `AbCip.HsbyPartnerRole` | Same encoding as `HsbyPrimaryRole`, observed on the partner chassis | +| `AbCip.HsbyFailoverCount` (PR 5.2) | Total number of `ActiveAddress` transitions the probe loop has observed across every HSBY-enabled device on this driver. Each increment maps to one runtime-cache invalidation + write-coalescer reset. | When more than one HSBY pair is configured on the same driver instance the flat keys are scoped per primary host: `AbCip.HsbyActive[ab://10.0.0.5/1,0]`, etc. The `DeviceState.ActiveAddress` field (internal; surfaced via -`HsbyActive` diagnostics) is the address PR 5.2 will route through -`ResolveHost`. +`HsbyActive` diagnostics) is the address PR 5.2 routes through +`ResolveHost` + uses to scope the per-host bulkhead / breaker key. +See [Failover behaviour](#failover-behaviour-pr-52) for the runtime +implications. ### Active-resolution rules @@ -104,8 +112,8 @@ The `DeviceState.ActiveAddress` field (internal; surfaced via | Active | Standby / Disqualified / Unknown | primary | | Standby / Disqualified / Unknown | Active | partner | | Active | Active (split-brain) | **primary wins**, warning logged | -| Standby + Standby | Standby + Standby | `null` (PR 5.2 will surface as `BadCommunicationError`) | -| Unknown + Unknown | Unknown + Unknown | `null` | +| Standby + Standby | Standby + Standby | `null` — PR 5.2's `ResolveHost` falls back to the configured primary; the existing dial flow surfaces `BadCommunicationError` if the primary is also down. See [Both-stuck](#both-stuck-no-chassis-active). | +| Unknown + Unknown | Unknown + Unknown | `null` (same fallback as Standby + Standby) | Split-brain (both chassis claim Active simultaneously) is a real production failure mode — typically a redundancy-module misconfiguration or @@ -150,28 +158,167 @@ otopcua-abcip-cli subscribe -g ab://10.0.0.5/1,0 --partner ab://10.0.0.6/1,0 \ RoleTagAddress, ProbeIntervalMs}` survive deserialise → driver → `DeviceState`). - `Hsby.Enabled = false` → no role probing. -- **Integration** (`tests/ZB.MOM.WW.OtOpcUa.Driver.AbCip.IntegrationTests/AbCipHsbyRoleProberTests.cs`): - - **Skipped by default** (`Assert.Skip`) — `ab_server` cannot emulate - a ControlLogix HSBY pair (no `WallClockTime.SyncStatus`, no second - chassis concept). The Docker `paired` profile (PR 5.1) brings up two - `ab_server` instances + a stub `hsby-mux` sidecar so the topology is - documented, but PR 5.2 follow-up needs a patched `ab_server` image - that actually serves the role tag before the integration test can - assert anything against the wire. - - Trait `Category=Hsby` so `dotnet test --filter Category=Hsby` finds - this test once it's promoted. +- **Integration** (`tests/ZB.MOM.WW.OtOpcUa.Driver.AbCip.IntegrationTests/`): + - `AbCipHsbyRoleProberTests.cs` (PR 5.1) and + `AbCipHsbyFailoverTests.cs` (PR 5.2) — both **skipped by default** + (`Assert.Skip`). `ab_server` cannot emulate a ControlLogix HSBY + pair (no `WallClockTime.SyncStatus`, no second chassis concept). + The Docker `paired` profile (PR 5.1) brings up two `ab_server` + instances + a stub `hsby-mux` sidecar so the topology is + documented, but a patched `ab_server` image that actually serves + the role tag is still on the follow-up list. + - Trait `Category=Hsby` so `dotnet test --filter Category=Hsby` + finds them once they're promoted. +- **End-to-end** (`scripts/e2e/test-abcip-hsby.ps1`, PR 5.2): + - Paired-fixture variant of `test-abcip.ps1`. Subscribes to a tag + through the OPC UA server, flips the active chassis mid-stream + via the `hsby-mux` sidecar's `POST /flip` endpoint, asserts the + stream survives + `AbCip.HsbyFailoverCount` increments. Gated + on operator-supplied `BridgeNodeId` + a running paired fixture; + ships unwired into `test-all.ps1` until the patched `ab_server` + lands. -## Follow-ups (PR 5.2 + beyond) +## Failover behaviour (PR 5.2) + +PR 5.2 wires `DeviceState.ActiveAddress` into the read / write hot path +through `AbCipDriver.ResolveHost` and the runtime-cache lifecycle. After +the role-probe loop (PR 5.1) detects an active-address transition the +driver re-points every wire-level operation at the now-Active chassis +without operator intervention. + +### What flips on a failover + +| Aspect | Pre-flip | Post-flip | +|---|---|---| +| `ResolveHost(tag)` return | primary `HostAddress` | the partner address (when partner is now Active) | +| Per-tag libplctag handles in `DeviceState.Runtimes` | created against primary gateway | dropped on flip; lazily re-created against the partner gateway on next read / write | +| Parent-DINT RMW handles in `DeviceState.ParentRuntimes` | primary gateway | dropped on flip; same re-create-on-demand path | +| `AbCipWriteCoalescer` per-device cache | last-known-written values from the primary | reset; the first write of any value to the partner pays the full round-trip | +| `LogicalInstanceMap` (Logical-mode `@tags` walk) | populated for primary | cleared; the next read on a Logical-mode device re-walks `@tags` against the partner | +| Per-host bulkhead key (Polly bulkhead + breaker, plan decision #144) | keyed on primary `HostAddress` | keyed on the new active address — the partner gets its own fresh breaker state instead of inheriting a tripped breaker from the now-standby | +| `AbCip.HsbyFailoverCount` diagnostic | 0 | incremented by 1 on every transition observed by the probe loop | + +### How the invalidation runs + +PR 5.2 introduces an internal `OnActiveAddressChanged` event raised by +`HsbyProbeLoopAsync` on every `DeviceState.ActiveAddress` transition. The +driver subscribes to it from its own constructor; the handler +(`HandleActiveAddressChanged`) does the cache invalidation in one place: + +1. Disposes every entry in `DeviceState.Runtimes` and + `DeviceState.ParentRuntimes`, then clears both dicts. Disposed + `IAbCipTagRuntime` instances release their underlying libplctag + handles so the native heap doesn't leak. +2. Clears `DeviceState.LogicalInstanceMap` and resets + `LogicalWalkComplete = false` so the next read on a Logical-mode + device re-fires the `@tags` symbol walk against the new chassis. +3. Calls `AbCipWriteCoalescer.Reset(deviceHostAddress)` so cached + "we already wrote 42" decisions don't stale-suppress the first + partner-side write. +4. Resets `DeviceState.RuntimesAddress = null` so subsequent + diagnostics observers see a fresh stamp on the next runtime + creation. +5. `Interlocked.Increment` on the driver-wide + `AbCip.HsbyFailoverCount` counter. + +The handler is idempotent — a second event for the same address change +is harmless because the dicts are already empty + the coalescer reset +is itself idempotent. + +### Bulkhead key semantics + +The per-host resilience pipeline (Polly bulkhead + circuit breaker, plan +decision #144) keys on whatever `IPerCallHostResolver.ResolveHost` +returns. PR 5.2 changes that resolver so an HSBY-failed-over device +returns the partner's address, which means: + +- The **device-state lookup** (`_devices.TryGetValue`) keeps using the + configured primary `HostAddress` as the dictionary key — that key + never changes for the lifetime of a device, so multi-device + configurations stay routable. +- The **resilience pipeline** (Polly bulkhead, breaker, retry policies) + receives the active address as the host-name dimension. The standby + chassis's tripped breaker (if its primary went away) doesn't bleed + over to the partner; the partner gets fresh limits + a closed + breaker. + +When HSBY is disabled (`Hsby.Enabled = false`) `ResolveHost` returns the +configured primary `HostAddress` exactly as it always has — pre-5.2 +behaviour, no double-key risk. + +## Failure-mode walkthrough + +PR 5.2 adds three failover surface areas to reason about. The table +below summarises the behaviour the driver reports + how an operator +can inspect it. + +### Primary-stuck (primary unreachable, partner Active) + +The primary chassis goes away (network partition, power loss, a stuck +Forward Open). The role-probe loop reads `HsbyRole.Unknown` for the +primary and `HsbyRole.Active` for the partner. + +| Surface | Behaviour | +|---|---| +| `DeviceState.ActiveAddress` | partner address | +| `DeviceState.PrimaryRole` | `Unknown` | +| `DeviceState.PartnerRole` | `Active` | +| `ResolveHost(tag)` | partner address | +| Reads / writes | route through partner gateway transparently | +| `AbCip.HsbyFailoverCount` | incremented when the address transitioned away from the primary | +| `AbCip.HsbyActive` | `2` (partner is the active chassis) | +| Operator action | none required for routing; investigate why the primary is unreachable through the connectivity-probe loop's `_System/_ConnectionStatus` for the device | + +### Secondary-stuck (partner unreachable, primary Active) + +The partner chassis goes away (its OPC UA server is down, its IP is +unreachable, the redundancy module unhitched it). The probe loop reads +`HsbyRole.Active` for the primary and `HsbyRole.Unknown` for the partner. + +| Surface | Behaviour | +|---|---| +| `DeviceState.ActiveAddress` | primary address (no transition; this is the steady state) | +| `DeviceState.PrimaryRole` | `Active` | +| `DeviceState.PartnerRole` | `Unknown` | +| `ResolveHost(tag)` | primary address | +| Reads / writes | route through primary gateway exactly as in a non-HSBY deployment | +| `AbCip.HsbyFailoverCount` | unchanged — no flip happened | +| `AbCip.HsbyActive` | `1` (primary is the active chassis) | +| Operator action | investigate why the partner is unreachable; the operational risk is that a future primary-side outage has no fall-back | + +### Both-stuck (no chassis Active) + +Both chassis report `Standby` / `Disqualified` / `Unknown` (a +redundancy-module misconfiguration, both controllers in Program mode, +or both unreachable). + +| Surface | Behaviour | +|---|---| +| `DeviceState.ActiveAddress` | `null` | +| `ResolveHost(tag)` | falls back to the configured primary `HostAddress` | +| Reads / writes | dispatched to the configured primary; a stuck primary surfaces `BadCommunicationError` per the existing dial flow | +| `AbCip.HsbyActive` | `0` (no chassis Active) | +| `AbCip.HsbyFailoverCount` | incremented when the transition `Active → null` happened | +| Operator action | investigate the redundancy module / mode keys; the SCADA layer sees stuck-or-bad-quality reads, not incorrect routing | + +The "fall back to primary on null Active" choice is deliberate. Routing +all reads to a deterministic chassis (the configured primary) keeps the +breaker key + bulkhead state stable while the operator diagnoses the +double-down outage; the alternative (round-robin / partner) would just +trip both breakers in turn and obscure which chassis is the real +problem. + +## Follow-ups (beyond PR 5.2) -- **PR 5.2** — wire `ActiveAddress` into `ResolveHost` so reads/writes - route to the live chassis automatically. Today's PR only **gathers** the - role. - **Patched `ab_server` image** — add a writable `WallClockTime.SyncStatus` tag (or a separate Python shim) so the Docker `paired` profile can - exercise the wire-level role probe. + exercise the wire-level role probe + the + `tests/.../IntegrationTests/AbCipHsbyFailoverTests.cs` scaffold can + flip its `Assert.Skip` for a real integration assertion. - **`hsby-mux` REST endpoint** — `POST /flip {"active": "primary"}` writes - `1` to the chosen chassis + `0` to the other so integration tests can - drive switch-overs deterministically. + `1` to the chosen chassis + `0` to the other so integration tests + + `scripts/e2e/test-abcip-hsby.ps1` can drive switch-overs + deterministically. - **GuardLogix HSBY** — same role-tag plumbing applies; verify against a real 1756-L8xS pair when one is on-site. diff --git a/scripts/e2e/test-abcip-hsby.ps1 b/scripts/e2e/test-abcip-hsby.ps1 new file mode 100644 index 0000000..bb1d23a --- /dev/null +++ b/scripts/e2e/test-abcip-hsby.ps1 @@ -0,0 +1,210 @@ +#Requires -Version 7.0 +<# +.SYNOPSIS + End-to-end CLI test for AB CIP HSBY failover routing (PR abcip-5.2). Subscribes to + a tag through the OtOpcUa OPC UA server, flips the active chassis mid-stream via + the paired-fixture's hsby-mux sidecar HTTP endpoint, and asserts the subscribe + stream survives the failover (no permanent loss of notifications + the post-flip + data carries the partner-side update). + +.DESCRIPTION + Paired-fixture variant of test-abcip.ps1. Where test-abcip.ps1 runs against a + single ab_server instance, this script assumes a paired fixture with two + ab_server instances (primary + partner) and an hsby-mux sidecar exposing + /flip {"active": "primary" | "partner"} over HTTP. + + Five assertions: + - HsbyInitialActive — primary is Active at start (hsby-mux primes it) + - HsbyResolveActive — driver-diagnostics surfaces AbCip.HsbyActive == 1 + - HsbyFailoverFlip — POST {"active": "partner"} → AbCip.HsbyActive == 2 + - HsbySubscribeSurvives — subscribe stream stays open across the flip + sees + an updated value from the partner side + - HsbyFailoverCount — AbCip.HsbyFailoverCount increments by ≥ 1 + +.PARAMETER PrimaryGateway + ab://host[:port]/cip-path of the primary chassis. Default ab://127.0.0.1/1,0. + +.PARAMETER PartnerGateway + ab://host[:port]/cip-path of the partner chassis. Default ab://127.0.0.2/1,0. + +.PARAMETER HsbyMuxUrl + Base URL of the paired-fixture's hsby-mux sidecar. Default http://localhost:7080. + Endpoints used: + GET /role → returns {"primary":"Active","partner":"Standby"} + POST /flip {"active":"primary"|"partner"} → flips role tag values on each chassis + +.PARAMETER OpcUaUrl + OtOpcUa server endpoint. Default opc.tcp://localhost:4840. + +.PARAMETER BridgeNodeId + NodeId at which the server publishes the tag exercised by the subscribe assertion. + Required. + +.PARAMETER TagPath + Logix symbolic path the bridge tag points at. Default 'TestDINT'. + +.PARAMETER DriverInstanceId + DriverInstance ID for the AB CIP driver under test. Used to scope the + driver-diagnostics RPC. Default 'abcip-hsby'. + +.EXAMPLE + ./test-abcip-hsby.ps1 -BridgeNodeId 'ns=2;s=AbCip/Bridge/TestDINT' +#> + +param( + [string]$PrimaryGateway = "ab://127.0.0.1/1,0", + [string]$PartnerGateway = "ab://127.0.0.2/1,0", + [string]$HsbyMuxUrl = "http://localhost:7080", + [string]$OpcUaUrl = "opc.tcp://localhost:4840", + [Parameter(Mandatory)] [string]$BridgeNodeId, + [string]$TagPath = "TestDINT", + [string]$DriverInstanceId = "abcip-hsby" +) + +$ErrorActionPreference = "Stop" +. "$PSScriptRoot/_common.ps1" + +$abcipCli = Get-CliInvocation ` + -ProjectFolder "src/ZB.MOM.WW.OtOpcUa.Driver.AbCip.Cli" ` + -ExeName "otopcua-abcip-cli" +$opcUaCli = Get-CliInvocation ` + -ProjectFolder "src/ZB.MOM.WW.OtOpcUa.Client.CLI" ` + -ExeName "otopcua-cli" + +$results = @() + +function Invoke-HsbyFlip { + param([string]$Active) + $body = @{ active = $Active } | ConvertTo-Json -Compress + try { + Invoke-RestMethod -Uri "$HsbyMuxUrl/flip" -Method Post -Body $body -ContentType 'application/json' + } catch { + throw "hsby-mux at $HsbyMuxUrl/flip rejected the request: $($_.Exception.Message)" + } +} + +function Get-HsbyDiagnosticValue { + param([string]$Counter) + # Pull driver-diagnostics through the OPC UA Admin RPC surface. The CLI returns + # a raw JSON blob; we grep out the named counter so the assertion is robust to + # other counters the driver surfaces. + $diagArgs = @($opcUaCli.PrefixArgs) + @( + "driver-diagnostics", "-u", $OpcUaUrl, "-d", $DriverInstanceId) + $diagOut = & $opcUaCli.File @diagArgs 2>&1 + $joined = ($diagOut -join "`n") + if ($joined -match "${Counter}.*?:\s*([\d\.]+)") { + return [double]$matches[1] + } + return $null +} + +# ---- HsbyInitialActive — hsby-mux primes primary as Active ---- +Write-Header "HsbyInitialActive (POST $HsbyMuxUrl/flip {active=primary})" +try { + Invoke-HsbyFlip -Active "primary" | Out-Null + Start-Sleep -Seconds 3 # role-probe loop default tick is 2s + $active = Get-HsbyDiagnosticValue -Counter "AbCip.HsbyActive" + $passed = ($active -eq 1.0) + $results += [PSCustomObject]@{ + Name = "HsbyInitialActive" + Passed = $passed + Detail = if ($passed) { "AbCip.HsbyActive=1 after priming primary" } else { "AbCip.HsbyActive=$active (expected 1)" } + } +} catch { + $results += [PSCustomObject]@{ + Name = "HsbyInitialActive"; Passed = $false; Detail = $_.Exception.Message + } +} + +# ---- HsbyResolveActive — driver routing reads through the primary ---- +Write-Header "HsbyResolveActive (read $TagPath via primary)" +$readArgs = @("read") + @("-g", $PrimaryGateway, "-f", "ControlLogix") + @("-t", $TagPath, "--type", "DInt") +$readOut = & $abcipCli.Exe @($abcipCli.Args + $readArgs) 2>&1 +$readOk = ($readOut -join "`n") -notmatch "(error|fail)" +$results += [PSCustomObject]@{ + Name = "HsbyResolveActive" + Passed = $readOk + Detail = if ($readOk) { "primary read completed without error" } else { "read failed: $($readOut -join ' ')" } +} + +# ---- HsbySubscribeSurvives + HsbyFailoverFlip + HsbyFailoverCount ---- +Write-Header "HsbyFailoverFlip + HsbySubscribeSurvives (subscribe across flip)" +$failoverBaseline = Get-HsbyDiagnosticValue -Counter "AbCip.HsbyFailoverCount" +if ($null -eq $failoverBaseline) { $failoverBaseline = 0 } + +$duration = 12 +$subOut = New-TemporaryFile +$subErr = New-TemporaryFile +$subArgs = @($opcUaCli.PrefixArgs) + @( + "subscribe", "-u", $OpcUaUrl, "-n", $BridgeNodeId, "-i", "200", "--duration", "$duration") +$subProc = Start-Process -FilePath $opcUaCli.File -ArgumentList $subArgs ` + -NoNewWindow -PassThru ` + -RedirectStandardOutput $subOut.FullName ` + -RedirectStandardError $subErr.FullName + +# Let the subscribe settle + accumulate primary-side notifications. +Start-Sleep -Seconds 3 + +# Mid-stream flip — primary→Standby, partner→Active. +try { + Invoke-HsbyFlip -Active "partner" | Out-Null +} catch { + Stop-Process -Id $subProc.Id -Force -ErrorAction SilentlyContinue + $results += [PSCustomObject]@{ + Name = "HsbyFailoverFlip"; Passed = $false; Detail = "hsby-mux flip rejected: $($_.Exception.Message)" + } +} + +# Wait for the role-probe loop to catch up (default tick 2s + ProbeIntervalMs slack). +Start-Sleep -Seconds 4 + +# Drive a write through the partner so the subscribe sees a fresh value. +$flipValue = Get-Random -Minimum 70000 -Maximum 79999 +$writeArgs = @("write") + @("-g", $PartnerGateway, "-f", "ControlLogix") + @("-t", $TagPath, "--type", "DInt", "-v", $flipValue) +& $abcipCli.Exe @($abcipCli.Args + $writeArgs) | Out-Null + +$activeAfter = Get-HsbyDiagnosticValue -Counter "AbCip.HsbyActive" +$flipPassed = ($activeAfter -eq 2.0) +$results += [PSCustomObject]@{ + Name = "HsbyFailoverFlip" + Passed = $flipPassed + Detail = if ($flipPassed) { "AbCip.HsbyActive=2 after flip" } else { "AbCip.HsbyActive=$activeAfter (expected 2)" } +} + +# Stop the subscribe + harvest the stream. +$subProc.WaitForExit(($duration + 5) * 1000) | Out-Null +if (-not $subProc.HasExited) { Stop-Process -Id $subProc.Id -Force } + +$subText = (Get-Content $subOut.FullName -Raw) + (Get-Content $subErr.FullName -Raw) +Remove-Item $subOut.FullName, $subErr.FullName -ErrorAction SilentlyContinue + +# Stream survival = at least one notification *after* the flip carries the new +# partner-side value. The post-flip write of $flipValue is the canary. +$saw = $subText -match "$flipValue" +$results += [PSCustomObject]@{ + Name = "HsbySubscribeSurvives" + Passed = $saw + Detail = if ($saw) { + "subscribe stream surfaced post-flip value $flipValue from partner chassis" + } else { + "subscribe stream did not see the post-flip canary $flipValue — output: $subText" + } +} + +# ---- HsbyFailoverCount — counter incremented by ≥ 1 ---- +Write-Header "HsbyFailoverCount" +$failoverAfter = Get-HsbyDiagnosticValue -Counter "AbCip.HsbyFailoverCount" +if ($null -eq $failoverAfter) { $failoverAfter = 0 } +$counterOk = ($failoverAfter - $failoverBaseline) -ge 1 +$results += [PSCustomObject]@{ + Name = "HsbyFailoverCount" + Passed = $counterOk + Detail = if ($counterOk) { + "AbCip.HsbyFailoverCount went from $failoverBaseline → $failoverAfter" + } else { + "AbCip.HsbyFailoverCount unchanged ($failoverBaseline → $failoverAfter); expected at least 1 increment" + } +} + +Write-Summary -Title "AB CIP HSBY failover e2e" -Results $results +if ($results | Where-Object { -not $_.Passed }) { exit 1 } diff --git a/src/ZB.MOM.WW.OtOpcUa.Driver.AbCip/AbCipDriver.cs b/src/ZB.MOM.WW.OtOpcUa.Driver.AbCip/AbCipDriver.cs index 9de6e76..e35152f 100644 --- a/src/ZB.MOM.WW.OtOpcUa.Driver.AbCip/AbCipDriver.cs +++ b/src/ZB.MOM.WW.OtOpcUa.Driver.AbCip/AbCipDriver.cs @@ -44,6 +44,24 @@ public sealed class AbCipDriver : IDriver, IReadable, IWritable, ITagDiscovery, private IAddressSpaceBuilder? _cachedBuilder; private DriverHealth _health = new(DriverState.Unknown, null, null); + // PR abcip-5.2 — failover bookkeeping. Counter is surfaced through driver-diagnostics + // as AbCip.HsbyFailoverCount; the event lets internal subscribers react to an + // ActiveAddress flip without HsbyProbeLoopAsync calling deep into the runtime cache + // directly. The driver subscribes itself in the constructor so cache invalidation + + // write-coalescer reset run inline with the address-change observation. + private long _hsbyFailoverCount; + + /// + /// PR abcip-5.2 — raised by whenever a device's + /// transitions to a value different from + /// the one observed on the previous tick. Args carry the device + the + /// (oldAddress, newAddress) pair so subscribers can decide whether the change + /// matters for them. Internal seam — the driver wires its own runtime-cache / + /// write-coalescer invalidation through this event so the bookkeeping runs in + /// one place + tests can assert via the public diagnostics counter. + /// + internal event EventHandler? OnActiveAddressChanged; + public event EventHandler? OnDataChange; public event EventHandler? OnHostStatusChanged; public event EventHandler? OnAlarmEvent; @@ -67,6 +85,12 @@ public sealed class AbCipDriver : IDriver, IReadable, IWritable, ITagDiscovery, onChange: (handle, tagRef, snapshot) => OnDataChange?.Invoke(this, new DataChangeEventArgs(handle, tagRef, snapshot))); _alarmProjection = new AbCipAlarmProjection(this, _options.AlarmPollInterval); + // PR abcip-5.2 — wire the failover-handling subscriber. Drops every cached per-tag + // / parent-DINT runtime against the now-standby gateway, resets the write-coalescer + // (the prior known-written values were against the standby chassis), clears the + // logical-walk state so the @tags walk reruns against the new active gateway, and + // bumps the diagnostics counter that BuildDiagnostics surfaces. + OnActiveAddressChanged += HandleActiveAddressChanged; } /// @@ -258,6 +282,13 @@ public sealed class AbCipDriver : IDriver, IReadable, IWritable, ITagDiscovery, && !string.IsNullOrWhiteSpace(state.Options.PartnerHostAddress)) { state.PartnerAddress = state.Options.PartnerHostAddress; + // PR abcip-5.2 — pre-parse the partner address once so the runtime hot + // path can swap (Gateway, Port, CipPath) without re-parsing on every + // ResolveHost / EnsureTagRuntimeAsync call. A bad partner address is a + // hard config error already flagged by HsbyProbeLoopAsync's TryParse + + // OnWarning path, so a TryParse miss here is non-fatal — the runtime + // never resolves to it because PartnerParsedAddress stays null. + state.PartnerParsedAddress = AbCipHostAddress.TryParse(state.Options.PartnerHostAddress!); state.HsbyCts = new CancellationTokenSource(); var ct = state.HsbyCts.Token; _ = Task.Run(() => HsbyProbeLoopAsync(state, hsby, ct), ct); @@ -784,7 +815,28 @@ public sealed class AbCipDriver : IDriver, IReadable, IWritable, ITagDiscovery, // No chassis Active — clear so PR abcip-5.2's ResolveHost can fault writes. newActive = null; } + // PR abcip-5.2 — fire OnActiveAddressChanged on every transition so the + // runtime-cache invalidation handler runs exactly once per flip. We compare + // before assigning so a steady-state tick (Active didn't change) is a no-op. + var prevActive = state.ActiveAddress; state.ActiveAddress = newActive; + if (!string.Equals(prevActive, newActive, StringComparison.OrdinalIgnoreCase)) + { + try + { + OnActiveAddressChanged?.Invoke(this, + new HsbyActiveAddressChangedEventArgs(state, prevActive, newActive)); + } + catch (Exception ex) + { + // A handler that throws must never tear the probe loop down. Surface + // the failure through the warning sink + keep ticking; the next flip + // gets another shot at invalidation. + _options.OnWarning?.Invoke( + $"AbCip HSBY active-address-changed handler threw on " + + $"primary='{state.Options.HostAddress}' partner='{partnerAddress}': {ex.Message}"); + } + } try { await Task.Delay(hsby.ProbeInterval, ct).ConfigureAwait(false); } catch (OperationCanceledException) { break; } @@ -836,6 +888,46 @@ public sealed class AbCipDriver : IDriver, IReadable, IWritable, ITagDiscovery, } } + /// + /// PR abcip-5.2 — invalidation hook for an HSBY failover. Disposes every cached + /// per-tag / parent-DINT runtime on the device so the next read / write re-creates + /// against the new Active gateway, resets the write-coalescer's per-device cache + /// (the prior known-written values were against the now-standby chassis), wipes + /// the Logical-mode @tags walk so the new chassis gets a fresh symbol-table + /// resolution, and bumps the AbCip.HsbyFailoverCount diagnostic. Idempotent — a + /// re-fire against the same address (e.g. an event handler that races the assign) + /// short-circuits on the RuntimesAddress equality check inside + /// . + /// + private void HandleActiveAddressChanged(object? sender, HsbyActiveAddressChangedEventArgs e) + { + var state = e.Device; + // Drop the runtime cache. The runtime creators repopulate against the new active + // gateway on next read/write; the disposed handles' libplctag pointers are + // released so the native heap doesn't leak. + foreach (var rt in state.Runtimes.Values) + { + try { rt.Dispose(); } catch { } + } + state.Runtimes.Clear(); + foreach (var rt in state.ParentRuntimes.Values) + { + try { rt.Dispose(); } catch { } + } + state.ParentRuntimes.Clear(); + // Reset the @tags symbol-table walk so the new chassis re-fires it on next read; + // the standby chassis's instance IDs don't transfer to the now-Active partner. + state.LogicalInstanceMap.Clear(); + state.LogicalWalkComplete = false; + // Reset the write-coalescer so the first post-flip write of any value pays the + // full round-trip and the cache rebuilds from the new baseline. + _writeCoalescer.Reset(state.Options.HostAddress); + // Clear the per-device runtimes-address marker so the next runtime creator stamps + // it with whatever the new ActiveParsedAddress resolves to. + state.RuntimesAddress = null; + Interlocked.Increment(ref _hsbyFailoverCount); + } + private void TransitionDeviceState(DeviceState state, HostState newState) { HostState old; @@ -911,11 +1003,34 @@ public sealed class AbCipDriver : IDriver, IReadable, IWritable, ITagDiscovery, if (AbCipSystemTagSource.IsSystemReference(fullReference)) { var host = ExtractSystemDeviceHost(fullReference); - if (host is not null) return host; + if (host is not null) return ResolveActiveHostFor(host); } if (_tagsByName.TryGetValue(fullReference, out var def)) - return def.DeviceHostAddress; - return _options.Devices.FirstOrDefault()?.HostAddress ?? DriverInstanceId; + return ResolveActiveHostFor(def.DeviceHostAddress); + return ResolveActiveHostFor(_options.Devices.FirstOrDefault()?.HostAddress ?? DriverInstanceId); + } + + /// + /// PR abcip-5.2 — failover-aware bulkhead-key resolver. The configured primary + /// HostAddress stays the device-state lookup key (it never changes for a + /// given device), but the resilience pipeline (Polly bulkhead + breaker per plan + /// decision #144) keys on whatever this method returns. When HSBY is enabled and + /// resolves to the partner, we route the + /// bulkhead through the partner's address so the new active partner gets its own + /// fresh breaker state instead of inheriting the now-standby's tripped breaker. + /// + /// When HSBY isn't enabled or no chassis is Active, returns the original + /// primary host address — that's the legacy pre-5.2 behaviour and keeps the + /// bulkhead state stable for the dial flow's BadCommunicationError surface. + /// + /// + internal string ResolveActiveHostFor(string deviceHostAddress) + { + if (!_devices.TryGetValue(deviceHostAddress, out var state)) return deviceHostAddress; + if (state.Options.Hsby is not { Enabled: true }) return deviceHostAddress; + var active = state.ActiveAddress; + if (string.IsNullOrEmpty(active)) return deviceHostAddress; + return active; } /// @@ -1367,10 +1482,12 @@ public sealed class AbCipDriver : IDriver, IReadable, IWritable, ITagDiscovery, { sliceLogicalId = sliceId; } + // PR abcip-5.2 — slice handles also follow the active address. + var sliceActive = device.ActiveParsedAddress; var baseParams = new AbCipTagCreateParams( - Gateway: device.ParsedAddress.Gateway, - Port: device.ParsedAddress.Port, - CipPath: device.ParsedAddress.CipPath, + Gateway: sliceActive.Gateway, + Port: sliceActive.Port, + CipPath: sliceActive.CipPath, LibplctagPlcAttribute: device.Profile.LibplctagPlcAttribute, TagName: parsedPath.ToLibplctagName(), Timeout: _options.Timeout, @@ -1439,6 +1556,13 @@ public sealed class AbCipDriver : IDriver, IReadable, IWritable, ITagDiscovery, throw; } device.Runtimes[tagName] = runtime; + // PR abcip-5.2 — keep the slice path's runtime cache lifecycle in lockstep with + // the per-tag handles. The failover handler clears Runtimes wholesale, so the + // address stamp here matches whatever ActiveAddress resolved to when the slice + // params were built (the caller passed createParams pre-resolved). + device.RuntimesAddress = device.Options.Hsby is { Enabled: true } + ? device.ActiveAddress ?? device.Options.HostAddress + : device.Options.HostAddress; return runtime; } @@ -1859,10 +1983,13 @@ public sealed class AbCipDriver : IDriver, IReadable, IWritable, ITagDiscovery, { parentLogicalId = pid; } + // PR abcip-5.2 — same active-address routing as EnsureTagRuntimeAsync so + // BOOL-in-DINT RMW handles follow the failover. + var active = device.ActiveParsedAddress; var runtime = _tagFactory.Create(new AbCipTagCreateParams( - Gateway: device.ParsedAddress.Gateway, - Port: device.ParsedAddress.Port, - CipPath: device.ParsedAddress.CipPath, + Gateway: active.Gateway, + Port: active.Port, + CipPath: active.CipPath, LibplctagPlcAttribute: device.Profile.LibplctagPlcAttribute, TagName: parentTagName, Timeout: _options.Timeout, @@ -1879,6 +2006,9 @@ public sealed class AbCipDriver : IDriver, IReadable, IWritable, ITagDiscovery, throw; } device.ParentRuntimes[parentTagName] = runtime; + device.RuntimesAddress = device.Options.Hsby is { Enabled: true } + ? device.ActiveAddress ?? device.Options.HostAddress + : device.Options.HostAddress; return runtime; } @@ -1906,10 +2036,15 @@ public sealed class AbCipDriver : IDriver, IReadable, IWritable, ITagDiscovery, logicalId = resolvedId; } + // PR abcip-5.2 — route through the resolved active address so an HSBY pair that + // failed-over to the partner targets the partner's gateway / port / cip-path. + // When HSBY is off or no chassis is Active the getter returns ParsedAddress and + // behaviour is identical to pre-5.2 builds. + var active = device.ActiveParsedAddress; var runtime = _tagFactory.Create(new AbCipTagCreateParams( - Gateway: device.ParsedAddress.Gateway, - Port: device.ParsedAddress.Port, - CipPath: device.ParsedAddress.CipPath, + Gateway: active.Gateway, + Port: active.Port, + CipPath: active.CipPath, LibplctagPlcAttribute: device.Profile.LibplctagPlcAttribute, TagName: parsed.ToLibplctagName(), Timeout: _options.Timeout, @@ -1927,6 +2062,12 @@ public sealed class AbCipDriver : IDriver, IReadable, IWritable, ITagDiscovery, throw; } device.Runtimes[def.Name] = runtime; + // Stamp the per-device runtimes-address marker so the failover handler can detect + // a stale cache. Compared in DEBUG builds + diagnostics; production code routes + // invalidation through OnActiveAddressChanged. + device.RuntimesAddress = device.Options.Hsby is { Enabled: true } + ? device.ActiveAddress ?? device.Options.HostAddress + : device.Options.HostAddress; return runtime; } @@ -1951,6 +2092,11 @@ public sealed class AbCipDriver : IDriver, IReadable, IWritable, ITagDiscovery, ["AbCip.WritesPassedThrough"] = _writeCoalescer.TotalWritesPassedThrough, // PR abcip-4.4 — total _RefreshTagDb truthy writes that dispatched to RebrowseAsync. ["AbCip.RefreshTriggers"] = _systemTagSource.TotalRefreshTriggers, + // PR abcip-5.2 — count of HSBY active-address transitions the probe loop has + // observed. Aggregated across every HSBY-enabled device on this driver + // instance; the per-device breakdown is observable via the per-pair role + // counters below. + ["AbCip.HsbyFailoverCount"] = Interlocked.Read(ref _hsbyFailoverCount), }; // PR abcip-5.1 — HSBY role surface. One per HSBY-enabled device: // AbCip.HsbyActive — 1 if ActiveAddress == primary, 2 if == partner, 0 otherwise. @@ -2368,6 +2514,49 @@ public sealed class AbCipDriver : IDriver, IReadable, IWritable, ITagDiscovery, /// public string? PartnerAddress { get; set; } + /// + /// PR abcip-5.2 — parsed form of , populated at init + /// when HSBY is configured. ResolveHost's caller side keeps using the + /// opaque ; the **runtime hot path** + /// consults so libplctag handles target the + /// currently Active gateway / port / cip-path. + /// + public AbCipHostAddress? PartnerParsedAddress { get; set; } + + /// + /// PR abcip-5.2 — parsed wire address that per-tag / per-slice / parent-DINT + /// runtimes should be created against right now. Returns + /// (the configured primary) when (a) HSBY isn't enabled, (b) + /// is null (no chassis Active — fall through to the dial flow which will fault + /// with BadCommunicationError on the next wire op), or (c) the active address + /// equals the configured primary host. Returns + /// when the partner is the live chassis. Cheap getter — every tag-runtime + /// creation calls it. + /// + public AbCipHostAddress ActiveParsedAddress + { + get + { + if (Options.Hsby is not { Enabled: true } || ActiveAddress is null) + return ParsedAddress; + if (PartnerParsedAddress is not null + && string.Equals(ActiveAddress, PartnerAddress, StringComparison.OrdinalIgnoreCase)) + return PartnerParsedAddress; + return ParsedAddress; + } + } + + /// + /// PR abcip-5.2 — address every entry in + + /// was created against. null until the first + /// read / write materialises a runtime; set to the resolved active address each + /// time a runtime is created. 's + /// active-address-changed callback compares this against the new active and + /// drops every cached handle on mismatch so the next read / write re-creates + /// against the new gateway. + /// + public string? RuntimesAddress { get; set; } + /// PR abcip-5.1 — most-recent role observed on the primary chassis. public HsbyRole PrimaryRole { get; set; } = HsbyRole.Unknown; @@ -2420,3 +2609,26 @@ public sealed class AbCipDriver : IDriver, IReadable, IWritable, ITagDiscovery, } } } + +/// +/// PR abcip-5.2 — event payload raised by when the HSBY +/// probe loop observes a transition in . +/// Subscribers consume / to decide +/// whether to invalidate cached state. is null on the +/// first transition (driver freshly initialised) and is +/// null when neither chassis is Active (both Standby / Disqualified / Unknown). +/// +internal sealed class HsbyActiveAddressChangedEventArgs : EventArgs +{ + public AbCipDriver.DeviceState Device { get; } + public string? OldAddress { get; } + public string? NewAddress { get; } + + public HsbyActiveAddressChangedEventArgs( + AbCipDriver.DeviceState device, string? oldAddress, string? newAddress) + { + Device = device; + OldAddress = oldAddress; + NewAddress = newAddress; + } +} diff --git a/tests/ZB.MOM.WW.OtOpcUa.Driver.AbCip.IntegrationTests/AbCipHsbyFailoverTests.cs b/tests/ZB.MOM.WW.OtOpcUa.Driver.AbCip.IntegrationTests/AbCipHsbyFailoverTests.cs new file mode 100644 index 0000000..b44ac0f --- /dev/null +++ b/tests/ZB.MOM.WW.OtOpcUa.Driver.AbCip.IntegrationTests/AbCipHsbyFailoverTests.cs @@ -0,0 +1,47 @@ +using Shouldly; +using Xunit; +using ZB.MOM.WW.OtOpcUa.Driver.AbCip; + +namespace ZB.MOM.WW.OtOpcUa.Driver.AbCip.IntegrationTests; + +/// +/// PR abcip-5.2 — integration scaffold for HSBY failover routing through +/// . Skipped by default because the paired +/// fixture (controllogix-secondary ab_server instance + hsby-mux +/// sidecar that flips the role tag on demand) is not yet stable in the Docker +/// compose layout. The scaffold lives here so: +/// +/// The trait is discoverable by dotnet test --filter "Category=Hsby". +/// The companion E2E script (scripts/e2e/test-abcip-hsby.ps1) has a +/// paired surface already wired in tests when an operator stands up the fixture +/// manually. +/// A future PR can flip the skip into a real assertion without restructuring +/// the test layout. +/// +/// The unit-level coverage in AbCipHsbyFailoverTests (in the unit tests +/// project) exercises the active-address-routing + cache-invalidation contract in +/// full against the FakeAbCipTagFactory; this scaffold is just the wire-level shape. +/// +[Trait("Category", "Hsby")] +[Trait("Requires", "AbServer")] +public sealed class AbCipHsbyFailoverTests +{ + [AbServerFact] + public Task ResolveHost_routes_to_partner_after_role_flip_through_hsby_mux() + { + // The paired-fixture compose service (controllogix + controllogix-secondary + + // hsby-mux sidecar at http://localhost:7080) is not yet wired. When it ships, + // the test body will: + // 1. POST {"active": "primary"} to hsby-mux → assert ResolveHost = primary + // gateway via a CLI read. + // 2. POST {"active": "partner"} → wait for the probe loop to catch up → + // assert ResolveHost = partner gateway via a second CLI read. + // 3. Assert AbCip.HsbyFailoverCount on the driver's diagnostics + // ≥ 1 by reading the driver-diagnostics RPC through the OPC UA Admin + // surface. + Assert.Skip("HSBY paired fixture (controllogix-secondary + hsby-mux sidecar) " + + "not yet promoted out of scaffold. Run scripts/e2e/test-abcip-hsby.ps1 against a " + + "manually-stood-up paired fixture when verifying this PR end-to-end."); + return Task.CompletedTask; + } +} diff --git a/tests/ZB.MOM.WW.OtOpcUa.Driver.AbCip.Tests/AbCipHsbyFailoverTests.cs b/tests/ZB.MOM.WW.OtOpcUa.Driver.AbCip.Tests/AbCipHsbyFailoverTests.cs new file mode 100644 index 0000000..27ce06d --- /dev/null +++ b/tests/ZB.MOM.WW.OtOpcUa.Driver.AbCip.Tests/AbCipHsbyFailoverTests.cs @@ -0,0 +1,373 @@ +using System.Collections.Concurrent; +using Shouldly; +using Xunit; +using ZB.MOM.WW.OtOpcUa.Core.Abstractions; +using ZB.MOM.WW.OtOpcUa.Driver.AbCip; + +namespace ZB.MOM.WW.OtOpcUa.Driver.AbCip.Tests; + +/// +/// PR abcip-5.2 — unit tests for HSBY failover routing in +/// . Drives a paired-IP HSBY device through +/// primary→partner role flips via the FakeAbCipTagFactory's Customise hook + +/// asserts: +/// +/// returns the address of the +/// currently-Active chassis (and the configured primary when HSBY is off / +/// both Standby). +/// The per-device runtime cache is invalidated on flip — disposed handles +/// prove the failover handler ran. +/// drops cached values for the device so +/// the partner pays the full round-trip on next write. +/// AbCip.HsbyFailoverCount in driver-diagnostics increments per flip. +/// Multiple flips count correctly. +/// +/// +[Trait("Category", "Unit")] +public sealed class AbCipHsbyFailoverTests +{ + private const string Primary = "ab://10.0.0.5/1,0"; + private const string Partner = "ab://10.0.0.6/1,0"; + + // ---- ResolveHost routing ---- + + [Fact] + public async Task ResolveHost_returns_partner_when_partner_active() + { + var (drv, _) = await BuildHsbyDriverAsync(primaryRoleValue: 0, partnerRoleValue: 1); + try + { + await WaitForActiveAsync(drv, Partner); + var resolved = drv.ResolveHost("Motor01_Speed"); + // Tag isn't registered; resolver still falls through ResolveActiveHostFor on + // the first configured device, which has the partner active. + resolved.ShouldBe(Partner); + } + finally + { + await drv.ShutdownAsync(CancellationToken.None); + } + } + + [Fact] + public async Task ResolveHost_returns_primary_when_primary_active() + { + var (drv, _) = await BuildHsbyDriverAsync(primaryRoleValue: 1, partnerRoleValue: 0); + try + { + await WaitForActiveAsync(drv, Primary); + drv.ResolveHost("Motor01_Speed").ShouldBe(Primary); + } + finally + { + await drv.ShutdownAsync(CancellationToken.None); + } + } + + [Fact] + public async Task Toggling_role_flips_ResolveHost_output() + { + var (factory, tracker) = BuildTrackingFactory(initialPrimary: 1, initialPartner: 0); + var drv = BuildDriver(factory); + await drv.InitializeAsync("{}", CancellationToken.None); + try + { + await WaitForActiveAsync(drv, Primary); + drv.ResolveHost("anything").ShouldBe(Primary); + + FlipRoles(tracker, newPrimary: 0, newPartner: 1); + + await WaitForActiveAsync(drv, Partner); + drv.ResolveHost("anything").ShouldBe(Partner); + } + finally + { + await drv.ShutdownAsync(CancellationToken.None); + } + } + + [Fact] + public async Task ResolveHost_falls_back_to_primary_when_both_standby() + { + var (drv, _) = await BuildHsbyDriverAsync(primaryRoleValue: 0, partnerRoleValue: 0); + try + { + // Wait for the role state to settle so we know the loop ticked at least once. + await WaitForAsync(() => drv.GetDeviceState(Primary)?.PrimaryRole != HsbyRole.Unknown); + drv.ResolveHost("anything").ShouldBe(Primary, + "neither chassis Active means ActiveAddress is null; ResolveHost falls back to the configured primary"); + } + finally + { + await drv.ShutdownAsync(CancellationToken.None); + } + } + + [Fact] + public async Task ResolveHost_ignores_ActiveAddress_when_Hsby_disabled() + { + var factory = new FakeAbCipTagFactory(); + var drv = new AbCipDriver(new AbCipDriverOptions + { + Devices = + [ + new AbCipDeviceOptions( + Primary, + PartnerHostAddress: Partner, + Hsby: new AbCipHsbyOptions { Enabled = false }), + ], + Probe = new AbCipProbeOptions { Enabled = false }, + }, "drv-hsby-off-resolve", factory); + try + { + await drv.InitializeAsync("{}", CancellationToken.None); + // Manually plant an ActiveAddress that conflicts with the primary; ResolveHost + // must still pick the primary because Hsby is disabled. + var state = drv.GetDeviceState(Primary).ShouldNotBeNull(); + state.ActiveAddress = Partner; + drv.ResolveHost("anything").ShouldBe(Primary); + } + finally + { + await drv.ShutdownAsync(CancellationToken.None); + } + } + + // ---- Cache invalidation on flip ---- + + [Fact] + public async Task Failover_invalidates_runtime_cache_and_increments_counter() + { + var (factory, tracker) = BuildTrackingFactory(initialPrimary: 1, initialPartner: 0); + var drv = BuildDriverWithTag(factory, "Motor01_Speed"); + await drv.InitializeAsync("{}", CancellationToken.None); + try + { + await WaitForActiveAsync(drv, Primary); + + // Force a per-tag runtime to be created against the primary. + var initialReads = await drv.ReadAsync(["Motor01_Speed"], CancellationToken.None); + initialReads.Count.ShouldBe(1); + var state = drv.GetDeviceState(Primary).ShouldNotBeNull(); + state.Runtimes.ShouldContainKey("Motor01_Speed"); + var runtimeBeforeFlip = (FakeAbCipTag)state.Runtimes["Motor01_Speed"]; + runtimeBeforeFlip.CreationParams.Gateway.ShouldBe("10.0.0.5"); + + // Flip — primary→Standby, partner→Active. + FlipRoles(tracker, newPrimary: 0, newPartner: 1); + await WaitForActiveAsync(drv, Partner); + + // The pre-flip runtime should have been disposed by the failover handler. + runtimeBeforeFlip.Disposed.ShouldBeTrue(); + // Cache should be empty until the next read repopulates it. + state.Runtimes.ShouldNotContainKey("Motor01_Speed"); + + // Diagnostics counter ticked. + var diag = drv.GetHealth().Diagnostics.ShouldNotBeNull(); + diag.ShouldContainKey("AbCip.HsbyFailoverCount"); + diag["AbCip.HsbyFailoverCount"].ShouldBeGreaterThanOrEqualTo(1); + + // Next read recreates against the partner gateway. + var afterReads = await drv.ReadAsync(["Motor01_Speed"], CancellationToken.None); + afterReads.Count.ShouldBe(1); + state.Runtimes.ShouldContainKey("Motor01_Speed"); + var runtimeAfterFlip = (FakeAbCipTag)state.Runtimes["Motor01_Speed"]; + runtimeAfterFlip.CreationParams.Gateway.ShouldBe("10.0.0.6", + "post-flip runtime must target the partner's gateway"); + runtimeAfterFlip.ShouldNotBeSameAs(runtimeBeforeFlip); + } + finally + { + await drv.ShutdownAsync(CancellationToken.None); + } + } + + [Fact] + public async Task Failover_resets_write_coalescer_for_device() + { + var (factory, tracker) = BuildTrackingFactory(initialPrimary: 1, initialPartner: 0); + var drv = BuildDriverWithTag(factory, "Motor01_Speed"); + await drv.InitializeAsync("{}", CancellationToken.None); + try + { + await WaitForActiveAsync(drv, Primary); + // Seed the coalescer cache for this device + tag. We poke it directly via + // the test seam so we don't depend on the multi-write planner accepting our + // synthetic Motor01_Speed definition. + var def = new AbCipTagDefinition( + Name: "Motor01_Speed", + DeviceHostAddress: Primary, + TagPath: "Motor01_Speed", + DataType: AbCipDataType.DInt, + Writable: true, + WriteOnChange: true); + drv.WriteCoalescer.Record(Primary, def, 42); + drv.WriteCoalescer.ShouldSuppress(Primary, def, 42).ShouldBeTrue( + "baseline: identical re-write must be suppressed pre-failover"); + + FlipRoles(tracker, newPrimary: 0, newPartner: 1); + await WaitForActiveAsync(drv, Partner); + + // The cache for this device was cleared so the same write is no longer suppressed. + drv.WriteCoalescer.ShouldSuppress(Primary, def, 42).ShouldBeFalse( + "failover must drop cached known-written values; partner needs the wire round-trip"); + } + finally + { + await drv.ShutdownAsync(CancellationToken.None); + } + } + + [Fact] + public async Task Multiple_flips_each_increment_HsbyFailoverCount() + { + var (factory, tracker) = BuildTrackingFactory(initialPrimary: 1, initialPartner: 0); + var drv = BuildDriver(factory); + await drv.InitializeAsync("{}", CancellationToken.None); + try + { + await WaitForActiveAsync(drv, Primary); + var diagBaseline = drv.GetHealth().Diagnostics.ShouldNotBeNull(); + var startCount = diagBaseline.TryGetValue("AbCip.HsbyFailoverCount", out var v) ? v : 0; + + // Flip 1: primary→partner + FlipRoles(tracker, newPrimary: 0, newPartner: 1); + await WaitForActiveAsync(drv, Partner); + + // Flip 2: partner→primary + FlipRoles(tracker, newPrimary: 1, newPartner: 0); + await WaitForActiveAsync(drv, Primary); + + // Flip 3: primary→partner again + FlipRoles(tracker, newPrimary: 0, newPartner: 1); + await WaitForActiveAsync(drv, Partner); + + var diag = drv.GetHealth().Diagnostics.ShouldNotBeNull(); + diag["AbCip.HsbyFailoverCount"].ShouldBeGreaterThanOrEqualTo(startCount + 3); + } + finally + { + await drv.ShutdownAsync(CancellationToken.None); + } + } + + // ---- Helpers ---- + + private static AbCipDriver BuildDriver(FakeAbCipTagFactory factory) => + new AbCipDriver(new AbCipDriverOptions + { + Devices = + [ + new AbCipDeviceOptions( + Primary, + PartnerHostAddress: Partner, + Hsby: new AbCipHsbyOptions + { + Enabled = true, + RoleTagAddress = "WallClockTime.SyncStatus", + ProbeInterval = TimeSpan.FromMilliseconds(40), + }), + ], + Probe = new AbCipProbeOptions { Enabled = false }, + }, "drv-hsby-failover", factory); + + private static AbCipDriver BuildDriverWithTag(FakeAbCipTagFactory factory, string tagName) => + new AbCipDriver(new AbCipDriverOptions + { + Devices = + [ + new AbCipDeviceOptions( + Primary, + PartnerHostAddress: Partner, + Hsby: new AbCipHsbyOptions + { + Enabled = true, + RoleTagAddress = "WallClockTime.SyncStatus", + ProbeInterval = TimeSpan.FromMilliseconds(40), + }), + ], + Tags = + [ + new AbCipTagDefinition( + Name: tagName, + DeviceHostAddress: Primary, + TagPath: tagName, + DataType: AbCipDataType.DInt, + Writable: true), + ], + Probe = new AbCipProbeOptions { Enabled = false }, + }, "drv-hsby-failover-tag", factory); + + private static async Task<(AbCipDriver Driver, FakeAbCipTagFactory Factory)> + BuildHsbyDriverAsync(int primaryRoleValue, int partnerRoleValue) + { + var factory = new FakeAbCipTagFactory + { + Customise = p => p.Gateway == "10.0.0.5" + ? new FakeAbCipTag(p) { Value = primaryRoleValue } + : new FakeAbCipTag(p) { Value = partnerRoleValue }, + }; + var drv = BuildDriver(factory); + await drv.InitializeAsync("{}", CancellationToken.None); + return (drv, factory); + } + + /// + /// Snapshot of the live primary + partner role-tag fakes the factory has handed + /// out, keyed by gateway. Populated by the Customise hook on the + /// via a side-effecting lambda; the + /// dict alone is insufficient because both + /// chassis use the same role-tag TagName + the dict overwrites on the second + /// create. + /// + private sealed class HsbyRoleTagTracker + { + public FakeAbCipTag? Primary { get; set; } + public FakeAbCipTag? Partner { get; set; } + } + + private static (FakeAbCipTagFactory Factory, HsbyRoleTagTracker Tracker) + BuildTrackingFactory(int initialPrimary, int initialPartner) + { + var tracker = new HsbyRoleTagTracker(); + var factory = new FakeAbCipTagFactory(); + factory.Customise = p => + { + if (p.TagName == "WallClockTime.SyncStatus") + { + var fake = new FakeAbCipTag(p) + { + Value = p.Gateway == "10.0.0.5" ? initialPrimary : initialPartner, + }; + if (p.Gateway == "10.0.0.5") tracker.Primary = fake; + else tracker.Partner = fake; + return fake; + } + // Non-role-tag handles (e.g. per-tag runtimes) — return a default fake. + return new FakeAbCipTag(p) { Value = 0 }; + }; + return (factory, tracker); + } + + /// + /// Mutate the live primary / partner role-tag fakes' Value so the next + /// probe-loop tick reads the new role. Probe loop reuses one runtime per chassis + /// once initialised, so direct mutation of is + /// sufficient — no re-create required. + /// + private static void FlipRoles(HsbyRoleTagTracker tracker, int newPrimary, int newPartner) + { + if (tracker.Primary is not null) tracker.Primary.Value = newPrimary; + if (tracker.Partner is not null) tracker.Partner.Value = newPartner; + } + + private static Task WaitForActiveAsync(AbCipDriver drv, string expectedActive) => + WaitForAsync(() => drv.GetDeviceState(Primary)?.ActiveAddress == expectedActive); + + private static async Task WaitForAsync(Func condition, TimeSpan? timeout = null) + { + var deadline = DateTime.UtcNow + (timeout ?? TimeSpan.FromSeconds(2)); + while (!condition() && DateTime.UtcNow < deadline) + await Task.Delay(20); + } +}