Auto: abcip-5.2 — HSBY failover routing in ResolveHost

Closes #243
2026-04-26 08:13:41 -04:00
parent 258ce8e937
commit 9e157fc8a4
5 changed files with 1031 additions and 42 deletions
@@ -1,14 +1,18 @@
 # AbCip — ControlLogix HSBY paired-IP support

-PR abcip-5.1 adds **non-transparent** HSBY (Hot-Standby) awareness to the AB
-CIP driver. Each device may declare a partner gateway; when both gateways are
-up the driver concurrently probes a role tag on each chassis and reports
-which one is currently Active.
+PR abcip-5.1 + 5.2 ship **non-transparent** HSBY (Hot-Standby) awareness
+to the AB CIP driver. Each device may declare a partner gateway; when both
+gateways are up the driver concurrently probes a role tag on each chassis,
+reports which one is currently Active, and routes reads / writes through
+that chassis automatically.

-PR abcip-5.1 only **gathers + reports** the role. PR abcip-5.2 is the
-follow-up that wires the resolved active address into
-`AbCipDriver.ResolveHost` so reads and writes route to whichever chassis is
-Active without operator intervention.
+- **PR abcip-5.1** — gathers + reports the role of each chassis through
+  driver diagnostics. See [Role-tag detection matrix](#role-tag-detection-matrix)
+  + [Active-resolution rules](#active-resolution-rules).
+- **PR abcip-5.2** — wires the resolved active address into
+  `AbCipDriver.ResolveHost` and the runtime-cache lifecycle. See
+  [Failover behaviour](#failover-behaviour-pr-52) +
+  [Failure-mode walkthrough](#failure-mode-walkthrough).

 ## When to use HSBY paired IPs

@@ -24,7 +28,8 @@ edited the config.

 PR abcip-5.1 closes the visibility half of that gap by reading the role tag
 on both chassis. PR abcip-5.2 closes the routing half by re-pointing
-`ResolveHost` at the Active address each tick.
+`ResolveHost` at the Active address each tick + invalidating the per-tag
+runtime cache + write-coalescer state on every flip.

 ## Configuration

@@ -88,14 +93,17 @@ The driver surfaces three diagnostics counters per HSBY-enabled device
 | `AbCip.HsbyActive` | `1` if primary is Active, `2` if partner is Active, `0` if neither (or HSBY off) |
 | `AbCip.HsbyPrimaryRole` | `(int)HsbyRole` — `0` = Unknown, `1` = Active, `2` = Standby, `3` = Disqualified |
 | `AbCip.HsbyPartnerRole` | Same encoding as `HsbyPrimaryRole`, observed on the partner chassis |
+| `AbCip.HsbyFailoverCount` (PR 5.2) | Total number of `ActiveAddress` transitions the probe loop has observed across every HSBY-enabled device on this driver. Each increment maps to one runtime-cache invalidation + write-coalescer reset. |

 When more than one HSBY pair is configured on the same driver instance the
 flat keys are scoped per primary host: `AbCip.HsbyActive[ab://10.0.0.5/1,0]`,
 etc.

 The `DeviceState.ActiveAddress` field (internal; surfaced via
-`HsbyActive` diagnostics) is the address PR 5.2 will route through
-`ResolveHost`.
+`HsbyActive` diagnostics) is the address PR 5.2 routes through
+`ResolveHost` + uses to scope the per-host bulkhead / breaker key.
+See [Failover behaviour](#failover-behaviour-pr-52) for the runtime
+implications.

 ### Active-resolution rules

@@ -104,8 +112,8 @@ The `DeviceState.ActiveAddress` field (internal; surfaced via
 | Active | Standby / Disqualified / Unknown | primary |
 | Standby / Disqualified / Unknown | Active | partner |
 | Active | Active (split-brain) | **primary wins**, warning logged |
-| Standby + Standby | Standby + Standby | `null` (PR 5.2 will surface as `BadCommunicationError`) |
-| Unknown + Unknown | Unknown + Unknown | `null` |
+| Standby + Standby | Standby + Standby | `null` — PR 5.2's `ResolveHost` falls back to the configured primary; the existing dial flow surfaces `BadCommunicationError` if the primary is also down. See [Both-stuck](#both-stuck-no-chassis-active). |
+| Unknown + Unknown | Unknown + Unknown | `null` (same fallback as Standby + Standby) |

 Split-brain (both chassis claim Active simultaneously) is a real
 production failure mode — typically a redundancy-module misconfiguration or
@@ -150,28 +158,167 @@ otopcua-abcip-cli subscribe -g ab://10.0.0.5/1,0 --partner ab://10.0.0.6/1,0 \
      RoleTagAddress, ProbeIntervalMs}` survive deserialise → driver →
      `DeviceState`).
    - `Hsby.Enabled = false` → no role probing.
- **Integration** (`tests/ZB.MOM.WW.OtOpcUa.Driver.AbCip.IntegrationTests/AbCipHsbyRoleProberTests.cs`):
-    - **Skipped by default** (`Assert.Skip`) — `ab_server` cannot emulate
-      a ControlLogix HSBY pair (no `WallClockTime.SyncStatus`, no second
-      chassis concept). The Docker `paired` profile (PR 5.1) brings up two
-      `ab_server` instances + a stub `hsby-mux` sidecar so the topology is
-      documented, but PR 5.2 follow-up needs a patched `ab_server` image
-      that actually serves the role tag before the integration test can
-      assert anything against the wire.
-    - Trait `Category=Hsby` so `dotnet test --filter Category=Hsby` finds
-      this test once it's promoted.
+- **Integration** (`tests/ZB.MOM.WW.OtOpcUa.Driver.AbCip.IntegrationTests/`):
+    - `AbCipHsbyRoleProberTests.cs` (PR 5.1) and
+      `AbCipHsbyFailoverTests.cs` (PR 5.2) — both **skipped by default**
+      (`Assert.Skip`). `ab_server` cannot emulate a ControlLogix HSBY
+      pair (no `WallClockTime.SyncStatus`, no second chassis concept).
+      The Docker `paired` profile (PR 5.1) brings up two `ab_server`
+      instances + a stub `hsby-mux` sidecar so the topology is
+      documented, but a patched `ab_server` image that actually serves
+      the role tag is still on the follow-up list.
+    - Trait `Category=Hsby` so `dotnet test --filter Category=Hsby`
+      finds them once they're promoted.
+- **End-to-end** (`scripts/e2e/test-abcip-hsby.ps1`, PR 5.2):
+    - Paired-fixture variant of `test-abcip.ps1`. Subscribes to a tag
+      through the OPC UA server, flips the active chassis mid-stream
+      via the `hsby-mux` sidecar's `POST /flip` endpoint, asserts the
+      stream survives + `AbCip.HsbyFailoverCount` increments. Gated
+      on operator-supplied `BridgeNodeId` + a running paired fixture;
+      ships unwired into `test-all.ps1` until the patched `ab_server`
+      lands.

-## Follow-ups (PR 5.2 + beyond)
+## Failover behaviour (PR 5.2)
+
+PR 5.2 wires `DeviceState.ActiveAddress` into the read / write hot path
+through `AbCipDriver.ResolveHost` and the runtime-cache lifecycle. After
+the role-probe loop (PR 5.1) detects an active-address transition the
+driver re-points every wire-level operation at the now-Active chassis
+without operator intervention.
+
+### What flips on a failover
+
+| Aspect | Pre-flip | Post-flip |
+|---|---|---|
+| `ResolveHost(tag)` return | primary `HostAddress` | the partner address (when partner is now Active) |
+| Per-tag libplctag handles in `DeviceState.Runtimes` | created against primary gateway | dropped on flip; lazily re-created against the partner gateway on next read / write |
+| Parent-DINT RMW handles in `DeviceState.ParentRuntimes` | primary gateway | dropped on flip; same re-create-on-demand path |
+| `AbCipWriteCoalescer` per-device cache | last-known-written values from the primary | reset; the first write of any value to the partner pays the full round-trip |
+| `LogicalInstanceMap` (Logical-mode `@tags` walk) | populated for primary | cleared; the next read on a Logical-mode device re-walks `@tags` against the partner |
+| Per-host bulkhead key (Polly bulkhead + breaker, plan decision #144) | keyed on primary `HostAddress` | keyed on the new active address — the partner gets its own fresh breaker state instead of inheriting a tripped breaker from the now-standby |
+| `AbCip.HsbyFailoverCount` diagnostic | 0 | incremented by 1 on every transition observed by the probe loop |
+
+### How the invalidation runs
+
+PR 5.2 introduces an internal `OnActiveAddressChanged` event raised by
+`HsbyProbeLoopAsync` on every `DeviceState.ActiveAddress` transition. The
+driver subscribes to it from its own constructor; the handler
+(`HandleActiveAddressChanged`) does the cache invalidation in one place:
+
+1. Disposes every entry in `DeviceState.Runtimes` and
+   `DeviceState.ParentRuntimes`, then clears both dicts. Disposed
+   `IAbCipTagRuntime` instances release their underlying libplctag
+   handles so the native heap doesn't leak.
+2. Clears `DeviceState.LogicalInstanceMap` and resets
+   `LogicalWalkComplete = false` so the next read on a Logical-mode
+   device re-fires the `@tags` symbol walk against the new chassis.
+3. Calls `AbCipWriteCoalescer.Reset(deviceHostAddress)` so cached
+   "we already wrote 42" decisions don't stale-suppress the first
+   partner-side write.
+4. Resets `DeviceState.RuntimesAddress = null` so subsequent
+   diagnostics observers see a fresh stamp on the next runtime
+   creation.
+5. `Interlocked.Increment` on the driver-wide
+   `AbCip.HsbyFailoverCount` counter.
+
+The handler is idempotent — a second event for the same address change
+is harmless because the dicts are already empty + the coalescer reset
+is itself idempotent.
+
+### Bulkhead key semantics
+
+The per-host resilience pipeline (Polly bulkhead + circuit breaker, plan
+decision #144) keys on whatever `IPerCallHostResolver.ResolveHost`
+returns. PR 5.2 changes that resolver so an HSBY-failed-over device
+returns the partner's address, which means:
+
+- The **device-state lookup** (`_devices.TryGetValue`) keeps using the
+  configured primary `HostAddress` as the dictionary key — that key
+  never changes for the lifetime of a device, so multi-device
+  configurations stay routable.
+- The **resilience pipeline** (Polly bulkhead, breaker, retry policies)
+  receives the active address as the host-name dimension. The standby
+  chassis's tripped breaker (if its primary went away) doesn't bleed
+  over to the partner; the partner gets fresh limits + a closed
+  breaker.
+
+When HSBY is disabled (`Hsby.Enabled = false`) `ResolveHost` returns the
+configured primary `HostAddress` exactly as it always has — pre-5.2
+behaviour, no double-key risk.
+
+## Failure-mode walkthrough
+
+PR 5.2 adds three failover surface areas to reason about. The table
+below summarises the behaviour the driver reports + how an operator
+can inspect it.
+
+### Primary-stuck (primary unreachable, partner Active)
+
+The primary chassis goes away (network partition, power loss, a stuck
+Forward Open). The role-probe loop reads `HsbyRole.Unknown` for the
+primary and `HsbyRole.Active` for the partner.
+
+| Surface | Behaviour |
+|---|---|
+| `DeviceState.ActiveAddress` | partner address |
+| `DeviceState.PrimaryRole` | `Unknown` |
+| `DeviceState.PartnerRole` | `Active` |
+| `ResolveHost(tag)` | partner address |
+| Reads / writes | route through partner gateway transparently |
+| `AbCip.HsbyFailoverCount` | incremented when the address transitioned away from the primary |
+| `AbCip.HsbyActive` | `2` (partner is the active chassis) |
+| Operator action | none required for routing; investigate why the primary is unreachable through the connectivity-probe loop's `_System/_ConnectionStatus` for the device |
+
+### Secondary-stuck (partner unreachable, primary Active)
+
+The partner chassis goes away (its OPC UA server is down, its IP is
+unreachable, the redundancy module unhitched it). The probe loop reads
+`HsbyRole.Active` for the primary and `HsbyRole.Unknown` for the partner.
+
+| Surface | Behaviour |
+|---|---|
+| `DeviceState.ActiveAddress` | primary address (no transition; this is the steady state) |
+| `DeviceState.PrimaryRole` | `Active` |
+| `DeviceState.PartnerRole` | `Unknown` |
+| `ResolveHost(tag)` | primary address |
+| Reads / writes | route through primary gateway exactly as in a non-HSBY deployment |
+| `AbCip.HsbyFailoverCount` | unchanged — no flip happened |
+| `AbCip.HsbyActive` | `1` (primary is the active chassis) |
+| Operator action | investigate why the partner is unreachable; the operational risk is that a future primary-side outage has no fall-back |
+
+### Both-stuck (no chassis Active)
+
+Both chassis report `Standby` / `Disqualified` / `Unknown` (a
+redundancy-module misconfiguration, both controllers in Program mode,
+or both unreachable).
+
+| Surface | Behaviour |
+|---|---|
+| `DeviceState.ActiveAddress` | `null` |
+| `ResolveHost(tag)` | falls back to the configured primary `HostAddress` |
+| Reads / writes | dispatched to the configured primary; a stuck primary surfaces `BadCommunicationError` per the existing dial flow |
+| `AbCip.HsbyActive` | `0` (no chassis Active) |
+| `AbCip.HsbyFailoverCount` | incremented when the transition `Active → null` happened |
+| Operator action | investigate the redundancy module / mode keys; the SCADA layer sees stuck-or-bad-quality reads, not incorrect routing |
+
+The "fall back to primary on null Active" choice is deliberate. Routing
+all reads to a deterministic chassis (the configured primary) keeps the
+breaker key + bulkhead state stable while the operator diagnoses the
+double-down outage; the alternative (round-robin / partner) would just
+trip both breakers in turn and obscure which chassis is the real
+problem.
+
+## Follow-ups (beyond PR 5.2)

- **PR 5.2** — wire `ActiveAddress` into `ResolveHost` so reads/writes
-  route to the live chassis automatically. Today's PR only **gathers** the
-  role.
 - **Patched `ab_server` image** — add a writable `WallClockTime.SyncStatus`
  tag (or a separate Python shim) so the Docker `paired` profile can
-  exercise the wire-level role probe.
+  exercise the wire-level role probe + the
+  `tests/.../IntegrationTests/AbCipHsbyFailoverTests.cs` scaffold can
+  flip its `Assert.Skip` for a real integration assertion.
 - **`hsby-mux` REST endpoint** — `POST /flip {"active": "primary"}` writes
-  `1` to the chosen chassis + `0` to the other so integration tests can
-  drive switch-overs deterministically.
+  `1` to the chosen chassis + `0` to the other so integration tests +
+  `scripts/e2e/test-abcip-hsby.ps1` can drive switch-overs
+  deterministically.
 - **GuardLogix HSBY** — same role-tag plumbing applies; verify against a
  real 1756-L8xS pair when one is on-site.