# AbCip — ControlLogix HSBY paired-IP support PR abcip-5.1 + 5.2 ship **non-transparent** HSBY (Hot-Standby) awareness to the AB CIP driver. Each device may declare a partner gateway; when both gateways are up the driver concurrently probes a role tag on each chassis, reports which one is currently Active, and routes reads / writes through that chassis automatically. - **PR abcip-5.1** — gathers + reports the role of each chassis through driver diagnostics. See [Role-tag detection matrix](#role-tag-detection-matrix) + [Active-resolution rules](#active-resolution-rules). - **PR abcip-5.2** — wires the resolved active address into `AbCipDriver.ResolveHost` and the runtime-cache lifecycle. See [Failover behaviour](#failover-behaviour-pr-52) + [Failure-mode walkthrough](#failure-mode-walkthrough). ## When to use HSBY paired IPs You have a redundant **ControlLogix** chassis pair (1756-RM redundancy module, two CPUs, one acting + one standby) and the SCADA / OPC UA layer needs to keep talking to *whichever chassis is currently Active* without an operator manually re-pointing the connection. Pre-5.1 the driver only knew about a single `HostAddress`. After a hot-standby switch-over, the standby (now Active) carried a **different IP** and the driver kept probing the dead-but-was-Active address until someone edited the config. PR abcip-5.1 closes the visibility half of that gap by reading the role tag on both chassis. PR abcip-5.2 closes the routing half by re-pointing `ResolveHost` at the Active address each tick + invalidating the per-tag runtime cache + write-coalescer state on every flip. ## Configuration ```jsonc { "Devices": [ { "HostAddress": "ab://10.0.0.5/1,0", "PartnerHostAddress": "ab://10.0.0.6/1,0", "Hsby": { "Enabled": true, "RoleTagAddress": "WallClockTime.SyncStatus", "ProbeIntervalMs": 2000 } } ] } ``` | Field | Default | Notes | |---|---|---| | `PartnerHostAddress` | `null` | Canonical `ab://gateway[:port]/cip-path` of the partner chassis. `null` = no HSBY pair; the driver behaves exactly like every pre-5.1 build. | | `Hsby.Enabled` | `false` | Master switch. When `false` (or `Hsby` omitted) no role probing happens, even if `PartnerHostAddress` is set. | | `Hsby.RoleTagAddress` | `WallClockTime.SyncStatus` | Address of the role tag on each chassis. See [role-tag detection matrix](#role-tag-detection-matrix). | | `Hsby.ProbeIntervalMs` | `2000` | How often each chassis is sampled. 2 s is a good default — tight enough to detect a switch-over within one Admin-UI refresh, loose enough to leave headroom for the regular probe loop. | ## Feature-flag gate (`Redundancy.Hsby.Enabled`) `Hsby.Enabled = false` (the default) is the off-switch for the entire feature. The role-probe loop never starts, the diagnostics keys are not emitted, and the driver behaves identically to a pre-5.1 build. This is the gate to flip when an operator wants to roll the feature out cautiously across a fleet — set `Hsby.Enabled = true` per-device in driver config (no build flag, no env var). When the gate is on but the partner gateway is unreachable, the role-probe loop reports `HsbyRole.Unknown` for the partner each tick. The primary's role still drives the active-chassis resolution; the operator sees the partner's role as Unknown in the Admin UI / driver diagnostics, which is the correct surface for "we can't reach the standby chassis right now." ## Role-tag detection matrix | Firmware / fronts | Address | Decode | |---|---|---| | **v20 / v24 / v32+ ControlLogix HSBY** | `WallClockTime.SyncStatus` (DINT) | `0` = Standby, `1` = Synchronized / Active, `2` = Disqualified, anything else = Unknown | | **PLC-5 / SLC500 status-byte fallback** | `S:34` Module Status word | bit 0 = "this chassis is Active". Bit set → `Active`; clear → `Standby` | | **Custom user role tag** | any DINT-typed CIP path | Same matrix as `WallClockTime.SyncStatus` (0 / 1 / 2). Out-of-range values → Unknown. | `AbCipHsbyRoleProber.MapValueToRole` is the value-to-role mapper; unit tests in `tests/ZB.MOM.WW.OtOpcUa.Driver.AbCip.Tests/AbCipHsbyTests.cs` pin every row of the matrix. ## What gets reported The driver surfaces three diagnostics counters per HSBY-enabled device (visible via `driver-diagnostics` RPC + the Admin UI): | Counter | Value | |---|---| | `AbCip.HsbyActive` | `1` if primary is Active, `2` if partner is Active, `0` if neither (or HSBY off) | | `AbCip.HsbyPrimaryRole` | `(int)HsbyRole` — `0` = Unknown, `1` = Active, `2` = Standby, `3` = Disqualified | | `AbCip.HsbyPartnerRole` | Same encoding as `HsbyPrimaryRole`, observed on the partner chassis | | `AbCip.HsbyFailoverCount` (PR 5.2) | Total number of `ActiveAddress` transitions the probe loop has observed across every HSBY-enabled device on this driver. Each increment maps to one runtime-cache invalidation + write-coalescer reset. | When more than one HSBY pair is configured on the same driver instance the flat keys are scoped per primary host: `AbCip.HsbyActive[ab://10.0.0.5/1,0]`, etc. The `DeviceState.ActiveAddress` field (internal; surfaced via `HsbyActive` diagnostics) is the address PR 5.2 routes through `ResolveHost` + uses to scope the per-host bulkhead / breaker key. See [Failover behaviour](#failover-behaviour-pr-52) for the runtime implications. ### Active-resolution rules | Primary role | Partner role | `ActiveAddress` resolution | |---|---|---| | Active | Standby / Disqualified / Unknown | primary | | Standby / Disqualified / Unknown | Active | partner | | Active | Active (split-brain) | **primary wins**, warning logged | | Standby + Standby | Standby + Standby | `null` — PR 5.2's `ResolveHost` falls back to the configured primary; the existing dial flow surfaces `BadCommunicationError` if the primary is also down. See [Both-stuck](#both-stuck-no-chassis-active). | | Unknown + Unknown | Unknown + Unknown | `null` (same fallback as Standby + Standby) | Split-brain (both chassis claim Active simultaneously) is a real production failure mode — typically a redundancy-module misconfiguration or a partial network split. The driver picks primary deterministically + emits a warning through `AbCipDriverOptions.OnWarning` so operators see it in the log. ## CLI flags The `otopcua-abcip-cli` tool exposes the HSBY plumbing through two surfaces (see [Driver.AbCip.Cli.md](../Driver.AbCip.Cli.md) for the full CLI guide): - `--partner ` — global flag on every command. Sets `PartnerHostAddress` + auto-enables `Hsby.Enabled = true` so the role probe runs alongside any read / write / subscribe. - `hsby-status` — dedicated command that prints which chassis is currently Active. Reads the role tag on both gateways for a few ticks + prints the `(primary, partner, active)` tuple. ```powershell # Print which chassis is Active right now otopcua-abcip-cli hsby-status -g ab://10.0.0.5/1,0 --partner ab://10.0.0.6/1,0 # Subscribe through the active chassis (PR 5.2 follow-up — today the # subscribe stays pointed at the primary; the role probe runs alongside). otopcua-abcip-cli subscribe -g ab://10.0.0.5/1,0 --partner ab://10.0.0.6/1,0 \ -t Motor01_Speed --type Real -i 500 ``` ## Test coverage - **Unit** (`tests/ZB.MOM.WW.OtOpcUa.Driver.AbCip.Tests/AbCipHsbyTests.cs`): - Pure `MapValueToRole` matrix (WallClockTime.SyncStatus + S:34 bit mask + Unknown values). - End-to-end driver loop: primary Active / partner Standby resolves to primary; both Active resolves to primary with a warning; both Standby clears `ActiveAddress`; primary read failure routes to partner. - Diagnostics surface (`AbCip.HsbyActive` / `HsbyPrimaryRole` / `HsbyPartnerRole`). - DTO JSON round-trip (`PartnerHostAddress` + `Hsby.{Enabled, RoleTagAddress, ProbeIntervalMs}` survive deserialise → driver → `DeviceState`). - `Hsby.Enabled = false` → no role probing. - **Integration** (`tests/ZB.MOM.WW.OtOpcUa.Driver.AbCip.IntegrationTests/`): - `AbCipHsbyRoleProberTests.cs` (PR 5.1) and `AbCipHsbyFailoverTests.cs` (PR 5.2) — both **skipped by default** (`Assert.Skip`). `ab_server` cannot emulate a ControlLogix HSBY pair (no `WallClockTime.SyncStatus`, no second chassis concept). The Docker `paired` profile (PR 5.1) brings up two `ab_server` instances + a stub `hsby-mux` sidecar so the topology is documented, but a patched `ab_server` image that actually serves the role tag is still on the follow-up list. - Trait `Category=Hsby` so `dotnet test --filter Category=Hsby` finds them once they're promoted. - **End-to-end** (`scripts/e2e/test-abcip-hsby.ps1`, PR 5.2): - Paired-fixture variant of `test-abcip.ps1`. Subscribes to a tag through the OPC UA server, flips the active chassis mid-stream via the `hsby-mux` sidecar's `POST /flip` endpoint, asserts the stream survives + `AbCip.HsbyFailoverCount` increments. Gated on operator-supplied `BridgeNodeId` + a running paired fixture; ships unwired into `test-all.ps1` until the patched `ab_server` lands. ## Failover behaviour (PR 5.2) PR 5.2 wires `DeviceState.ActiveAddress` into the read / write hot path through `AbCipDriver.ResolveHost` and the runtime-cache lifecycle. After the role-probe loop (PR 5.1) detects an active-address transition the driver re-points every wire-level operation at the now-Active chassis without operator intervention. ### What flips on a failover | Aspect | Pre-flip | Post-flip | |---|---|---| | `ResolveHost(tag)` return | primary `HostAddress` | the partner address (when partner is now Active) | | Per-tag libplctag handles in `DeviceState.Runtimes` | created against primary gateway | dropped on flip; lazily re-created against the partner gateway on next read / write | | Parent-DINT RMW handles in `DeviceState.ParentRuntimes` | primary gateway | dropped on flip; same re-create-on-demand path | | `AbCipWriteCoalescer` per-device cache | last-known-written values from the primary | reset; the first write of any value to the partner pays the full round-trip | | `LogicalInstanceMap` (Logical-mode `@tags` walk) | populated for primary | cleared; the next read on a Logical-mode device re-walks `@tags` against the partner | | Per-host bulkhead key (Polly bulkhead + breaker, plan decision #144) | keyed on primary `HostAddress` | keyed on the new active address — the partner gets its own fresh breaker state instead of inheriting a tripped breaker from the now-standby | | `AbCip.HsbyFailoverCount` diagnostic | 0 | incremented by 1 on every transition observed by the probe loop | ### How the invalidation runs PR 5.2 introduces an internal `OnActiveAddressChanged` event raised by `HsbyProbeLoopAsync` on every `DeviceState.ActiveAddress` transition. The driver subscribes to it from its own constructor; the handler (`HandleActiveAddressChanged`) does the cache invalidation in one place: 1. Disposes every entry in `DeviceState.Runtimes` and `DeviceState.ParentRuntimes`, then clears both dicts. Disposed `IAbCipTagRuntime` instances release their underlying libplctag handles so the native heap doesn't leak. 2. Clears `DeviceState.LogicalInstanceMap` and resets `LogicalWalkComplete = false` so the next read on a Logical-mode device re-fires the `@tags` symbol walk against the new chassis. 3. Calls `AbCipWriteCoalescer.Reset(deviceHostAddress)` so cached "we already wrote 42" decisions don't stale-suppress the first partner-side write. 4. Resets `DeviceState.RuntimesAddress = null` so subsequent diagnostics observers see a fresh stamp on the next runtime creation. 5. `Interlocked.Increment` on the driver-wide `AbCip.HsbyFailoverCount` counter. The handler is idempotent — a second event for the same address change is harmless because the dicts are already empty + the coalescer reset is itself idempotent. ### Bulkhead key semantics The per-host resilience pipeline (Polly bulkhead + circuit breaker, plan decision #144) keys on whatever `IPerCallHostResolver.ResolveHost` returns. PR 5.2 changes that resolver so an HSBY-failed-over device returns the partner's address, which means: - The **device-state lookup** (`_devices.TryGetValue`) keeps using the configured primary `HostAddress` as the dictionary key — that key never changes for the lifetime of a device, so multi-device configurations stay routable. - The **resilience pipeline** (Polly bulkhead, breaker, retry policies) receives the active address as the host-name dimension. The standby chassis's tripped breaker (if its primary went away) doesn't bleed over to the partner; the partner gets fresh limits + a closed breaker. When HSBY is disabled (`Hsby.Enabled = false`) `ResolveHost` returns the configured primary `HostAddress` exactly as it always has — pre-5.2 behaviour, no double-key risk. ## Failure-mode walkthrough PR 5.2 adds three failover surface areas to reason about. The table below summarises the behaviour the driver reports + how an operator can inspect it. ### Primary-stuck (primary unreachable, partner Active) The primary chassis goes away (network partition, power loss, a stuck Forward Open). The role-probe loop reads `HsbyRole.Unknown` for the primary and `HsbyRole.Active` for the partner. | Surface | Behaviour | |---|---| | `DeviceState.ActiveAddress` | partner address | | `DeviceState.PrimaryRole` | `Unknown` | | `DeviceState.PartnerRole` | `Active` | | `ResolveHost(tag)` | partner address | | Reads / writes | route through partner gateway transparently | | `AbCip.HsbyFailoverCount` | incremented when the address transitioned away from the primary | | `AbCip.HsbyActive` | `2` (partner is the active chassis) | | Operator action | none required for routing; investigate why the primary is unreachable through the connectivity-probe loop's `_System/_ConnectionStatus` for the device | ### Secondary-stuck (partner unreachable, primary Active) The partner chassis goes away (its OPC UA server is down, its IP is unreachable, the redundancy module unhitched it). The probe loop reads `HsbyRole.Active` for the primary and `HsbyRole.Unknown` for the partner. | Surface | Behaviour | |---|---| | `DeviceState.ActiveAddress` | primary address (no transition; this is the steady state) | | `DeviceState.PrimaryRole` | `Active` | | `DeviceState.PartnerRole` | `Unknown` | | `ResolveHost(tag)` | primary address | | Reads / writes | route through primary gateway exactly as in a non-HSBY deployment | | `AbCip.HsbyFailoverCount` | unchanged — no flip happened | | `AbCip.HsbyActive` | `1` (primary is the active chassis) | | Operator action | investigate why the partner is unreachable; the operational risk is that a future primary-side outage has no fall-back | ### Both-stuck (no chassis Active) Both chassis report `Standby` / `Disqualified` / `Unknown` (a redundancy-module misconfiguration, both controllers in Program mode, or both unreachable). | Surface | Behaviour | |---|---| | `DeviceState.ActiveAddress` | `null` | | `ResolveHost(tag)` | falls back to the configured primary `HostAddress` | | Reads / writes | dispatched to the configured primary; a stuck primary surfaces `BadCommunicationError` per the existing dial flow | | `AbCip.HsbyActive` | `0` (no chassis Active) | | `AbCip.HsbyFailoverCount` | incremented when the transition `Active → null` happened | | Operator action | investigate the redundancy module / mode keys; the SCADA layer sees stuck-or-bad-quality reads, not incorrect routing | The "fall back to primary on null Active" choice is deliberate. Routing all reads to a deterministic chassis (the configured primary) keeps the breaker key + bulkhead state stable while the operator diagnoses the double-down outage; the alternative (round-robin / partner) would just trip both breakers in turn and obscure which chassis is the real problem. ## Follow-ups (beyond PR 5.2) - **Patched `ab_server` image** — add a writable `WallClockTime.SyncStatus` tag (or a separate Python shim) so the Docker `paired` profile can exercise the wire-level role probe + the `tests/.../IntegrationTests/AbCipHsbyFailoverTests.cs` scaffold can flip its `Assert.Skip` for a real integration assertion. - **`hsby-mux` REST endpoint** — `POST /flip {"active": "primary"}` writes `1` to the chosen chassis + `0` to the other so integration tests + `scripts/e2e/test-abcip-hsby.ps1` can drive switch-overs deterministically. - **GuardLogix HSBY** — same role-tag plumbing applies; verify against a real 1756-L8xS pair when one is on-site. ## See also - [`docs/Driver.AbCip.Cli.md`](../Driver.AbCip.Cli.md) — `--partner` flag + `hsby-status` command reference - [`docs/drivers/AbServer-Test-Fixture.md`](AbServer-Test-Fixture.md) §"What it does NOT cover" — HSBY entry - [`docs/Redundancy.md`](../Redundancy.md) — server-level (OPC UA-stack) redundancy; HSBY is the **driver-level** companion