Files

Joseph Doherty 9e157fc8a4 Auto: abcip-5.2 — HSBY failover routing in ResolveHost

Closes #243

2026-04-26 08:13:41 -04:00

17 KiB

Raw Blame History

AbCip — ControlLogix HSBY paired-IP support

PR abcip-5.1 + 5.2 ship non-transparent HSBY (Hot-Standby) awareness to the AB CIP driver. Each device may declare a partner gateway; when both gateways are up the driver concurrently probes a role tag on each chassis, reports which one is currently Active, and routes reads / writes through that chassis automatically.

PR abcip-5.1 — gathers + reports the role of each chassis through driver diagnostics. See Role-tag detection matrix
- Active-resolution rules.
PR abcip-5.2 — wires the resolved active address into AbCipDriver.ResolveHost and the runtime-cache lifecycle. See Failover behaviour + Failure-mode walkthrough.

When to use HSBY paired IPs

You have a redundant ControlLogix chassis pair (1756-RM redundancy module, two CPUs, one acting + one standby) and the SCADA / OPC UA layer needs to keep talking to whichever chassis is currently Active without an operator manually re-pointing the connection.

Pre-5.1 the driver only knew about a single HostAddress. After a hot-standby switch-over, the standby (now Active) carried a different IP and the driver kept probing the dead-but-was-Active address until someone edited the config.

PR abcip-5.1 closes the visibility half of that gap by reading the role tag on both chassis. PR abcip-5.2 closes the routing half by re-pointing ResolveHost at the Active address each tick + invalidating the per-tag runtime cache + write-coalescer state on every flip.

Configuration

{
    "Devices": [
        {
            "HostAddress": "ab://10.0.0.5/1,0",
            "PartnerHostAddress": "ab://10.0.0.6/1,0",
            "Hsby": {
                "Enabled": true,
                "RoleTagAddress": "WallClockTime.SyncStatus",
                "ProbeIntervalMs": 2000
            }
        }
    ]
}

Field	Default	Notes
`PartnerHostAddress`	`null`	Canonical `ab://gateway[:port]/cip-path` of the partner chassis. `null` = no HSBY pair; the driver behaves exactly like every pre-5.1 build.
`Hsby.Enabled`	`false`	Master switch. When `false` (or `Hsby` omitted) no role probing happens, even if `PartnerHostAddress` is set.
`Hsby.RoleTagAddress`	`WallClockTime.SyncStatus`	Address of the role tag on each chassis. See role-tag detection matrix.
`Hsby.ProbeIntervalMs`	`2000`	How often each chassis is sampled. 2 s is a good default — tight enough to detect a switch-over within one Admin-UI refresh, loose enough to leave headroom for the regular probe loop.

Feature-flag gate (`Redundancy.Hsby.Enabled`)

Hsby.Enabled = false (the default) is the off-switch for the entire feature. The role-probe loop never starts, the diagnostics keys are not emitted, and the driver behaves identically to a pre-5.1 build. This is the gate to flip when an operator wants to roll the feature out cautiously across a fleet — set Hsby.Enabled = true per-device in driver config (no build flag, no env var).

When the gate is on but the partner gateway is unreachable, the role-probe loop reports HsbyRole.Unknown for the partner each tick. The primary's role still drives the active-chassis resolution; the operator sees the partner's role as Unknown in the Admin UI / driver diagnostics, which is the correct surface for "we can't reach the standby chassis right now."

Role-tag detection matrix

Firmware / fronts	Address	Decode
v20 / v24 / v32+ ControlLogix HSBY	`WallClockTime.SyncStatus` (DINT)	`0` = Standby, `1` = Synchronized / Active, `2` = Disqualified, anything else = Unknown
PLC-5 / SLC500 status-byte fallback	`S:34` Module Status word	bit 0 = "this chassis is Active". Bit set → `Active`; clear → `Standby`
Custom user role tag	any DINT-typed CIP path	Same matrix as `WallClockTime.SyncStatus` (0 / 1 / 2). Out-of-range values → Unknown.

AbCipHsbyRoleProber.MapValueToRole is the value-to-role mapper; unit tests in tests/ZB.MOM.WW.OtOpcUa.Driver.AbCip.Tests/AbCipHsbyTests.cs pin every row of the matrix.

What gets reported

The driver surfaces three diagnostics counters per HSBY-enabled device (visible via driver-diagnostics RPC + the Admin UI):

Counter	Value
`AbCip.HsbyActive`	`1` if primary is Active, `2` if partner is Active, `0` if neither (or HSBY off)
`AbCip.HsbyPrimaryRole`	`(int)HsbyRole` — `0` = Unknown, `1` = Active, `2` = Standby, `3` = Disqualified
`AbCip.HsbyPartnerRole`	Same encoding as `HsbyPrimaryRole`, observed on the partner chassis
`AbCip.HsbyFailoverCount` (PR 5.2)	Total number of `ActiveAddress` transitions the probe loop has observed across every HSBY-enabled device on this driver. Each increment maps to one runtime-cache invalidation + write-coalescer reset.

When more than one HSBY pair is configured on the same driver instance the flat keys are scoped per primary host: AbCip.HsbyActive[ab://10.0.0.5/1,0], etc.

The DeviceState.ActiveAddress field (internal; surfaced via HsbyActive diagnostics) is the address PR 5.2 routes through ResolveHost + uses to scope the per-host bulkhead / breaker key. See Failover behaviour for the runtime implications.

Active-resolution rules

Primary role	Partner role	`ActiveAddress` resolution
Active	Standby / Disqualified / Unknown	primary
Standby / Disqualified / Unknown	Active	partner
Active	Active (split-brain)	primary wins, warning logged
Standby + Standby	Standby + Standby	`null` — PR 5.2's `ResolveHost` falls back to the configured primary; the existing dial flow surfaces `BadCommunicationError` if the primary is also down. See Both-stuck.
Unknown + Unknown	Unknown + Unknown	`null` (same fallback as Standby + Standby)

Split-brain (both chassis claim Active simultaneously) is a real production failure mode — typically a redundancy-module misconfiguration or a partial network split. The driver picks primary deterministically + emits a warning through AbCipDriverOptions.OnWarning so operators see it in the log.

CLI flags

The otopcua-abcip-cli tool exposes the HSBY plumbing through two surfaces (see Driver.AbCip.Cli.md for the full CLI guide):

--partner <gateway> — global flag on every command. Sets PartnerHostAddress + auto-enables Hsby.Enabled = true so the role probe runs alongside any read / write / subscribe.
hsby-status — dedicated command that prints which chassis is currently Active. Reads the role tag on both gateways for a few ticks + prints the (primary, partner, active) tuple.

# Print which chassis is Active right now
otopcua-abcip-cli hsby-status -g ab://10.0.0.5/1,0 --partner ab://10.0.0.6/1,0

# Subscribe through the active chassis (PR 5.2 follow-up — today the
# subscribe stays pointed at the primary; the role probe runs alongside).
otopcua-abcip-cli subscribe -g ab://10.0.0.5/1,0 --partner ab://10.0.0.6/1,0 \
    -t Motor01_Speed --type Real -i 500

Test coverage

Unit (tests/ZB.MOM.WW.OtOpcUa.Driver.AbCip.Tests/AbCipHsbyTests.cs):
- Pure MapValueToRole matrix (WallClockTime.SyncStatus + S:34 bit mask + Unknown values).
- End-to-end driver loop: primary Active / partner Standby resolves to primary; both Active resolves to primary with a warning; both Standby clears ActiveAddress; primary read failure routes to partner.
- Diagnostics surface (AbCip.HsbyActive / HsbyPrimaryRole / HsbyPartnerRole).
- DTO JSON round-trip (PartnerHostAddress + Hsby.{Enabled, RoleTagAddress, ProbeIntervalMs} survive deserialise → driver → DeviceState).
- Hsby.Enabled = false → no role probing.
Integration (tests/ZB.MOM.WW.OtOpcUa.Driver.AbCip.IntegrationTests/):
- AbCipHsbyRoleProberTests.cs (PR 5.1) and AbCipHsbyFailoverTests.cs (PR 5.2) — both skipped by default (Assert.Skip). ab_server cannot emulate a ControlLogix HSBY pair (no WallClockTime.SyncStatus, no second chassis concept). The Docker paired profile (PR 5.1) brings up two ab_server instances + a stub hsby-mux sidecar so the topology is documented, but a patched ab_server image that actually serves the role tag is still on the follow-up list.
- Trait Category=Hsby so dotnet test --filter Category=Hsby finds them once they're promoted.
End-to-end (scripts/e2e/test-abcip-hsby.ps1, PR 5.2):
- Paired-fixture variant of test-abcip.ps1. Subscribes to a tag through the OPC UA server, flips the active chassis mid-stream via the hsby-mux sidecar's POST /flip endpoint, asserts the stream survives + AbCip.HsbyFailoverCount increments. Gated on operator-supplied BridgeNodeId + a running paired fixture; ships unwired into test-all.ps1 until the patched ab_server lands.

Failover behaviour (PR 5.2)

PR 5.2 wires DeviceState.ActiveAddress into the read / write hot path through AbCipDriver.ResolveHost and the runtime-cache lifecycle. After the role-probe loop (PR 5.1) detects an active-address transition the driver re-points every wire-level operation at the now-Active chassis without operator intervention.

What flips on a failover

Aspect	Pre-flip	Post-flip
`ResolveHost(tag)` return	primary `HostAddress`	the partner address (when partner is now Active)
Per-tag libplctag handles in `DeviceState.Runtimes`	created against primary gateway	dropped on flip; lazily re-created against the partner gateway on next read / write
Parent-DINT RMW handles in `DeviceState.ParentRuntimes`	primary gateway	dropped on flip; same re-create-on-demand path
`AbCipWriteCoalescer` per-device cache	last-known-written values from the primary	reset; the first write of any value to the partner pays the full round-trip
`LogicalInstanceMap` (Logical-mode `@tags` walk)	populated for primary	cleared; the next read on a Logical-mode device re-walks `@tags` against the partner
Per-host bulkhead key (Polly bulkhead + breaker, plan decision #144)	keyed on primary `HostAddress`	keyed on the new active address — the partner gets its own fresh breaker state instead of inheriting a tripped breaker from the now-standby
`AbCip.HsbyFailoverCount` diagnostic	0	incremented by 1 on every transition observed by the probe loop

How the invalidation runs

PR 5.2 introduces an internal OnActiveAddressChanged event raised by HsbyProbeLoopAsync on every DeviceState.ActiveAddress transition. The driver subscribes to it from its own constructor; the handler (HandleActiveAddressChanged) does the cache invalidation in one place:

Disposes every entry in DeviceState.Runtimes and DeviceState.ParentRuntimes, then clears both dicts. Disposed IAbCipTagRuntime instances release their underlying libplctag handles so the native heap doesn't leak.
Clears DeviceState.LogicalInstanceMap and resets LogicalWalkComplete = false so the next read on a Logical-mode device re-fires the @tags symbol walk against the new chassis.
Calls AbCipWriteCoalescer.Reset(deviceHostAddress) so cached "we already wrote 42" decisions don't stale-suppress the first partner-side write.
Resets DeviceState.RuntimesAddress = null so subsequent diagnostics observers see a fresh stamp on the next runtime creation.
Interlocked.Increment on the driver-wide AbCip.HsbyFailoverCount counter.

The handler is idempotent — a second event for the same address change is harmless because the dicts are already empty + the coalescer reset is itself idempotent.

Bulkhead key semantics

The per-host resilience pipeline (Polly bulkhead + circuit breaker, plan decision #144) keys on whatever IPerCallHostResolver.ResolveHost returns. PR 5.2 changes that resolver so an HSBY-failed-over device returns the partner's address, which means:

The device-state lookup (_devices.TryGetValue) keeps using the configured primary HostAddress as the dictionary key — that key never changes for the lifetime of a device, so multi-device configurations stay routable.
The resilience pipeline (Polly bulkhead, breaker, retry policies) receives the active address as the host-name dimension. The standby chassis's tripped breaker (if its primary went away) doesn't bleed over to the partner; the partner gets fresh limits + a closed breaker.

When HSBY is disabled (Hsby.Enabled = false) ResolveHost returns the configured primary HostAddress exactly as it always has — pre-5.2 behaviour, no double-key risk.

Failure-mode walkthrough

PR 5.2 adds three failover surface areas to reason about. The table below summarises the behaviour the driver reports + how an operator can inspect it.

Primary-stuck (primary unreachable, partner Active)

The primary chassis goes away (network partition, power loss, a stuck Forward Open). The role-probe loop reads HsbyRole.Unknown for the primary and HsbyRole.Active for the partner.

Surface	Behaviour
`DeviceState.ActiveAddress`	partner address
`DeviceState.PrimaryRole`	`Unknown`
`DeviceState.PartnerRole`	`Active`
`ResolveHost(tag)`	partner address
Reads / writes	route through partner gateway transparently
`AbCip.HsbyFailoverCount`	incremented when the address transitioned away from the primary
`AbCip.HsbyActive`	`2` (partner is the active chassis)
Operator action	none required for routing; investigate why the primary is unreachable through the connectivity-probe loop's `_System/_ConnectionStatus` for the device

Secondary-stuck (partner unreachable, primary Active)

The partner chassis goes away (its OPC UA server is down, its IP is unreachable, the redundancy module unhitched it). The probe loop reads HsbyRole.Active for the primary and HsbyRole.Unknown for the partner.

Surface	Behaviour
`DeviceState.ActiveAddress`	primary address (no transition; this is the steady state)
`DeviceState.PrimaryRole`	`Active`
`DeviceState.PartnerRole`	`Unknown`
`ResolveHost(tag)`	primary address
Reads / writes	route through primary gateway exactly as in a non-HSBY deployment
`AbCip.HsbyFailoverCount`	unchanged — no flip happened
`AbCip.HsbyActive`	`1` (primary is the active chassis)
Operator action	investigate why the partner is unreachable; the operational risk is that a future primary-side outage has no fall-back

Both-stuck (no chassis Active)

Both chassis report Standby / Disqualified / Unknown (a redundancy-module misconfiguration, both controllers in Program mode, or both unreachable).

Surface	Behaviour
`DeviceState.ActiveAddress`	`null`
`ResolveHost(tag)`	falls back to the configured primary `HostAddress`
Reads / writes	dispatched to the configured primary; a stuck primary surfaces `BadCommunicationError` per the existing dial flow
`AbCip.HsbyActive`	`0` (no chassis Active)
`AbCip.HsbyFailoverCount`	incremented when the transition `Active → null` happened
Operator action	investigate the redundancy module / mode keys; the SCADA layer sees stuck-or-bad-quality reads, not incorrect routing

The "fall back to primary on null Active" choice is deliberate. Routing all reads to a deterministic chassis (the configured primary) keeps the breaker key + bulkhead state stable while the operator diagnoses the double-down outage; the alternative (round-robin / partner) would just trip both breakers in turn and obscure which chassis is the real problem.

Follow-ups (beyond PR 5.2)

Patched ab_server image — add a writable WallClockTime.SyncStatus tag (or a separate Python shim) so the Docker paired profile can exercise the wire-level role probe + the tests/.../IntegrationTests/AbCipHsbyFailoverTests.cs scaffold can flip its Assert.Skip for a real integration assertion.
hsby-mux REST endpoint — POST /flip {"active": "primary"} writes 1 to the chosen chassis + 0 to the other so integration tests + scripts/e2e/test-abcip-hsby.ps1 can drive switch-overs deterministically.
GuardLogix HSBY — same role-tag plumbing applies; verify against a real 1756-L8xS pair when one is on-site.

17 KiB Raw Blame History