17 KiB
AbCip — ControlLogix HSBY paired-IP support
PR abcip-5.1 + 5.2 ship non-transparent HSBY (Hot-Standby) awareness to the AB CIP driver. Each device may declare a partner gateway; when both gateways are up the driver concurrently probes a role tag on each chassis, reports which one is currently Active, and routes reads / writes through that chassis automatically.
- PR abcip-5.1 — gathers + reports the role of each chassis through driver diagnostics. See Role-tag detection matrix
- PR abcip-5.2 — wires the resolved active address into
AbCipDriver.ResolveHostand the runtime-cache lifecycle. See Failover behaviour + Failure-mode walkthrough.
When to use HSBY paired IPs
You have a redundant ControlLogix chassis pair (1756-RM redundancy module, two CPUs, one acting + one standby) and the SCADA / OPC UA layer needs to keep talking to whichever chassis is currently Active without an operator manually re-pointing the connection.
Pre-5.1 the driver only knew about a single HostAddress. After a
hot-standby switch-over, the standby (now Active) carried a different IP
and the driver kept probing the dead-but-was-Active address until someone
edited the config.
PR abcip-5.1 closes the visibility half of that gap by reading the role tag
on both chassis. PR abcip-5.2 closes the routing half by re-pointing
ResolveHost at the Active address each tick + invalidating the per-tag
runtime cache + write-coalescer state on every flip.
Configuration
{
"Devices": [
{
"HostAddress": "ab://10.0.0.5/1,0",
"PartnerHostAddress": "ab://10.0.0.6/1,0",
"Hsby": {
"Enabled": true,
"RoleTagAddress": "WallClockTime.SyncStatus",
"ProbeIntervalMs": 2000
}
}
]
}
| Field | Default | Notes |
|---|---|---|
PartnerHostAddress |
null |
Canonical ab://gateway[:port]/cip-path of the partner chassis. null = no HSBY pair; the driver behaves exactly like every pre-5.1 build. |
Hsby.Enabled |
false |
Master switch. When false (or Hsby omitted) no role probing happens, even if PartnerHostAddress is set. |
Hsby.RoleTagAddress |
WallClockTime.SyncStatus |
Address of the role tag on each chassis. See role-tag detection matrix. |
Hsby.ProbeIntervalMs |
2000 |
How often each chassis is sampled. 2 s is a good default — tight enough to detect a switch-over within one Admin-UI refresh, loose enough to leave headroom for the regular probe loop. |
Feature-flag gate (Redundancy.Hsby.Enabled)
Hsby.Enabled = false (the default) is the off-switch for the entire
feature. The role-probe loop never starts, the diagnostics keys are not
emitted, and the driver behaves identically to a pre-5.1 build. This is the
gate to flip when an operator wants to roll the feature out cautiously
across a fleet — set Hsby.Enabled = true per-device in driver config (no
build flag, no env var).
When the gate is on but the partner gateway is unreachable, the role-probe
loop reports HsbyRole.Unknown for the partner each tick. The primary's
role still drives the active-chassis resolution; the operator sees the
partner's role as Unknown in the Admin UI / driver diagnostics, which is the
correct surface for "we can't reach the standby chassis right now."
Role-tag detection matrix
| Firmware / fronts | Address | Decode |
|---|---|---|
| v20 / v24 / v32+ ControlLogix HSBY | WallClockTime.SyncStatus (DINT) |
0 = Standby, 1 = Synchronized / Active, 2 = Disqualified, anything else = Unknown |
| PLC-5 / SLC500 status-byte fallback | S:34 Module Status word |
bit 0 = "this chassis is Active". Bit set → Active; clear → Standby |
| Custom user role tag | any DINT-typed CIP path | Same matrix as WallClockTime.SyncStatus (0 / 1 / 2). Out-of-range values → Unknown. |
AbCipHsbyRoleProber.MapValueToRole is the value-to-role mapper; unit tests
in tests/ZB.MOM.WW.OtOpcUa.Driver.AbCip.Tests/AbCipHsbyTests.cs pin every
row of the matrix.
What gets reported
The driver surfaces three diagnostics counters per HSBY-enabled device
(visible via driver-diagnostics RPC + the Admin UI):
| Counter | Value |
|---|---|
AbCip.HsbyActive |
1 if primary is Active, 2 if partner is Active, 0 if neither (or HSBY off) |
AbCip.HsbyPrimaryRole |
(int)HsbyRole — 0 = Unknown, 1 = Active, 2 = Standby, 3 = Disqualified |
AbCip.HsbyPartnerRole |
Same encoding as HsbyPrimaryRole, observed on the partner chassis |
AbCip.HsbyFailoverCount (PR 5.2) |
Total number of ActiveAddress transitions the probe loop has observed across every HSBY-enabled device on this driver. Each increment maps to one runtime-cache invalidation + write-coalescer reset. |
When more than one HSBY pair is configured on the same driver instance the
flat keys are scoped per primary host: AbCip.HsbyActive[ab://10.0.0.5/1,0],
etc.
The DeviceState.ActiveAddress field (internal; surfaced via
HsbyActive diagnostics) is the address PR 5.2 routes through
ResolveHost + uses to scope the per-host bulkhead / breaker key.
See Failover behaviour for the runtime
implications.
Active-resolution rules
| Primary role | Partner role | ActiveAddress resolution |
|---|---|---|
| Active | Standby / Disqualified / Unknown | primary |
| Standby / Disqualified / Unknown | Active | partner |
| Active | Active (split-brain) | primary wins, warning logged |
| Standby + Standby | Standby + Standby | null — PR 5.2's ResolveHost falls back to the configured primary; the existing dial flow surfaces BadCommunicationError if the primary is also down. See Both-stuck. |
| Unknown + Unknown | Unknown + Unknown | null (same fallback as Standby + Standby) |
Split-brain (both chassis claim Active simultaneously) is a real
production failure mode — typically a redundancy-module misconfiguration or
a partial network split. The driver picks primary deterministically + emits
a warning through AbCipDriverOptions.OnWarning so operators see it in the
log.
CLI flags
The otopcua-abcip-cli tool exposes the HSBY plumbing through two surfaces
(see Driver.AbCip.Cli.md for the full CLI guide):
--partner <gateway>— global flag on every command. SetsPartnerHostAddress+ auto-enablesHsby.Enabled = trueso the role probe runs alongside any read / write / subscribe.hsby-status— dedicated command that prints which chassis is currently Active. Reads the role tag on both gateways for a few ticks + prints the(primary, partner, active)tuple.
# Print which chassis is Active right now
otopcua-abcip-cli hsby-status -g ab://10.0.0.5/1,0 --partner ab://10.0.0.6/1,0
# Subscribe through the active chassis (PR 5.2 follow-up — today the
# subscribe stays pointed at the primary; the role probe runs alongside).
otopcua-abcip-cli subscribe -g ab://10.0.0.5/1,0 --partner ab://10.0.0.6/1,0 \
-t Motor01_Speed --type Real -i 500
Test coverage
- Unit (
tests/ZB.MOM.WW.OtOpcUa.Driver.AbCip.Tests/AbCipHsbyTests.cs):- Pure
MapValueToRolematrix (WallClockTime.SyncStatus + S:34 bit mask + Unknown values). - End-to-end driver loop: primary Active / partner Standby resolves to
primary; both Active resolves to primary with a warning; both
Standby clears
ActiveAddress; primary read failure routes to partner. - Diagnostics surface (
AbCip.HsbyActive/HsbyPrimaryRole/HsbyPartnerRole). - DTO JSON round-trip (
PartnerHostAddress+Hsby.{Enabled, RoleTagAddress, ProbeIntervalMs}survive deserialise → driver →DeviceState). Hsby.Enabled = false→ no role probing.
- Pure
- Integration (
tests/ZB.MOM.WW.OtOpcUa.Driver.AbCip.IntegrationTests/):AbCipHsbyRoleProberTests.cs(PR 5.1) andAbCipHsbyFailoverTests.cs(PR 5.2) — both skipped by default (Assert.Skip).ab_servercannot emulate a ControlLogix HSBY pair (noWallClockTime.SyncStatus, no second chassis concept). The Dockerpairedprofile (PR 5.1) brings up twoab_serverinstances + a stubhsby-muxsidecar so the topology is documented, but a patchedab_serverimage that actually serves the role tag is still on the follow-up list.- Trait
Category=Hsbysodotnet test --filter Category=Hsbyfinds them once they're promoted.
- End-to-end (
scripts/e2e/test-abcip-hsby.ps1, PR 5.2):- Paired-fixture variant of
test-abcip.ps1. Subscribes to a tag through the OPC UA server, flips the active chassis mid-stream via thehsby-muxsidecar'sPOST /flipendpoint, asserts the stream survives +AbCip.HsbyFailoverCountincrements. Gated on operator-suppliedBridgeNodeId+ a running paired fixture; ships unwired intotest-all.ps1until the patchedab_serverlands.
- Paired-fixture variant of
Failover behaviour (PR 5.2)
PR 5.2 wires DeviceState.ActiveAddress into the read / write hot path
through AbCipDriver.ResolveHost and the runtime-cache lifecycle. After
the role-probe loop (PR 5.1) detects an active-address transition the
driver re-points every wire-level operation at the now-Active chassis
without operator intervention.
What flips on a failover
| Aspect | Pre-flip | Post-flip |
|---|---|---|
ResolveHost(tag) return |
primary HostAddress |
the partner address (when partner is now Active) |
Per-tag libplctag handles in DeviceState.Runtimes |
created against primary gateway | dropped on flip; lazily re-created against the partner gateway on next read / write |
Parent-DINT RMW handles in DeviceState.ParentRuntimes |
primary gateway | dropped on flip; same re-create-on-demand path |
AbCipWriteCoalescer per-device cache |
last-known-written values from the primary | reset; the first write of any value to the partner pays the full round-trip |
LogicalInstanceMap (Logical-mode @tags walk) |
populated for primary | cleared; the next read on a Logical-mode device re-walks @tags against the partner |
| Per-host bulkhead key (Polly bulkhead + breaker, plan decision #144) | keyed on primary HostAddress |
keyed on the new active address — the partner gets its own fresh breaker state instead of inheriting a tripped breaker from the now-standby |
AbCip.HsbyFailoverCount diagnostic |
0 | incremented by 1 on every transition observed by the probe loop |
How the invalidation runs
PR 5.2 introduces an internal OnActiveAddressChanged event raised by
HsbyProbeLoopAsync on every DeviceState.ActiveAddress transition. The
driver subscribes to it from its own constructor; the handler
(HandleActiveAddressChanged) does the cache invalidation in one place:
- Disposes every entry in
DeviceState.RuntimesandDeviceState.ParentRuntimes, then clears both dicts. DisposedIAbCipTagRuntimeinstances release their underlying libplctag handles so the native heap doesn't leak. - Clears
DeviceState.LogicalInstanceMapand resetsLogicalWalkComplete = falseso the next read on a Logical-mode device re-fires the@tagssymbol walk against the new chassis. - Calls
AbCipWriteCoalescer.Reset(deviceHostAddress)so cached "we already wrote 42" decisions don't stale-suppress the first partner-side write. - Resets
DeviceState.RuntimesAddress = nullso subsequent diagnostics observers see a fresh stamp on the next runtime creation. Interlocked.Incrementon the driver-wideAbCip.HsbyFailoverCountcounter.
The handler is idempotent — a second event for the same address change is harmless because the dicts are already empty + the coalescer reset is itself idempotent.
Bulkhead key semantics
The per-host resilience pipeline (Polly bulkhead + circuit breaker, plan
decision #144) keys on whatever IPerCallHostResolver.ResolveHost
returns. PR 5.2 changes that resolver so an HSBY-failed-over device
returns the partner's address, which means:
- The device-state lookup (
_devices.TryGetValue) keeps using the configured primaryHostAddressas the dictionary key — that key never changes for the lifetime of a device, so multi-device configurations stay routable. - The resilience pipeline (Polly bulkhead, breaker, retry policies) receives the active address as the host-name dimension. The standby chassis's tripped breaker (if its primary went away) doesn't bleed over to the partner; the partner gets fresh limits + a closed breaker.
When HSBY is disabled (Hsby.Enabled = false) ResolveHost returns the
configured primary HostAddress exactly as it always has — pre-5.2
behaviour, no double-key risk.
Failure-mode walkthrough
PR 5.2 adds three failover surface areas to reason about. The table below summarises the behaviour the driver reports + how an operator can inspect it.
Primary-stuck (primary unreachable, partner Active)
The primary chassis goes away (network partition, power loss, a stuck
Forward Open). The role-probe loop reads HsbyRole.Unknown for the
primary and HsbyRole.Active for the partner.
| Surface | Behaviour |
|---|---|
DeviceState.ActiveAddress |
partner address |
DeviceState.PrimaryRole |
Unknown |
DeviceState.PartnerRole |
Active |
ResolveHost(tag) |
partner address |
| Reads / writes | route through partner gateway transparently |
AbCip.HsbyFailoverCount |
incremented when the address transitioned away from the primary |
AbCip.HsbyActive |
2 (partner is the active chassis) |
| Operator action | none required for routing; investigate why the primary is unreachable through the connectivity-probe loop's _System/_ConnectionStatus for the device |
Secondary-stuck (partner unreachable, primary Active)
The partner chassis goes away (its OPC UA server is down, its IP is
unreachable, the redundancy module unhitched it). The probe loop reads
HsbyRole.Active for the primary and HsbyRole.Unknown for the partner.
| Surface | Behaviour |
|---|---|
DeviceState.ActiveAddress |
primary address (no transition; this is the steady state) |
DeviceState.PrimaryRole |
Active |
DeviceState.PartnerRole |
Unknown |
ResolveHost(tag) |
primary address |
| Reads / writes | route through primary gateway exactly as in a non-HSBY deployment |
AbCip.HsbyFailoverCount |
unchanged — no flip happened |
AbCip.HsbyActive |
1 (primary is the active chassis) |
| Operator action | investigate why the partner is unreachable; the operational risk is that a future primary-side outage has no fall-back |
Both-stuck (no chassis Active)
Both chassis report Standby / Disqualified / Unknown (a
redundancy-module misconfiguration, both controllers in Program mode,
or both unreachable).
| Surface | Behaviour |
|---|---|
DeviceState.ActiveAddress |
null |
ResolveHost(tag) |
falls back to the configured primary HostAddress |
| Reads / writes | dispatched to the configured primary; a stuck primary surfaces BadCommunicationError per the existing dial flow |
AbCip.HsbyActive |
0 (no chassis Active) |
AbCip.HsbyFailoverCount |
incremented when the transition Active → null happened |
| Operator action | investigate the redundancy module / mode keys; the SCADA layer sees stuck-or-bad-quality reads, not incorrect routing |
The "fall back to primary on null Active" choice is deliberate. Routing all reads to a deterministic chassis (the configured primary) keeps the breaker key + bulkhead state stable while the operator diagnoses the double-down outage; the alternative (round-robin / partner) would just trip both breakers in turn and obscure which chassis is the real problem.
Follow-ups (beyond PR 5.2)
- Patched
ab_serverimage — add a writableWallClockTime.SyncStatustag (or a separate Python shim) so the Dockerpairedprofile can exercise the wire-level role probe + thetests/.../IntegrationTests/AbCipHsbyFailoverTests.csscaffold can flip itsAssert.Skipfor a real integration assertion. hsby-muxREST endpoint —POST /flip {"active": "primary"}writes1to the chosen chassis +0to the other so integration tests +scripts/e2e/test-abcip-hsby.ps1can drive switch-overs deterministically.- GuardLogix HSBY — same role-tag plumbing applies; verify against a real 1756-L8xS pair when one is on-site.
See also
docs/Driver.AbCip.Cli.md—--partnerflag +hsby-statuscommand referencedocs/drivers/AbServer-Test-Fixture.md§"What it does NOT cover" — HSBY entrydocs/Redundancy.md— server-level (OPC UA-stack) redundancy; HSBY is the driver-level companion