Files
lmxopcua/docs/drivers/AbCip-HSBY.md
2026-04-26 08:13:41 -04:00

17 KiB

AbCip — ControlLogix HSBY paired-IP support

PR abcip-5.1 + 5.2 ship non-transparent HSBY (Hot-Standby) awareness to the AB CIP driver. Each device may declare a partner gateway; when both gateways are up the driver concurrently probes a role tag on each chassis, reports which one is currently Active, and routes reads / writes through that chassis automatically.

When to use HSBY paired IPs

You have a redundant ControlLogix chassis pair (1756-RM redundancy module, two CPUs, one acting + one standby) and the SCADA / OPC UA layer needs to keep talking to whichever chassis is currently Active without an operator manually re-pointing the connection.

Pre-5.1 the driver only knew about a single HostAddress. After a hot-standby switch-over, the standby (now Active) carried a different IP and the driver kept probing the dead-but-was-Active address until someone edited the config.

PR abcip-5.1 closes the visibility half of that gap by reading the role tag on both chassis. PR abcip-5.2 closes the routing half by re-pointing ResolveHost at the Active address each tick + invalidating the per-tag runtime cache + write-coalescer state on every flip.

Configuration

{
    "Devices": [
        {
            "HostAddress": "ab://10.0.0.5/1,0",
            "PartnerHostAddress": "ab://10.0.0.6/1,0",
            "Hsby": {
                "Enabled": true,
                "RoleTagAddress": "WallClockTime.SyncStatus",
                "ProbeIntervalMs": 2000
            }
        }
    ]
}
Field Default Notes
PartnerHostAddress null Canonical ab://gateway[:port]/cip-path of the partner chassis. null = no HSBY pair; the driver behaves exactly like every pre-5.1 build.
Hsby.Enabled false Master switch. When false (or Hsby omitted) no role probing happens, even if PartnerHostAddress is set.
Hsby.RoleTagAddress WallClockTime.SyncStatus Address of the role tag on each chassis. See role-tag detection matrix.
Hsby.ProbeIntervalMs 2000 How often each chassis is sampled. 2 s is a good default — tight enough to detect a switch-over within one Admin-UI refresh, loose enough to leave headroom for the regular probe loop.

Feature-flag gate (Redundancy.Hsby.Enabled)

Hsby.Enabled = false (the default) is the off-switch for the entire feature. The role-probe loop never starts, the diagnostics keys are not emitted, and the driver behaves identically to a pre-5.1 build. This is the gate to flip when an operator wants to roll the feature out cautiously across a fleet — set Hsby.Enabled = true per-device in driver config (no build flag, no env var).

When the gate is on but the partner gateway is unreachable, the role-probe loop reports HsbyRole.Unknown for the partner each tick. The primary's role still drives the active-chassis resolution; the operator sees the partner's role as Unknown in the Admin UI / driver diagnostics, which is the correct surface for "we can't reach the standby chassis right now."

Role-tag detection matrix

Firmware / fronts Address Decode
v20 / v24 / v32+ ControlLogix HSBY WallClockTime.SyncStatus (DINT) 0 = Standby, 1 = Synchronized / Active, 2 = Disqualified, anything else = Unknown
PLC-5 / SLC500 status-byte fallback S:34 Module Status word bit 0 = "this chassis is Active". Bit set → Active; clear → Standby
Custom user role tag any DINT-typed CIP path Same matrix as WallClockTime.SyncStatus (0 / 1 / 2). Out-of-range values → Unknown.

AbCipHsbyRoleProber.MapValueToRole is the value-to-role mapper; unit tests in tests/ZB.MOM.WW.OtOpcUa.Driver.AbCip.Tests/AbCipHsbyTests.cs pin every row of the matrix.

What gets reported

The driver surfaces three diagnostics counters per HSBY-enabled device (visible via driver-diagnostics RPC + the Admin UI):

Counter Value
AbCip.HsbyActive 1 if primary is Active, 2 if partner is Active, 0 if neither (or HSBY off)
AbCip.HsbyPrimaryRole (int)HsbyRole0 = Unknown, 1 = Active, 2 = Standby, 3 = Disqualified
AbCip.HsbyPartnerRole Same encoding as HsbyPrimaryRole, observed on the partner chassis
AbCip.HsbyFailoverCount (PR 5.2) Total number of ActiveAddress transitions the probe loop has observed across every HSBY-enabled device on this driver. Each increment maps to one runtime-cache invalidation + write-coalescer reset.

When more than one HSBY pair is configured on the same driver instance the flat keys are scoped per primary host: AbCip.HsbyActive[ab://10.0.0.5/1,0], etc.

The DeviceState.ActiveAddress field (internal; surfaced via HsbyActive diagnostics) is the address PR 5.2 routes through ResolveHost + uses to scope the per-host bulkhead / breaker key. See Failover behaviour for the runtime implications.

Active-resolution rules

Primary role Partner role ActiveAddress resolution
Active Standby / Disqualified / Unknown primary
Standby / Disqualified / Unknown Active partner
Active Active (split-brain) primary wins, warning logged
Standby + Standby Standby + Standby null — PR 5.2's ResolveHost falls back to the configured primary; the existing dial flow surfaces BadCommunicationError if the primary is also down. See Both-stuck.
Unknown + Unknown Unknown + Unknown null (same fallback as Standby + Standby)

Split-brain (both chassis claim Active simultaneously) is a real production failure mode — typically a redundancy-module misconfiguration or a partial network split. The driver picks primary deterministically + emits a warning through AbCipDriverOptions.OnWarning so operators see it in the log.

CLI flags

The otopcua-abcip-cli tool exposes the HSBY plumbing through two surfaces (see Driver.AbCip.Cli.md for the full CLI guide):

  • --partner <gateway> — global flag on every command. Sets PartnerHostAddress + auto-enables Hsby.Enabled = true so the role probe runs alongside any read / write / subscribe.
  • hsby-status — dedicated command that prints which chassis is currently Active. Reads the role tag on both gateways for a few ticks + prints the (primary, partner, active) tuple.
# Print which chassis is Active right now
otopcua-abcip-cli hsby-status -g ab://10.0.0.5/1,0 --partner ab://10.0.0.6/1,0

# Subscribe through the active chassis (PR 5.2 follow-up — today the
# subscribe stays pointed at the primary; the role probe runs alongside).
otopcua-abcip-cli subscribe -g ab://10.0.0.5/1,0 --partner ab://10.0.0.6/1,0 \
    -t Motor01_Speed --type Real -i 500

Test coverage

  • Unit (tests/ZB.MOM.WW.OtOpcUa.Driver.AbCip.Tests/AbCipHsbyTests.cs):
    • Pure MapValueToRole matrix (WallClockTime.SyncStatus + S:34 bit mask + Unknown values).
    • End-to-end driver loop: primary Active / partner Standby resolves to primary; both Active resolves to primary with a warning; both Standby clears ActiveAddress; primary read failure routes to partner.
    • Diagnostics surface (AbCip.HsbyActive / HsbyPrimaryRole / HsbyPartnerRole).
    • DTO JSON round-trip (PartnerHostAddress + Hsby.{Enabled, RoleTagAddress, ProbeIntervalMs} survive deserialise → driver → DeviceState).
    • Hsby.Enabled = false → no role probing.
  • Integration (tests/ZB.MOM.WW.OtOpcUa.Driver.AbCip.IntegrationTests/):
    • AbCipHsbyRoleProberTests.cs (PR 5.1) and AbCipHsbyFailoverTests.cs (PR 5.2) — both skipped by default (Assert.Skip). ab_server cannot emulate a ControlLogix HSBY pair (no WallClockTime.SyncStatus, no second chassis concept). The Docker paired profile (PR 5.1) brings up two ab_server instances + a stub hsby-mux sidecar so the topology is documented, but a patched ab_server image that actually serves the role tag is still on the follow-up list.
    • Trait Category=Hsby so dotnet test --filter Category=Hsby finds them once they're promoted.
  • End-to-end (scripts/e2e/test-abcip-hsby.ps1, PR 5.2):
    • Paired-fixture variant of test-abcip.ps1. Subscribes to a tag through the OPC UA server, flips the active chassis mid-stream via the hsby-mux sidecar's POST /flip endpoint, asserts the stream survives + AbCip.HsbyFailoverCount increments. Gated on operator-supplied BridgeNodeId + a running paired fixture; ships unwired into test-all.ps1 until the patched ab_server lands.

Failover behaviour (PR 5.2)

PR 5.2 wires DeviceState.ActiveAddress into the read / write hot path through AbCipDriver.ResolveHost and the runtime-cache lifecycle. After the role-probe loop (PR 5.1) detects an active-address transition the driver re-points every wire-level operation at the now-Active chassis without operator intervention.

What flips on a failover

Aspect Pre-flip Post-flip
ResolveHost(tag) return primary HostAddress the partner address (when partner is now Active)
Per-tag libplctag handles in DeviceState.Runtimes created against primary gateway dropped on flip; lazily re-created against the partner gateway on next read / write
Parent-DINT RMW handles in DeviceState.ParentRuntimes primary gateway dropped on flip; same re-create-on-demand path
AbCipWriteCoalescer per-device cache last-known-written values from the primary reset; the first write of any value to the partner pays the full round-trip
LogicalInstanceMap (Logical-mode @tags walk) populated for primary cleared; the next read on a Logical-mode device re-walks @tags against the partner
Per-host bulkhead key (Polly bulkhead + breaker, plan decision #144) keyed on primary HostAddress keyed on the new active address — the partner gets its own fresh breaker state instead of inheriting a tripped breaker from the now-standby
AbCip.HsbyFailoverCount diagnostic 0 incremented by 1 on every transition observed by the probe loop

How the invalidation runs

PR 5.2 introduces an internal OnActiveAddressChanged event raised by HsbyProbeLoopAsync on every DeviceState.ActiveAddress transition. The driver subscribes to it from its own constructor; the handler (HandleActiveAddressChanged) does the cache invalidation in one place:

  1. Disposes every entry in DeviceState.Runtimes and DeviceState.ParentRuntimes, then clears both dicts. Disposed IAbCipTagRuntime instances release their underlying libplctag handles so the native heap doesn't leak.
  2. Clears DeviceState.LogicalInstanceMap and resets LogicalWalkComplete = false so the next read on a Logical-mode device re-fires the @tags symbol walk against the new chassis.
  3. Calls AbCipWriteCoalescer.Reset(deviceHostAddress) so cached "we already wrote 42" decisions don't stale-suppress the first partner-side write.
  4. Resets DeviceState.RuntimesAddress = null so subsequent diagnostics observers see a fresh stamp on the next runtime creation.
  5. Interlocked.Increment on the driver-wide AbCip.HsbyFailoverCount counter.

The handler is idempotent — a second event for the same address change is harmless because the dicts are already empty + the coalescer reset is itself idempotent.

Bulkhead key semantics

The per-host resilience pipeline (Polly bulkhead + circuit breaker, plan decision #144) keys on whatever IPerCallHostResolver.ResolveHost returns. PR 5.2 changes that resolver so an HSBY-failed-over device returns the partner's address, which means:

  • The device-state lookup (_devices.TryGetValue) keeps using the configured primary HostAddress as the dictionary key — that key never changes for the lifetime of a device, so multi-device configurations stay routable.
  • The resilience pipeline (Polly bulkhead, breaker, retry policies) receives the active address as the host-name dimension. The standby chassis's tripped breaker (if its primary went away) doesn't bleed over to the partner; the partner gets fresh limits + a closed breaker.

When HSBY is disabled (Hsby.Enabled = false) ResolveHost returns the configured primary HostAddress exactly as it always has — pre-5.2 behaviour, no double-key risk.

Failure-mode walkthrough

PR 5.2 adds three failover surface areas to reason about. The table below summarises the behaviour the driver reports + how an operator can inspect it.

Primary-stuck (primary unreachable, partner Active)

The primary chassis goes away (network partition, power loss, a stuck Forward Open). The role-probe loop reads HsbyRole.Unknown for the primary and HsbyRole.Active for the partner.

Surface Behaviour
DeviceState.ActiveAddress partner address
DeviceState.PrimaryRole Unknown
DeviceState.PartnerRole Active
ResolveHost(tag) partner address
Reads / writes route through partner gateway transparently
AbCip.HsbyFailoverCount incremented when the address transitioned away from the primary
AbCip.HsbyActive 2 (partner is the active chassis)
Operator action none required for routing; investigate why the primary is unreachable through the connectivity-probe loop's _System/_ConnectionStatus for the device

Secondary-stuck (partner unreachable, primary Active)

The partner chassis goes away (its OPC UA server is down, its IP is unreachable, the redundancy module unhitched it). The probe loop reads HsbyRole.Active for the primary and HsbyRole.Unknown for the partner.

Surface Behaviour
DeviceState.ActiveAddress primary address (no transition; this is the steady state)
DeviceState.PrimaryRole Active
DeviceState.PartnerRole Unknown
ResolveHost(tag) primary address
Reads / writes route through primary gateway exactly as in a non-HSBY deployment
AbCip.HsbyFailoverCount unchanged — no flip happened
AbCip.HsbyActive 1 (primary is the active chassis)
Operator action investigate why the partner is unreachable; the operational risk is that a future primary-side outage has no fall-back

Both-stuck (no chassis Active)

Both chassis report Standby / Disqualified / Unknown (a redundancy-module misconfiguration, both controllers in Program mode, or both unreachable).

Surface Behaviour
DeviceState.ActiveAddress null
ResolveHost(tag) falls back to the configured primary HostAddress
Reads / writes dispatched to the configured primary; a stuck primary surfaces BadCommunicationError per the existing dial flow
AbCip.HsbyActive 0 (no chassis Active)
AbCip.HsbyFailoverCount incremented when the transition Active → null happened
Operator action investigate the redundancy module / mode keys; the SCADA layer sees stuck-or-bad-quality reads, not incorrect routing

The "fall back to primary on null Active" choice is deliberate. Routing all reads to a deterministic chassis (the configured primary) keeps the breaker key + bulkhead state stable while the operator diagnoses the double-down outage; the alternative (round-robin / partner) would just trip both breakers in turn and obscure which chassis is the real problem.

Follow-ups (beyond PR 5.2)

  • Patched ab_server image — add a writable WallClockTime.SyncStatus tag (or a separate Python shim) so the Docker paired profile can exercise the wire-level role probe + the tests/.../IntegrationTests/AbCipHsbyFailoverTests.cs scaffold can flip its Assert.Skip for a real integration assertion.
  • hsby-mux REST endpointPOST /flip {"active": "primary"} writes 1 to the chosen chassis + 0 to the other so integration tests + scripts/e2e/test-abcip-hsby.ps1 can drive switch-overs deterministically.
  • GuardLogix HSBY — same role-tag plumbing applies; verify against a real 1756-L8xS pair when one is on-site.

See also