333 lines
17 KiB
Markdown
333 lines
17 KiB
Markdown
# AbCip — ControlLogix HSBY paired-IP support
|
|
|
|
PR abcip-5.1 + 5.2 ship **non-transparent** HSBY (Hot-Standby) awareness
|
|
to the AB CIP driver. Each device may declare a partner gateway; when both
|
|
gateways are up the driver concurrently probes a role tag on each chassis,
|
|
reports which one is currently Active, and routes reads / writes through
|
|
that chassis automatically.
|
|
|
|
- **PR abcip-5.1** — gathers + reports the role of each chassis through
|
|
driver diagnostics. See [Role-tag detection matrix](#role-tag-detection-matrix)
|
|
+ [Active-resolution rules](#active-resolution-rules).
|
|
- **PR abcip-5.2** — wires the resolved active address into
|
|
`AbCipDriver.ResolveHost` and the runtime-cache lifecycle. See
|
|
[Failover behaviour](#failover-behaviour-pr-52) +
|
|
[Failure-mode walkthrough](#failure-mode-walkthrough).
|
|
|
|
## When to use HSBY paired IPs
|
|
|
|
You have a redundant **ControlLogix** chassis pair (1756-RM redundancy
|
|
module, two CPUs, one acting + one standby) and the SCADA / OPC UA layer
|
|
needs to keep talking to *whichever chassis is currently Active* without an
|
|
operator manually re-pointing the connection.
|
|
|
|
Pre-5.1 the driver only knew about a single `HostAddress`. After a
|
|
hot-standby switch-over, the standby (now Active) carried a **different IP**
|
|
and the driver kept probing the dead-but-was-Active address until someone
|
|
edited the config.
|
|
|
|
PR abcip-5.1 closes the visibility half of that gap by reading the role tag
|
|
on both chassis. PR abcip-5.2 closes the routing half by re-pointing
|
|
`ResolveHost` at the Active address each tick + invalidating the per-tag
|
|
runtime cache + write-coalescer state on every flip.
|
|
|
|
## Configuration
|
|
|
|
```jsonc
|
|
{
|
|
"Devices": [
|
|
{
|
|
"HostAddress": "ab://10.0.0.5/1,0",
|
|
"PartnerHostAddress": "ab://10.0.0.6/1,0",
|
|
"Hsby": {
|
|
"Enabled": true,
|
|
"RoleTagAddress": "WallClockTime.SyncStatus",
|
|
"ProbeIntervalMs": 2000
|
|
}
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
| Field | Default | Notes |
|
|
|---|---|---|
|
|
| `PartnerHostAddress` | `null` | Canonical `ab://gateway[:port]/cip-path` of the partner chassis. `null` = no HSBY pair; the driver behaves exactly like every pre-5.1 build. |
|
|
| `Hsby.Enabled` | `false` | Master switch. When `false` (or `Hsby` omitted) no role probing happens, even if `PartnerHostAddress` is set. |
|
|
| `Hsby.RoleTagAddress` | `WallClockTime.SyncStatus` | Address of the role tag on each chassis. See [role-tag detection matrix](#role-tag-detection-matrix). |
|
|
| `Hsby.ProbeIntervalMs` | `2000` | How often each chassis is sampled. 2 s is a good default — tight enough to detect a switch-over within one Admin-UI refresh, loose enough to leave headroom for the regular probe loop. |
|
|
|
|
## Feature-flag gate (`Redundancy.Hsby.Enabled`)
|
|
|
|
`Hsby.Enabled = false` (the default) is the off-switch for the entire
|
|
feature. The role-probe loop never starts, the diagnostics keys are not
|
|
emitted, and the driver behaves identically to a pre-5.1 build. This is the
|
|
gate to flip when an operator wants to roll the feature out cautiously
|
|
across a fleet — set `Hsby.Enabled = true` per-device in driver config (no
|
|
build flag, no env var).
|
|
|
|
When the gate is on but the partner gateway is unreachable, the role-probe
|
|
loop reports `HsbyRole.Unknown` for the partner each tick. The primary's
|
|
role still drives the active-chassis resolution; the operator sees the
|
|
partner's role as Unknown in the Admin UI / driver diagnostics, which is the
|
|
correct surface for "we can't reach the standby chassis right now."
|
|
|
|
## Role-tag detection matrix
|
|
|
|
| Firmware / fronts | Address | Decode |
|
|
|---|---|---|
|
|
| **v20 / v24 / v32+ ControlLogix HSBY** | `WallClockTime.SyncStatus` (DINT) | `0` = Standby, `1` = Synchronized / Active, `2` = Disqualified, anything else = Unknown |
|
|
| **PLC-5 / SLC500 status-byte fallback** | `S:34` Module Status word | bit 0 = "this chassis is Active". Bit set → `Active`; clear → `Standby` |
|
|
| **Custom user role tag** | any DINT-typed CIP path | Same matrix as `WallClockTime.SyncStatus` (0 / 1 / 2). Out-of-range values → Unknown. |
|
|
|
|
`AbCipHsbyRoleProber.MapValueToRole` is the value-to-role mapper; unit tests
|
|
in `tests/ZB.MOM.WW.OtOpcUa.Driver.AbCip.Tests/AbCipHsbyTests.cs` pin every
|
|
row of the matrix.
|
|
|
|
## What gets reported
|
|
|
|
The driver surfaces three diagnostics counters per HSBY-enabled device
|
|
(visible via `driver-diagnostics` RPC + the Admin UI):
|
|
|
|
| Counter | Value |
|
|
|---|---|
|
|
| `AbCip.HsbyActive` | `1` if primary is Active, `2` if partner is Active, `0` if neither (or HSBY off) |
|
|
| `AbCip.HsbyPrimaryRole` | `(int)HsbyRole` — `0` = Unknown, `1` = Active, `2` = Standby, `3` = Disqualified |
|
|
| `AbCip.HsbyPartnerRole` | Same encoding as `HsbyPrimaryRole`, observed on the partner chassis |
|
|
| `AbCip.HsbyFailoverCount` (PR 5.2) | Total number of `ActiveAddress` transitions the probe loop has observed across every HSBY-enabled device on this driver. Each increment maps to one runtime-cache invalidation + write-coalescer reset. |
|
|
|
|
When more than one HSBY pair is configured on the same driver instance the
|
|
flat keys are scoped per primary host: `AbCip.HsbyActive[ab://10.0.0.5/1,0]`,
|
|
etc.
|
|
|
|
The `DeviceState.ActiveAddress` field (internal; surfaced via
|
|
`HsbyActive` diagnostics) is the address PR 5.2 routes through
|
|
`ResolveHost` + uses to scope the per-host bulkhead / breaker key.
|
|
See [Failover behaviour](#failover-behaviour-pr-52) for the runtime
|
|
implications.
|
|
|
|
### Active-resolution rules
|
|
|
|
| Primary role | Partner role | `ActiveAddress` resolution |
|
|
|---|---|---|
|
|
| Active | Standby / Disqualified / Unknown | primary |
|
|
| Standby / Disqualified / Unknown | Active | partner |
|
|
| Active | Active (split-brain) | **primary wins**, warning logged |
|
|
| Standby + Standby | Standby + Standby | `null` — PR 5.2's `ResolveHost` falls back to the configured primary; the existing dial flow surfaces `BadCommunicationError` if the primary is also down. See [Both-stuck](#both-stuck-no-chassis-active). |
|
|
| Unknown + Unknown | Unknown + Unknown | `null` (same fallback as Standby + Standby) |
|
|
|
|
Split-brain (both chassis claim Active simultaneously) is a real
|
|
production failure mode — typically a redundancy-module misconfiguration or
|
|
a partial network split. The driver picks primary deterministically + emits
|
|
a warning through `AbCipDriverOptions.OnWarning` so operators see it in the
|
|
log.
|
|
|
|
## CLI flags
|
|
|
|
The `otopcua-abcip-cli` tool exposes the HSBY plumbing through two surfaces
|
|
(see [Driver.AbCip.Cli.md](../Driver.AbCip.Cli.md) for the full CLI guide):
|
|
|
|
- `--partner <gateway>` — global flag on every command. Sets
|
|
`PartnerHostAddress` + auto-enables `Hsby.Enabled = true` so the role
|
|
probe runs alongside any read / write / subscribe.
|
|
- `hsby-status` — dedicated command that prints which chassis is
|
|
currently Active. Reads the role tag on both gateways for a few ticks +
|
|
prints the `(primary, partner, active)` tuple.
|
|
|
|
```powershell
|
|
# Print which chassis is Active right now
|
|
otopcua-abcip-cli hsby-status -g ab://10.0.0.5/1,0 --partner ab://10.0.0.6/1,0
|
|
|
|
# Subscribe through the active chassis (PR 5.2 follow-up — today the
|
|
# subscribe stays pointed at the primary; the role probe runs alongside).
|
|
otopcua-abcip-cli subscribe -g ab://10.0.0.5/1,0 --partner ab://10.0.0.6/1,0 \
|
|
-t Motor01_Speed --type Real -i 500
|
|
```
|
|
|
|
## Test coverage
|
|
|
|
- **Unit** (`tests/ZB.MOM.WW.OtOpcUa.Driver.AbCip.Tests/AbCipHsbyTests.cs`):
|
|
- Pure `MapValueToRole` matrix (WallClockTime.SyncStatus + S:34 bit
|
|
mask + Unknown values).
|
|
- End-to-end driver loop: primary Active / partner Standby resolves to
|
|
primary; both Active resolves to primary with a warning; both
|
|
Standby clears `ActiveAddress`; primary read failure routes to
|
|
partner.
|
|
- Diagnostics surface (`AbCip.HsbyActive` / `HsbyPrimaryRole` /
|
|
`HsbyPartnerRole`).
|
|
- DTO JSON round-trip (`PartnerHostAddress` + `Hsby.{Enabled,
|
|
RoleTagAddress, ProbeIntervalMs}` survive deserialise → driver →
|
|
`DeviceState`).
|
|
- `Hsby.Enabled = false` → no role probing.
|
|
- **Integration** (`tests/ZB.MOM.WW.OtOpcUa.Driver.AbCip.IntegrationTests/`):
|
|
- `AbCipHsbyRoleProberTests.cs` (PR 5.1) and
|
|
`AbCipHsbyFailoverTests.cs` (PR 5.2) — both **skipped by default**
|
|
(`Assert.Skip`). `ab_server` cannot emulate a ControlLogix HSBY
|
|
pair (no `WallClockTime.SyncStatus`, no second chassis concept).
|
|
The Docker `paired` profile (PR 5.1) brings up two `ab_server`
|
|
instances + a stub `hsby-mux` sidecar so the topology is
|
|
documented, but a patched `ab_server` image that actually serves
|
|
the role tag is still on the follow-up list.
|
|
- Trait `Category=Hsby` so `dotnet test --filter Category=Hsby`
|
|
finds them once they're promoted.
|
|
- **End-to-end** (`scripts/e2e/test-abcip-hsby.ps1`, PR 5.2):
|
|
- Paired-fixture variant of `test-abcip.ps1`. Subscribes to a tag
|
|
through the OPC UA server, flips the active chassis mid-stream
|
|
via the `hsby-mux` sidecar's `POST /flip` endpoint, asserts the
|
|
stream survives + `AbCip.HsbyFailoverCount` increments. Gated
|
|
on operator-supplied `BridgeNodeId` + a running paired fixture;
|
|
ships unwired into `test-all.ps1` until the patched `ab_server`
|
|
lands.
|
|
|
|
## Failover behaviour (PR 5.2)
|
|
|
|
PR 5.2 wires `DeviceState.ActiveAddress` into the read / write hot path
|
|
through `AbCipDriver.ResolveHost` and the runtime-cache lifecycle. After
|
|
the role-probe loop (PR 5.1) detects an active-address transition the
|
|
driver re-points every wire-level operation at the now-Active chassis
|
|
without operator intervention.
|
|
|
|
### What flips on a failover
|
|
|
|
| Aspect | Pre-flip | Post-flip |
|
|
|---|---|---|
|
|
| `ResolveHost(tag)` return | primary `HostAddress` | the partner address (when partner is now Active) |
|
|
| Per-tag libplctag handles in `DeviceState.Runtimes` | created against primary gateway | dropped on flip; lazily re-created against the partner gateway on next read / write |
|
|
| Parent-DINT RMW handles in `DeviceState.ParentRuntimes` | primary gateway | dropped on flip; same re-create-on-demand path |
|
|
| `AbCipWriteCoalescer` per-device cache | last-known-written values from the primary | reset; the first write of any value to the partner pays the full round-trip |
|
|
| `LogicalInstanceMap` (Logical-mode `@tags` walk) | populated for primary | cleared; the next read on a Logical-mode device re-walks `@tags` against the partner |
|
|
| Per-host bulkhead key (Polly bulkhead + breaker, plan decision #144) | keyed on primary `HostAddress` | keyed on the new active address — the partner gets its own fresh breaker state instead of inheriting a tripped breaker from the now-standby |
|
|
| `AbCip.HsbyFailoverCount` diagnostic | 0 | incremented by 1 on every transition observed by the probe loop |
|
|
|
|
### How the invalidation runs
|
|
|
|
PR 5.2 introduces an internal `OnActiveAddressChanged` event raised by
|
|
`HsbyProbeLoopAsync` on every `DeviceState.ActiveAddress` transition. The
|
|
driver subscribes to it from its own constructor; the handler
|
|
(`HandleActiveAddressChanged`) does the cache invalidation in one place:
|
|
|
|
1. Disposes every entry in `DeviceState.Runtimes` and
|
|
`DeviceState.ParentRuntimes`, then clears both dicts. Disposed
|
|
`IAbCipTagRuntime` instances release their underlying libplctag
|
|
handles so the native heap doesn't leak.
|
|
2. Clears `DeviceState.LogicalInstanceMap` and resets
|
|
`LogicalWalkComplete = false` so the next read on a Logical-mode
|
|
device re-fires the `@tags` symbol walk against the new chassis.
|
|
3. Calls `AbCipWriteCoalescer.Reset(deviceHostAddress)` so cached
|
|
"we already wrote 42" decisions don't stale-suppress the first
|
|
partner-side write.
|
|
4. Resets `DeviceState.RuntimesAddress = null` so subsequent
|
|
diagnostics observers see a fresh stamp on the next runtime
|
|
creation.
|
|
5. `Interlocked.Increment` on the driver-wide
|
|
`AbCip.HsbyFailoverCount` counter.
|
|
|
|
The handler is idempotent — a second event for the same address change
|
|
is harmless because the dicts are already empty + the coalescer reset
|
|
is itself idempotent.
|
|
|
|
### Bulkhead key semantics
|
|
|
|
The per-host resilience pipeline (Polly bulkhead + circuit breaker, plan
|
|
decision #144) keys on whatever `IPerCallHostResolver.ResolveHost`
|
|
returns. PR 5.2 changes that resolver so an HSBY-failed-over device
|
|
returns the partner's address, which means:
|
|
|
|
- The **device-state lookup** (`_devices.TryGetValue`) keeps using the
|
|
configured primary `HostAddress` as the dictionary key — that key
|
|
never changes for the lifetime of a device, so multi-device
|
|
configurations stay routable.
|
|
- The **resilience pipeline** (Polly bulkhead, breaker, retry policies)
|
|
receives the active address as the host-name dimension. The standby
|
|
chassis's tripped breaker (if its primary went away) doesn't bleed
|
|
over to the partner; the partner gets fresh limits + a closed
|
|
breaker.
|
|
|
|
When HSBY is disabled (`Hsby.Enabled = false`) `ResolveHost` returns the
|
|
configured primary `HostAddress` exactly as it always has — pre-5.2
|
|
behaviour, no double-key risk.
|
|
|
|
## Failure-mode walkthrough
|
|
|
|
PR 5.2 adds three failover surface areas to reason about. The table
|
|
below summarises the behaviour the driver reports + how an operator
|
|
can inspect it.
|
|
|
|
### Primary-stuck (primary unreachable, partner Active)
|
|
|
|
The primary chassis goes away (network partition, power loss, a stuck
|
|
Forward Open). The role-probe loop reads `HsbyRole.Unknown` for the
|
|
primary and `HsbyRole.Active` for the partner.
|
|
|
|
| Surface | Behaviour |
|
|
|---|---|
|
|
| `DeviceState.ActiveAddress` | partner address |
|
|
| `DeviceState.PrimaryRole` | `Unknown` |
|
|
| `DeviceState.PartnerRole` | `Active` |
|
|
| `ResolveHost(tag)` | partner address |
|
|
| Reads / writes | route through partner gateway transparently |
|
|
| `AbCip.HsbyFailoverCount` | incremented when the address transitioned away from the primary |
|
|
| `AbCip.HsbyActive` | `2` (partner is the active chassis) |
|
|
| Operator action | none required for routing; investigate why the primary is unreachable through the connectivity-probe loop's `_System/_ConnectionStatus` for the device |
|
|
|
|
### Secondary-stuck (partner unreachable, primary Active)
|
|
|
|
The partner chassis goes away (its OPC UA server is down, its IP is
|
|
unreachable, the redundancy module unhitched it). The probe loop reads
|
|
`HsbyRole.Active` for the primary and `HsbyRole.Unknown` for the partner.
|
|
|
|
| Surface | Behaviour |
|
|
|---|---|
|
|
| `DeviceState.ActiveAddress` | primary address (no transition; this is the steady state) |
|
|
| `DeviceState.PrimaryRole` | `Active` |
|
|
| `DeviceState.PartnerRole` | `Unknown` |
|
|
| `ResolveHost(tag)` | primary address |
|
|
| Reads / writes | route through primary gateway exactly as in a non-HSBY deployment |
|
|
| `AbCip.HsbyFailoverCount` | unchanged — no flip happened |
|
|
| `AbCip.HsbyActive` | `1` (primary is the active chassis) |
|
|
| Operator action | investigate why the partner is unreachable; the operational risk is that a future primary-side outage has no fall-back |
|
|
|
|
### Both-stuck (no chassis Active)
|
|
|
|
Both chassis report `Standby` / `Disqualified` / `Unknown` (a
|
|
redundancy-module misconfiguration, both controllers in Program mode,
|
|
or both unreachable).
|
|
|
|
| Surface | Behaviour |
|
|
|---|---|
|
|
| `DeviceState.ActiveAddress` | `null` |
|
|
| `ResolveHost(tag)` | falls back to the configured primary `HostAddress` |
|
|
| Reads / writes | dispatched to the configured primary; a stuck primary surfaces `BadCommunicationError` per the existing dial flow |
|
|
| `AbCip.HsbyActive` | `0` (no chassis Active) |
|
|
| `AbCip.HsbyFailoverCount` | incremented when the transition `Active → null` happened |
|
|
| Operator action | investigate the redundancy module / mode keys; the SCADA layer sees stuck-or-bad-quality reads, not incorrect routing |
|
|
|
|
The "fall back to primary on null Active" choice is deliberate. Routing
|
|
all reads to a deterministic chassis (the configured primary) keeps the
|
|
breaker key + bulkhead state stable while the operator diagnoses the
|
|
double-down outage; the alternative (round-robin / partner) would just
|
|
trip both breakers in turn and obscure which chassis is the real
|
|
problem.
|
|
|
|
## Follow-ups (beyond PR 5.2)
|
|
|
|
- **Patched `ab_server` image** — add a writable `WallClockTime.SyncStatus`
|
|
tag (or a separate Python shim) so the Docker `paired` profile can
|
|
exercise the wire-level role probe + the
|
|
`tests/.../IntegrationTests/AbCipHsbyFailoverTests.cs` scaffold can
|
|
flip its `Assert.Skip` for a real integration assertion.
|
|
- **`hsby-mux` REST endpoint** — `POST /flip {"active": "primary"}` writes
|
|
`1` to the chosen chassis + `0` to the other so integration tests +
|
|
`scripts/e2e/test-abcip-hsby.ps1` can drive switch-overs
|
|
deterministically.
|
|
- **GuardLogix HSBY** — same role-tag plumbing applies; verify against a
|
|
real 1756-L8xS pair when one is on-site.
|
|
|
|
## See also
|
|
|
|
- [`docs/Driver.AbCip.Cli.md`](../Driver.AbCip.Cli.md) — `--partner` flag +
|
|
`hsby-status` command reference
|
|
- [`docs/drivers/AbServer-Test-Fixture.md`](AbServer-Test-Fixture.md) §"What
|
|
it does NOT cover" — HSBY entry
|
|
- [`docs/Redundancy.md`](../Redundancy.md) — server-level (OPC UA-stack)
|
|
redundancy; HSBY is the **driver-level** companion
|