Auto: ablegacy-12 — auto-demote on comm failure

Closes #255
This commit is contained in:
Joseph Doherty
2026-04-26 08:44:53 -04:00
parent 8ee65a75d2
commit 1e3053c0d8
18 changed files with 1160 additions and 31 deletions

View File

@@ -7,10 +7,12 @@ directly without going through a separate diagnostics RPC. Mirrors the AB CIP
Closes #253 (PR ablegacy-10).
## The seven counters
## The nine counters
Each device managed by the `AbLegacyDriver` exposes seven read-only nodes under
`AbLegacy/<host>/_Diagnostics/<name>`:
Each device managed by the `AbLegacyDriver` exposes nine read-only nodes under
`AbLegacy/<host>/_Diagnostics/<name>`. The first seven shipped in PR ablegacy-10;
`DemoteCount` + `LastDemotedUtc` arrived with PR ablegacy-12 / #255 (auto-demote
on comm failure).
| Name | Type | Semantics |
|---|---|---|
@@ -21,6 +23,8 @@ Each device managed by the `AbLegacyDriver` exposes seven read-only nodes under
| `LastErrorCode` | Int32 | Most recent libplctag status code on a failed read; `0` when no error has been seen since the last reset. |
| `LastErrorMessage` | String | Most recent libplctag error message on a failed read; empty when no error has been seen since the last reset. |
| `CommFailures` | Int64 | Count of read failures mapped to `BadCommunicationError`. Spans transient libplctag throws + retried-out chains so operators see a single "wire fell off" counter. |
| `DemoteCount` | Int64 | **PR ablegacy-12** — cumulative auto-demote events for this device. Bumps every time the driver crosses the consecutive-failure threshold and arms a fresh cool-down window. Cumulative across `ReinitializeAsync` (preserved through redeploys) so a flapping link surfaces as a steadily climbing counter. |
| `LastDemotedUtc` | String | **PR ablegacy-12** — ISO-8601 UTC timestamp of the most recent auto-demotion. Empty string when this device has never been demoted. |
**Address shape**: `_Diagnostics/<deviceHostAddress>/<name>`
e.g. `_Diagnostics/ab://10.0.0.5/1,0/RequestCount`.
@@ -34,10 +38,11 @@ user-config tag node, just under a reserved sibling folder.
| Trigger | Effect |
|---|---|
| `ReinitializeAsync` | Every counter for every device resets to zero, plus `LastErrorMessage` clears to empty. |
| `ShutdownAsync` | Same as Reinitialize — counters drop with the device map. |
| `ReinitializeAsync` | Every counter for every device resets to zero, plus `LastErrorMessage` clears to empty. **PR ablegacy-12 exception:** `DemoteCount` + `LastDemotedUtc` survive the reinit so an operator redeploying mid-incident doesn't lose the flapping-link history. |
| `ShutdownAsync` | All counters drop with the device map (including `DemoteCount`). |
| Driver process restart | Counters start at zero. |
| Probe transition Stopped→Running | **No automatic reset** — counters are cumulative across reconnect events so operators can spot intermittent links by watching `CommFailures` keep climbing. |
| Probe transition Demoted→Running | **PR ablegacy-12** — early-clear of the active demote window, but the cumulative `DemoteCount` stays put. |
There is no in-process "reset" RPC at the time of writing. If you need to
clear counters without a redeploy, kick a `ReinitializeAsync` from the Admin
@@ -99,14 +104,85 @@ overview dashboard, plus a faster rate (1 s) on `LastErrorMessage` /
short-circuit makes every read O(1) — there's no penalty for fast polling
of the counter itself, only the OPC UA subscription bookkeeping.
## Auto-demote on comm failure (PR ablegacy-12 / #255)
When a device fails N consecutive reads or probes the driver marks it
**Demoted** for a configurable cool-down window. Reads against a demoted
device short-circuit with `BadCommunicationError` *without invoking
libplctag* — that's the whole point of the feature: one slow PLC sharing
the driver thread can't starve faster peers reading from healthy hosts on
the same `AbLegacyDriver` instance.
### Configuration
Per-device, optional. `null` keeps the documented defaults (auto-demote
**enabled** with 3 failures / 30 s).
```jsonc
{
"Devices": [
{
"HostAddress": "ab://10.0.0.5/1,0",
"PlcFamily": "Slc500",
"Demote": {
"FailureThreshold": 3, // default 3
"DemoteForMs": 30000, // default 30s
"Enabled": true // default true
}
}
]
}
```
| Knob | Default | Notes |
|---|---|---|
| `FailureThreshold` | `3` | Consecutive comm failures before the device is demoted. A successful read or probe resets the tally. Terminal failures (`BadNodeIdUnknown`, `BadTypeMismatch`, …) **do not count** — they're config / decoder mismatches, not field outages. |
| `DemoteForMs` | `30000` (30s) | Cool-down window. Reads while this is active short-circuit; a successful probe clears it early. |
| `Enabled` | `true` | Set to `false` to keep the diagnostic counters but skip the auto-throttle. The failure tally still ticks but never arms the cool-down. |
### Recovery
Three ways out of Demoted, in order of likelihood:
1. **Probe success** — the per-device probe loop (`Probe.Enabled = true`,
default address `S:0`) is the fast path. The next probe iteration after
demotion will exercise the wire; on success it clears
`DemotedUntilUtc` immediately and transitions the host to `Running`.
2. **Window expiry** — once `DemoteForMs` elapses the demote marker
clears on the next read attempt. The read goes through; if it fails,
the failure tally keeps counting from where it left off (so a
permanently-down device re-arms the window after one more consecutive
failure rather than having to repeat the full threshold).
3. **`ReinitializeAsync`** — clears `ConsecutiveFailures` +
`DemotedUntilUtc` outright. Cumulative `DemoteCount` survives.
### Observability
`DemoteCount` is the headline counter — it bumps once per demotion event,
not per short-circuited read. A device that flaps every hour for a week
shows `DemoteCount = ~168` on Friday afternoon, which is the operator
signal you actually want.
`LastDemotedUtc` is the ISO-8601 UTC timestamp of the most recent
demotion. Bind it on a per-device tile alongside `DemoteCount` for
"flapping link" alerting.
### Host-state surface
A demoted device reports `HostState.Demoted` (new in PR ablegacy-12
on `Core.Abstractions/IHostConnectivityProbe.cs`). Consumers that
predate the new value (the central `HostStatusPublisher`) safely treat
it as `Stopped` — no schema migration needed.
## Cross-references
- [`AbLegacyDiagnosticTags.cs`](../../src/ZB.MOM.WW.OtOpcUa.Driver.AbLegacy/AbLegacyDiagnosticTags.cs)
— counter store + read short-circuit
- [`AbLegacyDriver.cs`](../../src/ZB.MOM.WW.OtOpcUa.Driver.AbLegacy/AbLegacyDriver.cs)
— increment sites in `ReadAsync`, discovery emission in `DiscoverAsync`
— increment sites in `ReadAsync`, discovery emission in `DiscoverAsync`,
auto-demote bookkeeping in `RecordFailureAndMaybeDemote` + `ProbeLoopAsync`
- [`AbLegacy-Test-Fixture.md`](AbLegacy-Test-Fixture.md) — `AbLegacyDiagnosticsTests`
+ collision-rejection contract
+ `AbLegacyAutoDemoteTests` + collision-rejection contract
- [AB CIP `_System/` parallel](../../src/ZB.MOM.WW.OtOpcUa.Driver.AbCip/AbCipSystemTagSource.cs)
— same pattern with the CIP-specific six entries (incl. writeable
`_RefreshTagDb` trigger)

View File

@@ -53,12 +53,31 @@ supplies a `FakeAbLegacyTag`.
counters: 5 reads (3 ok / 2 fail) → `RequestCount=5`, `ResponseCount=3`,
`ErrorCount=2`; `LastErrorCode` reflects the most recent libplctag status;
`RetryCount` increments per retry attempt beyond the first; counters reset
on `ReinitializeAsync`; discovery emits exactly 7 diagnostic variables per
device under `_Diagnostics/`; collision rejection at `InitializeAsync` for
user tags shadowing reserved names or `_Diagnostics/` addresses; the
`_Diagnostics/<host>/<name>` short-circuit returns the live snapshot through
`ReadAsync` without bumping `RequestCount`; two devices keep counters
independent.
on `ReinitializeAsync`; discovery emits the canonical diagnostic variables
per device under `_Diagnostics/` (now 9 with PR ablegacy-12); collision
rejection at `InitializeAsync` for user tags shadowing reserved names or
`_Diagnostics/` addresses; the `_Diagnostics/<host>/<name>` short-circuit
returns the live snapshot through `ReadAsync` without bumping
`RequestCount`; two devices keep counters independent.
- `AbLegacyAutoDemoteTests`**PR ablegacy-12 / #255** auto-demote on comm
failure: 3 consecutive failures arm the demote window and surface
`HostState.Demoted`; subsequent reads short-circuit with
`BadCommunicationError` *without invoking libplctag* (verified via
`factory.Tags["N7:0"].ReadCount` not advancing); successful read resets
the consecutive-failure counter; failure-success-failure pattern doesn't
cross the threshold; `DemoteCount` + `LastDemotedUtc` surface via
`_Diagnostics/`; `Enabled=false` opts out (failures still count, demotion
never fires); `ReinitializeAsync` clears the active window but preserves
cumulative `DemoteCount`; cool-down expiry allows the next read through;
two devices in one driver — one faulty, one healthy — proves the faulty
side's demotion doesn't starve the healthy side; `BadNodeIdUnknown`
(terminal) does not count toward the comm-failure tally; DTO JSON
round-trip preserves `FailureThreshold` / `DemoteForMs` / `Enabled` at
the per-device level; `HostState.Demoted` enum value is wired through
`Core.Abstractions`. Companion integration test in
`tests/.../IntegrationTests/AbLegacyAutoDemoteTests.cs` runs the
two-device-one-unreachable scenario against a live ab_server fixture
using `127.0.0.1:1` as the unreachable peer.
- `RsLogixSymbolImportTests` — ablegacy-11 / #254 RSLogix CSV symbol-import parser:
canonical 8-row CSV (one row per N/F/B/L/ST/T/C/R) → 8 typed
`AbLegacyTagDefinition`s with the right `DataType`; header + comment-line