189 lines
9.3 KiB
Markdown
189 lines
9.3 KiB
Markdown
# AB Legacy diagnostic counters
|
||
|
||
Per-device diagnostic counters surface as auto-generated read-only OPC UA
|
||
variables under each device's synthetic `_Diagnostics/` folder. HMIs can bind
|
||
directly without going through a separate diagnostics RPC. Mirrors the AB CIP
|
||
`_System/` pattern from PR abcip-4.3.
|
||
|
||
Closes #253 (PR ablegacy-10).
|
||
|
||
## The nine counters
|
||
|
||
Each device managed by the `AbLegacyDriver` exposes nine read-only nodes under
|
||
`AbLegacy/<host>/_Diagnostics/<name>`. The first seven shipped in PR ablegacy-10;
|
||
`DemoteCount` + `LastDemotedUtc` arrived with PR ablegacy-12 / #255 (auto-demote
|
||
on comm failure).
|
||
|
||
| Name | Type | Semantics |
|
||
|---|---|---|
|
||
| `RequestCount` | Int64 | Total `ReadAsync` requests issued against this device. One increment per non-diagnostic reference per call, success or failure. |
|
||
| `ResponseCount` | Int64 | Successful read responses. |
|
||
| `ErrorCount` | Int64 | Failed read responses (any non-Good status). |
|
||
| `RetryCount` | Int64 | Retry attempts beyond the first per the PR 9 retry loop. A single read with two retries adds two. |
|
||
| `LastErrorCode` | Int32 | Most recent libplctag status code on a failed read; `0` when no error has been seen since the last reset. |
|
||
| `LastErrorMessage` | String | Most recent libplctag error message on a failed read; empty when no error has been seen since the last reset. |
|
||
| `CommFailures` | Int64 | Count of read failures mapped to `BadCommunicationError`. Spans transient libplctag throws + retried-out chains so operators see a single "wire fell off" counter. |
|
||
| `DemoteCount` | Int64 | **PR ablegacy-12** — cumulative auto-demote events for this device. Bumps every time the driver crosses the consecutive-failure threshold and arms a fresh cool-down window. Cumulative across `ReinitializeAsync` (preserved through redeploys) so a flapping link surfaces as a steadily climbing counter. |
|
||
| `LastDemotedUtc` | String | **PR ablegacy-12** — ISO-8601 UTC timestamp of the most recent auto-demotion. Empty string when this device has never been demoted. |
|
||
|
||
**Address shape**: `_Diagnostics/<deviceHostAddress>/<name>` —
|
||
e.g. `_Diagnostics/ab://10.0.0.5/1,0/RequestCount`.
|
||
|
||
The `<deviceHostAddress>` segment is the canonical `ab://host[:port]/cip-path`
|
||
string from `AbLegacyDeviceOptions.HostAddress`. The browse path looks like
|
||
`AbLegacy/<deviceHostAddress>/_Diagnostics/<name>` — the same shape as a
|
||
user-config tag node, just under a reserved sibling folder.
|
||
|
||
## Reset behaviour
|
||
|
||
| Trigger | Effect |
|
||
|---|---|
|
||
| `ReinitializeAsync` | Every counter for every device resets to zero, plus `LastErrorMessage` clears to empty. **PR ablegacy-12 exception:** `DemoteCount` + `LastDemotedUtc` survive the reinit so an operator redeploying mid-incident doesn't lose the flapping-link history. |
|
||
| `ShutdownAsync` | All counters drop with the device map (including `DemoteCount`). |
|
||
| Driver process restart | Counters start at zero. |
|
||
| Probe transition Stopped→Running | **No automatic reset** — counters are cumulative across reconnect events so operators can spot intermittent links by watching `CommFailures` keep climbing. |
|
||
| Probe transition Demoted→Running | **PR ablegacy-12** — early-clear of the active demote window, but the cumulative `DemoteCount` stays put. |
|
||
|
||
There is no in-process "reset" RPC at the time of writing. If you need to
|
||
clear counters without a redeploy, kick a `ReinitializeAsync` from the Admin
|
||
RPC surface — the driver re-EnsureDevice's each host so the freshly registered
|
||
counters start at zero.
|
||
|
||
## What does *not* increment counters
|
||
|
||
Reads against `_Diagnostics/<host>/<name>` are **driver-local observability**,
|
||
not field traffic — they short-circuit before the libplctag dispatch and do
|
||
NOT increment `RequestCount` or any other counter. Otherwise a 1 Hz HMI poll
|
||
of `RequestCount` would make the counter chase its own tail.
|
||
|
||
Writes against `_Diagnostics/*` are rejected with `BadNotWritable` because
|
||
every diagnostic node is `SecurityClassification.ViewOnly` — a misbehaving
|
||
SCADA template can't accidentally clobber the diagnostic surface.
|
||
|
||
## Collision with user tags
|
||
|
||
User-config tags must not shadow the seven reserved diagnostic names and
|
||
must not live under the synthetic `_Diagnostics/` folder. Both shapes are
|
||
rejected at `InitializeAsync` time with a clear `InvalidOperationException`:
|
||
|
||
- A tag named `RequestCount` (or any of the other six reserved names) is
|
||
rejected because it would silently never resolve at read time — the
|
||
diagnostics short-circuit wins.
|
||
- A tag whose `Address` starts with `_Diagnostics/` is rejected because the
|
||
whole prefix is owned by the auto-emitted counters.
|
||
|
||
Pick a different name (`SiteRequestCount`, `MachineRequestCount`) or a
|
||
different address path (real PCCC files like `N7:0`).
|
||
|
||
## HMI binding examples
|
||
|
||
### OPC UA Client CLI
|
||
|
||
```powershell
|
||
dotnet run --project src/ZB.MOM.WW.OtOpcUa.Client.CLI -- read `
|
||
-u opc.tcp://localhost:4840 `
|
||
-n "ns=2;s=AbLegacy/ab://10.0.0.5/1,0/_Diagnostics/RequestCount"
|
||
```
|
||
|
||
### AB Legacy CLI (driver-direct, no OPC UA layer)
|
||
|
||
```powershell
|
||
dotnet run --project src/ZB.MOM.WW.OtOpcUa.Driver.AbLegacy.Cli -- read `
|
||
-g "ab://10.0.0.5/1,0" -P Slc500 `
|
||
--address "_Diagnostics/RequestCount"
|
||
```
|
||
|
||
The driver-direct path lets you sanity-check the counter without standing up
|
||
an OPC UA server — useful when triaging a wire-level issue on the bench.
|
||
|
||
### Subscription pattern
|
||
|
||
Subscribe to all seven counters at a slow rate (e.g. 5–10 s) on a long-lived
|
||
overview dashboard, plus a faster rate (1 s) on `LastErrorMessage` /
|
||
`LastErrorCode` when actively debugging a flapping link. The diagnostics
|
||
short-circuit makes every read O(1) — there's no penalty for fast polling
|
||
of the counter itself, only the OPC UA subscription bookkeeping.
|
||
|
||
## Auto-demote on comm failure (PR ablegacy-12 / #255)
|
||
|
||
When a device fails N consecutive reads or probes the driver marks it
|
||
**Demoted** for a configurable cool-down window. Reads against a demoted
|
||
device short-circuit with `BadCommunicationError` *without invoking
|
||
libplctag* — that's the whole point of the feature: one slow PLC sharing
|
||
the driver thread can't starve faster peers reading from healthy hosts on
|
||
the same `AbLegacyDriver` instance.
|
||
|
||
### Configuration
|
||
|
||
Per-device, optional. `null` keeps the documented defaults (auto-demote
|
||
**enabled** with 3 failures / 30 s).
|
||
|
||
```jsonc
|
||
{
|
||
"Devices": [
|
||
{
|
||
"HostAddress": "ab://10.0.0.5/1,0",
|
||
"PlcFamily": "Slc500",
|
||
"Demote": {
|
||
"FailureThreshold": 3, // default 3
|
||
"DemoteForMs": 30000, // default 30s
|
||
"Enabled": true // default true
|
||
}
|
||
}
|
||
]
|
||
}
|
||
```
|
||
|
||
| Knob | Default | Notes |
|
||
|---|---|---|
|
||
| `FailureThreshold` | `3` | Consecutive comm failures before the device is demoted. A successful read or probe resets the tally. Terminal failures (`BadNodeIdUnknown`, `BadTypeMismatch`, …) **do not count** — they're config / decoder mismatches, not field outages. |
|
||
| `DemoteForMs` | `30000` (30s) | Cool-down window. Reads while this is active short-circuit; a successful probe clears it early. |
|
||
| `Enabled` | `true` | Set to `false` to keep the diagnostic counters but skip the auto-throttle. The failure tally still ticks but never arms the cool-down. |
|
||
|
||
### Recovery
|
||
|
||
Three ways out of Demoted, in order of likelihood:
|
||
|
||
1. **Probe success** — the per-device probe loop (`Probe.Enabled = true`,
|
||
default address `S:0`) is the fast path. The next probe iteration after
|
||
demotion will exercise the wire; on success it clears
|
||
`DemotedUntilUtc` immediately and transitions the host to `Running`.
|
||
2. **Window expiry** — once `DemoteForMs` elapses the demote marker
|
||
clears on the next read attempt. The read goes through; if it fails,
|
||
the failure tally keeps counting from where it left off (so a
|
||
permanently-down device re-arms the window after one more consecutive
|
||
failure rather than having to repeat the full threshold).
|
||
3. **`ReinitializeAsync`** — clears `ConsecutiveFailures` +
|
||
`DemotedUntilUtc` outright. Cumulative `DemoteCount` survives.
|
||
|
||
### Observability
|
||
|
||
`DemoteCount` is the headline counter — it bumps once per demotion event,
|
||
not per short-circuited read. A device that flaps every hour for a week
|
||
shows `DemoteCount = ~168` on Friday afternoon, which is the operator
|
||
signal you actually want.
|
||
|
||
`LastDemotedUtc` is the ISO-8601 UTC timestamp of the most recent
|
||
demotion. Bind it on a per-device tile alongside `DemoteCount` for
|
||
"flapping link" alerting.
|
||
|
||
### Host-state surface
|
||
|
||
A demoted device reports `HostState.Demoted` (new in PR ablegacy-12
|
||
on `Core.Abstractions/IHostConnectivityProbe.cs`). Consumers that
|
||
predate the new value (the central `HostStatusPublisher`) safely treat
|
||
it as `Stopped` — no schema migration needed.
|
||
|
||
## Cross-references
|
||
|
||
- [`AbLegacyDiagnosticTags.cs`](../../src/ZB.MOM.WW.OtOpcUa.Driver.AbLegacy/AbLegacyDiagnosticTags.cs)
|
||
— counter store + read short-circuit
|
||
- [`AbLegacyDriver.cs`](../../src/ZB.MOM.WW.OtOpcUa.Driver.AbLegacy/AbLegacyDriver.cs)
|
||
— increment sites in `ReadAsync`, discovery emission in `DiscoverAsync`,
|
||
auto-demote bookkeeping in `RecordFailureAndMaybeDemote` + `ProbeLoopAsync`
|
||
- [`AbLegacy-Test-Fixture.md`](AbLegacy-Test-Fixture.md) — `AbLegacyDiagnosticsTests`
|
||
+ `AbLegacyAutoDemoteTests` + collision-rejection contract
|
||
- [AB CIP `_System/` parallel](../../src/ZB.MOM.WW.OtOpcUa.Driver.AbCip/AbCipSystemTagSource.cs)
|
||
— same pattern with the CIP-specific six entries (incl. writeable
|
||
`_RefreshTagDb` trigger)
|