Files
lmxopcua/docs/drivers/AbLegacy-Diagnostics.md
2026-04-26 08:44:53 -04:00

189 lines
9.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# AB Legacy diagnostic counters
Per-device diagnostic counters surface as auto-generated read-only OPC UA
variables under each device's synthetic `_Diagnostics/` folder. HMIs can bind
directly without going through a separate diagnostics RPC. Mirrors the AB CIP
`_System/` pattern from PR abcip-4.3.
Closes #253 (PR ablegacy-10).
## The nine counters
Each device managed by the `AbLegacyDriver` exposes nine read-only nodes under
`AbLegacy/<host>/_Diagnostics/<name>`. The first seven shipped in PR ablegacy-10;
`DemoteCount` + `LastDemotedUtc` arrived with PR ablegacy-12 / #255 (auto-demote
on comm failure).
| Name | Type | Semantics |
|---|---|---|
| `RequestCount` | Int64 | Total `ReadAsync` requests issued against this device. One increment per non-diagnostic reference per call, success or failure. |
| `ResponseCount` | Int64 | Successful read responses. |
| `ErrorCount` | Int64 | Failed read responses (any non-Good status). |
| `RetryCount` | Int64 | Retry attempts beyond the first per the PR 9 retry loop. A single read with two retries adds two. |
| `LastErrorCode` | Int32 | Most recent libplctag status code on a failed read; `0` when no error has been seen since the last reset. |
| `LastErrorMessage` | String | Most recent libplctag error message on a failed read; empty when no error has been seen since the last reset. |
| `CommFailures` | Int64 | Count of read failures mapped to `BadCommunicationError`. Spans transient libplctag throws + retried-out chains so operators see a single "wire fell off" counter. |
| `DemoteCount` | Int64 | **PR ablegacy-12** — cumulative auto-demote events for this device. Bumps every time the driver crosses the consecutive-failure threshold and arms a fresh cool-down window. Cumulative across `ReinitializeAsync` (preserved through redeploys) so a flapping link surfaces as a steadily climbing counter. |
| `LastDemotedUtc` | String | **PR ablegacy-12** — ISO-8601 UTC timestamp of the most recent auto-demotion. Empty string when this device has never been demoted. |
**Address shape**: `_Diagnostics/<deviceHostAddress>/<name>`
e.g. `_Diagnostics/ab://10.0.0.5/1,0/RequestCount`.
The `<deviceHostAddress>` segment is the canonical `ab://host[:port]/cip-path`
string from `AbLegacyDeviceOptions.HostAddress`. The browse path looks like
`AbLegacy/<deviceHostAddress>/_Diagnostics/<name>` — the same shape as a
user-config tag node, just under a reserved sibling folder.
## Reset behaviour
| Trigger | Effect |
|---|---|
| `ReinitializeAsync` | Every counter for every device resets to zero, plus `LastErrorMessage` clears to empty. **PR ablegacy-12 exception:** `DemoteCount` + `LastDemotedUtc` survive the reinit so an operator redeploying mid-incident doesn't lose the flapping-link history. |
| `ShutdownAsync` | All counters drop with the device map (including `DemoteCount`). |
| Driver process restart | Counters start at zero. |
| Probe transition Stopped→Running | **No automatic reset** — counters are cumulative across reconnect events so operators can spot intermittent links by watching `CommFailures` keep climbing. |
| Probe transition Demoted→Running | **PR ablegacy-12** — early-clear of the active demote window, but the cumulative `DemoteCount` stays put. |
There is no in-process "reset" RPC at the time of writing. If you need to
clear counters without a redeploy, kick a `ReinitializeAsync` from the Admin
RPC surface — the driver re-EnsureDevice's each host so the freshly registered
counters start at zero.
## What does *not* increment counters
Reads against `_Diagnostics/<host>/<name>` are **driver-local observability**,
not field traffic — they short-circuit before the libplctag dispatch and do
NOT increment `RequestCount` or any other counter. Otherwise a 1 Hz HMI poll
of `RequestCount` would make the counter chase its own tail.
Writes against `_Diagnostics/*` are rejected with `BadNotWritable` because
every diagnostic node is `SecurityClassification.ViewOnly` — a misbehaving
SCADA template can't accidentally clobber the diagnostic surface.
## Collision with user tags
User-config tags must not shadow the seven reserved diagnostic names and
must not live under the synthetic `_Diagnostics/` folder. Both shapes are
rejected at `InitializeAsync` time with a clear `InvalidOperationException`:
- A tag named `RequestCount` (or any of the other six reserved names) is
rejected because it would silently never resolve at read time — the
diagnostics short-circuit wins.
- A tag whose `Address` starts with `_Diagnostics/` is rejected because the
whole prefix is owned by the auto-emitted counters.
Pick a different name (`SiteRequestCount`, `MachineRequestCount`) or a
different address path (real PCCC files like `N7:0`).
## HMI binding examples
### OPC UA Client CLI
```powershell
dotnet run --project src/ZB.MOM.WW.OtOpcUa.Client.CLI -- read `
-u opc.tcp://localhost:4840 `
-n "ns=2;s=AbLegacy/ab://10.0.0.5/1,0/_Diagnostics/RequestCount"
```
### AB Legacy CLI (driver-direct, no OPC UA layer)
```powershell
dotnet run --project src/ZB.MOM.WW.OtOpcUa.Driver.AbLegacy.Cli -- read `
-g "ab://10.0.0.5/1,0" -P Slc500 `
--address "_Diagnostics/RequestCount"
```
The driver-direct path lets you sanity-check the counter without standing up
an OPC UA server — useful when triaging a wire-level issue on the bench.
### Subscription pattern
Subscribe to all seven counters at a slow rate (e.g. 510 s) on a long-lived
overview dashboard, plus a faster rate (1 s) on `LastErrorMessage` /
`LastErrorCode` when actively debugging a flapping link. The diagnostics
short-circuit makes every read O(1) — there's no penalty for fast polling
of the counter itself, only the OPC UA subscription bookkeeping.
## Auto-demote on comm failure (PR ablegacy-12 / #255)
When a device fails N consecutive reads or probes the driver marks it
**Demoted** for a configurable cool-down window. Reads against a demoted
device short-circuit with `BadCommunicationError` *without invoking
libplctag* — that's the whole point of the feature: one slow PLC sharing
the driver thread can't starve faster peers reading from healthy hosts on
the same `AbLegacyDriver` instance.
### Configuration
Per-device, optional. `null` keeps the documented defaults (auto-demote
**enabled** with 3 failures / 30 s).
```jsonc
{
"Devices": [
{
"HostAddress": "ab://10.0.0.5/1,0",
"PlcFamily": "Slc500",
"Demote": {
"FailureThreshold": 3, // default 3
"DemoteForMs": 30000, // default 30s
"Enabled": true // default true
}
}
]
}
```
| Knob | Default | Notes |
|---|---|---|
| `FailureThreshold` | `3` | Consecutive comm failures before the device is demoted. A successful read or probe resets the tally. Terminal failures (`BadNodeIdUnknown`, `BadTypeMismatch`, …) **do not count** — they're config / decoder mismatches, not field outages. |
| `DemoteForMs` | `30000` (30s) | Cool-down window. Reads while this is active short-circuit; a successful probe clears it early. |
| `Enabled` | `true` | Set to `false` to keep the diagnostic counters but skip the auto-throttle. The failure tally still ticks but never arms the cool-down. |
### Recovery
Three ways out of Demoted, in order of likelihood:
1. **Probe success** — the per-device probe loop (`Probe.Enabled = true`,
default address `S:0`) is the fast path. The next probe iteration after
demotion will exercise the wire; on success it clears
`DemotedUntilUtc` immediately and transitions the host to `Running`.
2. **Window expiry** — once `DemoteForMs` elapses the demote marker
clears on the next read attempt. The read goes through; if it fails,
the failure tally keeps counting from where it left off (so a
permanently-down device re-arms the window after one more consecutive
failure rather than having to repeat the full threshold).
3. **`ReinitializeAsync`** — clears `ConsecutiveFailures` +
`DemotedUntilUtc` outright. Cumulative `DemoteCount` survives.
### Observability
`DemoteCount` is the headline counter — it bumps once per demotion event,
not per short-circuited read. A device that flaps every hour for a week
shows `DemoteCount = ~168` on Friday afternoon, which is the operator
signal you actually want.
`LastDemotedUtc` is the ISO-8601 UTC timestamp of the most recent
demotion. Bind it on a per-device tile alongside `DemoteCount` for
"flapping link" alerting.
### Host-state surface
A demoted device reports `HostState.Demoted` (new in PR ablegacy-12
on `Core.Abstractions/IHostConnectivityProbe.cs`). Consumers that
predate the new value (the central `HostStatusPublisher`) safely treat
it as `Stopped` — no schema migration needed.
## Cross-references
- [`AbLegacyDiagnosticTags.cs`](../../src/ZB.MOM.WW.OtOpcUa.Driver.AbLegacy/AbLegacyDiagnosticTags.cs)
— counter store + read short-circuit
- [`AbLegacyDriver.cs`](../../src/ZB.MOM.WW.OtOpcUa.Driver.AbLegacy/AbLegacyDriver.cs)
— increment sites in `ReadAsync`, discovery emission in `DiscoverAsync`,
auto-demote bookkeeping in `RecordFailureAndMaybeDemote` + `ProbeLoopAsync`
- [`AbLegacy-Test-Fixture.md`](AbLegacy-Test-Fixture.md) — `AbLegacyDiagnosticsTests`
+ `AbLegacyAutoDemoteTests` + collision-rejection contract
- [AB CIP `_System/` parallel](../../src/ZB.MOM.WW.OtOpcUa.Driver.AbCip/AbCipSystemTagSource.cs)
— same pattern with the CIP-specific six entries (incl. writeable
`_RefreshTagDb` trigger)