Files
scadaproj/components/observability/spec/METRIC-CONVENTIONS.md
T
Joseph Doherty 544a6ddb77 Fix all baseline code-review findings across the six shared libraries
Resolves the 35 findings from the 2026-06-01 baseline (commit 26ba1c7),
test-first for every behavioral change. +51 tests (331 -> 382 passing, 0 failed).

- Telemetry-001 (HIGH): RedactionEnricher now honours property removal, so a
  redactor that drops a key actually scrubs the secret from the event.
- Auth: LDAP validator ValidateOnStart; API-key verify no longer fails on a
  best-effort MarkUsed write or a corrupt scopes column (fail-closed); LDAP cert
  validation hook; KeyPrefix persistence aligned; README algorithm corrected.
- Health: Akka checks return Degraded (not throw) when the cluster isn't up yet;
  GrpcDependencyHealthCheck catch-all; null 'description' rendered; composite
  endpoint builder; XML docs shipped.
- Audit: CompositeAuditWriter no longer re-throws OperationCanceledException;
  TruncatingAuditRedactor over-redact scrubs Target + safe negative max; options
  record; XML docs shipped.
- Configuration: TryAddEnumerable idempotent registration; consistent port
  quoting; strict invariant port parsing; XML docs + README packaged.
- Theme: mobile toggle is now CSS-only (no Bootstrap JS); token/CSS hygiene;
  XML docs on the public parameter surface.

Shared-contract/spec docs updated where the code was the source of truth
(observability service.instance.id, MapZbMetrics, redactor reach). All changes
additive/back-compatible at v0.1.0. code-reviews bookkeeping follows separately.
2026-06-01 11:22:14 -04:00

236 lines
11 KiB
Markdown

# Observability — Metric conventions (standardized)
Status: **Standardized**. The naming and unit rules every sister project's instruments must
follow. Analogous to [`../auth/spec/CANONICAL-ROLES.md`](../../auth/spec/CANONICAL-ROLES.md)
for auth and [`../ui-theme/spec/DESIGN-TOKENS.md`](../../ui-theme/spec/DESIGN-TOKENS.md)
for the UI kit. Authoritative alongside [`SPEC.md`](SPEC.md).
The per-project instrument tables below (§4) document the **existing bespoke surface** — the
instruments each app currently defines or intends to define. These stay per-project; they are
not candidates for the shared library. The rules in §1–§3 govern *how* those instruments must
be named and measured.
---
## 1. Meter name
Each app owns exactly **one primary Meter**, named after its root namespace:
| App | Meter name | Status |
|---|---|---|
| OtOpcUa | `ZB.MOM.WW.OtOpcUa` | Correct today |
| MxGateway | `MxGateway.Server` | ⚠ Convergence target — rename to `ZB.MOM.WW.MxGateway` on adoption |
| ScadaBridge | `ZB.MOM.WW.ScadaBridge` | Target (no meter exists today) |
`MxGateway.Server` is the single convergence item for meter naming. It predates the
`ZB.MOM.WW.*` namespace convention; rename when adopting `AddZbTelemetry`. Instruments
emitted under the old name will require a `recording_rule` or relabel in any Prometheus
config that already scrapes the snapshot — coordinate before renaming in production.
If an app has secondary meters (e.g. a library component with its own meter), those follow
the same pattern: `ZB.MOM.WW.<App>.<Component>`.
---
## 2. Instrument name
Instrument names follow the pattern `<app>.<subsystem>.<event>`, all lower-case,
dot-separated:
```
<app> := short app identifier — otopcua | mxgateway | scadabridge
<subsystem> := functional area — deploy | session | tag | alarm | gateway | worker | ...
<event> := what happened or is measured — applied | count | duration | errors | active | ...
```
**Examples:**
| Instrument name | App | Meaning |
|---|---|---|
| `otopcua.deploy.applied` | OtOpcUa | Galaxy deploy events applied to the address space |
| `otopcua.deploy.apply.duration` | OtOpcUa | End-to-end deploy apply duration |
| `mxgateway.sessions.open` | MxGateway | Currently open MxAccess sessions |
| `mxgateway.commands.duration` | MxGateway | End-to-end MXAccess command latency |
| `scadabridge.alarm.received` | ScadaBridge | Alarms received by the DCL |
**Rules:**
1. All lower-case. No camelCase, no PascalCase, no hyphens.
2. Three segments minimum (`<app>.<subsystem>.<event>`). Four are permitted when the
subsystem warrants a sub-area (e.g. `mxgateway.commands.duration`).
3. Event nouns describe **what is counted or measured** (`applied`, `errors`, `active`,
`duration`), not implementation details (`method_called`, `loop_iteration`).
4. Counters: past-tense or noun (`received`, `errors`, `applied`).
UpDownCounters / gauges: present-state noun or adjective (`active`, `connected`).
Histograms: `duration` or a measured quantity noun (`size`, `lag`).
---
## 3. Units
### Duration — seconds (mandatory)
**All duration histograms MUST use seconds** (`"s"`). This is the OpenTelemetry semantic
convention (`UCUM`: `s`). Backends and dashboards assume seconds; mixing units breaks
aggregations across apps.
> ⚠ **MxGateway convergence item:** `GatewayMetrics.cs` defines three histograms with unit
> `"ms"` (`CommandDuration`, `EventDuration`, `WorkerCallDuration`). These must be migrated
> to `"s"` on adoption. Values must also be converted (divide by 1 000 at the call site).
> Track existing Prometheus `recording_rule`/dashboard changes — any dashboard panel that
> reads these histograms in `ms` will need updating. Until migration is complete, annotate
> the instruments with `// CONVERGENCE: ms→s pending`.
### Other units
| Quantity | Unit string | Notes |
|---|---|---|
| Duration | `"s"` | Mandatory — see above |
| Size / bytes | `"By"` | UCUM bytes |
| Count (dimensionless) | `"1"` or omit | For pure event counts; `"1"` preferred |
| Messages, requests | `"{message}"`, `"{request}"` | UCUM annotation form for dimensioned counts |
---
## 4. Resource attribute set (shared across all three signals)
The OTel `Resource` is built once by `AddZbTelemetry` (see [`SPEC.md`](SPEC.md) §2) and
attached to metrics, traces, and OTel-exported logs. The same `SiteId` and `NodeRole` values
populate Serilog enrichers, making a metric, a span, and a log line from the same node
joinable in any OTel-compatible backend.
| OTel attribute | Type | Required | Notes |
|---|---|---|---|
| `service.name` | string | Yes | Short lower-case app id: `otopcua`, `mxgateway`, `scadabridge` |
| `service.namespace` | string | Yes | Always `"ZB.MOM.WW"` — do not override |
| `service.version` | string | Recommended | Populate from `AssemblyInformationalVersion`; absent is better than wrong |
| `service.instance.id` | string | Auto | Always `ZbResource.InstanceId` = deterministic `MachineName:ProcessId`. The OTel SDK random-GUID default is disabled so every signal from one process shares one restart-stable instance id (cross-signal correlation); never override |
| `site.id` | string | Recommended | Physical or logical site identifier; omit for single-site deployments |
| `node.role` | string | Recommended | Node function: `"central"`, `"site"`, `"hub"`, `"standalone"` |
| `host.name` | string | Auto | Always `Environment.MachineName`; never override |
**Why `site.id` and `node.role` matter:** a ScadaBridge fleet runs N site clusters + one
central cluster, each on different hosts. Without `site.id` and `node.role`, metrics from a
site node and the central node are indistinguishable even if `host.name` differs.
---
## 5. Standard instrumentation baseline
Every app enables this baseline via `AddZbTelemetry`. No opt-out. These are community-
standard instrumentation packages; the overhead is negligible and the benefit (correlated
HTTP / gRPC request traces across the fleet) is high.
| Instrumentation | Signal(s) | OtOpcUa today | MxGateway today | ScadaBridge today |
|---|---|---|---|---|
| ASP.NET Core | Traces + Metrics | ⛔ not added | ⛔ not added | ⛔ not added |
| HttpClient | Traces + Metrics | ⛔ not added | ⛔ not added | ⛔ not added |
| gRPC client | Traces | ⛔ not added | ⛔ not added | n/a |
| .NET runtime | Metrics | ⛔ not added | ⛔ not added | ⛔ not added |
| Process | Metrics | ⛔ not added | ⛔ not added | ⛔ not added |
All three projects lack standard instrumentation today — it is added automatically when each
project calls `AddZbTelemetry` (Gap S1 in `GAPS.md`). No project removes any of these once
wired.
---
## 6. Per-app instrument surface (bespoke — stays per project)
These instruments are **not part of the shared library**. They document the existing bespoke
surface that each project registers through `o.Meters` / `o.ActivitySources` in `AddZbTelemetry`.
### 6.1 OtOpcUa — `ZB.MOM.WW.OtOpcUa` meter
Source: `src/Core/ZB.MOM.WW.OtOpcUa.Commons/Observability/OtOpcUaTelemetry.cs`
(Code-verified 2026-06-01 — see `current-state/otopcua/CURRENT-STATE.md`.)
**Counters (7):**
| Instrument | Kind | Unit | Description |
|---|---|---|---|
| `otopcua.deploy.applied` | Counter | — | Galaxy deploy events applied to the OPC UA address space |
| `otopcua.driver.lifecycle` | Counter | — | Driver lifecycle events (start / stop / restart) |
| `otopcua.virtualtag.eval` | Counter | — | Virtual tag evaluations |
| `otopcua.scriptedalarm.transition` | Counter | — | Scripted alarm state transitions |
| `otopcua.opcua.sink.write` | Counter | — | OPC UA sink write operations |
| `otopcua.redundancy.service_level_change` | Counter | — | Redundancy service-level changes |
**Histograms (1):**
| Instrument | Kind | Unit | Description |
|---|---|---|---|
| `otopcua.deploy.apply.duration` | Histogram | `s` | End-to-end deploy apply duration |
**ActivitySources (spans):**
| Source name | Spans |
|---|---|
| `ZB.MOM.WW.OtOpcUa` | `otopcua.deploy.apply`, `otopcua.opcua.address_space_rebuild` |
All durations use `"s"` — no unit convergence item for OtOpcUa.
### 6.2 MxGateway — `MxGateway.Server` meter (→ target: `ZB.MOM.WW.MxGateway`)
Source: `src/ZB.MOM.WW.MxGateway.Server/Metrics/GatewayMetrics.cs`
(Code-verified 2026-06-01 — see `current-state/mxaccessgw/CURRENT-STATE.md`.)
**Counters (13):**
| Instrument | Unit | Description |
|---|---|---|
| `mxgateway.sessions.opened` | `"1"` | New session requests |
| `mxgateway.sessions.closed` | `"1"` | Sessions torn down |
| `mxgateway.commands.started` | `"1"` | MXAccess command dispatched |
| `mxgateway.commands.succeeded` | `"1"` | Command completed OK |
| `mxgateway.commands.failed` | `"1"` | Command error |
| `mxgateway.events.received` | `"1"` | MXAccess events received from worker |
| `mxgateway.queues.overflows` | `"1"` | Queue overflow (backpressure) |
| `mxgateway.faults` | `"1"` | Unhandled gateway faults |
| `mxgateway.workers.killed` | `"1"` | Worker process forcibly terminated |
| `mxgateway.workers.exited` | `"1"` | Worker process exited cleanly |
| `mxgateway.heartbeats.failed` | `"1"` | Worker heartbeat timeouts |
| `mxgateway.grpc.streams.disconnected` | `"1"` | gRPC event stream disconnects |
| `mxgateway.retries.attempted` | `"1"` | Retry attempts (any subsystem) |
**Histograms (3) — current unit `ms` (convergence target `s`):**
| Instrument | Target unit | Current unit | Convergence |
|---|---|---|---|
| `mxgateway.workers.startup.duration` | `"s"` | `"ms"` | ⚠ Convert ms→s on adoption |
| `mxgateway.commands.duration` | `"s"` | `"ms"` | ⚠ Convert ms→s on adoption |
| `mxgateway.events.stream_send.duration` | `"s"` | `"ms"` | ⚠ Convert ms→s on adoption |
**Observable gauges (4):**
| Instrument | Unit | Description |
|---|---|---|
| `mxgateway.sessions.open` | `"1"` | Currently open sessions (live count) |
| `mxgateway.workers.running` | `"1"` | Currently running worker processes |
| `mxgateway.events.worker_queue.depth` | `"1"` | Per-worker event queue depth |
| `mxgateway.events.grpc_stream_queue.depth` | `"1"` | Per-stream gRPC send queue depth |
No ActivitySources today (no tracing). Adding `ZB.MOM.WW.MxGateway` as an ActivitySource
is left per-project (deferred to GAPS backlog).
### 6.3 ScadaBridge — `ZB.MOM.WW.ScadaBridge` meter
No meter or instruments exist today (`OpenTelemetry.Api` is a dangling ref). The target
meter name `ZB.MOM.WW.ScadaBridge` is reserved. Instruments are defined as part of the
ScadaBridge adoption tracked in [`../GAPS.md`](../GAPS.md).
---
## Consequences and convergence items (accepted)
| Item | Scope | Severity |
|---|---|---|
| MxGateway meter rename `MxGateway.Server``ZB.MOM.WW.MxGateway` | MxGateway adoption | Breaking — requires relabeling in Prometheus config and dashboards |
| MxGateway histogram unit `ms``s` (3 instruments) | MxGateway adoption | Breaking — values change by factor 1 000; dashboards need updating |
| ScadaBridge instrument set TBD | ScadaBridge adoption | No existing surface to converge — define from scratch |
All three items are tracked as backlog entries in [`../GAPS.md`](../GAPS.md). The ms→s
migration is the highest-priority convergence item because leaving it unresolved means
MxGateway's durations are incompatible with OtOpcUa's in any shared Prometheus / Grafana
workspace.