Files
scadaproj/components/observability/spec/METRIC-CONVENTIONS.md
T
Joseph Doherty 7d243890ed docs(observability): spec + METRIC-CONVENTIONS + ZB.MOM.WW.Telemetry shared contract
Author the three normalization docs for the observability component:
- components/observability/spec/SPEC.md — Section 0 scope (normalized vs. per-project),
  AddZbTelemetry pipeline, shared Resource attribute set, standard instrumentation baseline,
  exporter conventions, Serilog two-stage bootstrap with identity enrichers and
  TraceContextEnricher, ILogRedactor redaction seam, per-project migration table, and
  acceptance criteria.
- components/observability/spec/METRIC-CONVENTIONS.md — meter naming convention (app
  namespace; MxGateway.Server flagged as convergence target), instrument naming pattern
  (<app>.<subsystem>.<event>), mandatory duration unit = seconds (MxGateway ms histograms
  flagged), Resource attribute set table, standard instrumentation baseline, and per-app
  instrument tables (OtOpcUa 7 instruments + 2 spans; MxGateway 13 counters / 3 histograms
  / 4 gauges; ScadaBridge TBD).
- components/observability/shared-contract/ZB.MOM.WW.Telemetry.md — paper API for the two
  packages: ZbTelemetryOptions, ZbExporter enum, AddZbTelemetry (IHostApplicationBuilder +
  IServiceCollection overloads), ZbResource.Build, MapZbMetrics; AddZbSerilog,
  ZbLogEnricherNames constants, TraceContextEnricher, ILogRedactor, RedactionEnricher.
  Consumer matrix and open contract questions included.
2026-06-01 07:19:38 -04:00

225 lines
10 KiB
Markdown

# Observability — Metric conventions (standardized)
Status: **Standardized**. The naming and unit rules every sister project's instruments must
follow. Analogous to [`../auth/spec/CANONICAL-ROLES.md`](../../auth/spec/CANONICAL-ROLES.md)
for auth and [`../ui-theme/spec/DESIGN-TOKENS.md`](../../ui-theme/spec/DESIGN-TOKENS.md)
for the UI kit. Authoritative alongside [`SPEC.md`](SPEC.md).
The per-project instrument tables below (§4) document the **existing bespoke surface** — the
instruments each app currently defines or intends to define. These stay per-project; they are
not candidates for the shared library. The rules in §1–§3 govern *how* those instruments must
be named and measured.
---
## 1. Meter name
Each app owns exactly **one primary Meter**, named after its root namespace:
| App | Meter name | Status |
|---|---|---|
| OtOpcUa | `ZB.MOM.WW.OtOpcUa` | Correct today |
| MxGateway | `MxGateway.Server` | ⚠ Convergence target — rename to `ZB.MOM.WW.MxGateway` on adoption |
| ScadaBridge | `ZB.MOM.WW.ScadaBridge` | Target (no meter exists today) |
`MxGateway.Server` is the single convergence item for meter naming. It predates the
`ZB.MOM.WW.*` namespace convention; rename when adopting `AddZbTelemetry`. Instruments
emitted under the old name will require a `recording_rule` or relabel in any Prometheus
config that already scrapes the snapshot — coordinate before renaming in production.
If an app has secondary meters (e.g. a library component with its own meter), those follow
the same pattern: `ZB.MOM.WW.<App>.<Component>`.
---
## 2. Instrument name
Instrument names follow the pattern `<app>.<subsystem>.<event>`, all lower-case,
dot-separated:
```
<app> := short app identifier — otopcua | mxgateway | scadabridge
<subsystem> := functional area — deploy | session | tag | alarm | gateway | worker | ...
<event> := what happened or is measured — applied | count | duration | errors | active | ...
```
**Examples:**
| Instrument name | App | Meaning |
|---|---|---|
| `otopcua.deploy.applied` | OtOpcUa | Galaxy deploy events applied to the address space |
| `otopcua.tag.subscriptions` | OtOpcUa | Active OPC UA tag subscriptions |
| `mxgateway.session.active` | MxGateway | Active MxAccess sessions |
| `mxgateway.worker.call.duration` | MxGateway | gRPC call duration to the x86 worker |
| `scadabridge.alarm.received` | ScadaBridge | Alarms received by the DCL |
**Rules:**
1. All lower-case. No camelCase, no PascalCase, no hyphens.
2. Three segments minimum (`<app>.<subsystem>.<event>`). Four are permitted when the
subsystem warrants a sub-area (e.g. `mxgateway.worker.call.duration`).
3. Event nouns describe **what is counted or measured** (`applied`, `errors`, `active`,
`duration`), not implementation details (`method_called`, `loop_iteration`).
4. Counters: past-tense or noun (`received`, `errors`, `applied`).
UpDownCounters / gauges: present-state noun or adjective (`active`, `connected`).
Histograms: `duration` or a measured quantity noun (`size`, `lag`).
---
## 3. Units
### Duration — seconds (mandatory)
**All duration histograms MUST use seconds** (`"s"`). This is the OpenTelemetry semantic
convention (`UCUM`: `s`). Backends and dashboards assume seconds; mixing units breaks
aggregations across apps.
> ⚠ **MxGateway convergence item:** `GatewayMetrics.cs` defines three histograms with unit
> `"ms"` (`CommandDuration`, `EventDuration`, `WorkerCallDuration`). These must be migrated
> to `"s"` on adoption. Values must also be converted (divide by 1 000 at the call site).
> Track existing Prometheus `recording_rule`/dashboard changes — any dashboard panel that
> reads these histograms in `ms` will need updating. Until migration is complete, annotate
> the instruments with `// CONVERGENCE: ms→s pending`.
### Other units
| Quantity | Unit string | Notes |
|---|---|---|
| Duration | `"s"` | Mandatory — see above |
| Size / bytes | `"By"` | UCUM bytes |
| Count (dimensionless) | `"1"` or omit | For pure event counts; `"1"` preferred |
| Messages, requests | `"{message}"`, `"{request}"` | UCUM annotation form for dimensioned counts |
---
## 4. Resource attribute set (shared across all three signals)
The OTel `Resource` is built once by `AddZbTelemetry` (see [`SPEC.md`](SPEC.md) §2) and
attached to metrics, traces, and OTel-exported logs. The same `SiteId` and `NodeRole` values
populate Serilog enrichers, making a metric, a span, and a log line from the same node
joinable in any OTel-compatible backend.
| OTel attribute | Type | Required | Notes |
|---|---|---|---|
| `service.name` | string | Yes | Short lower-case app id: `otopcua`, `mxgateway`, `scadabridge` |
| `service.namespace` | string | Yes | Always `"ZB.MOM.WW"` — do not override |
| `service.version` | string | Recommended | Populate from `AssemblyInformationalVersion`; absent is better than wrong |
| `site.id` | string | Recommended | Physical or logical site identifier; omit for single-site deployments |
| `node.role` | string | Recommended | Node function: `"central"`, `"site"`, `"hub"`, `"standalone"` |
| `host.name` | string | Auto | Always `Environment.MachineName`; never override |
**Why `site.id` and `node.role` matter:** a ScadaBridge fleet runs N site clusters + one
central cluster, each on different hosts. Without `site.id` and `node.role`, metrics from a
site node and the central node are indistinguishable even if `host.name` differs.
---
## 5. Standard instrumentation baseline
Every app enables this baseline via `AddZbTelemetry`. No opt-out. These are community-
standard instrumentation packages; the overhead is negligible and the benefit (correlated
HTTP / gRPC request traces across the fleet) is high.
| Instrumentation | Signal(s) | OtOpcUa today | MxGateway today | ScadaBridge today |
|---|---|---|---|---|
| ASP.NET Core | Traces + Metrics | ✅ | ✅ | — |
| HttpClient | Traces + Metrics | ✅ | ✅ | — |
| gRPC client | Traces | ✅ | — | — |
| .NET runtime | Metrics | ✅ | — | — |
| Process | Metrics | ✅ | — | — |
OtOpcUa already enables all five. MxGateway and ScadaBridge add the missing ones through
`AddZbTelemetry`. No project removes any of these.
---
## 6. Per-app instrument surface (bespoke — stays per project)
These instruments are **not part of the shared library**. They document the existing bespoke
surface that each project registers through `o.Meters` / `o.ActivitySources` in `AddZbTelemetry`.
### 6.1 OtOpcUa — `ZB.MOM.WW.OtOpcUa` meter
Source: `src/Core/ZB.MOM.WW.OtOpcUa.Commons/Observability/OtOpcUaTelemetry.cs`
| Instrument | Kind | Unit | Description |
|---|---|---|---|
| `otopcua.deploy.applied` | Counter | `"1"` | Galaxy deploy events applied to the OPC UA address space |
| `otopcua.deploy.failed` | Counter | `"1"` | Deploy events that failed processing |
| `otopcua.tag.subscriptions` | UpDownCounter | `"1"` | Active OPC UA tag subscriptions |
| `otopcua.tag.reads` | Counter | `"1"` | Tag read operations |
| `otopcua.tag.writes` | Counter | `"1"` | Tag write operations |
| `otopcua.session.active` | UpDownCounter | `"1"` | Active OPC UA sessions |
| `otopcua.connection.gateway` | UpDownCounter | `"1"` | Active gRPC channels to MxAccessGateway |
**ActivitySources (spans):**
| Source name | Span(s) |
|---|---|
| `ZB.MOM.WW.OtOpcUa` | `DeployWatcher.Apply`, `GalaxyDriver.BrowseHierarchy` |
All durations already use `"s"` — no convergence item for OtOpcUa.
### 6.2 MxGateway — `MxGateway.Server` meter (→ target: `ZB.MOM.WW.MxGateway`)
Source: `src/ZB.MOM.WW.MxGateway.Server/Metrics/GatewayMetrics.cs`
**Counters (13):**
| Instrument | Unit | Description |
|---|---|---|
| `mxgateway.session.created` | `"1"` | MxAccess sessions opened |
| `mxgateway.session.closed` | `"1"` | MxAccess sessions closed |
| `mxgateway.session.errors` | `"1"` | Session creation/teardown errors |
| `mxgateway.command.invoked` | `"1"` | MxAccess command invocations |
| `mxgateway.command.errors` | `"1"` | Command invocation errors |
| `mxgateway.event.received` | `"1"` | MxAccess events received from worker |
| `mxgateway.event.errors` | `"1"` | Event processing errors |
| `mxgateway.worker.started` | `"1"` | x86 worker processes started |
| `mxgateway.worker.stopped` | `"1"` | x86 worker processes stopped |
| `mxgateway.worker.errors` | `"1"` | Worker communication errors |
| `mxgateway.galaxy.browse.requests` | `"1"` | Galaxy Repository browse RPCs |
| `mxgateway.galaxy.browse.errors` | `"1"` | Galaxy browse errors |
| `mxgateway.auth.failures` | `"1"` | Authentication failures |
**Histograms (3):**
| Instrument | Unit | Current unit | Convergence |
|---|---|---|---|
| `mxgateway.command.duration` | `"s"` | `"ms"` | ⚠ Convert ms→s on adoption |
| `mxgateway.event.duration` | `"s"` | `"ms"` | ⚠ Convert ms→s on adoption |
| `mxgateway.worker.call.duration` | `"s"` | `"ms"` | ⚠ Convert ms→s on adoption |
**Gauges (4):**
| Instrument | Unit | Description |
|---|---|---|
| `mxgateway.session.active` | `"1"` | Current active MxAccess sessions |
| `mxgateway.worker.active` | `"1"` | Current running x86 worker processes |
| `mxgateway.worker.memory` | `"By"` | Worker process RSS |
| `mxgateway.galaxy.nodes.cached` | `"1"` | Galaxy Repository nodes in browse cache |
No ActivitySources today (no tracing). Adding `ZB.MOM.WW.MxGateway` as an ActivitySource
is left per-project (deferred to GAPS backlog).
### 6.3 ScadaBridge — `ZB.MOM.WW.ScadaBridge` meter
No meter or instruments exist today (`OpenTelemetry.Api` is a dangling ref). The target
meter name `ZB.MOM.WW.ScadaBridge` is reserved. Instruments are defined as part of the
ScadaBridge adoption tracked in [`../GAPS.md`](../GAPS.md).
---
## Consequences and convergence items (accepted)
| Item | Scope | Severity |
|---|---|---|
| MxGateway meter rename `MxGateway.Server``ZB.MOM.WW.MxGateway` | MxGateway adoption | Breaking — requires relabeling in Prometheus config and dashboards |
| MxGateway histogram unit `ms``s` (3 instruments) | MxGateway adoption | Breaking — values change by factor 1 000; dashboards need updating |
| ScadaBridge instrument set TBD | ScadaBridge adoption | No existing surface to converge — define from scratch |
All three items are tracked as backlog entries in [`../GAPS.md`](../GAPS.md). The ms→s
migration is the highest-priority convergence item because leaving it unresolved means
MxGateway's durations are incompatible with OtOpcUa's in any shared Prometheus / Grafana
workspace.