# Observability — Metric conventions (standardized) Status: **Standardized**. The naming and unit rules every sister project's instruments must follow. Analogous to [`../auth/spec/CANONICAL-ROLES.md`](../../auth/spec/CANONICAL-ROLES.md) for auth and [`../ui-theme/spec/DESIGN-TOKENS.md`](../../ui-theme/spec/DESIGN-TOKENS.md) for the UI kit. Authoritative alongside [`SPEC.md`](SPEC.md). The per-project instrument tables below (§4) document the **existing bespoke surface** — the instruments each app currently defines or intends to define. These stay per-project; they are not candidates for the shared library. The rules in §1–§3 govern *how* those instruments must be named and measured. --- ## 1. Meter name Each app owns exactly **one primary Meter**, named after its root namespace: | App | Meter name | Status | |---|---|---| | OtOpcUa | `ZB.MOM.WW.OtOpcUa` | Correct today | | MxGateway | `MxGateway.Server` | ⚠ Convergence target — rename to `ZB.MOM.WW.MxGateway` on adoption | | ScadaBridge | `ZB.MOM.WW.ScadaBridge` | Target (no meter exists today) | `MxGateway.Server` is the single convergence item for meter naming. It predates the `ZB.MOM.WW.*` namespace convention; rename when adopting `AddZbTelemetry`. Instruments emitted under the old name will require a `recording_rule` or relabel in any Prometheus config that already scrapes the snapshot — coordinate before renaming in production. If an app has secondary meters (e.g. a library component with its own meter), those follow the same pattern: `ZB.MOM.WW..`. --- ## 2. Instrument name Instrument names follow the pattern `..`, all lower-case, dot-separated: ``` := short app identifier — otopcua | mxgateway | scadabridge := functional area — deploy | session | tag | alarm | gateway | worker | ... := what happened or is measured — applied | count | duration | errors | active | ... ``` **Examples:** | Instrument name | App | Meaning | |---|---|---| | `otopcua.deploy.applied` | OtOpcUa | Galaxy deploy events applied to the address space | | `otopcua.deploy.apply.duration` | OtOpcUa | End-to-end deploy apply duration | | `mxgateway.sessions.open` | MxGateway | Currently open MxAccess sessions | | `mxgateway.commands.duration` | MxGateway | End-to-end MXAccess command latency | | `scadabridge.alarm.received` | ScadaBridge | Alarms received by the DCL | **Rules:** 1. All lower-case. No camelCase, no PascalCase, no hyphens. 2. Three segments minimum (`..`). Four are permitted when the subsystem warrants a sub-area (e.g. `mxgateway.commands.duration`). 3. Event nouns describe **what is counted or measured** (`applied`, `errors`, `active`, `duration`), not implementation details (`method_called`, `loop_iteration`). 4. Counters: past-tense or noun (`received`, `errors`, `applied`). UpDownCounters / gauges: present-state noun or adjective (`active`, `connected`). Histograms: `duration` or a measured quantity noun (`size`, `lag`). --- ## 3. Units ### Duration — seconds (mandatory) **All duration histograms MUST use seconds** (`"s"`). This is the OpenTelemetry semantic convention (`UCUM`: `s`). Backends and dashboards assume seconds; mixing units breaks aggregations across apps. > ⚠ **MxGateway convergence item:** `GatewayMetrics.cs` defines three histograms with unit > `"ms"` (`CommandDuration`, `EventDuration`, `WorkerCallDuration`). These must be migrated > to `"s"` on adoption. Values must also be converted (divide by 1 000 at the call site). > Track existing Prometheus `recording_rule`/dashboard changes — any dashboard panel that > reads these histograms in `ms` will need updating. Until migration is complete, annotate > the instruments with `// CONVERGENCE: ms→s pending`. ### Other units | Quantity | Unit string | Notes | |---|---|---| | Duration | `"s"` | Mandatory — see above | | Size / bytes | `"By"` | UCUM bytes | | Count (dimensionless) | `"1"` or omit | For pure event counts; `"1"` preferred | | Messages, requests | `"{message}"`, `"{request}"` | UCUM annotation form for dimensioned counts | --- ## 4. Resource attribute set (shared across all three signals) The OTel `Resource` is built once by `AddZbTelemetry` (see [`SPEC.md`](SPEC.md) §2) and attached to metrics, traces, and OTel-exported logs. The same `SiteId` and `NodeRole` values populate Serilog enrichers, making a metric, a span, and a log line from the same node joinable in any OTel-compatible backend. | OTel attribute | Type | Required | Notes | |---|---|---|---| | `service.name` | string | Yes | Short lower-case app id: `otopcua`, `mxgateway`, `scadabridge` | | `service.namespace` | string | Yes | Always `"ZB.MOM.WW"` — do not override | | `service.version` | string | Recommended | Populate from `AssemblyInformationalVersion`; absent is better than wrong | | `service.instance.id` | string | Auto | Always `ZbResource.InstanceId` = deterministic `MachineName:ProcessId`. The OTel SDK random-GUID default is disabled so every signal from one process shares one restart-stable instance id (cross-signal correlation); never override | | `site.id` | string | Recommended | Physical or logical site identifier; omit for single-site deployments | | `node.role` | string | Recommended | Node function: `"central"`, `"site"`, `"hub"`, `"standalone"` | | `host.name` | string | Auto | Always `Environment.MachineName`; never override | **Why `site.id` and `node.role` matter:** a ScadaBridge fleet runs N site clusters + one central cluster, each on different hosts. Without `site.id` and `node.role`, metrics from a site node and the central node are indistinguishable even if `host.name` differs. --- ## 5. Standard instrumentation baseline Every app enables this baseline via `AddZbTelemetry`. No opt-out. These are community- standard instrumentation packages; the overhead is negligible and the benefit (correlated HTTP / gRPC request traces across the fleet) is high. | Instrumentation | Signal(s) | OtOpcUa today | MxGateway today | ScadaBridge today | |---|---|---|---|---| | ASP.NET Core | Traces + Metrics | ⛔ not added | ⛔ not added | ⛔ not added | | HttpClient | Traces + Metrics | ⛔ not added | ⛔ not added | ⛔ not added | | gRPC client | Traces | ⛔ not added | ⛔ not added | n/a | | .NET runtime | Metrics | ⛔ not added | ⛔ not added | ⛔ not added | | Process | Metrics | ⛔ not added | ⛔ not added | ⛔ not added | All three projects lack standard instrumentation today — it is added automatically when each project calls `AddZbTelemetry` (Gap S1 in `GAPS.md`). No project removes any of these once wired. --- ## 6. Per-app instrument surface (bespoke — stays per project) These instruments are **not part of the shared library**. They document the existing bespoke surface that each project registers through `o.Meters` / `o.ActivitySources` in `AddZbTelemetry`. ### 6.1 OtOpcUa — `ZB.MOM.WW.OtOpcUa` meter Source: `src/Core/ZB.MOM.WW.OtOpcUa.Commons/Observability/OtOpcUaTelemetry.cs` (Code-verified 2026-06-01 — see `current-state/otopcua/CURRENT-STATE.md`.) **Counters (7):** | Instrument | Kind | Unit | Description | |---|---|---|---| | `otopcua.deploy.applied` | Counter | — | Galaxy deploy events applied to the OPC UA address space | | `otopcua.driver.lifecycle` | Counter | — | Driver lifecycle events (start / stop / restart) | | `otopcua.virtualtag.eval` | Counter | — | Virtual tag evaluations | | `otopcua.scriptedalarm.transition` | Counter | — | Scripted alarm state transitions | | `otopcua.opcua.sink.write` | Counter | — | OPC UA sink write operations | | `otopcua.redundancy.service_level_change` | Counter | — | Redundancy service-level changes | **Histograms (1):** | Instrument | Kind | Unit | Description | |---|---|---|---| | `otopcua.deploy.apply.duration` | Histogram | `s` | End-to-end deploy apply duration | **ActivitySources (spans):** | Source name | Spans | |---|---| | `ZB.MOM.WW.OtOpcUa` | `otopcua.deploy.apply`, `otopcua.opcua.address_space_rebuild` | All durations use `"s"` — no unit convergence item for OtOpcUa. ### 6.2 MxGateway — `MxGateway.Server` meter (→ target: `ZB.MOM.WW.MxGateway`) Source: `src/ZB.MOM.WW.MxGateway.Server/Metrics/GatewayMetrics.cs` (Code-verified 2026-06-01 — see `current-state/mxaccessgw/CURRENT-STATE.md`.) **Counters (13):** | Instrument | Unit | Description | |---|---|---| | `mxgateway.sessions.opened` | `"1"` | New session requests | | `mxgateway.sessions.closed` | `"1"` | Sessions torn down | | `mxgateway.commands.started` | `"1"` | MXAccess command dispatched | | `mxgateway.commands.succeeded` | `"1"` | Command completed OK | | `mxgateway.commands.failed` | `"1"` | Command error | | `mxgateway.events.received` | `"1"` | MXAccess events received from worker | | `mxgateway.queues.overflows` | `"1"` | Queue overflow (backpressure) | | `mxgateway.faults` | `"1"` | Unhandled gateway faults | | `mxgateway.workers.killed` | `"1"` | Worker process forcibly terminated | | `mxgateway.workers.exited` | `"1"` | Worker process exited cleanly | | `mxgateway.heartbeats.failed` | `"1"` | Worker heartbeat timeouts | | `mxgateway.grpc.streams.disconnected` | `"1"` | gRPC event stream disconnects | | `mxgateway.retries.attempted` | `"1"` | Retry attempts (any subsystem) | **Histograms (3) — current unit `ms` (convergence target `s`):** | Instrument | Target unit | Current unit | Convergence | |---|---|---|---| | `mxgateway.workers.startup.duration` | `"s"` | `"ms"` | ⚠ Convert ms→s on adoption | | `mxgateway.commands.duration` | `"s"` | `"ms"` | ⚠ Convert ms→s on adoption | | `mxgateway.events.stream_send.duration` | `"s"` | `"ms"` | ⚠ Convert ms→s on adoption | **Observable gauges (4):** | Instrument | Unit | Description | |---|---|---| | `mxgateway.sessions.open` | `"1"` | Currently open sessions (live count) | | `mxgateway.workers.running` | `"1"` | Currently running worker processes | | `mxgateway.events.worker_queue.depth` | `"1"` | Per-worker event queue depth | | `mxgateway.events.grpc_stream_queue.depth` | `"1"` | Per-stream gRPC send queue depth | No ActivitySources today (no tracing). Adding `ZB.MOM.WW.MxGateway` as an ActivitySource is left per-project (deferred to GAPS backlog). ### 6.3 ScadaBridge — `ZB.MOM.WW.ScadaBridge` meter No meter or instruments exist today (`OpenTelemetry.Api` is a dangling ref). The target meter name `ZB.MOM.WW.ScadaBridge` is reserved. Instruments are defined as part of the ScadaBridge adoption tracked in [`../GAPS.md`](../GAPS.md). --- ## Consequences and convergence items (accepted) | Item | Scope | Severity | |---|---|---| | MxGateway meter rename `MxGateway.Server` → `ZB.MOM.WW.MxGateway` | MxGateway adoption | Breaking — requires relabeling in Prometheus config and dashboards | | MxGateway histogram unit `ms` → `s` (3 instruments) | MxGateway adoption | Breaking — values change by factor 1 000; dashboards need updating | | ScadaBridge instrument set TBD | ScadaBridge adoption | No existing surface to converge — define from scratch | All three items are tracked as backlog entries in [`../GAPS.md`](../GAPS.md). The ms→s migration is the highest-priority convergence item because leaving it unresolved means MxGateway's durations are incompatible with OtOpcUa's in any shared Prometheus / Grafana workspace.