Files
scadaproj/components/observability/spec/METRIC-CONVENTIONS.md
T
Joseph Doherty 7d243890ed docs(observability): spec + METRIC-CONVENTIONS + ZB.MOM.WW.Telemetry shared contract
Author the three normalization docs for the observability component:
- components/observability/spec/SPEC.md — Section 0 scope (normalized vs. per-project),
  AddZbTelemetry pipeline, shared Resource attribute set, standard instrumentation baseline,
  exporter conventions, Serilog two-stage bootstrap with identity enrichers and
  TraceContextEnricher, ILogRedactor redaction seam, per-project migration table, and
  acceptance criteria.
- components/observability/spec/METRIC-CONVENTIONS.md — meter naming convention (app
  namespace; MxGateway.Server flagged as convergence target), instrument naming pattern
  (<app>.<subsystem>.<event>), mandatory duration unit = seconds (MxGateway ms histograms
  flagged), Resource attribute set table, standard instrumentation baseline, and per-app
  instrument tables (OtOpcUa 7 instruments + 2 spans; MxGateway 13 counters / 3 histograms
  / 4 gauges; ScadaBridge TBD).
- components/observability/shared-contract/ZB.MOM.WW.Telemetry.md — paper API for the two
  packages: ZbTelemetryOptions, ZbExporter enum, AddZbTelemetry (IHostApplicationBuilder +
  IServiceCollection overloads), ZbResource.Build, MapZbMetrics; AddZbSerilog,
  ZbLogEnricherNames constants, TraceContextEnricher, ILogRedactor, RedactionEnricher.
  Consumer matrix and open contract questions included.
2026-06-01 07:19:38 -04:00

10 KiB

Observability — Metric conventions (standardized)

Status: Standardized. The naming and unit rules every sister project's instruments must follow. Analogous to ../auth/spec/CANONICAL-ROLES.md for auth and ../ui-theme/spec/DESIGN-TOKENS.md for the UI kit. Authoritative alongside SPEC.md.

The per-project instrument tables below (§4) document the existing bespoke surface — the instruments each app currently defines or intends to define. These stay per-project; they are not candidates for the shared library. The rules in §1–§3 govern how those instruments must be named and measured.


1. Meter name

Each app owns exactly one primary Meter, named after its root namespace:

App Meter name Status
OtOpcUa ZB.MOM.WW.OtOpcUa Correct today
MxGateway MxGateway.Server ⚠ Convergence target — rename to ZB.MOM.WW.MxGateway on adoption
ScadaBridge ZB.MOM.WW.ScadaBridge Target (no meter exists today)

MxGateway.Server is the single convergence item for meter naming. It predates the ZB.MOM.WW.* namespace convention; rename when adopting AddZbTelemetry. Instruments emitted under the old name will require a recording_rule or relabel in any Prometheus config that already scrapes the snapshot — coordinate before renaming in production.

If an app has secondary meters (e.g. a library component with its own meter), those follow the same pattern: ZB.MOM.WW.<App>.<Component>.


2. Instrument name

Instrument names follow the pattern <app>.<subsystem>.<event>, all lower-case, dot-separated:

<app>       := short app identifier — otopcua | mxgateway | scadabridge
<subsystem> := functional area — deploy | session | tag | alarm | gateway | worker | ...
<event>     := what happened or is measured — applied | count | duration | errors | active | ...

Examples:

Instrument name App Meaning
otopcua.deploy.applied OtOpcUa Galaxy deploy events applied to the address space
otopcua.tag.subscriptions OtOpcUa Active OPC UA tag subscriptions
mxgateway.session.active MxGateway Active MxAccess sessions
mxgateway.worker.call.duration MxGateway gRPC call duration to the x86 worker
scadabridge.alarm.received ScadaBridge Alarms received by the DCL

Rules:

  1. All lower-case. No camelCase, no PascalCase, no hyphens.
  2. Three segments minimum (<app>.<subsystem>.<event>). Four are permitted when the subsystem warrants a sub-area (e.g. mxgateway.worker.call.duration).
  3. Event nouns describe what is counted or measured (applied, errors, active, duration), not implementation details (method_called, loop_iteration).
  4. Counters: past-tense or noun (received, errors, applied). UpDownCounters / gauges: present-state noun or adjective (active, connected). Histograms: duration or a measured quantity noun (size, lag).

3. Units

Duration — seconds (mandatory)

All duration histograms MUST use seconds ("s"). This is the OpenTelemetry semantic convention (UCUM: s). Backends and dashboards assume seconds; mixing units breaks aggregations across apps.

MxGateway convergence item: GatewayMetrics.cs defines three histograms with unit "ms" (CommandDuration, EventDuration, WorkerCallDuration). These must be migrated to "s" on adoption. Values must also be converted (divide by 1 000 at the call site). Track existing Prometheus recording_rule/dashboard changes — any dashboard panel that reads these histograms in ms will need updating. Until migration is complete, annotate the instruments with // CONVERGENCE: ms→s pending.

Other units

Quantity Unit string Notes
Duration "s" Mandatory — see above
Size / bytes "By" UCUM bytes
Count (dimensionless) "1" or omit For pure event counts; "1" preferred
Messages, requests "{message}", "{request}" UCUM annotation form for dimensioned counts

4. Resource attribute set (shared across all three signals)

The OTel Resource is built once by AddZbTelemetry (see SPEC.md §2) and attached to metrics, traces, and OTel-exported logs. The same SiteId and NodeRole values populate Serilog enrichers, making a metric, a span, and a log line from the same node joinable in any OTel-compatible backend.

OTel attribute Type Required Notes
service.name string Yes Short lower-case app id: otopcua, mxgateway, scadabridge
service.namespace string Yes Always "ZB.MOM.WW" — do not override
service.version string Recommended Populate from AssemblyInformationalVersion; absent is better than wrong
site.id string Recommended Physical or logical site identifier; omit for single-site deployments
node.role string Recommended Node function: "central", "site", "hub", "standalone"
host.name string Auto Always Environment.MachineName; never override

Why site.id and node.role matter: a ScadaBridge fleet runs N site clusters + one central cluster, each on different hosts. Without site.id and node.role, metrics from a site node and the central node are indistinguishable even if host.name differs.


5. Standard instrumentation baseline

Every app enables this baseline via AddZbTelemetry. No opt-out. These are community- standard instrumentation packages; the overhead is negligible and the benefit (correlated HTTP / gRPC request traces across the fleet) is high.

Instrumentation Signal(s) OtOpcUa today MxGateway today ScadaBridge today
ASP.NET Core Traces + Metrics
HttpClient Traces + Metrics
gRPC client Traces
.NET runtime Metrics
Process Metrics

OtOpcUa already enables all five. MxGateway and ScadaBridge add the missing ones through AddZbTelemetry. No project removes any of these.


6. Per-app instrument surface (bespoke — stays per project)

These instruments are not part of the shared library. They document the existing bespoke surface that each project registers through o.Meters / o.ActivitySources in AddZbTelemetry.

6.1 OtOpcUa — ZB.MOM.WW.OtOpcUa meter

Source: src/Core/ZB.MOM.WW.OtOpcUa.Commons/Observability/OtOpcUaTelemetry.cs

Instrument Kind Unit Description
otopcua.deploy.applied Counter "1" Galaxy deploy events applied to the OPC UA address space
otopcua.deploy.failed Counter "1" Deploy events that failed processing
otopcua.tag.subscriptions UpDownCounter "1" Active OPC UA tag subscriptions
otopcua.tag.reads Counter "1" Tag read operations
otopcua.tag.writes Counter "1" Tag write operations
otopcua.session.active UpDownCounter "1" Active OPC UA sessions
otopcua.connection.gateway UpDownCounter "1" Active gRPC channels to MxAccessGateway

ActivitySources (spans):

Source name Span(s)
ZB.MOM.WW.OtOpcUa DeployWatcher.Apply, GalaxyDriver.BrowseHierarchy

All durations already use "s" — no convergence item for OtOpcUa.

6.2 MxGateway — MxGateway.Server meter (→ target: ZB.MOM.WW.MxGateway)

Source: src/ZB.MOM.WW.MxGateway.Server/Metrics/GatewayMetrics.cs

Counters (13):

Instrument Unit Description
mxgateway.session.created "1" MxAccess sessions opened
mxgateway.session.closed "1" MxAccess sessions closed
mxgateway.session.errors "1" Session creation/teardown errors
mxgateway.command.invoked "1" MxAccess command invocations
mxgateway.command.errors "1" Command invocation errors
mxgateway.event.received "1" MxAccess events received from worker
mxgateway.event.errors "1" Event processing errors
mxgateway.worker.started "1" x86 worker processes started
mxgateway.worker.stopped "1" x86 worker processes stopped
mxgateway.worker.errors "1" Worker communication errors
mxgateway.galaxy.browse.requests "1" Galaxy Repository browse RPCs
mxgateway.galaxy.browse.errors "1" Galaxy browse errors
mxgateway.auth.failures "1" Authentication failures

Histograms (3):

Instrument Unit Current unit Convergence
mxgateway.command.duration "s" "ms" ⚠ Convert ms→s on adoption
mxgateway.event.duration "s" "ms" ⚠ Convert ms→s on adoption
mxgateway.worker.call.duration "s" "ms" ⚠ Convert ms→s on adoption

Gauges (4):

Instrument Unit Description
mxgateway.session.active "1" Current active MxAccess sessions
mxgateway.worker.active "1" Current running x86 worker processes
mxgateway.worker.memory "By" Worker process RSS
mxgateway.galaxy.nodes.cached "1" Galaxy Repository nodes in browse cache

No ActivitySources today (no tracing). Adding ZB.MOM.WW.MxGateway as an ActivitySource is left per-project (deferred to GAPS backlog).

6.3 ScadaBridge — ZB.MOM.WW.ScadaBridge meter

No meter or instruments exist today (OpenTelemetry.Api is a dangling ref). The target meter name ZB.MOM.WW.ScadaBridge is reserved. Instruments are defined as part of the ScadaBridge adoption tracked in ../GAPS.md.


Consequences and convergence items (accepted)

Item Scope Severity
MxGateway meter rename MxGateway.ServerZB.MOM.WW.MxGateway MxGateway adoption Breaking — requires relabeling in Prometheus config and dashboards
MxGateway histogram unit mss (3 instruments) MxGateway adoption Breaking — values change by factor 1 000; dashboards need updating
ScadaBridge instrument set TBD ScadaBridge adoption No existing surface to converge — define from scratch

All three items are tracked as backlog entries in ../GAPS.md. The ms→s migration is the highest-priority convergence item because leaving it unresolved means MxGateway's durations are incompatible with OtOpcUa's in any shared Prometheus / Grafana workspace.