Files
scadaproj/components/observability/spec/METRIC-CONVENTIONS.md
T
Joseph Doherty 544a6ddb77 Fix all baseline code-review findings across the six shared libraries
Resolves the 35 findings from the 2026-06-01 baseline (commit 26ba1c7),
test-first for every behavioral change. +51 tests (331 -> 382 passing, 0 failed).

- Telemetry-001 (HIGH): RedactionEnricher now honours property removal, so a
  redactor that drops a key actually scrubs the secret from the event.
- Auth: LDAP validator ValidateOnStart; API-key verify no longer fails on a
  best-effort MarkUsed write or a corrupt scopes column (fail-closed); LDAP cert
  validation hook; KeyPrefix persistence aligned; README algorithm corrected.
- Health: Akka checks return Degraded (not throw) when the cluster isn't up yet;
  GrpcDependencyHealthCheck catch-all; null 'description' rendered; composite
  endpoint builder; XML docs shipped.
- Audit: CompositeAuditWriter no longer re-throws OperationCanceledException;
  TruncatingAuditRedactor over-redact scrubs Target + safe negative max; options
  record; XML docs shipped.
- Configuration: TryAddEnumerable idempotent registration; consistent port
  quoting; strict invariant port parsing; XML docs + README packaged.
- Theme: mobile toggle is now CSS-only (no Bootstrap JS); token/CSS hygiene;
  XML docs on the public parameter surface.

Shared-contract/spec docs updated where the code was the source of truth
(observability service.instance.id, MapZbMetrics, redactor reach). All changes
additive/back-compatible at v0.1.0. code-reviews bookkeeping follows separately.
2026-06-01 11:22:14 -04:00

11 KiB

Observability — Metric conventions (standardized)

Status: Standardized. The naming and unit rules every sister project's instruments must follow. Analogous to ../auth/spec/CANONICAL-ROLES.md for auth and ../ui-theme/spec/DESIGN-TOKENS.md for the UI kit. Authoritative alongside SPEC.md.

The per-project instrument tables below (§4) document the existing bespoke surface — the instruments each app currently defines or intends to define. These stay per-project; they are not candidates for the shared library. The rules in §1–§3 govern how those instruments must be named and measured.


1. Meter name

Each app owns exactly one primary Meter, named after its root namespace:

App Meter name Status
OtOpcUa ZB.MOM.WW.OtOpcUa Correct today
MxGateway MxGateway.Server ⚠ Convergence target — rename to ZB.MOM.WW.MxGateway on adoption
ScadaBridge ZB.MOM.WW.ScadaBridge Target (no meter exists today)

MxGateway.Server is the single convergence item for meter naming. It predates the ZB.MOM.WW.* namespace convention; rename when adopting AddZbTelemetry. Instruments emitted under the old name will require a recording_rule or relabel in any Prometheus config that already scrapes the snapshot — coordinate before renaming in production.

If an app has secondary meters (e.g. a library component with its own meter), those follow the same pattern: ZB.MOM.WW.<App>.<Component>.


2. Instrument name

Instrument names follow the pattern <app>.<subsystem>.<event>, all lower-case, dot-separated:

<app>       := short app identifier — otopcua | mxgateway | scadabridge
<subsystem> := functional area — deploy | session | tag | alarm | gateway | worker | ...
<event>     := what happened or is measured — applied | count | duration | errors | active | ...

Examples:

Instrument name App Meaning
otopcua.deploy.applied OtOpcUa Galaxy deploy events applied to the address space
otopcua.deploy.apply.duration OtOpcUa End-to-end deploy apply duration
mxgateway.sessions.open MxGateway Currently open MxAccess sessions
mxgateway.commands.duration MxGateway End-to-end MXAccess command latency
scadabridge.alarm.received ScadaBridge Alarms received by the DCL

Rules:

  1. All lower-case. No camelCase, no PascalCase, no hyphens.
  2. Three segments minimum (<app>.<subsystem>.<event>). Four are permitted when the subsystem warrants a sub-area (e.g. mxgateway.commands.duration).
  3. Event nouns describe what is counted or measured (applied, errors, active, duration), not implementation details (method_called, loop_iteration).
  4. Counters: past-tense or noun (received, errors, applied). UpDownCounters / gauges: present-state noun or adjective (active, connected). Histograms: duration or a measured quantity noun (size, lag).

3. Units

Duration — seconds (mandatory)

All duration histograms MUST use seconds ("s"). This is the OpenTelemetry semantic convention (UCUM: s). Backends and dashboards assume seconds; mixing units breaks aggregations across apps.

MxGateway convergence item: GatewayMetrics.cs defines three histograms with unit "ms" (CommandDuration, EventDuration, WorkerCallDuration). These must be migrated to "s" on adoption. Values must also be converted (divide by 1 000 at the call site). Track existing Prometheus recording_rule/dashboard changes — any dashboard panel that reads these histograms in ms will need updating. Until migration is complete, annotate the instruments with // CONVERGENCE: ms→s pending.

Other units

Quantity Unit string Notes
Duration "s" Mandatory — see above
Size / bytes "By" UCUM bytes
Count (dimensionless) "1" or omit For pure event counts; "1" preferred
Messages, requests "{message}", "{request}" UCUM annotation form for dimensioned counts

4. Resource attribute set (shared across all three signals)

The OTel Resource is built once by AddZbTelemetry (see SPEC.md §2) and attached to metrics, traces, and OTel-exported logs. The same SiteId and NodeRole values populate Serilog enrichers, making a metric, a span, and a log line from the same node joinable in any OTel-compatible backend.

OTel attribute Type Required Notes
service.name string Yes Short lower-case app id: otopcua, mxgateway, scadabridge
service.namespace string Yes Always "ZB.MOM.WW" — do not override
service.version string Recommended Populate from AssemblyInformationalVersion; absent is better than wrong
service.instance.id string Auto Always ZbResource.InstanceId = deterministic MachineName:ProcessId. The OTel SDK random-GUID default is disabled so every signal from one process shares one restart-stable instance id (cross-signal correlation); never override
site.id string Recommended Physical or logical site identifier; omit for single-site deployments
node.role string Recommended Node function: "central", "site", "hub", "standalone"
host.name string Auto Always Environment.MachineName; never override

Why site.id and node.role matter: a ScadaBridge fleet runs N site clusters + one central cluster, each on different hosts. Without site.id and node.role, metrics from a site node and the central node are indistinguishable even if host.name differs.


5. Standard instrumentation baseline

Every app enables this baseline via AddZbTelemetry. No opt-out. These are community- standard instrumentation packages; the overhead is negligible and the benefit (correlated HTTP / gRPC request traces across the fleet) is high.

Instrumentation Signal(s) OtOpcUa today MxGateway today ScadaBridge today
ASP.NET Core Traces + Metrics not added not added not added
HttpClient Traces + Metrics not added not added not added
gRPC client Traces not added not added n/a
.NET runtime Metrics not added not added not added
Process Metrics not added not added not added

All three projects lack standard instrumentation today — it is added automatically when each project calls AddZbTelemetry (Gap S1 in GAPS.md). No project removes any of these once wired.


6. Per-app instrument surface (bespoke — stays per project)

These instruments are not part of the shared library. They document the existing bespoke surface that each project registers through o.Meters / o.ActivitySources in AddZbTelemetry.

6.1 OtOpcUa — ZB.MOM.WW.OtOpcUa meter

Source: src/Core/ZB.MOM.WW.OtOpcUa.Commons/Observability/OtOpcUaTelemetry.cs (Code-verified 2026-06-01 — see current-state/otopcua/CURRENT-STATE.md.)

Counters (7):

Instrument Kind Unit Description
otopcua.deploy.applied Counter Galaxy deploy events applied to the OPC UA address space
otopcua.driver.lifecycle Counter Driver lifecycle events (start / stop / restart)
otopcua.virtualtag.eval Counter Virtual tag evaluations
otopcua.scriptedalarm.transition Counter Scripted alarm state transitions
otopcua.opcua.sink.write Counter OPC UA sink write operations
otopcua.redundancy.service_level_change Counter Redundancy service-level changes

Histograms (1):

Instrument Kind Unit Description
otopcua.deploy.apply.duration Histogram s End-to-end deploy apply duration

ActivitySources (spans):

Source name Spans
ZB.MOM.WW.OtOpcUa otopcua.deploy.apply, otopcua.opcua.address_space_rebuild

All durations use "s" — no unit convergence item for OtOpcUa.

6.2 MxGateway — MxGateway.Server meter (→ target: ZB.MOM.WW.MxGateway)

Source: src/ZB.MOM.WW.MxGateway.Server/Metrics/GatewayMetrics.cs (Code-verified 2026-06-01 — see current-state/mxaccessgw/CURRENT-STATE.md.)

Counters (13):

Instrument Unit Description
mxgateway.sessions.opened "1" New session requests
mxgateway.sessions.closed "1" Sessions torn down
mxgateway.commands.started "1" MXAccess command dispatched
mxgateway.commands.succeeded "1" Command completed OK
mxgateway.commands.failed "1" Command error
mxgateway.events.received "1" MXAccess events received from worker
mxgateway.queues.overflows "1" Queue overflow (backpressure)
mxgateway.faults "1" Unhandled gateway faults
mxgateway.workers.killed "1" Worker process forcibly terminated
mxgateway.workers.exited "1" Worker process exited cleanly
mxgateway.heartbeats.failed "1" Worker heartbeat timeouts
mxgateway.grpc.streams.disconnected "1" gRPC event stream disconnects
mxgateway.retries.attempted "1" Retry attempts (any subsystem)

Histograms (3) — current unit ms (convergence target s):

Instrument Target unit Current unit Convergence
mxgateway.workers.startup.duration "s" "ms" ⚠ Convert ms→s on adoption
mxgateway.commands.duration "s" "ms" ⚠ Convert ms→s on adoption
mxgateway.events.stream_send.duration "s" "ms" ⚠ Convert ms→s on adoption

Observable gauges (4):

Instrument Unit Description
mxgateway.sessions.open "1" Currently open sessions (live count)
mxgateway.workers.running "1" Currently running worker processes
mxgateway.events.worker_queue.depth "1" Per-worker event queue depth
mxgateway.events.grpc_stream_queue.depth "1" Per-stream gRPC send queue depth

No ActivitySources today (no tracing). Adding ZB.MOM.WW.MxGateway as an ActivitySource is left per-project (deferred to GAPS backlog).

6.3 ScadaBridge — ZB.MOM.WW.ScadaBridge meter

No meter or instruments exist today (OpenTelemetry.Api is a dangling ref). The target meter name ZB.MOM.WW.ScadaBridge is reserved. Instruments are defined as part of the ScadaBridge adoption tracked in ../GAPS.md.


Consequences and convergence items (accepted)

Item Scope Severity
MxGateway meter rename MxGateway.ServerZB.MOM.WW.MxGateway MxGateway adoption Breaking — requires relabeling in Prometheus config and dashboards
MxGateway histogram unit mss (3 instruments) MxGateway adoption Breaking — values change by factor 1 000; dashboards need updating
ScadaBridge instrument set TBD ScadaBridge adoption No existing surface to converge — define from scratch

All three items are tracked as backlog entries in ../GAPS.md. The ms→s migration is the highest-priority convergence item because leaving it unresolved means MxGateway's durations are incompatible with OtOpcUa's in any shared Prometheus / Grafana workspace.