Author the three normalization docs for the observability component: - components/observability/spec/SPEC.md — Section 0 scope (normalized vs. per-project), AddZbTelemetry pipeline, shared Resource attribute set, standard instrumentation baseline, exporter conventions, Serilog two-stage bootstrap with identity enrichers and TraceContextEnricher, ILogRedactor redaction seam, per-project migration table, and acceptance criteria. - components/observability/spec/METRIC-CONVENTIONS.md — meter naming convention (app namespace; MxGateway.Server flagged as convergence target), instrument naming pattern (<app>.<subsystem>.<event>), mandatory duration unit = seconds (MxGateway ms histograms flagged), Resource attribute set table, standard instrumentation baseline, and per-app instrument tables (OtOpcUa 7 instruments + 2 spans; MxGateway 13 counters / 3 histograms / 4 gauges; ScadaBridge TBD). - components/observability/shared-contract/ZB.MOM.WW.Telemetry.md — paper API for the two packages: ZbTelemetryOptions, ZbExporter enum, AddZbTelemetry (IHostApplicationBuilder + IServiceCollection overloads), ZbResource.Build, MapZbMetrics; AddZbSerilog, ZbLogEnricherNames constants, TraceContextEnricher, ILogRedactor, RedactionEnricher. Consumer matrix and open contract questions included.
10 KiB
Observability — Metric conventions (standardized)
Status: Standardized. The naming and unit rules every sister project's instruments must
follow. Analogous to ../auth/spec/CANONICAL-ROLES.md
for auth and ../ui-theme/spec/DESIGN-TOKENS.md
for the UI kit. Authoritative alongside SPEC.md.
The per-project instrument tables below (§4) document the existing bespoke surface — the instruments each app currently defines or intends to define. These stay per-project; they are not candidates for the shared library. The rules in §1–§3 govern how those instruments must be named and measured.
1. Meter name
Each app owns exactly one primary Meter, named after its root namespace:
| App | Meter name | Status |
|---|---|---|
| OtOpcUa | ZB.MOM.WW.OtOpcUa |
Correct today |
| MxGateway | MxGateway.Server |
⚠ Convergence target — rename to ZB.MOM.WW.MxGateway on adoption |
| ScadaBridge | ZB.MOM.WW.ScadaBridge |
Target (no meter exists today) |
MxGateway.Server is the single convergence item for meter naming. It predates the
ZB.MOM.WW.* namespace convention; rename when adopting AddZbTelemetry. Instruments
emitted under the old name will require a recording_rule or relabel in any Prometheus
config that already scrapes the snapshot — coordinate before renaming in production.
If an app has secondary meters (e.g. a library component with its own meter), those follow
the same pattern: ZB.MOM.WW.<App>.<Component>.
2. Instrument name
Instrument names follow the pattern <app>.<subsystem>.<event>, all lower-case,
dot-separated:
<app> := short app identifier — otopcua | mxgateway | scadabridge
<subsystem> := functional area — deploy | session | tag | alarm | gateway | worker | ...
<event> := what happened or is measured — applied | count | duration | errors | active | ...
Examples:
| Instrument name | App | Meaning |
|---|---|---|
otopcua.deploy.applied |
OtOpcUa | Galaxy deploy events applied to the address space |
otopcua.tag.subscriptions |
OtOpcUa | Active OPC UA tag subscriptions |
mxgateway.session.active |
MxGateway | Active MxAccess sessions |
mxgateway.worker.call.duration |
MxGateway | gRPC call duration to the x86 worker |
scadabridge.alarm.received |
ScadaBridge | Alarms received by the DCL |
Rules:
- All lower-case. No camelCase, no PascalCase, no hyphens.
- Three segments minimum (
<app>.<subsystem>.<event>). Four are permitted when the subsystem warrants a sub-area (e.g.mxgateway.worker.call.duration). - Event nouns describe what is counted or measured (
applied,errors,active,duration), not implementation details (method_called,loop_iteration). - Counters: past-tense or noun (
received,errors,applied). UpDownCounters / gauges: present-state noun or adjective (active,connected). Histograms:durationor a measured quantity noun (size,lag).
3. Units
Duration — seconds (mandatory)
All duration histograms MUST use seconds ("s"). This is the OpenTelemetry semantic
convention (UCUM: s). Backends and dashboards assume seconds; mixing units breaks
aggregations across apps.
⚠ MxGateway convergence item:
GatewayMetrics.csdefines three histograms with unit"ms"(CommandDuration,EventDuration,WorkerCallDuration). These must be migrated to"s"on adoption. Values must also be converted (divide by 1 000 at the call site). Track existing Prometheusrecording_rule/dashboard changes — any dashboard panel that reads these histograms inmswill need updating. Until migration is complete, annotate the instruments with// CONVERGENCE: ms→s pending.
Other units
| Quantity | Unit string | Notes |
|---|---|---|
| Duration | "s" |
Mandatory — see above |
| Size / bytes | "By" |
UCUM bytes |
| Count (dimensionless) | "1" or omit |
For pure event counts; "1" preferred |
| Messages, requests | "{message}", "{request}" |
UCUM annotation form for dimensioned counts |
4. Resource attribute set (shared across all three signals)
The OTel Resource is built once by AddZbTelemetry (see SPEC.md §2) and
attached to metrics, traces, and OTel-exported logs. The same SiteId and NodeRole values
populate Serilog enrichers, making a metric, a span, and a log line from the same node
joinable in any OTel-compatible backend.
| OTel attribute | Type | Required | Notes |
|---|---|---|---|
service.name |
string | Yes | Short lower-case app id: otopcua, mxgateway, scadabridge |
service.namespace |
string | Yes | Always "ZB.MOM.WW" — do not override |
service.version |
string | Recommended | Populate from AssemblyInformationalVersion; absent is better than wrong |
site.id |
string | Recommended | Physical or logical site identifier; omit for single-site deployments |
node.role |
string | Recommended | Node function: "central", "site", "hub", "standalone" |
host.name |
string | Auto | Always Environment.MachineName; never override |
Why site.id and node.role matter: a ScadaBridge fleet runs N site clusters + one
central cluster, each on different hosts. Without site.id and node.role, metrics from a
site node and the central node are indistinguishable even if host.name differs.
5. Standard instrumentation baseline
Every app enables this baseline via AddZbTelemetry. No opt-out. These are community-
standard instrumentation packages; the overhead is negligible and the benefit (correlated
HTTP / gRPC request traces across the fleet) is high.
| Instrumentation | Signal(s) | OtOpcUa today | MxGateway today | ScadaBridge today |
|---|---|---|---|---|
| ASP.NET Core | Traces + Metrics | ✅ | ✅ | — |
| HttpClient | Traces + Metrics | ✅ | ✅ | — |
| gRPC client | Traces | ✅ | — | — |
| .NET runtime | Metrics | ✅ | — | — |
| Process | Metrics | ✅ | — | — |
OtOpcUa already enables all five. MxGateway and ScadaBridge add the missing ones through
AddZbTelemetry. No project removes any of these.
6. Per-app instrument surface (bespoke — stays per project)
These instruments are not part of the shared library. They document the existing bespoke
surface that each project registers through o.Meters / o.ActivitySources in AddZbTelemetry.
6.1 OtOpcUa — ZB.MOM.WW.OtOpcUa meter
Source: src/Core/ZB.MOM.WW.OtOpcUa.Commons/Observability/OtOpcUaTelemetry.cs
| Instrument | Kind | Unit | Description |
|---|---|---|---|
otopcua.deploy.applied |
Counter | "1" |
Galaxy deploy events applied to the OPC UA address space |
otopcua.deploy.failed |
Counter | "1" |
Deploy events that failed processing |
otopcua.tag.subscriptions |
UpDownCounter | "1" |
Active OPC UA tag subscriptions |
otopcua.tag.reads |
Counter | "1" |
Tag read operations |
otopcua.tag.writes |
Counter | "1" |
Tag write operations |
otopcua.session.active |
UpDownCounter | "1" |
Active OPC UA sessions |
otopcua.connection.gateway |
UpDownCounter | "1" |
Active gRPC channels to MxAccessGateway |
ActivitySources (spans):
| Source name | Span(s) |
|---|---|
ZB.MOM.WW.OtOpcUa |
DeployWatcher.Apply, GalaxyDriver.BrowseHierarchy |
All durations already use "s" — no convergence item for OtOpcUa.
6.2 MxGateway — MxGateway.Server meter (→ target: ZB.MOM.WW.MxGateway)
Source: src/ZB.MOM.WW.MxGateway.Server/Metrics/GatewayMetrics.cs
Counters (13):
| Instrument | Unit | Description |
|---|---|---|
mxgateway.session.created |
"1" |
MxAccess sessions opened |
mxgateway.session.closed |
"1" |
MxAccess sessions closed |
mxgateway.session.errors |
"1" |
Session creation/teardown errors |
mxgateway.command.invoked |
"1" |
MxAccess command invocations |
mxgateway.command.errors |
"1" |
Command invocation errors |
mxgateway.event.received |
"1" |
MxAccess events received from worker |
mxgateway.event.errors |
"1" |
Event processing errors |
mxgateway.worker.started |
"1" |
x86 worker processes started |
mxgateway.worker.stopped |
"1" |
x86 worker processes stopped |
mxgateway.worker.errors |
"1" |
Worker communication errors |
mxgateway.galaxy.browse.requests |
"1" |
Galaxy Repository browse RPCs |
mxgateway.galaxy.browse.errors |
"1" |
Galaxy browse errors |
mxgateway.auth.failures |
"1" |
Authentication failures |
Histograms (3):
| Instrument | Unit | Current unit | Convergence |
|---|---|---|---|
mxgateway.command.duration |
"s" |
"ms" |
⚠ Convert ms→s on adoption |
mxgateway.event.duration |
"s" |
"ms" |
⚠ Convert ms→s on adoption |
mxgateway.worker.call.duration |
"s" |
"ms" |
⚠ Convert ms→s on adoption |
Gauges (4):
| Instrument | Unit | Description |
|---|---|---|
mxgateway.session.active |
"1" |
Current active MxAccess sessions |
mxgateway.worker.active |
"1" |
Current running x86 worker processes |
mxgateway.worker.memory |
"By" |
Worker process RSS |
mxgateway.galaxy.nodes.cached |
"1" |
Galaxy Repository nodes in browse cache |
No ActivitySources today (no tracing). Adding ZB.MOM.WW.MxGateway as an ActivitySource
is left per-project (deferred to GAPS backlog).
6.3 ScadaBridge — ZB.MOM.WW.ScadaBridge meter
No meter or instruments exist today (OpenTelemetry.Api is a dangling ref). The target
meter name ZB.MOM.WW.ScadaBridge is reserved. Instruments are defined as part of the
ScadaBridge adoption tracked in ../GAPS.md.
Consequences and convergence items (accepted)
| Item | Scope | Severity |
|---|---|---|
MxGateway meter rename MxGateway.Server → ZB.MOM.WW.MxGateway |
MxGateway adoption | Breaking — requires relabeling in Prometheus config and dashboards |
MxGateway histogram unit ms → s (3 instruments) |
MxGateway adoption | Breaking — values change by factor 1 000; dashboards need updating |
| ScadaBridge instrument set TBD | ScadaBridge adoption | No existing surface to converge — define from scratch |
All three items are tracked as backlog entries in ../GAPS.md. The ms→s
migration is the highest-priority convergence item because leaving it unresolved means
MxGateway's durations are incompatible with OtOpcUa's in any shared Prometheus / Grafana
workspace.