215a646e35
C1: NodeHostname is AUTO throughout. Shared-contract AddZbSerilog doc comment now reads
"SiteId + NodeRole from ZbTelemetryOptions; NodeHostname from Environment.MachineName (auto)".
SPEC.md §0 and §5 prose updated to match. ScadaBridge adoption snippet no longer sets
o.NodeHostname (removed; NodeHostname is auto, not caller-supplied).
C2: METRIC-CONVENTIONS §6.1 OtOpcUa instrument table replaced with code-verified set:
counters otopcua.deploy.applied / driver.lifecycle / virtualtag.eval / scriptedalarm.transition /
opcua.sink.write / redundancy.service_level_change; histogram otopcua.deploy.apply.duration (s);
ActivitySource ZB.MOM.WW.OtOpcUa with spans otopcua.deploy.apply + otopcua.opcua.address_space_rebuild.
Removed invented names (deploy.failed, tag.subscriptions, tag.reads, tag.writes, session.active,
connection.gateway).
C3: METRIC-CONVENTIONS §6.2 MxGateway instrument table replaced with code-verified names from
GatewayMetrics.cs: 13 counters (sessions.opened/closed, commands.started/succeeded/failed,
events.received, queues.overflows, faults, workers.killed/exited, heartbeats.failed,
grpc.streams.disconnected, retries.attempted); 3 histograms ms (workers.startup.duration,
commands.duration, events.stream_send.duration); 4 gauges (sessions.open, workers.running,
events.worker_queue.depth, events.grpc_stream_queue.depth). Removed invented names.
m3: §2 example table replaced mxgateway.session.active + mxgateway.worker.call.duration
(invented) with mxgateway.sessions.open + mxgateway.commands.duration (real). Also fixed
the §2 rule-2 body text example which referenced mxgateway.worker.call.duration.
I4: §5 standard instrumentation table corrected — OtOpcUa now shows ⛔ not added for all
five baseline instrumentations, matching current-state/otopcua. All three projects lack
standard instrumentation today; AddZbTelemetry adds it on adoption.
I1+m1: GAPS.md "Decisions still open" — removed the two settled questions (Prometheus-default
and ms→s/meter-rename bundling). Moved them to a new "Decisions settled" section with explicit
resolution notes. One genuinely open question remains (SiteId/NodeRole config binding path).
I2: SPEC.md §5 AddZbSerilog: added note that AddZbSerilog reads Serilog:MinimumLevel from
IConfiguration; callers with a different config key (e.g. ScadaBridge:Logging:MinimumLevel)
apply that override themselves — stays per-project. Shared-contract doc comment updated to match.
I3: MxAccessGateway adoption plan Meters = ["MxGateway.Server"] annotated as temporary with
note to update to ZB.MOM.WW.MxGateway when Gap N1 (Meter-rename) is closed.
m2: SPEC.md §1 now notes AddZbTelemetry also has an IServiceCollection overload for non-standard
hosts, with the IHostApplicationBuilder overload as the primary path.
235 lines
11 KiB
Markdown
235 lines
11 KiB
Markdown
# Observability — Metric conventions (standardized)
|
|
|
|
Status: **Standardized**. The naming and unit rules every sister project's instruments must
|
|
follow. Analogous to [`../auth/spec/CANONICAL-ROLES.md`](../../auth/spec/CANONICAL-ROLES.md)
|
|
for auth and [`../ui-theme/spec/DESIGN-TOKENS.md`](../../ui-theme/spec/DESIGN-TOKENS.md)
|
|
for the UI kit. Authoritative alongside [`SPEC.md`](SPEC.md).
|
|
|
|
The per-project instrument tables below (§4) document the **existing bespoke surface** — the
|
|
instruments each app currently defines or intends to define. These stay per-project; they are
|
|
not candidates for the shared library. The rules in §1–§3 govern *how* those instruments must
|
|
be named and measured.
|
|
|
|
---
|
|
|
|
## 1. Meter name
|
|
|
|
Each app owns exactly **one primary Meter**, named after its root namespace:
|
|
|
|
| App | Meter name | Status |
|
|
|---|---|---|
|
|
| OtOpcUa | `ZB.MOM.WW.OtOpcUa` | Correct today |
|
|
| MxGateway | `MxGateway.Server` | ⚠ Convergence target — rename to `ZB.MOM.WW.MxGateway` on adoption |
|
|
| ScadaBridge | `ZB.MOM.WW.ScadaBridge` | Target (no meter exists today) |
|
|
|
|
`MxGateway.Server` is the single convergence item for meter naming. It predates the
|
|
`ZB.MOM.WW.*` namespace convention; rename when adopting `AddZbTelemetry`. Instruments
|
|
emitted under the old name will require a `recording_rule` or relabel in any Prometheus
|
|
config that already scrapes the snapshot — coordinate before renaming in production.
|
|
|
|
If an app has secondary meters (e.g. a library component with its own meter), those follow
|
|
the same pattern: `ZB.MOM.WW.<App>.<Component>`.
|
|
|
|
---
|
|
|
|
## 2. Instrument name
|
|
|
|
Instrument names follow the pattern `<app>.<subsystem>.<event>`, all lower-case,
|
|
dot-separated:
|
|
|
|
```
|
|
<app> := short app identifier — otopcua | mxgateway | scadabridge
|
|
<subsystem> := functional area — deploy | session | tag | alarm | gateway | worker | ...
|
|
<event> := what happened or is measured — applied | count | duration | errors | active | ...
|
|
```
|
|
|
|
**Examples:**
|
|
|
|
| Instrument name | App | Meaning |
|
|
|---|---|---|
|
|
| `otopcua.deploy.applied` | OtOpcUa | Galaxy deploy events applied to the address space |
|
|
| `otopcua.deploy.apply.duration` | OtOpcUa | End-to-end deploy apply duration |
|
|
| `mxgateway.sessions.open` | MxGateway | Currently open MxAccess sessions |
|
|
| `mxgateway.commands.duration` | MxGateway | End-to-end MXAccess command latency |
|
|
| `scadabridge.alarm.received` | ScadaBridge | Alarms received by the DCL |
|
|
|
|
**Rules:**
|
|
|
|
1. All lower-case. No camelCase, no PascalCase, no hyphens.
|
|
2. Three segments minimum (`<app>.<subsystem>.<event>`). Four are permitted when the
|
|
subsystem warrants a sub-area (e.g. `mxgateway.commands.duration`).
|
|
3. Event nouns describe **what is counted or measured** (`applied`, `errors`, `active`,
|
|
`duration`), not implementation details (`method_called`, `loop_iteration`).
|
|
4. Counters: past-tense or noun (`received`, `errors`, `applied`).
|
|
UpDownCounters / gauges: present-state noun or adjective (`active`, `connected`).
|
|
Histograms: `duration` or a measured quantity noun (`size`, `lag`).
|
|
|
|
---
|
|
|
|
## 3. Units
|
|
|
|
### Duration — seconds (mandatory)
|
|
|
|
**All duration histograms MUST use seconds** (`"s"`). This is the OpenTelemetry semantic
|
|
convention (`UCUM`: `s`). Backends and dashboards assume seconds; mixing units breaks
|
|
aggregations across apps.
|
|
|
|
> ⚠ **MxGateway convergence item:** `GatewayMetrics.cs` defines three histograms with unit
|
|
> `"ms"` (`CommandDuration`, `EventDuration`, `WorkerCallDuration`). These must be migrated
|
|
> to `"s"` on adoption. Values must also be converted (divide by 1 000 at the call site).
|
|
> Track existing Prometheus `recording_rule`/dashboard changes — any dashboard panel that
|
|
> reads these histograms in `ms` will need updating. Until migration is complete, annotate
|
|
> the instruments with `// CONVERGENCE: ms→s pending`.
|
|
|
|
### Other units
|
|
|
|
| Quantity | Unit string | Notes |
|
|
|---|---|---|
|
|
| Duration | `"s"` | Mandatory — see above |
|
|
| Size / bytes | `"By"` | UCUM bytes |
|
|
| Count (dimensionless) | `"1"` or omit | For pure event counts; `"1"` preferred |
|
|
| Messages, requests | `"{message}"`, `"{request}"` | UCUM annotation form for dimensioned counts |
|
|
|
|
---
|
|
|
|
## 4. Resource attribute set (shared across all three signals)
|
|
|
|
The OTel `Resource` is built once by `AddZbTelemetry` (see [`SPEC.md`](SPEC.md) §2) and
|
|
attached to metrics, traces, and OTel-exported logs. The same `SiteId` and `NodeRole` values
|
|
populate Serilog enrichers, making a metric, a span, and a log line from the same node
|
|
joinable in any OTel-compatible backend.
|
|
|
|
| OTel attribute | Type | Required | Notes |
|
|
|---|---|---|---|
|
|
| `service.name` | string | Yes | Short lower-case app id: `otopcua`, `mxgateway`, `scadabridge` |
|
|
| `service.namespace` | string | Yes | Always `"ZB.MOM.WW"` — do not override |
|
|
| `service.version` | string | Recommended | Populate from `AssemblyInformationalVersion`; absent is better than wrong |
|
|
| `site.id` | string | Recommended | Physical or logical site identifier; omit for single-site deployments |
|
|
| `node.role` | string | Recommended | Node function: `"central"`, `"site"`, `"hub"`, `"standalone"` |
|
|
| `host.name` | string | Auto | Always `Environment.MachineName`; never override |
|
|
|
|
**Why `site.id` and `node.role` matter:** a ScadaBridge fleet runs N site clusters + one
|
|
central cluster, each on different hosts. Without `site.id` and `node.role`, metrics from a
|
|
site node and the central node are indistinguishable even if `host.name` differs.
|
|
|
|
---
|
|
|
|
## 5. Standard instrumentation baseline
|
|
|
|
Every app enables this baseline via `AddZbTelemetry`. No opt-out. These are community-
|
|
standard instrumentation packages; the overhead is negligible and the benefit (correlated
|
|
HTTP / gRPC request traces across the fleet) is high.
|
|
|
|
| Instrumentation | Signal(s) | OtOpcUa today | MxGateway today | ScadaBridge today |
|
|
|---|---|---|---|---|
|
|
| ASP.NET Core | Traces + Metrics | ⛔ not added | ⛔ not added | ⛔ not added |
|
|
| HttpClient | Traces + Metrics | ⛔ not added | ⛔ not added | ⛔ not added |
|
|
| gRPC client | Traces | ⛔ not added | ⛔ not added | n/a |
|
|
| .NET runtime | Metrics | ⛔ not added | ⛔ not added | ⛔ not added |
|
|
| Process | Metrics | ⛔ not added | ⛔ not added | ⛔ not added |
|
|
|
|
All three projects lack standard instrumentation today — it is added automatically when each
|
|
project calls `AddZbTelemetry` (Gap S1 in `GAPS.md`). No project removes any of these once
|
|
wired.
|
|
|
|
---
|
|
|
|
## 6. Per-app instrument surface (bespoke — stays per project)
|
|
|
|
These instruments are **not part of the shared library**. They document the existing bespoke
|
|
surface that each project registers through `o.Meters` / `o.ActivitySources` in `AddZbTelemetry`.
|
|
|
|
### 6.1 OtOpcUa — `ZB.MOM.WW.OtOpcUa` meter
|
|
|
|
Source: `src/Core/ZB.MOM.WW.OtOpcUa.Commons/Observability/OtOpcUaTelemetry.cs`
|
|
(Code-verified 2026-06-01 — see `current-state/otopcua/CURRENT-STATE.md`.)
|
|
|
|
**Counters (7):**
|
|
|
|
| Instrument | Kind | Unit | Description |
|
|
|---|---|---|---|
|
|
| `otopcua.deploy.applied` | Counter | — | Galaxy deploy events applied to the OPC UA address space |
|
|
| `otopcua.driver.lifecycle` | Counter | — | Driver lifecycle events (start / stop / restart) |
|
|
| `otopcua.virtualtag.eval` | Counter | — | Virtual tag evaluations |
|
|
| `otopcua.scriptedalarm.transition` | Counter | — | Scripted alarm state transitions |
|
|
| `otopcua.opcua.sink.write` | Counter | — | OPC UA sink write operations |
|
|
| `otopcua.redundancy.service_level_change` | Counter | — | Redundancy service-level changes |
|
|
|
|
**Histograms (1):**
|
|
|
|
| Instrument | Kind | Unit | Description |
|
|
|---|---|---|---|
|
|
| `otopcua.deploy.apply.duration` | Histogram | `s` | End-to-end deploy apply duration |
|
|
|
|
**ActivitySources (spans):**
|
|
|
|
| Source name | Spans |
|
|
|---|---|
|
|
| `ZB.MOM.WW.OtOpcUa` | `otopcua.deploy.apply`, `otopcua.opcua.address_space_rebuild` |
|
|
|
|
All durations use `"s"` — no unit convergence item for OtOpcUa.
|
|
|
|
### 6.2 MxGateway — `MxGateway.Server` meter (→ target: `ZB.MOM.WW.MxGateway`)
|
|
|
|
Source: `src/ZB.MOM.WW.MxGateway.Server/Metrics/GatewayMetrics.cs`
|
|
(Code-verified 2026-06-01 — see `current-state/mxaccessgw/CURRENT-STATE.md`.)
|
|
|
|
**Counters (13):**
|
|
|
|
| Instrument | Unit | Description |
|
|
|---|---|---|
|
|
| `mxgateway.sessions.opened` | `"1"` | New session requests |
|
|
| `mxgateway.sessions.closed` | `"1"` | Sessions torn down |
|
|
| `mxgateway.commands.started` | `"1"` | MXAccess command dispatched |
|
|
| `mxgateway.commands.succeeded` | `"1"` | Command completed OK |
|
|
| `mxgateway.commands.failed` | `"1"` | Command error |
|
|
| `mxgateway.events.received` | `"1"` | MXAccess events received from worker |
|
|
| `mxgateway.queues.overflows` | `"1"` | Queue overflow (backpressure) |
|
|
| `mxgateway.faults` | `"1"` | Unhandled gateway faults |
|
|
| `mxgateway.workers.killed` | `"1"` | Worker process forcibly terminated |
|
|
| `mxgateway.workers.exited` | `"1"` | Worker process exited cleanly |
|
|
| `mxgateway.heartbeats.failed` | `"1"` | Worker heartbeat timeouts |
|
|
| `mxgateway.grpc.streams.disconnected` | `"1"` | gRPC event stream disconnects |
|
|
| `mxgateway.retries.attempted` | `"1"` | Retry attempts (any subsystem) |
|
|
|
|
**Histograms (3) — current unit `ms` (convergence target `s`):**
|
|
|
|
| Instrument | Target unit | Current unit | Convergence |
|
|
|---|---|---|---|
|
|
| `mxgateway.workers.startup.duration` | `"s"` | `"ms"` | ⚠ Convert ms→s on adoption |
|
|
| `mxgateway.commands.duration` | `"s"` | `"ms"` | ⚠ Convert ms→s on adoption |
|
|
| `mxgateway.events.stream_send.duration` | `"s"` | `"ms"` | ⚠ Convert ms→s on adoption |
|
|
|
|
**Observable gauges (4):**
|
|
|
|
| Instrument | Unit | Description |
|
|
|---|---|---|
|
|
| `mxgateway.sessions.open` | `"1"` | Currently open sessions (live count) |
|
|
| `mxgateway.workers.running` | `"1"` | Currently running worker processes |
|
|
| `mxgateway.events.worker_queue.depth` | `"1"` | Per-worker event queue depth |
|
|
| `mxgateway.events.grpc_stream_queue.depth` | `"1"` | Per-stream gRPC send queue depth |
|
|
|
|
No ActivitySources today (no tracing). Adding `ZB.MOM.WW.MxGateway` as an ActivitySource
|
|
is left per-project (deferred to GAPS backlog).
|
|
|
|
### 6.3 ScadaBridge — `ZB.MOM.WW.ScadaBridge` meter
|
|
|
|
No meter or instruments exist today (`OpenTelemetry.Api` is a dangling ref). The target
|
|
meter name `ZB.MOM.WW.ScadaBridge` is reserved. Instruments are defined as part of the
|
|
ScadaBridge adoption tracked in [`../GAPS.md`](../GAPS.md).
|
|
|
|
---
|
|
|
|
## Consequences and convergence items (accepted)
|
|
|
|
| Item | Scope | Severity |
|
|
|---|---|---|
|
|
| MxGateway meter rename `MxGateway.Server` → `ZB.MOM.WW.MxGateway` | MxGateway adoption | Breaking — requires relabeling in Prometheus config and dashboards |
|
|
| MxGateway histogram unit `ms` → `s` (3 instruments) | MxGateway adoption | Breaking — values change by factor 1 000; dashboards need updating |
|
|
| ScadaBridge instrument set TBD | ScadaBridge adoption | No existing surface to converge — define from scratch |
|
|
|
|
All three items are tracked as backlog entries in [`../GAPS.md`](../GAPS.md). The ms→s
|
|
migration is the highest-priority convergence item because leaving it unresolved means
|
|
MxGateway's durations are incompatible with OtOpcUa's in any shared Prometheus / Grafana
|
|
workspace.
|