7d243890ed
Author the three normalization docs for the observability component: - components/observability/spec/SPEC.md — Section 0 scope (normalized vs. per-project), AddZbTelemetry pipeline, shared Resource attribute set, standard instrumentation baseline, exporter conventions, Serilog two-stage bootstrap with identity enrichers and TraceContextEnricher, ILogRedactor redaction seam, per-project migration table, and acceptance criteria. - components/observability/spec/METRIC-CONVENTIONS.md — meter naming convention (app namespace; MxGateway.Server flagged as convergence target), instrument naming pattern (<app>.<subsystem>.<event>), mandatory duration unit = seconds (MxGateway ms histograms flagged), Resource attribute set table, standard instrumentation baseline, and per-app instrument tables (OtOpcUa 7 instruments + 2 spans; MxGateway 13 counters / 3 histograms / 4 gauges; ScadaBridge TBD). - components/observability/shared-contract/ZB.MOM.WW.Telemetry.md — paper API for the two packages: ZbTelemetryOptions, ZbExporter enum, AddZbTelemetry (IHostApplicationBuilder + IServiceCollection overloads), ZbResource.Build, MapZbMetrics; AddZbSerilog, ZbLogEnricherNames constants, TraceContextEnricher, ILogRedactor, RedactionEnricher. Consumer matrix and open contract questions included.
225 lines
10 KiB
Markdown
225 lines
10 KiB
Markdown
# Observability — Metric conventions (standardized)
|
|
|
|
Status: **Standardized**. The naming and unit rules every sister project's instruments must
|
|
follow. Analogous to [`../auth/spec/CANONICAL-ROLES.md`](../../auth/spec/CANONICAL-ROLES.md)
|
|
for auth and [`../ui-theme/spec/DESIGN-TOKENS.md`](../../ui-theme/spec/DESIGN-TOKENS.md)
|
|
for the UI kit. Authoritative alongside [`SPEC.md`](SPEC.md).
|
|
|
|
The per-project instrument tables below (§4) document the **existing bespoke surface** — the
|
|
instruments each app currently defines or intends to define. These stay per-project; they are
|
|
not candidates for the shared library. The rules in §1–§3 govern *how* those instruments must
|
|
be named and measured.
|
|
|
|
---
|
|
|
|
## 1. Meter name
|
|
|
|
Each app owns exactly **one primary Meter**, named after its root namespace:
|
|
|
|
| App | Meter name | Status |
|
|
|---|---|---|
|
|
| OtOpcUa | `ZB.MOM.WW.OtOpcUa` | Correct today |
|
|
| MxGateway | `MxGateway.Server` | ⚠ Convergence target — rename to `ZB.MOM.WW.MxGateway` on adoption |
|
|
| ScadaBridge | `ZB.MOM.WW.ScadaBridge` | Target (no meter exists today) |
|
|
|
|
`MxGateway.Server` is the single convergence item for meter naming. It predates the
|
|
`ZB.MOM.WW.*` namespace convention; rename when adopting `AddZbTelemetry`. Instruments
|
|
emitted under the old name will require a `recording_rule` or relabel in any Prometheus
|
|
config that already scrapes the snapshot — coordinate before renaming in production.
|
|
|
|
If an app has secondary meters (e.g. a library component with its own meter), those follow
|
|
the same pattern: `ZB.MOM.WW.<App>.<Component>`.
|
|
|
|
---
|
|
|
|
## 2. Instrument name
|
|
|
|
Instrument names follow the pattern `<app>.<subsystem>.<event>`, all lower-case,
|
|
dot-separated:
|
|
|
|
```
|
|
<app> := short app identifier — otopcua | mxgateway | scadabridge
|
|
<subsystem> := functional area — deploy | session | tag | alarm | gateway | worker | ...
|
|
<event> := what happened or is measured — applied | count | duration | errors | active | ...
|
|
```
|
|
|
|
**Examples:**
|
|
|
|
| Instrument name | App | Meaning |
|
|
|---|---|---|
|
|
| `otopcua.deploy.applied` | OtOpcUa | Galaxy deploy events applied to the address space |
|
|
| `otopcua.tag.subscriptions` | OtOpcUa | Active OPC UA tag subscriptions |
|
|
| `mxgateway.session.active` | MxGateway | Active MxAccess sessions |
|
|
| `mxgateway.worker.call.duration` | MxGateway | gRPC call duration to the x86 worker |
|
|
| `scadabridge.alarm.received` | ScadaBridge | Alarms received by the DCL |
|
|
|
|
**Rules:**
|
|
|
|
1. All lower-case. No camelCase, no PascalCase, no hyphens.
|
|
2. Three segments minimum (`<app>.<subsystem>.<event>`). Four are permitted when the
|
|
subsystem warrants a sub-area (e.g. `mxgateway.worker.call.duration`).
|
|
3. Event nouns describe **what is counted or measured** (`applied`, `errors`, `active`,
|
|
`duration`), not implementation details (`method_called`, `loop_iteration`).
|
|
4. Counters: past-tense or noun (`received`, `errors`, `applied`).
|
|
UpDownCounters / gauges: present-state noun or adjective (`active`, `connected`).
|
|
Histograms: `duration` or a measured quantity noun (`size`, `lag`).
|
|
|
|
---
|
|
|
|
## 3. Units
|
|
|
|
### Duration — seconds (mandatory)
|
|
|
|
**All duration histograms MUST use seconds** (`"s"`). This is the OpenTelemetry semantic
|
|
convention (`UCUM`: `s`). Backends and dashboards assume seconds; mixing units breaks
|
|
aggregations across apps.
|
|
|
|
> ⚠ **MxGateway convergence item:** `GatewayMetrics.cs` defines three histograms with unit
|
|
> `"ms"` (`CommandDuration`, `EventDuration`, `WorkerCallDuration`). These must be migrated
|
|
> to `"s"` on adoption. Values must also be converted (divide by 1 000 at the call site).
|
|
> Track existing Prometheus `recording_rule`/dashboard changes — any dashboard panel that
|
|
> reads these histograms in `ms` will need updating. Until migration is complete, annotate
|
|
> the instruments with `// CONVERGENCE: ms→s pending`.
|
|
|
|
### Other units
|
|
|
|
| Quantity | Unit string | Notes |
|
|
|---|---|---|
|
|
| Duration | `"s"` | Mandatory — see above |
|
|
| Size / bytes | `"By"` | UCUM bytes |
|
|
| Count (dimensionless) | `"1"` or omit | For pure event counts; `"1"` preferred |
|
|
| Messages, requests | `"{message}"`, `"{request}"` | UCUM annotation form for dimensioned counts |
|
|
|
|
---
|
|
|
|
## 4. Resource attribute set (shared across all three signals)
|
|
|
|
The OTel `Resource` is built once by `AddZbTelemetry` (see [`SPEC.md`](SPEC.md) §2) and
|
|
attached to metrics, traces, and OTel-exported logs. The same `SiteId` and `NodeRole` values
|
|
populate Serilog enrichers, making a metric, a span, and a log line from the same node
|
|
joinable in any OTel-compatible backend.
|
|
|
|
| OTel attribute | Type | Required | Notes |
|
|
|---|---|---|---|
|
|
| `service.name` | string | Yes | Short lower-case app id: `otopcua`, `mxgateway`, `scadabridge` |
|
|
| `service.namespace` | string | Yes | Always `"ZB.MOM.WW"` — do not override |
|
|
| `service.version` | string | Recommended | Populate from `AssemblyInformationalVersion`; absent is better than wrong |
|
|
| `site.id` | string | Recommended | Physical or logical site identifier; omit for single-site deployments |
|
|
| `node.role` | string | Recommended | Node function: `"central"`, `"site"`, `"hub"`, `"standalone"` |
|
|
| `host.name` | string | Auto | Always `Environment.MachineName`; never override |
|
|
|
|
**Why `site.id` and `node.role` matter:** a ScadaBridge fleet runs N site clusters + one
|
|
central cluster, each on different hosts. Without `site.id` and `node.role`, metrics from a
|
|
site node and the central node are indistinguishable even if `host.name` differs.
|
|
|
|
---
|
|
|
|
## 5. Standard instrumentation baseline
|
|
|
|
Every app enables this baseline via `AddZbTelemetry`. No opt-out. These are community-
|
|
standard instrumentation packages; the overhead is negligible and the benefit (correlated
|
|
HTTP / gRPC request traces across the fleet) is high.
|
|
|
|
| Instrumentation | Signal(s) | OtOpcUa today | MxGateway today | ScadaBridge today |
|
|
|---|---|---|---|---|
|
|
| ASP.NET Core | Traces + Metrics | ✅ | ✅ | — |
|
|
| HttpClient | Traces + Metrics | ✅ | ✅ | — |
|
|
| gRPC client | Traces | ✅ | — | — |
|
|
| .NET runtime | Metrics | ✅ | — | — |
|
|
| Process | Metrics | ✅ | — | — |
|
|
|
|
OtOpcUa already enables all five. MxGateway and ScadaBridge add the missing ones through
|
|
`AddZbTelemetry`. No project removes any of these.
|
|
|
|
---
|
|
|
|
## 6. Per-app instrument surface (bespoke — stays per project)
|
|
|
|
These instruments are **not part of the shared library**. They document the existing bespoke
|
|
surface that each project registers through `o.Meters` / `o.ActivitySources` in `AddZbTelemetry`.
|
|
|
|
### 6.1 OtOpcUa — `ZB.MOM.WW.OtOpcUa` meter
|
|
|
|
Source: `src/Core/ZB.MOM.WW.OtOpcUa.Commons/Observability/OtOpcUaTelemetry.cs`
|
|
|
|
| Instrument | Kind | Unit | Description |
|
|
|---|---|---|---|
|
|
| `otopcua.deploy.applied` | Counter | `"1"` | Galaxy deploy events applied to the OPC UA address space |
|
|
| `otopcua.deploy.failed` | Counter | `"1"` | Deploy events that failed processing |
|
|
| `otopcua.tag.subscriptions` | UpDownCounter | `"1"` | Active OPC UA tag subscriptions |
|
|
| `otopcua.tag.reads` | Counter | `"1"` | Tag read operations |
|
|
| `otopcua.tag.writes` | Counter | `"1"` | Tag write operations |
|
|
| `otopcua.session.active` | UpDownCounter | `"1"` | Active OPC UA sessions |
|
|
| `otopcua.connection.gateway` | UpDownCounter | `"1"` | Active gRPC channels to MxAccessGateway |
|
|
|
|
**ActivitySources (spans):**
|
|
|
|
| Source name | Span(s) |
|
|
|---|---|
|
|
| `ZB.MOM.WW.OtOpcUa` | `DeployWatcher.Apply`, `GalaxyDriver.BrowseHierarchy` |
|
|
|
|
All durations already use `"s"` — no convergence item for OtOpcUa.
|
|
|
|
### 6.2 MxGateway — `MxGateway.Server` meter (→ target: `ZB.MOM.WW.MxGateway`)
|
|
|
|
Source: `src/ZB.MOM.WW.MxGateway.Server/Metrics/GatewayMetrics.cs`
|
|
|
|
**Counters (13):**
|
|
|
|
| Instrument | Unit | Description |
|
|
|---|---|---|
|
|
| `mxgateway.session.created` | `"1"` | MxAccess sessions opened |
|
|
| `mxgateway.session.closed` | `"1"` | MxAccess sessions closed |
|
|
| `mxgateway.session.errors` | `"1"` | Session creation/teardown errors |
|
|
| `mxgateway.command.invoked` | `"1"` | MxAccess command invocations |
|
|
| `mxgateway.command.errors` | `"1"` | Command invocation errors |
|
|
| `mxgateway.event.received` | `"1"` | MxAccess events received from worker |
|
|
| `mxgateway.event.errors` | `"1"` | Event processing errors |
|
|
| `mxgateway.worker.started` | `"1"` | x86 worker processes started |
|
|
| `mxgateway.worker.stopped` | `"1"` | x86 worker processes stopped |
|
|
| `mxgateway.worker.errors` | `"1"` | Worker communication errors |
|
|
| `mxgateway.galaxy.browse.requests` | `"1"` | Galaxy Repository browse RPCs |
|
|
| `mxgateway.galaxy.browse.errors` | `"1"` | Galaxy browse errors |
|
|
| `mxgateway.auth.failures` | `"1"` | Authentication failures |
|
|
|
|
**Histograms (3):**
|
|
|
|
| Instrument | Unit | Current unit | Convergence |
|
|
|---|---|---|---|
|
|
| `mxgateway.command.duration` | `"s"` | `"ms"` | ⚠ Convert ms→s on adoption |
|
|
| `mxgateway.event.duration` | `"s"` | `"ms"` | ⚠ Convert ms→s on adoption |
|
|
| `mxgateway.worker.call.duration` | `"s"` | `"ms"` | ⚠ Convert ms→s on adoption |
|
|
|
|
**Gauges (4):**
|
|
|
|
| Instrument | Unit | Description |
|
|
|---|---|---|
|
|
| `mxgateway.session.active` | `"1"` | Current active MxAccess sessions |
|
|
| `mxgateway.worker.active` | `"1"` | Current running x86 worker processes |
|
|
| `mxgateway.worker.memory` | `"By"` | Worker process RSS |
|
|
| `mxgateway.galaxy.nodes.cached` | `"1"` | Galaxy Repository nodes in browse cache |
|
|
|
|
No ActivitySources today (no tracing). Adding `ZB.MOM.WW.MxGateway` as an ActivitySource
|
|
is left per-project (deferred to GAPS backlog).
|
|
|
|
### 6.3 ScadaBridge — `ZB.MOM.WW.ScadaBridge` meter
|
|
|
|
No meter or instruments exist today (`OpenTelemetry.Api` is a dangling ref). The target
|
|
meter name `ZB.MOM.WW.ScadaBridge` is reserved. Instruments are defined as part of the
|
|
ScadaBridge adoption tracked in [`../GAPS.md`](../GAPS.md).
|
|
|
|
---
|
|
|
|
## Consequences and convergence items (accepted)
|
|
|
|
| Item | Scope | Severity |
|
|
|---|---|---|
|
|
| MxGateway meter rename `MxGateway.Server` → `ZB.MOM.WW.MxGateway` | MxGateway adoption | Breaking — requires relabeling in Prometheus config and dashboards |
|
|
| MxGateway histogram unit `ms` → `s` (3 instruments) | MxGateway adoption | Breaking — values change by factor 1 000; dashboards need updating |
|
|
| ScadaBridge instrument set TBD | ScadaBridge adoption | No existing surface to converge — define from scratch |
|
|
|
|
All three items are tracked as backlog entries in [`../GAPS.md`](../GAPS.md). The ms→s
|
|
migration is the highest-priority convergence item because leaving it unresolved means
|
|
MxGateway's durations are incompatible with OtOpcUa's in any shared Prometheus / Grafana
|
|
workspace.
|