docs(observability): fix metric-convention instrument names + NodeHostname-auto + resolve settled questions

C1: NodeHostname is AUTO throughout. Shared-contract AddZbSerilog doc comment now reads
"SiteId + NodeRole from ZbTelemetryOptions; NodeHostname from Environment.MachineName (auto)".
SPEC.md §0 and §5 prose updated to match. ScadaBridge adoption snippet no longer sets
o.NodeHostname (removed; NodeHostname is auto, not caller-supplied).

C2: METRIC-CONVENTIONS §6.1 OtOpcUa instrument table replaced with code-verified set:
counters otopcua.deploy.applied / driver.lifecycle / virtualtag.eval / scriptedalarm.transition /
opcua.sink.write / redundancy.service_level_change; histogram otopcua.deploy.apply.duration (s);
ActivitySource ZB.MOM.WW.OtOpcUa with spans otopcua.deploy.apply + otopcua.opcua.address_space_rebuild.
Removed invented names (deploy.failed, tag.subscriptions, tag.reads, tag.writes, session.active,
connection.gateway).

C3: METRIC-CONVENTIONS §6.2 MxGateway instrument table replaced with code-verified names from
GatewayMetrics.cs: 13 counters (sessions.opened/closed, commands.started/succeeded/failed,
events.received, queues.overflows, faults, workers.killed/exited, heartbeats.failed,
grpc.streams.disconnected, retries.attempted); 3 histograms ms (workers.startup.duration,
commands.duration, events.stream_send.duration); 4 gauges (sessions.open, workers.running,
events.worker_queue.depth, events.grpc_stream_queue.depth). Removed invented names.

m3: §2 example table replaced mxgateway.session.active + mxgateway.worker.call.duration
(invented) with mxgateway.sessions.open + mxgateway.commands.duration (real). Also fixed
the §2 rule-2 body text example which referenced mxgateway.worker.call.duration.

I4: §5 standard instrumentation table corrected — OtOpcUa now shows  not added for all
five baseline instrumentations, matching current-state/otopcua. All three projects lack
standard instrumentation today; AddZbTelemetry adds it on adoption.

I1+m1: GAPS.md "Decisions still open" — removed the two settled questions (Prometheus-default
and ms→s/meter-rename bundling). Moved them to a new "Decisions settled" section with explicit
resolution notes. One genuinely open question remains (SiteId/NodeRole config binding path).

I2: SPEC.md §5 AddZbSerilog: added note that AddZbSerilog reads Serilog:MinimumLevel from
IConfiguration; callers with a different config key (e.g. ScadaBridge:Logging:MinimumLevel)
apply that override themselves — stays per-project. Shared-contract doc comment updated to match.

I3: MxAccessGateway adoption plan Meters = ["MxGateway.Server"] annotated as temporary with
note to update to ZB.MOM.WW.MxGateway when Gap N1 (Meter-rename) is closed.

m2: SPEC.md §1 now notes AddZbTelemetry also has an IServiceCollection overload for non-standard
hosts, with the IHostApplicationBuilder overload as the primary path.
This commit is contained in:
Joseph Doherty
2026-06-01 07:32:58 -04:00
parent 645388b1f1
commit 215a646e35
6 changed files with 94 additions and 58 deletions
+10 -4
View File
@@ -168,10 +168,16 @@ app is opt-in and tracked here, not forced.
## Decisions still open
- Whether `AddZbTelemetry` enables OTLP by default (simplest for new setups) or Prometheus by
default (matches OtOpcUa's current posture). Design doc says Prometheus default; OTLP opt-in.
- Whether the `ms``s` conversion and Meter rename are bundled with the initial MxGateway
adoption (Task #9) or deferred. Deferring avoids dashboard breaks during the migration window.
- Canonical `SiteId` and `NodeRole` config binding path — ScadaBridge reads from its own config
hierarchy; `AddZbSerilog` must accept the value directly (caller-supplied) rather than reading
from a fixed config section, to remain project-agnostic.
## Decisions settled (no longer open)
- **Prometheus vs OTLP default (SETTLED):** `AddZbTelemetry` defaults to Prometheus (matching
OtOpcUa's existing `/metrics` posture). OTLP is opt-in via `ZbTelemetryOptions.Exporter =
ZbExporter.Otlp`. See `spec/SPEC.md` §4 and shared-contract `ZbTelemetryOptions.Exporter`.
- **`ms``s` conversion and Meter rename bundling (SETTLED — deferred):** Both the histogram
unit migration (Gap U1) and the Meter rename (Gap N1) are deferred from the initial MxGateway
adoption (Task #9). They are breaking dashboard/alert changes requiring ops coordination and
are tracked as separate backlog items #6 and #7 in the adoption backlog above.
@@ -166,7 +166,7 @@ the MxAccessGateway repo as part of Task #9 (blocked by Task #8 — library buil
**Wire metrics export via `AddZbTelemetry`:**
- Add `builder.AddZbTelemetry(o => { o.ServiceName = "mxgateway"; o.SiteId = ...; o.NodeRole = ...; o.Meters = ["MxGateway.Server"]; })`.
- Add `builder.AddZbTelemetry(o => { o.ServiceName = "mxgateway"; o.SiteId = ...; o.NodeRole = ...; o.Meters = ["MxGateway.Server"]; /* temporary — update to "ZB.MOM.WW.MxGateway" when the Meter-rename gap (Gap N1) is closed */ })`.
This registers the OTel SDK and connects `GatewayMetrics`'s existing `Meter` to the Prometheus
exporter. The 13 counters, 3 histograms, and 4 gauges **begin exporting** for the first time.
`GatewayMetrics.cs` itself is unchanged — only the SDK layer is added around it.
@@ -127,9 +127,10 @@ renaming needed on adoption.
**Adopt `AddZbSerilog`:**
- Replace the `LoggerConfigurationFactory.Build(config, nodeRole, siteId, nodeHostname)` call in
`Program.cs:2754` with `builder.AddZbSerilog(o => { o.ServiceName = "scadabridge"; o.SiteId = cfg.SiteId; o.NodeRole = cfg.NodeRole; o.NodeHostname = cfg.NodeHostname; })`.
`Program.cs:2754` with `builder.AddZbSerilog(o => { o.ServiceName = "scadabridge"; o.SiteId = cfg.SiteId; o.NodeRole = cfg.NodeRole; })`.
The three enrichers (`SiteId`, `NodeRole`, `NodeHostname`) are now provided by the shared
`AddZbSerilog` path; `LoggerConfigurationFactory` can be deleted.
`AddZbSerilog` path (`SiteId`/`NodeRole` from options; `NodeHostname` auto from
`Environment.MachineName`); `LoggerConfigurationFactory` can be deleted.
- `ReadFrom.Configuration` for sinks and `MinimumLevel.Is` override from config are preserved
inside `AddZbSerilog` — behavior is unchanged.
- The `TraceContextEnricher` is wired automatically by `AddZbSerilog`; once application instruments
@@ -147,10 +147,16 @@ public static class ZbSerilogExtensions
/// Two-stage Serilog bootstrap:
/// Stage 1 — minimal console-only bootstrap logger (for startup errors before IConfiguration).
/// Stage 2 — application logger wired from IConfiguration (ReadFrom.Configuration reads
/// Serilog:WriteTo sinks + Serilog:MinimumLevel overrides) with fixed enrichers:
/// SiteId, NodeRole, NodeHostname (from ZbTelemetryOptions), TraceContextEnricher,
/// Serilog:WriteTo sinks + Serilog:MinimumLevel from "Serilog:MinimumLevel") with
/// fixed enrichers: SiteId + NodeRole from ZbTelemetryOptions; NodeHostname from
/// Environment.MachineName (auto — not a caller-supplied option); TraceContextEnricher;
/// and RedactionEnricher (applied only when ILogRedactor is registered).
///
/// MinimumLevel: AddZbSerilog reads "Serilog:MinimumLevel" from IConfiguration. Callers that
/// bind MinimumLevel from a different config key (e.g. ScadaBridge's
/// "ScadaBridge:Logging:MinimumLevel") apply that override themselves before or after
/// calling AddZbSerilog — this remains per-project and AddZbSerilog does not read it.
///
/// OTel log export is wired automatically: logs flow through the OTel pipeline with the same
/// Resource as the metrics and traces (all three signals correlated in a backend).
///
@@ -48,16 +48,16 @@ dot-separated:
| Instrument name | App | Meaning |
|---|---|---|
| `otopcua.deploy.applied` | OtOpcUa | Galaxy deploy events applied to the address space |
| `otopcua.tag.subscriptions` | OtOpcUa | Active OPC UA tag subscriptions |
| `mxgateway.session.active` | MxGateway | Active MxAccess sessions |
| `mxgateway.worker.call.duration` | MxGateway | gRPC call duration to the x86 worker |
| `otopcua.deploy.apply.duration` | OtOpcUa | End-to-end deploy apply duration |
| `mxgateway.sessions.open` | MxGateway | Currently open MxAccess sessions |
| `mxgateway.commands.duration` | MxGateway | End-to-end MXAccess command latency |
| `scadabridge.alarm.received` | ScadaBridge | Alarms received by the DCL |
**Rules:**
1. All lower-case. No camelCase, no PascalCase, no hyphens.
2. Three segments minimum (`<app>.<subsystem>.<event>`). Four are permitted when the
subsystem warrants a sub-area (e.g. `mxgateway.worker.call.duration`).
subsystem warrants a sub-area (e.g. `mxgateway.commands.duration`).
3. Event nouns describe **what is counted or measured** (`applied`, `errors`, `active`,
`duration`), not implementation details (`method_called`, `loop_iteration`).
4. Counters: past-tense or noun (`received`, `errors`, `applied`).
@@ -122,14 +122,15 @@ HTTP / gRPC request traces across the fleet) is high.
| Instrumentation | Signal(s) | OtOpcUa today | MxGateway today | ScadaBridge today |
|---|---|---|---|---|
| ASP.NET Core | Traces + Metrics | ✅ | ✅ | — |
| HttpClient | Traces + Metrics | ✅ | ✅ | — |
| gRPC client | Traces | ✅ | — | — |
| .NET runtime | Metrics | ✅ | — | — |
| Process | Metrics | ✅ | — | — |
| ASP.NET Core | Traces + Metrics | ⛔ not added | ⛔ not added | ⛔ not added |
| HttpClient | Traces + Metrics | ⛔ not added | ⛔ not added | ⛔ not added |
| gRPC client | Traces | ⛔ not added | ⛔ not added | n/a |
| .NET runtime | Metrics | ⛔ not added | ⛔ not added | ⛔ not added |
| Process | Metrics | ⛔ not added | ⛔ not added | ⛔ not added |
OtOpcUa already enables all five. MxGateway and ScadaBridge add the missing ones through
`AddZbTelemetry`. No project removes any of these.
All three projects lack standard instrumentation today — it is added automatically when each
project calls `AddZbTelemetry` (Gap S1 in `GAPS.md`). No project removes any of these once
wired.
---
@@ -141,63 +142,72 @@ surface that each project registers through `o.Meters` / `o.ActivitySources` in
### 6.1 OtOpcUa — `ZB.MOM.WW.OtOpcUa` meter
Source: `src/Core/ZB.MOM.WW.OtOpcUa.Commons/Observability/OtOpcUaTelemetry.cs`
(Code-verified 2026-06-01 — see `current-state/otopcua/CURRENT-STATE.md`.)
**Counters (7):**
| Instrument | Kind | Unit | Description |
|---|---|---|---|
| `otopcua.deploy.applied` | Counter | `"1"` | Galaxy deploy events applied to the OPC UA address space |
| `otopcua.deploy.failed` | Counter | `"1"` | Deploy events that failed processing |
| `otopcua.tag.subscriptions` | UpDownCounter | `"1"` | Active OPC UA tag subscriptions |
| `otopcua.tag.reads` | Counter | `"1"` | Tag read operations |
| `otopcua.tag.writes` | Counter | `"1"` | Tag write operations |
| `otopcua.session.active` | UpDownCounter | `"1"` | Active OPC UA sessions |
| `otopcua.connection.gateway` | UpDownCounter | `"1"` | Active gRPC channels to MxAccessGateway |
| `otopcua.deploy.applied` | Counter | | Galaxy deploy events applied to the OPC UA address space |
| `otopcua.driver.lifecycle` | Counter | — | Driver lifecycle events (start / stop / restart) |
| `otopcua.virtualtag.eval` | Counter | — | Virtual tag evaluations |
| `otopcua.scriptedalarm.transition` | Counter | — | Scripted alarm state transitions |
| `otopcua.opcua.sink.write` | Counter | — | OPC UA sink write operations |
| `otopcua.redundancy.service_level_change` | Counter | — | Redundancy service-level changes |
**Histograms (1):**
| Instrument | Kind | Unit | Description |
|---|---|---|---|
| `otopcua.deploy.apply.duration` | Histogram | `s` | End-to-end deploy apply duration |
**ActivitySources (spans):**
| Source name | Span(s) |
| Source name | Spans |
|---|---|
| `ZB.MOM.WW.OtOpcUa` | `DeployWatcher.Apply`, `GalaxyDriver.BrowseHierarchy` |
| `ZB.MOM.WW.OtOpcUa` | `otopcua.deploy.apply`, `otopcua.opcua.address_space_rebuild` |
All durations already use `"s"` — no convergence item for OtOpcUa.
All durations use `"s"` — no unit convergence item for OtOpcUa.
### 6.2 MxGateway — `MxGateway.Server` meter (→ target: `ZB.MOM.WW.MxGateway`)
Source: `src/ZB.MOM.WW.MxGateway.Server/Metrics/GatewayMetrics.cs`
(Code-verified 2026-06-01 — see `current-state/mxaccessgw/CURRENT-STATE.md`.)
**Counters (13):**
| Instrument | Unit | Description |
|---|---|---|
| `mxgateway.session.created` | `"1"` | MxAccess sessions opened |
| `mxgateway.session.closed` | `"1"` | MxAccess sessions closed |
| `mxgateway.session.errors` | `"1"` | Session creation/teardown errors |
| `mxgateway.command.invoked` | `"1"` | MxAccess command invocations |
| `mxgateway.command.errors` | `"1"` | Command invocation errors |
| `mxgateway.event.received` | `"1"` | MxAccess events received from worker |
| `mxgateway.event.errors` | `"1"` | Event processing errors |
| `mxgateway.worker.started` | `"1"` | x86 worker processes started |
| `mxgateway.worker.stopped` | `"1"` | x86 worker processes stopped |
| `mxgateway.worker.errors` | `"1"` | Worker communication errors |
| `mxgateway.galaxy.browse.requests` | `"1"` | Galaxy Repository browse RPCs |
| `mxgateway.galaxy.browse.errors` | `"1"` | Galaxy browse errors |
| `mxgateway.auth.failures` | `"1"` | Authentication failures |
| `mxgateway.sessions.opened` | `"1"` | New session requests |
| `mxgateway.sessions.closed` | `"1"` | Sessions torn down |
| `mxgateway.commands.started` | `"1"` | MXAccess command dispatched |
| `mxgateway.commands.succeeded` | `"1"` | Command completed OK |
| `mxgateway.commands.failed` | `"1"` | Command error |
| `mxgateway.events.received` | `"1"` | MXAccess events received from worker |
| `mxgateway.queues.overflows` | `"1"` | Queue overflow (backpressure) |
| `mxgateway.faults` | `"1"` | Unhandled gateway faults |
| `mxgateway.workers.killed` | `"1"` | Worker process forcibly terminated |
| `mxgateway.workers.exited` | `"1"` | Worker process exited cleanly |
| `mxgateway.heartbeats.failed` | `"1"` | Worker heartbeat timeouts |
| `mxgateway.grpc.streams.disconnected` | `"1"` | gRPC event stream disconnects |
| `mxgateway.retries.attempted` | `"1"` | Retry attempts (any subsystem) |
**Histograms (3):**
**Histograms (3) — current unit `ms` (convergence target `s`):**
| Instrument | Unit | Current unit | Convergence |
| Instrument | Target unit | Current unit | Convergence |
|---|---|---|---|
| `mxgateway.command.duration` | `"s"` | `"ms"` | ⚠ Convert ms→s on adoption |
| `mxgateway.event.duration` | `"s"` | `"ms"` | ⚠ Convert ms→s on adoption |
| `mxgateway.worker.call.duration` | `"s"` | `"ms"` | ⚠ Convert ms→s on adoption |
| `mxgateway.workers.startup.duration` | `"s"` | `"ms"` | ⚠ Convert ms→s on adoption |
| `mxgateway.commands.duration` | `"s"` | `"ms"` | ⚠ Convert ms→s on adoption |
| `mxgateway.events.stream_send.duration` | `"s"` | `"ms"` | ⚠ Convert ms→s on adoption |
**Gauges (4):**
**Observable gauges (4):**
| Instrument | Unit | Description |
|---|---|---|
| `mxgateway.session.active` | `"1"` | Current active MxAccess sessions |
| `mxgateway.worker.active` | `"1"` | Current running x86 worker processes |
| `mxgateway.worker.memory` | `"By"` | Worker process RSS |
| `mxgateway.galaxy.nodes.cached` | `"1"` | Galaxy Repository nodes in browse cache |
| `mxgateway.sessions.open` | `"1"` | Currently open sessions (live count) |
| `mxgateway.workers.running` | `"1"` | Currently running worker processes |
| `mxgateway.events.worker_queue.depth` | `"1"` | Per-worker event queue depth |
| `mxgateway.events.grpc_stream_queue.depth` | `"1"` | Per-stream gRPC send queue depth |
No ActivitySources today (no tracing). Adding `ZB.MOM.WW.MxGateway` as an ActivitySource
is left per-project (deferred to GAPS backlog).
+18 -5
View File
@@ -13,8 +13,9 @@ logs) via a single `AddZbTelemetry` extension; the shared `Resource` attribute s
`host.name`) that makes every node distinguishable in a collector; standard instrumentation
everyone enables (ASP.NET Core, HttpClient, gRPC client, runtime, process meters); exporter
conventions (Prometheus scrape endpoint default, OTLP opt-in); a shared Serilog bootstrap
with identity enrichers (`SiteId`, `NodeRole`, `NodeHostname`) bound from the same options
object as the OTel Resource (metrics and logs therefore carry identical dimensions); a
with identity enrichers (`SiteId`, `NodeRole` from `ZbTelemetryOptions`; `NodeHostname` auto
from `Environment.MachineName`) matching the OTel Resource dimensions (metrics and logs
therefore carry identical dimensions); a
`TraceContextEnricher` that stamps `trace_id`/`span_id` from `Activity.Current` onto every
Serilog event, enabling log↔trace correlation; an `ILogRedactor` redaction seam.
@@ -53,6 +54,11 @@ This is the headline fix: nobody in the fleet sets a `Resource` or `service.name
making every node indistinguishable in a collector. Every project must call `AddZbTelemetry`
to be observable.
> **`IServiceCollection` overload:** `AddZbTelemetry` also has an `IServiceCollection`-based
> overload for host configurations where `IHostApplicationBuilder` is not available (detailed in
> the shared-contract). The `IHostApplicationBuilder` overload is the primary path for all three
> apps on .NET 10.
## 2. Shared Resource
The OTel `Resource` attached to all three signals is built from `ZbTelemetryOptions`:
@@ -119,15 +125,22 @@ project's bespoke logging bootstrap with a shared two-stage pattern:
| `TraceContextEnricher` | `trace_id`, `span_id` | `Activity.Current` |
| `RedactionEnricher` | _(project-defined fields)_ | `ILogRedactor` implementation |
The three identity properties (`SiteId`, `NodeRole`, `NodeHostname`) are bound from the
same `ZbTelemetryOptions` object as the OTel `Resource`, so logs and metrics/traces carry
identical dimensions. When no `Activity.Current` is present (e.g. background services,
`SiteId` and `NodeRole` are bound from the same `ZbTelemetryOptions` object as the OTel
`Resource`; `NodeHostname` is populated automatically from `Environment.MachineName` (not a
caller-supplied option). All three identity properties appear on logs and metrics/traces alike,
so signals from the same node carry identical dimensions. When no `Activity.Current` is present (e.g. background services,
startup), `TraceContextEnricher` emits nothing — it does not inject empty or zero values.
`MinimumLevel` is set explicitly in code (default `Information`) and can be overridden via
`IConfiguration` (`Serilog:MinimumLevel`). Sinks are fully config-driven:
`ReadFrom.Configuration` reads `Serilog:WriteTo` from `appsettings.json` / environment.
> **Per-project config paths:** `AddZbSerilog` reads `Serilog:MinimumLevel` from `IConfiguration`.
> Callers that bind MinimumLevel from a different key (e.g. ScadaBridge's
> `ScadaBridge:Logging:MinimumLevel`) apply that override themselves before or after calling
> `AddZbSerilog`. The config key for MinimumLevel remains per-project; `AddZbSerilog` is not
> parameterized on it.
OTel log export is wired in the same call: logs flow through the OTel pipeline with the
same `Resource` attached, making all three signals (metrics / traces / logs) available in a
single backend.