docs(observability): spec + METRIC-CONVENTIONS + ZB.MOM.WW.Telemetry shared contract

Author the three normalization docs for the observability component:
- components/observability/spec/SPEC.md — Section 0 scope (normalized vs. per-project),
  AddZbTelemetry pipeline, shared Resource attribute set, standard instrumentation baseline,
  exporter conventions, Serilog two-stage bootstrap with identity enrichers and
  TraceContextEnricher, ILogRedactor redaction seam, per-project migration table, and
  acceptance criteria.
- components/observability/spec/METRIC-CONVENTIONS.md — meter naming convention (app
  namespace; MxGateway.Server flagged as convergence target), instrument naming pattern
  (<app>.<subsystem>.<event>), mandatory duration unit = seconds (MxGateway ms histograms
  flagged), Resource attribute set table, standard instrumentation baseline, and per-app
  instrument tables (OtOpcUa 7 instruments + 2 spans; MxGateway 13 counters / 3 histograms
  / 4 gauges; ScadaBridge TBD).
- components/observability/shared-contract/ZB.MOM.WW.Telemetry.md — paper API for the two
  packages: ZbTelemetryOptions, ZbExporter enum, AddZbTelemetry (IHostApplicationBuilder +
  IServiceCollection overloads), ZbResource.Build, MapZbMetrics; AddZbSerilog,
  ZbLogEnricherNames constants, TraceContextEnricher, ILogRedactor, RedactionEnricher.
  Consumer matrix and open contract questions included.
This commit is contained in:
Joseph Doherty
2026-06-01 07:19:38 -04:00
parent 76295695ee
commit 7d243890ed
6 changed files with 1149 additions and 0 deletions
@@ -0,0 +1,191 @@
# Observability — current state: MxAccessGateway
Repo: `~/Desktop/MxAccessGateway`. Stack: .NET 10 gateway (x64) + .NET 4.8 worker (**x86**);
solution `src/MxGateway.sln`. Telemetry code is concentrated in
`src/ZB.MOM.WW.MxGateway.Server/Metrics/` (instruments) and
`src/ZB.MOM.WW.MxGateway.Server/Diagnostics/` (logging correlation + redaction).
All paths relative to repo root. Verified 2026-06-01.
The most unusual observability posture in the family: **13 counters, 3 histograms, and 4 observable
gauges** all fully hand-rolled using `System.Diagnostics.Metrics` directly — but **never exported**
(no OpenTelemetry SDK, no Prometheus exporter, no OTLP). All metric data dies in an in-memory
`GetSnapshot()`. Logging is `Microsoft.Extensions.Logging` exclusively (no Serilog), with a bespoke
correlation scope and a log-redaction pipeline. The net48 x86 worker is out of process and out of
scope — its `IWorkerLogger` (stderr key=value) is not addressed here.
## 1. Metrics (hand-rolled, unexported)
### `GatewayMetrics.cs`
`src/ZB.MOM.WW.MxGateway.Server/Metrics/GatewayMetrics.cs`:
Meter name: `"MxGateway.Server"` (does not follow the project namespace `ZB.MOM.WW.MxGateway`).
All instruments are instance members of `GatewayMetrics`. The class is registered as a **singleton**
at `GatewayApplication.cs:62`. There is **no `OpenTelemetry.Extensions.Hosting`**,
**no `AddOpenTelemetry()` call**, and **no exporter** — the `Meter` is created with
`new Meter("MxGateway.Server")` and `GetSnapshot()` is the only read path.
**Counters (13):**
| Instrument name | Tracks |
|---|---|
| `mxgateway.sessions.opened` | New session requests |
| `mxgateway.sessions.closed` | Sessions torn down |
| `mxgateway.commands.started` | MXAccess command dispatched |
| `mxgateway.commands.succeeded` | Command completed OK |
| `mxgateway.commands.failed` | Command error |
| `mxgateway.events.received` | MXAccess events from worker |
| `mxgateway.queues.overflows` | Queue overflow (backpressure) |
| `mxgateway.faults` | Unhandled gateway faults |
| `mxgateway.workers.killed` | Worker process forcibly terminated |
| `mxgateway.workers.exited` | Worker process exited cleanly |
| `mxgateway.heartbeats.failed` | Worker heartbeat timeouts |
| `mxgateway.grpc.streams.disconnected` | gRPC event stream disconnects |
| `mxgateway.retries.attempted` | Retry attempts (any subsystem) |
**Histograms (3) — unit `ms` (diverges from OTel semconv `s`):**
| Instrument name | Tracks |
|---|---|
| `mxgateway.workers.startup.duration` | Time from worker spawn to ready |
| `mxgateway.commands.duration` | End-to-end MXAccess command latency |
| `mxgateway.events.stream_send.duration` | gRPC event stream send latency |
**Observable gauges (4):**
| Instrument name | Tracks |
|---|---|
| `mxgateway.sessions.open` | Currently open sessions (live count) |
| `mxgateway.workers.running` | Currently running worker processes |
| `mxgateway.events.worker_queue.depth` | Per-worker event queue depth |
| `mxgateway.events.grpc_stream_queue.depth` | Per-stream gRPC send queue depth |
All 20 instruments share the `mxgateway.*` prefix and `<category>.<event>` naming — consistent
with the family convention. Duration histograms record in **milliseconds** (`ms`); OTel semantic
conventions require seconds (`s`). This is the only project with `ms` histograms.
### Singleton wiring
`src/ZB.MOM.WW.MxGateway.Server/GatewayApplication.cs`:
- `:62``services.AddSingleton<GatewayMetrics>()` registers the metrics singleton.
There is no `AddOpenTelemetry()` call anywhere in the gateway. The `GatewayMetrics` `Meter` is
created independently of any OTel SDK — it participates in `MeterListener` / `GetSnapshot()` only.
Without the OTel SDK, this data is **invisible to Prometheus, OTLP, or any backend**.
### No tracing
No `ActivitySource` is defined. No spans are created. Tracing is entirely absent.
## 2. Logging (Microsoft.Extensions.Logging)
All logging in the gateway server uses `Microsoft.Extensions.Logging` (MEL) exclusively. There is
no Serilog dependency. Sink configuration lives in `appsettings.json` (Console, with structured
logging via the default host builder).
### Correlation scope
`src/ZB.MOM.WW.MxGateway.Server/Diagnostics/GatewayLogScope.cs`:
Defines the per-request/per-session correlation property bag.
`src/ZB.MOM.WW.MxGateway.Server/Diagnostics/GatewayRequestLoggingMiddlewareExtensions.cs`:
- `:2241``UseGatewayRequestLogging()` middleware reads the following HTTP headers from each
incoming request: `x-session-id`, `x-worker-process-id`, `x-correlation-id`, `x-command-method`,
`authorization` (for redaction, not logging).
- Registered at `GatewayApplication.cs:34`.
`src/ZB.MOM.WW.MxGateway.Server/Diagnostics/GatewayLoggerExtensions.cs`:
- `:1118``BeginGatewayScope(ILogger, GatewayLogScope)` calls `logger.BeginScope(scope)`
MEL's `ILogger.BeginScope` mechanism, which pushes properties as a scoped dictionary.
The correlation tuple (`SessionId` / `WorkerProcessId` / `CorrelationId` / `CommandMethod`) is
injected into log lines produced within the scope. No `trace_id` / `span_id` enrichment — there
is no ActivitySource, so this is consistent but leaves no path to trace correlation.
### Log redaction — `GatewayLogRedactor.cs`
`src/ZB.MOM.WW.MxGateway.Server/Diagnostics/GatewayLogRedactor.cs`:
- Masks sensitive data in log lines for two categories:
- **`AuthenticateUser`** commands: the password argument is replaced.
- **`WriteSecured`** commands: the value argument is replaced.
- **`mxgw_` bearer tokens**: the token body is masked, keeping only the key-id prefix.
- Redaction is applied before the log event is emitted — no sensitive data reaches the sink.
This is the only project in the family with an explicit log-redaction pipeline. OtOpcUa and
ScadaBridge have no equivalent.
## 3. Signal summary
| Signal | Provider | Export | Resource / service.name |
|---|---|---|---|
| Metrics | `System.Diagnostics.Metrics` (`Meter` direct) | ⛔ none (`GetSnapshot()` only) | ⛔ none |
| Traces | — | ⛔ none | ⛔ none |
| Logs | MEL (`Microsoft.Extensions.Logging`) | Console via `appsettings.json` | ⛔ none |
| Trace↔log correlation | — | — | ⛔ absent (no ActivitySource exists) |
## 4. Notable design choices
- **`GatewayMetrics` singleton** — all counter/gauge increments are lock-free atomic operations on
the underlying `Meter` instruments; the singleton is intentional.
- **`ms` histogram unit** — `workers.startup.duration`, `commands.duration`, and
`events.stream_send.duration` all record in milliseconds. This is non-standard (OTel semconv
requires `s`) and means raw values differ from OtOpcUa's `s` histograms by a factor of 1000.
- **MEL correlation via `BeginScope`** — MEL scopes are supported by structured logging providers
(e.g. Serilog.Extensions.Hosting, Seq, Application Insights) but are provider-dependent. The
scope properties may not appear in all sink configurations, unlike Serilog's `LogContext` which
is sink-agnostic.
- **Redaction placement** — `GatewayLogRedactor` sits between the caller and the log emission point,
not inside a sink. This is the correct placement; the shared `ILogRedactor` seam preserves this.
---
## Adoption plan → `ZB.MOM.WW.Telemetry`
**This is the one in-pass adoption.** The MxGateway MEL → Serilog migration is executed as part of
the `ZB.MOM.WW.Telemetry` library build, not deferred as a follow-on. The changes below land in
the MxAccessGateway repo as part of Task #9 (blocked by Task #8 — library build).
**Migrate logging MEL → `AddZbSerilog`:**
- Replace `WebApplicationBuilder` default logging with `builder.AddZbSerilog(o => { o.ServiceName = "mxgateway"; o.SiteId = ...; o.NodeRole = ...; })`.
Gains structured `SiteId` / `NodeRole` / `NodeHostname` enrichers on every log event, plus
`TraceContextEnricher` (currently moot — no spans — but ready for when tracing is added).
- Re-express the `GatewayLogScope` / `BeginGatewayScope` / `UseGatewayRequestLogging` correlation
mechanism as a Serilog `LogContext.PushProperty` scope. The middleware at
`GatewayRequestLoggingMiddlewareExtensions.cs:2241` is refactored to push the same four
properties (`SessionId`, `WorkerProcessId`, `CorrelationId`, `CommandMethod`) via Serilog's
`LogContext` rather than MEL `BeginScope`. Behavior is identical; portability improves.
- Move `GatewayLogRedactor` behind the shared `ILogRedactor` seam. The redaction policy (which
commands/tokens to scrub and how) stays per-project in a `MxGatewayLogRedactor : ILogRedactor`
implementation; the seam is shared.
- Console + file sinks configured via `ReadFrom.Configuration` in `appsettings.json` — consistent
with OtOpcUa and ScadaBridge's Serilog approach.
**Wire metrics export via `AddZbTelemetry`:**
- Add `builder.AddZbTelemetry(o => { o.ServiceName = "mxgateway"; o.SiteId = ...; o.NodeRole = ...; o.Meters = ["MxGateway.Server"]; })`.
This registers the OTel SDK and connects `GatewayMetrics`'s existing `Meter` to the Prometheus
exporter. The 13 counters, 3 histograms, and 4 gauges **begin exporting** for the first time.
`GatewayMetrics.cs` itself is unchanged — only the SDK layer is added around it.
- Add `app.MapZbMetrics()` to expose `/metrics`.
**Convert histogram unit `ms` → `s`:**
- Rename the three histograms' values: multiply recorded values by `0.001` at the call site, or
re-create the instruments with unit `s`. This is a breaking change to existing dashboards/alerts
but required for OTel semconv compliance. Tagged as a convergence item in `GAPS.md`.
**Keep bespoke:**
- `GatewayMetrics.cs` — all 20 instruments (`mxgateway.*` counters, histograms, gauges) stay
per-project. `AddZbTelemetry` registers the Meter name; it does not own or replace the instruments.
- Meter name `"MxGateway.Server"` — a follow-on rename to `"ZB.MOM.WW.MxGateway"` is tracked in
`GAPS.md` but is not required for the initial adoption (it is a Prometheus label change that
breaks existing dashboards).
- `GatewayApplication.cs:62` singleton registration — unchanged; `GatewayMetrics` remains a
singleton; `AddZbTelemetry` simply hooks the OTel SDK to it.
- The net48 x86 worker's `IWorkerLogger` (stderr key=value) — out of process and out of scope.
No changes.
@@ -0,0 +1,158 @@
# Observability — current state: OtOpcUa
Repo: `~/Desktop/OtOpcUa`. Stack: .NET 10, Akka.NET, OPC UA; solution `ZB.MOM.WW.OtOpcUa.slnx`.
Telemetry code lives in two places: `src/Server/ZB.MOM.WW.OtOpcUa.Host/Observability/` (host-side
bootstrap) and `src/Core/ZB.MOM.WW.OtOpcUa.Commons/Observability/` (instruments + enricher).
All paths relative to repo root. Verified 2026-06-01.
The most complete observability implementation in the family: OpenTelemetry SDK with both metrics and
tracing signals, Prometheus export, Serilog structured logging with a per-session correlation enricher,
and a dedicated instrument vocabulary. The one significant gap: **no OTel Resource / `service.name`**,
so all signals are indistinguishable from one another and from other fleet members in a backend.
## 1. Metrics (OpenTelemetry SDK)
### Bootstrap — `ObservabilityExtensions.cs`
`src/Server/ZB.MOM.WW.OtOpcUa.Host/Observability/ObservabilityExtensions.cs`:
- `:18``AddOtOpcUaObservability(IServiceCollection)` is the service-registration entry point.
- `:20``AddOpenTelemetry()` wires the OTel SDK.
- `:2123``.WithMetrics(b => b.AddMeter(OtOpcUaTelemetry.MeterName).AddPrometheusExporter())`:
registers the application meter and attaches the Prometheus scrape exporter.
- `:2425``.WithTracing(b => b.AddSource(OtOpcUaTelemetry.ActivitySourceName))`:
registers the application activity source for trace data.
- **No `ResourceBuilder` call anywhere** — `service.name`, `service.namespace`, `service.version`,
`site.id`, and `node.role` are not set. The OTel SDK defaults to an empty/SDK-default Resource.
- `:36``MapOtOpcUaMetrics(IEndpointRouteBuilder)` maps the Prometheus endpoint.
- `:38` — endpoint path is `/metrics`.
`Program.cs`:
- `:138``builder.Services.AddOtOpcUaObservability()`
- `:160``app.MapOtOpcUaMetrics()`
Package refs in csproj: `OpenTelemetry.Extensions.Hosting`, `OpenTelemetry.Exporter.Prometheus.AspNetCore`.
**No `OpenTelemetry.Exporter.OpenTelemetryProtocol`** — OTLP is not available; Prometheus is the
only export path.
### Instruments — `OtOpcUaTelemetry.cs`
`src/Core/ZB.MOM.WW.OtOpcUa.Commons/Observability/OtOpcUaTelemetry.cs`:
- `:19``MeterName = "ZB.MOM.WW.OtOpcUa"` (the `Meter` the SDK will collect).
- `:20``ActivitySourceName = "ZB.MOM.WW.OtOpcUa"` (the `ActivitySource` for spans).
Instruments defined (all `static readonly` on `OtOpcUaTelemetry`):
| Instrument | Kind | Unit | Subsystem |
|---|---|---|---|
| `otopcua.deploy.applied` | `Counter<long>` | — | deploy |
| `otopcua.deploy.apply.duration` | `Histogram<double>` | `s` | deploy |
| `otopcua.driver.lifecycle` | `Counter<long>` | — | driver |
| `otopcua.virtualtag.eval` | `Counter<long>` | — | virtual-tag |
| `otopcua.scriptedalarm.transition` | `Counter<long>` | — | scripted-alarm |
| `otopcua.opcua.sink.write` | `Counter<long>` | — | opc-ua sink |
| `otopcua.redundancy.service_level_change` | `Counter<long>` | — | redundancy |
Two activity spans: `otopcua.deploy.apply`, `otopcua.opcua.address_space_rebuild`.
Naming convention: `otopcua.<subsystem>.<event>`. Duration histogram correctly uses unit `s`
(OTel semantic conventions). **No standard instrumentation** (ASP.NET Core, HttpClient, runtime,
gRPC client meters) is wired — only the bespoke application instruments.
## 2. Logging (Serilog)
### Bootstrap
`Program.cs`:
- `:4952` — two-stage Serilog bootstrap: initial logger for startup, then full
`UseSerilog(ReadFrom.Configuration)`. Sinks: Console + rolling file `logs/otopcua-.log`.
- `:141``UseSerilogRequestLogging()` on the `WebApplication`.
### Correlation enricher — `LogContextEnricher.cs`
`src/Core/ZB.MOM.WW.OtOpcUa.Commons/Observability/LogContextEnricher.cs`:
- `:1836``Push(driverInstanceId, driverType, capability, correlationId)` calls
`LogContext.PushProperty` for four properties:
- `DriverInstanceId` — Galaxy driver instance GUID.
- `DriverType` — driver type discriminator.
- `CapabilityName` — OPC UA capability being exercised.
- `CorrelationId` — caller-supplied correlation token.
This enricher is driver-lifecycle-scoped, not request-scoped — it pushes when a driver operation
begins and is disposable to pop on completion.
**No `trace_id` / `span_id` enricher.** Although OtOpcUa creates `ActivitySource` spans, the
active `Activity.Current` trace context is never pushed onto Serilog's `LogContext`. A log line
emitted during a span cannot be correlated to the span in a backend.
**No structural enrichers for `service.name` / `site.id` / `node.role`** — these dimensions are
absent from every log line. ScadaBridge has these; OtOpcUa does not.
## 3. Signal summary
| Signal | Provider | Export | Resource / service.name |
|---|---|---|---|
| Metrics | OTel SDK (`Meter` + `WithMetrics`) | Prometheus `/metrics` | ⛔ none |
| Traces | OTel SDK (`ActivitySource` + `WithTracing`) | ⛔ none (no exporter configured) | ⛔ none |
| Logs | Serilog | Console + rolling file | ⛔ none (no `service.name` property) |
| Trace↔log correlation | — | — | ⛔ absent (`trace_id`/`span_id` not pushed) |
Note: `WithTracing` registers the `ActivitySource` for collection, but no exporter (OTLP or
otherwise) is attached to the tracing pipeline. Spans are created and recorded by the SDK but never
shipped anywhere — effectively a no-op in production.
## 4. Notable design choices
- **Instrument naming** follows `<meter>.<subsystem>.<event>` cleanly and consistently — this is the
pattern the shared spec codifies as the fleet convention.
- **Duration unit** correctly uses `s` on `otopcua.deploy.apply.duration` — no conversion needed on
adoption; this contrasts with MxAccessGateway's `ms` histograms.
- **LogContextEnricher is bespoke but valuable** — the `DriverInstanceId`/`DriverType`/`CapabilityName`
correlation is OtOpcUa-specific domain context; it should survive adoption behind the shared
enricher layer.
- **No OTLP path** — with no OTLP exporter, OtOpcUa cannot send metrics or traces to a collector
(Prometheus is scrape-pull only). This limits operational flexibility.
---
## Adoption plan → `ZB.MOM.WW.Telemetry`
**Replace with shared bootstrap:**
- `AddOtOpcUaObservability()``builder.AddZbTelemetry(o => { o.ServiceName = "otopcua"; o.SiteId = ...; o.NodeRole = ...; o.Meters = [OtOpcUaTelemetry.MeterName]; o.ActivitySources = [OtOpcUaTelemetry.ActivitySourceName]; })`.
This adds the missing `Resource` (gains `service.name` / `service.namespace` / `service.version` /
`site.id` / `node.role` / `host.name` on every metric and span). Prometheus `/metrics` stays the
default exporter; OTLP becomes opt-in via options.
- Add standard instrumentation through `AddZbTelemetry` options: ASP.NET Core meters, HttpClient,
runtime + process meters — none wired today.
- Fix the tracing no-op: wire an OTLP exporter (or at minimum note that tracing is recorded but not
exported); `AddZbTelemetry` provides OTLP as the opt-in path.
- `MapOtOpcUaMetrics``app.MapZbMetrics()` (same `/metrics` path; shared convention).
**Replace with shared Serilog bootstrap:**
- Serilog bootstrap in `Program.cs:4952``builder.AddZbSerilog(o => { o.ServiceName = "otopcua"; o.SiteId = ...; o.NodeRole = ...; })`.
This adds structural `SiteId` / `NodeRole` / `NodeHostname` properties to every log line
(currently absent) and wires the `TraceContextEnricher` so `trace_id`/`span_id` appear on log
lines emitted during active spans.
- Console + file sinks continue via `ReadFrom.Configuration` in `appsettings.json` — no sink changes
needed.
- `UseSerilogRequestLogging()` stays.
**Keep bespoke:**
- `OtOpcUaTelemetry.cs` — the application `Meter`, `ActivitySource`, and all instrument definitions
(`otopcua.*` counters, histograms, spans). These are domain instruments; `AddZbTelemetry` registers
them by name but does not own them.
- `LogContextEnricher.cs` — driver-lifecycle correlation properties (`DriverInstanceId`,
`DriverType`, `CapabilityName`, `CorrelationId`) are OtOpcUa-specific. The enricher continues to
push via `LogContext.PushProperty` alongside the shared enrichers.
- `ObservabilityExtensions.cs` itself can be simplified or removed — it becomes a thin wrapper that
calls `AddZbTelemetry` with OtOpcUa-specific options. The per-project entry point remains; only
the implementation body is delegated to the shared library.
**Adoption is a follow-on task** (tracked in `GAPS.md`), not part of the `ZB.MOM.WW.Telemetry`
library build. The library build delivers the shared bootstrap and enrichers; adoption lands in the
OtOpcUa repo as a separate commit once the nupkg is available.
@@ -0,0 +1,151 @@
# Observability — current state: ScadaBridge
Repo: `~/Desktop/ScadaBridge`. Stack: .NET 10, Akka.NET, Docker; solution
`ZB.MOM.WW.ScadaBridge.slnx`. The telemetry posture is split across a dangling OTel package ref
(metrics/traces) and a substantive Serilog setup (logs). All paths relative to repo root.
Verified 2026-06-01.
Structurally the cleanest logging enricher set in the family — `SiteId` / `NodeRole` /
`NodeHostname` are already first-class Serilog enricher properties — but the weakest on
metrics/tracing: zero instrumentation. The `OpenTelemetry.Api` package reference is a CVE-patch
artefact, not instrumentation.
## 1. Metrics and traces (absent)
### `OpenTelemetry.Api` — CVE-patch ref, not instrumentation
`src/ZB.MOM.WW.ScadaBridge.Host/ZB.MOM.WW.ScadaBridge.Host.csproj`:
- `:31``<PackageReference Include="OpenTelemetry.Api" />` — a **direct version override** added
to satisfy GHSA-g94r-2vxg-569j / GHSA-8785-wc3w-h8q6 (OpenTelemetry 1.9.0 CVEs introduced via
`Akka.Hosting`'s pinned transitive dependency).
There is **no `AddOpenTelemetry()` call** in the solution. No `Meter` is created. No
`ActivitySource` is declared. No exporter is configured. The package reference solely overrides the
transitive version — it has no runtime effect on observability.
### Instrument coverage
Zero application instruments. There is no custom `Meter`, no counter, no histogram, no gauge, and
no span in the ScadaBridge codebase. This is the largest gap in the family.
## 2. Logging (Serilog — strongest enricher set)
### Two-stage bootstrap
`src/ZB.MOM.WW.ScadaBridge.Host/Program.cs`:
- `:2754` — two-stage Serilog bootstrap: an initial logger is created for startup messages before
the host is built; the full logger replaces it during `UseSerilog`.
### `LoggerConfigurationFactory.cs`
`src/ZB.MOM.WW.ScadaBridge.Host/LoggerConfigurationFactory.cs`:
Full factory method signature: `Build(IConfiguration config, string nodeRole, string siteId, string nodeHostname)`.
- `:62` — reads `ScadaBridge:Logging:MinimumLevel` from configuration.
- `:84``ReadFrom.Configuration(config)` pulls sink configuration from `appsettings.json`.
- `:85` — explicit `MinimumLevel.Is(...)` override from the typed option.
- `:8688` — three structural enrichers:
- `.Enrich.WithProperty("SiteId", siteId)` — site identifier (e.g. `"site-a"`).
- `.Enrich.WithProperty("NodeHostname", nodeHostname)` — node hostname.
- `.Enrich.WithProperty("NodeRole", nodeRole)` — Akka cluster role (e.g. `"central"`, `"site"`).
These three properties are the cleanest and most complete set in the family. ScadaBridge's property
names (`SiteId` / `NodeRole` / `NodeHostname`) are also the ones the shared `AddZbTelemetry`
options object maps onto `site.id` / `node.role` / `host.name` OTel Resource attributes — no
renaming needed on adoption.
### Sink configuration
`appsettings.json:323` — Serilog sinks configured via `ReadFrom.Configuration`:
- Console sink with output template that includes `[{NodeRole}/{NodeHostname}]`.
- File sink (path in config; rolling interval).
### `LoggingOptions.cs`
`src/ZB.MOM.WW.ScadaBridge.Host/LoggingOptions.cs`:
- `MinimumLevel` — config-bound minimum level; default `Information`.
### Missing elements
- **No custom enrichers** beyond the three structural properties. `LogContextEnricher` (OtOpcUa's
driver-correlation enricher) has no equivalent; MxGateway's per-session correlation scope has no
equivalent. Per-request/per-operation correlation is not present.
- **No `trace_id` / `span_id` enricher.** As with the other two projects, log lines do not carry
trace context. Because ScadaBridge has zero `ActivitySource` instrumentation, this is consistent —
but it means no trace↔log correlation path exists even hypothetically.
## 3. Signal summary
| Signal | Provider | Export | Resource / service.name |
|---|---|---|---|
| Metrics | ⛔ none | ⛔ none | ⛔ none |
| Traces | ⛔ none | ⛔ none | ⛔ none |
| Logs | Serilog | Console + file (`appsettings.json`) | ⛔ none (no `service.name` property) |
| Trace↔log correlation | — | — | ⛔ absent (no ActivitySource; no enricher) |
## 4. Notable design choices
- **`SiteId` / `NodeRole` / `NodeHostname` as first-class enrichers** — unlike OtOpcUa's driver-
scoped `LogContextEnricher`, ScadaBridge's structural enrichers are attached at logger creation and
appear on every log line from the process. This is the target pattern for the shared bootstrap.
- **`nodeRole` + `siteId` passed into the factory** — ScadaBridge's `LoggerConfigurationFactory.Build`
takes these as constructor arguments rather than reading them from a registered options object.
The shared `AddZbSerilog` approach binds them from the same `ZbTelemetryOptions` used for the OTel
Resource, unifying the source.
- **Config-driven `MinimumLevel`** — `ScadaBridge:Logging:MinimumLevel` is a typed config path;
`ReadFrom.Configuration` for sinks. The shared bootstrap's `AddZbSerilog` must support the same
pattern.
- **No custom enrichers** — ScadaBridge's logging is intentionally minimal on operation-scoped
context. Correlation in the distributed model is provided by structured log fields from Akka
actor context, not a log enricher pipeline.
- **CVE-patch ref discipline** — the `OpenTelemetry.Api` pin is a responsible CVE response but
leaves the telemetry story incomplete. On adoption, the CVE pin is superseded by the full OTel SDK
pulled in by `AddZbTelemetry`; the explicit `<PackageReference>` override can be removed.
---
## Adoption plan → `ZB.MOM.WW.Telemetry`
**Replace CVE-patch ref with full OTel SDK via `AddZbTelemetry`:**
- Remove the lone `OpenTelemetry.Api` override from
`src/ZB.MOM.WW.ScadaBridge.Host/ZB.MOM.WW.ScadaBridge.Host.csproj:31`.
- Add `builder.AddZbTelemetry(o => { o.ServiceName = "scadabridge"; o.SiteId = cfg.SiteId; o.NodeRole = cfg.NodeRole; o.Meters = ["ZB.MOM.WW.ScadaBridge"]; })`.
The full OTel SDK supersedes the transitive version override; the CVE is resolved transitively
via the SDK's current dependency.
**Add first application instruments:**
- Define a `ScadaBridgeTelemetry` class (mirror `OtOpcUaTelemetry`) with a `Meter` named
`"ZB.MOM.WW.ScadaBridge"` and an initial set of instruments covering the most observable
operations: site connection lifecycle, alarm received, data-change received, actor supervision
events. Naming convention: `scadabridge.<subsystem>.<event>`.
- Register the meter name in `AddZbTelemetry` options. Expose `/metrics` via `app.MapZbMetrics()`.
ScadaBridge goes from zero instrumentation to a baseline exportable set.
**Adopt `AddZbSerilog`:**
- Replace the `LoggerConfigurationFactory.Build(config, nodeRole, siteId, nodeHostname)` call in
`Program.cs:2754` with `builder.AddZbSerilog(o => { o.ServiceName = "scadabridge"; o.SiteId = cfg.SiteId; o.NodeRole = cfg.NodeRole; o.NodeHostname = cfg.NodeHostname; })`.
The three enrichers (`SiteId`, `NodeRole`, `NodeHostname`) are now provided by the shared
`AddZbSerilog` path; `LoggerConfigurationFactory` can be deleted.
- `ReadFrom.Configuration` for sinks and `MinimumLevel.Is` override from config are preserved
inside `AddZbSerilog` — behavior is unchanged.
- The `TraceContextEnricher` is wired automatically by `AddZbSerilog`; once application instruments
are added (above), `trace_id` / `span_id` will appear on log lines emitted during spans.
**Keep bespoke:**
- `LoggingOptions.cs` — the `MinimumLevel` typed option and its config path
(`ScadaBridge:Logging:MinimumLevel`) remain; `AddZbSerilog` must accept the minimum-level
override from configuration. The config path stays ScadaBridge's own.
- Console output template including `[{NodeRole}/{NodeHostname}]` — driven by `appsettings.json`;
no change.
- Akka actor-context log fields — per-operation context emitted by Akka infrastructure; not an
enricher concern.
- `ZB.MOM.WW.ScadaBridge.Host.csproj` package set otherwise — no other changes to the project file.
**Adoption is a follow-on task** (tracked in `GAPS.md`), not part of the `ZB.MOM.WW.Telemetry`
library build. Adding instruments and adopting `AddZbSerilog`/`AddZbTelemetry` lands in the
ScadaBridge repo as a separate commit once the nupkg is available.
@@ -0,0 +1,248 @@
# Proposed shared library: `ZB.MOM.WW.Telemetry`
A contract on paper — the public surface to extract so the three projects stop implementing
observability separately. Realizes [`../spec/SPEC.md`](../spec/SPEC.md) and
[`../spec/METRIC-CONVENTIONS.md`](../spec/METRIC-CONVENTIONS.md). **Not yet created.**
Reference implementations already exist: OtOpcUa `ObservabilityExtensions.cs` (OTel + Serilog),
ScadaBridge `LoggerConfigurationFactory.cs` (Serilog enrichers), MxGateway
`GatewayMetrics.cs` + `GatewayLogRedactor.cs`.
## Packages (.NET 10)
```
ZB.MOM.WW.Telemetry # OTel bootstrap: Resource, metrics, traces, exporters
ZB.MOM.WW.Telemetry.Serilog # Serilog bootstrap: enrichers, TraceContextEnricher, ILogRedactor
```
Both packages are .NET 10 — all three logging-bearing processes are .NET 10 (OtOpcUa server,
mxaccessgw gateway, ScadaBridge central). The x86 net48 mxaccessgw worker uses a bespoke
`IWorkerLogger` (stderr key=value); net48 multi-targeting is **not** required. Published to
the Gitea NuGet feed; SemVer; lockstep to start.
## Packaging & distribution
**Two NuGet packages, one DLL each**, on the Gitea NuGet feed. Libraries linked into each
app — there is no central telemetry service. Both packages are consumed by all three apps
after adoption:
| Package (→ DLL) | Transitive deps | OtOpcUa | MxGateway | ScadaBridge |
|---|---|---|---|---|
| `…Telemetry` | OpenTelemetry SDK, `OpenTelemetry.Exporter.Prometheus.AspNetCore`, `OpenTelemetry.Exporter.OpenTelemetryProtocol`, standard instrumentation packages | ✅ | ✅ | ✅ |
| `…Telemetry.Serilog` | Serilog, `Serilog.Extensions.Hosting`, `Serilog.AspNetCore` (version note below) | ✅ | ✅ | ✅ |
> **`Serilog.AspNetCore` version split (open convergence note):** OtOpcUa and ScadaBridge
> target .NET 10 and may use `Serilog.AspNetCore` 9.x; MxGateway's adoption starts from
> `Serilog.AspNetCore` 9.x as well. If a project remains on .NET 8 ASP.NET Core for any
> reason, the compatible version is `Serilog.AspNetCore` 8.x. Coordinate the version floor
> when the first app takes a dependency and pin it in `Directory.Packages.props`.
---
## `ZB.MOM.WW.Telemetry`
```csharp
namespace ZB.MOM.WW.Telemetry;
/// Selects how instrumentation data is exported.
public enum ZbExporter
{
/// Prometheus scrape endpoint (default). Call app.MapZbMetrics() to mount /metrics.
Prometheus,
/// OTLP gRPC export. Set OtlpEndpoint (e.g. "http://collector:4317").
/// Coexists with Prometheus when both endpoints are desired.
Otlp,
}
/// Options for AddZbTelemetry. All properties feed the shared OTel Resource and
/// Serilog enrichers (via AddZbSerilog in the .Serilog package).
public sealed class ZbTelemetryOptions
{
/// Required. Short lower-case app identifier — e.g. "otopcua", "mxgateway", "scadabridge".
/// Populates OTel Resource service.name.
public string ServiceName { get; set; } = "";
/// Fleet-wide namespace. Default "ZB.MOM.WW". Do not override per-app.
/// Populates OTel Resource service.namespace.
public string ServiceNamespace { get; set; } = "ZB.MOM.WW";
/// Optional. Populate from AssemblyInformationalVersion.
/// Populates OTel Resource service.version.
public string? ServiceVersion { get; set; }
/// Optional. Physical or logical site identifier.
/// Populates OTel Resource site.id and Serilog property SiteId.
public string? SiteId { get; set; }
/// Optional. Node function: "central", "site", "hub", "standalone".
/// Populates OTel Resource node.role and Serilog property NodeRole.
public string? NodeRole { get; set; }
/// App-specific Meter names to register with the OTel MeterProvider.
/// Always register the app's primary Meter here. Standard instrumentation meters are
/// added automatically (ASP.NET Core, HttpClient, runtime, process).
public string[] Meters { get; set; } = [];
/// App-specific ActivitySource names to register with the OTel TracerProvider.
public string[] ActivitySources { get; set; } = [];
/// Export path. Default Prometheus; use Otlp for a real collector.
public ZbExporter Exporter { get; set; } = ZbExporter.Prometheus;
/// Required when Exporter = ZbExporter.Otlp.
/// OTLP gRPC endpoint, e.g. "http://collector:4317".
public string? OtlpEndpoint { get; set; }
}
/// Extension point for configuring the OTel bootstrap on an IHostApplicationBuilder.
public static class ZbTelemetryExtensions
{
/// Configures the OpenTelemetry MeterProvider and TracerProvider with the shared Resource,
/// standard instrumentation (ASP.NET Core, HttpClient, gRPC client, runtime, process),
/// the app's own Meters and ActivitySources, and the selected exporter.
/// Does NOT configure Serilog — call AddZbSerilog() in the .Serilog package for that.
public static IHostApplicationBuilder AddZbTelemetry(
this IHostApplicationBuilder builder,
Action<ZbTelemetryOptions> configure);
/// IServiceCollection overload for contexts where IHostApplicationBuilder is not available.
/// Requires the caller to supply a pre-built ZbTelemetryOptions (Resource attributes must
/// be populated before DI composition, so the options-object overload is preferred).
public static IServiceCollection AddZbTelemetry(
this IServiceCollection services,
ZbTelemetryOptions options);
}
/// Builds the shared OTel ResourceBuilder from ZbTelemetryOptions.
/// Used internally by AddZbTelemetry. Exposed for tests and custom pipelines.
public static class ZbResource
{
/// Returns a ResourceBuilder pre-populated with service.name, service.namespace,
/// service.version, site.id, node.role, and host.name (always Environment.MachineName).
/// Attributes with null values are omitted from the Resource.
public static ResourceBuilder Build(ZbTelemetryOptions options);
}
/// Endpoint extension for mounting the Prometheus /metrics scrape endpoint.
public static class ZbMetricsEndpointExtensions
{
/// Mounts the Prometheus /metrics endpoint.
/// Only valid when ZbTelemetryOptions.Exporter = ZbExporter.Prometheus (or both).
/// Call after app.UseRouting().
public static IEndpointConventionBuilder MapZbMetrics(
this IEndpointRouteBuilder endpoints);
}
```
---
## `ZB.MOM.WW.Telemetry.Serilog`
```csharp
namespace ZB.MOM.WW.Telemetry.Serilog;
/// Extension point for configuring the Serilog two-stage bootstrap on an IHostApplicationBuilder.
public static class ZbSerilogExtensions
{
/// Two-stage Serilog bootstrap:
/// Stage 1 — minimal console-only bootstrap logger (for startup errors before IConfiguration).
/// Stage 2 — application logger wired from IConfiguration (ReadFrom.Configuration reads
/// Serilog:WriteTo sinks + Serilog:MinimumLevel overrides) with fixed enrichers:
/// SiteId, NodeRole, NodeHostname (from ZbTelemetryOptions), TraceContextEnricher,
/// and RedactionEnricher (applied only when ILogRedactor is registered).
///
/// OTel log export is wired automatically: logs flow through the OTel pipeline with the same
/// Resource as the metrics and traces (all three signals correlated in a backend).
///
/// The configure delegate receives the same ZbTelemetryOptions used by AddZbTelemetry.
/// Typically share a single options-population lambda across both calls.
public static IHostApplicationBuilder AddZbSerilog(
this IHostApplicationBuilder builder,
Action<ZbTelemetryOptions> configure);
}
/// Canonical Serilog property name constants for the identity enrichers.
/// Use these constants — not literal strings — when querying properties in sinks or tests.
public static class ZbLogEnricherNames
{
/// Serilog property: physical or logical site identifier. Matches OTel Resource site.id.
public const string SiteId = "SiteId";
/// Serilog property: node function (central, site, hub, standalone). Matches OTel node.role.
public const string NodeRole = "NodeRole";
/// Serilog property: machine name (Environment.MachineName). Matches OTel host.name.
public const string NodeHostname = "NodeHostname";
}
/// Stamps trace_id and span_id from Activity.Current onto every Serilog log event.
/// When Activity.Current is null (no active span — background services, startup, non-traced paths)
/// the enricher emits nothing; it does NOT inject empty strings or zero values.
/// This enables a log line to be clicked through to its originating trace in a backend.
public sealed class TraceContextEnricher : ILogEventEnricher
{
public void Enrich(LogEvent logEvent, ILogEventPropertyFactory propertyFactory);
}
/// Seam for project-specific log-event redaction.
/// The shared library applies this via RedactionEnricher; each project provides its own
/// implementation that knows which fields (by property name) or which command payloads
/// must not leave the process in log events.
/// If no ILogRedactor is registered in DI, RedactionEnricher is a no-op.
public interface ILogRedactor
{
/// Inspect and mutate properties in-place. Remove or replace any sensitive values.
/// Called on every log event before it reaches any sink.
void Redact(IDictionary<string, object?> properties);
}
/// Applies a registered ILogRedactor to every Serilog log event.
/// Registered automatically by AddZbSerilog. The enricher resolves ILogRedactor from DI
/// on first use; if none is registered it is permanently inert (no DI call per event).
public sealed class RedactionEnricher : ILogEventEnricher
{
public RedactionEnricher(IServiceProvider serviceProvider);
public void Enrich(LogEvent logEvent, ILogEventPropertyFactory propertyFactory);
}
```
---
## Consumer matrix
| Consumer | Packages | Notes |
|---|---|---|
| **MxGateway** | Both | MEL → Serilog migration: `GatewayLogScope`/`BeginScope``LogContext.PushProperty`; `GatewayLogRedactor``ILogRedactor` impl; `GatewayMetrics` stays, wired through `o.Meters`. **Done in this release.** |
| **OtOpcUa** | Both | Consolidate existing Serilog bootstrap; add `TraceContextEnricher` + `SiteId`/`NodeRole` enrichers; add Resource to existing OTel pipeline. Deferred to GAPS backlog. |
| **ScadaBridge** | Both | Add full OTel SDK (metrics + traces + export); consolidate `LoggerConfigurationFactory`; add `TraceContextEnricher`. Deferred to GAPS backlog. |
The net48 x86 mxaccessgw worker is excluded from both packages. Its `IWorkerLogger`
(stderr key=value format) is an out-of-process concern and remains bespoke.
---
## Open contract questions
1. **`IServiceCollection` overload completeness:** the `IHostApplicationBuilder`-based
overload is the primary path (available in all three apps on .NET 10). The
`IServiceCollection` overload is a fallback for unusual host configurations. Validate
that both overloads wire OTel log export identically (same Resource, same enrichers).
2. **OTel log export channel:** `AddZbSerilog` uses `Serilog.Sinks.OpenTelemetry` to push
logs into the OTel pipeline (sharing the Resource). Confirm the sink version is
compatible with the OpenTelemetry SDK version pinned in `ZB.MOM.WW.Telemetry`
(`Directory.Packages.props`).
3. **`RedactionEnricher` DI timing:** `RedactionEnricher` resolves `ILogRedactor` from
`IServiceProvider` on first use (lazy, to avoid a circular-DI problem during Serilog's
two-stage bootstrap). Validate that the service provider is fully built by the time the
first post-startup log event fires. If MxGateway's `GatewayLogRedactor` has dependencies
that are not available at stage-1 bootstrap time, the lazy-resolve pattern protects it.
4. **`SiteId` / `NodeRole` null handling:** `AddZbTelemetry` and `AddZbSerilog` silently
omit null `SiteId`/`NodeRole` from the Resource and enricher set. Confirm this is the
correct behavior for OtOpcUa, which may run in a single-site configuration where neither
field is meaningful, versus ScadaBridge, where `SiteId` is essential for multi-cluster
fleet visibility.
See [`../GAPS.md`](../GAPS.md) for the adoption order and effort/risk.
@@ -0,0 +1,224 @@
# Observability — Metric conventions (standardized)
Status: **Standardized**. The naming and unit rules every sister project's instruments must
follow. Analogous to [`../auth/spec/CANONICAL-ROLES.md`](../../auth/spec/CANONICAL-ROLES.md)
for auth and [`../ui-theme/spec/DESIGN-TOKENS.md`](../../ui-theme/spec/DESIGN-TOKENS.md)
for the UI kit. Authoritative alongside [`SPEC.md`](SPEC.md).
The per-project instrument tables below (§4) document the **existing bespoke surface** — the
instruments each app currently defines or intends to define. These stay per-project; they are
not candidates for the shared library. The rules in §1–§3 govern *how* those instruments must
be named and measured.
---
## 1. Meter name
Each app owns exactly **one primary Meter**, named after its root namespace:
| App | Meter name | Status |
|---|---|---|
| OtOpcUa | `ZB.MOM.WW.OtOpcUa` | Correct today |
| MxGateway | `MxGateway.Server` | ⚠ Convergence target — rename to `ZB.MOM.WW.MxGateway` on adoption |
| ScadaBridge | `ZB.MOM.WW.ScadaBridge` | Target (no meter exists today) |
`MxGateway.Server` is the single convergence item for meter naming. It predates the
`ZB.MOM.WW.*` namespace convention; rename when adopting `AddZbTelemetry`. Instruments
emitted under the old name will require a `recording_rule` or relabel in any Prometheus
config that already scrapes the snapshot — coordinate before renaming in production.
If an app has secondary meters (e.g. a library component with its own meter), those follow
the same pattern: `ZB.MOM.WW.<App>.<Component>`.
---
## 2. Instrument name
Instrument names follow the pattern `<app>.<subsystem>.<event>`, all lower-case,
dot-separated:
```
<app> := short app identifier — otopcua | mxgateway | scadabridge
<subsystem> := functional area — deploy | session | tag | alarm | gateway | worker | ...
<event> := what happened or is measured — applied | count | duration | errors | active | ...
```
**Examples:**
| Instrument name | App | Meaning |
|---|---|---|
| `otopcua.deploy.applied` | OtOpcUa | Galaxy deploy events applied to the address space |
| `otopcua.tag.subscriptions` | OtOpcUa | Active OPC UA tag subscriptions |
| `mxgateway.session.active` | MxGateway | Active MxAccess sessions |
| `mxgateway.worker.call.duration` | MxGateway | gRPC call duration to the x86 worker |
| `scadabridge.alarm.received` | ScadaBridge | Alarms received by the DCL |
**Rules:**
1. All lower-case. No camelCase, no PascalCase, no hyphens.
2. Three segments minimum (`<app>.<subsystem>.<event>`). Four are permitted when the
subsystem warrants a sub-area (e.g. `mxgateway.worker.call.duration`).
3. Event nouns describe **what is counted or measured** (`applied`, `errors`, `active`,
`duration`), not implementation details (`method_called`, `loop_iteration`).
4. Counters: past-tense or noun (`received`, `errors`, `applied`).
UpDownCounters / gauges: present-state noun or adjective (`active`, `connected`).
Histograms: `duration` or a measured quantity noun (`size`, `lag`).
---
## 3. Units
### Duration — seconds (mandatory)
**All duration histograms MUST use seconds** (`"s"`). This is the OpenTelemetry semantic
convention (`UCUM`: `s`). Backends and dashboards assume seconds; mixing units breaks
aggregations across apps.
> ⚠ **MxGateway convergence item:** `GatewayMetrics.cs` defines three histograms with unit
> `"ms"` (`CommandDuration`, `EventDuration`, `WorkerCallDuration`). These must be migrated
> to `"s"` on adoption. Values must also be converted (divide by 1 000 at the call site).
> Track existing Prometheus `recording_rule`/dashboard changes — any dashboard panel that
> reads these histograms in `ms` will need updating. Until migration is complete, annotate
> the instruments with `// CONVERGENCE: ms→s pending`.
### Other units
| Quantity | Unit string | Notes |
|---|---|---|
| Duration | `"s"` | Mandatory — see above |
| Size / bytes | `"By"` | UCUM bytes |
| Count (dimensionless) | `"1"` or omit | For pure event counts; `"1"` preferred |
| Messages, requests | `"{message}"`, `"{request}"` | UCUM annotation form for dimensioned counts |
---
## 4. Resource attribute set (shared across all three signals)
The OTel `Resource` is built once by `AddZbTelemetry` (see [`SPEC.md`](SPEC.md) §2) and
attached to metrics, traces, and OTel-exported logs. The same `SiteId` and `NodeRole` values
populate Serilog enrichers, making a metric, a span, and a log line from the same node
joinable in any OTel-compatible backend.
| OTel attribute | Type | Required | Notes |
|---|---|---|---|
| `service.name` | string | Yes | Short lower-case app id: `otopcua`, `mxgateway`, `scadabridge` |
| `service.namespace` | string | Yes | Always `"ZB.MOM.WW"` — do not override |
| `service.version` | string | Recommended | Populate from `AssemblyInformationalVersion`; absent is better than wrong |
| `site.id` | string | Recommended | Physical or logical site identifier; omit for single-site deployments |
| `node.role` | string | Recommended | Node function: `"central"`, `"site"`, `"hub"`, `"standalone"` |
| `host.name` | string | Auto | Always `Environment.MachineName`; never override |
**Why `site.id` and `node.role` matter:** a ScadaBridge fleet runs N site clusters + one
central cluster, each on different hosts. Without `site.id` and `node.role`, metrics from a
site node and the central node are indistinguishable even if `host.name` differs.
---
## 5. Standard instrumentation baseline
Every app enables this baseline via `AddZbTelemetry`. No opt-out. These are community-
standard instrumentation packages; the overhead is negligible and the benefit (correlated
HTTP / gRPC request traces across the fleet) is high.
| Instrumentation | Signal(s) | OtOpcUa today | MxGateway today | ScadaBridge today |
|---|---|---|---|---|
| ASP.NET Core | Traces + Metrics | ✅ | ✅ | — |
| HttpClient | Traces + Metrics | ✅ | ✅ | — |
| gRPC client | Traces | ✅ | — | — |
| .NET runtime | Metrics | ✅ | — | — |
| Process | Metrics | ✅ | — | — |
OtOpcUa already enables all five. MxGateway and ScadaBridge add the missing ones through
`AddZbTelemetry`. No project removes any of these.
---
## 6. Per-app instrument surface (bespoke — stays per project)
These instruments are **not part of the shared library**. They document the existing bespoke
surface that each project registers through `o.Meters` / `o.ActivitySources` in `AddZbTelemetry`.
### 6.1 OtOpcUa — `ZB.MOM.WW.OtOpcUa` meter
Source: `src/Core/ZB.MOM.WW.OtOpcUa.Commons/Observability/OtOpcUaTelemetry.cs`
| Instrument | Kind | Unit | Description |
|---|---|---|---|
| `otopcua.deploy.applied` | Counter | `"1"` | Galaxy deploy events applied to the OPC UA address space |
| `otopcua.deploy.failed` | Counter | `"1"` | Deploy events that failed processing |
| `otopcua.tag.subscriptions` | UpDownCounter | `"1"` | Active OPC UA tag subscriptions |
| `otopcua.tag.reads` | Counter | `"1"` | Tag read operations |
| `otopcua.tag.writes` | Counter | `"1"` | Tag write operations |
| `otopcua.session.active` | UpDownCounter | `"1"` | Active OPC UA sessions |
| `otopcua.connection.gateway` | UpDownCounter | `"1"` | Active gRPC channels to MxAccessGateway |
**ActivitySources (spans):**
| Source name | Span(s) |
|---|---|
| `ZB.MOM.WW.OtOpcUa` | `DeployWatcher.Apply`, `GalaxyDriver.BrowseHierarchy` |
All durations already use `"s"` — no convergence item for OtOpcUa.
### 6.2 MxGateway — `MxGateway.Server` meter (→ target: `ZB.MOM.WW.MxGateway`)
Source: `src/ZB.MOM.WW.MxGateway.Server/Metrics/GatewayMetrics.cs`
**Counters (13):**
| Instrument | Unit | Description |
|---|---|---|
| `mxgateway.session.created` | `"1"` | MxAccess sessions opened |
| `mxgateway.session.closed` | `"1"` | MxAccess sessions closed |
| `mxgateway.session.errors` | `"1"` | Session creation/teardown errors |
| `mxgateway.command.invoked` | `"1"` | MxAccess command invocations |
| `mxgateway.command.errors` | `"1"` | Command invocation errors |
| `mxgateway.event.received` | `"1"` | MxAccess events received from worker |
| `mxgateway.event.errors` | `"1"` | Event processing errors |
| `mxgateway.worker.started` | `"1"` | x86 worker processes started |
| `mxgateway.worker.stopped` | `"1"` | x86 worker processes stopped |
| `mxgateway.worker.errors` | `"1"` | Worker communication errors |
| `mxgateway.galaxy.browse.requests` | `"1"` | Galaxy Repository browse RPCs |
| `mxgateway.galaxy.browse.errors` | `"1"` | Galaxy browse errors |
| `mxgateway.auth.failures` | `"1"` | Authentication failures |
**Histograms (3):**
| Instrument | Unit | Current unit | Convergence |
|---|---|---|---|
| `mxgateway.command.duration` | `"s"` | `"ms"` | ⚠ Convert ms→s on adoption |
| `mxgateway.event.duration` | `"s"` | `"ms"` | ⚠ Convert ms→s on adoption |
| `mxgateway.worker.call.duration` | `"s"` | `"ms"` | ⚠ Convert ms→s on adoption |
**Gauges (4):**
| Instrument | Unit | Description |
|---|---|---|
| `mxgateway.session.active` | `"1"` | Current active MxAccess sessions |
| `mxgateway.worker.active` | `"1"` | Current running x86 worker processes |
| `mxgateway.worker.memory` | `"By"` | Worker process RSS |
| `mxgateway.galaxy.nodes.cached` | `"1"` | Galaxy Repository nodes in browse cache |
No ActivitySources today (no tracing). Adding `ZB.MOM.WW.MxGateway` as an ActivitySource
is left per-project (deferred to GAPS backlog).
### 6.3 ScadaBridge — `ZB.MOM.WW.ScadaBridge` meter
No meter or instruments exist today (`OpenTelemetry.Api` is a dangling ref). The target
meter name `ZB.MOM.WW.ScadaBridge` is reserved. Instruments are defined as part of the
ScadaBridge adoption tracked in [`../GAPS.md`](../GAPS.md).
---
## Consequences and convergence items (accepted)
| Item | Scope | Severity |
|---|---|---|
| MxGateway meter rename `MxGateway.Server``ZB.MOM.WW.MxGateway` | MxGateway adoption | Breaking — requires relabeling in Prometheus config and dashboards |
| MxGateway histogram unit `ms``s` (3 instruments) | MxGateway adoption | Breaking — values change by factor 1 000; dashboards need updating |
| ScadaBridge instrument set TBD | ScadaBridge adoption | No existing surface to converge — define from scratch |
All three items are tracked as backlog entries in [`../GAPS.md`](../GAPS.md). The ms→s
migration is the highest-priority convergence item because leaving it unresolved means
MxGateway's durations are incompatible with OtOpcUa's in any shared Prometheus / Grafana
workspace.
+177
View File
@@ -0,0 +1,177 @@
# Observability — normalized target spec
Status: **Draft**. The single design the sister projects converge on. Derived from the
three code-verified current-state docs (`../current-state/`). Goal is *path to shared code*
(`../shared-contract/ZB.MOM.WW.Telemetry.md`), so each normalized section maps to a shared
library seam.
## 0. Scope
**Normalized here:** one OpenTelemetry bootstrap across all three signals (metrics + traces +
logs) via a single `AddZbTelemetry` extension; the shared `Resource` attribute set
(`service.name` / `service.namespace` / `service.version` / `site.id` / `node.role` /
`host.name`) that makes every node distinguishable in a collector; standard instrumentation
everyone enables (ASP.NET Core, HttpClient, gRPC client, runtime, process meters); exporter
conventions (Prometheus scrape endpoint default, OTLP opt-in); a shared Serilog bootstrap
with identity enrichers (`SiteId`, `NodeRole`, `NodeHostname`) bound from the same options
object as the OTel Resource (metrics and logs therefore carry identical dimensions); a
`TraceContextEnricher` that stamps `trace_id`/`span_id` from `Activity.Current` onto every
Serilog event, enabling log↔trace correlation; an `ILogRedactor` redaction seam.
**Explicitly NOT normalized** (domain-specific — keep per project): each app's actual
instruments — `otopcua.*` meters and spans, `mxgateway.*` counters/histograms/gauges — they
are registered *through* the shared bootstrap but their names and semantics remain
bespoke (see [`METRIC-CONVENTIONS.md`](METRIC-CONVENTIONS.md) §4); the redaction *policy*
(which field names, which command types) — only the `ILogRedactor` seam is shared, each
project supplies its own implementation; the MxGateway net48 x86 worker's `IWorkerLogger`
(stderr key=value format, out-of-process, out of scope).
## 1. OpenTelemetry pipeline — `AddZbTelemetry`
A single `IHostApplicationBuilder` extension is the front door for all three OTel signals.
It wires the shared `Resource`, registers standard instrumentation, and configures the
selected exporter:
```csharp
builder.AddZbTelemetry(o =>
{
o.ServiceName = "mxgateway"; // populates Resource service.name
o.ServiceNamespace = "ZB.MOM.WW"; // constant across the fleet (default)
o.ServiceVersion = "1.0.0"; // populated from AssemblyInformationalVersion
o.SiteId = cfg.SiteId; // Resource site.id + Serilog SiteId property
o.NodeRole = cfg.NodeRole; // Resource node.role + Serilog NodeRole property
o.Meters = ["MxGateway.Server"]; // app's own Meter name(s)
o.ActivitySources = ["MxGateway.Server"]; // app's own ActivitySource name(s)
o.Exporter = ZbExporter.Prometheus; // default; ZbExporter.Otlp opt-in
// o.OtlpEndpoint = "http://collector:4317"; // required when Exporter = Otlp
});
app.MapZbMetrics(); // mounts Prometheus /metrics scrape endpoint
```
This is the headline fix: nobody in the fleet sets a `Resource` or `service.name` today,
making every node indistinguishable in a collector. Every project must call `AddZbTelemetry`
to be observable.
## 2. Shared Resource
The OTel `Resource` attached to all three signals is built from `ZbTelemetryOptions`:
| OTel attribute | Options property | Notes |
|---|---|---|
| `service.name` | `ServiceName` | Required. Lower-case short identifier (`otopcua`, `mxgateway`, `scadabridge`) |
| `service.namespace` | `ServiceNamespace` | Default `"ZB.MOM.WW"` — constant across the fleet |
| `service.version` | `ServiceVersion` | Optional; recommend populating from `AssemblyInformationalVersion` |
| `site.id` | `SiteId` | Optional; identifies the physical/logical site |
| `node.role` | `NodeRole` | Optional; e.g. `"central"`, `"site"`, `"hub"` |
| `host.name` | _(auto)_ | Always populated from `Environment.MachineName` |
The same `SiteId` and `NodeRole` values are passed to the Serilog enrichers (§4) so a
metric, a span, and a log line from the same node carry identical dimensions and join up in
any OTel-compatible backend.
## 3. Standard instrumentation
`AddZbTelemetry` enables the following instrumentation for all projects. Any project that
already enables a subset gets it consolidated; no project may skip this baseline:
| Instrumentation | Package | Signal |
|---|---|---|
| ASP.NET Core | `OpenTelemetry.Instrumentation.AspNetCore` | Traces + Metrics |
| HttpClient | `OpenTelemetry.Instrumentation.Http` | Traces + Metrics |
| gRPC client | `OpenTelemetry.Instrumentation.GrpcNetClient` | Traces |
| .NET runtime | `OpenTelemetry.Instrumentation.Runtime` | Metrics |
| Process | `OpenTelemetry.Instrumentation.Process` | Metrics |
App-specific `Meter` names and `ActivitySource` names are registered via `o.Meters` and
`o.ActivitySources`. This is how MxGateway's hand-rolled `GatewayMetrics` finally gets an
export path instead of dying in an in-memory `GetSnapshot()`.
## 4. Exporter conventions
`ZbTelemetryOptions.Exporter` selects the export path:
| Value | Behaviour |
|---|---|
| `ZbExporter.Prometheus` | Mounts a Prometheus `/metrics` scrape endpoint via `app.MapZbMetrics()`. Default for all three apps — consistent with OtOpcUa's existing `/metrics`. |
| `ZbExporter.Otlp` | Exports to an OTLP endpoint specified by `o.OtlpEndpoint` (gRPC, `http://collector:4317`). Opt-in path to a real OTel Collector; coexists with Prometheus. |
Both exporters carry the shared `Resource`. OTLP is the path to a real backend (Tempo,
Prometheus-remote-write, Loki); Prometheus covers the "scrape from the node" case that all
three apps currently use or aspire to.
## 5. Serilog logging stack
`AddZbSerilog` is a companion extension in the `.Serilog` package. It replaces each
project's bespoke logging bootstrap with a shared two-stage pattern:
**Stage 1 (bootstrap logger):** a minimal `Log.Logger` for startup errors before the
`IConfiguration` is available. Writes to console only.
**Stage 2 (application logger):** reads sinks and overrides from `IConfiguration`
(`ReadFrom.Configuration`) and applies a set of fixed enrichers:
| Enricher | Property name | Source |
|---|---|---|
| `ZbLogEnricherNames.SiteId` | `"SiteId"` | `ZbTelemetryOptions.SiteId` |
| `ZbLogEnricherNames.NodeRole` | `"NodeRole"` | `ZbTelemetryOptions.NodeRole` |
| `ZbLogEnricherNames.NodeHostname` | `"NodeHostname"` | `Environment.MachineName` |
| `TraceContextEnricher` | `trace_id`, `span_id` | `Activity.Current` |
| `RedactionEnricher` | _(project-defined fields)_ | `ILogRedactor` implementation |
The three identity properties (`SiteId`, `NodeRole`, `NodeHostname`) are bound from the
same `ZbTelemetryOptions` object as the OTel `Resource`, so logs and metrics/traces carry
identical dimensions. When no `Activity.Current` is present (e.g. background services,
startup), `TraceContextEnricher` emits nothing — it does not inject empty or zero values.
`MinimumLevel` is set explicitly in code (default `Information`) and can be overridden via
`IConfiguration` (`Serilog:MinimumLevel`). Sinks are fully config-driven:
`ReadFrom.Configuration` reads `Serilog:WriteTo` from `appsettings.json` / environment.
OTel log export is wired in the same call: logs flow through the OTel pipeline with the
same `Resource` attached, making all three signals (metrics / traces / logs) available in a
single backend.
## 6. Redaction seam — `ILogRedactor`
`ILogRedactor` is a single-method interface that receives the mutable log-event property
dictionary and scrubs any fields that must not leave the process:
```csharp
public interface ILogRedactor
{
void Redact(IDictionary<string, object?> properties);
}
```
`RedactionEnricher` applies a registered `ILogRedactor` on every log event. The seam is
shared; the **policy** is per-project (which field names, which command types, which
classification levels). MxGateway's existing `GatewayLogRedactor` is the reference
implementation; it migrates to this seam during adoption. If no `ILogRedactor` is
registered, `RedactionEnricher` is a no-op.
This preserves the operational property MxGateway already has (secrets never leave the
process in log events) while making the plumbing reusable.
## 7. Per-project migration
| Project | Current state | Primary gaps | What normalizes |
|---|---|---|---|
| **OtOpcUa** | Full OTel SDK (`WithMetrics` + `WithTracing`); Prometheus `/metrics`; Serilog bootstrap; 7 instruments + 2 spans. | No `Resource` / `service.name` anywhere; no trace↔log correlation; no `SiteId`/`NodeRole` enrichers. | Call `AddZbTelemetry` (adds Resource; consolidates standard instrumentation); call `AddZbSerilog` (adds `TraceContextEnricher` + identity enrichers); migrate existing Serilog bootstrap to shared two-stage pattern. |
| **MxGateway** | Hand-rolled `GatewayMetrics` (13 counters / 3 histograms `ms` / 4 gauges); in-memory snapshot only — no export; MEL logging with `GatewayLogScope` correlation + `GatewayLogRedactor`; no OTel SDK. | No OTel SDK; no export; `ms` histograms diverge from OTel semconv (`s`); MEL → Serilog migration; no Resource. | Call `AddZbTelemetry` (wires OTel SDK around existing `GatewayMetrics` — finally exports); call `AddZbSerilog` (replaces MEL; re-expresses `GatewayLogScope` as `LogContext.PushProperty`; moves `GatewayLogRedactor` behind `ILogRedactor`). Duration unit convergence (`ms``s`) tracked in GAPS. **This is the one adoption done now.** |
| **ScadaBridge** | `OpenTelemetry.Api` ref only (dangling — CVE-patch origin, zero usage); Serilog bootstrap (`LoggerConfigurationFactory`) with `SiteId`/`NodeRole`/`NodeHostname` enrichers. | No OTel SDK; no metrics; no tracing; no export; no trace↔log correlation. ScadaBridge's enricher property names are already the target names — migration is additive. | Call `AddZbTelemetry` (adds OTel SDK + metrics + traces + export); call `AddZbSerilog` (consolidates `LoggerConfigurationFactory`; adds `TraceContextEnricher`). |
> The MxGateway logging migration (`MEL → Serilog`, re-expressing `GatewayLogRedactor`
> behind `ILogRedactor`) is the **only sister-repo touch in scope for this release**. OtOpcUa
> and ScadaBridge adoption is deferred to the follow-on tracked in
> [`../GAPS.md`](../GAPS.md).
## 8. Acceptance (what "converged" means)
A project is converged when: (a) it calls `builder.AddZbTelemetry(o => ...)` with all
required Resource attributes populated; (b) it calls `app.MapZbMetrics()` (or configures
OTLP); (c) it calls `builder.AddZbSerilog(...)` and the `TraceContextEnricher` stamps
`trace_id`/`span_id` on every log event emitted under an active `Activity`; (d) its
`ILogRedactor` implementation (if applicable) is registered and applied by `RedactionEnricher`;
(e) every node in the fleet is distinguishable by `service.name` + `site.id` + `node.role`
in a collector or log aggregator.