Files
scadaproj/components/observability/spec/SPEC.md
T
Joseph Doherty 7d243890ed docs(observability): spec + METRIC-CONVENTIONS + ZB.MOM.WW.Telemetry shared contract
Author the three normalization docs for the observability component:
- components/observability/spec/SPEC.md — Section 0 scope (normalized vs. per-project),
  AddZbTelemetry pipeline, shared Resource attribute set, standard instrumentation baseline,
  exporter conventions, Serilog two-stage bootstrap with identity enrichers and
  TraceContextEnricher, ILogRedactor redaction seam, per-project migration table, and
  acceptance criteria.
- components/observability/spec/METRIC-CONVENTIONS.md — meter naming convention (app
  namespace; MxGateway.Server flagged as convergence target), instrument naming pattern
  (<app>.<subsystem>.<event>), mandatory duration unit = seconds (MxGateway ms histograms
  flagged), Resource attribute set table, standard instrumentation baseline, and per-app
  instrument tables (OtOpcUa 7 instruments + 2 spans; MxGateway 13 counters / 3 histograms
  / 4 gauges; ScadaBridge TBD).
- components/observability/shared-contract/ZB.MOM.WW.Telemetry.md — paper API for the two
  packages: ZbTelemetryOptions, ZbExporter enum, AddZbTelemetry (IHostApplicationBuilder +
  IServiceCollection overloads), ZbResource.Build, MapZbMetrics; AddZbSerilog,
  ZbLogEnricherNames constants, TraceContextEnricher, ILogRedactor, RedactionEnricher.
  Consumer matrix and open contract questions included.
2026-06-01 07:19:38 -04:00

11 KiB

Observability — normalized target spec

Status: Draft. The single design the sister projects converge on. Derived from the three code-verified current-state docs (../current-state/). Goal is path to shared code (../shared-contract/ZB.MOM.WW.Telemetry.md), so each normalized section maps to a shared library seam.

0. Scope

Normalized here: one OpenTelemetry bootstrap across all three signals (metrics + traces + logs) via a single AddZbTelemetry extension; the shared Resource attribute set (service.name / service.namespace / service.version / site.id / node.role / host.name) that makes every node distinguishable in a collector; standard instrumentation everyone enables (ASP.NET Core, HttpClient, gRPC client, runtime, process meters); exporter conventions (Prometheus scrape endpoint default, OTLP opt-in); a shared Serilog bootstrap with identity enrichers (SiteId, NodeRole, NodeHostname) bound from the same options object as the OTel Resource (metrics and logs therefore carry identical dimensions); a TraceContextEnricher that stamps trace_id/span_id from Activity.Current onto every Serilog event, enabling log↔trace correlation; an ILogRedactor redaction seam.

Explicitly NOT normalized (domain-specific — keep per project): each app's actual instruments — otopcua.* meters and spans, mxgateway.* counters/histograms/gauges — they are registered through the shared bootstrap but their names and semantics remain bespoke (see METRIC-CONVENTIONS.md §4); the redaction policy (which field names, which command types) — only the ILogRedactor seam is shared, each project supplies its own implementation; the MxGateway net48 x86 worker's IWorkerLogger (stderr key=value format, out-of-process, out of scope).

1. OpenTelemetry pipeline — AddZbTelemetry

A single IHostApplicationBuilder extension is the front door for all three OTel signals. It wires the shared Resource, registers standard instrumentation, and configures the selected exporter:

builder.AddZbTelemetry(o =>
{
    o.ServiceName      = "mxgateway";          // populates Resource service.name
    o.ServiceNamespace = "ZB.MOM.WW";          // constant across the fleet (default)
    o.ServiceVersion   = "1.0.0";              // populated from AssemblyInformationalVersion
    o.SiteId           = cfg.SiteId;           // Resource site.id + Serilog SiteId property
    o.NodeRole         = cfg.NodeRole;         // Resource node.role + Serilog NodeRole property
    o.Meters           = ["MxGateway.Server"]; // app's own Meter name(s)
    o.ActivitySources  = ["MxGateway.Server"]; // app's own ActivitySource name(s)
    o.Exporter         = ZbExporter.Prometheus; // default; ZbExporter.Otlp opt-in
    // o.OtlpEndpoint  = "http://collector:4317"; // required when Exporter = Otlp
});

app.MapZbMetrics(); // mounts Prometheus /metrics scrape endpoint

This is the headline fix: nobody in the fleet sets a Resource or service.name today, making every node indistinguishable in a collector. Every project must call AddZbTelemetry to be observable.

2. Shared Resource

The OTel Resource attached to all three signals is built from ZbTelemetryOptions:

OTel attribute Options property Notes
service.name ServiceName Required. Lower-case short identifier (otopcua, mxgateway, scadabridge)
service.namespace ServiceNamespace Default "ZB.MOM.WW" — constant across the fleet
service.version ServiceVersion Optional; recommend populating from AssemblyInformationalVersion
site.id SiteId Optional; identifies the physical/logical site
node.role NodeRole Optional; e.g. "central", "site", "hub"
host.name (auto) Always populated from Environment.MachineName

The same SiteId and NodeRole values are passed to the Serilog enrichers (§4) so a metric, a span, and a log line from the same node carry identical dimensions and join up in any OTel-compatible backend.

3. Standard instrumentation

AddZbTelemetry enables the following instrumentation for all projects. Any project that already enables a subset gets it consolidated; no project may skip this baseline:

Instrumentation Package Signal
ASP.NET Core OpenTelemetry.Instrumentation.AspNetCore Traces + Metrics
HttpClient OpenTelemetry.Instrumentation.Http Traces + Metrics
gRPC client OpenTelemetry.Instrumentation.GrpcNetClient Traces
.NET runtime OpenTelemetry.Instrumentation.Runtime Metrics
Process OpenTelemetry.Instrumentation.Process Metrics

App-specific Meter names and ActivitySource names are registered via o.Meters and o.ActivitySources. This is how MxGateway's hand-rolled GatewayMetrics finally gets an export path instead of dying in an in-memory GetSnapshot().

4. Exporter conventions

ZbTelemetryOptions.Exporter selects the export path:

Value Behaviour
ZbExporter.Prometheus Mounts a Prometheus /metrics scrape endpoint via app.MapZbMetrics(). Default for all three apps — consistent with OtOpcUa's existing /metrics.
ZbExporter.Otlp Exports to an OTLP endpoint specified by o.OtlpEndpoint (gRPC, http://collector:4317). Opt-in path to a real OTel Collector; coexists with Prometheus.

Both exporters carry the shared Resource. OTLP is the path to a real backend (Tempo, Prometheus-remote-write, Loki); Prometheus covers the "scrape from the node" case that all three apps currently use or aspire to.

5. Serilog logging stack

AddZbSerilog is a companion extension in the .Serilog package. It replaces each project's bespoke logging bootstrap with a shared two-stage pattern:

Stage 1 (bootstrap logger): a minimal Log.Logger for startup errors before the IConfiguration is available. Writes to console only.

Stage 2 (application logger): reads sinks and overrides from IConfiguration (ReadFrom.Configuration) and applies a set of fixed enrichers:

Enricher Property name Source
ZbLogEnricherNames.SiteId "SiteId" ZbTelemetryOptions.SiteId
ZbLogEnricherNames.NodeRole "NodeRole" ZbTelemetryOptions.NodeRole
ZbLogEnricherNames.NodeHostname "NodeHostname" Environment.MachineName
TraceContextEnricher trace_id, span_id Activity.Current
RedactionEnricher (project-defined fields) ILogRedactor implementation

The three identity properties (SiteId, NodeRole, NodeHostname) are bound from the same ZbTelemetryOptions object as the OTel Resource, so logs and metrics/traces carry identical dimensions. When no Activity.Current is present (e.g. background services, startup), TraceContextEnricher emits nothing — it does not inject empty or zero values.

MinimumLevel is set explicitly in code (default Information) and can be overridden via IConfiguration (Serilog:MinimumLevel). Sinks are fully config-driven: ReadFrom.Configuration reads Serilog:WriteTo from appsettings.json / environment.

OTel log export is wired in the same call: logs flow through the OTel pipeline with the same Resource attached, making all three signals (metrics / traces / logs) available in a single backend.

6. Redaction seam — ILogRedactor

ILogRedactor is a single-method interface that receives the mutable log-event property dictionary and scrubs any fields that must not leave the process:

public interface ILogRedactor
{
    void Redact(IDictionary<string, object?> properties);
}

RedactionEnricher applies a registered ILogRedactor on every log event. The seam is shared; the policy is per-project (which field names, which command types, which classification levels). MxGateway's existing GatewayLogRedactor is the reference implementation; it migrates to this seam during adoption. If no ILogRedactor is registered, RedactionEnricher is a no-op.

This preserves the operational property MxGateway already has (secrets never leave the process in log events) while making the plumbing reusable.

7. Per-project migration

Project Current state Primary gaps What normalizes
OtOpcUa Full OTel SDK (WithMetrics + WithTracing); Prometheus /metrics; Serilog bootstrap; 7 instruments + 2 spans. No Resource / service.name anywhere; no trace↔log correlation; no SiteId/NodeRole enrichers. Call AddZbTelemetry (adds Resource; consolidates standard instrumentation); call AddZbSerilog (adds TraceContextEnricher + identity enrichers); migrate existing Serilog bootstrap to shared two-stage pattern.
MxGateway Hand-rolled GatewayMetrics (13 counters / 3 histograms ms / 4 gauges); in-memory snapshot only — no export; MEL logging with GatewayLogScope correlation + GatewayLogRedactor; no OTel SDK. No OTel SDK; no export; ms histograms diverge from OTel semconv (s); MEL → Serilog migration; no Resource. Call AddZbTelemetry (wires OTel SDK around existing GatewayMetrics — finally exports); call AddZbSerilog (replaces MEL; re-expresses GatewayLogScope as LogContext.PushProperty; moves GatewayLogRedactor behind ILogRedactor). Duration unit convergence (mss) tracked in GAPS. This is the one adoption done now.
ScadaBridge OpenTelemetry.Api ref only (dangling — CVE-patch origin, zero usage); Serilog bootstrap (LoggerConfigurationFactory) with SiteId/NodeRole/NodeHostname enrichers. No OTel SDK; no metrics; no tracing; no export; no trace↔log correlation. ScadaBridge's enricher property names are already the target names — migration is additive. Call AddZbTelemetry (adds OTel SDK + metrics + traces + export); call AddZbSerilog (consolidates LoggerConfigurationFactory; adds TraceContextEnricher).

The MxGateway logging migration (MEL → Serilog, re-expressing GatewayLogRedactor behind ILogRedactor) is the only sister-repo touch in scope for this release. OtOpcUa and ScadaBridge adoption is deferred to the follow-on tracked in ../GAPS.md.

8. Acceptance (what "converged" means)

A project is converged when: (a) it calls builder.AddZbTelemetry(o => ...) with all required Resource attributes populated; (b) it calls app.MapZbMetrics() (or configures OTLP); (c) it calls builder.AddZbSerilog(...) and the TraceContextEnricher stamps trace_id/span_id on every log event emitted under an active Activity; (d) its ILogRedactor implementation (if applicable) is registered and applied by RedactionEnricher; (e) every node in the fleet is distinguishable by service.name + site.id + node.role in a collector or log aggregator.