C1: NodeHostname is AUTO throughout. Shared-contract AddZbSerilog doc comment now reads
"SiteId + NodeRole from ZbTelemetryOptions; NodeHostname from Environment.MachineName (auto)".
SPEC.md §0 and §5 prose updated to match. ScadaBridge adoption snippet no longer sets
o.NodeHostname (removed; NodeHostname is auto, not caller-supplied).
C2: METRIC-CONVENTIONS §6.1 OtOpcUa instrument table replaced with code-verified set:
counters otopcua.deploy.applied / driver.lifecycle / virtualtag.eval / scriptedalarm.transition /
opcua.sink.write / redundancy.service_level_change; histogram otopcua.deploy.apply.duration (s);
ActivitySource ZB.MOM.WW.OtOpcUa with spans otopcua.deploy.apply + otopcua.opcua.address_space_rebuild.
Removed invented names (deploy.failed, tag.subscriptions, tag.reads, tag.writes, session.active,
connection.gateway).
C3: METRIC-CONVENTIONS §6.2 MxGateway instrument table replaced with code-verified names from
GatewayMetrics.cs: 13 counters (sessions.opened/closed, commands.started/succeeded/failed,
events.received, queues.overflows, faults, workers.killed/exited, heartbeats.failed,
grpc.streams.disconnected, retries.attempted); 3 histograms ms (workers.startup.duration,
commands.duration, events.stream_send.duration); 4 gauges (sessions.open, workers.running,
events.worker_queue.depth, events.grpc_stream_queue.depth). Removed invented names.
m3: §2 example table replaced mxgateway.session.active + mxgateway.worker.call.duration
(invented) with mxgateway.sessions.open + mxgateway.commands.duration (real). Also fixed
the §2 rule-2 body text example which referenced mxgateway.worker.call.duration.
I4: §5 standard instrumentation table corrected — OtOpcUa now shows ⛔ not added for all
five baseline instrumentations, matching current-state/otopcua. All three projects lack
standard instrumentation today; AddZbTelemetry adds it on adoption.
I1+m1: GAPS.md "Decisions still open" — removed the two settled questions (Prometheus-default
and ms→s/meter-rename bundling). Moved them to a new "Decisions settled" section with explicit
resolution notes. One genuinely open question remains (SiteId/NodeRole config binding path).
I2: SPEC.md §5 AddZbSerilog: added note that AddZbSerilog reads Serilog:MinimumLevel from
IConfiguration; callers with a different config key (e.g. ScadaBridge:Logging:MinimumLevel)
apply that override themselves — stays per-project. Shared-contract doc comment updated to match.
I3: MxAccessGateway adoption plan Meters = ["MxGateway.Server"] annotated as temporary with
note to update to ZB.MOM.WW.MxGateway when Gap N1 (Meter-rename) is closed.
m2: SPEC.md §1 now notes AddZbTelemetry also has an IServiceCollection overload for non-standard
hosts, with the IHostApplicationBuilder overload as the primary path.
12 KiB
Observability — normalized target spec
Status: Draft. The single design the sister projects converge on. Derived from the
three code-verified current-state docs (../current-state/). Goal is path to shared code
(../shared-contract/ZB.MOM.WW.Telemetry.md), so each normalized section maps to a shared
library seam.
0. Scope
Normalized here: one OpenTelemetry bootstrap across all three signals (metrics + traces +
logs) via a single AddZbTelemetry extension; the shared Resource attribute set
(service.name / service.namespace / service.version / site.id / node.role /
host.name) that makes every node distinguishable in a collector; standard instrumentation
everyone enables (ASP.NET Core, HttpClient, gRPC client, runtime, process meters); exporter
conventions (Prometheus scrape endpoint default, OTLP opt-in); a shared Serilog bootstrap
with identity enrichers (SiteId, NodeRole from ZbTelemetryOptions; NodeHostname auto
from Environment.MachineName) matching the OTel Resource dimensions (metrics and logs
therefore carry identical dimensions); a
TraceContextEnricher that stamps trace_id/span_id from Activity.Current onto every
Serilog event, enabling log↔trace correlation; an ILogRedactor redaction seam.
Explicitly NOT normalized (domain-specific — keep per project): each app's actual
instruments — otopcua.* meters and spans, mxgateway.* counters/histograms/gauges — they
are registered through the shared bootstrap but their names and semantics remain
bespoke (see METRIC-CONVENTIONS.md §4); the redaction policy
(which field names, which command types) — only the ILogRedactor seam is shared, each
project supplies its own implementation; the MxGateway net48 x86 worker's IWorkerLogger
(stderr key=value format, out-of-process, out of scope).
1. OpenTelemetry pipeline — AddZbTelemetry
A single IHostApplicationBuilder extension is the front door for all three OTel signals.
It wires the shared Resource, registers standard instrumentation, and configures the
selected exporter:
builder.AddZbTelemetry(o =>
{
o.ServiceName = "mxgateway"; // populates Resource service.name
o.ServiceNamespace = "ZB.MOM.WW"; // constant across the fleet (default)
o.ServiceVersion = "1.0.0"; // populated from AssemblyInformationalVersion
o.SiteId = cfg.SiteId; // Resource site.id + Serilog SiteId property
o.NodeRole = cfg.NodeRole; // Resource node.role + Serilog NodeRole property
o.Meters = ["MxGateway.Server"]; // app's own Meter name(s)
o.ActivitySources = ["MxGateway.Server"]; // app's own ActivitySource name(s)
o.Exporter = ZbExporter.Prometheus; // default; ZbExporter.Otlp opt-in
// o.OtlpEndpoint = "http://collector:4317"; // required when Exporter = Otlp
});
app.MapZbMetrics(); // mounts Prometheus /metrics scrape endpoint
This is the headline fix: nobody in the fleet sets a Resource or service.name today,
making every node indistinguishable in a collector. Every project must call AddZbTelemetry
to be observable.
IServiceCollectionoverload:AddZbTelemetryalso has anIServiceCollection-based overload for host configurations whereIHostApplicationBuilderis not available (detailed in the shared-contract). TheIHostApplicationBuilderoverload is the primary path for all three apps on .NET 10.
2. Shared Resource
The OTel Resource attached to all three signals is built from ZbTelemetryOptions:
| OTel attribute | Options property | Notes |
|---|---|---|
service.name |
ServiceName |
Required. Lower-case short identifier (otopcua, mxgateway, scadabridge) |
service.namespace |
ServiceNamespace |
Default "ZB.MOM.WW" — constant across the fleet |
service.version |
ServiceVersion |
Optional; recommend populating from AssemblyInformationalVersion |
site.id |
SiteId |
Optional; identifies the physical/logical site |
node.role |
NodeRole |
Optional; e.g. "central", "site", "hub" |
host.name |
(auto) | Always populated from Environment.MachineName |
The same SiteId and NodeRole values are passed to the Serilog enrichers (§4) so a
metric, a span, and a log line from the same node carry identical dimensions and join up in
any OTel-compatible backend.
3. Standard instrumentation
AddZbTelemetry enables the following instrumentation for all projects. Any project that
already enables a subset gets it consolidated; no project may skip this baseline:
| Instrumentation | Package | Signal |
|---|---|---|
| ASP.NET Core | OpenTelemetry.Instrumentation.AspNetCore |
Traces + Metrics |
| HttpClient | OpenTelemetry.Instrumentation.Http |
Traces + Metrics |
| gRPC client | OpenTelemetry.Instrumentation.GrpcNetClient |
Traces |
| .NET runtime | OpenTelemetry.Instrumentation.Runtime |
Metrics |
| Process | OpenTelemetry.Instrumentation.Process |
Metrics |
App-specific Meter names and ActivitySource names are registered via o.Meters and
o.ActivitySources. This is how MxGateway's hand-rolled GatewayMetrics finally gets an
export path instead of dying in an in-memory GetSnapshot().
4. Exporter conventions
ZbTelemetryOptions.Exporter selects the export path:
| Value | Behaviour |
|---|---|
ZbExporter.Prometheus |
Mounts a Prometheus /metrics scrape endpoint via app.MapZbMetrics(). Default for all three apps — consistent with OtOpcUa's existing /metrics. |
ZbExporter.Otlp |
Exports to an OTLP endpoint specified by o.OtlpEndpoint (gRPC, http://collector:4317). Opt-in path to a real OTel Collector; coexists with Prometheus. |
Both exporters carry the shared Resource. OTLP is the path to a real backend (Tempo,
Prometheus-remote-write, Loki); Prometheus covers the "scrape from the node" case that all
three apps currently use or aspire to.
5. Serilog logging stack
AddZbSerilog is a companion extension in the .Serilog package. It replaces each
project's bespoke logging bootstrap with a shared two-stage pattern:
Stage 1 (bootstrap logger): a minimal Log.Logger for startup errors before the
IConfiguration is available. Writes to console only.
Stage 2 (application logger): reads sinks and overrides from IConfiguration
(ReadFrom.Configuration) and applies a set of fixed enrichers:
| Enricher | Property name | Source |
|---|---|---|
ZbLogEnricherNames.SiteId |
"SiteId" |
ZbTelemetryOptions.SiteId |
ZbLogEnricherNames.NodeRole |
"NodeRole" |
ZbTelemetryOptions.NodeRole |
ZbLogEnricherNames.NodeHostname |
"NodeHostname" |
Environment.MachineName |
TraceContextEnricher |
trace_id, span_id |
Activity.Current |
RedactionEnricher |
(project-defined fields) | ILogRedactor implementation |
SiteId and NodeRole are bound from the same ZbTelemetryOptions object as the OTel
Resource; NodeHostname is populated automatically from Environment.MachineName (not a
caller-supplied option). All three identity properties appear on logs and metrics/traces alike,
so signals from the same node carry identical dimensions. When no Activity.Current is present (e.g. background services,
startup), TraceContextEnricher emits nothing — it does not inject empty or zero values.
MinimumLevel is set explicitly in code (default Information) and can be overridden via
IConfiguration (Serilog:MinimumLevel). Sinks are fully config-driven:
ReadFrom.Configuration reads Serilog:WriteTo from appsettings.json / environment.
Per-project config paths:
AddZbSerilogreadsSerilog:MinimumLevelfromIConfiguration. Callers that bind MinimumLevel from a different key (e.g. ScadaBridge'sScadaBridge:Logging:MinimumLevel) apply that override themselves before or after callingAddZbSerilog. The config key for MinimumLevel remains per-project;AddZbSerilogis not parameterized on it.
OTel log export is wired in the same call: logs flow through the OTel pipeline with the
same Resource attached, making all three signals (metrics / traces / logs) available in a
single backend.
6. Redaction seam — ILogRedactor
ILogRedactor is a single-method interface that receives the mutable log-event property
dictionary and scrubs any fields that must not leave the process:
public interface ILogRedactor
{
void Redact(IDictionary<string, object?> properties);
}
RedactionEnricher applies a registered ILogRedactor on every log event. The seam is
shared; the policy is per-project (which field names, which command types, which
classification levels). MxGateway's existing GatewayLogRedactor is the reference
implementation; it migrates to this seam during adoption. If no ILogRedactor is
registered, RedactionEnricher is a no-op.
This preserves the operational property MxGateway already has (secrets never leave the process in log events) while making the plumbing reusable.
7. Per-project migration
| Project | Current state | Primary gaps | What normalizes |
|---|---|---|---|
| OtOpcUa | Full OTel SDK (WithMetrics + WithTracing); Prometheus /metrics; Serilog bootstrap; 7 instruments + 2 spans. |
No Resource / service.name anywhere; no trace↔log correlation; no SiteId/NodeRole enrichers. |
Call AddZbTelemetry (adds Resource; consolidates standard instrumentation); call AddZbSerilog (adds TraceContextEnricher + identity enrichers); migrate existing Serilog bootstrap to shared two-stage pattern. |
| MxGateway | Hand-rolled GatewayMetrics (13 counters / 3 histograms ms / 4 gauges); in-memory snapshot only — no export; MEL logging with GatewayLogScope correlation + GatewayLogRedactor; no OTel SDK. |
No OTel SDK; no export; ms histograms diverge from OTel semconv (s); MEL → Serilog migration; no Resource. |
Call AddZbTelemetry (wires OTel SDK around existing GatewayMetrics — finally exports); call AddZbSerilog (replaces MEL; re-expresses GatewayLogScope as LogContext.PushProperty; moves GatewayLogRedactor behind ILogRedactor). Duration unit convergence (ms→s) tracked in GAPS. This is the one adoption done now. |
| ScadaBridge | OpenTelemetry.Api ref only (dangling — CVE-patch origin, zero usage); Serilog bootstrap (LoggerConfigurationFactory) with SiteId/NodeRole/NodeHostname enrichers. |
No OTel SDK; no metrics; no tracing; no export; no trace↔log correlation. ScadaBridge's enricher property names are already the target names — migration is additive. | Call AddZbTelemetry (adds OTel SDK + metrics + traces + export); call AddZbSerilog (consolidates LoggerConfigurationFactory; adds TraceContextEnricher). |
The MxGateway logging migration (
MEL → Serilog, re-expressingGatewayLogRedactorbehindILogRedactor) is the only sister-repo touch in scope for this release. OtOpcUa and ScadaBridge adoption is deferred to the follow-on tracked in../GAPS.md.
8. Acceptance (what "converged" means)
A project is converged when: (a) it calls builder.AddZbTelemetry(o => ...) with all
required Resource attributes populated; (b) it calls app.MapZbMetrics() (or configures
OTLP); (c) it calls builder.AddZbSerilog(...) and the TraceContextEnricher stamps
trace_id/span_id on every log event emitted under an active Activity; (d) its
ILogRedactor implementation (if applicable) is registered and applied by RedactionEnricher;
(e) every node in the fleet is distinguishable by service.name + site.id + node.role
in a collector or log aggregator.