Files
scadaproj/components/observability/current-state/otopcua/CURRENT-STATE.md
T
Joseph Doherty 7d243890ed docs(observability): spec + METRIC-CONVENTIONS + ZB.MOM.WW.Telemetry shared contract
Author the three normalization docs for the observability component:
- components/observability/spec/SPEC.md — Section 0 scope (normalized vs. per-project),
  AddZbTelemetry pipeline, shared Resource attribute set, standard instrumentation baseline,
  exporter conventions, Serilog two-stage bootstrap with identity enrichers and
  TraceContextEnricher, ILogRedactor redaction seam, per-project migration table, and
  acceptance criteria.
- components/observability/spec/METRIC-CONVENTIONS.md — meter naming convention (app
  namespace; MxGateway.Server flagged as convergence target), instrument naming pattern
  (<app>.<subsystem>.<event>), mandatory duration unit = seconds (MxGateway ms histograms
  flagged), Resource attribute set table, standard instrumentation baseline, and per-app
  instrument tables (OtOpcUa 7 instruments + 2 spans; MxGateway 13 counters / 3 histograms
  / 4 gauges; ScadaBridge TBD).
- components/observability/shared-contract/ZB.MOM.WW.Telemetry.md — paper API for the two
  packages: ZbTelemetryOptions, ZbExporter enum, AddZbTelemetry (IHostApplicationBuilder +
  IServiceCollection overloads), ZbResource.Build, MapZbMetrics; AddZbSerilog,
  ZbLogEnricherNames constants, TraceContextEnricher, ILogRedactor, RedactionEnricher.
  Consumer matrix and open contract questions included.
2026-06-01 07:19:38 -04:00

8.6 KiB
Raw Blame History

Observability — current state: OtOpcUa

Repo: ~/Desktop/OtOpcUa. Stack: .NET 10, Akka.NET, OPC UA; solution ZB.MOM.WW.OtOpcUa.slnx. Telemetry code lives in two places: src/Server/ZB.MOM.WW.OtOpcUa.Host/Observability/ (host-side bootstrap) and src/Core/ZB.MOM.WW.OtOpcUa.Commons/Observability/ (instruments + enricher). All paths relative to repo root. Verified 2026-06-01.

The most complete observability implementation in the family: OpenTelemetry SDK with both metrics and tracing signals, Prometheus export, Serilog structured logging with a per-session correlation enricher, and a dedicated instrument vocabulary. The one significant gap: no OTel Resource / service.name, so all signals are indistinguishable from one another and from other fleet members in a backend.

1. Metrics (OpenTelemetry SDK)

Bootstrap — ObservabilityExtensions.cs

src/Server/ZB.MOM.WW.OtOpcUa.Host/Observability/ObservabilityExtensions.cs:

  • :18AddOtOpcUaObservability(IServiceCollection) is the service-registration entry point.
  • :20AddOpenTelemetry() wires the OTel SDK.
  • :2123.WithMetrics(b => b.AddMeter(OtOpcUaTelemetry.MeterName).AddPrometheusExporter()): registers the application meter and attaches the Prometheus scrape exporter.
  • :2425.WithTracing(b => b.AddSource(OtOpcUaTelemetry.ActivitySourceName)): registers the application activity source for trace data.
  • No ResourceBuilder call anywhereservice.name, service.namespace, service.version, site.id, and node.role are not set. The OTel SDK defaults to an empty/SDK-default Resource.
  • :36MapOtOpcUaMetrics(IEndpointRouteBuilder) maps the Prometheus endpoint.
  • :38 — endpoint path is /metrics.

Program.cs:

  • :138builder.Services.AddOtOpcUaObservability()
  • :160app.MapOtOpcUaMetrics()

Package refs in csproj: OpenTelemetry.Extensions.Hosting, OpenTelemetry.Exporter.Prometheus.AspNetCore. No OpenTelemetry.Exporter.OpenTelemetryProtocol — OTLP is not available; Prometheus is the only export path.

Instruments — OtOpcUaTelemetry.cs

src/Core/ZB.MOM.WW.OtOpcUa.Commons/Observability/OtOpcUaTelemetry.cs:

  • :19MeterName = "ZB.MOM.WW.OtOpcUa" (the Meter the SDK will collect).
  • :20ActivitySourceName = "ZB.MOM.WW.OtOpcUa" (the ActivitySource for spans).

Instruments defined (all static readonly on OtOpcUaTelemetry):

Instrument Kind Unit Subsystem
otopcua.deploy.applied Counter<long> deploy
otopcua.deploy.apply.duration Histogram<double> s deploy
otopcua.driver.lifecycle Counter<long> driver
otopcua.virtualtag.eval Counter<long> virtual-tag
otopcua.scriptedalarm.transition Counter<long> scripted-alarm
otopcua.opcua.sink.write Counter<long> opc-ua sink
otopcua.redundancy.service_level_change Counter<long> redundancy

Two activity spans: otopcua.deploy.apply, otopcua.opcua.address_space_rebuild.

Naming convention: otopcua.<subsystem>.<event>. Duration histogram correctly uses unit s (OTel semantic conventions). No standard instrumentation (ASP.NET Core, HttpClient, runtime, gRPC client meters) is wired — only the bespoke application instruments.

2. Logging (Serilog)

Bootstrap

Program.cs:

  • :4952 — two-stage Serilog bootstrap: initial logger for startup, then full UseSerilog(ReadFrom.Configuration). Sinks: Console + rolling file logs/otopcua-.log.
  • :141UseSerilogRequestLogging() on the WebApplication.

Correlation enricher — LogContextEnricher.cs

src/Core/ZB.MOM.WW.OtOpcUa.Commons/Observability/LogContextEnricher.cs:

  • :1836Push(driverInstanceId, driverType, capability, correlationId) calls LogContext.PushProperty for four properties:
    • DriverInstanceId — Galaxy driver instance GUID.
    • DriverType — driver type discriminator.
    • CapabilityName — OPC UA capability being exercised.
    • CorrelationId — caller-supplied correlation token.

This enricher is driver-lifecycle-scoped, not request-scoped — it pushes when a driver operation begins and is disposable to pop on completion.

No trace_id / span_id enricher. Although OtOpcUa creates ActivitySource spans, the active Activity.Current trace context is never pushed onto Serilog's LogContext. A log line emitted during a span cannot be correlated to the span in a backend.

No structural enrichers for service.name / site.id / node.role — these dimensions are absent from every log line. ScadaBridge has these; OtOpcUa does not.

3. Signal summary

Signal Provider Export Resource / service.name
Metrics OTel SDK (Meter + WithMetrics) Prometheus /metrics none
Traces OTel SDK (ActivitySource + WithTracing) none (no exporter configured) none
Logs Serilog Console + rolling file none (no service.name property)
Trace↔log correlation absent (trace_id/span_id not pushed)

Note: WithTracing registers the ActivitySource for collection, but no exporter (OTLP or otherwise) is attached to the tracing pipeline. Spans are created and recorded by the SDK but never shipped anywhere — effectively a no-op in production.

4. Notable design choices

  • Instrument naming follows <meter>.<subsystem>.<event> cleanly and consistently — this is the pattern the shared spec codifies as the fleet convention.
  • Duration unit correctly uses s on otopcua.deploy.apply.duration — no conversion needed on adoption; this contrasts with MxAccessGateway's ms histograms.
  • LogContextEnricher is bespoke but valuable — the DriverInstanceId/DriverType/CapabilityName correlation is OtOpcUa-specific domain context; it should survive adoption behind the shared enricher layer.
  • No OTLP path — with no OTLP exporter, OtOpcUa cannot send metrics or traces to a collector (Prometheus is scrape-pull only). This limits operational flexibility.

Adoption plan → ZB.MOM.WW.Telemetry

Replace with shared bootstrap:

  • AddOtOpcUaObservability()builder.AddZbTelemetry(o => { o.ServiceName = "otopcua"; o.SiteId = ...; o.NodeRole = ...; o.Meters = [OtOpcUaTelemetry.MeterName]; o.ActivitySources = [OtOpcUaTelemetry.ActivitySourceName]; }). This adds the missing Resource (gains service.name / service.namespace / service.version / site.id / node.role / host.name on every metric and span). Prometheus /metrics stays the default exporter; OTLP becomes opt-in via options.
  • Add standard instrumentation through AddZbTelemetry options: ASP.NET Core meters, HttpClient, runtime + process meters — none wired today.
  • Fix the tracing no-op: wire an OTLP exporter (or at minimum note that tracing is recorded but not exported); AddZbTelemetry provides OTLP as the opt-in path.
  • MapOtOpcUaMetricsapp.MapZbMetrics() (same /metrics path; shared convention).

Replace with shared Serilog bootstrap:

  • Serilog bootstrap in Program.cs:4952builder.AddZbSerilog(o => { o.ServiceName = "otopcua"; o.SiteId = ...; o.NodeRole = ...; }). This adds structural SiteId / NodeRole / NodeHostname properties to every log line (currently absent) and wires the TraceContextEnricher so trace_id/span_id appear on log lines emitted during active spans.
  • Console + file sinks continue via ReadFrom.Configuration in appsettings.json — no sink changes needed.
  • UseSerilogRequestLogging() stays.

Keep bespoke:

  • OtOpcUaTelemetry.cs — the application Meter, ActivitySource, and all instrument definitions (otopcua.* counters, histograms, spans). These are domain instruments; AddZbTelemetry registers them by name but does not own them.
  • LogContextEnricher.cs — driver-lifecycle correlation properties (DriverInstanceId, DriverType, CapabilityName, CorrelationId) are OtOpcUa-specific. The enricher continues to push via LogContext.PushProperty alongside the shared enrichers.
  • ObservabilityExtensions.cs itself can be simplified or removed — it becomes a thin wrapper that calls AddZbTelemetry with OtOpcUa-specific options. The per-project entry point remains; only the implementation body is delegated to the shared library.

Adoption is a follow-on task (tracked in GAPS.md), not part of the ZB.MOM.WW.Telemetry library build. The library build delivers the shared bootstrap and enrichers; adoption lands in the OtOpcUa repo as a separate commit once the nupkg is available.