Files
scadaproj/components/observability/GAPS.md
T
Joseph Doherty fba3d09eed docs(observability): current-state x3 + GAPS + README
Complete the observability normalization component docs:

- components/observability/current-state/otopcua/CURRENT-STATE.md — full
  OTel SDK (metrics + tracing) + Prometheus; 7 otopcua.* instruments + 2
  spans; Serilog with driver-scope LogContextEnricher; no Resource/service.name
  anywhere; tracing pipeline wired but no exporter; adoption plan: AddZbTelemetry
  gains shared Resource + trace↔log correlation; LogContextEnricher kept bespoke.

- components/observability/current-state/mxaccessgw/CURRENT-STATE.md — 20
  hand-rolled instruments (13 counters, 3 histograms ms-unit, 4 gauges) in
  GatewayMetrics.cs; no OTel SDK → metrics never export; MEL logging with
  GatewayLogScope correlation and GatewayLogRedactor; adoption plan: in-pass
  MEL → AddZbSerilog migration (LogContext correlation, ILogRedactor seam) +
  AddZbTelemetry wires OTel SDK so GatewayMetrics finally exports.

- components/observability/current-state/scadabridge/CURRENT-STATE.md —
  OpenTelemetry.Api is a CVE-patch override only (zero instrumentation); Serilog
  with SiteId/NodeRole/NodeHostname enrichers (strongest set in family); adoption
  plan: replace CVE ref with AddZbTelemetry; adopt AddZbSerilog (LoggerConfigurationFactory
  deleted); add first scadabridge.* instruments.

- components/observability/GAPS.md — divergence table across §1 Resource (P1,
  nobody), §2 metrics export (P1, MxGateway invisible), §3 MxGateway MEL→Serilog
  (P1, in-pass done), §4 trace↔log correlation, §5 ms→s unit, §6 Meter naming,
  §7 standard instrumentation, §8 Serilog version, §9 ScadaBridge zero
  instrumentation; 11-item prioritized backlog.

- components/observability/README.md — overview, per-project status table
  (OTel today / metrics / tracing / logging / enrichers / adoption status),
  normalized vs. left-per-project boundary, 2-package structure, component status.
2026-06-01 07:23:08 -04:00

11 KiB
Raw Blame History

Observability — gaps & adoption backlog

Divergence of each project from spec/SPEC.md, and the ordered backlog to reach the shared ZB.MOM.WW.Telemetry library. Status legend: gap · 🟡 partial · matches.

Divergence vs spec

§1 OTel Resource / service.name (P1 — nobody has it)

Spec attribute OtOpcUa MxAccessGateway ScadaBridge
service.name not set not set not set
service.namespace not set not set not set
service.version not set not set not set
site.id not set not set not set
node.role not set not set not set
host.name not set not set not set

No project configures a ResourceBuilder or equivalent. Every metric and span from every node is indistinguishable in a backend — no service identity, no topology (site/role), no version label. This is the single highest-value gap across the fleet; closing it requires only adding AddZbTelemetry with options.

Gap R1 (P1): All three projects must call AddZbTelemetry with ServiceName, SiteId, NodeRole options to populate the shared Resource. None may do so before the library is available.

§2 Metrics export (P1 — MxGateway metrics are invisible)

OtOpcUa MxAccessGateway ScadaBridge
OTel SDK present (AddOpenTelemetry) none none (OpenTelemetry.Api only)
Meter registered ZB.MOM.WW.OtOpcUa 🟡 MxGateway.Server (not via OTel SDK) no Meter
Prometheus export /metrics GetSnapshot() only none
OTLP export not available not available not available

MxGateway has 20 production instruments (13 counters, 3 histograms, 4 gauges) recording real data that is never exported. The data lives only in an in-memory GetSnapshot() read path. Adding AddZbTelemetry with Meters = ["MxGateway.Server"] closes this gap without touching GatewayMetrics.cs.

Gap M1 (P1): MxGateway: wire OTel SDK via AddZbTelemetry so GatewayMetrics exports. This is one half of the in-pass adoption (logging migration is the other). → Gap M2: ScadaBridge: define a ScadaBridgeTelemetry class and first application instruments (scadabridge.*); register via AddZbTelemetry. Currently zero instrumentation.

§3 MxGateway logging: MEL → Serilog (P1 — in-pass adoption, done in this task)

OtOpcUa MxAccessGateway ScadaBridge
Logging framework Serilog MEL only Serilog
Structural enrichers 🟡 driver-scope only MEL scope (provider-dependent) SiteId/NodeRole/NodeHostname
Correlation mechanism LogContextEnricher.Push GatewayLogScope + BeginScope structural enrichers
Log redaction none GatewayLogRedactor none
ILogRedactor seam none bespoke none

Status: done in this task (Task #9). The MxGateway logging migration — MEL → AddZbSerilog, GatewayLogScopeLogContext.PushProperty, GatewayLogRedactorILogRedactor seam — is executed as part of the ZB.MOM.WW.Telemetry library build. See current-state/mxaccessgw/CURRENT-STATE.md adoption plan for the exact changes.

Gap L1 (in-pass, done): MxGateway MEL → AddZbSerilog + LogContext correlation + ILogRedactor.

§4 Trace↔log correlation (nobody has it)

OtOpcUa MxAccessGateway ScadaBridge
trace_id/span_id in logs absent absent (no spans) absent (no spans)
Mechanism TraceContextEnricher not wired n/a n/a

OtOpcUa creates spans (otopcua.deploy.apply, otopcua.opcua.address_space_rebuild) but never pushes Activity.Current's trace context onto Serilog's LogContext. A log emitted inside a span cannot be joined to the span in a backend. MxGateway and ScadaBridge have no spans at all.

The shared TraceContextEnricher (part of ZB.MOM.WW.Telemetry.Serilog) closes this for all three projects on AddZbSerilog adoption — it is wired automatically.

Gap C1: OtOpcUa: adopt AddZbSerilog to wire TraceContextEnricher; spans and logs become joinable once the enricher is active. → Gap C2: ScadaBridge/MxGateway: enricher wired on adoption; effective once spans are created.

§5 Duration unit: ms vs s

OtOpcUa MxAccessGateway ScadaBridge
Histogram unit s ms (workers.startup.duration, commands.duration, events.stream_send.duration) n/a (no histograms)

MxGateway's three histograms record in milliseconds. OTel semantic conventions (and OtOpcUa) use seconds. Raw values differ by a factor of 1000 — dashboards and SLO alerts would need separate scaling rules without normalization.

Gap U1: MxGateway: convert the three histogram call sites to record in seconds (multiply by 0.001 or redefine the instruments). This is a breaking change to existing dashboards and is tagged as a convergence item separate from the initial adoption.

§6 Meter naming: MxGateway.Server vs namespace convention

OtOpcUa MxAccessGateway ScadaBridge
Meter name ZB.MOM.WW.OtOpcUa MxGateway.Server n/a

The fleet convention (per spec/SPEC.md) is <project-namespace> — OtOpcUa uses ZB.MOM.WW.OtOpcUa; the gateway's meter should be ZB.MOM.WW.MxGateway. MxGateway.Server is the assembly name, not the namespace, and does not carry the ZB.MOM.WW product prefix.

Gap N1: MxGateway: rename Meter from "MxGateway.Server""ZB.MOM.WW.MxGateway". This is a Prometheus metric label change that breaks dashboards/alerts and is tracked separately from the initial adoption.

§7 Standard instrumentation (nobody has the full set)

Instrumentation OtOpcUa MxAccessGateway ScadaBridge
ASP.NET Core request metrics not added not added not added
HttpClient metrics not added not added not added
Runtime / process metrics not added not added not added
gRPC client metrics not added not added n/a

AddZbTelemetry enables these via AddAspNetCoreInstrumentation, AddHttpClientInstrumentation, AddRuntimeInstrumentation, and AddProcessInstrumentation by default — all three projects gain them on AddZbTelemetry adoption without code changes.

Gap S1: all three projects lack standard instrumentation; closed automatically by AddZbTelemetry adoption.

§8 Serilog version split (Serilog.AspNetCore 8 vs 9)

OtOpcUa and ScadaBridge use Serilog.AspNetCore but may be on different versions due to their independent csproj updates. The shared ZB.MOM.WW.Telemetry.Serilog package must declare a Serilog.AspNetCore version floor that works for both. Verify version alignment on adoption.

Gap V1: confirm Serilog.AspNetCore version compatibility across all three projects on AddZbSerilog adoption; align if diverged.

§9 ScadaBridge: zero instrumentation

ScadaBridge has no application instruments today. The OpenTelemetry.Api ref is a CVE patch, not instrumentation. No counter, histogram, gauge, or span exists anywhere in the solution.

Gap I1: ScadaBridge: define ScadaBridgeTelemetry with a ZB.MOM.WW.ScadaBridge Meter and a first set of scadabridge.* instruments. Tracked as a follow-on (application-specific work, not shared-library work).

Adoption backlog (ordered)

# Item Projects Priority Effort Risk Notes
1 MxGateway MEL → AddZbSerilog: LogContext correlation + ILogRedactor (Gap L1) MxGateway P1 M Low In-pass, done in Task #9 — unblocked by library build
2 MxGateway: wire OTel SDK via AddZbTelemetry; GatewayMetrics begins exporting (Gap M1) MxGateway P1 S Low Bundled with #1 in Task #9
3 All: AddZbTelemetry with ServiceName/SiteId/NodeRole → shared Resource (Gap R1) OtOpcUa, ScadaBridge P1 S Low MxGateway covered by #2; others are follow-on
4 OtOpcUa: adopt AddZbSerilog + TraceContextEnricher (Gaps C1, V1) OtOpcUa P2 S Low Keep LogContextEnricher; add shared enrichers alongside
5 ScadaBridge: adopt AddZbSerilog; replace LoggerConfigurationFactory (Gap C2, V1) ScadaBridge P2 S Low Enricher names already match; LoggerConfigurationFactory deleted
6 MxGateway: histogram mss conversion (Gap U1) MxGateway P2 S Med Breaking dashboard/alert change; coordinate with ops
7 MxGateway: rename Meter "MxGateway.Server""ZB.MOM.WW.MxGateway" (Gap N1) MxGateway P3 XS Med Breaking Prometheus label change; coordinate with ops
8 All: standard instrumentation via AddZbTelemetry options (Gap S1) all 3 P2 XS Low Automatic on AddZbTelemetry adoption; no extra code
9 ScadaBridge: define ScadaBridgeTelemetry + first scadabridge.* instruments (Gap I1) ScadaBridge P2 M Low Application-specific work; tracked in ScadaBridge repo
10 OtOpcUa: wire OTLP exporter alongside Prometheus (Gap M2) OtOpcUa P3 S Low Opt-in via AddZbTelemetry options; no code rewrite
11 All: fix the tracing no-op in OtOpcUa (spans recorded, no exporter) OtOpcUa P2 S Low Wire OTLP or at minimum document the gap

Sequencing: Items #1 and #2 are the in-pass adoption (Task #9, MxGateway only). Items #3#5 are the follow-on OtOpcUa/ScadaBridge adoptions — they are independent of each other and can land in either order. Items #6#7 (unit/naming convergence) are breaking changes requiring ops coordination; defer until dashboards can be updated. Items #8#11 are cleanups that bundle naturally with #3#5.

This mirrors the Auth and UI-Theme pattern: the shared library is built first; adoption by each app is opt-in and tracked here, not forced.

Decisions still open

  • Whether AddZbTelemetry enables OTLP by default (simplest for new setups) or Prometheus by default (matches OtOpcUa's current posture). Design doc says Prometheus default; OTLP opt-in.
  • Whether the mss conversion and Meter rename are bundled with the initial MxGateway adoption (Task #9) or deferred. Deferring avoids dashboard breaks during the migration window.
  • Canonical SiteId and NodeRole config binding path — ScadaBridge reads from its own config hierarchy; AddZbSerilog must accept the value directly (caller-supplied) rather than reading from a fixed config section, to remain project-agnostic.