Files
scadaproj/components/observability/GAPS.md
T
Joseph Doherty 215a646e35 docs(observability): fix metric-convention instrument names + NodeHostname-auto + resolve settled questions
C1: NodeHostname is AUTO throughout. Shared-contract AddZbSerilog doc comment now reads
"SiteId + NodeRole from ZbTelemetryOptions; NodeHostname from Environment.MachineName (auto)".
SPEC.md §0 and §5 prose updated to match. ScadaBridge adoption snippet no longer sets
o.NodeHostname (removed; NodeHostname is auto, not caller-supplied).

C2: METRIC-CONVENTIONS §6.1 OtOpcUa instrument table replaced with code-verified set:
counters otopcua.deploy.applied / driver.lifecycle / virtualtag.eval / scriptedalarm.transition /
opcua.sink.write / redundancy.service_level_change; histogram otopcua.deploy.apply.duration (s);
ActivitySource ZB.MOM.WW.OtOpcUa with spans otopcua.deploy.apply + otopcua.opcua.address_space_rebuild.
Removed invented names (deploy.failed, tag.subscriptions, tag.reads, tag.writes, session.active,
connection.gateway).

C3: METRIC-CONVENTIONS §6.2 MxGateway instrument table replaced with code-verified names from
GatewayMetrics.cs: 13 counters (sessions.opened/closed, commands.started/succeeded/failed,
events.received, queues.overflows, faults, workers.killed/exited, heartbeats.failed,
grpc.streams.disconnected, retries.attempted); 3 histograms ms (workers.startup.duration,
commands.duration, events.stream_send.duration); 4 gauges (sessions.open, workers.running,
events.worker_queue.depth, events.grpc_stream_queue.depth). Removed invented names.

m3: §2 example table replaced mxgateway.session.active + mxgateway.worker.call.duration
(invented) with mxgateway.sessions.open + mxgateway.commands.duration (real). Also fixed
the §2 rule-2 body text example which referenced mxgateway.worker.call.duration.

I4: §5 standard instrumentation table corrected — OtOpcUa now shows  not added for all
five baseline instrumentations, matching current-state/otopcua. All three projects lack
standard instrumentation today; AddZbTelemetry adds it on adoption.

I1+m1: GAPS.md "Decisions still open" — removed the two settled questions (Prometheus-default
and ms→s/meter-rename bundling). Moved them to a new "Decisions settled" section with explicit
resolution notes. One genuinely open question remains (SiteId/NodeRole config binding path).

I2: SPEC.md §5 AddZbSerilog: added note that AddZbSerilog reads Serilog:MinimumLevel from
IConfiguration; callers with a different config key (e.g. ScadaBridge:Logging:MinimumLevel)
apply that override themselves — stays per-project. Shared-contract doc comment updated to match.

I3: MxAccessGateway adoption plan Meters = ["MxGateway.Server"] annotated as temporary with
note to update to ZB.MOM.WW.MxGateway when Gap N1 (Meter-rename) is closed.

m2: SPEC.md §1 now notes AddZbTelemetry also has an IServiceCollection overload for non-standard
hosts, with the IHostApplicationBuilder overload as the primary path.
2026-06-01 07:32:58 -04:00

11 KiB
Raw Blame History

Observability — gaps & adoption backlog

Divergence of each project from spec/SPEC.md, and the ordered backlog to reach the shared ZB.MOM.WW.Telemetry library. Status legend: gap · 🟡 partial · matches.

Divergence vs spec

§1 OTel Resource / service.name (P1 — nobody has it)

Spec attribute OtOpcUa MxAccessGateway ScadaBridge
service.name not set not set not set
service.namespace not set not set not set
service.version not set not set not set
site.id not set not set not set
node.role not set not set not set
host.name not set not set not set

No project configures a ResourceBuilder or equivalent. Every metric and span from every node is indistinguishable in a backend — no service identity, no topology (site/role), no version label. This is the single highest-value gap across the fleet; closing it requires only adding AddZbTelemetry with options.

Gap R1 (P1): All three projects must call AddZbTelemetry with ServiceName, SiteId, NodeRole options to populate the shared Resource. None may do so before the library is available.

§2 Metrics export (P1 — MxGateway metrics are invisible)

OtOpcUa MxAccessGateway ScadaBridge
OTel SDK present (AddOpenTelemetry) none none (OpenTelemetry.Api only)
Meter registered ZB.MOM.WW.OtOpcUa 🟡 MxGateway.Server (not via OTel SDK) no Meter
Prometheus export /metrics GetSnapshot() only none
OTLP export not available not available not available

MxGateway has 20 production instruments (13 counters, 3 histograms, 4 gauges) recording real data that is never exported. The data lives only in an in-memory GetSnapshot() read path. Adding AddZbTelemetry with Meters = ["MxGateway.Server"] closes this gap without touching GatewayMetrics.cs.

Gap M1 (P1): MxGateway: wire OTel SDK via AddZbTelemetry so GatewayMetrics exports. This is one half of the in-pass adoption (logging migration is the other). → Gap M2: ScadaBridge: define a ScadaBridgeTelemetry class and first application instruments (scadabridge.*); register via AddZbTelemetry. Currently zero instrumentation.

§3 MxGateway logging: MEL → Serilog (P1 — in-pass adoption, done in this task)

OtOpcUa MxAccessGateway ScadaBridge
Logging framework Serilog MEL only Serilog
Structural enrichers 🟡 driver-scope only MEL scope (provider-dependent) SiteId/NodeRole/NodeHostname
Correlation mechanism LogContextEnricher.Push GatewayLogScope + BeginScope structural enrichers
Log redaction none GatewayLogRedactor none
ILogRedactor seam none bespoke none

Status: done in this task (Task #9). The MxGateway logging migration — MEL → AddZbSerilog, GatewayLogScopeLogContext.PushProperty, GatewayLogRedactorILogRedactor seam — is executed as part of the ZB.MOM.WW.Telemetry library build. See current-state/mxaccessgw/CURRENT-STATE.md adoption plan for the exact changes.

Gap L1 (in-pass, done): MxGateway MEL → AddZbSerilog + LogContext correlation + ILogRedactor.

§4 Trace↔log correlation (nobody has it)

OtOpcUa MxAccessGateway ScadaBridge
trace_id/span_id in logs absent absent (no spans) absent (no spans)
Mechanism TraceContextEnricher not wired n/a n/a

OtOpcUa creates spans (otopcua.deploy.apply, otopcua.opcua.address_space_rebuild) but never pushes Activity.Current's trace context onto Serilog's LogContext. A log emitted inside a span cannot be joined to the span in a backend. MxGateway and ScadaBridge have no spans at all.

The shared TraceContextEnricher (part of ZB.MOM.WW.Telemetry.Serilog) closes this for all three projects on AddZbSerilog adoption — it is wired automatically.

Gap C1: OtOpcUa: adopt AddZbSerilog to wire TraceContextEnricher; spans and logs become joinable once the enricher is active. → Gap C2: ScadaBridge/MxGateway: enricher wired on adoption; effective once spans are created.

§5 Duration unit: ms vs s

OtOpcUa MxAccessGateway ScadaBridge
Histogram unit s ms (workers.startup.duration, commands.duration, events.stream_send.duration) n/a (no histograms)

MxGateway's three histograms record in milliseconds. OTel semantic conventions (and OtOpcUa) use seconds. Raw values differ by a factor of 1000 — dashboards and SLO alerts would need separate scaling rules without normalization.

Gap U1: MxGateway: convert the three histogram call sites to record in seconds (multiply by 0.001 or redefine the instruments). This is a breaking change to existing dashboards and is tagged as a convergence item separate from the initial adoption.

§6 Meter naming: MxGateway.Server vs namespace convention

OtOpcUa MxAccessGateway ScadaBridge
Meter name ZB.MOM.WW.OtOpcUa MxGateway.Server n/a

The fleet convention (per spec/SPEC.md) is <project-namespace> — OtOpcUa uses ZB.MOM.WW.OtOpcUa; the gateway's meter should be ZB.MOM.WW.MxGateway. MxGateway.Server is the assembly name, not the namespace, and does not carry the ZB.MOM.WW product prefix.

Gap N1: MxGateway: rename Meter from "MxGateway.Server""ZB.MOM.WW.MxGateway". This is a Prometheus metric label change that breaks dashboards/alerts and is tracked separately from the initial adoption.

§7 Standard instrumentation (nobody has the full set)

Instrumentation OtOpcUa MxAccessGateway ScadaBridge
ASP.NET Core request metrics not added not added not added
HttpClient metrics not added not added not added
Runtime / process metrics not added not added not added
gRPC client metrics not added not added n/a

AddZbTelemetry enables these via AddAspNetCoreInstrumentation, AddHttpClientInstrumentation, AddRuntimeInstrumentation, and AddProcessInstrumentation by default — all three projects gain them on AddZbTelemetry adoption without code changes.

Gap S1: all three projects lack standard instrumentation; closed automatically by AddZbTelemetry adoption.

§8 Serilog version split (Serilog.AspNetCore 8 vs 9)

OtOpcUa and ScadaBridge use Serilog.AspNetCore but may be on different versions due to their independent csproj updates. The shared ZB.MOM.WW.Telemetry.Serilog package must declare a Serilog.AspNetCore version floor that works for both. Verify version alignment on adoption.

Gap V1: confirm Serilog.AspNetCore version compatibility across all three projects on AddZbSerilog adoption; align if diverged.

§9 ScadaBridge: zero instrumentation

ScadaBridge has no application instruments today. The OpenTelemetry.Api ref is a CVE patch, not instrumentation. No counter, histogram, gauge, or span exists anywhere in the solution.

Gap I1: ScadaBridge: define ScadaBridgeTelemetry with a ZB.MOM.WW.ScadaBridge Meter and a first set of scadabridge.* instruments. Tracked as a follow-on (application-specific work, not shared-library work).

Adoption backlog (ordered)

# Item Projects Priority Effort Risk Notes
1 MxGateway MEL → AddZbSerilog: LogContext correlation + ILogRedactor (Gap L1) MxGateway P1 M Low In-pass, done in Task #9 — unblocked by library build
2 MxGateway: wire OTel SDK via AddZbTelemetry; GatewayMetrics begins exporting (Gap M1) MxGateway P1 S Low Bundled with #1 in Task #9
3 All: AddZbTelemetry with ServiceName/SiteId/NodeRole → shared Resource (Gap R1) OtOpcUa, ScadaBridge P1 S Low MxGateway covered by #2; others are follow-on
4 OtOpcUa: adopt AddZbSerilog + TraceContextEnricher (Gaps C1, V1) OtOpcUa P2 S Low Keep LogContextEnricher; add shared enrichers alongside
5 ScadaBridge: adopt AddZbSerilog; replace LoggerConfigurationFactory (Gap C2, V1) ScadaBridge P2 S Low Enricher names already match; LoggerConfigurationFactory deleted
6 MxGateway: histogram mss conversion (Gap U1) MxGateway P2 S Med Breaking dashboard/alert change; coordinate with ops
7 MxGateway: rename Meter "MxGateway.Server""ZB.MOM.WW.MxGateway" (Gap N1) MxGateway P3 XS Med Breaking Prometheus label change; coordinate with ops
8 All: standard instrumentation via AddZbTelemetry options (Gap S1) all 3 P2 XS Low Automatic on AddZbTelemetry adoption; no extra code
9 ScadaBridge: define ScadaBridgeTelemetry + first scadabridge.* instruments (Gap I1) ScadaBridge P2 M Low Application-specific work; tracked in ScadaBridge repo
10 OtOpcUa: wire OTLP exporter alongside Prometheus (Gap M2) OtOpcUa P3 S Low Opt-in via AddZbTelemetry options; no code rewrite
11 All: fix the tracing no-op in OtOpcUa (spans recorded, no exporter) OtOpcUa P2 S Low Wire OTLP or at minimum document the gap

Sequencing: Items #1 and #2 are the in-pass adoption (Task #9, MxGateway only). Items #3#5 are the follow-on OtOpcUa/ScadaBridge adoptions — they are independent of each other and can land in either order. Items #6#7 (unit/naming convergence) are breaking changes requiring ops coordination; defer until dashboards can be updated. Items #8#11 are cleanups that bundle naturally with #3#5.

This mirrors the Auth and UI-Theme pattern: the shared library is built first; adoption by each app is opt-in and tracked here, not forced.

Decisions still open

  • Canonical SiteId and NodeRole config binding path — ScadaBridge reads from its own config hierarchy; AddZbSerilog must accept the value directly (caller-supplied) rather than reading from a fixed config section, to remain project-agnostic.

Decisions settled (no longer open)

  • Prometheus vs OTLP default (SETTLED): AddZbTelemetry defaults to Prometheus (matching OtOpcUa's existing /metrics posture). OTLP is opt-in via ZbTelemetryOptions.Exporter = ZbExporter.Otlp. See spec/SPEC.md §4 and shared-contract ZbTelemetryOptions.Exporter.
  • mss conversion and Meter rename bundling (SETTLED — deferred): Both the histogram unit migration (Gap U1) and the Meter rename (Gap N1) are deferred from the initial MxGateway adoption (Task #9). They are breaking dashboard/alert changes requiring ops coordination and are tracked as separate backlog items #6 and #7 in the adoption backlog above.