Files
scadaproj/components/observability/GAPS.md
T

16 KiB
Raw Blame History

Observability — gaps & adoption backlog

Divergence of each project from spec/SPEC.md, and the ordered backlog to reach the shared ZB.MOM.WW.Telemetry library. Status legend: gap · 🟡 partial · matches.

Divergence vs spec

§1 OTel Resource / service.name (P1 — nobody has it)

Spec attribute OtOpcUa MxAccessGateway ScadaBridge
service.name not set not set not set
service.namespace not set not set not set
service.version not set not set not set
site.id not set not set not set
node.role not set not set not set
host.name not set not set not set

No project configures a ResourceBuilder or equivalent. Every metric and span from every node is indistinguishable in a backend — no service identity, no topology (site/role), no version label. This is the single highest-value gap across the fleet; closing it requires only adding AddZbTelemetry with options.

Gap R1 (P1): All three projects must call AddZbTelemetry with ServiceName, SiteId, NodeRole options to populate the shared Resource. None may do so before the library is available.

§2 Metrics export (P1 — MxGateway metrics are invisible)

OtOpcUa MxAccessGateway ScadaBridge
OTel SDK present (AddOpenTelemetry) none none (OpenTelemetry.Api only)
Meter registered ZB.MOM.WW.OtOpcUa 🟡 MxGateway.Server (not via OTel SDK) no Meter
Prometheus export /metrics GetSnapshot() only none
OTLP export not available not available not available

MxGateway has 20 production instruments (13 counters, 3 histograms, 4 gauges) recording real data that is never exported. The data lives only in an in-memory GetSnapshot() read path. Adding AddZbTelemetry with Meters = ["MxGateway.Server"] closes this gap without touching GatewayMetrics.cs.

Gap M1 (P1): MxGateway: wire OTel SDK via AddZbTelemetry so GatewayMetrics exports. This is one half of the in-pass adoption (logging migration is the other). → Gap M2: ScadaBridge: define a ScadaBridgeTelemetry class and first application instruments (scadabridge.*); register via AddZbTelemetry. Currently zero instrumentation.

§3 MxGateway logging: MEL → Serilog (P1 — in-pass adoption, done in this task)

OtOpcUa MxAccessGateway ScadaBridge
Logging framework Serilog MEL only Serilog
Structural enrichers 🟡 driver-scope only MEL scope (provider-dependent) SiteId/NodeRole/NodeHostname
Correlation mechanism LogContextEnricher.Push GatewayLogScope + BeginScope structural enrichers
Log redaction none GatewayLogRedactor none
ILogRedactor seam none bespoke none

Status: done in this task (Task #9). The MxGateway logging migration — MEL → AddZbSerilog, GatewayLogScopeLogContext.PushProperty, GatewayLogRedactorILogRedactor seam — is executed as part of the ZB.MOM.WW.Telemetry library build. See current-state/mxaccessgw/CURRENT-STATE.md adoption plan for the exact changes.

Gap L1 (in-pass, done): MxGateway MEL → AddZbSerilog + LogContext correlation + ILogRedactor.

§4 Trace↔log correlation (nobody has it)

OtOpcUa MxAccessGateway ScadaBridge
trace_id/span_id in logs absent absent (no spans) absent (no spans)
Mechanism TraceContextEnricher not wired n/a n/a

OtOpcUa creates spans (otopcua.deploy.apply, otopcua.opcua.address_space_rebuild) but never pushes Activity.Current's trace context onto Serilog's LogContext. A log emitted inside a span cannot be joined to the span in a backend. MxGateway and ScadaBridge have no spans at all.

The shared TraceContextEnricher (part of ZB.MOM.WW.Telemetry.Serilog) closes this for all three projects on AddZbSerilog adoption — it is wired automatically.

Gap C1: OtOpcUa: adopt AddZbSerilog to wire TraceContextEnricher; spans and logs become joinable once the enricher is active. → Gap C2: ScadaBridge/MxGateway: enricher wired on adoption; effective once spans are created.

§5 Duration unit: ms vs s

OtOpcUa MxAccessGateway ScadaBridge
Histogram unit s ms (workers.startup.duration, commands.duration, events.stream_send.duration) n/a (no histograms)

MxGateway's three histograms record in milliseconds. OTel semantic conventions (and OtOpcUa) use seconds. Raw values differ by a factor of 1000 — dashboards and SLO alerts would need separate scaling rules without normalization.

Gap U1: MxGateway: convert the three histogram call sites to record in seconds (multiply by 0.001 or redefine the instruments). This is a breaking change to existing dashboards and is tagged as a convergence item separate from the initial adoption.

§6 Meter naming: MxGateway.Server vs namespace convention

OtOpcUa MxAccessGateway ScadaBridge
Meter name ZB.MOM.WW.OtOpcUa MxGateway.Server n/a

The fleet convention (per spec/SPEC.md) is <project-namespace> — OtOpcUa uses ZB.MOM.WW.OtOpcUa; the gateway's meter should be ZB.MOM.WW.MxGateway. MxGateway.Server is the assembly name, not the namespace, and does not carry the ZB.MOM.WW product prefix.

Gap N1: MxGateway: rename Meter from "MxGateway.Server""ZB.MOM.WW.MxGateway". This is a Prometheus metric label change that breaks dashboards/alerts and is tracked separately from the initial adoption.

§7 Standard instrumentation (nobody has the full set)

Instrumentation OtOpcUa MxAccessGateway ScadaBridge
ASP.NET Core request metrics not added not added not added
HttpClient metrics not added not added not added
Runtime / process metrics not added not added not added
gRPC client metrics not added not added n/a

AddZbTelemetry enables these via AddAspNetCoreInstrumentation, AddHttpClientInstrumentation, AddRuntimeInstrumentation, and AddProcessInstrumentation by default — all three projects gain them on AddZbTelemetry adoption without code changes.

Gap S1: all three projects lack standard instrumentation; closed automatically by AddZbTelemetry adoption.

§8 Serilog version split (Serilog.AspNetCore 8 vs 9)

OtOpcUa and ScadaBridge use Serilog.AspNetCore but may be on different versions due to their independent csproj updates. The shared ZB.MOM.WW.Telemetry.Serilog package must declare a Serilog.AspNetCore version floor that works for both. Verify version alignment on adoption.

Gap V1: confirm Serilog.AspNetCore version compatibility across all three projects on AddZbSerilog adoption; align if diverged.

§9 ScadaBridge: zero instrumentation

ScadaBridge has no application instruments today. The OpenTelemetry.Api ref is a CVE patch, not instrumentation. No counter, histogram, gauge, or span exists anywhere in the solution.

Gap I1: ScadaBridge: define ScadaBridgeTelemetry with a ZB.MOM.WW.ScadaBridge Meter and a first set of scadabridge.* instruments. Tracked as a follow-on (application-specific work, not shared-library work).

Adoption backlog (ordered)

# Item Projects Priority Effort Risk Notes
1 MxGateway MEL → AddZbSerilog: LogContext correlation + ILogRedactor (Gap L1) MxGateway P1 M Low In-pass, done in Task #9 — unblocked by library build
2 MxGateway: wire OTel SDK via AddZbTelemetry; GatewayMetrics begins exporting (Gap M1) MxGateway P1 S Low Bundled with #1 in Task #9
3 All: AddZbTelemetry with ServiceName/SiteId/NodeRole → shared Resource (Gap R1) OtOpcUa, ScadaBridge P1 S Low MxGateway covered by #2; others are follow-on
4 OtOpcUa: adopt AddZbSerilog + TraceContextEnricher (Gaps C1, V1) OtOpcUa P2 S Low Keep LogContextEnricher; add shared enrichers alongside
5 ScadaBridge: adopt AddZbSerilog; replace LoggerConfigurationFactory (Gap C2, V1) ScadaBridge P2 S Low Enricher names already match; LoggerConfigurationFactory deleted
6 MxGateway: histogram mss conversion (Gap U1) MxGateway P2 S Med Breaking dashboard/alert change; coordinate with ops
7 MxGateway: rename Meter "MxGateway.Server""ZB.MOM.WW.MxGateway" (Gap N1) MxGateway P3 XS Med Breaking Prometheus label change; coordinate with ops
8 All: standard instrumentation via AddZbTelemetry options (Gap S1) all 3 P2 XS Low Automatic on AddZbTelemetry adoption; no extra code
9 ScadaBridge: define ScadaBridgeTelemetry + first scadabridge.* instruments (Gap I1) ScadaBridge P2 M Low Application-specific work; tracked in ScadaBridge repo
10 OtOpcUa: wire OTLP exporter alongside Prometheus (Gap M2) OtOpcUa P3 S Low Opt-in via AddZbTelemetry options; no code rewrite
11 All: fix the tracing no-op in OtOpcUa (spans recorded, no exporter) OtOpcUa P2 S Low Wire OTLP or at minimum document the gap

Sequencing: Items #1 and #2 are the in-pass adoption (Task #9, MxGateway only). Items #3#5 are the follow-on OtOpcUa/ScadaBridge adoptions — they are independent of each other and can land in either order. Items #6#7 (unit/naming convergence) are breaking changes requiring ops coordination; defer until dashboards can be updated. Items #8#11 are cleanups that bundle naturally with #3#5.

This mirrors the Auth and UI-Theme pattern: the shared library is built first; adoption by each app is opt-in and tracked here, not forced.

Decisions still open

  • Canonical SiteId and NodeRole config binding path — ScadaBridge reads from its own config hierarchy; AddZbSerilog must accept the value directly (caller-supplied) rather than reading from a fixed config section, to remain project-agnostic.

Decisions settled (no longer open)

  • Prometheus vs OTLP default (SETTLED): AddZbTelemetry defaults to Prometheus (matching OtOpcUa's existing /metrics posture). OTLP is opt-in via ZbTelemetryOptions.Exporter = ZbExporter.Otlp. See spec/SPEC.md §4 and shared-contract ZbTelemetryOptions.Exporter.
  • mss conversion and Meter rename bundling (SETTLED — deferred): Both the histogram unit migration (Gap U1) and the Meter rename (Gap N1) are deferred from the initial MxGateway adoption (Task #9). They are breaking dashboard/alert changes requiring ops coordination and are tracked as separate backlog items #6 and #7 in the adoption backlog above.

Adoption status — 2026-06-01 (DONE)

ZB.MOM.WW.Telemetry + ZB.MOM.WW.Telemetry.Serilog (0.1.0) were adopted across all three sister apps in one pass, behaviour-preserving. Each adoption landed on a per-repo branch feat/adopt-zb-telemetry (one commit per task). Plan + design: docs/plans/2026-06-01-telemetry-library-adoption.md.

Correction: the prior claim that "MxAccessGateway logging was adopted (MEL → Serilog) on its own branch" was false on main — MxGateway was still MEL-only, and its MxGateway.Server meter was never exported. The full MEL→Serilog migration and the metrics export both landed in this 2026-06-01 pass.

Repo AddZbTelemetry (Resource + std instrumentation + Prometheus) /metrics Logging Meter (unchanged)
OtOpcUa replaced hand-rolled ObservabilityExtensions /metrics (path unchanged) AddZbSerilog (sinks moved to appsettings; LogContextEnricher kept) ZB.MOM.WW.OtOpcUa
ScadaBridge added in BindSharedOptions (both Central + Site roots) Central; mapped on Site too (see follow-on) ⚠️ kept LoggerConfigurationFactory + added shared TraceContextEnricher — did not adopt AddZbSerilog (none yet; #9)
MxAccessGateway exports existing GatewayMetrics new /metrics MEL→AddZbSerilog; GatewayLogRedactor exposed via ILogRedactor seam (GatewayLogRedactorSeam); GatewayLogScope/middleware kept as-is MxGateway.Server (name + ms units unchanged)

Accepted scope decisions (deviations from the original backlog)

  • ScadaBridge keeps LoggerConfigurationFactory (backlog #5 revised). The factory implements a documented governance contract (REQ-HOST-8 / Host-011/014/020/022): ScadaBridge:Logging:MinimumLevel is the floor and overrides Serilog:MinimumLevel, with operator warnings. AddZbSerilog hard-codes MinimumLevel.Is(Information) before ReadFrom.Configuration, which would invert that precedence and silently drop the knob. So ScadaBridge keeps the factory and only adds the shared TraceContextEnricher to it — gaining trace↔log correlation without regressing the contract. Full AddZbSerilog adoption for ScadaBridge would first require teaching the shared bootstrap to accept a caller-supplied minimum-level governance hook.
  • MxGateway keeps GatewayLogScope + request-logging middleware as-is. The Serilog MEL provider captures MEL BeginScope dictionaries as structured properties, so the scope/correlation code keeps producing the same properties under Serilog. Only the provider swap + the ILogRedactor adapter were needed.

Follow-ons — DONE 2026-06-01

All the deferred follow-ons were then executed (branch feat/telemetry-followons per repo, behaviour-preserving except the intentional, no-consumer-yet metric-shape change in #6/#7). Plan: docs/plans/2026-06-01-telemetry-followons.md.

Item Status What landed
#6 MxGateway histogram mss 3 histograms record .TotalSeconds, unit "s". Safe — never Prometheus-exported before, so no dashboards broke.
#7 Meter rename → ZB.MOM.WW.MxGateway GatewayMetrics.MeterName renamed; docs/Metrics.md synced.
#9 ScadaBridge app instruments ScadaBridgeTelemetry meter (ZB.MOM.WW.ScadaBridge) + first 4: deployments.applied (counter), store_and_forward.queue.depth (sync-safe cached gauge), inbound_api.requests (counter, bounded method tag), site.connection.up (balanced open/close gauge).
#10/#11 OTLP opt-in All 3 apps read <App>:Telemetry:Exporter (Prometheus|Otlp) + :OtlpEndpoint, default Prometheus. Setting OTLP also exports OtOpcUa's spans (resolves the trace no-op) — once a collector endpoint is configured.
Site-node /metrics scrape ScadaBridge NodeOptions.MetricsPort (default 8084, avoids the site RemotingPort=8082 collision) + a second Http1AndHttp2 Kestrel listener on the Site role; StartupValidator enforces MetricsPort ≠ Remoting/Grpc.
Serilog version drift OtOpcUa Serilog.AspNetCore/.Extensions.Hosting/.Settings.Configuration aligned to 10.0.0 (family-consistent).

Still open (not code — operational/future):

  • OTLP is opt-in but unexercised until an OTel collector endpoint is deployed and the <App>:Telemetry:Exporter=Otlp + :OtlpEndpoint config is set. The wiring is in place; only a collector is missing.
  • Further ScadaBridge instruments beyond the first 4 are additive future work (not blocking).