16 KiB
Observability — gaps & adoption backlog
Divergence of each project from spec/SPEC.md, and the ordered backlog to
reach the shared ZB.MOM.WW.Telemetry library. Status legend: ⛔ gap · 🟡 partial · ✅ matches.
Divergence vs spec
§1 OTel Resource / service.name (P1 — nobody has it)
| Spec attribute | OtOpcUa | MxAccessGateway | ScadaBridge |
|---|---|---|---|
service.name |
⛔ not set | ⛔ not set | ⛔ not set |
service.namespace |
⛔ not set | ⛔ not set | ⛔ not set |
service.version |
⛔ not set | ⛔ not set | ⛔ not set |
site.id |
⛔ not set | ⛔ not set | ⛔ not set |
node.role |
⛔ not set | ⛔ not set | ⛔ not set |
host.name |
⛔ not set | ⛔ not set | ⛔ not set |
No project configures a ResourceBuilder or equivalent. Every metric and span from every node is
indistinguishable in a backend — no service identity, no topology (site/role), no version label.
This is the single highest-value gap across the fleet; closing it requires only adding
AddZbTelemetry with options.
→ Gap R1 (P1): All three projects must call AddZbTelemetry with ServiceName, SiteId,
NodeRole options to populate the shared Resource. None may do so before the library is available.
§2 Metrics export (P1 — MxGateway metrics are invisible)
| OtOpcUa | MxAccessGateway | ScadaBridge | |
|---|---|---|---|
| OTel SDK present | ✅ (AddOpenTelemetry) |
⛔ none | ⛔ none (OpenTelemetry.Api only) |
| Meter registered | ✅ ZB.MOM.WW.OtOpcUa |
🟡 MxGateway.Server (not via OTel SDK) |
⛔ no Meter |
| Prometheus export | ✅ /metrics |
⛔ GetSnapshot() only |
⛔ none |
| OTLP export | ⛔ not available | ⛔ not available | ⛔ not available |
MxGateway has 20 production instruments (13 counters, 3 histograms, 4 gauges) recording real data
that is never exported. The data lives only in an in-memory GetSnapshot() read path. Adding
AddZbTelemetry with Meters = ["MxGateway.Server"] closes this gap without touching
GatewayMetrics.cs.
→ Gap M1 (P1): MxGateway: wire OTel SDK via AddZbTelemetry so GatewayMetrics exports.
This is one half of the in-pass adoption (logging migration is the other).
→ Gap M2: ScadaBridge: define a ScadaBridgeTelemetry class and first application instruments
(scadabridge.*); register via AddZbTelemetry. Currently zero instrumentation.
§3 MxGateway logging: MEL → Serilog (P1 — in-pass adoption, done in this task)
| OtOpcUa | MxAccessGateway | ScadaBridge | |
|---|---|---|---|
| Logging framework | ✅ Serilog | ⛔ MEL only | ✅ Serilog |
| Structural enrichers | 🟡 driver-scope only | ⛔ MEL scope (provider-dependent) | ✅ SiteId/NodeRole/NodeHostname |
| Correlation mechanism | LogContextEnricher.Push |
GatewayLogScope + BeginScope |
structural enrichers |
| Log redaction | ⛔ none | ✅ GatewayLogRedactor |
⛔ none |
ILogRedactor seam |
⛔ none | ⛔ bespoke | ⛔ none |
Status: done in this task (Task #9). The MxGateway logging migration — MEL → AddZbSerilog,
GatewayLogScope → LogContext.PushProperty, GatewayLogRedactor → ILogRedactor seam — is
executed as part of the ZB.MOM.WW.Telemetry library build. See
current-state/mxaccessgw/CURRENT-STATE.md adoption
plan for the exact changes.
→ Gap L1 (in-pass, done): MxGateway MEL → AddZbSerilog + LogContext correlation + ILogRedactor.
§4 Trace↔log correlation (nobody has it)
| OtOpcUa | MxAccessGateway | ScadaBridge | |
|---|---|---|---|
trace_id/span_id in logs |
⛔ absent | ⛔ absent (no spans) | ⛔ absent (no spans) |
| Mechanism | TraceContextEnricher not wired |
n/a | n/a |
OtOpcUa creates spans (otopcua.deploy.apply, otopcua.opcua.address_space_rebuild) but never
pushes Activity.Current's trace context onto Serilog's LogContext. A log emitted inside a span
cannot be joined to the span in a backend. MxGateway and ScadaBridge have no spans at all.
The shared TraceContextEnricher (part of ZB.MOM.WW.Telemetry.Serilog) closes this for all
three projects on AddZbSerilog adoption — it is wired automatically.
→ Gap C1: OtOpcUa: adopt AddZbSerilog to wire TraceContextEnricher; spans and logs become
joinable once the enricher is active.
→ Gap C2: ScadaBridge/MxGateway: enricher wired on adoption; effective once spans are created.
§5 Duration unit: ms vs s
| OtOpcUa | MxAccessGateway | ScadaBridge | |
|---|---|---|---|
| Histogram unit | ✅ s |
⛔ ms (workers.startup.duration, commands.duration, events.stream_send.duration) |
n/a (no histograms) |
MxGateway's three histograms record in milliseconds. OTel semantic conventions (and OtOpcUa) use seconds. Raw values differ by a factor of 1000 — dashboards and SLO alerts would need separate scaling rules without normalization.
→ Gap U1: MxGateway: convert the three histogram call sites to record in seconds (multiply by
0.001 or redefine the instruments). This is a breaking change to existing dashboards and is
tagged as a convergence item separate from the initial adoption.
§6 Meter naming: MxGateway.Server vs namespace convention
| OtOpcUa | MxAccessGateway | ScadaBridge | |
|---|---|---|---|
| Meter name | ✅ ZB.MOM.WW.OtOpcUa |
⛔ MxGateway.Server |
n/a |
The fleet convention (per spec/SPEC.md) is <project-namespace> — OtOpcUa uses
ZB.MOM.WW.OtOpcUa; the gateway's meter should be ZB.MOM.WW.MxGateway. MxGateway.Server is
the assembly name, not the namespace, and does not carry the ZB.MOM.WW product prefix.
→ Gap N1: MxGateway: rename Meter from "MxGateway.Server" → "ZB.MOM.WW.MxGateway".
This is a Prometheus metric label change that breaks dashboards/alerts and is tracked
separately from the initial adoption.
§7 Standard instrumentation (nobody has the full set)
| Instrumentation | OtOpcUa | MxAccessGateway | ScadaBridge |
|---|---|---|---|
| ASP.NET Core request metrics | ⛔ not added | ⛔ not added | ⛔ not added |
HttpClient metrics |
⛔ not added | ⛔ not added | ⛔ not added |
| Runtime / process metrics | ⛔ not added | ⛔ not added | ⛔ not added |
| gRPC client metrics | ⛔ not added | ⛔ not added | n/a |
AddZbTelemetry enables these via AddAspNetCoreInstrumentation, AddHttpClientInstrumentation,
AddRuntimeInstrumentation, and AddProcessInstrumentation by default — all three projects gain
them on AddZbTelemetry adoption without code changes.
→ Gap S1: all three projects lack standard instrumentation; closed automatically by
AddZbTelemetry adoption.
§8 Serilog version split (Serilog.AspNetCore 8 vs 9)
OtOpcUa and ScadaBridge use Serilog.AspNetCore but may be on different versions due to their
independent csproj updates. The shared ZB.MOM.WW.Telemetry.Serilog package must declare a
Serilog.AspNetCore version floor that works for both. Verify version alignment on adoption.
→ Gap V1: confirm Serilog.AspNetCore version compatibility across all three projects on
AddZbSerilog adoption; align if diverged.
§9 ScadaBridge: zero instrumentation
ScadaBridge has no application instruments today. The OpenTelemetry.Api ref is a CVE patch, not
instrumentation. No counter, histogram, gauge, or span exists anywhere in the solution.
→ Gap I1: ScadaBridge: define ScadaBridgeTelemetry with a ZB.MOM.WW.ScadaBridge Meter and
a first set of scadabridge.* instruments. Tracked as a follow-on (application-specific work,
not shared-library work).
Adoption backlog (ordered)
| # | Item | Projects | Priority | Effort | Risk | Notes |
|---|---|---|---|---|---|---|
| 1 | MxGateway MEL → AddZbSerilog: LogContext correlation + ILogRedactor (Gap L1) |
MxGateway | P1 | M | Low | In-pass, done in Task #9 — unblocked by library build |
| 2 | MxGateway: wire OTel SDK via AddZbTelemetry; GatewayMetrics begins exporting (Gap M1) |
MxGateway | P1 | S | Low | Bundled with #1 in Task #9 |
| 3 | All: AddZbTelemetry with ServiceName/SiteId/NodeRole → shared Resource (Gap R1) |
OtOpcUa, ScadaBridge | P1 | S | Low | MxGateway covered by #2; others are follow-on |
| 4 | OtOpcUa: adopt AddZbSerilog + TraceContextEnricher (Gaps C1, V1) |
OtOpcUa | P2 | S | Low | Keep LogContextEnricher; add shared enrichers alongside |
| 5 | ScadaBridge: adopt AddZbSerilog; replace LoggerConfigurationFactory (Gap C2, V1) |
ScadaBridge | P2 | S | Low | Enricher names already match; LoggerConfigurationFactory deleted |
| 6 | MxGateway: histogram ms → s conversion (Gap U1) |
MxGateway | P2 | S | Med | Breaking dashboard/alert change; coordinate with ops |
| 7 | MxGateway: rename Meter "MxGateway.Server" → "ZB.MOM.WW.MxGateway" (Gap N1) |
MxGateway | P3 | XS | Med | Breaking Prometheus label change; coordinate with ops |
| 8 | All: standard instrumentation via AddZbTelemetry options (Gap S1) |
all 3 | P2 | XS | Low | Automatic on AddZbTelemetry adoption; no extra code |
| 9 | ScadaBridge: define ScadaBridgeTelemetry + first scadabridge.* instruments (Gap I1) |
ScadaBridge | P2 | M | Low | Application-specific work; tracked in ScadaBridge repo |
| 10 | OtOpcUa: wire OTLP exporter alongside Prometheus (Gap M2) | OtOpcUa | P3 | S | Low | Opt-in via AddZbTelemetry options; no code rewrite |
| 11 | All: fix the tracing no-op in OtOpcUa (spans recorded, no exporter) | OtOpcUa | P2 | S | Low | Wire OTLP or at minimum document the gap |
Sequencing: Items #1 and #2 are the in-pass adoption (Task #9, MxGateway only). Items #3–#5 are the follow-on OtOpcUa/ScadaBridge adoptions — they are independent of each other and can land in either order. Items #6–#7 (unit/naming convergence) are breaking changes requiring ops coordination; defer until dashboards can be updated. Items #8–#11 are cleanups that bundle naturally with #3–#5.
This mirrors the Auth and UI-Theme pattern: the shared library is built first; adoption by each app is opt-in and tracked here, not forced.
Decisions still open
- Canonical
SiteIdandNodeRoleconfig binding path — ScadaBridge reads from its own config hierarchy;AddZbSerilogmust accept the value directly (caller-supplied) rather than reading from a fixed config section, to remain project-agnostic.
Decisions settled (no longer open)
- Prometheus vs OTLP default (SETTLED):
AddZbTelemetrydefaults to Prometheus (matching OtOpcUa's existing/metricsposture). OTLP is opt-in viaZbTelemetryOptions.Exporter = ZbExporter.Otlp. Seespec/SPEC.md§4 and shared-contractZbTelemetryOptions.Exporter. ms→sconversion and Meter rename bundling (SETTLED — deferred): Both the histogram unit migration (Gap U1) and the Meter rename (Gap N1) are deferred from the initial MxGateway adoption (Task #9). They are breaking dashboard/alert changes requiring ops coordination and are tracked as separate backlog items #6 and #7 in the adoption backlog above.
Adoption status — 2026-06-01 (DONE)
ZB.MOM.WW.Telemetry + ZB.MOM.WW.Telemetry.Serilog (0.1.0) were adopted across all three
sister apps in one pass, behaviour-preserving. Each adoption landed on a per-repo branch
feat/adopt-zb-telemetry (one commit per task). Plan + design:
docs/plans/2026-06-01-telemetry-library-adoption.md.
Correction: the prior claim that "MxAccessGateway logging was adopted (MEL → Serilog) on its own branch" was false on
main— MxGateway was still MEL-only, and itsMxGateway.Servermeter was never exported. The full MEL→Serilog migration and the metrics export both landed in this 2026-06-01 pass.
| Repo | AddZbTelemetry (Resource + std instrumentation + Prometheus) |
/metrics |
Logging | Meter (unchanged) |
|---|---|---|---|---|
| OtOpcUa | ✅ replaced hand-rolled ObservabilityExtensions |
✅ /metrics (path unchanged) |
✅ AddZbSerilog (sinks moved to appsettings; LogContextEnricher kept) |
ZB.MOM.WW.OtOpcUa |
| ScadaBridge | ✅ added in BindSharedOptions (both Central + Site roots) |
✅ Central; mapped on Site too (see follow-on) | ⚠️ kept LoggerConfigurationFactory + added shared TraceContextEnricher — did not adopt AddZbSerilog |
(none yet; #9) |
| MxAccessGateway | ✅ exports existing GatewayMetrics |
✅ new /metrics |
✅ MEL→AddZbSerilog; GatewayLogRedactor exposed via ILogRedactor seam (GatewayLogRedactorSeam); GatewayLogScope/middleware kept as-is |
MxGateway.Server (name + ms units unchanged) |
Accepted scope decisions (deviations from the original backlog)
- ScadaBridge keeps
LoggerConfigurationFactory(backlog #5 revised). The factory implements a documented governance contract (REQ-HOST-8 / Host-011/014/020/022):ScadaBridge:Logging:MinimumLevelis the floor and overridesSerilog:MinimumLevel, with operator warnings.AddZbSeriloghard-codesMinimumLevel.Is(Information)beforeReadFrom.Configuration, which would invert that precedence and silently drop the knob. So ScadaBridge keeps the factory and only adds the sharedTraceContextEnricherto it — gaining trace↔log correlation without regressing the contract. FullAddZbSerilogadoption for ScadaBridge would first require teaching the shared bootstrap to accept a caller-supplied minimum-level governance hook. - MxGateway keeps
GatewayLogScope+ request-logging middleware as-is. The Serilog MEL provider captures MELBeginScopedictionaries as structured properties, so the scope/correlation code keeps producing the same properties under Serilog. Only the provider swap + theILogRedactoradapter were needed.
Follow-ons — DONE 2026-06-01
All the deferred follow-ons were then executed (branch feat/telemetry-followons per repo,
behaviour-preserving except the intentional, no-consumer-yet metric-shape change in #6/#7). Plan:
docs/plans/2026-06-01-telemetry-followons.md.
| Item | Status | What landed |
|---|---|---|
#6 MxGateway histogram ms→s |
✅ | 3 histograms record .TotalSeconds, unit "s". Safe — never Prometheus-exported before, so no dashboards broke. |
#7 Meter rename → ZB.MOM.WW.MxGateway |
✅ | GatewayMetrics.MeterName renamed; docs/Metrics.md synced. |
| #9 ScadaBridge app instruments | ✅ | ScadaBridgeTelemetry meter (ZB.MOM.WW.ScadaBridge) + first 4: deployments.applied (counter), store_and_forward.queue.depth (sync-safe cached gauge), inbound_api.requests (counter, bounded method tag), site.connection.up (balanced open/close gauge). |
| #10/#11 OTLP opt-in | ✅ | All 3 apps read <App>:Telemetry:Exporter (Prometheus|Otlp) + :OtlpEndpoint, default Prometheus. Setting OTLP also exports OtOpcUa's spans (resolves the trace no-op) — once a collector endpoint is configured. |
Site-node /metrics scrape |
✅ | ScadaBridge NodeOptions.MetricsPort (default 8084, avoids the site RemotingPort=8082 collision) + a second Http1AndHttp2 Kestrel listener on the Site role; StartupValidator enforces MetricsPort ≠ Remoting/Grpc. |
| Serilog version drift | ✅ | OtOpcUa Serilog.AspNetCore/.Extensions.Hosting/.Settings.Configuration aligned to 10.0.0 (family-consistent). |
Still open (not code — operational/future):
- OTLP is opt-in but unexercised until an OTel collector endpoint is deployed and the
<App>:Telemetry:Exporter=Otlp+:OtlpEndpointconfig is set. The wiring is in place; only a collector is missing. - Further ScadaBridge instruments beyond the first 4 are additive future work (not blocking).