# Observability β€” gaps & adoption backlog Divergence of each project from [`spec/SPEC.md`](spec/SPEC.md), and the ordered backlog to reach the shared `ZB.MOM.WW.Telemetry` library. Status legend: β›” gap Β· 🟑 partial Β· βœ… matches. ## Divergence vs spec ### Β§1 OTel Resource / `service.name` (P1 β€” nobody has it) | Spec attribute | OtOpcUa | MxAccessGateway | ScadaBridge | |---|---|---|---| | `service.name` | β›” not set | β›” not set | β›” not set | | `service.namespace` | β›” not set | β›” not set | β›” not set | | `service.version` | β›” not set | β›” not set | β›” not set | | `site.id` | β›” not set | β›” not set | β›” not set | | `node.role` | β›” not set | β›” not set | β›” not set | | `host.name` | β›” not set | β›” not set | β›” not set | No project configures a `ResourceBuilder` or equivalent. Every metric and span from every node is **indistinguishable** in a backend β€” no service identity, no topology (site/role), no version label. This is the single highest-value gap across the fleet; closing it requires only adding `AddZbTelemetry` with options. β†’ **Gap R1 (P1):** All three projects must call `AddZbTelemetry` with `ServiceName`, `SiteId`, `NodeRole` options to populate the shared Resource. None may do so before the library is available. ### Β§2 Metrics export (P1 β€” MxGateway metrics are invisible) | | OtOpcUa | MxAccessGateway | ScadaBridge | |---|---|---|---| | OTel SDK present | βœ… (`AddOpenTelemetry`) | β›” none | β›” none (`OpenTelemetry.Api` only) | | Meter registered | βœ… `ZB.MOM.WW.OtOpcUa` | 🟑 `MxGateway.Server` (not via OTel SDK) | β›” no Meter | | Prometheus export | βœ… `/metrics` | β›” `GetSnapshot()` only | β›” none | | OTLP export | β›” not available | β›” not available | β›” not available | MxGateway has 20 production instruments (13 counters, 3 histograms, 4 gauges) recording real data that is **never exported**. The data lives only in an in-memory `GetSnapshot()` read path. Adding `AddZbTelemetry` with `Meters = ["MxGateway.Server"]` closes this gap without touching `GatewayMetrics.cs`. β†’ **Gap M1 (P1):** MxGateway: wire OTel SDK via `AddZbTelemetry` so `GatewayMetrics` exports. This is one half of the in-pass adoption (logging migration is the other). β†’ **Gap M2:** ScadaBridge: define a `ScadaBridgeTelemetry` class and first application instruments (`scadabridge.*`); register via `AddZbTelemetry`. Currently zero instrumentation. ### Β§3 MxGateway logging: MEL β†’ Serilog (P1 β€” in-pass adoption, done in this task) | | OtOpcUa | MxAccessGateway | ScadaBridge | |---|---|---|---| | Logging framework | βœ… Serilog | β›” MEL only | βœ… Serilog | | Structural enrichers | 🟑 driver-scope only | β›” MEL scope (provider-dependent) | βœ… `SiteId`/`NodeRole`/`NodeHostname` | | Correlation mechanism | `LogContextEnricher.Push` | `GatewayLogScope` + `BeginScope` | structural enrichers | | Log redaction | β›” none | βœ… `GatewayLogRedactor` | β›” none | | `ILogRedactor` seam | β›” none | β›” bespoke | β›” none | **Status: done in this task (Task #9).** The MxGateway logging migration β€” MEL β†’ `AddZbSerilog`, `GatewayLogScope` β†’ `LogContext.PushProperty`, `GatewayLogRedactor` β†’ `ILogRedactor` seam β€” is executed as part of the `ZB.MOM.WW.Telemetry` library build. See [`current-state/mxaccessgw/CURRENT-STATE.md`](current-state/mxaccessgw/CURRENT-STATE.md) adoption plan for the exact changes. β†’ **Gap L1 (in-pass, done):** MxGateway MEL β†’ `AddZbSerilog` + `LogContext` correlation + `ILogRedactor`. ### Β§4 Trace↔log correlation (nobody has it) | | OtOpcUa | MxAccessGateway | ScadaBridge | |---|---|---|---| | `trace_id`/`span_id` in logs | β›” absent | β›” absent (no spans) | β›” absent (no spans) | | Mechanism | `TraceContextEnricher` not wired | n/a | n/a | OtOpcUa creates spans (`otopcua.deploy.apply`, `otopcua.opcua.address_space_rebuild`) but never pushes `Activity.Current`'s trace context onto Serilog's `LogContext`. A log emitted inside a span cannot be joined to the span in a backend. MxGateway and ScadaBridge have no spans at all. The shared `TraceContextEnricher` (part of `ZB.MOM.WW.Telemetry.Serilog`) closes this for all three projects on `AddZbSerilog` adoption β€” it is wired automatically. β†’ **Gap C1:** OtOpcUa: adopt `AddZbSerilog` to wire `TraceContextEnricher`; spans and logs become joinable once the enricher is active. β†’ **Gap C2:** ScadaBridge/MxGateway: enricher wired on adoption; effective once spans are created. ### Β§5 Duration unit: `ms` vs `s` | | OtOpcUa | MxAccessGateway | ScadaBridge | |---|---|---|---| | Histogram unit | βœ… `s` | β›” `ms` (`workers.startup.duration`, `commands.duration`, `events.stream_send.duration`) | n/a (no histograms) | MxGateway's three histograms record in milliseconds. OTel semantic conventions (and OtOpcUa) use seconds. Raw values differ by a factor of 1000 β€” dashboards and SLO alerts would need separate scaling rules without normalization. β†’ **Gap U1:** MxGateway: convert the three histogram call sites to record in seconds (multiply by `0.001` or redefine the instruments). This is a **breaking change to existing dashboards** and is tagged as a convergence item separate from the initial adoption. ### Β§6 Meter naming: `MxGateway.Server` vs namespace convention | | OtOpcUa | MxAccessGateway | ScadaBridge | |---|---|---|---| | Meter name | βœ… `ZB.MOM.WW.OtOpcUa` | β›” `MxGateway.Server` | n/a | The fleet convention (per `spec/SPEC.md`) is `` β€” OtOpcUa uses `ZB.MOM.WW.OtOpcUa`; the gateway's meter should be `ZB.MOM.WW.MxGateway`. `MxGateway.Server` is the assembly name, not the namespace, and does not carry the `ZB.MOM.WW` product prefix. β†’ **Gap N1:** MxGateway: rename Meter from `"MxGateway.Server"` β†’ `"ZB.MOM.WW.MxGateway"`. This is a **Prometheus metric label change** that breaks dashboards/alerts and is tracked separately from the initial adoption. ### Β§7 Standard instrumentation (nobody has the full set) | Instrumentation | OtOpcUa | MxAccessGateway | ScadaBridge | |---|---|---|---| | ASP.NET Core request metrics | β›” not added | β›” not added | β›” not added | | `HttpClient` metrics | β›” not added | β›” not added | β›” not added | | Runtime / process metrics | β›” not added | β›” not added | β›” not added | | gRPC client metrics | β›” not added | β›” not added | n/a | `AddZbTelemetry` enables these via `AddAspNetCoreInstrumentation`, `AddHttpClientInstrumentation`, `AddRuntimeInstrumentation`, and `AddProcessInstrumentation` by default β€” all three projects gain them on `AddZbTelemetry` adoption without code changes. β†’ **Gap S1:** all three projects lack standard instrumentation; closed automatically by `AddZbTelemetry` adoption. ### Β§8 Serilog version split (`Serilog.AspNetCore` 8 vs 9) OtOpcUa and ScadaBridge use `Serilog.AspNetCore` but may be on different versions due to their independent csproj updates. The shared `ZB.MOM.WW.Telemetry.Serilog` package must declare a `Serilog.AspNetCore` version floor that works for both. Verify version alignment on adoption. β†’ **Gap V1:** confirm `Serilog.AspNetCore` version compatibility across all three projects on `AddZbSerilog` adoption; align if diverged. ### Β§9 ScadaBridge: zero instrumentation ScadaBridge has no application instruments today. The `OpenTelemetry.Api` ref is a CVE patch, not instrumentation. No counter, histogram, gauge, or span exists anywhere in the solution. β†’ **Gap I1:** ScadaBridge: define `ScadaBridgeTelemetry` with a `ZB.MOM.WW.ScadaBridge` Meter and a first set of `scadabridge.*` instruments. Tracked as a follow-on (application-specific work, not shared-library work). ## Adoption backlog (ordered) | # | Item | Projects | Priority | Effort | Risk | Notes | |---|---|---|---|---|---|---| | 1 | MxGateway MEL β†’ `AddZbSerilog`: `LogContext` correlation + `ILogRedactor` (Gap L1) | MxGateway | P1 | M | Low | **In-pass, done in Task #9** β€” unblocked by library build | | 2 | MxGateway: wire OTel SDK via `AddZbTelemetry`; `GatewayMetrics` begins exporting (Gap M1) | MxGateway | P1 | S | Low | Bundled with #1 in Task #9 | | 3 | All: `AddZbTelemetry` with `ServiceName`/`SiteId`/`NodeRole` β†’ shared Resource (Gap R1) | OtOpcUa, ScadaBridge | P1 | S | Low | MxGateway covered by #2; others are follow-on | | 4 | OtOpcUa: adopt `AddZbSerilog` + `TraceContextEnricher` (Gaps C1, V1) | OtOpcUa | P2 | S | Low | Keep `LogContextEnricher`; add shared enrichers alongside | | 5 | ScadaBridge: adopt `AddZbSerilog`; replace `LoggerConfigurationFactory` (Gap C2, V1) | ScadaBridge | P2 | S | Low | Enricher names already match; `LoggerConfigurationFactory` deleted | | 6 | MxGateway: histogram `ms` β†’ `s` conversion (Gap U1) | MxGateway | P2 | S | Med | Breaking dashboard/alert change; coordinate with ops | | 7 | MxGateway: rename Meter `"MxGateway.Server"` β†’ `"ZB.MOM.WW.MxGateway"` (Gap N1) | MxGateway | P3 | XS | Med | Breaking Prometheus label change; coordinate with ops | | 8 | All: standard instrumentation via `AddZbTelemetry` options (Gap S1) | all 3 | P2 | XS | Low | Automatic on `AddZbTelemetry` adoption; no extra code | | 9 | ScadaBridge: define `ScadaBridgeTelemetry` + first `scadabridge.*` instruments (Gap I1) | ScadaBridge | P2 | M | Low | Application-specific work; tracked in ScadaBridge repo | | 10 | OtOpcUa: wire OTLP exporter alongside Prometheus (Gap M2) | OtOpcUa | P3 | S | Low | Opt-in via `AddZbTelemetry` options; no code rewrite | | 11 | All: fix the tracing no-op in OtOpcUa (spans recorded, no exporter) | OtOpcUa | P2 | S | Low | Wire OTLP or at minimum document the gap | **Sequencing:** Items #1 and #2 are the in-pass adoption (Task #9, MxGateway only). Items #3–#5 are the follow-on OtOpcUa/ScadaBridge adoptions β€” they are independent of each other and can land in either order. Items #6–#7 (unit/naming convergence) are breaking changes requiring ops coordination; defer until dashboards can be updated. Items #8–#11 are cleanups that bundle naturally with #3–#5. This mirrors the Auth and UI-Theme pattern: the shared library is built first; adoption by each app is opt-in and tracked here, not forced. ## Decisions still open - Canonical `SiteId` and `NodeRole` config binding path β€” ScadaBridge reads from its own config hierarchy; `AddZbSerilog` must accept the value directly (caller-supplied) rather than reading from a fixed config section, to remain project-agnostic. ## Decisions settled (no longer open) - **Prometheus vs OTLP default (SETTLED):** `AddZbTelemetry` defaults to Prometheus (matching OtOpcUa's existing `/metrics` posture). OTLP is opt-in via `ZbTelemetryOptions.Exporter = ZbExporter.Otlp`. See `spec/SPEC.md` Β§4 and shared-contract `ZbTelemetryOptions.Exporter`. - **`ms`β†’`s` conversion and Meter rename bundling (SETTLED β€” deferred):** Both the histogram unit migration (Gap U1) and the Meter rename (Gap N1) are deferred from the initial MxGateway adoption (Task #9). They are breaking dashboard/alert changes requiring ops coordination and are tracked as separate backlog items #6 and #7 in the adoption backlog above.