Files
scadaproj/components/observability/GAPS.md
T
Joseph Doherty fba3d09eed docs(observability): current-state x3 + GAPS + README
Complete the observability normalization component docs:

- components/observability/current-state/otopcua/CURRENT-STATE.md — full
  OTel SDK (metrics + tracing) + Prometheus; 7 otopcua.* instruments + 2
  spans; Serilog with driver-scope LogContextEnricher; no Resource/service.name
  anywhere; tracing pipeline wired but no exporter; adoption plan: AddZbTelemetry
  gains shared Resource + trace↔log correlation; LogContextEnricher kept bespoke.

- components/observability/current-state/mxaccessgw/CURRENT-STATE.md — 20
  hand-rolled instruments (13 counters, 3 histograms ms-unit, 4 gauges) in
  GatewayMetrics.cs; no OTel SDK → metrics never export; MEL logging with
  GatewayLogScope correlation and GatewayLogRedactor; adoption plan: in-pass
  MEL → AddZbSerilog migration (LogContext correlation, ILogRedactor seam) +
  AddZbTelemetry wires OTel SDK so GatewayMetrics finally exports.

- components/observability/current-state/scadabridge/CURRENT-STATE.md —
  OpenTelemetry.Api is a CVE-patch override only (zero instrumentation); Serilog
  with SiteId/NodeRole/NodeHostname enrichers (strongest set in family); adoption
  plan: replace CVE ref with AddZbTelemetry; adopt AddZbSerilog (LoggerConfigurationFactory
  deleted); add first scadabridge.* instruments.

- components/observability/GAPS.md — divergence table across §1 Resource (P1,
  nobody), §2 metrics export (P1, MxGateway invisible), §3 MxGateway MEL→Serilog
  (P1, in-pass done), §4 trace↔log correlation, §5 ms→s unit, §6 Meter naming,
  §7 standard instrumentation, §8 Serilog version, §9 ScadaBridge zero
  instrumentation; 11-item prioritized backlog.

- components/observability/README.md — overview, per-project status table
  (OTel today / metrics / tracing / logging / enrichers / adoption status),
  normalized vs. left-per-project boundary, 2-package structure, component status.
2026-06-01 07:23:08 -04:00

178 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Observability — gaps & adoption backlog
Divergence of each project from [`spec/SPEC.md`](spec/SPEC.md), and the ordered backlog to
reach the shared `ZB.MOM.WW.Telemetry` library. Status legend: ⛔ gap · 🟡 partial · ✅ matches.
## Divergence vs spec
### §1 OTel Resource / `service.name` (P1 — nobody has it)
| Spec attribute | OtOpcUa | MxAccessGateway | ScadaBridge |
|---|---|---|---|
| `service.name` | ⛔ not set | ⛔ not set | ⛔ not set |
| `service.namespace` | ⛔ not set | ⛔ not set | ⛔ not set |
| `service.version` | ⛔ not set | ⛔ not set | ⛔ not set |
| `site.id` | ⛔ not set | ⛔ not set | ⛔ not set |
| `node.role` | ⛔ not set | ⛔ not set | ⛔ not set |
| `host.name` | ⛔ not set | ⛔ not set | ⛔ not set |
No project configures a `ResourceBuilder` or equivalent. Every metric and span from every node is
**indistinguishable** in a backend — no service identity, no topology (site/role), no version label.
This is the single highest-value gap across the fleet; closing it requires only adding
`AddZbTelemetry` with options.
**Gap R1 (P1):** All three projects must call `AddZbTelemetry` with `ServiceName`, `SiteId`,
`NodeRole` options to populate the shared Resource. None may do so before the library is available.
### §2 Metrics export (P1 — MxGateway metrics are invisible)
| | OtOpcUa | MxAccessGateway | ScadaBridge |
|---|---|---|---|
| OTel SDK present | ✅ (`AddOpenTelemetry`) | ⛔ none | ⛔ none (`OpenTelemetry.Api` only) |
| Meter registered | ✅ `ZB.MOM.WW.OtOpcUa` | 🟡 `MxGateway.Server` (not via OTel SDK) | ⛔ no Meter |
| Prometheus export | ✅ `/metrics` | ⛔ `GetSnapshot()` only | ⛔ none |
| OTLP export | ⛔ not available | ⛔ not available | ⛔ not available |
MxGateway has 20 production instruments (13 counters, 3 histograms, 4 gauges) recording real data
that is **never exported**. The data lives only in an in-memory `GetSnapshot()` read path. Adding
`AddZbTelemetry` with `Meters = ["MxGateway.Server"]` closes this gap without touching
`GatewayMetrics.cs`.
**Gap M1 (P1):** MxGateway: wire OTel SDK via `AddZbTelemetry` so `GatewayMetrics` exports.
This is one half of the in-pass adoption (logging migration is the other).
**Gap M2:** ScadaBridge: define a `ScadaBridgeTelemetry` class and first application instruments
(`scadabridge.*`); register via `AddZbTelemetry`. Currently zero instrumentation.
### §3 MxGateway logging: MEL → Serilog (P1 — in-pass adoption, done in this task)
| | OtOpcUa | MxAccessGateway | ScadaBridge |
|---|---|---|---|
| Logging framework | ✅ Serilog | ⛔ MEL only | ✅ Serilog |
| Structural enrichers | 🟡 driver-scope only | ⛔ MEL scope (provider-dependent) | ✅ `SiteId`/`NodeRole`/`NodeHostname` |
| Correlation mechanism | `LogContextEnricher.Push` | `GatewayLogScope` + `BeginScope` | structural enrichers |
| Log redaction | ⛔ none | ✅ `GatewayLogRedactor` | ⛔ none |
| `ILogRedactor` seam | ⛔ none | ⛔ bespoke | ⛔ none |
**Status: done in this task (Task #9).** The MxGateway logging migration — MEL → `AddZbSerilog`,
`GatewayLogScope``LogContext.PushProperty`, `GatewayLogRedactor``ILogRedactor` seam — is
executed as part of the `ZB.MOM.WW.Telemetry` library build. See
[`current-state/mxaccessgw/CURRENT-STATE.md`](current-state/mxaccessgw/CURRENT-STATE.md) adoption
plan for the exact changes.
**Gap L1 (in-pass, done):** MxGateway MEL → `AddZbSerilog` + `LogContext` correlation + `ILogRedactor`.
### §4 Trace↔log correlation (nobody has it)
| | OtOpcUa | MxAccessGateway | ScadaBridge |
|---|---|---|---|
| `trace_id`/`span_id` in logs | ⛔ absent | ⛔ absent (no spans) | ⛔ absent (no spans) |
| Mechanism | `TraceContextEnricher` not wired | n/a | n/a |
OtOpcUa creates spans (`otopcua.deploy.apply`, `otopcua.opcua.address_space_rebuild`) but never
pushes `Activity.Current`'s trace context onto Serilog's `LogContext`. A log emitted inside a span
cannot be joined to the span in a backend. MxGateway and ScadaBridge have no spans at all.
The shared `TraceContextEnricher` (part of `ZB.MOM.WW.Telemetry.Serilog`) closes this for all
three projects on `AddZbSerilog` adoption — it is wired automatically.
**Gap C1:** OtOpcUa: adopt `AddZbSerilog` to wire `TraceContextEnricher`; spans and logs become
joinable once the enricher is active.
**Gap C2:** ScadaBridge/MxGateway: enricher wired on adoption; effective once spans are created.
### §5 Duration unit: `ms` vs `s`
| | OtOpcUa | MxAccessGateway | ScadaBridge |
|---|---|---|---|
| Histogram unit | ✅ `s` | ⛔ `ms` (`workers.startup.duration`, `commands.duration`, `events.stream_send.duration`) | n/a (no histograms) |
MxGateway's three histograms record in milliseconds. OTel semantic conventions (and OtOpcUa) use
seconds. Raw values differ by a factor of 1000 — dashboards and SLO alerts would need separate
scaling rules without normalization.
**Gap U1:** MxGateway: convert the three histogram call sites to record in seconds (multiply by
`0.001` or redefine the instruments). This is a **breaking change to existing dashboards** and is
tagged as a convergence item separate from the initial adoption.
### §6 Meter naming: `MxGateway.Server` vs namespace convention
| | OtOpcUa | MxAccessGateway | ScadaBridge |
|---|---|---|---|
| Meter name | ✅ `ZB.MOM.WW.OtOpcUa` | ⛔ `MxGateway.Server` | n/a |
The fleet convention (per `spec/SPEC.md`) is `<project-namespace>` — OtOpcUa uses
`ZB.MOM.WW.OtOpcUa`; the gateway's meter should be `ZB.MOM.WW.MxGateway`. `MxGateway.Server` is
the assembly name, not the namespace, and does not carry the `ZB.MOM.WW` product prefix.
**Gap N1:** MxGateway: rename Meter from `"MxGateway.Server"``"ZB.MOM.WW.MxGateway"`.
This is a **Prometheus metric label change** that breaks dashboards/alerts and is tracked
separately from the initial adoption.
### §7 Standard instrumentation (nobody has the full set)
| Instrumentation | OtOpcUa | MxAccessGateway | ScadaBridge |
|---|---|---|---|
| ASP.NET Core request metrics | ⛔ not added | ⛔ not added | ⛔ not added |
| `HttpClient` metrics | ⛔ not added | ⛔ not added | ⛔ not added |
| Runtime / process metrics | ⛔ not added | ⛔ not added | ⛔ not added |
| gRPC client metrics | ⛔ not added | ⛔ not added | n/a |
`AddZbTelemetry` enables these via `AddAspNetCoreInstrumentation`, `AddHttpClientInstrumentation`,
`AddRuntimeInstrumentation`, and `AddProcessInstrumentation` by default — all three projects gain
them on `AddZbTelemetry` adoption without code changes.
**Gap S1:** all three projects lack standard instrumentation; closed automatically by
`AddZbTelemetry` adoption.
### §8 Serilog version split (`Serilog.AspNetCore` 8 vs 9)
OtOpcUa and ScadaBridge use `Serilog.AspNetCore` but may be on different versions due to their
independent csproj updates. The shared `ZB.MOM.WW.Telemetry.Serilog` package must declare a
`Serilog.AspNetCore` version floor that works for both. Verify version alignment on adoption.
**Gap V1:** confirm `Serilog.AspNetCore` version compatibility across all three projects on
`AddZbSerilog` adoption; align if diverged.
### §9 ScadaBridge: zero instrumentation
ScadaBridge has no application instruments today. The `OpenTelemetry.Api` ref is a CVE patch, not
instrumentation. No counter, histogram, gauge, or span exists anywhere in the solution.
**Gap I1:** ScadaBridge: define `ScadaBridgeTelemetry` with a `ZB.MOM.WW.ScadaBridge` Meter and
a first set of `scadabridge.*` instruments. Tracked as a follow-on (application-specific work,
not shared-library work).
## Adoption backlog (ordered)
| # | Item | Projects | Priority | Effort | Risk | Notes |
|---|---|---|---|---|---|---|
| 1 | MxGateway MEL → `AddZbSerilog`: `LogContext` correlation + `ILogRedactor` (Gap L1) | MxGateway | P1 | M | Low | **In-pass, done in Task #9** — unblocked by library build |
| 2 | MxGateway: wire OTel SDK via `AddZbTelemetry`; `GatewayMetrics` begins exporting (Gap M1) | MxGateway | P1 | S | Low | Bundled with #1 in Task #9 |
| 3 | All: `AddZbTelemetry` with `ServiceName`/`SiteId`/`NodeRole` → shared Resource (Gap R1) | OtOpcUa, ScadaBridge | P1 | S | Low | MxGateway covered by #2; others are follow-on |
| 4 | OtOpcUa: adopt `AddZbSerilog` + `TraceContextEnricher` (Gaps C1, V1) | OtOpcUa | P2 | S | Low | Keep `LogContextEnricher`; add shared enrichers alongside |
| 5 | ScadaBridge: adopt `AddZbSerilog`; replace `LoggerConfigurationFactory` (Gap C2, V1) | ScadaBridge | P2 | S | Low | Enricher names already match; `LoggerConfigurationFactory` deleted |
| 6 | MxGateway: histogram `ms``s` conversion (Gap U1) | MxGateway | P2 | S | Med | Breaking dashboard/alert change; coordinate with ops |
| 7 | MxGateway: rename Meter `"MxGateway.Server"``"ZB.MOM.WW.MxGateway"` (Gap N1) | MxGateway | P3 | XS | Med | Breaking Prometheus label change; coordinate with ops |
| 8 | All: standard instrumentation via `AddZbTelemetry` options (Gap S1) | all 3 | P2 | XS | Low | Automatic on `AddZbTelemetry` adoption; no extra code |
| 9 | ScadaBridge: define `ScadaBridgeTelemetry` + first `scadabridge.*` instruments (Gap I1) | ScadaBridge | P2 | M | Low | Application-specific work; tracked in ScadaBridge repo |
| 10 | OtOpcUa: wire OTLP exporter alongside Prometheus (Gap M2) | OtOpcUa | P3 | S | Low | Opt-in via `AddZbTelemetry` options; no code rewrite |
| 11 | All: fix the tracing no-op in OtOpcUa (spans recorded, no exporter) | OtOpcUa | P2 | S | Low | Wire OTLP or at minimum document the gap |
**Sequencing:** Items #1 and #2 are the in-pass adoption (Task #9, MxGateway only). Items #3#5
are the follow-on OtOpcUa/ScadaBridge adoptions — they are independent of each other and can land
in either order. Items #6#7 (unit/naming convergence) are breaking changes requiring ops
coordination; defer until dashboards can be updated. Items #8#11 are cleanups that bundle
naturally with #3#5.
This mirrors the Auth and UI-Theme pattern: the shared library is built first; adoption by each
app is opt-in and tracked here, not forced.
## Decisions still open
- Whether `AddZbTelemetry` enables OTLP by default (simplest for new setups) or Prometheus by
default (matches OtOpcUa's current posture). Design doc says Prometheus default; OTLP opt-in.
- Whether the `ms``s` conversion and Meter rename are bundled with the initial MxGateway
adoption (Task #9) or deferred. Deferring avoids dashboard breaks during the migration window.
- Canonical `SiteId` and `NodeRole` config binding path — ScadaBridge reads from its own config
hierarchy; `AddZbSerilog` must accept the value directly (caller-supplied) rather than reading
from a fixed config section, to remain project-agnostic.