Files
scadaproj/components/observability/GAPS.md
T
Joseph Doherty 215a646e35 docs(observability): fix metric-convention instrument names + NodeHostname-auto + resolve settled questions
C1: NodeHostname is AUTO throughout. Shared-contract AddZbSerilog doc comment now reads
"SiteId + NodeRole from ZbTelemetryOptions; NodeHostname from Environment.MachineName (auto)".
SPEC.md §0 and §5 prose updated to match. ScadaBridge adoption snippet no longer sets
o.NodeHostname (removed; NodeHostname is auto, not caller-supplied).

C2: METRIC-CONVENTIONS §6.1 OtOpcUa instrument table replaced with code-verified set:
counters otopcua.deploy.applied / driver.lifecycle / virtualtag.eval / scriptedalarm.transition /
opcua.sink.write / redundancy.service_level_change; histogram otopcua.deploy.apply.duration (s);
ActivitySource ZB.MOM.WW.OtOpcUa with spans otopcua.deploy.apply + otopcua.opcua.address_space_rebuild.
Removed invented names (deploy.failed, tag.subscriptions, tag.reads, tag.writes, session.active,
connection.gateway).

C3: METRIC-CONVENTIONS §6.2 MxGateway instrument table replaced with code-verified names from
GatewayMetrics.cs: 13 counters (sessions.opened/closed, commands.started/succeeded/failed,
events.received, queues.overflows, faults, workers.killed/exited, heartbeats.failed,
grpc.streams.disconnected, retries.attempted); 3 histograms ms (workers.startup.duration,
commands.duration, events.stream_send.duration); 4 gauges (sessions.open, workers.running,
events.worker_queue.depth, events.grpc_stream_queue.depth). Removed invented names.

m3: §2 example table replaced mxgateway.session.active + mxgateway.worker.call.duration
(invented) with mxgateway.sessions.open + mxgateway.commands.duration (real). Also fixed
the §2 rule-2 body text example which referenced mxgateway.worker.call.duration.

I4: §5 standard instrumentation table corrected — OtOpcUa now shows  not added for all
five baseline instrumentations, matching current-state/otopcua. All three projects lack
standard instrumentation today; AddZbTelemetry adds it on adoption.

I1+m1: GAPS.md "Decisions still open" — removed the two settled questions (Prometheus-default
and ms→s/meter-rename bundling). Moved them to a new "Decisions settled" section with explicit
resolution notes. One genuinely open question remains (SiteId/NodeRole config binding path).

I2: SPEC.md §5 AddZbSerilog: added note that AddZbSerilog reads Serilog:MinimumLevel from
IConfiguration; callers with a different config key (e.g. ScadaBridge:Logging:MinimumLevel)
apply that override themselves — stays per-project. Shared-contract doc comment updated to match.

I3: MxAccessGateway adoption plan Meters = ["MxGateway.Server"] annotated as temporary with
note to update to ZB.MOM.WW.MxGateway when Gap N1 (Meter-rename) is closed.

m2: SPEC.md §1 now notes AddZbTelemetry also has an IServiceCollection overload for non-standard
hosts, with the IHostApplicationBuilder overload as the primary path.
2026-06-01 07:32:58 -04:00

184 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Observability — gaps & adoption backlog
Divergence of each project from [`spec/SPEC.md`](spec/SPEC.md), and the ordered backlog to
reach the shared `ZB.MOM.WW.Telemetry` library. Status legend: ⛔ gap · 🟡 partial · ✅ matches.
## Divergence vs spec
### §1 OTel Resource / `service.name` (P1 — nobody has it)
| Spec attribute | OtOpcUa | MxAccessGateway | ScadaBridge |
|---|---|---|---|
| `service.name` | ⛔ not set | ⛔ not set | ⛔ not set |
| `service.namespace` | ⛔ not set | ⛔ not set | ⛔ not set |
| `service.version` | ⛔ not set | ⛔ not set | ⛔ not set |
| `site.id` | ⛔ not set | ⛔ not set | ⛔ not set |
| `node.role` | ⛔ not set | ⛔ not set | ⛔ not set |
| `host.name` | ⛔ not set | ⛔ not set | ⛔ not set |
No project configures a `ResourceBuilder` or equivalent. Every metric and span from every node is
**indistinguishable** in a backend — no service identity, no topology (site/role), no version label.
This is the single highest-value gap across the fleet; closing it requires only adding
`AddZbTelemetry` with options.
**Gap R1 (P1):** All three projects must call `AddZbTelemetry` with `ServiceName`, `SiteId`,
`NodeRole` options to populate the shared Resource. None may do so before the library is available.
### §2 Metrics export (P1 — MxGateway metrics are invisible)
| | OtOpcUa | MxAccessGateway | ScadaBridge |
|---|---|---|---|
| OTel SDK present | ✅ (`AddOpenTelemetry`) | ⛔ none | ⛔ none (`OpenTelemetry.Api` only) |
| Meter registered | ✅ `ZB.MOM.WW.OtOpcUa` | 🟡 `MxGateway.Server` (not via OTel SDK) | ⛔ no Meter |
| Prometheus export | ✅ `/metrics` | ⛔ `GetSnapshot()` only | ⛔ none |
| OTLP export | ⛔ not available | ⛔ not available | ⛔ not available |
MxGateway has 20 production instruments (13 counters, 3 histograms, 4 gauges) recording real data
that is **never exported**. The data lives only in an in-memory `GetSnapshot()` read path. Adding
`AddZbTelemetry` with `Meters = ["MxGateway.Server"]` closes this gap without touching
`GatewayMetrics.cs`.
**Gap M1 (P1):** MxGateway: wire OTel SDK via `AddZbTelemetry` so `GatewayMetrics` exports.
This is one half of the in-pass adoption (logging migration is the other).
**Gap M2:** ScadaBridge: define a `ScadaBridgeTelemetry` class and first application instruments
(`scadabridge.*`); register via `AddZbTelemetry`. Currently zero instrumentation.
### §3 MxGateway logging: MEL → Serilog (P1 — in-pass adoption, done in this task)
| | OtOpcUa | MxAccessGateway | ScadaBridge |
|---|---|---|---|
| Logging framework | ✅ Serilog | ⛔ MEL only | ✅ Serilog |
| Structural enrichers | 🟡 driver-scope only | ⛔ MEL scope (provider-dependent) | ✅ `SiteId`/`NodeRole`/`NodeHostname` |
| Correlation mechanism | `LogContextEnricher.Push` | `GatewayLogScope` + `BeginScope` | structural enrichers |
| Log redaction | ⛔ none | ✅ `GatewayLogRedactor` | ⛔ none |
| `ILogRedactor` seam | ⛔ none | ⛔ bespoke | ⛔ none |
**Status: done in this task (Task #9).** The MxGateway logging migration — MEL → `AddZbSerilog`,
`GatewayLogScope``LogContext.PushProperty`, `GatewayLogRedactor``ILogRedactor` seam — is
executed as part of the `ZB.MOM.WW.Telemetry` library build. See
[`current-state/mxaccessgw/CURRENT-STATE.md`](current-state/mxaccessgw/CURRENT-STATE.md) adoption
plan for the exact changes.
**Gap L1 (in-pass, done):** MxGateway MEL → `AddZbSerilog` + `LogContext` correlation + `ILogRedactor`.
### §4 Trace↔log correlation (nobody has it)
| | OtOpcUa | MxAccessGateway | ScadaBridge |
|---|---|---|---|
| `trace_id`/`span_id` in logs | ⛔ absent | ⛔ absent (no spans) | ⛔ absent (no spans) |
| Mechanism | `TraceContextEnricher` not wired | n/a | n/a |
OtOpcUa creates spans (`otopcua.deploy.apply`, `otopcua.opcua.address_space_rebuild`) but never
pushes `Activity.Current`'s trace context onto Serilog's `LogContext`. A log emitted inside a span
cannot be joined to the span in a backend. MxGateway and ScadaBridge have no spans at all.
The shared `TraceContextEnricher` (part of `ZB.MOM.WW.Telemetry.Serilog`) closes this for all
three projects on `AddZbSerilog` adoption — it is wired automatically.
**Gap C1:** OtOpcUa: adopt `AddZbSerilog` to wire `TraceContextEnricher`; spans and logs become
joinable once the enricher is active.
**Gap C2:** ScadaBridge/MxGateway: enricher wired on adoption; effective once spans are created.
### §5 Duration unit: `ms` vs `s`
| | OtOpcUa | MxAccessGateway | ScadaBridge |
|---|---|---|---|
| Histogram unit | ✅ `s` | ⛔ `ms` (`workers.startup.duration`, `commands.duration`, `events.stream_send.duration`) | n/a (no histograms) |
MxGateway's three histograms record in milliseconds. OTel semantic conventions (and OtOpcUa) use
seconds. Raw values differ by a factor of 1000 — dashboards and SLO alerts would need separate
scaling rules without normalization.
**Gap U1:** MxGateway: convert the three histogram call sites to record in seconds (multiply by
`0.001` or redefine the instruments). This is a **breaking change to existing dashboards** and is
tagged as a convergence item separate from the initial adoption.
### §6 Meter naming: `MxGateway.Server` vs namespace convention
| | OtOpcUa | MxAccessGateway | ScadaBridge |
|---|---|---|---|
| Meter name | ✅ `ZB.MOM.WW.OtOpcUa` | ⛔ `MxGateway.Server` | n/a |
The fleet convention (per `spec/SPEC.md`) is `<project-namespace>` — OtOpcUa uses
`ZB.MOM.WW.OtOpcUa`; the gateway's meter should be `ZB.MOM.WW.MxGateway`. `MxGateway.Server` is
the assembly name, not the namespace, and does not carry the `ZB.MOM.WW` product prefix.
**Gap N1:** MxGateway: rename Meter from `"MxGateway.Server"``"ZB.MOM.WW.MxGateway"`.
This is a **Prometheus metric label change** that breaks dashboards/alerts and is tracked
separately from the initial adoption.
### §7 Standard instrumentation (nobody has the full set)
| Instrumentation | OtOpcUa | MxAccessGateway | ScadaBridge |
|---|---|---|---|
| ASP.NET Core request metrics | ⛔ not added | ⛔ not added | ⛔ not added |
| `HttpClient` metrics | ⛔ not added | ⛔ not added | ⛔ not added |
| Runtime / process metrics | ⛔ not added | ⛔ not added | ⛔ not added |
| gRPC client metrics | ⛔ not added | ⛔ not added | n/a |
`AddZbTelemetry` enables these via `AddAspNetCoreInstrumentation`, `AddHttpClientInstrumentation`,
`AddRuntimeInstrumentation`, and `AddProcessInstrumentation` by default — all three projects gain
them on `AddZbTelemetry` adoption without code changes.
**Gap S1:** all three projects lack standard instrumentation; closed automatically by
`AddZbTelemetry` adoption.
### §8 Serilog version split (`Serilog.AspNetCore` 8 vs 9)
OtOpcUa and ScadaBridge use `Serilog.AspNetCore` but may be on different versions due to their
independent csproj updates. The shared `ZB.MOM.WW.Telemetry.Serilog` package must declare a
`Serilog.AspNetCore` version floor that works for both. Verify version alignment on adoption.
**Gap V1:** confirm `Serilog.AspNetCore` version compatibility across all three projects on
`AddZbSerilog` adoption; align if diverged.
### §9 ScadaBridge: zero instrumentation
ScadaBridge has no application instruments today. The `OpenTelemetry.Api` ref is a CVE patch, not
instrumentation. No counter, histogram, gauge, or span exists anywhere in the solution.
**Gap I1:** ScadaBridge: define `ScadaBridgeTelemetry` with a `ZB.MOM.WW.ScadaBridge` Meter and
a first set of `scadabridge.*` instruments. Tracked as a follow-on (application-specific work,
not shared-library work).
## Adoption backlog (ordered)
| # | Item | Projects | Priority | Effort | Risk | Notes |
|---|---|---|---|---|---|---|
| 1 | MxGateway MEL → `AddZbSerilog`: `LogContext` correlation + `ILogRedactor` (Gap L1) | MxGateway | P1 | M | Low | **In-pass, done in Task #9** — unblocked by library build |
| 2 | MxGateway: wire OTel SDK via `AddZbTelemetry`; `GatewayMetrics` begins exporting (Gap M1) | MxGateway | P1 | S | Low | Bundled with #1 in Task #9 |
| 3 | All: `AddZbTelemetry` with `ServiceName`/`SiteId`/`NodeRole` → shared Resource (Gap R1) | OtOpcUa, ScadaBridge | P1 | S | Low | MxGateway covered by #2; others are follow-on |
| 4 | OtOpcUa: adopt `AddZbSerilog` + `TraceContextEnricher` (Gaps C1, V1) | OtOpcUa | P2 | S | Low | Keep `LogContextEnricher`; add shared enrichers alongside |
| 5 | ScadaBridge: adopt `AddZbSerilog`; replace `LoggerConfigurationFactory` (Gap C2, V1) | ScadaBridge | P2 | S | Low | Enricher names already match; `LoggerConfigurationFactory` deleted |
| 6 | MxGateway: histogram `ms``s` conversion (Gap U1) | MxGateway | P2 | S | Med | Breaking dashboard/alert change; coordinate with ops |
| 7 | MxGateway: rename Meter `"MxGateway.Server"``"ZB.MOM.WW.MxGateway"` (Gap N1) | MxGateway | P3 | XS | Med | Breaking Prometheus label change; coordinate with ops |
| 8 | All: standard instrumentation via `AddZbTelemetry` options (Gap S1) | all 3 | P2 | XS | Low | Automatic on `AddZbTelemetry` adoption; no extra code |
| 9 | ScadaBridge: define `ScadaBridgeTelemetry` + first `scadabridge.*` instruments (Gap I1) | ScadaBridge | P2 | M | Low | Application-specific work; tracked in ScadaBridge repo |
| 10 | OtOpcUa: wire OTLP exporter alongside Prometheus (Gap M2) | OtOpcUa | P3 | S | Low | Opt-in via `AddZbTelemetry` options; no code rewrite |
| 11 | All: fix the tracing no-op in OtOpcUa (spans recorded, no exporter) | OtOpcUa | P2 | S | Low | Wire OTLP or at minimum document the gap |
**Sequencing:** Items #1 and #2 are the in-pass adoption (Task #9, MxGateway only). Items #3#5
are the follow-on OtOpcUa/ScadaBridge adoptions — they are independent of each other and can land
in either order. Items #6#7 (unit/naming convergence) are breaking changes requiring ops
coordination; defer until dashboards can be updated. Items #8#11 are cleanups that bundle
naturally with #3#5.
This mirrors the Auth and UI-Theme pattern: the shared library is built first; adoption by each
app is opt-in and tracked here, not forced.
## Decisions still open
- Canonical `SiteId` and `NodeRole` config binding path — ScadaBridge reads from its own config
hierarchy; `AddZbSerilog` must accept the value directly (caller-supplied) rather than reading
from a fixed config section, to remain project-agnostic.
## Decisions settled (no longer open)
- **Prometheus vs OTLP default (SETTLED):** `AddZbTelemetry` defaults to Prometheus (matching
OtOpcUa's existing `/metrics` posture). OTLP is opt-in via `ZbTelemetryOptions.Exporter =
ZbExporter.Otlp`. See `spec/SPEC.md` §4 and shared-contract `ZbTelemetryOptions.Exporter`.
- **`ms``s` conversion and Meter rename bundling (SETTLED — deferred):** Both the histogram
unit migration (Gap U1) and the Meter rename (Gap N1) are deferred from the initial MxGateway
adoption (Task #9). They are breaking dashboard/alert changes requiring ops coordination and
are tracked as separate backlog items #6 and #7 in the adoption backlog above.