docs(observability): current-state x3 + GAPS + README

Complete the observability normalization component docs:

- components/observability/current-state/otopcua/CURRENT-STATE.md — full
  OTel SDK (metrics + tracing) + Prometheus; 7 otopcua.* instruments + 2
  spans; Serilog with driver-scope LogContextEnricher; no Resource/service.name
  anywhere; tracing pipeline wired but no exporter; adoption plan: AddZbTelemetry
  gains shared Resource + trace↔log correlation; LogContextEnricher kept bespoke.

- components/observability/current-state/mxaccessgw/CURRENT-STATE.md — 20
  hand-rolled instruments (13 counters, 3 histograms ms-unit, 4 gauges) in
  GatewayMetrics.cs; no OTel SDK → metrics never export; MEL logging with
  GatewayLogScope correlation and GatewayLogRedactor; adoption plan: in-pass
  MEL → AddZbSerilog migration (LogContext correlation, ILogRedactor seam) +
  AddZbTelemetry wires OTel SDK so GatewayMetrics finally exports.

- components/observability/current-state/scadabridge/CURRENT-STATE.md —
  OpenTelemetry.Api is a CVE-patch override only (zero instrumentation); Serilog
  with SiteId/NodeRole/NodeHostname enrichers (strongest set in family); adoption
  plan: replace CVE ref with AddZbTelemetry; adopt AddZbSerilog (LoggerConfigurationFactory
  deleted); add first scadabridge.* instruments.

- components/observability/GAPS.md — divergence table across §1 Resource (P1,
  nobody), §2 metrics export (P1, MxGateway invisible), §3 MxGateway MEL→Serilog
  (P1, in-pass done), §4 trace↔log correlation, §5 ms→s unit, §6 Meter naming,
  §7 standard instrumentation, §8 Serilog version, §9 ScadaBridge zero
  instrumentation; 11-item prioritized backlog.

- components/observability/README.md — overview, per-project status table
  (OTel today / metrics / tracing / logging / enrichers / adoption status),
  normalized vs. left-per-project boundary, 2-package structure, component status.
This commit is contained in:
Joseph Doherty
2026-06-01 07:23:08 -04:00
parent 7d243890ed
commit fba3d09eed
2 changed files with 283 additions and 0 deletions
+177
View File
@@ -0,0 +1,177 @@
# Observability — gaps & adoption backlog
Divergence of each project from [`spec/SPEC.md`](spec/SPEC.md), and the ordered backlog to
reach the shared `ZB.MOM.WW.Telemetry` library. Status legend: ⛔ gap · 🟡 partial · ✅ matches.
## Divergence vs spec
### §1 OTel Resource / `service.name` (P1 — nobody has it)
| Spec attribute | OtOpcUa | MxAccessGateway | ScadaBridge |
|---|---|---|---|
| `service.name` | ⛔ not set | ⛔ not set | ⛔ not set |
| `service.namespace` | ⛔ not set | ⛔ not set | ⛔ not set |
| `service.version` | ⛔ not set | ⛔ not set | ⛔ not set |
| `site.id` | ⛔ not set | ⛔ not set | ⛔ not set |
| `node.role` | ⛔ not set | ⛔ not set | ⛔ not set |
| `host.name` | ⛔ not set | ⛔ not set | ⛔ not set |
No project configures a `ResourceBuilder` or equivalent. Every metric and span from every node is
**indistinguishable** in a backend — no service identity, no topology (site/role), no version label.
This is the single highest-value gap across the fleet; closing it requires only adding
`AddZbTelemetry` with options.
**Gap R1 (P1):** All three projects must call `AddZbTelemetry` with `ServiceName`, `SiteId`,
`NodeRole` options to populate the shared Resource. None may do so before the library is available.
### §2 Metrics export (P1 — MxGateway metrics are invisible)
| | OtOpcUa | MxAccessGateway | ScadaBridge |
|---|---|---|---|
| OTel SDK present | ✅ (`AddOpenTelemetry`) | ⛔ none | ⛔ none (`OpenTelemetry.Api` only) |
| Meter registered | ✅ `ZB.MOM.WW.OtOpcUa` | 🟡 `MxGateway.Server` (not via OTel SDK) | ⛔ no Meter |
| Prometheus export | ✅ `/metrics` | ⛔ `GetSnapshot()` only | ⛔ none |
| OTLP export | ⛔ not available | ⛔ not available | ⛔ not available |
MxGateway has 20 production instruments (13 counters, 3 histograms, 4 gauges) recording real data
that is **never exported**. The data lives only in an in-memory `GetSnapshot()` read path. Adding
`AddZbTelemetry` with `Meters = ["MxGateway.Server"]` closes this gap without touching
`GatewayMetrics.cs`.
**Gap M1 (P1):** MxGateway: wire OTel SDK via `AddZbTelemetry` so `GatewayMetrics` exports.
This is one half of the in-pass adoption (logging migration is the other).
**Gap M2:** ScadaBridge: define a `ScadaBridgeTelemetry` class and first application instruments
(`scadabridge.*`); register via `AddZbTelemetry`. Currently zero instrumentation.
### §3 MxGateway logging: MEL → Serilog (P1 — in-pass adoption, done in this task)
| | OtOpcUa | MxAccessGateway | ScadaBridge |
|---|---|---|---|
| Logging framework | ✅ Serilog | ⛔ MEL only | ✅ Serilog |
| Structural enrichers | 🟡 driver-scope only | ⛔ MEL scope (provider-dependent) | ✅ `SiteId`/`NodeRole`/`NodeHostname` |
| Correlation mechanism | `LogContextEnricher.Push` | `GatewayLogScope` + `BeginScope` | structural enrichers |
| Log redaction | ⛔ none | ✅ `GatewayLogRedactor` | ⛔ none |
| `ILogRedactor` seam | ⛔ none | ⛔ bespoke | ⛔ none |
**Status: done in this task (Task #9).** The MxGateway logging migration — MEL → `AddZbSerilog`,
`GatewayLogScope``LogContext.PushProperty`, `GatewayLogRedactor``ILogRedactor` seam — is
executed as part of the `ZB.MOM.WW.Telemetry` library build. See
[`current-state/mxaccessgw/CURRENT-STATE.md`](current-state/mxaccessgw/CURRENT-STATE.md) adoption
plan for the exact changes.
**Gap L1 (in-pass, done):** MxGateway MEL → `AddZbSerilog` + `LogContext` correlation + `ILogRedactor`.
### §4 Trace↔log correlation (nobody has it)
| | OtOpcUa | MxAccessGateway | ScadaBridge |
|---|---|---|---|
| `trace_id`/`span_id` in logs | ⛔ absent | ⛔ absent (no spans) | ⛔ absent (no spans) |
| Mechanism | `TraceContextEnricher` not wired | n/a | n/a |
OtOpcUa creates spans (`otopcua.deploy.apply`, `otopcua.opcua.address_space_rebuild`) but never
pushes `Activity.Current`'s trace context onto Serilog's `LogContext`. A log emitted inside a span
cannot be joined to the span in a backend. MxGateway and ScadaBridge have no spans at all.
The shared `TraceContextEnricher` (part of `ZB.MOM.WW.Telemetry.Serilog`) closes this for all
three projects on `AddZbSerilog` adoption — it is wired automatically.
**Gap C1:** OtOpcUa: adopt `AddZbSerilog` to wire `TraceContextEnricher`; spans and logs become
joinable once the enricher is active.
**Gap C2:** ScadaBridge/MxGateway: enricher wired on adoption; effective once spans are created.
### §5 Duration unit: `ms` vs `s`
| | OtOpcUa | MxAccessGateway | ScadaBridge |
|---|---|---|---|
| Histogram unit | ✅ `s` | ⛔ `ms` (`workers.startup.duration`, `commands.duration`, `events.stream_send.duration`) | n/a (no histograms) |
MxGateway's three histograms record in milliseconds. OTel semantic conventions (and OtOpcUa) use
seconds. Raw values differ by a factor of 1000 — dashboards and SLO alerts would need separate
scaling rules without normalization.
**Gap U1:** MxGateway: convert the three histogram call sites to record in seconds (multiply by
`0.001` or redefine the instruments). This is a **breaking change to existing dashboards** and is
tagged as a convergence item separate from the initial adoption.
### §6 Meter naming: `MxGateway.Server` vs namespace convention
| | OtOpcUa | MxAccessGateway | ScadaBridge |
|---|---|---|---|
| Meter name | ✅ `ZB.MOM.WW.OtOpcUa` | ⛔ `MxGateway.Server` | n/a |
The fleet convention (per `spec/SPEC.md`) is `<project-namespace>` — OtOpcUa uses
`ZB.MOM.WW.OtOpcUa`; the gateway's meter should be `ZB.MOM.WW.MxGateway`. `MxGateway.Server` is
the assembly name, not the namespace, and does not carry the `ZB.MOM.WW` product prefix.
**Gap N1:** MxGateway: rename Meter from `"MxGateway.Server"``"ZB.MOM.WW.MxGateway"`.
This is a **Prometheus metric label change** that breaks dashboards/alerts and is tracked
separately from the initial adoption.
### §7 Standard instrumentation (nobody has the full set)
| Instrumentation | OtOpcUa | MxAccessGateway | ScadaBridge |
|---|---|---|---|
| ASP.NET Core request metrics | ⛔ not added | ⛔ not added | ⛔ not added |
| `HttpClient` metrics | ⛔ not added | ⛔ not added | ⛔ not added |
| Runtime / process metrics | ⛔ not added | ⛔ not added | ⛔ not added |
| gRPC client metrics | ⛔ not added | ⛔ not added | n/a |
`AddZbTelemetry` enables these via `AddAspNetCoreInstrumentation`, `AddHttpClientInstrumentation`,
`AddRuntimeInstrumentation`, and `AddProcessInstrumentation` by default — all three projects gain
them on `AddZbTelemetry` adoption without code changes.
**Gap S1:** all three projects lack standard instrumentation; closed automatically by
`AddZbTelemetry` adoption.
### §8 Serilog version split (`Serilog.AspNetCore` 8 vs 9)
OtOpcUa and ScadaBridge use `Serilog.AspNetCore` but may be on different versions due to their
independent csproj updates. The shared `ZB.MOM.WW.Telemetry.Serilog` package must declare a
`Serilog.AspNetCore` version floor that works for both. Verify version alignment on adoption.
**Gap V1:** confirm `Serilog.AspNetCore` version compatibility across all three projects on
`AddZbSerilog` adoption; align if diverged.
### §9 ScadaBridge: zero instrumentation
ScadaBridge has no application instruments today. The `OpenTelemetry.Api` ref is a CVE patch, not
instrumentation. No counter, histogram, gauge, or span exists anywhere in the solution.
**Gap I1:** ScadaBridge: define `ScadaBridgeTelemetry` with a `ZB.MOM.WW.ScadaBridge` Meter and
a first set of `scadabridge.*` instruments. Tracked as a follow-on (application-specific work,
not shared-library work).
## Adoption backlog (ordered)
| # | Item | Projects | Priority | Effort | Risk | Notes |
|---|---|---|---|---|---|---|
| 1 | MxGateway MEL → `AddZbSerilog`: `LogContext` correlation + `ILogRedactor` (Gap L1) | MxGateway | P1 | M | Low | **In-pass, done in Task #9** — unblocked by library build |
| 2 | MxGateway: wire OTel SDK via `AddZbTelemetry`; `GatewayMetrics` begins exporting (Gap M1) | MxGateway | P1 | S | Low | Bundled with #1 in Task #9 |
| 3 | All: `AddZbTelemetry` with `ServiceName`/`SiteId`/`NodeRole` → shared Resource (Gap R1) | OtOpcUa, ScadaBridge | P1 | S | Low | MxGateway covered by #2; others are follow-on |
| 4 | OtOpcUa: adopt `AddZbSerilog` + `TraceContextEnricher` (Gaps C1, V1) | OtOpcUa | P2 | S | Low | Keep `LogContextEnricher`; add shared enrichers alongside |
| 5 | ScadaBridge: adopt `AddZbSerilog`; replace `LoggerConfigurationFactory` (Gap C2, V1) | ScadaBridge | P2 | S | Low | Enricher names already match; `LoggerConfigurationFactory` deleted |
| 6 | MxGateway: histogram `ms``s` conversion (Gap U1) | MxGateway | P2 | S | Med | Breaking dashboard/alert change; coordinate with ops |
| 7 | MxGateway: rename Meter `"MxGateway.Server"``"ZB.MOM.WW.MxGateway"` (Gap N1) | MxGateway | P3 | XS | Med | Breaking Prometheus label change; coordinate with ops |
| 8 | All: standard instrumentation via `AddZbTelemetry` options (Gap S1) | all 3 | P2 | XS | Low | Automatic on `AddZbTelemetry` adoption; no extra code |
| 9 | ScadaBridge: define `ScadaBridgeTelemetry` + first `scadabridge.*` instruments (Gap I1) | ScadaBridge | P2 | M | Low | Application-specific work; tracked in ScadaBridge repo |
| 10 | OtOpcUa: wire OTLP exporter alongside Prometheus (Gap M2) | OtOpcUa | P3 | S | Low | Opt-in via `AddZbTelemetry` options; no code rewrite |
| 11 | All: fix the tracing no-op in OtOpcUa (spans recorded, no exporter) | OtOpcUa | P2 | S | Low | Wire OTLP or at minimum document the gap |
**Sequencing:** Items #1 and #2 are the in-pass adoption (Task #9, MxGateway only). Items #3#5
are the follow-on OtOpcUa/ScadaBridge adoptions — they are independent of each other and can land
in either order. Items #6#7 (unit/naming convergence) are breaking changes requiring ops
coordination; defer until dashboards can be updated. Items #8#11 are cleanups that bundle
naturally with #3#5.
This mirrors the Auth and UI-Theme pattern: the shared library is built first; adoption by each
app is opt-in and tracked here, not forced.
## Decisions still open
- Whether `AddZbTelemetry` enables OTLP by default (simplest for new setups) or Prometheus by
default (matches OtOpcUa's current posture). Design doc says Prometheus default; OTLP opt-in.
- Whether the `ms``s` conversion and Meter rename are bundled with the initial MxGateway
adoption (Task #9) or deferred. Deferring avoids dashboard breaks during the migration window.
- Canonical `SiteId` and `NodeRole` config binding path — ScadaBridge reads from its own config
hierarchy; `AddZbSerilog` must accept the value directly (caller-supplied) rather than reading
from a fixed config section, to remain project-agnostic.
+106
View File
@@ -0,0 +1,106 @@
# Observability (metrics / traces / logs)
Third normalized component under the operability cluster. **Goal: path to shared code** — converge
the three sister projects onto a common OpenTelemetry Resource, a shared Serilog bootstrap with
unified enrichers, and a trace↔log correlation bridge, proposed as the `ZB.MOM.WW.Telemetry`
library set (2 packages), while each project keeps its own application instruments and sink
configuration.
- The one target: [`spec/SPEC.md`](spec/SPEC.md)
- Metric naming reference: [`spec/METRIC-CONVENTIONS.md`](spec/METRIC-CONVENTIONS.md)
- The proposed shared library: [`shared-contract/ZB.MOM.WW.Telemetry.md`](shared-contract/ZB.MOM.WW.Telemetry.md)
- Divergences + backlog: [`GAPS.md`](GAPS.md)
- Current state, per project: [`current-state/`](current-state/)
## Why observability is a strong normalization candidate
All three projects instrument something — but in three completely different ways and at three very
different levels of completeness. The divergences are structural:
- **OtOpcUa** has the full OpenTelemetry SDK (metrics + tracing), Prometheus export, and a bespoke
Serilog enricher for driver-lifecycle correlation — but no Resource (`service.name` is never set)
and no trace↔log bridge.
- **MxAccessGateway** has 20 hand-rolled instruments (counters, histograms, gauges) recording real
production data — that never leave the process. No OTel SDK, no exporter, no tracing. Logging
uses Microsoft.Extensions.Logging rather than Serilog, with a bespoke correlation-scope and
redaction pipeline.
- **ScadaBridge** has zero application instruments. Its `OpenTelemetry.Api` reference is a CVE
patch, not instrumentation. It does have the cleanest structured logging enricher set
(`SiteId`/`NodeRole`/`NodeHostname`) — but those properties exist only in Serilog, not in the
OTel Resource, so logs and metrics cannot join in a backend.
Nobody sets a Resource. Nobody does trace↔log correlation. MxGateway's metrics are invisible.
ScadaBridge has no metrics at all.
The common fix is a single `AddZbTelemetry(options)` call that: creates a shared Resource from a
`service.name`/`site.id`/`node.role` options object; registers the project's own Meter/ActivitySource
names with the OTel SDK; and exposes Prometheus `/metrics`. A companion `AddZbSerilog(options)` wires
Serilog with the same options as enricher properties and adds `TraceContextEnricher` so logs carry
`trace_id`/`span_id`. The unifying hinge: the same identity triple (`service.name`/`site.id`/
`node.role`) populates both the OTel Resource and the Serilog enrichers, so a metric, a span, and
a log line from the same node carry identical dimensions and join up in a backend.
One adoption happens **in this task**: MxAccessGateway migrates off MEL onto `AddZbSerilog`. All
other app wiring is follow-on, consistent with how Auth and UI-Theme are structured.
## Status by project
| Project | OTel SDK today | Metrics today | Tracing today | Logging today | Enrichers today | Adoption status |
|---|---|---|---|---|---|---|
| **OtOpcUa** | ✅ full SDK (`WithMetrics`+`WithTracing`) | ✅ 7 instruments (`otopcua.*`); Prometheus `/metrics` | 🟡 2 spans defined; no exporter | Serilog (Console+File) | `DriverInstanceId`/`DriverType`/`CapabilityName`/`CorrelationId` (driver-scope) | Not started (follow-on) |
| **MxAccessGateway** | ⛔ none (hand-rolled `Meter`) | 🟡 20 instruments (`mxgateway.*`); **never exported** | ⛔ none | MEL → **migrating to Serilog in this task** | `SessionId`/`WorkerProcessId`/`CorrelationId`/`CommandMethod` (MEL scope) | **In progress (Task #9)** |
| **ScadaBridge** | ⛔ (`OpenTelemetry.Api` CVE-patch only) | ⛔ zero instruments | ⛔ none | Serilog (Console+File) | `SiteId`/`NodeRole`/`NodeHostname` (process-level; strongest set) | Not started (follow-on) |
See each project's [`current-state/<project>/CURRENT-STATE.md`](current-state/) for the
code-verified detail and its adoption plan.
## Normalized vs. left per-project
**Normalized (the shared target):**
- `AddZbTelemetry(ZbTelemetryOptions)` — front door for the OTel SDK. Populates the shared
Resource (`service.name`, `service.namespace`, `service.version`, `site.id`, `node.role`,
`host.name`). Registers the caller-supplied Meter and ActivitySource name(s). Wires standard
instrumentation (ASP.NET Core, HttpClient, runtime, process). Prometheus default; OTLP opt-in.
- `app.MapZbMetrics()` — maps the Prometheus `/metrics` endpoint (shared path + shared exporter).
- `AddZbSerilog(ZbTelemetryOptions)` — shared Serilog two-stage bootstrap generalizing
ScadaBridge's `LoggerConfigurationFactory`. Wires `SiteId`/`NodeRole`/`NodeHostname` enrichers
from the same options object as the OTel Resource. Wires `TraceContextEnricher`
(`trace_id`/`span_id` from `Activity.Current`). Preserves `ReadFrom.Configuration` for sinks
and explicit `MinimumLevel.Is` override.
- `ILogRedactor` seam — generalized from MxGateway's `GatewayLogRedactor`. The seam is shared;
the redaction policy (which fields/commands) stays per-project.
- Metric naming convention: `<meter>.<subsystem>.<event>`; Meter name = project namespace
(`ZB.MOM.WW.<ProjectName>`); duration unit = `s` (OTel semconv).
**Left per-project (not forced together):**
- Application `Meter`, `ActivitySource`, and all instrument definitions — `otopcua.*`,
`mxgateway.*`, `scadabridge.*` instruments are owned by each repo.
- Serilog sink configuration (`appsettings.json` Console/File templates, rolling intervals).
- Per-operation/per-session correlation enrichers (`LogContextEnricher` in OtOpcUa;
`LogContext.PushProperty` scope in MxGateway after migration).
- Redaction policies (`MxGatewayLogRedactor` implements `ILogRedactor` with gateway-specific
command/field rules).
- Config section paths for `SiteId`/`NodeRole`/`NodeHostname` — each project binds these from
its own config hierarchy and passes the resolved values to `AddZbTelemetry`/`AddZbSerilog`.
## Package structure
`ZB.MOM.WW.Telemetry` ships as two dependency-split packages:
| Package | Contents | Consumers |
|---|---|---|
| `ZB.MOM.WW.Telemetry` | `AddZbTelemetry`, `ZbTelemetryOptions`, Resource builder, standard instrumentation, Prometheus/OTLP exporters, `app.MapZbMetrics()` | All three |
| `ZB.MOM.WW.Telemetry.Serilog` | `AddZbSerilog`, shared enrichers (`SiteId`/`NodeRole`/`NodeHostname`/`TraceContextEnricher`), `ILogRedactor` seam | All three (Serilog users); MxGateway on migration |
Both packages share `ZbTelemetryOptions` as the single options object that drives Resource
attributes, Serilog enrichers, Meter/ActivitySource names, and exporter selection — the unifying
hinge that makes a metric, a span, and a log line from the same node carry identical dimensions.
## Component status
**Status: Draft.** Spec and shared-contract written; current-state docs verified; GAPS backlog
populated. Library implementation in progress (`ZB.MOM.WW.Telemetry` — Task #8). MxAccessGateway
MEL → Serilog migration in progress (Task #9, blocked by library build). Adoption by OtOpcUa and
ScadaBridge is follow-on, tracked in [`GAPS.md`](GAPS.md).