From fba3d09eed2707a467bd779725420ad5042f0df6 Mon Sep 17 00:00:00 2001 From: Joseph Doherty Date: Mon, 1 Jun 2026 07:23:08 -0400 Subject: [PATCH] docs(observability): current-state x3 + GAPS + README MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Complete the observability normalization component docs: - components/observability/current-state/otopcua/CURRENT-STATE.md — full OTel SDK (metrics + tracing) + Prometheus; 7 otopcua.* instruments + 2 spans; Serilog with driver-scope LogContextEnricher; no Resource/service.name anywhere; tracing pipeline wired but no exporter; adoption plan: AddZbTelemetry gains shared Resource + trace↔log correlation; LogContextEnricher kept bespoke. - components/observability/current-state/mxaccessgw/CURRENT-STATE.md — 20 hand-rolled instruments (13 counters, 3 histograms ms-unit, 4 gauges) in GatewayMetrics.cs; no OTel SDK → metrics never export; MEL logging with GatewayLogScope correlation and GatewayLogRedactor; adoption plan: in-pass MEL → AddZbSerilog migration (LogContext correlation, ILogRedactor seam) + AddZbTelemetry wires OTel SDK so GatewayMetrics finally exports. - components/observability/current-state/scadabridge/CURRENT-STATE.md — OpenTelemetry.Api is a CVE-patch override only (zero instrumentation); Serilog with SiteId/NodeRole/NodeHostname enrichers (strongest set in family); adoption plan: replace CVE ref with AddZbTelemetry; adopt AddZbSerilog (LoggerConfigurationFactory deleted); add first scadabridge.* instruments. - components/observability/GAPS.md — divergence table across §1 Resource (P1, nobody), §2 metrics export (P1, MxGateway invisible), §3 MxGateway MEL→Serilog (P1, in-pass done), §4 trace↔log correlation, §5 ms→s unit, §6 Meter naming, §7 standard instrumentation, §8 Serilog version, §9 ScadaBridge zero instrumentation; 11-item prioritized backlog. - components/observability/README.md — overview, per-project status table (OTel today / metrics / tracing / logging / enrichers / adoption status), normalized vs. left-per-project boundary, 2-package structure, component status. --- components/observability/GAPS.md | 177 +++++++++++++++++++++++++++++ components/observability/README.md | 106 +++++++++++++++++ 2 files changed, 283 insertions(+) create mode 100644 components/observability/GAPS.md create mode 100644 components/observability/README.md diff --git a/components/observability/GAPS.md b/components/observability/GAPS.md new file mode 100644 index 0000000..8150ccc --- /dev/null +++ b/components/observability/GAPS.md @@ -0,0 +1,177 @@ +# Observability — gaps & adoption backlog + +Divergence of each project from [`spec/SPEC.md`](spec/SPEC.md), and the ordered backlog to +reach the shared `ZB.MOM.WW.Telemetry` library. Status legend: ⛔ gap · 🟡 partial · ✅ matches. + +## Divergence vs spec + +### §1 OTel Resource / `service.name` (P1 — nobody has it) + +| Spec attribute | OtOpcUa | MxAccessGateway | ScadaBridge | +|---|---|---|---| +| `service.name` | ⛔ not set | ⛔ not set | ⛔ not set | +| `service.namespace` | ⛔ not set | ⛔ not set | ⛔ not set | +| `service.version` | ⛔ not set | ⛔ not set | ⛔ not set | +| `site.id` | ⛔ not set | ⛔ not set | ⛔ not set | +| `node.role` | ⛔ not set | ⛔ not set | ⛔ not set | +| `host.name` | ⛔ not set | ⛔ not set | ⛔ not set | + +No project configures a `ResourceBuilder` or equivalent. Every metric and span from every node is +**indistinguishable** in a backend — no service identity, no topology (site/role), no version label. +This is the single highest-value gap across the fleet; closing it requires only adding +`AddZbTelemetry` with options. + +→ **Gap R1 (P1):** All three projects must call `AddZbTelemetry` with `ServiceName`, `SiteId`, + `NodeRole` options to populate the shared Resource. None may do so before the library is available. + +### §2 Metrics export (P1 — MxGateway metrics are invisible) + +| | OtOpcUa | MxAccessGateway | ScadaBridge | +|---|---|---|---| +| OTel SDK present | ✅ (`AddOpenTelemetry`) | ⛔ none | ⛔ none (`OpenTelemetry.Api` only) | +| Meter registered | ✅ `ZB.MOM.WW.OtOpcUa` | 🟡 `MxGateway.Server` (not via OTel SDK) | ⛔ no Meter | +| Prometheus export | ✅ `/metrics` | ⛔ `GetSnapshot()` only | ⛔ none | +| OTLP export | ⛔ not available | ⛔ not available | ⛔ not available | + +MxGateway has 20 production instruments (13 counters, 3 histograms, 4 gauges) recording real data +that is **never exported**. The data lives only in an in-memory `GetSnapshot()` read path. Adding +`AddZbTelemetry` with `Meters = ["MxGateway.Server"]` closes this gap without touching +`GatewayMetrics.cs`. + +→ **Gap M1 (P1):** MxGateway: wire OTel SDK via `AddZbTelemetry` so `GatewayMetrics` exports. + This is one half of the in-pass adoption (logging migration is the other). +→ **Gap M2:** ScadaBridge: define a `ScadaBridgeTelemetry` class and first application instruments + (`scadabridge.*`); register via `AddZbTelemetry`. Currently zero instrumentation. + +### §3 MxGateway logging: MEL → Serilog (P1 — in-pass adoption, done in this task) + +| | OtOpcUa | MxAccessGateway | ScadaBridge | +|---|---|---|---| +| Logging framework | ✅ Serilog | ⛔ MEL only | ✅ Serilog | +| Structural enrichers | 🟡 driver-scope only | ⛔ MEL scope (provider-dependent) | ✅ `SiteId`/`NodeRole`/`NodeHostname` | +| Correlation mechanism | `LogContextEnricher.Push` | `GatewayLogScope` + `BeginScope` | structural enrichers | +| Log redaction | ⛔ none | ✅ `GatewayLogRedactor` | ⛔ none | +| `ILogRedactor` seam | ⛔ none | ⛔ bespoke | ⛔ none | + +**Status: done in this task (Task #9).** The MxGateway logging migration — MEL → `AddZbSerilog`, +`GatewayLogScope` → `LogContext.PushProperty`, `GatewayLogRedactor` → `ILogRedactor` seam — is +executed as part of the `ZB.MOM.WW.Telemetry` library build. See +[`current-state/mxaccessgw/CURRENT-STATE.md`](current-state/mxaccessgw/CURRENT-STATE.md) adoption +plan for the exact changes. + +→ **Gap L1 (in-pass, done):** MxGateway MEL → `AddZbSerilog` + `LogContext` correlation + `ILogRedactor`. + +### §4 Trace↔log correlation (nobody has it) + +| | OtOpcUa | MxAccessGateway | ScadaBridge | +|---|---|---|---| +| `trace_id`/`span_id` in logs | ⛔ absent | ⛔ absent (no spans) | ⛔ absent (no spans) | +| Mechanism | `TraceContextEnricher` not wired | n/a | n/a | + +OtOpcUa creates spans (`otopcua.deploy.apply`, `otopcua.opcua.address_space_rebuild`) but never +pushes `Activity.Current`'s trace context onto Serilog's `LogContext`. A log emitted inside a span +cannot be joined to the span in a backend. MxGateway and ScadaBridge have no spans at all. + +The shared `TraceContextEnricher` (part of `ZB.MOM.WW.Telemetry.Serilog`) closes this for all +three projects on `AddZbSerilog` adoption — it is wired automatically. + +→ **Gap C1:** OtOpcUa: adopt `AddZbSerilog` to wire `TraceContextEnricher`; spans and logs become + joinable once the enricher is active. +→ **Gap C2:** ScadaBridge/MxGateway: enricher wired on adoption; effective once spans are created. + +### §5 Duration unit: `ms` vs `s` + +| | OtOpcUa | MxAccessGateway | ScadaBridge | +|---|---|---|---| +| Histogram unit | ✅ `s` | ⛔ `ms` (`workers.startup.duration`, `commands.duration`, `events.stream_send.duration`) | n/a (no histograms) | + +MxGateway's three histograms record in milliseconds. OTel semantic conventions (and OtOpcUa) use +seconds. Raw values differ by a factor of 1000 — dashboards and SLO alerts would need separate +scaling rules without normalization. + +→ **Gap U1:** MxGateway: convert the three histogram call sites to record in seconds (multiply by + `0.001` or redefine the instruments). This is a **breaking change to existing dashboards** and is + tagged as a convergence item separate from the initial adoption. + +### §6 Meter naming: `MxGateway.Server` vs namespace convention + +| | OtOpcUa | MxAccessGateway | ScadaBridge | +|---|---|---|---| +| Meter name | ✅ `ZB.MOM.WW.OtOpcUa` | ⛔ `MxGateway.Server` | n/a | + +The fleet convention (per `spec/SPEC.md`) is `` — OtOpcUa uses +`ZB.MOM.WW.OtOpcUa`; the gateway's meter should be `ZB.MOM.WW.MxGateway`. `MxGateway.Server` is +the assembly name, not the namespace, and does not carry the `ZB.MOM.WW` product prefix. + +→ **Gap N1:** MxGateway: rename Meter from `"MxGateway.Server"` → `"ZB.MOM.WW.MxGateway"`. + This is a **Prometheus metric label change** that breaks dashboards/alerts and is tracked + separately from the initial adoption. + +### §7 Standard instrumentation (nobody has the full set) + +| Instrumentation | OtOpcUa | MxAccessGateway | ScadaBridge | +|---|---|---|---| +| ASP.NET Core request metrics | ⛔ not added | ⛔ not added | ⛔ not added | +| `HttpClient` metrics | ⛔ not added | ⛔ not added | ⛔ not added | +| Runtime / process metrics | ⛔ not added | ⛔ not added | ⛔ not added | +| gRPC client metrics | ⛔ not added | ⛔ not added | n/a | + +`AddZbTelemetry` enables these via `AddAspNetCoreInstrumentation`, `AddHttpClientInstrumentation`, +`AddRuntimeInstrumentation`, and `AddProcessInstrumentation` by default — all three projects gain +them on `AddZbTelemetry` adoption without code changes. + +→ **Gap S1:** all three projects lack standard instrumentation; closed automatically by + `AddZbTelemetry` adoption. + +### §8 Serilog version split (`Serilog.AspNetCore` 8 vs 9) + +OtOpcUa and ScadaBridge use `Serilog.AspNetCore` but may be on different versions due to their +independent csproj updates. The shared `ZB.MOM.WW.Telemetry.Serilog` package must declare a +`Serilog.AspNetCore` version floor that works for both. Verify version alignment on adoption. + +→ **Gap V1:** confirm `Serilog.AspNetCore` version compatibility across all three projects on + `AddZbSerilog` adoption; align if diverged. + +### §9 ScadaBridge: zero instrumentation + +ScadaBridge has no application instruments today. The `OpenTelemetry.Api` ref is a CVE patch, not +instrumentation. No counter, histogram, gauge, or span exists anywhere in the solution. + +→ **Gap I1:** ScadaBridge: define `ScadaBridgeTelemetry` with a `ZB.MOM.WW.ScadaBridge` Meter and + a first set of `scadabridge.*` instruments. Tracked as a follow-on (application-specific work, + not shared-library work). + +## Adoption backlog (ordered) + +| # | Item | Projects | Priority | Effort | Risk | Notes | +|---|---|---|---|---|---|---| +| 1 | MxGateway MEL → `AddZbSerilog`: `LogContext` correlation + `ILogRedactor` (Gap L1) | MxGateway | P1 | M | Low | **In-pass, done in Task #9** — unblocked by library build | +| 2 | MxGateway: wire OTel SDK via `AddZbTelemetry`; `GatewayMetrics` begins exporting (Gap M1) | MxGateway | P1 | S | Low | Bundled with #1 in Task #9 | +| 3 | All: `AddZbTelemetry` with `ServiceName`/`SiteId`/`NodeRole` → shared Resource (Gap R1) | OtOpcUa, ScadaBridge | P1 | S | Low | MxGateway covered by #2; others are follow-on | +| 4 | OtOpcUa: adopt `AddZbSerilog` + `TraceContextEnricher` (Gaps C1, V1) | OtOpcUa | P2 | S | Low | Keep `LogContextEnricher`; add shared enrichers alongside | +| 5 | ScadaBridge: adopt `AddZbSerilog`; replace `LoggerConfigurationFactory` (Gap C2, V1) | ScadaBridge | P2 | S | Low | Enricher names already match; `LoggerConfigurationFactory` deleted | +| 6 | MxGateway: histogram `ms` → `s` conversion (Gap U1) | MxGateway | P2 | S | Med | Breaking dashboard/alert change; coordinate with ops | +| 7 | MxGateway: rename Meter `"MxGateway.Server"` → `"ZB.MOM.WW.MxGateway"` (Gap N1) | MxGateway | P3 | XS | Med | Breaking Prometheus label change; coordinate with ops | +| 8 | All: standard instrumentation via `AddZbTelemetry` options (Gap S1) | all 3 | P2 | XS | Low | Automatic on `AddZbTelemetry` adoption; no extra code | +| 9 | ScadaBridge: define `ScadaBridgeTelemetry` + first `scadabridge.*` instruments (Gap I1) | ScadaBridge | P2 | M | Low | Application-specific work; tracked in ScadaBridge repo | +| 10 | OtOpcUa: wire OTLP exporter alongside Prometheus (Gap M2) | OtOpcUa | P3 | S | Low | Opt-in via `AddZbTelemetry` options; no code rewrite | +| 11 | All: fix the tracing no-op in OtOpcUa (spans recorded, no exporter) | OtOpcUa | P2 | S | Low | Wire OTLP or at minimum document the gap | + +**Sequencing:** Items #1 and #2 are the in-pass adoption (Task #9, MxGateway only). Items #3–#5 +are the follow-on OtOpcUa/ScadaBridge adoptions — they are independent of each other and can land +in either order. Items #6–#7 (unit/naming convergence) are breaking changes requiring ops +coordination; defer until dashboards can be updated. Items #8–#11 are cleanups that bundle +naturally with #3–#5. + +This mirrors the Auth and UI-Theme pattern: the shared library is built first; adoption by each +app is opt-in and tracked here, not forced. + +## Decisions still open + +- Whether `AddZbTelemetry` enables OTLP by default (simplest for new setups) or Prometheus by + default (matches OtOpcUa's current posture). Design doc says Prometheus default; OTLP opt-in. +- Whether the `ms` → `s` conversion and Meter rename are bundled with the initial MxGateway + adoption (Task #9) or deferred. Deferring avoids dashboard breaks during the migration window. +- Canonical `SiteId` and `NodeRole` config binding path — ScadaBridge reads from its own config + hierarchy; `AddZbSerilog` must accept the value directly (caller-supplied) rather than reading + from a fixed config section, to remain project-agnostic. diff --git a/components/observability/README.md b/components/observability/README.md new file mode 100644 index 0000000..8b9e648 --- /dev/null +++ b/components/observability/README.md @@ -0,0 +1,106 @@ +# Observability (metrics / traces / logs) + +Third normalized component under the operability cluster. **Goal: path to shared code** — converge +the three sister projects onto a common OpenTelemetry Resource, a shared Serilog bootstrap with +unified enrichers, and a trace↔log correlation bridge, proposed as the `ZB.MOM.WW.Telemetry` +library set (2 packages), while each project keeps its own application instruments and sink +configuration. + +- The one target: [`spec/SPEC.md`](spec/SPEC.md) +- Metric naming reference: [`spec/METRIC-CONVENTIONS.md`](spec/METRIC-CONVENTIONS.md) +- The proposed shared library: [`shared-contract/ZB.MOM.WW.Telemetry.md`](shared-contract/ZB.MOM.WW.Telemetry.md) +- Divergences + backlog: [`GAPS.md`](GAPS.md) +- Current state, per project: [`current-state/`](current-state/) + +## Why observability is a strong normalization candidate + +All three projects instrument something — but in three completely different ways and at three very +different levels of completeness. The divergences are structural: + +- **OtOpcUa** has the full OpenTelemetry SDK (metrics + tracing), Prometheus export, and a bespoke + Serilog enricher for driver-lifecycle correlation — but no Resource (`service.name` is never set) + and no trace↔log bridge. +- **MxAccessGateway** has 20 hand-rolled instruments (counters, histograms, gauges) recording real + production data — that never leave the process. No OTel SDK, no exporter, no tracing. Logging + uses Microsoft.Extensions.Logging rather than Serilog, with a bespoke correlation-scope and + redaction pipeline. +- **ScadaBridge** has zero application instruments. Its `OpenTelemetry.Api` reference is a CVE + patch, not instrumentation. It does have the cleanest structured logging enricher set + (`SiteId`/`NodeRole`/`NodeHostname`) — but those properties exist only in Serilog, not in the + OTel Resource, so logs and metrics cannot join in a backend. + +Nobody sets a Resource. Nobody does trace↔log correlation. MxGateway's metrics are invisible. +ScadaBridge has no metrics at all. + +The common fix is a single `AddZbTelemetry(options)` call that: creates a shared Resource from a +`service.name`/`site.id`/`node.role` options object; registers the project's own Meter/ActivitySource +names with the OTel SDK; and exposes Prometheus `/metrics`. A companion `AddZbSerilog(options)` wires +Serilog with the same options as enricher properties and adds `TraceContextEnricher` so logs carry +`trace_id`/`span_id`. The unifying hinge: the same identity triple (`service.name`/`site.id`/ +`node.role`) populates both the OTel Resource and the Serilog enrichers, so a metric, a span, and +a log line from the same node carry identical dimensions and join up in a backend. + +One adoption happens **in this task**: MxAccessGateway migrates off MEL onto `AddZbSerilog`. All +other app wiring is follow-on, consistent with how Auth and UI-Theme are structured. + +## Status by project + +| Project | OTel SDK today | Metrics today | Tracing today | Logging today | Enrichers today | Adoption status | +|---|---|---|---|---|---|---| +| **OtOpcUa** | ✅ full SDK (`WithMetrics`+`WithTracing`) | ✅ 7 instruments (`otopcua.*`); Prometheus `/metrics` | 🟡 2 spans defined; no exporter | Serilog (Console+File) | `DriverInstanceId`/`DriverType`/`CapabilityName`/`CorrelationId` (driver-scope) | Not started (follow-on) | +| **MxAccessGateway** | ⛔ none (hand-rolled `Meter`) | 🟡 20 instruments (`mxgateway.*`); **never exported** | ⛔ none | MEL → **migrating to Serilog in this task** | `SessionId`/`WorkerProcessId`/`CorrelationId`/`CommandMethod` (MEL scope) | **In progress (Task #9)** | +| **ScadaBridge** | ⛔ (`OpenTelemetry.Api` CVE-patch only) | ⛔ zero instruments | ⛔ none | Serilog (Console+File) | `SiteId`/`NodeRole`/`NodeHostname` (process-level; strongest set) | Not started (follow-on) | + +See each project's [`current-state//CURRENT-STATE.md`](current-state/) for the +code-verified detail and its adoption plan. + +## Normalized vs. left per-project + +**Normalized (the shared target):** + +- `AddZbTelemetry(ZbTelemetryOptions)` — front door for the OTel SDK. Populates the shared + Resource (`service.name`, `service.namespace`, `service.version`, `site.id`, `node.role`, + `host.name`). Registers the caller-supplied Meter and ActivitySource name(s). Wires standard + instrumentation (ASP.NET Core, HttpClient, runtime, process). Prometheus default; OTLP opt-in. +- `app.MapZbMetrics()` — maps the Prometheus `/metrics` endpoint (shared path + shared exporter). +- `AddZbSerilog(ZbTelemetryOptions)` — shared Serilog two-stage bootstrap generalizing + ScadaBridge's `LoggerConfigurationFactory`. Wires `SiteId`/`NodeRole`/`NodeHostname` enrichers + from the same options object as the OTel Resource. Wires `TraceContextEnricher` + (`trace_id`/`span_id` from `Activity.Current`). Preserves `ReadFrom.Configuration` for sinks + and explicit `MinimumLevel.Is` override. +- `ILogRedactor` seam — generalized from MxGateway's `GatewayLogRedactor`. The seam is shared; + the redaction policy (which fields/commands) stays per-project. +- Metric naming convention: `..`; Meter name = project namespace + (`ZB.MOM.WW.`); duration unit = `s` (OTel semconv). + +**Left per-project (not forced together):** + +- Application `Meter`, `ActivitySource`, and all instrument definitions — `otopcua.*`, + `mxgateway.*`, `scadabridge.*` instruments are owned by each repo. +- Serilog sink configuration (`appsettings.json` Console/File templates, rolling intervals). +- Per-operation/per-session correlation enrichers (`LogContextEnricher` in OtOpcUa; + `LogContext.PushProperty` scope in MxGateway after migration). +- Redaction policies (`MxGatewayLogRedactor` implements `ILogRedactor` with gateway-specific + command/field rules). +- Config section paths for `SiteId`/`NodeRole`/`NodeHostname` — each project binds these from + its own config hierarchy and passes the resolved values to `AddZbTelemetry`/`AddZbSerilog`. + +## Package structure + +`ZB.MOM.WW.Telemetry` ships as two dependency-split packages: + +| Package | Contents | Consumers | +|---|---|---| +| `ZB.MOM.WW.Telemetry` | `AddZbTelemetry`, `ZbTelemetryOptions`, Resource builder, standard instrumentation, Prometheus/OTLP exporters, `app.MapZbMetrics()` | All three | +| `ZB.MOM.WW.Telemetry.Serilog` | `AddZbSerilog`, shared enrichers (`SiteId`/`NodeRole`/`NodeHostname`/`TraceContextEnricher`), `ILogRedactor` seam | All three (Serilog users); MxGateway on migration | + +Both packages share `ZbTelemetryOptions` as the single options object that drives Resource +attributes, Serilog enrichers, Meter/ActivitySource names, and exporter selection — the unifying +hinge that makes a metric, a span, and a log line from the same node carry identical dimensions. + +## Component status + +**Status: Draft.** Spec and shared-contract written; current-state docs verified; GAPS backlog +populated. Library implementation in progress (`ZB.MOM.WW.Telemetry` — Task #8). MxAccessGateway +MEL → Serilog migration in progress (Task #9, blocked by library build). Adoption by OtOpcUa and +ScadaBridge is follow-on, tracked in [`GAPS.md`](GAPS.md).