Files
scadaproj/components/observability/GAPS.md
T

239 lines
16 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Observability — gaps & adoption backlog
Divergence of each project from [`spec/SPEC.md`](spec/SPEC.md), and the ordered backlog to
reach the shared `ZB.MOM.WW.Telemetry` library. Status legend: ⛔ gap · 🟡 partial · ✅ matches.
## Divergence vs spec
### §1 OTel Resource / `service.name` (P1 — nobody has it)
| Spec attribute | OtOpcUa | MxAccessGateway | ScadaBridge |
|---|---|---|---|
| `service.name` | ⛔ not set | ⛔ not set | ⛔ not set |
| `service.namespace` | ⛔ not set | ⛔ not set | ⛔ not set |
| `service.version` | ⛔ not set | ⛔ not set | ⛔ not set |
| `site.id` | ⛔ not set | ⛔ not set | ⛔ not set |
| `node.role` | ⛔ not set | ⛔ not set | ⛔ not set |
| `host.name` | ⛔ not set | ⛔ not set | ⛔ not set |
No project configures a `ResourceBuilder` or equivalent. Every metric and span from every node is
**indistinguishable** in a backend — no service identity, no topology (site/role), no version label.
This is the single highest-value gap across the fleet; closing it requires only adding
`AddZbTelemetry` with options.
**Gap R1 (P1):** All three projects must call `AddZbTelemetry` with `ServiceName`, `SiteId`,
`NodeRole` options to populate the shared Resource. None may do so before the library is available.
### §2 Metrics export (P1 — MxGateway metrics are invisible)
| | OtOpcUa | MxAccessGateway | ScadaBridge |
|---|---|---|---|
| OTel SDK present | ✅ (`AddOpenTelemetry`) | ⛔ none | ⛔ none (`OpenTelemetry.Api` only) |
| Meter registered | ✅ `ZB.MOM.WW.OtOpcUa` | 🟡 `MxGateway.Server` (not via OTel SDK) | ⛔ no Meter |
| Prometheus export | ✅ `/metrics` | ⛔ `GetSnapshot()` only | ⛔ none |
| OTLP export | ⛔ not available | ⛔ not available | ⛔ not available |
MxGateway has 20 production instruments (13 counters, 3 histograms, 4 gauges) recording real data
that is **never exported**. The data lives only in an in-memory `GetSnapshot()` read path. Adding
`AddZbTelemetry` with `Meters = ["MxGateway.Server"]` closes this gap without touching
`GatewayMetrics.cs`.
**Gap M1 (P1):** MxGateway: wire OTel SDK via `AddZbTelemetry` so `GatewayMetrics` exports.
This is one half of the in-pass adoption (logging migration is the other).
**Gap M2:** ScadaBridge: define a `ScadaBridgeTelemetry` class and first application instruments
(`scadabridge.*`); register via `AddZbTelemetry`. Currently zero instrumentation.
### §3 MxGateway logging: MEL → Serilog (P1 — in-pass adoption, done in this task)
| | OtOpcUa | MxAccessGateway | ScadaBridge |
|---|---|---|---|
| Logging framework | ✅ Serilog | ⛔ MEL only | ✅ Serilog |
| Structural enrichers | 🟡 driver-scope only | ⛔ MEL scope (provider-dependent) | ✅ `SiteId`/`NodeRole`/`NodeHostname` |
| Correlation mechanism | `LogContextEnricher.Push` | `GatewayLogScope` + `BeginScope` | structural enrichers |
| Log redaction | ⛔ none | ✅ `GatewayLogRedactor` | ⛔ none |
| `ILogRedactor` seam | ⛔ none | ⛔ bespoke | ⛔ none |
**Status: done in this task (Task #9).** The MxGateway logging migration — MEL → `AddZbSerilog`,
`GatewayLogScope``LogContext.PushProperty`, `GatewayLogRedactor``ILogRedactor` seam — is
executed as part of the `ZB.MOM.WW.Telemetry` library build. See
[`current-state/mxaccessgw/CURRENT-STATE.md`](current-state/mxaccessgw/CURRENT-STATE.md) adoption
plan for the exact changes.
**Gap L1 (in-pass, done):** MxGateway MEL → `AddZbSerilog` + `LogContext` correlation + `ILogRedactor`.
### §4 Trace↔log correlation (nobody has it)
| | OtOpcUa | MxAccessGateway | ScadaBridge |
|---|---|---|---|
| `trace_id`/`span_id` in logs | ⛔ absent | ⛔ absent (no spans) | ⛔ absent (no spans) |
| Mechanism | `TraceContextEnricher` not wired | n/a | n/a |
OtOpcUa creates spans (`otopcua.deploy.apply`, `otopcua.opcua.address_space_rebuild`) but never
pushes `Activity.Current`'s trace context onto Serilog's `LogContext`. A log emitted inside a span
cannot be joined to the span in a backend. MxGateway and ScadaBridge have no spans at all.
The shared `TraceContextEnricher` (part of `ZB.MOM.WW.Telemetry.Serilog`) closes this for all
three projects on `AddZbSerilog` adoption — it is wired automatically.
**Gap C1:** OtOpcUa: adopt `AddZbSerilog` to wire `TraceContextEnricher`; spans and logs become
joinable once the enricher is active.
**Gap C2:** ScadaBridge/MxGateway: enricher wired on adoption; effective once spans are created.
### §5 Duration unit: `ms` vs `s`
| | OtOpcUa | MxAccessGateway | ScadaBridge |
|---|---|---|---|
| Histogram unit | ✅ `s` | ⛔ `ms` (`workers.startup.duration`, `commands.duration`, `events.stream_send.duration`) | n/a (no histograms) |
MxGateway's three histograms record in milliseconds. OTel semantic conventions (and OtOpcUa) use
seconds. Raw values differ by a factor of 1000 — dashboards and SLO alerts would need separate
scaling rules without normalization.
**Gap U1:** MxGateway: convert the three histogram call sites to record in seconds (multiply by
`0.001` or redefine the instruments). This is a **breaking change to existing dashboards** and is
tagged as a convergence item separate from the initial adoption.
### §6 Meter naming: `MxGateway.Server` vs namespace convention
| | OtOpcUa | MxAccessGateway | ScadaBridge |
|---|---|---|---|
| Meter name | ✅ `ZB.MOM.WW.OtOpcUa` | ⛔ `MxGateway.Server` | n/a |
The fleet convention (per `spec/SPEC.md`) is `<project-namespace>` — OtOpcUa uses
`ZB.MOM.WW.OtOpcUa`; the gateway's meter should be `ZB.MOM.WW.MxGateway`. `MxGateway.Server` is
the assembly name, not the namespace, and does not carry the `ZB.MOM.WW` product prefix.
**Gap N1:** MxGateway: rename Meter from `"MxGateway.Server"``"ZB.MOM.WW.MxGateway"`.
This is a **Prometheus metric label change** that breaks dashboards/alerts and is tracked
separately from the initial adoption.
### §7 Standard instrumentation (nobody has the full set)
| Instrumentation | OtOpcUa | MxAccessGateway | ScadaBridge |
|---|---|---|---|
| ASP.NET Core request metrics | ⛔ not added | ⛔ not added | ⛔ not added |
| `HttpClient` metrics | ⛔ not added | ⛔ not added | ⛔ not added |
| Runtime / process metrics | ⛔ not added | ⛔ not added | ⛔ not added |
| gRPC client metrics | ⛔ not added | ⛔ not added | n/a |
`AddZbTelemetry` enables these via `AddAspNetCoreInstrumentation`, `AddHttpClientInstrumentation`,
`AddRuntimeInstrumentation`, and `AddProcessInstrumentation` by default — all three projects gain
them on `AddZbTelemetry` adoption without code changes.
**Gap S1:** all three projects lack standard instrumentation; closed automatically by
`AddZbTelemetry` adoption.
### §8 Serilog version split (`Serilog.AspNetCore` 8 vs 9)
OtOpcUa and ScadaBridge use `Serilog.AspNetCore` but may be on different versions due to their
independent csproj updates. The shared `ZB.MOM.WW.Telemetry.Serilog` package must declare a
`Serilog.AspNetCore` version floor that works for both. Verify version alignment on adoption.
**Gap V1:** confirm `Serilog.AspNetCore` version compatibility across all three projects on
`AddZbSerilog` adoption; align if diverged.
### §9 ScadaBridge: zero instrumentation
ScadaBridge has no application instruments today. The `OpenTelemetry.Api` ref is a CVE patch, not
instrumentation. No counter, histogram, gauge, or span exists anywhere in the solution.
**Gap I1:** ScadaBridge: define `ScadaBridgeTelemetry` with a `ZB.MOM.WW.ScadaBridge` Meter and
a first set of `scadabridge.*` instruments. Tracked as a follow-on (application-specific work,
not shared-library work).
## Adoption backlog (ordered)
| # | Item | Projects | Priority | Effort | Risk | Notes |
|---|---|---|---|---|---|---|
| 1 | MxGateway MEL → `AddZbSerilog`: `LogContext` correlation + `ILogRedactor` (Gap L1) | MxGateway | P1 | M | Low | **In-pass, done in Task #9** — unblocked by library build |
| 2 | MxGateway: wire OTel SDK via `AddZbTelemetry`; `GatewayMetrics` begins exporting (Gap M1) | MxGateway | P1 | S | Low | Bundled with #1 in Task #9 |
| 3 | All: `AddZbTelemetry` with `ServiceName`/`SiteId`/`NodeRole` → shared Resource (Gap R1) | OtOpcUa, ScadaBridge | P1 | S | Low | MxGateway covered by #2; others are follow-on |
| 4 | OtOpcUa: adopt `AddZbSerilog` + `TraceContextEnricher` (Gaps C1, V1) | OtOpcUa | P2 | S | Low | Keep `LogContextEnricher`; add shared enrichers alongside |
| 5 | ScadaBridge: adopt `AddZbSerilog`; replace `LoggerConfigurationFactory` (Gap C2, V1) | ScadaBridge | P2 | S | Low | Enricher names already match; `LoggerConfigurationFactory` deleted |
| 6 | MxGateway: histogram `ms``s` conversion (Gap U1) | MxGateway | P2 | S | Med | Breaking dashboard/alert change; coordinate with ops |
| 7 | MxGateway: rename Meter `"MxGateway.Server"``"ZB.MOM.WW.MxGateway"` (Gap N1) | MxGateway | P3 | XS | Med | Breaking Prometheus label change; coordinate with ops |
| 8 | All: standard instrumentation via `AddZbTelemetry` options (Gap S1) | all 3 | P2 | XS | Low | Automatic on `AddZbTelemetry` adoption; no extra code |
| 9 | ScadaBridge: define `ScadaBridgeTelemetry` + first `scadabridge.*` instruments (Gap I1) | ScadaBridge | P2 | M | Low | Application-specific work; tracked in ScadaBridge repo |
| 10 | OtOpcUa: wire OTLP exporter alongside Prometheus (Gap M2) | OtOpcUa | P3 | S | Low | Opt-in via `AddZbTelemetry` options; no code rewrite |
| 11 | All: fix the tracing no-op in OtOpcUa (spans recorded, no exporter) | OtOpcUa | P2 | S | Low | Wire OTLP or at minimum document the gap |
**Sequencing:** Items #1 and #2 are the in-pass adoption (Task #9, MxGateway only). Items #3#5
are the follow-on OtOpcUa/ScadaBridge adoptions — they are independent of each other and can land
in either order. Items #6#7 (unit/naming convergence) are breaking changes requiring ops
coordination; defer until dashboards can be updated. Items #8#11 are cleanups that bundle
naturally with #3#5.
This mirrors the Auth and UI-Theme pattern: the shared library is built first; adoption by each
app is opt-in and tracked here, not forced.
## Decisions still open
- Canonical `SiteId` and `NodeRole` config binding path — ScadaBridge reads from its own config
hierarchy; `AddZbSerilog` must accept the value directly (caller-supplied) rather than reading
from a fixed config section, to remain project-agnostic.
## Decisions settled (no longer open)
- **Prometheus vs OTLP default (SETTLED):** `AddZbTelemetry` defaults to Prometheus (matching
OtOpcUa's existing `/metrics` posture). OTLP is opt-in via `ZbTelemetryOptions.Exporter =
ZbExporter.Otlp`. See `spec/SPEC.md` §4 and shared-contract `ZbTelemetryOptions.Exporter`.
- **`ms``s` conversion and Meter rename bundling (SETTLED — deferred):** Both the histogram
unit migration (Gap U1) and the Meter rename (Gap N1) are deferred from the initial MxGateway
adoption (Task #9). They are breaking dashboard/alert changes requiring ops coordination and
are tracked as separate backlog items #6 and #7 in the adoption backlog above.
## Adoption status — 2026-06-01 (DONE)
`ZB.MOM.WW.Telemetry` + `ZB.MOM.WW.Telemetry.Serilog` (`0.1.0`) were adopted across **all three**
sister apps in one pass, behaviour-preserving. Each adoption landed on a per-repo branch
`feat/adopt-zb-telemetry` (one commit per task). Plan + design:
[`docs/plans/2026-06-01-telemetry-library-adoption.md`](../../docs/plans/2026-06-01-telemetry-library-adoption.md).
> **Correction:** the prior claim that *"MxAccessGateway logging was adopted (MEL → Serilog) on its
> own branch"* was **false on `main`** — MxGateway was still MEL-only, and its `MxGateway.Server`
> meter was never exported. The full MEL→Serilog migration **and** the metrics export both landed
> in this 2026-06-01 pass.
| Repo | `AddZbTelemetry` (Resource + std instrumentation + Prometheus) | `/metrics` | Logging | Meter (unchanged) |
|---|---|---|---|---|
| **OtOpcUa** | ✅ replaced hand-rolled `ObservabilityExtensions` | ✅ `/metrics` (path unchanged) | ✅ `AddZbSerilog` (sinks moved to `appsettings`; `LogContextEnricher` kept) | `ZB.MOM.WW.OtOpcUa` |
| **ScadaBridge** | ✅ added in `BindSharedOptions` (both Central + Site roots) | ✅ Central; mapped on Site too (see follow-on) | ⚠️ **kept `LoggerConfigurationFactory`** + added shared `TraceContextEnricher` — did **not** adopt `AddZbSerilog` | (none yet; #9) |
| **MxAccessGateway** | ✅ exports existing `GatewayMetrics` | ✅ new `/metrics` | ✅ MEL→`AddZbSerilog`; `GatewayLogRedactor` exposed via `ILogRedactor` seam (`GatewayLogRedactorSeam`); `GatewayLogScope`/middleware kept as-is | `MxGateway.Server` (name + `ms` units unchanged) |
### Accepted scope decisions (deviations from the original backlog)
- **ScadaBridge keeps `LoggerConfigurationFactory` (backlog #5 revised).** The factory implements a
documented governance contract (REQ-HOST-8 / Host-011/014/020/022): `ScadaBridge:Logging:MinimumLevel`
is the floor and **overrides** `Serilog:MinimumLevel`, with operator warnings. `AddZbSerilog`
hard-codes `MinimumLevel.Is(Information)` before `ReadFrom.Configuration`, which would invert that
precedence and silently drop the knob. So ScadaBridge keeps the factory and only **adds the shared
`TraceContextEnricher`** to it — gaining trace↔log correlation without regressing the contract. Full
`AddZbSerilog` adoption for ScadaBridge would first require teaching the shared bootstrap to accept a
caller-supplied minimum-level governance hook.
- **MxGateway keeps `GatewayLogScope` + request-logging middleware as-is.** The Serilog MEL provider
captures MEL `BeginScope` dictionaries as structured properties, so the scope/correlation code keeps
producing the same properties under Serilog. Only the provider swap + the `ILogRedactor` adapter were
needed.
## Follow-ons — DONE 2026-06-01
All the deferred follow-ons were then executed (branch `feat/telemetry-followons` per repo,
behaviour-preserving except the intentional, no-consumer-yet metric-shape change in #6/#7). Plan:
[`docs/plans/2026-06-01-telemetry-followons.md`](../../docs/plans/2026-06-01-telemetry-followons.md).
| Item | Status | What landed |
|---|---|---|
| **#6** MxGateway histogram `ms``s` | ✅ | 3 histograms record `.TotalSeconds`, unit `"s"`. Safe — never Prometheus-exported before, so no dashboards broke. |
| **#7** Meter rename → `ZB.MOM.WW.MxGateway` | ✅ | `GatewayMetrics.MeterName` renamed; `docs/Metrics.md` synced. |
| **#9** ScadaBridge app instruments | ✅ | `ScadaBridgeTelemetry` meter (`ZB.MOM.WW.ScadaBridge`) + first 4: `deployments.applied` (counter), `store_and_forward.queue.depth` (sync-safe cached gauge), `inbound_api.requests` (counter, bounded `method` tag), `site.connection.up` (balanced open/close gauge). |
| **#10/#11** OTLP opt-in | ✅ | All 3 apps read `<App>:Telemetry:Exporter` (`Prometheus`\|`Otlp`) + `:OtlpEndpoint`, default Prometheus. Setting OTLP also exports OtOpcUa's spans (resolves the trace no-op) — once a collector endpoint is configured. |
| **Site-node `/metrics` scrape** | ✅ | ScadaBridge `NodeOptions.MetricsPort` (default **8084**, avoids the site `RemotingPort=8082` collision) + a second `Http1AndHttp2` Kestrel listener on the Site role; `StartupValidator` enforces MetricsPort ≠ Remoting/Grpc. |
| Serilog version drift | ✅ | OtOpcUa `Serilog.AspNetCore`/`.Extensions.Hosting`/`.Settings.Configuration` aligned to `10.0.0` (family-consistent). |
**Still open (not code — operational/future):**
- **OTLP is opt-in but unexercised** until an OTel collector endpoint is deployed and the
`<App>:Telemetry:Exporter=Otlp` + `:OtlpEndpoint` config is set. The wiring is in place; only a
collector is missing.
- **Further ScadaBridge instruments** beyond the first 4 are additive future work (not blocking).