diff --git a/docs/plans/2026-06-01-telemetry-library-adoption-design.md b/docs/plans/2026-06-01-telemetry-library-adoption-design.md new file mode 100644 index 0000000..b223929 --- /dev/null +++ b/docs/plans/2026-06-01-telemetry-library-adoption-design.md @@ -0,0 +1,234 @@ +# Adopt `ZB.MOM.WW.Telemetry` across the three sister apps — design + +**Date:** 2026-06-01 +**Status:** Approved (design); implementation plan to follow via writing-plans. +**Scope:** Integrate the built-but-unadopted `ZB.MOM.WW.Telemetry` (+ `.Serilog`) shared library +into all three sister apps — **OtOpcUa**, **MxAccessGateway**, **ScadaBridge** — wiring the shared +OpenTelemetry Resource, standard instrumentation, Prometheus `/metrics`, and the shared Serilog +bootstrap with identity enrichers and trace↔log correlation. + +This is the second full cross-fleet adoption of one of the six shared `ZB.MOM.WW.*` libraries +(after `ZB.MOM.WW.Health`). It follows the adoption backlog in +[`components/observability/GAPS.md`](../../components/observability/GAPS.md), re-verified against +current code on 2026-06-01. + +> **Correction recorded during design:** the library CLAUDE.md and +> [`components/observability/README.md`](../../components/observability/README.md) claim +> *"MxAccessGateway logging adopted (MEL → Serilog migration done on its own branch)."* This is +> **false on `main`** — MxGateway is still MEL-only (no Serilog packages, `GatewayLogScope` / +> `GatewayLogRedactor` still bespoke), and its `MxGateway.Server` meter is **not exported at all** +> (no `AddOpenTelemetry`, no `/metrics`). That branch never landed. This design therefore includes +> the full MxGateway MEL→Serilog migration, and the bookkeeping task corrects the false claim. + +--- + +## 1. Goal & scope + +Wire the two shared packages into all three apps: + +- **`ZB.MOM.WW.Telemetry`** — `AddZbTelemetry(options)`: shared OTel Resource (the identity triple + `service.name` / `site.id` / `node.role` + `service.namespace` / `service.version` / `host.name`), + caller-supplied Meters/ActivitySources, standard instrumentation (ASP.NET Core, HttpClient, gRPC + client, runtime, process), Prometheus always-on exporter (OTLP opt-in), and `app.MapZbMetrics()` + to mount `/metrics`. +- **`ZB.MOM.WW.Telemetry.Serilog`** — `AddZbSerilog(options)`: two-stage Serilog bootstrap, + `ReadFrom.Configuration` sinks, `SiteId`/`NodeRole`/`NodeHostname` enrichers, `TraceContextEnricher` + (writes `trace_id`/`span_id` from `Activity.Current`), and the `ILogRedactor` seam via + `RedactionEnricher`. Uses `preserveStaticLogger: true` so it is test-safe. + +**The headline gap (§1 of GAPS):** *no* app sets a single OTel Resource attribute today, so every +metric and span from every node is indistinguishable in a backend — no service identity, no +site/role topology, no version label. `AddZbTelemetry` closes this for all three at once. This is +the single highest-value observability gap across the fleet. + +**Behaviour-preserving bar** (same as the Health adoption): same log messages at the same levels, +same metric series with the same names and units, same `/metrics` path. New series produced by +standard instrumentation are *additive*. All genuinely breaking items are **deferred** (see §6). + +--- + +## 2. Distribution + +- **Feed:** Gitea NuGet registry `dohertj2-gitea` + (`https://gitea.dohertylan.com/api/packages/dohertj2/nuget/index.json`). Credentials live + **creds-only at the user level** (`~/.nuget/NuGet/NuGet.Config` ``), + matched by source name — **never committed to any repo**. Already configured during the Health + round; no change needed here. +- **Source-mapping — the two-pattern gotcha (carried from Health):** under + `packageSourceMapping`, the glob `ZB.MOM.WW.Telemetry.*` matches `ZB.MOM.WW.Telemetry.Serilog` + but **not** the bare core id `ZB.MOM.WW.Telemetry`. Each repo therefore needs **both**: + ```xml + + + ``` +- **Per-repo wiring:** + | Repo | CPM? | Change | + |---|---|---| + | OtOpcUa | yes (`Directory.Packages.props`) | add 2 `` @ `0.1.0`; extend existing `NuGet.config` mapping with both Telemetry patterns; add 2 versionless `` to the Host csproj | + | ScadaBridge | yes | add 2 `` @ `0.1.0`; extend existing `nuget.config` mapping; add 2 versionless `` to the Host csproj | + | MxAccessGateway | **no CPM** | add 2 direct versioned `` to the Server csproj; extend its `nuget.config` mapping (the file created during the Health round) | +- **Task 0 (gating, like Health):** the library docs claim these two packages are already on the + feed. **Verify first; pack + push the two `.nupkg`s if missing** — the Health round proved this + claim cannot be trusted. +- **Serilog version floor (Gap V1):** OtOpcUa pins `Serilog.AspNetCore` 9.0.0, ScadaBridge 10.0.0. + Confirm the `.Serilog` package's Serilog dependency floor is satisfied by both (bump if not), and + pick MxGateway's fresh `Serilog.AspNetCore` version to align. + +--- + +## 3. Per-app adoption surface + +### OtOpcUa (`master`) — moderate + +Already has Serilog (inline `UseSerilog`), full OTel, and Prometheus `/metrics`. + +- **Metrics/traces:** replace the hand-rolled + `src/Server/ZB.MOM.WW.OtOpcUa.Host/Observability/ObservabilityExtensions.cs` + (`AddOpenTelemetry().WithMetrics(...AddPrometheusExporter()).WithTracing(...)` + + `MapPrometheusScrapingEndpoint("/metrics")`) with + ```csharp + builder.AddZbTelemetry(o => + { + o.ServiceName = "otopcua"; + o.ServiceVersion = /* AssemblyInformationalVersion */; + o.Meters = ["ZB.MOM.WW.OtOpcUa"]; + o.ActivitySources = ["ZB.MOM.WW.OtOpcUa"]; + // Exporter defaults to Prometheus + }); + // ... + app.MapZbMetrics(); + ``` + **Same meter/source names and same `/metrics` path** → behaviour-preserving; *gains* the Resource + identity + standard instrumentation. (OtOpcUa records spans but has no trace exporter today; + Prometheus is metrics-only, so traces remain a no-op exporter-wise — unchanged. OTLP trace wiring + is deferred, §6.) +- **Logging:** replace the inline + `builder.Host.UseSerilog((ctx, lc) => lc.ReadFrom.Configuration(...).WriteTo.Console().WriteTo.File(...))` + with `builder.AddZbSerilog(o => { o.ServiceName = "otopcua"; })`, moving the Console/File sinks + into `appsettings` `Serilog:WriteTo` so `ReadFrom.Configuration` reproduces them. Keep the + existing driver-scope `LogContextEnricher` alongside the shared enrichers. +- **Identity:** `ServiceName="otopcua"`; `SiteId`/`NodeRole` omitted (none in config). + +### ScadaBridge (`main`) — moderate, two composition roots + +Serilog already (via `LoggerConfigurationFactory`); **no OTel at all**; `SiteId` + `NodeRole` +already read from config (`ScadaBridge:Node:*`, `NodeOptions`). + +- **Metrics:** add `builder.AddZbTelemetry(o => { o.ServiceName="scadabridge"; o.SiteId=siteId; o.NodeRole=nodeRole; })` + + `app.MapZbMetrics()` in **both** composition roots — the Central block and the Site block of + `Program.cs` (the same two-root pattern the Health adoption used). `Meters=[]` for now (app + instruments are deferred, §6). Purely additive — no metrics exist today to break. +- **Logging:** replace `LoggerConfigurationFactory.Build(config, nodeRole, siteId, nodeHostname)` + + `builder.Host.UseSerilog()` with + `builder.AddZbSerilog(o => { o.ServiceName="scadabridge"; o.SiteId=siteId; o.NodeRole=nodeRole; })` + — its enrichers reproduce the factory's `SiteId`/`NodeRole`/`NodeHostname`. Keep a minimal + `CreateBootstrapLogger()` line for early-startup capture per the library's documented pattern, + then delete `LoggerConfigurationFactory`. Verify the existing sinks are config-driven (`Serilog` + section in `appsettings`) so the swap is byte-equivalent; mirror any code-side sinks into config. + +### MxAccessGateway (`main`) — heaviest (the MEL→Serilog migration) + +MEL-only; custom `MxGateway.Server` meter **not exported**; no `/metrics`. The x86 net48 worker is +a separate process and **out of scope** — telemetry is for the Server. + +- **Logging (MEL → Serilog):** + - Add Serilog packages (`Serilog.AspNetCore` + sinks) to the Server csproj (direct versioned ref). + - Replace the temporary `LoggerFactory.Create(...)` MEL bootstrap in `GatewayApplication.cs` + (and `builder.Logging` config) with `builder.AddZbSerilog(o => { o.ServiceName="mxgateway"; })` + + a `CreateBootstrapLogger()` line. + - `GatewayLogScope` → `Serilog.Context.LogContext.PushProperty(...)`. + - `GatewayLogRedactor` → implement the `ILogRedactor` seam, register in DI (picked up by + `RedactionEnricher`). + - Request-logging middleware → `UseSerilogRequestLogging()` (or keep the middleware but emit via + a Serilog `ILogger`). Sinks to `appsettings`. +- **Metrics:** `builder.AddZbTelemetry(o => { o.ServiceName="mxgateway"; o.Meters=["MxGateway.Server"]; })` + + `app.MapZbMetrics()` → the 20 existing instruments (13 counters, 3 histograms, 4 gauges) finally + export. **Keep the `MxGateway.Server` meter name and the `ms` histogram units** (rename and unit + conversion are deferred, §6). `GetSnapshot()` in-memory read path stays untouched. + +--- + +## 4. Shared seam + +``` + ZbTelemetryOptions (ServiceName / SiteId / NodeRole / Meters / ActivitySources / Exporter) + │ + ┌─────────────────┴──────────────────┐ +AddZbTelemetry (core) AddZbSerilog (.Serilog) +• ZbResource (identity triple) • ReadFrom.Configuration sinks +• app Meters + ActivitySources • SiteId / NodeRole / NodeHostname enrichers +• standard instrumentation • TraceContextEnricher (trace_id / span_id) +• Prometheus always + OTLP opt-in • ILogRedactor seam (RedactionEnricher) + │ │ +app.MapZbMetrics() → /metrics preserveStaticLogger: true (test-safe) +``` + +Both packages share the single `ZbTelemetryOptions`. The Serilog OTLP log sink derives its Resource +attributes from `ZbResource.BuildAttributes` (single source of truth), so logs can never drift from +metrics and traces in a backend. + +--- + +## 5. Sequencing & execution + +Subagent-driven, classification-driven review chain. **Task 0 gates everything** (verify/publish the +feed). Then three **independent** per-repo phases — each its own git repo, branch +**`feat/adopt-zb-telemetry`**, commit per task, **never skip hooks, never force-push**: + +1. **Task 0 (gating):** verify the two Telemetry `.nupkg`s are on the Gitea feed; pack + push if + missing (creds-only user config, already set). +2. **OtOpcUa:** source-mapping + package refs → `AddZbTelemetry` swap → `AddZbSerilog` swap → tests. +3. **ScadaBridge:** source-mapping + package refs → `AddZbTelemetry` (both roots) → `AddZbSerilog` + (replace `LoggerConfigurationFactory`) → tests. +4. **MxAccessGateway:** source-mapping + package refs → **MEL→Serilog** (sub-tasked, `high-risk`) + → `AddZbTelemetry` metrics export → tests. +5. **scadaproj bookkeeping:** add an "Adoption status — DONE" section to + `components/observability/GAPS.md` (per-repo table + deferred items), **and correct the false + "MxGateway logging already adopted" claim** in CLAUDE.md, the library CLAUDE.md, and + `components/observability/README.md`. + +The MxGateway MEL→Serilog migration is the one `high-risk` change (logging behaviour on the most +operational app) and gets the full spec→code serial review chain. The other per-app swaps are +`standard`. + +--- + +## 6. Deferred (out of scope this round; recorded in GAPS) + +| # | Item | Why deferred | +|---|---|---| +| #6 | MxGateway histogram `ms` → `s` | Breaking dashboard/alert change — needs ops coordination | +| #7 | MxGateway meter rename `MxGateway.Server` → `ZB.MOM.WW.MxGateway` | Breaking Prometheus label change — needs ops coordination | +| #9 | ScadaBridge app instruments (`ScadaBridgeTelemetry` + `scadabridge.*`) | Application-specific work, not shared-library adoption | +| #10 | OtOpcUa OTLP exporter alongside Prometheus | Opt-in; no consumer for OTLP yet | +| #11 | OtOpcUa trace-export no-op (spans recorded, no exporter) | Resolved by #10 / OTLP; or document | + +None of these block the behaviour-preserving initial adoption. + +--- + +## 7. Testing + +All tests run **offline** — Prometheus is in-process, no OTLP collector required, and the library's +own test suites are network-free. + +- **OtOpcUa:** assert `/metrics` is still served, the `ZB.MOM.WW.OtOpcUa` meter is present, the + Resource carries `service.name`, and the shared Serilog enrichers are wired. +- **ScadaBridge:** assert `/metrics` is served in **both** roles, the logger carries + `SiteId`/`NodeRole` enrichers, and startup is clean after `LoggerConfigurationFactory` removal. +- **MxAccessGateway** (the careful one): assert log messages are still emitted at the same levels, + redaction still applies, request logging still fires, `/metrics` is now served, and the + `GetSnapshot()` path is unchanged — using the existing fake-worker test harness (no MXAccess + needed). + +--- + +## 8. Acceptance bar + +- Each app builds and its test suite is green. +- `/metrics` serves the same existing series (plus additive standard-instrumentation series); meter + names and units unchanged. +- Logs carry the same messages at the same levels, plus the shared identity enrichers and + `trace_id`/`span_id` correlation. +- No secrets committed to any repo (the Gitea token stays creds-only at the user level). +- `components/observability/GAPS.md` updated; the false "MxGateway logging adopted" claim corrected.