docs: design for ZB.MOM.WW.Telemetry adoption across the 3 sister apps

Second cross-fleet shared-library adoption (after Health). Full scope:
AddZbTelemetry (OTel Resource identity triple + standard instrumentation +
Prometheus /metrics) on all 3, plus shared Serilog on all 3 — including the
MxGateway MEL->Serilog migration. Records the correction that MxGateway's
logging was NOT actually adopted on main despite the docs' claim. Behaviour-
preserving bar; breaking items (#6 unit, #7 rename) deferred.
This commit is contained in:
Joseph Doherty
2026-06-01 15:11:50 -04:00
parent 19f7ea5eeb
commit 3729ff2152
@@ -0,0 +1,234 @@
# Adopt `ZB.MOM.WW.Telemetry` across the three sister apps — design
**Date:** 2026-06-01
**Status:** Approved (design); implementation plan to follow via writing-plans.
**Scope:** Integrate the built-but-unadopted `ZB.MOM.WW.Telemetry` (+ `.Serilog`) shared library
into all three sister apps — **OtOpcUa**, **MxAccessGateway**, **ScadaBridge** — wiring the shared
OpenTelemetry Resource, standard instrumentation, Prometheus `/metrics`, and the shared Serilog
bootstrap with identity enrichers and trace↔log correlation.
This is the second full cross-fleet adoption of one of the six shared `ZB.MOM.WW.*` libraries
(after `ZB.MOM.WW.Health`). It follows the adoption backlog in
[`components/observability/GAPS.md`](../../components/observability/GAPS.md), re-verified against
current code on 2026-06-01.
> **Correction recorded during design:** the library CLAUDE.md and
> [`components/observability/README.md`](../../components/observability/README.md) claim
> *"MxAccessGateway logging adopted (MEL → Serilog migration done on its own branch)."* This is
> **false on `main`** — MxGateway is still MEL-only (no Serilog packages, `GatewayLogScope` /
> `GatewayLogRedactor` still bespoke), and its `MxGateway.Server` meter is **not exported at all**
> (no `AddOpenTelemetry`, no `/metrics`). That branch never landed. This design therefore includes
> the full MxGateway MEL→Serilog migration, and the bookkeeping task corrects the false claim.
---
## 1. Goal & scope
Wire the two shared packages into all three apps:
- **`ZB.MOM.WW.Telemetry`** — `AddZbTelemetry(options)`: shared OTel Resource (the identity triple
`service.name` / `site.id` / `node.role` + `service.namespace` / `service.version` / `host.name`),
caller-supplied Meters/ActivitySources, standard instrumentation (ASP.NET Core, HttpClient, gRPC
client, runtime, process), Prometheus always-on exporter (OTLP opt-in), and `app.MapZbMetrics()`
to mount `/metrics`.
- **`ZB.MOM.WW.Telemetry.Serilog`** — `AddZbSerilog(options)`: two-stage Serilog bootstrap,
`ReadFrom.Configuration` sinks, `SiteId`/`NodeRole`/`NodeHostname` enrichers, `TraceContextEnricher`
(writes `trace_id`/`span_id` from `Activity.Current`), and the `ILogRedactor` seam via
`RedactionEnricher`. Uses `preserveStaticLogger: true` so it is test-safe.
**The headline gap (§1 of GAPS):** *no* app sets a single OTel Resource attribute today, so every
metric and span from every node is indistinguishable in a backend — no service identity, no
site/role topology, no version label. `AddZbTelemetry` closes this for all three at once. This is
the single highest-value observability gap across the fleet.
**Behaviour-preserving bar** (same as the Health adoption): same log messages at the same levels,
same metric series with the same names and units, same `/metrics` path. New series produced by
standard instrumentation are *additive*. All genuinely breaking items are **deferred** (see §6).
---
## 2. Distribution
- **Feed:** Gitea NuGet registry `dohertj2-gitea`
(`https://gitea.dohertylan.com/api/packages/dohertj2/nuget/index.json`). Credentials live
**creds-only at the user level** (`~/.nuget/NuGet/NuGet.Config` `<packageSourceCredentials>`),
matched by source name — **never committed to any repo**. Already configured during the Health
round; no change needed here.
- **Source-mapping — the two-pattern gotcha (carried from Health):** under
`packageSourceMapping`, the glob `ZB.MOM.WW.Telemetry.*` matches `ZB.MOM.WW.Telemetry.Serilog`
but **not** the bare core id `ZB.MOM.WW.Telemetry`. Each repo therefore needs **both**:
```xml
<package pattern="ZB.MOM.WW.Telemetry" />
<package pattern="ZB.MOM.WW.Telemetry.*" />
```
- **Per-repo wiring:**
| Repo | CPM? | Change |
|---|---|---|
| OtOpcUa | yes (`Directory.Packages.props`) | add 2 `<PackageVersion>` @ `0.1.0`; extend existing `NuGet.config` mapping with both Telemetry patterns; add 2 versionless `<PackageReference>` to the Host csproj |
| ScadaBridge | yes | add 2 `<PackageVersion>` @ `0.1.0`; extend existing `nuget.config` mapping; add 2 versionless `<PackageReference>` to the Host csproj |
| MxAccessGateway | **no CPM** | add 2 direct versioned `<PackageReference>` to the Server csproj; extend its `nuget.config` mapping (the file created during the Health round) |
- **Task 0 (gating, like Health):** the library docs claim these two packages are already on the
feed. **Verify first; pack + push the two `.nupkg`s if missing** — the Health round proved this
claim cannot be trusted.
- **Serilog version floor (Gap V1):** OtOpcUa pins `Serilog.AspNetCore` 9.0.0, ScadaBridge 10.0.0.
Confirm the `.Serilog` package's Serilog dependency floor is satisfied by both (bump if not), and
pick MxGateway's fresh `Serilog.AspNetCore` version to align.
---
## 3. Per-app adoption surface
### OtOpcUa (`master`) — moderate
Already has Serilog (inline `UseSerilog`), full OTel, and Prometheus `/metrics`.
- **Metrics/traces:** replace the hand-rolled
`src/Server/ZB.MOM.WW.OtOpcUa.Host/Observability/ObservabilityExtensions.cs`
(`AddOpenTelemetry().WithMetrics(...AddPrometheusExporter()).WithTracing(...)` +
`MapPrometheusScrapingEndpoint("/metrics")`) with
```csharp
builder.AddZbTelemetry(o =>
{
o.ServiceName = "otopcua";
o.ServiceVersion = /* AssemblyInformationalVersion */;
o.Meters = ["ZB.MOM.WW.OtOpcUa"];
o.ActivitySources = ["ZB.MOM.WW.OtOpcUa"];
// Exporter defaults to Prometheus
});
// ...
app.MapZbMetrics();
```
**Same meter/source names and same `/metrics` path** → behaviour-preserving; *gains* the Resource
identity + standard instrumentation. (OtOpcUa records spans but has no trace exporter today;
Prometheus is metrics-only, so traces remain a no-op exporter-wise — unchanged. OTLP trace wiring
is deferred, §6.)
- **Logging:** replace the inline
`builder.Host.UseSerilog((ctx, lc) => lc.ReadFrom.Configuration(...).WriteTo.Console().WriteTo.File(...))`
with `builder.AddZbSerilog(o => { o.ServiceName = "otopcua"; })`, moving the Console/File sinks
into `appsettings` `Serilog:WriteTo` so `ReadFrom.Configuration` reproduces them. Keep the
existing driver-scope `LogContextEnricher` alongside the shared enrichers.
- **Identity:** `ServiceName="otopcua"`; `SiteId`/`NodeRole` omitted (none in config).
### ScadaBridge (`main`) — moderate, two composition roots
Serilog already (via `LoggerConfigurationFactory`); **no OTel at all**; `SiteId` + `NodeRole`
already read from config (`ScadaBridge:Node:*`, `NodeOptions`).
- **Metrics:** add `builder.AddZbTelemetry(o => { o.ServiceName="scadabridge"; o.SiteId=siteId; o.NodeRole=nodeRole; })`
+ `app.MapZbMetrics()` in **both** composition roots — the Central block and the Site block of
`Program.cs` (the same two-root pattern the Health adoption used). `Meters=[]` for now (app
instruments are deferred, §6). Purely additive — no metrics exist today to break.
- **Logging:** replace `LoggerConfigurationFactory.Build(config, nodeRole, siteId, nodeHostname)` +
`builder.Host.UseSerilog()` with
`builder.AddZbSerilog(o => { o.ServiceName="scadabridge"; o.SiteId=siteId; o.NodeRole=nodeRole; })`
— its enrichers reproduce the factory's `SiteId`/`NodeRole`/`NodeHostname`. Keep a minimal
`CreateBootstrapLogger()` line for early-startup capture per the library's documented pattern,
then delete `LoggerConfigurationFactory`. Verify the existing sinks are config-driven (`Serilog`
section in `appsettings`) so the swap is byte-equivalent; mirror any code-side sinks into config.
### MxAccessGateway (`main`) — heaviest (the MEL→Serilog migration)
MEL-only; custom `MxGateway.Server` meter **not exported**; no `/metrics`. The x86 net48 worker is
a separate process and **out of scope** — telemetry is for the Server.
- **Logging (MEL → Serilog):**
- Add Serilog packages (`Serilog.AspNetCore` + sinks) to the Server csproj (direct versioned ref).
- Replace the temporary `LoggerFactory.Create(...)` MEL bootstrap in `GatewayApplication.cs`
(and `builder.Logging` config) with `builder.AddZbSerilog(o => { o.ServiceName="mxgateway"; })`
+ a `CreateBootstrapLogger()` line.
- `GatewayLogScope` → `Serilog.Context.LogContext.PushProperty(...)`.
- `GatewayLogRedactor` → implement the `ILogRedactor` seam, register in DI (picked up by
`RedactionEnricher`).
- Request-logging middleware → `UseSerilogRequestLogging()` (or keep the middleware but emit via
a Serilog `ILogger`). Sinks to `appsettings`.
- **Metrics:** `builder.AddZbTelemetry(o => { o.ServiceName="mxgateway"; o.Meters=["MxGateway.Server"]; })`
+ `app.MapZbMetrics()` → the 20 existing instruments (13 counters, 3 histograms, 4 gauges) finally
export. **Keep the `MxGateway.Server` meter name and the `ms` histogram units** (rename and unit
conversion are deferred, §6). `GetSnapshot()` in-memory read path stays untouched.
---
## 4. Shared seam
```
ZbTelemetryOptions (ServiceName / SiteId / NodeRole / Meters / ActivitySources / Exporter)
┌─────────────────┴──────────────────┐
AddZbTelemetry (core) AddZbSerilog (.Serilog)
• ZbResource (identity triple) • ReadFrom.Configuration sinks
• app Meters + ActivitySources • SiteId / NodeRole / NodeHostname enrichers
• standard instrumentation • TraceContextEnricher (trace_id / span_id)
• Prometheus always + OTLP opt-in • ILogRedactor seam (RedactionEnricher)
│ │
app.MapZbMetrics() → /metrics preserveStaticLogger: true (test-safe)
```
Both packages share the single `ZbTelemetryOptions`. The Serilog OTLP log sink derives its Resource
attributes from `ZbResource.BuildAttributes` (single source of truth), so logs can never drift from
metrics and traces in a backend.
---
## 5. Sequencing & execution
Subagent-driven, classification-driven review chain. **Task 0 gates everything** (verify/publish the
feed). Then three **independent** per-repo phases — each its own git repo, branch
**`feat/adopt-zb-telemetry`**, commit per task, **never skip hooks, never force-push**:
1. **Task 0 (gating):** verify the two Telemetry `.nupkg`s are on the Gitea feed; pack + push if
missing (creds-only user config, already set).
2. **OtOpcUa:** source-mapping + package refs → `AddZbTelemetry` swap → `AddZbSerilog` swap → tests.
3. **ScadaBridge:** source-mapping + package refs → `AddZbTelemetry` (both roots) → `AddZbSerilog`
(replace `LoggerConfigurationFactory`) → tests.
4. **MxAccessGateway:** source-mapping + package refs → **MEL→Serilog** (sub-tasked, `high-risk`)
→ `AddZbTelemetry` metrics export → tests.
5. **scadaproj bookkeeping:** add an "Adoption status — DONE" section to
`components/observability/GAPS.md` (per-repo table + deferred items), **and correct the false
"MxGateway logging already adopted" claim** in CLAUDE.md, the library CLAUDE.md, and
`components/observability/README.md`.
The MxGateway MEL→Serilog migration is the one `high-risk` change (logging behaviour on the most
operational app) and gets the full spec→code serial review chain. The other per-app swaps are
`standard`.
---
## 6. Deferred (out of scope this round; recorded in GAPS)
| # | Item | Why deferred |
|---|---|---|
| #6 | MxGateway histogram `ms` → `s` | Breaking dashboard/alert change — needs ops coordination |
| #7 | MxGateway meter rename `MxGateway.Server` → `ZB.MOM.WW.MxGateway` | Breaking Prometheus label change — needs ops coordination |
| #9 | ScadaBridge app instruments (`ScadaBridgeTelemetry` + `scadabridge.*`) | Application-specific work, not shared-library adoption |
| #10 | OtOpcUa OTLP exporter alongside Prometheus | Opt-in; no consumer for OTLP yet |
| #11 | OtOpcUa trace-export no-op (spans recorded, no exporter) | Resolved by #10 / OTLP; or document |
None of these block the behaviour-preserving initial adoption.
---
## 7. Testing
All tests run **offline** — Prometheus is in-process, no OTLP collector required, and the library's
own test suites are network-free.
- **OtOpcUa:** assert `/metrics` is still served, the `ZB.MOM.WW.OtOpcUa` meter is present, the
Resource carries `service.name`, and the shared Serilog enrichers are wired.
- **ScadaBridge:** assert `/metrics` is served in **both** roles, the logger carries
`SiteId`/`NodeRole` enrichers, and startup is clean after `LoggerConfigurationFactory` removal.
- **MxAccessGateway** (the careful one): assert log messages are still emitted at the same levels,
redaction still applies, request logging still fires, `/metrics` is now served, and the
`GetSnapshot()` path is unchanged — using the existing fake-worker test harness (no MXAccess
needed).
---
## 8. Acceptance bar
- Each app builds and its test suite is green.
- `/metrics` serves the same existing series (plus additive standard-instrumentation series); meter
names and units unchanged.
- Logs carry the same messages at the same levels, plus the shared identity enrichers and
`trace_id`/`span_id` correlation.
- No secrets committed to any repo (the Gitea token stays creds-only at the user level).
- `components/observability/GAPS.md` updated; the false "MxGateway logging adopted" claim corrected.