Second cross-fleet shared-library adoption (after Health). Full scope: AddZbTelemetry (OTel Resource identity triple + standard instrumentation + Prometheus /metrics) on all 3, plus shared Serilog on all 3 — including the MxGateway MEL->Serilog migration. Records the correction that MxGateway's logging was NOT actually adopted on main despite the docs' claim. Behaviour- preserving bar; breaking items (#6 unit, #7 rename) deferred.
13 KiB
Adopt ZB.MOM.WW.Telemetry across the three sister apps — design
Date: 2026-06-01
Status: Approved (design); implementation plan to follow via writing-plans.
Scope: Integrate the built-but-unadopted ZB.MOM.WW.Telemetry (+ .Serilog) shared library
into all three sister apps — OtOpcUa, MxAccessGateway, ScadaBridge — wiring the shared
OpenTelemetry Resource, standard instrumentation, Prometheus /metrics, and the shared Serilog
bootstrap with identity enrichers and trace↔log correlation.
This is the second full cross-fleet adoption of one of the six shared ZB.MOM.WW.* libraries
(after ZB.MOM.WW.Health). It follows the adoption backlog in
components/observability/GAPS.md, re-verified against
current code on 2026-06-01.
Correction recorded during design: the library CLAUDE.md and
components/observability/README.mdclaim "MxAccessGateway logging adopted (MEL → Serilog migration done on its own branch)." This is false onmain— MxGateway is still MEL-only (no Serilog packages,GatewayLogScope/GatewayLogRedactorstill bespoke), and itsMxGateway.Servermeter is not exported at all (noAddOpenTelemetry, no/metrics). That branch never landed. This design therefore includes the full MxGateway MEL→Serilog migration, and the bookkeeping task corrects the false claim.
1. Goal & scope
Wire the two shared packages into all three apps:
ZB.MOM.WW.Telemetry—AddZbTelemetry(options): shared OTel Resource (the identity tripleservice.name/site.id/node.role+service.namespace/service.version/host.name), caller-supplied Meters/ActivitySources, standard instrumentation (ASP.NET Core, HttpClient, gRPC client, runtime, process), Prometheus always-on exporter (OTLP opt-in), andapp.MapZbMetrics()to mount/metrics.ZB.MOM.WW.Telemetry.Serilog—AddZbSerilog(options): two-stage Serilog bootstrap,ReadFrom.Configurationsinks,SiteId/NodeRole/NodeHostnameenrichers,TraceContextEnricher(writestrace_id/span_idfromActivity.Current), and theILogRedactorseam viaRedactionEnricher. UsespreserveStaticLogger: trueso it is test-safe.
The headline gap (§1 of GAPS): no app sets a single OTel Resource attribute today, so every
metric and span from every node is indistinguishable in a backend — no service identity, no
site/role topology, no version label. AddZbTelemetry closes this for all three at once. This is
the single highest-value observability gap across the fleet.
Behaviour-preserving bar (same as the Health adoption): same log messages at the same levels,
same metric series with the same names and units, same /metrics path. New series produced by
standard instrumentation are additive. All genuinely breaking items are deferred (see §6).
2. Distribution
- Feed: Gitea NuGet registry
dohertj2-gitea(https://gitea.dohertylan.com/api/packages/dohertj2/nuget/index.json). Credentials live creds-only at the user level (~/.nuget/NuGet/NuGet.Config<packageSourceCredentials>), matched by source name — never committed to any repo. Already configured during the Health round; no change needed here. - Source-mapping — the two-pattern gotcha (carried from Health): under
packageSourceMapping, the globZB.MOM.WW.Telemetry.*matchesZB.MOM.WW.Telemetry.Serilogbut not the bare core idZB.MOM.WW.Telemetry. Each repo therefore needs both:<package pattern="ZB.MOM.WW.Telemetry" /> <package pattern="ZB.MOM.WW.Telemetry.*" /> - Per-repo wiring:
Repo CPM? Change OtOpcUa yes ( Directory.Packages.props)add 2 <PackageVersion>@0.1.0; extend existingNuGet.configmapping with both Telemetry patterns; add 2 versionless<PackageReference>to the Host csprojScadaBridge yes add 2 <PackageVersion>@0.1.0; extend existingnuget.configmapping; add 2 versionless<PackageReference>to the Host csprojMxAccessGateway no CPM add 2 direct versioned <PackageReference>to the Server csproj; extend itsnuget.configmapping (the file created during the Health round) - Task 0 (gating, like Health): the library docs claim these two packages are already on the
feed. Verify first; pack + push the two
.nupkgs if missing — the Health round proved this claim cannot be trusted. - Serilog version floor (Gap V1): OtOpcUa pins
Serilog.AspNetCore9.0.0, ScadaBridge 10.0.0. Confirm the.Serilogpackage's Serilog dependency floor is satisfied by both (bump if not), and pick MxGateway's freshSerilog.AspNetCoreversion to align.
3. Per-app adoption surface
OtOpcUa (master) — moderate
Already has Serilog (inline UseSerilog), full OTel, and Prometheus /metrics.
- Metrics/traces: replace the hand-rolled
src/Server/ZB.MOM.WW.OtOpcUa.Host/Observability/ObservabilityExtensions.cs(AddOpenTelemetry().WithMetrics(...AddPrometheusExporter()).WithTracing(...)+MapPrometheusScrapingEndpoint("/metrics")) withSame meter/source names and samebuilder.AddZbTelemetry(o => { o.ServiceName = "otopcua"; o.ServiceVersion = /* AssemblyInformationalVersion */; o.Meters = ["ZB.MOM.WW.OtOpcUa"]; o.ActivitySources = ["ZB.MOM.WW.OtOpcUa"]; // Exporter defaults to Prometheus }); // ... app.MapZbMetrics();/metricspath → behaviour-preserving; gains the Resource identity + standard instrumentation. (OtOpcUa records spans but has no trace exporter today; Prometheus is metrics-only, so traces remain a no-op exporter-wise — unchanged. OTLP trace wiring is deferred, §6.) - Logging: replace the inline
builder.Host.UseSerilog((ctx, lc) => lc.ReadFrom.Configuration(...).WriteTo.Console().WriteTo.File(...))withbuilder.AddZbSerilog(o => { o.ServiceName = "otopcua"; }), moving the Console/File sinks intoappsettingsSerilog:WriteTosoReadFrom.Configurationreproduces them. Keep the existing driver-scopeLogContextEnricheralongside the shared enrichers. - Identity:
ServiceName="otopcua";SiteId/NodeRoleomitted (none in config).
ScadaBridge (main) — moderate, two composition roots
Serilog already (via LoggerConfigurationFactory); no OTel at all; SiteId + NodeRole
already read from config (ScadaBridge:Node:*, NodeOptions).
- Metrics: add
builder.AddZbTelemetry(o => { o.ServiceName="scadabridge"; o.SiteId=siteId; o.NodeRole=nodeRole; })app.MapZbMetrics()in both composition roots — the Central block and the Site block ofProgram.cs(the same two-root pattern the Health adoption used).Meters=[]for now (app instruments are deferred, §6). Purely additive — no metrics exist today to break.
- Logging: replace
LoggerConfigurationFactory.Build(config, nodeRole, siteId, nodeHostname)+builder.Host.UseSerilog()withbuilder.AddZbSerilog(o => { o.ServiceName="scadabridge"; o.SiteId=siteId; o.NodeRole=nodeRole; })— its enrichers reproduce the factory'sSiteId/NodeRole/NodeHostname. Keep a minimalCreateBootstrapLogger()line for early-startup capture per the library's documented pattern, then deleteLoggerConfigurationFactory. Verify the existing sinks are config-driven (Serilogsection inappsettings) so the swap is byte-equivalent; mirror any code-side sinks into config.
MxAccessGateway (main) — heaviest (the MEL→Serilog migration)
MEL-only; custom MxGateway.Server meter not exported; no /metrics. The x86 net48 worker is
a separate process and out of scope — telemetry is for the Server.
- Logging (MEL → Serilog):
- Add Serilog packages (
Serilog.AspNetCore+ sinks) to the Server csproj (direct versioned ref). - Replace the temporary
LoggerFactory.Create(...)MEL bootstrap inGatewayApplication.cs(andbuilder.Loggingconfig) withbuilder.AddZbSerilog(o => { o.ServiceName="mxgateway"; })- a
CreateBootstrapLogger()line.
- a
GatewayLogScope→Serilog.Context.LogContext.PushProperty(...).GatewayLogRedactor→ implement theILogRedactorseam, register in DI (picked up byRedactionEnricher).- Request-logging middleware →
UseSerilogRequestLogging()(or keep the middleware but emit via a SerilogILogger). Sinks toappsettings.
- Add Serilog packages (
- Metrics:
builder.AddZbTelemetry(o => { o.ServiceName="mxgateway"; o.Meters=["MxGateway.Server"]; })app.MapZbMetrics()→ the 20 existing instruments (13 counters, 3 histograms, 4 gauges) finally export. Keep theMxGateway.Servermeter name and themshistogram units (rename and unit conversion are deferred, §6).GetSnapshot()in-memory read path stays untouched.
4. Shared seam
ZbTelemetryOptions (ServiceName / SiteId / NodeRole / Meters / ActivitySources / Exporter)
│
┌─────────────────┴──────────────────┐
AddZbTelemetry (core) AddZbSerilog (.Serilog)
• ZbResource (identity triple) • ReadFrom.Configuration sinks
• app Meters + ActivitySources • SiteId / NodeRole / NodeHostname enrichers
• standard instrumentation • TraceContextEnricher (trace_id / span_id)
• Prometheus always + OTLP opt-in • ILogRedactor seam (RedactionEnricher)
│ │
app.MapZbMetrics() → /metrics preserveStaticLogger: true (test-safe)
Both packages share the single ZbTelemetryOptions. The Serilog OTLP log sink derives its Resource
attributes from ZbResource.BuildAttributes (single source of truth), so logs can never drift from
metrics and traces in a backend.
5. Sequencing & execution
Subagent-driven, classification-driven review chain. Task 0 gates everything (verify/publish the
feed). Then three independent per-repo phases — each its own git repo, branch
feat/adopt-zb-telemetry, commit per task, never skip hooks, never force-push:
- Task 0 (gating): verify the two Telemetry
.nupkgs are on the Gitea feed; pack + push if missing (creds-only user config, already set). - OtOpcUa: source-mapping + package refs →
AddZbTelemetryswap →AddZbSerilogswap → tests. - ScadaBridge: source-mapping + package refs →
AddZbTelemetry(both roots) →AddZbSerilog(replaceLoggerConfigurationFactory) → tests. - MxAccessGateway: source-mapping + package refs → MEL→Serilog (sub-tasked,
high-risk) →AddZbTelemetrymetrics export → tests. - scadaproj bookkeeping: add an "Adoption status — DONE" section to
components/observability/GAPS.md(per-repo table + deferred items), and correct the false "MxGateway logging already adopted" claim in CLAUDE.md, the library CLAUDE.md, andcomponents/observability/README.md.
The MxGateway MEL→Serilog migration is the one high-risk change (logging behaviour on the most
operational app) and gets the full spec→code serial review chain. The other per-app swaps are
standard.
6. Deferred (out of scope this round; recorded in GAPS)
| # | Item | Why deferred |
|---|---|---|
| #6 | MxGateway histogram ms → s |
Breaking dashboard/alert change — needs ops coordination |
| #7 | MxGateway meter rename MxGateway.Server → ZB.MOM.WW.MxGateway |
Breaking Prometheus label change — needs ops coordination |
| #9 | ScadaBridge app instruments (ScadaBridgeTelemetry + scadabridge.*) |
Application-specific work, not shared-library adoption |
| #10 | OtOpcUa OTLP exporter alongside Prometheus | Opt-in; no consumer for OTLP yet |
| #11 | OtOpcUa trace-export no-op (spans recorded, no exporter) | Resolved by #10 / OTLP; or document |
None of these block the behaviour-preserving initial adoption.
7. Testing
All tests run offline — Prometheus is in-process, no OTLP collector required, and the library's own test suites are network-free.
- OtOpcUa: assert
/metricsis still served, theZB.MOM.WW.OtOpcUameter is present, the Resource carriesservice.name, and the shared Serilog enrichers are wired. - ScadaBridge: assert
/metricsis served in both roles, the logger carriesSiteId/NodeRoleenrichers, and startup is clean afterLoggerConfigurationFactoryremoval. - MxAccessGateway (the careful one): assert log messages are still emitted at the same levels,
redaction still applies, request logging still fires,
/metricsis now served, and theGetSnapshot()path is unchanged — using the existing fake-worker test harness (no MXAccess needed).
8. Acceptance bar
- Each app builds and its test suite is green.
/metricsserves the same existing series (plus additive standard-instrumentation series); meter names and units unchanged.- Logs carry the same messages at the same levels, plus the shared identity enrichers and
trace_id/span_idcorrelation. - No secrets committed to any repo (the Gitea token stays creds-only at the user level).
components/observability/GAPS.mdupdated; the false "MxGateway logging adopted" claim corrected.