Files
scadaproj/docs/plans/2026-06-01-telemetry-library-adoption-design.md
T
Joseph Doherty 3729ff2152 docs: design for ZB.MOM.WW.Telemetry adoption across the 3 sister apps
Second cross-fleet shared-library adoption (after Health). Full scope:
AddZbTelemetry (OTel Resource identity triple + standard instrumentation +
Prometheus /metrics) on all 3, plus shared Serilog on all 3 — including the
MxGateway MEL->Serilog migration. Records the correction that MxGateway's
logging was NOT actually adopted on main despite the docs' claim. Behaviour-
preserving bar; breaking items (#6 unit, #7 rename) deferred.
2026-06-01 15:11:50 -04:00

13 KiB

Adopt ZB.MOM.WW.Telemetry across the three sister apps — design

Date: 2026-06-01 Status: Approved (design); implementation plan to follow via writing-plans. Scope: Integrate the built-but-unadopted ZB.MOM.WW.Telemetry (+ .Serilog) shared library into all three sister apps — OtOpcUa, MxAccessGateway, ScadaBridge — wiring the shared OpenTelemetry Resource, standard instrumentation, Prometheus /metrics, and the shared Serilog bootstrap with identity enrichers and trace↔log correlation.

This is the second full cross-fleet adoption of one of the six shared ZB.MOM.WW.* libraries (after ZB.MOM.WW.Health). It follows the adoption backlog in components/observability/GAPS.md, re-verified against current code on 2026-06-01.

Correction recorded during design: the library CLAUDE.md and components/observability/README.md claim "MxAccessGateway logging adopted (MEL → Serilog migration done on its own branch)." This is false on main — MxGateway is still MEL-only (no Serilog packages, GatewayLogScope / GatewayLogRedactor still bespoke), and its MxGateway.Server meter is not exported at all (no AddOpenTelemetry, no /metrics). That branch never landed. This design therefore includes the full MxGateway MEL→Serilog migration, and the bookkeeping task corrects the false claim.


1. Goal & scope

Wire the two shared packages into all three apps:

  • ZB.MOM.WW.TelemetryAddZbTelemetry(options): shared OTel Resource (the identity triple service.name / site.id / node.role + service.namespace / service.version / host.name), caller-supplied Meters/ActivitySources, standard instrumentation (ASP.NET Core, HttpClient, gRPC client, runtime, process), Prometheus always-on exporter (OTLP opt-in), and app.MapZbMetrics() to mount /metrics.
  • ZB.MOM.WW.Telemetry.SerilogAddZbSerilog(options): two-stage Serilog bootstrap, ReadFrom.Configuration sinks, SiteId/NodeRole/NodeHostname enrichers, TraceContextEnricher (writes trace_id/span_id from Activity.Current), and the ILogRedactor seam via RedactionEnricher. Uses preserveStaticLogger: true so it is test-safe.

The headline gap (§1 of GAPS): no app sets a single OTel Resource attribute today, so every metric and span from every node is indistinguishable in a backend — no service identity, no site/role topology, no version label. AddZbTelemetry closes this for all three at once. This is the single highest-value observability gap across the fleet.

Behaviour-preserving bar (same as the Health adoption): same log messages at the same levels, same metric series with the same names and units, same /metrics path. New series produced by standard instrumentation are additive. All genuinely breaking items are deferred (see §6).


2. Distribution

  • Feed: Gitea NuGet registry dohertj2-gitea (https://gitea.dohertylan.com/api/packages/dohertj2/nuget/index.json). Credentials live creds-only at the user level (~/.nuget/NuGet/NuGet.Config <packageSourceCredentials>), matched by source name — never committed to any repo. Already configured during the Health round; no change needed here.
  • Source-mapping — the two-pattern gotcha (carried from Health): under packageSourceMapping, the glob ZB.MOM.WW.Telemetry.* matches ZB.MOM.WW.Telemetry.Serilog but not the bare core id ZB.MOM.WW.Telemetry. Each repo therefore needs both:
    <package pattern="ZB.MOM.WW.Telemetry" />
    <package pattern="ZB.MOM.WW.Telemetry.*" />
    
  • Per-repo wiring:
    Repo CPM? Change
    OtOpcUa yes (Directory.Packages.props) add 2 <PackageVersion> @ 0.1.0; extend existing NuGet.config mapping with both Telemetry patterns; add 2 versionless <PackageReference> to the Host csproj
    ScadaBridge yes add 2 <PackageVersion> @ 0.1.0; extend existing nuget.config mapping; add 2 versionless <PackageReference> to the Host csproj
    MxAccessGateway no CPM add 2 direct versioned <PackageReference> to the Server csproj; extend its nuget.config mapping (the file created during the Health round)
  • Task 0 (gating, like Health): the library docs claim these two packages are already on the feed. Verify first; pack + push the two .nupkgs if missing — the Health round proved this claim cannot be trusted.
  • Serilog version floor (Gap V1): OtOpcUa pins Serilog.AspNetCore 9.0.0, ScadaBridge 10.0.0. Confirm the .Serilog package's Serilog dependency floor is satisfied by both (bump if not), and pick MxGateway's fresh Serilog.AspNetCore version to align.

3. Per-app adoption surface

OtOpcUa (master) — moderate

Already has Serilog (inline UseSerilog), full OTel, and Prometheus /metrics.

  • Metrics/traces: replace the hand-rolled src/Server/ZB.MOM.WW.OtOpcUa.Host/Observability/ObservabilityExtensions.cs (AddOpenTelemetry().WithMetrics(...AddPrometheusExporter()).WithTracing(...) + MapPrometheusScrapingEndpoint("/metrics")) with
    builder.AddZbTelemetry(o =>
    {
        o.ServiceName = "otopcua";
        o.ServiceVersion = /* AssemblyInformationalVersion */;
        o.Meters = ["ZB.MOM.WW.OtOpcUa"];
        o.ActivitySources = ["ZB.MOM.WW.OtOpcUa"];
        // Exporter defaults to Prometheus
    });
    // ...
    app.MapZbMetrics();
    
    Same meter/source names and same /metrics path → behaviour-preserving; gains the Resource identity + standard instrumentation. (OtOpcUa records spans but has no trace exporter today; Prometheus is metrics-only, so traces remain a no-op exporter-wise — unchanged. OTLP trace wiring is deferred, §6.)
  • Logging: replace the inline builder.Host.UseSerilog((ctx, lc) => lc.ReadFrom.Configuration(...).WriteTo.Console().WriteTo.File(...)) with builder.AddZbSerilog(o => { o.ServiceName = "otopcua"; }), moving the Console/File sinks into appsettings Serilog:WriteTo so ReadFrom.Configuration reproduces them. Keep the existing driver-scope LogContextEnricher alongside the shared enrichers.
  • Identity: ServiceName="otopcua"; SiteId/NodeRole omitted (none in config).

ScadaBridge (main) — moderate, two composition roots

Serilog already (via LoggerConfigurationFactory); no OTel at all; SiteId + NodeRole already read from config (ScadaBridge:Node:*, NodeOptions).

  • Metrics: add builder.AddZbTelemetry(o => { o.ServiceName="scadabridge"; o.SiteId=siteId; o.NodeRole=nodeRole; })
    • app.MapZbMetrics() in both composition roots — the Central block and the Site block of Program.cs (the same two-root pattern the Health adoption used). Meters=[] for now (app instruments are deferred, §6). Purely additive — no metrics exist today to break.
  • Logging: replace LoggerConfigurationFactory.Build(config, nodeRole, siteId, nodeHostname) + builder.Host.UseSerilog() with builder.AddZbSerilog(o => { o.ServiceName="scadabridge"; o.SiteId=siteId; o.NodeRole=nodeRole; }) — its enrichers reproduce the factory's SiteId/NodeRole/NodeHostname. Keep a minimal CreateBootstrapLogger() line for early-startup capture per the library's documented pattern, then delete LoggerConfigurationFactory. Verify the existing sinks are config-driven (Serilog section in appsettings) so the swap is byte-equivalent; mirror any code-side sinks into config.

MxAccessGateway (main) — heaviest (the MEL→Serilog migration)

MEL-only; custom MxGateway.Server meter not exported; no /metrics. The x86 net48 worker is a separate process and out of scope — telemetry is for the Server.

  • Logging (MEL → Serilog):
    • Add Serilog packages (Serilog.AspNetCore + sinks) to the Server csproj (direct versioned ref).
    • Replace the temporary LoggerFactory.Create(...) MEL bootstrap in GatewayApplication.cs (and builder.Logging config) with builder.AddZbSerilog(o => { o.ServiceName="mxgateway"; })
      • a CreateBootstrapLogger() line.
    • GatewayLogScopeSerilog.Context.LogContext.PushProperty(...).
    • GatewayLogRedactor → implement the ILogRedactor seam, register in DI (picked up by RedactionEnricher).
    • Request-logging middleware → UseSerilogRequestLogging() (or keep the middleware but emit via a Serilog ILogger). Sinks to appsettings.
  • Metrics: builder.AddZbTelemetry(o => { o.ServiceName="mxgateway"; o.Meters=["MxGateway.Server"]; })
    • app.MapZbMetrics() → the 20 existing instruments (13 counters, 3 histograms, 4 gauges) finally export. Keep the MxGateway.Server meter name and the ms histogram units (rename and unit conversion are deferred, §6). GetSnapshot() in-memory read path stays untouched.

4. Shared seam

            ZbTelemetryOptions  (ServiceName / SiteId / NodeRole / Meters / ActivitySources / Exporter)
                      │
    ┌─────────────────┴──────────────────┐
AddZbTelemetry (core)             AddZbSerilog (.Serilog)
• ZbResource (identity triple)    • ReadFrom.Configuration sinks
• app Meters + ActivitySources    • SiteId / NodeRole / NodeHostname enrichers
• standard instrumentation        • TraceContextEnricher (trace_id / span_id)
• Prometheus always + OTLP opt-in  • ILogRedactor seam (RedactionEnricher)
    │                                  │
app.MapZbMetrics() → /metrics     preserveStaticLogger: true (test-safe)

Both packages share the single ZbTelemetryOptions. The Serilog OTLP log sink derives its Resource attributes from ZbResource.BuildAttributes (single source of truth), so logs can never drift from metrics and traces in a backend.


5. Sequencing & execution

Subagent-driven, classification-driven review chain. Task 0 gates everything (verify/publish the feed). Then three independent per-repo phases — each its own git repo, branch feat/adopt-zb-telemetry, commit per task, never skip hooks, never force-push:

  1. Task 0 (gating): verify the two Telemetry .nupkgs are on the Gitea feed; pack + push if missing (creds-only user config, already set).
  2. OtOpcUa: source-mapping + package refs → AddZbTelemetry swap → AddZbSerilog swap → tests.
  3. ScadaBridge: source-mapping + package refs → AddZbTelemetry (both roots) → AddZbSerilog (replace LoggerConfigurationFactory) → tests.
  4. MxAccessGateway: source-mapping + package refs → MEL→Serilog (sub-tasked, high-risk) → AddZbTelemetry metrics export → tests.
  5. scadaproj bookkeeping: add an "Adoption status — DONE" section to components/observability/GAPS.md (per-repo table + deferred items), and correct the false "MxGateway logging already adopted" claim in CLAUDE.md, the library CLAUDE.md, and components/observability/README.md.

The MxGateway MEL→Serilog migration is the one high-risk change (logging behaviour on the most operational app) and gets the full spec→code serial review chain. The other per-app swaps are standard.


6. Deferred (out of scope this round; recorded in GAPS)

# Item Why deferred
#6 MxGateway histogram mss Breaking dashboard/alert change — needs ops coordination
#7 MxGateway meter rename MxGateway.ServerZB.MOM.WW.MxGateway Breaking Prometheus label change — needs ops coordination
#9 ScadaBridge app instruments (ScadaBridgeTelemetry + scadabridge.*) Application-specific work, not shared-library adoption
#10 OtOpcUa OTLP exporter alongside Prometheus Opt-in; no consumer for OTLP yet
#11 OtOpcUa trace-export no-op (spans recorded, no exporter) Resolved by #10 / OTLP; or document

None of these block the behaviour-preserving initial adoption.


7. Testing

All tests run offline — Prometheus is in-process, no OTLP collector required, and the library's own test suites are network-free.

  • OtOpcUa: assert /metrics is still served, the ZB.MOM.WW.OtOpcUa meter is present, the Resource carries service.name, and the shared Serilog enrichers are wired.
  • ScadaBridge: assert /metrics is served in both roles, the logger carries SiteId/NodeRole enrichers, and startup is clean after LoggerConfigurationFactory removal.
  • MxAccessGateway (the careful one): assert log messages are still emitted at the same levels, redaction still applies, request logging still fires, /metrics is now served, and the GetSnapshot() path is unchanged — using the existing fake-worker test harness (no MXAccess needed).

8. Acceptance bar

  • Each app builds and its test suite is green.
  • /metrics serves the same existing series (plus additive standard-instrumentation series); meter names and units unchanged.
  • Logs carry the same messages at the same levels, plus the shared identity enrichers and trace_id/span_id correlation.
  • No secrets committed to any repo (the Gitea token stays creds-only at the user level).
  • components/observability/GAPS.md updated; the false "MxGateway logging adopted" claim corrected.