3729ff2152
Second cross-fleet shared-library adoption (after Health). Full scope: AddZbTelemetry (OTel Resource identity triple + standard instrumentation + Prometheus /metrics) on all 3, plus shared Serilog on all 3 — including the MxGateway MEL->Serilog migration. Records the correction that MxGateway's logging was NOT actually adopted on main despite the docs' claim. Behaviour- preserving bar; breaking items (#6 unit, #7 rename) deferred.
235 lines
13 KiB
Markdown
235 lines
13 KiB
Markdown
# Adopt `ZB.MOM.WW.Telemetry` across the three sister apps — design
|
|
|
|
**Date:** 2026-06-01
|
|
**Status:** Approved (design); implementation plan to follow via writing-plans.
|
|
**Scope:** Integrate the built-but-unadopted `ZB.MOM.WW.Telemetry` (+ `.Serilog`) shared library
|
|
into all three sister apps — **OtOpcUa**, **MxAccessGateway**, **ScadaBridge** — wiring the shared
|
|
OpenTelemetry Resource, standard instrumentation, Prometheus `/metrics`, and the shared Serilog
|
|
bootstrap with identity enrichers and trace↔log correlation.
|
|
|
|
This is the second full cross-fleet adoption of one of the six shared `ZB.MOM.WW.*` libraries
|
|
(after `ZB.MOM.WW.Health`). It follows the adoption backlog in
|
|
[`components/observability/GAPS.md`](../../components/observability/GAPS.md), re-verified against
|
|
current code on 2026-06-01.
|
|
|
|
> **Correction recorded during design:** the library CLAUDE.md and
|
|
> [`components/observability/README.md`](../../components/observability/README.md) claim
|
|
> *"MxAccessGateway logging adopted (MEL → Serilog migration done on its own branch)."* This is
|
|
> **false on `main`** — MxGateway is still MEL-only (no Serilog packages, `GatewayLogScope` /
|
|
> `GatewayLogRedactor` still bespoke), and its `MxGateway.Server` meter is **not exported at all**
|
|
> (no `AddOpenTelemetry`, no `/metrics`). That branch never landed. This design therefore includes
|
|
> the full MxGateway MEL→Serilog migration, and the bookkeeping task corrects the false claim.
|
|
|
|
---
|
|
|
|
## 1. Goal & scope
|
|
|
|
Wire the two shared packages into all three apps:
|
|
|
|
- **`ZB.MOM.WW.Telemetry`** — `AddZbTelemetry(options)`: shared OTel Resource (the identity triple
|
|
`service.name` / `site.id` / `node.role` + `service.namespace` / `service.version` / `host.name`),
|
|
caller-supplied Meters/ActivitySources, standard instrumentation (ASP.NET Core, HttpClient, gRPC
|
|
client, runtime, process), Prometheus always-on exporter (OTLP opt-in), and `app.MapZbMetrics()`
|
|
to mount `/metrics`.
|
|
- **`ZB.MOM.WW.Telemetry.Serilog`** — `AddZbSerilog(options)`: two-stage Serilog bootstrap,
|
|
`ReadFrom.Configuration` sinks, `SiteId`/`NodeRole`/`NodeHostname` enrichers, `TraceContextEnricher`
|
|
(writes `trace_id`/`span_id` from `Activity.Current`), and the `ILogRedactor` seam via
|
|
`RedactionEnricher`. Uses `preserveStaticLogger: true` so it is test-safe.
|
|
|
|
**The headline gap (§1 of GAPS):** *no* app sets a single OTel Resource attribute today, so every
|
|
metric and span from every node is indistinguishable in a backend — no service identity, no
|
|
site/role topology, no version label. `AddZbTelemetry` closes this for all three at once. This is
|
|
the single highest-value observability gap across the fleet.
|
|
|
|
**Behaviour-preserving bar** (same as the Health adoption): same log messages at the same levels,
|
|
same metric series with the same names and units, same `/metrics` path. New series produced by
|
|
standard instrumentation are *additive*. All genuinely breaking items are **deferred** (see §6).
|
|
|
|
---
|
|
|
|
## 2. Distribution
|
|
|
|
- **Feed:** Gitea NuGet registry `dohertj2-gitea`
|
|
(`https://gitea.dohertylan.com/api/packages/dohertj2/nuget/index.json`). Credentials live
|
|
**creds-only at the user level** (`~/.nuget/NuGet/NuGet.Config` `<packageSourceCredentials>`),
|
|
matched by source name — **never committed to any repo**. Already configured during the Health
|
|
round; no change needed here.
|
|
- **Source-mapping — the two-pattern gotcha (carried from Health):** under
|
|
`packageSourceMapping`, the glob `ZB.MOM.WW.Telemetry.*` matches `ZB.MOM.WW.Telemetry.Serilog`
|
|
but **not** the bare core id `ZB.MOM.WW.Telemetry`. Each repo therefore needs **both**:
|
|
```xml
|
|
<package pattern="ZB.MOM.WW.Telemetry" />
|
|
<package pattern="ZB.MOM.WW.Telemetry.*" />
|
|
```
|
|
- **Per-repo wiring:**
|
|
| Repo | CPM? | Change |
|
|
|---|---|---|
|
|
| OtOpcUa | yes (`Directory.Packages.props`) | add 2 `<PackageVersion>` @ `0.1.0`; extend existing `NuGet.config` mapping with both Telemetry patterns; add 2 versionless `<PackageReference>` to the Host csproj |
|
|
| ScadaBridge | yes | add 2 `<PackageVersion>` @ `0.1.0`; extend existing `nuget.config` mapping; add 2 versionless `<PackageReference>` to the Host csproj |
|
|
| MxAccessGateway | **no CPM** | add 2 direct versioned `<PackageReference>` to the Server csproj; extend its `nuget.config` mapping (the file created during the Health round) |
|
|
- **Task 0 (gating, like Health):** the library docs claim these two packages are already on the
|
|
feed. **Verify first; pack + push the two `.nupkg`s if missing** — the Health round proved this
|
|
claim cannot be trusted.
|
|
- **Serilog version floor (Gap V1):** OtOpcUa pins `Serilog.AspNetCore` 9.0.0, ScadaBridge 10.0.0.
|
|
Confirm the `.Serilog` package's Serilog dependency floor is satisfied by both (bump if not), and
|
|
pick MxGateway's fresh `Serilog.AspNetCore` version to align.
|
|
|
|
---
|
|
|
|
## 3. Per-app adoption surface
|
|
|
|
### OtOpcUa (`master`) — moderate
|
|
|
|
Already has Serilog (inline `UseSerilog`), full OTel, and Prometheus `/metrics`.
|
|
|
|
- **Metrics/traces:** replace the hand-rolled
|
|
`src/Server/ZB.MOM.WW.OtOpcUa.Host/Observability/ObservabilityExtensions.cs`
|
|
(`AddOpenTelemetry().WithMetrics(...AddPrometheusExporter()).WithTracing(...)` +
|
|
`MapPrometheusScrapingEndpoint("/metrics")`) with
|
|
```csharp
|
|
builder.AddZbTelemetry(o =>
|
|
{
|
|
o.ServiceName = "otopcua";
|
|
o.ServiceVersion = /* AssemblyInformationalVersion */;
|
|
o.Meters = ["ZB.MOM.WW.OtOpcUa"];
|
|
o.ActivitySources = ["ZB.MOM.WW.OtOpcUa"];
|
|
// Exporter defaults to Prometheus
|
|
});
|
|
// ...
|
|
app.MapZbMetrics();
|
|
```
|
|
**Same meter/source names and same `/metrics` path** → behaviour-preserving; *gains* the Resource
|
|
identity + standard instrumentation. (OtOpcUa records spans but has no trace exporter today;
|
|
Prometheus is metrics-only, so traces remain a no-op exporter-wise — unchanged. OTLP trace wiring
|
|
is deferred, §6.)
|
|
- **Logging:** replace the inline
|
|
`builder.Host.UseSerilog((ctx, lc) => lc.ReadFrom.Configuration(...).WriteTo.Console().WriteTo.File(...))`
|
|
with `builder.AddZbSerilog(o => { o.ServiceName = "otopcua"; })`, moving the Console/File sinks
|
|
into `appsettings` `Serilog:WriteTo` so `ReadFrom.Configuration` reproduces them. Keep the
|
|
existing driver-scope `LogContextEnricher` alongside the shared enrichers.
|
|
- **Identity:** `ServiceName="otopcua"`; `SiteId`/`NodeRole` omitted (none in config).
|
|
|
|
### ScadaBridge (`main`) — moderate, two composition roots
|
|
|
|
Serilog already (via `LoggerConfigurationFactory`); **no OTel at all**; `SiteId` + `NodeRole`
|
|
already read from config (`ScadaBridge:Node:*`, `NodeOptions`).
|
|
|
|
- **Metrics:** add `builder.AddZbTelemetry(o => { o.ServiceName="scadabridge"; o.SiteId=siteId; o.NodeRole=nodeRole; })`
|
|
+ `app.MapZbMetrics()` in **both** composition roots — the Central block and the Site block of
|
|
`Program.cs` (the same two-root pattern the Health adoption used). `Meters=[]` for now (app
|
|
instruments are deferred, §6). Purely additive — no metrics exist today to break.
|
|
- **Logging:** replace `LoggerConfigurationFactory.Build(config, nodeRole, siteId, nodeHostname)` +
|
|
`builder.Host.UseSerilog()` with
|
|
`builder.AddZbSerilog(o => { o.ServiceName="scadabridge"; o.SiteId=siteId; o.NodeRole=nodeRole; })`
|
|
— its enrichers reproduce the factory's `SiteId`/`NodeRole`/`NodeHostname`. Keep a minimal
|
|
`CreateBootstrapLogger()` line for early-startup capture per the library's documented pattern,
|
|
then delete `LoggerConfigurationFactory`. Verify the existing sinks are config-driven (`Serilog`
|
|
section in `appsettings`) so the swap is byte-equivalent; mirror any code-side sinks into config.
|
|
|
|
### MxAccessGateway (`main`) — heaviest (the MEL→Serilog migration)
|
|
|
|
MEL-only; custom `MxGateway.Server` meter **not exported**; no `/metrics`. The x86 net48 worker is
|
|
a separate process and **out of scope** — telemetry is for the Server.
|
|
|
|
- **Logging (MEL → Serilog):**
|
|
- Add Serilog packages (`Serilog.AspNetCore` + sinks) to the Server csproj (direct versioned ref).
|
|
- Replace the temporary `LoggerFactory.Create(...)` MEL bootstrap in `GatewayApplication.cs`
|
|
(and `builder.Logging` config) with `builder.AddZbSerilog(o => { o.ServiceName="mxgateway"; })`
|
|
+ a `CreateBootstrapLogger()` line.
|
|
- `GatewayLogScope` → `Serilog.Context.LogContext.PushProperty(...)`.
|
|
- `GatewayLogRedactor` → implement the `ILogRedactor` seam, register in DI (picked up by
|
|
`RedactionEnricher`).
|
|
- Request-logging middleware → `UseSerilogRequestLogging()` (or keep the middleware but emit via
|
|
a Serilog `ILogger`). Sinks to `appsettings`.
|
|
- **Metrics:** `builder.AddZbTelemetry(o => { o.ServiceName="mxgateway"; o.Meters=["MxGateway.Server"]; })`
|
|
+ `app.MapZbMetrics()` → the 20 existing instruments (13 counters, 3 histograms, 4 gauges) finally
|
|
export. **Keep the `MxGateway.Server` meter name and the `ms` histogram units** (rename and unit
|
|
conversion are deferred, §6). `GetSnapshot()` in-memory read path stays untouched.
|
|
|
|
---
|
|
|
|
## 4. Shared seam
|
|
|
|
```
|
|
ZbTelemetryOptions (ServiceName / SiteId / NodeRole / Meters / ActivitySources / Exporter)
|
|
│
|
|
┌─────────────────┴──────────────────┐
|
|
AddZbTelemetry (core) AddZbSerilog (.Serilog)
|
|
• ZbResource (identity triple) • ReadFrom.Configuration sinks
|
|
• app Meters + ActivitySources • SiteId / NodeRole / NodeHostname enrichers
|
|
• standard instrumentation • TraceContextEnricher (trace_id / span_id)
|
|
• Prometheus always + OTLP opt-in • ILogRedactor seam (RedactionEnricher)
|
|
│ │
|
|
app.MapZbMetrics() → /metrics preserveStaticLogger: true (test-safe)
|
|
```
|
|
|
|
Both packages share the single `ZbTelemetryOptions`. The Serilog OTLP log sink derives its Resource
|
|
attributes from `ZbResource.BuildAttributes` (single source of truth), so logs can never drift from
|
|
metrics and traces in a backend.
|
|
|
|
---
|
|
|
|
## 5. Sequencing & execution
|
|
|
|
Subagent-driven, classification-driven review chain. **Task 0 gates everything** (verify/publish the
|
|
feed). Then three **independent** per-repo phases — each its own git repo, branch
|
|
**`feat/adopt-zb-telemetry`**, commit per task, **never skip hooks, never force-push**:
|
|
|
|
1. **Task 0 (gating):** verify the two Telemetry `.nupkg`s are on the Gitea feed; pack + push if
|
|
missing (creds-only user config, already set).
|
|
2. **OtOpcUa:** source-mapping + package refs → `AddZbTelemetry` swap → `AddZbSerilog` swap → tests.
|
|
3. **ScadaBridge:** source-mapping + package refs → `AddZbTelemetry` (both roots) → `AddZbSerilog`
|
|
(replace `LoggerConfigurationFactory`) → tests.
|
|
4. **MxAccessGateway:** source-mapping + package refs → **MEL→Serilog** (sub-tasked, `high-risk`)
|
|
→ `AddZbTelemetry` metrics export → tests.
|
|
5. **scadaproj bookkeeping:** add an "Adoption status — DONE" section to
|
|
`components/observability/GAPS.md` (per-repo table + deferred items), **and correct the false
|
|
"MxGateway logging already adopted" claim** in CLAUDE.md, the library CLAUDE.md, and
|
|
`components/observability/README.md`.
|
|
|
|
The MxGateway MEL→Serilog migration is the one `high-risk` change (logging behaviour on the most
|
|
operational app) and gets the full spec→code serial review chain. The other per-app swaps are
|
|
`standard`.
|
|
|
|
---
|
|
|
|
## 6. Deferred (out of scope this round; recorded in GAPS)
|
|
|
|
| # | Item | Why deferred |
|
|
|---|---|---|
|
|
| #6 | MxGateway histogram `ms` → `s` | Breaking dashboard/alert change — needs ops coordination |
|
|
| #7 | MxGateway meter rename `MxGateway.Server` → `ZB.MOM.WW.MxGateway` | Breaking Prometheus label change — needs ops coordination |
|
|
| #9 | ScadaBridge app instruments (`ScadaBridgeTelemetry` + `scadabridge.*`) | Application-specific work, not shared-library adoption |
|
|
| #10 | OtOpcUa OTLP exporter alongside Prometheus | Opt-in; no consumer for OTLP yet |
|
|
| #11 | OtOpcUa trace-export no-op (spans recorded, no exporter) | Resolved by #10 / OTLP; or document |
|
|
|
|
None of these block the behaviour-preserving initial adoption.
|
|
|
|
---
|
|
|
|
## 7. Testing
|
|
|
|
All tests run **offline** — Prometheus is in-process, no OTLP collector required, and the library's
|
|
own test suites are network-free.
|
|
|
|
- **OtOpcUa:** assert `/metrics` is still served, the `ZB.MOM.WW.OtOpcUa` meter is present, the
|
|
Resource carries `service.name`, and the shared Serilog enrichers are wired.
|
|
- **ScadaBridge:** assert `/metrics` is served in **both** roles, the logger carries
|
|
`SiteId`/`NodeRole` enrichers, and startup is clean after `LoggerConfigurationFactory` removal.
|
|
- **MxAccessGateway** (the careful one): assert log messages are still emitted at the same levels,
|
|
redaction still applies, request logging still fires, `/metrics` is now served, and the
|
|
`GetSnapshot()` path is unchanged — using the existing fake-worker test harness (no MXAccess
|
|
needed).
|
|
|
|
---
|
|
|
|
## 8. Acceptance bar
|
|
|
|
- Each app builds and its test suite is green.
|
|
- `/metrics` serves the same existing series (plus additive standard-instrumentation series); meter
|
|
names and units unchanged.
|
|
- Logs carry the same messages at the same levels, plus the shared identity enrichers and
|
|
`trace_id`/`span_id` correlation.
|
|
- No secrets committed to any repo (the Gitea token stays creds-only at the user level).
|
|
- `components/observability/GAPS.md` updated; the false "MxGateway logging adopted" claim corrected.
|