docs: design for ZB.MOM.WW.Telemetry adoption across the 3 sister apps
Second cross-fleet shared-library adoption (after Health). Full scope: AddZbTelemetry (OTel Resource identity triple + standard instrumentation + Prometheus /metrics) on all 3, plus shared Serilog on all 3 — including the MxGateway MEL->Serilog migration. Records the correction that MxGateway's logging was NOT actually adopted on main despite the docs' claim. Behaviour- preserving bar; breaking items (#6 unit, #7 rename) deferred.
This commit is contained in:
@@ -0,0 +1,234 @@
|
||||
# Adopt `ZB.MOM.WW.Telemetry` across the three sister apps — design
|
||||
|
||||
**Date:** 2026-06-01
|
||||
**Status:** Approved (design); implementation plan to follow via writing-plans.
|
||||
**Scope:** Integrate the built-but-unadopted `ZB.MOM.WW.Telemetry` (+ `.Serilog`) shared library
|
||||
into all three sister apps — **OtOpcUa**, **MxAccessGateway**, **ScadaBridge** — wiring the shared
|
||||
OpenTelemetry Resource, standard instrumentation, Prometheus `/metrics`, and the shared Serilog
|
||||
bootstrap with identity enrichers and trace↔log correlation.
|
||||
|
||||
This is the second full cross-fleet adoption of one of the six shared `ZB.MOM.WW.*` libraries
|
||||
(after `ZB.MOM.WW.Health`). It follows the adoption backlog in
|
||||
[`components/observability/GAPS.md`](../../components/observability/GAPS.md), re-verified against
|
||||
current code on 2026-06-01.
|
||||
|
||||
> **Correction recorded during design:** the library CLAUDE.md and
|
||||
> [`components/observability/README.md`](../../components/observability/README.md) claim
|
||||
> *"MxAccessGateway logging adopted (MEL → Serilog migration done on its own branch)."* This is
|
||||
> **false on `main`** — MxGateway is still MEL-only (no Serilog packages, `GatewayLogScope` /
|
||||
> `GatewayLogRedactor` still bespoke), and its `MxGateway.Server` meter is **not exported at all**
|
||||
> (no `AddOpenTelemetry`, no `/metrics`). That branch never landed. This design therefore includes
|
||||
> the full MxGateway MEL→Serilog migration, and the bookkeeping task corrects the false claim.
|
||||
|
||||
---
|
||||
|
||||
## 1. Goal & scope
|
||||
|
||||
Wire the two shared packages into all three apps:
|
||||
|
||||
- **`ZB.MOM.WW.Telemetry`** — `AddZbTelemetry(options)`: shared OTel Resource (the identity triple
|
||||
`service.name` / `site.id` / `node.role` + `service.namespace` / `service.version` / `host.name`),
|
||||
caller-supplied Meters/ActivitySources, standard instrumentation (ASP.NET Core, HttpClient, gRPC
|
||||
client, runtime, process), Prometheus always-on exporter (OTLP opt-in), and `app.MapZbMetrics()`
|
||||
to mount `/metrics`.
|
||||
- **`ZB.MOM.WW.Telemetry.Serilog`** — `AddZbSerilog(options)`: two-stage Serilog bootstrap,
|
||||
`ReadFrom.Configuration` sinks, `SiteId`/`NodeRole`/`NodeHostname` enrichers, `TraceContextEnricher`
|
||||
(writes `trace_id`/`span_id` from `Activity.Current`), and the `ILogRedactor` seam via
|
||||
`RedactionEnricher`. Uses `preserveStaticLogger: true` so it is test-safe.
|
||||
|
||||
**The headline gap (§1 of GAPS):** *no* app sets a single OTel Resource attribute today, so every
|
||||
metric and span from every node is indistinguishable in a backend — no service identity, no
|
||||
site/role topology, no version label. `AddZbTelemetry` closes this for all three at once. This is
|
||||
the single highest-value observability gap across the fleet.
|
||||
|
||||
**Behaviour-preserving bar** (same as the Health adoption): same log messages at the same levels,
|
||||
same metric series with the same names and units, same `/metrics` path. New series produced by
|
||||
standard instrumentation are *additive*. All genuinely breaking items are **deferred** (see §6).
|
||||
|
||||
---
|
||||
|
||||
## 2. Distribution
|
||||
|
||||
- **Feed:** Gitea NuGet registry `dohertj2-gitea`
|
||||
(`https://gitea.dohertylan.com/api/packages/dohertj2/nuget/index.json`). Credentials live
|
||||
**creds-only at the user level** (`~/.nuget/NuGet/NuGet.Config` `<packageSourceCredentials>`),
|
||||
matched by source name — **never committed to any repo**. Already configured during the Health
|
||||
round; no change needed here.
|
||||
- **Source-mapping — the two-pattern gotcha (carried from Health):** under
|
||||
`packageSourceMapping`, the glob `ZB.MOM.WW.Telemetry.*` matches `ZB.MOM.WW.Telemetry.Serilog`
|
||||
but **not** the bare core id `ZB.MOM.WW.Telemetry`. Each repo therefore needs **both**:
|
||||
```xml
|
||||
<package pattern="ZB.MOM.WW.Telemetry" />
|
||||
<package pattern="ZB.MOM.WW.Telemetry.*" />
|
||||
```
|
||||
- **Per-repo wiring:**
|
||||
| Repo | CPM? | Change |
|
||||
|---|---|---|
|
||||
| OtOpcUa | yes (`Directory.Packages.props`) | add 2 `<PackageVersion>` @ `0.1.0`; extend existing `NuGet.config` mapping with both Telemetry patterns; add 2 versionless `<PackageReference>` to the Host csproj |
|
||||
| ScadaBridge | yes | add 2 `<PackageVersion>` @ `0.1.0`; extend existing `nuget.config` mapping; add 2 versionless `<PackageReference>` to the Host csproj |
|
||||
| MxAccessGateway | **no CPM** | add 2 direct versioned `<PackageReference>` to the Server csproj; extend its `nuget.config` mapping (the file created during the Health round) |
|
||||
- **Task 0 (gating, like Health):** the library docs claim these two packages are already on the
|
||||
feed. **Verify first; pack + push the two `.nupkg`s if missing** — the Health round proved this
|
||||
claim cannot be trusted.
|
||||
- **Serilog version floor (Gap V1):** OtOpcUa pins `Serilog.AspNetCore` 9.0.0, ScadaBridge 10.0.0.
|
||||
Confirm the `.Serilog` package's Serilog dependency floor is satisfied by both (bump if not), and
|
||||
pick MxGateway's fresh `Serilog.AspNetCore` version to align.
|
||||
|
||||
---
|
||||
|
||||
## 3. Per-app adoption surface
|
||||
|
||||
### OtOpcUa (`master`) — moderate
|
||||
|
||||
Already has Serilog (inline `UseSerilog`), full OTel, and Prometheus `/metrics`.
|
||||
|
||||
- **Metrics/traces:** replace the hand-rolled
|
||||
`src/Server/ZB.MOM.WW.OtOpcUa.Host/Observability/ObservabilityExtensions.cs`
|
||||
(`AddOpenTelemetry().WithMetrics(...AddPrometheusExporter()).WithTracing(...)` +
|
||||
`MapPrometheusScrapingEndpoint("/metrics")`) with
|
||||
```csharp
|
||||
builder.AddZbTelemetry(o =>
|
||||
{
|
||||
o.ServiceName = "otopcua";
|
||||
o.ServiceVersion = /* AssemblyInformationalVersion */;
|
||||
o.Meters = ["ZB.MOM.WW.OtOpcUa"];
|
||||
o.ActivitySources = ["ZB.MOM.WW.OtOpcUa"];
|
||||
// Exporter defaults to Prometheus
|
||||
});
|
||||
// ...
|
||||
app.MapZbMetrics();
|
||||
```
|
||||
**Same meter/source names and same `/metrics` path** → behaviour-preserving; *gains* the Resource
|
||||
identity + standard instrumentation. (OtOpcUa records spans but has no trace exporter today;
|
||||
Prometheus is metrics-only, so traces remain a no-op exporter-wise — unchanged. OTLP trace wiring
|
||||
is deferred, §6.)
|
||||
- **Logging:** replace the inline
|
||||
`builder.Host.UseSerilog((ctx, lc) => lc.ReadFrom.Configuration(...).WriteTo.Console().WriteTo.File(...))`
|
||||
with `builder.AddZbSerilog(o => { o.ServiceName = "otopcua"; })`, moving the Console/File sinks
|
||||
into `appsettings` `Serilog:WriteTo` so `ReadFrom.Configuration` reproduces them. Keep the
|
||||
existing driver-scope `LogContextEnricher` alongside the shared enrichers.
|
||||
- **Identity:** `ServiceName="otopcua"`; `SiteId`/`NodeRole` omitted (none in config).
|
||||
|
||||
### ScadaBridge (`main`) — moderate, two composition roots
|
||||
|
||||
Serilog already (via `LoggerConfigurationFactory`); **no OTel at all**; `SiteId` + `NodeRole`
|
||||
already read from config (`ScadaBridge:Node:*`, `NodeOptions`).
|
||||
|
||||
- **Metrics:** add `builder.AddZbTelemetry(o => { o.ServiceName="scadabridge"; o.SiteId=siteId; o.NodeRole=nodeRole; })`
|
||||
+ `app.MapZbMetrics()` in **both** composition roots — the Central block and the Site block of
|
||||
`Program.cs` (the same two-root pattern the Health adoption used). `Meters=[]` for now (app
|
||||
instruments are deferred, §6). Purely additive — no metrics exist today to break.
|
||||
- **Logging:** replace `LoggerConfigurationFactory.Build(config, nodeRole, siteId, nodeHostname)` +
|
||||
`builder.Host.UseSerilog()` with
|
||||
`builder.AddZbSerilog(o => { o.ServiceName="scadabridge"; o.SiteId=siteId; o.NodeRole=nodeRole; })`
|
||||
— its enrichers reproduce the factory's `SiteId`/`NodeRole`/`NodeHostname`. Keep a minimal
|
||||
`CreateBootstrapLogger()` line for early-startup capture per the library's documented pattern,
|
||||
then delete `LoggerConfigurationFactory`. Verify the existing sinks are config-driven (`Serilog`
|
||||
section in `appsettings`) so the swap is byte-equivalent; mirror any code-side sinks into config.
|
||||
|
||||
### MxAccessGateway (`main`) — heaviest (the MEL→Serilog migration)
|
||||
|
||||
MEL-only; custom `MxGateway.Server` meter **not exported**; no `/metrics`. The x86 net48 worker is
|
||||
a separate process and **out of scope** — telemetry is for the Server.
|
||||
|
||||
- **Logging (MEL → Serilog):**
|
||||
- Add Serilog packages (`Serilog.AspNetCore` + sinks) to the Server csproj (direct versioned ref).
|
||||
- Replace the temporary `LoggerFactory.Create(...)` MEL bootstrap in `GatewayApplication.cs`
|
||||
(and `builder.Logging` config) with `builder.AddZbSerilog(o => { o.ServiceName="mxgateway"; })`
|
||||
+ a `CreateBootstrapLogger()` line.
|
||||
- `GatewayLogScope` → `Serilog.Context.LogContext.PushProperty(...)`.
|
||||
- `GatewayLogRedactor` → implement the `ILogRedactor` seam, register in DI (picked up by
|
||||
`RedactionEnricher`).
|
||||
- Request-logging middleware → `UseSerilogRequestLogging()` (or keep the middleware but emit via
|
||||
a Serilog `ILogger`). Sinks to `appsettings`.
|
||||
- **Metrics:** `builder.AddZbTelemetry(o => { o.ServiceName="mxgateway"; o.Meters=["MxGateway.Server"]; })`
|
||||
+ `app.MapZbMetrics()` → the 20 existing instruments (13 counters, 3 histograms, 4 gauges) finally
|
||||
export. **Keep the `MxGateway.Server` meter name and the `ms` histogram units** (rename and unit
|
||||
conversion are deferred, §6). `GetSnapshot()` in-memory read path stays untouched.
|
||||
|
||||
---
|
||||
|
||||
## 4. Shared seam
|
||||
|
||||
```
|
||||
ZbTelemetryOptions (ServiceName / SiteId / NodeRole / Meters / ActivitySources / Exporter)
|
||||
│
|
||||
┌─────────────────┴──────────────────┐
|
||||
AddZbTelemetry (core) AddZbSerilog (.Serilog)
|
||||
• ZbResource (identity triple) • ReadFrom.Configuration sinks
|
||||
• app Meters + ActivitySources • SiteId / NodeRole / NodeHostname enrichers
|
||||
• standard instrumentation • TraceContextEnricher (trace_id / span_id)
|
||||
• Prometheus always + OTLP opt-in • ILogRedactor seam (RedactionEnricher)
|
||||
│ │
|
||||
app.MapZbMetrics() → /metrics preserveStaticLogger: true (test-safe)
|
||||
```
|
||||
|
||||
Both packages share the single `ZbTelemetryOptions`. The Serilog OTLP log sink derives its Resource
|
||||
attributes from `ZbResource.BuildAttributes` (single source of truth), so logs can never drift from
|
||||
metrics and traces in a backend.
|
||||
|
||||
---
|
||||
|
||||
## 5. Sequencing & execution
|
||||
|
||||
Subagent-driven, classification-driven review chain. **Task 0 gates everything** (verify/publish the
|
||||
feed). Then three **independent** per-repo phases — each its own git repo, branch
|
||||
**`feat/adopt-zb-telemetry`**, commit per task, **never skip hooks, never force-push**:
|
||||
|
||||
1. **Task 0 (gating):** verify the two Telemetry `.nupkg`s are on the Gitea feed; pack + push if
|
||||
missing (creds-only user config, already set).
|
||||
2. **OtOpcUa:** source-mapping + package refs → `AddZbTelemetry` swap → `AddZbSerilog` swap → tests.
|
||||
3. **ScadaBridge:** source-mapping + package refs → `AddZbTelemetry` (both roots) → `AddZbSerilog`
|
||||
(replace `LoggerConfigurationFactory`) → tests.
|
||||
4. **MxAccessGateway:** source-mapping + package refs → **MEL→Serilog** (sub-tasked, `high-risk`)
|
||||
→ `AddZbTelemetry` metrics export → tests.
|
||||
5. **scadaproj bookkeeping:** add an "Adoption status — DONE" section to
|
||||
`components/observability/GAPS.md` (per-repo table + deferred items), **and correct the false
|
||||
"MxGateway logging already adopted" claim** in CLAUDE.md, the library CLAUDE.md, and
|
||||
`components/observability/README.md`.
|
||||
|
||||
The MxGateway MEL→Serilog migration is the one `high-risk` change (logging behaviour on the most
|
||||
operational app) and gets the full spec→code serial review chain. The other per-app swaps are
|
||||
`standard`.
|
||||
|
||||
---
|
||||
|
||||
## 6. Deferred (out of scope this round; recorded in GAPS)
|
||||
|
||||
| # | Item | Why deferred |
|
||||
|---|---|---|
|
||||
| #6 | MxGateway histogram `ms` → `s` | Breaking dashboard/alert change — needs ops coordination |
|
||||
| #7 | MxGateway meter rename `MxGateway.Server` → `ZB.MOM.WW.MxGateway` | Breaking Prometheus label change — needs ops coordination |
|
||||
| #9 | ScadaBridge app instruments (`ScadaBridgeTelemetry` + `scadabridge.*`) | Application-specific work, not shared-library adoption |
|
||||
| #10 | OtOpcUa OTLP exporter alongside Prometheus | Opt-in; no consumer for OTLP yet |
|
||||
| #11 | OtOpcUa trace-export no-op (spans recorded, no exporter) | Resolved by #10 / OTLP; or document |
|
||||
|
||||
None of these block the behaviour-preserving initial adoption.
|
||||
|
||||
---
|
||||
|
||||
## 7. Testing
|
||||
|
||||
All tests run **offline** — Prometheus is in-process, no OTLP collector required, and the library's
|
||||
own test suites are network-free.
|
||||
|
||||
- **OtOpcUa:** assert `/metrics` is still served, the `ZB.MOM.WW.OtOpcUa` meter is present, the
|
||||
Resource carries `service.name`, and the shared Serilog enrichers are wired.
|
||||
- **ScadaBridge:** assert `/metrics` is served in **both** roles, the logger carries
|
||||
`SiteId`/`NodeRole` enrichers, and startup is clean after `LoggerConfigurationFactory` removal.
|
||||
- **MxAccessGateway** (the careful one): assert log messages are still emitted at the same levels,
|
||||
redaction still applies, request logging still fires, `/metrics` is now served, and the
|
||||
`GetSnapshot()` path is unchanged — using the existing fake-worker test harness (no MXAccess
|
||||
needed).
|
||||
|
||||
---
|
||||
|
||||
## 8. Acceptance bar
|
||||
|
||||
- Each app builds and its test suite is green.
|
||||
- `/metrics` serves the same existing series (plus additive standard-instrumentation series); meter
|
||||
names and units unchanged.
|
||||
- Logs carry the same messages at the same levels, plus the shared identity enrichers and
|
||||
`trace_id`/`span_id` correlation.
|
||||
- No secrets committed to any repo (the Gitea token stays creds-only at the user level).
|
||||
- `components/observability/GAPS.md` updated; the false "MxGateway logging adopted" claim corrected.
|
||||
Reference in New Issue
Block a user