docs: plan for ZB.MOM.WW.Telemetry follow-ons (A additive/hygiene, B metric normalization, C ScadaBridge instruments, D OTLP opt-in)

This commit is contained in:
Joseph Doherty
2026-06-01 16:32:57 -04:00
parent dee55aadc6
commit 6c2a43a238
@@ -0,0 +1,117 @@
# ZB.MOM.WW.Telemetry — Follow-ons Implementation Plan
> Continuation of [`2026-06-01-telemetry-library-adoption.md`](2026-06-01-telemetry-library-adoption.md).
> Executes the deferred follow-ons recorded in `components/observability/GAPS.md`, all four groups
> selected by the user.
**Goal:** Close the recorded telemetry follow-ons across the three apps — additive/hygiene fixes,
MxGateway metric normalization, ScadaBridge first application instruments, and OTLP opt-in.
**Branches:** new `feat/telemetry-followons` per repo (off the now-updated default). Commit per task,
never skip hooks, never force-push. The three repo phases are independent (parallel); within a repo,
sequential.
**Behaviour bar:** additive/opt-in by default (Prometheus stays the default exporter; new instruments
are new series; the MxGateway `ms``s` + rename are the *one* intentional metric-shape change, safe
because those series were never Prometheus-exported before the adoption).
---
## OtOpcUa (branch `feat/telemetry-followons` off `master`)
### Task O-A2: align Serilog to the 10.x line
**Classification:** small · **Files:** `Directory.Packages.props`
Bump `Serilog.AspNetCore`, `Serilog.Extensions.Hosting`, `Serilog.Settings.Configuration` from
`9.0.0``10.0.0` (ScadaBridge already runs `10.0.0` with `Serilog 4.x`, so 10.x is 4.x-compatible —
no Serilog 5 needed). Keep `Serilog 4.3.0` (or bump to `4.3.1` to match ScadaBridge). Restore + build
`ZB.MOM.WW.OtOpcUa.slnx`; run `--filter LogContextEnricherTests`. Commit.
### Task O-D: OTLP exporter opt-in (config-driven)
**Classification:** standard · **Parallelizable with:** O-A2 (disjoint files)
**Files:** `src/Server/.../Observability/ObservabilityExtensions.cs`, `src/Server/.../Program.cs:138`
Refactor `AddOtOpcUaObservability` to accept `IConfiguration` and read
`OtOpcUa:Telemetry:Exporter` (`Prometheus`|`Otlp`, default Prometheus) + `OtOpcUa:Telemetry:OtlpEndpoint`;
set `o.Exporter`/`o.OtlpEndpoint` accordingly. Update the call site to
`builder.Services.AddOtOpcUaObservability(builder.Configuration)`. Default (no config) stays Prometheus.
This also makes OtOpcUa's recorded spans exportable when OTLP is configured (resolves the trace no-op).
Build; run `OtOpcUaTelemetryHookTests`. Commit.
---
## MxAccessGateway (branch `feat/telemetry-followons` off `main`)
### Task M-A3: gitignore stray doc artifacts
**Classification:** trivial · **Files:** `.gitignore`
Append a `# Documentation review artifacts` block ignoring `*-docs-issues.md`, `*-docs-fixed.md`,
`*-docs-final.md` (the 5 untracked `*-docs-*.md` files are CommentChecker "Documentation Analysis
Report" output). Commit. (Do NOT delete the files — just ignore.)
### Task M-B: metric normalization (`ms`→`s` + meter rename)
**Classification:** standard · **Files:** `src/.../Metrics/GatewayMetrics.cs`, test if needed
- Rename `MeterName` const `"MxGateway.Server"``"ZB.MOM.WW.MxGateway"`. (AddZbTelemetry uses the
const, so it follows automatically; no test asserts the literal; `GatewayMetricsTests` filter by
meter *instance*, not name.)
- Change the 3 histograms' unit `"ms"``"s"` (CreateHistogram lines) and their 4 record sites
`.TotalMilliseconds``.TotalSeconds`. The snapshot/dashboard do NOT read these histograms, so no
read-path impact. Check `GatewayMetricsTests` for any histogram-value assertion in ms and update.
Build the Server project; run `--filter "GatewayMetricsTests|GatewayApplicationTests"`. Commit.
### Task M-D: OTLP exporter opt-in
**Classification:** small · **Files:** `src/.../GatewayApplication.cs` (the `AddZbTelemetry` lambda)
In the `AddZbTelemetry` lambda, read `MxGateway:Telemetry:Exporter` + `MxGateway:Telemetry:OtlpEndpoint`
from `builder.Configuration` (in scope) and set `o.Exporter`/`o.OtlpEndpoint`. Default Prometheus. Build.
Commit. (Sequential after M-B — both touch GatewayApplication.cs / metrics area.)
---
## ScadaBridge (branch `feat/telemetry-followons` off `main`)
### Task S-A1: site-node HTTP/1.1 `/metrics` listener
**Classification:** standard · **Files:** `src/.../NodeOptions.cs`, `src/.../Program.cs` (Site Kestrel)
Add `MetricsPort` (default `8082`) to `NodeOptions`. In the Site block's `ConfigureKestrel`, add a
second `ListenAnyIP(metricsPort, lo => lo.Protocols = Http1AndHttp2)` alongside the existing HTTP/2-only
gRPC-port listener, so the already-mapped `/metrics` becomes scrapable over HTTP/1.1 on site nodes.
Read the port from `ScadaBridge:Node:MetricsPort` (default 8082). Build; existing Host.Tests stay green.
Commit.
### Task S-D: OTLP exporter opt-in
**Classification:** small · **Files:** `src/.../SiteServiceRegistration.cs` (the `AddZbTelemetry` lambda)
In `BindSharedOptions`, read `ScadaBridge:Telemetry:Exporter` + `ScadaBridge:Telemetry:OtlpEndpoint`
from `config` (in scope) and set `o.Exporter`/`o.OtlpEndpoint`. Default Prometheus. Build. Commit.
(Sequential after S-C0 — both edit the `AddZbTelemetry` call.)
### Task S-C0: `ScadaBridgeTelemetry` meter + registration
**Classification:** standard · **Files:** Create `src/ZB.MOM.WW.ScadaBridge.Commons/Observability/ScadaBridgeTelemetry.cs`; edit `SiteServiceRegistration.cs` (`AddZbTelemetry` Meters)
Create a `ScadaBridgeTelemetry` static class: `Meter "ZB.MOM.WW.ScadaBridge"` + the four instruments
(`scadabridge.deployments.applied` counter; `scadabridge.store_and_forward.queue.depth` observable
gauge; `scadabridge.inbound_api.requests` counter; `scadabridge.site.connection.up` up/down gauge) with
thin static emit helpers. Register `o.Meters = ["ZB.MOM.WW.ScadaBridge"]` in the `AddZbTelemetry` call.
Build. Commit. (Precedes C1C4.)
### Tasks S-C1…S-C4: wire the four emit points
**Classification:** standard each · depend on S-C0
- **S-C1 `deployments.applied`** — increment on the DeploymentManager/DeploymentService success path.
- **S-C2 `store_and_forward.queue.depth`** — observable-gauge callback reading the StoreAndForward depth
(SQLite `COUNT`/existing depth accessor).
- **S-C3 `inbound_api.requests`** — increment (tag = method) in the InboundAPI endpoint filter/middleware.
- **S-C4 `site.connection.up`** — +1 on site-stream open, 1 on close in the Communication/SiteStream
gRPC server.
Each implementer finds the cleanest emit point and **STOPs + reports** if no clean point exists rather
than forcing a fragile edit. Add a focused test where practical. Build; commit per instrument.
---
## scadaproj bookkeeping
### Task Z: update GAPS.md
**Classification:** trivial · **Files:** `components/observability/GAPS.md`
Move the handled follow-ons (#6/#7 done; A1 site-listener done; #9 first instruments done; #10/#11 OTLP
opt-in done) from "Deferred" to a "Follow-ons — DONE 2026-06-01" subsection; note what each app now does.
Commit + (on user request) push all branches/merges.
---
## Sequencing
After each repo branch is cut: OtOpcUa {O-A2 ∥ O-D}; MxGateway {M-A3 → M-B → M-D}; ScadaBridge
{S-A1 ∥ (S-C0 → {S-C1, S-C2, S-C3, S-C4} → S-D)}. Repos run in parallel. Z + merge/push last.