Files
scadaproj/docs/plans/2026-06-01-telemetry-followons.md
T

118 lines
6.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ZB.MOM.WW.Telemetry — Follow-ons Implementation Plan
> Continuation of [`2026-06-01-telemetry-library-adoption.md`](2026-06-01-telemetry-library-adoption.md).
> Executes the deferred follow-ons recorded in `components/observability/GAPS.md`, all four groups
> selected by the user.
**Goal:** Close the recorded telemetry follow-ons across the three apps — additive/hygiene fixes,
MxGateway metric normalization, ScadaBridge first application instruments, and OTLP opt-in.
**Branches:** new `feat/telemetry-followons` per repo (off the now-updated default). Commit per task,
never skip hooks, never force-push. The three repo phases are independent (parallel); within a repo,
sequential.
**Behaviour bar:** additive/opt-in by default (Prometheus stays the default exporter; new instruments
are new series; the MxGateway `ms``s` + rename are the *one* intentional metric-shape change, safe
because those series were never Prometheus-exported before the adoption).
---
## OtOpcUa (branch `feat/telemetry-followons` off `master`)
### Task O-A2: align Serilog to the 10.x line
**Classification:** small · **Files:** `Directory.Packages.props`
Bump `Serilog.AspNetCore`, `Serilog.Extensions.Hosting`, `Serilog.Settings.Configuration` from
`9.0.0``10.0.0` (ScadaBridge already runs `10.0.0` with `Serilog 4.x`, so 10.x is 4.x-compatible —
no Serilog 5 needed). Keep `Serilog 4.3.0` (or bump to `4.3.1` to match ScadaBridge). Restore + build
`ZB.MOM.WW.OtOpcUa.slnx`; run `--filter LogContextEnricherTests`. Commit.
### Task O-D: OTLP exporter opt-in (config-driven)
**Classification:** standard · **Parallelizable with:** O-A2 (disjoint files)
**Files:** `src/Server/.../Observability/ObservabilityExtensions.cs`, `src/Server/.../Program.cs:138`
Refactor `AddOtOpcUaObservability` to accept `IConfiguration` and read
`OtOpcUa:Telemetry:Exporter` (`Prometheus`|`Otlp`, default Prometheus) + `OtOpcUa:Telemetry:OtlpEndpoint`;
set `o.Exporter`/`o.OtlpEndpoint` accordingly. Update the call site to
`builder.Services.AddOtOpcUaObservability(builder.Configuration)`. Default (no config) stays Prometheus.
This also makes OtOpcUa's recorded spans exportable when OTLP is configured (resolves the trace no-op).
Build; run `OtOpcUaTelemetryHookTests`. Commit.
---
## MxAccessGateway (branch `feat/telemetry-followons` off `main`)
### Task M-A3: gitignore stray doc artifacts
**Classification:** trivial · **Files:** `.gitignore`
Append a `# Documentation review artifacts` block ignoring `*-docs-issues.md`, `*-docs-fixed.md`,
`*-docs-final.md` (the 5 untracked `*-docs-*.md` files are CommentChecker "Documentation Analysis
Report" output). Commit. (Do NOT delete the files — just ignore.)
### Task M-B: metric normalization (`ms`→`s` + meter rename)
**Classification:** standard · **Files:** `src/.../Metrics/GatewayMetrics.cs`, test if needed
- Rename `MeterName` const `"MxGateway.Server"``"ZB.MOM.WW.MxGateway"`. (AddZbTelemetry uses the
const, so it follows automatically; no test asserts the literal; `GatewayMetricsTests` filter by
meter *instance*, not name.)
- Change the 3 histograms' unit `"ms"``"s"` (CreateHistogram lines) and their 4 record sites
`.TotalMilliseconds``.TotalSeconds`. The snapshot/dashboard do NOT read these histograms, so no
read-path impact. Check `GatewayMetricsTests` for any histogram-value assertion in ms and update.
Build the Server project; run `--filter "GatewayMetricsTests|GatewayApplicationTests"`. Commit.
### Task M-D: OTLP exporter opt-in
**Classification:** small · **Files:** `src/.../GatewayApplication.cs` (the `AddZbTelemetry` lambda)
In the `AddZbTelemetry` lambda, read `MxGateway:Telemetry:Exporter` + `MxGateway:Telemetry:OtlpEndpoint`
from `builder.Configuration` (in scope) and set `o.Exporter`/`o.OtlpEndpoint`. Default Prometheus. Build.
Commit. (Sequential after M-B — both touch GatewayApplication.cs / metrics area.)
---
## ScadaBridge (branch `feat/telemetry-followons` off `main`)
### Task S-A1: site-node HTTP/1.1 `/metrics` listener
**Classification:** standard · **Files:** `src/.../NodeOptions.cs`, `src/.../Program.cs` (Site Kestrel)
Add `MetricsPort` (default `8082`) to `NodeOptions`. In the Site block's `ConfigureKestrel`, add a
second `ListenAnyIP(metricsPort, lo => lo.Protocols = Http1AndHttp2)` alongside the existing HTTP/2-only
gRPC-port listener, so the already-mapped `/metrics` becomes scrapable over HTTP/1.1 on site nodes.
Read the port from `ScadaBridge:Node:MetricsPort` (default 8082). Build; existing Host.Tests stay green.
Commit.
### Task S-D: OTLP exporter opt-in
**Classification:** small · **Files:** `src/.../SiteServiceRegistration.cs` (the `AddZbTelemetry` lambda)
In `BindSharedOptions`, read `ScadaBridge:Telemetry:Exporter` + `ScadaBridge:Telemetry:OtlpEndpoint`
from `config` (in scope) and set `o.Exporter`/`o.OtlpEndpoint`. Default Prometheus. Build. Commit.
(Sequential after S-C0 — both edit the `AddZbTelemetry` call.)
### Task S-C0: `ScadaBridgeTelemetry` meter + registration
**Classification:** standard · **Files:** Create `src/ZB.MOM.WW.ScadaBridge.Commons/Observability/ScadaBridgeTelemetry.cs`; edit `SiteServiceRegistration.cs` (`AddZbTelemetry` Meters)
Create a `ScadaBridgeTelemetry` static class: `Meter "ZB.MOM.WW.ScadaBridge"` + the four instruments
(`scadabridge.deployments.applied` counter; `scadabridge.store_and_forward.queue.depth` observable
gauge; `scadabridge.inbound_api.requests` counter; `scadabridge.site.connection.up` up/down gauge) with
thin static emit helpers. Register `o.Meters = ["ZB.MOM.WW.ScadaBridge"]` in the `AddZbTelemetry` call.
Build. Commit. (Precedes C1C4.)
### Tasks S-C1…S-C4: wire the four emit points
**Classification:** standard each · depend on S-C0
- **S-C1 `deployments.applied`** — increment on the DeploymentManager/DeploymentService success path.
- **S-C2 `store_and_forward.queue.depth`** — observable-gauge callback reading the StoreAndForward depth
(SQLite `COUNT`/existing depth accessor).
- **S-C3 `inbound_api.requests`** — increment (tag = method) in the InboundAPI endpoint filter/middleware.
- **S-C4 `site.connection.up`** — +1 on site-stream open, 1 on close in the Communication/SiteStream
gRPC server.
Each implementer finds the cleanest emit point and **STOPs + reports** if no clean point exists rather
than forcing a fragile edit. Add a focused test where practical. Build; commit per instrument.
---
## scadaproj bookkeeping
### Task Z: update GAPS.md
**Classification:** trivial · **Files:** `components/observability/GAPS.md`
Move the handled follow-ons (#6/#7 done; A1 site-listener done; #9 first instruments done; #10/#11 OTLP
opt-in done) from "Deferred" to a "Follow-ons — DONE 2026-06-01" subsection; note what each app now does.
Commit + (on user request) push all branches/merges.
---
## Sequencing
After each repo branch is cut: OtOpcUa {O-A2 ∥ O-D}; MxGateway {M-A3 → M-B → M-D}; ScadaBridge
{S-A1 ∥ (S-C0 → {S-C1, S-C2, S-C3, S-C4} → S-D)}. Repos run in parallel. Z + merge/push last.