Files
scadaproj/docs/plans/2026-06-01-health-observability-components-design.md
T
Joseph Doherty 29b309c6c1 docs: design for health + observability normalization components
Adds the approved brainstorm design for the next two component-normalization
entries (Health #1, Observability #2 from upcoming.md):

- components/health/ -> ZB.MOM.WW.Health (3 dependency-split packages)
- components/observability/ -> ZB.MOM.WW.Telemetry (2 packages, 3 OTel signals
  + shared Serilog bootstrap)

Scope: normalization docs + build both libraries (.NET 10, tested, packed);
one sister-repo touch (MxGateway MEL->Serilog migration); no other app adoption.
Unifying hinge: one identity triple (service.name/site.id/node.role) feeds both
the OTel Resource and the Serilog enrichers.
2026-06-01 06:08:51 -04:00

233 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Design — Health & Observability normalization components + shared libraries
Date: 2026-06-01
Status: **Approved design** (brainstorm output). Implementation plans follow separately
(one per library) via the writing-plans workflow.
This design adds the next two entries to the [component-normalization](../../components/README.md)
program, following the exact arc already used for **Auth** (`ZB.MOM.WW.Auth`) and **UI-Theme**
(`ZB.MOM.WW.Theme`): normalize the concern in `components/`, then build the shared library in this
repo. The two concerns are the top-ranked candidates in [`upcoming.md`](../../upcoming.md) (Health #1,
Observability #2 — the "operability cluster").
## Scope decisions (locked during brainstorm)
1. **Deliverable depth** — normalization docs **+ build both shared libraries** (.NET 10, tested,
`dotnet pack`). *Not* a docs-only pass.
2. **Structure** — two separate components → two separate libraries (one component = one library,
per house precedent): `components/health/``ZB.MOM.WW.Health`; `components/observability/`
`ZB.MOM.WW.Telemetry`. A future `ZB.MOM.WW.Hosting` aggregator can bundle both behind one call.
3. **Telemetry reach** — all three OpenTelemetry signals (metrics + traces + logs), including a shared
Serilog bootstrap, enrichers, and trace↔log correlation.
4. **Sister-repo touch** — exactly one: migrate **MxAccessGateway** off `Microsoft.Extensions.Logging`
onto the shared Serilog bootstrap. No other app adoption — wiring Health/Telemetry into the three
apps stays a future `GAPS.md` item, identical to where Auth and UI-Theme sit today.
5. **Packaging** — dependency-split packages (mirrors Auth's 4-package split and the
`AspNetCore.HealthChecks.*` ecosystem). Heavy probes live in opt-in satellites so MxGateway never
transitively pulls Akka or EF.
6. **Current-state docs** — full code-verified depth with `file:line` refs, per
`components/README.md`'s mandate (matching auth's current-state docs).
## The unifying hinge
A single identity triple — `service.name` / `site.id` / `node.role` (+ host) — populates **both** the
OpenTelemetry `Resource` **and** the Serilog enrichers. A metric, a span, and a log line from the same
node therefore carry identical dimensions and join up in a backend. This symmetry is the reason
Health and Telemetry are designed together even though they ship as separate libraries.
## Repo layout
```
scadaproj/
├─ components/
│ ├─ health/ NEW normalization component (docs)
│ │ ├─ README.md
│ │ ├─ spec/SPEC.md
│ │ ├─ shared-contract/ZB.MOM.WW.Health.md
│ │ ├─ current-state/{otopcua,mxaccessgw,scadabridge}/CURRENT-STATE.md
│ │ └─ GAPS.md
│ └─ observability/ NEW normalization component (docs)
│ ├─ README.md
│ ├─ spec/SPEC.md
│ ├─ spec/METRIC-CONVENTIONS.md (mirrors auth CANONICAL-ROLES / theme DESIGN-TOKENS)
│ ├─ shared-contract/ZB.MOM.WW.Telemetry.md
│ ├─ current-state/{otopcua,mxaccessgw,scadabridge}/CURRENT-STATE.md
│ └─ GAPS.md
├─ ZB.MOM.WW.Health/ NEW built library (nested git repo, .NET 10) → 3 nupkgs @ 0.1.0
├─ ZB.MOM.WW.Telemetry/ NEW built library (nested git repo, .NET 10) → 2 nupkgs @ 0.1.0
└─ docs/plans/
├─ 2026-06-01-zb-mom-ww-health-shared-library.md (impl plan — from writing-plans)
└─ 2026-06-01-zb-mom-ww-telemetry-shared-library.md (impl plan — from writing-plans)
```
Index updates (same discipline as the prior two components): add both rows to
`components/README.md`, the `CLAUDE.md` Component-normalization table, and check off Health +
Observability in `upcoming.md`.
## Code-verified current state (2026-06-01 scan)
### Health
| | OtOpcUa | ScadaBridge | MxGateway |
|---|---|---|---|
| Endpoints | `/health/ready`, `/health/active`, `/healthz` | `/health/ready`, `/health/active` (no `/healthz`) | `/health/live` only (custom `GatewayHealthReply`) |
| Probes | Database, AkkaCluster, AdminRoleLeader | Database, AkkaCluster, ActiveNode | **none** (`AddHealthChecks()` called but unused) |
| Tagging | tags on the check | named + predicate, `HealthChecks.UI.Client` JSON | — |
| Extra | — | `IActiveNodeGate` route gate + `HealthMonitoring/` domain pipeline | net48 x86 worker has no endpoint |
Both descend from the same "ScadaLink three-tier pattern" (OtOpcUa's `HealthEndpoints.cs:13` says
so) but the Akka/leader probe logic and the DB probe technique already differ. MxGateway is **not**
Akka-based — a hard dependency-hygiene constraint.
Key refs: OtOpcUa `src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/{HealthEndpoints,DatabaseHealthCheck,AkkaClusterHealthCheck,AdminRoleLeaderHealthCheck}.cs`;
ScadaBridge `src/ZB.MOM.WW.ScadaBridge.Host/Health/{DatabaseHealthCheck,AkkaClusterHealthCheck,ActiveNodeHealthCheck,ActiveNodeGate}.cs` + `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/`;
MxGateway `src/ZB.MOM.WW.MxGateway.Server/GatewayApplication.cs:61,139145`.
### Telemetry
| | OtOpcUa | MxGateway | ScadaBridge |
|---|---|---|---|
| OTel SDK | full (`WithMetrics`+`WithTracing`) | **none** (hand-rolled `System.Diagnostics.Metrics`, no export) | **none** (`OpenTelemetry.Api` is a dangling CVE-patch ref) |
| Exporter | Prometheus `/metrics` | in-memory snapshot only (`GetSnapshot()`) | — |
| Meter | `ZB.MOM.WW.OtOpcUa` | `MxGateway.Server` (13 ctr / 3 hist `ms` / 4 gauge) | — |
| Tracing | ActivitySource (2 spans) | none | none |
| Resource / `service.name` | **none anywhere** | none | none |
Nobody sets a resource/`service.name` — the fleet is indistinguishable in a collector. Durations
split `s` (OtOpcUa, OTel-correct) vs `ms` (MxGateway).
Key refs: OtOpcUa `src/Server/ZB.MOM.WW.OtOpcUa.Host/Observability/ObservabilityExtensions.cs` +
`src/Core/ZB.MOM.WW.OtOpcUa.Commons/Observability/OtOpcUaTelemetry.cs`;
MxGateway `src/ZB.MOM.WW.MxGateway.Server/Metrics/GatewayMetrics.cs`;
ScadaBridge `src/ZB.MOM.WW.ScadaBridge.Host/ZB.MOM.WW.ScadaBridge.Host.csproj:31` (Api only, zero usage).
### Logging
Serilog in OtOpcUa (`Program.cs:49`) + ScadaBridge (`LoggerConfigurationFactory.cs:28126`,
enrichers `SiteId`/`NodeRole`/`NodeHostname`); MEL in MxGateway (`appsettings.json`, correlation via
`GatewayLogScope`/`BeginScope` middleware + `GatewayLogRedactor`). ScadaBridge's enricher set is the
cleanest and its property names match the Resource attributes Telemetry needs. Nobody enriches logs
with `trace_id`/`span_id`.
## Library design — `ZB.MOM.WW.Health` (3 packages)
**`ZB.MOM.WW.Health`** (core; deps: `Microsoft.Extensions.Diagnostics.HealthChecks` + ASP.NET Core abstractions)
- Tier convention: canonical tags `ready` / `active` / `live`; `app.MapZbHealth()` maps all three —
`/health/ready` (tag `ready` → can this node serve?), `/health/active` (tag `active` → is this the
leader/active node?), `/healthz` (predicate `_ => false` → bare process liveness).
- Canonical JSON response writer (lifts ScadaBridge's `HealthChecks.UI.Client` style to the default).
- `IActiveNodeGate` seam (generalized from ScadaBridge's `ActiveNodeGate`) + `MapZbHealth` integration.
- `GrpcDependencyHealthCheck` — "is my downstream gRPC dependency reachable" (MxGateway → worker;
OtOpcUa → gateway channel).
**`ZB.MOM.WW.Health.Akka`** (dep: Akka.Cluster)
- `AkkaClusterHealthCheck` with a configurable status policy. Default = ScadaBridge's
(`Up`/`Joining`=Healthy, `Leaving`/`Exiting`=Degraded, else Unhealthy); OtOpcUa's (self-`Up`-among-
members → Healthy/Degraded) ships as a preset.
- `ActiveNodeHealthCheck` with an optional role filter — role-less default gives ScadaBridge's
`ActiveNode` (Up && leader); passing a role gives OtOpcUa's `AdminRoleLeader` behavior.
**`ZB.MOM.WW.Health.EntityFrameworkCore`** (dep: EF Core)
- `DatabaseHealthCheck<TContext>` — default probe `CanConnectAsync()` (ScadaBridge), optional
probe-query delegate for OtOpcUa's "query `Deployments`" style.
**Stays per-project:** which probes each app registers; orchestrator/Traefik wiring; ScadaBridge's
`HealthMonitoring/` domain aggregation pipeline (distributed domain health, not an ASP.NET probe).
**Consumer matrix:** MxGateway → core only (+ gRPC-dep probe; no Akka/EF); OtOpcUa & ScadaBridge → all three.
## Library design — `ZB.MOM.WW.Telemetry` (2 packages)
**`ZB.MOM.WW.Telemetry`** (OTel metrics + traces; deps: OpenTelemetry SDK + hosting/exporter)
- `builder.AddZbTelemetry(options)` — the missing front door:
```csharp
builder.AddZbTelemetry(o => {
o.ServiceName = "mxgateway"; // → Resource service.name
o.ServiceNamespace = "ZB.MOM.WW"; // constant across the fleet
o.SiteId = cfg.SiteId; // → resource attr site.id
o.NodeRole = cfg.NodeRole; // → resource attr node.role
o.Meters = ["MxGateway.Server"]; // app's own Meter name(s)
o.ActivitySources = [...]; // app's own ActivitySource name(s)
o.Exporter = Prometheus; // default; OTLP opt-in
});
app.MapZbMetrics(); // Prometheus /metrics
```
- Shared `Resource`: `service.name` + `service.namespace` + `service.version` + `site.id` +
`node.role` + `host.name`. **The headline fix** — nobody sets this today.
- Standard instrumentation everyone should have (only OtOpcUa has it now): ASP.NET Core, gRPC client,
HttpClient, runtime + process meters.
- Exporter: Prometheus `/metrics` default; **OTLP opt-in** via options (path to a real collector).
- App instruments stay per-project. MxGateway's hand-rolled `GatewayMetrics` keeps its 13/3/4
instruments but its `Meter` is registered through `AddZbTelemetry` so it finally **exports** instead
of dying in an in-memory snapshot.
**② `ZB.MOM.WW.Telemetry.Serilog`** (logs signal + Serilog convergence; deps: Serilog + the core package)
- `AddZbSerilog()` — shared two-stage bootstrap generalizing ScadaBridge's `LoggerConfigurationFactory`
(`ReadFrom.Configuration` for sinks + explicit `MinimumLevel.Is` override).
- Shared enrichers `SiteId` / `NodeRole` / `NodeHostname`, **bound from the same options object as the
OTel Resource** so logs and metrics carry identical dimensions.
- **NEW `TraceContextEnricher`** — stamps `trace_id`/`span_id` from `Activity.Current` onto every log
event (makes a log line clickable from a trace; nobody has this today).
- OTel log export — logs flow through the OTel pipeline with the same Resource (all three signals
correlated in a backend).
- `ILogRedactor` seam — generalized from MxGateway's `GatewayLogRedactor` (the only app with real
secret redaction). The seam is shared; the policy (which fields/commands) stays per-project.
**Convergence the spec pins down:** Meter name = `<app>` namespace; instrument name =
`<app>.<subsystem>.<event>`; duration unit = **seconds** (OTel semconv) — flags MxGateway's `ms`
histograms as a convergence item.
### The one adoption — MxGateway MEL → Serilog
Replace `WebApplicationBuilder` default logging with `AddZbSerilog()`; re-express the
`GatewayLogScope`/`BeginScope` correlation middleware as a Serilog `LogContext.PushProperty` scope;
move `GatewayLogRedactor` behind the shared `ILogRedactor` seam. The net48 x86 worker's `IWorkerLogger`
(stderr key=value) stays bespoke — out of process and out of scope.
## Normalization component docs
Both trees follow `components/README.md`'s six-part layout (matching auth + ui-theme). Each `spec`
opens with a Section 0 stating normalized vs. left-per-project explicitly. `observability/` adds one
reference doc — `spec/METRIC-CONVENTIONS.md` — mirroring auth's `CANONICAL-ROLES.md` / theme's
`DESIGN-TOKENS.md`. Three `current-state/<project>/CURRENT-STATE.md` per component at full
code-verified depth, each ending in an Adoption plan. `GAPS.md` turns deltas into a prioritized
backlog (MxGateway "no probes" + "MEL→Serilog" are top entries; `ms`→`s` and the missing Resource are
convergence items). Both register at status **Draft** (`Draft → Reviewed → Adopting → Converged`).
## Testing & verification
Every package ships tests (mirrors auth's 172 / theme's 32; `dotnet test` from each library root):
- **Health** — `WebApplicationFactory` tests for the three tiers + JSON shape; `IActiveNodeGate`
gates a route (200 active / 503 standby); `GrpcDependencyHealthCheck` on a stub channel.
- **Health.Akka** — table-driven status-policy + role-filter unit tests over faked cluster state.
- **Health.EntityFrameworkCore** — `DatabaseHealthCheck<T>` against SQLite in-memory (healthy / broken
context / custom probe delegate).
- **Telemetry** — Resource carries every options attribute; in-memory exporter sees a registered app
Meter's instrument; `MapZbMetrics` serves Prometheus text.
- **Telemetry.Serilog** — in-memory/TestCorrelator sink asserts enricher properties present;
`TraceContextEnricher` stamps `trace_id`/`span_id` under an active `Activity` and omits cleanly
otherwise; `ILogRedactor` scrubs a policy-marked secret.
- **MxGateway migration** — existing `MxGateway.Tests` (fake worker) still green + correlation scope
still emits + secrets still redacted.
Verification gates (evidence, not assertions): each library `dotnet test` green + `dotnet pack`
produces nupkgs @ 0.1.0; MxGateway `dotnet build src/MxGateway.sln` + `dotnet test` green.
## Build order
```
1. components/health/ + components/observability/ docs (spec first — drives the APIs)
2. ZB.MOM.WW.Health (3 pkgs) ─┐ parallelizable
3. ZB.MOM.WW.Telemetry (core: metrics+traces) ─┘
4. ZB.MOM.WW.Telemetry.Serilog (needs #3)
5. MxGateway MEL→Serilog migration (needs #4) ← the one sister-repo touch
6. Index/registry updates + GAPS cross-check
```
## Implementation tasks (native task IDs)
- #7 Build `ZB.MOM.WW.Health` library (3 packages)
- #8 Build `ZB.MOM.WW.Telemetry` library (2 packages)
- #9 Migrate MxGateway logging MEL → shared Serilog (sister-repo) — blocked by #8
- #10 Author `components/health/` normalization docs
- #11 Author `components/observability/` normalization docs
Dependency: #9 blocked by #8 (needs `ZB.MOM.WW.Telemetry.Serilog`). Docs (#10/#11) precede the
libraries logically (spec drives API) but can be drafted in parallel from the captured current-state.