docs: design for health + observability normalization components

Adds the approved brainstorm design for the next two component-normalization
entries (Health #1, Observability #2 from upcoming.md):

- components/health/ -> ZB.MOM.WW.Health (3 dependency-split packages)
- components/observability/ -> ZB.MOM.WW.Telemetry (2 packages, 3 OTel signals
  + shared Serilog bootstrap)

Scope: normalization docs + build both libraries (.NET 10, tested, packed);
one sister-repo touch (MxGateway MEL->Serilog migration); no other app adoption.
Unifying hinge: one identity triple (service.name/site.id/node.role) feeds both
the OTel Resource and the Serilog enrichers.
This commit is contained in:
Joseph Doherty
2026-06-01 06:08:51 -04:00
parent b95c413c08
commit 29b309c6c1
@@ -0,0 +1,232 @@
# Design — Health & Observability normalization components + shared libraries
Date: 2026-06-01
Status: **Approved design** (brainstorm output). Implementation plans follow separately
(one per library) via the writing-plans workflow.
This design adds the next two entries to the [component-normalization](../../components/README.md)
program, following the exact arc already used for **Auth** (`ZB.MOM.WW.Auth`) and **UI-Theme**
(`ZB.MOM.WW.Theme`): normalize the concern in `components/`, then build the shared library in this
repo. The two concerns are the top-ranked candidates in [`upcoming.md`](../../upcoming.md) (Health #1,
Observability #2 — the "operability cluster").
## Scope decisions (locked during brainstorm)
1. **Deliverable depth** — normalization docs **+ build both shared libraries** (.NET 10, tested,
`dotnet pack`). *Not* a docs-only pass.
2. **Structure** — two separate components → two separate libraries (one component = one library,
per house precedent): `components/health/``ZB.MOM.WW.Health`; `components/observability/`
`ZB.MOM.WW.Telemetry`. A future `ZB.MOM.WW.Hosting` aggregator can bundle both behind one call.
3. **Telemetry reach** — all three OpenTelemetry signals (metrics + traces + logs), including a shared
Serilog bootstrap, enrichers, and trace↔log correlation.
4. **Sister-repo touch** — exactly one: migrate **MxAccessGateway** off `Microsoft.Extensions.Logging`
onto the shared Serilog bootstrap. No other app adoption — wiring Health/Telemetry into the three
apps stays a future `GAPS.md` item, identical to where Auth and UI-Theme sit today.
5. **Packaging** — dependency-split packages (mirrors Auth's 4-package split and the
`AspNetCore.HealthChecks.*` ecosystem). Heavy probes live in opt-in satellites so MxGateway never
transitively pulls Akka or EF.
6. **Current-state docs** — full code-verified depth with `file:line` refs, per
`components/README.md`'s mandate (matching auth's current-state docs).
## The unifying hinge
A single identity triple — `service.name` / `site.id` / `node.role` (+ host) — populates **both** the
OpenTelemetry `Resource` **and** the Serilog enrichers. A metric, a span, and a log line from the same
node therefore carry identical dimensions and join up in a backend. This symmetry is the reason
Health and Telemetry are designed together even though they ship as separate libraries.
## Repo layout
```
scadaproj/
├─ components/
│ ├─ health/ NEW normalization component (docs)
│ │ ├─ README.md
│ │ ├─ spec/SPEC.md
│ │ ├─ shared-contract/ZB.MOM.WW.Health.md
│ │ ├─ current-state/{otopcua,mxaccessgw,scadabridge}/CURRENT-STATE.md
│ │ └─ GAPS.md
│ └─ observability/ NEW normalization component (docs)
│ ├─ README.md
│ ├─ spec/SPEC.md
│ ├─ spec/METRIC-CONVENTIONS.md (mirrors auth CANONICAL-ROLES / theme DESIGN-TOKENS)
│ ├─ shared-contract/ZB.MOM.WW.Telemetry.md
│ ├─ current-state/{otopcua,mxaccessgw,scadabridge}/CURRENT-STATE.md
│ └─ GAPS.md
├─ ZB.MOM.WW.Health/ NEW built library (nested git repo, .NET 10) → 3 nupkgs @ 0.1.0
├─ ZB.MOM.WW.Telemetry/ NEW built library (nested git repo, .NET 10) → 2 nupkgs @ 0.1.0
└─ docs/plans/
├─ 2026-06-01-zb-mom-ww-health-shared-library.md (impl plan — from writing-plans)
└─ 2026-06-01-zb-mom-ww-telemetry-shared-library.md (impl plan — from writing-plans)
```
Index updates (same discipline as the prior two components): add both rows to
`components/README.md`, the `CLAUDE.md` Component-normalization table, and check off Health +
Observability in `upcoming.md`.
## Code-verified current state (2026-06-01 scan)
### Health
| | OtOpcUa | ScadaBridge | MxGateway |
|---|---|---|---|
| Endpoints | `/health/ready`, `/health/active`, `/healthz` | `/health/ready`, `/health/active` (no `/healthz`) | `/health/live` only (custom `GatewayHealthReply`) |
| Probes | Database, AkkaCluster, AdminRoleLeader | Database, AkkaCluster, ActiveNode | **none** (`AddHealthChecks()` called but unused) |
| Tagging | tags on the check | named + predicate, `HealthChecks.UI.Client` JSON | — |
| Extra | — | `IActiveNodeGate` route gate + `HealthMonitoring/` domain pipeline | net48 x86 worker has no endpoint |
Both descend from the same "ScadaLink three-tier pattern" (OtOpcUa's `HealthEndpoints.cs:13` says
so) but the Akka/leader probe logic and the DB probe technique already differ. MxGateway is **not**
Akka-based — a hard dependency-hygiene constraint.
Key refs: OtOpcUa `src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/{HealthEndpoints,DatabaseHealthCheck,AkkaClusterHealthCheck,AdminRoleLeaderHealthCheck}.cs`;
ScadaBridge `src/ZB.MOM.WW.ScadaBridge.Host/Health/{DatabaseHealthCheck,AkkaClusterHealthCheck,ActiveNodeHealthCheck,ActiveNodeGate}.cs` + `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/`;
MxGateway `src/ZB.MOM.WW.MxGateway.Server/GatewayApplication.cs:61,139145`.
### Telemetry
| | OtOpcUa | MxGateway | ScadaBridge |
|---|---|---|---|
| OTel SDK | full (`WithMetrics`+`WithTracing`) | **none** (hand-rolled `System.Diagnostics.Metrics`, no export) | **none** (`OpenTelemetry.Api` is a dangling CVE-patch ref) |
| Exporter | Prometheus `/metrics` | in-memory snapshot only (`GetSnapshot()`) | — |
| Meter | `ZB.MOM.WW.OtOpcUa` | `MxGateway.Server` (13 ctr / 3 hist `ms` / 4 gauge) | — |
| Tracing | ActivitySource (2 spans) | none | none |
| Resource / `service.name` | **none anywhere** | none | none |
Nobody sets a resource/`service.name` — the fleet is indistinguishable in a collector. Durations
split `s` (OtOpcUa, OTel-correct) vs `ms` (MxGateway).
Key refs: OtOpcUa `src/Server/ZB.MOM.WW.OtOpcUa.Host/Observability/ObservabilityExtensions.cs` +
`src/Core/ZB.MOM.WW.OtOpcUa.Commons/Observability/OtOpcUaTelemetry.cs`;
MxGateway `src/ZB.MOM.WW.MxGateway.Server/Metrics/GatewayMetrics.cs`;
ScadaBridge `src/ZB.MOM.WW.ScadaBridge.Host/ZB.MOM.WW.ScadaBridge.Host.csproj:31` (Api only, zero usage).
### Logging
Serilog in OtOpcUa (`Program.cs:49`) + ScadaBridge (`LoggerConfigurationFactory.cs:28126`,
enrichers `SiteId`/`NodeRole`/`NodeHostname`); MEL in MxGateway (`appsettings.json`, correlation via
`GatewayLogScope`/`BeginScope` middleware + `GatewayLogRedactor`). ScadaBridge's enricher set is the
cleanest and its property names match the Resource attributes Telemetry needs. Nobody enriches logs
with `trace_id`/`span_id`.
## Library design — `ZB.MOM.WW.Health` (3 packages)
**`ZB.MOM.WW.Health`** (core; deps: `Microsoft.Extensions.Diagnostics.HealthChecks` + ASP.NET Core abstractions)
- Tier convention: canonical tags `ready` / `active` / `live`; `app.MapZbHealth()` maps all three —
`/health/ready` (tag `ready` → can this node serve?), `/health/active` (tag `active` → is this the
leader/active node?), `/healthz` (predicate `_ => false` → bare process liveness).
- Canonical JSON response writer (lifts ScadaBridge's `HealthChecks.UI.Client` style to the default).
- `IActiveNodeGate` seam (generalized from ScadaBridge's `ActiveNodeGate`) + `MapZbHealth` integration.
- `GrpcDependencyHealthCheck` — "is my downstream gRPC dependency reachable" (MxGateway → worker;
OtOpcUa → gateway channel).
**`ZB.MOM.WW.Health.Akka`** (dep: Akka.Cluster)
- `AkkaClusterHealthCheck` with a configurable status policy. Default = ScadaBridge's
(`Up`/`Joining`=Healthy, `Leaving`/`Exiting`=Degraded, else Unhealthy); OtOpcUa's (self-`Up`-among-
members → Healthy/Degraded) ships as a preset.
- `ActiveNodeHealthCheck` with an optional role filter — role-less default gives ScadaBridge's
`ActiveNode` (Up && leader); passing a role gives OtOpcUa's `AdminRoleLeader` behavior.
**`ZB.MOM.WW.Health.EntityFrameworkCore`** (dep: EF Core)
- `DatabaseHealthCheck<TContext>` — default probe `CanConnectAsync()` (ScadaBridge), optional
probe-query delegate for OtOpcUa's "query `Deployments`" style.
**Stays per-project:** which probes each app registers; orchestrator/Traefik wiring; ScadaBridge's
`HealthMonitoring/` domain aggregation pipeline (distributed domain health, not an ASP.NET probe).
**Consumer matrix:** MxGateway → core only (+ gRPC-dep probe; no Akka/EF); OtOpcUa & ScadaBridge → all three.
## Library design — `ZB.MOM.WW.Telemetry` (2 packages)
**`ZB.MOM.WW.Telemetry`** (OTel metrics + traces; deps: OpenTelemetry SDK + hosting/exporter)
- `builder.AddZbTelemetry(options)` — the missing front door:
```csharp
builder.AddZbTelemetry(o => {
o.ServiceName = "mxgateway"; // → Resource service.name
o.ServiceNamespace = "ZB.MOM.WW"; // constant across the fleet
o.SiteId = cfg.SiteId; // → resource attr site.id
o.NodeRole = cfg.NodeRole; // → resource attr node.role
o.Meters = ["MxGateway.Server"]; // app's own Meter name(s)
o.ActivitySources = [...]; // app's own ActivitySource name(s)
o.Exporter = Prometheus; // default; OTLP opt-in
});
app.MapZbMetrics(); // Prometheus /metrics
```
- Shared `Resource`: `service.name` + `service.namespace` + `service.version` + `site.id` +
`node.role` + `host.name`. **The headline fix** — nobody sets this today.
- Standard instrumentation everyone should have (only OtOpcUa has it now): ASP.NET Core, gRPC client,
HttpClient, runtime + process meters.
- Exporter: Prometheus `/metrics` default; **OTLP opt-in** via options (path to a real collector).
- App instruments stay per-project. MxGateway's hand-rolled `GatewayMetrics` keeps its 13/3/4
instruments but its `Meter` is registered through `AddZbTelemetry` so it finally **exports** instead
of dying in an in-memory snapshot.
**② `ZB.MOM.WW.Telemetry.Serilog`** (logs signal + Serilog convergence; deps: Serilog + the core package)
- `AddZbSerilog()` — shared two-stage bootstrap generalizing ScadaBridge's `LoggerConfigurationFactory`
(`ReadFrom.Configuration` for sinks + explicit `MinimumLevel.Is` override).
- Shared enrichers `SiteId` / `NodeRole` / `NodeHostname`, **bound from the same options object as the
OTel Resource** so logs and metrics carry identical dimensions.
- **NEW `TraceContextEnricher`** — stamps `trace_id`/`span_id` from `Activity.Current` onto every log
event (makes a log line clickable from a trace; nobody has this today).
- OTel log export — logs flow through the OTel pipeline with the same Resource (all three signals
correlated in a backend).
- `ILogRedactor` seam — generalized from MxGateway's `GatewayLogRedactor` (the only app with real
secret redaction). The seam is shared; the policy (which fields/commands) stays per-project.
**Convergence the spec pins down:** Meter name = `<app>` namespace; instrument name =
`<app>.<subsystem>.<event>`; duration unit = **seconds** (OTel semconv) — flags MxGateway's `ms`
histograms as a convergence item.
### The one adoption — MxGateway MEL → Serilog
Replace `WebApplicationBuilder` default logging with `AddZbSerilog()`; re-express the
`GatewayLogScope`/`BeginScope` correlation middleware as a Serilog `LogContext.PushProperty` scope;
move `GatewayLogRedactor` behind the shared `ILogRedactor` seam. The net48 x86 worker's `IWorkerLogger`
(stderr key=value) stays bespoke — out of process and out of scope.
## Normalization component docs
Both trees follow `components/README.md`'s six-part layout (matching auth + ui-theme). Each `spec`
opens with a Section 0 stating normalized vs. left-per-project explicitly. `observability/` adds one
reference doc — `spec/METRIC-CONVENTIONS.md` — mirroring auth's `CANONICAL-ROLES.md` / theme's
`DESIGN-TOKENS.md`. Three `current-state/<project>/CURRENT-STATE.md` per component at full
code-verified depth, each ending in an Adoption plan. `GAPS.md` turns deltas into a prioritized
backlog (MxGateway "no probes" + "MEL→Serilog" are top entries; `ms`→`s` and the missing Resource are
convergence items). Both register at status **Draft** (`Draft → Reviewed → Adopting → Converged`).
## Testing & verification
Every package ships tests (mirrors auth's 172 / theme's 32; `dotnet test` from each library root):
- **Health** — `WebApplicationFactory` tests for the three tiers + JSON shape; `IActiveNodeGate`
gates a route (200 active / 503 standby); `GrpcDependencyHealthCheck` on a stub channel.
- **Health.Akka** — table-driven status-policy + role-filter unit tests over faked cluster state.
- **Health.EntityFrameworkCore** — `DatabaseHealthCheck<T>` against SQLite in-memory (healthy / broken
context / custom probe delegate).
- **Telemetry** — Resource carries every options attribute; in-memory exporter sees a registered app
Meter's instrument; `MapZbMetrics` serves Prometheus text.
- **Telemetry.Serilog** — in-memory/TestCorrelator sink asserts enricher properties present;
`TraceContextEnricher` stamps `trace_id`/`span_id` under an active `Activity` and omits cleanly
otherwise; `ILogRedactor` scrubs a policy-marked secret.
- **MxGateway migration** — existing `MxGateway.Tests` (fake worker) still green + correlation scope
still emits + secrets still redacted.
Verification gates (evidence, not assertions): each library `dotnet test` green + `dotnet pack`
produces nupkgs @ 0.1.0; MxGateway `dotnet build src/MxGateway.sln` + `dotnet test` green.
## Build order
```
1. components/health/ + components/observability/ docs (spec first — drives the APIs)
2. ZB.MOM.WW.Health (3 pkgs) ─┐ parallelizable
3. ZB.MOM.WW.Telemetry (core: metrics+traces) ─┘
4. ZB.MOM.WW.Telemetry.Serilog (needs #3)
5. MxGateway MEL→Serilog migration (needs #4) ← the one sister-repo touch
6. Index/registry updates + GAPS cross-check
```
## Implementation tasks (native task IDs)
- #7 Build `ZB.MOM.WW.Health` library (3 packages)
- #8 Build `ZB.MOM.WW.Telemetry` library (2 packages)
- #9 Migrate MxGateway logging MEL → shared Serilog (sister-repo) — blocked by #8
- #10 Author `components/health/` normalization docs
- #11 Author `components/observability/` normalization docs
Dependency: #9 blocked by #8 (needs `ZB.MOM.WW.Telemetry.Serilog`). Docs (#10/#11) precede the
libraries logically (spec drives API) but can be drafted in parallel from the captured current-state.