# Design — Health & Observability normalization components + shared libraries Date: 2026-06-01 Status: **Approved design** (brainstorm output). Implementation plans follow separately (one per library) via the writing-plans workflow. This design adds the next two entries to the [component-normalization](../../components/README.md) program, following the exact arc already used for **Auth** (`ZB.MOM.WW.Auth`) and **UI-Theme** (`ZB.MOM.WW.Theme`): normalize the concern in `components/`, then build the shared library in this repo. The two concerns are the top-ranked candidates in [`upcoming.md`](../../upcoming.md) (Health #1, Observability #2 — the "operability cluster"). ## Scope decisions (locked during brainstorm) 1. **Deliverable depth** — normalization docs **+ build both shared libraries** (.NET 10, tested, `dotnet pack`). *Not* a docs-only pass. 2. **Structure** — two separate components → two separate libraries (one component = one library, per house precedent): `components/health/` → `ZB.MOM.WW.Health`; `components/observability/` → `ZB.MOM.WW.Telemetry`. A future `ZB.MOM.WW.Hosting` aggregator can bundle both behind one call. 3. **Telemetry reach** — all three OpenTelemetry signals (metrics + traces + logs), including a shared Serilog bootstrap, enrichers, and trace↔log correlation. 4. **Sister-repo touch** — exactly one: migrate **MxAccessGateway** off `Microsoft.Extensions.Logging` onto the shared Serilog bootstrap. No other app adoption — wiring Health/Telemetry into the three apps stays a future `GAPS.md` item, identical to where Auth and UI-Theme sit today. 5. **Packaging** — dependency-split packages (mirrors Auth's 4-package split and the `AspNetCore.HealthChecks.*` ecosystem). Heavy probes live in opt-in satellites so MxGateway never transitively pulls Akka or EF. 6. **Current-state docs** — full code-verified depth with `file:line` refs, per `components/README.md`'s mandate (matching auth's current-state docs). ## The unifying hinge A single identity triple — `service.name` / `site.id` / `node.role` (+ host) — populates **both** the OpenTelemetry `Resource` **and** the Serilog enrichers. A metric, a span, and a log line from the same node therefore carry identical dimensions and join up in a backend. This symmetry is the reason Health and Telemetry are designed together even though they ship as separate libraries. ## Repo layout ``` scadaproj/ ├─ components/ │ ├─ health/ NEW normalization component (docs) │ │ ├─ README.md │ │ ├─ spec/SPEC.md │ │ ├─ shared-contract/ZB.MOM.WW.Health.md │ │ ├─ current-state/{otopcua,mxaccessgw,scadabridge}/CURRENT-STATE.md │ │ └─ GAPS.md │ └─ observability/ NEW normalization component (docs) │ ├─ README.md │ ├─ spec/SPEC.md │ ├─ spec/METRIC-CONVENTIONS.md (mirrors auth CANONICAL-ROLES / theme DESIGN-TOKENS) │ ├─ shared-contract/ZB.MOM.WW.Telemetry.md │ ├─ current-state/{otopcua,mxaccessgw,scadabridge}/CURRENT-STATE.md │ └─ GAPS.md ├─ ZB.MOM.WW.Health/ NEW built library (nested git repo, .NET 10) → 3 nupkgs @ 0.1.0 ├─ ZB.MOM.WW.Telemetry/ NEW built library (nested git repo, .NET 10) → 2 nupkgs @ 0.1.0 └─ docs/plans/ ├─ 2026-06-01-zb-mom-ww-health-shared-library.md (impl plan — from writing-plans) └─ 2026-06-01-zb-mom-ww-telemetry-shared-library.md (impl plan — from writing-plans) ``` Index updates (same discipline as the prior two components): add both rows to `components/README.md`, the `CLAUDE.md` Component-normalization table, and check off Health + Observability in `upcoming.md`. ## Code-verified current state (2026-06-01 scan) ### Health | | OtOpcUa | ScadaBridge | MxGateway | |---|---|---|---| | Endpoints | `/health/ready`, `/health/active`, `/healthz` | `/health/ready`, `/health/active` (no `/healthz`) | `/health/live` only (custom `GatewayHealthReply`) | | Probes | Database, AkkaCluster, AdminRoleLeader | Database, AkkaCluster, ActiveNode | **none** (`AddHealthChecks()` called but unused) | | Tagging | tags on the check | named + predicate, `HealthChecks.UI.Client` JSON | — | | Extra | — | `IActiveNodeGate` route gate + `HealthMonitoring/` domain pipeline | net48 x86 worker has no endpoint | Both descend from the same "ScadaLink three-tier pattern" (OtOpcUa's `HealthEndpoints.cs:13` says so) but the Akka/leader probe logic and the DB probe technique already differ. MxGateway is **not** Akka-based — a hard dependency-hygiene constraint. Key refs: OtOpcUa `src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/{HealthEndpoints,DatabaseHealthCheck,AkkaClusterHealthCheck,AdminRoleLeaderHealthCheck}.cs`; ScadaBridge `src/ZB.MOM.WW.ScadaBridge.Host/Health/{DatabaseHealthCheck,AkkaClusterHealthCheck,ActiveNodeHealthCheck,ActiveNodeGate}.cs` + `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/`; MxGateway `src/ZB.MOM.WW.MxGateway.Server/GatewayApplication.cs:61,139–145`. ### Telemetry | | OtOpcUa | MxGateway | ScadaBridge | |---|---|---|---| | OTel SDK | full (`WithMetrics`+`WithTracing`) | **none** (hand-rolled `System.Diagnostics.Metrics`, no export) | **none** (`OpenTelemetry.Api` is a dangling CVE-patch ref) | | Exporter | Prometheus `/metrics` | in-memory snapshot only (`GetSnapshot()`) | — | | Meter | `ZB.MOM.WW.OtOpcUa` | `MxGateway.Server` (13 ctr / 3 hist `ms` / 4 gauge) | — | | Tracing | ActivitySource (2 spans) | none | none | | Resource / `service.name` | **none anywhere** | none | none | Nobody sets a resource/`service.name` — the fleet is indistinguishable in a collector. Durations split `s` (OtOpcUa, OTel-correct) vs `ms` (MxGateway). Key refs: OtOpcUa `src/Server/ZB.MOM.WW.OtOpcUa.Host/Observability/ObservabilityExtensions.cs` + `src/Core/ZB.MOM.WW.OtOpcUa.Commons/Observability/OtOpcUaTelemetry.cs`; MxGateway `src/ZB.MOM.WW.MxGateway.Server/Metrics/GatewayMetrics.cs`; ScadaBridge `src/ZB.MOM.WW.ScadaBridge.Host/ZB.MOM.WW.ScadaBridge.Host.csproj:31` (Api only, zero usage). ### Logging Serilog in OtOpcUa (`Program.cs:49`) + ScadaBridge (`LoggerConfigurationFactory.cs:28–126`, enrichers `SiteId`/`NodeRole`/`NodeHostname`); MEL in MxGateway (`appsettings.json`, correlation via `GatewayLogScope`/`BeginScope` middleware + `GatewayLogRedactor`). ScadaBridge's enricher set is the cleanest and its property names match the Resource attributes Telemetry needs. Nobody enriches logs with `trace_id`/`span_id`. ## Library design — `ZB.MOM.WW.Health` (3 packages) **① `ZB.MOM.WW.Health`** (core; deps: `Microsoft.Extensions.Diagnostics.HealthChecks` + ASP.NET Core abstractions) - Tier convention: canonical tags `ready` / `active` / `live`; `app.MapZbHealth()` maps all three — `/health/ready` (tag `ready` → can this node serve?), `/health/active` (tag `active` → is this the leader/active node?), `/healthz` (predicate `_ => false` → bare process liveness). - Canonical JSON response writer (lifts ScadaBridge's `HealthChecks.UI.Client` style to the default). - `IActiveNodeGate` seam (generalized from ScadaBridge's `ActiveNodeGate`) + `MapZbHealth` integration. - `GrpcDependencyHealthCheck` — "is my downstream gRPC dependency reachable" (MxGateway → worker; OtOpcUa → gateway channel). **② `ZB.MOM.WW.Health.Akka`** (dep: Akka.Cluster) - `AkkaClusterHealthCheck` with a configurable status policy. Default = ScadaBridge's (`Up`/`Joining`=Healthy, `Leaving`/`Exiting`=Degraded, else Unhealthy); OtOpcUa's (self-`Up`-among- members → Healthy/Degraded) ships as a preset. - `ActiveNodeHealthCheck` with an optional role filter — role-less default gives ScadaBridge's `ActiveNode` (Up && leader); passing a role gives OtOpcUa's `AdminRoleLeader` behavior. **③ `ZB.MOM.WW.Health.EntityFrameworkCore`** (dep: EF Core) - `DatabaseHealthCheck` — default probe `CanConnectAsync()` (ScadaBridge), optional probe-query delegate for OtOpcUa's "query `Deployments`" style. **Stays per-project:** which probes each app registers; orchestrator/Traefik wiring; ScadaBridge's `HealthMonitoring/` domain aggregation pipeline (distributed domain health, not an ASP.NET probe). **Consumer matrix:** MxGateway → core only (+ gRPC-dep probe; no Akka/EF); OtOpcUa & ScadaBridge → all three. ## Library design — `ZB.MOM.WW.Telemetry` (2 packages) **① `ZB.MOM.WW.Telemetry`** (OTel metrics + traces; deps: OpenTelemetry SDK + hosting/exporter) - `builder.AddZbTelemetry(options)` — the missing front door: ```csharp builder.AddZbTelemetry(o => { o.ServiceName = "mxgateway"; // → Resource service.name o.ServiceNamespace = "ZB.MOM.WW"; // constant across the fleet o.SiteId = cfg.SiteId; // → resource attr site.id o.NodeRole = cfg.NodeRole; // → resource attr node.role o.Meters = ["MxGateway.Server"]; // app's own Meter name(s) o.ActivitySources = [...]; // app's own ActivitySource name(s) o.Exporter = Prometheus; // default; OTLP opt-in }); app.MapZbMetrics(); // Prometheus /metrics ``` - Shared `Resource`: `service.name` + `service.namespace` + `service.version` + `site.id` + `node.role` + `host.name`. **The headline fix** — nobody sets this today. - Standard instrumentation everyone should have (only OtOpcUa has it now): ASP.NET Core, gRPC client, HttpClient, runtime + process meters. - Exporter: Prometheus `/metrics` default; **OTLP opt-in** via options (path to a real collector). - App instruments stay per-project. MxGateway's hand-rolled `GatewayMetrics` keeps its 13/3/4 instruments but its `Meter` is registered through `AddZbTelemetry` so it finally **exports** instead of dying in an in-memory snapshot. **② `ZB.MOM.WW.Telemetry.Serilog`** (logs signal + Serilog convergence; deps: Serilog + the core package) - `AddZbSerilog()` — shared two-stage bootstrap generalizing ScadaBridge's `LoggerConfigurationFactory` (`ReadFrom.Configuration` for sinks + explicit `MinimumLevel.Is` override). - Shared enrichers `SiteId` / `NodeRole` / `NodeHostname`, **bound from the same options object as the OTel Resource** so logs and metrics carry identical dimensions. - **NEW `TraceContextEnricher`** — stamps `trace_id`/`span_id` from `Activity.Current` onto every log event (makes a log line clickable from a trace; nobody has this today). - OTel log export — logs flow through the OTel pipeline with the same Resource (all three signals correlated in a backend). - `ILogRedactor` seam — generalized from MxGateway's `GatewayLogRedactor` (the only app with real secret redaction). The seam is shared; the policy (which fields/commands) stays per-project. **Convergence the spec pins down:** Meter name = `` namespace; instrument name = `..`; duration unit = **seconds** (OTel semconv) — flags MxGateway's `ms` histograms as a convergence item. ### The one adoption — MxGateway MEL → Serilog Replace `WebApplicationBuilder` default logging with `AddZbSerilog()`; re-express the `GatewayLogScope`/`BeginScope` correlation middleware as a Serilog `LogContext.PushProperty` scope; move `GatewayLogRedactor` behind the shared `ILogRedactor` seam. The net48 x86 worker's `IWorkerLogger` (stderr key=value) stays bespoke — out of process and out of scope. ## Normalization component docs Both trees follow `components/README.md`'s six-part layout (matching auth + ui-theme). Each `spec` opens with a Section 0 stating normalized vs. left-per-project explicitly. `observability/` adds one reference doc — `spec/METRIC-CONVENTIONS.md` — mirroring auth's `CANONICAL-ROLES.md` / theme's `DESIGN-TOKENS.md`. Three `current-state//CURRENT-STATE.md` per component at full code-verified depth, each ending in an Adoption plan. `GAPS.md` turns deltas into a prioritized backlog (MxGateway "no probes" + "MEL→Serilog" are top entries; `ms`→`s` and the missing Resource are convergence items). Both register at status **Draft** (`Draft → Reviewed → Adopting → Converged`). ## Testing & verification Every package ships tests (mirrors auth's 172 / theme's 32; `dotnet test` from each library root): - **Health** — `WebApplicationFactory` tests for the three tiers + JSON shape; `IActiveNodeGate` gates a route (200 active / 503 standby); `GrpcDependencyHealthCheck` on a stub channel. - **Health.Akka** — table-driven status-policy + role-filter unit tests over faked cluster state. - **Health.EntityFrameworkCore** — `DatabaseHealthCheck` against SQLite in-memory (healthy / broken context / custom probe delegate). - **Telemetry** — Resource carries every options attribute; in-memory exporter sees a registered app Meter's instrument; `MapZbMetrics` serves Prometheus text. - **Telemetry.Serilog** — in-memory/TestCorrelator sink asserts enricher properties present; `TraceContextEnricher` stamps `trace_id`/`span_id` under an active `Activity` and omits cleanly otherwise; `ILogRedactor` scrubs a policy-marked secret. - **MxGateway migration** — existing `MxGateway.Tests` (fake worker) still green + correlation scope still emits + secrets still redacted. Verification gates (evidence, not assertions): each library `dotnet test` green + `dotnet pack` produces nupkgs @ 0.1.0; MxGateway `dotnet build src/MxGateway.sln` + `dotnet test` green. ## Build order ``` 1. components/health/ + components/observability/ docs (spec first — drives the APIs) 2. ZB.MOM.WW.Health (3 pkgs) ─┐ parallelizable 3. ZB.MOM.WW.Telemetry (core: metrics+traces) ─┘ 4. ZB.MOM.WW.Telemetry.Serilog (needs #3) 5. MxGateway MEL→Serilog migration (needs #4) ← the one sister-repo touch 6. Index/registry updates + GAPS cross-check ``` ## Implementation tasks (native task IDs) - #7 Build `ZB.MOM.WW.Health` library (3 packages) - #8 Build `ZB.MOM.WW.Telemetry` library (2 packages) - #9 Migrate MxGateway logging MEL → shared Serilog (sister-repo) — blocked by #8 - #10 Author `components/health/` normalization docs - #11 Author `components/observability/` normalization docs Dependency: #9 blocked by #8 (needs `ZB.MOM.WW.Telemetry.Serilog`). Docs (#10/#11) precede the libraries logically (spec drives API) but can be drafted in parallel from the captured current-state.