diff --git a/docs/plans/2026-06-01-health-observability-components-design.md b/docs/plans/2026-06-01-health-observability-components-design.md new file mode 100644 index 0000000..16b09ea --- /dev/null +++ b/docs/plans/2026-06-01-health-observability-components-design.md @@ -0,0 +1,232 @@ +# Design — Health & Observability normalization components + shared libraries + +Date: 2026-06-01 +Status: **Approved design** (brainstorm output). Implementation plans follow separately +(one per library) via the writing-plans workflow. + +This design adds the next two entries to the [component-normalization](../../components/README.md) +program, following the exact arc already used for **Auth** (`ZB.MOM.WW.Auth`) and **UI-Theme** +(`ZB.MOM.WW.Theme`): normalize the concern in `components/`, then build the shared library in this +repo. The two concerns are the top-ranked candidates in [`upcoming.md`](../../upcoming.md) (Health #1, +Observability #2 — the "operability cluster"). + +## Scope decisions (locked during brainstorm) + +1. **Deliverable depth** — normalization docs **+ build both shared libraries** (.NET 10, tested, + `dotnet pack`). *Not* a docs-only pass. +2. **Structure** — two separate components → two separate libraries (one component = one library, + per house precedent): `components/health/` → `ZB.MOM.WW.Health`; `components/observability/` → + `ZB.MOM.WW.Telemetry`. A future `ZB.MOM.WW.Hosting` aggregator can bundle both behind one call. +3. **Telemetry reach** — all three OpenTelemetry signals (metrics + traces + logs), including a shared + Serilog bootstrap, enrichers, and trace↔log correlation. +4. **Sister-repo touch** — exactly one: migrate **MxAccessGateway** off `Microsoft.Extensions.Logging` + onto the shared Serilog bootstrap. No other app adoption — wiring Health/Telemetry into the three + apps stays a future `GAPS.md` item, identical to where Auth and UI-Theme sit today. +5. **Packaging** — dependency-split packages (mirrors Auth's 4-package split and the + `AspNetCore.HealthChecks.*` ecosystem). Heavy probes live in opt-in satellites so MxGateway never + transitively pulls Akka or EF. +6. **Current-state docs** — full code-verified depth with `file:line` refs, per + `components/README.md`'s mandate (matching auth's current-state docs). + +## The unifying hinge + +A single identity triple — `service.name` / `site.id` / `node.role` (+ host) — populates **both** the +OpenTelemetry `Resource` **and** the Serilog enrichers. A metric, a span, and a log line from the same +node therefore carry identical dimensions and join up in a backend. This symmetry is the reason +Health and Telemetry are designed together even though they ship as separate libraries. + +## Repo layout + +``` +scadaproj/ +├─ components/ +│ ├─ health/ NEW normalization component (docs) +│ │ ├─ README.md +│ │ ├─ spec/SPEC.md +│ │ ├─ shared-contract/ZB.MOM.WW.Health.md +│ │ ├─ current-state/{otopcua,mxaccessgw,scadabridge}/CURRENT-STATE.md +│ │ └─ GAPS.md +│ └─ observability/ NEW normalization component (docs) +│ ├─ README.md +│ ├─ spec/SPEC.md +│ ├─ spec/METRIC-CONVENTIONS.md (mirrors auth CANONICAL-ROLES / theme DESIGN-TOKENS) +│ ├─ shared-contract/ZB.MOM.WW.Telemetry.md +│ ├─ current-state/{otopcua,mxaccessgw,scadabridge}/CURRENT-STATE.md +│ └─ GAPS.md +├─ ZB.MOM.WW.Health/ NEW built library (nested git repo, .NET 10) → 3 nupkgs @ 0.1.0 +├─ ZB.MOM.WW.Telemetry/ NEW built library (nested git repo, .NET 10) → 2 nupkgs @ 0.1.0 +└─ docs/plans/ + ├─ 2026-06-01-zb-mom-ww-health-shared-library.md (impl plan — from writing-plans) + └─ 2026-06-01-zb-mom-ww-telemetry-shared-library.md (impl plan — from writing-plans) +``` + +Index updates (same discipline as the prior two components): add both rows to +`components/README.md`, the `CLAUDE.md` Component-normalization table, and check off Health + +Observability in `upcoming.md`. + +## Code-verified current state (2026-06-01 scan) + +### Health +| | OtOpcUa | ScadaBridge | MxGateway | +|---|---|---|---| +| Endpoints | `/health/ready`, `/health/active`, `/healthz` | `/health/ready`, `/health/active` (no `/healthz`) | `/health/live` only (custom `GatewayHealthReply`) | +| Probes | Database, AkkaCluster, AdminRoleLeader | Database, AkkaCluster, ActiveNode | **none** (`AddHealthChecks()` called but unused) | +| Tagging | tags on the check | named + predicate, `HealthChecks.UI.Client` JSON | — | +| Extra | — | `IActiveNodeGate` route gate + `HealthMonitoring/` domain pipeline | net48 x86 worker has no endpoint | + +Both descend from the same "ScadaLink three-tier pattern" (OtOpcUa's `HealthEndpoints.cs:13` says +so) but the Akka/leader probe logic and the DB probe technique already differ. MxGateway is **not** +Akka-based — a hard dependency-hygiene constraint. + +Key refs: OtOpcUa `src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/{HealthEndpoints,DatabaseHealthCheck,AkkaClusterHealthCheck,AdminRoleLeaderHealthCheck}.cs`; +ScadaBridge `src/ZB.MOM.WW.ScadaBridge.Host/Health/{DatabaseHealthCheck,AkkaClusterHealthCheck,ActiveNodeHealthCheck,ActiveNodeGate}.cs` + `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/`; +MxGateway `src/ZB.MOM.WW.MxGateway.Server/GatewayApplication.cs:61,139–145`. + +### Telemetry +| | OtOpcUa | MxGateway | ScadaBridge | +|---|---|---|---| +| OTel SDK | full (`WithMetrics`+`WithTracing`) | **none** (hand-rolled `System.Diagnostics.Metrics`, no export) | **none** (`OpenTelemetry.Api` is a dangling CVE-patch ref) | +| Exporter | Prometheus `/metrics` | in-memory snapshot only (`GetSnapshot()`) | — | +| Meter | `ZB.MOM.WW.OtOpcUa` | `MxGateway.Server` (13 ctr / 3 hist `ms` / 4 gauge) | — | +| Tracing | ActivitySource (2 spans) | none | none | +| Resource / `service.name` | **none anywhere** | none | none | + +Nobody sets a resource/`service.name` — the fleet is indistinguishable in a collector. Durations +split `s` (OtOpcUa, OTel-correct) vs `ms` (MxGateway). + +Key refs: OtOpcUa `src/Server/ZB.MOM.WW.OtOpcUa.Host/Observability/ObservabilityExtensions.cs` + +`src/Core/ZB.MOM.WW.OtOpcUa.Commons/Observability/OtOpcUaTelemetry.cs`; +MxGateway `src/ZB.MOM.WW.MxGateway.Server/Metrics/GatewayMetrics.cs`; +ScadaBridge `src/ZB.MOM.WW.ScadaBridge.Host/ZB.MOM.WW.ScadaBridge.Host.csproj:31` (Api only, zero usage). + +### Logging +Serilog in OtOpcUa (`Program.cs:49`) + ScadaBridge (`LoggerConfigurationFactory.cs:28–126`, +enrichers `SiteId`/`NodeRole`/`NodeHostname`); MEL in MxGateway (`appsettings.json`, correlation via +`GatewayLogScope`/`BeginScope` middleware + `GatewayLogRedactor`). ScadaBridge's enricher set is the +cleanest and its property names match the Resource attributes Telemetry needs. Nobody enriches logs +with `trace_id`/`span_id`. + +## Library design — `ZB.MOM.WW.Health` (3 packages) + +**① `ZB.MOM.WW.Health`** (core; deps: `Microsoft.Extensions.Diagnostics.HealthChecks` + ASP.NET Core abstractions) +- Tier convention: canonical tags `ready` / `active` / `live`; `app.MapZbHealth()` maps all three — + `/health/ready` (tag `ready` → can this node serve?), `/health/active` (tag `active` → is this the + leader/active node?), `/healthz` (predicate `_ => false` → bare process liveness). +- Canonical JSON response writer (lifts ScadaBridge's `HealthChecks.UI.Client` style to the default). +- `IActiveNodeGate` seam (generalized from ScadaBridge's `ActiveNodeGate`) + `MapZbHealth` integration. +- `GrpcDependencyHealthCheck` — "is my downstream gRPC dependency reachable" (MxGateway → worker; + OtOpcUa → gateway channel). + +**② `ZB.MOM.WW.Health.Akka`** (dep: Akka.Cluster) +- `AkkaClusterHealthCheck` with a configurable status policy. Default = ScadaBridge's + (`Up`/`Joining`=Healthy, `Leaving`/`Exiting`=Degraded, else Unhealthy); OtOpcUa's (self-`Up`-among- + members → Healthy/Degraded) ships as a preset. +- `ActiveNodeHealthCheck` with an optional role filter — role-less default gives ScadaBridge's + `ActiveNode` (Up && leader); passing a role gives OtOpcUa's `AdminRoleLeader` behavior. + +**③ `ZB.MOM.WW.Health.EntityFrameworkCore`** (dep: EF Core) +- `DatabaseHealthCheck` — default probe `CanConnectAsync()` (ScadaBridge), optional + probe-query delegate for OtOpcUa's "query `Deployments`" style. + +**Stays per-project:** which probes each app registers; orchestrator/Traefik wiring; ScadaBridge's +`HealthMonitoring/` domain aggregation pipeline (distributed domain health, not an ASP.NET probe). + +**Consumer matrix:** MxGateway → core only (+ gRPC-dep probe; no Akka/EF); OtOpcUa & ScadaBridge → all three. + +## Library design — `ZB.MOM.WW.Telemetry` (2 packages) + +**① `ZB.MOM.WW.Telemetry`** (OTel metrics + traces; deps: OpenTelemetry SDK + hosting/exporter) +- `builder.AddZbTelemetry(options)` — the missing front door: + ```csharp + builder.AddZbTelemetry(o => { + o.ServiceName = "mxgateway"; // → Resource service.name + o.ServiceNamespace = "ZB.MOM.WW"; // constant across the fleet + o.SiteId = cfg.SiteId; // → resource attr site.id + o.NodeRole = cfg.NodeRole; // → resource attr node.role + o.Meters = ["MxGateway.Server"]; // app's own Meter name(s) + o.ActivitySources = [...]; // app's own ActivitySource name(s) + o.Exporter = Prometheus; // default; OTLP opt-in + }); + app.MapZbMetrics(); // Prometheus /metrics + ``` +- Shared `Resource`: `service.name` + `service.namespace` + `service.version` + `site.id` + + `node.role` + `host.name`. **The headline fix** — nobody sets this today. +- Standard instrumentation everyone should have (only OtOpcUa has it now): ASP.NET Core, gRPC client, + HttpClient, runtime + process meters. +- Exporter: Prometheus `/metrics` default; **OTLP opt-in** via options (path to a real collector). +- App instruments stay per-project. MxGateway's hand-rolled `GatewayMetrics` keeps its 13/3/4 + instruments but its `Meter` is registered through `AddZbTelemetry` so it finally **exports** instead + of dying in an in-memory snapshot. + +**② `ZB.MOM.WW.Telemetry.Serilog`** (logs signal + Serilog convergence; deps: Serilog + the core package) +- `AddZbSerilog()` — shared two-stage bootstrap generalizing ScadaBridge's `LoggerConfigurationFactory` + (`ReadFrom.Configuration` for sinks + explicit `MinimumLevel.Is` override). +- Shared enrichers `SiteId` / `NodeRole` / `NodeHostname`, **bound from the same options object as the + OTel Resource** so logs and metrics carry identical dimensions. +- **NEW `TraceContextEnricher`** — stamps `trace_id`/`span_id` from `Activity.Current` onto every log + event (makes a log line clickable from a trace; nobody has this today). +- OTel log export — logs flow through the OTel pipeline with the same Resource (all three signals + correlated in a backend). +- `ILogRedactor` seam — generalized from MxGateway's `GatewayLogRedactor` (the only app with real + secret redaction). The seam is shared; the policy (which fields/commands) stays per-project. + +**Convergence the spec pins down:** Meter name = `` namespace; instrument name = +`..`; duration unit = **seconds** (OTel semconv) — flags MxGateway's `ms` +histograms as a convergence item. + +### The one adoption — MxGateway MEL → Serilog +Replace `WebApplicationBuilder` default logging with `AddZbSerilog()`; re-express the +`GatewayLogScope`/`BeginScope` correlation middleware as a Serilog `LogContext.PushProperty` scope; +move `GatewayLogRedactor` behind the shared `ILogRedactor` seam. The net48 x86 worker's `IWorkerLogger` +(stderr key=value) stays bespoke — out of process and out of scope. + +## Normalization component docs + +Both trees follow `components/README.md`'s six-part layout (matching auth + ui-theme). Each `spec` +opens with a Section 0 stating normalized vs. left-per-project explicitly. `observability/` adds one +reference doc — `spec/METRIC-CONVENTIONS.md` — mirroring auth's `CANONICAL-ROLES.md` / theme's +`DESIGN-TOKENS.md`. Three `current-state//CURRENT-STATE.md` per component at full +code-verified depth, each ending in an Adoption plan. `GAPS.md` turns deltas into a prioritized +backlog (MxGateway "no probes" + "MEL→Serilog" are top entries; `ms`→`s` and the missing Resource are +convergence items). Both register at status **Draft** (`Draft → Reviewed → Adopting → Converged`). + +## Testing & verification + +Every package ships tests (mirrors auth's 172 / theme's 32; `dotnet test` from each library root): +- **Health** — `WebApplicationFactory` tests for the three tiers + JSON shape; `IActiveNodeGate` + gates a route (200 active / 503 standby); `GrpcDependencyHealthCheck` on a stub channel. +- **Health.Akka** — table-driven status-policy + role-filter unit tests over faked cluster state. +- **Health.EntityFrameworkCore** — `DatabaseHealthCheck` against SQLite in-memory (healthy / broken + context / custom probe delegate). +- **Telemetry** — Resource carries every options attribute; in-memory exporter sees a registered app + Meter's instrument; `MapZbMetrics` serves Prometheus text. +- **Telemetry.Serilog** — in-memory/TestCorrelator sink asserts enricher properties present; + `TraceContextEnricher` stamps `trace_id`/`span_id` under an active `Activity` and omits cleanly + otherwise; `ILogRedactor` scrubs a policy-marked secret. +- **MxGateway migration** — existing `MxGateway.Tests` (fake worker) still green + correlation scope + still emits + secrets still redacted. + +Verification gates (evidence, not assertions): each library `dotnet test` green + `dotnet pack` +produces nupkgs @ 0.1.0; MxGateway `dotnet build src/MxGateway.sln` + `dotnet test` green. + +## Build order + +``` +1. components/health/ + components/observability/ docs (spec first — drives the APIs) +2. ZB.MOM.WW.Health (3 pkgs) ─┐ parallelizable +3. ZB.MOM.WW.Telemetry (core: metrics+traces) ─┘ +4. ZB.MOM.WW.Telemetry.Serilog (needs #3) +5. MxGateway MEL→Serilog migration (needs #4) ← the one sister-repo touch +6. Index/registry updates + GAPS cross-check +``` + +## Implementation tasks (native task IDs) + +- #7 Build `ZB.MOM.WW.Health` library (3 packages) +- #8 Build `ZB.MOM.WW.Telemetry` library (2 packages) +- #9 Migrate MxGateway logging MEL → shared Serilog (sister-repo) — blocked by #8 +- #10 Author `components/health/` normalization docs +- #11 Author `components/observability/` normalization docs + +Dependency: #9 blocked by #8 (needs `ZB.MOM.WW.Telemetry.Serilog`). Docs (#10/#11) precede the +libraries logically (spec drives API) but can be drafted in parallel from the captured current-state.