Files
scadaproj/docs/plans/2026-06-01-health-observability-components-design.md
T
Joseph Doherty 29b309c6c1 docs: design for health + observability normalization components
Adds the approved brainstorm design for the next two component-normalization
entries (Health #1, Observability #2 from upcoming.md):

- components/health/ -> ZB.MOM.WW.Health (3 dependency-split packages)
- components/observability/ -> ZB.MOM.WW.Telemetry (2 packages, 3 OTel signals
  + shared Serilog bootstrap)

Scope: normalization docs + build both libraries (.NET 10, tested, packed);
one sister-repo touch (MxGateway MEL->Serilog migration); no other app adoption.
Unifying hinge: one identity triple (service.name/site.id/node.role) feeds both
the OTel Resource and the Serilog enrichers.
2026-06-01 06:08:51 -04:00

14 KiB
Raw Blame History

Design — Health & Observability normalization components + shared libraries

Date: 2026-06-01 Status: Approved design (brainstorm output). Implementation plans follow separately (one per library) via the writing-plans workflow.

This design adds the next two entries to the component-normalization program, following the exact arc already used for Auth (ZB.MOM.WW.Auth) and UI-Theme (ZB.MOM.WW.Theme): normalize the concern in components/, then build the shared library in this repo. The two concerns are the top-ranked candidates in upcoming.md (Health #1, Observability #2 — the "operability cluster").

Scope decisions (locked during brainstorm)

  1. Deliverable depth — normalization docs + build both shared libraries (.NET 10, tested, dotnet pack). Not a docs-only pass.
  2. Structure — two separate components → two separate libraries (one component = one library, per house precedent): components/health/ZB.MOM.WW.Health; components/observability/ZB.MOM.WW.Telemetry. A future ZB.MOM.WW.Hosting aggregator can bundle both behind one call.
  3. Telemetry reach — all three OpenTelemetry signals (metrics + traces + logs), including a shared Serilog bootstrap, enrichers, and trace↔log correlation.
  4. Sister-repo touch — exactly one: migrate MxAccessGateway off Microsoft.Extensions.Logging onto the shared Serilog bootstrap. No other app adoption — wiring Health/Telemetry into the three apps stays a future GAPS.md item, identical to where Auth and UI-Theme sit today.
  5. Packaging — dependency-split packages (mirrors Auth's 4-package split and the AspNetCore.HealthChecks.* ecosystem). Heavy probes live in opt-in satellites so MxGateway never transitively pulls Akka or EF.
  6. Current-state docs — full code-verified depth with file:line refs, per components/README.md's mandate (matching auth's current-state docs).

The unifying hinge

A single identity triple — service.name / site.id / node.role (+ host) — populates both the OpenTelemetry Resource and the Serilog enrichers. A metric, a span, and a log line from the same node therefore carry identical dimensions and join up in a backend. This symmetry is the reason Health and Telemetry are designed together even though they ship as separate libraries.

Repo layout

scadaproj/
├─ components/
│   ├─ health/                       NEW normalization component (docs)
│   │   ├─ README.md
│   │   ├─ spec/SPEC.md
│   │   ├─ shared-contract/ZB.MOM.WW.Health.md
│   │   ├─ current-state/{otopcua,mxaccessgw,scadabridge}/CURRENT-STATE.md
│   │   └─ GAPS.md
│   └─ observability/                NEW normalization component (docs)
│       ├─ README.md
│       ├─ spec/SPEC.md
│       ├─ spec/METRIC-CONVENTIONS.md   (mirrors auth CANONICAL-ROLES / theme DESIGN-TOKENS)
│       ├─ shared-contract/ZB.MOM.WW.Telemetry.md
│       ├─ current-state/{otopcua,mxaccessgw,scadabridge}/CURRENT-STATE.md
│       └─ GAPS.md
├─ ZB.MOM.WW.Health/                 NEW built library (nested git repo, .NET 10) → 3 nupkgs @ 0.1.0
├─ ZB.MOM.WW.Telemetry/              NEW built library (nested git repo, .NET 10) → 2 nupkgs @ 0.1.0
└─ docs/plans/
    ├─ 2026-06-01-zb-mom-ww-health-shared-library.md       (impl plan — from writing-plans)
    └─ 2026-06-01-zb-mom-ww-telemetry-shared-library.md     (impl plan — from writing-plans)

Index updates (same discipline as the prior two components): add both rows to components/README.md, the CLAUDE.md Component-normalization table, and check off Health + Observability in upcoming.md.

Code-verified current state (2026-06-01 scan)

Health

OtOpcUa ScadaBridge MxGateway
Endpoints /health/ready, /health/active, /healthz /health/ready, /health/active (no /healthz) /health/live only (custom GatewayHealthReply)
Probes Database, AkkaCluster, AdminRoleLeader Database, AkkaCluster, ActiveNode none (AddHealthChecks() called but unused)
Tagging tags on the check named + predicate, HealthChecks.UI.Client JSON
Extra IActiveNodeGate route gate + HealthMonitoring/ domain pipeline net48 x86 worker has no endpoint

Both descend from the same "ScadaLink three-tier pattern" (OtOpcUa's HealthEndpoints.cs:13 says so) but the Akka/leader probe logic and the DB probe technique already differ. MxGateway is not Akka-based — a hard dependency-hygiene constraint.

Key refs: OtOpcUa src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/{HealthEndpoints,DatabaseHealthCheck,AkkaClusterHealthCheck,AdminRoleLeaderHealthCheck}.cs; ScadaBridge src/ZB.MOM.WW.ScadaBridge.Host/Health/{DatabaseHealthCheck,AkkaClusterHealthCheck,ActiveNodeHealthCheck,ActiveNodeGate}.cs + src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/; MxGateway src/ZB.MOM.WW.MxGateway.Server/GatewayApplication.cs:61,139145.

Telemetry

OtOpcUa MxGateway ScadaBridge
OTel SDK full (WithMetrics+WithTracing) none (hand-rolled System.Diagnostics.Metrics, no export) none (OpenTelemetry.Api is a dangling CVE-patch ref)
Exporter Prometheus /metrics in-memory snapshot only (GetSnapshot())
Meter ZB.MOM.WW.OtOpcUa MxGateway.Server (13 ctr / 3 hist ms / 4 gauge)
Tracing ActivitySource (2 spans) none none
Resource / service.name none anywhere none none

Nobody sets a resource/service.name — the fleet is indistinguishable in a collector. Durations split s (OtOpcUa, OTel-correct) vs ms (MxGateway).

Key refs: OtOpcUa src/Server/ZB.MOM.WW.OtOpcUa.Host/Observability/ObservabilityExtensions.cs + src/Core/ZB.MOM.WW.OtOpcUa.Commons/Observability/OtOpcUaTelemetry.cs; MxGateway src/ZB.MOM.WW.MxGateway.Server/Metrics/GatewayMetrics.cs; ScadaBridge src/ZB.MOM.WW.ScadaBridge.Host/ZB.MOM.WW.ScadaBridge.Host.csproj:31 (Api only, zero usage).

Logging

Serilog in OtOpcUa (Program.cs:49) + ScadaBridge (LoggerConfigurationFactory.cs:28126, enrichers SiteId/NodeRole/NodeHostname); MEL in MxGateway (appsettings.json, correlation via GatewayLogScope/BeginScope middleware + GatewayLogRedactor). ScadaBridge's enricher set is the cleanest and its property names match the Resource attributes Telemetry needs. Nobody enriches logs with trace_id/span_id.

Library design — ZB.MOM.WW.Health (3 packages)

ZB.MOM.WW.Health (core; deps: Microsoft.Extensions.Diagnostics.HealthChecks + ASP.NET Core abstractions)

  • Tier convention: canonical tags ready / active / live; app.MapZbHealth() maps all three — /health/ready (tag ready → can this node serve?), /health/active (tag active → is this the leader/active node?), /healthz (predicate _ => false → bare process liveness).
  • Canonical JSON response writer (lifts ScadaBridge's HealthChecks.UI.Client style to the default).
  • IActiveNodeGate seam (generalized from ScadaBridge's ActiveNodeGate) + MapZbHealth integration.
  • GrpcDependencyHealthCheck — "is my downstream gRPC dependency reachable" (MxGateway → worker; OtOpcUa → gateway channel).

ZB.MOM.WW.Health.Akka (dep: Akka.Cluster)

  • AkkaClusterHealthCheck with a configurable status policy. Default = ScadaBridge's (Up/Joining=Healthy, Leaving/Exiting=Degraded, else Unhealthy); OtOpcUa's (self-Up-among- members → Healthy/Degraded) ships as a preset.
  • ActiveNodeHealthCheck with an optional role filter — role-less default gives ScadaBridge's ActiveNode (Up && leader); passing a role gives OtOpcUa's AdminRoleLeader behavior.

ZB.MOM.WW.Health.EntityFrameworkCore (dep: EF Core)

  • DatabaseHealthCheck<TContext> — default probe CanConnectAsync() (ScadaBridge), optional probe-query delegate for OtOpcUa's "query Deployments" style.

Stays per-project: which probes each app registers; orchestrator/Traefik wiring; ScadaBridge's HealthMonitoring/ domain aggregation pipeline (distributed domain health, not an ASP.NET probe).

Consumer matrix: MxGateway → core only (+ gRPC-dep probe; no Akka/EF); OtOpcUa & ScadaBridge → all three.

Library design — ZB.MOM.WW.Telemetry (2 packages)

ZB.MOM.WW.Telemetry (OTel metrics + traces; deps: OpenTelemetry SDK + hosting/exporter)

  • builder.AddZbTelemetry(options) — the missing front door:
    builder.AddZbTelemetry(o => {
        o.ServiceName = "mxgateway";        // → Resource service.name
        o.ServiceNamespace = "ZB.MOM.WW";   // constant across the fleet
        o.SiteId = cfg.SiteId;              // → resource attr site.id
        o.NodeRole = cfg.NodeRole;          // → resource attr node.role
        o.Meters = ["MxGateway.Server"];    // app's own Meter name(s)
        o.ActivitySources = [...];          // app's own ActivitySource name(s)
        o.Exporter = Prometheus;            // default; OTLP opt-in
    });
    app.MapZbMetrics();                     // Prometheus /metrics
    
  • Shared Resource: service.name + service.namespace + service.version + site.id + node.role + host.name. The headline fix — nobody sets this today.
  • Standard instrumentation everyone should have (only OtOpcUa has it now): ASP.NET Core, gRPC client, HttpClient, runtime + process meters.
  • Exporter: Prometheus /metrics default; OTLP opt-in via options (path to a real collector).
  • App instruments stay per-project. MxGateway's hand-rolled GatewayMetrics keeps its 13/3/4 instruments but its Meter is registered through AddZbTelemetry so it finally exports instead of dying in an in-memory snapshot.

ZB.MOM.WW.Telemetry.Serilog (logs signal + Serilog convergence; deps: Serilog + the core package)

  • AddZbSerilog() — shared two-stage bootstrap generalizing ScadaBridge's LoggerConfigurationFactory (ReadFrom.Configuration for sinks + explicit MinimumLevel.Is override).
  • Shared enrichers SiteId / NodeRole / NodeHostname, bound from the same options object as the OTel Resource so logs and metrics carry identical dimensions.
  • NEW TraceContextEnricher — stamps trace_id/span_id from Activity.Current onto every log event (makes a log line clickable from a trace; nobody has this today).
  • OTel log export — logs flow through the OTel pipeline with the same Resource (all three signals correlated in a backend).
  • ILogRedactor seam — generalized from MxGateway's GatewayLogRedactor (the only app with real secret redaction). The seam is shared; the policy (which fields/commands) stays per-project.

Convergence the spec pins down: Meter name = <app> namespace; instrument name = <app>.<subsystem>.<event>; duration unit = seconds (OTel semconv) — flags MxGateway's ms histograms as a convergence item.

The one adoption — MxGateway MEL → Serilog

Replace WebApplicationBuilder default logging with AddZbSerilog(); re-express the GatewayLogScope/BeginScope correlation middleware as a Serilog LogContext.PushProperty scope; move GatewayLogRedactor behind the shared ILogRedactor seam. The net48 x86 worker's IWorkerLogger (stderr key=value) stays bespoke — out of process and out of scope.

Normalization component docs

Both trees follow components/README.md's six-part layout (matching auth + ui-theme). Each spec opens with a Section 0 stating normalized vs. left-per-project explicitly. observability/ adds one reference doc — spec/METRIC-CONVENTIONS.md — mirroring auth's CANONICAL-ROLES.md / theme's DESIGN-TOKENS.md. Three current-state/<project>/CURRENT-STATE.md per component at full code-verified depth, each ending in an Adoption plan. GAPS.md turns deltas into a prioritized backlog (MxGateway "no probes" + "MEL→Serilog" are top entries; mss and the missing Resource are convergence items). Both register at status Draft (Draft → Reviewed → Adopting → Converged).

Testing & verification

Every package ships tests (mirrors auth's 172 / theme's 32; dotnet test from each library root):

  • HealthWebApplicationFactory tests for the three tiers + JSON shape; IActiveNodeGate gates a route (200 active / 503 standby); GrpcDependencyHealthCheck on a stub channel.
  • Health.Akka — table-driven status-policy + role-filter unit tests over faked cluster state.
  • Health.EntityFrameworkCoreDatabaseHealthCheck<T> against SQLite in-memory (healthy / broken context / custom probe delegate).
  • Telemetry — Resource carries every options attribute; in-memory exporter sees a registered app Meter's instrument; MapZbMetrics serves Prometheus text.
  • Telemetry.Serilog — in-memory/TestCorrelator sink asserts enricher properties present; TraceContextEnricher stamps trace_id/span_id under an active Activity and omits cleanly otherwise; ILogRedactor scrubs a policy-marked secret.
  • MxGateway migration — existing MxGateway.Tests (fake worker) still green + correlation scope still emits + secrets still redacted.

Verification gates (evidence, not assertions): each library dotnet test green + dotnet pack produces nupkgs @ 0.1.0; MxGateway dotnet build src/MxGateway.sln + dotnet test green.

Build order

1. components/health/ + components/observability/ docs   (spec first — drives the APIs)
2. ZB.MOM.WW.Health        (3 pkgs)                       ─┐ parallelizable
3. ZB.MOM.WW.Telemetry     (core: metrics+traces)         ─┘
4. ZB.MOM.WW.Telemetry.Serilog   (needs #3)
5. MxGateway MEL→Serilog migration  (needs #4)            ← the one sister-repo touch
6. Index/registry updates + GAPS cross-check

Implementation tasks (native task IDs)

  • #7 Build ZB.MOM.WW.Health library (3 packages)
  • #8 Build ZB.MOM.WW.Telemetry library (2 packages)
  • #9 Migrate MxGateway logging MEL → shared Serilog (sister-repo) — blocked by #8
  • #10 Author components/health/ normalization docs
  • #11 Author components/observability/ normalization docs

Dependency: #9 blocked by #8 (needs ZB.MOM.WW.Telemetry.Serilog). Docs (#10/#11) precede the libraries logically (spec drives API) but can be drafted in parallel from the captured current-state.