Adds the approved brainstorm design for the next two component-normalization entries (Health #1, Observability #2 from upcoming.md): - components/health/ -> ZB.MOM.WW.Health (3 dependency-split packages) - components/observability/ -> ZB.MOM.WW.Telemetry (2 packages, 3 OTel signals + shared Serilog bootstrap) Scope: normalization docs + build both libraries (.NET 10, tested, packed); one sister-repo touch (MxGateway MEL->Serilog migration); no other app adoption. Unifying hinge: one identity triple (service.name/site.id/node.role) feeds both the OTel Resource and the Serilog enrichers.
14 KiB
Design — Health & Observability normalization components + shared libraries
Date: 2026-06-01 Status: Approved design (brainstorm output). Implementation plans follow separately (one per library) via the writing-plans workflow.
This design adds the next two entries to the component-normalization
program, following the exact arc already used for Auth (ZB.MOM.WW.Auth) and UI-Theme
(ZB.MOM.WW.Theme): normalize the concern in components/, then build the shared library in this
repo. The two concerns are the top-ranked candidates in upcoming.md (Health #1,
Observability #2 — the "operability cluster").
Scope decisions (locked during brainstorm)
- Deliverable depth — normalization docs + build both shared libraries (.NET 10, tested,
dotnet pack). Not a docs-only pass. - Structure — two separate components → two separate libraries (one component = one library,
per house precedent):
components/health/→ZB.MOM.WW.Health;components/observability/→ZB.MOM.WW.Telemetry. A futureZB.MOM.WW.Hostingaggregator can bundle both behind one call. - Telemetry reach — all three OpenTelemetry signals (metrics + traces + logs), including a shared Serilog bootstrap, enrichers, and trace↔log correlation.
- Sister-repo touch — exactly one: migrate MxAccessGateway off
Microsoft.Extensions.Loggingonto the shared Serilog bootstrap. No other app adoption — wiring Health/Telemetry into the three apps stays a futureGAPS.mditem, identical to where Auth and UI-Theme sit today. - Packaging — dependency-split packages (mirrors Auth's 4-package split and the
AspNetCore.HealthChecks.*ecosystem). Heavy probes live in opt-in satellites so MxGateway never transitively pulls Akka or EF. - Current-state docs — full code-verified depth with
file:linerefs, percomponents/README.md's mandate (matching auth's current-state docs).
The unifying hinge
A single identity triple — service.name / site.id / node.role (+ host) — populates both the
OpenTelemetry Resource and the Serilog enrichers. A metric, a span, and a log line from the same
node therefore carry identical dimensions and join up in a backend. This symmetry is the reason
Health and Telemetry are designed together even though they ship as separate libraries.
Repo layout
scadaproj/
├─ components/
│ ├─ health/ NEW normalization component (docs)
│ │ ├─ README.md
│ │ ├─ spec/SPEC.md
│ │ ├─ shared-contract/ZB.MOM.WW.Health.md
│ │ ├─ current-state/{otopcua,mxaccessgw,scadabridge}/CURRENT-STATE.md
│ │ └─ GAPS.md
│ └─ observability/ NEW normalization component (docs)
│ ├─ README.md
│ ├─ spec/SPEC.md
│ ├─ spec/METRIC-CONVENTIONS.md (mirrors auth CANONICAL-ROLES / theme DESIGN-TOKENS)
│ ├─ shared-contract/ZB.MOM.WW.Telemetry.md
│ ├─ current-state/{otopcua,mxaccessgw,scadabridge}/CURRENT-STATE.md
│ └─ GAPS.md
├─ ZB.MOM.WW.Health/ NEW built library (nested git repo, .NET 10) → 3 nupkgs @ 0.1.0
├─ ZB.MOM.WW.Telemetry/ NEW built library (nested git repo, .NET 10) → 2 nupkgs @ 0.1.0
└─ docs/plans/
├─ 2026-06-01-zb-mom-ww-health-shared-library.md (impl plan — from writing-plans)
└─ 2026-06-01-zb-mom-ww-telemetry-shared-library.md (impl plan — from writing-plans)
Index updates (same discipline as the prior two components): add both rows to
components/README.md, the CLAUDE.md Component-normalization table, and check off Health +
Observability in upcoming.md.
Code-verified current state (2026-06-01 scan)
Health
| OtOpcUa | ScadaBridge | MxGateway | |
|---|---|---|---|
| Endpoints | /health/ready, /health/active, /healthz |
/health/ready, /health/active (no /healthz) |
/health/live only (custom GatewayHealthReply) |
| Probes | Database, AkkaCluster, AdminRoleLeader | Database, AkkaCluster, ActiveNode | none (AddHealthChecks() called but unused) |
| Tagging | tags on the check | named + predicate, HealthChecks.UI.Client JSON |
— |
| Extra | — | IActiveNodeGate route gate + HealthMonitoring/ domain pipeline |
net48 x86 worker has no endpoint |
Both descend from the same "ScadaLink three-tier pattern" (OtOpcUa's HealthEndpoints.cs:13 says
so) but the Akka/leader probe logic and the DB probe technique already differ. MxGateway is not
Akka-based — a hard dependency-hygiene constraint.
Key refs: OtOpcUa src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/{HealthEndpoints,DatabaseHealthCheck,AkkaClusterHealthCheck,AdminRoleLeaderHealthCheck}.cs;
ScadaBridge src/ZB.MOM.WW.ScadaBridge.Host/Health/{DatabaseHealthCheck,AkkaClusterHealthCheck,ActiveNodeHealthCheck,ActiveNodeGate}.cs + src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/;
MxGateway src/ZB.MOM.WW.MxGateway.Server/GatewayApplication.cs:61,139–145.
Telemetry
| OtOpcUa | MxGateway | ScadaBridge | |
|---|---|---|---|
| OTel SDK | full (WithMetrics+WithTracing) |
none (hand-rolled System.Diagnostics.Metrics, no export) |
none (OpenTelemetry.Api is a dangling CVE-patch ref) |
| Exporter | Prometheus /metrics |
in-memory snapshot only (GetSnapshot()) |
— |
| Meter | ZB.MOM.WW.OtOpcUa |
MxGateway.Server (13 ctr / 3 hist ms / 4 gauge) |
— |
| Tracing | ActivitySource (2 spans) | none | none |
Resource / service.name |
none anywhere | none | none |
Nobody sets a resource/service.name — the fleet is indistinguishable in a collector. Durations
split s (OtOpcUa, OTel-correct) vs ms (MxGateway).
Key refs: OtOpcUa src/Server/ZB.MOM.WW.OtOpcUa.Host/Observability/ObservabilityExtensions.cs +
src/Core/ZB.MOM.WW.OtOpcUa.Commons/Observability/OtOpcUaTelemetry.cs;
MxGateway src/ZB.MOM.WW.MxGateway.Server/Metrics/GatewayMetrics.cs;
ScadaBridge src/ZB.MOM.WW.ScadaBridge.Host/ZB.MOM.WW.ScadaBridge.Host.csproj:31 (Api only, zero usage).
Logging
Serilog in OtOpcUa (Program.cs:49) + ScadaBridge (LoggerConfigurationFactory.cs:28–126,
enrichers SiteId/NodeRole/NodeHostname); MEL in MxGateway (appsettings.json, correlation via
GatewayLogScope/BeginScope middleware + GatewayLogRedactor). ScadaBridge's enricher set is the
cleanest and its property names match the Resource attributes Telemetry needs. Nobody enriches logs
with trace_id/span_id.
Library design — ZB.MOM.WW.Health (3 packages)
① ZB.MOM.WW.Health (core; deps: Microsoft.Extensions.Diagnostics.HealthChecks + ASP.NET Core abstractions)
- Tier convention: canonical tags
ready/active/live;app.MapZbHealth()maps all three —/health/ready(tagready→ can this node serve?),/health/active(tagactive→ is this the leader/active node?),/healthz(predicate_ => false→ bare process liveness). - Canonical JSON response writer (lifts ScadaBridge's
HealthChecks.UI.Clientstyle to the default). IActiveNodeGateseam (generalized from ScadaBridge'sActiveNodeGate) +MapZbHealthintegration.GrpcDependencyHealthCheck— "is my downstream gRPC dependency reachable" (MxGateway → worker; OtOpcUa → gateway channel).
② ZB.MOM.WW.Health.Akka (dep: Akka.Cluster)
AkkaClusterHealthCheckwith a configurable status policy. Default = ScadaBridge's (Up/Joining=Healthy,Leaving/Exiting=Degraded, else Unhealthy); OtOpcUa's (self-Up-among- members → Healthy/Degraded) ships as a preset.ActiveNodeHealthCheckwith an optional role filter — role-less default gives ScadaBridge'sActiveNode(Up && leader); passing a role gives OtOpcUa'sAdminRoleLeaderbehavior.
③ ZB.MOM.WW.Health.EntityFrameworkCore (dep: EF Core)
DatabaseHealthCheck<TContext>— default probeCanConnectAsync()(ScadaBridge), optional probe-query delegate for OtOpcUa's "queryDeployments" style.
Stays per-project: which probes each app registers; orchestrator/Traefik wiring; ScadaBridge's
HealthMonitoring/ domain aggregation pipeline (distributed domain health, not an ASP.NET probe).
Consumer matrix: MxGateway → core only (+ gRPC-dep probe; no Akka/EF); OtOpcUa & ScadaBridge → all three.
Library design — ZB.MOM.WW.Telemetry (2 packages)
① ZB.MOM.WW.Telemetry (OTel metrics + traces; deps: OpenTelemetry SDK + hosting/exporter)
builder.AddZbTelemetry(options)— the missing front door:builder.AddZbTelemetry(o => { o.ServiceName = "mxgateway"; // → Resource service.name o.ServiceNamespace = "ZB.MOM.WW"; // constant across the fleet o.SiteId = cfg.SiteId; // → resource attr site.id o.NodeRole = cfg.NodeRole; // → resource attr node.role o.Meters = ["MxGateway.Server"]; // app's own Meter name(s) o.ActivitySources = [...]; // app's own ActivitySource name(s) o.Exporter = Prometheus; // default; OTLP opt-in }); app.MapZbMetrics(); // Prometheus /metrics- Shared
Resource:service.name+service.namespace+service.version+site.id+node.role+host.name. The headline fix — nobody sets this today. - Standard instrumentation everyone should have (only OtOpcUa has it now): ASP.NET Core, gRPC client, HttpClient, runtime + process meters.
- Exporter: Prometheus
/metricsdefault; OTLP opt-in via options (path to a real collector). - App instruments stay per-project. MxGateway's hand-rolled
GatewayMetricskeeps its 13/3/4 instruments but itsMeteris registered throughAddZbTelemetryso it finally exports instead of dying in an in-memory snapshot.
② ZB.MOM.WW.Telemetry.Serilog (logs signal + Serilog convergence; deps: Serilog + the core package)
AddZbSerilog()— shared two-stage bootstrap generalizing ScadaBridge'sLoggerConfigurationFactory(ReadFrom.Configurationfor sinks + explicitMinimumLevel.Isoverride).- Shared enrichers
SiteId/NodeRole/NodeHostname, bound from the same options object as the OTel Resource so logs and metrics carry identical dimensions. - NEW
TraceContextEnricher— stampstrace_id/span_idfromActivity.Currentonto every log event (makes a log line clickable from a trace; nobody has this today). - OTel log export — logs flow through the OTel pipeline with the same Resource (all three signals correlated in a backend).
ILogRedactorseam — generalized from MxGateway'sGatewayLogRedactor(the only app with real secret redaction). The seam is shared; the policy (which fields/commands) stays per-project.
Convergence the spec pins down: Meter name = <app> namespace; instrument name =
<app>.<subsystem>.<event>; duration unit = seconds (OTel semconv) — flags MxGateway's ms
histograms as a convergence item.
The one adoption — MxGateway MEL → Serilog
Replace WebApplicationBuilder default logging with AddZbSerilog(); re-express the
GatewayLogScope/BeginScope correlation middleware as a Serilog LogContext.PushProperty scope;
move GatewayLogRedactor behind the shared ILogRedactor seam. The net48 x86 worker's IWorkerLogger
(stderr key=value) stays bespoke — out of process and out of scope.
Normalization component docs
Both trees follow components/README.md's six-part layout (matching auth + ui-theme). Each spec
opens with a Section 0 stating normalized vs. left-per-project explicitly. observability/ adds one
reference doc — spec/METRIC-CONVENTIONS.md — mirroring auth's CANONICAL-ROLES.md / theme's
DESIGN-TOKENS.md. Three current-state/<project>/CURRENT-STATE.md per component at full
code-verified depth, each ending in an Adoption plan. GAPS.md turns deltas into a prioritized
backlog (MxGateway "no probes" + "MEL→Serilog" are top entries; ms→s and the missing Resource are
convergence items). Both register at status Draft (Draft → Reviewed → Adopting → Converged).
Testing & verification
Every package ships tests (mirrors auth's 172 / theme's 32; dotnet test from each library root):
- Health —
WebApplicationFactorytests for the three tiers + JSON shape;IActiveNodeGategates a route (200 active / 503 standby);GrpcDependencyHealthCheckon a stub channel. - Health.Akka — table-driven status-policy + role-filter unit tests over faked cluster state.
- Health.EntityFrameworkCore —
DatabaseHealthCheck<T>against SQLite in-memory (healthy / broken context / custom probe delegate). - Telemetry — Resource carries every options attribute; in-memory exporter sees a registered app
Meter's instrument;
MapZbMetricsserves Prometheus text. - Telemetry.Serilog — in-memory/TestCorrelator sink asserts enricher properties present;
TraceContextEnricherstampstrace_id/span_idunder an activeActivityand omits cleanly otherwise;ILogRedactorscrubs a policy-marked secret. - MxGateway migration — existing
MxGateway.Tests(fake worker) still green + correlation scope still emits + secrets still redacted.
Verification gates (evidence, not assertions): each library dotnet test green + dotnet pack
produces nupkgs @ 0.1.0; MxGateway dotnet build src/MxGateway.sln + dotnet test green.
Build order
1. components/health/ + components/observability/ docs (spec first — drives the APIs)
2. ZB.MOM.WW.Health (3 pkgs) ─┐ parallelizable
3. ZB.MOM.WW.Telemetry (core: metrics+traces) ─┘
4. ZB.MOM.WW.Telemetry.Serilog (needs #3)
5. MxGateway MEL→Serilog migration (needs #4) ← the one sister-repo touch
6. Index/registry updates + GAPS cross-check
Implementation tasks (native task IDs)
- #7 Build
ZB.MOM.WW.Healthlibrary (3 packages) - #8 Build
ZB.MOM.WW.Telemetrylibrary (2 packages) - #9 Migrate MxGateway logging MEL → shared Serilog (sister-repo) — blocked by #8
- #10 Author
components/health/normalization docs - #11 Author
components/observability/normalization docs
Dependency: #9 blocked by #8 (needs ZB.MOM.WW.Telemetry.Serilog). Docs (#10/#11) precede the
libraries logically (spec drives API) but can be drafted in parallel from the captured current-state.