docs: design for health + observability normalization components
Adds the approved brainstorm design for the next two component-normalization entries (Health #1, Observability #2 from upcoming.md): - components/health/ -> ZB.MOM.WW.Health (3 dependency-split packages) - components/observability/ -> ZB.MOM.WW.Telemetry (2 packages, 3 OTel signals + shared Serilog bootstrap) Scope: normalization docs + build both libraries (.NET 10, tested, packed); one sister-repo touch (MxGateway MEL->Serilog migration); no other app adoption. Unifying hinge: one identity triple (service.name/site.id/node.role) feeds both the OTel Resource and the Serilog enrichers.
This commit is contained in:
@@ -0,0 +1,232 @@
|
||||
# Design — Health & Observability normalization components + shared libraries
|
||||
|
||||
Date: 2026-06-01
|
||||
Status: **Approved design** (brainstorm output). Implementation plans follow separately
|
||||
(one per library) via the writing-plans workflow.
|
||||
|
||||
This design adds the next two entries to the [component-normalization](../../components/README.md)
|
||||
program, following the exact arc already used for **Auth** (`ZB.MOM.WW.Auth`) and **UI-Theme**
|
||||
(`ZB.MOM.WW.Theme`): normalize the concern in `components/`, then build the shared library in this
|
||||
repo. The two concerns are the top-ranked candidates in [`upcoming.md`](../../upcoming.md) (Health #1,
|
||||
Observability #2 — the "operability cluster").
|
||||
|
||||
## Scope decisions (locked during brainstorm)
|
||||
|
||||
1. **Deliverable depth** — normalization docs **+ build both shared libraries** (.NET 10, tested,
|
||||
`dotnet pack`). *Not* a docs-only pass.
|
||||
2. **Structure** — two separate components → two separate libraries (one component = one library,
|
||||
per house precedent): `components/health/` → `ZB.MOM.WW.Health`; `components/observability/` →
|
||||
`ZB.MOM.WW.Telemetry`. A future `ZB.MOM.WW.Hosting` aggregator can bundle both behind one call.
|
||||
3. **Telemetry reach** — all three OpenTelemetry signals (metrics + traces + logs), including a shared
|
||||
Serilog bootstrap, enrichers, and trace↔log correlation.
|
||||
4. **Sister-repo touch** — exactly one: migrate **MxAccessGateway** off `Microsoft.Extensions.Logging`
|
||||
onto the shared Serilog bootstrap. No other app adoption — wiring Health/Telemetry into the three
|
||||
apps stays a future `GAPS.md` item, identical to where Auth and UI-Theme sit today.
|
||||
5. **Packaging** — dependency-split packages (mirrors Auth's 4-package split and the
|
||||
`AspNetCore.HealthChecks.*` ecosystem). Heavy probes live in opt-in satellites so MxGateway never
|
||||
transitively pulls Akka or EF.
|
||||
6. **Current-state docs** — full code-verified depth with `file:line` refs, per
|
||||
`components/README.md`'s mandate (matching auth's current-state docs).
|
||||
|
||||
## The unifying hinge
|
||||
|
||||
A single identity triple — `service.name` / `site.id` / `node.role` (+ host) — populates **both** the
|
||||
OpenTelemetry `Resource` **and** the Serilog enrichers. A metric, a span, and a log line from the same
|
||||
node therefore carry identical dimensions and join up in a backend. This symmetry is the reason
|
||||
Health and Telemetry are designed together even though they ship as separate libraries.
|
||||
|
||||
## Repo layout
|
||||
|
||||
```
|
||||
scadaproj/
|
||||
├─ components/
|
||||
│ ├─ health/ NEW normalization component (docs)
|
||||
│ │ ├─ README.md
|
||||
│ │ ├─ spec/SPEC.md
|
||||
│ │ ├─ shared-contract/ZB.MOM.WW.Health.md
|
||||
│ │ ├─ current-state/{otopcua,mxaccessgw,scadabridge}/CURRENT-STATE.md
|
||||
│ │ └─ GAPS.md
|
||||
│ └─ observability/ NEW normalization component (docs)
|
||||
│ ├─ README.md
|
||||
│ ├─ spec/SPEC.md
|
||||
│ ├─ spec/METRIC-CONVENTIONS.md (mirrors auth CANONICAL-ROLES / theme DESIGN-TOKENS)
|
||||
│ ├─ shared-contract/ZB.MOM.WW.Telemetry.md
|
||||
│ ├─ current-state/{otopcua,mxaccessgw,scadabridge}/CURRENT-STATE.md
|
||||
│ └─ GAPS.md
|
||||
├─ ZB.MOM.WW.Health/ NEW built library (nested git repo, .NET 10) → 3 nupkgs @ 0.1.0
|
||||
├─ ZB.MOM.WW.Telemetry/ NEW built library (nested git repo, .NET 10) → 2 nupkgs @ 0.1.0
|
||||
└─ docs/plans/
|
||||
├─ 2026-06-01-zb-mom-ww-health-shared-library.md (impl plan — from writing-plans)
|
||||
└─ 2026-06-01-zb-mom-ww-telemetry-shared-library.md (impl plan — from writing-plans)
|
||||
```
|
||||
|
||||
Index updates (same discipline as the prior two components): add both rows to
|
||||
`components/README.md`, the `CLAUDE.md` Component-normalization table, and check off Health +
|
||||
Observability in `upcoming.md`.
|
||||
|
||||
## Code-verified current state (2026-06-01 scan)
|
||||
|
||||
### Health
|
||||
| | OtOpcUa | ScadaBridge | MxGateway |
|
||||
|---|---|---|---|
|
||||
| Endpoints | `/health/ready`, `/health/active`, `/healthz` | `/health/ready`, `/health/active` (no `/healthz`) | `/health/live` only (custom `GatewayHealthReply`) |
|
||||
| Probes | Database, AkkaCluster, AdminRoleLeader | Database, AkkaCluster, ActiveNode | **none** (`AddHealthChecks()` called but unused) |
|
||||
| Tagging | tags on the check | named + predicate, `HealthChecks.UI.Client` JSON | — |
|
||||
| Extra | — | `IActiveNodeGate` route gate + `HealthMonitoring/` domain pipeline | net48 x86 worker has no endpoint |
|
||||
|
||||
Both descend from the same "ScadaLink three-tier pattern" (OtOpcUa's `HealthEndpoints.cs:13` says
|
||||
so) but the Akka/leader probe logic and the DB probe technique already differ. MxGateway is **not**
|
||||
Akka-based — a hard dependency-hygiene constraint.
|
||||
|
||||
Key refs: OtOpcUa `src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/{HealthEndpoints,DatabaseHealthCheck,AkkaClusterHealthCheck,AdminRoleLeaderHealthCheck}.cs`;
|
||||
ScadaBridge `src/ZB.MOM.WW.ScadaBridge.Host/Health/{DatabaseHealthCheck,AkkaClusterHealthCheck,ActiveNodeHealthCheck,ActiveNodeGate}.cs` + `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/`;
|
||||
MxGateway `src/ZB.MOM.WW.MxGateway.Server/GatewayApplication.cs:61,139–145`.
|
||||
|
||||
### Telemetry
|
||||
| | OtOpcUa | MxGateway | ScadaBridge |
|
||||
|---|---|---|---|
|
||||
| OTel SDK | full (`WithMetrics`+`WithTracing`) | **none** (hand-rolled `System.Diagnostics.Metrics`, no export) | **none** (`OpenTelemetry.Api` is a dangling CVE-patch ref) |
|
||||
| Exporter | Prometheus `/metrics` | in-memory snapshot only (`GetSnapshot()`) | — |
|
||||
| Meter | `ZB.MOM.WW.OtOpcUa` | `MxGateway.Server` (13 ctr / 3 hist `ms` / 4 gauge) | — |
|
||||
| Tracing | ActivitySource (2 spans) | none | none |
|
||||
| Resource / `service.name` | **none anywhere** | none | none |
|
||||
|
||||
Nobody sets a resource/`service.name` — the fleet is indistinguishable in a collector. Durations
|
||||
split `s` (OtOpcUa, OTel-correct) vs `ms` (MxGateway).
|
||||
|
||||
Key refs: OtOpcUa `src/Server/ZB.MOM.WW.OtOpcUa.Host/Observability/ObservabilityExtensions.cs` +
|
||||
`src/Core/ZB.MOM.WW.OtOpcUa.Commons/Observability/OtOpcUaTelemetry.cs`;
|
||||
MxGateway `src/ZB.MOM.WW.MxGateway.Server/Metrics/GatewayMetrics.cs`;
|
||||
ScadaBridge `src/ZB.MOM.WW.ScadaBridge.Host/ZB.MOM.WW.ScadaBridge.Host.csproj:31` (Api only, zero usage).
|
||||
|
||||
### Logging
|
||||
Serilog in OtOpcUa (`Program.cs:49`) + ScadaBridge (`LoggerConfigurationFactory.cs:28–126`,
|
||||
enrichers `SiteId`/`NodeRole`/`NodeHostname`); MEL in MxGateway (`appsettings.json`, correlation via
|
||||
`GatewayLogScope`/`BeginScope` middleware + `GatewayLogRedactor`). ScadaBridge's enricher set is the
|
||||
cleanest and its property names match the Resource attributes Telemetry needs. Nobody enriches logs
|
||||
with `trace_id`/`span_id`.
|
||||
|
||||
## Library design — `ZB.MOM.WW.Health` (3 packages)
|
||||
|
||||
**① `ZB.MOM.WW.Health`** (core; deps: `Microsoft.Extensions.Diagnostics.HealthChecks` + ASP.NET Core abstractions)
|
||||
- Tier convention: canonical tags `ready` / `active` / `live`; `app.MapZbHealth()` maps all three —
|
||||
`/health/ready` (tag `ready` → can this node serve?), `/health/active` (tag `active` → is this the
|
||||
leader/active node?), `/healthz` (predicate `_ => false` → bare process liveness).
|
||||
- Canonical JSON response writer (lifts ScadaBridge's `HealthChecks.UI.Client` style to the default).
|
||||
- `IActiveNodeGate` seam (generalized from ScadaBridge's `ActiveNodeGate`) + `MapZbHealth` integration.
|
||||
- `GrpcDependencyHealthCheck` — "is my downstream gRPC dependency reachable" (MxGateway → worker;
|
||||
OtOpcUa → gateway channel).
|
||||
|
||||
**② `ZB.MOM.WW.Health.Akka`** (dep: Akka.Cluster)
|
||||
- `AkkaClusterHealthCheck` with a configurable status policy. Default = ScadaBridge's
|
||||
(`Up`/`Joining`=Healthy, `Leaving`/`Exiting`=Degraded, else Unhealthy); OtOpcUa's (self-`Up`-among-
|
||||
members → Healthy/Degraded) ships as a preset.
|
||||
- `ActiveNodeHealthCheck` with an optional role filter — role-less default gives ScadaBridge's
|
||||
`ActiveNode` (Up && leader); passing a role gives OtOpcUa's `AdminRoleLeader` behavior.
|
||||
|
||||
**③ `ZB.MOM.WW.Health.EntityFrameworkCore`** (dep: EF Core)
|
||||
- `DatabaseHealthCheck<TContext>` — default probe `CanConnectAsync()` (ScadaBridge), optional
|
||||
probe-query delegate for OtOpcUa's "query `Deployments`" style.
|
||||
|
||||
**Stays per-project:** which probes each app registers; orchestrator/Traefik wiring; ScadaBridge's
|
||||
`HealthMonitoring/` domain aggregation pipeline (distributed domain health, not an ASP.NET probe).
|
||||
|
||||
**Consumer matrix:** MxGateway → core only (+ gRPC-dep probe; no Akka/EF); OtOpcUa & ScadaBridge → all three.
|
||||
|
||||
## Library design — `ZB.MOM.WW.Telemetry` (2 packages)
|
||||
|
||||
**① `ZB.MOM.WW.Telemetry`** (OTel metrics + traces; deps: OpenTelemetry SDK + hosting/exporter)
|
||||
- `builder.AddZbTelemetry(options)` — the missing front door:
|
||||
```csharp
|
||||
builder.AddZbTelemetry(o => {
|
||||
o.ServiceName = "mxgateway"; // → Resource service.name
|
||||
o.ServiceNamespace = "ZB.MOM.WW"; // constant across the fleet
|
||||
o.SiteId = cfg.SiteId; // → resource attr site.id
|
||||
o.NodeRole = cfg.NodeRole; // → resource attr node.role
|
||||
o.Meters = ["MxGateway.Server"]; // app's own Meter name(s)
|
||||
o.ActivitySources = [...]; // app's own ActivitySource name(s)
|
||||
o.Exporter = Prometheus; // default; OTLP opt-in
|
||||
});
|
||||
app.MapZbMetrics(); // Prometheus /metrics
|
||||
```
|
||||
- Shared `Resource`: `service.name` + `service.namespace` + `service.version` + `site.id` +
|
||||
`node.role` + `host.name`. **The headline fix** — nobody sets this today.
|
||||
- Standard instrumentation everyone should have (only OtOpcUa has it now): ASP.NET Core, gRPC client,
|
||||
HttpClient, runtime + process meters.
|
||||
- Exporter: Prometheus `/metrics` default; **OTLP opt-in** via options (path to a real collector).
|
||||
- App instruments stay per-project. MxGateway's hand-rolled `GatewayMetrics` keeps its 13/3/4
|
||||
instruments but its `Meter` is registered through `AddZbTelemetry` so it finally **exports** instead
|
||||
of dying in an in-memory snapshot.
|
||||
|
||||
**② `ZB.MOM.WW.Telemetry.Serilog`** (logs signal + Serilog convergence; deps: Serilog + the core package)
|
||||
- `AddZbSerilog()` — shared two-stage bootstrap generalizing ScadaBridge's `LoggerConfigurationFactory`
|
||||
(`ReadFrom.Configuration` for sinks + explicit `MinimumLevel.Is` override).
|
||||
- Shared enrichers `SiteId` / `NodeRole` / `NodeHostname`, **bound from the same options object as the
|
||||
OTel Resource** so logs and metrics carry identical dimensions.
|
||||
- **NEW `TraceContextEnricher`** — stamps `trace_id`/`span_id` from `Activity.Current` onto every log
|
||||
event (makes a log line clickable from a trace; nobody has this today).
|
||||
- OTel log export — logs flow through the OTel pipeline with the same Resource (all three signals
|
||||
correlated in a backend).
|
||||
- `ILogRedactor` seam — generalized from MxGateway's `GatewayLogRedactor` (the only app with real
|
||||
secret redaction). The seam is shared; the policy (which fields/commands) stays per-project.
|
||||
|
||||
**Convergence the spec pins down:** Meter name = `<app>` namespace; instrument name =
|
||||
`<app>.<subsystem>.<event>`; duration unit = **seconds** (OTel semconv) — flags MxGateway's `ms`
|
||||
histograms as a convergence item.
|
||||
|
||||
### The one adoption — MxGateway MEL → Serilog
|
||||
Replace `WebApplicationBuilder` default logging with `AddZbSerilog()`; re-express the
|
||||
`GatewayLogScope`/`BeginScope` correlation middleware as a Serilog `LogContext.PushProperty` scope;
|
||||
move `GatewayLogRedactor` behind the shared `ILogRedactor` seam. The net48 x86 worker's `IWorkerLogger`
|
||||
(stderr key=value) stays bespoke — out of process and out of scope.
|
||||
|
||||
## Normalization component docs
|
||||
|
||||
Both trees follow `components/README.md`'s six-part layout (matching auth + ui-theme). Each `spec`
|
||||
opens with a Section 0 stating normalized vs. left-per-project explicitly. `observability/` adds one
|
||||
reference doc — `spec/METRIC-CONVENTIONS.md` — mirroring auth's `CANONICAL-ROLES.md` / theme's
|
||||
`DESIGN-TOKENS.md`. Three `current-state/<project>/CURRENT-STATE.md` per component at full
|
||||
code-verified depth, each ending in an Adoption plan. `GAPS.md` turns deltas into a prioritized
|
||||
backlog (MxGateway "no probes" + "MEL→Serilog" are top entries; `ms`→`s` and the missing Resource are
|
||||
convergence items). Both register at status **Draft** (`Draft → Reviewed → Adopting → Converged`).
|
||||
|
||||
## Testing & verification
|
||||
|
||||
Every package ships tests (mirrors auth's 172 / theme's 32; `dotnet test` from each library root):
|
||||
- **Health** — `WebApplicationFactory` tests for the three tiers + JSON shape; `IActiveNodeGate`
|
||||
gates a route (200 active / 503 standby); `GrpcDependencyHealthCheck` on a stub channel.
|
||||
- **Health.Akka** — table-driven status-policy + role-filter unit tests over faked cluster state.
|
||||
- **Health.EntityFrameworkCore** — `DatabaseHealthCheck<T>` against SQLite in-memory (healthy / broken
|
||||
context / custom probe delegate).
|
||||
- **Telemetry** — Resource carries every options attribute; in-memory exporter sees a registered app
|
||||
Meter's instrument; `MapZbMetrics` serves Prometheus text.
|
||||
- **Telemetry.Serilog** — in-memory/TestCorrelator sink asserts enricher properties present;
|
||||
`TraceContextEnricher` stamps `trace_id`/`span_id` under an active `Activity` and omits cleanly
|
||||
otherwise; `ILogRedactor` scrubs a policy-marked secret.
|
||||
- **MxGateway migration** — existing `MxGateway.Tests` (fake worker) still green + correlation scope
|
||||
still emits + secrets still redacted.
|
||||
|
||||
Verification gates (evidence, not assertions): each library `dotnet test` green + `dotnet pack`
|
||||
produces nupkgs @ 0.1.0; MxGateway `dotnet build src/MxGateway.sln` + `dotnet test` green.
|
||||
|
||||
## Build order
|
||||
|
||||
```
|
||||
1. components/health/ + components/observability/ docs (spec first — drives the APIs)
|
||||
2. ZB.MOM.WW.Health (3 pkgs) ─┐ parallelizable
|
||||
3. ZB.MOM.WW.Telemetry (core: metrics+traces) ─┘
|
||||
4. ZB.MOM.WW.Telemetry.Serilog (needs #3)
|
||||
5. MxGateway MEL→Serilog migration (needs #4) ← the one sister-repo touch
|
||||
6. Index/registry updates + GAPS cross-check
|
||||
```
|
||||
|
||||
## Implementation tasks (native task IDs)
|
||||
|
||||
- #7 Build `ZB.MOM.WW.Health` library (3 packages)
|
||||
- #8 Build `ZB.MOM.WW.Telemetry` library (2 packages)
|
||||
- #9 Migrate MxGateway logging MEL → shared Serilog (sister-repo) — blocked by #8
|
||||
- #10 Author `components/health/` normalization docs
|
||||
- #11 Author `components/observability/` normalization docs
|
||||
|
||||
Dependency: #9 blocked by #8 (needs `ZB.MOM.WW.Telemetry.Serilog`). Docs (#10/#11) precede the
|
||||
libraries logically (spec drives API) but can be drafted in parallel from the captured current-state.
|
||||
Reference in New Issue
Block a user