docs(observability): spec + METRIC-CONVENTIONS + ZB.MOM.WW.Telemetry shared contract
Author the three normalization docs for the observability component: - components/observability/spec/SPEC.md — Section 0 scope (normalized vs. per-project), AddZbTelemetry pipeline, shared Resource attribute set, standard instrumentation baseline, exporter conventions, Serilog two-stage bootstrap with identity enrichers and TraceContextEnricher, ILogRedactor redaction seam, per-project migration table, and acceptance criteria. - components/observability/spec/METRIC-CONVENTIONS.md — meter naming convention (app namespace; MxGateway.Server flagged as convergence target), instrument naming pattern (<app>.<subsystem>.<event>), mandatory duration unit = seconds (MxGateway ms histograms flagged), Resource attribute set table, standard instrumentation baseline, and per-app instrument tables (OtOpcUa 7 instruments + 2 spans; MxGateway 13 counters / 3 histograms / 4 gauges; ScadaBridge TBD). - components/observability/shared-contract/ZB.MOM.WW.Telemetry.md — paper API for the two packages: ZbTelemetryOptions, ZbExporter enum, AddZbTelemetry (IHostApplicationBuilder + IServiceCollection overloads), ZbResource.Build, MapZbMetrics; AddZbSerilog, ZbLogEnricherNames constants, TraceContextEnricher, ILogRedactor, RedactionEnricher. Consumer matrix and open contract questions included.
This commit is contained in:
@@ -0,0 +1,158 @@
|
||||
# Observability — current state: OtOpcUa
|
||||
|
||||
Repo: `~/Desktop/OtOpcUa`. Stack: .NET 10, Akka.NET, OPC UA; solution `ZB.MOM.WW.OtOpcUa.slnx`.
|
||||
Telemetry code lives in two places: `src/Server/ZB.MOM.WW.OtOpcUa.Host/Observability/` (host-side
|
||||
bootstrap) and `src/Core/ZB.MOM.WW.OtOpcUa.Commons/Observability/` (instruments + enricher).
|
||||
All paths relative to repo root. Verified 2026-06-01.
|
||||
|
||||
The most complete observability implementation in the family: OpenTelemetry SDK with both metrics and
|
||||
tracing signals, Prometheus export, Serilog structured logging with a per-session correlation enricher,
|
||||
and a dedicated instrument vocabulary. The one significant gap: **no OTel Resource / `service.name`**,
|
||||
so all signals are indistinguishable from one another and from other fleet members in a backend.
|
||||
|
||||
## 1. Metrics (OpenTelemetry SDK)
|
||||
|
||||
### Bootstrap — `ObservabilityExtensions.cs`
|
||||
|
||||
`src/Server/ZB.MOM.WW.OtOpcUa.Host/Observability/ObservabilityExtensions.cs`:
|
||||
|
||||
- `:18` — `AddOtOpcUaObservability(IServiceCollection)` is the service-registration entry point.
|
||||
- `:20` — `AddOpenTelemetry()` wires the OTel SDK.
|
||||
- `:21–23` — `.WithMetrics(b => b.AddMeter(OtOpcUaTelemetry.MeterName).AddPrometheusExporter())`:
|
||||
registers the application meter and attaches the Prometheus scrape exporter.
|
||||
- `:24–25` — `.WithTracing(b => b.AddSource(OtOpcUaTelemetry.ActivitySourceName))`:
|
||||
registers the application activity source for trace data.
|
||||
- **No `ResourceBuilder` call anywhere** — `service.name`, `service.namespace`, `service.version`,
|
||||
`site.id`, and `node.role` are not set. The OTel SDK defaults to an empty/SDK-default Resource.
|
||||
- `:36` — `MapOtOpcUaMetrics(IEndpointRouteBuilder)` maps the Prometheus endpoint.
|
||||
- `:38` — endpoint path is `/metrics`.
|
||||
|
||||
`Program.cs`:
|
||||
- `:138` — `builder.Services.AddOtOpcUaObservability()`
|
||||
- `:160` — `app.MapOtOpcUaMetrics()`
|
||||
|
||||
Package refs in csproj: `OpenTelemetry.Extensions.Hosting`, `OpenTelemetry.Exporter.Prometheus.AspNetCore`.
|
||||
**No `OpenTelemetry.Exporter.OpenTelemetryProtocol`** — OTLP is not available; Prometheus is the
|
||||
only export path.
|
||||
|
||||
### Instruments — `OtOpcUaTelemetry.cs`
|
||||
|
||||
`src/Core/ZB.MOM.WW.OtOpcUa.Commons/Observability/OtOpcUaTelemetry.cs`:
|
||||
|
||||
- `:19` — `MeterName = "ZB.MOM.WW.OtOpcUa"` (the `Meter` the SDK will collect).
|
||||
- `:20` — `ActivitySourceName = "ZB.MOM.WW.OtOpcUa"` (the `ActivitySource` for spans).
|
||||
|
||||
Instruments defined (all `static readonly` on `OtOpcUaTelemetry`):
|
||||
|
||||
| Instrument | Kind | Unit | Subsystem |
|
||||
|---|---|---|---|
|
||||
| `otopcua.deploy.applied` | `Counter<long>` | — | deploy |
|
||||
| `otopcua.deploy.apply.duration` | `Histogram<double>` | `s` | deploy |
|
||||
| `otopcua.driver.lifecycle` | `Counter<long>` | — | driver |
|
||||
| `otopcua.virtualtag.eval` | `Counter<long>` | — | virtual-tag |
|
||||
| `otopcua.scriptedalarm.transition` | `Counter<long>` | — | scripted-alarm |
|
||||
| `otopcua.opcua.sink.write` | `Counter<long>` | — | opc-ua sink |
|
||||
| `otopcua.redundancy.service_level_change` | `Counter<long>` | — | redundancy |
|
||||
|
||||
Two activity spans: `otopcua.deploy.apply`, `otopcua.opcua.address_space_rebuild`.
|
||||
|
||||
Naming convention: `otopcua.<subsystem>.<event>`. Duration histogram correctly uses unit `s`
|
||||
(OTel semantic conventions). **No standard instrumentation** (ASP.NET Core, HttpClient, runtime,
|
||||
gRPC client meters) is wired — only the bespoke application instruments.
|
||||
|
||||
## 2. Logging (Serilog)
|
||||
|
||||
### Bootstrap
|
||||
|
||||
`Program.cs`:
|
||||
- `:49–52` — two-stage Serilog bootstrap: initial logger for startup, then full
|
||||
`UseSerilog(ReadFrom.Configuration)`. Sinks: Console + rolling file `logs/otopcua-.log`.
|
||||
- `:141` — `UseSerilogRequestLogging()` on the `WebApplication`.
|
||||
|
||||
### Correlation enricher — `LogContextEnricher.cs`
|
||||
|
||||
`src/Core/ZB.MOM.WW.OtOpcUa.Commons/Observability/LogContextEnricher.cs`:
|
||||
|
||||
- `:18–36` — `Push(driverInstanceId, driverType, capability, correlationId)` calls
|
||||
`LogContext.PushProperty` for four properties:
|
||||
- `DriverInstanceId` — Galaxy driver instance GUID.
|
||||
- `DriverType` — driver type discriminator.
|
||||
- `CapabilityName` — OPC UA capability being exercised.
|
||||
- `CorrelationId` — caller-supplied correlation token.
|
||||
|
||||
This enricher is driver-lifecycle-scoped, not request-scoped — it pushes when a driver operation
|
||||
begins and is disposable to pop on completion.
|
||||
|
||||
**No `trace_id` / `span_id` enricher.** Although OtOpcUa creates `ActivitySource` spans, the
|
||||
active `Activity.Current` trace context is never pushed onto Serilog's `LogContext`. A log line
|
||||
emitted during a span cannot be correlated to the span in a backend.
|
||||
|
||||
**No structural enrichers for `service.name` / `site.id` / `node.role`** — these dimensions are
|
||||
absent from every log line. ScadaBridge has these; OtOpcUa does not.
|
||||
|
||||
## 3. Signal summary
|
||||
|
||||
| Signal | Provider | Export | Resource / service.name |
|
||||
|---|---|---|---|
|
||||
| Metrics | OTel SDK (`Meter` + `WithMetrics`) | Prometheus `/metrics` | ⛔ none |
|
||||
| Traces | OTel SDK (`ActivitySource` + `WithTracing`) | ⛔ none (no exporter configured) | ⛔ none |
|
||||
| Logs | Serilog | Console + rolling file | ⛔ none (no `service.name` property) |
|
||||
| Trace↔log correlation | — | — | ⛔ absent (`trace_id`/`span_id` not pushed) |
|
||||
|
||||
Note: `WithTracing` registers the `ActivitySource` for collection, but no exporter (OTLP or
|
||||
otherwise) is attached to the tracing pipeline. Spans are created and recorded by the SDK but never
|
||||
shipped anywhere — effectively a no-op in production.
|
||||
|
||||
## 4. Notable design choices
|
||||
|
||||
- **Instrument naming** follows `<meter>.<subsystem>.<event>` cleanly and consistently — this is the
|
||||
pattern the shared spec codifies as the fleet convention.
|
||||
- **Duration unit** correctly uses `s` on `otopcua.deploy.apply.duration` — no conversion needed on
|
||||
adoption; this contrasts with MxAccessGateway's `ms` histograms.
|
||||
- **LogContextEnricher is bespoke but valuable** — the `DriverInstanceId`/`DriverType`/`CapabilityName`
|
||||
correlation is OtOpcUa-specific domain context; it should survive adoption behind the shared
|
||||
enricher layer.
|
||||
- **No OTLP path** — with no OTLP exporter, OtOpcUa cannot send metrics or traces to a collector
|
||||
(Prometheus is scrape-pull only). This limits operational flexibility.
|
||||
|
||||
---
|
||||
|
||||
## Adoption plan → `ZB.MOM.WW.Telemetry`
|
||||
|
||||
**Replace with shared bootstrap:**
|
||||
|
||||
- `AddOtOpcUaObservability()` → `builder.AddZbTelemetry(o => { o.ServiceName = "otopcua"; o.SiteId = ...; o.NodeRole = ...; o.Meters = [OtOpcUaTelemetry.MeterName]; o.ActivitySources = [OtOpcUaTelemetry.ActivitySourceName]; })`.
|
||||
This adds the missing `Resource` (gains `service.name` / `service.namespace` / `service.version` /
|
||||
`site.id` / `node.role` / `host.name` on every metric and span). Prometheus `/metrics` stays the
|
||||
default exporter; OTLP becomes opt-in via options.
|
||||
- Add standard instrumentation through `AddZbTelemetry` options: ASP.NET Core meters, HttpClient,
|
||||
runtime + process meters — none wired today.
|
||||
- Fix the tracing no-op: wire an OTLP exporter (or at minimum note that tracing is recorded but not
|
||||
exported); `AddZbTelemetry` provides OTLP as the opt-in path.
|
||||
- `MapOtOpcUaMetrics` → `app.MapZbMetrics()` (same `/metrics` path; shared convention).
|
||||
|
||||
**Replace with shared Serilog bootstrap:**
|
||||
|
||||
- Serilog bootstrap in `Program.cs:49–52` → `builder.AddZbSerilog(o => { o.ServiceName = "otopcua"; o.SiteId = ...; o.NodeRole = ...; })`.
|
||||
This adds structural `SiteId` / `NodeRole` / `NodeHostname` properties to every log line
|
||||
(currently absent) and wires the `TraceContextEnricher` so `trace_id`/`span_id` appear on log
|
||||
lines emitted during active spans.
|
||||
- Console + file sinks continue via `ReadFrom.Configuration` in `appsettings.json` — no sink changes
|
||||
needed.
|
||||
- `UseSerilogRequestLogging()` stays.
|
||||
|
||||
**Keep bespoke:**
|
||||
|
||||
- `OtOpcUaTelemetry.cs` — the application `Meter`, `ActivitySource`, and all instrument definitions
|
||||
(`otopcua.*` counters, histograms, spans). These are domain instruments; `AddZbTelemetry` registers
|
||||
them by name but does not own them.
|
||||
- `LogContextEnricher.cs` — driver-lifecycle correlation properties (`DriverInstanceId`,
|
||||
`DriverType`, `CapabilityName`, `CorrelationId`) are OtOpcUa-specific. The enricher continues to
|
||||
push via `LogContext.PushProperty` alongside the shared enrichers.
|
||||
- `ObservabilityExtensions.cs` itself can be simplified or removed — it becomes a thin wrapper that
|
||||
calls `AddZbTelemetry` with OtOpcUa-specific options. The per-project entry point remains; only
|
||||
the implementation body is delegated to the shared library.
|
||||
|
||||
**Adoption is a follow-on task** (tracked in `GAPS.md`), not part of the `ZB.MOM.WW.Telemetry`
|
||||
library build. The library build delivers the shared bootstrap and enrichers; adoption lands in the
|
||||
OtOpcUa repo as a separate commit once the nupkg is available.
|
||||
Reference in New Issue
Block a user