7d243890ed
Author the three normalization docs for the observability component: - components/observability/spec/SPEC.md — Section 0 scope (normalized vs. per-project), AddZbTelemetry pipeline, shared Resource attribute set, standard instrumentation baseline, exporter conventions, Serilog two-stage bootstrap with identity enrichers and TraceContextEnricher, ILogRedactor redaction seam, per-project migration table, and acceptance criteria. - components/observability/spec/METRIC-CONVENTIONS.md — meter naming convention (app namespace; MxGateway.Server flagged as convergence target), instrument naming pattern (<app>.<subsystem>.<event>), mandatory duration unit = seconds (MxGateway ms histograms flagged), Resource attribute set table, standard instrumentation baseline, and per-app instrument tables (OtOpcUa 7 instruments + 2 spans; MxGateway 13 counters / 3 histograms / 4 gauges; ScadaBridge TBD). - components/observability/shared-contract/ZB.MOM.WW.Telemetry.md — paper API for the two packages: ZbTelemetryOptions, ZbExporter enum, AddZbTelemetry (IHostApplicationBuilder + IServiceCollection overloads), ZbResource.Build, MapZbMetrics; AddZbSerilog, ZbLogEnricherNames constants, TraceContextEnricher, ILogRedactor, RedactionEnricher. Consumer matrix and open contract questions included.
159 lines
8.6 KiB
Markdown
159 lines
8.6 KiB
Markdown
# Observability — current state: OtOpcUa
|
||
|
||
Repo: `~/Desktop/OtOpcUa`. Stack: .NET 10, Akka.NET, OPC UA; solution `ZB.MOM.WW.OtOpcUa.slnx`.
|
||
Telemetry code lives in two places: `src/Server/ZB.MOM.WW.OtOpcUa.Host/Observability/` (host-side
|
||
bootstrap) and `src/Core/ZB.MOM.WW.OtOpcUa.Commons/Observability/` (instruments + enricher).
|
||
All paths relative to repo root. Verified 2026-06-01.
|
||
|
||
The most complete observability implementation in the family: OpenTelemetry SDK with both metrics and
|
||
tracing signals, Prometheus export, Serilog structured logging with a per-session correlation enricher,
|
||
and a dedicated instrument vocabulary. The one significant gap: **no OTel Resource / `service.name`**,
|
||
so all signals are indistinguishable from one another and from other fleet members in a backend.
|
||
|
||
## 1. Metrics (OpenTelemetry SDK)
|
||
|
||
### Bootstrap — `ObservabilityExtensions.cs`
|
||
|
||
`src/Server/ZB.MOM.WW.OtOpcUa.Host/Observability/ObservabilityExtensions.cs`:
|
||
|
||
- `:18` — `AddOtOpcUaObservability(IServiceCollection)` is the service-registration entry point.
|
||
- `:20` — `AddOpenTelemetry()` wires the OTel SDK.
|
||
- `:21–23` — `.WithMetrics(b => b.AddMeter(OtOpcUaTelemetry.MeterName).AddPrometheusExporter())`:
|
||
registers the application meter and attaches the Prometheus scrape exporter.
|
||
- `:24–25` — `.WithTracing(b => b.AddSource(OtOpcUaTelemetry.ActivitySourceName))`:
|
||
registers the application activity source for trace data.
|
||
- **No `ResourceBuilder` call anywhere** — `service.name`, `service.namespace`, `service.version`,
|
||
`site.id`, and `node.role` are not set. The OTel SDK defaults to an empty/SDK-default Resource.
|
||
- `:36` — `MapOtOpcUaMetrics(IEndpointRouteBuilder)` maps the Prometheus endpoint.
|
||
- `:38` — endpoint path is `/metrics`.
|
||
|
||
`Program.cs`:
|
||
- `:138` — `builder.Services.AddOtOpcUaObservability()`
|
||
- `:160` — `app.MapOtOpcUaMetrics()`
|
||
|
||
Package refs in csproj: `OpenTelemetry.Extensions.Hosting`, `OpenTelemetry.Exporter.Prometheus.AspNetCore`.
|
||
**No `OpenTelemetry.Exporter.OpenTelemetryProtocol`** — OTLP is not available; Prometheus is the
|
||
only export path.
|
||
|
||
### Instruments — `OtOpcUaTelemetry.cs`
|
||
|
||
`src/Core/ZB.MOM.WW.OtOpcUa.Commons/Observability/OtOpcUaTelemetry.cs`:
|
||
|
||
- `:19` — `MeterName = "ZB.MOM.WW.OtOpcUa"` (the `Meter` the SDK will collect).
|
||
- `:20` — `ActivitySourceName = "ZB.MOM.WW.OtOpcUa"` (the `ActivitySource` for spans).
|
||
|
||
Instruments defined (all `static readonly` on `OtOpcUaTelemetry`):
|
||
|
||
| Instrument | Kind | Unit | Subsystem |
|
||
|---|---|---|---|
|
||
| `otopcua.deploy.applied` | `Counter<long>` | — | deploy |
|
||
| `otopcua.deploy.apply.duration` | `Histogram<double>` | `s` | deploy |
|
||
| `otopcua.driver.lifecycle` | `Counter<long>` | — | driver |
|
||
| `otopcua.virtualtag.eval` | `Counter<long>` | — | virtual-tag |
|
||
| `otopcua.scriptedalarm.transition` | `Counter<long>` | — | scripted-alarm |
|
||
| `otopcua.opcua.sink.write` | `Counter<long>` | — | opc-ua sink |
|
||
| `otopcua.redundancy.service_level_change` | `Counter<long>` | — | redundancy |
|
||
|
||
Two activity spans: `otopcua.deploy.apply`, `otopcua.opcua.address_space_rebuild`.
|
||
|
||
Naming convention: `otopcua.<subsystem>.<event>`. Duration histogram correctly uses unit `s`
|
||
(OTel semantic conventions). **No standard instrumentation** (ASP.NET Core, HttpClient, runtime,
|
||
gRPC client meters) is wired — only the bespoke application instruments.
|
||
|
||
## 2. Logging (Serilog)
|
||
|
||
### Bootstrap
|
||
|
||
`Program.cs`:
|
||
- `:49–52` — two-stage Serilog bootstrap: initial logger for startup, then full
|
||
`UseSerilog(ReadFrom.Configuration)`. Sinks: Console + rolling file `logs/otopcua-.log`.
|
||
- `:141` — `UseSerilogRequestLogging()` on the `WebApplication`.
|
||
|
||
### Correlation enricher — `LogContextEnricher.cs`
|
||
|
||
`src/Core/ZB.MOM.WW.OtOpcUa.Commons/Observability/LogContextEnricher.cs`:
|
||
|
||
- `:18–36` — `Push(driverInstanceId, driverType, capability, correlationId)` calls
|
||
`LogContext.PushProperty` for four properties:
|
||
- `DriverInstanceId` — Galaxy driver instance GUID.
|
||
- `DriverType` — driver type discriminator.
|
||
- `CapabilityName` — OPC UA capability being exercised.
|
||
- `CorrelationId` — caller-supplied correlation token.
|
||
|
||
This enricher is driver-lifecycle-scoped, not request-scoped — it pushes when a driver operation
|
||
begins and is disposable to pop on completion.
|
||
|
||
**No `trace_id` / `span_id` enricher.** Although OtOpcUa creates `ActivitySource` spans, the
|
||
active `Activity.Current` trace context is never pushed onto Serilog's `LogContext`. A log line
|
||
emitted during a span cannot be correlated to the span in a backend.
|
||
|
||
**No structural enrichers for `service.name` / `site.id` / `node.role`** — these dimensions are
|
||
absent from every log line. ScadaBridge has these; OtOpcUa does not.
|
||
|
||
## 3. Signal summary
|
||
|
||
| Signal | Provider | Export | Resource / service.name |
|
||
|---|---|---|---|
|
||
| Metrics | OTel SDK (`Meter` + `WithMetrics`) | Prometheus `/metrics` | ⛔ none |
|
||
| Traces | OTel SDK (`ActivitySource` + `WithTracing`) | ⛔ none (no exporter configured) | ⛔ none |
|
||
| Logs | Serilog | Console + rolling file | ⛔ none (no `service.name` property) |
|
||
| Trace↔log correlation | — | — | ⛔ absent (`trace_id`/`span_id` not pushed) |
|
||
|
||
Note: `WithTracing` registers the `ActivitySource` for collection, but no exporter (OTLP or
|
||
otherwise) is attached to the tracing pipeline. Spans are created and recorded by the SDK but never
|
||
shipped anywhere — effectively a no-op in production.
|
||
|
||
## 4. Notable design choices
|
||
|
||
- **Instrument naming** follows `<meter>.<subsystem>.<event>` cleanly and consistently — this is the
|
||
pattern the shared spec codifies as the fleet convention.
|
||
- **Duration unit** correctly uses `s` on `otopcua.deploy.apply.duration` — no conversion needed on
|
||
adoption; this contrasts with MxAccessGateway's `ms` histograms.
|
||
- **LogContextEnricher is bespoke but valuable** — the `DriverInstanceId`/`DriverType`/`CapabilityName`
|
||
correlation is OtOpcUa-specific domain context; it should survive adoption behind the shared
|
||
enricher layer.
|
||
- **No OTLP path** — with no OTLP exporter, OtOpcUa cannot send metrics or traces to a collector
|
||
(Prometheus is scrape-pull only). This limits operational flexibility.
|
||
|
||
---
|
||
|
||
## Adoption plan → `ZB.MOM.WW.Telemetry`
|
||
|
||
**Replace with shared bootstrap:**
|
||
|
||
- `AddOtOpcUaObservability()` → `builder.AddZbTelemetry(o => { o.ServiceName = "otopcua"; o.SiteId = ...; o.NodeRole = ...; o.Meters = [OtOpcUaTelemetry.MeterName]; o.ActivitySources = [OtOpcUaTelemetry.ActivitySourceName]; })`.
|
||
This adds the missing `Resource` (gains `service.name` / `service.namespace` / `service.version` /
|
||
`site.id` / `node.role` / `host.name` on every metric and span). Prometheus `/metrics` stays the
|
||
default exporter; OTLP becomes opt-in via options.
|
||
- Add standard instrumentation through `AddZbTelemetry` options: ASP.NET Core meters, HttpClient,
|
||
runtime + process meters — none wired today.
|
||
- Fix the tracing no-op: wire an OTLP exporter (or at minimum note that tracing is recorded but not
|
||
exported); `AddZbTelemetry` provides OTLP as the opt-in path.
|
||
- `MapOtOpcUaMetrics` → `app.MapZbMetrics()` (same `/metrics` path; shared convention).
|
||
|
||
**Replace with shared Serilog bootstrap:**
|
||
|
||
- Serilog bootstrap in `Program.cs:49–52` → `builder.AddZbSerilog(o => { o.ServiceName = "otopcua"; o.SiteId = ...; o.NodeRole = ...; })`.
|
||
This adds structural `SiteId` / `NodeRole` / `NodeHostname` properties to every log line
|
||
(currently absent) and wires the `TraceContextEnricher` so `trace_id`/`span_id` appear on log
|
||
lines emitted during active spans.
|
||
- Console + file sinks continue via `ReadFrom.Configuration` in `appsettings.json` — no sink changes
|
||
needed.
|
||
- `UseSerilogRequestLogging()` stays.
|
||
|
||
**Keep bespoke:**
|
||
|
||
- `OtOpcUaTelemetry.cs` — the application `Meter`, `ActivitySource`, and all instrument definitions
|
||
(`otopcua.*` counters, histograms, spans). These are domain instruments; `AddZbTelemetry` registers
|
||
them by name but does not own them.
|
||
- `LogContextEnricher.cs` — driver-lifecycle correlation properties (`DriverInstanceId`,
|
||
`DriverType`, `CapabilityName`, `CorrelationId`) are OtOpcUa-specific. The enricher continues to
|
||
push via `LogContext.PushProperty` alongside the shared enrichers.
|
||
- `ObservabilityExtensions.cs` itself can be simplified or removed — it becomes a thin wrapper that
|
||
calls `AddZbTelemetry` with OtOpcUa-specific options. The per-project entry point remains; only
|
||
the implementation body is delegated to the shared library.
|
||
|
||
**Adoption is a follow-on task** (tracked in `GAPS.md`), not part of the `ZB.MOM.WW.Telemetry`
|
||
library build. The library build delivers the shared bootstrap and enrichers; adoption lands in the
|
||
OtOpcUa repo as a separate commit once the nupkg is available.
|