docs(observability): spec + METRIC-CONVENTIONS + ZB.MOM.WW.Telemetry shared contract
Author the three normalization docs for the observability component: - components/observability/spec/SPEC.md — Section 0 scope (normalized vs. per-project), AddZbTelemetry pipeline, shared Resource attribute set, standard instrumentation baseline, exporter conventions, Serilog two-stage bootstrap with identity enrichers and TraceContextEnricher, ILogRedactor redaction seam, per-project migration table, and acceptance criteria. - components/observability/spec/METRIC-CONVENTIONS.md — meter naming convention (app namespace; MxGateway.Server flagged as convergence target), instrument naming pattern (<app>.<subsystem>.<event>), mandatory duration unit = seconds (MxGateway ms histograms flagged), Resource attribute set table, standard instrumentation baseline, and per-app instrument tables (OtOpcUa 7 instruments + 2 spans; MxGateway 13 counters / 3 histograms / 4 gauges; ScadaBridge TBD). - components/observability/shared-contract/ZB.MOM.WW.Telemetry.md — paper API for the two packages: ZbTelemetryOptions, ZbExporter enum, AddZbTelemetry (IHostApplicationBuilder + IServiceCollection overloads), ZbResource.Build, MapZbMetrics; AddZbSerilog, ZbLogEnricherNames constants, TraceContextEnricher, ILogRedactor, RedactionEnricher. Consumer matrix and open contract questions included.
This commit is contained in:
@@ -0,0 +1,191 @@
|
||||
# Observability — current state: MxAccessGateway
|
||||
|
||||
Repo: `~/Desktop/MxAccessGateway`. Stack: .NET 10 gateway (x64) + .NET 4.8 worker (**x86**);
|
||||
solution `src/MxGateway.sln`. Telemetry code is concentrated in
|
||||
`src/ZB.MOM.WW.MxGateway.Server/Metrics/` (instruments) and
|
||||
`src/ZB.MOM.WW.MxGateway.Server/Diagnostics/` (logging correlation + redaction).
|
||||
All paths relative to repo root. Verified 2026-06-01.
|
||||
|
||||
The most unusual observability posture in the family: **13 counters, 3 histograms, and 4 observable
|
||||
gauges** all fully hand-rolled using `System.Diagnostics.Metrics` directly — but **never exported**
|
||||
(no OpenTelemetry SDK, no Prometheus exporter, no OTLP). All metric data dies in an in-memory
|
||||
`GetSnapshot()`. Logging is `Microsoft.Extensions.Logging` exclusively (no Serilog), with a bespoke
|
||||
correlation scope and a log-redaction pipeline. The net48 x86 worker is out of process and out of
|
||||
scope — its `IWorkerLogger` (stderr key=value) is not addressed here.
|
||||
|
||||
## 1. Metrics (hand-rolled, unexported)
|
||||
|
||||
### `GatewayMetrics.cs`
|
||||
|
||||
`src/ZB.MOM.WW.MxGateway.Server/Metrics/GatewayMetrics.cs`:
|
||||
|
||||
Meter name: `"MxGateway.Server"` (does not follow the project namespace `ZB.MOM.WW.MxGateway`).
|
||||
|
||||
All instruments are instance members of `GatewayMetrics`. The class is registered as a **singleton**
|
||||
at `GatewayApplication.cs:62`. There is **no `OpenTelemetry.Extensions.Hosting`**,
|
||||
**no `AddOpenTelemetry()` call**, and **no exporter** — the `Meter` is created with
|
||||
`new Meter("MxGateway.Server")` and `GetSnapshot()` is the only read path.
|
||||
|
||||
**Counters (13):**
|
||||
|
||||
| Instrument name | Tracks |
|
||||
|---|---|
|
||||
| `mxgateway.sessions.opened` | New session requests |
|
||||
| `mxgateway.sessions.closed` | Sessions torn down |
|
||||
| `mxgateway.commands.started` | MXAccess command dispatched |
|
||||
| `mxgateway.commands.succeeded` | Command completed OK |
|
||||
| `mxgateway.commands.failed` | Command error |
|
||||
| `mxgateway.events.received` | MXAccess events from worker |
|
||||
| `mxgateway.queues.overflows` | Queue overflow (backpressure) |
|
||||
| `mxgateway.faults` | Unhandled gateway faults |
|
||||
| `mxgateway.workers.killed` | Worker process forcibly terminated |
|
||||
| `mxgateway.workers.exited` | Worker process exited cleanly |
|
||||
| `mxgateway.heartbeats.failed` | Worker heartbeat timeouts |
|
||||
| `mxgateway.grpc.streams.disconnected` | gRPC event stream disconnects |
|
||||
| `mxgateway.retries.attempted` | Retry attempts (any subsystem) |
|
||||
|
||||
**Histograms (3) — unit `ms` (diverges from OTel semconv `s`):**
|
||||
|
||||
| Instrument name | Tracks |
|
||||
|---|---|
|
||||
| `mxgateway.workers.startup.duration` | Time from worker spawn to ready |
|
||||
| `mxgateway.commands.duration` | End-to-end MXAccess command latency |
|
||||
| `mxgateway.events.stream_send.duration` | gRPC event stream send latency |
|
||||
|
||||
**Observable gauges (4):**
|
||||
|
||||
| Instrument name | Tracks |
|
||||
|---|---|
|
||||
| `mxgateway.sessions.open` | Currently open sessions (live count) |
|
||||
| `mxgateway.workers.running` | Currently running worker processes |
|
||||
| `mxgateway.events.worker_queue.depth` | Per-worker event queue depth |
|
||||
| `mxgateway.events.grpc_stream_queue.depth` | Per-stream gRPC send queue depth |
|
||||
|
||||
All 20 instruments share the `mxgateway.*` prefix and `<category>.<event>` naming — consistent
|
||||
with the family convention. Duration histograms record in **milliseconds** (`ms`); OTel semantic
|
||||
conventions require seconds (`s`). This is the only project with `ms` histograms.
|
||||
|
||||
### Singleton wiring
|
||||
|
||||
`src/ZB.MOM.WW.MxGateway.Server/GatewayApplication.cs`:
|
||||
- `:62` — `services.AddSingleton<GatewayMetrics>()` registers the metrics singleton.
|
||||
|
||||
There is no `AddOpenTelemetry()` call anywhere in the gateway. The `GatewayMetrics` `Meter` is
|
||||
created independently of any OTel SDK — it participates in `MeterListener` / `GetSnapshot()` only.
|
||||
Without the OTel SDK, this data is **invisible to Prometheus, OTLP, or any backend**.
|
||||
|
||||
### No tracing
|
||||
|
||||
No `ActivitySource` is defined. No spans are created. Tracing is entirely absent.
|
||||
|
||||
## 2. Logging (Microsoft.Extensions.Logging)
|
||||
|
||||
All logging in the gateway server uses `Microsoft.Extensions.Logging` (MEL) exclusively. There is
|
||||
no Serilog dependency. Sink configuration lives in `appsettings.json` (Console, with structured
|
||||
logging via the default host builder).
|
||||
|
||||
### Correlation scope
|
||||
|
||||
`src/ZB.MOM.WW.MxGateway.Server/Diagnostics/GatewayLogScope.cs`:
|
||||
|
||||
Defines the per-request/per-session correlation property bag.
|
||||
|
||||
`src/ZB.MOM.WW.MxGateway.Server/Diagnostics/GatewayRequestLoggingMiddlewareExtensions.cs`:
|
||||
- `:22–41` — `UseGatewayRequestLogging()` middleware reads the following HTTP headers from each
|
||||
incoming request: `x-session-id`, `x-worker-process-id`, `x-correlation-id`, `x-command-method`,
|
||||
`authorization` (for redaction, not logging).
|
||||
- Registered at `GatewayApplication.cs:34`.
|
||||
|
||||
`src/ZB.MOM.WW.MxGateway.Server/Diagnostics/GatewayLoggerExtensions.cs`:
|
||||
- `:11–18` — `BeginGatewayScope(ILogger, GatewayLogScope)` calls `logger.BeginScope(scope)` —
|
||||
MEL's `ILogger.BeginScope` mechanism, which pushes properties as a scoped dictionary.
|
||||
|
||||
The correlation tuple (`SessionId` / `WorkerProcessId` / `CorrelationId` / `CommandMethod`) is
|
||||
injected into log lines produced within the scope. No `trace_id` / `span_id` enrichment — there
|
||||
is no ActivitySource, so this is consistent but leaves no path to trace correlation.
|
||||
|
||||
### Log redaction — `GatewayLogRedactor.cs`
|
||||
|
||||
`src/ZB.MOM.WW.MxGateway.Server/Diagnostics/GatewayLogRedactor.cs`:
|
||||
|
||||
- Masks sensitive data in log lines for two categories:
|
||||
- **`AuthenticateUser`** commands: the password argument is replaced.
|
||||
- **`WriteSecured`** commands: the value argument is replaced.
|
||||
- **`mxgw_` bearer tokens**: the token body is masked, keeping only the key-id prefix.
|
||||
- Redaction is applied before the log event is emitted — no sensitive data reaches the sink.
|
||||
|
||||
This is the only project in the family with an explicit log-redaction pipeline. OtOpcUa and
|
||||
ScadaBridge have no equivalent.
|
||||
|
||||
## 3. Signal summary
|
||||
|
||||
| Signal | Provider | Export | Resource / service.name |
|
||||
|---|---|---|---|
|
||||
| Metrics | `System.Diagnostics.Metrics` (`Meter` direct) | ⛔ none (`GetSnapshot()` only) | ⛔ none |
|
||||
| Traces | — | ⛔ none | ⛔ none |
|
||||
| Logs | MEL (`Microsoft.Extensions.Logging`) | Console via `appsettings.json` | ⛔ none |
|
||||
| Trace↔log correlation | — | — | ⛔ absent (no ActivitySource exists) |
|
||||
|
||||
## 4. Notable design choices
|
||||
|
||||
- **`GatewayMetrics` singleton** — all counter/gauge increments are lock-free atomic operations on
|
||||
the underlying `Meter` instruments; the singleton is intentional.
|
||||
- **`ms` histogram unit** — `workers.startup.duration`, `commands.duration`, and
|
||||
`events.stream_send.duration` all record in milliseconds. This is non-standard (OTel semconv
|
||||
requires `s`) and means raw values differ from OtOpcUa's `s` histograms by a factor of 1000.
|
||||
- **MEL correlation via `BeginScope`** — MEL scopes are supported by structured logging providers
|
||||
(e.g. Serilog.Extensions.Hosting, Seq, Application Insights) but are provider-dependent. The
|
||||
scope properties may not appear in all sink configurations, unlike Serilog's `LogContext` which
|
||||
is sink-agnostic.
|
||||
- **Redaction placement** — `GatewayLogRedactor` sits between the caller and the log emission point,
|
||||
not inside a sink. This is the correct placement; the shared `ILogRedactor` seam preserves this.
|
||||
|
||||
---
|
||||
|
||||
## Adoption plan → `ZB.MOM.WW.Telemetry`
|
||||
|
||||
**This is the one in-pass adoption.** The MxGateway MEL → Serilog migration is executed as part of
|
||||
the `ZB.MOM.WW.Telemetry` library build, not deferred as a follow-on. The changes below land in
|
||||
the MxAccessGateway repo as part of Task #9 (blocked by Task #8 — library build).
|
||||
|
||||
**Migrate logging MEL → `AddZbSerilog`:**
|
||||
|
||||
- Replace `WebApplicationBuilder` default logging with `builder.AddZbSerilog(o => { o.ServiceName = "mxgateway"; o.SiteId = ...; o.NodeRole = ...; })`.
|
||||
Gains structured `SiteId` / `NodeRole` / `NodeHostname` enrichers on every log event, plus
|
||||
`TraceContextEnricher` (currently moot — no spans — but ready for when tracing is added).
|
||||
- Re-express the `GatewayLogScope` / `BeginGatewayScope` / `UseGatewayRequestLogging` correlation
|
||||
mechanism as a Serilog `LogContext.PushProperty` scope. The middleware at
|
||||
`GatewayRequestLoggingMiddlewareExtensions.cs:22–41` is refactored to push the same four
|
||||
properties (`SessionId`, `WorkerProcessId`, `CorrelationId`, `CommandMethod`) via Serilog's
|
||||
`LogContext` rather than MEL `BeginScope`. Behavior is identical; portability improves.
|
||||
- Move `GatewayLogRedactor` behind the shared `ILogRedactor` seam. The redaction policy (which
|
||||
commands/tokens to scrub and how) stays per-project in a `MxGatewayLogRedactor : ILogRedactor`
|
||||
implementation; the seam is shared.
|
||||
- Console + file sinks configured via `ReadFrom.Configuration` in `appsettings.json` — consistent
|
||||
with OtOpcUa and ScadaBridge's Serilog approach.
|
||||
|
||||
**Wire metrics export via `AddZbTelemetry`:**
|
||||
|
||||
- Add `builder.AddZbTelemetry(o => { o.ServiceName = "mxgateway"; o.SiteId = ...; o.NodeRole = ...; o.Meters = ["MxGateway.Server"]; })`.
|
||||
This registers the OTel SDK and connects `GatewayMetrics`'s existing `Meter` to the Prometheus
|
||||
exporter. The 13 counters, 3 histograms, and 4 gauges **begin exporting** for the first time.
|
||||
`GatewayMetrics.cs` itself is unchanged — only the SDK layer is added around it.
|
||||
- Add `app.MapZbMetrics()` to expose `/metrics`.
|
||||
|
||||
**Convert histogram unit `ms` → `s`:**
|
||||
|
||||
- Rename the three histograms' values: multiply recorded values by `0.001` at the call site, or
|
||||
re-create the instruments with unit `s`. This is a breaking change to existing dashboards/alerts
|
||||
but required for OTel semconv compliance. Tagged as a convergence item in `GAPS.md`.
|
||||
|
||||
**Keep bespoke:**
|
||||
|
||||
- `GatewayMetrics.cs` — all 20 instruments (`mxgateway.*` counters, histograms, gauges) stay
|
||||
per-project. `AddZbTelemetry` registers the Meter name; it does not own or replace the instruments.
|
||||
- Meter name `"MxGateway.Server"` — a follow-on rename to `"ZB.MOM.WW.MxGateway"` is tracked in
|
||||
`GAPS.md` but is not required for the initial adoption (it is a Prometheus label change that
|
||||
breaks existing dashboards).
|
||||
- `GatewayApplication.cs:62` singleton registration — unchanged; `GatewayMetrics` remains a
|
||||
singleton; `AddZbTelemetry` simply hooks the OTel SDK to it.
|
||||
- The net48 x86 worker's `IWorkerLogger` (stderr key=value) — out of process and out of scope.
|
||||
No changes.
|
||||
@@ -0,0 +1,158 @@
|
||||
# Observability — current state: OtOpcUa
|
||||
|
||||
Repo: `~/Desktop/OtOpcUa`. Stack: .NET 10, Akka.NET, OPC UA; solution `ZB.MOM.WW.OtOpcUa.slnx`.
|
||||
Telemetry code lives in two places: `src/Server/ZB.MOM.WW.OtOpcUa.Host/Observability/` (host-side
|
||||
bootstrap) and `src/Core/ZB.MOM.WW.OtOpcUa.Commons/Observability/` (instruments + enricher).
|
||||
All paths relative to repo root. Verified 2026-06-01.
|
||||
|
||||
The most complete observability implementation in the family: OpenTelemetry SDK with both metrics and
|
||||
tracing signals, Prometheus export, Serilog structured logging with a per-session correlation enricher,
|
||||
and a dedicated instrument vocabulary. The one significant gap: **no OTel Resource / `service.name`**,
|
||||
so all signals are indistinguishable from one another and from other fleet members in a backend.
|
||||
|
||||
## 1. Metrics (OpenTelemetry SDK)
|
||||
|
||||
### Bootstrap — `ObservabilityExtensions.cs`
|
||||
|
||||
`src/Server/ZB.MOM.WW.OtOpcUa.Host/Observability/ObservabilityExtensions.cs`:
|
||||
|
||||
- `:18` — `AddOtOpcUaObservability(IServiceCollection)` is the service-registration entry point.
|
||||
- `:20` — `AddOpenTelemetry()` wires the OTel SDK.
|
||||
- `:21–23` — `.WithMetrics(b => b.AddMeter(OtOpcUaTelemetry.MeterName).AddPrometheusExporter())`:
|
||||
registers the application meter and attaches the Prometheus scrape exporter.
|
||||
- `:24–25` — `.WithTracing(b => b.AddSource(OtOpcUaTelemetry.ActivitySourceName))`:
|
||||
registers the application activity source for trace data.
|
||||
- **No `ResourceBuilder` call anywhere** — `service.name`, `service.namespace`, `service.version`,
|
||||
`site.id`, and `node.role` are not set. The OTel SDK defaults to an empty/SDK-default Resource.
|
||||
- `:36` — `MapOtOpcUaMetrics(IEndpointRouteBuilder)` maps the Prometheus endpoint.
|
||||
- `:38` — endpoint path is `/metrics`.
|
||||
|
||||
`Program.cs`:
|
||||
- `:138` — `builder.Services.AddOtOpcUaObservability()`
|
||||
- `:160` — `app.MapOtOpcUaMetrics()`
|
||||
|
||||
Package refs in csproj: `OpenTelemetry.Extensions.Hosting`, `OpenTelemetry.Exporter.Prometheus.AspNetCore`.
|
||||
**No `OpenTelemetry.Exporter.OpenTelemetryProtocol`** — OTLP is not available; Prometheus is the
|
||||
only export path.
|
||||
|
||||
### Instruments — `OtOpcUaTelemetry.cs`
|
||||
|
||||
`src/Core/ZB.MOM.WW.OtOpcUa.Commons/Observability/OtOpcUaTelemetry.cs`:
|
||||
|
||||
- `:19` — `MeterName = "ZB.MOM.WW.OtOpcUa"` (the `Meter` the SDK will collect).
|
||||
- `:20` — `ActivitySourceName = "ZB.MOM.WW.OtOpcUa"` (the `ActivitySource` for spans).
|
||||
|
||||
Instruments defined (all `static readonly` on `OtOpcUaTelemetry`):
|
||||
|
||||
| Instrument | Kind | Unit | Subsystem |
|
||||
|---|---|---|---|
|
||||
| `otopcua.deploy.applied` | `Counter<long>` | — | deploy |
|
||||
| `otopcua.deploy.apply.duration` | `Histogram<double>` | `s` | deploy |
|
||||
| `otopcua.driver.lifecycle` | `Counter<long>` | — | driver |
|
||||
| `otopcua.virtualtag.eval` | `Counter<long>` | — | virtual-tag |
|
||||
| `otopcua.scriptedalarm.transition` | `Counter<long>` | — | scripted-alarm |
|
||||
| `otopcua.opcua.sink.write` | `Counter<long>` | — | opc-ua sink |
|
||||
| `otopcua.redundancy.service_level_change` | `Counter<long>` | — | redundancy |
|
||||
|
||||
Two activity spans: `otopcua.deploy.apply`, `otopcua.opcua.address_space_rebuild`.
|
||||
|
||||
Naming convention: `otopcua.<subsystem>.<event>`. Duration histogram correctly uses unit `s`
|
||||
(OTel semantic conventions). **No standard instrumentation** (ASP.NET Core, HttpClient, runtime,
|
||||
gRPC client meters) is wired — only the bespoke application instruments.
|
||||
|
||||
## 2. Logging (Serilog)
|
||||
|
||||
### Bootstrap
|
||||
|
||||
`Program.cs`:
|
||||
- `:49–52` — two-stage Serilog bootstrap: initial logger for startup, then full
|
||||
`UseSerilog(ReadFrom.Configuration)`. Sinks: Console + rolling file `logs/otopcua-.log`.
|
||||
- `:141` — `UseSerilogRequestLogging()` on the `WebApplication`.
|
||||
|
||||
### Correlation enricher — `LogContextEnricher.cs`
|
||||
|
||||
`src/Core/ZB.MOM.WW.OtOpcUa.Commons/Observability/LogContextEnricher.cs`:
|
||||
|
||||
- `:18–36` — `Push(driverInstanceId, driverType, capability, correlationId)` calls
|
||||
`LogContext.PushProperty` for four properties:
|
||||
- `DriverInstanceId` — Galaxy driver instance GUID.
|
||||
- `DriverType` — driver type discriminator.
|
||||
- `CapabilityName` — OPC UA capability being exercised.
|
||||
- `CorrelationId` — caller-supplied correlation token.
|
||||
|
||||
This enricher is driver-lifecycle-scoped, not request-scoped — it pushes when a driver operation
|
||||
begins and is disposable to pop on completion.
|
||||
|
||||
**No `trace_id` / `span_id` enricher.** Although OtOpcUa creates `ActivitySource` spans, the
|
||||
active `Activity.Current` trace context is never pushed onto Serilog's `LogContext`. A log line
|
||||
emitted during a span cannot be correlated to the span in a backend.
|
||||
|
||||
**No structural enrichers for `service.name` / `site.id` / `node.role`** — these dimensions are
|
||||
absent from every log line. ScadaBridge has these; OtOpcUa does not.
|
||||
|
||||
## 3. Signal summary
|
||||
|
||||
| Signal | Provider | Export | Resource / service.name |
|
||||
|---|---|---|---|
|
||||
| Metrics | OTel SDK (`Meter` + `WithMetrics`) | Prometheus `/metrics` | ⛔ none |
|
||||
| Traces | OTel SDK (`ActivitySource` + `WithTracing`) | ⛔ none (no exporter configured) | ⛔ none |
|
||||
| Logs | Serilog | Console + rolling file | ⛔ none (no `service.name` property) |
|
||||
| Trace↔log correlation | — | — | ⛔ absent (`trace_id`/`span_id` not pushed) |
|
||||
|
||||
Note: `WithTracing` registers the `ActivitySource` for collection, but no exporter (OTLP or
|
||||
otherwise) is attached to the tracing pipeline. Spans are created and recorded by the SDK but never
|
||||
shipped anywhere — effectively a no-op in production.
|
||||
|
||||
## 4. Notable design choices
|
||||
|
||||
- **Instrument naming** follows `<meter>.<subsystem>.<event>` cleanly and consistently — this is the
|
||||
pattern the shared spec codifies as the fleet convention.
|
||||
- **Duration unit** correctly uses `s` on `otopcua.deploy.apply.duration` — no conversion needed on
|
||||
adoption; this contrasts with MxAccessGateway's `ms` histograms.
|
||||
- **LogContextEnricher is bespoke but valuable** — the `DriverInstanceId`/`DriverType`/`CapabilityName`
|
||||
correlation is OtOpcUa-specific domain context; it should survive adoption behind the shared
|
||||
enricher layer.
|
||||
- **No OTLP path** — with no OTLP exporter, OtOpcUa cannot send metrics or traces to a collector
|
||||
(Prometheus is scrape-pull only). This limits operational flexibility.
|
||||
|
||||
---
|
||||
|
||||
## Adoption plan → `ZB.MOM.WW.Telemetry`
|
||||
|
||||
**Replace with shared bootstrap:**
|
||||
|
||||
- `AddOtOpcUaObservability()` → `builder.AddZbTelemetry(o => { o.ServiceName = "otopcua"; o.SiteId = ...; o.NodeRole = ...; o.Meters = [OtOpcUaTelemetry.MeterName]; o.ActivitySources = [OtOpcUaTelemetry.ActivitySourceName]; })`.
|
||||
This adds the missing `Resource` (gains `service.name` / `service.namespace` / `service.version` /
|
||||
`site.id` / `node.role` / `host.name` on every metric and span). Prometheus `/metrics` stays the
|
||||
default exporter; OTLP becomes opt-in via options.
|
||||
- Add standard instrumentation through `AddZbTelemetry` options: ASP.NET Core meters, HttpClient,
|
||||
runtime + process meters — none wired today.
|
||||
- Fix the tracing no-op: wire an OTLP exporter (or at minimum note that tracing is recorded but not
|
||||
exported); `AddZbTelemetry` provides OTLP as the opt-in path.
|
||||
- `MapOtOpcUaMetrics` → `app.MapZbMetrics()` (same `/metrics` path; shared convention).
|
||||
|
||||
**Replace with shared Serilog bootstrap:**
|
||||
|
||||
- Serilog bootstrap in `Program.cs:49–52` → `builder.AddZbSerilog(o => { o.ServiceName = "otopcua"; o.SiteId = ...; o.NodeRole = ...; })`.
|
||||
This adds structural `SiteId` / `NodeRole` / `NodeHostname` properties to every log line
|
||||
(currently absent) and wires the `TraceContextEnricher` so `trace_id`/`span_id` appear on log
|
||||
lines emitted during active spans.
|
||||
- Console + file sinks continue via `ReadFrom.Configuration` in `appsettings.json` — no sink changes
|
||||
needed.
|
||||
- `UseSerilogRequestLogging()` stays.
|
||||
|
||||
**Keep bespoke:**
|
||||
|
||||
- `OtOpcUaTelemetry.cs` — the application `Meter`, `ActivitySource`, and all instrument definitions
|
||||
(`otopcua.*` counters, histograms, spans). These are domain instruments; `AddZbTelemetry` registers
|
||||
them by name but does not own them.
|
||||
- `LogContextEnricher.cs` — driver-lifecycle correlation properties (`DriverInstanceId`,
|
||||
`DriverType`, `CapabilityName`, `CorrelationId`) are OtOpcUa-specific. The enricher continues to
|
||||
push via `LogContext.PushProperty` alongside the shared enrichers.
|
||||
- `ObservabilityExtensions.cs` itself can be simplified or removed — it becomes a thin wrapper that
|
||||
calls `AddZbTelemetry` with OtOpcUa-specific options. The per-project entry point remains; only
|
||||
the implementation body is delegated to the shared library.
|
||||
|
||||
**Adoption is a follow-on task** (tracked in `GAPS.md`), not part of the `ZB.MOM.WW.Telemetry`
|
||||
library build. The library build delivers the shared bootstrap and enrichers; adoption lands in the
|
||||
OtOpcUa repo as a separate commit once the nupkg is available.
|
||||
@@ -0,0 +1,151 @@
|
||||
# Observability — current state: ScadaBridge
|
||||
|
||||
Repo: `~/Desktop/ScadaBridge`. Stack: .NET 10, Akka.NET, Docker; solution
|
||||
`ZB.MOM.WW.ScadaBridge.slnx`. The telemetry posture is split across a dangling OTel package ref
|
||||
(metrics/traces) and a substantive Serilog setup (logs). All paths relative to repo root.
|
||||
Verified 2026-06-01.
|
||||
|
||||
Structurally the cleanest logging enricher set in the family — `SiteId` / `NodeRole` /
|
||||
`NodeHostname` are already first-class Serilog enricher properties — but the weakest on
|
||||
metrics/tracing: zero instrumentation. The `OpenTelemetry.Api` package reference is a CVE-patch
|
||||
artefact, not instrumentation.
|
||||
|
||||
## 1. Metrics and traces (absent)
|
||||
|
||||
### `OpenTelemetry.Api` — CVE-patch ref, not instrumentation
|
||||
|
||||
`src/ZB.MOM.WW.ScadaBridge.Host/ZB.MOM.WW.ScadaBridge.Host.csproj`:
|
||||
- `:31` — `<PackageReference Include="OpenTelemetry.Api" />` — a **direct version override** added
|
||||
to satisfy GHSA-g94r-2vxg-569j / GHSA-8785-wc3w-h8q6 (OpenTelemetry 1.9.0 CVEs introduced via
|
||||
`Akka.Hosting`'s pinned transitive dependency).
|
||||
|
||||
There is **no `AddOpenTelemetry()` call** in the solution. No `Meter` is created. No
|
||||
`ActivitySource` is declared. No exporter is configured. The package reference solely overrides the
|
||||
transitive version — it has no runtime effect on observability.
|
||||
|
||||
### Instrument coverage
|
||||
|
||||
Zero application instruments. There is no custom `Meter`, no counter, no histogram, no gauge, and
|
||||
no span in the ScadaBridge codebase. This is the largest gap in the family.
|
||||
|
||||
## 2. Logging (Serilog — strongest enricher set)
|
||||
|
||||
### Two-stage bootstrap
|
||||
|
||||
`src/ZB.MOM.WW.ScadaBridge.Host/Program.cs`:
|
||||
- `:27–54` — two-stage Serilog bootstrap: an initial logger is created for startup messages before
|
||||
the host is built; the full logger replaces it during `UseSerilog`.
|
||||
|
||||
### `LoggerConfigurationFactory.cs`
|
||||
|
||||
`src/ZB.MOM.WW.ScadaBridge.Host/LoggerConfigurationFactory.cs`:
|
||||
|
||||
Full factory method signature: `Build(IConfiguration config, string nodeRole, string siteId, string nodeHostname)`.
|
||||
|
||||
- `:62` — reads `ScadaBridge:Logging:MinimumLevel` from configuration.
|
||||
- `:84` — `ReadFrom.Configuration(config)` pulls sink configuration from `appsettings.json`.
|
||||
- `:85` — explicit `MinimumLevel.Is(...)` override from the typed option.
|
||||
- `:86–88` — three structural enrichers:
|
||||
- `.Enrich.WithProperty("SiteId", siteId)` — site identifier (e.g. `"site-a"`).
|
||||
- `.Enrich.WithProperty("NodeHostname", nodeHostname)` — node hostname.
|
||||
- `.Enrich.WithProperty("NodeRole", nodeRole)` — Akka cluster role (e.g. `"central"`, `"site"`).
|
||||
|
||||
These three properties are the cleanest and most complete set in the family. ScadaBridge's property
|
||||
names (`SiteId` / `NodeRole` / `NodeHostname`) are also the ones the shared `AddZbTelemetry`
|
||||
options object maps onto `site.id` / `node.role` / `host.name` OTel Resource attributes — no
|
||||
renaming needed on adoption.
|
||||
|
||||
### Sink configuration
|
||||
|
||||
`appsettings.json:3–23` — Serilog sinks configured via `ReadFrom.Configuration`:
|
||||
- Console sink with output template that includes `[{NodeRole}/{NodeHostname}]`.
|
||||
- File sink (path in config; rolling interval).
|
||||
|
||||
### `LoggingOptions.cs`
|
||||
|
||||
`src/ZB.MOM.WW.ScadaBridge.Host/LoggingOptions.cs`:
|
||||
- `MinimumLevel` — config-bound minimum level; default `Information`.
|
||||
|
||||
### Missing elements
|
||||
|
||||
- **No custom enrichers** beyond the three structural properties. `LogContextEnricher` (OtOpcUa's
|
||||
driver-correlation enricher) has no equivalent; MxGateway's per-session correlation scope has no
|
||||
equivalent. Per-request/per-operation correlation is not present.
|
||||
- **No `trace_id` / `span_id` enricher.** As with the other two projects, log lines do not carry
|
||||
trace context. Because ScadaBridge has zero `ActivitySource` instrumentation, this is consistent —
|
||||
but it means no trace↔log correlation path exists even hypothetically.
|
||||
|
||||
## 3. Signal summary
|
||||
|
||||
| Signal | Provider | Export | Resource / service.name |
|
||||
|---|---|---|---|
|
||||
| Metrics | ⛔ none | ⛔ none | ⛔ none |
|
||||
| Traces | ⛔ none | ⛔ none | ⛔ none |
|
||||
| Logs | Serilog | Console + file (`appsettings.json`) | ⛔ none (no `service.name` property) |
|
||||
| Trace↔log correlation | — | — | ⛔ absent (no ActivitySource; no enricher) |
|
||||
|
||||
## 4. Notable design choices
|
||||
|
||||
- **`SiteId` / `NodeRole` / `NodeHostname` as first-class enrichers** — unlike OtOpcUa's driver-
|
||||
scoped `LogContextEnricher`, ScadaBridge's structural enrichers are attached at logger creation and
|
||||
appear on every log line from the process. This is the target pattern for the shared bootstrap.
|
||||
- **`nodeRole` + `siteId` passed into the factory** — ScadaBridge's `LoggerConfigurationFactory.Build`
|
||||
takes these as constructor arguments rather than reading them from a registered options object.
|
||||
The shared `AddZbSerilog` approach binds them from the same `ZbTelemetryOptions` used for the OTel
|
||||
Resource, unifying the source.
|
||||
- **Config-driven `MinimumLevel`** — `ScadaBridge:Logging:MinimumLevel` is a typed config path;
|
||||
`ReadFrom.Configuration` for sinks. The shared bootstrap's `AddZbSerilog` must support the same
|
||||
pattern.
|
||||
- **No custom enrichers** — ScadaBridge's logging is intentionally minimal on operation-scoped
|
||||
context. Correlation in the distributed model is provided by structured log fields from Akka
|
||||
actor context, not a log enricher pipeline.
|
||||
- **CVE-patch ref discipline** — the `OpenTelemetry.Api` pin is a responsible CVE response but
|
||||
leaves the telemetry story incomplete. On adoption, the CVE pin is superseded by the full OTel SDK
|
||||
pulled in by `AddZbTelemetry`; the explicit `<PackageReference>` override can be removed.
|
||||
|
||||
---
|
||||
|
||||
## Adoption plan → `ZB.MOM.WW.Telemetry`
|
||||
|
||||
**Replace CVE-patch ref with full OTel SDK via `AddZbTelemetry`:**
|
||||
|
||||
- Remove the lone `OpenTelemetry.Api` override from
|
||||
`src/ZB.MOM.WW.ScadaBridge.Host/ZB.MOM.WW.ScadaBridge.Host.csproj:31`.
|
||||
- Add `builder.AddZbTelemetry(o => { o.ServiceName = "scadabridge"; o.SiteId = cfg.SiteId; o.NodeRole = cfg.NodeRole; o.Meters = ["ZB.MOM.WW.ScadaBridge"]; })`.
|
||||
The full OTel SDK supersedes the transitive version override; the CVE is resolved transitively
|
||||
via the SDK's current dependency.
|
||||
|
||||
**Add first application instruments:**
|
||||
|
||||
- Define a `ScadaBridgeTelemetry` class (mirror `OtOpcUaTelemetry`) with a `Meter` named
|
||||
`"ZB.MOM.WW.ScadaBridge"` and an initial set of instruments covering the most observable
|
||||
operations: site connection lifecycle, alarm received, data-change received, actor supervision
|
||||
events. Naming convention: `scadabridge.<subsystem>.<event>`.
|
||||
- Register the meter name in `AddZbTelemetry` options. Expose `/metrics` via `app.MapZbMetrics()`.
|
||||
ScadaBridge goes from zero instrumentation to a baseline exportable set.
|
||||
|
||||
**Adopt `AddZbSerilog`:**
|
||||
|
||||
- Replace the `LoggerConfigurationFactory.Build(config, nodeRole, siteId, nodeHostname)` call in
|
||||
`Program.cs:27–54` with `builder.AddZbSerilog(o => { o.ServiceName = "scadabridge"; o.SiteId = cfg.SiteId; o.NodeRole = cfg.NodeRole; o.NodeHostname = cfg.NodeHostname; })`.
|
||||
The three enrichers (`SiteId`, `NodeRole`, `NodeHostname`) are now provided by the shared
|
||||
`AddZbSerilog` path; `LoggerConfigurationFactory` can be deleted.
|
||||
- `ReadFrom.Configuration` for sinks and `MinimumLevel.Is` override from config are preserved
|
||||
inside `AddZbSerilog` — behavior is unchanged.
|
||||
- The `TraceContextEnricher` is wired automatically by `AddZbSerilog`; once application instruments
|
||||
are added (above), `trace_id` / `span_id` will appear on log lines emitted during spans.
|
||||
|
||||
**Keep bespoke:**
|
||||
|
||||
- `LoggingOptions.cs` — the `MinimumLevel` typed option and its config path
|
||||
(`ScadaBridge:Logging:MinimumLevel`) remain; `AddZbSerilog` must accept the minimum-level
|
||||
override from configuration. The config path stays ScadaBridge's own.
|
||||
- Console output template including `[{NodeRole}/{NodeHostname}]` — driven by `appsettings.json`;
|
||||
no change.
|
||||
- Akka actor-context log fields — per-operation context emitted by Akka infrastructure; not an
|
||||
enricher concern.
|
||||
- `ZB.MOM.WW.ScadaBridge.Host.csproj` package set otherwise — no other changes to the project file.
|
||||
|
||||
**Adoption is a follow-on task** (tracked in `GAPS.md`), not part of the `ZB.MOM.WW.Telemetry`
|
||||
library build. Adding instruments and adopting `AddZbSerilog`/`AddZbTelemetry` lands in the
|
||||
ScadaBridge repo as a separate commit once the nupkg is available.
|
||||
Reference in New Issue
Block a user