docs(observability): spec + METRIC-CONVENTIONS + ZB.MOM.WW.Telemetry shared contract
Author the three normalization docs for the observability component: - components/observability/spec/SPEC.md — Section 0 scope (normalized vs. per-project), AddZbTelemetry pipeline, shared Resource attribute set, standard instrumentation baseline, exporter conventions, Serilog two-stage bootstrap with identity enrichers and TraceContextEnricher, ILogRedactor redaction seam, per-project migration table, and acceptance criteria. - components/observability/spec/METRIC-CONVENTIONS.md — meter naming convention (app namespace; MxGateway.Server flagged as convergence target), instrument naming pattern (<app>.<subsystem>.<event>), mandatory duration unit = seconds (MxGateway ms histograms flagged), Resource attribute set table, standard instrumentation baseline, and per-app instrument tables (OtOpcUa 7 instruments + 2 spans; MxGateway 13 counters / 3 histograms / 4 gauges; ScadaBridge TBD). - components/observability/shared-contract/ZB.MOM.WW.Telemetry.md — paper API for the two packages: ZbTelemetryOptions, ZbExporter enum, AddZbTelemetry (IHostApplicationBuilder + IServiceCollection overloads), ZbResource.Build, MapZbMetrics; AddZbSerilog, ZbLogEnricherNames constants, TraceContextEnricher, ILogRedactor, RedactionEnricher. Consumer matrix and open contract questions included.
This commit is contained in:
@@ -0,0 +1,191 @@
|
||||
# Observability — current state: MxAccessGateway
|
||||
|
||||
Repo: `~/Desktop/MxAccessGateway`. Stack: .NET 10 gateway (x64) + .NET 4.8 worker (**x86**);
|
||||
solution `src/MxGateway.sln`. Telemetry code is concentrated in
|
||||
`src/ZB.MOM.WW.MxGateway.Server/Metrics/` (instruments) and
|
||||
`src/ZB.MOM.WW.MxGateway.Server/Diagnostics/` (logging correlation + redaction).
|
||||
All paths relative to repo root. Verified 2026-06-01.
|
||||
|
||||
The most unusual observability posture in the family: **13 counters, 3 histograms, and 4 observable
|
||||
gauges** all fully hand-rolled using `System.Diagnostics.Metrics` directly — but **never exported**
|
||||
(no OpenTelemetry SDK, no Prometheus exporter, no OTLP). All metric data dies in an in-memory
|
||||
`GetSnapshot()`. Logging is `Microsoft.Extensions.Logging` exclusively (no Serilog), with a bespoke
|
||||
correlation scope and a log-redaction pipeline. The net48 x86 worker is out of process and out of
|
||||
scope — its `IWorkerLogger` (stderr key=value) is not addressed here.
|
||||
|
||||
## 1. Metrics (hand-rolled, unexported)
|
||||
|
||||
### `GatewayMetrics.cs`
|
||||
|
||||
`src/ZB.MOM.WW.MxGateway.Server/Metrics/GatewayMetrics.cs`:
|
||||
|
||||
Meter name: `"MxGateway.Server"` (does not follow the project namespace `ZB.MOM.WW.MxGateway`).
|
||||
|
||||
All instruments are instance members of `GatewayMetrics`. The class is registered as a **singleton**
|
||||
at `GatewayApplication.cs:62`. There is **no `OpenTelemetry.Extensions.Hosting`**,
|
||||
**no `AddOpenTelemetry()` call**, and **no exporter** — the `Meter` is created with
|
||||
`new Meter("MxGateway.Server")` and `GetSnapshot()` is the only read path.
|
||||
|
||||
**Counters (13):**
|
||||
|
||||
| Instrument name | Tracks |
|
||||
|---|---|
|
||||
| `mxgateway.sessions.opened` | New session requests |
|
||||
| `mxgateway.sessions.closed` | Sessions torn down |
|
||||
| `mxgateway.commands.started` | MXAccess command dispatched |
|
||||
| `mxgateway.commands.succeeded` | Command completed OK |
|
||||
| `mxgateway.commands.failed` | Command error |
|
||||
| `mxgateway.events.received` | MXAccess events from worker |
|
||||
| `mxgateway.queues.overflows` | Queue overflow (backpressure) |
|
||||
| `mxgateway.faults` | Unhandled gateway faults |
|
||||
| `mxgateway.workers.killed` | Worker process forcibly terminated |
|
||||
| `mxgateway.workers.exited` | Worker process exited cleanly |
|
||||
| `mxgateway.heartbeats.failed` | Worker heartbeat timeouts |
|
||||
| `mxgateway.grpc.streams.disconnected` | gRPC event stream disconnects |
|
||||
| `mxgateway.retries.attempted` | Retry attempts (any subsystem) |
|
||||
|
||||
**Histograms (3) — unit `ms` (diverges from OTel semconv `s`):**
|
||||
|
||||
| Instrument name | Tracks |
|
||||
|---|---|
|
||||
| `mxgateway.workers.startup.duration` | Time from worker spawn to ready |
|
||||
| `mxgateway.commands.duration` | End-to-end MXAccess command latency |
|
||||
| `mxgateway.events.stream_send.duration` | gRPC event stream send latency |
|
||||
|
||||
**Observable gauges (4):**
|
||||
|
||||
| Instrument name | Tracks |
|
||||
|---|---|
|
||||
| `mxgateway.sessions.open` | Currently open sessions (live count) |
|
||||
| `mxgateway.workers.running` | Currently running worker processes |
|
||||
| `mxgateway.events.worker_queue.depth` | Per-worker event queue depth |
|
||||
| `mxgateway.events.grpc_stream_queue.depth` | Per-stream gRPC send queue depth |
|
||||
|
||||
All 20 instruments share the `mxgateway.*` prefix and `<category>.<event>` naming — consistent
|
||||
with the family convention. Duration histograms record in **milliseconds** (`ms`); OTel semantic
|
||||
conventions require seconds (`s`). This is the only project with `ms` histograms.
|
||||
|
||||
### Singleton wiring
|
||||
|
||||
`src/ZB.MOM.WW.MxGateway.Server/GatewayApplication.cs`:
|
||||
- `:62` — `services.AddSingleton<GatewayMetrics>()` registers the metrics singleton.
|
||||
|
||||
There is no `AddOpenTelemetry()` call anywhere in the gateway. The `GatewayMetrics` `Meter` is
|
||||
created independently of any OTel SDK — it participates in `MeterListener` / `GetSnapshot()` only.
|
||||
Without the OTel SDK, this data is **invisible to Prometheus, OTLP, or any backend**.
|
||||
|
||||
### No tracing
|
||||
|
||||
No `ActivitySource` is defined. No spans are created. Tracing is entirely absent.
|
||||
|
||||
## 2. Logging (Microsoft.Extensions.Logging)
|
||||
|
||||
All logging in the gateway server uses `Microsoft.Extensions.Logging` (MEL) exclusively. There is
|
||||
no Serilog dependency. Sink configuration lives in `appsettings.json` (Console, with structured
|
||||
logging via the default host builder).
|
||||
|
||||
### Correlation scope
|
||||
|
||||
`src/ZB.MOM.WW.MxGateway.Server/Diagnostics/GatewayLogScope.cs`:
|
||||
|
||||
Defines the per-request/per-session correlation property bag.
|
||||
|
||||
`src/ZB.MOM.WW.MxGateway.Server/Diagnostics/GatewayRequestLoggingMiddlewareExtensions.cs`:
|
||||
- `:22–41` — `UseGatewayRequestLogging()` middleware reads the following HTTP headers from each
|
||||
incoming request: `x-session-id`, `x-worker-process-id`, `x-correlation-id`, `x-command-method`,
|
||||
`authorization` (for redaction, not logging).
|
||||
- Registered at `GatewayApplication.cs:34`.
|
||||
|
||||
`src/ZB.MOM.WW.MxGateway.Server/Diagnostics/GatewayLoggerExtensions.cs`:
|
||||
- `:11–18` — `BeginGatewayScope(ILogger, GatewayLogScope)` calls `logger.BeginScope(scope)` —
|
||||
MEL's `ILogger.BeginScope` mechanism, which pushes properties as a scoped dictionary.
|
||||
|
||||
The correlation tuple (`SessionId` / `WorkerProcessId` / `CorrelationId` / `CommandMethod`) is
|
||||
injected into log lines produced within the scope. No `trace_id` / `span_id` enrichment — there
|
||||
is no ActivitySource, so this is consistent but leaves no path to trace correlation.
|
||||
|
||||
### Log redaction — `GatewayLogRedactor.cs`
|
||||
|
||||
`src/ZB.MOM.WW.MxGateway.Server/Diagnostics/GatewayLogRedactor.cs`:
|
||||
|
||||
- Masks sensitive data in log lines for two categories:
|
||||
- **`AuthenticateUser`** commands: the password argument is replaced.
|
||||
- **`WriteSecured`** commands: the value argument is replaced.
|
||||
- **`mxgw_` bearer tokens**: the token body is masked, keeping only the key-id prefix.
|
||||
- Redaction is applied before the log event is emitted — no sensitive data reaches the sink.
|
||||
|
||||
This is the only project in the family with an explicit log-redaction pipeline. OtOpcUa and
|
||||
ScadaBridge have no equivalent.
|
||||
|
||||
## 3. Signal summary
|
||||
|
||||
| Signal | Provider | Export | Resource / service.name |
|
||||
|---|---|---|---|
|
||||
| Metrics | `System.Diagnostics.Metrics` (`Meter` direct) | ⛔ none (`GetSnapshot()` only) | ⛔ none |
|
||||
| Traces | — | ⛔ none | ⛔ none |
|
||||
| Logs | MEL (`Microsoft.Extensions.Logging`) | Console via `appsettings.json` | ⛔ none |
|
||||
| Trace↔log correlation | — | — | ⛔ absent (no ActivitySource exists) |
|
||||
|
||||
## 4. Notable design choices
|
||||
|
||||
- **`GatewayMetrics` singleton** — all counter/gauge increments are lock-free atomic operations on
|
||||
the underlying `Meter` instruments; the singleton is intentional.
|
||||
- **`ms` histogram unit** — `workers.startup.duration`, `commands.duration`, and
|
||||
`events.stream_send.duration` all record in milliseconds. This is non-standard (OTel semconv
|
||||
requires `s`) and means raw values differ from OtOpcUa's `s` histograms by a factor of 1000.
|
||||
- **MEL correlation via `BeginScope`** — MEL scopes are supported by structured logging providers
|
||||
(e.g. Serilog.Extensions.Hosting, Seq, Application Insights) but are provider-dependent. The
|
||||
scope properties may not appear in all sink configurations, unlike Serilog's `LogContext` which
|
||||
is sink-agnostic.
|
||||
- **Redaction placement** — `GatewayLogRedactor` sits between the caller and the log emission point,
|
||||
not inside a sink. This is the correct placement; the shared `ILogRedactor` seam preserves this.
|
||||
|
||||
---
|
||||
|
||||
## Adoption plan → `ZB.MOM.WW.Telemetry`
|
||||
|
||||
**This is the one in-pass adoption.** The MxGateway MEL → Serilog migration is executed as part of
|
||||
the `ZB.MOM.WW.Telemetry` library build, not deferred as a follow-on. The changes below land in
|
||||
the MxAccessGateway repo as part of Task #9 (blocked by Task #8 — library build).
|
||||
|
||||
**Migrate logging MEL → `AddZbSerilog`:**
|
||||
|
||||
- Replace `WebApplicationBuilder` default logging with `builder.AddZbSerilog(o => { o.ServiceName = "mxgateway"; o.SiteId = ...; o.NodeRole = ...; })`.
|
||||
Gains structured `SiteId` / `NodeRole` / `NodeHostname` enrichers on every log event, plus
|
||||
`TraceContextEnricher` (currently moot — no spans — but ready for when tracing is added).
|
||||
- Re-express the `GatewayLogScope` / `BeginGatewayScope` / `UseGatewayRequestLogging` correlation
|
||||
mechanism as a Serilog `LogContext.PushProperty` scope. The middleware at
|
||||
`GatewayRequestLoggingMiddlewareExtensions.cs:22–41` is refactored to push the same four
|
||||
properties (`SessionId`, `WorkerProcessId`, `CorrelationId`, `CommandMethod`) via Serilog's
|
||||
`LogContext` rather than MEL `BeginScope`. Behavior is identical; portability improves.
|
||||
- Move `GatewayLogRedactor` behind the shared `ILogRedactor` seam. The redaction policy (which
|
||||
commands/tokens to scrub and how) stays per-project in a `MxGatewayLogRedactor : ILogRedactor`
|
||||
implementation; the seam is shared.
|
||||
- Console + file sinks configured via `ReadFrom.Configuration` in `appsettings.json` — consistent
|
||||
with OtOpcUa and ScadaBridge's Serilog approach.
|
||||
|
||||
**Wire metrics export via `AddZbTelemetry`:**
|
||||
|
||||
- Add `builder.AddZbTelemetry(o => { o.ServiceName = "mxgateway"; o.SiteId = ...; o.NodeRole = ...; o.Meters = ["MxGateway.Server"]; })`.
|
||||
This registers the OTel SDK and connects `GatewayMetrics`'s existing `Meter` to the Prometheus
|
||||
exporter. The 13 counters, 3 histograms, and 4 gauges **begin exporting** for the first time.
|
||||
`GatewayMetrics.cs` itself is unchanged — only the SDK layer is added around it.
|
||||
- Add `app.MapZbMetrics()` to expose `/metrics`.
|
||||
|
||||
**Convert histogram unit `ms` → `s`:**
|
||||
|
||||
- Rename the three histograms' values: multiply recorded values by `0.001` at the call site, or
|
||||
re-create the instruments with unit `s`. This is a breaking change to existing dashboards/alerts
|
||||
but required for OTel semconv compliance. Tagged as a convergence item in `GAPS.md`.
|
||||
|
||||
**Keep bespoke:**
|
||||
|
||||
- `GatewayMetrics.cs` — all 20 instruments (`mxgateway.*` counters, histograms, gauges) stay
|
||||
per-project. `AddZbTelemetry` registers the Meter name; it does not own or replace the instruments.
|
||||
- Meter name `"MxGateway.Server"` — a follow-on rename to `"ZB.MOM.WW.MxGateway"` is tracked in
|
||||
`GAPS.md` but is not required for the initial adoption (it is a Prometheus label change that
|
||||
breaks existing dashboards).
|
||||
- `GatewayApplication.cs:62` singleton registration — unchanged; `GatewayMetrics` remains a
|
||||
singleton; `AddZbTelemetry` simply hooks the OTel SDK to it.
|
||||
- The net48 x86 worker's `IWorkerLogger` (stderr key=value) — out of process and out of scope.
|
||||
No changes.
|
||||
@@ -0,0 +1,158 @@
|
||||
# Observability — current state: OtOpcUa
|
||||
|
||||
Repo: `~/Desktop/OtOpcUa`. Stack: .NET 10, Akka.NET, OPC UA; solution `ZB.MOM.WW.OtOpcUa.slnx`.
|
||||
Telemetry code lives in two places: `src/Server/ZB.MOM.WW.OtOpcUa.Host/Observability/` (host-side
|
||||
bootstrap) and `src/Core/ZB.MOM.WW.OtOpcUa.Commons/Observability/` (instruments + enricher).
|
||||
All paths relative to repo root. Verified 2026-06-01.
|
||||
|
||||
The most complete observability implementation in the family: OpenTelemetry SDK with both metrics and
|
||||
tracing signals, Prometheus export, Serilog structured logging with a per-session correlation enricher,
|
||||
and a dedicated instrument vocabulary. The one significant gap: **no OTel Resource / `service.name`**,
|
||||
so all signals are indistinguishable from one another and from other fleet members in a backend.
|
||||
|
||||
## 1. Metrics (OpenTelemetry SDK)
|
||||
|
||||
### Bootstrap — `ObservabilityExtensions.cs`
|
||||
|
||||
`src/Server/ZB.MOM.WW.OtOpcUa.Host/Observability/ObservabilityExtensions.cs`:
|
||||
|
||||
- `:18` — `AddOtOpcUaObservability(IServiceCollection)` is the service-registration entry point.
|
||||
- `:20` — `AddOpenTelemetry()` wires the OTel SDK.
|
||||
- `:21–23` — `.WithMetrics(b => b.AddMeter(OtOpcUaTelemetry.MeterName).AddPrometheusExporter())`:
|
||||
registers the application meter and attaches the Prometheus scrape exporter.
|
||||
- `:24–25` — `.WithTracing(b => b.AddSource(OtOpcUaTelemetry.ActivitySourceName))`:
|
||||
registers the application activity source for trace data.
|
||||
- **No `ResourceBuilder` call anywhere** — `service.name`, `service.namespace`, `service.version`,
|
||||
`site.id`, and `node.role` are not set. The OTel SDK defaults to an empty/SDK-default Resource.
|
||||
- `:36` — `MapOtOpcUaMetrics(IEndpointRouteBuilder)` maps the Prometheus endpoint.
|
||||
- `:38` — endpoint path is `/metrics`.
|
||||
|
||||
`Program.cs`:
|
||||
- `:138` — `builder.Services.AddOtOpcUaObservability()`
|
||||
- `:160` — `app.MapOtOpcUaMetrics()`
|
||||
|
||||
Package refs in csproj: `OpenTelemetry.Extensions.Hosting`, `OpenTelemetry.Exporter.Prometheus.AspNetCore`.
|
||||
**No `OpenTelemetry.Exporter.OpenTelemetryProtocol`** — OTLP is not available; Prometheus is the
|
||||
only export path.
|
||||
|
||||
### Instruments — `OtOpcUaTelemetry.cs`
|
||||
|
||||
`src/Core/ZB.MOM.WW.OtOpcUa.Commons/Observability/OtOpcUaTelemetry.cs`:
|
||||
|
||||
- `:19` — `MeterName = "ZB.MOM.WW.OtOpcUa"` (the `Meter` the SDK will collect).
|
||||
- `:20` — `ActivitySourceName = "ZB.MOM.WW.OtOpcUa"` (the `ActivitySource` for spans).
|
||||
|
||||
Instruments defined (all `static readonly` on `OtOpcUaTelemetry`):
|
||||
|
||||
| Instrument | Kind | Unit | Subsystem |
|
||||
|---|---|---|---|
|
||||
| `otopcua.deploy.applied` | `Counter<long>` | — | deploy |
|
||||
| `otopcua.deploy.apply.duration` | `Histogram<double>` | `s` | deploy |
|
||||
| `otopcua.driver.lifecycle` | `Counter<long>` | — | driver |
|
||||
| `otopcua.virtualtag.eval` | `Counter<long>` | — | virtual-tag |
|
||||
| `otopcua.scriptedalarm.transition` | `Counter<long>` | — | scripted-alarm |
|
||||
| `otopcua.opcua.sink.write` | `Counter<long>` | — | opc-ua sink |
|
||||
| `otopcua.redundancy.service_level_change` | `Counter<long>` | — | redundancy |
|
||||
|
||||
Two activity spans: `otopcua.deploy.apply`, `otopcua.opcua.address_space_rebuild`.
|
||||
|
||||
Naming convention: `otopcua.<subsystem>.<event>`. Duration histogram correctly uses unit `s`
|
||||
(OTel semantic conventions). **No standard instrumentation** (ASP.NET Core, HttpClient, runtime,
|
||||
gRPC client meters) is wired — only the bespoke application instruments.
|
||||
|
||||
## 2. Logging (Serilog)
|
||||
|
||||
### Bootstrap
|
||||
|
||||
`Program.cs`:
|
||||
- `:49–52` — two-stage Serilog bootstrap: initial logger for startup, then full
|
||||
`UseSerilog(ReadFrom.Configuration)`. Sinks: Console + rolling file `logs/otopcua-.log`.
|
||||
- `:141` — `UseSerilogRequestLogging()` on the `WebApplication`.
|
||||
|
||||
### Correlation enricher — `LogContextEnricher.cs`
|
||||
|
||||
`src/Core/ZB.MOM.WW.OtOpcUa.Commons/Observability/LogContextEnricher.cs`:
|
||||
|
||||
- `:18–36` — `Push(driverInstanceId, driverType, capability, correlationId)` calls
|
||||
`LogContext.PushProperty` for four properties:
|
||||
- `DriverInstanceId` — Galaxy driver instance GUID.
|
||||
- `DriverType` — driver type discriminator.
|
||||
- `CapabilityName` — OPC UA capability being exercised.
|
||||
- `CorrelationId` — caller-supplied correlation token.
|
||||
|
||||
This enricher is driver-lifecycle-scoped, not request-scoped — it pushes when a driver operation
|
||||
begins and is disposable to pop on completion.
|
||||
|
||||
**No `trace_id` / `span_id` enricher.** Although OtOpcUa creates `ActivitySource` spans, the
|
||||
active `Activity.Current` trace context is never pushed onto Serilog's `LogContext`. A log line
|
||||
emitted during a span cannot be correlated to the span in a backend.
|
||||
|
||||
**No structural enrichers for `service.name` / `site.id` / `node.role`** — these dimensions are
|
||||
absent from every log line. ScadaBridge has these; OtOpcUa does not.
|
||||
|
||||
## 3. Signal summary
|
||||
|
||||
| Signal | Provider | Export | Resource / service.name |
|
||||
|---|---|---|---|
|
||||
| Metrics | OTel SDK (`Meter` + `WithMetrics`) | Prometheus `/metrics` | ⛔ none |
|
||||
| Traces | OTel SDK (`ActivitySource` + `WithTracing`) | ⛔ none (no exporter configured) | ⛔ none |
|
||||
| Logs | Serilog | Console + rolling file | ⛔ none (no `service.name` property) |
|
||||
| Trace↔log correlation | — | — | ⛔ absent (`trace_id`/`span_id` not pushed) |
|
||||
|
||||
Note: `WithTracing` registers the `ActivitySource` for collection, but no exporter (OTLP or
|
||||
otherwise) is attached to the tracing pipeline. Spans are created and recorded by the SDK but never
|
||||
shipped anywhere — effectively a no-op in production.
|
||||
|
||||
## 4. Notable design choices
|
||||
|
||||
- **Instrument naming** follows `<meter>.<subsystem>.<event>` cleanly and consistently — this is the
|
||||
pattern the shared spec codifies as the fleet convention.
|
||||
- **Duration unit** correctly uses `s` on `otopcua.deploy.apply.duration` — no conversion needed on
|
||||
adoption; this contrasts with MxAccessGateway's `ms` histograms.
|
||||
- **LogContextEnricher is bespoke but valuable** — the `DriverInstanceId`/`DriverType`/`CapabilityName`
|
||||
correlation is OtOpcUa-specific domain context; it should survive adoption behind the shared
|
||||
enricher layer.
|
||||
- **No OTLP path** — with no OTLP exporter, OtOpcUa cannot send metrics or traces to a collector
|
||||
(Prometheus is scrape-pull only). This limits operational flexibility.
|
||||
|
||||
---
|
||||
|
||||
## Adoption plan → `ZB.MOM.WW.Telemetry`
|
||||
|
||||
**Replace with shared bootstrap:**
|
||||
|
||||
- `AddOtOpcUaObservability()` → `builder.AddZbTelemetry(o => { o.ServiceName = "otopcua"; o.SiteId = ...; o.NodeRole = ...; o.Meters = [OtOpcUaTelemetry.MeterName]; o.ActivitySources = [OtOpcUaTelemetry.ActivitySourceName]; })`.
|
||||
This adds the missing `Resource` (gains `service.name` / `service.namespace` / `service.version` /
|
||||
`site.id` / `node.role` / `host.name` on every metric and span). Prometheus `/metrics` stays the
|
||||
default exporter; OTLP becomes opt-in via options.
|
||||
- Add standard instrumentation through `AddZbTelemetry` options: ASP.NET Core meters, HttpClient,
|
||||
runtime + process meters — none wired today.
|
||||
- Fix the tracing no-op: wire an OTLP exporter (or at minimum note that tracing is recorded but not
|
||||
exported); `AddZbTelemetry` provides OTLP as the opt-in path.
|
||||
- `MapOtOpcUaMetrics` → `app.MapZbMetrics()` (same `/metrics` path; shared convention).
|
||||
|
||||
**Replace with shared Serilog bootstrap:**
|
||||
|
||||
- Serilog bootstrap in `Program.cs:49–52` → `builder.AddZbSerilog(o => { o.ServiceName = "otopcua"; o.SiteId = ...; o.NodeRole = ...; })`.
|
||||
This adds structural `SiteId` / `NodeRole` / `NodeHostname` properties to every log line
|
||||
(currently absent) and wires the `TraceContextEnricher` so `trace_id`/`span_id` appear on log
|
||||
lines emitted during active spans.
|
||||
- Console + file sinks continue via `ReadFrom.Configuration` in `appsettings.json` — no sink changes
|
||||
needed.
|
||||
- `UseSerilogRequestLogging()` stays.
|
||||
|
||||
**Keep bespoke:**
|
||||
|
||||
- `OtOpcUaTelemetry.cs` — the application `Meter`, `ActivitySource`, and all instrument definitions
|
||||
(`otopcua.*` counters, histograms, spans). These are domain instruments; `AddZbTelemetry` registers
|
||||
them by name but does not own them.
|
||||
- `LogContextEnricher.cs` — driver-lifecycle correlation properties (`DriverInstanceId`,
|
||||
`DriverType`, `CapabilityName`, `CorrelationId`) are OtOpcUa-specific. The enricher continues to
|
||||
push via `LogContext.PushProperty` alongside the shared enrichers.
|
||||
- `ObservabilityExtensions.cs` itself can be simplified or removed — it becomes a thin wrapper that
|
||||
calls `AddZbTelemetry` with OtOpcUa-specific options. The per-project entry point remains; only
|
||||
the implementation body is delegated to the shared library.
|
||||
|
||||
**Adoption is a follow-on task** (tracked in `GAPS.md`), not part of the `ZB.MOM.WW.Telemetry`
|
||||
library build. The library build delivers the shared bootstrap and enrichers; adoption lands in the
|
||||
OtOpcUa repo as a separate commit once the nupkg is available.
|
||||
@@ -0,0 +1,151 @@
|
||||
# Observability — current state: ScadaBridge
|
||||
|
||||
Repo: `~/Desktop/ScadaBridge`. Stack: .NET 10, Akka.NET, Docker; solution
|
||||
`ZB.MOM.WW.ScadaBridge.slnx`. The telemetry posture is split across a dangling OTel package ref
|
||||
(metrics/traces) and a substantive Serilog setup (logs). All paths relative to repo root.
|
||||
Verified 2026-06-01.
|
||||
|
||||
Structurally the cleanest logging enricher set in the family — `SiteId` / `NodeRole` /
|
||||
`NodeHostname` are already first-class Serilog enricher properties — but the weakest on
|
||||
metrics/tracing: zero instrumentation. The `OpenTelemetry.Api` package reference is a CVE-patch
|
||||
artefact, not instrumentation.
|
||||
|
||||
## 1. Metrics and traces (absent)
|
||||
|
||||
### `OpenTelemetry.Api` — CVE-patch ref, not instrumentation
|
||||
|
||||
`src/ZB.MOM.WW.ScadaBridge.Host/ZB.MOM.WW.ScadaBridge.Host.csproj`:
|
||||
- `:31` — `<PackageReference Include="OpenTelemetry.Api" />` — a **direct version override** added
|
||||
to satisfy GHSA-g94r-2vxg-569j / GHSA-8785-wc3w-h8q6 (OpenTelemetry 1.9.0 CVEs introduced via
|
||||
`Akka.Hosting`'s pinned transitive dependency).
|
||||
|
||||
There is **no `AddOpenTelemetry()` call** in the solution. No `Meter` is created. No
|
||||
`ActivitySource` is declared. No exporter is configured. The package reference solely overrides the
|
||||
transitive version — it has no runtime effect on observability.
|
||||
|
||||
### Instrument coverage
|
||||
|
||||
Zero application instruments. There is no custom `Meter`, no counter, no histogram, no gauge, and
|
||||
no span in the ScadaBridge codebase. This is the largest gap in the family.
|
||||
|
||||
## 2. Logging (Serilog — strongest enricher set)
|
||||
|
||||
### Two-stage bootstrap
|
||||
|
||||
`src/ZB.MOM.WW.ScadaBridge.Host/Program.cs`:
|
||||
- `:27–54` — two-stage Serilog bootstrap: an initial logger is created for startup messages before
|
||||
the host is built; the full logger replaces it during `UseSerilog`.
|
||||
|
||||
### `LoggerConfigurationFactory.cs`
|
||||
|
||||
`src/ZB.MOM.WW.ScadaBridge.Host/LoggerConfigurationFactory.cs`:
|
||||
|
||||
Full factory method signature: `Build(IConfiguration config, string nodeRole, string siteId, string nodeHostname)`.
|
||||
|
||||
- `:62` — reads `ScadaBridge:Logging:MinimumLevel` from configuration.
|
||||
- `:84` — `ReadFrom.Configuration(config)` pulls sink configuration from `appsettings.json`.
|
||||
- `:85` — explicit `MinimumLevel.Is(...)` override from the typed option.
|
||||
- `:86–88` — three structural enrichers:
|
||||
- `.Enrich.WithProperty("SiteId", siteId)` — site identifier (e.g. `"site-a"`).
|
||||
- `.Enrich.WithProperty("NodeHostname", nodeHostname)` — node hostname.
|
||||
- `.Enrich.WithProperty("NodeRole", nodeRole)` — Akka cluster role (e.g. `"central"`, `"site"`).
|
||||
|
||||
These three properties are the cleanest and most complete set in the family. ScadaBridge's property
|
||||
names (`SiteId` / `NodeRole` / `NodeHostname`) are also the ones the shared `AddZbTelemetry`
|
||||
options object maps onto `site.id` / `node.role` / `host.name` OTel Resource attributes — no
|
||||
renaming needed on adoption.
|
||||
|
||||
### Sink configuration
|
||||
|
||||
`appsettings.json:3–23` — Serilog sinks configured via `ReadFrom.Configuration`:
|
||||
- Console sink with output template that includes `[{NodeRole}/{NodeHostname}]`.
|
||||
- File sink (path in config; rolling interval).
|
||||
|
||||
### `LoggingOptions.cs`
|
||||
|
||||
`src/ZB.MOM.WW.ScadaBridge.Host/LoggingOptions.cs`:
|
||||
- `MinimumLevel` — config-bound minimum level; default `Information`.
|
||||
|
||||
### Missing elements
|
||||
|
||||
- **No custom enrichers** beyond the three structural properties. `LogContextEnricher` (OtOpcUa's
|
||||
driver-correlation enricher) has no equivalent; MxGateway's per-session correlation scope has no
|
||||
equivalent. Per-request/per-operation correlation is not present.
|
||||
- **No `trace_id` / `span_id` enricher.** As with the other two projects, log lines do not carry
|
||||
trace context. Because ScadaBridge has zero `ActivitySource` instrumentation, this is consistent —
|
||||
but it means no trace↔log correlation path exists even hypothetically.
|
||||
|
||||
## 3. Signal summary
|
||||
|
||||
| Signal | Provider | Export | Resource / service.name |
|
||||
|---|---|---|---|
|
||||
| Metrics | ⛔ none | ⛔ none | ⛔ none |
|
||||
| Traces | ⛔ none | ⛔ none | ⛔ none |
|
||||
| Logs | Serilog | Console + file (`appsettings.json`) | ⛔ none (no `service.name` property) |
|
||||
| Trace↔log correlation | — | — | ⛔ absent (no ActivitySource; no enricher) |
|
||||
|
||||
## 4. Notable design choices
|
||||
|
||||
- **`SiteId` / `NodeRole` / `NodeHostname` as first-class enrichers** — unlike OtOpcUa's driver-
|
||||
scoped `LogContextEnricher`, ScadaBridge's structural enrichers are attached at logger creation and
|
||||
appear on every log line from the process. This is the target pattern for the shared bootstrap.
|
||||
- **`nodeRole` + `siteId` passed into the factory** — ScadaBridge's `LoggerConfigurationFactory.Build`
|
||||
takes these as constructor arguments rather than reading them from a registered options object.
|
||||
The shared `AddZbSerilog` approach binds them from the same `ZbTelemetryOptions` used for the OTel
|
||||
Resource, unifying the source.
|
||||
- **Config-driven `MinimumLevel`** — `ScadaBridge:Logging:MinimumLevel` is a typed config path;
|
||||
`ReadFrom.Configuration` for sinks. The shared bootstrap's `AddZbSerilog` must support the same
|
||||
pattern.
|
||||
- **No custom enrichers** — ScadaBridge's logging is intentionally minimal on operation-scoped
|
||||
context. Correlation in the distributed model is provided by structured log fields from Akka
|
||||
actor context, not a log enricher pipeline.
|
||||
- **CVE-patch ref discipline** — the `OpenTelemetry.Api` pin is a responsible CVE response but
|
||||
leaves the telemetry story incomplete. On adoption, the CVE pin is superseded by the full OTel SDK
|
||||
pulled in by `AddZbTelemetry`; the explicit `<PackageReference>` override can be removed.
|
||||
|
||||
---
|
||||
|
||||
## Adoption plan → `ZB.MOM.WW.Telemetry`
|
||||
|
||||
**Replace CVE-patch ref with full OTel SDK via `AddZbTelemetry`:**
|
||||
|
||||
- Remove the lone `OpenTelemetry.Api` override from
|
||||
`src/ZB.MOM.WW.ScadaBridge.Host/ZB.MOM.WW.ScadaBridge.Host.csproj:31`.
|
||||
- Add `builder.AddZbTelemetry(o => { o.ServiceName = "scadabridge"; o.SiteId = cfg.SiteId; o.NodeRole = cfg.NodeRole; o.Meters = ["ZB.MOM.WW.ScadaBridge"]; })`.
|
||||
The full OTel SDK supersedes the transitive version override; the CVE is resolved transitively
|
||||
via the SDK's current dependency.
|
||||
|
||||
**Add first application instruments:**
|
||||
|
||||
- Define a `ScadaBridgeTelemetry` class (mirror `OtOpcUaTelemetry`) with a `Meter` named
|
||||
`"ZB.MOM.WW.ScadaBridge"` and an initial set of instruments covering the most observable
|
||||
operations: site connection lifecycle, alarm received, data-change received, actor supervision
|
||||
events. Naming convention: `scadabridge.<subsystem>.<event>`.
|
||||
- Register the meter name in `AddZbTelemetry` options. Expose `/metrics` via `app.MapZbMetrics()`.
|
||||
ScadaBridge goes from zero instrumentation to a baseline exportable set.
|
||||
|
||||
**Adopt `AddZbSerilog`:**
|
||||
|
||||
- Replace the `LoggerConfigurationFactory.Build(config, nodeRole, siteId, nodeHostname)` call in
|
||||
`Program.cs:27–54` with `builder.AddZbSerilog(o => { o.ServiceName = "scadabridge"; o.SiteId = cfg.SiteId; o.NodeRole = cfg.NodeRole; o.NodeHostname = cfg.NodeHostname; })`.
|
||||
The three enrichers (`SiteId`, `NodeRole`, `NodeHostname`) are now provided by the shared
|
||||
`AddZbSerilog` path; `LoggerConfigurationFactory` can be deleted.
|
||||
- `ReadFrom.Configuration` for sinks and `MinimumLevel.Is` override from config are preserved
|
||||
inside `AddZbSerilog` — behavior is unchanged.
|
||||
- The `TraceContextEnricher` is wired automatically by `AddZbSerilog`; once application instruments
|
||||
are added (above), `trace_id` / `span_id` will appear on log lines emitted during spans.
|
||||
|
||||
**Keep bespoke:**
|
||||
|
||||
- `LoggingOptions.cs` — the `MinimumLevel` typed option and its config path
|
||||
(`ScadaBridge:Logging:MinimumLevel`) remain; `AddZbSerilog` must accept the minimum-level
|
||||
override from configuration. The config path stays ScadaBridge's own.
|
||||
- Console output template including `[{NodeRole}/{NodeHostname}]` — driven by `appsettings.json`;
|
||||
no change.
|
||||
- Akka actor-context log fields — per-operation context emitted by Akka infrastructure; not an
|
||||
enricher concern.
|
||||
- `ZB.MOM.WW.ScadaBridge.Host.csproj` package set otherwise — no other changes to the project file.
|
||||
|
||||
**Adoption is a follow-on task** (tracked in `GAPS.md`), not part of the `ZB.MOM.WW.Telemetry`
|
||||
library build. Adding instruments and adopting `AddZbSerilog`/`AddZbTelemetry` lands in the
|
||||
ScadaBridge repo as a separate commit once the nupkg is available.
|
||||
@@ -0,0 +1,248 @@
|
||||
# Proposed shared library: `ZB.MOM.WW.Telemetry`
|
||||
|
||||
A contract on paper — the public surface to extract so the three projects stop implementing
|
||||
observability separately. Realizes [`../spec/SPEC.md`](../spec/SPEC.md) and
|
||||
[`../spec/METRIC-CONVENTIONS.md`](../spec/METRIC-CONVENTIONS.md). **Not yet created.**
|
||||
Reference implementations already exist: OtOpcUa `ObservabilityExtensions.cs` (OTel + Serilog),
|
||||
ScadaBridge `LoggerConfigurationFactory.cs` (Serilog enrichers), MxGateway
|
||||
`GatewayMetrics.cs` + `GatewayLogRedactor.cs`.
|
||||
|
||||
## Packages (.NET 10)
|
||||
|
||||
```
|
||||
ZB.MOM.WW.Telemetry # OTel bootstrap: Resource, metrics, traces, exporters
|
||||
ZB.MOM.WW.Telemetry.Serilog # Serilog bootstrap: enrichers, TraceContextEnricher, ILogRedactor
|
||||
```
|
||||
|
||||
Both packages are .NET 10 — all three logging-bearing processes are .NET 10 (OtOpcUa server,
|
||||
mxaccessgw gateway, ScadaBridge central). The x86 net48 mxaccessgw worker uses a bespoke
|
||||
`IWorkerLogger` (stderr key=value); net48 multi-targeting is **not** required. Published to
|
||||
the Gitea NuGet feed; SemVer; lockstep to start.
|
||||
|
||||
## Packaging & distribution
|
||||
|
||||
**Two NuGet packages, one DLL each**, on the Gitea NuGet feed. Libraries linked into each
|
||||
app — there is no central telemetry service. Both packages are consumed by all three apps
|
||||
after adoption:
|
||||
|
||||
| Package (→ DLL) | Transitive deps | OtOpcUa | MxGateway | ScadaBridge |
|
||||
|---|---|---|---|---|
|
||||
| `…Telemetry` | OpenTelemetry SDK, `OpenTelemetry.Exporter.Prometheus.AspNetCore`, `OpenTelemetry.Exporter.OpenTelemetryProtocol`, standard instrumentation packages | ✅ | ✅ | ✅ |
|
||||
| `…Telemetry.Serilog` | Serilog, `Serilog.Extensions.Hosting`, `Serilog.AspNetCore` (version note below) | ✅ | ✅ | ✅ |
|
||||
|
||||
> **`Serilog.AspNetCore` version split (open convergence note):** OtOpcUa and ScadaBridge
|
||||
> target .NET 10 and may use `Serilog.AspNetCore` 9.x; MxGateway's adoption starts from
|
||||
> `Serilog.AspNetCore` 9.x as well. If a project remains on .NET 8 ASP.NET Core for any
|
||||
> reason, the compatible version is `Serilog.AspNetCore` 8.x. Coordinate the version floor
|
||||
> when the first app takes a dependency and pin it in `Directory.Packages.props`.
|
||||
|
||||
---
|
||||
|
||||
## `ZB.MOM.WW.Telemetry`
|
||||
|
||||
```csharp
|
||||
namespace ZB.MOM.WW.Telemetry;
|
||||
|
||||
/// Selects how instrumentation data is exported.
|
||||
public enum ZbExporter
|
||||
{
|
||||
/// Prometheus scrape endpoint (default). Call app.MapZbMetrics() to mount /metrics.
|
||||
Prometheus,
|
||||
|
||||
/// OTLP gRPC export. Set OtlpEndpoint (e.g. "http://collector:4317").
|
||||
/// Coexists with Prometheus when both endpoints are desired.
|
||||
Otlp,
|
||||
}
|
||||
|
||||
/// Options for AddZbTelemetry. All properties feed the shared OTel Resource and
|
||||
/// Serilog enrichers (via AddZbSerilog in the .Serilog package).
|
||||
public sealed class ZbTelemetryOptions
|
||||
{
|
||||
/// Required. Short lower-case app identifier — e.g. "otopcua", "mxgateway", "scadabridge".
|
||||
/// Populates OTel Resource service.name.
|
||||
public string ServiceName { get; set; } = "";
|
||||
|
||||
/// Fleet-wide namespace. Default "ZB.MOM.WW". Do not override per-app.
|
||||
/// Populates OTel Resource service.namespace.
|
||||
public string ServiceNamespace { get; set; } = "ZB.MOM.WW";
|
||||
|
||||
/// Optional. Populate from AssemblyInformationalVersion.
|
||||
/// Populates OTel Resource service.version.
|
||||
public string? ServiceVersion { get; set; }
|
||||
|
||||
/// Optional. Physical or logical site identifier.
|
||||
/// Populates OTel Resource site.id and Serilog property SiteId.
|
||||
public string? SiteId { get; set; }
|
||||
|
||||
/// Optional. Node function: "central", "site", "hub", "standalone".
|
||||
/// Populates OTel Resource node.role and Serilog property NodeRole.
|
||||
public string? NodeRole { get; set; }
|
||||
|
||||
/// App-specific Meter names to register with the OTel MeterProvider.
|
||||
/// Always register the app's primary Meter here. Standard instrumentation meters are
|
||||
/// added automatically (ASP.NET Core, HttpClient, runtime, process).
|
||||
public string[] Meters { get; set; } = [];
|
||||
|
||||
/// App-specific ActivitySource names to register with the OTel TracerProvider.
|
||||
public string[] ActivitySources { get; set; } = [];
|
||||
|
||||
/// Export path. Default Prometheus; use Otlp for a real collector.
|
||||
public ZbExporter Exporter { get; set; } = ZbExporter.Prometheus;
|
||||
|
||||
/// Required when Exporter = ZbExporter.Otlp.
|
||||
/// OTLP gRPC endpoint, e.g. "http://collector:4317".
|
||||
public string? OtlpEndpoint { get; set; }
|
||||
}
|
||||
|
||||
/// Extension point for configuring the OTel bootstrap on an IHostApplicationBuilder.
|
||||
public static class ZbTelemetryExtensions
|
||||
{
|
||||
/// Configures the OpenTelemetry MeterProvider and TracerProvider with the shared Resource,
|
||||
/// standard instrumentation (ASP.NET Core, HttpClient, gRPC client, runtime, process),
|
||||
/// the app's own Meters and ActivitySources, and the selected exporter.
|
||||
/// Does NOT configure Serilog — call AddZbSerilog() in the .Serilog package for that.
|
||||
public static IHostApplicationBuilder AddZbTelemetry(
|
||||
this IHostApplicationBuilder builder,
|
||||
Action<ZbTelemetryOptions> configure);
|
||||
|
||||
/// IServiceCollection overload for contexts where IHostApplicationBuilder is not available.
|
||||
/// Requires the caller to supply a pre-built ZbTelemetryOptions (Resource attributes must
|
||||
/// be populated before DI composition, so the options-object overload is preferred).
|
||||
public static IServiceCollection AddZbTelemetry(
|
||||
this IServiceCollection services,
|
||||
ZbTelemetryOptions options);
|
||||
}
|
||||
|
||||
/// Builds the shared OTel ResourceBuilder from ZbTelemetryOptions.
|
||||
/// Used internally by AddZbTelemetry. Exposed for tests and custom pipelines.
|
||||
public static class ZbResource
|
||||
{
|
||||
/// Returns a ResourceBuilder pre-populated with service.name, service.namespace,
|
||||
/// service.version, site.id, node.role, and host.name (always Environment.MachineName).
|
||||
/// Attributes with null values are omitted from the Resource.
|
||||
public static ResourceBuilder Build(ZbTelemetryOptions options);
|
||||
}
|
||||
|
||||
/// Endpoint extension for mounting the Prometheus /metrics scrape endpoint.
|
||||
public static class ZbMetricsEndpointExtensions
|
||||
{
|
||||
/// Mounts the Prometheus /metrics endpoint.
|
||||
/// Only valid when ZbTelemetryOptions.Exporter = ZbExporter.Prometheus (or both).
|
||||
/// Call after app.UseRouting().
|
||||
public static IEndpointConventionBuilder MapZbMetrics(
|
||||
this IEndpointRouteBuilder endpoints);
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## `ZB.MOM.WW.Telemetry.Serilog`
|
||||
|
||||
```csharp
|
||||
namespace ZB.MOM.WW.Telemetry.Serilog;
|
||||
|
||||
/// Extension point for configuring the Serilog two-stage bootstrap on an IHostApplicationBuilder.
|
||||
public static class ZbSerilogExtensions
|
||||
{
|
||||
/// Two-stage Serilog bootstrap:
|
||||
/// Stage 1 — minimal console-only bootstrap logger (for startup errors before IConfiguration).
|
||||
/// Stage 2 — application logger wired from IConfiguration (ReadFrom.Configuration reads
|
||||
/// Serilog:WriteTo sinks + Serilog:MinimumLevel overrides) with fixed enrichers:
|
||||
/// SiteId, NodeRole, NodeHostname (from ZbTelemetryOptions), TraceContextEnricher,
|
||||
/// and RedactionEnricher (applied only when ILogRedactor is registered).
|
||||
///
|
||||
/// OTel log export is wired automatically: logs flow through the OTel pipeline with the same
|
||||
/// Resource as the metrics and traces (all three signals correlated in a backend).
|
||||
///
|
||||
/// The configure delegate receives the same ZbTelemetryOptions used by AddZbTelemetry.
|
||||
/// Typically share a single options-population lambda across both calls.
|
||||
public static IHostApplicationBuilder AddZbSerilog(
|
||||
this IHostApplicationBuilder builder,
|
||||
Action<ZbTelemetryOptions> configure);
|
||||
}
|
||||
|
||||
/// Canonical Serilog property name constants for the identity enrichers.
|
||||
/// Use these constants — not literal strings — when querying properties in sinks or tests.
|
||||
public static class ZbLogEnricherNames
|
||||
{
|
||||
/// Serilog property: physical or logical site identifier. Matches OTel Resource site.id.
|
||||
public const string SiteId = "SiteId";
|
||||
|
||||
/// Serilog property: node function (central, site, hub, standalone). Matches OTel node.role.
|
||||
public const string NodeRole = "NodeRole";
|
||||
|
||||
/// Serilog property: machine name (Environment.MachineName). Matches OTel host.name.
|
||||
public const string NodeHostname = "NodeHostname";
|
||||
}
|
||||
|
||||
/// Stamps trace_id and span_id from Activity.Current onto every Serilog log event.
|
||||
/// When Activity.Current is null (no active span — background services, startup, non-traced paths)
|
||||
/// the enricher emits nothing; it does NOT inject empty strings or zero values.
|
||||
/// This enables a log line to be clicked through to its originating trace in a backend.
|
||||
public sealed class TraceContextEnricher : ILogEventEnricher
|
||||
{
|
||||
public void Enrich(LogEvent logEvent, ILogEventPropertyFactory propertyFactory);
|
||||
}
|
||||
|
||||
/// Seam for project-specific log-event redaction.
|
||||
/// The shared library applies this via RedactionEnricher; each project provides its own
|
||||
/// implementation that knows which fields (by property name) or which command payloads
|
||||
/// must not leave the process in log events.
|
||||
/// If no ILogRedactor is registered in DI, RedactionEnricher is a no-op.
|
||||
public interface ILogRedactor
|
||||
{
|
||||
/// Inspect and mutate properties in-place. Remove or replace any sensitive values.
|
||||
/// Called on every log event before it reaches any sink.
|
||||
void Redact(IDictionary<string, object?> properties);
|
||||
}
|
||||
|
||||
/// Applies a registered ILogRedactor to every Serilog log event.
|
||||
/// Registered automatically by AddZbSerilog. The enricher resolves ILogRedactor from DI
|
||||
/// on first use; if none is registered it is permanently inert (no DI call per event).
|
||||
public sealed class RedactionEnricher : ILogEventEnricher
|
||||
{
|
||||
public RedactionEnricher(IServiceProvider serviceProvider);
|
||||
public void Enrich(LogEvent logEvent, ILogEventPropertyFactory propertyFactory);
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Consumer matrix
|
||||
|
||||
| Consumer | Packages | Notes |
|
||||
|---|---|---|
|
||||
| **MxGateway** | Both | MEL → Serilog migration: `GatewayLogScope`/`BeginScope` → `LogContext.PushProperty`; `GatewayLogRedactor` → `ILogRedactor` impl; `GatewayMetrics` stays, wired through `o.Meters`. **Done in this release.** |
|
||||
| **OtOpcUa** | Both | Consolidate existing Serilog bootstrap; add `TraceContextEnricher` + `SiteId`/`NodeRole` enrichers; add Resource to existing OTel pipeline. Deferred to GAPS backlog. |
|
||||
| **ScadaBridge** | Both | Add full OTel SDK (metrics + traces + export); consolidate `LoggerConfigurationFactory`; add `TraceContextEnricher`. Deferred to GAPS backlog. |
|
||||
|
||||
The net48 x86 mxaccessgw worker is excluded from both packages. Its `IWorkerLogger`
|
||||
(stderr key=value format) is an out-of-process concern and remains bespoke.
|
||||
|
||||
---
|
||||
|
||||
## Open contract questions
|
||||
|
||||
1. **`IServiceCollection` overload completeness:** the `IHostApplicationBuilder`-based
|
||||
overload is the primary path (available in all three apps on .NET 10). The
|
||||
`IServiceCollection` overload is a fallback for unusual host configurations. Validate
|
||||
that both overloads wire OTel log export identically (same Resource, same enrichers).
|
||||
|
||||
2. **OTel log export channel:** `AddZbSerilog` uses `Serilog.Sinks.OpenTelemetry` to push
|
||||
logs into the OTel pipeline (sharing the Resource). Confirm the sink version is
|
||||
compatible with the OpenTelemetry SDK version pinned in `ZB.MOM.WW.Telemetry`
|
||||
(`Directory.Packages.props`).
|
||||
|
||||
3. **`RedactionEnricher` DI timing:** `RedactionEnricher` resolves `ILogRedactor` from
|
||||
`IServiceProvider` on first use (lazy, to avoid a circular-DI problem during Serilog's
|
||||
two-stage bootstrap). Validate that the service provider is fully built by the time the
|
||||
first post-startup log event fires. If MxGateway's `GatewayLogRedactor` has dependencies
|
||||
that are not available at stage-1 bootstrap time, the lazy-resolve pattern protects it.
|
||||
|
||||
4. **`SiteId` / `NodeRole` null handling:** `AddZbTelemetry` and `AddZbSerilog` silently
|
||||
omit null `SiteId`/`NodeRole` from the Resource and enricher set. Confirm this is the
|
||||
correct behavior for OtOpcUa, which may run in a single-site configuration where neither
|
||||
field is meaningful, versus ScadaBridge, where `SiteId` is essential for multi-cluster
|
||||
fleet visibility.
|
||||
|
||||
See [`../GAPS.md`](../GAPS.md) for the adoption order and effort/risk.
|
||||
@@ -0,0 +1,224 @@
|
||||
# Observability — Metric conventions (standardized)
|
||||
|
||||
Status: **Standardized**. The naming and unit rules every sister project's instruments must
|
||||
follow. Analogous to [`../auth/spec/CANONICAL-ROLES.md`](../../auth/spec/CANONICAL-ROLES.md)
|
||||
for auth and [`../ui-theme/spec/DESIGN-TOKENS.md`](../../ui-theme/spec/DESIGN-TOKENS.md)
|
||||
for the UI kit. Authoritative alongside [`SPEC.md`](SPEC.md).
|
||||
|
||||
The per-project instrument tables below (§4) document the **existing bespoke surface** — the
|
||||
instruments each app currently defines or intends to define. These stay per-project; they are
|
||||
not candidates for the shared library. The rules in §1–§3 govern *how* those instruments must
|
||||
be named and measured.
|
||||
|
||||
---
|
||||
|
||||
## 1. Meter name
|
||||
|
||||
Each app owns exactly **one primary Meter**, named after its root namespace:
|
||||
|
||||
| App | Meter name | Status |
|
||||
|---|---|---|
|
||||
| OtOpcUa | `ZB.MOM.WW.OtOpcUa` | Correct today |
|
||||
| MxGateway | `MxGateway.Server` | ⚠ Convergence target — rename to `ZB.MOM.WW.MxGateway` on adoption |
|
||||
| ScadaBridge | `ZB.MOM.WW.ScadaBridge` | Target (no meter exists today) |
|
||||
|
||||
`MxGateway.Server` is the single convergence item for meter naming. It predates the
|
||||
`ZB.MOM.WW.*` namespace convention; rename when adopting `AddZbTelemetry`. Instruments
|
||||
emitted under the old name will require a `recording_rule` or relabel in any Prometheus
|
||||
config that already scrapes the snapshot — coordinate before renaming in production.
|
||||
|
||||
If an app has secondary meters (e.g. a library component with its own meter), those follow
|
||||
the same pattern: `ZB.MOM.WW.<App>.<Component>`.
|
||||
|
||||
---
|
||||
|
||||
## 2. Instrument name
|
||||
|
||||
Instrument names follow the pattern `<app>.<subsystem>.<event>`, all lower-case,
|
||||
dot-separated:
|
||||
|
||||
```
|
||||
<app> := short app identifier — otopcua | mxgateway | scadabridge
|
||||
<subsystem> := functional area — deploy | session | tag | alarm | gateway | worker | ...
|
||||
<event> := what happened or is measured — applied | count | duration | errors | active | ...
|
||||
```
|
||||
|
||||
**Examples:**
|
||||
|
||||
| Instrument name | App | Meaning |
|
||||
|---|---|---|
|
||||
| `otopcua.deploy.applied` | OtOpcUa | Galaxy deploy events applied to the address space |
|
||||
| `otopcua.tag.subscriptions` | OtOpcUa | Active OPC UA tag subscriptions |
|
||||
| `mxgateway.session.active` | MxGateway | Active MxAccess sessions |
|
||||
| `mxgateway.worker.call.duration` | MxGateway | gRPC call duration to the x86 worker |
|
||||
| `scadabridge.alarm.received` | ScadaBridge | Alarms received by the DCL |
|
||||
|
||||
**Rules:**
|
||||
|
||||
1. All lower-case. No camelCase, no PascalCase, no hyphens.
|
||||
2. Three segments minimum (`<app>.<subsystem>.<event>`). Four are permitted when the
|
||||
subsystem warrants a sub-area (e.g. `mxgateway.worker.call.duration`).
|
||||
3. Event nouns describe **what is counted or measured** (`applied`, `errors`, `active`,
|
||||
`duration`), not implementation details (`method_called`, `loop_iteration`).
|
||||
4. Counters: past-tense or noun (`received`, `errors`, `applied`).
|
||||
UpDownCounters / gauges: present-state noun or adjective (`active`, `connected`).
|
||||
Histograms: `duration` or a measured quantity noun (`size`, `lag`).
|
||||
|
||||
---
|
||||
|
||||
## 3. Units
|
||||
|
||||
### Duration — seconds (mandatory)
|
||||
|
||||
**All duration histograms MUST use seconds** (`"s"`). This is the OpenTelemetry semantic
|
||||
convention (`UCUM`: `s`). Backends and dashboards assume seconds; mixing units breaks
|
||||
aggregations across apps.
|
||||
|
||||
> ⚠ **MxGateway convergence item:** `GatewayMetrics.cs` defines three histograms with unit
|
||||
> `"ms"` (`CommandDuration`, `EventDuration`, `WorkerCallDuration`). These must be migrated
|
||||
> to `"s"` on adoption. Values must also be converted (divide by 1 000 at the call site).
|
||||
> Track existing Prometheus `recording_rule`/dashboard changes — any dashboard panel that
|
||||
> reads these histograms in `ms` will need updating. Until migration is complete, annotate
|
||||
> the instruments with `// CONVERGENCE: ms→s pending`.
|
||||
|
||||
### Other units
|
||||
|
||||
| Quantity | Unit string | Notes |
|
||||
|---|---|---|
|
||||
| Duration | `"s"` | Mandatory — see above |
|
||||
| Size / bytes | `"By"` | UCUM bytes |
|
||||
| Count (dimensionless) | `"1"` or omit | For pure event counts; `"1"` preferred |
|
||||
| Messages, requests | `"{message}"`, `"{request}"` | UCUM annotation form for dimensioned counts |
|
||||
|
||||
---
|
||||
|
||||
## 4. Resource attribute set (shared across all three signals)
|
||||
|
||||
The OTel `Resource` is built once by `AddZbTelemetry` (see [`SPEC.md`](SPEC.md) §2) and
|
||||
attached to metrics, traces, and OTel-exported logs. The same `SiteId` and `NodeRole` values
|
||||
populate Serilog enrichers, making a metric, a span, and a log line from the same node
|
||||
joinable in any OTel-compatible backend.
|
||||
|
||||
| OTel attribute | Type | Required | Notes |
|
||||
|---|---|---|---|
|
||||
| `service.name` | string | Yes | Short lower-case app id: `otopcua`, `mxgateway`, `scadabridge` |
|
||||
| `service.namespace` | string | Yes | Always `"ZB.MOM.WW"` — do not override |
|
||||
| `service.version` | string | Recommended | Populate from `AssemblyInformationalVersion`; absent is better than wrong |
|
||||
| `site.id` | string | Recommended | Physical or logical site identifier; omit for single-site deployments |
|
||||
| `node.role` | string | Recommended | Node function: `"central"`, `"site"`, `"hub"`, `"standalone"` |
|
||||
| `host.name` | string | Auto | Always `Environment.MachineName`; never override |
|
||||
|
||||
**Why `site.id` and `node.role` matter:** a ScadaBridge fleet runs N site clusters + one
|
||||
central cluster, each on different hosts. Without `site.id` and `node.role`, metrics from a
|
||||
site node and the central node are indistinguishable even if `host.name` differs.
|
||||
|
||||
---
|
||||
|
||||
## 5. Standard instrumentation baseline
|
||||
|
||||
Every app enables this baseline via `AddZbTelemetry`. No opt-out. These are community-
|
||||
standard instrumentation packages; the overhead is negligible and the benefit (correlated
|
||||
HTTP / gRPC request traces across the fleet) is high.
|
||||
|
||||
| Instrumentation | Signal(s) | OtOpcUa today | MxGateway today | ScadaBridge today |
|
||||
|---|---|---|---|---|
|
||||
| ASP.NET Core | Traces + Metrics | ✅ | ✅ | — |
|
||||
| HttpClient | Traces + Metrics | ✅ | ✅ | — |
|
||||
| gRPC client | Traces | ✅ | — | — |
|
||||
| .NET runtime | Metrics | ✅ | — | — |
|
||||
| Process | Metrics | ✅ | — | — |
|
||||
|
||||
OtOpcUa already enables all five. MxGateway and ScadaBridge add the missing ones through
|
||||
`AddZbTelemetry`. No project removes any of these.
|
||||
|
||||
---
|
||||
|
||||
## 6. Per-app instrument surface (bespoke — stays per project)
|
||||
|
||||
These instruments are **not part of the shared library**. They document the existing bespoke
|
||||
surface that each project registers through `o.Meters` / `o.ActivitySources` in `AddZbTelemetry`.
|
||||
|
||||
### 6.1 OtOpcUa — `ZB.MOM.WW.OtOpcUa` meter
|
||||
|
||||
Source: `src/Core/ZB.MOM.WW.OtOpcUa.Commons/Observability/OtOpcUaTelemetry.cs`
|
||||
|
||||
| Instrument | Kind | Unit | Description |
|
||||
|---|---|---|---|
|
||||
| `otopcua.deploy.applied` | Counter | `"1"` | Galaxy deploy events applied to the OPC UA address space |
|
||||
| `otopcua.deploy.failed` | Counter | `"1"` | Deploy events that failed processing |
|
||||
| `otopcua.tag.subscriptions` | UpDownCounter | `"1"` | Active OPC UA tag subscriptions |
|
||||
| `otopcua.tag.reads` | Counter | `"1"` | Tag read operations |
|
||||
| `otopcua.tag.writes` | Counter | `"1"` | Tag write operations |
|
||||
| `otopcua.session.active` | UpDownCounter | `"1"` | Active OPC UA sessions |
|
||||
| `otopcua.connection.gateway` | UpDownCounter | `"1"` | Active gRPC channels to MxAccessGateway |
|
||||
|
||||
**ActivitySources (spans):**
|
||||
|
||||
| Source name | Span(s) |
|
||||
|---|---|
|
||||
| `ZB.MOM.WW.OtOpcUa` | `DeployWatcher.Apply`, `GalaxyDriver.BrowseHierarchy` |
|
||||
|
||||
All durations already use `"s"` — no convergence item for OtOpcUa.
|
||||
|
||||
### 6.2 MxGateway — `MxGateway.Server` meter (→ target: `ZB.MOM.WW.MxGateway`)
|
||||
|
||||
Source: `src/ZB.MOM.WW.MxGateway.Server/Metrics/GatewayMetrics.cs`
|
||||
|
||||
**Counters (13):**
|
||||
|
||||
| Instrument | Unit | Description |
|
||||
|---|---|---|
|
||||
| `mxgateway.session.created` | `"1"` | MxAccess sessions opened |
|
||||
| `mxgateway.session.closed` | `"1"` | MxAccess sessions closed |
|
||||
| `mxgateway.session.errors` | `"1"` | Session creation/teardown errors |
|
||||
| `mxgateway.command.invoked` | `"1"` | MxAccess command invocations |
|
||||
| `mxgateway.command.errors` | `"1"` | Command invocation errors |
|
||||
| `mxgateway.event.received` | `"1"` | MxAccess events received from worker |
|
||||
| `mxgateway.event.errors` | `"1"` | Event processing errors |
|
||||
| `mxgateway.worker.started` | `"1"` | x86 worker processes started |
|
||||
| `mxgateway.worker.stopped` | `"1"` | x86 worker processes stopped |
|
||||
| `mxgateway.worker.errors` | `"1"` | Worker communication errors |
|
||||
| `mxgateway.galaxy.browse.requests` | `"1"` | Galaxy Repository browse RPCs |
|
||||
| `mxgateway.galaxy.browse.errors` | `"1"` | Galaxy browse errors |
|
||||
| `mxgateway.auth.failures` | `"1"` | Authentication failures |
|
||||
|
||||
**Histograms (3):**
|
||||
|
||||
| Instrument | Unit | Current unit | Convergence |
|
||||
|---|---|---|---|
|
||||
| `mxgateway.command.duration` | `"s"` | `"ms"` | ⚠ Convert ms→s on adoption |
|
||||
| `mxgateway.event.duration` | `"s"` | `"ms"` | ⚠ Convert ms→s on adoption |
|
||||
| `mxgateway.worker.call.duration` | `"s"` | `"ms"` | ⚠ Convert ms→s on adoption |
|
||||
|
||||
**Gauges (4):**
|
||||
|
||||
| Instrument | Unit | Description |
|
||||
|---|---|---|
|
||||
| `mxgateway.session.active` | `"1"` | Current active MxAccess sessions |
|
||||
| `mxgateway.worker.active` | `"1"` | Current running x86 worker processes |
|
||||
| `mxgateway.worker.memory` | `"By"` | Worker process RSS |
|
||||
| `mxgateway.galaxy.nodes.cached` | `"1"` | Galaxy Repository nodes in browse cache |
|
||||
|
||||
No ActivitySources today (no tracing). Adding `ZB.MOM.WW.MxGateway` as an ActivitySource
|
||||
is left per-project (deferred to GAPS backlog).
|
||||
|
||||
### 6.3 ScadaBridge — `ZB.MOM.WW.ScadaBridge` meter
|
||||
|
||||
No meter or instruments exist today (`OpenTelemetry.Api` is a dangling ref). The target
|
||||
meter name `ZB.MOM.WW.ScadaBridge` is reserved. Instruments are defined as part of the
|
||||
ScadaBridge adoption tracked in [`../GAPS.md`](../GAPS.md).
|
||||
|
||||
---
|
||||
|
||||
## Consequences and convergence items (accepted)
|
||||
|
||||
| Item | Scope | Severity |
|
||||
|---|---|---|
|
||||
| MxGateway meter rename `MxGateway.Server` → `ZB.MOM.WW.MxGateway` | MxGateway adoption | Breaking — requires relabeling in Prometheus config and dashboards |
|
||||
| MxGateway histogram unit `ms` → `s` (3 instruments) | MxGateway adoption | Breaking — values change by factor 1 000; dashboards need updating |
|
||||
| ScadaBridge instrument set TBD | ScadaBridge adoption | No existing surface to converge — define from scratch |
|
||||
|
||||
All three items are tracked as backlog entries in [`../GAPS.md`](../GAPS.md). The ms→s
|
||||
migration is the highest-priority convergence item because leaving it unresolved means
|
||||
MxGateway's durations are incompatible with OtOpcUa's in any shared Prometheus / Grafana
|
||||
workspace.
|
||||
@@ -0,0 +1,177 @@
|
||||
# Observability — normalized target spec
|
||||
|
||||
Status: **Draft**. The single design the sister projects converge on. Derived from the
|
||||
three code-verified current-state docs (`../current-state/`). Goal is *path to shared code*
|
||||
(`../shared-contract/ZB.MOM.WW.Telemetry.md`), so each normalized section maps to a shared
|
||||
library seam.
|
||||
|
||||
## 0. Scope
|
||||
|
||||
**Normalized here:** one OpenTelemetry bootstrap across all three signals (metrics + traces +
|
||||
logs) via a single `AddZbTelemetry` extension; the shared `Resource` attribute set
|
||||
(`service.name` / `service.namespace` / `service.version` / `site.id` / `node.role` /
|
||||
`host.name`) that makes every node distinguishable in a collector; standard instrumentation
|
||||
everyone enables (ASP.NET Core, HttpClient, gRPC client, runtime, process meters); exporter
|
||||
conventions (Prometheus scrape endpoint default, OTLP opt-in); a shared Serilog bootstrap
|
||||
with identity enrichers (`SiteId`, `NodeRole`, `NodeHostname`) bound from the same options
|
||||
object as the OTel Resource (metrics and logs therefore carry identical dimensions); a
|
||||
`TraceContextEnricher` that stamps `trace_id`/`span_id` from `Activity.Current` onto every
|
||||
Serilog event, enabling log↔trace correlation; an `ILogRedactor` redaction seam.
|
||||
|
||||
**Explicitly NOT normalized** (domain-specific — keep per project): each app's actual
|
||||
instruments — `otopcua.*` meters and spans, `mxgateway.*` counters/histograms/gauges — they
|
||||
are registered *through* the shared bootstrap but their names and semantics remain
|
||||
bespoke (see [`METRIC-CONVENTIONS.md`](METRIC-CONVENTIONS.md) §4); the redaction *policy*
|
||||
(which field names, which command types) — only the `ILogRedactor` seam is shared, each
|
||||
project supplies its own implementation; the MxGateway net48 x86 worker's `IWorkerLogger`
|
||||
(stderr key=value format, out-of-process, out of scope).
|
||||
|
||||
## 1. OpenTelemetry pipeline — `AddZbTelemetry`
|
||||
|
||||
A single `IHostApplicationBuilder` extension is the front door for all three OTel signals.
|
||||
It wires the shared `Resource`, registers standard instrumentation, and configures the
|
||||
selected exporter:
|
||||
|
||||
```csharp
|
||||
builder.AddZbTelemetry(o =>
|
||||
{
|
||||
o.ServiceName = "mxgateway"; // populates Resource service.name
|
||||
o.ServiceNamespace = "ZB.MOM.WW"; // constant across the fleet (default)
|
||||
o.ServiceVersion = "1.0.0"; // populated from AssemblyInformationalVersion
|
||||
o.SiteId = cfg.SiteId; // Resource site.id + Serilog SiteId property
|
||||
o.NodeRole = cfg.NodeRole; // Resource node.role + Serilog NodeRole property
|
||||
o.Meters = ["MxGateway.Server"]; // app's own Meter name(s)
|
||||
o.ActivitySources = ["MxGateway.Server"]; // app's own ActivitySource name(s)
|
||||
o.Exporter = ZbExporter.Prometheus; // default; ZbExporter.Otlp opt-in
|
||||
// o.OtlpEndpoint = "http://collector:4317"; // required when Exporter = Otlp
|
||||
});
|
||||
|
||||
app.MapZbMetrics(); // mounts Prometheus /metrics scrape endpoint
|
||||
```
|
||||
|
||||
This is the headline fix: nobody in the fleet sets a `Resource` or `service.name` today,
|
||||
making every node indistinguishable in a collector. Every project must call `AddZbTelemetry`
|
||||
to be observable.
|
||||
|
||||
## 2. Shared Resource
|
||||
|
||||
The OTel `Resource` attached to all three signals is built from `ZbTelemetryOptions`:
|
||||
|
||||
| OTel attribute | Options property | Notes |
|
||||
|---|---|---|
|
||||
| `service.name` | `ServiceName` | Required. Lower-case short identifier (`otopcua`, `mxgateway`, `scadabridge`) |
|
||||
| `service.namespace` | `ServiceNamespace` | Default `"ZB.MOM.WW"` — constant across the fleet |
|
||||
| `service.version` | `ServiceVersion` | Optional; recommend populating from `AssemblyInformationalVersion` |
|
||||
| `site.id` | `SiteId` | Optional; identifies the physical/logical site |
|
||||
| `node.role` | `NodeRole` | Optional; e.g. `"central"`, `"site"`, `"hub"` |
|
||||
| `host.name` | _(auto)_ | Always populated from `Environment.MachineName` |
|
||||
|
||||
The same `SiteId` and `NodeRole` values are passed to the Serilog enrichers (§4) so a
|
||||
metric, a span, and a log line from the same node carry identical dimensions and join up in
|
||||
any OTel-compatible backend.
|
||||
|
||||
## 3. Standard instrumentation
|
||||
|
||||
`AddZbTelemetry` enables the following instrumentation for all projects. Any project that
|
||||
already enables a subset gets it consolidated; no project may skip this baseline:
|
||||
|
||||
| Instrumentation | Package | Signal |
|
||||
|---|---|---|
|
||||
| ASP.NET Core | `OpenTelemetry.Instrumentation.AspNetCore` | Traces + Metrics |
|
||||
| HttpClient | `OpenTelemetry.Instrumentation.Http` | Traces + Metrics |
|
||||
| gRPC client | `OpenTelemetry.Instrumentation.GrpcNetClient` | Traces |
|
||||
| .NET runtime | `OpenTelemetry.Instrumentation.Runtime` | Metrics |
|
||||
| Process | `OpenTelemetry.Instrumentation.Process` | Metrics |
|
||||
|
||||
App-specific `Meter` names and `ActivitySource` names are registered via `o.Meters` and
|
||||
`o.ActivitySources`. This is how MxGateway's hand-rolled `GatewayMetrics` finally gets an
|
||||
export path instead of dying in an in-memory `GetSnapshot()`.
|
||||
|
||||
## 4. Exporter conventions
|
||||
|
||||
`ZbTelemetryOptions.Exporter` selects the export path:
|
||||
|
||||
| Value | Behaviour |
|
||||
|---|---|
|
||||
| `ZbExporter.Prometheus` | Mounts a Prometheus `/metrics` scrape endpoint via `app.MapZbMetrics()`. Default for all three apps — consistent with OtOpcUa's existing `/metrics`. |
|
||||
| `ZbExporter.Otlp` | Exports to an OTLP endpoint specified by `o.OtlpEndpoint` (gRPC, `http://collector:4317`). Opt-in path to a real OTel Collector; coexists with Prometheus. |
|
||||
|
||||
Both exporters carry the shared `Resource`. OTLP is the path to a real backend (Tempo,
|
||||
Prometheus-remote-write, Loki); Prometheus covers the "scrape from the node" case that all
|
||||
three apps currently use or aspire to.
|
||||
|
||||
## 5. Serilog logging stack
|
||||
|
||||
`AddZbSerilog` is a companion extension in the `.Serilog` package. It replaces each
|
||||
project's bespoke logging bootstrap with a shared two-stage pattern:
|
||||
|
||||
**Stage 1 (bootstrap logger):** a minimal `Log.Logger` for startup errors before the
|
||||
`IConfiguration` is available. Writes to console only.
|
||||
|
||||
**Stage 2 (application logger):** reads sinks and overrides from `IConfiguration`
|
||||
(`ReadFrom.Configuration`) and applies a set of fixed enrichers:
|
||||
|
||||
| Enricher | Property name | Source |
|
||||
|---|---|---|
|
||||
| `ZbLogEnricherNames.SiteId` | `"SiteId"` | `ZbTelemetryOptions.SiteId` |
|
||||
| `ZbLogEnricherNames.NodeRole` | `"NodeRole"` | `ZbTelemetryOptions.NodeRole` |
|
||||
| `ZbLogEnricherNames.NodeHostname` | `"NodeHostname"` | `Environment.MachineName` |
|
||||
| `TraceContextEnricher` | `trace_id`, `span_id` | `Activity.Current` |
|
||||
| `RedactionEnricher` | _(project-defined fields)_ | `ILogRedactor` implementation |
|
||||
|
||||
The three identity properties (`SiteId`, `NodeRole`, `NodeHostname`) are bound from the
|
||||
same `ZbTelemetryOptions` object as the OTel `Resource`, so logs and metrics/traces carry
|
||||
identical dimensions. When no `Activity.Current` is present (e.g. background services,
|
||||
startup), `TraceContextEnricher` emits nothing — it does not inject empty or zero values.
|
||||
|
||||
`MinimumLevel` is set explicitly in code (default `Information`) and can be overridden via
|
||||
`IConfiguration` (`Serilog:MinimumLevel`). Sinks are fully config-driven:
|
||||
`ReadFrom.Configuration` reads `Serilog:WriteTo` from `appsettings.json` / environment.
|
||||
|
||||
OTel log export is wired in the same call: logs flow through the OTel pipeline with the
|
||||
same `Resource` attached, making all three signals (metrics / traces / logs) available in a
|
||||
single backend.
|
||||
|
||||
## 6. Redaction seam — `ILogRedactor`
|
||||
|
||||
`ILogRedactor` is a single-method interface that receives the mutable log-event property
|
||||
dictionary and scrubs any fields that must not leave the process:
|
||||
|
||||
```csharp
|
||||
public interface ILogRedactor
|
||||
{
|
||||
void Redact(IDictionary<string, object?> properties);
|
||||
}
|
||||
```
|
||||
|
||||
`RedactionEnricher` applies a registered `ILogRedactor` on every log event. The seam is
|
||||
shared; the **policy** is per-project (which field names, which command types, which
|
||||
classification levels). MxGateway's existing `GatewayLogRedactor` is the reference
|
||||
implementation; it migrates to this seam during adoption. If no `ILogRedactor` is
|
||||
registered, `RedactionEnricher` is a no-op.
|
||||
|
||||
This preserves the operational property MxGateway already has (secrets never leave the
|
||||
process in log events) while making the plumbing reusable.
|
||||
|
||||
## 7. Per-project migration
|
||||
|
||||
| Project | Current state | Primary gaps | What normalizes |
|
||||
|---|---|---|---|
|
||||
| **OtOpcUa** | Full OTel SDK (`WithMetrics` + `WithTracing`); Prometheus `/metrics`; Serilog bootstrap; 7 instruments + 2 spans. | No `Resource` / `service.name` anywhere; no trace↔log correlation; no `SiteId`/`NodeRole` enrichers. | Call `AddZbTelemetry` (adds Resource; consolidates standard instrumentation); call `AddZbSerilog` (adds `TraceContextEnricher` + identity enrichers); migrate existing Serilog bootstrap to shared two-stage pattern. |
|
||||
| **MxGateway** | Hand-rolled `GatewayMetrics` (13 counters / 3 histograms `ms` / 4 gauges); in-memory snapshot only — no export; MEL logging with `GatewayLogScope` correlation + `GatewayLogRedactor`; no OTel SDK. | No OTel SDK; no export; `ms` histograms diverge from OTel semconv (`s`); MEL → Serilog migration; no Resource. | Call `AddZbTelemetry` (wires OTel SDK around existing `GatewayMetrics` — finally exports); call `AddZbSerilog` (replaces MEL; re-expresses `GatewayLogScope` as `LogContext.PushProperty`; moves `GatewayLogRedactor` behind `ILogRedactor`). Duration unit convergence (`ms`→`s`) tracked in GAPS. **This is the one adoption done now.** |
|
||||
| **ScadaBridge** | `OpenTelemetry.Api` ref only (dangling — CVE-patch origin, zero usage); Serilog bootstrap (`LoggerConfigurationFactory`) with `SiteId`/`NodeRole`/`NodeHostname` enrichers. | No OTel SDK; no metrics; no tracing; no export; no trace↔log correlation. ScadaBridge's enricher property names are already the target names — migration is additive. | Call `AddZbTelemetry` (adds OTel SDK + metrics + traces + export); call `AddZbSerilog` (consolidates `LoggerConfigurationFactory`; adds `TraceContextEnricher`). |
|
||||
|
||||
> The MxGateway logging migration (`MEL → Serilog`, re-expressing `GatewayLogRedactor`
|
||||
> behind `ILogRedactor`) is the **only sister-repo touch in scope for this release**. OtOpcUa
|
||||
> and ScadaBridge adoption is deferred to the follow-on tracked in
|
||||
> [`../GAPS.md`](../GAPS.md).
|
||||
|
||||
## 8. Acceptance (what "converged" means)
|
||||
|
||||
A project is converged when: (a) it calls `builder.AddZbTelemetry(o => ...)` with all
|
||||
required Resource attributes populated; (b) it calls `app.MapZbMetrics()` (or configures
|
||||
OTLP); (c) it calls `builder.AddZbSerilog(...)` and the `TraceContextEnricher` stamps
|
||||
`trace_id`/`span_id` on every log event emitted under an active `Activity`; (d) its
|
||||
`ILogRedactor` implementation (if applicable) is registered and applied by `RedactionEnricher`;
|
||||
(e) every node in the fleet is distinguishable by `service.name` + `site.id` + `node.role`
|
||||
in a collector or log aggregator.
|
||||
Reference in New Issue
Block a user