docs(observability): spec + METRIC-CONVENTIONS + ZB.MOM.WW.Telemetry shared contract

Author the three normalization docs for the observability component:
- components/observability/spec/SPEC.md — Section 0 scope (normalized vs. per-project),
  AddZbTelemetry pipeline, shared Resource attribute set, standard instrumentation baseline,
  exporter conventions, Serilog two-stage bootstrap with identity enrichers and
  TraceContextEnricher, ILogRedactor redaction seam, per-project migration table, and
  acceptance criteria.
- components/observability/spec/METRIC-CONVENTIONS.md — meter naming convention (app
  namespace; MxGateway.Server flagged as convergence target), instrument naming pattern
  (<app>.<subsystem>.<event>), mandatory duration unit = seconds (MxGateway ms histograms
  flagged), Resource attribute set table, standard instrumentation baseline, and per-app
  instrument tables (OtOpcUa 7 instruments + 2 spans; MxGateway 13 counters / 3 histograms
  / 4 gauges; ScadaBridge TBD).
- components/observability/shared-contract/ZB.MOM.WW.Telemetry.md — paper API for the two
  packages: ZbTelemetryOptions, ZbExporter enum, AddZbTelemetry (IHostApplicationBuilder +
  IServiceCollection overloads), ZbResource.Build, MapZbMetrics; AddZbSerilog,
  ZbLogEnricherNames constants, TraceContextEnricher, ILogRedactor, RedactionEnricher.
  Consumer matrix and open contract questions included.
This commit is contained in:
Joseph Doherty
2026-06-01 07:19:38 -04:00
parent 76295695ee
commit 7d243890ed
6 changed files with 1149 additions and 0 deletions
@@ -0,0 +1,191 @@
# Observability — current state: MxAccessGateway
Repo: `~/Desktop/MxAccessGateway`. Stack: .NET 10 gateway (x64) + .NET 4.8 worker (**x86**);
solution `src/MxGateway.sln`. Telemetry code is concentrated in
`src/ZB.MOM.WW.MxGateway.Server/Metrics/` (instruments) and
`src/ZB.MOM.WW.MxGateway.Server/Diagnostics/` (logging correlation + redaction).
All paths relative to repo root. Verified 2026-06-01.
The most unusual observability posture in the family: **13 counters, 3 histograms, and 4 observable
gauges** all fully hand-rolled using `System.Diagnostics.Metrics` directly — but **never exported**
(no OpenTelemetry SDK, no Prometheus exporter, no OTLP). All metric data dies in an in-memory
`GetSnapshot()`. Logging is `Microsoft.Extensions.Logging` exclusively (no Serilog), with a bespoke
correlation scope and a log-redaction pipeline. The net48 x86 worker is out of process and out of
scope — its `IWorkerLogger` (stderr key=value) is not addressed here.
## 1. Metrics (hand-rolled, unexported)
### `GatewayMetrics.cs`
`src/ZB.MOM.WW.MxGateway.Server/Metrics/GatewayMetrics.cs`:
Meter name: `"MxGateway.Server"` (does not follow the project namespace `ZB.MOM.WW.MxGateway`).
All instruments are instance members of `GatewayMetrics`. The class is registered as a **singleton**
at `GatewayApplication.cs:62`. There is **no `OpenTelemetry.Extensions.Hosting`**,
**no `AddOpenTelemetry()` call**, and **no exporter** — the `Meter` is created with
`new Meter("MxGateway.Server")` and `GetSnapshot()` is the only read path.
**Counters (13):**
| Instrument name | Tracks |
|---|---|
| `mxgateway.sessions.opened` | New session requests |
| `mxgateway.sessions.closed` | Sessions torn down |
| `mxgateway.commands.started` | MXAccess command dispatched |
| `mxgateway.commands.succeeded` | Command completed OK |
| `mxgateway.commands.failed` | Command error |
| `mxgateway.events.received` | MXAccess events from worker |
| `mxgateway.queues.overflows` | Queue overflow (backpressure) |
| `mxgateway.faults` | Unhandled gateway faults |
| `mxgateway.workers.killed` | Worker process forcibly terminated |
| `mxgateway.workers.exited` | Worker process exited cleanly |
| `mxgateway.heartbeats.failed` | Worker heartbeat timeouts |
| `mxgateway.grpc.streams.disconnected` | gRPC event stream disconnects |
| `mxgateway.retries.attempted` | Retry attempts (any subsystem) |
**Histograms (3) — unit `ms` (diverges from OTel semconv `s`):**
| Instrument name | Tracks |
|---|---|
| `mxgateway.workers.startup.duration` | Time from worker spawn to ready |
| `mxgateway.commands.duration` | End-to-end MXAccess command latency |
| `mxgateway.events.stream_send.duration` | gRPC event stream send latency |
**Observable gauges (4):**
| Instrument name | Tracks |
|---|---|
| `mxgateway.sessions.open` | Currently open sessions (live count) |
| `mxgateway.workers.running` | Currently running worker processes |
| `mxgateway.events.worker_queue.depth` | Per-worker event queue depth |
| `mxgateway.events.grpc_stream_queue.depth` | Per-stream gRPC send queue depth |
All 20 instruments share the `mxgateway.*` prefix and `<category>.<event>` naming — consistent
with the family convention. Duration histograms record in **milliseconds** (`ms`); OTel semantic
conventions require seconds (`s`). This is the only project with `ms` histograms.
### Singleton wiring
`src/ZB.MOM.WW.MxGateway.Server/GatewayApplication.cs`:
- `:62``services.AddSingleton<GatewayMetrics>()` registers the metrics singleton.
There is no `AddOpenTelemetry()` call anywhere in the gateway. The `GatewayMetrics` `Meter` is
created independently of any OTel SDK — it participates in `MeterListener` / `GetSnapshot()` only.
Without the OTel SDK, this data is **invisible to Prometheus, OTLP, or any backend**.
### No tracing
No `ActivitySource` is defined. No spans are created. Tracing is entirely absent.
## 2. Logging (Microsoft.Extensions.Logging)
All logging in the gateway server uses `Microsoft.Extensions.Logging` (MEL) exclusively. There is
no Serilog dependency. Sink configuration lives in `appsettings.json` (Console, with structured
logging via the default host builder).
### Correlation scope
`src/ZB.MOM.WW.MxGateway.Server/Diagnostics/GatewayLogScope.cs`:
Defines the per-request/per-session correlation property bag.
`src/ZB.MOM.WW.MxGateway.Server/Diagnostics/GatewayRequestLoggingMiddlewareExtensions.cs`:
- `:2241``UseGatewayRequestLogging()` middleware reads the following HTTP headers from each
incoming request: `x-session-id`, `x-worker-process-id`, `x-correlation-id`, `x-command-method`,
`authorization` (for redaction, not logging).
- Registered at `GatewayApplication.cs:34`.
`src/ZB.MOM.WW.MxGateway.Server/Diagnostics/GatewayLoggerExtensions.cs`:
- `:1118``BeginGatewayScope(ILogger, GatewayLogScope)` calls `logger.BeginScope(scope)`
MEL's `ILogger.BeginScope` mechanism, which pushes properties as a scoped dictionary.
The correlation tuple (`SessionId` / `WorkerProcessId` / `CorrelationId` / `CommandMethod`) is
injected into log lines produced within the scope. No `trace_id` / `span_id` enrichment — there
is no ActivitySource, so this is consistent but leaves no path to trace correlation.
### Log redaction — `GatewayLogRedactor.cs`
`src/ZB.MOM.WW.MxGateway.Server/Diagnostics/GatewayLogRedactor.cs`:
- Masks sensitive data in log lines for two categories:
- **`AuthenticateUser`** commands: the password argument is replaced.
- **`WriteSecured`** commands: the value argument is replaced.
- **`mxgw_` bearer tokens**: the token body is masked, keeping only the key-id prefix.
- Redaction is applied before the log event is emitted — no sensitive data reaches the sink.
This is the only project in the family with an explicit log-redaction pipeline. OtOpcUa and
ScadaBridge have no equivalent.
## 3. Signal summary
| Signal | Provider | Export | Resource / service.name |
|---|---|---|---|
| Metrics | `System.Diagnostics.Metrics` (`Meter` direct) | ⛔ none (`GetSnapshot()` only) | ⛔ none |
| Traces | — | ⛔ none | ⛔ none |
| Logs | MEL (`Microsoft.Extensions.Logging`) | Console via `appsettings.json` | ⛔ none |
| Trace↔log correlation | — | — | ⛔ absent (no ActivitySource exists) |
## 4. Notable design choices
- **`GatewayMetrics` singleton** — all counter/gauge increments are lock-free atomic operations on
the underlying `Meter` instruments; the singleton is intentional.
- **`ms` histogram unit** — `workers.startup.duration`, `commands.duration`, and
`events.stream_send.duration` all record in milliseconds. This is non-standard (OTel semconv
requires `s`) and means raw values differ from OtOpcUa's `s` histograms by a factor of 1000.
- **MEL correlation via `BeginScope`** — MEL scopes are supported by structured logging providers
(e.g. Serilog.Extensions.Hosting, Seq, Application Insights) but are provider-dependent. The
scope properties may not appear in all sink configurations, unlike Serilog's `LogContext` which
is sink-agnostic.
- **Redaction placement** — `GatewayLogRedactor` sits between the caller and the log emission point,
not inside a sink. This is the correct placement; the shared `ILogRedactor` seam preserves this.
---
## Adoption plan → `ZB.MOM.WW.Telemetry`
**This is the one in-pass adoption.** The MxGateway MEL → Serilog migration is executed as part of
the `ZB.MOM.WW.Telemetry` library build, not deferred as a follow-on. The changes below land in
the MxAccessGateway repo as part of Task #9 (blocked by Task #8 — library build).
**Migrate logging MEL → `AddZbSerilog`:**
- Replace `WebApplicationBuilder` default logging with `builder.AddZbSerilog(o => { o.ServiceName = "mxgateway"; o.SiteId = ...; o.NodeRole = ...; })`.
Gains structured `SiteId` / `NodeRole` / `NodeHostname` enrichers on every log event, plus
`TraceContextEnricher` (currently moot — no spans — but ready for when tracing is added).
- Re-express the `GatewayLogScope` / `BeginGatewayScope` / `UseGatewayRequestLogging` correlation
mechanism as a Serilog `LogContext.PushProperty` scope. The middleware at
`GatewayRequestLoggingMiddlewareExtensions.cs:2241` is refactored to push the same four
properties (`SessionId`, `WorkerProcessId`, `CorrelationId`, `CommandMethod`) via Serilog's
`LogContext` rather than MEL `BeginScope`. Behavior is identical; portability improves.
- Move `GatewayLogRedactor` behind the shared `ILogRedactor` seam. The redaction policy (which
commands/tokens to scrub and how) stays per-project in a `MxGatewayLogRedactor : ILogRedactor`
implementation; the seam is shared.
- Console + file sinks configured via `ReadFrom.Configuration` in `appsettings.json` — consistent
with OtOpcUa and ScadaBridge's Serilog approach.
**Wire metrics export via `AddZbTelemetry`:**
- Add `builder.AddZbTelemetry(o => { o.ServiceName = "mxgateway"; o.SiteId = ...; o.NodeRole = ...; o.Meters = ["MxGateway.Server"]; })`.
This registers the OTel SDK and connects `GatewayMetrics`'s existing `Meter` to the Prometheus
exporter. The 13 counters, 3 histograms, and 4 gauges **begin exporting** for the first time.
`GatewayMetrics.cs` itself is unchanged — only the SDK layer is added around it.
- Add `app.MapZbMetrics()` to expose `/metrics`.
**Convert histogram unit `ms` → `s`:**
- Rename the three histograms' values: multiply recorded values by `0.001` at the call site, or
re-create the instruments with unit `s`. This is a breaking change to existing dashboards/alerts
but required for OTel semconv compliance. Tagged as a convergence item in `GAPS.md`.
**Keep bespoke:**
- `GatewayMetrics.cs` — all 20 instruments (`mxgateway.*` counters, histograms, gauges) stay
per-project. `AddZbTelemetry` registers the Meter name; it does not own or replace the instruments.
- Meter name `"MxGateway.Server"` — a follow-on rename to `"ZB.MOM.WW.MxGateway"` is tracked in
`GAPS.md` but is not required for the initial adoption (it is a Prometheus label change that
breaks existing dashboards).
- `GatewayApplication.cs:62` singleton registration — unchanged; `GatewayMetrics` remains a
singleton; `AddZbTelemetry` simply hooks the OTel SDK to it.
- The net48 x86 worker's `IWorkerLogger` (stderr key=value) — out of process and out of scope.
No changes.