215a646e35
C1: NodeHostname is AUTO throughout. Shared-contract AddZbSerilog doc comment now reads
"SiteId + NodeRole from ZbTelemetryOptions; NodeHostname from Environment.MachineName (auto)".
SPEC.md §0 and §5 prose updated to match. ScadaBridge adoption snippet no longer sets
o.NodeHostname (removed; NodeHostname is auto, not caller-supplied).
C2: METRIC-CONVENTIONS §6.1 OtOpcUa instrument table replaced with code-verified set:
counters otopcua.deploy.applied / driver.lifecycle / virtualtag.eval / scriptedalarm.transition /
opcua.sink.write / redundancy.service_level_change; histogram otopcua.deploy.apply.duration (s);
ActivitySource ZB.MOM.WW.OtOpcUa with spans otopcua.deploy.apply + otopcua.opcua.address_space_rebuild.
Removed invented names (deploy.failed, tag.subscriptions, tag.reads, tag.writes, session.active,
connection.gateway).
C3: METRIC-CONVENTIONS §6.2 MxGateway instrument table replaced with code-verified names from
GatewayMetrics.cs: 13 counters (sessions.opened/closed, commands.started/succeeded/failed,
events.received, queues.overflows, faults, workers.killed/exited, heartbeats.failed,
grpc.streams.disconnected, retries.attempted); 3 histograms ms (workers.startup.duration,
commands.duration, events.stream_send.duration); 4 gauges (sessions.open, workers.running,
events.worker_queue.depth, events.grpc_stream_queue.depth). Removed invented names.
m3: §2 example table replaced mxgateway.session.active + mxgateway.worker.call.duration
(invented) with mxgateway.sessions.open + mxgateway.commands.duration (real). Also fixed
the §2 rule-2 body text example which referenced mxgateway.worker.call.duration.
I4: §5 standard instrumentation table corrected — OtOpcUa now shows ⛔ not added for all
five baseline instrumentations, matching current-state/otopcua. All three projects lack
standard instrumentation today; AddZbTelemetry adds it on adoption.
I1+m1: GAPS.md "Decisions still open" — removed the two settled questions (Prometheus-default
and ms→s/meter-rename bundling). Moved them to a new "Decisions settled" section with explicit
resolution notes. One genuinely open question remains (SiteId/NodeRole config binding path).
I2: SPEC.md §5 AddZbSerilog: added note that AddZbSerilog reads Serilog:MinimumLevel from
IConfiguration; callers with a different config key (e.g. ScadaBridge:Logging:MinimumLevel)
apply that override themselves — stays per-project. Shared-contract doc comment updated to match.
I3: MxAccessGateway adoption plan Meters = ["MxGateway.Server"] annotated as temporary with
note to update to ZB.MOM.WW.MxGateway when Gap N1 (Meter-rename) is closed.
m2: SPEC.md §1 now notes AddZbTelemetry also has an IServiceCollection overload for non-standard
hosts, with the IHostApplicationBuilder overload as the primary path.
192 lines
10 KiB
Markdown
192 lines
10 KiB
Markdown
# Observability — current state: MxAccessGateway
|
||
|
||
Repo: `~/Desktop/MxAccessGateway`. Stack: .NET 10 gateway (x64) + .NET 4.8 worker (**x86**);
|
||
solution `src/MxGateway.sln`. Telemetry code is concentrated in
|
||
`src/ZB.MOM.WW.MxGateway.Server/Metrics/` (instruments) and
|
||
`src/ZB.MOM.WW.MxGateway.Server/Diagnostics/` (logging correlation + redaction).
|
||
All paths relative to repo root. Verified 2026-06-01.
|
||
|
||
The most unusual observability posture in the family: **13 counters, 3 histograms, and 4 observable
|
||
gauges** all fully hand-rolled using `System.Diagnostics.Metrics` directly — but **never exported**
|
||
(no OpenTelemetry SDK, no Prometheus exporter, no OTLP). All metric data dies in an in-memory
|
||
`GetSnapshot()`. Logging is `Microsoft.Extensions.Logging` exclusively (no Serilog), with a bespoke
|
||
correlation scope and a log-redaction pipeline. The net48 x86 worker is out of process and out of
|
||
scope — its `IWorkerLogger` (stderr key=value) is not addressed here.
|
||
|
||
## 1. Metrics (hand-rolled, unexported)
|
||
|
||
### `GatewayMetrics.cs`
|
||
|
||
`src/ZB.MOM.WW.MxGateway.Server/Metrics/GatewayMetrics.cs`:
|
||
|
||
Meter name: `"MxGateway.Server"` (does not follow the project namespace `ZB.MOM.WW.MxGateway`).
|
||
|
||
All instruments are instance members of `GatewayMetrics`. The class is registered as a **singleton**
|
||
at `GatewayApplication.cs:62`. There is **no `OpenTelemetry.Extensions.Hosting`**,
|
||
**no `AddOpenTelemetry()` call**, and **no exporter** — the `Meter` is created with
|
||
`new Meter("MxGateway.Server")` and `GetSnapshot()` is the only read path.
|
||
|
||
**Counters (13):**
|
||
|
||
| Instrument name | Tracks |
|
||
|---|---|
|
||
| `mxgateway.sessions.opened` | New session requests |
|
||
| `mxgateway.sessions.closed` | Sessions torn down |
|
||
| `mxgateway.commands.started` | MXAccess command dispatched |
|
||
| `mxgateway.commands.succeeded` | Command completed OK |
|
||
| `mxgateway.commands.failed` | Command error |
|
||
| `mxgateway.events.received` | MXAccess events from worker |
|
||
| `mxgateway.queues.overflows` | Queue overflow (backpressure) |
|
||
| `mxgateway.faults` | Unhandled gateway faults |
|
||
| `mxgateway.workers.killed` | Worker process forcibly terminated |
|
||
| `mxgateway.workers.exited` | Worker process exited cleanly |
|
||
| `mxgateway.heartbeats.failed` | Worker heartbeat timeouts |
|
||
| `mxgateway.grpc.streams.disconnected` | gRPC event stream disconnects |
|
||
| `mxgateway.retries.attempted` | Retry attempts (any subsystem) |
|
||
|
||
**Histograms (3) — unit `ms` (diverges from OTel semconv `s`):**
|
||
|
||
| Instrument name | Tracks |
|
||
|---|---|
|
||
| `mxgateway.workers.startup.duration` | Time from worker spawn to ready |
|
||
| `mxgateway.commands.duration` | End-to-end MXAccess command latency |
|
||
| `mxgateway.events.stream_send.duration` | gRPC event stream send latency |
|
||
|
||
**Observable gauges (4):**
|
||
|
||
| Instrument name | Tracks |
|
||
|---|---|
|
||
| `mxgateway.sessions.open` | Currently open sessions (live count) |
|
||
| `mxgateway.workers.running` | Currently running worker processes |
|
||
| `mxgateway.events.worker_queue.depth` | Per-worker event queue depth |
|
||
| `mxgateway.events.grpc_stream_queue.depth` | Per-stream gRPC send queue depth |
|
||
|
||
All 20 instruments share the `mxgateway.*` prefix and `<category>.<event>` naming — consistent
|
||
with the family convention. Duration histograms record in **milliseconds** (`ms`); OTel semantic
|
||
conventions require seconds (`s`). This is the only project with `ms` histograms.
|
||
|
||
### Singleton wiring
|
||
|
||
`src/ZB.MOM.WW.MxGateway.Server/GatewayApplication.cs`:
|
||
- `:62` — `services.AddSingleton<GatewayMetrics>()` registers the metrics singleton.
|
||
|
||
There is no `AddOpenTelemetry()` call anywhere in the gateway. The `GatewayMetrics` `Meter` is
|
||
created independently of any OTel SDK — it participates in `MeterListener` / `GetSnapshot()` only.
|
||
Without the OTel SDK, this data is **invisible to Prometheus, OTLP, or any backend**.
|
||
|
||
### No tracing
|
||
|
||
No `ActivitySource` is defined. No spans are created. Tracing is entirely absent.
|
||
|
||
## 2. Logging (Microsoft.Extensions.Logging)
|
||
|
||
All logging in the gateway server uses `Microsoft.Extensions.Logging` (MEL) exclusively. There is
|
||
no Serilog dependency. Sink configuration lives in `appsettings.json` (Console, with structured
|
||
logging via the default host builder).
|
||
|
||
### Correlation scope
|
||
|
||
`src/ZB.MOM.WW.MxGateway.Server/Diagnostics/GatewayLogScope.cs`:
|
||
|
||
Defines the per-request/per-session correlation property bag.
|
||
|
||
`src/ZB.MOM.WW.MxGateway.Server/Diagnostics/GatewayRequestLoggingMiddlewareExtensions.cs`:
|
||
- `:22–41` — `UseGatewayRequestLogging()` middleware reads the following HTTP headers from each
|
||
incoming request: `x-session-id`, `x-worker-process-id`, `x-correlation-id`, `x-command-method`,
|
||
`authorization` (for redaction, not logging).
|
||
- Registered at `GatewayApplication.cs:34`.
|
||
|
||
`src/ZB.MOM.WW.MxGateway.Server/Diagnostics/GatewayLoggerExtensions.cs`:
|
||
- `:11–18` — `BeginGatewayScope(ILogger, GatewayLogScope)` calls `logger.BeginScope(scope)` —
|
||
MEL's `ILogger.BeginScope` mechanism, which pushes properties as a scoped dictionary.
|
||
|
||
The correlation tuple (`SessionId` / `WorkerProcessId` / `CorrelationId` / `CommandMethod`) is
|
||
injected into log lines produced within the scope. No `trace_id` / `span_id` enrichment — there
|
||
is no ActivitySource, so this is consistent but leaves no path to trace correlation.
|
||
|
||
### Log redaction — `GatewayLogRedactor.cs`
|
||
|
||
`src/ZB.MOM.WW.MxGateway.Server/Diagnostics/GatewayLogRedactor.cs`:
|
||
|
||
- Masks sensitive data in log lines for two categories:
|
||
- **`AuthenticateUser`** commands: the password argument is replaced.
|
||
- **`WriteSecured`** commands: the value argument is replaced.
|
||
- **`mxgw_` bearer tokens**: the token body is masked, keeping only the key-id prefix.
|
||
- Redaction is applied before the log event is emitted — no sensitive data reaches the sink.
|
||
|
||
This is the only project in the family with an explicit log-redaction pipeline. OtOpcUa and
|
||
ScadaBridge have no equivalent.
|
||
|
||
## 3. Signal summary
|
||
|
||
| Signal | Provider | Export | Resource / service.name |
|
||
|---|---|---|---|
|
||
| Metrics | `System.Diagnostics.Metrics` (`Meter` direct) | ⛔ none (`GetSnapshot()` only) | ⛔ none |
|
||
| Traces | — | ⛔ none | ⛔ none |
|
||
| Logs | MEL (`Microsoft.Extensions.Logging`) | Console via `appsettings.json` | ⛔ none |
|
||
| Trace↔log correlation | — | — | ⛔ absent (no ActivitySource exists) |
|
||
|
||
## 4. Notable design choices
|
||
|
||
- **`GatewayMetrics` singleton** — all counter/gauge increments are lock-free atomic operations on
|
||
the underlying `Meter` instruments; the singleton is intentional.
|
||
- **`ms` histogram unit** — `workers.startup.duration`, `commands.duration`, and
|
||
`events.stream_send.duration` all record in milliseconds. This is non-standard (OTel semconv
|
||
requires `s`) and means raw values differ from OtOpcUa's `s` histograms by a factor of 1000.
|
||
- **MEL correlation via `BeginScope`** — MEL scopes are supported by structured logging providers
|
||
(e.g. Serilog.Extensions.Hosting, Seq, Application Insights) but are provider-dependent. The
|
||
scope properties may not appear in all sink configurations, unlike Serilog's `LogContext` which
|
||
is sink-agnostic.
|
||
- **Redaction placement** — `GatewayLogRedactor` sits between the caller and the log emission point,
|
||
not inside a sink. This is the correct placement; the shared `ILogRedactor` seam preserves this.
|
||
|
||
---
|
||
|
||
## Adoption plan → `ZB.MOM.WW.Telemetry`
|
||
|
||
**This is the one in-pass adoption.** The MxGateway MEL → Serilog migration is executed as part of
|
||
the `ZB.MOM.WW.Telemetry` library build, not deferred as a follow-on. The changes below land in
|
||
the MxAccessGateway repo as part of Task #9 (blocked by Task #8 — library build).
|
||
|
||
**Migrate logging MEL → `AddZbSerilog`:**
|
||
|
||
- Replace `WebApplicationBuilder` default logging with `builder.AddZbSerilog(o => { o.ServiceName = "mxgateway"; o.SiteId = ...; o.NodeRole = ...; })`.
|
||
Gains structured `SiteId` / `NodeRole` / `NodeHostname` enrichers on every log event, plus
|
||
`TraceContextEnricher` (currently moot — no spans — but ready for when tracing is added).
|
||
- Re-express the `GatewayLogScope` / `BeginGatewayScope` / `UseGatewayRequestLogging` correlation
|
||
mechanism as a Serilog `LogContext.PushProperty` scope. The middleware at
|
||
`GatewayRequestLoggingMiddlewareExtensions.cs:22–41` is refactored to push the same four
|
||
properties (`SessionId`, `WorkerProcessId`, `CorrelationId`, `CommandMethod`) via Serilog's
|
||
`LogContext` rather than MEL `BeginScope`. Behavior is identical; portability improves.
|
||
- Move `GatewayLogRedactor` behind the shared `ILogRedactor` seam. The redaction policy (which
|
||
commands/tokens to scrub and how) stays per-project in a `MxGatewayLogRedactor : ILogRedactor`
|
||
implementation; the seam is shared.
|
||
- Console + file sinks configured via `ReadFrom.Configuration` in `appsettings.json` — consistent
|
||
with OtOpcUa and ScadaBridge's Serilog approach.
|
||
|
||
**Wire metrics export via `AddZbTelemetry`:**
|
||
|
||
- Add `builder.AddZbTelemetry(o => { o.ServiceName = "mxgateway"; o.SiteId = ...; o.NodeRole = ...; o.Meters = ["MxGateway.Server"]; /* temporary — update to "ZB.MOM.WW.MxGateway" when the Meter-rename gap (Gap N1) is closed */ })`.
|
||
This registers the OTel SDK and connects `GatewayMetrics`'s existing `Meter` to the Prometheus
|
||
exporter. The 13 counters, 3 histograms, and 4 gauges **begin exporting** for the first time.
|
||
`GatewayMetrics.cs` itself is unchanged — only the SDK layer is added around it.
|
||
- Add `app.MapZbMetrics()` to expose `/metrics`.
|
||
|
||
**Convert histogram unit `ms` → `s`:**
|
||
|
||
- Rename the three histograms' values: multiply recorded values by `0.001` at the call site, or
|
||
re-create the instruments with unit `s`. This is a breaking change to existing dashboards/alerts
|
||
but required for OTel semconv compliance. Tagged as a convergence item in `GAPS.md`.
|
||
|
||
**Keep bespoke:**
|
||
|
||
- `GatewayMetrics.cs` — all 20 instruments (`mxgateway.*` counters, histograms, gauges) stay
|
||
per-project. `AddZbTelemetry` registers the Meter name; it does not own or replace the instruments.
|
||
- Meter name `"MxGateway.Server"` — a follow-on rename to `"ZB.MOM.WW.MxGateway"` is tracked in
|
||
`GAPS.md` but is not required for the initial adoption (it is a Prometheus label change that
|
||
breaks existing dashboards).
|
||
- `GatewayApplication.cs:62` singleton registration — unchanged; `GatewayMetrics` remains a
|
||
singleton; `AddZbTelemetry` simply hooks the OTel SDK to it.
|
||
- The net48 x86 worker's `IWorkerLogger` (stderr key=value) — out of process and out of scope.
|
||
No changes.
|