Files
scadaproj/components/observability/current-state/mxaccessgw/CURRENT-STATE.md
T
Joseph Doherty 215a646e35 docs(observability): fix metric-convention instrument names + NodeHostname-auto + resolve settled questions
C1: NodeHostname is AUTO throughout. Shared-contract AddZbSerilog doc comment now reads
"SiteId + NodeRole from ZbTelemetryOptions; NodeHostname from Environment.MachineName (auto)".
SPEC.md §0 and §5 prose updated to match. ScadaBridge adoption snippet no longer sets
o.NodeHostname (removed; NodeHostname is auto, not caller-supplied).

C2: METRIC-CONVENTIONS §6.1 OtOpcUa instrument table replaced with code-verified set:
counters otopcua.deploy.applied / driver.lifecycle / virtualtag.eval / scriptedalarm.transition /
opcua.sink.write / redundancy.service_level_change; histogram otopcua.deploy.apply.duration (s);
ActivitySource ZB.MOM.WW.OtOpcUa with spans otopcua.deploy.apply + otopcua.opcua.address_space_rebuild.
Removed invented names (deploy.failed, tag.subscriptions, tag.reads, tag.writes, session.active,
connection.gateway).

C3: METRIC-CONVENTIONS §6.2 MxGateway instrument table replaced with code-verified names from
GatewayMetrics.cs: 13 counters (sessions.opened/closed, commands.started/succeeded/failed,
events.received, queues.overflows, faults, workers.killed/exited, heartbeats.failed,
grpc.streams.disconnected, retries.attempted); 3 histograms ms (workers.startup.duration,
commands.duration, events.stream_send.duration); 4 gauges (sessions.open, workers.running,
events.worker_queue.depth, events.grpc_stream_queue.depth). Removed invented names.

m3: §2 example table replaced mxgateway.session.active + mxgateway.worker.call.duration
(invented) with mxgateway.sessions.open + mxgateway.commands.duration (real). Also fixed
the §2 rule-2 body text example which referenced mxgateway.worker.call.duration.

I4: §5 standard instrumentation table corrected — OtOpcUa now shows  not added for all
five baseline instrumentations, matching current-state/otopcua. All three projects lack
standard instrumentation today; AddZbTelemetry adds it on adoption.

I1+m1: GAPS.md "Decisions still open" — removed the two settled questions (Prometheus-default
and ms→s/meter-rename bundling). Moved them to a new "Decisions settled" section with explicit
resolution notes. One genuinely open question remains (SiteId/NodeRole config binding path).

I2: SPEC.md §5 AddZbSerilog: added note that AddZbSerilog reads Serilog:MinimumLevel from
IConfiguration; callers with a different config key (e.g. ScadaBridge:Logging:MinimumLevel)
apply that override themselves — stays per-project. Shared-contract doc comment updated to match.

I3: MxAccessGateway adoption plan Meters = ["MxGateway.Server"] annotated as temporary with
note to update to ZB.MOM.WW.MxGateway when Gap N1 (Meter-rename) is closed.

m2: SPEC.md §1 now notes AddZbTelemetry also has an IServiceCollection overload for non-standard
hosts, with the IHostApplicationBuilder overload as the primary path.
2026-06-01 07:32:58 -04:00

192 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Observability — current state: MxAccessGateway
Repo: `~/Desktop/MxAccessGateway`. Stack: .NET 10 gateway (x64) + .NET 4.8 worker (**x86**);
solution `src/MxGateway.sln`. Telemetry code is concentrated in
`src/ZB.MOM.WW.MxGateway.Server/Metrics/` (instruments) and
`src/ZB.MOM.WW.MxGateway.Server/Diagnostics/` (logging correlation + redaction).
All paths relative to repo root. Verified 2026-06-01.
The most unusual observability posture in the family: **13 counters, 3 histograms, and 4 observable
gauges** all fully hand-rolled using `System.Diagnostics.Metrics` directly — but **never exported**
(no OpenTelemetry SDK, no Prometheus exporter, no OTLP). All metric data dies in an in-memory
`GetSnapshot()`. Logging is `Microsoft.Extensions.Logging` exclusively (no Serilog), with a bespoke
correlation scope and a log-redaction pipeline. The net48 x86 worker is out of process and out of
scope — its `IWorkerLogger` (stderr key=value) is not addressed here.
## 1. Metrics (hand-rolled, unexported)
### `GatewayMetrics.cs`
`src/ZB.MOM.WW.MxGateway.Server/Metrics/GatewayMetrics.cs`:
Meter name: `"MxGateway.Server"` (does not follow the project namespace `ZB.MOM.WW.MxGateway`).
All instruments are instance members of `GatewayMetrics`. The class is registered as a **singleton**
at `GatewayApplication.cs:62`. There is **no `OpenTelemetry.Extensions.Hosting`**,
**no `AddOpenTelemetry()` call**, and **no exporter** — the `Meter` is created with
`new Meter("MxGateway.Server")` and `GetSnapshot()` is the only read path.
**Counters (13):**
| Instrument name | Tracks |
|---|---|
| `mxgateway.sessions.opened` | New session requests |
| `mxgateway.sessions.closed` | Sessions torn down |
| `mxgateway.commands.started` | MXAccess command dispatched |
| `mxgateway.commands.succeeded` | Command completed OK |
| `mxgateway.commands.failed` | Command error |
| `mxgateway.events.received` | MXAccess events from worker |
| `mxgateway.queues.overflows` | Queue overflow (backpressure) |
| `mxgateway.faults` | Unhandled gateway faults |
| `mxgateway.workers.killed` | Worker process forcibly terminated |
| `mxgateway.workers.exited` | Worker process exited cleanly |
| `mxgateway.heartbeats.failed` | Worker heartbeat timeouts |
| `mxgateway.grpc.streams.disconnected` | gRPC event stream disconnects |
| `mxgateway.retries.attempted` | Retry attempts (any subsystem) |
**Histograms (3) — unit `ms` (diverges from OTel semconv `s`):**
| Instrument name | Tracks |
|---|---|
| `mxgateway.workers.startup.duration` | Time from worker spawn to ready |
| `mxgateway.commands.duration` | End-to-end MXAccess command latency |
| `mxgateway.events.stream_send.duration` | gRPC event stream send latency |
**Observable gauges (4):**
| Instrument name | Tracks |
|---|---|
| `mxgateway.sessions.open` | Currently open sessions (live count) |
| `mxgateway.workers.running` | Currently running worker processes |
| `mxgateway.events.worker_queue.depth` | Per-worker event queue depth |
| `mxgateway.events.grpc_stream_queue.depth` | Per-stream gRPC send queue depth |
All 20 instruments share the `mxgateway.*` prefix and `<category>.<event>` naming — consistent
with the family convention. Duration histograms record in **milliseconds** (`ms`); OTel semantic
conventions require seconds (`s`). This is the only project with `ms` histograms.
### Singleton wiring
`src/ZB.MOM.WW.MxGateway.Server/GatewayApplication.cs`:
- `:62``services.AddSingleton<GatewayMetrics>()` registers the metrics singleton.
There is no `AddOpenTelemetry()` call anywhere in the gateway. The `GatewayMetrics` `Meter` is
created independently of any OTel SDK — it participates in `MeterListener` / `GetSnapshot()` only.
Without the OTel SDK, this data is **invisible to Prometheus, OTLP, or any backend**.
### No tracing
No `ActivitySource` is defined. No spans are created. Tracing is entirely absent.
## 2. Logging (Microsoft.Extensions.Logging)
All logging in the gateway server uses `Microsoft.Extensions.Logging` (MEL) exclusively. There is
no Serilog dependency. Sink configuration lives in `appsettings.json` (Console, with structured
logging via the default host builder).
### Correlation scope
`src/ZB.MOM.WW.MxGateway.Server/Diagnostics/GatewayLogScope.cs`:
Defines the per-request/per-session correlation property bag.
`src/ZB.MOM.WW.MxGateway.Server/Diagnostics/GatewayRequestLoggingMiddlewareExtensions.cs`:
- `:2241``UseGatewayRequestLogging()` middleware reads the following HTTP headers from each
incoming request: `x-session-id`, `x-worker-process-id`, `x-correlation-id`, `x-command-method`,
`authorization` (for redaction, not logging).
- Registered at `GatewayApplication.cs:34`.
`src/ZB.MOM.WW.MxGateway.Server/Diagnostics/GatewayLoggerExtensions.cs`:
- `:1118``BeginGatewayScope(ILogger, GatewayLogScope)` calls `logger.BeginScope(scope)`
MEL's `ILogger.BeginScope` mechanism, which pushes properties as a scoped dictionary.
The correlation tuple (`SessionId` / `WorkerProcessId` / `CorrelationId` / `CommandMethod`) is
injected into log lines produced within the scope. No `trace_id` / `span_id` enrichment — there
is no ActivitySource, so this is consistent but leaves no path to trace correlation.
### Log redaction — `GatewayLogRedactor.cs`
`src/ZB.MOM.WW.MxGateway.Server/Diagnostics/GatewayLogRedactor.cs`:
- Masks sensitive data in log lines for two categories:
- **`AuthenticateUser`** commands: the password argument is replaced.
- **`WriteSecured`** commands: the value argument is replaced.
- **`mxgw_` bearer tokens**: the token body is masked, keeping only the key-id prefix.
- Redaction is applied before the log event is emitted — no sensitive data reaches the sink.
This is the only project in the family with an explicit log-redaction pipeline. OtOpcUa and
ScadaBridge have no equivalent.
## 3. Signal summary
| Signal | Provider | Export | Resource / service.name |
|---|---|---|---|
| Metrics | `System.Diagnostics.Metrics` (`Meter` direct) | ⛔ none (`GetSnapshot()` only) | ⛔ none |
| Traces | — | ⛔ none | ⛔ none |
| Logs | MEL (`Microsoft.Extensions.Logging`) | Console via `appsettings.json` | ⛔ none |
| Trace↔log correlation | — | — | ⛔ absent (no ActivitySource exists) |
## 4. Notable design choices
- **`GatewayMetrics` singleton** — all counter/gauge increments are lock-free atomic operations on
the underlying `Meter` instruments; the singleton is intentional.
- **`ms` histogram unit** — `workers.startup.duration`, `commands.duration`, and
`events.stream_send.duration` all record in milliseconds. This is non-standard (OTel semconv
requires `s`) and means raw values differ from OtOpcUa's `s` histograms by a factor of 1000.
- **MEL correlation via `BeginScope`** — MEL scopes are supported by structured logging providers
(e.g. Serilog.Extensions.Hosting, Seq, Application Insights) but are provider-dependent. The
scope properties may not appear in all sink configurations, unlike Serilog's `LogContext` which
is sink-agnostic.
- **Redaction placement** — `GatewayLogRedactor` sits between the caller and the log emission point,
not inside a sink. This is the correct placement; the shared `ILogRedactor` seam preserves this.
---
## Adoption plan → `ZB.MOM.WW.Telemetry`
**This is the one in-pass adoption.** The MxGateway MEL → Serilog migration is executed as part of
the `ZB.MOM.WW.Telemetry` library build, not deferred as a follow-on. The changes below land in
the MxAccessGateway repo as part of Task #9 (blocked by Task #8 — library build).
**Migrate logging MEL → `AddZbSerilog`:**
- Replace `WebApplicationBuilder` default logging with `builder.AddZbSerilog(o => { o.ServiceName = "mxgateway"; o.SiteId = ...; o.NodeRole = ...; })`.
Gains structured `SiteId` / `NodeRole` / `NodeHostname` enrichers on every log event, plus
`TraceContextEnricher` (currently moot — no spans — but ready for when tracing is added).
- Re-express the `GatewayLogScope` / `BeginGatewayScope` / `UseGatewayRequestLogging` correlation
mechanism as a Serilog `LogContext.PushProperty` scope. The middleware at
`GatewayRequestLoggingMiddlewareExtensions.cs:2241` is refactored to push the same four
properties (`SessionId`, `WorkerProcessId`, `CorrelationId`, `CommandMethod`) via Serilog's
`LogContext` rather than MEL `BeginScope`. Behavior is identical; portability improves.
- Move `GatewayLogRedactor` behind the shared `ILogRedactor` seam. The redaction policy (which
commands/tokens to scrub and how) stays per-project in a `MxGatewayLogRedactor : ILogRedactor`
implementation; the seam is shared.
- Console + file sinks configured via `ReadFrom.Configuration` in `appsettings.json` — consistent
with OtOpcUa and ScadaBridge's Serilog approach.
**Wire metrics export via `AddZbTelemetry`:**
- Add `builder.AddZbTelemetry(o => { o.ServiceName = "mxgateway"; o.SiteId = ...; o.NodeRole = ...; o.Meters = ["MxGateway.Server"]; /* temporary — update to "ZB.MOM.WW.MxGateway" when the Meter-rename gap (Gap N1) is closed */ })`.
This registers the OTel SDK and connects `GatewayMetrics`'s existing `Meter` to the Prometheus
exporter. The 13 counters, 3 histograms, and 4 gauges **begin exporting** for the first time.
`GatewayMetrics.cs` itself is unchanged — only the SDK layer is added around it.
- Add `app.MapZbMetrics()` to expose `/metrics`.
**Convert histogram unit `ms` → `s`:**
- Rename the three histograms' values: multiply recorded values by `0.001` at the call site, or
re-create the instruments with unit `s`. This is a breaking change to existing dashboards/alerts
but required for OTel semconv compliance. Tagged as a convergence item in `GAPS.md`.
**Keep bespoke:**
- `GatewayMetrics.cs` — all 20 instruments (`mxgateway.*` counters, histograms, gauges) stay
per-project. `AddZbTelemetry` registers the Meter name; it does not own or replace the instruments.
- Meter name `"MxGateway.Server"` — a follow-on rename to `"ZB.MOM.WW.MxGateway"` is tracked in
`GAPS.md` but is not required for the initial adoption (it is a Prometheus label change that
breaks existing dashboards).
- `GatewayApplication.cs:62` singleton registration — unchanged; `GatewayMetrics` remains a
singleton; `AddZbTelemetry` simply hooks the OTel SDK to it.
- The net48 x86 worker's `IWorkerLogger` (stderr key=value) — out of process and out of scope.
No changes.