Files
scadaproj/components/observability/current-state/mxaccessgw/CURRENT-STATE.md
T
Joseph Doherty 215a646e35 docs(observability): fix metric-convention instrument names + NodeHostname-auto + resolve settled questions
C1: NodeHostname is AUTO throughout. Shared-contract AddZbSerilog doc comment now reads
"SiteId + NodeRole from ZbTelemetryOptions; NodeHostname from Environment.MachineName (auto)".
SPEC.md §0 and §5 prose updated to match. ScadaBridge adoption snippet no longer sets
o.NodeHostname (removed; NodeHostname is auto, not caller-supplied).

C2: METRIC-CONVENTIONS §6.1 OtOpcUa instrument table replaced with code-verified set:
counters otopcua.deploy.applied / driver.lifecycle / virtualtag.eval / scriptedalarm.transition /
opcua.sink.write / redundancy.service_level_change; histogram otopcua.deploy.apply.duration (s);
ActivitySource ZB.MOM.WW.OtOpcUa with spans otopcua.deploy.apply + otopcua.opcua.address_space_rebuild.
Removed invented names (deploy.failed, tag.subscriptions, tag.reads, tag.writes, session.active,
connection.gateway).

C3: METRIC-CONVENTIONS §6.2 MxGateway instrument table replaced with code-verified names from
GatewayMetrics.cs: 13 counters (sessions.opened/closed, commands.started/succeeded/failed,
events.received, queues.overflows, faults, workers.killed/exited, heartbeats.failed,
grpc.streams.disconnected, retries.attempted); 3 histograms ms (workers.startup.duration,
commands.duration, events.stream_send.duration); 4 gauges (sessions.open, workers.running,
events.worker_queue.depth, events.grpc_stream_queue.depth). Removed invented names.

m3: §2 example table replaced mxgateway.session.active + mxgateway.worker.call.duration
(invented) with mxgateway.sessions.open + mxgateway.commands.duration (real). Also fixed
the §2 rule-2 body text example which referenced mxgateway.worker.call.duration.

I4: §5 standard instrumentation table corrected — OtOpcUa now shows  not added for all
five baseline instrumentations, matching current-state/otopcua. All three projects lack
standard instrumentation today; AddZbTelemetry adds it on adoption.

I1+m1: GAPS.md "Decisions still open" — removed the two settled questions (Prometheus-default
and ms→s/meter-rename bundling). Moved them to a new "Decisions settled" section with explicit
resolution notes. One genuinely open question remains (SiteId/NodeRole config binding path).

I2: SPEC.md §5 AddZbSerilog: added note that AddZbSerilog reads Serilog:MinimumLevel from
IConfiguration; callers with a different config key (e.g. ScadaBridge:Logging:MinimumLevel)
apply that override themselves — stays per-project. Shared-contract doc comment updated to match.

I3: MxAccessGateway adoption plan Meters = ["MxGateway.Server"] annotated as temporary with
note to update to ZB.MOM.WW.MxGateway when Gap N1 (Meter-rename) is closed.

m2: SPEC.md §1 now notes AddZbTelemetry also has an IServiceCollection overload for non-standard
hosts, with the IHostApplicationBuilder overload as the primary path.
2026-06-01 07:32:58 -04:00

10 KiB
Raw Blame History

Observability — current state: MxAccessGateway

Repo: ~/Desktop/MxAccessGateway. Stack: .NET 10 gateway (x64) + .NET 4.8 worker (x86); solution src/MxGateway.sln. Telemetry code is concentrated in src/ZB.MOM.WW.MxGateway.Server/Metrics/ (instruments) and src/ZB.MOM.WW.MxGateway.Server/Diagnostics/ (logging correlation + redaction). All paths relative to repo root. Verified 2026-06-01.

The most unusual observability posture in the family: 13 counters, 3 histograms, and 4 observable gauges all fully hand-rolled using System.Diagnostics.Metrics directly — but never exported (no OpenTelemetry SDK, no Prometheus exporter, no OTLP). All metric data dies in an in-memory GetSnapshot(). Logging is Microsoft.Extensions.Logging exclusively (no Serilog), with a bespoke correlation scope and a log-redaction pipeline. The net48 x86 worker is out of process and out of scope — its IWorkerLogger (stderr key=value) is not addressed here.

1. Metrics (hand-rolled, unexported)

GatewayMetrics.cs

src/ZB.MOM.WW.MxGateway.Server/Metrics/GatewayMetrics.cs:

Meter name: "MxGateway.Server" (does not follow the project namespace ZB.MOM.WW.MxGateway).

All instruments are instance members of GatewayMetrics. The class is registered as a singleton at GatewayApplication.cs:62. There is no OpenTelemetry.Extensions.Hosting, no AddOpenTelemetry() call, and no exporter — the Meter is created with new Meter("MxGateway.Server") and GetSnapshot() is the only read path.

Counters (13):

Instrument name Tracks
mxgateway.sessions.opened New session requests
mxgateway.sessions.closed Sessions torn down
mxgateway.commands.started MXAccess command dispatched
mxgateway.commands.succeeded Command completed OK
mxgateway.commands.failed Command error
mxgateway.events.received MXAccess events from worker
mxgateway.queues.overflows Queue overflow (backpressure)
mxgateway.faults Unhandled gateway faults
mxgateway.workers.killed Worker process forcibly terminated
mxgateway.workers.exited Worker process exited cleanly
mxgateway.heartbeats.failed Worker heartbeat timeouts
mxgateway.grpc.streams.disconnected gRPC event stream disconnects
mxgateway.retries.attempted Retry attempts (any subsystem)

Histograms (3) — unit ms (diverges from OTel semconv s):

Instrument name Tracks
mxgateway.workers.startup.duration Time from worker spawn to ready
mxgateway.commands.duration End-to-end MXAccess command latency
mxgateway.events.stream_send.duration gRPC event stream send latency

Observable gauges (4):

Instrument name Tracks
mxgateway.sessions.open Currently open sessions (live count)
mxgateway.workers.running Currently running worker processes
mxgateway.events.worker_queue.depth Per-worker event queue depth
mxgateway.events.grpc_stream_queue.depth Per-stream gRPC send queue depth

All 20 instruments share the mxgateway.* prefix and <category>.<event> naming — consistent with the family convention. Duration histograms record in milliseconds (ms); OTel semantic conventions require seconds (s). This is the only project with ms histograms.

Singleton wiring

src/ZB.MOM.WW.MxGateway.Server/GatewayApplication.cs:

  • :62services.AddSingleton<GatewayMetrics>() registers the metrics singleton.

There is no AddOpenTelemetry() call anywhere in the gateway. The GatewayMetrics Meter is created independently of any OTel SDK — it participates in MeterListener / GetSnapshot() only. Without the OTel SDK, this data is invisible to Prometheus, OTLP, or any backend.

No tracing

No ActivitySource is defined. No spans are created. Tracing is entirely absent.

2. Logging (Microsoft.Extensions.Logging)

All logging in the gateway server uses Microsoft.Extensions.Logging (MEL) exclusively. There is no Serilog dependency. Sink configuration lives in appsettings.json (Console, with structured logging via the default host builder).

Correlation scope

src/ZB.MOM.WW.MxGateway.Server/Diagnostics/GatewayLogScope.cs:

Defines the per-request/per-session correlation property bag.

src/ZB.MOM.WW.MxGateway.Server/Diagnostics/GatewayRequestLoggingMiddlewareExtensions.cs:

  • :2241UseGatewayRequestLogging() middleware reads the following HTTP headers from each incoming request: x-session-id, x-worker-process-id, x-correlation-id, x-command-method, authorization (for redaction, not logging).
  • Registered at GatewayApplication.cs:34.

src/ZB.MOM.WW.MxGateway.Server/Diagnostics/GatewayLoggerExtensions.cs:

  • :1118BeginGatewayScope(ILogger, GatewayLogScope) calls logger.BeginScope(scope) — MEL's ILogger.BeginScope mechanism, which pushes properties as a scoped dictionary.

The correlation tuple (SessionId / WorkerProcessId / CorrelationId / CommandMethod) is injected into log lines produced within the scope. No trace_id / span_id enrichment — there is no ActivitySource, so this is consistent but leaves no path to trace correlation.

Log redaction — GatewayLogRedactor.cs

src/ZB.MOM.WW.MxGateway.Server/Diagnostics/GatewayLogRedactor.cs:

  • Masks sensitive data in log lines for two categories:
    • AuthenticateUser commands: the password argument is replaced.
    • WriteSecured commands: the value argument is replaced.
    • mxgw_ bearer tokens: the token body is masked, keeping only the key-id prefix.
  • Redaction is applied before the log event is emitted — no sensitive data reaches the sink.

This is the only project in the family with an explicit log-redaction pipeline. OtOpcUa and ScadaBridge have no equivalent.

3. Signal summary

Signal Provider Export Resource / service.name
Metrics System.Diagnostics.Metrics (Meter direct) none (GetSnapshot() only) none
Traces none none
Logs MEL (Microsoft.Extensions.Logging) Console via appsettings.json none
Trace↔log correlation absent (no ActivitySource exists)

4. Notable design choices

  • GatewayMetrics singleton — all counter/gauge increments are lock-free atomic operations on the underlying Meter instruments; the singleton is intentional.
  • ms histogram unitworkers.startup.duration, commands.duration, and events.stream_send.duration all record in milliseconds. This is non-standard (OTel semconv requires s) and means raw values differ from OtOpcUa's s histograms by a factor of 1000.
  • MEL correlation via BeginScope — MEL scopes are supported by structured logging providers (e.g. Serilog.Extensions.Hosting, Seq, Application Insights) but are provider-dependent. The scope properties may not appear in all sink configurations, unlike Serilog's LogContext which is sink-agnostic.
  • Redaction placementGatewayLogRedactor sits between the caller and the log emission point, not inside a sink. This is the correct placement; the shared ILogRedactor seam preserves this.

Adoption plan → ZB.MOM.WW.Telemetry

This is the one in-pass adoption. The MxGateway MEL → Serilog migration is executed as part of the ZB.MOM.WW.Telemetry library build, not deferred as a follow-on. The changes below land in the MxAccessGateway repo as part of Task #9 (blocked by Task #8 — library build).

Migrate logging MEL → AddZbSerilog:

  • Replace WebApplicationBuilder default logging with builder.AddZbSerilog(o => { o.ServiceName = "mxgateway"; o.SiteId = ...; o.NodeRole = ...; }). Gains structured SiteId / NodeRole / NodeHostname enrichers on every log event, plus TraceContextEnricher (currently moot — no spans — but ready for when tracing is added).
  • Re-express the GatewayLogScope / BeginGatewayScope / UseGatewayRequestLogging correlation mechanism as a Serilog LogContext.PushProperty scope. The middleware at GatewayRequestLoggingMiddlewareExtensions.cs:2241 is refactored to push the same four properties (SessionId, WorkerProcessId, CorrelationId, CommandMethod) via Serilog's LogContext rather than MEL BeginScope. Behavior is identical; portability improves.
  • Move GatewayLogRedactor behind the shared ILogRedactor seam. The redaction policy (which commands/tokens to scrub and how) stays per-project in a MxGatewayLogRedactor : ILogRedactor implementation; the seam is shared.
  • Console + file sinks configured via ReadFrom.Configuration in appsettings.json — consistent with OtOpcUa and ScadaBridge's Serilog approach.

Wire metrics export via AddZbTelemetry:

  • Add builder.AddZbTelemetry(o => { o.ServiceName = "mxgateway"; o.SiteId = ...; o.NodeRole = ...; o.Meters = ["MxGateway.Server"]; /* temporary — update to "ZB.MOM.WW.MxGateway" when the Meter-rename gap (Gap N1) is closed */ }). This registers the OTel SDK and connects GatewayMetrics's existing Meter to the Prometheus exporter. The 13 counters, 3 histograms, and 4 gauges begin exporting for the first time. GatewayMetrics.cs itself is unchanged — only the SDK layer is added around it.
  • Add app.MapZbMetrics() to expose /metrics.

Convert histogram unit mss:

  • Rename the three histograms' values: multiply recorded values by 0.001 at the call site, or re-create the instruments with unit s. This is a breaking change to existing dashboards/alerts but required for OTel semconv compliance. Tagged as a convergence item in GAPS.md.

Keep bespoke:

  • GatewayMetrics.cs — all 20 instruments (mxgateway.* counters, histograms, gauges) stay per-project. AddZbTelemetry registers the Meter name; it does not own or replace the instruments.
  • Meter name "MxGateway.Server" — a follow-on rename to "ZB.MOM.WW.MxGateway" is tracked in GAPS.md but is not required for the initial adoption (it is a Prometheus label change that breaks existing dashboards).
  • GatewayApplication.cs:62 singleton registration — unchanged; GatewayMetrics remains a singleton; AddZbTelemetry simply hooks the OTel SDK to it.
  • The net48 x86 worker's IWorkerLogger (stderr key=value) — out of process and out of scope. No changes.