C1: NodeHostname is AUTO throughout. Shared-contract AddZbSerilog doc comment now reads
"SiteId + NodeRole from ZbTelemetryOptions; NodeHostname from Environment.MachineName (auto)".
SPEC.md §0 and §5 prose updated to match. ScadaBridge adoption snippet no longer sets
o.NodeHostname (removed; NodeHostname is auto, not caller-supplied).
C2: METRIC-CONVENTIONS §6.1 OtOpcUa instrument table replaced with code-verified set:
counters otopcua.deploy.applied / driver.lifecycle / virtualtag.eval / scriptedalarm.transition /
opcua.sink.write / redundancy.service_level_change; histogram otopcua.deploy.apply.duration (s);
ActivitySource ZB.MOM.WW.OtOpcUa with spans otopcua.deploy.apply + otopcua.opcua.address_space_rebuild.
Removed invented names (deploy.failed, tag.subscriptions, tag.reads, tag.writes, session.active,
connection.gateway).
C3: METRIC-CONVENTIONS §6.2 MxGateway instrument table replaced with code-verified names from
GatewayMetrics.cs: 13 counters (sessions.opened/closed, commands.started/succeeded/failed,
events.received, queues.overflows, faults, workers.killed/exited, heartbeats.failed,
grpc.streams.disconnected, retries.attempted); 3 histograms ms (workers.startup.duration,
commands.duration, events.stream_send.duration); 4 gauges (sessions.open, workers.running,
events.worker_queue.depth, events.grpc_stream_queue.depth). Removed invented names.
m3: §2 example table replaced mxgateway.session.active + mxgateway.worker.call.duration
(invented) with mxgateway.sessions.open + mxgateway.commands.duration (real). Also fixed
the §2 rule-2 body text example which referenced mxgateway.worker.call.duration.
I4: §5 standard instrumentation table corrected — OtOpcUa now shows ⛔ not added for all
five baseline instrumentations, matching current-state/otopcua. All three projects lack
standard instrumentation today; AddZbTelemetry adds it on adoption.
I1+m1: GAPS.md "Decisions still open" — removed the two settled questions (Prometheus-default
and ms→s/meter-rename bundling). Moved them to a new "Decisions settled" section with explicit
resolution notes. One genuinely open question remains (SiteId/NodeRole config binding path).
I2: SPEC.md §5 AddZbSerilog: added note that AddZbSerilog reads Serilog:MinimumLevel from
IConfiguration; callers with a different config key (e.g. ScadaBridge:Logging:MinimumLevel)
apply that override themselves — stays per-project. Shared-contract doc comment updated to match.
I3: MxAccessGateway adoption plan Meters = ["MxGateway.Server"] annotated as temporary with
note to update to ZB.MOM.WW.MxGateway when Gap N1 (Meter-rename) is closed.
m2: SPEC.md §1 now notes AddZbTelemetry also has an IServiceCollection overload for non-standard
hosts, with the IHostApplicationBuilder overload as the primary path.
10 KiB
Observability — current state: MxAccessGateway
Repo: ~/Desktop/MxAccessGateway. Stack: .NET 10 gateway (x64) + .NET 4.8 worker (x86);
solution src/MxGateway.sln. Telemetry code is concentrated in
src/ZB.MOM.WW.MxGateway.Server/Metrics/ (instruments) and
src/ZB.MOM.WW.MxGateway.Server/Diagnostics/ (logging correlation + redaction).
All paths relative to repo root. Verified 2026-06-01.
The most unusual observability posture in the family: 13 counters, 3 histograms, and 4 observable
gauges all fully hand-rolled using System.Diagnostics.Metrics directly — but never exported
(no OpenTelemetry SDK, no Prometheus exporter, no OTLP). All metric data dies in an in-memory
GetSnapshot(). Logging is Microsoft.Extensions.Logging exclusively (no Serilog), with a bespoke
correlation scope and a log-redaction pipeline. The net48 x86 worker is out of process and out of
scope — its IWorkerLogger (stderr key=value) is not addressed here.
1. Metrics (hand-rolled, unexported)
GatewayMetrics.cs
src/ZB.MOM.WW.MxGateway.Server/Metrics/GatewayMetrics.cs:
Meter name: "MxGateway.Server" (does not follow the project namespace ZB.MOM.WW.MxGateway).
All instruments are instance members of GatewayMetrics. The class is registered as a singleton
at GatewayApplication.cs:62. There is no OpenTelemetry.Extensions.Hosting,
no AddOpenTelemetry() call, and no exporter — the Meter is created with
new Meter("MxGateway.Server") and GetSnapshot() is the only read path.
Counters (13):
| Instrument name | Tracks |
|---|---|
mxgateway.sessions.opened |
New session requests |
mxgateway.sessions.closed |
Sessions torn down |
mxgateway.commands.started |
MXAccess command dispatched |
mxgateway.commands.succeeded |
Command completed OK |
mxgateway.commands.failed |
Command error |
mxgateway.events.received |
MXAccess events from worker |
mxgateway.queues.overflows |
Queue overflow (backpressure) |
mxgateway.faults |
Unhandled gateway faults |
mxgateway.workers.killed |
Worker process forcibly terminated |
mxgateway.workers.exited |
Worker process exited cleanly |
mxgateway.heartbeats.failed |
Worker heartbeat timeouts |
mxgateway.grpc.streams.disconnected |
gRPC event stream disconnects |
mxgateway.retries.attempted |
Retry attempts (any subsystem) |
Histograms (3) — unit ms (diverges from OTel semconv s):
| Instrument name | Tracks |
|---|---|
mxgateway.workers.startup.duration |
Time from worker spawn to ready |
mxgateway.commands.duration |
End-to-end MXAccess command latency |
mxgateway.events.stream_send.duration |
gRPC event stream send latency |
Observable gauges (4):
| Instrument name | Tracks |
|---|---|
mxgateway.sessions.open |
Currently open sessions (live count) |
mxgateway.workers.running |
Currently running worker processes |
mxgateway.events.worker_queue.depth |
Per-worker event queue depth |
mxgateway.events.grpc_stream_queue.depth |
Per-stream gRPC send queue depth |
All 20 instruments share the mxgateway.* prefix and <category>.<event> naming — consistent
with the family convention. Duration histograms record in milliseconds (ms); OTel semantic
conventions require seconds (s). This is the only project with ms histograms.
Singleton wiring
src/ZB.MOM.WW.MxGateway.Server/GatewayApplication.cs:
:62—services.AddSingleton<GatewayMetrics>()registers the metrics singleton.
There is no AddOpenTelemetry() call anywhere in the gateway. The GatewayMetrics Meter is
created independently of any OTel SDK — it participates in MeterListener / GetSnapshot() only.
Without the OTel SDK, this data is invisible to Prometheus, OTLP, or any backend.
No tracing
No ActivitySource is defined. No spans are created. Tracing is entirely absent.
2. Logging (Microsoft.Extensions.Logging)
All logging in the gateway server uses Microsoft.Extensions.Logging (MEL) exclusively. There is
no Serilog dependency. Sink configuration lives in appsettings.json (Console, with structured
logging via the default host builder).
Correlation scope
src/ZB.MOM.WW.MxGateway.Server/Diagnostics/GatewayLogScope.cs:
Defines the per-request/per-session correlation property bag.
src/ZB.MOM.WW.MxGateway.Server/Diagnostics/GatewayRequestLoggingMiddlewareExtensions.cs:
:22–41—UseGatewayRequestLogging()middleware reads the following HTTP headers from each incoming request:x-session-id,x-worker-process-id,x-correlation-id,x-command-method,authorization(for redaction, not logging).- Registered at
GatewayApplication.cs:34.
src/ZB.MOM.WW.MxGateway.Server/Diagnostics/GatewayLoggerExtensions.cs:
:11–18—BeginGatewayScope(ILogger, GatewayLogScope)callslogger.BeginScope(scope)— MEL'sILogger.BeginScopemechanism, which pushes properties as a scoped dictionary.
The correlation tuple (SessionId / WorkerProcessId / CorrelationId / CommandMethod) is
injected into log lines produced within the scope. No trace_id / span_id enrichment — there
is no ActivitySource, so this is consistent but leaves no path to trace correlation.
Log redaction — GatewayLogRedactor.cs
src/ZB.MOM.WW.MxGateway.Server/Diagnostics/GatewayLogRedactor.cs:
- Masks sensitive data in log lines for two categories:
AuthenticateUsercommands: the password argument is replaced.WriteSecuredcommands: the value argument is replaced.mxgw_bearer tokens: the token body is masked, keeping only the key-id prefix.
- Redaction is applied before the log event is emitted — no sensitive data reaches the sink.
This is the only project in the family with an explicit log-redaction pipeline. OtOpcUa and ScadaBridge have no equivalent.
3. Signal summary
| Signal | Provider | Export | Resource / service.name |
|---|---|---|---|
| Metrics | System.Diagnostics.Metrics (Meter direct) |
⛔ none (GetSnapshot() only) |
⛔ none |
| Traces | — | ⛔ none | ⛔ none |
| Logs | MEL (Microsoft.Extensions.Logging) |
Console via appsettings.json |
⛔ none |
| Trace↔log correlation | — | — | ⛔ absent (no ActivitySource exists) |
4. Notable design choices
GatewayMetricssingleton — all counter/gauge increments are lock-free atomic operations on the underlyingMeterinstruments; the singleton is intentional.mshistogram unit —workers.startup.duration,commands.duration, andevents.stream_send.durationall record in milliseconds. This is non-standard (OTel semconv requiress) and means raw values differ from OtOpcUa'sshistograms by a factor of 1000.- MEL correlation via
BeginScope— MEL scopes are supported by structured logging providers (e.g. Serilog.Extensions.Hosting, Seq, Application Insights) but are provider-dependent. The scope properties may not appear in all sink configurations, unlike Serilog'sLogContextwhich is sink-agnostic. - Redaction placement —
GatewayLogRedactorsits between the caller and the log emission point, not inside a sink. This is the correct placement; the sharedILogRedactorseam preserves this.
Adoption plan → ZB.MOM.WW.Telemetry
This is the one in-pass adoption. The MxGateway MEL → Serilog migration is executed as part of
the ZB.MOM.WW.Telemetry library build, not deferred as a follow-on. The changes below land in
the MxAccessGateway repo as part of Task #9 (blocked by Task #8 — library build).
Migrate logging MEL → AddZbSerilog:
- Replace
WebApplicationBuilderdefault logging withbuilder.AddZbSerilog(o => { o.ServiceName = "mxgateway"; o.SiteId = ...; o.NodeRole = ...; }). Gains structuredSiteId/NodeRole/NodeHostnameenrichers on every log event, plusTraceContextEnricher(currently moot — no spans — but ready for when tracing is added). - Re-express the
GatewayLogScope/BeginGatewayScope/UseGatewayRequestLoggingcorrelation mechanism as a SerilogLogContext.PushPropertyscope. The middleware atGatewayRequestLoggingMiddlewareExtensions.cs:22–41is refactored to push the same four properties (SessionId,WorkerProcessId,CorrelationId,CommandMethod) via Serilog'sLogContextrather than MELBeginScope. Behavior is identical; portability improves. - Move
GatewayLogRedactorbehind the sharedILogRedactorseam. The redaction policy (which commands/tokens to scrub and how) stays per-project in aMxGatewayLogRedactor : ILogRedactorimplementation; the seam is shared. - Console + file sinks configured via
ReadFrom.Configurationinappsettings.json— consistent with OtOpcUa and ScadaBridge's Serilog approach.
Wire metrics export via AddZbTelemetry:
- Add
builder.AddZbTelemetry(o => { o.ServiceName = "mxgateway"; o.SiteId = ...; o.NodeRole = ...; o.Meters = ["MxGateway.Server"]; /* temporary — update to "ZB.MOM.WW.MxGateway" when the Meter-rename gap (Gap N1) is closed */ }). This registers the OTel SDK and connectsGatewayMetrics's existingMeterto the Prometheus exporter. The 13 counters, 3 histograms, and 4 gauges begin exporting for the first time.GatewayMetrics.csitself is unchanged — only the SDK layer is added around it. - Add
app.MapZbMetrics()to expose/metrics.
Convert histogram unit ms → s:
- Rename the three histograms' values: multiply recorded values by
0.001at the call site, or re-create the instruments with units. This is a breaking change to existing dashboards/alerts but required for OTel semconv compliance. Tagged as a convergence item inGAPS.md.
Keep bespoke:
GatewayMetrics.cs— all 20 instruments (mxgateway.*counters, histograms, gauges) stay per-project.AddZbTelemetryregisters the Meter name; it does not own or replace the instruments.- Meter name
"MxGateway.Server"— a follow-on rename to"ZB.MOM.WW.MxGateway"is tracked inGAPS.mdbut is not required for the initial adoption (it is a Prometheus label change that breaks existing dashboards). GatewayApplication.cs:62singleton registration — unchanged;GatewayMetricsremains a singleton;AddZbTelemetrysimply hooks the OTel SDK to it.- The net48 x86 worker's
IWorkerLogger(stderr key=value) — out of process and out of scope. No changes.