Files
mxaccessgw/docs/Metrics.md
T
Joseph Doherty dc9c0c950c rename: prefix gateway projects/namespaces with ZB.MOM.WW + sln→slnx
Apply the ZB.MOM.WW. prefix to all gateway-side projects, folders,
.csproj/.sln contents, C# namespaces, using directives, generated proto
C# (csharp_namespace + checked-in generated files), InternalsVisibleTo
attributes, project-name string literals (LoadProject, .sln lookups,
worker exe paths, staticwebassets manifest), and the install/script/doc
references that point at any of the above. Migrate the solution from
.sln to .slnx via `dotnet sln migrate` and delete the old file.

External-runtime identifiers are intentionally NOT prefixed so external
configuration keeps working:
- GatewayMetrics.cs MeterName ("MxGateway.Server")
- DashboardAuthenticationDefaults Scheme/Policy ("MxGateway.Dashboard")
- GatewayRequestLoggingMiddleware logger category ("MxGateway.Request")
- StaRuntime thread name ("MxGateway.Worker.STA")
- appsettings.json root section "MxGateway" + env-var prefix
  MxGateway__... and secret-name MxGateway:ApiKeyPepper
- C:\ProgramData\MxGateway\ data dir paths

Also fixes two tests that were not rename-related but became visible
while validating the rename:

- WorkerLiveMxAccessSmokeTests.ShutDownAsync: cancellation that the
  gateway service correctly maps to RpcException(Cancelled) per gRPC
  convention was being misclassified as a stream fault. Added a sibling
  catch on RpcException with StatusCode.Cancelled.

- IntegrationTestEnvironment.ResolveRepositoryRoot: extracted IsRepositoryRoot
  and made it accept either a .git marker OR a .sln/.slnx next to src/
  so the worker-exe walker works in non-git working copies.

clients/proto/proto-inputs.json's protoRoot updated to point at
src/ZB.MOM.WW.MxGateway.Contracts/Protos.

Verified by `dotnet build` and a full `dotnet test` of the .slnx with
MXGATEWAY_RUN_LIVE_{MXACCESS,LDAP,GALAXY}_TESTS=1:
  Tests: 472/472 pass
  Worker.Tests: 280/280 pass (4 dev-rig [Fact(Skip=...)] skipped)
  IntegrationTests: 18/18 pass

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 16:22:23 -04:00

11 KiB

Gateway Metrics

The metrics subsystem exposes counters, histograms, and observable gauges that describe gateway throughput, queue health, and worker lifecycle. Both the System.Diagnostics.Metrics pipeline and the in-memory GatewayMetricsSnapshot consume the same underlying state, so external collectors and the dashboard see consistent numbers.

Overview

GatewayMetrics is a singleton (registered in GatewayApplication.cs) that owns a single Meter named ZB.MOM.WW.MxGateway.Server and a set of synchronised counters, histograms, and observable gauges. Subsystems call typed mutator methods (SessionOpened, CommandFailed, EventReceived, etc.) rather than touching the Meter directly, which keeps the OpenTelemetry instrument names and tag conventions in one place. A lock (_syncRoot) block guards the scalar fields used by GetSnapshot, while per-event maps use ConcurrentDictionary<string, long> so the hot event path avoids the lock.

Meter and OpenTelemetry Compatibility

The meter name is exposed as a constant so that hosting code can register it with an OpenTelemetry exporter:

public sealed class GatewayMetrics : IDisposable
{
    public const string MeterName = "ZB.MOM.WW.MxGateway.Server";

    public GatewayMetrics()
    {
        _meter = new Meter(MeterName, typeof(GatewayMetrics).Assembly.GetName().Version?.ToString());
        _sessionsOpenedCounter = _meter.CreateCounter<long>("mxgateway.sessions.opened");
        ...
    }
}

The meter version is the gateway assembly version, which gives exporters a stable identifier per build. All instrument names use the dotted mxgateway.<area>.<event> convention so they group cleanly under a single namespace in tools such as Prometheus, OTLP collectors, or dotnet-counters.

Instrument Inventory

Counters

All counters are Counter<long>. Tag values come from the call sites listed under Recording Sites.

Instrument Tags What it measures
mxgateway.sessions.opened none Successful SessionManager.OpenSession completions.
mxgateway.sessions.closed none Sessions closed cleanly via SessionManager.
mxgateway.commands.started method Command dispatches initiated by WorkerClient.
mxgateway.commands.succeeded method Commands acknowledged with success by the worker.
mxgateway.commands.failed method, category Command failures, where category is the WorkerClientErrorCode or exception type name.
mxgateway.events.received family Worker events accepted into the event pipeline.
mxgateway.queues.overflows queue Drops when a bounded queue rejects a message (e.g. grpc-event-stream).
mxgateway.faults category Faults reported by session, event, or worker code paths. The category is a SessionManagerErrorCode or WorkerClientErrorCode name.
mxgateway.workers.killed reason Forced terminations of worker processes.
mxgateway.workers.exited reason Clean or fault-driven worker exits.
mxgateway.heartbeats.failed session_id Worker heartbeat misses tracked per session.
mxgateway.grpc.streams.disconnected reason Detachments of the dashboard or client gRPC event stream.
mxgateway.retries.attempted area Resilience retries executed by gateway components.

Histograms

Histograms record durations in milliseconds (the unit argument on CreateHistogram):

_workerStartupLatencyHistogram = _meter.CreateHistogram<double>("mxgateway.workers.startup.duration", "ms");
_commandLatencyHistogram = _meter.CreateHistogram<double>("mxgateway.commands.duration", "ms");
_eventStreamSendLatencyHistogram = _meter.CreateHistogram<double>("mxgateway.events.stream_send.duration", "ms");
Instrument Tags What it measures
mxgateway.workers.startup.duration none Time from WorkerClient launch to worker-ready.
mxgateway.commands.duration method, optional category Command round-trip time. The category tag is added on failure so success and failure latencies stay distinguishable.
mxgateway.events.stream_send.duration family Time spent writing each public event to the gRPC response stream in MxAccessGatewayService.StreamEvents.

Observable gauges

Observable gauges are pull-based; the Meter invokes the supplied callback whenever a listener samples it. Each callback re-acquires _syncRoot so the gauge value matches the snapshot taken at the same instant.

Instrument Source field Description
mxgateway.sessions.open _openSessions Currently open sessions tracked by SessionManager.
mxgateway.workers.running _workersRunning Worker clients in a running state.
mxgateway.events.worker_queue.depth _workerEventQueueDepth Last reported depth of the worker-side event queue.
mxgateway.events.grpc_stream_queue.depth _grpcEventStreamQueueDepth Backlog held by EventStreamService for the active gRPC stream consumer.

Snapshot Shape

GatewayMetricsSnapshot is the immutable view of the same state, returned by GatewayMetrics.GetSnapshot() while holding _syncRoot. The dictionaries are copied so the caller can iterate without further synchronisation. The dashboard service is the primary consumer.

public sealed record GatewayMetricsSnapshot(
    int OpenSessions,
    int WorkersRunning,
    int WorkerEventQueueDepth,
    int GrpcEventStreamQueueDepth,
    long SessionsOpened,
    long SessionsClosed,
    long CommandsStarted,
    long CommandsSucceeded,
    long CommandsFailed,
    long EventsReceived,
    long QueueOverflows,
    long Faults,
    long WorkerKills,
    long WorkerExits,
    long HeartbeatFailures,
    long StreamDisconnects,
    long RetryAttempts,
    IReadOnlyDictionary<string, long> CommandFailuresByMethod,
    IReadOnlyDictionary<string, long> EventsByFamily,
    IReadOnlyDictionary<string, long> EventsBySession,
    IReadOnlyDictionary<string, long> RetryAttemptsByArea);

The scalar fields mirror the counters and gauges. The four dictionaries provide the breakdowns that counter tags would otherwise require an exporter to aggregate:

  • CommandFailuresByMethod keys by gRPC method name.
  • EventsByFamily keys by event family (the Family enum on a worker event).
  • EventsBySession keys by sessionId; entries are removed via RemoveSessionEvents when a session closes so the map does not grow without bound.
  • RetryAttemptsByArea keys by the resilience area tag, e.g. worker_startup.

EventsReceived is read with Interlocked.Read(ref _eventsReceived) because EventReceived increments it via Interlocked.Increment outside the lock to keep the event-ingestion path non-blocking.

Recording Sites

The recording call sites describe the code paths that write into each instrument. This mapping makes it easier to trace an unexpected counter reading back to a subsystem.

Session manager

Sessions/SessionManager.cs emits session lifecycle and fault counters:

_metrics.SessionOpened();
...
_metrics.Fault(SessionManagerErrorCode.OpenFailed.ToString());
...
_metrics.SessionClosed();
...
_metrics.SessionRemoved();
...
_metrics.Fault(SessionManagerErrorCode.CloseFailed.ToString());
...
_metrics.RemoveSessionEvents(session.SessionId);

SessionRemoved decrements the open-session gauge without incrementing the closed counter, which covers cases where a session is evicted rather than closed by the client.

Worker client

Workers/WorkerClient.cs records command throughput, worker lifecycle, heartbeat failures, and the worker-side event queue depth:

  • CommandStarted(method) and CommandSucceeded(method, duration) / CommandFailed(method, category, duration) around the worker request/response pair.
  • WorkerStarted(startupDuration) once the worker reports ready.
  • RecordWorkerStoppedOnce calls WorkerStopped(reason) exactly once per worker, guarding against double-counting on simultaneous fault and exit signals.
  • WorkerKilled(reason) when the client forcibly terminates the worker.
  • HeartbeatFailed(SessionId) per missed heartbeat.
  • SetWorkerEventQueueDepth(queueDepth) after each event ingest.
  • EventReceived(SessionId, workerEvent.Event.Family.ToString()) for each worker event.

Worker process launcher

Workers/WorkerProcessLauncher.cs records process-level kills and startup retries:

_metrics.WorkerKilled(reason);
...
_metrics.RetryAttempted("worker_startup");

The worker_startup tag is hard-coded so the RetryAttemptsByArea snapshot reports launcher retries distinctly from other resilience areas.

Session worker client factory

Sessions/SessionWorkerClientFactory.cs records the worker kill that follows a failed OpenSession handshake:

_metrics.WorkerKilled("OpenSessionFailed");

This is the only fault path where the factory itself owns the kill decision; once the worker is bound to a session, the WorkerClient becomes responsible for lifecycle metrics.

gRPC event stream service

Grpc/EventStreamService.cs records the dashboard/client event-stream backlog and disconnect counters:

metrics.AdjustGrpcEventStreamQueueDepth(1);
...
metrics.AdjustGrpcEventStreamQueueDepth(-1);
...
metrics.AdjustGrpcEventStreamQueueDepth(-remainingDepth);
metrics.StreamDisconnected("Detached");
...
metrics.QueueOverflow("grpc-event-stream");
metrics.Fault(SessionManagerErrorCode.EventQueueOverflow.ToString());
...
metrics.Fault(WorkerClientErrorCode.WorkerFaulted.ToString());

The service tracks per-message enqueues and dequeues, so AdjustGrpcEventStreamQueueDepth updates the aggregate stream backlog. The Math.Max(0, ...) clamp inside the adjuster prevents a negative depth if the bookkeeping ever drifts.

Grpc/MxAccessGatewayService.cs records gRPC event send latency around each response-stream write:

Stopwatch stopwatch = Stopwatch.StartNew();
await responseStream.WriteAsync(publicEvent).ConfigureAwait(false);
metrics.RecordEventStreamSend(publicEvent.Family.ToString(), stopwatch.Elapsed);

Dashboard Consumption

Dashboard/DashboardSnapshotService.cs calls _metrics.GetSnapshot() once per GetSnapshot invocation and projects it into the dashboard transport types together with the session registry view. The dashboard receives a single, internally consistent snapshot per tick rather than reading individual counters at separate times. See Gateway Dashboard Design and Dashboard Interface Design for the projection rules and wire format.