- Rename 16 kebab-case docs to PascalCase per StyleGuide - Move per-language client design docs from docs/ to clients/<lang>/ alongside their READMEs - Add ## Related Documentation sections to 15 docs that lacked one - Fix sentence-case violations in H3 headings (StyleGuide rule) - Update cross-references in gateway.md, client READMEs, scripts, and generate-proto.ps1 helpers to follow the new paths - Add CLAUDE.md with build/test commands, the source-update verification matrix, the parity-first contract, and pointers to MXAccess and Galaxy Repository analysis sources Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
11 KiB
Gateway Metrics
The metrics subsystem exposes counters, histograms, and observable gauges that describe gateway throughput, queue health, and worker lifecycle. Both the System.Diagnostics.Metrics pipeline and the in-memory GatewayMetricsSnapshot consume the same underlying state, so external collectors and the dashboard see consistent numbers.
Overview
GatewayMetrics is a singleton (registered in GatewayApplication.cs) that owns a single Meter named MxGateway.Server and a set of synchronised counters, histograms, and observable gauges. Subsystems call typed mutator methods (SessionOpened, CommandFailed, EventReceived, etc.) rather than touching the Meter directly, which keeps the OpenTelemetry instrument names and tag conventions in one place. A lock (_syncRoot) block guards the scalar fields used by GetSnapshot, while per-event maps use ConcurrentDictionary<string, long> so the hot event path avoids the lock.
Meter and OpenTelemetry Compatibility
The meter name is exposed as a constant so that hosting code can register it with an OpenTelemetry exporter:
public sealed class GatewayMetrics : IDisposable
{
public const string MeterName = "MxGateway.Server";
public GatewayMetrics()
{
_meter = new Meter(MeterName, typeof(GatewayMetrics).Assembly.GetName().Version?.ToString());
_sessionsOpenedCounter = _meter.CreateCounter<long>("mxgateway.sessions.opened");
...
}
}
The meter version is the gateway assembly version, which gives exporters a stable identifier per build. All instrument names use the dotted mxgateway.<area>.<event> convention so they group cleanly under a single namespace in tools such as Prometheus, OTLP collectors, or dotnet-counters.
Instrument Inventory
Counters
All counters are Counter<long>. Tag values come from the call sites listed under Recording Sites.
| Instrument | Tags | What it measures |
|---|---|---|
mxgateway.sessions.opened |
none | Successful SessionManager.OpenSession completions. |
mxgateway.sessions.closed |
none | Sessions closed cleanly via SessionManager. |
mxgateway.commands.started |
method |
Command dispatches initiated by WorkerClient. |
mxgateway.commands.succeeded |
method |
Commands acknowledged with success by the worker. |
mxgateway.commands.failed |
method, category |
Command failures, where category is the WorkerClientErrorCode or exception type name. |
mxgateway.events.received |
family |
Worker events accepted into the event pipeline. |
mxgateway.queues.overflows |
queue |
Drops when a bounded queue rejects a message (e.g. grpc-event-stream). |
mxgateway.faults |
category |
Faults reported by session, event, or worker code paths. The category is a SessionManagerErrorCode or WorkerClientErrorCode name. |
mxgateway.workers.killed |
reason |
Forced terminations of worker processes. |
mxgateway.workers.exited |
reason |
Clean or fault-driven worker exits. |
mxgateway.heartbeats.failed |
session_id |
Worker heartbeat misses tracked per session. |
mxgateway.grpc.streams.disconnected |
reason |
Detachments of the dashboard or client gRPC event stream. |
mxgateway.retries.attempted |
area |
Resilience retries executed by gateway components. |
Histograms
Histograms record durations in milliseconds (the unit argument on CreateHistogram):
_workerStartupLatencyHistogram = _meter.CreateHistogram<double>("mxgateway.workers.startup.duration", "ms");
_commandLatencyHistogram = _meter.CreateHistogram<double>("mxgateway.commands.duration", "ms");
_eventStreamSendLatencyHistogram = _meter.CreateHistogram<double>("mxgateway.events.stream_send.duration", "ms");
| Instrument | Tags | What it measures |
|---|---|---|
mxgateway.workers.startup.duration |
none | Time from WorkerClient launch to worker-ready. |
mxgateway.commands.duration |
method, optional category |
Command round-trip time. The category tag is added on failure so success and failure latencies stay distinguishable. |
mxgateway.events.stream_send.duration |
family |
Time spent writing each public event to the gRPC response stream in MxAccessGatewayService.StreamEvents. |
Observable gauges
Observable gauges are pull-based; the Meter invokes the supplied callback whenever a listener samples it. Each callback re-acquires _syncRoot so the gauge value matches the snapshot taken at the same instant.
| Instrument | Source field | Description |
|---|---|---|
mxgateway.sessions.open |
_openSessions |
Currently open sessions tracked by SessionManager. |
mxgateway.workers.running |
_workersRunning |
Worker clients in a running state. |
mxgateway.events.worker_queue.depth |
_workerEventQueueDepth |
Last reported depth of the worker-side event queue. |
mxgateway.events.grpc_stream_queue.depth |
_grpcEventStreamQueueDepth |
Backlog held by EventStreamService for the active gRPC stream consumer. |
Snapshot Shape
GatewayMetricsSnapshot is the immutable view of the same state, returned by GatewayMetrics.GetSnapshot() while holding _syncRoot. The dictionaries are copied so the caller can iterate without further synchronisation. The dashboard service is the primary consumer.
public sealed record GatewayMetricsSnapshot(
int OpenSessions,
int WorkersRunning,
int WorkerEventQueueDepth,
int GrpcEventStreamQueueDepth,
long SessionsOpened,
long SessionsClosed,
long CommandsStarted,
long CommandsSucceeded,
long CommandsFailed,
long EventsReceived,
long QueueOverflows,
long Faults,
long WorkerKills,
long WorkerExits,
long HeartbeatFailures,
long StreamDisconnects,
long RetryAttempts,
IReadOnlyDictionary<string, long> CommandFailuresByMethod,
IReadOnlyDictionary<string, long> EventsByFamily,
IReadOnlyDictionary<string, long> EventsBySession,
IReadOnlyDictionary<string, long> RetryAttemptsByArea);
The scalar fields mirror the counters and gauges. The four dictionaries provide the breakdowns that counter tags would otherwise require an exporter to aggregate:
CommandFailuresByMethodkeys by gRPC method name.EventsByFamilykeys by event family (theFamilyenum on a worker event).EventsBySessionkeys bysessionId; entries are removed viaRemoveSessionEventswhen a session closes so the map does not grow without bound.RetryAttemptsByAreakeys by the resilienceareatag, e.g.worker_startup.
EventsReceived is read with Interlocked.Read(ref _eventsReceived) because EventReceived increments it via Interlocked.Increment outside the lock to keep the event-ingestion path non-blocking.
Recording Sites
The recording call sites describe the code paths that write into each instrument. This mapping makes it easier to trace an unexpected counter reading back to a subsystem.
Session manager
Sessions/SessionManager.cs emits session lifecycle and fault counters:
_metrics.SessionOpened();
...
_metrics.Fault(SessionManagerErrorCode.OpenFailed.ToString());
...
_metrics.SessionClosed();
...
_metrics.SessionRemoved();
...
_metrics.Fault(SessionManagerErrorCode.CloseFailed.ToString());
...
_metrics.RemoveSessionEvents(session.SessionId);
SessionRemoved decrements the open-session gauge without incrementing the closed counter, which covers cases where a session is evicted rather than closed by the client.
Worker client
Workers/WorkerClient.cs records command throughput, worker lifecycle, heartbeat failures, and the worker-side event queue depth:
CommandStarted(method)andCommandSucceeded(method, duration)/CommandFailed(method, category, duration)around the worker request/response pair.WorkerStarted(startupDuration)once the worker reports ready.RecordWorkerStoppedOncecallsWorkerStopped(reason)exactly once per worker, guarding against double-counting on simultaneous fault and exit signals.WorkerKilled(reason)when the client forcibly terminates the worker.HeartbeatFailed(SessionId)per missed heartbeat.SetWorkerEventQueueDepth(queueDepth)after each event ingest.EventReceived(SessionId, workerEvent.Event.Family.ToString())for each worker event.
Worker process launcher
Workers/WorkerProcessLauncher.cs records process-level kills and startup retries:
_metrics.WorkerKilled(reason);
...
_metrics.RetryAttempted("worker_startup");
The worker_startup tag is hard-coded so the RetryAttemptsByArea snapshot reports launcher retries distinctly from other resilience areas.
Session worker client factory
Sessions/SessionWorkerClientFactory.cs records the worker kill that follows a failed OpenSession handshake:
_metrics.WorkerKilled("OpenSessionFailed");
This is the only fault path where the factory itself owns the kill decision; once the worker is bound to a session, the WorkerClient becomes responsible for lifecycle metrics.
gRPC event stream service
Grpc/EventStreamService.cs records the dashboard/client event-stream backlog and disconnect counters:
metrics.AdjustGrpcEventStreamQueueDepth(1);
...
metrics.AdjustGrpcEventStreamQueueDepth(-1);
...
metrics.AdjustGrpcEventStreamQueueDepth(-remainingDepth);
metrics.StreamDisconnected("Detached");
...
metrics.QueueOverflow("grpc-event-stream");
metrics.Fault(SessionManagerErrorCode.EventQueueOverflow.ToString());
...
metrics.Fault(WorkerClientErrorCode.WorkerFaulted.ToString());
The service tracks per-message enqueues and dequeues, so AdjustGrpcEventStreamQueueDepth updates the aggregate stream backlog. The Math.Max(0, ...) clamp inside the adjuster prevents a negative depth if the bookkeeping ever drifts.
Grpc/MxAccessGatewayService.cs records gRPC event send latency around each response-stream write:
Stopwatch stopwatch = Stopwatch.StartNew();
await responseStream.WriteAsync(publicEvent).ConfigureAwait(false);
metrics.RecordEventStreamSend(publicEvent.Family.ToString(), stopwatch.Elapsed);
Dashboard Consumption
Dashboard/DashboardSnapshotService.cs calls _metrics.GetSnapshot() once per GetSnapshot invocation and projects it into the dashboard transport types together with the session registry view. The dashboard receives a single, internally consistent snapshot per tick rather than reading individual counters at separate times. See Gateway Dashboard Design and Dashboard Interface Design for the projection rules and wire format.