# Gateway Metrics The metrics subsystem exposes counters, histograms, and observable gauges that describe gateway throughput, queue health, and worker lifecycle. Both the `System.Diagnostics.Metrics` pipeline and the in-memory `GatewayMetricsSnapshot` consume the same underlying state, so external collectors and the dashboard see consistent numbers. ## Overview `GatewayMetrics` is a singleton (registered in `GatewayApplication.cs`) that owns a single `Meter` named `MxGateway.Server` and a set of synchronised counters, histograms, and observable gauges. Subsystems call typed mutator methods (`SessionOpened`, `CommandFailed`, `EventReceived`, etc.) rather than touching the `Meter` directly, which keeps the OpenTelemetry instrument names and tag conventions in one place. A `lock (_syncRoot)` block guards the scalar fields used by `GetSnapshot`, while per-event maps use `ConcurrentDictionary` so the hot event path avoids the lock. ## Meter and OpenTelemetry Compatibility The meter name is exposed as a constant so that hosting code can register it with an OpenTelemetry exporter: ```csharp public sealed class GatewayMetrics : IDisposable { public const string MeterName = "MxGateway.Server"; public GatewayMetrics() { _meter = new Meter(MeterName, typeof(GatewayMetrics).Assembly.GetName().Version?.ToString()); _sessionsOpenedCounter = _meter.CreateCounter("mxgateway.sessions.opened"); ... } } ``` The meter version is the gateway assembly version, which gives exporters a stable identifier per build. All instrument names use the dotted `mxgateway..` convention so they group cleanly under a single namespace in tools such as Prometheus, OTLP collectors, or `dotnet-counters`. ## Instrument Inventory ### Counters All counters are `Counter`. Tag values come from the call sites listed under [Recording Sites](#recording-sites). | Instrument | Tags | What it measures | |------------|------|------------------| | `mxgateway.sessions.opened` | none | Successful `SessionManager.OpenSession` completions. | | `mxgateway.sessions.closed` | none | Sessions closed cleanly via `SessionManager`. | | `mxgateway.commands.started` | `method` | Command dispatches initiated by `WorkerClient`. | | `mxgateway.commands.succeeded` | `method` | Commands acknowledged with success by the worker. | | `mxgateway.commands.failed` | `method`, `category` | Command failures, where `category` is the `WorkerClientErrorCode` or exception type name. | | `mxgateway.events.received` | `family` | Worker events accepted into the event pipeline. | | `mxgateway.queues.overflows` | `queue` | Drops when a bounded queue rejects a message (e.g. `grpc-event-stream`). | | `mxgateway.faults` | `category` | Faults reported by session, event, or worker code paths. The category is a `SessionManagerErrorCode` or `WorkerClientErrorCode` name. | | `mxgateway.workers.killed` | `reason` | Forced terminations of worker processes. | | `mxgateway.workers.exited` | `reason` | Clean or fault-driven worker exits. | | `mxgateway.heartbeats.failed` | `session_id` | Worker heartbeat misses tracked per session. | | `mxgateway.grpc.streams.disconnected` | `reason` | Detachments of the dashboard or client gRPC event stream. | | `mxgateway.retries.attempted` | `area` | Resilience retries executed by gateway components. | ### Histograms Histograms record durations in milliseconds (the `unit` argument on `CreateHistogram`): ```csharp _workerStartupLatencyHistogram = _meter.CreateHistogram("mxgateway.workers.startup.duration", "ms"); _commandLatencyHistogram = _meter.CreateHistogram("mxgateway.commands.duration", "ms"); _eventStreamSendLatencyHistogram = _meter.CreateHistogram("mxgateway.events.stream_send.duration", "ms"); ``` | Instrument | Tags | What it measures | |------------|------|------------------| | `mxgateway.workers.startup.duration` | none | Time from `WorkerClient` launch to worker-ready. | | `mxgateway.commands.duration` | `method`, optional `category` | Command round-trip time. The `category` tag is added on failure so success and failure latencies stay distinguishable. | | `mxgateway.events.stream_send.duration` | `family` | Time spent writing each public event to the gRPC response stream in `MxAccessGatewayService.StreamEvents`. | ### Observable Gauges Observable gauges are pull-based; the `Meter` invokes the supplied callback whenever a listener samples it. Each callback re-acquires `_syncRoot` so the gauge value matches the snapshot taken at the same instant. | Instrument | Source field | Description | |------------|--------------|-------------| | `mxgateway.sessions.open` | `_openSessions` | Currently open sessions tracked by `SessionManager`. | | `mxgateway.workers.running` | `_workersRunning` | Worker clients in a running state. | | `mxgateway.events.worker_queue.depth` | `_workerEventQueueDepth` | Last reported depth of the worker-side event queue. | | `mxgateway.events.grpc_stream_queue.depth` | `_grpcEventStreamQueueDepth` | Backlog held by `EventStreamService` for the active gRPC stream consumer. | ## Snapshot Shape `GatewayMetricsSnapshot` is the immutable view of the same state, returned by `GatewayMetrics.GetSnapshot()` while holding `_syncRoot`. The dictionaries are copied so the caller can iterate without further synchronisation. The dashboard service is the primary consumer. ```csharp public sealed record GatewayMetricsSnapshot( int OpenSessions, int WorkersRunning, int WorkerEventQueueDepth, int GrpcEventStreamQueueDepth, long SessionsOpened, long SessionsClosed, long CommandsStarted, long CommandsSucceeded, long CommandsFailed, long EventsReceived, long QueueOverflows, long Faults, long WorkerKills, long WorkerExits, long HeartbeatFailures, long StreamDisconnects, long RetryAttempts, IReadOnlyDictionary CommandFailuresByMethod, IReadOnlyDictionary EventsByFamily, IReadOnlyDictionary EventsBySession, IReadOnlyDictionary RetryAttemptsByArea); ``` The scalar fields mirror the counters and gauges. The four dictionaries provide the breakdowns that counter tags would otherwise require an exporter to aggregate: - `CommandFailuresByMethod` keys by gRPC method name. - `EventsByFamily` keys by event family (the `Family` enum on a worker event). - `EventsBySession` keys by `sessionId`; entries are removed via `RemoveSessionEvents` when a session closes so the map does not grow without bound. - `RetryAttemptsByArea` keys by the resilience `area` tag, e.g. `worker_startup`. `EventsReceived` is read with `Interlocked.Read(ref _eventsReceived)` because `EventReceived` increments it via `Interlocked.Increment` outside the lock to keep the event-ingestion path non-blocking. ## Recording Sites The recording call sites describe the code paths that write into each instrument. This mapping makes it easier to trace an unexpected counter reading back to a subsystem. ### Session manager `Sessions/SessionManager.cs` emits session lifecycle and fault counters: ```csharp _metrics.SessionOpened(); ... _metrics.Fault(SessionManagerErrorCode.OpenFailed.ToString()); ... _metrics.SessionClosed(); ... _metrics.SessionRemoved(); ... _metrics.Fault(SessionManagerErrorCode.CloseFailed.ToString()); ... _metrics.RemoveSessionEvents(session.SessionId); ``` `SessionRemoved` decrements the open-session gauge without incrementing the closed counter, which covers cases where a session is evicted rather than closed by the client. ### Worker client `Workers/WorkerClient.cs` records command throughput, worker lifecycle, heartbeat failures, and the worker-side event queue depth: - `CommandStarted(method)` and `CommandSucceeded(method, duration)` / `CommandFailed(method, category, duration)` around the worker request/response pair. - `WorkerStarted(startupDuration)` once the worker reports ready. - `RecordWorkerStoppedOnce` calls `WorkerStopped(reason)` exactly once per worker, guarding against double-counting on simultaneous fault and exit signals. - `WorkerKilled(reason)` when the client forcibly terminates the worker. - `HeartbeatFailed(SessionId)` per missed heartbeat. - `SetWorkerEventQueueDepth(queueDepth)` after each event ingest. - `EventReceived(SessionId, workerEvent.Event.Family.ToString())` for each worker event. ### Worker process launcher `Workers/WorkerProcessLauncher.cs` records process-level kills and startup retries: ```csharp _metrics.WorkerKilled(reason); ... _metrics.RetryAttempted("worker_startup"); ``` The `worker_startup` tag is hard-coded so the `RetryAttemptsByArea` snapshot reports launcher retries distinctly from other resilience areas. ### Session worker client factory `Sessions/SessionWorkerClientFactory.cs` records the worker kill that follows a failed `OpenSession` handshake: ```csharp _metrics.WorkerKilled("OpenSessionFailed"); ``` This is the only fault path where the factory itself owns the kill decision; once the worker is bound to a session, the `WorkerClient` becomes responsible for lifecycle metrics. ### gRPC event stream service `Grpc/EventStreamService.cs` records the dashboard/client event-stream backlog and disconnect counters: ```csharp metrics.AdjustGrpcEventStreamQueueDepth(1); ... metrics.AdjustGrpcEventStreamQueueDepth(-1); ... metrics.AdjustGrpcEventStreamQueueDepth(-remainingDepth); metrics.StreamDisconnected("Detached"); ... metrics.QueueOverflow("grpc-event-stream"); metrics.Fault(SessionManagerErrorCode.EventQueueOverflow.ToString()); ... metrics.Fault(WorkerClientErrorCode.WorkerFaulted.ToString()); ``` The service tracks per-message enqueues and dequeues, so `AdjustGrpcEventStreamQueueDepth` updates the aggregate stream backlog. The `Math.Max(0, ...)` clamp inside the adjuster prevents a negative depth if the bookkeeping ever drifts. `Grpc/MxAccessGatewayService.cs` records gRPC event send latency around each response-stream write: ```csharp Stopwatch stopwatch = Stopwatch.StartNew(); await responseStream.WriteAsync(publicEvent).ConfigureAwait(false); metrics.RecordEventStreamSend(publicEvent.Family.ToString(), stopwatch.Elapsed); ``` ## Dashboard Consumption `Dashboard/DashboardSnapshotService.cs` calls `_metrics.GetSnapshot()` once per `GetSnapshot` invocation and projects it into the dashboard transport types together with the session registry view. The dashboard receives a single, internally consistent snapshot per tick rather than reading individual counters at separate times. See [Gateway Dashboard Design](./gateway-dashboard-design.md) and [Dashboard Interface Design](./DashboardInterfaceDesign.md) for the projection rules and wire format. ## Related Documentation - [Gateway Dashboard Design](./gateway-dashboard-design.md) - [Dashboard Interface Design](./DashboardInterfaceDesign.md) - [Sessions](./Sessions.md)