51a9dadf62
- Rename 16 kebab-case docs to PascalCase per StyleGuide - Move per-language client design docs from docs/ to clients/<lang>/ alongside their READMEs - Add ## Related Documentation sections to 15 docs that lacked one - Fix sentence-case violations in H3 headings (StyleGuide rule) - Update cross-references in gateway.md, client READMEs, scripts, and generate-proto.ps1 helpers to follow the new paths - Add CLAUDE.md with build/test commands, the source-update verification matrix, the parity-first contract, and pointers to MXAccess and Galaxy Repository analysis sources Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
211 lines
11 KiB
Markdown
211 lines
11 KiB
Markdown
# Gateway Metrics
|
|
|
|
The metrics subsystem exposes counters, histograms, and observable gauges that describe gateway throughput, queue health, and worker lifecycle. Both the `System.Diagnostics.Metrics` pipeline and the in-memory `GatewayMetricsSnapshot` consume the same underlying state, so external collectors and the dashboard see consistent numbers.
|
|
|
|
## Overview
|
|
|
|
`GatewayMetrics` is a singleton (registered in `GatewayApplication.cs`) that owns a single `Meter` named `MxGateway.Server` and a set of synchronised counters, histograms, and observable gauges. Subsystems call typed mutator methods (`SessionOpened`, `CommandFailed`, `EventReceived`, etc.) rather than touching the `Meter` directly, which keeps the OpenTelemetry instrument names and tag conventions in one place. A `lock (_syncRoot)` block guards the scalar fields used by `GetSnapshot`, while per-event maps use `ConcurrentDictionary<string, long>` so the hot event path avoids the lock.
|
|
|
|
## Meter and OpenTelemetry Compatibility
|
|
|
|
The meter name is exposed as a constant so that hosting code can register it with an OpenTelemetry exporter:
|
|
|
|
```csharp
|
|
public sealed class GatewayMetrics : IDisposable
|
|
{
|
|
public const string MeterName = "MxGateway.Server";
|
|
|
|
public GatewayMetrics()
|
|
{
|
|
_meter = new Meter(MeterName, typeof(GatewayMetrics).Assembly.GetName().Version?.ToString());
|
|
_sessionsOpenedCounter = _meter.CreateCounter<long>("mxgateway.sessions.opened");
|
|
...
|
|
}
|
|
}
|
|
```
|
|
|
|
The meter version is the gateway assembly version, which gives exporters a stable identifier per build. All instrument names use the dotted `mxgateway.<area>.<event>` convention so they group cleanly under a single namespace in tools such as Prometheus, OTLP collectors, or `dotnet-counters`.
|
|
|
|
## Instrument Inventory
|
|
|
|
### Counters
|
|
|
|
All counters are `Counter<long>`. Tag values come from the call sites listed under [Recording Sites](#recording-sites).
|
|
|
|
| Instrument | Tags | What it measures |
|
|
|------------|------|------------------|
|
|
| `mxgateway.sessions.opened` | none | Successful `SessionManager.OpenSession` completions. |
|
|
| `mxgateway.sessions.closed` | none | Sessions closed cleanly via `SessionManager`. |
|
|
| `mxgateway.commands.started` | `method` | Command dispatches initiated by `WorkerClient`. |
|
|
| `mxgateway.commands.succeeded` | `method` | Commands acknowledged with success by the worker. |
|
|
| `mxgateway.commands.failed` | `method`, `category` | Command failures, where `category` is the `WorkerClientErrorCode` or exception type name. |
|
|
| `mxgateway.events.received` | `family` | Worker events accepted into the event pipeline. |
|
|
| `mxgateway.queues.overflows` | `queue` | Drops when a bounded queue rejects a message (e.g. `grpc-event-stream`). |
|
|
| `mxgateway.faults` | `category` | Faults reported by session, event, or worker code paths. The category is a `SessionManagerErrorCode` or `WorkerClientErrorCode` name. |
|
|
| `mxgateway.workers.killed` | `reason` | Forced terminations of worker processes. |
|
|
| `mxgateway.workers.exited` | `reason` | Clean or fault-driven worker exits. |
|
|
| `mxgateway.heartbeats.failed` | `session_id` | Worker heartbeat misses tracked per session. |
|
|
| `mxgateway.grpc.streams.disconnected` | `reason` | Detachments of the dashboard or client gRPC event stream. |
|
|
| `mxgateway.retries.attempted` | `area` | Resilience retries executed by gateway components. |
|
|
|
|
### Histograms
|
|
|
|
Histograms record durations in milliseconds (the `unit` argument on `CreateHistogram`):
|
|
|
|
```csharp
|
|
_workerStartupLatencyHistogram = _meter.CreateHistogram<double>("mxgateway.workers.startup.duration", "ms");
|
|
_commandLatencyHistogram = _meter.CreateHistogram<double>("mxgateway.commands.duration", "ms");
|
|
_eventStreamSendLatencyHistogram = _meter.CreateHistogram<double>("mxgateway.events.stream_send.duration", "ms");
|
|
```
|
|
|
|
| Instrument | Tags | What it measures |
|
|
|------------|------|------------------|
|
|
| `mxgateway.workers.startup.duration` | none | Time from `WorkerClient` launch to worker-ready. |
|
|
| `mxgateway.commands.duration` | `method`, optional `category` | Command round-trip time. The `category` tag is added on failure so success and failure latencies stay distinguishable. |
|
|
| `mxgateway.events.stream_send.duration` | `family` | Time spent writing each public event to the gRPC response stream in `MxAccessGatewayService.StreamEvents`. |
|
|
|
|
### Observable gauges
|
|
|
|
Observable gauges are pull-based; the `Meter` invokes the supplied callback whenever a listener samples it. Each callback re-acquires `_syncRoot` so the gauge value matches the snapshot taken at the same instant.
|
|
|
|
| Instrument | Source field | Description |
|
|
|------------|--------------|-------------|
|
|
| `mxgateway.sessions.open` | `_openSessions` | Currently open sessions tracked by `SessionManager`. |
|
|
| `mxgateway.workers.running` | `_workersRunning` | Worker clients in a running state. |
|
|
| `mxgateway.events.worker_queue.depth` | `_workerEventQueueDepth` | Last reported depth of the worker-side event queue. |
|
|
| `mxgateway.events.grpc_stream_queue.depth` | `_grpcEventStreamQueueDepth` | Backlog held by `EventStreamService` for the active gRPC stream consumer. |
|
|
|
|
## Snapshot Shape
|
|
|
|
`GatewayMetricsSnapshot` is the immutable view of the same state, returned by `GatewayMetrics.GetSnapshot()` while holding `_syncRoot`. The dictionaries are copied so the caller can iterate without further synchronisation. The dashboard service is the primary consumer.
|
|
|
|
```csharp
|
|
public sealed record GatewayMetricsSnapshot(
|
|
int OpenSessions,
|
|
int WorkersRunning,
|
|
int WorkerEventQueueDepth,
|
|
int GrpcEventStreamQueueDepth,
|
|
long SessionsOpened,
|
|
long SessionsClosed,
|
|
long CommandsStarted,
|
|
long CommandsSucceeded,
|
|
long CommandsFailed,
|
|
long EventsReceived,
|
|
long QueueOverflows,
|
|
long Faults,
|
|
long WorkerKills,
|
|
long WorkerExits,
|
|
long HeartbeatFailures,
|
|
long StreamDisconnects,
|
|
long RetryAttempts,
|
|
IReadOnlyDictionary<string, long> CommandFailuresByMethod,
|
|
IReadOnlyDictionary<string, long> EventsByFamily,
|
|
IReadOnlyDictionary<string, long> EventsBySession,
|
|
IReadOnlyDictionary<string, long> RetryAttemptsByArea);
|
|
```
|
|
|
|
The scalar fields mirror the counters and gauges. The four dictionaries provide the breakdowns that counter tags would otherwise require an exporter to aggregate:
|
|
|
|
- `CommandFailuresByMethod` keys by gRPC method name.
|
|
- `EventsByFamily` keys by event family (the `Family` enum on a worker event).
|
|
- `EventsBySession` keys by `sessionId`; entries are removed via `RemoveSessionEvents` when a session closes so the map does not grow without bound.
|
|
- `RetryAttemptsByArea` keys by the resilience `area` tag, e.g. `worker_startup`.
|
|
|
|
`EventsReceived` is read with `Interlocked.Read(ref _eventsReceived)` because `EventReceived` increments it via `Interlocked.Increment` outside the lock to keep the event-ingestion path non-blocking.
|
|
|
|
## Recording Sites
|
|
|
|
The recording call sites describe the code paths that write into each instrument. This mapping makes it easier to trace an unexpected counter reading back to a subsystem.
|
|
|
|
### Session manager
|
|
|
|
`Sessions/SessionManager.cs` emits session lifecycle and fault counters:
|
|
|
|
```csharp
|
|
_metrics.SessionOpened();
|
|
...
|
|
_metrics.Fault(SessionManagerErrorCode.OpenFailed.ToString());
|
|
...
|
|
_metrics.SessionClosed();
|
|
...
|
|
_metrics.SessionRemoved();
|
|
...
|
|
_metrics.Fault(SessionManagerErrorCode.CloseFailed.ToString());
|
|
...
|
|
_metrics.RemoveSessionEvents(session.SessionId);
|
|
```
|
|
|
|
`SessionRemoved` decrements the open-session gauge without incrementing the closed counter, which covers cases where a session is evicted rather than closed by the client.
|
|
|
|
### Worker client
|
|
|
|
`Workers/WorkerClient.cs` records command throughput, worker lifecycle, heartbeat failures, and the worker-side event queue depth:
|
|
|
|
- `CommandStarted(method)` and `CommandSucceeded(method, duration)` / `CommandFailed(method, category, duration)` around the worker request/response pair.
|
|
- `WorkerStarted(startupDuration)` once the worker reports ready.
|
|
- `RecordWorkerStoppedOnce` calls `WorkerStopped(reason)` exactly once per worker, guarding against double-counting on simultaneous fault and exit signals.
|
|
- `WorkerKilled(reason)` when the client forcibly terminates the worker.
|
|
- `HeartbeatFailed(SessionId)` per missed heartbeat.
|
|
- `SetWorkerEventQueueDepth(queueDepth)` after each event ingest.
|
|
- `EventReceived(SessionId, workerEvent.Event.Family.ToString())` for each worker event.
|
|
|
|
### Worker process launcher
|
|
|
|
`Workers/WorkerProcessLauncher.cs` records process-level kills and startup retries:
|
|
|
|
```csharp
|
|
_metrics.WorkerKilled(reason);
|
|
...
|
|
_metrics.RetryAttempted("worker_startup");
|
|
```
|
|
|
|
The `worker_startup` tag is hard-coded so the `RetryAttemptsByArea` snapshot reports launcher retries distinctly from other resilience areas.
|
|
|
|
### Session worker client factory
|
|
|
|
`Sessions/SessionWorkerClientFactory.cs` records the worker kill that follows a failed `OpenSession` handshake:
|
|
|
|
```csharp
|
|
_metrics.WorkerKilled("OpenSessionFailed");
|
|
```
|
|
|
|
This is the only fault path where the factory itself owns the kill decision; once the worker is bound to a session, the `WorkerClient` becomes responsible for lifecycle metrics.
|
|
|
|
### gRPC event stream service
|
|
|
|
`Grpc/EventStreamService.cs` records the dashboard/client event-stream backlog and disconnect counters:
|
|
|
|
```csharp
|
|
metrics.AdjustGrpcEventStreamQueueDepth(1);
|
|
...
|
|
metrics.AdjustGrpcEventStreamQueueDepth(-1);
|
|
...
|
|
metrics.AdjustGrpcEventStreamQueueDepth(-remainingDepth);
|
|
metrics.StreamDisconnected("Detached");
|
|
...
|
|
metrics.QueueOverflow("grpc-event-stream");
|
|
metrics.Fault(SessionManagerErrorCode.EventQueueOverflow.ToString());
|
|
...
|
|
metrics.Fault(WorkerClientErrorCode.WorkerFaulted.ToString());
|
|
```
|
|
|
|
The service tracks per-message enqueues and dequeues, so `AdjustGrpcEventStreamQueueDepth` updates the aggregate stream backlog. The `Math.Max(0, ...)` clamp inside the adjuster prevents a negative depth if the bookkeeping ever drifts.
|
|
|
|
`Grpc/MxAccessGatewayService.cs` records gRPC event send latency around each response-stream write:
|
|
|
|
```csharp
|
|
Stopwatch stopwatch = Stopwatch.StartNew();
|
|
await responseStream.WriteAsync(publicEvent).ConfigureAwait(false);
|
|
metrics.RecordEventStreamSend(publicEvent.Family.ToString(), stopwatch.Elapsed);
|
|
```
|
|
|
|
## Dashboard Consumption
|
|
|
|
`Dashboard/DashboardSnapshotService.cs` calls `_metrics.GetSnapshot()` once per `GetSnapshot` invocation and projects it into the dashboard transport types together with the session registry view. The dashboard receives a single, internally consistent snapshot per tick rather than reading individual counters at separate times. See [Gateway Dashboard Design](./GatewayDashboardDesign.md) and [Dashboard Interface Design](./DashboardInterfaceDesign.md) for the projection rules and wire format.
|
|
|
|
## Related Documentation
|
|
|
|
- [Gateway Dashboard Design](./GatewayDashboardDesign.md)
|
|
- [Dashboard Interface Design](./DashboardInterfaceDesign.md)
|
|
- [Sessions](./Sessions.md)
|