Files
mxaccessgw/docs/Grpc.md
T
Joseph Doherty ee959e46e6 Resolve Contracts-001/004/005/006/007/008 code-review findings
Contracts-001: docs/Grpc.md still described "four MxAccessGateway RPCs" —
updated to the actual six (adding AcknowledgeAlarm and QueryActiveAlarms to
the handler and validation-rule sections).

Contracts-003 (Won't Fix): the finding is factually wrong — the <Protobuf>
item for mxaccess_worker.proto already sets ProtoRoot="Protos"; all three
items are consistent (confirmed back to the reviewed commit).

Contracts-004: corrected the stale GatewayContractInfo XML summary
("before generated protobuf contracts are introduced").

Contracts-005: no proto field/enum value was ever removed, so no reserved
ranges were invented. Added a wire-compatibility policy comment to all three
.proto files instructing future editors to reserve removed numbers.

Contracts-006: documented MxStatusProxy.success — it mirrors the COM
MXSTATUS_PROXY numeric success member, is not a boolean, and clients should
branch on category.

Contracts-007: added 13 round-trip tests covering galaxy_repository.proto
messages, bulk-subscribe payloads, and raw-value/IPC worker bodies.

Contracts-008: WorkerAlarmRpcDispatcher never assigns AcknowledgeAlarmReply.
status, so the old "native status" proto comment was misleading. Corrected
the hresult/status proto comments and documented the worker native_status →
public reply mapping in AlarmClientDiscovery.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 23:12:00 -04:00

252 lines
16 KiB
Markdown

# Gateway gRPC Service Layer
The gRPC service layer is the public entry point for client traffic. It is intentionally thin: handlers validate the incoming request, look up or open a session, dispatch to the worker through the session manager, and translate worker replies and events back into public proto types.
## Layer Responsibilities
The architecture rule (from `gateway.md`) is that the gRPC layer must "validate request, find session, call the session worker client, map worker replies to public replies, and stream events". Anything else — caching, retries, worker process lifetime, event ordering — lives behind `ISessionManager` and the worker client. Keeping the layer thin lets the same session/worker code be reused by future transports (for example, an in-process host or an alternate IPC) without having to re-derive validation or mapping rules.
The layer is composed of four collaborators:
| Type | Lifetime | Role |
|------|----------|------|
| `MxAccessGatewayService` | scoped (gRPC) | Implements the six `MxAccessGateway` RPCs, performs exception mapping. |
| `MxAccessGrpcRequestValidator` | singleton | Rejects malformed requests before any session work runs. |
| `MxAccessGrpcMapper` | singleton | Converts public proto types to internal `WorkerCommand`/`WorkerEvent` types and back. |
| `IEventStreamService` (`EventStreamService`) | singleton | Owns the event stream pipeline, including bounded queue and backpressure handling. |
Registration happens in `GatewayApplication`:
```csharp
builder.Services.AddSingleton<MxAccessGrpcMapper>();
builder.Services.AddSingleton<MxAccessGrpcRequestValidator>();
builder.Services.AddSingleton<IEventStreamService, EventStreamService>();
```
The service itself is mapped as a normal gRPC endpoint via `endpoints.MapGrpcService<MxAccessGatewayService>()`.
A second gRPC service, `GalaxyRepositoryGrpcService`, is mapped alongside it. It exposes the read-only Galaxy Repository browse surface and is documented separately in [Galaxy Repository Browse](./GalaxyRepository.md). It shares the authorization interceptor and authentication policy used by `MxAccessGatewayService`, but it does not go through the session manager or worker — it talks to SQL Server directly.
## RPC Handlers
`MxAccessGatewayService` derives from the generated `MxAccessGateway.MxAccessGatewayBase` and implements every RPC declared in `mxaccess_gateway.proto` — six in total: `OpenSession`, `CloseSession`, `Invoke`, `StreamEvents`, `AcknowledgeAlarm`, and `QueryActiveAlarms`. The proto contract itself is documented in [Contracts](./Contracts.md); this section covers only what the server-side handler does on top of that contract.
Public gRPC send and receive message sizes are configured from
`MxGateway:Protocol:MaxGrpcMessageBytes` (default 16 MiB). Official clients use
the same default so paged Galaxy browse replies and larger MXAccess payloads
fail consistently instead of depending on language-specific gRPC defaults.
### `OpenSession`
`OpenSession` validates the request, asks `ISessionManager` to open a session under the caller's identity, and returns a reply that advertises both protocol versions and the capabilities the gateway supports. Capability strings are static because the gateway has a fixed feature set per build; clients use them as a forward-compatibility hint rather than runtime negotiation.
```csharp
GatewaySession session = await sessionManager
.OpenSessionAsync(
SessionOpenRequest.FromContract(request),
ResolveClientIdentity(),
context.CancellationToken)
.ConfigureAwait(false);
OpenSessionReply reply = new()
{
SessionId = session.SessionId,
BackendName = session.BackendName,
WorkerProcessId = session.WorkerProcessId ?? 0,
WorkerProtocolVersion = GatewayContractInfo.WorkerProtocolVersion,
GatewayProtocolVersion = GatewayContractInfo.GatewayProtocolVersion,
DefaultCommandTimeout = Google.Protobuf.WellKnownTypes.Duration.FromTimeSpan(session.CommandTimeout),
ProtocolStatus = MxAccessGrpcMapper.Ok(),
};
```
`ResolveClientIdentity()` reads `IGatewayRequestIdentityAccessor.Current` and prefers `DisplayName`, falling back to `KeyId`. The accessor is populated by the authorization interceptor before the handler runs (see [Authorization](./Authorization.md)).
### `CloseSession`
The handler delegates to `sessionManager.CloseSessionAsync` and converts the resulting `SessionCloseResult` into a `CloseSessionReply`. The handler distinguishes the two outcomes via the protocol status message — `"Session was already closed."` versus `"Session closed."` — so callers do not need to reason about idempotency from a status code alone.
### `Invoke`
`Invoke` is the unary command path. It runs the validator (which enforces payload-vs-kind matching), uses the mapper to wrap the public `MxCommand` in a `WorkerCommand` with an enqueue timestamp, calls `sessionManager.InvokeAsync`, and unwraps the worker reply.
```csharp
requestValidator.ValidateInvoke(request);
WorkerCommand workerCommand = mapper.MapCommand(request);
WorkerCommandReply workerReply = await sessionManager
.InvokeAsync(request.SessionId, workerCommand, context.CancellationToken)
.ConfigureAwait(false);
return mapper.MapCommandReply(workerReply);
```
Carrying the enqueue timestamp into the worker layer is what lets queue-wait time be measured separately from worker-side execution time when troubleshooting timeouts.
### `StreamEvents`
`StreamEvents` is a server-streaming RPC. The handler delegates the full pipeline to `IEventStreamService` and just forwards each `MxEvent` onto the response stream. Keeping the channel and producer/consumer machinery out of the handler means cancellation, exception mapping, and metric bookkeeping live in one place.
### `AcknowledgeAlarm`
`AcknowledgeAlarm` is a unary RPC that acknowledges a single alarm. The handler validates `session_id` and `alarm_full_reference` inline (it does not run through `MxAccessGrpcRequestValidator`, because the alarm surface routes through `IAlarmRpcDispatcher` rather than the generic `Invoke` path), resolves the session, then delegates to the registered `IAlarmRpcDispatcher`. The production `WorkerAlarmRpcDispatcher` routes the ack over the worker IPC by GUID (`AcknowledgeAlarmCommand`) when the reference parses as a canonical GUID, or by `Provider!Group.Tag` reference (`AcknowledgeAlarmByNameCommand`) otherwise. The handler-level RPC behaviour and the alarm contract itself are documented in [Alarm Client Discovery](./AlarmClientDiscovery.md).
### `QueryActiveAlarms`
`QueryActiveAlarms` is a server-streaming RPC that returns an `ActiveAlarmSnapshot` per currently active alarm. The handler validates `session_id` inline, resolves the session, and delegates to `IAlarmRpcDispatcher`; `WorkerAlarmRpcDispatcher` issues a `QueryActiveAlarmsCommand` over the worker IPC and streams each snapshot from the worker reply.
## Validation Rules
`MxAccessGrpcRequestValidator` rejects requests with `StatusCode.InvalidArgument` before any session work happens. The rules are intentionally narrow — anything that requires session state (for example, "session does not exist") is left for `ISessionManager` so the validator can stay synchronous and side-effect free.
| RPC | Rule | Status |
|-----|------|--------|
| `OpenSession` | `command_timeout`, when set, must be `> 0`. | `InvalidArgument` |
| `CloseSession` | `session_id` must be non-empty. | `InvalidArgument` |
| `StreamEvents` | `session_id` must be non-empty. | `InvalidArgument` |
| `Invoke` | `session_id` non-empty, `command` present, `kind` not `Unspecified`, payload oneof must match `kind`. | `InvalidArgument` |
| `AcknowledgeAlarm` | `session_id` and `alarm_full_reference` must be non-empty. Validated inline in the handler, not by `MxAccessGrpcRequestValidator`. | `InvalidArgument` |
| `QueryActiveAlarms` | `session_id` must be non-empty. Validated inline in the handler, not by `MxAccessGrpcRequestValidator`. | `InvalidArgument` |
The payload-vs-kind check matters because the `MxCommand.payload` oneof is non-discriminated on the wire — a misaligned client could send `kind = Write` with a `Register` payload and silently confuse the worker. The validator turns that into a clear client error:
```csharp
private static void ValidateCommandPayload(MxCommand command)
{
MxCommand.PayloadOneofCase expectedPayload = ExpectedPayload(command.Kind);
if (command.PayloadCase != expectedPayload)
{
throw InvalidArgument(
$"Command kind {command.Kind} requires payload {expectedPayload} but received {command.PayloadCase}.");
}
}
```
`ExpectedPayload` enumerates every `MxCommandKind` that has a matching payload case; unknown kinds map to `PayloadOneofCase.None`, which forces a mismatch and therefore a rejection.
## Mapping Rules
`MxAccessGrpcMapper` is the only place that translates between public proto types and internal worker types. Two design choices are worth calling out.
The mapper clones the inbound `MxCommand` rather than reusing the reference. This isolates the worker pipeline from any later mutation of the request graph by the gRPC framework or interceptors:
```csharp
public WorkerCommand MapCommand(MxCommandRequest request)
{
ArgumentNullException.ThrowIfNull(request);
ArgumentNullException.ThrowIfNull(request.Command);
return new WorkerCommand
{
Command = request.Command.Clone(),
EnqueueTimestamp = Timestamp.FromDateTimeOffset(_timeProvider.GetUtcNow()),
};
}
```
When the worker reply or event payload is missing, the mapper returns a synthetic public message with `ProtocolStatusCode.ProtocolViolation` (for replies) or a sentinel `MxEvent` with `MxEventFamily.Unspecified` (for events). The gateway never relays a partial frame to clients — anything missing is reported as a protocol violation against the worker, not a transport error against the client.
The mapper also exposes static factory methods for every `ProtocolStatusCode` (`Ok`, `InvalidRequest`, `SessionNotFound`, `SessionNotReady`, `WorkerUnavailable`, `Timeout`, `Canceled`, `ProtocolViolation`) so that handlers and tests can produce status payloads without duplicating the enum-to-string mapping.
## Exception to Status Mapping
Every handler wraps its body in `try { ... } catch (Exception exception) when (exception is not RpcException) { throw MapException(exception); }`. `RpcException` is allowed to propagate untouched (the validator already produces them with the right code). All other exceptions are translated by `MapException`, which knows three categories:
- `OperationCanceledException` becomes `StatusCode.Cancelled`.
- `SessionManagerException` is mapped by its `ErrorCode`.
- `WorkerClientException` is mapped by its `ErrorCode`.
Anything else is logged at warning and surfaced as `Unavailable` with a generic message — clients see a retryable status, and the unexpected exception is captured in gateway logs rather than leaked over the wire.
```csharp
StatusCode statusCode = exception.ErrorCode switch
{
SessionManagerErrorCode.SessionNotFound => StatusCode.NotFound,
SessionManagerErrorCode.SessionNotReady => StatusCode.FailedPrecondition,
SessionManagerErrorCode.EventSubscriberAlreadyActive => StatusCode.ResourceExhausted,
SessionManagerErrorCode.EventQueueOverflow => StatusCode.ResourceExhausted,
SessionManagerErrorCode.SessionLimitExceeded => StatusCode.ResourceExhausted,
SessionManagerErrorCode.OpenFailed => StatusCode.Unavailable,
SessionManagerErrorCode.CloseFailed => StatusCode.Unavailable,
_ => StatusCode.Unavailable,
};
```
`WorkerClientException` follows the same pattern: `CommandTimeout` becomes `DeadlineExceeded`, `GatewayShutdown` becomes `Cancelled`, `InvalidState` becomes `FailedPrecondition`, `ProtocolViolation` becomes `Internal`, and unmapped codes fall through to `Unavailable`.
## Event Streaming Model
`EventStreamService` implements the `StreamEvents` pipeline. It exists as a separate service (rather than living inside `MxAccessGatewayService`) so that the channel, backpressure policy, and metric bookkeeping can be unit-tested without spinning up a Kestrel host.
The pipeline has three stages:
1. The session is resolved via `sessionManager.TryGetSession`. A miss raises `SessionManagerException(SessionNotFound)` so the handler reports `StatusCode.NotFound`.
2. `session.AttachEventSubscriber(...)` enforces the single-subscriber-per-session rule (or allows multiple subscribers if `Sessions:AllowMultipleEventSubscribers` is enabled). The returned `IDisposable` is released in the `finally` block, ensuring the subscriber slot is freed even when the client cancels mid-stream.
3. A `Channel<MxEvent>` decouples the worker-side producer from the gRPC writer. The channel is bounded by `Events:QueueCapacity` and configured for a single reader and writer:
```csharp
Channel<MxEvent> eventQueue = Channel.CreateBounded<MxEvent>(
new BoundedChannelOptions(options.Value.Events.QueueCapacity)
{
SingleReader = true,
SingleWriter = true,
FullMode = BoundedChannelFullMode.Wait,
AllowSynchronousContinuations = false,
});
```
The producer reads `WorkerEvent`s from the session, maps them, and writes to the channel. Per-session ordering is preserved end-to-end: events arrive from the worker in `WorkerSequence` order, the producer is the single writer, and the consumer drains FIFO. The producer also honours the request's `AfterWorkerSequence` cursor by skipping any event whose `WorkerSequence` is at or before the requested cutoff, which lets clients resume after a disconnect without server-side replay state.
### Backpressure
When `TryWrite` fails the queue is full. The handling depends on `Events:BackpressurePolicy`:
```csharp
if (!writer.TryWrite(publicEvent))
{
string message = $"Session {session.SessionId} event stream queue overflowed.";
metrics.QueueOverflow("grpc-event-stream");
if (options.Value.Events.BackpressurePolicy == EventBackpressurePolicy.FailFast)
{
session.MarkFaulted(message);
metrics.Fault(SessionManagerErrorCode.EventQueueOverflow.ToString());
}
else
{
logger.LogDebug(
"Disconnecting event stream for session {SessionId} after queue overflow.",
session.SessionId);
}
writer.TryComplete(new SessionManagerException(
SessionManagerErrorCode.EventQueueOverflow,
message));
return;
}
```
Under `FailFast` the session is faulted so subsequent commands return `FailedPrecondition`; the client must reopen. Under the default policy only the stream is dropped and the session continues to accept commands, leaving recovery to the client (typically a fresh `StreamEvents` call with an updated `AfterWorkerSequence`). Either way, the consumer side observes `StatusCode.ResourceExhausted` via the `EventQueueOverflow` mapping above.
### Cancellation and cleanup
The handler creates a linked cancellation token (`streamCts`) so that completing the consumer (client disconnect, error, or graceful end-of-stream) also cancels the producer. The `finally` block cancels the source, disposes the subscriber slot, awaits the producer (swallowing the expected cancellation), and emits `StreamDisconnected("Detached")` so dashboards see the disconnection regardless of cause.
`WorkerClientException` thrown by the producer marks the session as faulted before completing the channel — the worker is presumed gone, and any subsequent command on that session must observe the fault rather than silently retry.
## Authorization Interceptor Integration
Authorization is applied as a gRPC interceptor, registered in `GrpcAuthorizationServiceCollectionExtensions`:
```csharp
services.AddSingleton<GatewayGrpcAuthorizationInterceptor>();
services.AddGrpc(options => options.Interceptors.Add<GatewayGrpcAuthorizationInterceptor>());
```
Because the interceptor runs before any handler, `MxAccessGatewayService` can safely assume the call has been authorized and that `IGatewayRequestIdentityAccessor.Current` is populated. The handler's only responsibility is to read the identity for `OpenSession` so the session is owned by the authenticated principal; it does not perform any authorization checks of its own. See [Authorization](./Authorization.md) for the policy and identity model.
## Related Documentation
- [Contracts](./Contracts.md)
- [Sessions](./Sessions.md)
- [Authorization](./Authorization.md)
- [Gateway Process Design](./GatewayProcessDesign.md)