Add Galaxy repository API and clients
This commit is contained in:
@@ -0,0 +1,266 @@
|
||||
# Gateway Authentication
|
||||
|
||||
The gateway authentication subsystem verifies inbound API key credentials against a SQLite-backed key store, hashes secrets with a configurable pepper, and records administrative and verification events to an audit trail.
|
||||
|
||||
## Token Format
|
||||
|
||||
API keys travel in the HTTP `Authorization` header as a bearer token shaped `mxgw_<keyId>_<secret>`. The `mxgw_` prefix scopes parsing to gateway tokens, the `<keyId>` segment is the public identifier used for lookup, and `<secret>` is the high-entropy portion that the gateway verifies against a stored hash.
|
||||
|
||||
`ApiKeyParser` enforces the format and rejects malformed tokens before any database round-trip:
|
||||
|
||||
```csharp
|
||||
public bool TryParseAuthorizationHeader(string? authorizationHeader, out ParsedApiKey? apiKey)
|
||||
{
|
||||
apiKey = null;
|
||||
|
||||
if (string.IsNullOrWhiteSpace(authorizationHeader)
|
||||
|| !authorizationHeader.StartsWith(BearerPrefix, StringComparison.OrdinalIgnoreCase))
|
||||
{
|
||||
return false;
|
||||
}
|
||||
|
||||
string token = authorizationHeader[BearerPrefix.Length..].Trim();
|
||||
|
||||
if (!token.StartsWith(TokenPrefix, StringComparison.OrdinalIgnoreCase))
|
||||
{
|
||||
return false;
|
||||
}
|
||||
```
|
||||
|
||||
A successful parse produces a `ParsedApiKey(KeyId, Secret)` record. The `IApiKeyParser` interface exists so verification consumers can be tested without depending on header-format details.
|
||||
|
||||
## Parsing and Secrets
|
||||
|
||||
### Secret generation
|
||||
|
||||
`ApiKeySecretGenerator.Generate()` is the single source of new secret material. It uses 32 bytes from `RandomNumberGenerator.Fill` and encodes with URL-safe base64 (no padding) so secrets can be embedded in headers without escaping:
|
||||
|
||||
```csharp
|
||||
public static string Generate()
|
||||
{
|
||||
Span<byte> bytes = stackalloc byte[32];
|
||||
RandomNumberGenerator.Fill(bytes);
|
||||
|
||||
return Convert.ToBase64String(bytes)
|
||||
.TrimEnd('=')
|
||||
.Replace('+', '-')
|
||||
.Replace('/', '_');
|
||||
}
|
||||
```
|
||||
|
||||
### Peppered hashing
|
||||
|
||||
`ApiKeySecretHasher` (registered behind `IApiKeySecretHasher`) hashes secrets with `HMACSHA256` keyed by a server-side pepper. The pepper lives outside the database and is resolved by `IConfiguration` lookup against the configured `PepperSecretName`:
|
||||
|
||||
```csharp
|
||||
public byte[] HashSecret(string secret)
|
||||
{
|
||||
string pepper = GetPepper();
|
||||
byte[] pepperBytes = Encoding.UTF8.GetBytes(pepper);
|
||||
byte[] secretBytes = Encoding.UTF8.GetBytes(secret);
|
||||
|
||||
using HMACSHA256 hmac = new(pepperBytes);
|
||||
|
||||
return hmac.ComputeHash(secretBytes);
|
||||
}
|
||||
```
|
||||
|
||||
The pepper is intentionally not stored alongside the hash: an attacker who exfiltrates only the SQLite file holds the hashes but lacks the keying material to brute-force candidate secrets, even if the stored hash algorithm and salt scheme are known. If the pepper is missing the hasher throws `ApiKeyPepperUnavailableException`, which the verifier converts to a distinct failure code rather than treating it as a credential mismatch.
|
||||
|
||||
## Verification
|
||||
|
||||
`ApiKeyVerifier` (`IApiKeyVerifier`) implements the verification flow:
|
||||
|
||||
1. Parse the `Authorization` header into a `ParsedApiKey`.
|
||||
2. Look up the `ApiKeyRecord` by `KeyId` through `IApiKeyStore.FindByKeyIdAsync`.
|
||||
3. Reject revoked records (`RevokedUtc is not null`).
|
||||
4. Hash the presented secret with the configured pepper.
|
||||
5. Compare hashes with `CryptographicOperations.FixedTimeEquals` to avoid timing oracles.
|
||||
6. Record a `LastUsedUtc` timestamp via `MarkKeyUsedAsync` and return an `ApiKeyIdentity`.
|
||||
|
||||
```csharp
|
||||
if (!CryptographicOperations.FixedTimeEquals(presentedHash, storedKey.SecretHash))
|
||||
{
|
||||
return ApiKeyVerificationResult.Fail(ApiKeyVerificationFailure.SecretMismatch);
|
||||
}
|
||||
|
||||
await keyStore.MarkKeyUsedAsync(storedKey.KeyId, DateTimeOffset.UtcNow, cancellationToken)
|
||||
.ConfigureAwait(false);
|
||||
|
||||
return ApiKeyVerificationResult.Success(new ApiKeyIdentity(
|
||||
KeyId: storedKey.KeyId,
|
||||
KeyPrefix: storedKey.KeyPrefix,
|
||||
DisplayName: storedKey.DisplayName,
|
||||
Scopes: storedKey.Scopes));
|
||||
```
|
||||
|
||||
`ApiKeyVerificationResult` carries either an `ApiKeyIdentity` or a discriminated `ApiKeyVerificationFailure` value. The failure enum distinguishes parse errors, missing pepper, missing or revoked keys, and secret mismatch so the calling middleware can emit precise audit detail without leaking which check failed to the client.
|
||||
|
||||
`ApiKeyIdentity` exposes only non-secret fields (`KeyId`, `KeyPrefix`, `DisplayName`, `Scopes`) and is the type downstream authorization code consumes.
|
||||
|
||||
## Storage
|
||||
|
||||
The gateway keeps API key state in a dedicated SQLite database. SQLite is sufficient because credential volume is small, the gateway runs as a single process, and the file is straightforward to back up and rotate independently of the main application data.
|
||||
|
||||
### Connection factory
|
||||
|
||||
`AuthSqliteConnectionFactory` reads `GatewayOptions.Authentication.SqlitePath`, ensures the parent directory exists, and opens the connection in `ReadWriteCreate` mode so first-run installations can create the file without manual provisioning:
|
||||
|
||||
```csharp
|
||||
public SqliteConnection CreateConnection()
|
||||
{
|
||||
string sqlitePath = options.Value.Authentication.SqlitePath;
|
||||
string? directory = Path.GetDirectoryName(sqlitePath);
|
||||
|
||||
if (!string.IsNullOrWhiteSpace(directory))
|
||||
{
|
||||
Directory.CreateDirectory(directory);
|
||||
}
|
||||
|
||||
SqliteConnectionStringBuilder builder = new()
|
||||
{
|
||||
DataSource = sqlitePath,
|
||||
Mode = SqliteOpenMode.ReadWriteCreate
|
||||
};
|
||||
|
||||
return new SqliteConnection(builder.ToString());
|
||||
}
|
||||
```
|
||||
|
||||
### Schema
|
||||
|
||||
`SqliteAuthSchema` declares table names and the current schema version as constants. Three tables are involved:
|
||||
|
||||
- `api_keys` stores `key_id`, `key_prefix`, the `secret_hash` blob, `display_name`, serialized `scopes`, and the `created_utc`, `last_used_utc`, and `revoked_utc` timestamps.
|
||||
- `api_key_audit` is an append-only log keyed by an autoincrement `audit_id` with `key_id`, `event_type`, `remote_address`, `created_utc`, and `details` columns.
|
||||
- `schema_version` carries a single row whose `version` column is matched against `SqliteAuthSchema.CurrentVersion`.
|
||||
|
||||
### Read paths
|
||||
|
||||
`SqliteApiKeyStore` (`IApiKeyStore`) handles the two reads needed at request time: `FindByKeyIdAsync` returns any record (so revoked keys can be reported distinctly) and `FindActiveByKeyIdAsync` filters to non-revoked rows. `MarkKeyUsedAsync` updates `last_used_utc` only for non-revoked rows so a freshly revoked key cannot have its timestamp refreshed by a racing verification.
|
||||
|
||||
`ApiKeyRecord` is the in-memory projection. `ApiKeyRecordReader.Read` is shared by every read path so column ordering is defined in one place:
|
||||
|
||||
```csharp
|
||||
public static ApiKeyRecord Read(SqliteDataReader reader)
|
||||
{
|
||||
return new ApiKeyRecord(
|
||||
KeyId: reader.GetString(0),
|
||||
KeyPrefix: reader.GetString(1),
|
||||
SecretHash: (byte[])reader["secret_hash"],
|
||||
DisplayName: reader.GetString(3),
|
||||
Scopes: ApiKeyScopeSerializer.Deserialize(reader.GetString(4)),
|
||||
CreatedUtc: DateTimeOffset.Parse(reader.GetString(5), System.Globalization.CultureInfo.InvariantCulture),
|
||||
LastUsedUtc: ReadNullableDateTimeOffset(reader, 6),
|
||||
RevokedUtc: ReadNullableDateTimeOffset(reader, 7));
|
||||
}
|
||||
```
|
||||
|
||||
### Write paths
|
||||
|
||||
`SqliteApiKeyAdminStore` (`IApiKeyAdminStore`) implements administrative mutations: `CreateAsync` accepts an `ApiKeyCreateRequest`, `RevokeAsync` sets `revoked_utc` only when not already revoked, and `RotateAsync` replaces `secret_hash`, clears `last_used_utc`, and clears `revoked_utc` so a rotated key is immediately usable.
|
||||
|
||||
### Audit trail
|
||||
|
||||
`SqliteApiKeyAuditStore` (`IApiKeyAuditStore`) appends `ApiKeyAuditEntry` values to the `api_key_audit` table and stamps each row with a UTC timestamp inside the store rather than trusting the caller. `ListRecentAsync` returns the most recent rows ordered by `audit_id` descending and projects them into `ApiKeyAuditRecord`. Rows are kept even after the referenced key is revoked because the audit history is the durable record of administrative action; the `key_id` column is nullable to accommodate non-key-scoped events such as `init-db`.
|
||||
|
||||
## Migration
|
||||
|
||||
Schema bring-up is centralised behind `IAuthStoreMigrator`. `SqliteAuthStoreMigrator` executes the migration inside a single transaction so a partial failure leaves the database untouched, refuses to start when the on-disk schema version is newer than the binary supports, and idempotently creates the v1 schema:
|
||||
|
||||
```csharp
|
||||
if (existingVersion > SqliteAuthSchema.CurrentVersion)
|
||||
{
|
||||
throw new AuthStoreMigrationException(
|
||||
$"Auth database schema version {existingVersion} is newer than supported version {SqliteAuthSchema.CurrentVersion}.");
|
||||
}
|
||||
|
||||
await ApplyVersionOneAsync(connection, transaction, cancellationToken).ConfigureAwait(false);
|
||||
|
||||
await transaction.CommitAsync(cancellationToken).ConfigureAwait(false);
|
||||
```
|
||||
|
||||
`AuthStoreMigrationHostedService` runs the migrator at startup, but only when API-key authentication is enabled and `RunMigrationsOnStartup` is true. Operators who manage schema out-of-band can disable the hosted run and use the admin CLI's `init-db` command instead.
|
||||
|
||||
`AuthStoreMigrationException` is a sealed `InvalidOperationException` so it can be caught precisely without swallowing unrelated failures.
|
||||
|
||||
## Admin CLI
|
||||
|
||||
`ApiKeyAdminCommandLineParser.Parse` recognises a leading `apikey` argument and dispatches to one of the subcommands declared by `ApiKeyAdminCommandKind`. Each parsed invocation produces an `ApiKeyAdminCommand` (or an `ApiKeyAdminParseResult` carrying an error). `ApiKeyAdminCliRunner` then executes the command, runs the migrator first, calls the relevant store method, appends an audit row, and writes either text or JSON output via `ApiKeyAdminOutput`. The returned `ApiKeyAdminListedKey` projection deliberately omits the `secret_hash` so listing a database does not surface hash material.
|
||||
|
||||
The supported subcommands match `ApiKeyAdminCommandKind` exactly:
|
||||
|
||||
| Subcommand | Required options | Behaviour |
|
||||
|------------|------------------|-----------|
|
||||
| `init-db` | none | Runs the migrator and records an audit entry. |
|
||||
| `create-key` | `--key-id`, `--display-name` | Generates a new secret, stores its peppered hash, and prints the assembled `mxgw_<keyId>_<secret>` token. |
|
||||
| `list-keys` | none | Lists every stored key with its scopes and revocation state. |
|
||||
| `revoke-key` | `--key-id` | Sets `revoked_utc` if the key is currently active. |
|
||||
| `rotate-key` | `--key-id` | Replaces the secret hash and prints the new token. |
|
||||
|
||||
Examples:
|
||||
|
||||
```bash
|
||||
mxgateway apikey init-db
|
||||
mxgateway apikey create-key --key-id ops.alice --display-name "Alice (ops)" --scopes read,write
|
||||
mxgateway apikey list-keys --json
|
||||
mxgateway apikey revoke-key --key-id ops.alice
|
||||
mxgateway apikey rotate-key --key-id ops.alice
|
||||
```
|
||||
|
||||
Key ids are restricted by the parser to ASCII letters, digits, periods, and hyphens so they remain safe to embed in the token format and in URL paths used by administrative tooling.
|
||||
|
||||
## Scope Serialization
|
||||
|
||||
Scopes are persisted as a single TEXT column rather than a join table because the set is small, never queried by membership at the database level, and changes atomically with the owning row. `ApiKeyScopeSerializer.Serialize` writes a JSON array sorted with `StringComparer.Ordinal` so equivalent scope sets produce byte-identical column values, which makes audit diffing and database comparisons deterministic:
|
||||
|
||||
```csharp
|
||||
public static string Serialize(IReadOnlySet<string> scopes)
|
||||
{
|
||||
return JsonSerializer.Serialize(scopes.Order(StringComparer.Ordinal));
|
||||
}
|
||||
|
||||
public static IReadOnlySet<string> Deserialize(string value)
|
||||
{
|
||||
if (string.IsNullOrWhiteSpace(value))
|
||||
{
|
||||
return new HashSet<string>(StringComparer.Ordinal);
|
||||
}
|
||||
|
||||
string[]? scopes = JsonSerializer.Deserialize<string[]>(value);
|
||||
|
||||
return new HashSet<string>(scopes ?? [], StringComparer.Ordinal);
|
||||
}
|
||||
```
|
||||
|
||||
`Deserialize` tolerates an empty column by returning an empty set so older rows or hand-edited records do not crash the verifier.
|
||||
|
||||
## Registration
|
||||
|
||||
`AuthStoreServiceCollectionExtensions.AddSqliteAuthStore` wires every service in this subsystem as a singleton and registers the migration hosted service:
|
||||
|
||||
```csharp
|
||||
public static IServiceCollection AddSqliteAuthStore(this IServiceCollection services)
|
||||
{
|
||||
services.AddSingleton<IApiKeyParser, ApiKeyParser>();
|
||||
services.AddSingleton<IApiKeySecretHasher, ApiKeySecretHasher>();
|
||||
services.AddSingleton<IApiKeyVerifier, ApiKeyVerifier>();
|
||||
services.AddSingleton<ApiKeyAdminCliRunner>();
|
||||
services.AddSingleton<AuthSqliteConnectionFactory>();
|
||||
services.AddSingleton<IAuthStoreMigrator, SqliteAuthStoreMigrator>();
|
||||
services.AddSingleton<IApiKeyStore, SqliteApiKeyStore>();
|
||||
services.AddSingleton<IApiKeyAdminStore, SqliteApiKeyAdminStore>();
|
||||
services.AddSingleton<IApiKeyAuditStore, SqliteApiKeyAuditStore>();
|
||||
services.AddHostedService<AuthStoreMigrationHostedService>();
|
||||
|
||||
return services;
|
||||
}
|
||||
```
|
||||
|
||||
Singletons are safe because each operation opens its own short-lived `SqliteConnection` through the factory; there is no shared mutable state inside the services.
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Gateway Configuration](./GatewayConfiguration.md)
|
||||
- [Authorization](./Authorization.md)
|
||||
- [Diagnostics](./Diagnostics.md)
|
||||
@@ -0,0 +1,214 @@
|
||||
# Gateway gRPC Authorization
|
||||
|
||||
The authorization subsystem enforces per-RPC scope checks against the authenticated `ApiKeyIdentity` produced by the authentication layer, so service implementations never need to repeat permission logic.
|
||||
|
||||
## Overview
|
||||
|
||||
Authorization runs as a single gRPC server interceptor registered for every call on the gateway. It pulls the authenticated identity for the current request, derives the scope that the request type requires, and either lets the call continue or fails the call with a gRPC status. The pipeline keeps service classes free of cross-cutting checks, which matches the AGENTS.md "thin gRPC layer" rule that service handlers translate between contracts and domain code without owning policy.
|
||||
|
||||
The participating types live under `src/MxGateway.Server/Security/Authorization/`:
|
||||
|
||||
- `GatewayGrpcAuthorizationInterceptor` runs the authenticate-then-authorize pipeline for unary and server-streaming calls.
|
||||
- `GatewayGrpcScopeResolver` maps a request message (and, for `MxCommandRequest`, the inner `MxCommandKind`) to the scope string that must be present on the caller.
|
||||
- `GatewayScopes` exposes the canonical scope constants used by the resolver and any downstream consumer.
|
||||
- `GatewayRequestIdentityAccessor` and `IGatewayRequestIdentityAccessor` expose the verified identity to handlers and any service code that runs inside the call.
|
||||
- `GrpcAuthorizationServiceCollectionExtensions` wires the components into the DI container and the gRPC pipeline.
|
||||
|
||||
The `ApiKeyIdentity` consumed here is produced by the authentication layer; see [Authentication](./Authentication.md) for how it is built and how scopes are persisted.
|
||||
|
||||
## Why an Interceptor
|
||||
|
||||
Centralizing the policy in `GatewayGrpcAuthorizationInterceptor` produces three concrete benefits:
|
||||
|
||||
1. Every RPC defined in `MxAccessGatewayService` is covered by construction. A new RPC inherits the check the moment its request type is added to `GatewayGrpcScopeResolver`, instead of relying on each service method to remember to call an authorization helper.
|
||||
2. The service class stays a thin translator between proto contracts and domain calls. RPC methods do not branch on identity or scope, which keeps the AGENTS.md guideline that gRPC handlers contain no policy.
|
||||
3. Authentication and authorization happen in one place, so the gRPC `Status` mapping is consistent. A failed key check always returns `Unauthenticated`, and a missing scope always returns `PermissionDenied` with the offending scope name.
|
||||
|
||||
## Interceptor Flow
|
||||
|
||||
`GatewayGrpcAuthorizationInterceptor` overrides both `UnaryServerHandler` and `ServerStreamingServerHandler`. Both call the same private `AuthenticateAndAuthorizeAsync` helper before invoking the continuation, then push the resolved identity onto the accessor for the duration of the call.
|
||||
|
||||
```csharp
|
||||
public override async Task<TResponse> UnaryServerHandler<TRequest, TResponse>(
|
||||
TRequest request,
|
||||
ServerCallContext context,
|
||||
UnaryServerMethod<TRequest, TResponse> continuation)
|
||||
{
|
||||
ApiKeyIdentity? identity = await AuthenticateAndAuthorizeAsync(request, context).ConfigureAwait(false);
|
||||
IDisposable? identityScope = identity is null ? null : identityAccessor.Push(identity);
|
||||
using (identityScope)
|
||||
{
|
||||
return await continuation(request, context).ConfigureAwait(false);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The shared helper performs the actual decision:
|
||||
|
||||
```csharp
|
||||
if (options.Value.Authentication.Mode == AuthenticationMode.Disabled)
|
||||
{
|
||||
return null;
|
||||
}
|
||||
|
||||
string? authorizationHeader = context.RequestHeaders.GetValue("authorization");
|
||||
ApiKeyVerificationResult verificationResult = await apiKeyVerifier
|
||||
.VerifyAsync(authorizationHeader, context.CancellationToken)
|
||||
.ConfigureAwait(false);
|
||||
|
||||
if (!verificationResult.Succeeded || verificationResult.Identity is null)
|
||||
{
|
||||
throw new RpcException(new Status(
|
||||
StatusCode.Unauthenticated,
|
||||
"Missing or invalid API key."));
|
||||
}
|
||||
|
||||
string requiredScope = scopeResolver.ResolveRequiredScope(request);
|
||||
if (!verificationResult.Identity.Scopes.Contains(requiredScope))
|
||||
{
|
||||
throw new RpcException(new Status(
|
||||
StatusCode.PermissionDenied,
|
||||
$"API key is missing required scope '{requiredScope}'."));
|
||||
}
|
||||
|
||||
return verificationResult.Identity;
|
||||
```
|
||||
|
||||
The flow is:
|
||||
|
||||
1. If `GatewayOptions.Authentication.Mode` is `AuthenticationMode.Disabled`, the helper returns `null` immediately. No identity is pushed onto the accessor and the continuation runs without scope enforcement. This matches the `AuthenticationMode` enum, which only defines `ApiKey` and `Disabled`.
|
||||
2. Otherwise, the `authorization` request header is read directly off `ServerCallContext.RequestHeaders` and handed to `IApiKeyVerifier.VerifyAsync`. A failed verification or a missing identity throws `RpcException` with `StatusCode.Unauthenticated`.
|
||||
3. `GatewayGrpcScopeResolver.ResolveRequiredScope(request)` produces the scope string. If the identity's `Scopes` set does not contain it, the helper throws `RpcException` with `StatusCode.PermissionDenied` and embeds the missing scope name in `Status.Detail` so callers can diagnose the failure.
|
||||
4. On success, the verified `ApiKeyIdentity` is returned and pushed onto `IGatewayRequestIdentityAccessor` for the lifetime of the call.
|
||||
|
||||
The status codes are deliberately distinct: `Unauthenticated` signals "we do not know who you are," and `PermissionDenied` signals "we know who you are, but you cannot do this." Treating the two as the same code would make troubleshooting harder for client implementations.
|
||||
|
||||
## Scope Resolution
|
||||
|
||||
`GatewayGrpcScopeResolver` is a stateless singleton that switches on the runtime request type. Top-level RPC requests map directly:
|
||||
|
||||
```csharp
|
||||
public string ResolveRequiredScope(object request)
|
||||
{
|
||||
return request switch
|
||||
{
|
||||
OpenSessionRequest => GatewayScopes.SessionOpen,
|
||||
CloseSessionRequest => GatewayScopes.SessionClose,
|
||||
StreamEventsRequest => GatewayScopes.EventsRead,
|
||||
MxCommandRequest commandRequest => ResolveCommandScope(commandRequest.Command?.Kind ?? MxCommandKind.Unspecified),
|
||||
_ => GatewayScopes.Admin
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
The `_ => GatewayScopes.Admin` fallback is intentional: any future request type that the resolver does not recognize fails closed, requiring the strongest scope until the resolver is updated.
|
||||
|
||||
`MxCommandRequest` is special because it multiplexes many MxAccess operations through a single RPC. The resolver inspects the embedded `MxCommandKind` so each operation gets its own scope:
|
||||
|
||||
```csharp
|
||||
private static string ResolveCommandScope(MxCommandKind kind)
|
||||
{
|
||||
return kind switch
|
||||
{
|
||||
MxCommandKind.Write or
|
||||
MxCommandKind.Write2 => GatewayScopes.InvokeWrite,
|
||||
|
||||
MxCommandKind.WriteSecured or
|
||||
MxCommandKind.WriteSecured2 or
|
||||
MxCommandKind.AuthenticateUser => GatewayScopes.InvokeSecure,
|
||||
|
||||
MxCommandKind.ArchestraUserToId or
|
||||
MxCommandKind.GetSessionState or
|
||||
MxCommandKind.GetWorkerInfo => GatewayScopes.MetadataRead,
|
||||
|
||||
MxCommandKind.DrainEvents => GatewayScopes.EventsRead,
|
||||
MxCommandKind.ShutdownWorker => GatewayScopes.Admin,
|
||||
|
||||
_ => GatewayScopes.InvokeRead
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
Reads (`Register`, `AddItem`, `Advise`, and any other unspecified kind) fall through to `InvokeRead`, which keeps the matrix small while still separating reads from writes, secured writes, metadata lookups, event drains, and worker shutdown.
|
||||
|
||||
## Scope Catalog
|
||||
|
||||
`GatewayScopes` is the single source of truth for scope strings. Every entry is currently mapped by either the resolver or another security component:
|
||||
|
||||
| Constant | Value | Required For |
|
||||
|----------|-------|--------------|
|
||||
| `SessionOpen` | `session:open` | `OpenSessionRequest` |
|
||||
| `SessionClose` | `session:close` | `CloseSessionRequest` |
|
||||
| `EventsRead` | `events:read` | `StreamEventsRequest`, `MxCommandKind.DrainEvents` |
|
||||
| `InvokeRead` | `invoke:read` | `MxCommandRequest` for read-style command kinds (`Register`, `AddItem`, `Advise`, and any kind not otherwise mapped) |
|
||||
| `InvokeWrite` | `invoke:write` | `MxCommandKind.Write`, `MxCommandKind.Write2` |
|
||||
| `InvokeSecure` | `invoke:secure` | `MxCommandKind.WriteSecured`, `MxCommandKind.WriteSecured2`, `MxCommandKind.AuthenticateUser` |
|
||||
| `MetadataRead` | `metadata:read` | `MxCommandKind.ArchestraUserToId`, `MxCommandKind.GetSessionState`, `MxCommandKind.GetWorkerInfo`, `GalaxyRepository.TestConnection`, `GalaxyRepository.GetLastDeployTime`, `GalaxyRepository.DiscoverHierarchy`, `GalaxyRepository.WatchDeployEvents` |
|
||||
| `Admin` | `admin` | `MxCommandKind.ShutdownWorker`, the default for any unrecognized request type, and the dashboard authorization policy |
|
||||
|
||||
The `Admin` constant is also referenced by `DashboardAuthenticator` and `DashboardAuthorizationHandler` so that the dashboard and the gRPC layer agree on what "admin" means.
|
||||
|
||||
## Identity Access for Downstream Layers
|
||||
|
||||
Once authorization passes, `GatewayGrpcAuthorizationInterceptor` calls `identityAccessor.Push(identity)` and disposes the returned scope when the continuation completes. `GatewayRequestIdentityAccessor` stores the active identity in an `AsyncLocal<ApiKeyIdentity?>`, so the value flows across `await` boundaries and child tasks belonging to the same request.
|
||||
|
||||
```csharp
|
||||
public sealed class GatewayRequestIdentityAccessor : IGatewayRequestIdentityAccessor
|
||||
{
|
||||
private readonly AsyncLocal<ApiKeyIdentity?> currentIdentity = new();
|
||||
|
||||
public ApiKeyIdentity? Current => currentIdentity.Value;
|
||||
|
||||
public IDisposable Push(ApiKeyIdentity identity)
|
||||
{
|
||||
ArgumentNullException.ThrowIfNull(identity);
|
||||
|
||||
ApiKeyIdentity? previousIdentity = currentIdentity.Value;
|
||||
currentIdentity.Value = identity;
|
||||
|
||||
return new IdentityScope(this, previousIdentity);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The returned `IdentityScope` restores the previous value on dispose rather than clearing it. This makes the accessor safe for nested pushes, even though the current interceptor only pushes once per call. Disposing twice is a no-op because of the `disposed` guard inside `IdentityScope`.
|
||||
|
||||
Downstream code consumes the accessor through the `IGatewayRequestIdentityAccessor` interface:
|
||||
|
||||
```csharp
|
||||
public interface IGatewayRequestIdentityAccessor
|
||||
{
|
||||
ApiKeyIdentity? Current { get; }
|
||||
|
||||
IDisposable Push(ApiKeyIdentity identity);
|
||||
}
|
||||
```
|
||||
|
||||
`MxAccessGatewayService` takes `IGatewayRequestIdentityAccessor` as a constructor dependency and reads `Current` whenever it needs to attach the calling identity to a domain operation, which keeps the service free of header parsing or scope checks.
|
||||
|
||||
When `AuthenticationMode.Disabled` is configured, no identity is pushed, so `Current` returns `null`. Downstream code must tolerate that, just as it tolerates the absence of a scope check.
|
||||
|
||||
## Registration
|
||||
|
||||
`GrpcAuthorizationServiceCollectionExtensions.AddGatewayGrpcAuthorization` is the single entry point that registers every component and inserts the interceptor into the gRPC pipeline:
|
||||
|
||||
```csharp
|
||||
public static IServiceCollection AddGatewayGrpcAuthorization(this IServiceCollection services)
|
||||
{
|
||||
services.AddSingleton<GatewayGrpcScopeResolver>();
|
||||
services.AddSingleton<IGatewayRequestIdentityAccessor, GatewayRequestIdentityAccessor>();
|
||||
services.AddSingleton<GatewayGrpcAuthorizationInterceptor>();
|
||||
services.AddGrpc(options => options.Interceptors.Add<GatewayGrpcAuthorizationInterceptor>());
|
||||
|
||||
return services;
|
||||
}
|
||||
```
|
||||
|
||||
Singleton lifetimes are appropriate because none of the three classes hold per-request state on instance fields; the request-scoped value lives inside the `AsyncLocal` on `GatewayRequestIdentityAccessor`. `GatewayApplication` calls `builder.Services.AddGatewayGrpcAuthorization()` during startup, and the call also performs `AddGrpc`, so the gateway never registers gRPC without the interceptor attached.
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Authentication](./Authentication.md)
|
||||
- [Grpc](./Grpc.md)
|
||||
- [GatewayConfiguration](./GatewayConfiguration.md)
|
||||
- [Galaxy Repository Browse](./GalaxyRepository.md)
|
||||
@@ -28,6 +28,13 @@ worker IPC envelope and control messages. It imports
|
||||
`mxaccess_gateway.proto` so the worker and gateway use the same command, reply,
|
||||
event, value, and status shapes.
|
||||
|
||||
`src/MxGateway.Contracts/Protos/galaxy_repository.proto` defines the
|
||||
`GalaxyRepository` service used by clients to browse the Galaxy Repository
|
||||
(deployed object hierarchy and dynamic attributes). The service is metadata-
|
||||
only and does not share types with `mxaccess_gateway.proto`. See
|
||||
[Galaxy Repository Browse](./GalaxyRepository.md) for the RPC catalog and
|
||||
behavior.
|
||||
|
||||
Generated C# output is written to `src/MxGateway.Contracts/Generated/`. Do not
|
||||
hand-edit generated files.
|
||||
|
||||
|
||||
@@ -0,0 +1,222 @@
|
||||
# Gateway Diagnostics
|
||||
|
||||
The diagnostics subsystem provides structured logging, credential redaction, and request-scoped log enrichment for the gateway. It lives under `src/MxGateway.Server/Diagnostics/` and is wired into the ASP.NET Core pipeline so every gRPC and HTTP request carries the same correlation fields.
|
||||
|
||||
## Goals
|
||||
|
||||
The subsystem exists to satisfy two security rules from `AGENTS.md`: never log passwords or raw credential values for `AuthenticateUser`, `WriteSecured`, or related secured operations, and never log full MXAccess values by default. Code paths that touch credentials or tag values must therefore route through `GatewayLogRedactor` rather than emitting them directly.
|
||||
|
||||
A second goal is parity-test diagnosability. Because MXAccess sessions, workers, correlation ids, and command methods are the units of comparison, every log entry produced inside a request scope must carry those identifiers without each call site having to format them.
|
||||
|
||||
## Log Scopes
|
||||
|
||||
`GatewayLogScope` is a record that captures the fields attached to a logger scope. It only emits keys whose values are non-null, so callers can supply just the identifiers they know about:
|
||||
|
||||
```csharp
|
||||
public sealed record GatewayLogScope(
|
||||
string? SessionId = null,
|
||||
int? WorkerProcessId = null,
|
||||
ulong? CorrelationId = null,
|
||||
string? CommandMethod = null,
|
||||
string? ClientIdentity = null)
|
||||
{
|
||||
public IReadOnlyDictionary<string, object?> ToDictionary()
|
||||
{
|
||||
Dictionary<string, object?> values = [];
|
||||
|
||||
AddIfPresent(values, "SessionId", SessionId);
|
||||
AddIfPresent(values, "WorkerProcessId", WorkerProcessId);
|
||||
AddIfPresent(values, "CorrelationId", CorrelationId);
|
||||
AddIfPresent(values, "CommandMethod", CommandMethod);
|
||||
AddIfPresent(values, "ClientIdentity", GatewayLogRedactor.RedactClientIdentity(ClientIdentity));
|
||||
|
||||
return values;
|
||||
}
|
||||
```
|
||||
|
||||
`ClientIdentity` is passed through `GatewayLogRedactor.RedactClientIdentity` inside `ToDictionary` rather than at the call site. This guarantees that any logger scope built from a `GatewayLogScope` cannot accidentally surface a raw API key, even when a caller forgets to redact before constructing the scope.
|
||||
|
||||
### How scopes are pushed
|
||||
|
||||
`GatewayLoggerExtensions` exposes a single method that converts a `GatewayLogScope` into the dictionary form expected by `ILogger.BeginScope`:
|
||||
|
||||
```csharp
|
||||
public static class GatewayLoggerExtensions
|
||||
{
|
||||
public static IDisposable? BeginGatewayScope(
|
||||
this ILogger logger,
|
||||
GatewayLogScope scope)
|
||||
{
|
||||
ArgumentNullException.ThrowIfNull(logger);
|
||||
ArgumentNullException.ThrowIfNull(scope);
|
||||
|
||||
return logger.BeginScope(scope.ToDictionary());
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The returned `IDisposable?` follows the standard `BeginScope` contract: callers wrap it in a `using` to bound the scope to a request, command, or worker interaction.
|
||||
|
||||
## Redaction Rules
|
||||
|
||||
`GatewayLogRedactor` centralizes every redaction decision so that policy changes live in one file. Three categories of input are handled differently because each has different "safe to log" prefixes.
|
||||
|
||||
### Sensitive command methods
|
||||
|
||||
A static set names the MXAccess commands that are known to carry credentials in their payloads:
|
||||
|
||||
```csharp
|
||||
private static readonly HashSet<string> SensitiveCommandMethods = new(StringComparer.OrdinalIgnoreCase)
|
||||
{
|
||||
"AuthenticateUser",
|
||||
"WriteSecured",
|
||||
"WriteSecured2"
|
||||
};
|
||||
|
||||
public static bool IsCredentialBearingCommand(string? commandMethod)
|
||||
{
|
||||
return commandMethod is not null
|
||||
&& SensitiveCommandMethods.Contains(commandMethod);
|
||||
}
|
||||
```
|
||||
|
||||
The names match the MXAccess command list in `AGENTS.md` exactly. `Write` and `Write2` are not in the set because their payloads are tag values, not credentials, and are governed by the `valueLoggingEnabled` flag described below.
|
||||
|
||||
### API key redaction
|
||||
|
||||
`RedactApiKey` is built around the `mxgw_` API key format issued by the gateway. It preserves the bearer scheme and the key id segment so that operators can correlate a log entry to a specific principal, but always strips the secret tail:
|
||||
|
||||
```csharp
|
||||
public static string? RedactApiKey(string? authorizationHeader)
|
||||
{
|
||||
if (string.IsNullOrWhiteSpace(authorizationHeader))
|
||||
{
|
||||
return authorizationHeader;
|
||||
}
|
||||
|
||||
const string bearerPrefix = "Bearer ";
|
||||
if (!authorizationHeader.StartsWith(bearerPrefix, StringComparison.OrdinalIgnoreCase))
|
||||
{
|
||||
return RedactedValue;
|
||||
}
|
||||
|
||||
string token = authorizationHeader[bearerPrefix.Length..].Trim();
|
||||
|
||||
if (!token.StartsWith("mxgw_", StringComparison.OrdinalIgnoreCase))
|
||||
{
|
||||
return $"{bearerPrefix}{RedactedValue}";
|
||||
}
|
||||
|
||||
string[] tokenParts = token.Split('_', 3, StringSplitOptions.RemoveEmptyEntries);
|
||||
if (tokenParts.Length < 2)
|
||||
{
|
||||
return $"{bearerPrefix}mxgw_{RedactedValue}";
|
||||
}
|
||||
|
||||
return $"{bearerPrefix}mxgw_{tokenParts[1]}_{RedactedValue}";
|
||||
}
|
||||
```
|
||||
|
||||
The split uses `count: 3` because the secret portion may itself contain underscores; only the first two segments (`mxgw` and the key id) are kept verbatim. Authorization headers that are not bearer tokens are reduced to `[redacted]` rather than passed through, since the gateway cannot reason about their structure.
|
||||
|
||||
`RedactClientIdentity` is the entry point used by `GatewayLogScope` and `DashboardRedactor`. It only invokes `RedactApiKey` when the input contains the `mxgw_` marker, leaving non-key identities (for example, Windows account names) untouched.
|
||||
|
||||
### Command value redaction
|
||||
|
||||
`RedactCommandValue` enforces the "values are opt-in and redacted by default" rule:
|
||||
|
||||
```csharp
|
||||
public static object? RedactCommandValue(
|
||||
string? commandMethod,
|
||||
object? value,
|
||||
bool valueLoggingEnabled = false)
|
||||
{
|
||||
if (value is null)
|
||||
{
|
||||
return null;
|
||||
}
|
||||
|
||||
if (!valueLoggingEnabled || IsCredentialBearingCommand(commandMethod))
|
||||
{
|
||||
return RedactedValue;
|
||||
}
|
||||
|
||||
return value;
|
||||
}
|
||||
```
|
||||
|
||||
Two rules combine here. First, when `valueLoggingEnabled` is `false` (the default), every value is replaced with `[redacted]`. Second, even when value logging is enabled, credential-bearing commands still redact. The credential check is therefore unconditional and cannot be overridden by configuration.
|
||||
|
||||
The shared `RedactedValue` constant is `"[redacted]"`. `DashboardRedactor` reuses it so that gateway logs and dashboard renders use the same placeholder.
|
||||
|
||||
## Request Logging Middleware
|
||||
|
||||
`GatewayRequestLoggingMiddlewareExtensions.UseGatewayRequestLoggingScope` registers the middleware that pushes a `GatewayLogScope` for the duration of every request:
|
||||
|
||||
```csharp
|
||||
public static IApplicationBuilder UseGatewayRequestLoggingScope(this IApplicationBuilder app)
|
||||
{
|
||||
ArgumentNullException.ThrowIfNull(app);
|
||||
|
||||
return app.Use(async (context, next) =>
|
||||
{
|
||||
ILogger logger = context.RequestServices
|
||||
.GetRequiredService<ILoggerFactory>()
|
||||
.CreateLogger("MxGateway.Request");
|
||||
|
||||
using IDisposable? scope = logger.BeginGatewayScope(new GatewayLogScope(
|
||||
SessionId: ReadHeader(context, SessionIdHeaderName),
|
||||
WorkerProcessId: ReadInt32Header(context, WorkerProcessIdHeaderName),
|
||||
CorrelationId: ReadUInt64Header(context, CorrelationIdHeaderName),
|
||||
CommandMethod: ReadHeader(context, CommandMethodHeaderName),
|
||||
ClientIdentity: ReadHeader(context, "authorization")));
|
||||
|
||||
await next(context);
|
||||
});
|
||||
}
|
||||
```
|
||||
|
||||
The scope is keyed off four custom headers and the standard `authorization` header:
|
||||
|
||||
| Header | Scope field | Type |
|
||||
|--------|-------------|------|
|
||||
| `x-session-id` | `SessionId` | string |
|
||||
| `x-worker-process-id` | `WorkerProcessId` | int |
|
||||
| `x-correlation-id` | `CorrelationId` | ulong |
|
||||
| `x-command-method` | `CommandMethod` | string |
|
||||
| `authorization` | `ClientIdentity` | string (redacted) |
|
||||
|
||||
The numeric headers use `int.TryParse` and `ulong.TryParse`; missing or unparseable values become `null` and are dropped by `GatewayLogScope.ToDictionary`. This keeps the middleware tolerant of clients that do not yet emit every header, which matters because the earliest call in a session (`OpenSession`) has no `SessionId` to send.
|
||||
|
||||
The logger category is `MxGateway.Request`, which lets operators filter the request scope events independently from per-component categories.
|
||||
|
||||
### Pipeline ordering
|
||||
|
||||
`GatewayApplication.Build` registers the middleware before authentication, authorization, and endpoint mapping:
|
||||
|
||||
```csharp
|
||||
app.UseGatewayRequestLoggingScope();
|
||||
app.UseStaticFiles();
|
||||
app.UseAuthentication();
|
||||
app.UseAuthorization();
|
||||
app.UseAntiforgery();
|
||||
app.MapGatewayEndpoints();
|
||||
```
|
||||
|
||||
The order matters: putting the logging scope first ensures that authentication failures, authorization denials, and endpoint exceptions all run inside the request scope, so failure logs still carry the correlation id and session id headers that the caller sent. The `ClientIdentity` field is redacted before logging, so reading the `authorization` header at this stage does not leak the bearer secret into authentication failure logs.
|
||||
|
||||
## Consumers
|
||||
|
||||
`GatewayLoggerExtensions.BeginGatewayScope` is consumed by `GatewayRequestLoggingMiddlewareExtensions` to attach the per-request scope. Component-level call sites build narrower `GatewayLogScope` instances (for example, with a known `WorkerProcessId` after a worker launch) and push a nested scope on top of the request scope.
|
||||
|
||||
`GatewayLogRedactor` is consumed in three places:
|
||||
|
||||
- `GatewayLogScope.ToDictionary` redacts `ClientIdentity` whenever a scope is materialized.
|
||||
- `DashboardRedactor.Redact` delegates to `RedactClientIdentity` for any value containing the `mxgw_` marker, then falls back to a marker-keyword check for fields like `password` or `token`. This keeps dashboard renders aligned with log redaction.
|
||||
- `MxGateway.Tests/Diagnostics/GatewayLogRedactorTests.cs` covers each redaction branch, including the assertion that `WriteSecured` values stay redacted even when `valueLoggingEnabled` is true.
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Sessions](./Sessions.md)
|
||||
- [gRPC](./Grpc.md)
|
||||
- [Authentication](./Authentication.md)
|
||||
@@ -0,0 +1,271 @@
|
||||
# Galaxy Repository Browse
|
||||
|
||||
The gateway exposes a read-only browse surface over the AVEVA System Platform
|
||||
Galaxy Repository (the SQL Server database named `ZB`). Clients use it to
|
||||
enumerate the deployed object hierarchy and each object's dynamic attributes
|
||||
before subscribing to runtime values via the existing `MxAccessGateway` RPCs.
|
||||
|
||||
This is a metadata layer: it never reads or writes runtime tag values, never
|
||||
goes through MXAccess COM, and never needs the x86 worker. It runs entirely
|
||||
inside the .NET 10 gateway process and talks to the Galaxy Repository over
|
||||
plain SQL.
|
||||
|
||||
## Why It Exists
|
||||
|
||||
Without browse, a client must already know `tag_name.AttributeName` strings to
|
||||
issue `AddItem`. The Galaxy Repository is the authoritative source for those
|
||||
names — the same database that System Platform itself reads when the
|
||||
ArchestrA IDE renders the deployment tree. Surfacing that data over gRPC lets
|
||||
remote clients build a navigable address space without any coupling to the
|
||||
COM layer or the host platform.
|
||||
|
||||
The query bodies are kept byte-for-byte identical to the equivalent OPC UA
|
||||
server in the OtOpcUa project so the two consumers see the same row sets.
|
||||
|
||||
## RPC Surface
|
||||
|
||||
The service is defined in
|
||||
`src/MxGateway.Contracts/Protos/galaxy_repository.proto` under package
|
||||
`galaxy_repository.v1`.
|
||||
|
||||
| RPC | Purpose |
|
||||
|-----|---------|
|
||||
| `TestConnection` | Connectivity probe. Returns `{ ok: bool }` after a `SELECT 1`. Does not throw on SQL failure — returns `ok = false`. Always hits SQL directly so it remains a true health check. |
|
||||
| `GetLastDeployTime` | Returns the cached `galaxy.time_of_last_deploy`. Served from the shared hierarchy cache; refreshed in the background. |
|
||||
| `DiscoverHierarchy` | Returns the full deployed hierarchy plus every object's dynamic attributes. **Served from cache** — see [Hierarchy Cache](#hierarchy-cache). |
|
||||
| `WatchDeployEvents` | **Server-streaming.** The server emits the current state immediately on subscribe (so clients can bootstrap without waiting), then emits one event per detected deploy change. See [Deploy Notifications](#deploy-notifications). |
|
||||
|
||||
`DiscoverHierarchy` is intentionally a single unary RPC rather than a stream:
|
||||
the row set is small (thousands of objects, low tens-of-thousands of
|
||||
attributes for typical Galaxies) and clients almost always want the whole tree
|
||||
at once.
|
||||
|
||||
## Hierarchy Cache
|
||||
|
||||
The gateway holds a single shared `IGalaxyHierarchyCache`
|
||||
(`src/MxGateway.Server/Galaxy/GalaxyHierarchyCache.cs`) — every
|
||||
`DiscoverHierarchy` and `GetLastDeployTime` request reads from this cache
|
||||
rather than hitting SQL. Many clients can browse concurrently with at most
|
||||
one SQL query in flight.
|
||||
|
||||
Refresh strategy is **deploy-time gated**:
|
||||
|
||||
1. The hosted `GalaxyHierarchyRefreshService` ticks every
|
||||
`MxGateway:Galaxy:DashboardRefreshIntervalSeconds` seconds (default 30).
|
||||
2. Each tick queries the cheap `SELECT time_of_last_deploy FROM galaxy` first.
|
||||
3. If the deploy timestamp is unchanged, the heavy hierarchy + attributes
|
||||
queries are **skipped**. The cache simply marks `LastSuccessAt`.
|
||||
4. If the deploy timestamp changed (or no data has loaded yet), the cache
|
||||
pulls hierarchy + attributes, materializes a `DiscoverHierarchyReply`
|
||||
once, replaces the entry atomically, and publishes a deploy event.
|
||||
|
||||
Materializing the reply at refresh time means subsequent `DiscoverHierarchy`
|
||||
calls return a pre-built proto message — no per-request projection, no
|
||||
per-request allocations beyond the gRPC serializer's frame.
|
||||
|
||||
When SQL is unreachable, the cache retains the previous data and flips
|
||||
`Status` to `Stale` (or `Unavailable` if no data was ever loaded). A
|
||||
`SqlException` never bubbles out as the client-facing error.
|
||||
|
||||
### First-load behavior
|
||||
|
||||
If a client calls `DiscoverHierarchy` before the background service has
|
||||
populated the cache, the gRPC handler waits up to 5 seconds for the first
|
||||
load to complete before returning. If the first load fails or times out,
|
||||
the client gets `Unavailable` with a short reason. Once any load completes
|
||||
(success or failure), this wait is skipped on subsequent calls.
|
||||
|
||||
## Deploy Notifications
|
||||
|
||||
`WatchDeployEvents` is a server-streaming RPC backed by
|
||||
`IGalaxyDeployNotifier` (`src/MxGateway.Server/Galaxy/GalaxyDeployNotifier.cs`).
|
||||
The notifier maintains a private bounded channel per subscriber so a slow
|
||||
client cannot back-pressure other subscribers or the publisher.
|
||||
|
||||
Subscriber lifecycle:
|
||||
|
||||
1. On subscribe, the notifier emits the **current** event (the state of the
|
||||
most recent successful refresh) so the subscriber can sync its local cache
|
||||
without waiting for the next deploy. Clients that already know about that
|
||||
deploy can pass `last_seen_deploy_time` in the request to suppress the
|
||||
bootstrap event.
|
||||
2. As the cache observes new deploy timestamps, it publishes one event per
|
||||
change. Each event carries:
|
||||
- `sequence` — monotonic per server start; gaps signal a dropped event.
|
||||
- `observed_at` — server wall-clock when the cache saw the deploy.
|
||||
- `time_of_last_deploy` (+ `time_of_last_deploy_present`) — the Galaxy
|
||||
timestamp; absent only when the source row reports null.
|
||||
- `object_count`, `attribute_count` — counts on the new deploy, useful for
|
||||
dashboards and "did anything important change" gates without re-pulling.
|
||||
3. If the subscriber's per-subscriber buffer fills (bound = 16 events with
|
||||
`DropOldest`), older events are dropped. Clients use the `sequence` field
|
||||
to detect this.
|
||||
4. Cancellation (or transport disconnect) removes the subscriber.
|
||||
|
||||
Typical client pattern:
|
||||
|
||||
```text
|
||||
1. Open WatchDeployEvents stream (with last_seen_deploy_time if you have one).
|
||||
2. On each event, decide whether to call DiscoverHierarchy to refresh local cache.
|
||||
3. If sequence skipped a number, treat it as a dropped event and refresh.
|
||||
```
|
||||
|
||||
### Reply Shape
|
||||
|
||||
```proto
|
||||
message GalaxyObject {
|
||||
int32 gobject_id = 1;
|
||||
string tag_name = 2;
|
||||
string contained_name = 3;
|
||||
string browse_name = 4; // contained_name when present, else tag_name
|
||||
int32 parent_gobject_id = 5;
|
||||
bool is_area = 6;
|
||||
int32 category_id = 7;
|
||||
int32 hosted_by_gobject_id = 8;
|
||||
repeated string template_chain = 9;
|
||||
repeated GalaxyAttribute attributes = 10;
|
||||
}
|
||||
|
||||
message GalaxyAttribute {
|
||||
string attribute_name = 1;
|
||||
string full_tag_reference = 2; // e.g. "DelmiaReceiver_001.DownloadPath"
|
||||
int32 mx_data_type = 3; // raw Galaxy mx_data_type integer
|
||||
string data_type_name = 4;
|
||||
bool is_array = 5;
|
||||
int32 array_dimension = 6;
|
||||
bool array_dimension_present = 7; // distinguishes "no dimension" from 0
|
||||
int32 mx_attribute_category = 8;
|
||||
int32 security_classification = 9;
|
||||
bool is_historized = 10;
|
||||
bool is_alarm = 11;
|
||||
}
|
||||
```
|
||||
|
||||
### Contained Name vs Tag Name
|
||||
|
||||
Galaxy objects carry two names. `tag_name` is globally unique and is what
|
||||
MXAccess expects in `AddItem`. `contained_name` is the human-readable name
|
||||
used in the IDE browse tree, scoped to the parent. The browse RPC exposes
|
||||
both: clients display `browse_name` to users and pass `tag_name` (or
|
||||
`full_tag_reference`) into MXAccess subscriptions. When `contained_name` is
|
||||
empty (top-level objects), `browse_name` falls back to `tag_name`.
|
||||
|
||||
### Data Types
|
||||
|
||||
`mx_data_type` is returned as the raw Galaxy integer rather than mapped to a
|
||||
language-neutral enum. The gateway makes no assumption about the client's
|
||||
target type system — clients map to OPC UA, JSON, .NET CLR types, or
|
||||
something else as appropriate. The Galaxy `data_type` table description is
|
||||
also passed through as `data_type_name`.
|
||||
|
||||
`array_dimension_present` is a separate boolean because protobuf scalar
|
||||
fields cannot express null. Use it to distinguish "no dimension reported" from
|
||||
"dimension is zero."
|
||||
|
||||
## Architecture
|
||||
|
||||
```text
|
||||
gRPC client(s)
|
||||
-> GalaxyRepositoryGrpcService (src/MxGateway.Server/Grpc/)
|
||||
DiscoverHierarchy, GetLastDeployTime -> IGalaxyHierarchyCache.Current
|
||||
WatchDeployEvents -> IGalaxyDeployNotifier
|
||||
TestConnection -> GalaxyRepository (direct SQL)
|
||||
|
||||
GalaxyHierarchyRefreshService (BackgroundService)
|
||||
-> IGalaxyHierarchyCache.RefreshAsync
|
||||
-> GalaxyRepository.GetLastDeployTimeAsync (cheap, every tick)
|
||||
-> GalaxyRepository.GetHierarchyAsync (only on deploy change)
|
||||
-> GalaxyRepository.GetAttributesAsync (only on deploy change)
|
||||
-> GalaxyProtoMapper.MapObject (materialize DiscoverHierarchyReply once)
|
||||
-> IGalaxyDeployNotifier.Publish (only on deploy change)
|
||||
```
|
||||
|
||||
Component breakdown:
|
||||
|
||||
- `GalaxyRepository` (`src/MxGateway.Server/Galaxy/GalaxyRepository.cs`) holds
|
||||
the SQL. Its constants `HierarchySql` and `AttributesSql` are copied verbatim
|
||||
from the OtOpcUa project; do not edit them in isolation here. The two
|
||||
queries walk template-derivation and package-derivation chains via
|
||||
recursive CTEs and pick the most-derived attribute override per object.
|
||||
- `GalaxyHierarchyCache`
|
||||
(`src/MxGateway.Server/Galaxy/GalaxyHierarchyCache.cs`) holds the most
|
||||
recent immutable `GalaxyHierarchyCacheEntry` (rows + materialized proto
|
||||
reply + counts + status). All gRPC clients share the same entry.
|
||||
- `GalaxyHierarchyRefreshService`
|
||||
(`src/MxGateway.Server/Galaxy/GalaxyHierarchyRefreshService.cs`) is a
|
||||
hosted `BackgroundService` that drives `RefreshAsync` on the configured
|
||||
interval, with deploy-time gating to avoid unnecessary heavy queries.
|
||||
- `GalaxyDeployNotifier`
|
||||
(`src/MxGateway.Server/Galaxy/GalaxyDeployNotifier.cs`) is a thin
|
||||
per-subscriber-channel fan-out for streaming clients.
|
||||
- `GalaxyProtoMapper`
|
||||
(`src/MxGateway.Server/Grpc/GalaxyProtoMapper.cs`) converts row models to
|
||||
proto messages. Used by the cache during refresh to materialize the reply
|
||||
once.
|
||||
- `GalaxyRepositoryGrpcService`
|
||||
(`src/MxGateway.Server/Grpc/GalaxyRepositoryGrpcService.cs`) implements
|
||||
the four RPCs.
|
||||
|
||||
## Configuration
|
||||
|
||||
Bound to `MxGateway:Galaxy` via `GalaxyRepositoryOptions`.
|
||||
|
||||
| Option | Default | Description |
|
||||
|--------|---------|-------------|
|
||||
| `MxGateway:Galaxy:ConnectionString` | `Server=localhost;Database=ZB;Integrated Security=True;TrustServerCertificate=True;Encrypt=False;` | SQL Server connection string for the Galaxy Repository. Integrated Security against `localhost` is the dev default; production deployments should override this through the standard double-underscore environment variable form, e.g. `MxGateway__Galaxy__ConnectionString`. |
|
||||
| `MxGateway:Galaxy:CommandTimeoutSeconds` | `60` | Per-command SQL timeout. Applies to all three RPCs. |
|
||||
|
||||
The connection string is not treated as a secret in dev (`Integrated
|
||||
Security`), but production deployments that use SQL authentication should set
|
||||
the override via environment variable rather than committing credentials to
|
||||
`appsettings.json`.
|
||||
|
||||
## Authorization
|
||||
|
||||
All four Galaxy RPCs (including `WatchDeployEvents`) require the
|
||||
`metadata:read` API-key scope. Browse is read-only metadata, equivalent in
|
||||
privilege to `MxCommandKind.GetSessionState` or `MxCommandKind.GetWorkerInfo`.
|
||||
The mapping lives in `GatewayGrpcScopeResolver`; see
|
||||
[Authorization](./Authorization.md) for the full scope catalog.
|
||||
|
||||
A request without an API key returns `Unauthenticated`. A request with a key
|
||||
that lacks `metadata:read` returns `PermissionDenied` with the missing scope
|
||||
embedded in the status detail.
|
||||
|
||||
## Dashboard Surface
|
||||
|
||||
The gateway's Blazor dashboard surfaces a Galaxy summary in two places:
|
||||
|
||||
- An overview card on `/dashboard` showing connectivity status, last deploy
|
||||
timestamp, object count (with area count), attribute total, historized and
|
||||
alarm counts, and last successful refresh.
|
||||
- A dedicated `/dashboard/galaxy` page with object-category and top-template
|
||||
breakdowns plus a Sync Info table covering last successful refresh, last
|
||||
attempt, refresh interval, redacted connection string, and command timeout.
|
||||
|
||||
Both views are projected from the same `IGalaxyHierarchyCache` that backs the
|
||||
gRPC service. The dashboard does not run its own refresh — when the
|
||||
background `GalaxyHierarchyRefreshService` updates the cache, both the
|
||||
overview card and the `/dashboard/galaxy` page pick up the new state on the
|
||||
next dashboard tick. When SQL is unreachable, the cache retains the previous
|
||||
data and flips `Status` to `Stale` or `Unavailable`; the dashboard surfaces
|
||||
that as a yellow or red status badge plus the truncated error.
|
||||
|
||||
## Operational Notes
|
||||
|
||||
- The service is registered alongside `MxAccessGatewayService` in
|
||||
`GatewayApplication.MapGatewayEndpoints`. Both services share the same
|
||||
authorization interceptor and authentication policy.
|
||||
- Failures to reach the Galaxy database surface as `Unavailable`. Detailed
|
||||
SQL exceptions are logged at `Warning` and never returned to clients.
|
||||
- Integration tests live in
|
||||
`src/MxGateway.IntegrationTests/Galaxy/GalaxyRepositoryLiveTests.cs`. Set
|
||||
`MXGATEWAY_RUN_LIVE_GALAXY_TESTS=1` (and optionally
|
||||
`MXGATEWAY_LIVE_GALAXY_CONN`) to run them; otherwise they skip.
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Contracts](./Contracts.md)
|
||||
- [Grpc](./Grpc.md)
|
||||
- [Authorization](./Authorization.md)
|
||||
- [Gateway Configuration](./GatewayConfiguration.md)
|
||||
@@ -53,6 +53,11 @@ paths, timeouts, queue sizes, enum values, or protocol values are invalid.
|
||||
},
|
||||
"Protocol": {
|
||||
"WorkerProtocolVersion": 1
|
||||
},
|
||||
"Galaxy": {
|
||||
"ConnectionString": "Server=localhost;Database=ZB;Integrated Security=True;TrustServerCertificate=True;Encrypt=False;",
|
||||
"CommandTimeoutSeconds": 60,
|
||||
"DashboardRefreshIntervalSeconds": 30
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -146,9 +151,21 @@ The protocol option is exposed for diagnostics and explicit deployment
|
||||
configuration, not for compatibility negotiation. A mismatch fails validation
|
||||
at startup.
|
||||
|
||||
## Galaxy Options
|
||||
|
||||
| Option | Default | Description |
|
||||
|--------|---------|-------------|
|
||||
| `MxGateway:Galaxy:ConnectionString` | `Server=localhost;Database=ZB;Integrated Security=True;TrustServerCertificate=True;Encrypt=False;` | SQL Server connection string for the Galaxy Repository (`ZB`) used by the `GalaxyRepository` browse RPCs. Override in production via `MxGateway__Galaxy__ConnectionString`. |
|
||||
| `MxGateway:Galaxy:CommandTimeoutSeconds` | `60` | Per-command SQL timeout for all Galaxy browse RPCs. |
|
||||
| `MxGateway:Galaxy:DashboardRefreshIntervalSeconds` | `30` | Interval between background refreshes of the dashboard Galaxy summary cache. SQL is hit at most once per interval regardless of dashboard render rate. |
|
||||
|
||||
See [Galaxy Repository Browse](./GalaxyRepository.md) for the RPC surface and
|
||||
behavior.
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Gateway Process Detailed Design](./gateway-process-design.md)
|
||||
- [Gateway Dashboard Detailed Design](./gateway-dashboard-design.md)
|
||||
- [Worker Process Launcher](./WorkerProcessLauncher.md)
|
||||
- [Worker Frame Protocol](./WorkerFrameProtocol.md)
|
||||
- [Galaxy Repository Browse](./GalaxyRepository.md)
|
||||
|
||||
+236
@@ -0,0 +1,236 @@
|
||||
# Gateway gRPC Service Layer
|
||||
|
||||
The gRPC service layer is the public entry point for client traffic. It is intentionally thin: handlers validate the incoming request, look up or open a session, dispatch to the worker through the session manager, and translate worker replies and events back into public proto types.
|
||||
|
||||
## Layer Responsibilities
|
||||
|
||||
The architecture rule (from `AGENTS.md`) is that the gRPC layer must "validate request, find session, call the session worker client, map worker replies to public replies, and stream events". Anything else — caching, retries, worker process lifetime, event ordering — lives behind `ISessionManager` and the worker client. Keeping the layer thin lets the same session/worker code be reused by future transports (for example, an in-process host or an alternate IPC) without having to re-derive validation or mapping rules.
|
||||
|
||||
The layer is composed of four collaborators:
|
||||
|
||||
| Type | Lifetime | Role |
|
||||
|------|----------|------|
|
||||
| `MxAccessGatewayService` | scoped (gRPC) | Implements the four `MxAccessGateway` RPCs, performs exception mapping. |
|
||||
| `MxAccessGrpcRequestValidator` | singleton | Rejects malformed requests before any session work runs. |
|
||||
| `MxAccessGrpcMapper` | singleton | Converts public proto types to internal `WorkerCommand`/`WorkerEvent` types and back. |
|
||||
| `IEventStreamService` (`EventStreamService`) | singleton | Owns the event stream pipeline, including bounded queue and backpressure handling. |
|
||||
|
||||
Registration happens in `GatewayApplication`:
|
||||
|
||||
```csharp
|
||||
builder.Services.AddSingleton<MxAccessGrpcMapper>();
|
||||
builder.Services.AddSingleton<MxAccessGrpcRequestValidator>();
|
||||
builder.Services.AddSingleton<IEventStreamService, EventStreamService>();
|
||||
```
|
||||
|
||||
The service itself is mapped as a normal gRPC endpoint via `endpoints.MapGrpcService<MxAccessGatewayService>()`.
|
||||
|
||||
A second gRPC service, `GalaxyRepositoryGrpcService`, is mapped alongside it. It exposes the read-only Galaxy Repository browse surface and is documented separately in [Galaxy Repository Browse](./GalaxyRepository.md). It shares the authorization interceptor and authentication policy used by `MxAccessGatewayService`, but it does not go through the session manager or worker — it talks to SQL Server directly.
|
||||
|
||||
## RPC Handlers
|
||||
|
||||
`MxAccessGatewayService` derives from the generated `MxAccessGateway.MxAccessGatewayBase` and implements every RPC declared in `mxaccess_gateway.proto`. The proto contract itself is documented in [Contracts](./Contracts.md); this section covers only what the server-side handler does on top of that contract.
|
||||
|
||||
### `OpenSession`
|
||||
|
||||
`OpenSession` validates the request, asks `ISessionManager` to open a session under the caller's identity, and returns a reply that advertises both protocol versions and the capabilities the gateway supports. Capability strings are static because the gateway has a fixed feature set per build; clients use them as a forward-compatibility hint rather than runtime negotiation.
|
||||
|
||||
```csharp
|
||||
GatewaySession session = await sessionManager
|
||||
.OpenSessionAsync(
|
||||
SessionOpenRequest.FromContract(request),
|
||||
ResolveClientIdentity(),
|
||||
context.CancellationToken)
|
||||
.ConfigureAwait(false);
|
||||
|
||||
OpenSessionReply reply = new()
|
||||
{
|
||||
SessionId = session.SessionId,
|
||||
BackendName = session.BackendName,
|
||||
WorkerProcessId = session.WorkerProcessId ?? 0,
|
||||
WorkerProtocolVersion = GatewayContractInfo.WorkerProtocolVersion,
|
||||
GatewayProtocolVersion = GatewayContractInfo.GatewayProtocolVersion,
|
||||
DefaultCommandTimeout = Google.Protobuf.WellKnownTypes.Duration.FromTimeSpan(session.CommandTimeout),
|
||||
ProtocolStatus = MxAccessGrpcMapper.Ok(),
|
||||
};
|
||||
```
|
||||
|
||||
`ResolveClientIdentity()` reads `IGatewayRequestIdentityAccessor.Current` and prefers `DisplayName`, falling back to `KeyId`. The accessor is populated by the authorization interceptor before the handler runs (see [Authorization](./Authorization.md)).
|
||||
|
||||
### `CloseSession`
|
||||
|
||||
The handler delegates to `sessionManager.CloseSessionAsync` and converts the resulting `SessionCloseResult` into a `CloseSessionReply`. The handler distinguishes the two outcomes via the protocol status message — `"Session was already closed."` versus `"Session closed."` — so callers do not need to reason about idempotency from a status code alone.
|
||||
|
||||
### `Invoke`
|
||||
|
||||
`Invoke` is the unary command path. It runs the validator (which enforces payload-vs-kind matching), uses the mapper to wrap the public `MxCommand` in a `WorkerCommand` with an enqueue timestamp, calls `sessionManager.InvokeAsync`, and unwraps the worker reply.
|
||||
|
||||
```csharp
|
||||
requestValidator.ValidateInvoke(request);
|
||||
WorkerCommand workerCommand = mapper.MapCommand(request);
|
||||
WorkerCommandReply workerReply = await sessionManager
|
||||
.InvokeAsync(request.SessionId, workerCommand, context.CancellationToken)
|
||||
.ConfigureAwait(false);
|
||||
|
||||
return mapper.MapCommandReply(workerReply);
|
||||
```
|
||||
|
||||
Carrying the enqueue timestamp into the worker layer is what lets queue-wait time be measured separately from worker-side execution time when troubleshooting timeouts.
|
||||
|
||||
### `StreamEvents`
|
||||
|
||||
`StreamEvents` is a server-streaming RPC. The handler delegates the full pipeline to `IEventStreamService` and just forwards each `MxEvent` onto the response stream. Keeping the channel and producer/consumer machinery out of the handler means cancellation, exception mapping, and metric bookkeeping live in one place.
|
||||
|
||||
## Validation Rules
|
||||
|
||||
`MxAccessGrpcRequestValidator` rejects requests with `StatusCode.InvalidArgument` before any session work happens. The rules are intentionally narrow — anything that requires session state (for example, "session does not exist") is left for `ISessionManager` so the validator can stay synchronous and side-effect free.
|
||||
|
||||
| RPC | Rule | Status |
|
||||
|-----|------|--------|
|
||||
| `OpenSession` | `command_timeout`, when set, must be `> 0`. | `InvalidArgument` |
|
||||
| `CloseSession` | `session_id` must be non-empty. | `InvalidArgument` |
|
||||
| `StreamEvents` | `session_id` must be non-empty. | `InvalidArgument` |
|
||||
| `Invoke` | `session_id` non-empty, `command` present, `kind` not `Unspecified`, payload oneof must match `kind`. | `InvalidArgument` |
|
||||
|
||||
The payload-vs-kind check matters because the `MxCommand.payload` oneof is non-discriminated on the wire — a misaligned client could send `kind = Write` with a `Register` payload and silently confuse the worker. The validator turns that into a clear client error:
|
||||
|
||||
```csharp
|
||||
private static void ValidateCommandPayload(MxCommand command)
|
||||
{
|
||||
MxCommand.PayloadOneofCase expectedPayload = ExpectedPayload(command.Kind);
|
||||
if (command.PayloadCase != expectedPayload)
|
||||
{
|
||||
throw InvalidArgument(
|
||||
$"Command kind {command.Kind} requires payload {expectedPayload} but received {command.PayloadCase}.");
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
`ExpectedPayload` enumerates every `MxCommandKind` that has a matching payload case; unknown kinds map to `PayloadOneofCase.None`, which forces a mismatch and therefore a rejection.
|
||||
|
||||
## Mapping Rules
|
||||
|
||||
`MxAccessGrpcMapper` is the only place that translates between public proto types and internal worker types. Two design choices are worth calling out.
|
||||
|
||||
The mapper clones the inbound `MxCommand` rather than reusing the reference. This isolates the worker pipeline from any later mutation of the request graph by the gRPC framework or interceptors:
|
||||
|
||||
```csharp
|
||||
public WorkerCommand MapCommand(MxCommandRequest request)
|
||||
{
|
||||
ArgumentNullException.ThrowIfNull(request);
|
||||
ArgumentNullException.ThrowIfNull(request.Command);
|
||||
|
||||
return new WorkerCommand
|
||||
{
|
||||
Command = request.Command.Clone(),
|
||||
EnqueueTimestamp = Timestamp.FromDateTimeOffset(_timeProvider.GetUtcNow()),
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
When the worker reply or event payload is missing, the mapper returns a synthetic public message with `ProtocolStatusCode.ProtocolViolation` (for replies) or a sentinel `MxEvent` with `MxEventFamily.Unspecified` (for events). The gateway never relays a partial frame to clients — anything missing is reported as a protocol violation against the worker, not a transport error against the client.
|
||||
|
||||
The mapper also exposes static factory methods for every `ProtocolStatusCode` (`Ok`, `InvalidRequest`, `SessionNotFound`, `SessionNotReady`, `WorkerUnavailable`, `Timeout`, `Canceled`, `ProtocolViolation`) so that handlers and tests can produce status payloads without duplicating the enum-to-string mapping.
|
||||
|
||||
## Exception to Status Mapping
|
||||
|
||||
Every handler wraps its body in `try { ... } catch (Exception exception) when (exception is not RpcException) { throw MapException(exception); }`. `RpcException` is allowed to propagate untouched (the validator already produces them with the right code). All other exceptions are translated by `MapException`, which knows three categories:
|
||||
|
||||
- `OperationCanceledException` becomes `StatusCode.Cancelled`.
|
||||
- `SessionManagerException` is mapped by its `ErrorCode`.
|
||||
- `WorkerClientException` is mapped by its `ErrorCode`.
|
||||
|
||||
Anything else is logged at warning and surfaced as `Unavailable` with a generic message — clients see a retryable status, and the unexpected exception is captured in gateway logs rather than leaked over the wire.
|
||||
|
||||
```csharp
|
||||
StatusCode statusCode = exception.ErrorCode switch
|
||||
{
|
||||
SessionManagerErrorCode.SessionNotFound => StatusCode.NotFound,
|
||||
SessionManagerErrorCode.SessionNotReady => StatusCode.FailedPrecondition,
|
||||
SessionManagerErrorCode.EventSubscriberAlreadyActive => StatusCode.ResourceExhausted,
|
||||
SessionManagerErrorCode.EventQueueOverflow => StatusCode.ResourceExhausted,
|
||||
SessionManagerErrorCode.SessionLimitExceeded => StatusCode.ResourceExhausted,
|
||||
SessionManagerErrorCode.OpenFailed => StatusCode.Unavailable,
|
||||
SessionManagerErrorCode.CloseFailed => StatusCode.Unavailable,
|
||||
_ => StatusCode.Unavailable,
|
||||
};
|
||||
```
|
||||
|
||||
`WorkerClientException` follows the same pattern: `CommandTimeout` becomes `DeadlineExceeded`, `GatewayShutdown` becomes `Cancelled`, `InvalidState` becomes `FailedPrecondition`, `ProtocolViolation` becomes `Internal`, and unmapped codes fall through to `Unavailable`.
|
||||
|
||||
## Event Streaming Model
|
||||
|
||||
`EventStreamService` implements the `StreamEvents` pipeline. It exists as a separate service (rather than living inside `MxAccessGatewayService`) so that the channel, backpressure policy, and metric bookkeeping can be unit-tested without spinning up a Kestrel host.
|
||||
|
||||
The pipeline has three stages:
|
||||
|
||||
1. The session is resolved via `sessionManager.TryGetSession`. A miss raises `SessionManagerException(SessionNotFound)` so the handler reports `StatusCode.NotFound`.
|
||||
2. `session.AttachEventSubscriber(...)` enforces the single-subscriber-per-session rule (or allows multiple subscribers if `Sessions:AllowMultipleEventSubscribers` is enabled). The returned `IDisposable` is released in the `finally` block, ensuring the subscriber slot is freed even when the client cancels mid-stream.
|
||||
3. A `Channel<MxEvent>` decouples the worker-side producer from the gRPC writer. The channel is bounded by `Events:QueueCapacity` and configured for a single reader and writer:
|
||||
|
||||
```csharp
|
||||
Channel<MxEvent> eventQueue = Channel.CreateBounded<MxEvent>(
|
||||
new BoundedChannelOptions(options.Value.Events.QueueCapacity)
|
||||
{
|
||||
SingleReader = true,
|
||||
SingleWriter = true,
|
||||
FullMode = BoundedChannelFullMode.Wait,
|
||||
AllowSynchronousContinuations = false,
|
||||
});
|
||||
```
|
||||
|
||||
The producer reads `WorkerEvent`s from the session, maps them, and writes to the channel. Per-session ordering is preserved end-to-end: events arrive from the worker in `WorkerSequence` order, the producer is the single writer, and the consumer drains FIFO. The producer also honours the request's `AfterWorkerSequence` cursor by skipping any event whose `WorkerSequence` is at or before the requested cutoff, which lets clients resume after a disconnect without server-side replay state.
|
||||
|
||||
### Backpressure
|
||||
|
||||
When `TryWrite` fails the queue is full. The handling depends on `Events:BackpressurePolicy`:
|
||||
|
||||
```csharp
|
||||
if (!writer.TryWrite(publicEvent))
|
||||
{
|
||||
string message = $"Session {session.SessionId} event stream queue overflowed.";
|
||||
metrics.QueueOverflow("grpc-event-stream");
|
||||
if (options.Value.Events.BackpressurePolicy == EventBackpressurePolicy.FailFast)
|
||||
{
|
||||
session.MarkFaulted(message);
|
||||
metrics.Fault(SessionManagerErrorCode.EventQueueOverflow.ToString());
|
||||
}
|
||||
else
|
||||
{
|
||||
logger.LogDebug(
|
||||
"Disconnecting event stream for session {SessionId} after queue overflow.",
|
||||
session.SessionId);
|
||||
}
|
||||
|
||||
writer.TryComplete(new SessionManagerException(
|
||||
SessionManagerErrorCode.EventQueueOverflow,
|
||||
message));
|
||||
return;
|
||||
}
|
||||
```
|
||||
|
||||
Under `FailFast` the session is faulted so subsequent commands return `FailedPrecondition`; the client must reopen. Under the default policy only the stream is dropped and the session continues to accept commands, leaving recovery to the client (typically a fresh `StreamEvents` call with an updated `AfterWorkerSequence`). Either way, the consumer side observes `StatusCode.ResourceExhausted` via the `EventQueueOverflow` mapping above.
|
||||
|
||||
### Cancellation and Cleanup
|
||||
|
||||
The handler creates a linked cancellation token (`streamCts`) so that completing the consumer (client disconnect, error, or graceful end-of-stream) also cancels the producer. The `finally` block cancels the source, disposes the subscriber slot, awaits the producer (swallowing the expected cancellation), and emits `StreamDisconnected("Detached")` so dashboards see the disconnection regardless of cause.
|
||||
|
||||
`WorkerClientException` thrown by the producer marks the session as faulted before completing the channel — the worker is presumed gone, and any subsequent command on that session must observe the fault rather than silently retry.
|
||||
|
||||
## Authorization Interceptor Integration
|
||||
|
||||
Authorization is applied as a gRPC interceptor, registered in `GrpcAuthorizationServiceCollectionExtensions`:
|
||||
|
||||
```csharp
|
||||
services.AddSingleton<GatewayGrpcAuthorizationInterceptor>();
|
||||
services.AddGrpc(options => options.Interceptors.Add<GatewayGrpcAuthorizationInterceptor>());
|
||||
```
|
||||
|
||||
Because the interceptor runs before any handler, `MxAccessGatewayService` can safely assume the call has been authorized and that `IGatewayRequestIdentityAccessor.Current` is populated. The handler's only responsibility is to read the identity for `OpenSession` so the session is owned by the authenticated principal; it does not perform any authorization checks of its own. See [Authorization](./Authorization.md) for the policy and identity model.
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Contracts](./Contracts.md)
|
||||
- [Sessions](./Sessions.md)
|
||||
- [Authorization](./Authorization.md)
|
||||
- [Gateway Process Design](./gateway-process-design.md)
|
||||
+210
@@ -0,0 +1,210 @@
|
||||
# Gateway Metrics
|
||||
|
||||
The metrics subsystem exposes counters, histograms, and observable gauges that describe gateway throughput, queue health, and worker lifecycle. Both the `System.Diagnostics.Metrics` pipeline and the in-memory `GatewayMetricsSnapshot` consume the same underlying state, so external collectors and the dashboard see consistent numbers.
|
||||
|
||||
## Overview
|
||||
|
||||
`GatewayMetrics` is a singleton (registered in `GatewayApplication.cs`) that owns a single `Meter` named `MxGateway.Server` and a set of synchronised counters, histograms, and observable gauges. Subsystems call typed mutator methods (`SessionOpened`, `CommandFailed`, `EventReceived`, etc.) rather than touching the `Meter` directly, which keeps the OpenTelemetry instrument names and tag conventions in one place. A `lock (_syncRoot)` block guards the scalar fields used by `GetSnapshot`, while per-event maps use `ConcurrentDictionary<string, long>` so the hot event path avoids the lock.
|
||||
|
||||
## Meter and OpenTelemetry Compatibility
|
||||
|
||||
The meter name is exposed as a constant so that hosting code can register it with an OpenTelemetry exporter:
|
||||
|
||||
```csharp
|
||||
public sealed class GatewayMetrics : IDisposable
|
||||
{
|
||||
public const string MeterName = "MxGateway.Server";
|
||||
|
||||
public GatewayMetrics()
|
||||
{
|
||||
_meter = new Meter(MeterName, typeof(GatewayMetrics).Assembly.GetName().Version?.ToString());
|
||||
_sessionsOpenedCounter = _meter.CreateCounter<long>("mxgateway.sessions.opened");
|
||||
...
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The meter version is the gateway assembly version, which gives exporters a stable identifier per build. All instrument names use the dotted `mxgateway.<area>.<event>` convention so they group cleanly under a single namespace in tools such as Prometheus, OTLP collectors, or `dotnet-counters`.
|
||||
|
||||
## Instrument Inventory
|
||||
|
||||
### Counters
|
||||
|
||||
All counters are `Counter<long>`. Tag values come from the call sites listed under [Recording Sites](#recording-sites).
|
||||
|
||||
| Instrument | Tags | What it measures |
|
||||
|------------|------|------------------|
|
||||
| `mxgateway.sessions.opened` | none | Successful `SessionManager.OpenSession` completions. |
|
||||
| `mxgateway.sessions.closed` | none | Sessions closed cleanly via `SessionManager`. |
|
||||
| `mxgateway.commands.started` | `method` | Command dispatches initiated by `WorkerClient`. |
|
||||
| `mxgateway.commands.succeeded` | `method` | Commands acknowledged with success by the worker. |
|
||||
| `mxgateway.commands.failed` | `method`, `category` | Command failures, where `category` is the `WorkerClientErrorCode` or exception type name. |
|
||||
| `mxgateway.events.received` | `family` | Worker events accepted into the event pipeline. |
|
||||
| `mxgateway.queues.overflows` | `queue` | Drops when a bounded queue rejects a message (e.g. `grpc-event-stream`). |
|
||||
| `mxgateway.faults` | `category` | Faults reported by session, event, or worker code paths. The category is a `SessionManagerErrorCode` or `WorkerClientErrorCode` name. |
|
||||
| `mxgateway.workers.killed` | `reason` | Forced terminations of worker processes. |
|
||||
| `mxgateway.workers.exited` | `reason` | Clean or fault-driven worker exits. |
|
||||
| `mxgateway.heartbeats.failed` | `session_id` | Worker heartbeat misses tracked per session. |
|
||||
| `mxgateway.grpc.streams.disconnected` | `reason` | Detachments of the dashboard or client gRPC event stream. |
|
||||
| `mxgateway.retries.attempted` | `area` | Resilience retries executed by gateway components. |
|
||||
|
||||
### Histograms
|
||||
|
||||
Histograms record durations in milliseconds (the `unit` argument on `CreateHistogram`):
|
||||
|
||||
```csharp
|
||||
_workerStartupLatencyHistogram = _meter.CreateHistogram<double>("mxgateway.workers.startup.duration", "ms");
|
||||
_commandLatencyHistogram = _meter.CreateHistogram<double>("mxgateway.commands.duration", "ms");
|
||||
_eventStreamSendLatencyHistogram = _meter.CreateHistogram<double>("mxgateway.events.stream_send.duration", "ms");
|
||||
```
|
||||
|
||||
| Instrument | Tags | What it measures |
|
||||
|------------|------|------------------|
|
||||
| `mxgateway.workers.startup.duration` | none | Time from `WorkerClient` launch to worker-ready. |
|
||||
| `mxgateway.commands.duration` | `method`, optional `category` | Command round-trip time. The `category` tag is added on failure so success and failure latencies stay distinguishable. |
|
||||
| `mxgateway.events.stream_send.duration` | `family` | Time spent writing each public event to the gRPC response stream in `MxAccessGatewayService.StreamEvents`. |
|
||||
|
||||
### Observable Gauges
|
||||
|
||||
Observable gauges are pull-based; the `Meter` invokes the supplied callback whenever a listener samples it. Each callback re-acquires `_syncRoot` so the gauge value matches the snapshot taken at the same instant.
|
||||
|
||||
| Instrument | Source field | Description |
|
||||
|------------|--------------|-------------|
|
||||
| `mxgateway.sessions.open` | `_openSessions` | Currently open sessions tracked by `SessionManager`. |
|
||||
| `mxgateway.workers.running` | `_workersRunning` | Worker clients in a running state. |
|
||||
| `mxgateway.events.worker_queue.depth` | `_workerEventQueueDepth` | Last reported depth of the worker-side event queue. |
|
||||
| `mxgateway.events.grpc_stream_queue.depth` | `_grpcEventStreamQueueDepth` | Backlog held by `EventStreamService` for the active gRPC stream consumer. |
|
||||
|
||||
## Snapshot Shape
|
||||
|
||||
`GatewayMetricsSnapshot` is the immutable view of the same state, returned by `GatewayMetrics.GetSnapshot()` while holding `_syncRoot`. The dictionaries are copied so the caller can iterate without further synchronisation. The dashboard service is the primary consumer.
|
||||
|
||||
```csharp
|
||||
public sealed record GatewayMetricsSnapshot(
|
||||
int OpenSessions,
|
||||
int WorkersRunning,
|
||||
int WorkerEventQueueDepth,
|
||||
int GrpcEventStreamQueueDepth,
|
||||
long SessionsOpened,
|
||||
long SessionsClosed,
|
||||
long CommandsStarted,
|
||||
long CommandsSucceeded,
|
||||
long CommandsFailed,
|
||||
long EventsReceived,
|
||||
long QueueOverflows,
|
||||
long Faults,
|
||||
long WorkerKills,
|
||||
long WorkerExits,
|
||||
long HeartbeatFailures,
|
||||
long StreamDisconnects,
|
||||
long RetryAttempts,
|
||||
IReadOnlyDictionary<string, long> CommandFailuresByMethod,
|
||||
IReadOnlyDictionary<string, long> EventsByFamily,
|
||||
IReadOnlyDictionary<string, long> EventsBySession,
|
||||
IReadOnlyDictionary<string, long> RetryAttemptsByArea);
|
||||
```
|
||||
|
||||
The scalar fields mirror the counters and gauges. The four dictionaries provide the breakdowns that counter tags would otherwise require an exporter to aggregate:
|
||||
|
||||
- `CommandFailuresByMethod` keys by gRPC method name.
|
||||
- `EventsByFamily` keys by event family (the `Family` enum on a worker event).
|
||||
- `EventsBySession` keys by `sessionId`; entries are removed via `RemoveSessionEvents` when a session closes so the map does not grow without bound.
|
||||
- `RetryAttemptsByArea` keys by the resilience `area` tag, e.g. `worker_startup`.
|
||||
|
||||
`EventsReceived` is read with `Interlocked.Read(ref _eventsReceived)` because `EventReceived` increments it via `Interlocked.Increment` outside the lock to keep the event-ingestion path non-blocking.
|
||||
|
||||
## Recording Sites
|
||||
|
||||
The recording call sites describe the code paths that write into each instrument. This mapping makes it easier to trace an unexpected counter reading back to a subsystem.
|
||||
|
||||
### Session manager
|
||||
|
||||
`Sessions/SessionManager.cs` emits session lifecycle and fault counters:
|
||||
|
||||
```csharp
|
||||
_metrics.SessionOpened();
|
||||
...
|
||||
_metrics.Fault(SessionManagerErrorCode.OpenFailed.ToString());
|
||||
...
|
||||
_metrics.SessionClosed();
|
||||
...
|
||||
_metrics.SessionRemoved();
|
||||
...
|
||||
_metrics.Fault(SessionManagerErrorCode.CloseFailed.ToString());
|
||||
...
|
||||
_metrics.RemoveSessionEvents(session.SessionId);
|
||||
```
|
||||
|
||||
`SessionRemoved` decrements the open-session gauge without incrementing the closed counter, which covers cases where a session is evicted rather than closed by the client.
|
||||
|
||||
### Worker client
|
||||
|
||||
`Workers/WorkerClient.cs` records command throughput, worker lifecycle, heartbeat failures, and the worker-side event queue depth:
|
||||
|
||||
- `CommandStarted(method)` and `CommandSucceeded(method, duration)` / `CommandFailed(method, category, duration)` around the worker request/response pair.
|
||||
- `WorkerStarted(startupDuration)` once the worker reports ready.
|
||||
- `RecordWorkerStoppedOnce` calls `WorkerStopped(reason)` exactly once per worker, guarding against double-counting on simultaneous fault and exit signals.
|
||||
- `WorkerKilled(reason)` when the client forcibly terminates the worker.
|
||||
- `HeartbeatFailed(SessionId)` per missed heartbeat.
|
||||
- `SetWorkerEventQueueDepth(queueDepth)` after each event ingest.
|
||||
- `EventReceived(SessionId, workerEvent.Event.Family.ToString())` for each worker event.
|
||||
|
||||
### Worker process launcher
|
||||
|
||||
`Workers/WorkerProcessLauncher.cs` records process-level kills and startup retries:
|
||||
|
||||
```csharp
|
||||
_metrics.WorkerKilled(reason);
|
||||
...
|
||||
_metrics.RetryAttempted("worker_startup");
|
||||
```
|
||||
|
||||
The `worker_startup` tag is hard-coded so the `RetryAttemptsByArea` snapshot reports launcher retries distinctly from other resilience areas.
|
||||
|
||||
### Session worker client factory
|
||||
|
||||
`Sessions/SessionWorkerClientFactory.cs` records the worker kill that follows a failed `OpenSession` handshake:
|
||||
|
||||
```csharp
|
||||
_metrics.WorkerKilled("OpenSessionFailed");
|
||||
```
|
||||
|
||||
This is the only fault path where the factory itself owns the kill decision; once the worker is bound to a session, the `WorkerClient` becomes responsible for lifecycle metrics.
|
||||
|
||||
### gRPC event stream service
|
||||
|
||||
`Grpc/EventStreamService.cs` records the dashboard/client event-stream backlog and disconnect counters:
|
||||
|
||||
```csharp
|
||||
metrics.AdjustGrpcEventStreamQueueDepth(1);
|
||||
...
|
||||
metrics.AdjustGrpcEventStreamQueueDepth(-1);
|
||||
...
|
||||
metrics.AdjustGrpcEventStreamQueueDepth(-remainingDepth);
|
||||
metrics.StreamDisconnected("Detached");
|
||||
...
|
||||
metrics.QueueOverflow("grpc-event-stream");
|
||||
metrics.Fault(SessionManagerErrorCode.EventQueueOverflow.ToString());
|
||||
...
|
||||
metrics.Fault(WorkerClientErrorCode.WorkerFaulted.ToString());
|
||||
```
|
||||
|
||||
The service tracks per-message enqueues and dequeues, so `AdjustGrpcEventStreamQueueDepth` updates the aggregate stream backlog. The `Math.Max(0, ...)` clamp inside the adjuster prevents a negative depth if the bookkeeping ever drifts.
|
||||
|
||||
`Grpc/MxAccessGatewayService.cs` records gRPC event send latency around each response-stream write:
|
||||
|
||||
```csharp
|
||||
Stopwatch stopwatch = Stopwatch.StartNew();
|
||||
await responseStream.WriteAsync(publicEvent).ConfigureAwait(false);
|
||||
metrics.RecordEventStreamSend(publicEvent.Family.ToString(), stopwatch.Elapsed);
|
||||
```
|
||||
|
||||
## Dashboard Consumption
|
||||
|
||||
`Dashboard/DashboardSnapshotService.cs` calls `_metrics.GetSnapshot()` once per `GetSnapshot` invocation and projects it into the dashboard transport types together with the session registry view. The dashboard receives a single, internally consistent snapshot per tick rather than reading individual counters at separate times. See [Gateway Dashboard Design](./gateway-dashboard-design.md) and [Dashboard Interface Design](./DashboardInterfaceDesign.md) for the projection rules and wire format.
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Gateway Dashboard Design](./gateway-dashboard-design.md)
|
||||
- [Dashboard Interface Design](./DashboardInterfaceDesign.md)
|
||||
- [Sessions](./Sessions.md)
|
||||
@@ -0,0 +1,271 @@
|
||||
# Gateway Sessions
|
||||
|
||||
The sessions subsystem owns the in-memory representation of an active gateway-to-worker pairing and coordinates its lifecycle from open through close. Each `GatewaySession` corresponds to exactly one MXAccess worker process connected over a dedicated named pipe.
|
||||
|
||||
## Overview
|
||||
|
||||
A session is the gateway-side handle that callers use to invoke worker commands, stream worker events, and tear the worker down. The subsystem is split between the per-session state machine (`GatewaySession`), an in-memory directory (`SessionRegistry`), the orchestrator that opens and closes sessions (`SessionManager`), the worker construction step (`SessionWorkerClientFactory`), and a hosted service that drains sessions during host shutdown (`SessionShutdownHostedService`).
|
||||
|
||||
All four interfaces (`ISessionManager`, `ISessionRegistry`, `ISessionWorkerClientFactory`) plus `SessionShutdownHostedService` are wired as singletons by `SessionServiceCollectionExtensions.AddGatewaySessions`.
|
||||
|
||||
## Key Types
|
||||
|
||||
### GatewaySession
|
||||
|
||||
`GatewaySession` is a sealed class that holds the identity, configured timeouts, worker client reference, and current `SessionState` for one session. State is protected by a private `_syncRoot` lock so that property reads and transitions are observed atomically by concurrent gRPC calls and the lease sweeper.
|
||||
|
||||
The session id is an opaque string in the form `session-{guid:N}` and the per-session pipe name is `mxaccess-gateway-{ProcessId}-{SessionId}`. Encoding the gateway PID into the pipe name avoids collisions when an old gateway process leaks pipes that the OS has not yet reclaimed.
|
||||
|
||||
`SessionState` itself is the protobuf-generated enum from `MxGateway.Contracts.Proto`, so it is shared between the gateway and clients on the wire.
|
||||
|
||||
```csharp
|
||||
public void TransitionTo(SessionState nextState)
|
||||
{
|
||||
lock (_syncRoot)
|
||||
{
|
||||
if (_state is SessionState.Closed)
|
||||
{
|
||||
return;
|
||||
}
|
||||
|
||||
if (_state is SessionState.Faulted && nextState is not SessionState.Closed)
|
||||
{
|
||||
return;
|
||||
}
|
||||
|
||||
_state = nextState;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
`Closed` is terminal and `Faulted` only allows a transition to `Closed`. This guards against late callbacks (worker exit, heartbeat timeout) re-animating a session that is already torn down.
|
||||
|
||||
### SessionManager (ISessionManager)
|
||||
|
||||
`SessionManager` is the orchestrator. It exposes `OpenSessionAsync`, `TryGetSession`, `InvokeAsync`, `ReadEventsAsync`, `CloseSessionAsync`, `CloseExpiredLeasesAsync`, and `ShutdownAsync`. It composes `ISessionRegistry`, `ISessionWorkerClientFactory`, `GatewayMetrics`, and `GatewayOptions`.
|
||||
|
||||
Concurrency is bounded by a `SemaphoreSlim` initialized to `GatewayOptions.Sessions.MaxSessions`. Open requests that exceed the bound throw `SessionManagerException` with `SessionLimitExceeded` rather than queuing; the caller is expected to retry.
|
||||
|
||||
```csharp
|
||||
private void EnsureSessionCapacity()
|
||||
{
|
||||
if (!_sessionSlots.Wait(0))
|
||||
{
|
||||
throw new SessionManagerException(
|
||||
SessionManagerErrorCode.SessionLimitExceeded,
|
||||
$"Gateway session limit {_options.Sessions.MaxSessions} has been reached.");
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
`SessionManager` also defines three close-reason constants — `DefaultCloseReason` (`"client-close"`), `GatewayShutdownReason` (`"gateway-shutdown"`), and `LeaseExpiredReason` (`"lease-expired"`) — so that the metrics and worker shutdown paths agree on a fixed vocabulary.
|
||||
|
||||
### SessionRegistry (ISessionRegistry)
|
||||
|
||||
`SessionRegistry` is a thin wrapper over a `ConcurrentDictionary<string, GatewaySession>` keyed by session id with `StringComparer.Ordinal`. `Snapshot` materializes the values into an array so iteration callers (lease sweeper, shutdown) do not race with concurrent `TryAdd` and `TryRemove` calls.
|
||||
|
||||
`ActiveCount` filters out sessions whose state is `Closed`; this is consumed by metrics and the dashboard, where `Count` would otherwise momentarily over-report during teardown.
|
||||
|
||||
### SessionWorkerClientFactory (ISessionWorkerClientFactory)
|
||||
|
||||
`SessionWorkerClientFactory.CreateAsync` is the only path that builds an `IWorkerClient`. It drives the session through the protobuf `SessionState` substates in order: `StartingWorker`, `WaitingForPipe`, `Handshaking`, `InitializingWorker`. The substates are wire-visible so the dashboard and clients can render startup progress.
|
||||
|
||||
A linked `CancellationTokenSource` enforces `session.StartupTimeout`. If startup fails or times out, the factory either kills the partially-constructed `WorkerClient` or, if the client was never built, kills the launched process and disposes the named pipe before rethrowing. A pure timeout is rewritten as `TimeoutException` so callers can distinguish it from caller-driven cancellation:
|
||||
|
||||
```csharp
|
||||
if (exception is OperationCanceledException
|
||||
&& startupCancellation.IsCancellationRequested
|
||||
&& !cancellationToken.IsCancellationRequested)
|
||||
{
|
||||
throw new TimeoutException(
|
||||
$"Worker session {session.SessionId} did not complete startup within {session.StartupTimeout}.",
|
||||
exception);
|
||||
}
|
||||
```
|
||||
|
||||
The named pipe is created with `maxNumberOfServerInstances: 1` so a second worker cannot connect to the same pipe name even if the first launch is still pending. Combined with the per-session nonce passed to the worker, this is the gateway's defense against a foreign process answering a pipe.
|
||||
|
||||
### SessionShutdownHostedService
|
||||
|
||||
`SessionShutdownHostedService` is an `IHostedService` whose only job is to call `ISessionManager.ShutdownAsync` from `StopAsync`. It catches `OperationCanceledException` triggered by the host shutdown timeout and logs a warning so that an over-running shutdown does not surface as an unhandled exception.
|
||||
|
||||
### SessionOpenRequest
|
||||
|
||||
`SessionOpenRequest` is the gateway-internal record passed to `OpenSessionAsync`. It is constructed from the wire-level `OpenSessionRequest` via `SessionOpenRequest.FromContract`. Keeping a separate internal record means the gRPC layer can normalize input (defaulting backend, sanitizing strings) without leaking generated proto types into `SessionManager`.
|
||||
|
||||
```csharp
|
||||
public sealed record SessionOpenRequest(
|
||||
string? RequestedBackend,
|
||||
string? ClientSessionName,
|
||||
string? ClientCorrelationId,
|
||||
Duration? CommandTimeout)
|
||||
{
|
||||
public static SessionOpenRequest FromContract(OpenSessionRequest request)
|
||||
{
|
||||
ArgumentNullException.ThrowIfNull(request);
|
||||
|
||||
return new SessionOpenRequest(
|
||||
request.RequestedBackend,
|
||||
request.ClientSessionName,
|
||||
request.ClientCorrelationId,
|
||||
request.CommandTimeout);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### SessionCloseResult
|
||||
|
||||
`SessionCloseResult` is the record returned from a successful close. `AlreadyClosed` distinguishes "this call closed the session" from "the session was already closed when we acquired the close lock", which the metrics layer uses to avoid double-counting.
|
||||
|
||||
```csharp
|
||||
public sealed record SessionCloseResult(
|
||||
string SessionId,
|
||||
SessionState FinalState,
|
||||
bool AlreadyClosed);
|
||||
```
|
||||
|
||||
### SessionCloseStartedException
|
||||
|
||||
`SessionCloseStartedException` is `internal` and is only thrown from inside `GatewaySession.CloseAsync` when the close path has already begun mutating worker state and a subsequent step fails. `SessionManager.CloseSessionCoreAsync` catches it, marks the session faulted, increments the close-failed metric, removes the session from the registry, and rethrows it wrapped as `SessionManagerException` with `CloseFailed`. The intermediate type exists so the public API surface only ever exposes `SessionManagerException`.
|
||||
|
||||
### SessionManagerException and SessionManagerErrorCode
|
||||
|
||||
`SessionManagerException` is the single public error type emitted from this subsystem; the code is carried in the `ErrorCode` property and is also surfaced to metrics tags via `SessionManagerErrorCode.ToString()`.
|
||||
|
||||
| Code | Meaning |
|
||||
|------|---------|
|
||||
| `SessionNotFound` | The session id is not in the registry. |
|
||||
| `SessionNotReady` | The session or its `IWorkerClient` is not in `Ready` state. |
|
||||
| `EventSubscriberAlreadyActive` | A second event subscriber attached when only one is allowed. |
|
||||
| `EventQueueOverflow` | Reserved for the worker event channel overflow path. |
|
||||
| `SessionLimitExceeded` | `MaxSessions` is in use. |
|
||||
| `OpenFailed` | `OpenSessionAsync` failed; the inner exception carries the cause. |
|
||||
| `CloseFailed` | A close started but did not complete cleanly; the session is removed and faulted. |
|
||||
|
||||
## Lifecycle
|
||||
|
||||
### Open
|
||||
|
||||
`SessionManager.OpenSessionAsync` allocates a session slot, builds the `GatewaySession`, registers it, and asks the factory to bring up the worker. Failures roll back every preceding step:
|
||||
|
||||
```csharp
|
||||
catch (Exception exception)
|
||||
{
|
||||
session?.MarkFaulted(exception.Message);
|
||||
if (session is not null)
|
||||
{
|
||||
_registry.TryRemove(session.SessionId, out _);
|
||||
await session.DisposeAsync().ConfigureAwait(false);
|
||||
}
|
||||
|
||||
ReleaseSessionSlot();
|
||||
_metrics.Fault(SessionManagerErrorCode.OpenFailed.ToString());
|
||||
_logger.LogWarning(
|
||||
exception,
|
||||
"Failed to open gateway session {SessionId}.",
|
||||
session?.SessionId ?? "<not-created>");
|
||||
|
||||
throw new SessionManagerException(
|
||||
SessionManagerErrorCode.OpenFailed,
|
||||
session is null ? "Failed to create session." : $"Failed to open session {session.SessionId}.",
|
||||
exception);
|
||||
}
|
||||
```
|
||||
|
||||
The order — fault, deregister, dispose, release slot, record metric, log, rethrow — matters because releasing the semaphore before disposal would let the next open race the worker process tear-down on the same machine.
|
||||
|
||||
### Run
|
||||
|
||||
While `Ready`, callers reach the worker through `SessionManager.InvokeAsync` or `ReadEventsAsync`. Both delegate to `GatewaySession`, which checks the state under lock and updates `LastClientActivityAt` on every invocation. `GatewaySession` also exposes typed bulk helpers (`AddItemBulkAsync`, `SubscribeBulkAsync`, etc.) that wrap `WorkerCommand` round-trips and translate non-`Ok` `ProtocolStatus` replies into `SessionManagerException` with `SessionNotReady`.
|
||||
|
||||
Event streaming uses `AttachEventSubscriber` which returns a disposable lease. When `allowMultipleSubscribers` is false the second attach throws `EventSubscriberAlreadyActive`; this prevents two gRPC streams from racing on the same worker event channel.
|
||||
|
||||
`ExtendLease` and `IsLeaseExpired` cooperate with `SessionManager.CloseExpiredLeasesAsync`, which iterates a registry snapshot and closes any session whose lease has expired with `LeaseExpiredReason`.
|
||||
|
||||
### Close
|
||||
|
||||
`GatewaySession.CloseAsync` is serialized by a per-session `SemaphoreSlim` (`_closeLock`). It transitions to `Closing`, asks the worker client to shut down within `ShutdownTimeout`, and on success transitions to `Closed`. If `WorkerClient.ShutdownAsync` throws, the session falls back to `IWorkerClient.Kill` (forced close):
|
||||
|
||||
```csharp
|
||||
if (_workerClient is not null)
|
||||
{
|
||||
try
|
||||
{
|
||||
await _workerClient.ShutdownAsync(ShutdownTimeout, cancellationToken).ConfigureAwait(false);
|
||||
}
|
||||
catch (Exception exception)
|
||||
{
|
||||
try
|
||||
{
|
||||
_workerClient.Kill(reason);
|
||||
}
|
||||
catch (Exception killException)
|
||||
{
|
||||
throw new SessionCloseStartedException(
|
||||
$"Session {SessionId} close failed after worker shutdown started.",
|
||||
new AggregateException(exception, killException));
|
||||
}
|
||||
|
||||
throw;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
If both graceful shutdown and the kill fall-back fail, the original and kill exceptions are bundled into an `AggregateException` and surfaced as `SessionCloseStartedException`. `SessionManager.CloseSessionCoreAsync` then translates that into a `SessionManagerException` with `CloseFailed` and removes the session.
|
||||
|
||||
`GatewaySession.KillWorker` is the unconditional forced-close path used by shutdown when graceful close itself throws.
|
||||
|
||||
## Shutdown Coordination
|
||||
|
||||
`SessionShutdownHostedService.StopAsync` calls `SessionManager.ShutdownAsync`, which closes every registered session with `GatewayShutdownReason`. The shutdown loop catches per-session exceptions, calls `KillWorker`, and removes the session so that one stuck worker cannot block the rest of the host:
|
||||
|
||||
```csharp
|
||||
public async Task ShutdownAsync(CancellationToken cancellationToken)
|
||||
{
|
||||
foreach (GatewaySession session in _registry.Snapshot())
|
||||
{
|
||||
try
|
||||
{
|
||||
await CloseSessionCoreAsync(session, GatewayShutdownReason, cancellationToken).ConfigureAwait(false);
|
||||
}
|
||||
catch (Exception exception)
|
||||
{
|
||||
_logger.LogWarning(
|
||||
exception,
|
||||
"Graceful shutdown failed for session {SessionId}; killing worker.",
|
||||
session.SessionId);
|
||||
if (_registry.TryGet(session.SessionId, out _))
|
||||
{
|
||||
session.KillWorker(GatewayShutdownReason);
|
||||
await RemoveSessionAsync(session).ConfigureAwait(false);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Iterating over `Snapshot` rather than the live dictionary lets `RemoveSessionAsync` mutate the registry inside the loop without throwing.
|
||||
|
||||
## Dependency Injection
|
||||
|
||||
`SessionServiceCollectionExtensions.AddGatewaySessions` registers the four singletons and the hosted service:
|
||||
|
||||
```csharp
|
||||
public static IServiceCollection AddGatewaySessions(this IServiceCollection services)
|
||||
{
|
||||
services.AddSingleton<ISessionRegistry, SessionRegistry>();
|
||||
services.AddSingleton<ISessionWorkerClientFactory, SessionWorkerClientFactory>();
|
||||
services.AddSingleton<ISessionManager, SessionManager>();
|
||||
services.AddHostedService<SessionShutdownHostedService>();
|
||||
|
||||
return services;
|
||||
}
|
||||
```
|
||||
|
||||
The registry must be a singleton because its `ConcurrentDictionary` is the source of truth for session state across the gRPC service, the lease sweeper, the dashboard, and the shutdown hosted service. Registering `SessionShutdownHostedService` last ensures it is constructed after `ISessionManager` and therefore drains sessions during host stop.
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Gateway Process Design](./gateway-process-design.md)
|
||||
- [Gateway Configuration](./GatewayConfiguration.md)
|
||||
- [Worker Process Launcher](./WorkerProcessLauncher.md)
|
||||
@@ -0,0 +1,223 @@
|
||||
# Worker Bootstrap
|
||||
|
||||
The bootstrap layer parses the command-line arguments and environment variables passed to the `MxGateway.Worker` process, validates them against the gateway contract, and produces either a populated `WorkerOptions` instance or a structured failure that maps to a `WorkerExitCode`.
|
||||
|
||||
## Overview
|
||||
|
||||
The worker process is a short-lived child of the gateway. The gateway side of this contract lives in [WorkerProcessLauncher](./WorkerProcessLauncher.md). On the worker side, `Program.cs` is a single line that delegates to `WorkerApplication.Run(args)`:
|
||||
|
||||
```csharp
|
||||
using MxGateway.Worker;
|
||||
|
||||
return WorkerApplication.Run(args);
|
||||
```
|
||||
|
||||
`WorkerApplication.Run` constructs the bootstrap dependencies (`EnvironmentVariableWorkerEnvironment`, `WorkerConsoleLogger` writing to `Console.Error`, and a `WorkerPipeClient`), runs `WorkerOptionsParser`, and routes the resulting `WorkerBootstrapResult` either into the pipe client or into a non-zero exit. Splitting parsing from process wiring lets tests substitute fakes for the environment, logger, and pipe client without spawning a child process.
|
||||
|
||||
## Worker Options
|
||||
|
||||
`WorkerOptions` is the validated input contract for a worker session. The gateway hands every field to the worker; the worker never reads configuration files.
|
||||
|
||||
```csharp
|
||||
public sealed class WorkerOptions
|
||||
{
|
||||
public const string NonceEnvironmentVariableName = "MXGATEWAY_WORKER_NONCE";
|
||||
|
||||
public WorkerOptions(
|
||||
string sessionId,
|
||||
string pipeName,
|
||||
uint protocolVersion,
|
||||
string nonce)
|
||||
{
|
||||
SessionId = sessionId;
|
||||
PipeName = pipeName;
|
||||
ProtocolVersion = protocolVersion;
|
||||
Nonce = nonce;
|
||||
}
|
||||
|
||||
public string SessionId { get; }
|
||||
public string PipeName { get; }
|
||||
public uint ProtocolVersion { get; }
|
||||
public string Nonce { get; }
|
||||
}
|
||||
```
|
||||
|
||||
### Required inputs
|
||||
|
||||
All four fields are required. Three arrive on the command line and one arrives via environment variable:
|
||||
|
||||
| Source | Name | Maps to |
|
||||
|--------|------|---------|
|
||||
| Argument | `--session-id` | `SessionId` |
|
||||
| Argument | `--pipe-name` | `PipeName` |
|
||||
| Argument | `--protocol-version` | `ProtocolVersion` |
|
||||
| Env var | `MXGATEWAY_WORKER_NONCE` | `Nonce` |
|
||||
|
||||
The nonce travels via environment variable rather than an argument because process command lines are visible to other users on Windows through `wmic`, `Get-CimInstance Win32_Process`, and the kernel object table; environment variables of another process are not. Treating the nonce as a credential keeps it off the command line.
|
||||
|
||||
There are no optional options. An unknown flag, a flag without a value, or a flag whose value starts with `--` is reported as an error rather than silently ignored.
|
||||
|
||||
## The Parser
|
||||
|
||||
`WorkerOptionsParser` walks `args` once, collects values into a case-insensitive dictionary, and accumulates errors so the caller sees every problem in a single failure rather than fixing them one at a time.
|
||||
|
||||
```csharp
|
||||
for (int index = 0; index < args.Length; index++)
|
||||
{
|
||||
string arg = args[index];
|
||||
if (!IsKnownOption(arg))
|
||||
{
|
||||
errors.Add($"Unknown option '{arg}'.");
|
||||
continue;
|
||||
}
|
||||
|
||||
if (index + 1 >= args.Length || args[index + 1].StartsWith("--", StringComparison.Ordinal))
|
||||
{
|
||||
errors.Add($"Option '{arg}' requires a value.");
|
||||
continue;
|
||||
}
|
||||
|
||||
values[arg] = args[index + 1];
|
||||
index++;
|
||||
}
|
||||
```
|
||||
|
||||
After argument scanning, the parser cross-checks the protocol version against `GatewayContractInfo.WorkerProtocolVersion`. A version that parses as a `uint` but does not match the contract value is a hard failure with `WorkerExitCode.InvalidProtocolVersion`, separate from `InvalidArguments`, so the gateway can distinguish a malformed launch from a version mismatch and report a useful upgrade message.
|
||||
|
||||
The nonce is read last so that argument-shape errors are reported before the parser asks the environment for a secret it might not need.
|
||||
|
||||
## Bootstrap Result
|
||||
|
||||
`WorkerBootstrapResult` is a discriminated success/failure carrier. `Options` is non-null when `Succeeded` is true; `Errors` is populated only on failure.
|
||||
|
||||
```csharp
|
||||
public static WorkerBootstrapResult Success(WorkerOptions options)
|
||||
{
|
||||
return new WorkerBootstrapResult(WorkerExitCode.Success, options, []);
|
||||
}
|
||||
|
||||
public static WorkerBootstrapResult Failure(WorkerExitCode exitCode, IEnumerable<string> errors)
|
||||
{
|
||||
return new WorkerBootstrapResult(exitCode, null, errors.ToArray());
|
||||
}
|
||||
```
|
||||
|
||||
`Succeeded` is defined as `ExitCode == WorkerExitCode.Success` rather than as a separate flag, so the exit code and the success state cannot disagree.
|
||||
|
||||
## Exit Codes
|
||||
|
||||
`WorkerExitCode` is the worker process's exit contract. The gateway-side launcher decodes these values to decide whether a relaunch is safe.
|
||||
|
||||
| Value | Numeric | Produced when |
|
||||
|-------|---------|---------------|
|
||||
| `Success` | 0 | The pipe session ran to a clean close. |
|
||||
| `UnexpectedFailure` | 1 | Any unhandled exception not matched by a more specific catch. |
|
||||
| `InvalidArguments` | 2 | One or more `--session-id`, `--pipe-name`, or `--protocol-version` errors (missing, empty, unknown flag, or no value). |
|
||||
| `InvalidProtocolVersion` | 3 | `--protocol-version` is not a `uint` or does not equal `GatewayContractInfo.WorkerProtocolVersion`. |
|
||||
| `MissingNonce` | 4 | The `MXGATEWAY_WORKER_NONCE` environment variable is null, empty, or whitespace. |
|
||||
| `PipeConnectionFailed` | 5 | An `IOException` or `TimeoutException` escapes the pipe client. |
|
||||
| `ProtocolViolation` | 6 | A `WorkerFrameProtocolException` escapes the pipe client. |
|
||||
|
||||
`InvalidArguments`, `InvalidProtocolVersion`, and `MissingNonce` originate in the parser; the others originate in `WorkerApplication.Run`'s `try/catch` around the pipe client.
|
||||
|
||||
## Environment Abstraction
|
||||
|
||||
`IWorkerEnvironment` exists so tests can supply a fake nonce without mutating the real process environment, which would be a shared mutable global across parallel test runs.
|
||||
|
||||
```csharp
|
||||
public interface IWorkerEnvironment
|
||||
{
|
||||
string? GetEnvironmentVariable(string name);
|
||||
}
|
||||
|
||||
public sealed class EnvironmentVariableWorkerEnvironment : IWorkerEnvironment
|
||||
{
|
||||
public string? GetEnvironmentVariable(string name)
|
||||
{
|
||||
return Environment.GetEnvironmentVariable(name);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The production binding in `WorkerApplication.Run(string[])` is `EnvironmentVariableWorkerEnvironment`, which is a thin pass-through to `System.Environment.GetEnvironmentVariable`.
|
||||
|
||||
## Logging
|
||||
|
||||
The worker writes structured key/value lines to standard error. Standard error is used rather than standard output because the gateway side reads worker stdout for diagnostic capture only, while stderr is reserved for log output that does not interfere with any future stdout-based channel.
|
||||
|
||||
### The logger contract
|
||||
|
||||
`IWorkerLogger` exposes only `Information` and `Error`. There is no `Debug` or `Trace` level, because the worker is launched per session and verbose tracing belongs to the gateway-side launcher.
|
||||
|
||||
```csharp
|
||||
public interface IWorkerLogger
|
||||
{
|
||||
void Information(string eventName, IReadOnlyDictionary<string, object?> fields);
|
||||
|
||||
void Error(string eventName, IReadOnlyDictionary<string, object?> fields);
|
||||
}
|
||||
```
|
||||
|
||||
`WorkerConsoleLogger` formats each call as `level=<Level> event=<EventName> key=value key=value` after running the field dictionary through `WorkerLogRedactor`:
|
||||
|
||||
```csharp
|
||||
private void Write(
|
||||
string level,
|
||||
string eventName,
|
||||
IReadOnlyDictionary<string, object?> fields)
|
||||
{
|
||||
Dictionary<string, object?> redactedFields = WorkerLogRedactor.RedactFields(fields);
|
||||
string fieldText = string.Join(
|
||||
" ",
|
||||
redactedFields.Select(field => $"{field.Key}={FormatValue(field.Value)}"));
|
||||
|
||||
_writer.WriteLine($"level={level} event={eventName} {fieldText}".TrimEnd());
|
||||
}
|
||||
```
|
||||
|
||||
### What the redactor redacts and why
|
||||
|
||||
`AGENTS.md` "Security And Logging" requires that the worker never log raw credential values for `AuthenticateUser`, `WriteSecured`, or related secured operations. The bootstrap nonce is also a credential: anyone who reads it can impersonate the worker to the gateway pipe. `WorkerLogRedactor` enforces this by replacing values whose field name contains any of these substrings (case-insensitive) with the literal `[redacted]`:
|
||||
|
||||
```csharp
|
||||
private static readonly string[] SensitiveFieldNameParts =
|
||||
[
|
||||
"nonce",
|
||||
"secret",
|
||||
"password",
|
||||
"token",
|
||||
"credential",
|
||||
"apikey",
|
||||
"api_key",
|
||||
];
|
||||
```
|
||||
|
||||
The match is on substrings of the field name rather than an exact list, so a field called `auth_token` or `user_password` is redacted automatically without each call site having to remember to opt in. `null` values pass through unchanged so the absence of a value is still visible in logs.
|
||||
|
||||
## How `Program.cs` Consumes The Result
|
||||
|
||||
`WorkerApplication.Run` is the single consumer of `WorkerBootstrapResult`. On failure it logs a `WorkerBootstrapFailed` event and returns the numeric `ExitCode` directly:
|
||||
|
||||
```csharp
|
||||
WorkerOptionsParser parser = new(environment);
|
||||
WorkerBootstrapResult result = parser.Parse(args);
|
||||
|
||||
if (!result.Succeeded)
|
||||
{
|
||||
logger.Error("WorkerBootstrapFailed", new Dictionary<string, object?>
|
||||
{
|
||||
["exit_code"] = result.ExitCode,
|
||||
["errors"] = string.Join(";", result.Errors),
|
||||
});
|
||||
|
||||
return (int)result.ExitCode;
|
||||
}
|
||||
```
|
||||
|
||||
On success it logs `WorkerBootstrapSucceeded` with the session fields (the `nonce` field is redacted by `WorkerLogRedactor` because of its name), hands the `WorkerOptions` to `IWorkerPipeClient.RunAsync`, and waits synchronously. The `try/catch` around the pipe call maps `WorkerFrameProtocolException` to `ProtocolViolation`, `IOException`/`TimeoutException` to `PipeConnectionFailed`, and any other exception to `UnexpectedFailure`, so every code path through `Run` returns one of the values in `WorkerExitCode`.
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Worker Process Launcher](./WorkerProcessLauncher.md)
|
||||
- [Worker STA](./WorkerSta.md)
|
||||
- [Worker Frame Protocol](./WorkerFrameProtocol.md)
|
||||
@@ -0,0 +1,262 @@
|
||||
# Worker Conversion Layer
|
||||
|
||||
The conversion layer in `MxGateway.Worker.Conversion` projects COM `VARIANT` payloads, `HRESULT` codes, and `MXSTATUS_PROXY` records into the protobuf wire types in `MxGateway.Contracts.Proto`. The design is parity-first: every projection preserves enough raw metadata that the original COM observation can be reconstructed downstream.
|
||||
|
||||
## Overview
|
||||
|
||||
`AGENTS.md` (section "Value And Status Rules") requires that the wire format use a value union capable of representing COM `VARIANT` values and arrays, that lossy conversions retain both the typed projection and raw diagnostic metadata, and that `MXSTATUS_PROXY` arrays never collapse to a single success flag. The types in `src/MxGateway.Worker/Conversion/` are the worker-side enforcement of those rules.
|
||||
|
||||
The layer is split into three concerns:
|
||||
|
||||
- Value projection: `VariantConverter`
|
||||
- HRESULT projection: `HResultConverter`, `HResultConversion`
|
||||
- Status projection: `MxStatusProxyConverter`, `MxStatusDetailText`, `MxStatusConversionException`
|
||||
|
||||
## VariantConverter
|
||||
|
||||
`VariantConverter` projects scalar and array `VARIANT` payloads delivered by COM interop into `MxValue` and `MxArray`. It accepts an optional `expectedDataType` so that an MXAccess attribute hint (for example `MxDataType.Time` for a 64-bit FILETIME) overrides the default CLR-driven projection.
|
||||
|
||||
### Scalar projection
|
||||
|
||||
Scalars dispatch on `Type.GetTypeCode` and populate the matching field in the `MxValue` `oneof`. Each branch records both the typed value and the source `VARIANT` tag in `VariantType`, so consumers can distinguish a `short` from an `int` even after both project to `MxDataType.Integer`.
|
||||
|
||||
```csharp
|
||||
case TypeCode.Boolean:
|
||||
return new MxValue
|
||||
{
|
||||
DataType = MxDataType.Boolean,
|
||||
VariantType = variantType,
|
||||
BoolValue = (bool)value,
|
||||
};
|
||||
|
||||
case TypeCode.Byte:
|
||||
case TypeCode.SByte:
|
||||
case TypeCode.Int16:
|
||||
case TypeCode.UInt16:
|
||||
case TypeCode.Int32:
|
||||
return new MxValue
|
||||
{
|
||||
DataType = MxDataType.Integer,
|
||||
VariantType = variantType,
|
||||
Int32Value = System.Convert.ToInt32(value, CultureInfo.InvariantCulture),
|
||||
};
|
||||
```
|
||||
|
||||
`Decimal` and 64-bit unsigned integers that exceed `long.MaxValue` cannot project losslessly. The converter still emits a typed best-effort value but attaches a `RawDiagnostic` so the loss is recorded on the wire rather than silently absorbed.
|
||||
|
||||
```csharp
|
||||
case TypeCode.Decimal:
|
||||
return new MxValue
|
||||
{
|
||||
DataType = MxDataType.Double,
|
||||
VariantType = variantType,
|
||||
DoubleValue = System.Convert.ToDouble(value, CultureInfo.InvariantCulture),
|
||||
RawDiagnostic = "Decimal value projected to double.",
|
||||
};
|
||||
```
|
||||
|
||||
Time-typed integers use the `expectedDataType` hint. `ConvertInt64Scalar` interprets a `long` as a Windows FILETIME when the caller asks for `MxDataType.Time`, otherwise it stays an integer. `ConvertUInt64Scalar` falls through to a raw projection if the value exceeds `long.MaxValue`.
|
||||
|
||||
### Null and missing values
|
||||
|
||||
`VARIANT` distinguishes `VT_EMPTY` from `VT_NULL`. The converter preserves both by inspecting whether the input is `DBNull` or a CLR null reference, and tags the result with `IsNull = true` so consumers do not misread a default value as data.
|
||||
|
||||
```csharp
|
||||
private static MxValue CreateNullValue(
|
||||
object? value,
|
||||
MxDataType expectedDataType)
|
||||
{
|
||||
return new MxValue
|
||||
{
|
||||
DataType = expectedDataType == MxDataType.Unspecified ? MxDataType.NoData : expectedDataType,
|
||||
VariantType = value is DBNull ? "VT_NULL" : "VT_EMPTY",
|
||||
IsNull = true,
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
### Array projection
|
||||
|
||||
`ConvertArray` records the rank and per-dimension lengths so multi-dimensional `SAFEARRAY` shapes survive the round trip. The element type is resolved from the caller-supplied hint or the CLR element type via `ResolveArrayElementDataType`, then dispatched to the matching typed builder (`ConvertBoolArray`, `ConvertInt64Array`, `ConvertTimestampArray`, and so on).
|
||||
|
||||
```csharp
|
||||
for (int dimension = 0; dimension < array.Rank; dimension++)
|
||||
{
|
||||
mxArray.Dimensions.Add((uint)array.GetLength(dimension));
|
||||
}
|
||||
|
||||
System.Type? elementType = array.GetType().GetElementType();
|
||||
MxDataType elementDataType = ResolveArrayElementDataType(elementType, expectedElementDataType);
|
||||
mxArray.ElementDataType = elementDataType;
|
||||
```
|
||||
|
||||
When the element type cannot be classified, `ConvertArray` does not throw. It downgrades the result to `MxDataType.Unknown`, records the original expected type in `RawElementDataType`, and serializes each element via `ConvertRawArray` as a UTF-8 byte string. This satisfies the AGENTS.md requirement to keep both the best typed projection and the raw diagnostic metadata.
|
||||
|
||||
```csharp
|
||||
default:
|
||||
mxArray.ElementDataType = MxDataType.Unknown;
|
||||
mxArray.RawElementDataType = (int)expectedElementDataType;
|
||||
mxArray.RawDiagnostic = CreateRawDiagnostic(array);
|
||||
mxArray.RawValues = ConvertRawArray(array);
|
||||
return mxArray;
|
||||
```
|
||||
|
||||
### Variant type names
|
||||
|
||||
`GetVariantTypeName` maps CLR types to canonical `VT_*` strings (`VT_BOOL`, `VT_I4`, `VT_BSTR`, `VT_DATE`, and so on). Unmapped CLR types fall back to `CLR:<FullName>` so the wire format never silently invents a `VT_*` tag it cannot justify. Array element tags are wrapped as `SAFEARRAY(<element>)` by `CreateArrayVariantType`.
|
||||
|
||||
### Why lossless preservation matters
|
||||
|
||||
The MXAccess engine returns values whose semantic type only fully resolves after consulting the engine's own attribute metadata. Clients that round-trip these values through the gateway (replay, parity fixtures, diagnostics) need the original `VT_*` tag, the engine-declared `MxDataType`, and any conversion diagnostic; otherwise edge cases such as decimal-to-double rounding, ulong overflow, or an unknown SAFEARRAY element type become invisible bugs. Storing both the typed projection and the raw fields in the same `MxValue`/`MxArray` lets cross-language clients recover the original observation byte-for-byte where possible and detect lossy cases where it is not.
|
||||
|
||||
## HResultConverter and HResultConversion
|
||||
|
||||
`HResultConverter.Convert` wraps any `Exception` thrown across the COM boundary. It prefers `COMException.ErrorCode` over `Exception.HResult` because the runtime sometimes overwrites `Exception.HResult` while marshalling, and the `ErrorCode` field is the value the COM call actually returned.
|
||||
|
||||
```csharp
|
||||
public HResultConversion Convert(Exception exception)
|
||||
{
|
||||
if (exception is null)
|
||||
{
|
||||
throw new ArgumentNullException(nameof(exception));
|
||||
}
|
||||
|
||||
int hresult = exception is COMException comException
|
||||
? comException.ErrorCode
|
||||
: exception.HResult;
|
||||
|
||||
return new HResultConversion(
|
||||
hresult,
|
||||
CreateProtocolStatus(hresult, exception),
|
||||
CreateSafeDiagnosticMessage(exception));
|
||||
}
|
||||
```
|
||||
|
||||
`CreateProtocolStatus` maps `0` (`S_OK`) to `ProtocolStatusCode.Ok` and any non-zero HRESULT to `ProtocolStatusCode.MxaccessFailure`. The status `Message` includes the exception type name and the formatted HRESULT (`HRESULT 0x<8 hex digits>`), giving downstream operators the same identifier they would see in COM event logs.
|
||||
|
||||
`HResultConversion` is an immutable carrier with three fields:
|
||||
|
||||
| Field | Purpose |
|
||||
|-------|---------|
|
||||
| `HResult` | Raw signed 32-bit HRESULT, suitable for inclusion in a wire reply |
|
||||
| `ProtocolStatus` | The `ProtocolStatus` projection used in command replies |
|
||||
| `DiagnosticMessage` | A safe, fully-qualified diagnostic string for logs |
|
||||
|
||||
The diagnostic message is built from the exception type's `FullName` and the formatted HRESULT only. It deliberately omits `Exception.Message` so user-supplied or localized strings do not leak into worker log channels that may be shipped to less trusted destinations.
|
||||
|
||||
## MxStatusProxyConverter
|
||||
|
||||
`MxStatusProxyConverter` projects MXAccess `MXSTATUS_PROXY` records into the `MxStatusProxy` proto message. Because the COM struct is exposed via interop reflection rather than a typed binding, the converter reads the `success`, `category`, `detectedBy`, and `detail` fields by name through `FieldInfo`.
|
||||
|
||||
```csharp
|
||||
public MxStatusProxy Convert(object status)
|
||||
{
|
||||
if (status is null)
|
||||
{
|
||||
throw new ArgumentNullException(nameof(status));
|
||||
}
|
||||
|
||||
Type statusType = status.GetType();
|
||||
int success = ReadInt32Field(status, statusType, "success");
|
||||
int rawCategory = ReadInt32Field(status, statusType, "category");
|
||||
int rawDetectedBy = ReadInt32Field(status, statusType, "detectedBy");
|
||||
int detail = ReadInt32Field(status, statusType, "detail");
|
||||
|
||||
return new MxStatusProxy
|
||||
{
|
||||
Success = success,
|
||||
Category = MapCategory(rawCategory),
|
||||
DetectedBy = MapSource(rawDetectedBy),
|
||||
Detail = detail,
|
||||
RawCategory = rawCategory,
|
||||
RawDetectedBy = rawDetectedBy,
|
||||
DiagnosticText = MxStatusDetailText.Lookup(detail),
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
`MapCategory` and `MapSource` translate the integer codes documented for `MXSTATUS_PROXY` (for example `0 = Ok`, `3 = CommunicationError`, `0 = RequestingLmx`, `5 = RespondingAutomationObject`) into typed enums. The original integers are preserved alongside the typed projection in `RawCategory` and `RawDetectedBy`, so any code the runtime emits outside the documented range is still observable.
|
||||
|
||||
### Status arrays
|
||||
|
||||
`ConvertMany` walks an inbound `Array` of status structs and emits `IReadOnlyList<MxStatusProxy>`. Null entries become explicit `Unknown`/`Unknown` placeholders with a diagnostic text rather than being dropped, so the index of each status still aligns with the index of the value or item handle it describes.
|
||||
|
||||
```csharp
|
||||
foreach (object? status in statuses)
|
||||
{
|
||||
if (status is null)
|
||||
{
|
||||
converted.Add(new MxStatusProxy
|
||||
{
|
||||
Category = MxStatusCategory.Unknown,
|
||||
DetectedBy = MxStatusSource.Unknown,
|
||||
DiagnosticText = "Null MXSTATUS_PROXY entry.",
|
||||
});
|
||||
continue;
|
||||
}
|
||||
|
||||
converted.Add(Convert(status));
|
||||
}
|
||||
```
|
||||
|
||||
### Why arrays are not collapsed
|
||||
|
||||
A single MXAccess command (notably `Read`, `Write`, and event callbacks) can return one status per item handle. AGENTS.md requires that the wire format represent each entry independently, because collapsing them to a Boolean success flag hides partial failures: a 50-item write where one item fails would be indistinguishable from a 50-item write where every item failed. Preserving the array per-position lets clients correlate each `MxStatusProxy` with its item handle and `MxValue`.
|
||||
|
||||
### Completion-only status fallback
|
||||
|
||||
When MXAccess returns a status payload that is not a recognizable `MXSTATUS_PROXY` struct (for example a completion-only byte buffer from older runtimes), `PreserveCompletionOnlyStatusBytes` hex-encodes the raw bytes so they survive transport even though they cannot be typed.
|
||||
|
||||
```csharp
|
||||
public string PreserveCompletionOnlyStatusBytes(byte[] statusBytes)
|
||||
{
|
||||
if (statusBytes is null)
|
||||
{
|
||||
throw new ArgumentNullException(nameof(statusBytes));
|
||||
}
|
||||
|
||||
return $"completion_only_status_hex={BitConverter.ToString(statusBytes).Replace("-", string.Empty)}";
|
||||
}
|
||||
```
|
||||
|
||||
## MxStatusDetailText
|
||||
|
||||
`MxStatusDetailText` is an internal lookup that maps known `MXSTATUS_PROXY.detail` codes to short human-readable strings (for example `28 = "Index out of range"`, `42 = "Unable to convert string"`, `8017 = "Object must be offscan to modify attributes that have an MxSecurityConfigure security classification"`). `MxStatusProxyConverter.Convert` calls `Lookup` and writes the result to `DiagnosticText`. Unknown codes return `string.Empty`, leaving the numeric `Detail` field as the authoritative identifier.
|
||||
|
||||
The mapping covers the engine-error range documented for MXAccess (16-50, 56-61, 541-542, 8017). Adding entries here is the supported way to enrich wire-level diagnostics without changing the proto schema.
|
||||
|
||||
## MxStatusConversionException
|
||||
|
||||
`MxStatusConversionException` is the worker-internal signal that a status struct could not be projected at all - for example, the interop type is missing one of the required fields, or the field is null. `MxStatusProxyConverter.ReadInt32Field` raises it with a message that names both the offending CLR type and the missing field.
|
||||
|
||||
```csharp
|
||||
private static int ReadInt32Field(
|
||||
object value,
|
||||
Type valueType,
|
||||
string fieldName)
|
||||
{
|
||||
FieldInfo? field = valueType.GetField(fieldName, BindingFlags.Instance | BindingFlags.Public);
|
||||
if (field is null)
|
||||
{
|
||||
throw new MxStatusConversionException(
|
||||
$"Status object type '{valueType.FullName}' does not expose required field '{fieldName}'.");
|
||||
}
|
||||
|
||||
object? fieldValue = field.GetValue(value);
|
||||
if (fieldValue is null)
|
||||
{
|
||||
throw new MxStatusConversionException(
|
||||
$"Status object field '{fieldName}' on type '{valueType.FullName}' is null.");
|
||||
}
|
||||
|
||||
return System.Convert.ToInt32(fieldValue, CultureInfo.InvariantCulture);
|
||||
}
|
||||
```
|
||||
|
||||
The exception is distinct from `COMException` so the worker's command pipeline can route it through a dedicated handler. Callers typically translate it into a `ProtocolStatus` with `ProtocolStatusCode.MxaccessFailure` and a synthetic HRESULT, then attach the exception message as a `RawDiagnostic` rather than letting the failure surface as a generic worker error.
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [MXAccess Worker Instance Design](./mxaccess-worker-instance-design.md)
|
||||
- [Contracts](./Contracts.md)
|
||||
- [Worker STA Threading](./WorkerSta.md)
|
||||
@@ -0,0 +1,157 @@
|
||||
# Worker STA Runtime
|
||||
|
||||
The worker STA runtime owns the dedicated single-threaded apartment thread that hosts the MXAccess COM object, runs a Windows message pump for COM event delivery, and serializes all gateway commands onto that thread.
|
||||
|
||||
## Why an STA Is Required
|
||||
|
||||
The installed MXAccess interop assembly declares an `Apartment` threading model (see `AGENTS.md` under "Worker Rules"). COM objects with that model must be created and called on a thread initialized as a single-threaded apartment, and any callbacks the object raises (event sink calls) are delivered through the thread's Windows message queue. A plain blocking queue is not sufficient: the STA loop must pump Windows messages so that the COM marshaler can deliver event invocations on the same thread that holds the object. Because of that constraint, every MXAccess operation in the worker is funneled through the types in `src/MxGateway.Worker/Sta/`.
|
||||
|
||||
## Types
|
||||
|
||||
| Type | Role |
|
||||
|------|------|
|
||||
| `StaRuntime` | Owns the STA `Thread`, the command queue, and the lifecycle gates. |
|
||||
| `IStaComApartmentInitializer` / `StaComApartmentInitializer` | Calls `CoInitializeEx` / `CoUninitialize` so the thread enters and leaves the apartment-threaded apartment. |
|
||||
| `StaMessagePump` | Wraps `MsgWaitForMultipleObjectsEx`, `PeekMessage`, `TranslateMessage`, and `DispatchMessage` so the STA loop can wait on work and drain the Windows message queue. |
|
||||
| `IStaWorkItem` / `StaWorkItem<T>` | Internal queue entries that capture a delegate, a `CancellationToken`, and a `TaskCompletionSource<T>` for the caller. |
|
||||
| `StaCommand` | Carries an `MxCommand` together with `SessionId`, `CorrelationId`, `EnqueueTimestamp`, and a `CancellationToken`. |
|
||||
| `IStaCommandExecutor` | The boundary between the dispatcher and the MXAccess interop layer; returns `MxCommandReply`. |
|
||||
| `StaCommandDispatcher` | Bounded asynchronous queue in front of `StaRuntime` that converts `StaCommand` into `MxCommandReply` and applies status normalization. |
|
||||
|
||||
## STA Thread Initialization
|
||||
|
||||
`StaRuntime`'s constructor configures a background `Thread` named `MxGateway.Worker.STA` and forces it into `ApartmentState.STA` before the thread starts. `Start()` releases the thread and then blocks on `startedEvent` so callers observe a fully-initialized apartment (or a captured `startupException`) before the first `InvokeAsync` call:
|
||||
|
||||
```csharp
|
||||
staThread = new Thread(ThreadMain)
|
||||
{
|
||||
IsBackground = true,
|
||||
Name = "MxGateway.Worker.STA"
|
||||
};
|
||||
staThread.SetApartmentState(ApartmentState.STA);
|
||||
```
|
||||
|
||||
`StaComApartmentInitializer.Initialize` calls `CoInitializeEx` with `COINIT_APARTMENTTHREADED` (`0x2`) and treats both `S_OK` and `S_FALSE` as success because `S_FALSE` indicates the apartment was already initialized on this thread. Any other HRESULT throws `COMException` so `ThreadMain` records the failure and signals `startedEvent`, which causes `Start()` to surface the exception.
|
||||
|
||||
## The STA Loop
|
||||
|
||||
`ThreadMain` runs the canonical "wait, pump, dispatch" loop. It enters COM, drains queued work, blocks until either a command arrives or a Windows message is posted, then pumps the message queue and records activity for the heartbeat:
|
||||
|
||||
```csharp
|
||||
StaThreadId = Thread.CurrentThread.ManagedThreadId;
|
||||
comApartmentInitializer.Initialize();
|
||||
comInitialized = true;
|
||||
MarkActivity();
|
||||
startedEvent.Set();
|
||||
|
||||
while (!IsShutdownRequested())
|
||||
{
|
||||
ProcessQueuedCommands();
|
||||
messagePump.WaitForWorkOrMessages(commandWakeEvent, idlePumpInterval);
|
||||
messagePump.PumpPendingMessages();
|
||||
MarkActivity();
|
||||
}
|
||||
```
|
||||
|
||||
`commandWakeEvent` is an `AutoResetEvent` set by `InvokeAsync` whenever a new work item is enqueued. The `idlePumpInterval` defaults to 50 ms so the pump still services Windows messages even when no commands are queued; this matters because COM event sink calls arrive as posted messages and would otherwise sit in the queue until the next command. `LastActivityUtc` is updated through `Volatile.Write` on every iteration so the worker's heartbeat path can read it without holding `gate`.
|
||||
|
||||
### The message pump
|
||||
|
||||
`StaMessagePump.WaitForWorkOrMessages` calls `MsgWaitForMultipleObjectsEx` with `QS_ALLINPUT` and `MWMO_INPUTAVAILABLE`, so the wait wakes for either a signal on `commandWakeEvent` or any new message in the thread's Windows queue. `PumpPendingMessages` then drains the queue with `PM_REMOVE`:
|
||||
|
||||
```csharp
|
||||
public int PumpPendingMessages()
|
||||
{
|
||||
int pumpedMessages = 0;
|
||||
|
||||
while (PeekMessage(out NativeMessage message, IntPtr.Zero, 0, 0, PmRemove))
|
||||
{
|
||||
TranslateMessage(ref message);
|
||||
DispatchMessage(ref message);
|
||||
pumpedMessages++;
|
||||
}
|
||||
|
||||
return pumpedMessages;
|
||||
}
|
||||
```
|
||||
|
||||
`DispatchMessage` is what causes the COM marshaler to deliver event sink calls onto the STA thread; without this loop, MXAccess events never surface to managed code.
|
||||
|
||||
## Work Items and the Command Queue
|
||||
|
||||
`StaRuntime` exposes two `InvokeAsync` overloads. Both wrap the delegate in an internal `StaWorkItem<T>`, enqueue it on a `ConcurrentQueue<IStaWorkItem>`, and signal `commandWakeEvent`:
|
||||
|
||||
```csharp
|
||||
StaWorkItem<T> workItem = new(command, cancellationToken);
|
||||
|
||||
lock (gate)
|
||||
{
|
||||
if (shutdownRequested)
|
||||
{
|
||||
return Task.FromException<T>(
|
||||
new InvalidOperationException("The worker STA runtime is shutting down."));
|
||||
}
|
||||
|
||||
commandQueue.Enqueue(workItem);
|
||||
}
|
||||
|
||||
commandWakeEvent.Set();
|
||||
return workItem.Task;
|
||||
```
|
||||
|
||||
`StaWorkItem<T>` uses an `Interlocked.CompareExchange` on `started` so that exactly one of three outcomes happens: the STA thread executes the delegate, an external `CancellationToken` cancellation fires first, or `CancelBeforeExecution` runs during shutdown. `Execute` runs on the STA thread, sets `Completion` from `command()`, and propagates exceptions through `TrySetException` so the awaiting caller observes them.
|
||||
|
||||
`ProcessQueuedCommands` is the only consumer of the queue and runs on the STA thread, so each work item is guaranteed to execute on the apartment that owns the COM object.
|
||||
|
||||
## Command Dispatch
|
||||
|
||||
`StaCommandDispatcher` sits between the worker's IPC layer and `StaRuntime`. It converts an `StaCommand` into an `MxCommandReply` and enforces a bounded queue: when `commandQueue.Count` reaches `maxPendingCommands` (default `DefaultMaxPendingCommands = 128`) the dispatcher returns a synthetic `WorkerUnavailable` reply rather than queueing further work. This back-pressure keeps the STA from accumulating an unbounded backlog while it is busy with a long-running call.
|
||||
|
||||
A single drain task pulls from `commandQueue` and submits each command to the STA via `staRuntime.InvokeAsync`:
|
||||
|
||||
```csharp
|
||||
SetCurrentCommand(command.CorrelationId);
|
||||
try
|
||||
{
|
||||
MxCommandReply reply = await staRuntime
|
||||
.InvokeAsync(() => commandExecutor.Execute(command))
|
||||
.ConfigureAwait(false);
|
||||
|
||||
queuedCommand.Complete(NormalizeReply(command, reply));
|
||||
}
|
||||
catch (Exception exception)
|
||||
{
|
||||
queuedCommand.Complete(CreateExceptionReply(command, exception));
|
||||
}
|
||||
finally
|
||||
{
|
||||
SetCurrentCommand(string.Empty);
|
||||
}
|
||||
```
|
||||
|
||||
`SetCurrentCommand` records the in-flight `CorrelationId` so `PopulateHeartbeat` can publish both `PendingCommandCount` and `CurrentCommandCorrelationId` on the worker heartbeat. Exceptions are converted through `HResultConverter` so the IPC reply still carries a structured `ProtocolStatus`, an HRESULT, and a diagnostic message instead of an unhandled fault. `NormalizeReply` backfills `SessionId`, `CorrelationId`, `Kind`, and a default `ProtocolStatusCode.Ok` so executors can return minimal replies without restating the envelope.
|
||||
|
||||
`CancelQueuedCommand` walks the queue and completes a single matching entry with `ProtocolStatusCode.Canceled`. It cannot abort a command already running on the STA: per `AGENTS.md`, "Canceling a gRPC call should stop waiting in the gateway, but it cannot safely abort an in-flight COM call on the STA. Hard cancellation means killing the worker process."
|
||||
|
||||
## Why the STA Loop Cannot Block on I/O
|
||||
|
||||
`AGENTS.md` states explicitly: "Do not block the STA on pipe writes, gRPC calls, or slow consumers. Event handlers should convert event args, enqueue outbound events, and return to pumping messages." The STA thread is the only thread that can service COM event callbacks, so any work that blocks it stalls every event the MXAccess object would otherwise deliver. The runtime keeps to that rule by giving the STA only two responsibilities inside `ThreadMain`: executing already-queued work items and pumping messages. Outbound event delivery and pipe writes happen on threads that observe the queues populated from the STA, never on the STA itself.
|
||||
|
||||
## Shutdown Sequence
|
||||
|
||||
`StaRuntime.Shutdown(TimeSpan timeout)` performs an ordered shutdown:
|
||||
|
||||
1. Sets `shutdownRequested` under `gate` so `InvokeAsync` rejects new work with `InvalidOperationException`.
|
||||
2. Signals `commandWakeEvent` to break the STA out of `WaitForWorkOrMessages`.
|
||||
3. Waits up to `timeout` on `stoppedEvent`, which the STA sets after it leaves `ThreadMain`.
|
||||
4. Once the thread has stopped, drains the queue through `CancelQueuedCommands`, which calls `CancelBeforeExecution` on every remaining work item so awaiting callers observe `OperationCanceledException` instead of hanging.
|
||||
|
||||
`ThreadMain`'s `finally` block guarantees that `comApartmentInitializer.Uninitialize` runs (when COM was successfully initialized) before `stoppedEvent.Set`, so the apartment is always torn down on the same thread that initialized it. `Dispose` calls `Shutdown` with a five-second budget and only disposes the wait handles when shutdown actually completed, which prevents a still-running STA thread from touching disposed handles.
|
||||
|
||||
`StaCommandDispatcher.RequestShutdown` mirrors the same intent at the dispatcher layer: it sets `shutdownRequested`, drains its own queue, and completes every queued command with `ProtocolStatusCode.WorkerUnavailable` so callers receive a structured rejection during teardown.
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [MXAccess worker instance design](./mxaccess-worker-instance-design.md)
|
||||
- [Worker conversion](./WorkerConversion.md)
|
||||
- [Worker bootstrap](./WorkerBootstrap.md)
|
||||
@@ -23,11 +23,15 @@ The source files listed by the manifest are:
|
||||
|
||||
- `src/MxGateway.Contracts/Protos/mxaccess_gateway.proto`
|
||||
- `src/MxGateway.Contracts/Protos/mxaccess_worker.proto`
|
||||
- `src/MxGateway.Contracts/Protos/galaxy_repository.proto`
|
||||
|
||||
`mxaccess_gateway.proto` defines the public gRPC service and shared DTOs.
|
||||
`mxaccess_worker.proto` is included in the descriptor because worker-aware
|
||||
tests and fake-worker clients need the same command, reply, event, value, and
|
||||
status shapes.
|
||||
status shapes. `galaxy_repository.proto` defines the read-only Galaxy
|
||||
Repository browse service used by clients to enumerate the deployed object
|
||||
hierarchy and dynamic attributes; see
|
||||
[Galaxy Repository Browse](./GalaxyRepository.md).
|
||||
|
||||
## Protocol Version
|
||||
|
||||
|
||||
@@ -48,10 +48,20 @@ Endpoint layout:
|
||||
/dashboard/sessions/{sessionId}
|
||||
/dashboard/workers
|
||||
/dashboard/events
|
||||
/dashboard/galaxy
|
||||
/dashboard/settings
|
||||
/dashboard/_blazor
|
||||
```
|
||||
|
||||
The `/dashboard/galaxy` page surfaces the Galaxy Repository browse summary
|
||||
(deployed object hierarchy size, last deploy timestamp, attribute totals,
|
||||
template usage, and connectivity sync info). The summary is fed by
|
||||
`GalaxySummaryCache`, which is refreshed off the request path by
|
||||
`GalaxySummaryRefreshService` on the
|
||||
`MxGateway:Galaxy:DashboardRefreshIntervalSeconds` cadence so the dashboard
|
||||
never blocks on SQL. See [Galaxy Repository Browse](./GalaxyRepository.md) for
|
||||
the underlying gRPC service.
|
||||
|
||||
The app should redirect `/` to `/dashboard` only if the deployment wants the
|
||||
dashboard as the default web page. Otherwise leave gRPC/API hosting unaffected.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user