Align docs with StyleGuide and add CLAUDE.md
- Rename 16 kebab-case docs to PascalCase per StyleGuide - Move per-language client design docs from docs/ to clients/<lang>/ alongside their READMEs - Add ## Related Documentation sections to 15 docs that lacked one - Fix sentence-case violations in H3 headings (StyleGuide rule) - Update cross-references in gateway.md, client READMEs, scripts, and generate-proto.ps1 helpers to follow the new paths - Add CLAUDE.md with build/test commands, the source-update verification matrix, the parity-first contract, and pointers to MXAccess and Galaxy Repository analysis sources Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,988 @@
|
||||
# Gateway Process Detailed Design
|
||||
|
||||
## Purpose
|
||||
|
||||
The gateway process is the only public network-facing component. It exposes the
|
||||
modern API, owns session lifecycle, launches and supervises MXAccess worker
|
||||
processes, and moves commands and events between clients and the worker that
|
||||
owns each session.
|
||||
|
||||
The gateway must not instantiate MXAccess COM, import MXAccess interop types, or
|
||||
depend on an STA message pump. The installed MXAccess COM component is isolated
|
||||
behind the worker process boundary.
|
||||
|
||||
## Runtime
|
||||
|
||||
- Target runtime: .NET 10.
|
||||
- Language: C#.
|
||||
- Preferred process architecture: x64.
|
||||
- Hosting: ASP.NET Core gRPC.
|
||||
- Web UI: Blazor Server dashboard with Bootstrap CSS/JS.
|
||||
- Operating system: Windows.
|
||||
- Public transport: TCP gRPC.
|
||||
- Internal worker transport: named pipes with protobuf-framed messages.
|
||||
|
||||
Style guides:
|
||||
|
||||
- [C# Style Guide](./style-guides/CSharpStyleGuide.md)
|
||||
- [Protobuf Style Guide](./style-guides/ProtobufStyleGuide.md)
|
||||
|
||||
## Responsibilities
|
||||
|
||||
The gateway owns:
|
||||
|
||||
- public gRPC service endpoints,
|
||||
- Blazor Server dashboard endpoints,
|
||||
- optional authentication and authorization,
|
||||
- session id allocation,
|
||||
- worker executable selection,
|
||||
- named-pipe server creation,
|
||||
- worker process launch,
|
||||
- gateway/worker handshake,
|
||||
- command correlation and timeout handling,
|
||||
- event fan-out to client streams,
|
||||
- session lease and heartbeat enforcement,
|
||||
- worker crash and hang detection,
|
||||
- metrics and structured logging,
|
||||
- graceful service shutdown.
|
||||
|
||||
The gateway does not own:
|
||||
|
||||
- MXAccess COM object creation,
|
||||
- MXAccess method dispatch,
|
||||
- MXAccess event subscription,
|
||||
- MXAccess handle generation,
|
||||
- COM value conversion from native `VARIANT` values.
|
||||
|
||||
Those belong to the worker.
|
||||
|
||||
## High-Level Components
|
||||
|
||||
```text
|
||||
MxGateway.Server
|
||||
Program / Host
|
||||
Configuration
|
||||
Grpc
|
||||
MxAccessGatewayService
|
||||
MxAccessGrpcRequestValidator
|
||||
MxAccessGrpcMapper
|
||||
Dashboard
|
||||
Pages
|
||||
Components
|
||||
DashboardSnapshotService
|
||||
DashboardAuthorization
|
||||
Sessions
|
||||
SessionManager
|
||||
GatewaySession
|
||||
SessionRegistry
|
||||
SessionLeaseMonitor
|
||||
Workers
|
||||
WorkerProcessLauncher
|
||||
WorkerClient
|
||||
WorkerPipeTransport
|
||||
WorkerProtocolReader
|
||||
WorkerProtocolWriter
|
||||
WorkerWatchdog
|
||||
Security
|
||||
ClientIdentityResolver
|
||||
CommandAuthorization
|
||||
Metrics
|
||||
GatewayMetrics
|
||||
Diagnostics
|
||||
HealthChecks
|
||||
```
|
||||
|
||||
## Public gRPC Surface
|
||||
|
||||
Start with unary commands plus an event stream:
|
||||
|
||||
```protobuf
|
||||
service MxAccessGateway {
|
||||
rpc OpenSession(OpenSessionRequest) returns (OpenSessionReply);
|
||||
rpc CloseSession(CloseSessionRequest) returns (CloseSessionReply);
|
||||
rpc Invoke(MxCommandRequest) returns (MxCommandReply);
|
||||
rpc StreamEvents(StreamEventsRequest) returns (stream MxEvent);
|
||||
}
|
||||
```
|
||||
|
||||
`MxAccessGatewayService` implements these public RPCs in the gateway process.
|
||||
It validates public requests with `MxAccessGrpcRequestValidator`, delegates
|
||||
session lifecycle and command routing to `ISessionManager`, and maps worker
|
||||
command replies and events through `MxAccessGrpcMapper`. Session lookup,
|
||||
validation, and worker transport failures become gRPC status errors. MXAccess
|
||||
method replies that reached the worker remain `MxCommandReply` payloads so
|
||||
HRESULT values, status arrays, and method-specific reply fields survive
|
||||
transport boundaries.
|
||||
|
||||
Add this later only after the command and event model is stable:
|
||||
|
||||
```protobuf
|
||||
rpc Session(stream ClientMessage) returns (stream ServerMessage);
|
||||
```
|
||||
|
||||
### OpenSession
|
||||
|
||||
`OpenSession` creates one gateway session and one worker process by default.
|
||||
|
||||
Inputs should include:
|
||||
|
||||
- requested backend, defaulting to `mxaccess-worker`,
|
||||
- optional client session name,
|
||||
- optional client correlation id,
|
||||
- optional timeout policy,
|
||||
- optional event backpressure policy,
|
||||
- optional metadata discovery options.
|
||||
|
||||
Outputs should include:
|
||||
|
||||
- session id,
|
||||
- backend name,
|
||||
- worker process id when available,
|
||||
- protocol version,
|
||||
- server capabilities,
|
||||
- default timeout values.
|
||||
|
||||
Behavior:
|
||||
|
||||
1. Resolve and authorize the client identity.
|
||||
2. Allocate a session id.
|
||||
3. Build a pipe name and random handshake nonce.
|
||||
4. Create a named-pipe server with restrictive local ACLs.
|
||||
5. Launch the worker executable with session bootstrap data.
|
||||
6. Accept the pipe connection within startup timeout.
|
||||
7. Exchange `GatewayHello` and `WorkerHello`.
|
||||
8. Wait for `WorkerReady`.
|
||||
9. Register the session as ready.
|
||||
10. Return the session details.
|
||||
|
||||
If any step fails, clean up all resources. Kill the worker if it was launched
|
||||
and did not shut down on its own.
|
||||
|
||||
### CloseSession
|
||||
|
||||
`CloseSession` attempts graceful shutdown and then enforces a kill timeout.
|
||||
|
||||
Behavior:
|
||||
|
||||
1. Mark the session closing.
|
||||
2. Stop accepting new commands.
|
||||
3. Notify event streams of terminal session close.
|
||||
4. Send `WorkerShutdown` when the pipe is still connected.
|
||||
5. Wait for worker exit up to the configured timeout.
|
||||
6. Kill the worker process if it remains alive.
|
||||
7. Remove the session from the registry.
|
||||
|
||||
`CloseSession` should be idempotent. Closing an already closed session should
|
||||
return a successful close result with the final known state.
|
||||
|
||||
`WorkerClient.ShutdownAsync` sends `WorkerShutdown`, waits for the worker read,
|
||||
write, and heartbeat loops to stop, and waits for the launched worker process to
|
||||
exit within the same shutdown timeout. If the pipe loops or process exit exceed
|
||||
the timeout, the close operation fails with `ShutdownTimeout`; `GatewaySession`
|
||||
then kills the worker process tree before surfacing the close failure.
|
||||
|
||||
### Invoke
|
||||
|
||||
`Invoke` forwards one MXAccess command to the worker that owns the session.
|
||||
|
||||
Behavior:
|
||||
|
||||
1. Validate the session id.
|
||||
2. Check session state is `Ready`.
|
||||
3. Validate the method-specific payload.
|
||||
4. Authorize the command, especially writes and credential-bearing commands.
|
||||
5. Assign a gateway correlation id.
|
||||
6. Write `WorkerCommand` to the worker pipe.
|
||||
7. Await the correlated `WorkerCommandReply`.
|
||||
8. Map worker reply to public `MxCommandReply`.
|
||||
|
||||
Request cancellation stops waiting in the gateway. It does not abort an
|
||||
in-flight COM call. If the command must be hard-canceled, kill the worker and
|
||||
fault the session.
|
||||
|
||||
### StreamEvents
|
||||
|
||||
`StreamEvents` streams events for one session.
|
||||
|
||||
Initial implementation allows one active stream subscriber per session. A second
|
||||
subscriber should be rejected with a clear session error. If multiple
|
||||
subscribers are later supported, they must have independent backpressure
|
||||
accounting and a clear fan-out policy.
|
||||
|
||||
Behavior:
|
||||
|
||||
1. Validate session id and authorize event access.
|
||||
2. Attach the single active subscriber lease for the session.
|
||||
3. Read worker events into a bounded public stream queue.
|
||||
4. Send events in worker sequence order.
|
||||
5. Stop on client cancellation, session close, or session fault.
|
||||
6. Emit a terminal status when the session faults if gRPC status alone cannot
|
||||
preserve the required details.
|
||||
|
||||
`EventStreamService` owns subscriber tracking and public stream backpressure.
|
||||
The default policy allows one active subscriber per session. A second subscriber
|
||||
is rejected with `EventSubscriberAlreadyActive`. Stream cancellation releases
|
||||
the subscriber lease so a later stream can attach to the session.
|
||||
|
||||
The gateway must not reorder events from one worker. `EventStreamService` writes
|
||||
mapped events to a bounded first-in, first-out queue and faults the session with
|
||||
`EventQueueOverflow` if the queue fills. The gateway does not synthesize
|
||||
`OperationComplete`; it forwards that family only when the worker reports a
|
||||
native MXAccess `OperationComplete` event.
|
||||
|
||||
## Web Dashboard
|
||||
|
||||
The gateway hosts a basic Blazor Server dashboard for operators and developers.
|
||||
The dashboard is read-only for v1 and should show current gateway/session/worker
|
||||
state plus basic metrics.
|
||||
|
||||
Technology:
|
||||
|
||||
- Blazor Server,
|
||||
- Bootstrap CSS,
|
||||
- Bootstrap JavaScript,
|
||||
- no MudBlazor,
|
||||
- no other Blazor client component libraries.
|
||||
|
||||
Suggested routes:
|
||||
|
||||
```text
|
||||
/dashboard
|
||||
/dashboard/sessions
|
||||
/dashboard/sessions/{sessionId}
|
||||
/dashboard/workers
|
||||
/dashboard/events
|
||||
/dashboard/settings
|
||||
```
|
||||
|
||||
Dashboard pages:
|
||||
|
||||
- home: gateway status, uptime, session count, worker count, command rate,
|
||||
event rate, queue depth, recent faults,
|
||||
- sessions: active/recent session table,
|
||||
- session details: one session's worker, heartbeat, counters, queues, and fault
|
||||
summary,
|
||||
- workers: worker process table and heartbeat details,
|
||||
- events: aggregate event counters and rates,
|
||||
- settings: read-only effective configuration with secrets redacted.
|
||||
|
||||
Realtime updates should use Blazor Server component updates from a read-only
|
||||
snapshot service. Components should subscribe to snapshots and call
|
||||
`StateHasChanged` through `InvokeAsync`. Do not stream every MXAccess event to
|
||||
the dashboard; aggregate event rates and counters instead.
|
||||
|
||||
Suggested service shape:
|
||||
|
||||
```csharp
|
||||
public interface IDashboardSnapshotService
|
||||
{
|
||||
DashboardSnapshot GetSnapshot();
|
||||
IAsyncEnumerable<DashboardSnapshot> WatchSnapshotsAsync(
|
||||
CancellationToken cancellationToken);
|
||||
}
|
||||
```
|
||||
|
||||
Default refresh policy:
|
||||
|
||||
- immediate update on session create, close, or fault,
|
||||
- immediate update on worker fault,
|
||||
- periodic metrics refresh every 1 second,
|
||||
- event-rate windows updated every 1 second.
|
||||
|
||||
Dashboard access should require API-key-backed authentication with `admin` scope
|
||||
when enabled. A simple `/dashboard/login` form can validate an API key and issue
|
||||
an HTTP-only secure cookie for dashboard pages. Do not put API keys in query
|
||||
strings. Anonymous localhost access may exist only behind an explicit
|
||||
configuration option that defaults to false.
|
||||
|
||||
## Session State Machine
|
||||
|
||||
```text
|
||||
Creating
|
||||
-> StartingWorker
|
||||
-> WaitingForPipe
|
||||
-> Handshaking
|
||||
-> InitializingWorker
|
||||
-> Ready
|
||||
-> Closing
|
||||
-> Closed
|
||||
|
||||
Any non-terminal state
|
||||
-> Faulted
|
||||
|
||||
Faulted
|
||||
-> Closed
|
||||
```
|
||||
|
||||
### State rules
|
||||
|
||||
- `Creating`: session id and in-memory state exist, but no worker has launched.
|
||||
- `StartingWorker`: worker process launch is in progress.
|
||||
- `WaitingForPipe`: gateway is waiting for the worker to connect to the pipe.
|
||||
- `Handshaking`: pipe is connected and protocol hello is being verified.
|
||||
- `InitializingWorker`: worker is connected but has not reported MXAccess ready.
|
||||
- `Ready`: commands and event streams may run.
|
||||
- `Closing`: graceful shutdown is in progress.
|
||||
- `Closed`: resources are released.
|
||||
- `Faulted`: a non-graceful terminal fault occurred and must be reported to
|
||||
callers before resources are released.
|
||||
|
||||
Only `Ready` sessions accept new commands.
|
||||
|
||||
## Session Model
|
||||
|
||||
Gateway session state should include:
|
||||
|
||||
- session id,
|
||||
- client identity,
|
||||
- backend name,
|
||||
- worker process id,
|
||||
- worker executable path and version,
|
||||
- pipe name,
|
||||
- pipe connection state,
|
||||
- open time,
|
||||
- last client activity time,
|
||||
- last worker heartbeat time,
|
||||
- lease expiration,
|
||||
- command timeout policy,
|
||||
- startup timeout policy,
|
||||
- shutdown timeout policy,
|
||||
- event queue metrics,
|
||||
- active event stream count,
|
||||
- final fault if any.
|
||||
|
||||
The worker remains authoritative for MXAccess handles. The gateway may keep a
|
||||
shadow state for diagnostics, but it must not invent, rewrite, or recycle
|
||||
MXAccess handles.
|
||||
|
||||
`SessionManager` owns the current in-memory session registry. It allocates a
|
||||
session id, creates the worker pipe name and nonce, registers the session before
|
||||
worker startup, and removes the session if startup fails. A successful
|
||||
`OpenSession` attaches the ready `IWorkerClient` and transitions the session to
|
||||
`Ready`.
|
||||
|
||||
Only `Ready` sessions accept command and event operations. `CloseSession` shuts
|
||||
down the worker, disposes the worker client, and removes the session from the
|
||||
registry so closed sessions do not retain pipe or process handles. A later close
|
||||
for the same id returns `SessionNotFound`. Lease handling is exposed as a
|
||||
session hook so a monitor can close expired sessions without embedding lease
|
||||
policy in the worker client. Gateway shutdown walks the registry, closes each
|
||||
known session, and kills a worker if graceful shutdown fails.
|
||||
|
||||
## Worker Launch
|
||||
|
||||
The gateway should launch the worker using explicit configuration:
|
||||
|
||||
- worker executable path,
|
||||
- worker working directory,
|
||||
- worker architecture requirement,
|
||||
- protocol version,
|
||||
- startup timeout,
|
||||
- environment variables,
|
||||
- optional restricted user identity.
|
||||
|
||||
Command-line arguments should include only non-secret bootstrap values:
|
||||
|
||||
```text
|
||||
--session-id <sessionId>
|
||||
--pipe-name <pipeName>
|
||||
--protocol-version <version>
|
||||
```
|
||||
|
||||
Prefer passing the handshake nonce via inherited environment or another
|
||||
protected local mechanism instead of command line when possible.
|
||||
|
||||
Before launch, validate:
|
||||
|
||||
- worker executable exists,
|
||||
- worker path is under the configured install directory,
|
||||
- worker file version or product version is acceptable,
|
||||
- worker is expected to be x86.
|
||||
|
||||
`WorkerProcessLauncher` implements the first validation layer now: it resolves
|
||||
the worker executable path, requires a `.exe`, validates the Windows Portable
|
||||
Executable header, and verifies the configured processor architecture. It passes
|
||||
only `--session-id`, `--pipe-name`, and `--protocol-version` on the command
|
||||
line. The per-session nonce is set through `MXGATEWAY_WORKER_NONCE` so the
|
||||
command line remains safe to log. Startup failures and startup timeouts kill and
|
||||
dispose the worker process and the pre-created pipe reservation before the
|
||||
session manager observes the failure.
|
||||
|
||||
## Worker IPC
|
||||
|
||||
The gateway creates the pipe server before launching the worker.
|
||||
|
||||
Pipe name:
|
||||
|
||||
```text
|
||||
mxaccess-gateway-{gatewayProcessId}-{sessionId}
|
||||
```
|
||||
|
||||
Message framing:
|
||||
|
||||
```text
|
||||
uint32 little-endian payload_length
|
||||
payload_length bytes protobuf WorkerEnvelope
|
||||
```
|
||||
|
||||
Recommended size limits:
|
||||
|
||||
- default max message size: 16 MiB,
|
||||
- configurable upper bound for large arrays,
|
||||
- reject zero-length payloads,
|
||||
- reject payloads larger than configured maximum before allocation.
|
||||
|
||||
### Envelope rules
|
||||
|
||||
Every message uses `WorkerEnvelope`:
|
||||
|
||||
- `protocol_version` must match a supported version.
|
||||
- `session_id` must match the pipe/session.
|
||||
- `sequence` is monotonic per sender.
|
||||
- `correlation_id` links commands and replies.
|
||||
- events use either zero or their own event correlation id.
|
||||
- protocol faults do not replace MXAccess HRESULT/status details.
|
||||
|
||||
The gateway should treat malformed frames, sequence regressions, and wrong
|
||||
session ids as protocol faults and close the session.
|
||||
|
||||
## WorkerClient Design
|
||||
|
||||
`WorkerClient` is the gateway-side object that owns one worker connection.
|
||||
|
||||
Current public shape:
|
||||
|
||||
```csharp
|
||||
public interface IWorkerClient : IAsyncDisposable
|
||||
{
|
||||
string SessionId { get; }
|
||||
int? ProcessId { get; }
|
||||
WorkerClientState State { get; }
|
||||
DateTimeOffset LastHeartbeatAt { get; }
|
||||
|
||||
Task StartAsync(CancellationToken cancellationToken);
|
||||
Task<WorkerCommandReply> InvokeAsync(
|
||||
WorkerCommand command,
|
||||
TimeSpan timeout,
|
||||
CancellationToken cancellationToken);
|
||||
IAsyncEnumerable<WorkerEvent> ReadEventsAsync(
|
||||
CancellationToken cancellationToken);
|
||||
Task ShutdownAsync(TimeSpan timeout, CancellationToken cancellationToken);
|
||||
void Kill(string reason);
|
||||
}
|
||||
```
|
||||
|
||||
Internally it owns:
|
||||
|
||||
- process handle,
|
||||
- pipe stream,
|
||||
- read loop,
|
||||
- write loop,
|
||||
- outbound command/control channel serialized by the write loop,
|
||||
- bounded inbound event channel,
|
||||
- pending command dictionary keyed by correlation id,
|
||||
- heartbeat monitor,
|
||||
- terminal fault source.
|
||||
|
||||
`StartAsync` sends `GatewayHello`, verifies the `WorkerHello` protocol version
|
||||
and nonce, waits for `WorkerReady`, and only then exposes `Ready` state. The
|
||||
read loop starts after readiness so the handshake has a single owner for its
|
||||
ordered frames.
|
||||
|
||||
### Read loop
|
||||
|
||||
The read loop:
|
||||
|
||||
1. Reads one frame.
|
||||
2. Parses `WorkerEnvelope`.
|
||||
3. Validates protocol fields.
|
||||
4. Dispatches by body type:
|
||||
- `WorkerCommandReply`: completes pending command.
|
||||
- `WorkerEvent`: enqueues event.
|
||||
- `WorkerHeartbeat`: updates heartbeat timestamp.
|
||||
- `WorkerFault`: faults session.
|
||||
5. Stops when pipe closes or cancellation is requested.
|
||||
|
||||
If the pipe closes while the session is not closing, fault the session.
|
||||
|
||||
### Write loop
|
||||
|
||||
The write loop serializes all writes to the pipe. No other code should write to
|
||||
the pipe directly.
|
||||
|
||||
It handles:
|
||||
|
||||
- `GatewayHello`,
|
||||
- `WorkerCommand`,
|
||||
- `WorkerCancel`,
|
||||
- `WorkerShutdown`,
|
||||
- gateway heartbeat if used.
|
||||
|
||||
The write loop should fail the session if a pipe write fails outside normal
|
||||
shutdown.
|
||||
|
||||
During shutdown the worker client treats `WorkerShutdownAck` as the protocol
|
||||
close signal, but the process handle remains authoritative for process lifetime.
|
||||
The client waits for both the protocol close and process exit before reporting a
|
||||
clean shutdown to `GatewaySession`.
|
||||
|
||||
## Command Correlation
|
||||
|
||||
Each command gets:
|
||||
|
||||
- gateway correlation id,
|
||||
- method name,
|
||||
- start timestamp,
|
||||
- timeout deadline,
|
||||
- caller cancellation token,
|
||||
- reply completion source.
|
||||
|
||||
Pending command handling:
|
||||
|
||||
- Add the pending entry before writing the command.
|
||||
- Remove it exactly once when reply, timeout, cancellation, or session fault
|
||||
occurs.
|
||||
- If a late reply arrives after cancellation or timeout, log it with the
|
||||
correlation id and discard it.
|
||||
- If the session faults, complete all pending commands with a structured fault.
|
||||
|
||||
Timeouts should not assume the COM call stopped. A timed-out command may still
|
||||
finish inside the worker.
|
||||
|
||||
## Fault Model
|
||||
|
||||
Fault categories:
|
||||
|
||||
- `StartupFailed`
|
||||
- `ProtocolMismatch`
|
||||
- `ProtocolViolation`
|
||||
- `PipeDisconnected`
|
||||
- `WorkerExited`
|
||||
- `HeartbeatExpired`
|
||||
- `CommandTimeout`
|
||||
- `WorkerFaulted`
|
||||
- `GatewayShutdown`
|
||||
- `AuthorizationFailed`
|
||||
|
||||
Public replies should distinguish:
|
||||
|
||||
- gRPC transport failure,
|
||||
- gateway/session failure,
|
||||
- worker protocol failure,
|
||||
- MXAccess method failure,
|
||||
- MXAccess HRESULT/status failure.
|
||||
|
||||
Do not hide an MXAccess HRESULT by returning only an RPC error. When MXAccess
|
||||
was reached and returned status, preserve that status in the command reply.
|
||||
|
||||
## Heartbeats And Leases
|
||||
|
||||
Use separate concepts:
|
||||
|
||||
- worker heartbeat: proves the worker process and pipe loop are alive,
|
||||
- session lease: proves the client still owns the session,
|
||||
- command timeout: bounds one command wait,
|
||||
- startup timeout: bounds worker creation,
|
||||
- shutdown timeout: bounds graceful stop.
|
||||
|
||||
Suggested defaults for early development:
|
||||
|
||||
- startup timeout: 30 seconds,
|
||||
- worker heartbeat interval: 5 seconds,
|
||||
- heartbeat grace: 15 seconds,
|
||||
- default command timeout: 30 seconds,
|
||||
- graceful shutdown timeout: 10 seconds,
|
||||
- idle session lease: configurable, disabled in local development.
|
||||
|
||||
The exact values should be configurable.
|
||||
|
||||
## Event Delivery
|
||||
|
||||
Events flow:
|
||||
|
||||
```text
|
||||
worker MXAccess event
|
||||
-> worker outbound event queue
|
||||
-> worker pipe writer
|
||||
-> gateway read loop
|
||||
-> worker client event queue
|
||||
-> EventStreamService bounded stream queue
|
||||
-> gRPC StreamEvents
|
||||
```
|
||||
|
||||
The gateway should record:
|
||||
|
||||
- worker event sequence,
|
||||
- gateway receive sequence,
|
||||
- worker timestamp,
|
||||
- gateway receive timestamp,
|
||||
- stream send timestamp if needed for diagnostics.
|
||||
|
||||
Default backpressure policy for parity testing should be fail-fast:
|
||||
|
||||
1. If the worker client event queue fills, fault the worker client.
|
||||
2. If the public stream queue fills, fault the gateway session.
|
||||
2. Preserve the overflow details in logs and metrics.
|
||||
3. Do not silently drop data-change events.
|
||||
|
||||
Do not set a production event-rate target before measurement. `GatewayMetrics`
|
||||
records received event counts by family, queue depth, stream disconnects, and
|
||||
overflow counts. Later production modes may support explicit coalescing by item
|
||||
handle as an opt-in behavior.
|
||||
|
||||
The gateway should not synthesize `OperationComplete` from write completion,
|
||||
command replies, ASB completion queues, or completion-only status frames. Forward
|
||||
`OperationComplete` only when the worker reports the native MXAccess public
|
||||
event.
|
||||
|
||||
## Security
|
||||
|
||||
### Public API
|
||||
|
||||
Use API key authentication for v1. Store API keys in a gateway-owned SQLite
|
||||
database, but store only hashed key secrets. Clients should send keys in gRPC
|
||||
metadata using:
|
||||
|
||||
```text
|
||||
authorization: Bearer mxgw_<key-id>_<secret>
|
||||
```
|
||||
|
||||
The gateway should split the key into a stable key id and secret component,
|
||||
load the key record by id, hash the presented secret, and compare using a
|
||||
constant-time comparison.
|
||||
|
||||
`ApiKeyParser` accepts only `authorization: Bearer mxgw_<key-id>_<secret>`.
|
||||
Malformed headers fail before any database lookup. The parsed raw secret is
|
||||
kept only long enough for `ApiKeySecretHasher` to compute an HMAC-SHA256 hash
|
||||
using the configured `Authentication:PepperSecretName` lookup in application
|
||||
configuration. The raw secret is not stored in the auth database, identity
|
||||
model, logs, or verification result.
|
||||
|
||||
`ApiKeyVerifier` loads the stored key record by key id, rejects revoked keys,
|
||||
hashes the presented secret, and compares the stored and presented hashes with
|
||||
`CryptographicOperations.FixedTimeEquals`. A successful verification returns an
|
||||
`ApiKeyIdentity` with key id, key prefix, display name, and scopes. Failure
|
||||
results distinguish malformed credentials, missing keys, revoked keys, missing
|
||||
pepper configuration, and hash mismatch for internal authorization handling.
|
||||
|
||||
`GatewayGrpcAuthorizationInterceptor` enforces this authentication model for
|
||||
public gRPC calls. Missing, malformed, revoked, unknown, or mismatched keys fail
|
||||
with `Unauthenticated`. Authenticated calls missing the scope required by the
|
||||
RPC fail with `PermissionDenied`. The interceptor applies to unary calls and
|
||||
server-streaming calls and stores the authenticated `ApiKeyIdentity` in
|
||||
`IGatewayRequestIdentityAccessor` for the duration of the request handler.
|
||||
`Authentication:Mode` set to `Disabled` bypasses API-key verification for local
|
||||
development only.
|
||||
|
||||
Dashboard authentication reuses the API-key verifier and scope model. The
|
||||
dashboard login endpoint accepts the key in a form post, checks `admin` scope
|
||||
when `Dashboard:RequireAdminScope` is enabled, and signs in with the
|
||||
`MxGateway.Dashboard` cookie scheme. The cookie is HTTP-only, secure, strict
|
||||
SameSite, and scoped with the `__Host-MxGatewayDashboard` name. Logout clears
|
||||
that cookie. Login and logout posts use anti-forgery validation, and dashboard
|
||||
API keys are not accepted in query strings. `Dashboard:AllowAnonymousLocalhost`
|
||||
allows only loopback requests to bypass the dashboard cookie requirement and
|
||||
defaults to `true`.
|
||||
|
||||
Recommended scopes:
|
||||
|
||||
- `session:open`
|
||||
- `session:close`
|
||||
- `invoke:read`
|
||||
- `invoke:write`
|
||||
- `invoke:secure`
|
||||
- `events:read`
|
||||
- `metadata:read`
|
||||
- `admin`
|
||||
|
||||
If the gateway is exposed outside the local machine, use TLS. Do not log raw API
|
||||
keys or raw credential-bearing MXAccess values.
|
||||
|
||||
API key administration for v1 should be a local CLI/tool rather than a public
|
||||
gRPC admin API. It should initialize the auth database, create keys, list keys
|
||||
without secrets, revoke keys, rotate keys, and print raw secrets only once at
|
||||
creation.
|
||||
|
||||
`MxGateway.Server` exposes local API-key administration as an `apikey`
|
||||
subcommand before the web host starts:
|
||||
|
||||
```bash
|
||||
MxGateway.Server apikey init-db --sqlite-path C:\ProgramData\MxGateway\gateway-auth.db
|
||||
MxGateway.Server apikey create-key --key-id operator01 --display-name Operator --scopes session:open,events:read
|
||||
MxGateway.Server apikey list-keys --json
|
||||
MxGateway.Server apikey revoke-key --key-id operator01
|
||||
MxGateway.Server apikey rotate-key --key-id operator01 --json
|
||||
```
|
||||
|
||||
The subcommands accept `--sqlite-path`, `--pepper`, and `--json`. `--pepper`
|
||||
sets the local `MxGateway:ApiKeyPepper` configuration value for the command
|
||||
process; deployments should normally provide the pepper through the configured
|
||||
secret source. `create-key` and `rotate-key` print the full raw API key exactly
|
||||
once. `list-keys` never prints raw secrets or `secret_hash` values.
|
||||
|
||||
SQLite auth storage should use startup migrations with a `schema_version` table.
|
||||
Migrations should run inside transactions and fail startup if the database
|
||||
schema is newer than the running binary understands.
|
||||
|
||||
The v1 auth store uses `Microsoft.Data.Sqlite` and creates the
|
||||
`schema_version`, `api_keys`, and `api_key_audit` tables through
|
||||
`SqliteAuthStoreMigrator`. `AuthStoreMigrationHostedService` runs those
|
||||
migrations at gateway startup when API-key authentication and
|
||||
`Authentication:RunMigrationsOnStartup` are enabled. A database with a newer
|
||||
schema version fails startup instead of being modified by an older gateway
|
||||
binary.
|
||||
|
||||
`IApiKeyStore` reads stored key records and exposes an active-key lookup that
|
||||
excludes rows with `revoked_utc` set. Hash verification belongs to the API-key
|
||||
hashing layer, but the store preserves the `secret_hash` bytes, display name,
|
||||
scopes, timestamps, and revocation state needed by that layer.
|
||||
|
||||
`IApiKeyAuditStore` appends audit events to `api_key_audit` and returns recent
|
||||
events for diagnostics and future administrative tools. Audit records store key
|
||||
ids and event metadata only; they do not store raw API key secrets.
|
||||
|
||||
Commands requiring authorization:
|
||||
|
||||
- writes,
|
||||
- secured writes,
|
||||
- authentication commands,
|
||||
- worker shutdown diagnostics,
|
||||
- metadata queries if they expose sensitive plant structure.
|
||||
|
||||
Current gRPC scope mapping:
|
||||
|
||||
- `OpenSession` requires `session:open`.
|
||||
- `CloseSession` requires `session:close`.
|
||||
- `StreamEvents` and `DrainEvents` require `events:read`.
|
||||
- read-style MXAccess commands such as `Register`, `AddItem`, `Advise`, and
|
||||
`Ping` require `invoke:read`.
|
||||
- `Write` and `Write2` require `invoke:write`.
|
||||
- `WriteSecured`, `WriteSecured2`, and `AuthenticateUser` require
|
||||
`invoke:secure`.
|
||||
- metadata commands such as `ArchestrAUserToId`, `GetSessionState`, and
|
||||
`GetWorkerInfo` require `metadata:read`.
|
||||
- `ShutdownWorker` requires `admin`.
|
||||
|
||||
### Worker IPC
|
||||
|
||||
Named pipes should be local only. Pipe ACLs should restrict access to:
|
||||
|
||||
- the gateway process identity,
|
||||
- the launched worker identity,
|
||||
- administrators only when operationally required.
|
||||
|
||||
The worker must validate `GatewayHello` and the nonce before creating MXAccess.
|
||||
|
||||
## Observability
|
||||
|
||||
Use structured logs with these fields where applicable:
|
||||
|
||||
- session id,
|
||||
- client identity,
|
||||
- worker process id,
|
||||
- pipe name hash or suffix,
|
||||
- protocol version,
|
||||
- correlation id,
|
||||
- command method,
|
||||
- MXAccess HRESULT,
|
||||
- MXAccess status summary,
|
||||
- event family,
|
||||
- event sequence,
|
||||
- queue depth,
|
||||
- elapsed milliseconds.
|
||||
|
||||
Metrics:
|
||||
|
||||
- open sessions,
|
||||
- workers running,
|
||||
- worker startup latency,
|
||||
- command latency by method,
|
||||
- command failures by method and category,
|
||||
- event rate by session and family,
|
||||
- event queue depth,
|
||||
- worker exits by reason,
|
||||
- worker kills,
|
||||
- heartbeat failures,
|
||||
- gRPC stream disconnects.
|
||||
|
||||
Do not log credential values or full tag values by default.
|
||||
|
||||
The gateway registers `GatewayMetrics` as the in-process metrics foundation.
|
||||
It emits .NET `Meter` instruments for collectors and keeps a
|
||||
`GatewayMetricsSnapshot` for dashboard projection. The snapshot exists because
|
||||
the dashboard needs current counters and queue depths without depending on a
|
||||
specific metrics exporter.
|
||||
|
||||
Event metrics use low-cardinality tags such as event family. Per-session event
|
||||
counts are kept only in the in-process snapshot for active dashboard sessions
|
||||
and are purged when the session is removed. Worker event queue depth and gRPC
|
||||
event stream queue depth are reported as separate gauges.
|
||||
|
||||
HTTP request handling uses `UseGatewayRequestLoggingScope()` to attach common
|
||||
structured log fields when request metadata is present:
|
||||
|
||||
- `SessionId`,
|
||||
- `ClientIdentity`,
|
||||
- `WorkerProcessId`,
|
||||
- `CorrelationId`,
|
||||
- `CommandMethod`.
|
||||
|
||||
`GatewayLogRedactor` redacts API key secrets and command values before they are
|
||||
added to log state. Value logging remains opt-in and redacted by default so
|
||||
secured writes, authentication commands, and ordinary tag values do not leak
|
||||
through diagnostics.
|
||||
|
||||
## Configuration
|
||||
|
||||
Suggested configuration shape:
|
||||
|
||||
```json
|
||||
{
|
||||
"MxGateway": {
|
||||
"Authentication": {
|
||||
"Mode": "ApiKey",
|
||||
"SqlitePath": "C:\\ProgramData\\MxGateway\\gateway-auth.db",
|
||||
"PepperSecretName": "MxGateway:ApiKeyPepper",
|
||||
"RunMigrationsOnStartup": true
|
||||
},
|
||||
"Worker": {
|
||||
"ExecutablePath": "src/MxGateway.Worker/bin/x86/Release/MxGateway.Worker.exe",
|
||||
"WorkingDirectory": null,
|
||||
"RequiredArchitecture": "X86",
|
||||
"StartupTimeoutSeconds": 30,
|
||||
"StartupProbeRetryAttempts": 3,
|
||||
"StartupProbeRetryDelayMilliseconds": 250,
|
||||
"PipeConnectAttemptTimeoutMilliseconds": 2000,
|
||||
"ShutdownTimeoutSeconds": 10,
|
||||
"HeartbeatIntervalSeconds": 5,
|
||||
"HeartbeatGraceSeconds": 15,
|
||||
"MaxMessageBytes": 16777216
|
||||
},
|
||||
"Sessions": {
|
||||
"DefaultCommandTimeoutSeconds": 30,
|
||||
"MaxSessions": 64,
|
||||
"MaxPendingCommandsPerSession": 128,
|
||||
"AllowMultipleEventSubscribers": false
|
||||
},
|
||||
"Events": {
|
||||
"QueueCapacity": 10000,
|
||||
"BackpressurePolicy": "FailFast"
|
||||
},
|
||||
"Dashboard": {
|
||||
"Enabled": true,
|
||||
"PathBase": "/dashboard",
|
||||
"RequireAdminScope": true,
|
||||
"AllowAnonymousLocalhost": true,
|
||||
"SnapshotIntervalMilliseconds": 1000,
|
||||
"RecentFaultLimit": 100,
|
||||
"RecentSessionLimit": 200,
|
||||
"ShowTagValues": false
|
||||
},
|
||||
"Protocol": {
|
||||
"WorkerProtocolVersion": 1
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Do not scatter connection or path constants through implementation code.
|
||||
|
||||
`MxGateway.Server` binds this section to `GatewayOptions` at startup and
|
||||
registers validation with `ValidateOnStart()`. Startup fails before the gateway
|
||||
begins serving traffic when required authentication settings are missing,
|
||||
timeouts or queue sizes are not positive, dashboard settings are malformed, or
|
||||
the configured worker protocol version does not match the contract version.
|
||||
|
||||
The gateway exposes read-only effective settings through
|
||||
`IGatewayConfigurationProvider`. This projection is for dashboard settings and
|
||||
diagnostics, so it redacts secret-related fields such as
|
||||
`Authentication:PepperSecretName` and does not include raw API keys or key
|
||||
material.
|
||||
|
||||
The complete option reference, including defaults and validation rules, is in
|
||||
[Gateway Configuration](./GatewayConfiguration.md).
|
||||
|
||||
## Galaxy Repository Metadata
|
||||
|
||||
Galaxy hierarchy and tag metadata can be discovered through SQL Server when
|
||||
needed for browse or diagnostics. The current notes live outside this repo at:
|
||||
|
||||
```text
|
||||
C:\Users\dohertj2\Desktop\lmxopcua\gr
|
||||
```
|
||||
|
||||
Use SQL metadata as discovery data. It does not replace MXAccess-backed runtime
|
||||
behavior unless an explicit non-parity backend is designed.
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
Gateway tests should be able to run without installed MXAccess by using fake
|
||||
workers and fake transports.
|
||||
|
||||
Use `FakeWorkerHarness` for tests that need real gateway-to-worker framing,
|
||||
handshake, command, event, fault, or malformed-protocol behavior without loading
|
||||
MXAccess COM. See [Gateway Testing](./GatewayTesting.md) for the harness scope
|
||||
and focused test commands.
|
||||
|
||||
Focused tests:
|
||||
|
||||
- session state transitions,
|
||||
- gRPC API-key authentication for unary and streaming calls,
|
||||
- gRPC scope mapping for sessions, invokes, events, metadata, and admin
|
||||
commands,
|
||||
- worker startup failures,
|
||||
- protocol version mismatch,
|
||||
- malformed frame handling,
|
||||
- pending command completion,
|
||||
- command timeout and late reply handling,
|
||||
- worker crash handling,
|
||||
- event ordering,
|
||||
- event queue overflow,
|
||||
- `CloseSession` idempotency,
|
||||
- gRPC mapping for command replies and faults.
|
||||
- dashboard snapshot projection,
|
||||
- dashboard auth decisions,
|
||||
- dashboard redaction,
|
||||
- dashboard realtime subscription disposal.
|
||||
|
||||
Integration tests with the real worker should be separated from unit tests and
|
||||
clearly marked because they require Windows, .NET Framework worker output, and
|
||||
eventually installed MXAccess COM.
|
||||
|
||||
## Initial Implementation Slice
|
||||
|
||||
The first gateway slice should implement:
|
||||
|
||||
1. Host startup and configuration binding.
|
||||
2. SQLite auth database initialization and migrations.
|
||||
3. Local API-key administration CLI/tool.
|
||||
4. API-key authentication and scope checks.
|
||||
5. `OpenSession`.
|
||||
6. Worker process launch.
|
||||
7. Named-pipe handshake.
|
||||
8. `Invoke` for `Register`, `AddItem`, and `Advise`.
|
||||
9. `StreamEvents` with one subscriber per session.
|
||||
10. `CloseSession`.
|
||||
11. Worker crash and startup failure handling.
|
||||
12. Event-rate, queue-depth, and overflow metrics.
|
||||
13. Blazor Server dashboard with Bootstrap assets.
|
||||
14. Dashboard home, sessions, and workers pages.
|
||||
15. Dashboard realtime snapshot refresh.
|
||||
16. Dashboard API-key login with admin-scope check.
|
||||
17. Basic structured logs.
|
||||
|
||||
This proves the process model before the full command surface is implemented.
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [MXAccess Worker Instance Detailed Design](./MxAccessWorkerInstanceDesign.md)
|
||||
- [Worker Frame Protocol](./WorkerFrameProtocol.md)
|
||||
- [Worker Process Launcher](./WorkerProcessLauncher.md)
|
||||
- [Gateway Configuration](./GatewayConfiguration.md)
|
||||
- [Sessions](./Sessions.md)
|
||||
- [gRPC](./Grpc.md)
|
||||
- [Authentication](./Authentication.md)
|
||||
- [Authorization](./Authorization.md)
|
||||
- [Metrics](./Metrics.md)
|
||||
- [Diagnostics](./Diagnostics.md)
|
||||
- [Gateway Testing](./GatewayTesting.md)
|
||||
Reference in New Issue
Block a user