# Gateway Process Detailed Design ## Purpose The gateway process is the only public network-facing component. It exposes the modern API, owns session lifecycle, launches and supervises MXAccess worker processes, and moves commands and events between clients and the worker that owns each session. The gateway must not instantiate MXAccess COM, import MXAccess interop types, or depend on an STA message pump. The installed MXAccess COM component is isolated behind the worker process boundary. ## Runtime - Target runtime: .NET 10. - Language: C#. - Preferred process architecture: x64. - Hosting: ASP.NET Core gRPC. - Web UI: Blazor Server dashboard with Bootstrap CSS/JS. - Operating system: Windows. - Public transport: TCP gRPC. - Internal worker transport: named pipes with protobuf-framed messages. Style guides: - [C# Style Guide](./style-guides/CSharpStyleGuide.md) - [Protobuf Style Guide](./style-guides/ProtobufStyleGuide.md) ## Responsibilities The gateway owns: - public gRPC service endpoints, - Blazor Server dashboard endpoints, - optional authentication and authorization, - session id allocation, - worker executable selection, - named-pipe server creation, - worker process launch, - gateway/worker handshake, - command correlation and timeout handling, - event fan-out to client streams, - session lease and heartbeat enforcement, - worker crash and hang detection, - metrics and structured logging, - graceful service shutdown. The gateway does not own: - MXAccess COM object creation, - MXAccess method dispatch, - MXAccess event subscription, - MXAccess handle generation, - COM value conversion from native `VARIANT` values. Those belong to the worker. ## High-Level Components ```text MxGateway.Server Program / Host Configuration Grpc MxAccessGatewayService RequestReplyMapper EventMapper Dashboard Pages Components DashboardSnapshotService DashboardAuthorization Sessions SessionManager GatewaySession SessionRegistry SessionLeaseMonitor Workers WorkerProcessLauncher WorkerClient WorkerPipeTransport WorkerProtocolReader WorkerProtocolWriter WorkerWatchdog Security ClientIdentityResolver CommandAuthorization Metrics GatewayMetrics Diagnostics HealthChecks ``` ## Public gRPC Surface Start with unary commands plus an event stream: ```protobuf service MxAccessGateway { rpc OpenSession(OpenSessionRequest) returns (OpenSessionReply); rpc CloseSession(CloseSessionRequest) returns (CloseSessionReply); rpc Invoke(MxCommandRequest) returns (MxCommandReply); rpc StreamEvents(StreamEventsRequest) returns (stream MxEvent); } ``` Add this later only after the command and event model is stable: ```protobuf rpc Session(stream ClientMessage) returns (stream ServerMessage); ``` ### OpenSession `OpenSession` creates one gateway session and one worker process by default. Inputs should include: - requested backend, defaulting to `mxaccess-worker`, - optional client session name, - optional client correlation id, - optional timeout policy, - optional event backpressure policy, - optional metadata discovery options. Outputs should include: - session id, - backend name, - worker process id when available, - protocol version, - server capabilities, - default timeout values. Behavior: 1. Resolve and authorize the client identity. 2. Allocate a session id. 3. Build a pipe name and random handshake nonce. 4. Create a named-pipe server with restrictive local ACLs. 5. Launch the worker executable with session bootstrap data. 6. Accept the pipe connection within startup timeout. 7. Exchange `GatewayHello` and `WorkerHello`. 8. Wait for `WorkerReady`. 9. Register the session as ready. 10. Return the session details. If any step fails, clean up all resources. Kill the worker if it was launched and did not shut down on its own. ### CloseSession `CloseSession` attempts graceful shutdown and then enforces a kill timeout. Behavior: 1. Mark the session closing. 2. Stop accepting new commands. 3. Notify event streams of terminal session close. 4. Send `WorkerShutdown` when the pipe is still connected. 5. Wait for worker exit up to the configured timeout. 6. Kill the worker process if it remains alive. 7. Remove the session from the registry. `CloseSession` should be idempotent. Closing an already closed session should return a successful close result with the final known state. ### Invoke `Invoke` forwards one MXAccess command to the worker that owns the session. Behavior: 1. Validate the session id. 2. Check session state is `Ready`. 3. Validate the method-specific payload. 4. Authorize the command, especially writes and credential-bearing commands. 5. Assign a gateway correlation id. 6. Write `WorkerCommand` to the worker pipe. 7. Await the correlated `WorkerCommandReply`. 8. Map worker reply to public `MxCommandReply`. Request cancellation stops waiting in the gateway. It does not abort an in-flight COM call. If the command must be hard-canceled, kill the worker and fault the session. ### StreamEvents `StreamEvents` streams events for one session. Initial implementation allows one active stream subscriber per session. A second subscriber should be rejected with a clear session error. If multiple subscribers are later supported, they must have independent backpressure accounting and a clear fan-out policy. Behavior: 1. Validate session id and authorize event access. 2. Attach a stream cursor to the session event channel. 3. Send events in worker sequence order. 4. Stop on client cancellation, session close, or session fault. 5. Emit a terminal status when the session faults if gRPC status alone cannot preserve the required details. The gateway must not reorder events from one worker. ## Web Dashboard The gateway hosts a basic Blazor Server dashboard for operators and developers. The dashboard is read-only for v1 and should show current gateway/session/worker state plus basic metrics. Technology: - Blazor Server, - Bootstrap CSS, - Bootstrap JavaScript, - no MudBlazor, - no other Blazor client component libraries. Suggested routes: ```text /dashboard /dashboard/sessions /dashboard/sessions/{sessionId} /dashboard/workers /dashboard/events /dashboard/settings ``` Dashboard pages: - home: gateway status, uptime, session count, worker count, command rate, event rate, queue depth, recent faults, - sessions: active/recent session table, - session details: one session's worker, heartbeat, counters, queues, and fault summary, - workers: worker process table and heartbeat details, - events: aggregate event counters and rates, - settings: read-only effective configuration with secrets redacted. Realtime updates should use Blazor Server component updates from a read-only snapshot service. Components should subscribe to snapshots and call `StateHasChanged` through `InvokeAsync`. Do not stream every MXAccess event to the dashboard; aggregate event rates and counters instead. Suggested service shape: ```csharp public interface IDashboardSnapshotService { DashboardSnapshot GetSnapshot(); IAsyncEnumerable WatchSnapshotsAsync( CancellationToken cancellationToken); } ``` Default refresh policy: - immediate update on session create, close, or fault, - immediate update on worker fault, - periodic metrics refresh every 1 second, - event-rate windows updated every 1 second. Dashboard access should require API-key-backed authentication with `admin` scope when enabled. A simple `/dashboard/login` form can validate an API key and issue an HTTP-only secure cookie for dashboard pages. Do not put API keys in query strings. Anonymous localhost access may exist only behind an explicit configuration option that defaults to false. ## Session State Machine ```text Creating -> StartingWorker -> WaitingForPipe -> Handshaking -> InitializingWorker -> Ready -> Closing -> Closed Any non-terminal state -> Faulted Faulted -> Closed ``` ### State Rules - `Creating`: session id and in-memory state exist, but no worker has launched. - `StartingWorker`: worker process launch is in progress. - `WaitingForPipe`: gateway is waiting for the worker to connect to the pipe. - `Handshaking`: pipe is connected and protocol hello is being verified. - `InitializingWorker`: worker is connected but has not reported MXAccess ready. - `Ready`: commands and event streams may run. - `Closing`: graceful shutdown is in progress. - `Closed`: resources are released. - `Faulted`: a non-graceful terminal fault occurred and must be reported to callers before resources are released. Only `Ready` sessions accept new commands. ## Session Model Gateway session state should include: - session id, - client identity, - backend name, - worker process id, - worker executable path and version, - pipe name, - pipe connection state, - open time, - last client activity time, - last worker heartbeat time, - lease expiration, - command timeout policy, - startup timeout policy, - shutdown timeout policy, - event queue metrics, - active event stream count, - final fault if any. The worker remains authoritative for MXAccess handles. The gateway may keep a shadow state for diagnostics, but it must not invent, rewrite, or recycle MXAccess handles. ## Worker Launch The gateway should launch the worker using explicit configuration: - worker executable path, - worker working directory, - worker architecture requirement, - protocol version, - startup timeout, - environment variables, - optional restricted user identity. Command-line arguments should include only non-secret bootstrap values: ```text --session-id --pipe-name --protocol-version ``` Prefer passing the handshake nonce via inherited environment or another protected local mechanism instead of command line when possible. Before launch, validate: - worker executable exists, - worker path is under the configured install directory, - worker file version or product version is acceptable, - worker is expected to be x86. ## Worker IPC The gateway creates the pipe server before launching the worker. Pipe name: ```text mxaccess-gateway-{gatewayProcessId}-{sessionId} ``` Message framing: ```text uint32 little-endian payload_length payload_length bytes protobuf WorkerEnvelope ``` Recommended size limits: - default max message size: 16 MiB, - configurable upper bound for large arrays, - reject zero-length payloads, - reject payloads larger than configured maximum before allocation. ### Envelope Rules Every message uses `WorkerEnvelope`: - `protocol_version` must match a supported version. - `session_id` must match the pipe/session. - `sequence` is monotonic per sender. - `correlation_id` links commands and replies. - events use either zero or their own event correlation id. - protocol faults do not replace MXAccess HRESULT/status details. The gateway should treat malformed frames, sequence regressions, and wrong session ids as protocol faults and close the session. ## WorkerClient Design `WorkerClient` is the gateway-side object that owns one worker connection. Suggested public shape: ```csharp public interface IWorkerClient : IAsyncDisposable { string SessionId { get; } int? ProcessId { get; } WorkerClientState State { get; } Task StartAsync(CancellationToken cancellationToken); Task InvokeAsync( WorkerCommand command, TimeSpan timeout, CancellationToken cancellationToken); IAsyncEnumerable ReadEventsAsync( CancellationToken cancellationToken); Task ShutdownAsync(TimeSpan timeout, CancellationToken cancellationToken); void Kill(string reason); } ``` Internally it owns: - process handle, - pipe stream, - read loop, - write loop, - bounded outbound command/control channel, - bounded inbound event channel, - pending command dictionary keyed by correlation id, - heartbeat monitor, - terminal fault source. ### Read Loop The read loop: 1. Reads one frame. 2. Parses `WorkerEnvelope`. 3. Validates protocol fields. 4. Dispatches by body type: - `WorkerCommandReply`: completes pending command. - `WorkerEvent`: enqueues event. - `WorkerHeartbeat`: updates heartbeat timestamp. - `WorkerFault`: faults session. 5. Stops when pipe closes or cancellation is requested. If the pipe closes while the session is not closing, fault the session. ### Write Loop The write loop serializes all writes to the pipe. No other code should write to the pipe directly. It handles: - `GatewayHello`, - `WorkerCommand`, - `WorkerCancel`, - `WorkerShutdown`, - gateway heartbeat if used. The write loop should fail the session if a pipe write fails outside normal shutdown. ## Command Correlation Each command gets: - gateway correlation id, - method name, - start timestamp, - timeout deadline, - caller cancellation token, - reply completion source. Pending command handling: - Add the pending entry before writing the command. - Remove it exactly once when reply, timeout, cancellation, or session fault occurs. - If a late reply arrives after cancellation or timeout, log it with the correlation id and discard it. - If the session faults, complete all pending commands with a structured fault. Timeouts should not assume the COM call stopped. A timed-out command may still finish inside the worker. ## Fault Model Fault categories: - `StartupFailed` - `ProtocolMismatch` - `ProtocolViolation` - `PipeDisconnected` - `WorkerExited` - `HeartbeatExpired` - `CommandTimeout` - `WorkerFaulted` - `GatewayShutdown` - `AuthorizationFailed` Public replies should distinguish: - gRPC transport failure, - gateway/session failure, - worker protocol failure, - MXAccess method failure, - MXAccess HRESULT/status failure. Do not hide an MXAccess HRESULT by returning only an RPC error. When MXAccess was reached and returned status, preserve that status in the command reply. ## Heartbeats And Leases Use separate concepts: - worker heartbeat: proves the worker process and pipe loop are alive, - session lease: proves the client still owns the session, - command timeout: bounds one command wait, - startup timeout: bounds worker creation, - shutdown timeout: bounds graceful stop. Suggested defaults for early development: - startup timeout: 30 seconds, - worker heartbeat interval: 5 seconds, - heartbeat grace: 15 seconds, - default command timeout: 30 seconds, - graceful shutdown timeout: 10 seconds, - idle session lease: configurable, disabled in local development. The exact values should be configurable. ## Event Delivery Events flow: ```text worker MXAccess event -> worker outbound event queue -> worker pipe writer -> gateway read loop -> session event channel -> gRPC StreamEvents ``` The gateway should record: - worker event sequence, - gateway receive sequence, - worker timestamp, - gateway receive timestamp, - stream send timestamp if needed for diagnostics. Default backpressure policy for parity testing should be fail-fast: 1. If the session event channel fills, fault the session. 2. Preserve the overflow details in logs and metrics. 3. Do not silently drop data-change events. Do not set a production event-rate target before measurement. Emit event rate, queue depth, stream send latency, and overflow metrics. Later production modes may support explicit coalescing by item handle as an opt-in behavior. The gateway should not synthesize `OperationComplete` from write completion, command replies, ASB completion queues, or completion-only status frames. Forward `OperationComplete` only when the worker reports the native MXAccess public event. ## Security ### Public API Use API key authentication for v1. Store API keys in a gateway-owned SQLite database, but store only hashed key secrets. Clients should send keys in gRPC metadata using: ```text authorization: Bearer mxgw__ ``` The gateway should split the key into a stable key id and secret component, load the key record by id, hash the presented secret, and compare using a constant-time comparison. Recommended scopes: - `session:open` - `session:close` - `invoke:read` - `invoke:write` - `invoke:secure` - `events:read` - `metadata:read` - `admin` If the gateway is exposed outside the local machine, use TLS. Do not log raw API keys or raw credential-bearing MXAccess values. API key administration for v1 should be a local CLI/tool rather than a public gRPC admin API. It should initialize the auth database, create keys, list keys without secrets, revoke keys, rotate keys, and print raw secrets only once at creation. SQLite auth storage should use startup migrations with a `schema_version` table. Migrations should run inside transactions and fail startup if the database schema is newer than the running binary understands. Commands requiring authorization: - writes, - secured writes, - authentication commands, - worker shutdown diagnostics, - metadata queries if they expose sensitive plant structure. ### Worker IPC Named pipes should be local only. Pipe ACLs should restrict access to: - the gateway process identity, - the launched worker identity, - administrators only when operationally required. The worker must validate `GatewayHello` and the nonce before creating MXAccess. ## Observability Use structured logs with these fields where applicable: - session id, - client identity, - worker process id, - pipe name hash or suffix, - protocol version, - correlation id, - command method, - MXAccess HRESULT, - MXAccess status summary, - event family, - event sequence, - queue depth, - elapsed milliseconds. Metrics: - open sessions, - workers running, - worker startup latency, - command latency by method, - command failures by method and category, - event rate by session and family, - event queue depth, - worker exits by reason, - worker kills, - heartbeat failures, - gRPC stream disconnects. Do not log credential values or full tag values by default. The gateway registers `GatewayMetrics` as the in-process metrics foundation. It emits .NET `Meter` instruments for collectors and keeps a `GatewayMetricsSnapshot` for dashboard projection. The snapshot exists because the dashboard needs current counters and queue depths without depending on a specific metrics exporter. HTTP request handling uses `UseGatewayRequestLoggingScope()` to attach common structured log fields when request metadata is present: - `SessionId`, - `ClientIdentity`, - `WorkerProcessId`, - `CorrelationId`, - `CommandMethod`. `GatewayLogRedactor` redacts API key secrets and command values before they are added to log state. Value logging remains opt-in and redacted by default so secured writes, authentication commands, and ordinary tag values do not leak through diagnostics. ## Configuration Suggested configuration shape: ```json { "MxGateway": { "Authentication": { "Mode": "ApiKey", "SqlitePath": "C:\\ProgramData\\MxGateway\\gateway-auth.db", "PepperSecretName": "MxGateway:ApiKeyPepper", "RunMigrationsOnStartup": true }, "Worker": { "ExecutablePath": "src/MxGateway.Worker/bin/x86/Release/MxGateway.Worker.exe", "StartupTimeoutSeconds": 30, "ShutdownTimeoutSeconds": 10, "HeartbeatIntervalSeconds": 5, "HeartbeatGraceSeconds": 15, "MaxMessageBytes": 16777216 }, "Sessions": { "DefaultCommandTimeoutSeconds": 30, "MaxSessions": 64, "AllowMultipleEventSubscribers": false }, "Events": { "QueueCapacity": 10000, "BackpressurePolicy": "FailFast" }, "Dashboard": { "Enabled": true, "PathBase": "/dashboard", "RequireAdminScope": true, "AllowAnonymousLocalhost": false, "SnapshotIntervalMilliseconds": 1000, "RecentFaultLimit": 100, "RecentSessionLimit": 200, "ShowTagValues": false } } } ``` Do not scatter connection or path constants through implementation code. `MxGateway.Server` binds this section to `GatewayOptions` at startup and registers validation with `ValidateOnStart()`. Startup fails before the gateway begins serving traffic when required authentication settings are missing, timeouts or queue sizes are not positive, dashboard settings are malformed, or the configured worker protocol version does not match the contract version. The gateway exposes read-only effective settings through `IGatewayConfigurationProvider`. This projection is for dashboard settings and diagnostics, so it redacts secret-related fields such as `Authentication:PepperSecretName` and does not include raw API keys or key material. ## Galaxy Repository Metadata Galaxy hierarchy and tag metadata can be discovered through SQL Server when needed for browse or diagnostics. The current notes live outside this repo at: ```text C:\Users\dohertj2\Desktop\lmxopcua\gr ``` Use SQL metadata as discovery data. It does not replace MXAccess-backed runtime behavior unless an explicit non-parity backend is designed. ## Testing Strategy Gateway tests should be able to run without installed MXAccess by using fake workers and fake transports. Focused tests: - session state transitions, - worker startup failures, - protocol version mismatch, - malformed frame handling, - pending command completion, - command timeout and late reply handling, - worker crash handling, - event ordering, - event queue overflow, - `CloseSession` idempotency, - gRPC mapping for command replies and faults. - dashboard snapshot projection, - dashboard auth decisions, - dashboard redaction, - dashboard realtime subscription disposal. Integration tests with the real worker should be separated from unit tests and clearly marked because they require Windows, .NET Framework worker output, and eventually installed MXAccess COM. ## Initial Implementation Slice The first gateway slice should implement: 1. Host startup and configuration binding. 2. SQLite auth database initialization and migrations. 3. Local API-key administration CLI/tool. 4. API-key authentication and scope checks. 5. `OpenSession`. 6. Worker process launch. 7. Named-pipe handshake. 8. `Invoke` for `Register`, `AddItem`, and `Advise`. 9. `StreamEvents` with one subscriber per session. 10. `CloseSession`. 11. Worker crash and startup failure handling. 12. Event-rate, queue-depth, and overflow metrics. 13. Blazor Server dashboard with Bootstrap assets. 14. Dashboard home, sessions, and workers pages. 15. Dashboard realtime snapshot refresh. 16. Dashboard API-key login with admin-scope check. 17. Basic structured logs. This proves the process model before the full command surface is implemented.