Files
mxaccessgw/docs/GatewayProcessDesign.md
T
Joseph Doherty dc9c0c950c rename: prefix gateway projects/namespaces with ZB.MOM.WW + sln→slnx
Apply the ZB.MOM.WW. prefix to all gateway-side projects, folders,
.csproj/.sln contents, C# namespaces, using directives, generated proto
C# (csharp_namespace + checked-in generated files), InternalsVisibleTo
attributes, project-name string literals (LoadProject, .sln lookups,
worker exe paths, staticwebassets manifest), and the install/script/doc
references that point at any of the above. Migrate the solution from
.sln to .slnx via `dotnet sln migrate` and delete the old file.

External-runtime identifiers are intentionally NOT prefixed so external
configuration keeps working:
- GatewayMetrics.cs MeterName ("MxGateway.Server")
- DashboardAuthenticationDefaults Scheme/Policy ("MxGateway.Dashboard")
- GatewayRequestLoggingMiddleware logger category ("MxGateway.Request")
- StaRuntime thread name ("MxGateway.Worker.STA")
- appsettings.json root section "MxGateway" + env-var prefix
  MxGateway__... and secret-name MxGateway:ApiKeyPepper
- C:\ProgramData\MxGateway\ data dir paths

Also fixes two tests that were not rename-related but became visible
while validating the rename:

- WorkerLiveMxAccessSmokeTests.ShutDownAsync: cancellation that the
  gateway service correctly maps to RpcException(Cancelled) per gRPC
  convention was being misclassified as a stream fault. Added a sibling
  catch on RpcException with StatusCode.Cancelled.

- IntegrationTestEnvironment.ResolveRepositoryRoot: extracted IsRepositoryRoot
  and made it accept either a .git marker OR a .sln/.slnx next to src/
  so the worker-exe walker works in non-git working copies.

clients/proto/proto-inputs.json's protoRoot updated to point at
src/ZB.MOM.WW.MxGateway.Contracts/Protos.

Verified by `dotnet build` and a full `dotnet test` of the .slnx with
MXGATEWAY_RUN_LIVE_{MXACCESS,LDAP,GALAXY}_TESTS=1:
  Tests: 472/472 pass
  Worker.Tests: 280/280 pass (4 dev-rig [Fact(Skip=...)] skipped)
  IntegrationTests: 18/18 pass

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 16:22:23 -04:00

989 lines
32 KiB
Markdown

# Gateway Process Detailed Design
## Purpose
The gateway process is the only public network-facing component. It exposes the
modern API, owns session lifecycle, launches and supervises MXAccess worker
processes, and moves commands and events between clients and the worker that
owns each session.
The gateway must not instantiate MXAccess COM, import MXAccess interop types, or
depend on an STA message pump. The installed MXAccess COM component is isolated
behind the worker process boundary.
## Runtime
- Target runtime: .NET 10.
- Language: C#.
- Preferred process architecture: x64.
- Hosting: ASP.NET Core gRPC.
- Web UI: Blazor Server dashboard with Bootstrap CSS/JS.
- Operating system: Windows.
- Public transport: TCP gRPC.
- Internal worker transport: named pipes with protobuf-framed messages.
Style guides:
- [C# Style Guide](./style-guides/CSharpStyleGuide.md)
- [Protobuf Style Guide](./style-guides/ProtobufStyleGuide.md)
## Responsibilities
The gateway owns:
- public gRPC service endpoints,
- Blazor Server dashboard endpoints,
- optional authentication and authorization,
- session id allocation,
- worker executable selection,
- named-pipe server creation,
- worker process launch,
- gateway/worker handshake,
- command correlation and timeout handling,
- event fan-out to client streams,
- session lease and heartbeat enforcement,
- worker crash and hang detection,
- metrics and structured logging,
- graceful service shutdown.
The gateway does not own:
- MXAccess COM object creation,
- MXAccess method dispatch,
- MXAccess event subscription,
- MXAccess handle generation,
- COM value conversion from native `VARIANT` values.
Those belong to the worker.
## High-Level Components
```text
ZB.MOM.WW.MxGateway.Server
Program / Host
Configuration
Grpc
MxAccessGatewayService
MxAccessGrpcRequestValidator
MxAccessGrpcMapper
Dashboard
Pages
Components
DashboardSnapshotService
DashboardAuthorization
Sessions
SessionManager
GatewaySession
SessionRegistry
SessionLeaseMonitor
Workers
WorkerProcessLauncher
WorkerClient
WorkerPipeTransport
WorkerProtocolReader
WorkerProtocolWriter
WorkerWatchdog
Security
ClientIdentityResolver
CommandAuthorization
Metrics
GatewayMetrics
Diagnostics
HealthChecks
```
## Public gRPC Surface
Start with unary commands plus an event stream:
```protobuf
service MxAccessGateway {
rpc OpenSession(OpenSessionRequest) returns (OpenSessionReply);
rpc CloseSession(CloseSessionRequest) returns (CloseSessionReply);
rpc Invoke(MxCommandRequest) returns (MxCommandReply);
rpc StreamEvents(StreamEventsRequest) returns (stream MxEvent);
}
```
`MxAccessGatewayService` implements these public RPCs in the gateway process.
It validates public requests with `MxAccessGrpcRequestValidator`, delegates
session lifecycle and command routing to `ISessionManager`, and maps worker
command replies and events through `MxAccessGrpcMapper`. Session lookup,
validation, and worker transport failures become gRPC status errors. MXAccess
method replies that reached the worker remain `MxCommandReply` payloads so
HRESULT values, status arrays, and method-specific reply fields survive
transport boundaries.
Add this later only after the command and event model is stable:
```protobuf
rpc Session(stream ClientMessage) returns (stream ServerMessage);
```
### OpenSession
`OpenSession` creates one gateway session and one worker process by default.
Inputs should include:
- requested backend, defaulting to `mxaccess-worker`,
- optional client session name,
- optional client correlation id,
- optional timeout policy,
- optional event backpressure policy,
- optional metadata discovery options.
Outputs should include:
- session id,
- backend name,
- worker process id when available,
- protocol version,
- server capabilities,
- default timeout values.
Behavior:
1. Resolve and authorize the client identity.
2. Allocate a session id.
3. Build a pipe name and random handshake nonce.
4. Create a named-pipe server with restrictive local ACLs.
5. Launch the worker executable with session bootstrap data.
6. Accept the pipe connection within startup timeout.
7. Exchange `GatewayHello` and `WorkerHello`.
8. Wait for `WorkerReady`.
9. Register the session as ready.
10. Return the session details.
If any step fails, clean up all resources. Kill the worker if it was launched
and did not shut down on its own.
### CloseSession
`CloseSession` attempts graceful shutdown and then enforces a kill timeout.
Behavior:
1. Mark the session closing.
2. Stop accepting new commands.
3. Notify event streams of terminal session close.
4. Send `WorkerShutdown` when the pipe is still connected.
5. Wait for worker exit up to the configured timeout.
6. Kill the worker process if it remains alive.
7. Remove the session from the registry.
`CloseSession` should be idempotent. Closing an already closed session should
return a successful close result with the final known state.
`WorkerClient.ShutdownAsync` sends `WorkerShutdown`, waits for the worker read,
write, and heartbeat loops to stop, and waits for the launched worker process to
exit within the same shutdown timeout. If the pipe loops or process exit exceed
the timeout, the close operation fails with `ShutdownTimeout`; `GatewaySession`
then kills the worker process tree before surfacing the close failure.
### Invoke
`Invoke` forwards one MXAccess command to the worker that owns the session.
Behavior:
1. Validate the session id.
2. Check session state is `Ready`.
3. Validate the method-specific payload.
4. Authorize the command, especially writes and credential-bearing commands.
5. Assign a gateway correlation id.
6. Write `WorkerCommand` to the worker pipe.
7. Await the correlated `WorkerCommandReply`.
8. Map worker reply to public `MxCommandReply`.
Request cancellation stops waiting in the gateway. It does not abort an
in-flight COM call. If the command must be hard-canceled, kill the worker and
fault the session.
### StreamEvents
`StreamEvents` streams events for one session.
Initial implementation allows one active stream subscriber per session. A second
subscriber should be rejected with a clear session error. If multiple
subscribers are later supported, they must have independent backpressure
accounting and a clear fan-out policy.
Behavior:
1. Validate session id and authorize event access.
2. Attach the single active subscriber lease for the session.
3. Read worker events into a bounded public stream queue.
4. Send events in worker sequence order.
5. Stop on client cancellation, session close, or session fault.
6. Emit a terminal status when the session faults if gRPC status alone cannot
preserve the required details.
`EventStreamService` owns subscriber tracking and public stream backpressure.
The default policy allows one active subscriber per session. A second subscriber
is rejected with `EventSubscriberAlreadyActive`. Stream cancellation releases
the subscriber lease so a later stream can attach to the session.
The gateway must not reorder events from one worker. `EventStreamService` writes
mapped events to a bounded first-in, first-out queue and faults the session with
`EventQueueOverflow` if the queue fills. The gateway does not synthesize
`OperationComplete`; it forwards that family only when the worker reports a
native MXAccess `OperationComplete` event.
## Web Dashboard
The gateway hosts a basic Blazor Server dashboard for operators and developers.
The dashboard is read-only for v1 and should show current gateway/session/worker
state plus basic metrics.
Technology:
- Blazor Server,
- Bootstrap CSS,
- Bootstrap JavaScript,
- no MudBlazor,
- no other Blazor client component libraries.
Suggested routes:
```text
/dashboard
/dashboard/sessions
/dashboard/sessions/{sessionId}
/dashboard/workers
/dashboard/events
/dashboard/settings
```
Dashboard pages:
- home: gateway status, uptime, session count, worker count, command rate,
event rate, queue depth, recent faults,
- sessions: active/recent session table,
- session details: one session's worker, heartbeat, counters, queues, and fault
summary,
- workers: worker process table and heartbeat details,
- events: aggregate event counters and rates,
- settings: read-only effective configuration with secrets redacted.
Realtime updates should use Blazor Server component updates from a read-only
snapshot service. Components should subscribe to snapshots and call
`StateHasChanged` through `InvokeAsync`. Do not stream every MXAccess event to
the dashboard; aggregate event rates and counters instead.
Suggested service shape:
```csharp
public interface IDashboardSnapshotService
{
DashboardSnapshot GetSnapshot();
IAsyncEnumerable<DashboardSnapshot> WatchSnapshotsAsync(
CancellationToken cancellationToken);
}
```
Default refresh policy:
- immediate update on session create, close, or fault,
- immediate update on worker fault,
- periodic metrics refresh every 1 second,
- event-rate windows updated every 1 second.
Dashboard access should require API-key-backed authentication with `admin` scope
when enabled. A simple `/dashboard/login` form can validate an API key and issue
an HTTP-only secure cookie for dashboard pages. Do not put API keys in query
strings. Anonymous localhost access may exist only behind an explicit
configuration option that defaults to false.
## Session State Machine
```text
Creating
-> StartingWorker
-> WaitingForPipe
-> Handshaking
-> InitializingWorker
-> Ready
-> Closing
-> Closed
Any non-terminal state
-> Faulted
Faulted
-> Closed
```
### State rules
- `Creating`: session id and in-memory state exist, but no worker has launched.
- `StartingWorker`: worker process launch is in progress.
- `WaitingForPipe`: gateway is waiting for the worker to connect to the pipe.
- `Handshaking`: pipe is connected and protocol hello is being verified.
- `InitializingWorker`: worker is connected but has not reported MXAccess ready.
- `Ready`: commands and event streams may run.
- `Closing`: graceful shutdown is in progress.
- `Closed`: resources are released.
- `Faulted`: a non-graceful terminal fault occurred and must be reported to
callers before resources are released.
Only `Ready` sessions accept new commands.
## Session Model
Gateway session state should include:
- session id,
- client identity,
- backend name,
- worker process id,
- worker executable path and version,
- pipe name,
- pipe connection state,
- open time,
- last client activity time,
- last worker heartbeat time,
- lease expiration,
- command timeout policy,
- startup timeout policy,
- shutdown timeout policy,
- event queue metrics,
- active event stream count,
- final fault if any.
The worker remains authoritative for MXAccess handles. The gateway may keep a
shadow state for diagnostics, but it must not invent, rewrite, or recycle
MXAccess handles.
`SessionManager` owns the current in-memory session registry. It allocates a
session id, creates the worker pipe name and nonce, registers the session before
worker startup, and removes the session if startup fails. A successful
`OpenSession` attaches the ready `IWorkerClient` and transitions the session to
`Ready`.
Only `Ready` sessions accept command and event operations. `CloseSession` shuts
down the worker, disposes the worker client, and removes the session from the
registry so closed sessions do not retain pipe or process handles. A later close
for the same id returns `SessionNotFound`. Lease handling is exposed as a
session hook so a monitor can close expired sessions without embedding lease
policy in the worker client. Gateway shutdown walks the registry, closes each
known session, and kills a worker if graceful shutdown fails.
## Worker Launch
The gateway should launch the worker using explicit configuration:
- worker executable path,
- worker working directory,
- worker architecture requirement,
- protocol version,
- startup timeout,
- environment variables,
- optional restricted user identity.
Command-line arguments should include only non-secret bootstrap values:
```text
--session-id <sessionId>
--pipe-name <pipeName>
--protocol-version <version>
```
Prefer passing the handshake nonce via inherited environment or another
protected local mechanism instead of command line when possible.
Before launch, validate:
- worker executable exists,
- worker path is under the configured install directory,
- worker file version or product version is acceptable,
- worker is expected to be x86.
`WorkerProcessLauncher` implements the first validation layer now: it resolves
the worker executable path, requires a `.exe`, validates the Windows Portable
Executable header, and verifies the configured processor architecture. It passes
only `--session-id`, `--pipe-name`, and `--protocol-version` on the command
line. The per-session nonce is set through `MXGATEWAY_WORKER_NONCE` so the
command line remains safe to log. Startup failures and startup timeouts kill and
dispose the worker process and the pre-created pipe reservation before the
session manager observes the failure.
## Worker IPC
The gateway creates the pipe server before launching the worker.
Pipe name:
```text
mxaccess-gateway-{gatewayProcessId}-{sessionId}
```
Message framing:
```text
uint32 little-endian payload_length
payload_length bytes protobuf WorkerEnvelope
```
Recommended size limits:
- default max message size: 16 MiB,
- configurable upper bound for large arrays,
- reject zero-length payloads,
- reject payloads larger than configured maximum before allocation.
### Envelope rules
Every message uses `WorkerEnvelope`:
- `protocol_version` must match a supported version.
- `session_id` must match the pipe/session.
- `sequence` is monotonic per sender.
- `correlation_id` links commands and replies.
- events use either zero or their own event correlation id.
- protocol faults do not replace MXAccess HRESULT/status details.
The gateway should treat malformed frames, sequence regressions, and wrong
session ids as protocol faults and close the session.
## WorkerClient Design
`WorkerClient` is the gateway-side object that owns one worker connection.
Current public shape:
```csharp
public interface IWorkerClient : IAsyncDisposable
{
string SessionId { get; }
int? ProcessId { get; }
WorkerClientState State { get; }
DateTimeOffset LastHeartbeatAt { get; }
Task StartAsync(CancellationToken cancellationToken);
Task<WorkerCommandReply> InvokeAsync(
WorkerCommand command,
TimeSpan timeout,
CancellationToken cancellationToken);
IAsyncEnumerable<WorkerEvent> ReadEventsAsync(
CancellationToken cancellationToken);
Task ShutdownAsync(TimeSpan timeout, CancellationToken cancellationToken);
void Kill(string reason);
}
```
Internally it owns:
- process handle,
- pipe stream,
- read loop,
- write loop,
- outbound command/control channel serialized by the write loop,
- bounded inbound event channel,
- pending command dictionary keyed by correlation id,
- heartbeat monitor,
- terminal fault source.
`StartAsync` sends `GatewayHello`, verifies the `WorkerHello` protocol version
and nonce, waits for `WorkerReady`, and only then exposes `Ready` state. The
read loop starts after readiness so the handshake has a single owner for its
ordered frames.
### Read loop
The read loop:
1. Reads one frame.
2. Parses `WorkerEnvelope`.
3. Validates protocol fields.
4. Dispatches by body type:
- `WorkerCommandReply`: completes pending command.
- `WorkerEvent`: enqueues event.
- `WorkerHeartbeat`: updates heartbeat timestamp.
- `WorkerFault`: faults session.
5. Stops when pipe closes or cancellation is requested.
If the pipe closes while the session is not closing, fault the session.
### Write loop
The write loop serializes all writes to the pipe. No other code should write to
the pipe directly.
It handles:
- `GatewayHello`,
- `WorkerCommand`,
- `WorkerCancel`,
- `WorkerShutdown`,
- gateway heartbeat if used.
The write loop should fail the session if a pipe write fails outside normal
shutdown.
During shutdown the worker client treats `WorkerShutdownAck` as the protocol
close signal, but the process handle remains authoritative for process lifetime.
The client waits for both the protocol close and process exit before reporting a
clean shutdown to `GatewaySession`.
## Command Correlation
Each command gets:
- gateway correlation id,
- method name,
- start timestamp,
- timeout deadline,
- caller cancellation token,
- reply completion source.
Pending command handling:
- Add the pending entry before writing the command.
- Remove it exactly once when reply, timeout, cancellation, or session fault
occurs.
- If a late reply arrives after cancellation or timeout, log it with the
correlation id and discard it.
- If the session faults, complete all pending commands with a structured fault.
Timeouts should not assume the COM call stopped. A timed-out command may still
finish inside the worker.
## Fault Model
Fault categories:
- `StartupFailed`
- `ProtocolMismatch`
- `ProtocolViolation`
- `PipeDisconnected`
- `WorkerExited`
- `HeartbeatExpired`
- `CommandTimeout`
- `WorkerFaulted`
- `GatewayShutdown`
- `AuthorizationFailed`
Public replies should distinguish:
- gRPC transport failure,
- gateway/session failure,
- worker protocol failure,
- MXAccess method failure,
- MXAccess HRESULT/status failure.
Do not hide an MXAccess HRESULT by returning only an RPC error. When MXAccess
was reached and returned status, preserve that status in the command reply.
## Heartbeats And Leases
Use separate concepts:
- worker heartbeat: proves the worker process and pipe loop are alive,
- session lease: proves the client still owns the session,
- command timeout: bounds one command wait,
- startup timeout: bounds worker creation,
- shutdown timeout: bounds graceful stop.
Suggested defaults for early development:
- startup timeout: 30 seconds,
- worker heartbeat interval: 5 seconds,
- heartbeat grace: 15 seconds,
- default command timeout: 30 seconds,
- graceful shutdown timeout: 10 seconds,
- idle session lease: configurable, disabled in local development.
The exact values should be configurable.
## Event Delivery
Events flow:
```text
worker MXAccess event
-> worker outbound event queue
-> worker pipe writer
-> gateway read loop
-> worker client event queue
-> EventStreamService bounded stream queue
-> gRPC StreamEvents
```
The gateway should record:
- worker event sequence,
- gateway receive sequence,
- worker timestamp,
- gateway receive timestamp,
- stream send timestamp if needed for diagnostics.
Default backpressure policy for parity testing should be fail-fast:
1. If the worker client event queue fills, fault the worker client.
2. If the public stream queue fills, fault the gateway session.
2. Preserve the overflow details in logs and metrics.
3. Do not silently drop data-change events.
Do not set a production event-rate target before measurement. `GatewayMetrics`
records received event counts by family, queue depth, stream disconnects, and
overflow counts. Later production modes may support explicit coalescing by item
handle as an opt-in behavior.
The gateway should not synthesize `OperationComplete` from write completion,
command replies, ASB completion queues, or completion-only status frames. Forward
`OperationComplete` only when the worker reports the native MXAccess public
event.
## Security
### Public API
Use API key authentication for v1. Store API keys in a gateway-owned SQLite
database, but store only hashed key secrets. Clients should send keys in gRPC
metadata using:
```text
authorization: Bearer mxgw_<key-id>_<secret>
```
The gateway should split the key into a stable key id and secret component,
load the key record by id, hash the presented secret, and compare using a
constant-time comparison.
`ApiKeyParser` accepts only `authorization: Bearer mxgw_<key-id>_<secret>`.
Malformed headers fail before any database lookup. The parsed raw secret is
kept only long enough for `ApiKeySecretHasher` to compute an HMAC-SHA256 hash
using the configured `Authentication:PepperSecretName` lookup in application
configuration. The raw secret is not stored in the auth database, identity
model, logs, or verification result.
`ApiKeyVerifier` loads the stored key record by key id, rejects revoked keys,
hashes the presented secret, and compares the stored and presented hashes with
`CryptographicOperations.FixedTimeEquals`. A successful verification returns an
`ApiKeyIdentity` with key id, key prefix, display name, and scopes. Failure
results distinguish malformed credentials, missing keys, revoked keys, missing
pepper configuration, and hash mismatch for internal authorization handling.
`GatewayGrpcAuthorizationInterceptor` enforces this authentication model for
public gRPC calls. Missing, malformed, revoked, unknown, or mismatched keys fail
with `Unauthenticated`. Authenticated calls missing the scope required by the
RPC fail with `PermissionDenied`. The interceptor applies to unary calls and
server-streaming calls and stores the authenticated `ApiKeyIdentity` in
`IGatewayRequestIdentityAccessor` for the duration of the request handler.
`Authentication:Mode` set to `Disabled` bypasses API-key verification for local
development only.
Dashboard authentication reuses the API-key verifier and scope model. The
dashboard login endpoint accepts the key in a form post, checks `admin` scope
when `Dashboard:RequireAdminScope` is enabled, and signs in with the
`ZB.MOM.WW.MxGateway.Dashboard` cookie scheme. The cookie is HTTP-only, secure, strict
SameSite, and scoped with the `__Host-MxGatewayDashboard` name. Logout clears
that cookie. Login and logout posts use anti-forgery validation, and dashboard
API keys are not accepted in query strings. `Dashboard:AllowAnonymousLocalhost`
allows only loopback requests to bypass the dashboard cookie requirement and
defaults to `true`.
Recommended scopes:
- `session:open`
- `session:close`
- `invoke:read`
- `invoke:write`
- `invoke:secure`
- `events:read`
- `metadata:read`
- `admin`
If the gateway is exposed outside the local machine, use TLS. Do not log raw API
keys or raw credential-bearing MXAccess values.
API key administration for v1 should be a local CLI/tool rather than a public
gRPC admin API. It should initialize the auth database, create keys, list keys
without secrets, revoke keys, rotate keys, and print raw secrets only once at
creation.
`ZB.MOM.WW.MxGateway.Server` exposes local API-key administration as an `apikey`
subcommand before the web host starts:
```bash
ZB.MOM.WW.MxGateway.Server apikey init-db --sqlite-path C:\ProgramData\MxGateway\gateway-auth.db
ZB.MOM.WW.MxGateway.Server apikey create-key --key-id operator01 --display-name Operator --scopes session:open,events:read
ZB.MOM.WW.MxGateway.Server apikey list-keys --json
ZB.MOM.WW.MxGateway.Server apikey revoke-key --key-id operator01
ZB.MOM.WW.MxGateway.Server apikey rotate-key --key-id operator01 --json
```
The subcommands accept `--sqlite-path`, `--pepper`, and `--json`. `--pepper`
sets the local `MxGateway:ApiKeyPepper` configuration value for the command
process; deployments should normally provide the pepper through the configured
secret source. `create-key` and `rotate-key` print the full raw API key exactly
once. `list-keys` never prints raw secrets or `secret_hash` values.
SQLite auth storage should use startup migrations with a `schema_version` table.
Migrations should run inside transactions and fail startup if the database
schema is newer than the running binary understands.
The v1 auth store uses `Microsoft.Data.Sqlite` and creates the
`schema_version`, `api_keys`, and `api_key_audit` tables through
`SqliteAuthStoreMigrator`. `AuthStoreMigrationHostedService` runs those
migrations at gateway startup when API-key authentication and
`Authentication:RunMigrationsOnStartup` are enabled. A database with a newer
schema version fails startup instead of being modified by an older gateway
binary.
`IApiKeyStore` reads stored key records and exposes an active-key lookup that
excludes rows with `revoked_utc` set. Hash verification belongs to the API-key
hashing layer, but the store preserves the `secret_hash` bytes, display name,
scopes, timestamps, and revocation state needed by that layer.
`IApiKeyAuditStore` appends audit events to `api_key_audit` and returns recent
events for diagnostics and future administrative tools. Audit records store key
ids and event metadata only; they do not store raw API key secrets.
Commands requiring authorization:
- writes,
- secured writes,
- authentication commands,
- worker shutdown diagnostics,
- metadata queries if they expose sensitive plant structure.
Current gRPC scope mapping:
- `OpenSession` requires `session:open`.
- `CloseSession` requires `session:close`.
- `StreamEvents` and `DrainEvents` require `events:read`.
- read-style MXAccess commands such as `Register`, `AddItem`, `Advise`, and
`Ping` require `invoke:read`.
- `Write` and `Write2` require `invoke:write`.
- `WriteSecured`, `WriteSecured2`, and `AuthenticateUser` require
`invoke:secure`.
- metadata commands such as `ArchestrAUserToId`, `GetSessionState`, and
`GetWorkerInfo` require `metadata:read`.
- `ShutdownWorker` requires `admin`.
### Worker IPC
Named pipes should be local only. Pipe ACLs should restrict access to:
- the gateway process identity,
- the launched worker identity,
- administrators only when operationally required.
The worker must validate `GatewayHello` and the nonce before creating MXAccess.
## Observability
Use structured logs with these fields where applicable:
- session id,
- client identity,
- worker process id,
- pipe name hash or suffix,
- protocol version,
- correlation id,
- command method,
- MXAccess HRESULT,
- MXAccess status summary,
- event family,
- event sequence,
- queue depth,
- elapsed milliseconds.
Metrics:
- open sessions,
- workers running,
- worker startup latency,
- command latency by method,
- command failures by method and category,
- event rate by session and family,
- event queue depth,
- worker exits by reason,
- worker kills,
- heartbeat failures,
- gRPC stream disconnects.
Do not log credential values or full tag values by default.
The gateway registers `GatewayMetrics` as the in-process metrics foundation.
It emits .NET `Meter` instruments for collectors and keeps a
`GatewayMetricsSnapshot` for dashboard projection. The snapshot exists because
the dashboard needs current counters and queue depths without depending on a
specific metrics exporter.
Event metrics use low-cardinality tags such as event family. Per-session event
counts are kept only in the in-process snapshot for active dashboard sessions
and are purged when the session is removed. Worker event queue depth and gRPC
event stream queue depth are reported as separate gauges.
HTTP request handling uses `UseGatewayRequestLoggingScope()` to attach common
structured log fields when request metadata is present:
- `SessionId`,
- `ClientIdentity`,
- `WorkerProcessId`,
- `CorrelationId`,
- `CommandMethod`.
`GatewayLogRedactor` redacts API key secrets and command values before they are
added to log state. Value logging remains opt-in and redacted by default so
secured writes, authentication commands, and ordinary tag values do not leak
through diagnostics.
## Configuration
Suggested configuration shape:
```json
{
"MxGateway": {
"Authentication": {
"Mode": "ApiKey",
"SqlitePath": "C:\\ProgramData\\MxGateway\\gateway-auth.db",
"PepperSecretName": "MxGateway:ApiKeyPepper",
"RunMigrationsOnStartup": true
},
"Worker": {
"ExecutablePath": "src/ZB.MOM.WW.MxGateway.Worker/bin/x86/Release/ZB.MOM.WW.MxGateway.Worker.exe",
"WorkingDirectory": null,
"RequiredArchitecture": "X86",
"StartupTimeoutSeconds": 30,
"StartupProbeRetryAttempts": 3,
"StartupProbeRetryDelayMilliseconds": 250,
"PipeConnectAttemptTimeoutMilliseconds": 2000,
"ShutdownTimeoutSeconds": 10,
"HeartbeatIntervalSeconds": 5,
"HeartbeatGraceSeconds": 15,
"MaxMessageBytes": 16777216
},
"Sessions": {
"DefaultCommandTimeoutSeconds": 30,
"MaxSessions": 64,
"MaxPendingCommandsPerSession": 128,
"AllowMultipleEventSubscribers": false
},
"Events": {
"QueueCapacity": 10000,
"BackpressurePolicy": "FailFast"
},
"Dashboard": {
"Enabled": true,
"PathBase": "/dashboard",
"RequireAdminScope": true,
"AllowAnonymousLocalhost": true,
"SnapshotIntervalMilliseconds": 1000,
"RecentFaultLimit": 100,
"RecentSessionLimit": 200,
"ShowTagValues": false
},
"Protocol": {
"WorkerProtocolVersion": 1
}
}
}
```
Do not scatter connection or path constants through implementation code.
`ZB.MOM.WW.MxGateway.Server` binds this section to `GatewayOptions` at startup and
registers validation with `ValidateOnStart()`. Startup fails before the gateway
begins serving traffic when required authentication settings are missing,
timeouts or queue sizes are not positive, dashboard settings are malformed, or
the configured worker protocol version does not match the contract version.
The gateway exposes read-only effective settings through
`IGatewayConfigurationProvider`. This projection is for dashboard settings and
diagnostics, so it redacts secret-related fields such as
`Authentication:PepperSecretName` and does not include raw API keys or key
material.
The complete option reference, including defaults and validation rules, is in
[Gateway Configuration](./GatewayConfiguration.md).
## Galaxy Repository Metadata
Galaxy hierarchy and tag metadata can be discovered through SQL Server when
needed for browse or diagnostics. The current notes live outside this repo at:
```text
C:\Users\dohertj2\Desktop\lmxopcua\gr
```
Use SQL metadata as discovery data. It does not replace MXAccess-backed runtime
behavior unless an explicit non-parity backend is designed.
## Testing Strategy
Gateway tests should be able to run without installed MXAccess by using fake
workers and fake transports.
Use `FakeWorkerHarness` for tests that need real gateway-to-worker framing,
handshake, command, event, fault, or malformed-protocol behavior without loading
MXAccess COM. See [Gateway Testing](./GatewayTesting.md) for the harness scope
and focused test commands.
Focused tests:
- session state transitions,
- gRPC API-key authentication for unary and streaming calls,
- gRPC scope mapping for sessions, invokes, events, metadata, and admin
commands,
- worker startup failures,
- protocol version mismatch,
- malformed frame handling,
- pending command completion,
- command timeout and late reply handling,
- worker crash handling,
- event ordering,
- event queue overflow,
- `CloseSession` idempotency,
- gRPC mapping for command replies and faults.
- dashboard snapshot projection,
- dashboard auth decisions,
- dashboard redaction,
- dashboard realtime subscription disposal.
Integration tests with the real worker should be separated from unit tests and
clearly marked because they require Windows, .NET Framework worker output, and
eventually installed MXAccess COM.
## Initial Implementation Slice
The first gateway slice should implement:
1. Host startup and configuration binding.
2. SQLite auth database initialization and migrations.
3. Local API-key administration CLI/tool.
4. API-key authentication and scope checks.
5. `OpenSession`.
6. Worker process launch.
7. Named-pipe handshake.
8. `Invoke` for `Register`, `AddItem`, and `Advise`.
9. `StreamEvents` with one subscriber per session.
10. `CloseSession`.
11. Worker crash and startup failure handling.
12. Event-rate, queue-depth, and overflow metrics.
13. Blazor Server dashboard with Bootstrap assets.
14. Dashboard home, sessions, and workers pages.
15. Dashboard realtime snapshot refresh.
16. Dashboard API-key login with admin-scope check.
17. Basic structured logs.
This proves the process model before the full command surface is implemented.
## Related Documentation
- [MXAccess Worker Instance Detailed Design](./MxAccessWorkerInstanceDesign.md)
- [Worker Frame Protocol](./WorkerFrameProtocol.md)
- [Worker Process Launcher](./WorkerProcessLauncher.md)
- [Gateway Configuration](./GatewayConfiguration.md)
- [Sessions](./Sessions.md)
- [gRPC](./Grpc.md)
- [Authentication](./Authentication.md)
- [Authorization](./Authorization.md)
- [Metrics](./Metrics.md)
- [Diagnostics](./Diagnostics.md)
- [Gateway Testing](./GatewayTesting.md)