d692232191
EventsHub publisher (closes the v2.1 follow-up flagged in the previous commit)
EventStreamService now mirrors every MxEvent it forwards to a gRPC client
into the `EventsHub` group for the session. The fan-out goes through a new
singleton `IDashboardEventBroadcaster`:
* IDashboardEventBroadcaster — abstraction so EventStreamService doesn't
take a direct dependency on SignalR.
* DashboardEventBroadcaster — singleton implementation that hands the
SendAsync to IHubContext<EventsHub> as fire-and-forget. Errors are
logged at debug and dropped so the source gRPC stream is never
blocked.
EventStreamService now takes IDashboardEventBroadcaster as a ctor parameter
and calls Publish(sessionId, publicEvent) once per event after sequence
filtering, before the bounded queue write. Test fixtures and the live
integration harness pass NullDashboardEventBroadcaster.Instance so the
broadcaster is a no-op in unit tests.
SessionDetailsPage adds a "Recent events" panel:
* implements IAsyncDisposable
* opens a second HubConnection via DashboardHubConnectionFactory targeting
/hubs/events
* calls SubscribeSession(SessionId) on Start
* renders the most recent 50 events in a small table (worker seq, family,
server/item handle, alarm reference when the event is OnAlarmTransition)
* shows a live/offline conn-pill driven by HubConnection.Closed /
Reconnected events
The dashboard mirror is intentionally passive — events appear only while a
gRPC client is also consuming that session's events. Documented as such in
the empty-state copy and in GatewayDashboardDesign.md.
Documentation refresh
Every doc that referenced the retired options (PathBase, RequireAdminScope,
RequiredGroup) and the old API-key-cookie auth flow is updated to describe
the new model:
* CLAUDE.md — Authentication section now explains LDAP bind +
GroupToRole + HubToken bearer flow.
* gateway.md — Dashboard section: root-mounted routes, snapshot/alarms/
events SignalR hubs, LDAP cookie + bearer scheme.
* docs/GatewayConfiguration.md — drop PathBase / RequireAdminScope rows,
add GroupToRole row, append "Authorization policies" and "SignalR hubs"
subsections describing the three policies and the /hubs/* endpoints.
* docs/GatewayDashboardDesign.md — hosting model (root mount, new
endpoint layout), Realtime Updates rewritten as a hub table
(DashboardSnapshotHub / AlarmsHub / EventsHub with producers, payloads,
and routing), Authentication And Authorization rewritten around LDAP +
role mapping + the hub bearer flow, Configuration block updated.
* docs/GatewayProcessDesign.md — security-section dashboard paragraph
and the example config block both refreshed to LDAP/role auth.
* docs/ImplementationPlanGateway.md — dashboard-auth deliverable list
updated (LDAP bind + GroupToRole + /hubs/token bearer mint replace the
API-key login flow).
* docs/GatewayTesting.md — DashboardLdapLiveTests blurb describes the
GroupToRole fixture (`{ GwAdmin: Admin }`) instead of the retired
RequiredGroup default; success-path assertion explains the role-claim
check.
Verification: 475 server tests, 275 worker tests (+ 9 dev-rig skips), 18
integration tests (live MxAccess + LDAP + Galaxy) all pass — including the
live worker smoke test fixture that now constructs EventStreamService with
the new broadcaster parameter.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
996 lines
32 KiB
Markdown
996 lines
32 KiB
Markdown
# Gateway Process Detailed Design
|
|
|
|
## Purpose
|
|
|
|
The gateway process is the only public network-facing component. It exposes the
|
|
modern API, owns session lifecycle, launches and supervises MXAccess worker
|
|
processes, and moves commands and events between clients and the worker that
|
|
owns each session.
|
|
|
|
The gateway must not instantiate MXAccess COM, import MXAccess interop types, or
|
|
depend on an STA message pump. The installed MXAccess COM component is isolated
|
|
behind the worker process boundary.
|
|
|
|
## Runtime
|
|
|
|
- Target runtime: .NET 10.
|
|
- Language: C#.
|
|
- Preferred process architecture: x64.
|
|
- Hosting: ASP.NET Core gRPC.
|
|
- Web UI: Blazor Server dashboard with Bootstrap CSS/JS.
|
|
- Operating system: Windows.
|
|
- Public transport: TCP gRPC.
|
|
- Internal worker transport: named pipes with protobuf-framed messages.
|
|
|
|
Style guides:
|
|
|
|
- [C# Style Guide](./style-guides/CSharpStyleGuide.md)
|
|
- [Protobuf Style Guide](./style-guides/ProtobufStyleGuide.md)
|
|
|
|
## Responsibilities
|
|
|
|
The gateway owns:
|
|
|
|
- public gRPC service endpoints,
|
|
- Blazor Server dashboard endpoints,
|
|
- optional authentication and authorization,
|
|
- session id allocation,
|
|
- worker executable selection,
|
|
- named-pipe server creation,
|
|
- worker process launch,
|
|
- gateway/worker handshake,
|
|
- command correlation and timeout handling,
|
|
- event fan-out to client streams,
|
|
- session lease and heartbeat enforcement,
|
|
- worker crash and hang detection,
|
|
- metrics and structured logging,
|
|
- graceful service shutdown.
|
|
|
|
The gateway does not own:
|
|
|
|
- MXAccess COM object creation,
|
|
- MXAccess method dispatch,
|
|
- MXAccess event subscription,
|
|
- MXAccess handle generation,
|
|
- COM value conversion from native `VARIANT` values.
|
|
|
|
Those belong to the worker.
|
|
|
|
## High-Level Components
|
|
|
|
```text
|
|
ZB.MOM.WW.MxGateway.Server
|
|
Program / Host
|
|
Configuration
|
|
Grpc
|
|
MxAccessGatewayService
|
|
MxAccessGrpcRequestValidator
|
|
MxAccessGrpcMapper
|
|
Dashboard
|
|
Pages
|
|
Components
|
|
DashboardSnapshotService
|
|
DashboardAuthorization
|
|
Sessions
|
|
SessionManager
|
|
GatewaySession
|
|
SessionRegistry
|
|
SessionLeaseMonitor
|
|
Workers
|
|
WorkerProcessLauncher
|
|
WorkerClient
|
|
WorkerPipeTransport
|
|
WorkerProtocolReader
|
|
WorkerProtocolWriter
|
|
WorkerWatchdog
|
|
Security
|
|
ClientIdentityResolver
|
|
CommandAuthorization
|
|
Metrics
|
|
GatewayMetrics
|
|
Diagnostics
|
|
HealthChecks
|
|
```
|
|
|
|
## Public gRPC Surface
|
|
|
|
Start with unary commands plus an event stream:
|
|
|
|
```protobuf
|
|
service MxAccessGateway {
|
|
rpc OpenSession(OpenSessionRequest) returns (OpenSessionReply);
|
|
rpc CloseSession(CloseSessionRequest) returns (CloseSessionReply);
|
|
rpc Invoke(MxCommandRequest) returns (MxCommandReply);
|
|
rpc StreamEvents(StreamEventsRequest) returns (stream MxEvent);
|
|
}
|
|
```
|
|
|
|
`MxAccessGatewayService` implements these public RPCs in the gateway process.
|
|
It validates public requests with `MxAccessGrpcRequestValidator`, delegates
|
|
session lifecycle and command routing to `ISessionManager`, and maps worker
|
|
command replies and events through `MxAccessGrpcMapper`. Session lookup,
|
|
validation, and worker transport failures become gRPC status errors. MXAccess
|
|
method replies that reached the worker remain `MxCommandReply` payloads so
|
|
HRESULT values, status arrays, and method-specific reply fields survive
|
|
transport boundaries.
|
|
|
|
Add this later only after the command and event model is stable:
|
|
|
|
```protobuf
|
|
rpc Session(stream ClientMessage) returns (stream ServerMessage);
|
|
```
|
|
|
|
### OpenSession
|
|
|
|
`OpenSession` creates one gateway session and one worker process by default.
|
|
|
|
Inputs should include:
|
|
|
|
- requested backend, defaulting to `mxaccess-worker`,
|
|
- optional client session name,
|
|
- optional client correlation id,
|
|
- optional timeout policy,
|
|
- optional event backpressure policy,
|
|
- optional metadata discovery options.
|
|
|
|
Outputs should include:
|
|
|
|
- session id,
|
|
- backend name,
|
|
- worker process id when available,
|
|
- protocol version,
|
|
- server capabilities,
|
|
- default timeout values.
|
|
|
|
Behavior:
|
|
|
|
1. Resolve and authorize the client identity.
|
|
2. Allocate a session id.
|
|
3. Build a pipe name and random handshake nonce.
|
|
4. Create a named-pipe server with restrictive local ACLs.
|
|
5. Launch the worker executable with session bootstrap data.
|
|
6. Accept the pipe connection within startup timeout.
|
|
7. Exchange `GatewayHello` and `WorkerHello`.
|
|
8. Wait for `WorkerReady`.
|
|
9. Register the session as ready.
|
|
10. Return the session details.
|
|
|
|
If any step fails, clean up all resources. Kill the worker if it was launched
|
|
and did not shut down on its own.
|
|
|
|
### CloseSession
|
|
|
|
`CloseSession` attempts graceful shutdown and then enforces a kill timeout.
|
|
|
|
Behavior:
|
|
|
|
1. Mark the session closing.
|
|
2. Stop accepting new commands.
|
|
3. Notify event streams of terminal session close.
|
|
4. Send `WorkerShutdown` when the pipe is still connected.
|
|
5. Wait for worker exit up to the configured timeout.
|
|
6. Kill the worker process if it remains alive.
|
|
7. Remove the session from the registry.
|
|
|
|
`CloseSession` should be idempotent. Closing an already closed session should
|
|
return a successful close result with the final known state.
|
|
|
|
`WorkerClient.ShutdownAsync` sends `WorkerShutdown`, waits for the worker read,
|
|
write, and heartbeat loops to stop, and waits for the launched worker process to
|
|
exit within the same shutdown timeout. If the pipe loops or process exit exceed
|
|
the timeout, the close operation fails with `ShutdownTimeout`; `GatewaySession`
|
|
then kills the worker process tree before surfacing the close failure.
|
|
|
|
### Invoke
|
|
|
|
`Invoke` forwards one MXAccess command to the worker that owns the session.
|
|
|
|
Behavior:
|
|
|
|
1. Validate the session id.
|
|
2. Check session state is `Ready`.
|
|
3. Validate the method-specific payload.
|
|
4. Authorize the command, especially writes and credential-bearing commands.
|
|
5. Assign a gateway correlation id.
|
|
6. Write `WorkerCommand` to the worker pipe.
|
|
7. Await the correlated `WorkerCommandReply`.
|
|
8. Map worker reply to public `MxCommandReply`.
|
|
|
|
Request cancellation stops waiting in the gateway. It does not abort an
|
|
in-flight COM call. If the command must be hard-canceled, kill the worker and
|
|
fault the session.
|
|
|
|
### StreamEvents
|
|
|
|
`StreamEvents` streams events for one session.
|
|
|
|
Initial implementation allows one active stream subscriber per session. A second
|
|
subscriber should be rejected with a clear session error. If multiple
|
|
subscribers are later supported, they must have independent backpressure
|
|
accounting and a clear fan-out policy.
|
|
|
|
Behavior:
|
|
|
|
1. Validate session id and authorize event access.
|
|
2. Attach the single active subscriber lease for the session.
|
|
3. Read worker events into a bounded public stream queue.
|
|
4. Send events in worker sequence order.
|
|
5. Stop on client cancellation, session close, or session fault.
|
|
6. Emit a terminal status when the session faults if gRPC status alone cannot
|
|
preserve the required details.
|
|
|
|
`EventStreamService` owns subscriber tracking and public stream backpressure.
|
|
The default policy allows one active subscriber per session. A second subscriber
|
|
is rejected with `EventSubscriberAlreadyActive`. Stream cancellation releases
|
|
the subscriber lease so a later stream can attach to the session.
|
|
|
|
The gateway must not reorder events from one worker. `EventStreamService` writes
|
|
mapped events to a bounded first-in, first-out queue and faults the session with
|
|
`EventQueueOverflow` if the queue fills. The gateway does not synthesize
|
|
`OperationComplete`; it forwards that family only when the worker reports a
|
|
native MXAccess `OperationComplete` event.
|
|
|
|
## Web Dashboard
|
|
|
|
The gateway hosts a basic Blazor Server dashboard for operators and developers.
|
|
The dashboard is read-only for v1 and should show current gateway/session/worker
|
|
state plus basic metrics.
|
|
|
|
Technology:
|
|
|
|
- Blazor Server,
|
|
- Bootstrap CSS,
|
|
- Bootstrap JavaScript,
|
|
- no MudBlazor,
|
|
- no other Blazor client component libraries.
|
|
|
|
Suggested routes:
|
|
|
|
```text
|
|
/dashboard
|
|
/dashboard/sessions
|
|
/dashboard/sessions/{sessionId}
|
|
/dashboard/workers
|
|
/dashboard/events
|
|
/dashboard/settings
|
|
```
|
|
|
|
Dashboard pages:
|
|
|
|
- home: gateway status, uptime, session count, worker count, command rate,
|
|
event rate, queue depth, recent faults,
|
|
- sessions: active/recent session table,
|
|
- session details: one session's worker, heartbeat, counters, queues, and fault
|
|
summary,
|
|
- workers: worker process table and heartbeat details,
|
|
- events: aggregate event counters and rates,
|
|
- settings: read-only effective configuration with secrets redacted.
|
|
|
|
Realtime updates should use Blazor Server component updates from a read-only
|
|
snapshot service. Components should subscribe to snapshots and call
|
|
`StateHasChanged` through `InvokeAsync`. Do not stream every MXAccess event to
|
|
the dashboard; aggregate event rates and counters instead.
|
|
|
|
Suggested service shape:
|
|
|
|
```csharp
|
|
public interface IDashboardSnapshotService
|
|
{
|
|
DashboardSnapshot GetSnapshot();
|
|
IAsyncEnumerable<DashboardSnapshot> WatchSnapshotsAsync(
|
|
CancellationToken cancellationToken);
|
|
}
|
|
```
|
|
|
|
Default refresh policy:
|
|
|
|
- immediate update on session create, close, or fault,
|
|
- immediate update on worker fault,
|
|
- periodic metrics refresh every 1 second,
|
|
- event-rate windows updated every 1 second.
|
|
|
|
Dashboard access requires LDAP-backed authentication with role mapping when
|
|
enabled. A simple `/login` form binds against the configured directory, maps
|
|
the user's groups to `Admin` or `Viewer` via
|
|
`MxGateway:Dashboard:GroupToRole`, and issues an HTTP-only secure cookie.
|
|
SignalR hub connections accept either that cookie or a short-lived
|
|
data-protected bearer minted at `/hubs/token`. Anonymous localhost access is
|
|
gated by `MxGateway:Dashboard:AllowAnonymousLocalhost` (defaults to true for
|
|
local development; remote requests always require auth).
|
|
|
|
## Session State Machine
|
|
|
|
```text
|
|
Creating
|
|
-> StartingWorker
|
|
-> WaitingForPipe
|
|
-> Handshaking
|
|
-> InitializingWorker
|
|
-> Ready
|
|
-> Closing
|
|
-> Closed
|
|
|
|
Any non-terminal state
|
|
-> Faulted
|
|
|
|
Faulted
|
|
-> Closed
|
|
```
|
|
|
|
### State rules
|
|
|
|
- `Creating`: session id and in-memory state exist, but no worker has launched.
|
|
- `StartingWorker`: worker process launch is in progress.
|
|
- `WaitingForPipe`: gateway is waiting for the worker to connect to the pipe.
|
|
- `Handshaking`: pipe is connected and protocol hello is being verified.
|
|
- `InitializingWorker`: worker is connected but has not reported MXAccess ready.
|
|
- `Ready`: commands and event streams may run.
|
|
- `Closing`: graceful shutdown is in progress.
|
|
- `Closed`: resources are released.
|
|
- `Faulted`: a non-graceful terminal fault occurred and must be reported to
|
|
callers before resources are released.
|
|
|
|
Only `Ready` sessions accept new commands.
|
|
|
|
## Session Model
|
|
|
|
Gateway session state should include:
|
|
|
|
- session id,
|
|
- client identity,
|
|
- backend name,
|
|
- worker process id,
|
|
- worker executable path and version,
|
|
- pipe name,
|
|
- pipe connection state,
|
|
- open time,
|
|
- last client activity time,
|
|
- last worker heartbeat time,
|
|
- lease expiration,
|
|
- command timeout policy,
|
|
- startup timeout policy,
|
|
- shutdown timeout policy,
|
|
- event queue metrics,
|
|
- active event stream count,
|
|
- final fault if any.
|
|
|
|
The worker remains authoritative for MXAccess handles. The gateway may keep a
|
|
shadow state for diagnostics, but it must not invent, rewrite, or recycle
|
|
MXAccess handles.
|
|
|
|
`SessionManager` owns the current in-memory session registry. It allocates a
|
|
session id, creates the worker pipe name and nonce, registers the session before
|
|
worker startup, and removes the session if startup fails. A successful
|
|
`OpenSession` attaches the ready `IWorkerClient` and transitions the session to
|
|
`Ready`.
|
|
|
|
Only `Ready` sessions accept command and event operations. `CloseSession` shuts
|
|
down the worker, disposes the worker client, and removes the session from the
|
|
registry so closed sessions do not retain pipe or process handles. A later close
|
|
for the same id returns `SessionNotFound`. Lease handling is exposed as a
|
|
session hook so a monitor can close expired sessions without embedding lease
|
|
policy in the worker client. Gateway shutdown walks the registry, closes each
|
|
known session, and kills a worker if graceful shutdown fails.
|
|
|
|
## Worker Launch
|
|
|
|
The gateway should launch the worker using explicit configuration:
|
|
|
|
- worker executable path,
|
|
- worker working directory,
|
|
- worker architecture requirement,
|
|
- protocol version,
|
|
- startup timeout,
|
|
- environment variables,
|
|
- optional restricted user identity.
|
|
|
|
Command-line arguments should include only non-secret bootstrap values:
|
|
|
|
```text
|
|
--session-id <sessionId>
|
|
--pipe-name <pipeName>
|
|
--protocol-version <version>
|
|
```
|
|
|
|
Prefer passing the handshake nonce via inherited environment or another
|
|
protected local mechanism instead of command line when possible.
|
|
|
|
Before launch, validate:
|
|
|
|
- worker executable exists,
|
|
- worker path is under the configured install directory,
|
|
- worker file version or product version is acceptable,
|
|
- worker is expected to be x86.
|
|
|
|
`WorkerProcessLauncher` implements the first validation layer now: it resolves
|
|
the worker executable path, requires a `.exe`, validates the Windows Portable
|
|
Executable header, and verifies the configured processor architecture. It passes
|
|
only `--session-id`, `--pipe-name`, and `--protocol-version` on the command
|
|
line. The per-session nonce is set through `MXGATEWAY_WORKER_NONCE` so the
|
|
command line remains safe to log. Startup failures and startup timeouts kill and
|
|
dispose the worker process and the pre-created pipe reservation before the
|
|
session manager observes the failure.
|
|
|
|
## Worker IPC
|
|
|
|
The gateway creates the pipe server before launching the worker.
|
|
|
|
Pipe name:
|
|
|
|
```text
|
|
mxaccess-gateway-{gatewayProcessId}-{sessionId}
|
|
```
|
|
|
|
Message framing:
|
|
|
|
```text
|
|
uint32 little-endian payload_length
|
|
payload_length bytes protobuf WorkerEnvelope
|
|
```
|
|
|
|
Recommended size limits:
|
|
|
|
- default max message size: 16 MiB,
|
|
- configurable upper bound for large arrays,
|
|
- reject zero-length payloads,
|
|
- reject payloads larger than configured maximum before allocation.
|
|
|
|
### Envelope rules
|
|
|
|
Every message uses `WorkerEnvelope`:
|
|
|
|
- `protocol_version` must match a supported version.
|
|
- `session_id` must match the pipe/session.
|
|
- `sequence` is monotonic per sender.
|
|
- `correlation_id` links commands and replies.
|
|
- events use either zero or their own event correlation id.
|
|
- protocol faults do not replace MXAccess HRESULT/status details.
|
|
|
|
The gateway should treat malformed frames, sequence regressions, and wrong
|
|
session ids as protocol faults and close the session.
|
|
|
|
## WorkerClient Design
|
|
|
|
`WorkerClient` is the gateway-side object that owns one worker connection.
|
|
|
|
Current public shape:
|
|
|
|
```csharp
|
|
public interface IWorkerClient : IAsyncDisposable
|
|
{
|
|
string SessionId { get; }
|
|
int? ProcessId { get; }
|
|
WorkerClientState State { get; }
|
|
DateTimeOffset LastHeartbeatAt { get; }
|
|
|
|
Task StartAsync(CancellationToken cancellationToken);
|
|
Task<WorkerCommandReply> InvokeAsync(
|
|
WorkerCommand command,
|
|
TimeSpan timeout,
|
|
CancellationToken cancellationToken);
|
|
IAsyncEnumerable<WorkerEvent> ReadEventsAsync(
|
|
CancellationToken cancellationToken);
|
|
Task ShutdownAsync(TimeSpan timeout, CancellationToken cancellationToken);
|
|
void Kill(string reason);
|
|
}
|
|
```
|
|
|
|
Internally it owns:
|
|
|
|
- process handle,
|
|
- pipe stream,
|
|
- read loop,
|
|
- write loop,
|
|
- outbound command/control channel serialized by the write loop,
|
|
- bounded inbound event channel,
|
|
- pending command dictionary keyed by correlation id,
|
|
- heartbeat monitor,
|
|
- terminal fault source.
|
|
|
|
`StartAsync` sends `GatewayHello`, verifies the `WorkerHello` protocol version
|
|
and nonce, waits for `WorkerReady`, and only then exposes `Ready` state. The
|
|
read loop starts after readiness so the handshake has a single owner for its
|
|
ordered frames.
|
|
|
|
### Read loop
|
|
|
|
The read loop:
|
|
|
|
1. Reads one frame.
|
|
2. Parses `WorkerEnvelope`.
|
|
3. Validates protocol fields.
|
|
4. Dispatches by body type:
|
|
- `WorkerCommandReply`: completes pending command.
|
|
- `WorkerEvent`: enqueues event.
|
|
- `WorkerHeartbeat`: updates heartbeat timestamp.
|
|
- `WorkerFault`: faults session.
|
|
5. Stops when pipe closes or cancellation is requested.
|
|
|
|
If the pipe closes while the session is not closing, fault the session.
|
|
|
|
### Write loop
|
|
|
|
The write loop serializes all writes to the pipe. No other code should write to
|
|
the pipe directly.
|
|
|
|
It handles:
|
|
|
|
- `GatewayHello`,
|
|
- `WorkerCommand`,
|
|
- `WorkerCancel`,
|
|
- `WorkerShutdown`,
|
|
- gateway heartbeat if used.
|
|
|
|
The write loop should fail the session if a pipe write fails outside normal
|
|
shutdown.
|
|
|
|
During shutdown the worker client treats `WorkerShutdownAck` as the protocol
|
|
close signal, but the process handle remains authoritative for process lifetime.
|
|
The client waits for both the protocol close and process exit before reporting a
|
|
clean shutdown to `GatewaySession`.
|
|
|
|
## Command Correlation
|
|
|
|
Each command gets:
|
|
|
|
- gateway correlation id,
|
|
- method name,
|
|
- start timestamp,
|
|
- timeout deadline,
|
|
- caller cancellation token,
|
|
- reply completion source.
|
|
|
|
Pending command handling:
|
|
|
|
- Add the pending entry before writing the command.
|
|
- Remove it exactly once when reply, timeout, cancellation, or session fault
|
|
occurs.
|
|
- If a late reply arrives after cancellation or timeout, log it with the
|
|
correlation id and discard it.
|
|
- If the session faults, complete all pending commands with a structured fault.
|
|
|
|
Timeouts should not assume the COM call stopped. A timed-out command may still
|
|
finish inside the worker.
|
|
|
|
## Fault Model
|
|
|
|
Fault categories:
|
|
|
|
- `StartupFailed`
|
|
- `ProtocolMismatch`
|
|
- `ProtocolViolation`
|
|
- `PipeDisconnected`
|
|
- `WorkerExited`
|
|
- `HeartbeatExpired`
|
|
- `CommandTimeout`
|
|
- `WorkerFaulted`
|
|
- `GatewayShutdown`
|
|
- `AuthorizationFailed`
|
|
|
|
Public replies should distinguish:
|
|
|
|
- gRPC transport failure,
|
|
- gateway/session failure,
|
|
- worker protocol failure,
|
|
- MXAccess method failure,
|
|
- MXAccess HRESULT/status failure.
|
|
|
|
Do not hide an MXAccess HRESULT by returning only an RPC error. When MXAccess
|
|
was reached and returned status, preserve that status in the command reply.
|
|
|
|
## Heartbeats And Leases
|
|
|
|
Use separate concepts:
|
|
|
|
- worker heartbeat: proves the worker process and pipe loop are alive,
|
|
- session lease: proves the client still owns the session,
|
|
- command timeout: bounds one command wait,
|
|
- startup timeout: bounds worker creation,
|
|
- shutdown timeout: bounds graceful stop.
|
|
|
|
Suggested defaults for early development:
|
|
|
|
- startup timeout: 30 seconds,
|
|
- worker heartbeat interval: 5 seconds,
|
|
- heartbeat grace: 15 seconds,
|
|
- default command timeout: 30 seconds,
|
|
- graceful shutdown timeout: 10 seconds,
|
|
- idle session lease: configurable, disabled in local development.
|
|
|
|
The exact values should be configurable.
|
|
|
|
## Event Delivery
|
|
|
|
Events flow:
|
|
|
|
```text
|
|
worker MXAccess event
|
|
-> worker outbound event queue
|
|
-> worker pipe writer
|
|
-> gateway read loop
|
|
-> worker client event queue
|
|
-> EventStreamService bounded stream queue
|
|
-> gRPC StreamEvents
|
|
```
|
|
|
|
The gateway should record:
|
|
|
|
- worker event sequence,
|
|
- gateway receive sequence,
|
|
- worker timestamp,
|
|
- gateway receive timestamp,
|
|
- stream send timestamp if needed for diagnostics.
|
|
|
|
Default backpressure policy for parity testing should be fail-fast:
|
|
|
|
1. If the worker client event queue fills, fault the worker client.
|
|
2. If the public stream queue fills, fault the gateway session.
|
|
2. Preserve the overflow details in logs and metrics.
|
|
3. Do not silently drop data-change events.
|
|
|
|
Do not set a production event-rate target before measurement. `GatewayMetrics`
|
|
records received event counts by family, queue depth, stream disconnects, and
|
|
overflow counts. Later production modes may support explicit coalescing by item
|
|
handle as an opt-in behavior.
|
|
|
|
The gateway should not synthesize `OperationComplete` from write completion,
|
|
command replies, ASB completion queues, or completion-only status frames. Forward
|
|
`OperationComplete` only when the worker reports the native MXAccess public
|
|
event.
|
|
|
|
## Security
|
|
|
|
### Public API
|
|
|
|
Use API key authentication for v1. Store API keys in a gateway-owned SQLite
|
|
database, but store only hashed key secrets. Clients should send keys in gRPC
|
|
metadata using:
|
|
|
|
```text
|
|
authorization: Bearer mxgw_<key-id>_<secret>
|
|
```
|
|
|
|
The gateway should split the key into a stable key id and secret component,
|
|
load the key record by id, hash the presented secret, and compare using a
|
|
constant-time comparison.
|
|
|
|
`ApiKeyParser` accepts only `authorization: Bearer mxgw_<key-id>_<secret>`.
|
|
Malformed headers fail before any database lookup. The parsed raw secret is
|
|
kept only long enough for `ApiKeySecretHasher` to compute an HMAC-SHA256 hash
|
|
using the configured `Authentication:PepperSecretName` lookup in application
|
|
configuration. The raw secret is not stored in the auth database, identity
|
|
model, logs, or verification result.
|
|
|
|
`ApiKeyVerifier` loads the stored key record by key id, rejects revoked keys,
|
|
hashes the presented secret, and compares the stored and presented hashes with
|
|
`CryptographicOperations.FixedTimeEquals`. A successful verification returns an
|
|
`ApiKeyIdentity` with key id, key prefix, display name, and scopes. Failure
|
|
results distinguish malformed credentials, missing keys, revoked keys, missing
|
|
pepper configuration, and hash mismatch for internal authorization handling.
|
|
|
|
`GatewayGrpcAuthorizationInterceptor` enforces this authentication model for
|
|
public gRPC calls. Missing, malformed, revoked, unknown, or mismatched keys fail
|
|
with `Unauthenticated`. Authenticated calls missing the scope required by the
|
|
RPC fail with `PermissionDenied`. The interceptor applies to unary calls and
|
|
server-streaming calls and stores the authenticated `ApiKeyIdentity` in
|
|
`IGatewayRequestIdentityAccessor` for the duration of the request handler.
|
|
`Authentication:Mode` set to `Disabled` bypasses API-key verification for local
|
|
development only.
|
|
|
|
Dashboard authentication uses LDAP bind + role mapping (separate from the
|
|
API-key model used on the gRPC API). The login endpoint accepts username and
|
|
password in a form post, calls `DashboardAuthenticator` to bind against
|
|
`MxGateway:Ldap`, resolves the user's LDAP groups through
|
|
`MxGateway:Dashboard:GroupToRole` to one of `Admin` / `Viewer`, and signs in
|
|
with the `MxGateway.Dashboard` cookie scheme. The cookie is HTTP-only,
|
|
secure, strict SameSite, and named `__Host-MxGatewayDashboard`. Logout
|
|
clears it. Login and logout posts validate antiforgery tokens. SignalR
|
|
connections additionally accept a 30-minute data-protected bearer minted at
|
|
`/hubs/token`. `Dashboard:AllowAnonymousLocalhost` permits loopback requests
|
|
to bypass the cookie requirement and defaults to `true`.
|
|
|
|
Recommended scopes:
|
|
|
|
- `session:open`
|
|
- `session:close`
|
|
- `invoke:read`
|
|
- `invoke:write`
|
|
- `invoke:secure`
|
|
- `events:read`
|
|
- `metadata:read`
|
|
- `admin`
|
|
|
|
If the gateway is exposed outside the local machine, use TLS. Do not log raw API
|
|
keys or raw credential-bearing MXAccess values.
|
|
|
|
API key administration for v1 should be a local CLI/tool rather than a public
|
|
gRPC admin API. It should initialize the auth database, create keys, list keys
|
|
without secrets, revoke keys, rotate keys, and print raw secrets only once at
|
|
creation.
|
|
|
|
`ZB.MOM.WW.MxGateway.Server` exposes local API-key administration as an `apikey`
|
|
subcommand before the web host starts:
|
|
|
|
```bash
|
|
ZB.MOM.WW.MxGateway.Server apikey init-db --sqlite-path C:\ProgramData\MxGateway\gateway-auth.db
|
|
ZB.MOM.WW.MxGateway.Server apikey create-key --key-id operator01 --display-name Operator --scopes session:open,events:read
|
|
ZB.MOM.WW.MxGateway.Server apikey list-keys --json
|
|
ZB.MOM.WW.MxGateway.Server apikey revoke-key --key-id operator01
|
|
ZB.MOM.WW.MxGateway.Server apikey rotate-key --key-id operator01 --json
|
|
```
|
|
|
|
The subcommands accept `--sqlite-path`, `--pepper`, and `--json`. `--pepper`
|
|
sets the local `MxGateway:ApiKeyPepper` configuration value for the command
|
|
process; deployments should normally provide the pepper through the configured
|
|
secret source. `create-key` and `rotate-key` print the full raw API key exactly
|
|
once. `list-keys` never prints raw secrets or `secret_hash` values.
|
|
|
|
SQLite auth storage should use startup migrations with a `schema_version` table.
|
|
Migrations should run inside transactions and fail startup if the database
|
|
schema is newer than the running binary understands.
|
|
|
|
The v1 auth store uses `Microsoft.Data.Sqlite` and creates the
|
|
`schema_version`, `api_keys`, and `api_key_audit` tables through
|
|
`SqliteAuthStoreMigrator`. `AuthStoreMigrationHostedService` runs those
|
|
migrations at gateway startup when API-key authentication and
|
|
`Authentication:RunMigrationsOnStartup` are enabled. A database with a newer
|
|
schema version fails startup instead of being modified by an older gateway
|
|
binary.
|
|
|
|
`IApiKeyStore` reads stored key records and exposes an active-key lookup that
|
|
excludes rows with `revoked_utc` set. Hash verification belongs to the API-key
|
|
hashing layer, but the store preserves the `secret_hash` bytes, display name,
|
|
scopes, timestamps, and revocation state needed by that layer.
|
|
|
|
`IApiKeyAuditStore` appends audit events to `api_key_audit` and returns recent
|
|
events for diagnostics and future administrative tools. Audit records store key
|
|
ids and event metadata only; they do not store raw API key secrets.
|
|
|
|
Commands requiring authorization:
|
|
|
|
- writes,
|
|
- secured writes,
|
|
- authentication commands,
|
|
- worker shutdown diagnostics,
|
|
- metadata queries if they expose sensitive plant structure.
|
|
|
|
Current gRPC scope mapping:
|
|
|
|
- `OpenSession` requires `session:open`.
|
|
- `CloseSession` requires `session:close`.
|
|
- `StreamEvents` and `DrainEvents` require `events:read`.
|
|
- read-style MXAccess commands such as `Register`, `AddItem`, `Advise`, and
|
|
`Ping` require `invoke:read`.
|
|
- `Write` and `Write2` require `invoke:write`.
|
|
- `WriteSecured`, `WriteSecured2`, and `AuthenticateUser` require
|
|
`invoke:secure`.
|
|
- metadata commands such as `ArchestrAUserToId`, `GetSessionState`, and
|
|
`GetWorkerInfo` require `metadata:read`.
|
|
- `ShutdownWorker` requires `admin`.
|
|
|
|
### Worker IPC
|
|
|
|
Named pipes should be local only. Pipe ACLs should restrict access to:
|
|
|
|
- the gateway process identity,
|
|
- the launched worker identity,
|
|
- administrators only when operationally required.
|
|
|
|
The worker must validate `GatewayHello` and the nonce before creating MXAccess.
|
|
|
|
## Observability
|
|
|
|
Use structured logs with these fields where applicable:
|
|
|
|
- session id,
|
|
- client identity,
|
|
- worker process id,
|
|
- pipe name hash or suffix,
|
|
- protocol version,
|
|
- correlation id,
|
|
- command method,
|
|
- MXAccess HRESULT,
|
|
- MXAccess status summary,
|
|
- event family,
|
|
- event sequence,
|
|
- queue depth,
|
|
- elapsed milliseconds.
|
|
|
|
Metrics:
|
|
|
|
- open sessions,
|
|
- workers running,
|
|
- worker startup latency,
|
|
- command latency by method,
|
|
- command failures by method and category,
|
|
- event rate by session and family,
|
|
- event queue depth,
|
|
- worker exits by reason,
|
|
- worker kills,
|
|
- heartbeat failures,
|
|
- gRPC stream disconnects.
|
|
|
|
Do not log credential values or full tag values by default.
|
|
|
|
The gateway registers `GatewayMetrics` as the in-process metrics foundation.
|
|
It emits .NET `Meter` instruments for collectors and keeps a
|
|
`GatewayMetricsSnapshot` for dashboard projection. The snapshot exists because
|
|
the dashboard needs current counters and queue depths without depending on a
|
|
specific metrics exporter.
|
|
|
|
Event metrics use low-cardinality tags such as event family. Per-session event
|
|
counts are kept only in the in-process snapshot for active dashboard sessions
|
|
and are purged when the session is removed. Worker event queue depth and gRPC
|
|
event stream queue depth are reported as separate gauges.
|
|
|
|
HTTP request handling uses `UseGatewayRequestLoggingScope()` to attach common
|
|
structured log fields when request metadata is present:
|
|
|
|
- `SessionId`,
|
|
- `ClientIdentity`,
|
|
- `WorkerProcessId`,
|
|
- `CorrelationId`,
|
|
- `CommandMethod`.
|
|
|
|
`GatewayLogRedactor` redacts API key secrets and command values before they are
|
|
added to log state. Value logging remains opt-in and redacted by default so
|
|
secured writes, authentication commands, and ordinary tag values do not leak
|
|
through diagnostics.
|
|
|
|
## Configuration
|
|
|
|
Suggested configuration shape:
|
|
|
|
```json
|
|
{
|
|
"MxGateway": {
|
|
"Authentication": {
|
|
"Mode": "ApiKey",
|
|
"SqlitePath": "C:\\ProgramData\\MxGateway\\gateway-auth.db",
|
|
"PepperSecretName": "MxGateway:ApiKeyPepper",
|
|
"RunMigrationsOnStartup": true
|
|
},
|
|
"Worker": {
|
|
"ExecutablePath": "src/ZB.MOM.WW.MxGateway.Worker/bin/x86/Release/ZB.MOM.WW.MxGateway.Worker.exe",
|
|
"WorkingDirectory": null,
|
|
"RequiredArchitecture": "X86",
|
|
"StartupTimeoutSeconds": 30,
|
|
"StartupProbeRetryAttempts": 3,
|
|
"StartupProbeRetryDelayMilliseconds": 250,
|
|
"PipeConnectAttemptTimeoutMilliseconds": 2000,
|
|
"ShutdownTimeoutSeconds": 10,
|
|
"HeartbeatIntervalSeconds": 5,
|
|
"HeartbeatGraceSeconds": 15,
|
|
"MaxMessageBytes": 16777216
|
|
},
|
|
"Sessions": {
|
|
"DefaultCommandTimeoutSeconds": 30,
|
|
"MaxSessions": 64,
|
|
"MaxPendingCommandsPerSession": 128,
|
|
"AllowMultipleEventSubscribers": false
|
|
},
|
|
"Events": {
|
|
"QueueCapacity": 10000,
|
|
"BackpressurePolicy": "FailFast"
|
|
},
|
|
"Dashboard": {
|
|
"Enabled": true,
|
|
"AllowAnonymousLocalhost": true,
|
|
"SnapshotIntervalMilliseconds": 1000,
|
|
"RecentFaultLimit": 100,
|
|
"RecentSessionLimit": 200,
|
|
"ShowTagValues": false,
|
|
"GroupToRole": {
|
|
"GwAdmin": "Admin",
|
|
"GwReader": "Viewer"
|
|
}
|
|
},
|
|
"Protocol": {
|
|
"WorkerProtocolVersion": 1
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
Do not scatter connection or path constants through implementation code.
|
|
|
|
`ZB.MOM.WW.MxGateway.Server` binds this section to `GatewayOptions` at startup and
|
|
registers validation with `ValidateOnStart()`. Startup fails before the gateway
|
|
begins serving traffic when required authentication settings are missing,
|
|
timeouts or queue sizes are not positive, dashboard settings are malformed, or
|
|
the configured worker protocol version does not match the contract version.
|
|
|
|
The gateway exposes read-only effective settings through
|
|
`IGatewayConfigurationProvider`. This projection is for dashboard settings and
|
|
diagnostics, so it redacts secret-related fields such as
|
|
`Authentication:PepperSecretName` and does not include raw API keys or key
|
|
material.
|
|
|
|
The complete option reference, including defaults and validation rules, is in
|
|
[Gateway Configuration](./GatewayConfiguration.md).
|
|
|
|
## Galaxy Repository Metadata
|
|
|
|
Galaxy hierarchy and tag metadata can be discovered through SQL Server when
|
|
needed for browse or diagnostics. The current notes live outside this repo at:
|
|
|
|
```text
|
|
C:\Users\dohertj2\Desktop\lmxopcua\gr
|
|
```
|
|
|
|
Use SQL metadata as discovery data. It does not replace MXAccess-backed runtime
|
|
behavior unless an explicit non-parity backend is designed.
|
|
|
|
## Testing Strategy
|
|
|
|
Gateway tests should be able to run without installed MXAccess by using fake
|
|
workers and fake transports.
|
|
|
|
Use `FakeWorkerHarness` for tests that need real gateway-to-worker framing,
|
|
handshake, command, event, fault, or malformed-protocol behavior without loading
|
|
MXAccess COM. See [Gateway Testing](./GatewayTesting.md) for the harness scope
|
|
and focused test commands.
|
|
|
|
Focused tests:
|
|
|
|
- session state transitions,
|
|
- gRPC API-key authentication for unary and streaming calls,
|
|
- gRPC scope mapping for sessions, invokes, events, metadata, and admin
|
|
commands,
|
|
- worker startup failures,
|
|
- protocol version mismatch,
|
|
- malformed frame handling,
|
|
- pending command completion,
|
|
- command timeout and late reply handling,
|
|
- worker crash handling,
|
|
- event ordering,
|
|
- event queue overflow,
|
|
- `CloseSession` idempotency,
|
|
- gRPC mapping for command replies and faults.
|
|
- dashboard snapshot projection,
|
|
- dashboard auth decisions,
|
|
- dashboard redaction,
|
|
- dashboard realtime subscription disposal.
|
|
|
|
Integration tests with the real worker should be separated from unit tests and
|
|
clearly marked because they require Windows, .NET Framework worker output, and
|
|
eventually installed MXAccess COM.
|
|
|
|
## Initial Implementation Slice
|
|
|
|
The first gateway slice should implement:
|
|
|
|
1. Host startup and configuration binding.
|
|
2. SQLite auth database initialization and migrations.
|
|
3. Local API-key administration CLI/tool.
|
|
4. API-key authentication and scope checks.
|
|
5. `OpenSession`.
|
|
6. Worker process launch.
|
|
7. Named-pipe handshake.
|
|
8. `Invoke` for `Register`, `AddItem`, and `Advise`.
|
|
9. `StreamEvents` with one subscriber per session.
|
|
10. `CloseSession`.
|
|
11. Worker crash and startup failure handling.
|
|
12. Event-rate, queue-depth, and overflow metrics.
|
|
13. Blazor Server dashboard with Bootstrap assets.
|
|
14. Dashboard home, sessions, and workers pages.
|
|
15. Dashboard realtime snapshot refresh.
|
|
16. Dashboard LDAP login mapped to Admin / Viewer roles.
|
|
17. Basic structured logs.
|
|
|
|
This proves the process model before the full command surface is implemented.
|
|
|
|
## Related Documentation
|
|
|
|
- [MXAccess Worker Instance Detailed Design](./MxAccessWorkerInstanceDesign.md)
|
|
- [Worker Frame Protocol](./WorkerFrameProtocol.md)
|
|
- [Worker Process Launcher](./WorkerProcessLauncher.md)
|
|
- [Gateway Configuration](./GatewayConfiguration.md)
|
|
- [Sessions](./Sessions.md)
|
|
- [gRPC](./Grpc.md)
|
|
- [Authentication](./Authentication.md)
|
|
- [Authorization](./Authorization.md)
|
|
- [Metrics](./Metrics.md)
|
|
- [Diagnostics](./Diagnostics.md)
|
|
- [Gateway Testing](./GatewayTesting.md)
|