Resolve audit findings: correct WorkerEnvelope proto/route/metric/session facts; rewrite auth (ZB.MOM.WW.Auth migration), dashboard (ZB.MOM.WW.Theme), and StyleGuide (foreign-project copy-paste); document alarm subsystem, Ldap options, and gateway alarm broker; fix client CLI flags and package paths.
16 KiB
Design Decisions
This document records current v1 choices for the MXAccess gateway design. These decisions can change, but implementation should follow them until a later design update says otherwise.
Source References
Use these local analysis sources when answering MXAccess-specific design or implementation questions:
C:\Users\dohertj2\Desktop\mxaccess
C:\Users\dohertj2\Desktop\mxaccess\docs\MXAccess-Public-API.md
C:\Users\dohertj2\Desktop\mxaccess\docs\MXAccess-Reverse-Engineering.md
Use these local notes for Galaxy Repository SQL metadata:
C:\Users\dohertj2\Desktop\lmxopcua\gr
MXAccess COM Target
Decision: target the installed MXAccess COM interop surface directly from the x86 worker.
Concrete COM details from the MXAccess analysis:
- Interop assembly:
C:\Program Files (x86)\ArchestrA\Framework\Bin\ArchestrA.MXAccess.dll - Assembly identity:
ArchestrA.MxAccess, Version=3.2.0.0, PublicKeyToken=23106a86e706d0ae - COM class:
ArchestrA.MxAccess.LMXProxyServerClass - CLSID:
{C30B52F5-2CB5-4760-AF0A-3A344A7EB5DC} - ProgID:
LMXProxy.LMXProxyServer.1 - Version-independent ProgID:
LMXProxy.LMXProxyServer - Registered server:
C:\Program Files (x86)\ArchestrA\Framework\Bin\LmxProxy.dll - Registry view:
HKCR\Wow6432Node\CLSID\{C30B52F5-2CB5-4760-AF0A-3A344A7EB5DC} - Threading model:
Apartment
Rationale: LMXProxyServer is a 32-bit in-process COM server, so a .NET 10 x64
gateway cannot instantiate it directly. The x86 sidecar worker is the reliable
parity path.
Implementation guidance:
- Worker should reference
ArchestrA.MXAccess.dll. - Worker should instantiate
new LMXProxyServerClass()on the dedicated STA. - Worker should expose the resolved class, ProgID, CLSID, interop assembly
version, and
LmxProxy.dllpath throughGetWorkerInfo/WorkerReady. - Keep the ProgID/path configurable for diagnostics, but the default should be the installed MXAccess class above.
Session Reconnect
Decision: no reconnectable sessions for v1.
One OpenSession creates one gateway session and one worker process. The
session ends on CloseSession, client disconnect policy, lease expiry, worker
fault, or gateway shutdown.
Rationale: reconnectable sessions require event replay, orphan ownership, security checks, and more complicated worker lifetime rules. They are not needed for the first parity slice.
Event Subscribers
Decision: one active StreamEvents subscriber per session for v1.
A second subscriber should be rejected with a clear session error. Multi-client fan-out may be added later with explicit backpressure semantics.
Rationale: one subscriber preserves simple event ordering and failure behavior while parity is being proven.
Alarms — superseded for the alarm subsystem
The single-subscriber rule above no longer applies to alarms. The gateway runs
an always-on central alarm monitor (GatewayAlarmMonitor) that owns one
gateway-managed worker session, caches the active-alarm set, and fans it out to
any number of clients through the session-less StreamAlarms RPC. Per-session
alarm auto-subscribe is removed; AcknowledgeAlarm is session-less and routes
through the monitor. Data-side StreamEvents remains one subscriber per
session. Rationale: alarm state is gateway-wide, not session-scoped — every
client wants the same current set plus updates, and forcing each to own a
worker would multiply AVEVA polling load for no benefit.
Authentication
Decision: API key authentication for the public gateway.
API keys are stored in a gateway-owned SQLite database. Store hashed API key secrets only; never store raw key material.
Recommended client format:
authorization: Bearer mxgw_<key-id>_<secret>
Recommended SQLite tables:
CREATE TABLE api_keys (
key_id TEXT PRIMARY KEY,
key_prefix TEXT NOT NULL,
secret_hash BLOB NOT NULL,
display_name TEXT NOT NULL,
scopes TEXT NOT NULL,
created_utc TEXT NOT NULL,
last_used_utc TEXT NULL,
revoked_utc TEXT NULL
);
CREATE TABLE api_key_audit (
audit_id INTEGER PRIMARY KEY AUTOINCREMENT,
key_id TEXT NULL,
event_type TEXT NOT NULL,
remote_address TEXT NULL,
created_utc TEXT NOT NULL,
details TEXT NULL
);
Recommended scopes:
session:opensession:closeinvoke:readinvoke:writeinvoke:secureevents:readmetadata:readadmin
Hashing recommendation:
- Use HMAC-SHA256 with a gateway-local secret/pepper stored outside SQLite, or use Argon2id if a suitable dependency is already accepted.
- Compare hashes using constant-time comparison.
- Log only the key id or prefix, not the raw key.
Storage recommendation:
- Default SQLite path should be under
ProgramDataor another configured gateway data directory. - Apply restrictive filesystem ACLs for the gateway service identity and administrators.
- Require TLS when the gateway is reachable off-machine.
Authorization
Decision: start with scope checks by command category.
Suggested mapping:
OpenSession:session:openCloseSession:session:closeRegister,Unregister,AddItem,AddItem2,RemoveItem,Advise,UnAdvise,AdviseSupervisory,AddBufferedItem,SetBufferedUpdateInterval,Suspend,Activate:invoke:readWrite,Write2:invoke:writeWriteSecured,WriteSecured2,AuthenticateUser,ArchestrAUserToId:invoke:secureStreamEvents:events:read- Galaxy SQL metadata endpoints if added:
metadata:read - worker shutdown diagnostics and key management:
admin
Worker Process Identity
Decision: run workers as the gateway service identity for v1.
Rationale: this avoids early COM/DCOM permission failures and keeps the first implementation focused on MXAccess parity. The worker launcher should keep an extension point for a restricted service account later.
Event Backpressure
Decision: fail-fast bounded queues for v1 and parity testing.
If worker or gateway event queues fill, fault the session. Do not silently drop or coalesce events in parity mode.
Rationale: event drops would hide parity defects. Production coalescing by item handle can be added later as an explicit opt-in mode once event rates are measured.
Event-Rate Target
Decision: do not set a production event-rate target before measurement.
For v1, expose queue depth, event rate, stream send latency, and overflow metrics. Keep bounded queues and fail-fast behavior. Use observed load from live systems to set a later coalescing or scaling target.
Command Batching
Decision: no public command batching for v1.
Use one command per request so replies, HRESULTs, status arrays, event ordering, and failure behavior are easy to compare against direct MXAccess.
Batch tag registration can be added later if measured setup latency requires it.
Bulk Command Family
Decision: the gateway exposes a fixed set of bulk command kinds —
AddItemBulk, AdviseItemBulk, RemoveItemBulk, UnAdviseItemBulk,
SubscribeBulk, UnsubscribeBulk, WriteBulk, Write2Bulk,
WriteSecuredBulk, WriteSecured2Bulk, ReadBulk — that carry a list of
entries in one round-trip and return one per-entry result. Each command kind
runs the corresponding single-item MXAccess COM call sequentially on the
worker STA; per-entry failures populate was_successful = false with the
underlying HRESULT and never throw. There is no transactional / fail-fast
semantic — bulk here means "one round-trip, per-entry results", not
"atomic".
Rationale: MXAccess COM itself has no native bulk API for any of these operations. Surfacing the per-entry result list keeps parity transparent — the caller sees the same per-item HRESULT they would see calling MXAccess N times directly — while the bulk shape collapses the gateway/IPC overhead to one round-trip per batch and lets the worker keep the STA hot.
ReadBulk is the only bulk command without a 1:1 MXAccess analogue. Two
choices were considered:
-
Cache-then-snapshot (chosen): when a requested tag is already in the session's item registry AND advised, the worker returns the last cached
OnDataChangevalue without touching the subscription (was_cached = true). Otherwise it takes the fullAddItem + Advise + wait-for-first-OnDataChange + UnAdvise + RemoveItemlifecycle itself (was_cached = false) and leaves the session exactly as it was before the call. The cache lives on a per-sessionMxAccessValueCache, populated byMxAccessBaseEventSinkon everyOnDataChangeafter the event clears the outbound queue. -
Always-snapshot: take the AddItem-through-RemoveItem lifecycle for every requested tag. Cleaner conceptually but pays the full lifecycle cost on every call and would interfere with existing subscriptions if MXAccess reuses item handles.
The chosen behavior matches what callers actually want from "current
value" — a free read of an already-streaming tag, and a one-shot snapshot
otherwise — and never disturbs subscriptions the caller did not create.
The decision intentionally does NOT synthesize an OnDataChange event
from the snapshot path: the snapshot value reaches the caller through
ReadBulk's reply payload only, not through the event stream. This
preserves the "Don't synthesize events" rule that scopes the rest of the
worker.
ReadBulk's wait loop pumps Windows messages on the worker STA
(StaRuntime.PumpPendingMessages) on every poll iteration so the inbound
MXAccess COM event can dispatch while the bulk executor still holds the
thread — without the pump the OnDataChange would never deliver.
Graceful Worker Shutdown
Decision: best-effort cleanup before COM release.
During graceful shutdown, the worker should attempt:
UnAdvisefor advised items.RemoveItemfor active item handles.Unregisterfor active server handles.- Event detach.
- COM release.
Failures during cleanup should be logged and preserved diagnostically, but the gateway may still kill the worker after shutdown timeout.
OperationComplete
Decision: model and forward OperationComplete only when native MXAccess fires
it. Do not synthesize OperationComplete from writes, command replies, ASB
completion queues, or other status frames.
Rationale: the event signature is known, but the MXAccess analysis has not yet captured the runtime condition that triggers the public event. Synthesizing it would risk breaking parity.
Buffered Data Change
Decision: include OnBufferedDataChange in the protocol and worker event
model, but treat multi-sample payload conversion as capture-validated work.
The event signature and native path are known. A live buffered sample batch has not yet been observed. Until then, preserve raw value, quality, timestamp, data type, and status metadata whenever conversion is incomplete.
Completion-Only Status Mapping
Decision: preserve completion-only operation-status bytes as raw diagnostic
metadata unless native MXAccess raises a public event or the MXAccess analysis
proves an exact MXSTATUS_PROXY[] mapping.
Do not guess status category/source/detail values for frames that MXAccess does not expose through its public COM events.
API Key Administration
Decision: v1 API key management is a local administrative CLI/tool, not a public admin API.
The tool should support:
- initialize auth database,
- create key,
- list keys without showing secrets,
- revoke key,
- rotate key,
- print the raw secret exactly once at creation.
Public gRPC key-management endpoints can be added later only behind admin
scope and TLS.
SQLite Migrations
Decision: use simple startup migrations with a schema_version table.
Recommended table:
CREATE TABLE schema_version (
id INTEGER PRIMARY KEY CHECK (id = 1),
version INTEGER NOT NULL,
applied_utc TEXT NOT NULL
);
Migrations should be idempotent, run inside transactions, and fail gateway startup if the database is newer than the running binary understands.
Web Dashboard
Decision: host a basic gateway dashboard with Blazor Server and Bootstrap CSS/JS.
The dashboard should show gateway health, active sessions, worker instances, basic metrics, queue depths, and recent faults. It should update in real time through Blazor Server component updates.
Allowed UI stack:
- Blazor Server,
- Bootstrap CSS,
- Bootstrap JavaScript,
- small local CSS.
Do not use MudBlazor or other Blazor UI component libraries for v1.
Dashboard authentication is LDAP-backed, deliberately separate from the gRPC
API-key model: dashboard users are people who already have directory accounts,
so reusing LDAP avoids minting and distributing API keys for human operators.
DashboardAuthenticator binds the supplied credentials against MxGateway:Ldap
through the shared ILdapAuthService, then maps the user's LDAP groups to the
Administrator or Viewer dashboard role via MxGateway:Dashboard:GroupToRole.
A login whose groups match no role is denied. For local development, anonymous
localhost access is enabled by default through
MxGateway:Dashboard:AllowAnonymousLocalhost; the bypass is limited to loopback
requests.
Lazy Browse Is Wire-Only
Decision: the gateway continues to pull the full Galaxy hierarchy on each
deploy. BrowseChildren and the lazy dashboard render only avoid sending and
DOM-materializing the full tree — they do not push laziness into SQL or cache
loading.
Rationale: snapshot persistence and the dashboard summary both depend on a fully-materialized cache. Lazy SQL would increase per-click latency on a deployment-heavy box, multiply per-session SQL connections, and complicate the cold-start path. Wire-side laziness solves the actual pain (oversized gRPC replies and a heavy DOM) without disturbing the materialization model.
TLS Auto-Certificate and Lenient Client Trust
Decision: when a Kestrel https:// endpoint is configured without a certificate
of its own (and no Kestrel:Certificates:Default is set), the gateway generates
and persists a self-signed certificate rather than failing to start. Clients
connecting over TLS without a pinned CA accept whatever certificate the server
presents by default; pinning a CA restores full verification.
Rationale: mxaccessgw is an internal tool with no PKI to issue or distribute
certificates. The prior behavior — an https endpoint with no certificate
fails at startup with Kestrel's opaque "no server certificate was specified"
error — pushed operators toward plaintext (h2c), exposing the API key and
request payloads on the wire. Auto-generating a long-lived, persisted, reused
certificate lets TLS "just work" with zero certificate management, while the
lenient client default means clients connect to that self-signed certificate
without a manual trust step. Both choices are deliberate, not oversights:
strict-by-default would force PKI work this tool does not warrant. Plaintext-only
deployments are untouched — no certificate or key material is written for them —
and an operator who supplies a real certificate transparently overrides the
generated one.
Two clients diverge from "accept any certificate" because their gRPC stacks lack a per-channel skip-verify hook:
- Python uses trust-on-first-use: it fetches the server's presented certificate
over a separate unverified probe and pins it for the channel, and defaults the
SNI/target-name override to
localhost(the generated certificate always carries alocalhostSAN). - Rust is pin-only: tonic exposes no public hook to inject a custom certificate verifier, so TLS over Rust requires either a pinned CA or an explicit opt-in to system-trust verification; otherwise connecting returns a clear, actionable error.
See Gateway Configuration — Automatic self-signed certificate and the per-client READMEs for the as-built behavior.
Later Revisit Items
These are explicit post-v1 revisit items, not open blockers:
- reconnectable sessions,
- multiple event subscribers per session,
- restricted worker service account,
- production coalescing by item handle,
- command batching for high-volume tag setup.