docs(audit): apply per-cluster judgment fixes across living docs

Resolve audit findings: correct WorkerEnvelope proto/route/metric/session
facts; rewrite auth (ZB.MOM.WW.Auth migration), dashboard (ZB.MOM.WW.Theme),
and StyleGuide (foreign-project copy-paste); document alarm subsystem, Ldap
options, and gateway alarm broker; fix client CLI flags and package paths.
This commit is contained in:
Joseph Doherty
2026-06-03 16:01:28 -04:00
parent f84e0c3474
commit e541339c07
29 changed files with 1102 additions and 432 deletions
+3 -3
View File
@@ -32,7 +32,7 @@ dotnet test src/MxGateway.Worker.Tests/MxGateway.Worker.Tests.csproj -p:Platform
dotnet run --project src/ZB.MOM.WW.MxGateway.Server/ZB.MOM.WW.MxGateway.Server.csproj
# API-key admin CLI (same exe, "apikey" subcommand)
dotnet run --project src/ZB.MOM.WW.MxGateway.Server/ZB.MOM.WW.MxGateway.Server.csproj -- apikey create --display-name "dev" --scopes session,invoke,event,metadata,admin
dotnet run --project src/ZB.MOM.WW.MxGateway.Server/ZB.MOM.WW.MxGateway.Server.csproj -- apikey create-key --key-id dev --display-name "dev" --scopes session:open,session:close,invoke:read,invoke:write,invoke:secure,events:read,metadata:read,admin
```
Single test by name (xUnit `--filter`):
@@ -77,7 +77,7 @@ powershell -ExecutionPolicy Bypass -File scripts/run-client-e2e-tests.ps1
- **Gateway restart does not reattach orphan workers.** The first version terminates orphaned workers on startup; do not design code paths that assume reattachment.
- **No Blazor UI component libraries.** Dashboard uses local Bootstrap CSS/JS only — do not introduce MudBlazor, Radzen, FluentUI, etc.
- **Don't log secrets or full tag values by default.** API keys, passwords, `WriteSecured` payloads, and `AuthenticateUser` credentials must never reach logs. Value logging is opt-in and redacted.
- **Generated code** under `src/MxGateway.Contracts/Generated/`, `clients/*/generated*/`, `clients/python/src/mxgateway/generated/`, etc., is build output. Don't hand-edit. To regenerate, build the contracts project (`dotnet build src/MxGateway.Contracts/MxGateway.Contracts.csproj`) or run the per-client generation step in that client's README.
- **Generated code** under `src/MxGateway.Contracts/Generated/`, `clients/*/generated*/`, `clients/python/src/zb_mom_ww_mxgateway/generated/`, etc., is build output. Don't hand-edit. To regenerate, build the contracts project (`dotnet build src/MxGateway.Contracts/MxGateway.Contracts.csproj`) or run the per-client generation step in that client's README.
- **Documentation style** (`StyleGuide.md`): PascalCase filenames, no marketing language, present tense, explain *why* not *what*.
- **Update docs in the same change as the source.** When public APIs, contracts, configuration, build steps, security behavior, event shapes, value conversion, status mapping, or lifecycle rules change, the affected docs (`gateway.md`, `docs/`, client READMEs, design docs) must change in the same commit. Don't leave stale prose describing old behavior.
@@ -114,7 +114,7 @@ External analysis sources referenced by design docs:
## Authentication
Gateway gRPC clients authenticate with an API key in metadata: `authorization: Bearer mxgw_<key-id>_<secret>`. Keys are stored hashed (with a peppered SHA) in a gateway-owned SQLite DB (default `C:\ProgramData\MxGateway\gateway-auth.db`). Scopes (`session`, `invoke`, `event`, `metadata`, `admin`) gate specific RPCs; missing → `Unauthenticated`, insufficient → `PermissionDenied`. The `apikey` subcommand on the server exe manages keys; see `src/MxGateway.Server/Security/Authentication/`.
Gateway gRPC clients authenticate with an API key in metadata: `authorization: Bearer mxgw_<key-id>_<secret>`. Keys are stored hashed (with a peppered SHA) in a gateway-owned SQLite DB (default `C:\ProgramData\MxGateway\gateway-auth.db`). Scopes (`session:open`, `session:close`, `invoke:read`, `invoke:write`, `invoke:secure`, `events:read`, `metadata:read`, `admin`) gate specific RPCs; missing → `Unauthenticated`, insufficient → `PermissionDenied`. The `apikey` subcommand on the server exe manages keys; see `src/MxGateway.Server/Security/Authentication/`.
Dashboard auth is LDAP-backed (separate from the gRPC API-key model). `/login` binds against `MxGateway:Ldap` and maps the user's LDAP groups to `Administrator` or `Viewer` via `MxGateway:Dashboard:GroupToRole`, then issues an HTTP-only secure `MxGatewayDashboard` cookie. SignalR hubs at `/hubs/{snapshot,alarms,events}` accept either the cookie or a 30-minute bearer minted at `/hubs/token`. `Dashboard:AllowAnonymousLocalhost` bypasses auth on loopback when enabled.
+76 -55
View File
@@ -1,42 +1,48 @@
# Documentation Style Guide
This guide defines writing conventions and formatting rules for all ScadaBridge documentation.
This guide defines writing conventions and formatting rules for all MXAccess
Gateway (`mxaccessgw`) documentation.
## Tone and Voice
### Be Technical and Direct
Write for developers who are familiar with .NET. Don't explain basic concepts like dependency injection or async/await unless they're used in an unusual way.
Write for developers who are familiar with .NET. Don't explain basic concepts
like dependency injection or async/await unless they're used in an unusual way.
**Good:**
> The `ScadaGatewayActor` routes messages to the appropriate `ScadaClientActor` based on the client ID in the message.
> The `SessionManager` launches one worker per session and tracks it through the
> session state machine.
**Avoid:**
> The ScadaGatewayActor is a really powerful component that helps manage all your SCADA connections efficiently!
> The SessionManager is a really powerful component that helps manage all your
> MXAccess connections efficiently!
### Explain "Why" Not Just "What"
Document the reasoning behind patterns and decisions, not just the mechanics.
**Good:**
> Health checks use a 5-second timeout because actors under heavy load may take several seconds to respond, but longer delays indicate a real problem.
> The worker pumps Windows messages on its STA thread because a plain blocking
> queue does not let MXAccess COM events deliver.
**Avoid:**
> Health checks use a 5-second timeout.
> The worker pumps Windows messages on its STA thread.
### Use Present Tense
Describe what the code does, not what it will do.
**Good:**
> The actor validates the message before processing.
> The gateway terminates orphaned workers on startup.
**Avoid:**
> The actor will validate the message before processing.
> The gateway will terminate orphaned workers on startup.
### No Marketing Language
This is internal technical documentation. Avoid superlatives and promotional language.
This is internal technical documentation. Avoid superlatives and promotional
language.
**Avoid:** "powerful", "robust", "cutting-edge", "seamless", "blazing fast"
@@ -45,10 +51,10 @@ This is internal technical documentation. Avoid superlatives and promotional lan
### File Names
Use `PascalCase.md` for all documentation files:
- `Overview.md`
- `HealthChecks.md`
- `StateMachines.md`
- `SignalR.md`
- `Sessions.md`
- `GatewayConfiguration.md`
- `WorkerSta.md`
- `Diagnostics.md`
### Headings
@@ -58,11 +64,11 @@ Use `PascalCase.md` for all documentation files:
- **H4+ (`####`):** Rarely needed, Sentence case
```markdown
# Actor Health Checks
# Gateway Configuration
## Configuration Options
## Session Options
### Setting the timeout
### Setting the lease timeout
#### Default values
```
@@ -73,40 +79,43 @@ Always specify the language:
````markdown
```csharp
public class MyActor : ReceiveActor { }
public sealed class GatewaySession { }
```
```json
{
"Setting": "value"
"MxGateway": { "Sessions": { "MaxConcurrent": 8 } }
}
```
```bash
dotnet build
```powershell
dotnet build src/ZB.MOM.WW.MxGateway.slnx
```
````
Supported languages: `csharp`, `json`, `bash`, `xml`, `sql`, `yaml`, `html`, `css`, `javascript`
Supported languages: `csharp`, `json`, `bash`, `powershell`, `xml`, `sql`,
`text`, `rust`, `python`, `go`, `proto`, `html`, `css`, `toml`.
### Code Snippets
**Length:** 5-25 lines is typical. Shorter for simple concepts, longer for complete examples.
**Length:** 5-25 lines is typical. Shorter for simple concepts, longer for
complete examples.
**Context:** Include enough to understand where the code lives:
```csharp
// Good - shows class context
public class TemplateInstanceActor : ReceiveActor
public sealed class GatewaySession
{
public TemplateInstanceActor(TemplateInstanceConfig config)
public GatewaySession(SessionId sessionId, WorkerPipeSession pipe)
{
Receive<StartProcessing>(Handle);
_sessionId = sessionId;
_pipe = pipe;
}
}
// Avoid - orphaned snippet
Receive<StartProcessing>(Handle);
_pipe = pipe;
```
**Accuracy:** Only use code that exists in the codebase. Never invent examples.
@@ -134,34 +143,34 @@ Use tables for structured reference information:
```markdown
| Option | Default | Description |
|--------|---------|-------------|
| `Timeout` | `5000` | Milliseconds to wait |
| `RetryCount` | `3` | Number of retry attempts |
| `MaxConcurrent` | `8` | Maximum simultaneous sessions |
| `LeaseTimeoutSeconds` | `60` | Idle lease before sweep |
```
### Inline Code
Use backticks for:
- Class names: `ScadaGatewayActor`
- Method names: `HandleMessage()`
- Class names: `SessionManager`
- Method names: `KillWorkerAsync()`
- File names: `appsettings.json`
- Configuration keys: `ScadaBridge:Timeout`
- Configuration keys: `MxGateway:Sessions:MaxConcurrent`
- Command-line commands: `dotnet build`
### Links
Use relative paths for internal documentation:
```markdown
[See the Actors guide](../Akka/Actors.md)
[Configuration options](./Configuration.md)
[See the architecture overview](./gateway.md)
[Configuration options](./docs/GatewayConfiguration.md)
```
Use descriptive link text:
```markdown
<!-- Good -->
See the [Actor Health Checks](../Akka/HealthChecks.md) documentation.
See the [Gateway Configuration](./docs/GatewayConfiguration.md) documentation.
<!-- Avoid -->
See [here](../Akka/HealthChecks.md) for more.
See [here](./docs/GatewayConfiguration.md) for more.
```
## Structure Conventions
@@ -173,9 +182,10 @@ Every document starts with:
2. 1-2 sentence description of purpose
```markdown
# Actor Health Checks
# Worker STA Thread
Health checks monitor actor responsiveness and report status to the ASP.NET Core health check system.
The worker owns one MXAccess COM instance on a dedicated STA thread and pumps
Windows messages so MXAccess events deliver.
```
### Section Organization
@@ -194,15 +204,15 @@ Organize content from general to specific:
Place code examples immediately after the concept they illustrate:
```markdown
## Message Handling
## Session Close
Actors process messages using `Receive<T>` handlers:
The gateway closes a session by killing its worker behind the close gate:
```csharp
Receive<MyMessage>(msg => HandleMyMessage(msg));
await session.KillWorkerWithCloseGateAsync(cancellationToken);
```
Each handler processes one message type...
The close gate serializes concurrent close attempts...
```
### Related Documentation Section
@@ -212,9 +222,9 @@ End each document with links to related topics:
```markdown
## Related Documentation
- [Actor Patterns](./Patterns.md)
- [Health Checks](../Operations/HealthChecks.md)
- [Configuration](../Configuration/Akka.md)
- [Sessions](./docs/Sessions.md)
- [Worker STA Thread](./docs/WorkerSta.md)
- [Gateway Configuration](./docs/GatewayConfiguration.md)
```
## Naming Conventions
@@ -222,30 +232,33 @@ End each document with links to related topics:
### Match Code Exactly
Use the exact names from source code:
- `TemplateInstanceActor` not "Template Instance Actor"
- `ScadaGatewayActor` not "SCADA Gateway Actor"
- `IRequiredActor<T>` not "required actor interface"
- `MxStatusProxy` not "MX status proxy"
- `SessionManager` not "session manager"
- `OrphanWorkerTerminator` not "orphan worker terminator"
### Acronyms
Spell out on first use, then use acronym:
> OPC Unified Architecture (OPC UA) provides industrial communication standards. OPC UA servers expose...
> Single-threaded apartment (STA) threads serialize COM calls. STA message
> pumping lets MXAccess events deliver...
Common acronyms that don't need expansion:
- API
- JSON
- SQL
- HTTP/HTTPS
- REST
- JWT
- COM
- gRPC
- IPC
- STA
- UI
### File Paths
Use forward slashes and backticks:
- `src/Infrastructure/Akka/Actors/`
- `src/ZB.MOM.WW.MxGateway.Server/`
- `appsettings.json`
- `Documentation/Akka/Overview.md`
- `docs/GatewayConfiguration.md`
## What to Avoid
@@ -260,13 +273,14 @@ The constructor creates a new instance of the class.
<!-- Better - only document if there's something notable -->
## Constructor
The constructor accepts an `IActorRef` for the gateway actor, which must be resolved before actor creation.
The constructor accepts a `WorkerPipeSession`, which must be connected before
the session transitions out of `Handshaking`.
```
### Don't Duplicate Source Code Comments
If code has good comments, reference the file rather than copying:
> See `ScadaGatewayActor.cs` lines 45-60 for the message routing logic.
> See `SessionManager.cs` for the open-failure rollback order.
### Don't Include Temporary Information
@@ -278,5 +292,12 @@ Assume readers know:
- Dependency injection
- async/await
- LINQ
- Entity Framework basics
- ASP.NET Core middleware pipeline
- gRPC service basics
## Related Documentation
- [Architecture overview](./gateway.md)
- [Gateway Configuration](./docs/GatewayConfiguration.md)
- [C# Style Guide](./docs/style-guides/CSharpStyleGuide.md)
- [Go Style Guide](./docs/style-guides/GoStyleGuide.md), [Java Style Guide](./docs/style-guides/JavaStyleGuide.md), [Python Style Guide](./docs/style-guides/PythonStyleGuide.md), [Rust Style Guide](./docs/style-guides/RustStyleGuide.md), [Protobuf Style Guide](./docs/style-guides/ProtobufStyleGuide.md)
-3
View File
@@ -32,8 +32,6 @@ clients/dotnet/
Commands/
ZB.MOM.WW.MxGateway.Client.Tests/
ZB.MOM.WW.MxGateway.Client.Tests.csproj
ZB.MOM.WW.MxGateway.Client.IntegrationTests/
ZB.MOM.WW.MxGateway.Client.IntegrationTests.csproj
```
Target framework:
@@ -52,7 +50,6 @@ Expected packages:
- `Grpc.Net.Client`
- `Google.Protobuf`
- `Grpc.Tools` for generation
- `Microsoft.Extensions.Logging.Abstractions`
- `System.CommandLine` or similar for CLI
- test framework: xUnit or NUnit
+3
View File
@@ -27,6 +27,9 @@ clients/go/
internal/generated/
mxaccess_gateway.pb.go
mxaccess_gateway_grpc.pb.go
galaxy_repository.pb.go
galaxy_repository_grpc.pb.go
mxaccess_worker.pb.go
cmd/mxgw-go/
main.go
tests/
+1 -1
View File
@@ -140,7 +140,7 @@ pairs `Children` with `ChildHasChildren` so you know which nodes to expand. See
request and filter semantics.
```go
import pb "gitea.dohertylan.com/dohertj2/mxaccessgw/clients/go/internal/generated/galaxy_repository/v1"
import pb "gitea.dohertylan.com/dohertj2/mxaccessgw/clients/go/internal/generated"
reply, err := galaxy.BrowseChildren(ctx, &pb.BrowseChildrenRequest{})
if err != nil {
+2 -2
View File
@@ -62,8 +62,8 @@ cargo run -p mxgw-cli -- register --session-id <session-id> --client-name mxgw-r
cargo run -p mxgw-cli -- add-item --session-id <session-id> --server-handle 1 --item TestChildObject.TestInt --json
cargo run -p mxgw-cli -- advise --session-id <session-id> --server-handle 1 --item-handle 1 --json
cargo run -p mxgw-cli -- stream-events --session-id <session-id> --max-events 1 --json
cargo run -p mxgw-cli -- stream-alarms --session-id <session-id> --max-messages 1 --json
cargo run -p mxgw-cli -- acknowledge-alarm --session-id <session-id> --alarm-reference "\\Galaxy\Area001.Pump001.PumpFault" --json
cargo run -p mxgw-cli -- stream-alarms --max-events 1 --json
cargo run -p mxgw-cli -- acknowledge-alarm --reference "\\Galaxy\Area001.Pump001.PumpFault" --json
cargo run -p mxgw-cli -- write --session-id <session-id> --server-handle 1 --item-handle 1 --value-type int32 --value 123 --json
```
+187 -38
View File
@@ -67,9 +67,17 @@ list.
## What this means
The architecture comment on
`src/ZB.MOM.WW.MxGateway.Worker/MxAccess/AlarmClientConsumer.cs` (PR A.5) is
**wrong against this deployed assembly**:
> **Historical note (current as built).** This discovery record predates the
> as-built alarm path. The `AlarmClientConsumer.cs` file referenced below was
> retired; the production consumer is
> `src/ZB.MOM.WW.MxGateway.Worker/MxAccess/WnWrapAlarmConsumer.cs` (driven by the
> `wwAlarmConsumerClass` COM surface — see [Option A](#option-a--captured-2026-05-01)
> below). The current public RPC surface and broker architecture are summarized
> in [Current alarm path (as built)](#current-alarm-path-as-built) at the end of
> this document; the sections in between are kept as a discovery record.
The architecture comment on the (now-retired) `AlarmClientConsumer.cs` (PR A.5)
was **wrong against this deployed assembly**:
> "The AVEVA alarm-manager surface (`IAlarmMgrDataProvider`) exposes
> the events we need as plain .NET events — no Windows message pump
@@ -601,8 +609,14 @@ returned to normal but is unacknowledged — i.e., visible in the
"current alarms" list because operator hasn't acked it yet) and
`UNACK_ALM` (the alarm is currently active and unacknowledged).
The other states from `eAlmState` (`ACK_RTN`, `ACK_ALM`) would
appear when an ack is performed — `wwAlarmConsumerClass.AlarmAckByGUID`
is the method to call.
appear when an ack is performed.
> **Forward reference / superseded:** an earlier draft named
> `wwAlarmConsumerClass.AlarmAckByGUID` as the ack method. That call turned out
> to be **`E_NOTIMPL`** on this AVEVA build (see
> [`AlarmAckByGUID` is not implemented](#4-alarmackbyguid-is-not-implemented)
> below). The as-built ack path is the v1 6-arg `AlarmAckByName` on a dedicated
> ack-only consumer instance. Do not wire acks through `AlarmAckByGUID`.
### `GetStatistics` AV — unrelated quirk
@@ -638,20 +652,25 @@ alarm-consumer surface unblocks A.2 fully. Outline:
payload; diff against the previous snapshot (keyed by
`GUID`); emit `MX_EVENT_FAMILY_ON_ALARM_TRANSITION`
events for added/changed/removed records.
- `AlarmAckByGUID(VBGUID, comment, oprName, node, domain,
fullName)` for client-driven acknowledgements (matches
PR A.5's `AlarmAckCommand` payload).
- Client-driven acknowledgements. (This draft named `AlarmAckByGUID` and a
`AlarmAckCommand` payload; as built the ack proto is
`AcknowledgeAlarmCommand` / `AcknowledgeAlarmByNameCommand`, the consumer
interface method is `AcknowledgeByGuid` / `AcknowledgeByName`, and the GUID
path is `E_NOTIMPL` so only the by-name path runs — see
[`AlarmAckByGUID` is not implemented](#4-alarmackbyguid-is-not-implemented).)
- Lifecycle teardown: `DeregisterConsumer` +
`UninitializeConsumer` + `Marshal.FinalReleaseComObject`.
3. **Conversion layer:** map XML record fields to
`MxAlarmConditionRecord` proto:
- `GUID` → `condition_id` (canonicalize the no-dashes hex
to a UUID string).
- `STATE` enum → `inAlarm` + `acked` booleans
(`UNACK_ALM` → in_alarm=true, acked=false;
`UNACK_RTN` → in_alarm=false, acked=false;
`ACK_ALM` → in_alarm=true, acked=true;
`ACK_RTN` → in_alarm=false, acked=true).
3. **Conversion layer:** map XML record fields to the alarm proto:
- `GUID` and `PROVIDER_NAME!GROUP.TAGNAME` → `alarm_full_reference` (there is
no `condition_id` field; the public RPC and worker carry the reference as
`alarm_full_reference`, either a canonical GUID or `Provider!Group.Tag`).
- `STATE` → `AlarmConditionState` on `ActiveAlarmSnapshot.current_state`
(this draft used `inAlarm` + `acked` booleans, which the proto does not
have). As built, the snapshot state collapses to three values:
`UNACK_ALM` → `Active`; `ACK_ALM` → `ActiveAcked`; `UNACK_RTN` and
`ACK_RTN` both → `Inactive` (a returned-to-normal alarm is no longer
"active"). For the live `transition` feed the `STATE` instead drives an
`AlarmTransitionKind` (`Raise` / `Acknowledge` / `Clear`).
- `DATE + TIME + GMTOFFSET + DSTADJUST` → reassemble UTC
timestamp; matches the worker's existing `Timestamp`
wire format.
@@ -663,10 +682,14 @@ alarm-consumer surface unblocks A.2 fully. Outline:
`aaAlarmManagedClient`, also true here). The existing
`AlarmClientConsumer` skips Initialize entirely; the new
`WnWrapAlarmConsumer` includes it from day one.
5. **Test reuse:** PR A.5's snapshot/ack contract tests can
stay — they don't touch the underlying COM API. Add a new
integration test against the wnwrap surface (live-AVEVA-only,
Skip-gated like the probe).
5. **Test reuse:** the snapshot/ack contract tests stayed — they don't touch
the underlying COM API. As built, the alarm tests live under
`src/ZB.MOM.WW.MxGateway.Worker.Tests/MxAccess/` (`AlarmDispatcherTests`,
`AlarmRecordTransitionMapperTests`, `AlarmCommandHandlerTests`,
`AlarmCommandExecutorTests`, `WnWrapAlarmConsumerXmlTests`), with the
live-AVEVA-only round-trip in
`src/ZB.MOM.WW.MxGateway.Worker.Tests/Probes/AlarmsLiveSmokeTests.cs`
(Skip-gated like the probe).
### Settled API-ordering and surface knowledge
@@ -752,26 +775,47 @@ AVEVA fixes the v2 method later.
The v2 `AlarmAckByGUID(VBGUID, …)` throws `NotImplementedException`
(COM `E_NOTIMPL`) on `wwAlarmConsumerClass` against this AVEVA
build. The reference→GUID lookup that we initially planned to wire
through `AlarmAckByGUID` is therefore not viable on wnwrap; all acks
must go through `AlarmAckByName`.
through `AlarmAckByGUID` is therefore not viable on wnwrap; only the
by-name path actually succeeds.
The proto `AcknowledgeAlarmCommand` (GUID-based) and the worker's
`MxAccessCommandExecutor.ExecuteAcknowledgeAlarm` switch arm remain
in the codebase for the forward-compat shape, but the gateway-side
`WorkerAlarmRpcDispatcher.AcknowledgeAsync` now always routes through
`AcknowledgeAlarmByName` when the public RPC supplies a recognizable
`Provider!Group.Tag` reference.
**Routing as built (and the GUID hazard).** The gateway-side router is
`GatewayAlarmMonitor.BuildAcknowledgeCommand` (there is no
`WorkerAlarmRpcDispatcher` type). Routing is **conditional on the reference
shape**, not unconditional:
### 5. STA / threading — production fix needed
- A reference that `Guid.TryParse` accepts is built into
`MxCommandKind.AcknowledgeAlarm` / `AcknowledgeAlarmCommand` — the **GUID
path**, which the worker dispatches to `AlarmAckByGUID`.
- A `Provider!Group.Tag` reference (parsed by
`GatewayAlarmMonitor.TryParseAlarmReference`) is built into
`MxCommandKind.AcknowledgeAlarmByName` / `AcknowledgeAlarmByNameCommand` — the
by-name path, which is the only one that succeeds on this build.
- Anything else fails with an `alarm_full_reference` parse error before any
worker call.
The wnwrap COM is `ThreadingModel=Apartment`. The consumer's
internal `Timer` fires on threadpool threads and would block forever
on cross-apartment marshaling unless the host STA pumps Win32
messages. The smoke test sidesteps this by setting
`pollIntervalMilliseconds=0` (Timer disabled) and driving `PollOnce`
manually from the test's STA. Production hosting will route polls
through the worker's `StaRuntime` in a follow-up — the consumer's
`PollOnce` is `public` and idempotent so the wire-up is mechanical.
The GUID arm is **still dispatched unguarded**: the proto
`AcknowledgeAlarmCommand` and the worker's
`MxAccessCommandExecutor.ExecuteAcknowledgeAlarm` switch arm remain in the
codebase for forward compatibility, and `BuildAcknowledgeCommand` routes a
GUID-shaped reference straight to them. On the deployed wnwrap build that path
hits the `E_NOTIMPL` `AlarmAckByGUID` and surfaces a `COMException` rather than
acknowledging. **Practical guidance:** acknowledge with the
`Provider!Group.Tag` reference (the same form the transition feed emits in
`alarm_full_reference`), not a raw GUID, until the GUID arm is either guarded or
AVEVA implements `AlarmAckByGUID`.
### 5. STA / threading
The wnwrap COM is `ThreadingModel=Apartment`, so every consumer call
(`Subscribe`, `PollOnce`, the `AcknowledgeBy*` methods) must run on the STA that
created the COM instance. As built, `WnWrapAlarmConsumer` owns **no internal
timer and takes no `pollIntervalMilliseconds` parameter** — an earlier draft
described a self-driven `Timer` that would have blocked on cross-apartment
marshaling, but that design was dropped. Instead `PollOnce()` is a `public`,
idempotent method the host drives on the worker's STA (via
`StaRuntime.InvokeAsync(() => consumer.PollOnce())`); the poll cadence lives in
the host, not the consumer. Each `PollOnce` reads `GetXmlCurrentAlarms2`, diffs
against the previous snapshot, and emits transition events.
### Capture summary
@@ -790,3 +834,108 @@ Post-ack transition: kind=Clear …
10s cadence held throughout; full proto fields populated correctly;
ack registered server-side without errors.
## Current alarm path (as built)
The sections above are a discovery record. This section summarizes the path that
actually ships, grounded in the current code. For the proto shapes see
[Contracts](./Contracts.md#alarm-rpcs-and-messages); for the server handlers see
[gRPC](./Grpc.md); for configuration see
[Gateway Configuration](./GatewayConfiguration.md#alarm-options).
### Public RPCs and configuration
Alarms are exposed through three **session-less** RPCs on `MxAccessGateway`:
`AcknowledgeAlarm`, `StreamAlarms`, and `QueryActiveAlarms`. No client opens a
worker session to use them. They are gated by `MxGateway:Alarms:*`:
- `MxGateway:Alarms:Enabled` (default `false`) turns the whole subsystem on.
- `MxGateway:Alarms:SubscriptionExpression` is the canonical
`\\<machine>\Galaxy!<area>` subscription; when empty, the monitor falls back
to `\\<MachineName>\Galaxy!<DefaultArea>` from `MxGateway:Alarms:DefaultArea`.
Enabled with both empty faults the monitor with a configuration diagnostic.
- `MxGateway:Alarms:ReconcileIntervalSeconds` (default 30, floored at 5) sets the
reconcile cadence below.
### The always-on `GatewayAlarmMonitor` broker
`GatewayAlarmMonitor` (`src/ZB.MOM.WW.MxGateway.Server/Alarms/GatewayAlarmMonitor.cs`)
is registered by `AddGatewayAlarms` as a singleton, as the `IGatewayAlarmService`,
and as a hosted `BackgroundService`. When `Enabled`, it:
1. Opens **one** gateway-managed worker session dedicated to alarms (client name
`gateway-alarm-monitor`, backend `Galaxy`), after a brief startup grace so
worker launching and orphan cleanup settle.
2. Subscribes that session to the resolved subscription expression and feeds an
in-process active-alarm cache (`Dictionary<reference, ActiveAlarmSnapshot>`)
from the session's transition events.
3. Fans the feed out to **any number** of `StreamAlarms` subscribers — clients
never open their own session. The session is transparently re-opened with a
5-second backoff if the worker faults.
### `AlarmFeedMessage` stream protocol
`StreamAsync` (behind `StreamAlarms`) emits, in order:
1. one `AlarmFeedMessage { active_alarm }` per currently-cached alarm matching
the optional `alarm_filter_prefix`,
2. a single `AlarmFeedMessage { snapshot_complete = true }` sentinel,
3. then one `AlarmFeedMessage { transition }` per live change.
The subscriber is registered under the monitor lock **before** the snapshot is
taken, so no transition can slip between the snapshot and the live tail.
`QueryActiveAlarms` reuses the same cache but emits only the `active_alarm`
snapshots and completes — no sentinel, no transitions.
### Reconcile loop
A `PeriodicTimer` runs `ReconcileAsync` every
`max(5, ReconcileIntervalSeconds)` seconds. It pulls the worker's authoritative
active-alarm snapshot and replaces the cache, broadcasting a synthetic `Clear`
transition for any cached alarm the snapshot no longer contains and a synthetic
`Raise` for any alarm the snapshot adds. This catches transitions the live
poll-and-diff feed missed (e.g. across a transport blip). A failed reconcile
pass logs at Debug and keeps the current cache.
### Subscriber backpressure
Each subscriber gets a bounded channel of **2048** messages
(`SubscriberQueueCapacity`). When `Broadcast` cannot write to a subscriber (its
channel is full), that subscriber is **completed with an error and dropped** —
the error message tells the client to reconnect to re-snapshot. Backpressure
from one slow consumer never blocks the broker or other subscribers.
### Snapshot state collapse
`ActiveAlarmSnapshot.current_state` carries only three `AlarmConditionState`
values, so the four AVEVA `STATE`s collapse: `UNACK_ALM` → `Active`,
`ACK_ALM` → `ActiveAcked`, and both `UNACK_RTN` and `ACK_RTN` → `Inactive`
(`AlarmDispatcher`). A returned-to-normal alarm is reported as `Inactive` in a
snapshot even though it is still listed because it is unacknowledged. The live
`transition` feed instead reports `AlarmTransitionKind` (`Raise` / `Acknowledge`
/ `Clear`).
### `alarm_full_reference` parse contract
`AcknowledgeAlarm` accepts either form in `alarm_full_reference`
(`GatewayAlarmMonitor.BuildAcknowledgeCommand`):
- a canonical GUID (`Guid.TryParse`) → GUID ack path
(`AcknowledgeAlarmCommand`), which on the deployed wnwrap build hits the
`E_NOTIMPL` `AlarmAckByGUID` — see
[`AlarmAckByGUID` is not implemented](#4-alarmackbyguid-is-not-implemented);
- a `Provider!Group.Tag` reference (`TryParseAlarmReference`: first `!` splits
provider from `Group.Tag`, the first `.` after the `!` splits group from tag)
→ by-name ack path (`AcknowledgeAlarmByNameCommand`), the path that works;
- anything else → a parse error before any worker call.
The transition feed emits the `Provider!Group.Tag` form in
`alarm_full_reference`, so echoing that value back into `AcknowledgeAlarm` takes
the working by-name path.
### Reserved / unused
`AlarmTransitionKind.RETRIGGER` is defined in the proto but is **not currently
produced** — the transition mapper emits only `Raise` / `Acknowledge` / `Clear`.
It is reserved for a future "re-raise from a previously cleared condition"
distinction.
+68 -50
View File
@@ -2,11 +2,13 @@
The gateway authentication subsystem verifies inbound API key credentials against a SQLite-backed key store, hashes secrets with a configurable pepper, and records administrative and verification events to an audit trail.
The peppered-HMAC API-key pipeline — token format, parsing, secret generation and hashing, constant-time comparison, the SQLite schema, the stores, the verifier, and the migrator — lives in the shared `ZB.MOM.WW.Auth.ApiKeys` package (with abstractions in `ZB.MOM.WW.Auth.Abstractions`), of which this gateway is the donor. The gateway references the package and binds the library's `ApiKeyOptions` from its own `MxGateway:Authentication` section through `AddSqliteAuthStore`, then layers the gateway-specific pieces on top: constraint enforcement, the gRPC authorization interceptor, the admin CLI, the dashboard API Keys page, and canonical audit forwarding. Types whose code is shown below for reference are owned by the shared package unless noted; the gateway does not re-implement them.
## Token Format
API keys travel in the HTTP `Authorization` header as a bearer token shaped `mxgw_<keyId>_<secret>`. The `mxgw_` prefix scopes parsing to gateway tokens, the `<keyId>` segment is the public identifier used for lookup, and `<secret>` is the high-entropy portion that the gateway verifies against a stored hash.
`ApiKeyParser` enforces the format and rejects malformed tokens before any database round-trip:
The shared library's `ApiKeyParser` enforces the format and rejects malformed tokens before any database round-trip:
```csharp
public bool TryParseAuthorizationHeader(string? authorizationHeader, out ParsedApiKey? apiKey)
@@ -50,7 +52,7 @@ public static string Generate()
### Peppered hashing
`ApiKeySecretHasher` (registered behind `IApiKeySecretHasher`) hashes secrets with `HMACSHA256` keyed by a server-side pepper. The pepper lives outside the database and is resolved by `IConfiguration` lookup against the configured `PepperSecretName`:
The shared library's `ApiKeySecretHasher` (behind `IApiKeySecretHasher`) hashes secrets with `HMACSHA256` keyed by a server-side pepper. The pepper lives outside the database and is resolved through an `IApiKeyPepperProvider` — the gateway wires the configuration-backed provider so the pepper comes from `IConfiguration` lookup against `MxGateway:ApiKeyPepper` (`PepperSecretName`):
```csharp
public byte[] HashSecret(string secret)
@@ -69,37 +71,29 @@ The pepper is intentionally not stored alongside the hash: an attacker who exfil
## Verification
`ApiKeyVerifier` (`IApiKeyVerifier`) implements the verification flow:
The shared library's `IApiKeyVerifier.VerifyAsync(authorizationHeader, cancellationToken)` owns the whole verification flow — the gateway interceptor hands it the raw `authorization` header value and never parses the token itself:
1. Parse the `Authorization` header into a `ParsedApiKey`.
2. Look up the `ApiKeyRecord` by `KeyId` through `IApiKeyStore.FindByKeyIdAsync`.
3. Reject revoked records (`RevokedUtc is not null`).
1. Parse the `Authorization` header into the key id and secret.
2. Look up the record by key id.
3. Reject revoked records.
4. Hash the presented secret with the configured pepper.
5. Compare hashes with `CryptographicOperations.FixedTimeEquals` to avoid timing oracles.
6. Record a `LastUsedUtc` timestamp via `MarkKeyUsedAsync` and return an `ApiKeyIdentity`.
6. Stamp `last_used_utc` and return an identity.
`VerifyAsync` returns an `ApiKeyVerification` value with a `Succeeded` flag and a nullable `Identity`. On failure the result is discriminated so the caller can tell parse errors, missing pepper, missing or revoked keys, and secret mismatch apart for audit detail — without leaking which check failed to the client. The gateway interceptor treats any non-success uniformly as `Unauthenticated` (see [Authorization](./Authorization.md)):
```csharp
if (!CryptographicOperations.FixedTimeEquals(presentedHash, storedKey.SecretHash))
{
return ApiKeyVerificationResult.Fail(ApiKeyVerificationFailure.SecretMismatch);
}
await keyStore.MarkKeyUsedAsync(storedKey.KeyId, DateTimeOffset.UtcNow, cancellationToken)
ApiKeyVerification verification = await apiKeyVerifier
.VerifyAsync(authorizationHeader ?? string.Empty, context.CancellationToken)
.ConfigureAwait(false);
return ApiKeyVerificationResult.Success(new ApiKeyIdentity(
KeyId: storedKey.KeyId,
KeyPrefix: storedKey.KeyPrefix,
DisplayName: storedKey.DisplayName,
Scopes: storedKey.Scopes,
Constraints: storedKey.Constraints));
if (!verification.Succeeded || verification.Identity is null)
{
throw new RpcException(new Status(StatusCode.Unauthenticated, "Missing or invalid API key."));
}
```
`ApiKeyVerificationResult` carries either an `ApiKeyIdentity` or a discriminated `ApiKeyVerificationFailure` value. The failure enum distinguishes parse errors, missing pepper, missing or revoked keys, and secret mismatch so the calling middleware can emit precise audit detail without leaking which check failed to the client.
`ApiKeyIdentity` exposes only non-secret fields (`KeyId`, `KeyPrefix`,
`DisplayName`, `Scopes`, and `Constraints`) and is the type downstream
authorization code consumes.
The shared verifier returns `ZB.MOM.WW.Auth.Abstractions.ApiKeys.ApiKeyIdentity`, which carries the persisted constraints as an opaque JSON string. The gateway's `GatewayApiKeyIdentityMapper.ToGatewayIdentity` projects it onto the gateway-local `ApiKeyIdentity` record, which exposes only non-secret fields (`KeyId`, `KeyPrefix`, `DisplayName`, `Scopes`) plus the deserialized `Constraints`, and is the type downstream authorization code consumes.
## Storage
@@ -107,7 +101,7 @@ The gateway keeps API key state in a dedicated SQLite database. SQLite is suffic
### Connection factory
`AuthSqliteConnectionFactory` reads `GatewayOptions.Authentication.SqlitePath`, ensures the parent directory exists, and builds a connection string in `ReadWriteCreate` mode so first-run installations can create the file without manual provisioning. Connection pooling is enabled and the connection string carries a non-zero `DefaultTimeout`:
The shared library's `AuthSqliteConnectionFactory` (registered by `AddZbApiKeyAuth`) reads the bound `ApiKeyOptions.SqlitePath` — which the gateway populates from `MxGateway:Authentication:SqlitePath` ensures the parent directory exists, and builds a connection string in `ReadWriteCreate` mode so first-run installations can create the file without manual provisioning. Connection pooling is enabled and the connection string carries a non-zero `DefaultTimeout`:
```csharp
SqliteConnectionStringBuilder builder = new()
@@ -119,21 +113,22 @@ SqliteConnectionStringBuilder builder = new()
};
```
Every store opens its connection through `OpenConnectionAsync`, which opens the connection and then applies `PRAGMA journal_mode=WAL` and `PRAGMA busy_timeout`. WAL is a persistent database-level setting so re-applying it per connection is a cheap no-op; `busy_timeout` is per-connection state. Because `MarkKeyUsedAsync` runs on every authenticated request and `SqliteApiKeyAuditStore` appends on every denial, this lets concurrent readers and writers retry briefly instead of surfacing `SQLITE_BUSY` as a hard failure on the request path.
Every store opens its connection through `OpenConnectionAsync`, which opens the connection and then applies `PRAGMA journal_mode=WAL` and `PRAGMA busy_timeout`. WAL is a persistent database-level setting so re-applying it per connection is a cheap no-op; `busy_timeout` is per-connection state. Because `MarkKeyUsedAsync` runs on every authenticated request and the canonical audit writer appends to the same file, this lets concurrent readers and writers retry briefly instead of surfacing `SQLITE_BUSY` as a hard failure on the request path.
### Schema
`SqliteAuthSchema` declares table names and the current schema version as constants. Three tables are involved:
The shared library's `SqliteAuthSchema` declares the API-key table names and the current schema version as constants. Four tables live in the database file:
- `api_keys` stores `key_id`, `key_prefix`, the `secret_hash` blob,
`display_name`, serialized `scopes`, optional serialized `constraints`, and
the `created_utc`, `last_used_utc`, and `revoked_utc` timestamps.
- `api_key_audit` is an append-only log keyed by an autoincrement `audit_id` with `key_id`, `event_type`, `remote_address`, `created_utc`, and `details` columns.
- `api_key_audit` is the shared library's append-only audit log keyed by an autoincrement `audit_id` with `key_id`, `event_type`, `remote_address`, `created_utc`, and `details` columns. The gateway overrides the library audit store (see [Audit trail](#audit-trail)), so this table is **left in place but unused** at runtime — nothing writes to it.
- `audit_event` is the gateway-owned canonical audit table written by `SqliteCanonicalAuditStore`. It lives in the same SQLite file (reusing the library's `AuthSqliteConnectionFactory`) and is where every gateway audit event actually lands. See [Audit trail](#audit-trail).
- `schema_version` carries a single row whose `version` column is matched against `SqliteAuthSchema.CurrentVersion`.
### Read paths
`SqliteApiKeyStore` (`IApiKeyStore`) handles the two reads needed at request time: `FindByKeyIdAsync` returns any record (so revoked keys can be reported distinctly) and `FindActiveByKeyIdAsync` filters to non-revoked rows. `MarkKeyUsedAsync` updates `last_used_utc` only for non-revoked rows so a freshly revoked key cannot have its timestamp refreshed by a racing verification.
The shared library's `SqliteApiKeyStore` (`IApiKeyStore`) handles the two reads needed at request time: `FindByKeyIdAsync` returns any record (so revoked keys can be reported distinctly) and `FindActiveByKeyIdAsync` filters to non-revoked rows. `MarkKeyUsedAsync` updates `last_used_utc` only for non-revoked rows so a freshly revoked key cannot have its timestamp refreshed by a racing verification.
`ApiKeyRecord` is the in-memory projection. `ApiKeyRecordReader.Read` is shared by every read path so column ordering is defined in one place:
@@ -155,17 +150,21 @@ public static ApiKeyRecord Read(SqliteDataReader reader)
### Write paths
`SqliteApiKeyAdminStore` (`IApiKeyAdminStore`) implements administrative mutations: `CreateAsync` accepts an `ApiKeyCreateRequest`, `RevokeAsync` sets `revoked_utc` only when not already revoked, `RotateAsync` replaces `secret_hash`, clears `last_used_utc`, and clears `revoked_utc` so a rotated key is immediately usable, and `DeleteAsync` permanently removes a row but only when `revoked_utc IS NOT NULL` — active keys are untouched (returns false) so the revoke event lands in the audit log before the row disappears.
The shared library's `SqliteApiKeyAdminStore` (`IApiKeyAdminStore`) implements administrative mutations: `CreateAsync` accepts an `ApiKeyCreateRequest`, `RevokeAsync` sets `revoked_utc` only when not already revoked, `RotateAsync` replaces `secret_hash`, clears `last_used_utc`, and clears `revoked_utc` so a rotated key is immediately usable, and `DeleteAsync` permanently removes a row but only when `revoked_utc IS NOT NULL` — active keys are untouched (returns false) so the revoke event lands in the audit log before the row disappears.
Because `RotateAsync` clears `revoked_utc`, rotating a previously revoked key reactivates it. The dashboard API Keys page therefore offers the Rotate (and Revoke) actions only for keys whose status is `Active`; revoked keys instead show a Delete action that calls `DeleteAsync`, so an operator can permanently remove a revoked row without ever risking un-revocation as a side effect of a rotation.
### Audit trail
`SqliteApiKeyAuditStore` (`IApiKeyAuditStore`) appends `ApiKeyAuditEntry` values to the `api_key_audit` table and stamps each row with a UTC timestamp inside the store rather than trusting the caller. `ListRecentAsync` returns the most recent rows ordered by `audit_id` descending and projects them into `ApiKeyAuditRecord`. Rows are kept even after the referenced key is revoked because the audit history is the durable record of administrative action; the `key_id` column is nullable to accommodate non-key-scoped events such as `init-db`.
All gateway audit flows through a single canonical `AuditEvent` written to the gateway-owned `audit_event` table, not the shared library's `api_key_audit` table. The gateway adopts `ZB.MOM.WW.Audit` and **overrides** the library's `IApiKeyAuditStore` registration with `CanonicalForwardingApiKeyAuditStore`. That adapter receives each library-emitted `ApiKeyAuditEntry` — including the library-internal admin-command verbs (`create-key`, `revoke-key`, `rotate-key`, `init-db`) the gateway cannot edit — canonicalizes it onto an `AuditEvent`, and forwards it through `IAuditWriter` (`CanonicalAuditWriter`), which persists to `audit_event` via `SqliteCanonicalAuditStore`.
Because the adapter is registered after `AddZbApiKeyAuth`, it is the `IApiKeyAuditStore` that the admin commands resolve and that the dashboard "recent audit" view reads through `IApiKeyAuditStore.ListRecentAsync`. The library's own `SqliteApiKeyAuditStore` and its `api_key_audit` table are therefore unused at runtime — the override is the only writer. Audit rows are kept even after the referenced key is revoked because the audit history is the durable record of administrative action; non-key-scoped events such as `init-db` carry no key id.
This canonical-forwarding wiring lives under `src/ZB.MOM.WW.MxGateway.Server/Security/Audit/`; the audit store override and writer are gateway types, while the entry shape and admin verbs originate in the shared library.
## Migration
Schema bring-up is centralised behind `IAuthStoreMigrator`. `SqliteAuthStoreMigrator` executes the migration inside a single transaction so a partial failure leaves the database untouched, refuses to start when the on-disk schema version is newer than the binary supports, and idempotently creates the v1 schema:
Schema bring-up for the API-key tables is owned by the shared library's `SqliteAuthStoreMigrator`, wired by `AddZbApiKeyAuth` along with its migration hosted service. It executes the migration inside a single transaction so a partial failure leaves the database untouched, refuses to start when the on-disk schema version is newer than the binary supports, and idempotently creates the schema:
```csharp
if (existingVersion > SqliteAuthSchema.CurrentVersion)
@@ -179,13 +178,11 @@ await ApplyVersionOneAsync(connection, transaction, cancellationToken).Configure
await transaction.CommitAsync(cancellationToken).ConfigureAwait(false);
```
`AuthStoreMigrationHostedService` runs the migrator at startup, but only when API-key authentication is enabled and `RunMigrationsOnStartup` is true. Operators who manage schema out-of-band can disable the hosted run and use the admin CLI's `init-db` command instead.
`AuthStoreMigrationException` is a sealed `InvalidOperationException` so it can be caught precisely without swallowing unrelated failures.
The library's migration hosted service runs the migrator at startup. Operators who manage schema out-of-band can use the admin CLI's `init-db` command instead.
## Admin CLI
`ApiKeyAdminCommandLineParser.Parse` recognises a leading `apikey` argument and dispatches to one of the subcommands declared by `ApiKeyAdminCommandKind`. Each parsed invocation produces an `ApiKeyAdminCommand` (or an `ApiKeyAdminParseResult` carrying an error). `ApiKeyAdminCliRunner` then executes the command, runs the migrator first, calls the relevant store method, appends an audit row, and writes either text or JSON output via `ApiKeyAdminOutput`. The returned `ApiKeyAdminListedKey` projection deliberately omits the `secret_hash` so listing a database does not surface hash material.
`ApiKeyAdminCommandLineParser.Parse` (a gateway type) recognises a leading `apikey` argument and dispatches to one of the subcommands declared by `ApiKeyAdminCommandKind`. Each parsed invocation produces an `ApiKeyAdminCommand` (or an `ApiKeyAdminParseResult` carrying an error). The parser validates requested `--scopes` against `GatewayScopes.All` (see [Authorization](./Authorization.md#scope-catalog)) so a non-canonical scope string cannot be persisted on a key. `ApiKeyAdminCliRunner` then drives the shared library's `ApiKeyAdminCommands` — which the gateway registers over the already-wired stores, pepper provider, and migrator — to execute the command, and writes either text or JSON output via `ApiKeyAdminOutput`. The returned `ApiKeyAdminListedKey` projection deliberately omits the `secret_hash` so listing a database does not surface hash material.
The supported subcommands match `ApiKeyAdminCommandKind` exactly:
@@ -201,7 +198,7 @@ Examples:
```bash
mxgateway apikey init-db
mxgateway apikey create-key --key-id ops.alice --display-name "Alice (ops)" --scopes read,write
mxgateway apikey create-key --key-id ops.alice --display-name "Alice (ops)" --scopes invoke:read,invoke:write
mxgateway apikey create-key --key-id area1.reader --display-name "Area 1 reader" --scopes invoke:read,metadata:read --read-subtree "Area1/*" --browse-subtree "Area1/*"
mxgateway apikey list-keys --json
mxgateway apikey revoke-key --key-id ops.alice
@@ -226,7 +223,7 @@ confirmation dialog and emits its own audit event
## Scope Serialization
Scopes are persisted as a single TEXT column rather than a join table because the set is small, never queried by membership at the database level, and changes atomically with the owning row. `ApiKeyScopeSerializer.Serialize` writes a JSON array sorted with `StringComparer.Ordinal` so equivalent scope sets produce byte-identical column values, which makes audit diffing and database comparisons deterministic:
Scopes are persisted as a single TEXT column rather than a join table because the set is small, never queried by membership at the database level, and changes atomically with the owning row. The shared library's `ApiKeyScopeSerializer.Serialize` writes a JSON array sorted with `StringComparer.Ordinal` so equivalent scope sets produce byte-identical column values, which makes audit diffing and database comparisons deterministic:
```csharp
public static string Serialize(IReadOnlySet<string> scopes)
@@ -249,29 +246,50 @@ public static IReadOnlySet<string> Deserialize(string value)
`Deserialize` tolerates an empty column by returning an empty set so older rows or hand-edited records do not crash the verifier.
## Dashboard Cookie and Hub Token
The API-key model above guards the gRPC surface. Interactive dashboard requests use a separate LDAP-backed cookie scheme (see [Gateway Dashboard Design](./GatewayDashboardDesign.md)). Two timeouts and a few configuration knobs govern that cookie:
- **Cookie idle timeout — 8 hours.** `DashboardServiceCollectionExtensions` applies the shared `ZbCookieDefaults.Apply` hardened cookie defaults (HttpOnly, `SameSite=Strict`, secure policy, sliding expiration) but overrides the library's 30-minute default with an 8-hour idle timeout, so an active operator is not signed out mid-shift. The expiration is sliding, so each authenticated request resets the window.
- **Hub bearer token — 30 minutes.** SignalR hub connections cannot always carry the HttpOnly cookie (the client SignalR JS may resolve the cookie scope to loopback), so the dashboard mints a short-lived data-protected bearer at `/hubs/token` via `HubTokenService`. The token lifetime is 30 minutes; the hubs accept either it or the cookie.
- **`MxGateway:Dashboard:CookieName`** overrides the cookie name (default `MxGatewayDashboard`, from `DashboardAuthenticationDefaults.CookieName`). Two gateway instances on the same host but different ports share a cookie scope — host+path, not port — so giving each a distinct name keeps their dashboard sessions from clobbering each other. Changing it signs out existing sessions on next deploy.
- **`MxGateway:Dashboard:RequireHttpsCookie`** (default `true`) restricts the cookie to HTTPS via `CookieSecurePolicy.Always`. Set it to `false` for plain-HTTP dev so the cookie uses `SameAsRequest`; leaving it `true` while serving the dashboard over plain HTTP from a non-localhost host breaks login, because browsers drop Secure cookies set over HTTP.
The dashboard issues claims through the shared `ZB.MOM.WW.Auth.AspNetCore.ZbClaimTypes` (e.g. `ZbClaimTypes.Username` = `zb:username`, `ZbClaimTypes.Name` = `ClaimTypes.Name` so `Identity.Name` resolves, `ZbClaimTypes.Role` = `ClaimTypes.Role` so `IsInRole`/`[Authorize(Roles=...)]` work). Cookie hardening defaults come from `ZbCookieDefaults`. Both live in the shared Auth packages, not the gateway.
## Registration
`AuthStoreServiceCollectionExtensions.AddSqliteAuthStore` wires every service in this subsystem as a singleton and registers the migration hosted service:
`AuthStoreServiceCollectionExtensions.AddSqliteAuthStore` is the gateway entry point. It does not register the parser, hasher, verifier, stores, or migrator directly — those come from the shared package. Instead it delegates to the package's `AddZbApiKeyAuth` and then layers the gateway-specific audit and CLI services:
```csharp
public static IServiceCollection AddSqliteAuthStore(this IServiceCollection services)
public static IServiceCollection AddSqliteAuthStore(
this IServiceCollection services,
IConfiguration configuration)
{
services.AddSingleton<IApiKeyParser, ApiKeyParser>();
services.AddSingleton<IApiKeySecretHasher, ApiKeySecretHasher>();
services.AddSingleton<IApiKeyVerifier, ApiKeyVerifier>();
// Register the shared API-key provider: binds ApiKeyOptions from MxGateway:Authentication,
// wires up the SQLite stores, the configuration-backed pepper provider, the verifier, the
// migrator and the migration hosted service.
services.AddZbApiKeyAuth(effectiveConfig, AuthenticationSectionPath);
// Gateway-owned canonical audit (ZB.MOM.WW.Audit) in the same SQLite file.
services.AddSingleton(sp =>
new SqliteCanonicalAuditStore(sp.GetRequiredService<AuthSqliteConnectionFactory>()));
services.AddSingleton<IAuditWriter>(sp => new CanonicalAuditWriter(/* ... */));
// Override the library's IApiKeyAuditStore so every audit lands in audit_event.
services.AddSingleton<IApiKeyAuditStore, CanonicalForwardingApiKeyAuditStore>();
// The shared admin command set, driven by the gateway CLI and dashboard.
services.AddSingleton(sp => new ApiKeyAdminCommands(/* ... */));
services.AddSingleton<ApiKeyAdminCliRunner>();
services.AddSingleton<AuthSqliteConnectionFactory>();
services.AddSingleton<IAuthStoreMigrator, SqliteAuthStoreMigrator>();
services.AddSingleton<IApiKeyStore, SqliteApiKeyStore>();
services.AddSingleton<IApiKeyAdminStore, SqliteApiKeyAdminStore>();
services.AddSingleton<IApiKeyAuditStore, SqliteApiKeyAuditStore>();
services.AddHostedService<AuthStoreMigrationHostedService>();
return services;
}
```
Singletons are safe because each operation opens its own short-lived `SqliteConnection` through the factory; there is no shared mutable state inside the services.
The gateway pins its own API-key contract — token prefix `mxgw` and the pepper key `MxGateway:ApiKeyPepper` — by layering those as fallback defaults under the supplied configuration before calling `AddZbApiKeyAuth`, because `ApiKeyOptions` is an init-only record that must be bound with those values present rather than mutated afterward. Explicit configuration still wins. `AddZbApiKeyAuth` binds `ApiKeyOptions` from the `MxGateway:Authentication` section and registers the connection factory, stores, pepper provider, verifier, migrator, and migration hosted service.
The audit-store override is registered *after* `AddZbApiKeyAuth` so it replaces the library's `TryAddSingleton` registration. The shared admin command set is not auto-registered by `AddZbApiKeyAuth`, so the gateway registers `ApiKeyAdminCommands` itself over the wired stores; the CLI and dashboard drive it. Library services are singletons and safe because each operation opens its own short-lived `SqliteConnection` through the factory.
## Related Documentation
+25 -12
View File
@@ -58,32 +58,34 @@ if (options.Value.Authentication.Mode == AuthenticationMode.Disabled)
}
string? authorizationHeader = context.RequestHeaders.GetValue("authorization");
ApiKeyVerificationResult verificationResult = await apiKeyVerifier
.VerifyAsync(authorizationHeader, context.CancellationToken)
ApiKeyVerification verification = await apiKeyVerifier
.VerifyAsync(authorizationHeader ?? string.Empty, context.CancellationToken)
.ConfigureAwait(false);
if (!verificationResult.Succeeded || verificationResult.Identity is null)
if (!verification.Succeeded || verification.Identity is null)
{
throw new RpcException(new Status(
StatusCode.Unauthenticated,
"Missing or invalid API key."));
}
ApiKeyIdentity identity = GatewayApiKeyIdentityMapper.ToGatewayIdentity(verification.Identity);
string requiredScope = scopeResolver.ResolveRequiredScope(request);
if (!verificationResult.Identity.Scopes.Contains(requiredScope))
if (!identity.Scopes.Contains(requiredScope))
{
throw new RpcException(new Status(
StatusCode.PermissionDenied,
$"API key is missing required scope '{requiredScope}'."));
}
return verificationResult.Identity;
return identity;
```
The flow is:
1. If `GatewayOptions.Authentication.Mode` is `AuthenticationMode.Disabled`, the helper returns `null` immediately. No identity is pushed onto the accessor and the continuation runs without scope enforcement. This matches the `AuthenticationMode` enum, which only defines `ApiKey` and `Disabled`.
2. Otherwise, the `authorization` request header is read directly off `ServerCallContext.RequestHeaders` and handed to `IApiKeyVerifier.VerifyAsync`. A failed verification or a missing identity throws `RpcException` with `StatusCode.Unauthenticated`.
2. Otherwise, the `authorization` request header is read directly off `ServerCallContext.RequestHeaders` and handed to the shared `IApiKeyVerifier.VerifyAsync`, which returns an `ApiKeyVerification`. A failed verification or a missing identity throws `RpcException` with `StatusCode.Unauthenticated`. The shared library's identity is then projected onto the gateway-local `ApiKeyIdentity` by `GatewayApiKeyIdentityMapper.ToGatewayIdentity` before scope checks run.
3. `GatewayGrpcScopeResolver.ResolveRequiredScope(request)` produces the scope string. If the identity's `Scopes` set does not contain it, the helper throws `RpcException` with `StatusCode.PermissionDenied` and embeds the missing scope name in `Status.Detail` so callers can diagnose the failure.
4. On success, the verified `ApiKeyIdentity` is returned and pushed onto `IGatewayRequestIdentityAccessor` for the lifetime of the call.
@@ -107,7 +109,8 @@ public string ResolveRequiredScope(object request)
TestConnectionRequest or
GetLastDeployTimeRequest or
DiscoverHierarchyRequest or
WatchDeployEventsRequest => GatewayScopes.MetadataRead,
WatchDeployEventsRequest or
BrowseChildrenRequest => GatewayScopes.MetadataRead,
_ => GatewayScopes.Admin
};
}
@@ -194,7 +197,7 @@ the gateway fails closed.
Non-bulk constraint failures return gRPC `PermissionDenied`. Bulk read
commands preserve input order and return a failed `SubscribeResult` for each
denied item while still forwarding allowed items to the worker. Every denial
adds an `api_key_audit` entry with the key id, command kind, target, and
records a canonical audit event with the key id, command kind, target, and
blocking constraint; secured values and raw credentials are never logged.
## Scope Catalog
@@ -209,10 +212,10 @@ blocking constraint; secured values and raw credentials are never logged.
| `InvokeRead` | `invoke:read` | `MxCommandRequest` for read-style command kinds (`Register`, `AddItem`, `Advise`, `ReadBulk`, and any kind not otherwise mapped) |
| `InvokeWrite` | `invoke:write` | `AcknowledgeAlarmRequest`, `MxCommandKind.Write`, `MxCommandKind.Write2`, `MxCommandKind.WriteBulk`, `MxCommandKind.Write2Bulk` |
| `InvokeSecure` | `invoke:secure` | `MxCommandKind.WriteSecured`, `MxCommandKind.WriteSecured2`, `MxCommandKind.WriteSecuredBulk`, `MxCommandKind.WriteSecured2Bulk`, `MxCommandKind.AuthenticateUser` |
| `MetadataRead` | `metadata:read` | `MxCommandKind.ArchestraUserToId`, `MxCommandKind.GetSessionState`, `MxCommandKind.GetWorkerInfo`, `GalaxyRepository.TestConnection`, `GalaxyRepository.GetLastDeployTime`, `GalaxyRepository.DiscoverHierarchy`, `GalaxyRepository.WatchDeployEvents` |
| `Admin` | `admin` | `MxCommandKind.ShutdownWorker`, the default for any unrecognized request type, and the dashboard authorization policy |
| `MetadataRead` | `metadata:read` | `MxCommandKind.ArchestraUserToId`, `MxCommandKind.GetSessionState`, `MxCommandKind.GetWorkerInfo`, `GalaxyRepository.TestConnection`, `GalaxyRepository.GetLastDeployTime`, `GalaxyRepository.DiscoverHierarchy`, `GalaxyRepository.WatchDeployEvents`, `GalaxyRepository.BrowseChildren` |
| `Admin` | `admin` | `MxCommandKind.ShutdownWorker` and the default for any unrecognized request type |
The `Admin` constant is also referenced by `DashboardAuthenticator` and `DashboardAuthorizationHandler` so that the dashboard and the gRPC layer agree on what "admin" means.
The gRPC `admin` scope here is **distinct** from the dashboard's `Administrator` role. The scope gates API-key access to admin-level RPCs; the dashboard role gates interactive cookie-authenticated dashboard pages. `DashboardAuthorizationHandler` and the dashboard policies authorize against the `Administrator`/`Viewer` roles (see [Gateway Dashboard Design](./GatewayDashboardDesign.md)) and do not reference `GatewayScopes.Admin`. The only dashboard code that touches `GatewayScopes` is the API Keys page, which validates requested scopes against `GatewayScopes.All` when creating a key — the same validation the CLI applies.
## Identity Access for Downstream Layers
@@ -263,14 +266,24 @@ public static IServiceCollection AddGatewayGrpcAuthorization(this IServiceCollec
{
services.AddSingleton<GatewayGrpcScopeResolver>();
services.AddSingleton<IGatewayRequestIdentityAccessor, GatewayRequestIdentityAccessor>();
services.AddSingleton<IConstraintEnforcer, ConstraintEnforcer>();
services.AddSingleton<GatewayGrpcAuthorizationInterceptor>();
services
.AddOptions<Grpc.AspNetCore.Server.GrpcServiceOptions>()
.Configure<IConfiguration>((grpcOptions, configuration) =>
{
ProtocolOptions protocolOptions = new();
configuration.GetSection("MxGateway:Protocol").Bind(protocolOptions);
grpcOptions.MaxReceiveMessageSize = protocolOptions.MaxGrpcMessageBytes;
grpcOptions.MaxSendMessageSize = protocolOptions.MaxGrpcMessageBytes;
});
services.AddGrpc(options => options.Interceptors.Add<GatewayGrpcAuthorizationInterceptor>());
return services;
}
```
Singleton lifetimes are appropriate because none of the three classes hold per-request state on instance fields; the request-scoped value lives inside the `AsyncLocal` on `GatewayRequestIdentityAccessor`. `GatewayApplication` calls `builder.Services.AddGatewayGrpcAuthorization()` during startup, and the call also performs `AddGrpc`, so the gateway never registers gRPC without the interceptor attached.
Four singletons are registered: the scope resolver, the identity accessor, the constraint enforcer (`IConstraintEnforcer``ConstraintEnforcer`, which service bodies call to apply API-key constraints), and the interceptor itself. The same method also binds gRPC's `GrpcServiceOptions.MaxReceiveMessageSize` and `MaxSendMessageSize` from `MxGateway:Protocol:MaxGrpcMessageBytes` so the message-size limits are configured in the one place that wires the authorization pipeline. Singleton lifetimes are appropriate because none of these classes hold per-request state on instance fields; the request-scoped value lives inside the `AsyncLocal` on `GatewayRequestIdentityAccessor`. `GatewayApplication` calls `builder.Services.AddGatewayGrpcAuthorization()` during startup, and the call also performs `AddGrpc`, so the gateway never registers gRPC without the interceptor attached.
## Related Documentation
+33 -6
View File
@@ -48,8 +48,8 @@ dotnet build src/ZB.MOM.WW.MxGateway.Contracts/ZB.MOM.WW.MxGateway.Contracts.csp
Build and test from the repository root:
```powershell
dotnet build clients/dotnet/ZB.MOM.WW.MxGateway.Client.sln
dotnet test clients/dotnet/ZB.MOM.WW.MxGateway.Client.sln --no-build
dotnet build clients/dotnet/ZB.MOM.WW.MxGateway.Client.slnx
dotnet test clients/dotnet/ZB.MOM.WW.MxGateway.Client.slnx --no-build
```
Create local package artifacts:
@@ -173,10 +173,14 @@ Install, test, and build a wheel from `clients/python`:
Push-Location clients/python
python -m pip install -e ".[dev]"
python -m pytest
python -m pip wheel . --no-deps --wheel-dir "$env:TEMP\mxgateway-python-wheel"
python -m build --outdir "$env:TEMP\mxgateway-python-dist"
Pop-Location
```
`python -m build` (sdist plus wheel) is the canonical build method — it is what
`scripts/pack-clients.ps1` runs for the Python package. Use
`python -m pip wheel . --no-deps` only for a quick wheel-only build.
Run the CLI from the editable install or with `python -m`:
```powershell
@@ -190,9 +194,10 @@ Pop-Location
## Java
The Java workspace uses Gradle, Java 21, `mxgateway-client`, and
`mxgateway-cli`. The Gradle protobuf plugin writes generated Java protobuf and
gRPC sources under `clients/java/src/main/generated`.
The Java workspace uses Gradle, Java 21, and the subprojects
`zb-mom-ww-mxgateway-client` and `zb-mom-ww-mxgateway-cli`. The Gradle protobuf
plugin writes generated Java protobuf and gRPC sources under
`clients/java/src/main/generated`.
Regenerate Java bindings:
@@ -228,6 +233,28 @@ gradle :zb-mom-ww-mxgateway-cli:run --args="smoke --endpoint mxgateway.example.l
Pop-Location
```
## Packing All Clients
`scripts/pack-clients.ps1` runs every client's native packaging command and
drops the artifacts into one directory so a release does not depend on running
each per-language command by hand. It packs the .NET NuGet packages
(`ZB.MOM.WW.MxGateway.Contracts` and `ZB.MOM.WW.MxGateway.Client`), the Python
sdist and wheel (`python -m build`), the Rust `.crate` (`cargo package`), and
the Java jars plus generated POM (`gradle assemble` and the publication tasks).
Go has no artifact to pack — it is released by git-tagging, so the script prints
the `scripts/tag-go-module.ps1` command and skips it.
```powershell
pwsh scripts/pack-clients.ps1
pwsh scripts/pack-clients.ps1 -Languages dotnet,python
```
Artifacts land in `-OutputDir` (default `dist/`). Each language runs its
regression tests first unless `-SkipTests` is set. With `-Publish`, every
package is pushed to the internal Gitea feed; this requires the `GITEA_USERNAME`
and `GITEA_TOKEN` environment variables and the script refuses to publish if
either is missing.
## Integration Tests
Client integration checks are opt-in because they need a live gateway and a
+6 -5
View File
@@ -98,7 +98,7 @@ Use these commands to regenerate language-specific client bindings:
| Go | `Push-Location clients/go; ./generate-proto.ps1; Pop-Location` |
| Rust | `Push-Location clients/rust; cargo check --workspace; Pop-Location` |
| Python | `Push-Location clients/python; ./generate-proto.ps1; Pop-Location` |
| Java | `Push-Location clients/java; gradle :mxgateway-client:generateProto; Pop-Location` |
| Java | `Push-Location clients/java; gradle :zb-mom-ww-mxgateway-client:generateProto; Pop-Location` |
.NET generation currently runs through the contracts project:
@@ -152,10 +152,11 @@ clients/python/generate-proto.ps1
```
Java clients use the Gradle protobuf plugin from `clients/java`. The
`mxgateway-client` project reads the shared `.proto` files and writes generated
Java protobuf and gRPC sources under `clients/java/src/main/generated`, matching
the manifest output path. Handwritten client and CLI code stays in the
`mxgateway-client` and `mxgateway-cli` project source trees.
`zb-mom-ww-mxgateway-client` project reads the shared `.proto` files and writes
generated Java protobuf and gRPC sources under
`clients/java/src/main/generated`, matching the manifest output path.
Handwritten client and CLI code stays in the `zb-mom-ww-mxgateway-client` and
`zb-mom-ww-mxgateway-cli` project source trees.
Run the Java workspace checks from `clients/java`:
+38
View File
@@ -77,6 +77,44 @@ only and does not share types with `mxaccess_gateway.proto`. See
[Galaxy Repository Browse](./GalaxyRepository.md) for the RPC catalog and
behavior.
### Alarm RPCs and messages
`mxaccess_gateway.proto` also defines three session-less alarm RPCs served by
the gateway's always-on central alarm monitor (no client worker session is
involved):
- `AcknowledgeAlarm(AcknowledgeAlarmRequest) returns (AcknowledgeAlarmReply)`
acknowledges one alarm by its `alarm_full_reference`, with an operator
`comment` and `operator_user`.
- `StreamAlarms(StreamAlarmsRequest) returns (stream AlarmFeedMessage)` — the
central alarm feed.
- `QueryActiveAlarms(QueryActiveAlarmsRequest) returns (stream
ActiveAlarmSnapshot)` — a point-in-time snapshot of the currently-active
alarm set, streamed so callers can begin processing without buffering the
whole set. `alarm_filter_prefix` (when non-empty) narrows the snapshot to
alarms whose `alarm_full_reference` starts with the prefix.
`StreamAlarms` uses a three-phase protocol carried by the `AlarmFeedMessage`
`oneof payload`: the stream opens with one `active_alarm` (`ActiveAlarmSnapshot`)
per currently-active alarm, then a single `snapshot_complete = true` sentinel,
then a `transition` (`OnAlarmTransitionEvent`) for every subsequent change.
`active_alarm` carries the collapsed current state (`AlarmConditionState`:
`Active` / `ActiveAcked` / `Inactive`); `transition` carries the
`AlarmTransitionKind` (`Raise` / `Acknowledge` / `Clear` / `Retrigger`).
`AcknowledgeAlarmRequest` and `AcknowledgeAlarmReply` both **reserve** field 1
and the name `session_id`: acknowledgement was made session-less and the field
was retired (the reservation prevents reuse of the tag). The authoritative
ack-outcome field on `AcknowledgeAlarmReply` is `hresult` (the worker's native
by-name/by-GUID ack return code, 0 = success), alongside `protocol_status`. The
structured `MxStatusProxy status` field is intentionally left **unset** on every
reply because the worker ack path produces only the int32 return code; clients
must read `hresult` and must not depend on `status` being populated.
For the broker architecture and the parse contract for `alarm_full_reference`
(GUID vs `Provider!Group.Tag`) see
[Alarm Client Discovery](./AlarmClientDiscovery.md).
Generated C# output is written to `src/ZB.MOM.WW.MxGateway.Contracts/Generated/`. Do not
hand-edit generated files.
+123 -87
View File
@@ -8,8 +8,12 @@ operations-focused projects.
The dashboard is an operational interface, not a landing page. It prioritizes
fast scanning, low visual noise, and stable layouts while live data changes.
The design uses Bootstrap for common behavior and a small local stylesheet for
project identity, spacing, and status presentation.
The layout chrome, status presentation, and design tokens come from the shared
`ZB.MOM.WW.Theme` kit (the technical-light design system). Bootstrap supplies
common widget behavior, and a small local stylesheet (`wwwroot/css/site.css`)
wires the dashboard's own class names and Bootstrap widgets onto the kit's
tokens. The local sheet contains no hard-coded colors; every color, font, and
surface resolves to a theme token.
Use this style for applications where users repeatedly check system state,
compare rows, inspect details, and diagnose faults. Avoid promotional layouts,
@@ -25,7 +29,7 @@ The interface uses a quiet, work-focused visual system:
- White cards and sections carry the actual operational content.
- Borders define structure more often than shadows.
- Accent color is reserved for metric values and important numeric signals.
- Bootstrap status badges provide state color without custom status art.
- The kit's `StatusPill` provides state color without custom status art.
- Tables remain compact and responsive so long identifiers and timestamps stay
readable.
@@ -34,93 +38,113 @@ and dense enough for repeated use.
## Layout Structure
Every page follows the same structure:
The application chassis is the kit's `ThemeShell` component (a vertical side
rail plus a content area), not a horizontal top navbar. `MainLayout.razor` is a
thin wrapper that delegates the rail chassis — brand block, hamburger toggle,
responsive collapse — to `<ThemeShell>` and supplies only the navigation items
and a rail footer:
1. A top navigation bar with the product or service name on the left.
2. A full-width `container-fluid` content area.
3. A page header with the page title, short context text, and optional status
badge.
4. Metric cards when a page has top-level numeric state.
5. Bordered content sections for tables, details, faults, or empty states.
The shell does not use a sidebar. A horizontal navigation bar is enough for the
current page count and keeps the content width available for tables.
```html
<div class="dashboard-shell">
<nav class="navbar navbar-expand-lg bg-body border-bottom dashboard-navbar">
<!-- brand, page links, sign-out action -->
</nav>
<main class="container-fluid dashboard-content">
<!-- page header, metric grid, sections -->
</main>
</div>
```razor
<ThemeShell Product="MXAccess Gateway" Accent="#2f5fd0">
<Nav>
<NavRailItem Href="/" Text="Dashboard" Match="NavLinkMatch.All" />
<NavRailSection Title="Runtime" Key="runtime">
<NavRailItem Href="/sessions" Text="Sessions" />
<NavRailItem Href="/workers" Text="Workers" />
</NavRailSection>
</Nav>
<RailFooter><!-- user name + sign-out --></RailFooter>
<ChildContent>@Body</ChildContent>
</ThemeShell>
```
Within the content area, every page follows the same structure:
1. A page header with the page title, short context text, and optional status
pill.
2. Metric cards when a page has top-level numeric state.
3. Bordered content sections for tables, details, faults, or empty states.
The login page uses `LoginLayout.razor` instead — a minimal layout with no rail
and no brand block, because the page renders its own centered `<LoginCard>`.
## Color Tokens
Use a small token set and let Bootstrap provide the rest. The current dashboard
uses these local tokens:
```css
:root {
--mxgw-surface: #f7f8fa;
--mxgw-border: #d8dee6;
--mxgw-ink-muted: #667085;
--mxgw-accent: #146c64;
}
```
Colors come from the `ZB.MOM.WW.Theme` kit's `theme.css`. The local
`site.css` defines no `:root` custom properties of its own; it references kit
tokens by name. The dashboard does not define a `--mxgw-*` token set.
| Token | Purpose |
|-------|---------|
| `--mxgw-surface` | Page background behind all content. |
| `--mxgw-border` | Borders on cards, tables, sections, and empty states. |
| `--mxgw-ink-muted` | Secondary labels, details, and empty-state text. |
| `--mxgw-accent` | Metric values and important numeric summaries. |
| `var(--card)` | Background of cards, sections, and data tables. |
| `var(--rule)`, `var(--rule-strong)` | Hairline and stronger borders. |
| `var(--ink)`, `var(--ink-soft)`, `var(--ink-faint)` | Primary, secondary, and muted text. |
| `var(--accent)`, `var(--accent-deep)` | Metric values, links, primary buttons, focus rings. |
| `var(--mono)` | Monospace family for values, identifiers, and code. |
| `var(--ok)`/`--ok-bg`, `var(--warn)`/`--warn-bg`, `var(--bad)`/`--bad-bg`, `var(--idle)`/`--idle-bg` | State colors for chips, alerts, and alarm-state labels. |
Keep the palette small. Add new colors only when they encode state or improve
readability. Prefer Bootstrap badge classes for states such as ready, closing,
closed, and faulted.
Keep the palette small and let the kit own it. Add new colors only when they
encode state or improve readability, and resolve them to a kit token rather than
a literal hex value. Use the kit's `StatusPill` for states such as ready,
closing, idle, and faulted.
## Typography
Typography stays compact and consistent:
- Page headings use `1.35rem`, weight `650`, and normal letter spacing.
- Section headings use the same size as page headings when they introduce a
table or details group.
- Metric labels use uppercase text at `.78rem` and weight `650`.
- Metric values use `1.7rem`, weight `700`, and the accent color.
- Page headings (`.dashboard-page-header h1`) use `1.15rem`, weight `600`, and a
slight letter spacing.
- Section headings (`.section-heading h2`) use a small uppercase eyebrow:
`.74rem`, weight `600`, muted ink.
- Metric labels (`.agg-label`) use uppercase text at `.68rem` and weight `600`,
muted ink.
- Metric values (`.agg-value`) use `1.5rem`, weight `600`, the monospace family,
tabular numerics, and primary ink (`var(--ink)`).
- Body and table text inherit Bootstrap defaults for readability.
Do not scale text with viewport width. Long values use `overflow-wrap:
anywhere` so session IDs, paths, and fault messages do not break the layout.
break-word` (numbers and date tokens stay whole, wrapping only at spaces); a few
free-form fields such as `.agg-sub` use `overflow-wrap: anywhere` so session
IDs, paths, and fault messages do not break the layout.
## Spacing And Shape
The dashboard uses modest spacing:
- Page content has `1.25rem` padding on desktop and `.75rem` on small screens.
- The kit owns the rail and content padding; the local small-screen rule sets
`.page` padding to `.85rem`.
- Metric grids use `.75rem` gaps.
- Content sections start with a top border and `1rem` top padding.
- Cards and empty states use Bootstrap's small radius shape, `.375rem`.
- Metric cards have no shadow.
- Content sections (`.dashboard-section`) and metric cards (`.agg-card`) are
fully bordered cards: `var(--card)` fill, a `1px solid var(--rule)` hairline,
and `0.9rem` padding for sections.
- Cards, sections, and modals use an `8px` radius; smaller widgets such as the
empty state use `6px`.
- Metric cards have no shadow (`box-shadow: none`); borders define structure.
This keeps information grouped without turning each section into a decorative
panel. Use cards for repeated metric summaries, login forms, and individual
items. Use unframed sections with a top border for page-level groups.
items. Use bordered sections for page-level groups.
## Navigation
Navigation is a Bootstrap responsive navbar. It includes:
Navigation lives in the `ThemeShell` side rail. It is built from the kit's
`NavRailSection` and `NavRailItem` components: a single home item plus eight
page items grouped into three labeled sections.
- Brand text for the service name.
- Short page labels: `Overview`, `Sessions`, `Workers`, `Events`, `Settings`.
- Active route styling through `NavLink`.
- A right-aligned sign-out button when authentication is enabled.
| Section | Items |
|---------|-------|
| (home) | `Dashboard` (route `/`, `NavLinkMatch.All`) |
| Runtime | `Sessions`, `Workers`, `Events`, `Alarms` |
| Galaxy | `Repository`, `Browse` |
| Admin | `API Keys`, `Settings` |
Keep navigation labels short. Operational users should be able to predict what
each page contains without reading explanatory copy.
Section expand/collapse state is owned by the kit (a `<details>` element plus
`ThemeScripts`); the layout does not run JS interop for it. The rail footer
shows the signed-in user name and a sign-out form (or a sign-in link when
unauthenticated).
Keep navigation labels short and group related pages. Operational users should
be able to predict what each page contains without reading explanatory copy.
## Page Headers
@@ -128,42 +152,43 @@ Each page starts with a `dashboard-page-header`:
- The title is the primary anchor.
- A single secondary line gives timestamp, row count, or configuration context.
- A status badge appears on the right when the page has an overall state.
- A status pill appears on the right when the page has an overall state.
On narrow screens, the header stacks vertically. This prevents long context
text or status badges from overlapping the title.
text or status pills from overlapping the title.
```html
<div class="dashboard-page-header">
<div>
<h1>Overview</h1>
<h1>Dashboard</h1>
<div class="text-secondary">Generated 2026-04-27 17:30:00</div>
</div>
<span class="badge text-bg-success">Healthy</span>
<!-- <StatusBadge Text="Healthy" /> -> kit <StatusPill State="Ok"> -->
</div>
```
## Metric Cards
Metric cards summarize numeric state at the top of overview and diagnostic
pages. They use Bootstrap cards with a local `metric-card` class:
Metric cards summarize numeric state at the top of the home and diagnostic
pages. The `MetricCard` component renders an `.agg-card` with label, value, and
optional sub-line:
- Label: uppercase, muted, compact.
- Value: large enough to scan, accent colored, wraps safely.
- Detail: optional muted text for version, rate context, or explanatory state.
- Label (`.agg-label`): uppercase eyebrow, muted, compact.
- Value (`.agg-value`): large monospace number in primary ink, wraps safely.
- Sub (`.agg-sub`): optional muted text for version, rate context, or state.
Use auto-fit CSS grid tracks so the cards fill available width without custom
breakpoints:
Cards lay out in a `.metric-grid`. Use auto-fill CSS grid tracks so they fill
available width without custom breakpoints:
```css
.metric-grid {
display: grid;
gap: .75rem;
grid-template-columns: repeat(auto-fit, minmax(12rem, 1fr));
grid-template-columns: repeat(auto-fill, minmax(11rem, 1fr));
}
.metric-grid.compact {
grid-template-columns: repeat(auto-fit, minmax(10rem, 1fr));
grid-template-columns: repeat(auto-fill, minmax(10rem, 1fr));
}
```
@@ -188,15 +213,22 @@ entire rows clickable when a single identifier link is clearer.
## Status Badges
Status uses Bootstrap badge classes with a small mapping layer:
`StatusBadge` is a thin adapter over the kit's `StatusPill`. Call sites pass the
literal domain state text (`<StatusBadge Text="Ready" />`); the adapter maps
that text to one of the kit's four `StatusState` values, and `StatusPill`
renders the chip. There are no Bootstrap `text-bg-*` classes in this layer.
| State | Badge class |
|-------|-------------|
| `Ready`, `Healthy` | `text-bg-success` |
| `Creating`, `StartingWorker`, `WaitingForPipe`, `InitializingWorker`, `Closing` | `text-bg-info` |
| `Closed` | `text-bg-secondary` |
| `Faulted` | `text-bg-danger` |
| Unknown state | `text-bg-light text-dark border` |
| Domain state text | `StatusState` |
|-------------------|---------------|
| `Ready`, `Healthy`, `Active` | `Ok` |
| `Creating`, `StartingWorker`, `WaitingForPipe`, `InitializingWorker`, `Closing`, `Stale`, `Degraded` | `Warn` |
| `Faulted`, `Unavailable` | `Bad` |
| Any other text (including `Closed`, `Revoked`, `Unknown`) | `Idle` |
Note the mapping changes from earlier revisions: `Closed` now falls through to
`Idle` (rather than its own neutral badge), and `Active`, `Stale`, `Degraded`,
and `Unavailable` are explicit cases. The kit owns the chip rendering; only this
domain text-to-state vocabulary lives in the app.
Keep status text literal. Operators benefit from seeing the same state names
that appear in logs and APIs.
@@ -230,8 +262,8 @@ The dashboard uses one small-screen breakpoint:
```css
@media (max-width: 700px) {
.dashboard-content {
padding: .75rem;
.page {
padding: .85rem;
}
.dashboard-page-header {
@@ -245,6 +277,9 @@ The dashboard uses one small-screen breakpoint:
}
```
A second breakpoint (`max-width: 960px`) collapses the Browse two-pane layout
(`.browse-layout`) to a single column.
Do not hide important columns by default. Use horizontal table scrolling for
dense operational data, and reserve column hiding for data that is clearly
duplicative.
@@ -277,13 +312,14 @@ markup.
Use this checklist when applying the design to another project:
- Define four local tokens: surface, border, muted ink, and accent.
- Use a Bootstrap top navbar with short route labels.
- Keep page content inside a full-width fluid container.
- Take colors, fonts, and surfaces from the `ZB.MOM.WW.Theme` kit tokens; do
not define a local color token set.
- Use the kit's `ThemeShell` side rail with `NavRailSection`/`NavRailItem` and
short route labels grouped into sections.
- Start every page with the same header structure.
- Put primary numeric state in `metric-grid` cards.
- Put primary numeric state in `metric-grid` / `agg-card` cards.
- Put detailed runtime state in compact responsive tables.
- Use status badges mapped from real domain states.
- Use `StatusBadge` (kit `StatusPill`) mapped from real domain states.
- Use dashed bordered empty states for loading and no-data cases.
- Use top-bordered sections for page groups instead of nested cards.
- Centralize formatting and redaction outside Razor markup.
+10 -4
View File
@@ -357,10 +357,16 @@ Allowed UI stack:
Do not use MudBlazor or other Blazor UI component libraries for v1.
Dashboard access should require API-key-backed dashboard authentication with
`admin` scope when enabled. For local development, anonymous localhost access
is enabled by default through `Dashboard:AllowAnonymousLocalhost`; the bypass is
limited to loopback requests.
Dashboard authentication is LDAP-backed, deliberately separate from the gRPC
API-key model: dashboard users are people who already have directory accounts,
so reusing LDAP avoids minting and distributing API keys for human operators.
`DashboardAuthenticator` binds the supplied credentials against `MxGateway:Ldap`
through the shared `ILdapAuthService`, then maps the user's LDAP groups to the
`Administrator` or `Viewer` dashboard role via `MxGateway:Dashboard:GroupToRole`.
A login whose groups match no role is denied. For local development, anonymous
localhost access is enabled by default through
`MxGateway:Dashboard:AllowAnonymousLocalhost`; the bypass is limited to loopback
requests.
## Lazy Browse Is Wire-Only
+26 -1
View File
@@ -205,13 +205,38 @@ app.MapGatewayEndpoints();
The order matters: putting the logging scope first ensures that authentication failures, authorization denials, and endpoint exceptions all run inside the request scope, so failure logs still carry the correlation id and session id headers that the caller sent. The `ClientIdentity` field is redacted before logging, so reading the `authorization` header at this stage does not leak the bearer secret into authentication failure logs.
### Telemetry redaction seam
The per-request middleware redacts the `authorization` header before it reaches a scope, but log events produced outside the request scope (or with credential-bearing properties attached by other enrichers) need the same protection. `GatewayLogRedactorSeam` adapts the static `GatewayLogRedactor` to the shared `ILogRedactor` seam so the telemetry `RedactionEnricher` masks identity material on **every** log event:
```csharp
builder.Services.AddSingleton<ILogRedactor, GatewayLogRedactorSeam>();
```
The seam scans a fixed set of identity-bearing property names (`ClientIdentity`, `authorization`, `Authorization`) and rewrites any string value through `GatewayLogRedactor.RedactClientIdentity`. Because it runs in the enricher rather than at the call site, it catches credential material that a component logged without going through `GatewayLogScope`.
## Readiness Health Check
`AuthStoreHealthCheck` is a readiness probe registered under the health-check name `auth-store` and tagged for the readiness set (`ZbHealthTags.Ready`):
```csharp
builder.Services.AddHealthChecks()
.AddTypeActivatedCheck<AuthStoreHealthCheck>(
"auth-store",
failureStatus: null,
tags: new[] { ZbHealthTags.Ready });
```
The gateway authenticates every gRPC call against the SQLite auth store, so its reachability gates readiness. The check opens a connection via `AuthSqliteConnectionFactory` and runs `SELECT 1;`: success reports `Healthy`, any exception (other than the probe being cancelled) reports `Unhealthy` with the underlying error attached. It is surfaced on the readiness endpoint exposed by the shared telemetry wiring (the live/ready split is what the `wonder-app-vd03` deployment exposes as `/health/live` with the dashboard disabled).
## Consumers
`GatewayLoggerExtensions.BeginGatewayScope` is consumed by `GatewayRequestLoggingMiddlewareExtensions` to attach the per-request scope. Component-level call sites build narrower `GatewayLogScope` instances (for example, with a known `WorkerProcessId` after a worker launch) and push a nested scope on top of the request scope.
`GatewayLogRedactor` is consumed in three places:
`GatewayLogRedactor` is consumed in four places:
- `GatewayLogScope.ToDictionary` redacts `ClientIdentity` whenever a scope is materialized.
- `GatewayLogRedactorSeam.Redact` applies the same redaction to identity-bearing properties on every telemetry log event (see above).
- `DashboardRedactor.Redact` delegates to `RedactClientIdentity` for any value containing the `mxgw_` marker, then falls back to a marker-keyword check for fields like `password` or `token`. This keeps dashboard renders aligned with log redaction.
- `ZB.MOM.WW.MxGateway.Tests/Diagnostics/GatewayLogRedactorTests.cs` covers each redaction branch, including the assertion that `WriteSecured` values stay redacted even when `valueLoggingEnabled` is true.
+45 -8
View File
@@ -81,11 +81,16 @@ computed against the *filtered* descendant set, a branch that contains no
matching objects gets `false`, not `true`.
**Paging.** Default page size is 500; the server caps any requested size at
5000. Page tokens encode `(cache_sequence, parent_id, filter_signature,
offset)`. A token from a different cache generation or a different filter set
returns `InvalidArgument`. The error messages reference "DiscoverHierarchy
page_token" because `BrowseChildren` reuses the same encoding and validation
path — if you see that wording in a `BrowseChildren` context it is expected.
5000. Page tokens are the colon-delimited triple `sequence:filterSignature:offset`
— the same encoding `DiscoverHierarchy` uses. The parent selector is not a
separate token field: it is folded into `filterSignature` along with the rest of
the filter set (the projector's `ComputeFilterSignature` takes the parent id),
so a page token implicitly pins the parent. A token from a different cache
generation (`sequence` mismatch) or a different filter set (`filterSignature`
mismatch) returns `InvalidArgument`. The error messages reference
"DiscoverHierarchy page_token" because `BrowseChildren` reuses the same encoding
and validation path — if you see that wording in a `BrowseChildren` context it is
expected.
**Errors.**
@@ -133,6 +138,15 @@ When SQL is unreachable, the cache retains the previous data and flips
`Status` to `Stale` (or `Unavailable` if no data was ever loaded). A
`SqlException` never bubbles out as the client-facing error.
The cache also auto-degrades a `Healthy` entry to `Stale` purely on age: when the
last successful refresh is older than five minutes, the projected status is
reported as `Stale` even though the data hasn't otherwise changed. This guards
against a silently wedged refresh loop — if ticks stop succeeding, browse
results visibly go `Stale` rather than continuing to look fresh. (`Unknown` and
`Unavailable` entries are returned as-is and not aged.) The first refresh runs at
service startup, before the interval loop begins, so the cache is populated as
soon as practical rather than waiting one full interval.
### First-load behavior
If a client calls `DiscoverHierarchy` before the background service has
@@ -156,7 +170,10 @@ working across that gap, the cache persists its dataset to disk:
- On the **first** refresh after startup, before any SQL runs, the cache
reloads that file. The restored data is served with `Stale` status —
it is last-known data, not live — so clients can browse immediately even
when the Galaxy database is unreachable.
when the Galaxy database is unreachable. The restore also publishes a deploy
event through `IGalaxyDeployNotifier`, so a `WatchDeployEvents` subscriber that
attaches before the first live query still sees the restored snapshot's deploy
state.
- The first live query then reconciles: if it observes the **same**
`time_of_last_deploy` the snapshot was saved at, the entry is promoted to
`Healthy` with no heavy re-query (the snapshot is provably current); if it
@@ -349,6 +366,25 @@ Component breakdown:
override per object. `HierarchySql` still matches the OtOpcUa original;
`AttributesSql` does not — it additionally enumerates built-in primitive
attributes (see [Built-in vs configured attributes](#built-in-vs-configured-attributes)).
`HierarchySql` restricts the result to a fixed allow-list of object categories
via `WHERE td.category_id IN (1, 3, 4, 10, 11, 13, 17, 24, 26)` — the same set
the dashboard's `ResolveCategoryName` map names. Categories outside this set
(for example, internal framework objects) are never browsed. The mapping:
| `category_id` | Name |
|---|---|
| 1 | WinPlatform |
| 3 | AppEngine |
| 4 | InTouchViewApp |
| 10 | UserDefined |
| 11 | FieldReference |
| 13 | Area |
| 17 | DIObject |
| 24 | DDESuiteLinkClient |
| 26 | OPCClient |
Any other category id renders as `Category {id}` in the dashboard.
- `GalaxyHierarchyCache`
(`src/ZB.MOM.WW.MxGateway.Server/Galaxy/GalaxyHierarchyCache.cs`) holds the most
recent immutable `GalaxyHierarchyCacheEntry` (materialized objects +
@@ -384,7 +420,7 @@ Bound to `MxGateway:Galaxy` via `GalaxyRepositoryOptions`.
| Option | Default | Description |
|--------|---------|-------------|
| `MxGateway:Galaxy:ConnectionString` | `Server=localhost;Database=ZB;Integrated Security=True;TrustServerCertificate=True;Encrypt=False;` | SQL Server connection string for the Galaxy Repository. Integrated Security against `localhost` is the dev default; production deployments should override this through the standard double-underscore environment variable form, e.g. `MxGateway__Galaxy__ConnectionString`. |
| `MxGateway:Galaxy:CommandTimeoutSeconds` | `60` | Per-command SQL timeout. Applies to all three RPCs. |
| `MxGateway:Galaxy:CommandTimeoutSeconds` | `60` | Per-command SQL timeout applied to every SQL command the repository runs (the connectivity probe, the deploy-time poll, and the hierarchy and attribute queries), which back all five Galaxy RPCs. |
| `MxGateway:Galaxy:PersistSnapshot` | `true` | Persists each successful browse dataset to disk and reloads it at startup. See [On-disk snapshot](#on-disk-snapshot). |
| `MxGateway:Galaxy:SnapshotCachePath` | `C:\ProgramData\MxGateway\galaxy-snapshot.json` | File path for the persisted browse snapshot. Ignored when `PersistSnapshot` is `false`. |
@@ -400,7 +436,8 @@ unparsed connection string text.
## Authorization
All four Galaxy RPCs (including `WatchDeployEvents`) require the
All five Galaxy RPCs (`TestConnection`, `GetLastDeployTime`,
`DiscoverHierarchy`, `WatchDeployEvents`, and `BrowseChildren`) require the
`metadata:read` API-key scope. Browse is read-only metadata, equivalent in
privilege to `MxCommandKind.GetSessionState` or `MxCommandKind.GetWorkerInfo`.
The mapping lives in `GatewayGrpcScopeResolver`; see
+46
View File
@@ -18,6 +18,19 @@ paths, timeouts, queue sizes, enum values, or protocol values are invalid.
"PepperSecretName": "MxGateway:ApiKeyPepper",
"RunMigrationsOnStartup": true
},
"Ldap": {
"Enabled": true,
"Server": "localhost",
"Port": 3893,
"Transport": "None",
"AllowInsecure": true,
"SearchBase": "dc=zb,dc=local",
"ServiceAccountDn": "cn=serviceaccount,dc=zb,dc=local",
"ServiceAccountPassword": "serviceaccount123",
"UserNameAttribute": "cn",
"DisplayNameAttribute": "cn",
"GroupAttribute": "memberOf"
},
"Worker": {
"ExecutablePath": "src\\ZB.MOM.WW.MxGateway.Worker\\bin\\x86\\Release\\ZB.MOM.WW.MxGateway.Worker.exe",
"WorkingDirectory": null,
@@ -93,6 +106,39 @@ Environment variables use the normal .NET double-underscore form. For example,
When `Mode` is `ApiKey`, `SqlitePath` and `PepperSecretName` must be present.
`SqlitePath` must be a valid filesystem path.
## Ldap Options
The `MxGateway:Ldap` section configures the dashboard's LDAP login (the gRPC API
uses API keys, not LDAP — see [Authentication](./Authentication.md)). The same
section is bound twice: the runtime bind/search is performed by the shared
`ZB.MOM.WW.Auth.Ldap` provider wired up by `AddZbLdapAuth`, while the gateway's
own `LdapOptions` shadow exists only for startup validation, the redacted
effective-config display, and the dev/default values. The two stay
field-compatible so the one section binds onto both. The gateway ships
dev-friendly defaults (plaintext localhost); the shared provider's own defaults
are secure-by-default.
| Option | Default | Description |
|--------|---------|-------------|
| `MxGateway:Ldap:Enabled` | `true` | Enables LDAP-backed dashboard login. When `false`, the rest of the section is not validated and LDAP login is not wired up. |
| `MxGateway:Ldap:Server` | `localhost` | LDAP server host. Required when `Enabled`. |
| `MxGateway:Ldap:Port` | `3893` | LDAP server port. Must be a valid port (165535). |
| `MxGateway:Ldap:Transport` | `None` | Transport/TLS mode. One of `None` (plaintext), `StartTls` (upgrade a plaintext connection to TLS), or `Ldaps` (TLS from connect). Replaces the former boolean `UseTls`. |
| `MxGateway:Ldap:AllowInsecure` | `true` | Allows plaintext LDAP connections. Must be `true` when `Transport` is `None`; setting `Transport=None` with `AllowInsecure=false` fails validation. |
| `MxGateway:Ldap:SearchBase` | `dc=zb,dc=local` | Search base distinguished name for user lookup. Required when `Enabled`. |
| `MxGateway:Ldap:ServiceAccountDn` | `cn=serviceaccount,dc=zb,dc=local` | Service account DN used to bind before searching for the logging-in user. Required when `Enabled`. Redacted in the effective-config display. |
| `MxGateway:Ldap:ServiceAccountPassword` | `serviceaccount123` | Service account bind password. Required when `Enabled`. Never logged; redacted in the effective-config display. |
| `MxGateway:Ldap:UserNameAttribute` | `cn` | Attribute matched against the login user name (the dev GLAuth directory keys users by `cn`, not `uid`). Required when `Enabled`. |
| `MxGateway:Ldap:DisplayNameAttribute` | `cn` | Attribute read for the user's display name. Required when `Enabled`. |
| `MxGateway:Ldap:GroupAttribute` | `memberOf` | Attribute read for the user's group membership. The resulting group names are mapped to dashboard roles by `MxGateway:Dashboard:GroupToRole`. Required when `Enabled`. |
When `Enabled` is `true`, `Server`, `SearchBase`, `ServiceAccountDn`,
`ServiceAccountPassword`, `UserNameAttribute`, `DisplayNameAttribute`, and
`GroupAttribute` must be non-blank, `Port` must be valid, and `AllowInsecure`
must be `true` whenever `Transport` is `None`. Group-to-role mapping lives in the
dashboard section; see `MxGateway:Dashboard:GroupToRole` below and
[glauth.md](../glauth.md).
## Worker Options
| Option | Default | Description |
+120 -38
View File
@@ -9,11 +9,13 @@ statistics in real time.
## Technology Choice
Decision: Blazor Server with Bootstrap CSS/JS.
Decision: Blazor Server with the shared `ZB.MOM.WW.Theme` kit layered over
Bootstrap CSS/JS.
Allowed UI stack:
- ASP.NET Core Blazor Server,
- the `ZB.MOM.WW.Theme` kit (layout chassis, status components, design tokens),
- Bootstrap CSS,
- Bootstrap JavaScript,
- small local CSS for layout and status styling,
@@ -30,7 +32,35 @@ Not allowed for v1:
Rationale: Blazor Server keeps the dashboard in the gateway process, avoids a
separate frontend build, and gives real-time UI updates through the Blazor
SignalR circuit. Bootstrap is sufficient for a basic dashboard.
SignalR circuit. The `ZB.MOM.WW.Theme` kit gives the dashboard the same chassis,
status vocabulary, and visual identity as the other ZB.MOM.WW operations UIs
without re-implementing layout and status styling per project.
## Theme Kit
The dashboard depends on the shared `ZB.MOM.WW.Theme` NuGet package
(version `0.2.0`, referenced in `ZB.MOM.WW.MxGateway.Server.csproj`). The kit is
a Razor Class Library that ships the technical-light design system: a layout
chassis, a small set of UI components, the design tokens, and the head/script
asset wiring. The dashboard takes its chrome and status presentation from the
kit and adds only its own pages and view CSS on top.
Components and assets used:
| Kit member | Role in the dashboard |
|---|---|
| `<ThemeShell>` | The application chassis — vertical side rail (brand, hamburger, responsive collapse) plus a content area. `MainLayout.razor` wraps it and supplies `Nav`, `RailFooter`, and `ChildContent` slots. |
| `<NavRailSection>` / `<NavRailItem>` | Grouped navigation items in the rail. Section expand/collapse persistence is owned by the kit (`<details>` + `ThemeScripts`); the app runs no JS interop for it. |
| `<LoginCard>` | The centered login card on `Login.razor`. Renders a native static `<form method="post" action="/login">` so the submit reaches the minimal-API endpoint rather than a Blazor event. |
| `<StatusPill State="…">` | The status chip. `StatusBadge.razor` is a thin adapter that maps domain state text to one of four `StatusState` values (`Ok`, `Warn`, `Bad`, `Idle`) and renders this pill. |
| `<ThemeHead/>` | Loaded in `App.razor`'s `<head>`; injects the kit's `theme.css` and related head assets. |
| `<ThemeScripts/>` | Loaded at the end of `App.razor`'s `<body>`; supplies the rail's interactive behavior. |
| Token system | `theme.css` defines all design tokens (`var(--card)`, `var(--ink)`, `var(--accent)`, `var(--mono)`, the state colors, etc.). The local `site.css` references these tokens and defines no hard-coded colors. |
The dependency on this kit is the reason the layout shell, navigation, status
chips, and tokens differ from a stock Bootstrap dashboard. See
[Dashboard Interface Design](./DashboardInterfaceDesign.md) for how the kit's
tokens and components shape the visual language.
## Hosting Model
@@ -67,8 +97,8 @@ Endpoint layout:
The `/galaxy` page surfaces the Galaxy Repository browse summary
(deployed object hierarchy size, last deploy timestamp, attribute totals,
template usage, and connectivity sync info). The summary is fed by
`GalaxySummaryCache`, which is refreshed off the request path by
`GalaxySummaryRefreshService` on the
`GalaxyHierarchyCache`, which is refreshed off the request path by
`GalaxyHierarchyRefreshService` on the
`MxGateway:Galaxy:DashboardRefreshIntervalSeconds` cadence so the dashboard
never blocks on SQL. See [Galaxy Repository Browse](./GalaxyRepository.md) for
the underlying gRPC service.
@@ -79,24 +109,31 @@ the underlying gRPC service.
ZB.MOM.WW.MxGateway.Server
Dashboard/
Components/
App.razor
App.razor (loads <ThemeHead/> / <ThemeScripts/>)
Routes.razor
DashboardPageBase.cs
DashboardDisplay.cs
Layout/
DashboardLayout.razor
MainLayout.razor (ThemeShell side-rail chassis)
LoginLayout.razor (minimal, no rail; hosts <LoginCard>)
Pages/
DashboardHome.razor
Login.razor
SessionsPage.razor
SessionDetailsPage.razor
WorkersPage.razor
EventsPage.razor
AlarmsPage.razor
GalaxyPage.razor
BrowsePage.razor
ApiKeysPage.razor
SettingsPage.razor
Shared/
MetricCard.razor
StatusBadge.razor
StatusBadge.razor (adapter over kit <StatusPill>)
FaultList.razor
BrowseTreeNodeView.razor
ConfirmDialog.razor
DashboardSnapshotService.cs
DashboardAuthorizationHandler.cs
DashboardAuthenticator.cs
@@ -244,10 +281,14 @@ Show:
- admin Close session / Kill worker controls (Admin role only).
The Sessions list, the Workers list, and this details page all render the same
admin controls when the signed-in principal carries the `Admin` role; viewers
admin controls when the signed-in principal carries the `Administrator` role; viewers
and the localhost-anonymous bypass see no action affordances and the server
re-checks the role on every invocation. Every destructive admin action is
gated by a confirmation dialog before it reaches `ISessionManager`.
gated by the shared `ConfirmDialog` component before it reaches
`ISessionManager`. `ConfirmDialog` is a reusable Bootstrap modal (title,
message, confirm/cancel buttons, and a busy state that disables both buttons
while the action runs); each page binds its open state and confirm/cancel
callbacks. The API keys page uses the same component.
- **Close session** routes through `ISessionManager.CloseSessionAsync`: the
worker is asked to shut down gracefully and is killed only as a fallback if
@@ -289,7 +330,8 @@ it opt-in and redacted.
### Browse page
`/browse` lets an operator explore the Galaxy tag hierarchy and watch
live values. The tree is built in-process by `DashboardBrowseTreeBuilder` from
live values. The tree is built in-process by the static
`DashboardBrowseTreeBuilder` (in `DashboardBrowseModel.cs`) from
`IGalaxyHierarchyCache.Current` — the same cache the Galaxy page reads — so a
render costs no gRPC call and no SQL round-trip. Each node shows its child
objects and, when expanded, its attributes with attribute name, data type
@@ -307,7 +349,10 @@ diagnostic session/worker views.
### Alarms page
`/alarms` lists the alarms the gateway's central alarm monitor
currently holds as Active or ActiveAcked, refreshed every three seconds. It
currently holds as Active or ActiveAcked. The page injects
`IDashboardLiveDataService` and drives a `PeriodicTimer` poll loop that calls
`QueryAlarmsAsync` every three seconds, rather than subscribing to the snapshot
hub or holding a `CurrentAlarms` reference directly. It
defaults to showing unacknowledged `Active` alarms; filters add acknowledged
alarms and narrow by area, severity range, and a reference/source/description
text search. Cleared alarms are not retained — the gateway holds no
@@ -358,7 +403,7 @@ for what each constraint means and how it is enforced on the gRPC path.
Create, Rotate, Revoke, and Delete controls render only when the signed-in
user is authorized. `DashboardApiKeyAuthorization.CanManage` requires an
authenticated principal carrying the `Admin` role claim (resolved at login
authenticated principal carrying the `Administrator` role claim (resolved at login
from the user's LDAP groups via `MxGateway:Dashboard:GroupToRole`). A
`Viewer` role can read the table but sees no action controls, and an
anonymous localhost session shows the same read-only view.
@@ -385,10 +430,11 @@ Create and Rotate return the assembled `mxgw_<keyId>_<secret>` token **once**,
in a one-time banner. It is never shown again, so the operator must copy it
immediately. This mirrors the `apikey create-key` / `rotate-key` CLI.
Every management action appends an `api_key_audit` entry
(`dashboard-create-key`, `dashboard-rotate-key`, `dashboard-revoke-key`,
`dashboard-delete-key`) with the key id and the caller's remote address.
Secrets and pepper values are never logged.
Every management action writes an entry to the canonical `audit_event` store
through `IAuditWriter` (`dashboard-create-key`, `dashboard-rotate-key`,
`dashboard-revoke-key`, `dashboard-delete-key`) with the key id, the caller's
remote address, and a correlation id. Secrets and pepper values are never
logged.
### Settings page
@@ -408,23 +454,33 @@ Do not show API key secrets or pepper values.
Dashboard authentication is LDAP-backed, distinct from the API-key model used
on the gRPC API. Users sign in with directory credentials; the gateway maps
their LDAP groups to one of two dashboard roles (`Admin` or `Viewer`) and
their LDAP groups to one of two dashboard roles (`Administrator` or `Viewer`) and
issues a cookie carrying those role claims.
Implemented behavior:
- a static `/login` HTML form posts username/password to the gateway;
- `DashboardAuthenticator` binds against `MxGateway:Ldap` (service-account bind,
user search, candidate bind) using `Novell.Directory.Ldap.NETStandard`;
- the user's `memberOf` (or short CN) is matched against
`MxGateway:Dashboard:GroupToRole`; the resolved role(s) are emitted as
`ClaimTypes.Role` claims, alongside the per-group `mxgateway:ldap_group`
claims;
- a successful login signs in the `MxGateway.Dashboard` cookie scheme
(`MxGatewayDashboard`, HttpOnly, SameSite=Strict, Secure);
- `GET /login` is served by the `[AllowAnonymous]` Blazor `Login.razor`
component (under `LoginLayout`), which renders the shared kit's `<LoginCard>`.
`LoginCard` emits a native static `<form method="post" action="/login">`
(username, password, hidden returnUrl) plus an `<AntiforgeryToken/>`. A native
form submit is not a Blazor event, so it reaches the minimal-API `POST /login`
endpoint regardless of the app's InteractiveServer render mode;
- `DashboardAuthenticator` delegates bind/search to the shared
`ZB.MOM.WW.Auth.Ldap` provider, registered by `AddZbLdapAuth(configuration,
"MxGateway:Ldap")`. The provider performs a service-account bind, user search,
then candidate bind, and fails closed;
- the user's group membership (stripped to its first RDN by the provider) is
matched against `MxGateway:Dashboard:GroupToRole`; the resolved role(s) are
emitted as `ClaimTypes.Role` claims, alongside the per-group
`mxgateway:ldap_group` claims;
- a successful login signs in the `MxGateway.Dashboard` cookie scheme. The
cookie defaults to the name `MxGatewayDashboard` (HttpOnly, SameSite=Strict,
Secure) and can be overridden via `MxGateway:Dashboard:CookieName`;
- a user with no matching group cannot sign in — the login screen returns the
generic credential-rejected message;
- antiforgery tokens guard the login and logout POSTs.
generic credential-rejected message via `/login?error=…`;
- antiforgery tokens guard the login and logout POSTs. `POST /logout` (and a
`GET /logout` convenience redirect) sign the cookie out and return to
`/login`.
Three authorization policies are registered:
@@ -443,8 +499,8 @@ Viewer role.
### Hub bearer flow
SignalR connections cannot reuse the `__Host-` cookie when the JS client
upgrades to WebSocket — the cookie's `SameSite=Strict; Path=/` keeps it from
SignalR connections cannot reuse the `MxGatewayDashboard` cookie when the JS
client upgrades to WebSocket — the cookie's `SameSite=Strict; Path=/` keeps it from
being forwarded by the browser's WebSocket layer in some edge cases. The
dashboard mints short-lived bearer tokens for the connection:
@@ -480,8 +536,10 @@ Effective configuration:
"RecentFaultLimit": 100,
"RecentSessionLimit": 200,
"ShowTagValues": false,
"CookieName": null,
"RequireHttpsCookie": true,
"GroupToRole": {
"GwAdmin": "Admin",
"GwAdmin": "Administrator",
"GwReader": "Viewer"
}
}
@@ -489,6 +547,15 @@ Effective configuration:
}
```
Two cookie keys tune the auth cookie:
- `CookieName` overrides the cookie name. Null or blank keeps the canonical
default `MxGatewayDashboard`, so a misconfiguration cannot leave the cookie
unnamed.
- `RequireHttpsCookie` (default `true`) sets the cookie `SecurePolicy` to
`Always`. Set it to `false` for dev HTTP deployments, which relaxes the policy
to `SameAsRequest`.
See [Gateway Configuration](./GatewayConfiguration.md#dashboard-options) for
the full option table and the policies/hubs that derive from these values.
@@ -504,17 +571,31 @@ the full option table and the policies/hubs that derive from these values.
## Styling
The dashboard serves Bootstrap 5.3.3 assets from
`src/ZB.MOM.WW.MxGateway.Server/wwwroot/lib/bootstrap/` and local layout/status styling
from `src/ZB.MOM.WW.MxGateway.Server/wwwroot/css/dashboard.css`.
Styling is layered. From base to top:
1. Bootstrap 5.3.3 assets served from
`src/ZB.MOM.WW.MxGateway.Server/wwwroot/lib/bootstrap/`.
2. The `ZB.MOM.WW.Theme` kit's `theme.css` (the technical-light design system),
which owns the design tokens and the kit component styles. `App.razor` loads
it through the kit's `<ThemeHead/>` component, and pairs it with
`<ThemeScripts/>` at the end of `<body>` for the rail's interactive behavior.
3. The local view stylesheet
`src/ZB.MOM.WW.MxGateway.Server/wwwroot/css/site.css`, which wires the
dashboard's own class names and Bootstrap widgets onto the kit tokens. It
defines no hard-coded colors.
The minimal `/denied` page is rendered outside the Blazor circuit, so it loads
the kit CSS directly from the static-web-asset path
(`/_content/ZB.MOM.WW.Theme/css/theme.css` and `…/layout.css`) plus Bootstrap
and `site.css`.
Recommended visual language:
- compact tables,
- status badges,
- the kit `StatusPill` for state,
- metric cards,
- Bootstrap alerts for faults,
- restrained colors,
- restrained colors drawn from the kit tokens,
- no decorative hero sections,
- no charting dependency for v1.
@@ -530,7 +611,7 @@ Dashboard unit/component tests should cover:
- snapshot projection,
- dashboard auth authorization decisions,
- login API-key validation behavior,
- login LDAP bind and group-to-role mapping behavior,
- pages render with empty state,
- pages render with active sessions,
- pages render with faulted sessions,
@@ -557,7 +638,8 @@ Integration tests should verify:
The first dashboard slice implements:
1. Blazor Server hosting in `ZB.MOM.WW.MxGateway.Server`.
2. local Bootstrap static assets.
2. local Bootstrap static assets plus the `ZB.MOM.WW.Theme` kit layer
(chassis, tokens, status components).
3. dashboard configuration binding.
4. dashboard auth using LDAP bind + role-mapped HTTP-only cookie.
5. `DashboardSnapshotService` projecting gateway state for read views.
+10 -4
View File
@@ -248,10 +248,15 @@ Suggested routes:
```text
/
/login
/sessions
/sessions/{sessionId}
/workers
/events
/alarms
/galaxy
/browse
/apikeys
/settings
```
@@ -681,13 +686,14 @@ Dashboard authentication uses LDAP bind + role mapping (separate from the
API-key model used on the gRPC API). The login endpoint accepts username and
password in a form post, calls `DashboardAuthenticator` to bind against
`MxGateway:Ldap`, resolves the user's LDAP groups through
`MxGateway:Dashboard:GroupToRole` to one of `Admin` / `Viewer`, and signs in
`MxGateway:Dashboard:GroupToRole` to one of `Administrator` / `Viewer`, and signs in
with the `MxGateway.Dashboard` cookie scheme. The cookie is HTTP-only,
secure, strict SameSite, and named `__Host-MxGatewayDashboard`. Logout
secure, strict SameSite, and named `MxGatewayDashboard` (configurable via
`MxGateway:Dashboard:CookieName`). Logout
clears it. Login and logout posts validate antiforgery tokens. SignalR
connections additionally accept a 30-minute data-protected bearer minted at
`/hubs/token`. `Dashboard:AllowAnonymousLocalhost` permits loopback requests
to bypass the cookie requirement and defaults to `true`.
`/hubs/token`. `MxGateway:Dashboard:AllowAnonymousLocalhost` permits loopback
requests to bypass the cookie requirement and defaults to `true`.
Recommended scopes:
+11
View File
@@ -100,6 +100,17 @@ Optional live smoke variables:
| `MXGATEWAY_LIVE_MXACCESS_WRITE_SECURED_USER` | `admin` | ArchestrA user name passed to `AuthenticateUser` before the `WriteSecured` parity step. |
| `MXGATEWAY_LIVE_MXACCESS_WRITE_SECURED_PASSWORD` | `admin123` | Password paired with the user above. Never logged; the test asserts the value does not appear in the WriteSecured diagnostic message. |
When `MXGATEWAY_LIVE_MXACCESS_WORKER_EXE` is unset, the integration harness
locates the worker by resolving the repository root: `ResolveRepositoryRoot`
walks parent directories from the test binary looking for a directory that
contains a `src` subdirectory next to either a `.git` marker or a `*.sln` /
`*.slnx` file under `src`. The `.git`-or-`.sln` pair lets the resolution work
both in a checked-out repository and in an extracted copy that ships no `.git`
folder. If the walk exhausts without a match, it throws `InvalidOperationException`
naming the start directory and the expected markers; set
`MXGATEWAY_LIVE_MXACCESS_WORKER_EXE` to point directly at a worker executable and
bypass repository-root resolution entirely.
The test output includes session id, worker process id, command status,
HRESULT/status diagnostics, event sequence and handles, close status, and worker
stdout/stderr lines emitted during the run.
+9 -4
View File
@@ -10,7 +10,7 @@ The layer is composed of four collaborators:
| Type | Lifetime | Role |
|------|----------|------|
| `MxAccessGatewayService` | scoped (gRPC) | Implements the six `MxAccessGateway` RPCs, performs exception mapping. |
| `MxAccessGatewayService` | scoped (gRPC) | Implements the seven `MxAccessGateway` RPCs, performs exception mapping. |
| `MxAccessGrpcRequestValidator` | singleton | Rejects malformed requests before any session work runs. |
| `MxAccessGrpcMapper` | singleton | Converts public proto types to internal `WorkerCommand`/`WorkerEvent` types and back. |
| `IEventStreamService` (`EventStreamService`) | singleton | Owns the event stream pipeline, including bounded queue and backpressure handling. |
@@ -29,7 +29,7 @@ A second gRPC service, `GalaxyRepositoryGrpcService`, is mapped alongside it. It
## RPC Handlers
`MxAccessGatewayService` derives from the generated `MxAccessGateway.MxAccessGatewayBase` and implements every RPC declared in `mxaccess_gateway.proto` — six in total: `OpenSession`, `CloseSession`, `Invoke`, `StreamEvents`, `AcknowledgeAlarm`, and `StreamAlarms`. The proto contract itself is documented in [Contracts](./Contracts.md); this section covers only what the server-side handler does on top of that contract.
`MxAccessGatewayService` derives from the generated `MxAccessGateway.MxAccessGatewayBase` and implements every RPC declared in `mxaccess_gateway.proto` — seven in total: `OpenSession`, `CloseSession`, `Invoke`, `StreamEvents`, `AcknowledgeAlarm`, `StreamAlarms`, and `QueryActiveAlarms`. The proto contract itself is documented in [Contracts](./Contracts.md); this section covers only what the server-side handler does on top of that contract.
Public gRPC send and receive message sizes are configured from
`MxGateway:Protocol:MaxGrpcMessageBytes` (default 16 MiB). Official clients use
@@ -94,6 +94,10 @@ Carrying the enqueue timestamp into the worker layer is what lets queue-wait tim
`StreamAlarms` is a server-streaming, **session-less** RPC that attaches to the gateway's central alarm feed. The handler delegates to `IGatewayAlarmService.StreamAsync`. The stream opens with one `AlarmFeedMessage` carrying an `active_alarm` per currently-active alarm (the ConditionRefresh snapshot), then a single `snapshot_complete`, then a `transition` for every subsequent raise / acknowledge / clear. It is served by the always-on `GatewayAlarmMonitor`, which owns a single gateway-managed worker session and fans out to every attached client — clients no longer open a session of their own. `alarm_filter_prefix`, when set, scopes the stream to a sub-tree.
### `QueryActiveAlarms`
`QueryActiveAlarms` is a server-streaming, **session-less** RPC that returns a point-in-time snapshot of the alarm monitor's active-alarm cache. The handler iterates `IGatewayAlarmService.CurrentAlarms`, writing one `ActiveAlarmSnapshot` per active alarm, then completes — unlike `StreamAlarms` it emits no `snapshot_complete` sentinel and no transitions. When `alarm_filter_prefix` is non-empty, snapshots whose `alarm_full_reference` does not start with the prefix are skipped (ordinal match). Clients use it to seed or reconcile state after a reconnect; for a live feed they use `StreamAlarms`.
## Validation Rules
`MxAccessGrpcRequestValidator` rejects requests with `StatusCode.InvalidArgument` before any session work happens. The rules are intentionally narrow — anything that requires session state (for example, "session does not exist") is left for `ISessionManager` so the validator can stay synchronous and side-effect free.
@@ -106,6 +110,7 @@ Carrying the enqueue timestamp into the worker layer is what lets queue-wait tim
| `Invoke` | `session_id` non-empty, `command` present, `kind` not `Unspecified`, payload oneof must match `kind`. | `InvalidArgument` |
| `AcknowledgeAlarm` | `alarm_full_reference` must be non-empty. Validated inline in the handler, not by `MxAccessGrpcRequestValidator`. | `InvalidArgument` |
| `StreamAlarms` | No required fields — `alarm_filter_prefix` is optional. | — |
| `QueryActiveAlarms` | No required fields — `alarm_filter_prefix` is optional. | — |
The payload-vs-kind check matters because the `MxCommand.payload` oneof is non-discriminated on the wire — a misaligned client could send `kind = Write` with a `Register` payload and silently confuse the worker. The validator turns that into a clear client error:
@@ -145,7 +150,7 @@ public WorkerCommand MapCommand(MxCommandRequest request)
When the worker reply or event payload is missing, the mapper returns a synthetic public message with `ProtocolStatusCode.ProtocolViolation` (for replies) or a sentinel `MxEvent` with `MxEventFamily.Unspecified` (for events). The gateway never relays a partial frame to clients — anything missing is reported as a protocol violation against the worker, not a transport error against the client.
The mapper also exposes static factory methods for every `ProtocolStatusCode` (`Ok`, `InvalidRequest`, `SessionNotFound`, `SessionNotReady`, `WorkerUnavailable`, `Timeout`, `Canceled`, `ProtocolViolation`) so that handlers and tests can produce status payloads without duplicating the enum-to-string mapping.
The mapper also exposes static factory methods for most `ProtocolStatusCode` values (`Ok`, `InvalidRequest`, `SessionNotFound`, `SessionNotReady`, `WorkerUnavailable`, `Timeout`, `Canceled`, `ProtocolViolation`) so that handlers and tests can produce status payloads without duplicating the enum-to-string mapping. There is intentionally no factory for `MxAccessFailure` (the ninth enum value): that code is set by the worker on the reply payload to report an MXAccess-side failure, not synthesized by the gateway mapper.
## Exception to Status Mapping
@@ -224,7 +229,7 @@ if (!writer.TryWrite(publicEvent))
}
```
Under `FailFast` the session is faulted so subsequent commands return `FailedPrecondition`; the client must reopen. Under the default policy only the stream is dropped and the session continues to accept commands, leaving recovery to the client (typically a fresh `StreamEvents` call with an updated `AfterWorkerSequence`). Either way, the consumer side observes `StatusCode.ResourceExhausted` via the `EventQueueOverflow` mapping above.
`FailFast` is the **default** policy (`Events:BackpressurePolicy`): on overflow the whole session is faulted, so subsequent commands return `FailedPrecondition` and the client must reopen. This is deliberate — the default refuses to silently drop MXAccess events. The non-default `DisconnectSubscriber` policy drops only the slow stream and leaves the session accepting commands, leaving recovery to the client (typically a fresh `StreamEvents` call with an updated `AfterWorkerSequence`). Either way, the consumer side observes `StatusCode.ResourceExhausted` via the `EventQueueOverflow` mapping above.
### Cancellation and cleanup
+83 -27
View File
@@ -94,9 +94,11 @@ Expected protected environment values:
```text
MXGATEWAY_WORKER_NONCE=<random nonce>
MXGATEWAY_WORKER_LOG_CONTEXT=<optional context>
```
The nonce travels through the environment rather than the command line so it
never appears in process-listing tools that expose argument vectors.
Startup sequence:
1. Parse command-line arguments.
@@ -114,16 +116,26 @@ Startup sequence:
If validation fails before MXAccess creation, exit quickly with a non-zero exit
code. If MXAccess creation fails, send `WorkerFault` when possible and exit.
The bootstrap layer returns structured exit codes before it creates pipes,
starts the STA, or touches MXAccess:
`WorkerApplication.Run` returns one of the structured `WorkerExitCode` values.
Codes `2``4` are produced by the bootstrap parse phase before any pipe, STA, or
MXAccess work happens; codes `5``6` and a clean `0` only become reachable once
the parse succeeds and the worker runs its pipe session:
| Exit code | Name | Meaning |
|-----------|------|---------|
| `0` | `Success` | Required bootstrap options are valid. |
| `0` | `Success` | The pipe session ran to a clean close. |
| `1` | `UnexpectedFailure` | A non-bootstrap exception reaches the process boundary. |
| `2` | `InvalidArguments` | Required arguments are missing or unknown arguments are present. |
| `3` | `InvalidProtocolVersion` | `--protocol-version` is not numeric or does not match the supported worker protocol. |
| `4` | `MissingNonce` | `MXGATEWAY_WORKER_NONCE` is absent or empty. |
| `5` | `PipeConnectionFailed` | The pipe connection raised an `IOException` or `TimeoutException`. |
| `6` | `ProtocolViolation` | A `WorkerFrameProtocolException` escaped the pipe session. |
`WorkerBootstrapResult.Succeeded` is a separate parse-phase gate: it reports
whether argument parsing produced usable `WorkerOptions`. A `false` result
carries one of codes `2``4` and the worker exits before running a session, so a
successful parse is distinct from the `0` exit code, which only follows a clean
pipe-session close.
Bootstrap logs use `WorkerConsoleLogger` key/value output. `WorkerLogRedactor`
redacts fields whose names indicate nonce, secret, password, token,
@@ -133,30 +145,35 @@ credential, or API key values before the message is written.
```text
ZB.MOM.WW.MxGateway.Worker
Program
Program (calls WorkerApplication.Run)
WorkerApplication (parse, bootstrap, run pipe session, map exit code)
Bootstrap
WorkerOptionsParser (parse args + env into WorkerOptions)
WorkerOptions
WorkerHost
WorkerBootstrapResult (parse outcome + WorkerExitCode)
WorkerExitCode
WorkerConsoleLogger / WorkerLogRedactor
Ipc
PipeClient
FrameReader
FrameWriter
WorkerProtocol
WorkerPipeClient (named-pipe connect + retry, owns the session)
WorkerPipeSession (handshake, read/write/drain/heartbeat loops)
WorkerFrameReader / WorkerFrameWriter
WorkerEnvelopeValidator
WorkerContractInfo (protocol version + descriptor names)
Sta
StaRuntime
StaCommandQueue
MessagePump
StaWatchdog
StaRuntime (the dedicated STA thread + message pump loop)
StaCommandDispatcher
StaMessagePump
MxAccess
MxAccessSession
MxAccessCommandDispatcher
MxAccessEventSink
MxAccessStaSession (IWorkerRuntimeSession over the STA)
MxAccessSession (handle registry + COM-call orchestration)
MxAccessCommandExecutor (IStaCommandExecutor; runs commands on the STA)
MxAccessBaseEventSink (OnDataChange tag-data events)
MxAccessHandleRegistry
(alarm subsystem — see below)
Conversion
VariantConverter
SafeArrayConverter
StatusProxyConverter
HResultMapper
VariantConverter (MxValue <-> COM VARIANT, both directions)
MxStatusProxyConverter
HResultConverter / HResultConversion
```
## Threading Model
@@ -330,13 +347,19 @@ cleanup path completes.
## Event Sink
The worker must subscribe to every public MXAccess event family:
The worker subscribes to every public MXAccess event family through
`MxAccessBaseEventSink`:
- `OnDataChange`
- `OnWriteComplete`
- `OperationComplete`
- `OnBufferedDataChange`
Alarm transitions arrive on a separate path. They do not originate from the
`LMXProxyServerClass` connection points, so `MxAccessAlarmEventSink` (driven by
the alarm subsystem below) feeds them onto the same `MxAccessEventQueue` rather
than `MxAccessBaseEventSink`.
Forward these event families only when the native MXAccess COM object raises
them. Do not synthesize `OperationComplete` from write completion or command
status. `OnBufferedDataChange` must be represented in the protocol now, but
@@ -368,16 +391,49 @@ type on buffered events. `OperationComplete` is only emitted from the native
`MxAccessEventQueue` is the bounded outbound event queue for one worker
session. It assigns the monotonic `WorkerSequence` and `WorkerTimestamp` when an
event is accepted, preserving the order in which MXAccess handlers enqueue
events. The default capacity is `10000`. When the queue reaches capacity it
records a `WorkerFaultCategory.QueueOverflow` fault and rejects further events.
The event handler catches conversion and enqueue failures, records the first
fault on the queue, and returns to the STA message pump instead of writing to
the pipe.
events. The default capacity is `10000`. When the queue reaches capacity, `Enqueue`
records a `WorkerFaultCategory.QueueOverflow` fault and then throws
`MxAccessEventQueueOverflowException` so the caller cannot silently drop the
event. The event handler catches conversion and enqueue failures (including this
overflow exception), records the first fault on the queue, and returns to the
STA message pump instead of writing to the pipe.
If event conversion throws, catch it inside the event handler, record a
structured `WorkerFault`, and keep the worker alive only if the fault policy
allows it.
## Alarm Subsystem
Alarms come from a different COM surface than tag data, so the worker carries a
separate pipeline rather than folding alarms into `MxAccessBaseEventSink`. The
MXAccess `LMXProxyServerClass` does not expose alarm subscription, so the worker
hosts AVEVA's standalone alarm-consumer COM object instead.
- `WnWrapAlarmConsumer` is the production `IMxAccessAlarmConsumer`, backed by
`WNWRAPCONSUMERLib.wwAlarmConsumerClass`. It returns the active alarm set as a
BSTR XML string through `GetXmlCurrentAlarms2`, which avoids the FILETIME→
`DateTime` marshaling that crashed the earlier managed alarm client. The CLSID
is registered `ThreadingModel=Apartment`, so the consumer is created and
driven entirely on the worker's STA. It owns no internal timer.
- `MxAccessStaSession` drives the **STA alarm poll loop**: `RunAlarmPollLoopAsync`
awaits a fixed `500 ms` interval and then calls `IAlarmCommandHandler.PollOnce`
on the STA via the runtime, so every `GetXmlCurrentAlarms2` call stays on the
apartment that owns the consumer. A poll failure is recorded as a
`WorkerFault` on the event queue rather than terminating the worker.
- `AlarmCommandHandler` owns one `AlarmDispatcher` per session and is the entry
point for the alarm IPC commands (`SubscribeAlarms`, `AcknowledgeAlarm` by GUID
or name, `QueryActiveAlarms`, `Unsubscribe`). It rejects a second subscribe
before an unsubscribe, mirroring the consumer's non-idempotent `Subscribe`.
- `AlarmDispatcher` wires the consumer's `AlarmTransitionEmitted` stream onto
`MxAccessAlarmEventSink.EnqueueTransition`. It maps state transitions through
`AlarmRecordTransitionMapper`, composes the canonical
`\\<machine>\Galaxy!<area>` full reference, and projects active-alarm
snapshots to `ActiveAlarmSnapshot` protos for the `QueryActiveAlarms` refresh
stream.
- `MxAccessAlarmEventSink` enqueues each decoded transition onto the shared
`MxAccessEventQueue` as a proto alarm-transition event, stamping the session
id, so alarms ride the same outbound IPC path as tag-data events.
## Command Queue
The pipe reader converts `WorkerCommand` messages into `StaCommand` entries.
+45 -11
View File
@@ -4,9 +4,9 @@ The sessions subsystem owns the in-memory representation of an active gateway-to
## Overview
A session is the gateway-side handle that callers use to invoke worker commands, stream worker events, and tear the worker down. The subsystem is split between the per-session state machine (`GatewaySession`), an in-memory directory (`SessionRegistry`), the orchestrator that opens and closes sessions (`SessionManager`), the worker construction step (`SessionWorkerClientFactory`), and a hosted service that drains sessions during host shutdown (`SessionShutdownHostedService`).
A session is the gateway-side handle that callers use to invoke worker commands, stream worker events, and tear the worker down. The subsystem is split between the per-session state machine (`GatewaySession`), an in-memory directory (`SessionRegistry`), the orchestrator that opens and closes sessions (`SessionManager`), the worker construction step (`SessionWorkerClientFactory`), a hosted service that sweeps expired leases (`SessionLeaseMonitorHostedService`), and a hosted service that drains sessions during host shutdown (`SessionShutdownHostedService`).
All four interfaces (`ISessionManager`, `ISessionRegistry`, `ISessionWorkerClientFactory`) plus `SessionShutdownHostedService` are wired as singletons by `SessionServiceCollectionExtensions.AddGatewaySessions`.
The three interfaces (`ISessionManager`, `ISessionRegistry`, `ISessionWorkerClientFactory`) are wired as singletons, and both hosted services (`SessionLeaseMonitorHostedService`, `SessionShutdownHostedService`) are registered, by `SessionServiceCollectionExtensions.AddGatewaySessions`. The startup orphan-worker cleanup that runs before any session opens lives in the worker subsystem (`OrphanWorkerCleanupHostedService`); see [Gateway Restart and Orphan Cleanup](#gateway-restart-and-orphan-cleanup).
## Key Types
@@ -18,6 +18,8 @@ The session id is an opaque string in the form `session-{guid:N}` and the per-se
`SessionState` itself is the protobuf-generated enum from `ZB.MOM.WW.MxGateway.Contracts.Proto`, so it is shared between the gateway and clients on the wire.
`GatewaySession` also keeps an `_items` dictionary keyed by `(ServerHandle, ItemHandle)` mapping each subscribed item to its `SessionItemRegistration` (server handle, item handle, tag address). It is the gateway-side shadow of the items the worker has added, populated as `AddItem`-style commands succeed and pruned on `RemoveItem`. The shadow exists so the gateway can answer item lookups and clean up subscriptions without round-tripping the worker; the worker remains authoritative for the handles themselves (see [gateway.md](../gateway.md)).
```csharp
public void TransitionTo(SessionState nextState)
{
@@ -54,7 +56,7 @@ public void TransitionTo(SessionState nextState)
`CloseSessionAsync` and `KillWorkerAsync` are both end-of-life paths but differ in what they offer the worker:
- `CloseSessionAsync` is the graceful path: it calls `GatewaySession.CloseAsync`, which asks the worker to shut down via `IWorkerClient.ShutdownAsync` and only kills the process as a fallback if shutdown fails.
- `KillWorkerAsync` is the forceful path used by the dashboard's admin Kill button: it calls `GatewaySession.KillWorker` directly, which kills the worker process immediately with no graceful-shutdown attempt and transitions the session to `Closed`.
- `KillWorkerAsync` is the forceful path used by the dashboard's admin Kill button: it calls `GatewaySession.KillWorkerWithCloseGateAsync`, which kills the worker process immediately with no graceful-shutdown attempt and transitions the session to `Closed`. Routing through `KillWorkerWithCloseGateAsync` (rather than the bare `GatewaySession.KillWorker`) acquires the per-session `_closeLock` so a kill and an in-flight graceful close serialize on the same "was the session already closed" observation that drives metric accounting; the method returns that observation so `KillWorkerAsync` increments `mxgateway.sessions.closed` at most once across concurrent callers.
Both paths converge on the same registry/metrics cleanup, so the open-session slot is released and `mxgateway.sessions.closed` is incremented either way.
@@ -99,6 +101,8 @@ if (exception is OperationCanceledException
The named pipe is created with `maxNumberOfServerInstances: 1` so a second worker cannot connect to the same pipe name even if the first launch is still pending. Combined with the per-session nonce passed to the worker, this is the gateway's defense against a foreign process answering a pipe.
The factory also seeds the worker client's `MaxPendingCommands` from `MxGateway:Sessions:MaxPendingCommandsPerSession` (default 128, validated `> 0` at startup). This caps how many commands can be in flight to a single worker at once; the `WorkerClient` rejects an enqueue past the cap and records `mxgateway.queues.overflows` tagged `worker-pending-commands`. The bound exists because the worker executes commands serially on one STA — an unbounded backlog would only grow memory and latency, not throughput.
### SessionShutdownHostedService
`SessionShutdownHostedService` is an `IHostedService` whose only job is to call `ISessionManager.ShutdownAsync` from `StopAsync`. It catches `OperationCanceledException` triggered by the host shutdown timeout and logs a warning so that an over-running shutdown does not surface as an unhandled exception.
@@ -172,6 +176,14 @@ catch (Exception exception)
await session.DisposeAsync().ConfigureAwait(false);
}
// If SessionOpened() already incremented the open-session gauge,
// a failure after that point (e.g. auto-subscribe rejection) must
// decrement it again so mxgateway.sessions.open does not leak.
if (sessionOpenedRecorded)
{
_metrics.SessionRemoved();
}
ReleaseSessionSlot();
_metrics.Fault(SessionManagerErrorCode.OpenFailed.ToString());
_logger.LogWarning(
@@ -186,7 +198,7 @@ catch (Exception exception)
}
```
The order — fault, deregister, dispose, release slot, record metric, log, rethrow — matters because releasing the semaphore before disposal would let the next open race the worker process tear-down on the same machine.
The order — fault, deregister, dispose, conditionally decrement the open-session gauge, release slot, record fault metric, log, rethrow — matters because releasing the semaphore before disposal would let the next open race the worker process tear-down on the same machine. The `SessionRemoved()` call is conditional on `sessionOpenedRecorded` (Server-006): a failure *after* `SessionOpened()` already incremented `mxgateway.sessions.open` (for example, an auto-subscribe rejection) must decrement the gauge so it does not leak, but a failure before that point must not.
### Run
@@ -194,6 +206,8 @@ While `Ready`, callers reach the worker through `SessionManager.InvokeAsync` or
Event streaming uses `AttachEventSubscriber` which returns a disposable lease. When `allowMultipleSubscribers` is false the second attach throws `EventSubscriberAlreadyActive`; this prevents two gRPC streams from racing on the same worker event channel. Active event subscribers keep the session lease from expiring until the stream is disposed.
The single-subscriber rule is enforced at startup, not just at runtime: setting `MxGateway:Sessions:AllowMultipleEventSubscribers` to `true` is refused by `GatewayOptionsValidator` with "AllowMultipleEventSubscribers is not supported until event fan-out is implemented," so the gateway fails fast rather than booting in a configuration the event path cannot honor. Multi-subscriber fan-out is explicitly out of scope for v1 (see [Design Decisions](./DesignDecisions.md)).
Sessions open with `MxGateway:Sessions:DefaultLeaseSeconds` (default 1800) added to the open timestamp. Unary client activity refreshes the lease by the same duration. `ExtendLease` and `IsLeaseExpired` cooperate with `SessionManager.CloseExpiredLeasesAsync`, which iterates a registry snapshot and closes any session whose lease has expired with `LeaseExpiredReason`. `SessionLeaseMonitorHostedService` runs that sweep every `MxGateway:Sessions:LeaseSweepIntervalSeconds` seconds (default 30).
### Close
@@ -227,11 +241,11 @@ if (_workerClient is not null)
If both graceful shutdown and the kill fall-back fail, the original and kill exceptions are bundled into an `AggregateException` and surfaced as `SessionCloseStartedException`. `SessionManager.CloseSessionCoreAsync` then translates that into a `SessionManagerException` with `CloseFailed` and removes the session.
`GatewaySession.KillWorker` is the unconditional forced-close path used by shutdown when graceful close itself throws, and also by `SessionManager.KillWorkerAsync` — the explicit kill path that the dashboard's admin Kill button invokes. `KillWorkerAsync` skips `WorkerClient.ShutdownAsync` entirely, so `KillCount` increments while `ShutdownCount` does not; the session is then removed from the registry and the open-session slot is released, identical to the cleanup that follows a successful `CloseSessionAsync`.
`GatewaySession.KillWorker` is the unconditional forced-close path. `SessionManager.KillWorkerAsync` — the explicit kill path that the dashboard's admin Kill button invokes — no longer calls it directly; it routes through `GatewaySession.KillWorkerWithCloseGateAsync` so the kill takes the per-session `_closeLock`. That method skips `WorkerClient.ShutdownAsync` entirely and forces the worker process down via `IWorkerClient.Kill`, which records the `mxgateway.workers.killed` counter through `GatewayMetrics.WorkerKilled(reason)`. The session is then removed from the registry and the open-session slot is released, identical to the cleanup that follows a successful `CloseSessionAsync` (which increments `mxgateway.sessions.closed`). There is no separate `KillCount` / `ShutdownCount`: worker terminations are counted by `mxgateway.workers.killed` (tagged with the kill reason), and session closes by `mxgateway.sessions.closed`.
## Shutdown Coordination
`SessionShutdownHostedService.StopAsync` calls `SessionManager.ShutdownAsync`, which closes every registered session with `GatewayShutdownReason`. The shutdown loop catches per-session exceptions, calls `KillWorker`, and removes the session so that one stuck worker cannot block the rest of the host:
`SessionShutdownHostedService.StopAsync` calls `SessionManager.ShutdownAsync`, which closes every registered session with `GatewayShutdownReason`. The shutdown loop catches per-session exceptions and falls back to a forced kill so that one stuck worker cannot block the rest of the host. The fallback routes through `KillWorkerAsync` (not a bare `session.KillWorker`) so the kill takes the same close-gate and metric bookkeeping as the dashboard kill path (Server-046):
```csharp
public async Task ShutdownAsync(CancellationToken cancellationToken)
@@ -248,21 +262,40 @@ public async Task ShutdownAsync(CancellationToken cancellationToken)
exception,
"Graceful shutdown failed for session {SessionId}; killing worker.",
session.SessionId);
// CloseSessionCoreAsync's inner SessionCloseStartedException catch normally
// removes and accounts the session; this fallback only fires for sessions
// still in the registry, and reuses KillWorkerAsync for identical bookkeeping.
if (_registry.TryGet(session.SessionId, out _))
{
session.KillWorker(GatewayShutdownReason);
await RemoveSessionAsync(session).ConfigureAwait(false);
try
{
await KillWorkerAsync(session.SessionId, GatewayShutdownReason, cancellationToken).ConfigureAwait(false);
}
catch (SessionManagerException killException)
{
_logger.LogWarning(
killException,
"Worker kill fallback failed for session {SessionId}.",
session.SessionId);
}
}
}
}
}
```
Iterating over `Snapshot` rather than the live dictionary lets `RemoveSessionAsync` mutate the registry inside the loop without throwing.
Iterating over `Snapshot` rather than the live dictionary lets the registry mutate inside the loop without throwing.
## Gateway Restart and Orphan Cleanup
A graceful shutdown drains sessions through `ShutdownAsync`, but a gateway crash or `Kill` leaves no chance to tear workers down. Those orphaned worker processes outlive the gateway that launched them, still holding their MXAccess COM instance and their named pipe. Because the pipe name encodes the *old* gateway PID, a fresh gateway will never reconnect to them — v1 deliberately does not reattach orphan workers (see [Design Decisions](./DesignDecisions.md)).
Instead, `OrphanWorkerCleanupHostedService` runs once on startup, before any session opens, and calls `OrphanWorkerTerminator.TerminateOrphans`. The terminator enumerates running processes matching the configured worker executable name, skips the current process, and kills any that it identifies as a leftover worker (matched against the configured executable path). Each kill records `mxgateway.workers.killed` tagged `OrphanStartupCleanup` and logs a warning. The sweep is best-effort: a failure to kill any one orphan (it may have already exited, or be inaccessible) is logged and swallowed so it cannot block gateway startup. This service lives in the worker subsystem, not the session subsystem, because it operates on OS processes rather than `GatewaySession` state.
## Dependency Injection
`SessionServiceCollectionExtensions.AddGatewaySessions` registers the four singletons and the hosted service:
`SessionServiceCollectionExtensions.AddGatewaySessions` registers the three singletons and the two hosted services:
```csharp
public static IServiceCollection AddGatewaySessions(this IServiceCollection services)
@@ -270,13 +303,14 @@ public static IServiceCollection AddGatewaySessions(this IServiceCollection serv
services.AddSingleton<ISessionRegistry, SessionRegistry>();
services.AddSingleton<ISessionWorkerClientFactory, SessionWorkerClientFactory>();
services.AddSingleton<ISessionManager, SessionManager>();
services.AddHostedService<SessionLeaseMonitorHostedService>();
services.AddHostedService<SessionShutdownHostedService>();
return services;
}
```
The registry must be a singleton because its `ConcurrentDictionary` is the source of truth for session state across the gRPC service, the lease sweeper, the dashboard, and the shutdown hosted service. Registering `SessionShutdownHostedService` last ensures it is constructed after `ISessionManager` and therefore drains sessions during host stop.
The registry must be a singleton because its `ConcurrentDictionary` is the source of truth for session state across the gRPC service, the lease sweeper, the dashboard, and the shutdown hosted service. `SessionLeaseMonitorHostedService` runs the periodic expired-lease sweep; `SessionShutdownHostedService` drains sessions during host stop. Both are registered after `ISessionManager` so they resolve the same singleton manager when the host starts; `SessionShutdownHostedService` is registered last so it is the latter of the two to be constructed and is available to drain sessions on stop.
## Related Documentation
+2 -2
View File
@@ -4,7 +4,7 @@ The bootstrap layer parses the command-line arguments and environment variables
## Overview
The worker process is a short-lived child of the gateway. The gateway side of this contract lives in [WorkerProcessLauncher](./WorkerProcessLauncher.md). On the worker side, `Program.cs` is a single line that delegates to `WorkerApplication.Run(args)`:
The worker process is a per-session child process of the gateway: one worker is launched per session and lives for that session's lifetime. The gateway side of this contract lives in [WorkerProcessLauncher](./WorkerProcessLauncher.md). On the worker side, `Program.cs` is a single line that delegates to `WorkerApplication.Run(args)`:
```csharp
using ZB.MOM.WW.MxGateway.Worker;
@@ -143,7 +143,7 @@ The production binding in `WorkerApplication.Run(string[])` is `EnvironmentVaria
## Logging
The worker writes structured key/value lines to standard error. Standard error is used rather than standard output because the gateway side reads worker stdout for diagnostic capture only, while stderr is reserved for log output that does not interfere with any future stdout-based channel.
The worker writes structured key/value lines to standard error. The launcher does not redirect either stream (`WorkerProcessLauncher` sets `UseShellExecute=false` and `CreateNoWindow=true` but leaves stdout and stderr inherited), so log output lands on the inherited console rather than a pipe the gateway reads. Standard error is used rather than standard output so that diagnostic logging stays clear of stdout, keeping that stream free for any future stdout-based channel.
### The logger contract
+25 -1
View File
@@ -109,6 +109,30 @@ default:
The MXAccess engine returns values whose semantic type only fully resolves after consulting the engine's own attribute metadata. Clients that round-trip these values through the gateway (replay, parity fixtures, diagnostics) need the original `VT_*` tag, the engine-declared `MxDataType`, and any conversion diagnostic; otherwise edge cases such as decimal-to-double rounding, ulong overflow, or an unknown SAFEARRAY element type become invisible bugs. Storing both the typed projection and the raw fields in the same `MxValue`/`MxArray` lets cross-language clients recover the original observation byte-for-byte where possible and detect lossy cases where it is not.
### Inverse projection for COM writes
The conversions above run on the read path, turning COM values into `MxValue`.
The write path runs the same `VariantConverter` in reverse: `ConvertToComValue`
takes an `MxValue` from a `Write` command and returns a CLR object that the COM
marshaler boxes into the matching VARIANT, so it is the inverse of `Convert`.
- A null `MxValue` argument throws; an `MxValue` whose `IsNull` flag is set
returns `null` (the MXAccess null), keeping the read/write null semantics
symmetric.
- Each `KindCase` maps to its CLR scalar (`bool`, `int`, `long`, `float`,
`double`, `string`). A `TimestampValue` becomes a `DateTime`, which the
marshaler renders as `VT_DATE` — the form MXAccess accepts for the
timestamped-write argument.
- An array kind delegates to `ConvertToComArray`, which projects each
`MxArray.ValuesCase` to a typed CLR array (for example `int[]`, `string[]`, or
a `DateTime[]` for timestamp arrays) so the marshaler produces the
corresponding SAFEARRAY.
- `RawValue` payloads are intentionally rejected on both the scalar and array
paths. Raw bytes are preserved on the read path for diagnostics, but there is
no safe way to reconstruct the original VARIANT from them, so a write that
carries a raw value throws rather than guessing. An `MxValue` with no value
kind set throws for the same reason — there is nothing to write.
## HResultConverter and HResultConversion
`HResultConverter.Convert` wraps any `Exception` thrown across the COM boundary. It prefers `COMException.ErrorCode` over `Exception.HResult` because the runtime sometimes overwrites `Exception.HResult` while marshalling, and the `ErrorCode` field is the value the COM call actually returned.
@@ -223,7 +247,7 @@ public string PreserveCompletionOnlyStatusBytes(byte[] statusBytes)
`MxStatusDetailText` is an internal lookup that maps known `MXSTATUS_PROXY.detail` codes to short human-readable strings (for example `28 = "Index out of range"`, `42 = "Unable to convert string"`, `8017 = "Object must be offscan to modify attributes that have an MxSecurityConfigure security classification"`). `MxStatusProxyConverter.Convert` calls `Lookup` and writes the result to `DiagnosticText`. Unknown codes return `string.Empty`, leaving the numeric `Detail` field as the authoritative identifier.
The mapping covers the engine-error range documented for MXAccess (16-50, 56-61, 541-542, 8017). Adding entries here is the supported way to enrich wire-level diagnostics without changing the proto schema.
The mapping covers selected detail codes in the MXAccess engine-error ranges (16-50, 56-61, 541-542, 8017). The ranges are not contiguous: codes that the runtime does not assign a distinct meaning are omitted (for example 35, 45, and 46 in the 16-50 range and 58-59 in the 56-61 range), so only codes with a known text appear. Adding entries here is the supported way to enrich wire-level diagnostics without changing the proto schema.
## MxStatusConversionException
+3 -3
View File
@@ -16,7 +16,7 @@ The installed MXAccess interop assembly declares an `Apartment` threading model
| `IStaWorkItem` / `StaWorkItem<T>` | Internal queue entries that capture a delegate, a `CancellationToken`, and a `TaskCompletionSource<T>` for the caller. |
| `StaCommand` | Carries an `MxCommand` together with `SessionId`, `CorrelationId`, `EnqueueTimestamp`, and a `CancellationToken`. |
| `IStaCommandExecutor` | The boundary between the dispatcher and the MXAccess interop layer; returns `MxCommandReply`. |
| `StaCommandDispatcher` | Bounded asynchronous queue in front of `StaRuntime` that converts `StaCommand` into `MxCommandReply` and applies status normalization. |
| `StaCommandDispatcher` | A bounded `Queue<T>` (guarded by a lock) with an async drain loop in front of `StaRuntime` that converts `StaCommand` into `MxCommandReply` and applies status normalization. |
## STA Thread Initialization
@@ -141,10 +141,10 @@ finally
`StaRuntime.Shutdown(TimeSpan timeout)` performs an ordered shutdown:
1. Sets `shutdownRequested` under `gate` so `InvokeAsync` rejects new work with `InvalidOperationException`.
1. Sets `shutdownRequested` under `gate` so subsequent `InvokeAsync` calls reject new work. `InvokeAsync` does not throw inline: it returns a faulted `Task` carrying `StaRuntimeShutdownException` (a dedicated subtype, not a bare `InvalidOperationException`). The distinct type lets callers and the dispatcher distinguish "rejected because the runtime is shutting down" from any other invalid-operation condition.
2. Signals `commandWakeEvent` to break the STA out of `WaitForWorkOrMessages`.
3. Waits up to `timeout` on `stoppedEvent`, which the STA sets after it leaves `ThreadMain`.
4. Once the thread has stopped, drains the queue through `CancelQueuedCommands`, which calls `CancelBeforeExecution` on every remaining work item so awaiting callers observe `OperationCanceledException` instead of hanging.
4. The queue is drained through `CancelQueuedCommands` twice. `ThreadMain`'s `finally` block runs it before setting `stoppedEvent`, so any work that was queued while the loop was exiting is canceled on the STA itself. `Shutdown` then runs it again after the wait returns, which catches work enqueued during the gap between the `finally` drain and the gate close. Either way, `CancelBeforeExecution` completes every remaining work item so awaiting callers observe `OperationCanceledException` instead of hanging. (When the STA thread never started, `Shutdown` instead drains directly and sets `stoppedEvent` itself.)
`ThreadMain`'s `finally` block guarantees that `comApartmentInitializer.Uninitialize` runs (when COM was successfully initialized) before `stoppedEvent.Set`, so the apartment is always torn down on the same thread that initialized it. `Dispose` calls `Shutdown` with a five-second budget and only disposes the wait handles when shutdown actually completed, which prevents a still-running STA thread from touching disposed handles.
+3 -1
View File
@@ -65,4 +65,6 @@ CLI, and tests.
- Use `pytest` and `pytest-asyncio`.
- Use fake generated stubs or an in-process test gRPC server for unit tests.
- Keep live integration tests behind `MXGATEWAY_INTEGRATION=1`.
- Keep live integration tests behind an explicit opt-in environment variable
and a `pytest` skip guard, matching the existing tests (for example the
loopback TLS tests gate on `MXGATEWAY_RUN_TLS_TESTS=1`).
+28 -16
View File
@@ -145,9 +145,10 @@ for the alarm subsystem.
Dashboard authentication is LDAP-backed (distinct from the API-key model on
the gRPC API). `/login` accepts username and password in a form body, binds
against `MxGateway:Ldap`, maps the user's LDAP groups to `Admin` or `Viewer`
via `MxGateway:Dashboard:GroupToRole`, and issues an HTTP-only secure
`__Host-MxGatewayDashboard` cookie. `/logout` clears it. Login and logout
against `MxGateway:Ldap`, maps the user's LDAP groups to `Administrator` or
`Viewer` via `MxGateway:Dashboard:GroupToRole`, and issues an HTTP-only secure
`MxGatewayDashboard` cookie (the name is configurable via
`MxGateway:Dashboard:CookieName`). `/logout` clears it. Login and logout
posts validate antiforgery tokens. SignalR hub connections accept either the
cookie or a 30-minute data-protected bearer minted at `/hubs/token`.
`MxGateway:Dashboard:AllowAnonymousLocalhost` permits loopback to bypass the
@@ -232,27 +233,35 @@ message WorkerEnvelope {
uint32 protocol_version = 1;
string session_id = 2;
uint64 sequence = 3;
uint64 correlation_id = 4;
string correlation_id = 4;
oneof body {
WorkerHello worker_hello = 10;
GatewayHello gateway_hello = 11;
GatewayHello gateway_hello = 10;
WorkerHello worker_hello = 11;
WorkerReady worker_ready = 12;
WorkerCommand command = 20;
WorkerCommandReply command_reply = 21;
WorkerEvent event = 22;
WorkerHeartbeat heartbeat = 23;
WorkerCancel cancel = 24;
WorkerShutdown shutdown = 25;
WorkerFault fault = 26;
WorkerCommand worker_command = 13;
WorkerCommandReply worker_command_reply = 14;
WorkerCancel worker_cancel = 15;
WorkerShutdown worker_shutdown = 16;
WorkerShutdownAck worker_shutdown_ack = 17;
WorkerEvent worker_event = 18;
WorkerHeartbeat worker_heartbeat = 19;
WorkerFault worker_fault = 20;
}
}
```
The contract evolves additively only: field numbers and enum values are never
renumbered or repurposed, so a stale gateway and worker that disagree on the
newest tags still decode the fields they share. `correlation_id` is a `string`
(not a numeric id) because it is the same correlation token the public gRPC API
carries end to end, so the worker never has to translate id formats.
Rules:
- `sequence` is monotonic per sender.
- `correlation_id` links commands to replies.
- Events use their own correlation id or zero.
- Events carry their own correlation id or an empty string.
- Replies must preserve MXAccess HRESULT/status information even when the
command is also represented as a protocol-level failure.
- Protocol version mismatch fails session creation.
@@ -659,8 +668,10 @@ External gateway:
- authenticate v1 gRPC clients with `authorization: Bearer
mxgw_<key-id>_<secret>` API-key metadata,
- reject missing or invalid API keys with gRPC `Unauthenticated`,
- reject valid keys that lack the required session, invoke, event, metadata, or
admin scope with gRPC `PermissionDenied`,
- reject valid keys that lack the required scope with gRPC `PermissionDenied`.
Scopes are fine-grained: `session:open`, `session:close`, `invoke:read`,
`invoke:write`, `invoke:secure`, `events:read`, `metadata:read`, and `admin`
(see `GatewayScopes`),
- authorize access to commands that can write, authenticate users, expose
metadata, stream events, or alter runtime state.
@@ -901,6 +912,7 @@ State machine:
Creating
-> StartingWorker
-> WaitingForPipe
-> Handshaking
-> InitializingWorker
-> Ready
-> Closing
+71 -50
View File
@@ -59,13 +59,17 @@ For mxaccessgw dev, `admin` covers every gw-side capability test;
`readonly` is the right "negative" case for proving Browse-OK /
Write-denied.
The gateway dashboard adds one role beyond this LmxOpcUa taxonomy:
`GwAdmin`. `LdapOptions.RequiredGroup` defaults to `GwAdmin`, so the
dashboard login and `DashboardLdapLiveTests` require `admin` to be a
member of a `GwAdmin` group. `GwAdmin` is **not** in the baseline
GLAuth config — it must be provisioned before dashboard authn or the
LDAP live tests work. See [Provisioning the GwAdmin
group](#provisioning-the-gwadmin-group) below.
The gateway dashboard adds one group beyond this LmxOpcUa taxonomy:
`GwAdmin`. There is no `RequiredGroup` option — dashboard authorization
is driven entirely by `MxGateway:Dashboard:GroupToRole`, which maps an
LDAP group to a dashboard role. A user whose groups produce no mapped
role is rejected at login. So for the dashboard to admit `admin`, a
group named in `GroupToRole` (by convention `GwAdmin``Administrator`)
must exist and `admin` must belong to it. `GwAdmin` is **not** in the
baseline GLAuth config — it must be provisioned before dashboard authn
or the `DashboardLdapLiveTests` (`MXGATEWAY_RUN_LIVE_LDAP_TESTS=1`)
work. See [Provisioning the GwAdmin group](#provisioning-the-gwadmin-group)
below.
> **Dashboard role value (Task 1.7):** the LDAP `GwAdmin` group now maps to
> the canonical dashboard role **`Administrator`** (was `Admin`); `GwReader`
@@ -112,43 +116,58 @@ to avoid re-deriving the LDAP escape-string handling.
## Suggested mxgw configuration shape
A YAML/JSON section for mxaccessgw that mirrors LmxOpcUa's `LdapOptions`
record:
The gateway binds the `MxGateway:Ldap` section onto `LdapOptions`. The
field names are PascalCase config keys (shown here as YAML; JSON
`appsettings` and env-var overrides use the same names). Note the keys
that changed from the older LmxOpcUa shape: `Transport` (an enum,
replacing the boolean `UseTls`), `AllowInsecure` (replacing
`AllowInsecureLdap`), and `UserNameAttribute` which defaults to `cn`:
```yaml
ldap:
enabled: true
server: localhost
port: 3893
useTls: false
allowInsecureLdap: true # dev only
searchBase: "dc=zb,dc=local"
serviceAccountDn: "cn=serviceaccount,dc=zb,dc=local"
serviceAccountPassword: "serviceaccount123"
userNameAttribute: "uid" # GLAuth populates this; AD uses sAMAccountName
displayNameAttribute: "cn"
groupAttribute: "memberOf"
groupToRole:
ReadOnly: "Browse"
WriteOperate: "Write"
WriteTune: "WriteSecured"
WriteConfigure: "WriteSecured"
AlarmAck: "AlarmAck"
MxGateway:
Ldap:
Enabled: true
Server: localhost
Port: 3893
Transport: None # None | StartTls | Ldaps (dev: None)
AllowInsecure: true # dev only
SearchBase: "dc=zb,dc=local"
ServiceAccountDn: "cn=serviceaccount,dc=zb,dc=local"
ServiceAccountPassword: "serviceaccount123"
UserNameAttribute: "cn" # GLAuth keys users by cn; AD uses sAMAccountName
DisplayNameAttribute: "cn"
GroupAttribute: "memberOf"
Dashboard:
GroupToRole:
GwAdmin: "Administrator"
GwReader: "Viewer"
```
`groupAttribute` returns full DNs like
`ou=ReadOnly,ou=groups,dc=zb,dc=local` — the authenticator
should strip the leading `ou=` (or `cn=` against AD) RDN value and
look that up in `groupToRole`.
`Transport` is an `LdapTransport` enum (`None`, `StartTls`, `Ldaps`); it
replaces the old boolean `UseTls` (`true``Ldaps`, `false` = `None`).
`UserNameAttribute` defaults to `cn` because GLAuth keys users by `cn`
(`backend.nameformat = "cn"`); only AD needs `sAMAccountName`. The
group-to-role mapping lives under `MxGateway:Dashboard:GroupToRole`, not
in the LDAP section, and its values must be dashboard roles
(`Administrator` or `Viewer`).
The shared `ZB.MOM.WW.Auth.Ldap` provider performs the runtime bind and
search; it returns each group already stripped to its short RDN value
(e.g. `GwAdmin` from `ou=GwAdmin,ou=groups,dc=zb,dc=local`) before the
gateway looks it up in `GroupToRole`. Keep `GroupToRole` keys as short
group names — a full-DN key will never match the short name the provider
returns.
## Provisioning the GwAdmin group
`GwAdmin` is the gateway-specific dashboard-admin role. It is the
default `LdapOptions.RequiredGroup`, so the dashboard cookie login and
`DashboardLdapLiveTests` (`MXGATEWAY_RUN_LIVE_LDAP_TESTS=1`) reject
`admin` until a `GwAdmin` group exists and `admin` is a member.
GLAuth's baseline config ships only the five LmxOpcUa role groups, so
`GwAdmin` must be added to GLAuth rather than run from a separate LDAP
`GwAdmin` is the gateway-specific dashboard-admin group, mapped to the
`Administrator` role through `MxGateway:Dashboard:GroupToRole`. Because
dashboard login rejects any user who resolves to no role, the dashboard
cookie login and `DashboardLdapLiveTests`
(`MXGATEWAY_RUN_LIVE_LDAP_TESTS=1`) reject `admin` until a `GwAdmin`
group exists, `admin` is a member, and `GroupToRole` maps `GwAdmin` to a
role. GLAuth's baseline config ships only the five LmxOpcUa role groups,
so `GwAdmin` must be added to GLAuth rather than run from a separate LDAP
server:
1. Edit `C:\publish\glauth\glauth.cfg`
@@ -178,10 +197,11 @@ server:
4. `nssm restart GLAuth`
After the restart, `admin`'s `memberOf` includes
`ou=GwAdmin,ou=groups,dc=zb,dc=local`, which the authenticator
strips to `GwAdmin` and matches against `RequiredGroup`. The same
pattern applies to any future permission that doesn't fit the existing
five roles.
`ou=GwAdmin,ou=groups,dc=zb,dc=local`. The shared LDAP provider strips
that to the short RDN `GwAdmin`, which the gateway looks up in
`MxGateway:Dashboard:GroupToRole` to resolve the dashboard role. The same
pattern applies to any future group that doesn't fit the existing five
roles — add the group, add the member, and add a `GroupToRole` entry.
Generate `passsha256` from a plaintext password:
@@ -254,24 +274,25 @@ Get-Content C:\publish\glauth\logs\stderr.log -Tail 20 -Wait
## Active Directory migration cheat-sheet
LmxOpcUa's `LdapOptions` xml-doc captures the AD overrides; same set
applies to mxaccessgw verbatim. Keys that change:
These `MxGateway:Ldap` keys change when pointing the gateway at AD
instead of dev GLAuth:
| Field | GLAuth dev value | AD production value |
|---|---|---|
| `Server` | `localhost` | a domain controller FQDN, or the domain itself |
| `Port` | `3893` | `636` (LDAPS) — AD increasingly rejects plain bind under LDAP-signing enforcement |
| `UseTls` | `false` | `true` |
| `AllowInsecureLdap` | `true` | `false` |
| `Transport` | `None` | `Ldaps` (or `StartTls`) |
| `AllowInsecure` | `true` | `false` |
| `SearchBase` | `dc=zb,dc=local` | `DC=corp,DC=example,DC=com` |
| `ServiceAccountDn` | `cn=serviceaccount,dc=zb,dc=local` | `CN=MxGwSvc,OU=Service Accounts,DC=corp,...` |
| `UserNameAttribute` | `uid` | `sAMAccountName` (or `userPrincipalName`) |
| `UserNameAttribute` | `cn` | `sAMAccountName` (or `userPrincipalName`) |
| `GroupAttribute` | `memberOf` (unchanged) | `memberOf` (unchanged) |
`memberOf` returns full DNs; the authenticator strips the leading
`CN=` value and uses it as the lookup key in `groupToRole`. Nested
groups are **not** auto-expanded; either flatten in the directory or
add a `tokenGroups` query as an enhancement.
`memberOf` returns full DNs; the shared LDAP provider strips each to its
leading RDN value (`CN=`/`OU=`) and the gateway uses that as the lookup
key in `MxGateway:Dashboard:GroupToRole`. Nested groups are **not**
auto-expanded; either flatten in the directory or add a `tokenGroups`
query as an enhancement.
## Security notes for production