From e541339c076f045f61f7c73d71e16fdaa719d977 Mon Sep 17 00:00:00 2001 From: Joseph Doherty Date: Wed, 3 Jun 2026 16:01:28 -0400 Subject: [PATCH] docs(audit): apply per-cluster judgment fixes across living docs Resolve audit findings: correct WorkerEnvelope proto/route/metric/session facts; rewrite auth (ZB.MOM.WW.Auth migration), dashboard (ZB.MOM.WW.Theme), and StyleGuide (foreign-project copy-paste); document alarm subsystem, Ldap options, and gateway alarm broker; fix client CLI flags and package paths. --- CLAUDE.md | 6 +- StyleGuide.md | 131 ++++++++------- clients/dotnet/DotnetClientDesign.md | 3 - clients/go/GoClientDesign.md | 3 + clients/go/README.md | 2 +- clients/rust/README.md | 4 +- docs/AlarmClientDiscovery.md | 225 +++++++++++++++++++++----- docs/Authentication.md | 118 ++++++++------ docs/Authorization.md | 37 +++-- docs/ClientPackaging.md | 39 ++++- docs/ClientProtoGeneration.md | 11 +- docs/Contracts.md | 38 +++++ docs/DashboardInterfaceDesign.md | 210 ++++++++++++++---------- docs/DesignDecisions.md | 14 +- docs/Diagnostics.md | 27 +++- docs/GalaxyRepository.md | 53 +++++- docs/GatewayConfiguration.md | 46 ++++++ docs/GatewayDashboardDesign.md | 158 +++++++++++++----- docs/GatewayProcessDesign.md | 14 +- docs/GatewayTesting.md | 11 ++ docs/Grpc.md | 13 +- docs/MxAccessWorkerInstanceDesign.md | 110 +++++++++---- docs/Sessions.md | 56 +++++-- docs/WorkerBootstrap.md | 4 +- docs/WorkerConversion.md | 26 ++- docs/WorkerSta.md | 6 +- docs/style-guides/PythonStyleGuide.md | 4 +- gateway.md | 44 +++-- glauth.md | 121 ++++++++------ 29 files changed, 1102 insertions(+), 432 deletions(-) diff --git a/CLAUDE.md b/CLAUDE.md index a5cb7ef..79aab1d 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -32,7 +32,7 @@ dotnet test src/MxGateway.Worker.Tests/MxGateway.Worker.Tests.csproj -p:Platform dotnet run --project src/ZB.MOM.WW.MxGateway.Server/ZB.MOM.WW.MxGateway.Server.csproj # API-key admin CLI (same exe, "apikey" subcommand) -dotnet run --project src/ZB.MOM.WW.MxGateway.Server/ZB.MOM.WW.MxGateway.Server.csproj -- apikey create --display-name "dev" --scopes session,invoke,event,metadata,admin +dotnet run --project src/ZB.MOM.WW.MxGateway.Server/ZB.MOM.WW.MxGateway.Server.csproj -- apikey create-key --key-id dev --display-name "dev" --scopes session:open,session:close,invoke:read,invoke:write,invoke:secure,events:read,metadata:read,admin ``` Single test by name (xUnit `--filter`): @@ -77,7 +77,7 @@ powershell -ExecutionPolicy Bypass -File scripts/run-client-e2e-tests.ps1 - **Gateway restart does not reattach orphan workers.** The first version terminates orphaned workers on startup; do not design code paths that assume reattachment. - **No Blazor UI component libraries.** Dashboard uses local Bootstrap CSS/JS only — do not introduce MudBlazor, Radzen, FluentUI, etc. - **Don't log secrets or full tag values by default.** API keys, passwords, `WriteSecured` payloads, and `AuthenticateUser` credentials must never reach logs. Value logging is opt-in and redacted. -- **Generated code** under `src/MxGateway.Contracts/Generated/`, `clients/*/generated*/`, `clients/python/src/mxgateway/generated/`, etc., is build output. Don't hand-edit. To regenerate, build the contracts project (`dotnet build src/MxGateway.Contracts/MxGateway.Contracts.csproj`) or run the per-client generation step in that client's README. +- **Generated code** under `src/MxGateway.Contracts/Generated/`, `clients/*/generated*/`, `clients/python/src/zb_mom_ww_mxgateway/generated/`, etc., is build output. Don't hand-edit. To regenerate, build the contracts project (`dotnet build src/MxGateway.Contracts/MxGateway.Contracts.csproj`) or run the per-client generation step in that client's README. - **Documentation style** (`StyleGuide.md`): PascalCase filenames, no marketing language, present tense, explain *why* not *what*. - **Update docs in the same change as the source.** When public APIs, contracts, configuration, build steps, security behavior, event shapes, value conversion, status mapping, or lifecycle rules change, the affected docs (`gateway.md`, `docs/`, client READMEs, design docs) must change in the same commit. Don't leave stale prose describing old behavior. @@ -114,7 +114,7 @@ External analysis sources referenced by design docs: ## Authentication -Gateway gRPC clients authenticate with an API key in metadata: `authorization: Bearer mxgw__`. Keys are stored hashed (with a peppered SHA) in a gateway-owned SQLite DB (default `C:\ProgramData\MxGateway\gateway-auth.db`). Scopes (`session`, `invoke`, `event`, `metadata`, `admin`) gate specific RPCs; missing → `Unauthenticated`, insufficient → `PermissionDenied`. The `apikey` subcommand on the server exe manages keys; see `src/MxGateway.Server/Security/Authentication/`. +Gateway gRPC clients authenticate with an API key in metadata: `authorization: Bearer mxgw__`. Keys are stored hashed (with a peppered SHA) in a gateway-owned SQLite DB (default `C:\ProgramData\MxGateway\gateway-auth.db`). Scopes (`session:open`, `session:close`, `invoke:read`, `invoke:write`, `invoke:secure`, `events:read`, `metadata:read`, `admin`) gate specific RPCs; missing → `Unauthenticated`, insufficient → `PermissionDenied`. The `apikey` subcommand on the server exe manages keys; see `src/MxGateway.Server/Security/Authentication/`. Dashboard auth is LDAP-backed (separate from the gRPC API-key model). `/login` binds against `MxGateway:Ldap` and maps the user's LDAP groups to `Administrator` or `Viewer` via `MxGateway:Dashboard:GroupToRole`, then issues an HTTP-only secure `MxGatewayDashboard` cookie. SignalR hubs at `/hubs/{snapshot,alarms,events}` accept either the cookie or a 30-minute bearer minted at `/hubs/token`. `Dashboard:AllowAnonymousLocalhost` bypasses auth on loopback when enabled. diff --git a/StyleGuide.md b/StyleGuide.md index ad60857..368470d 100644 --- a/StyleGuide.md +++ b/StyleGuide.md @@ -1,42 +1,48 @@ # Documentation Style Guide -This guide defines writing conventions and formatting rules for all ScadaBridge documentation. +This guide defines writing conventions and formatting rules for all MXAccess +Gateway (`mxaccessgw`) documentation. ## Tone and Voice ### Be Technical and Direct -Write for developers who are familiar with .NET. Don't explain basic concepts like dependency injection or async/await unless they're used in an unusual way. +Write for developers who are familiar with .NET. Don't explain basic concepts +like dependency injection or async/await unless they're used in an unusual way. **Good:** -> The `ScadaGatewayActor` routes messages to the appropriate `ScadaClientActor` based on the client ID in the message. +> The `SessionManager` launches one worker per session and tracks it through the +> session state machine. **Avoid:** -> The ScadaGatewayActor is a really powerful component that helps manage all your SCADA connections efficiently! +> The SessionManager is a really powerful component that helps manage all your +> MXAccess connections efficiently! ### Explain "Why" Not Just "What" Document the reasoning behind patterns and decisions, not just the mechanics. **Good:** -> Health checks use a 5-second timeout because actors under heavy load may take several seconds to respond, but longer delays indicate a real problem. +> The worker pumps Windows messages on its STA thread because a plain blocking +> queue does not let MXAccess COM events deliver. **Avoid:** -> Health checks use a 5-second timeout. +> The worker pumps Windows messages on its STA thread. ### Use Present Tense Describe what the code does, not what it will do. **Good:** -> The actor validates the message before processing. +> The gateway terminates orphaned workers on startup. **Avoid:** -> The actor will validate the message before processing. +> The gateway will terminate orphaned workers on startup. ### No Marketing Language -This is internal technical documentation. Avoid superlatives and promotional language. +This is internal technical documentation. Avoid superlatives and promotional +language. **Avoid:** "powerful", "robust", "cutting-edge", "seamless", "blazing fast" @@ -45,10 +51,10 @@ This is internal technical documentation. Avoid superlatives and promotional lan ### File Names Use `PascalCase.md` for all documentation files: -- `Overview.md` -- `HealthChecks.md` -- `StateMachines.md` -- `SignalR.md` +- `Sessions.md` +- `GatewayConfiguration.md` +- `WorkerSta.md` +- `Diagnostics.md` ### Headings @@ -58,11 +64,11 @@ Use `PascalCase.md` for all documentation files: - **H4+ (`####`):** Rarely needed, Sentence case ```markdown -# Actor Health Checks +# Gateway Configuration -## Configuration Options +## Session Options -### Setting the timeout +### Setting the lease timeout #### Default values ``` @@ -73,40 +79,43 @@ Always specify the language: ````markdown ```csharp -public class MyActor : ReceiveActor { } +public sealed class GatewaySession { } ``` ```json { - "Setting": "value" + "MxGateway": { "Sessions": { "MaxConcurrent": 8 } } } ``` -```bash -dotnet build +```powershell +dotnet build src/ZB.MOM.WW.MxGateway.slnx ``` ```` -Supported languages: `csharp`, `json`, `bash`, `xml`, `sql`, `yaml`, `html`, `css`, `javascript` +Supported languages: `csharp`, `json`, `bash`, `powershell`, `xml`, `sql`, +`text`, `rust`, `python`, `go`, `proto`, `html`, `css`, `toml`. ### Code Snippets -**Length:** 5-25 lines is typical. Shorter for simple concepts, longer for complete examples. +**Length:** 5-25 lines is typical. Shorter for simple concepts, longer for +complete examples. **Context:** Include enough to understand where the code lives: ```csharp // Good - shows class context -public class TemplateInstanceActor : ReceiveActor +public sealed class GatewaySession { - public TemplateInstanceActor(TemplateInstanceConfig config) + public GatewaySession(SessionId sessionId, WorkerPipeSession pipe) { - Receive(Handle); + _sessionId = sessionId; + _pipe = pipe; } } // Avoid - orphaned snippet -Receive(Handle); +_pipe = pipe; ``` **Accuracy:** Only use code that exists in the codebase. Never invent examples. @@ -134,34 +143,34 @@ Use tables for structured reference information: ```markdown | Option | Default | Description | |--------|---------|-------------| -| `Timeout` | `5000` | Milliseconds to wait | -| `RetryCount` | `3` | Number of retry attempts | +| `MaxConcurrent` | `8` | Maximum simultaneous sessions | +| `LeaseTimeoutSeconds` | `60` | Idle lease before sweep | ``` ### Inline Code Use backticks for: -- Class names: `ScadaGatewayActor` -- Method names: `HandleMessage()` +- Class names: `SessionManager` +- Method names: `KillWorkerAsync()` - File names: `appsettings.json` -- Configuration keys: `ScadaBridge:Timeout` +- Configuration keys: `MxGateway:Sessions:MaxConcurrent` - Command-line commands: `dotnet build` ### Links Use relative paths for internal documentation: ```markdown -[See the Actors guide](../Akka/Actors.md) -[Configuration options](./Configuration.md) +[See the architecture overview](./gateway.md) +[Configuration options](./docs/GatewayConfiguration.md) ``` Use descriptive link text: ```markdown -See the [Actor Health Checks](../Akka/HealthChecks.md) documentation. +See the [Gateway Configuration](./docs/GatewayConfiguration.md) documentation. -See [here](../Akka/HealthChecks.md) for more. +See [here](./docs/GatewayConfiguration.md) for more. ``` ## Structure Conventions @@ -173,9 +182,10 @@ Every document starts with: 2. 1-2 sentence description of purpose ```markdown -# Actor Health Checks +# Worker STA Thread -Health checks monitor actor responsiveness and report status to the ASP.NET Core health check system. +The worker owns one MXAccess COM instance on a dedicated STA thread and pumps +Windows messages so MXAccess events deliver. ``` ### Section Organization @@ -194,15 +204,15 @@ Organize content from general to specific: Place code examples immediately after the concept they illustrate: ```markdown -## Message Handling +## Session Close -Actors process messages using `Receive` handlers: +The gateway closes a session by killing its worker behind the close gate: ```csharp -Receive(msg => HandleMyMessage(msg)); +await session.KillWorkerWithCloseGateAsync(cancellationToken); ``` -Each handler processes one message type... +The close gate serializes concurrent close attempts... ``` ### Related Documentation Section @@ -212,9 +222,9 @@ End each document with links to related topics: ```markdown ## Related Documentation -- [Actor Patterns](./Patterns.md) -- [Health Checks](../Operations/HealthChecks.md) -- [Configuration](../Configuration/Akka.md) +- [Sessions](./docs/Sessions.md) +- [Worker STA Thread](./docs/WorkerSta.md) +- [Gateway Configuration](./docs/GatewayConfiguration.md) ``` ## Naming Conventions @@ -222,30 +232,33 @@ End each document with links to related topics: ### Match Code Exactly Use the exact names from source code: -- `TemplateInstanceActor` not "Template Instance Actor" -- `ScadaGatewayActor` not "SCADA Gateway Actor" -- `IRequiredActor` not "required actor interface" +- `MxStatusProxy` not "MX status proxy" +- `SessionManager` not "session manager" +- `OrphanWorkerTerminator` not "orphan worker terminator" ### Acronyms Spell out on first use, then use acronym: -> OPC Unified Architecture (OPC UA) provides industrial communication standards. OPC UA servers expose... +> Single-threaded apartment (STA) threads serialize COM calls. STA message +> pumping lets MXAccess events deliver... Common acronyms that don't need expansion: - API - JSON - SQL - HTTP/HTTPS -- REST -- JWT +- COM +- gRPC +- IPC +- STA - UI ### File Paths Use forward slashes and backticks: -- `src/Infrastructure/Akka/Actors/` +- `src/ZB.MOM.WW.MxGateway.Server/` - `appsettings.json` -- `Documentation/Akka/Overview.md` +- `docs/GatewayConfiguration.md` ## What to Avoid @@ -260,13 +273,14 @@ The constructor creates a new instance of the class. ## Constructor -The constructor accepts an `IActorRef` for the gateway actor, which must be resolved before actor creation. +The constructor accepts a `WorkerPipeSession`, which must be connected before +the session transitions out of `Handshaking`. ``` ### Don't Duplicate Source Code Comments If code has good comments, reference the file rather than copying: -> See `ScadaGatewayActor.cs` lines 45-60 for the message routing logic. +> See `SessionManager.cs` for the open-failure rollback order. ### Don't Include Temporary Information @@ -278,5 +292,12 @@ Assume readers know: - Dependency injection - async/await - LINQ -- Entity Framework basics - ASP.NET Core middleware pipeline +- gRPC service basics + +## Related Documentation + +- [Architecture overview](./gateway.md) +- [Gateway Configuration](./docs/GatewayConfiguration.md) +- [C# Style Guide](./docs/style-guides/CSharpStyleGuide.md) +- [Go Style Guide](./docs/style-guides/GoStyleGuide.md), [Java Style Guide](./docs/style-guides/JavaStyleGuide.md), [Python Style Guide](./docs/style-guides/PythonStyleGuide.md), [Rust Style Guide](./docs/style-guides/RustStyleGuide.md), [Protobuf Style Guide](./docs/style-guides/ProtobufStyleGuide.md) diff --git a/clients/dotnet/DotnetClientDesign.md b/clients/dotnet/DotnetClientDesign.md index 4124f3d..fe997a7 100644 --- a/clients/dotnet/DotnetClientDesign.md +++ b/clients/dotnet/DotnetClientDesign.md @@ -32,8 +32,6 @@ clients/dotnet/ Commands/ ZB.MOM.WW.MxGateway.Client.Tests/ ZB.MOM.WW.MxGateway.Client.Tests.csproj - ZB.MOM.WW.MxGateway.Client.IntegrationTests/ - ZB.MOM.WW.MxGateway.Client.IntegrationTests.csproj ``` Target framework: @@ -52,7 +50,6 @@ Expected packages: - `Grpc.Net.Client` - `Google.Protobuf` -- `Grpc.Tools` for generation - `Microsoft.Extensions.Logging.Abstractions` - `System.CommandLine` or similar for CLI - test framework: xUnit or NUnit diff --git a/clients/go/GoClientDesign.md b/clients/go/GoClientDesign.md index dd0d51b..f409202 100644 --- a/clients/go/GoClientDesign.md +++ b/clients/go/GoClientDesign.md @@ -27,6 +27,9 @@ clients/go/ internal/generated/ mxaccess_gateway.pb.go mxaccess_gateway_grpc.pb.go + galaxy_repository.pb.go + galaxy_repository_grpc.pb.go + mxaccess_worker.pb.go cmd/mxgw-go/ main.go tests/ diff --git a/clients/go/README.md b/clients/go/README.md index b6ab95c..ed003f8 100644 --- a/clients/go/README.md +++ b/clients/go/README.md @@ -140,7 +140,7 @@ pairs `Children` with `ChildHasChildren` so you know which nodes to expand. See request and filter semantics. ```go -import pb "gitea.dohertylan.com/dohertj2/mxaccessgw/clients/go/internal/generated/galaxy_repository/v1" +import pb "gitea.dohertylan.com/dohertj2/mxaccessgw/clients/go/internal/generated" reply, err := galaxy.BrowseChildren(ctx, &pb.BrowseChildrenRequest{}) if err != nil { diff --git a/clients/rust/README.md b/clients/rust/README.md index ccb3397..c8b7ba3 100644 --- a/clients/rust/README.md +++ b/clients/rust/README.md @@ -62,8 +62,8 @@ cargo run -p mxgw-cli -- register --session-id --client-name mxgw-r cargo run -p mxgw-cli -- add-item --session-id --server-handle 1 --item TestChildObject.TestInt --json cargo run -p mxgw-cli -- advise --session-id --server-handle 1 --item-handle 1 --json cargo run -p mxgw-cli -- stream-events --session-id --max-events 1 --json -cargo run -p mxgw-cli -- stream-alarms --session-id --max-messages 1 --json -cargo run -p mxgw-cli -- acknowledge-alarm --session-id --alarm-reference "\\Galaxy\Area001.Pump001.PumpFault" --json +cargo run -p mxgw-cli -- stream-alarms --max-events 1 --json +cargo run -p mxgw-cli -- acknowledge-alarm --reference "\\Galaxy\Area001.Pump001.PumpFault" --json cargo run -p mxgw-cli -- write --session-id --server-handle 1 --item-handle 1 --value-type int32 --value 123 --json ``` diff --git a/docs/AlarmClientDiscovery.md b/docs/AlarmClientDiscovery.md index 056bb75..8b213e1 100644 --- a/docs/AlarmClientDiscovery.md +++ b/docs/AlarmClientDiscovery.md @@ -67,9 +67,17 @@ list. ## What this means -The architecture comment on -`src/ZB.MOM.WW.MxGateway.Worker/MxAccess/AlarmClientConsumer.cs` (PR A.5) is -**wrong against this deployed assembly**: +> **Historical note (current as built).** This discovery record predates the +> as-built alarm path. The `AlarmClientConsumer.cs` file referenced below was +> retired; the production consumer is +> `src/ZB.MOM.WW.MxGateway.Worker/MxAccess/WnWrapAlarmConsumer.cs` (driven by the +> `wwAlarmConsumerClass` COM surface — see [Option A](#option-a--captured-2026-05-01) +> below). The current public RPC surface and broker architecture are summarized +> in [Current alarm path (as built)](#current-alarm-path-as-built) at the end of +> this document; the sections in between are kept as a discovery record. + +The architecture comment on the (now-retired) `AlarmClientConsumer.cs` (PR A.5) +was **wrong against this deployed assembly**: > "The AVEVA alarm-manager surface (`IAlarmMgrDataProvider`) exposes > the events we need as plain .NET events — no Windows message pump @@ -601,8 +609,14 @@ returned to normal but is unacknowledged — i.e., visible in the "current alarms" list because operator hasn't acked it yet) and `UNACK_ALM` (the alarm is currently active and unacknowledged). The other states from `eAlmState` (`ACK_RTN`, `ACK_ALM`) would -appear when an ack is performed — `wwAlarmConsumerClass.AlarmAckByGUID` -is the method to call. +appear when an ack is performed. + +> **Forward reference / superseded:** an earlier draft named +> `wwAlarmConsumerClass.AlarmAckByGUID` as the ack method. That call turned out +> to be **`E_NOTIMPL`** on this AVEVA build (see +> [`AlarmAckByGUID` is not implemented](#4-alarmackbyguid-is-not-implemented) +> below). The as-built ack path is the v1 6-arg `AlarmAckByName` on a dedicated +> ack-only consumer instance. Do not wire acks through `AlarmAckByGUID`. ### `GetStatistics` AV — unrelated quirk @@ -638,20 +652,25 @@ alarm-consumer surface unblocks A.2 fully. Outline: payload; diff against the previous snapshot (keyed by `GUID`); emit `MX_EVENT_FAMILY_ON_ALARM_TRANSITION` events for added/changed/removed records. - - `AlarmAckByGUID(VBGUID, comment, oprName, node, domain, - fullName)` for client-driven acknowledgements (matches - PR A.5's `AlarmAckCommand` payload). + - Client-driven acknowledgements. (This draft named `AlarmAckByGUID` and a + `AlarmAckCommand` payload; as built the ack proto is + `AcknowledgeAlarmCommand` / `AcknowledgeAlarmByNameCommand`, the consumer + interface method is `AcknowledgeByGuid` / `AcknowledgeByName`, and the GUID + path is `E_NOTIMPL` so only the by-name path runs — see + [`AlarmAckByGUID` is not implemented](#4-alarmackbyguid-is-not-implemented).) - Lifecycle teardown: `DeregisterConsumer` + `UninitializeConsumer` + `Marshal.FinalReleaseComObject`. -3. **Conversion layer:** map XML record fields to - `MxAlarmConditionRecord` proto: - - `GUID` → `condition_id` (canonicalize the no-dashes hex - to a UUID string). - - `STATE` enum → `inAlarm` + `acked` booleans - (`UNACK_ALM` → in_alarm=true, acked=false; - `UNACK_RTN` → in_alarm=false, acked=false; - `ACK_ALM` → in_alarm=true, acked=true; - `ACK_RTN` → in_alarm=false, acked=true). +3. **Conversion layer:** map XML record fields to the alarm proto: + - `GUID` and `PROVIDER_NAME!GROUP.TAGNAME` → `alarm_full_reference` (there is + no `condition_id` field; the public RPC and worker carry the reference as + `alarm_full_reference`, either a canonical GUID or `Provider!Group.Tag`). + - `STATE` → `AlarmConditionState` on `ActiveAlarmSnapshot.current_state` + (this draft used `inAlarm` + `acked` booleans, which the proto does not + have). As built, the snapshot state collapses to three values: + `UNACK_ALM` → `Active`; `ACK_ALM` → `ActiveAcked`; `UNACK_RTN` and + `ACK_RTN` both → `Inactive` (a returned-to-normal alarm is no longer + "active"). For the live `transition` feed the `STATE` instead drives an + `AlarmTransitionKind` (`Raise` / `Acknowledge` / `Clear`). - `DATE + TIME + GMTOFFSET + DSTADJUST` → reassemble UTC timestamp; matches the worker's existing `Timestamp` wire format. @@ -663,10 +682,14 @@ alarm-consumer surface unblocks A.2 fully. Outline: `aaAlarmManagedClient`, also true here). The existing `AlarmClientConsumer` skips Initialize entirely; the new `WnWrapAlarmConsumer` includes it from day one. -5. **Test reuse:** PR A.5's snapshot/ack contract tests can - stay — they don't touch the underlying COM API. Add a new - integration test against the wnwrap surface (live-AVEVA-only, - Skip-gated like the probe). +5. **Test reuse:** the snapshot/ack contract tests stayed — they don't touch + the underlying COM API. As built, the alarm tests live under + `src/ZB.MOM.WW.MxGateway.Worker.Tests/MxAccess/` (`AlarmDispatcherTests`, + `AlarmRecordTransitionMapperTests`, `AlarmCommandHandlerTests`, + `AlarmCommandExecutorTests`, `WnWrapAlarmConsumerXmlTests`), with the + live-AVEVA-only round-trip in + `src/ZB.MOM.WW.MxGateway.Worker.Tests/Probes/AlarmsLiveSmokeTests.cs` + (Skip-gated like the probe). ### Settled API-ordering and surface knowledge @@ -752,26 +775,47 @@ AVEVA fixes the v2 method later. The v2 `AlarmAckByGUID(VBGUID, …)` throws `NotImplementedException` (COM `E_NOTIMPL`) on `wwAlarmConsumerClass` against this AVEVA build. The reference→GUID lookup that we initially planned to wire -through `AlarmAckByGUID` is therefore not viable on wnwrap; all acks -must go through `AlarmAckByName`. +through `AlarmAckByGUID` is therefore not viable on wnwrap; only the +by-name path actually succeeds. -The proto `AcknowledgeAlarmCommand` (GUID-based) and the worker's -`MxAccessCommandExecutor.ExecuteAcknowledgeAlarm` switch arm remain -in the codebase for the forward-compat shape, but the gateway-side -`WorkerAlarmRpcDispatcher.AcknowledgeAsync` now always routes through -`AcknowledgeAlarmByName` when the public RPC supplies a recognizable -`Provider!Group.Tag` reference. +**Routing as built (and the GUID hazard).** The gateway-side router is +`GatewayAlarmMonitor.BuildAcknowledgeCommand` (there is no +`WorkerAlarmRpcDispatcher` type). Routing is **conditional on the reference +shape**, not unconditional: -### 5. STA / threading — production fix needed +- A reference that `Guid.TryParse` accepts is built into + `MxCommandKind.AcknowledgeAlarm` / `AcknowledgeAlarmCommand` — the **GUID + path**, which the worker dispatches to `AlarmAckByGUID`. +- A `Provider!Group.Tag` reference (parsed by + `GatewayAlarmMonitor.TryParseAlarmReference`) is built into + `MxCommandKind.AcknowledgeAlarmByName` / `AcknowledgeAlarmByNameCommand` — the + by-name path, which is the only one that succeeds on this build. +- Anything else fails with an `alarm_full_reference` parse error before any + worker call. -The wnwrap COM is `ThreadingModel=Apartment`. The consumer's -internal `Timer` fires on threadpool threads and would block forever -on cross-apartment marshaling unless the host STA pumps Win32 -messages. The smoke test sidesteps this by setting -`pollIntervalMilliseconds=0` (Timer disabled) and driving `PollOnce` -manually from the test's STA. Production hosting will route polls -through the worker's `StaRuntime` in a follow-up — the consumer's -`PollOnce` is `public` and idempotent so the wire-up is mechanical. +The GUID arm is **still dispatched unguarded**: the proto +`AcknowledgeAlarmCommand` and the worker's +`MxAccessCommandExecutor.ExecuteAcknowledgeAlarm` switch arm remain in the +codebase for forward compatibility, and `BuildAcknowledgeCommand` routes a +GUID-shaped reference straight to them. On the deployed wnwrap build that path +hits the `E_NOTIMPL` `AlarmAckByGUID` and surfaces a `COMException` rather than +acknowledging. **Practical guidance:** acknowledge with the +`Provider!Group.Tag` reference (the same form the transition feed emits in +`alarm_full_reference`), not a raw GUID, until the GUID arm is either guarded or +AVEVA implements `AlarmAckByGUID`. + +### 5. STA / threading + +The wnwrap COM is `ThreadingModel=Apartment`, so every consumer call +(`Subscribe`, `PollOnce`, the `AcknowledgeBy*` methods) must run on the STA that +created the COM instance. As built, `WnWrapAlarmConsumer` owns **no internal +timer and takes no `pollIntervalMilliseconds` parameter** — an earlier draft +described a self-driven `Timer` that would have blocked on cross-apartment +marshaling, but that design was dropped. Instead `PollOnce()` is a `public`, +idempotent method the host drives on the worker's STA (via +`StaRuntime.InvokeAsync(() => consumer.PollOnce())`); the poll cadence lives in +the host, not the consumer. Each `PollOnce` reads `GetXmlCurrentAlarms2`, diffs +against the previous snapshot, and emits transition events. ### Capture summary @@ -790,3 +834,108 @@ Post-ack transition: kind=Clear … 10s cadence held throughout; full proto fields populated correctly; ack registered server-side without errors. + +## Current alarm path (as built) + +The sections above are a discovery record. This section summarizes the path that +actually ships, grounded in the current code. For the proto shapes see +[Contracts](./Contracts.md#alarm-rpcs-and-messages); for the server handlers see +[gRPC](./Grpc.md); for configuration see +[Gateway Configuration](./GatewayConfiguration.md#alarm-options). + +### Public RPCs and configuration + +Alarms are exposed through three **session-less** RPCs on `MxAccessGateway`: +`AcknowledgeAlarm`, `StreamAlarms`, and `QueryActiveAlarms`. No client opens a +worker session to use them. They are gated by `MxGateway:Alarms:*`: + +- `MxGateway:Alarms:Enabled` (default `false`) turns the whole subsystem on. +- `MxGateway:Alarms:SubscriptionExpression` is the canonical + `\\\Galaxy!` subscription; when empty, the monitor falls back + to `\\\Galaxy!` from `MxGateway:Alarms:DefaultArea`. + Enabled with both empty faults the monitor with a configuration diagnostic. +- `MxGateway:Alarms:ReconcileIntervalSeconds` (default 30, floored at 5) sets the + reconcile cadence below. + +### The always-on `GatewayAlarmMonitor` broker + +`GatewayAlarmMonitor` (`src/ZB.MOM.WW.MxGateway.Server/Alarms/GatewayAlarmMonitor.cs`) +is registered by `AddGatewayAlarms` as a singleton, as the `IGatewayAlarmService`, +and as a hosted `BackgroundService`. When `Enabled`, it: + +1. Opens **one** gateway-managed worker session dedicated to alarms (client name + `gateway-alarm-monitor`, backend `Galaxy`), after a brief startup grace so + worker launching and orphan cleanup settle. +2. Subscribes that session to the resolved subscription expression and feeds an + in-process active-alarm cache (`Dictionary`) + from the session's transition events. +3. Fans the feed out to **any number** of `StreamAlarms` subscribers — clients + never open their own session. The session is transparently re-opened with a + 5-second backoff if the worker faults. + +### `AlarmFeedMessage` stream protocol + +`StreamAsync` (behind `StreamAlarms`) emits, in order: + +1. one `AlarmFeedMessage { active_alarm }` per currently-cached alarm matching + the optional `alarm_filter_prefix`, +2. a single `AlarmFeedMessage { snapshot_complete = true }` sentinel, +3. then one `AlarmFeedMessage { transition }` per live change. + +The subscriber is registered under the monitor lock **before** the snapshot is +taken, so no transition can slip between the snapshot and the live tail. +`QueryActiveAlarms` reuses the same cache but emits only the `active_alarm` +snapshots and completes — no sentinel, no transitions. + +### Reconcile loop + +A `PeriodicTimer` runs `ReconcileAsync` every +`max(5, ReconcileIntervalSeconds)` seconds. It pulls the worker's authoritative +active-alarm snapshot and replaces the cache, broadcasting a synthetic `Clear` +transition for any cached alarm the snapshot no longer contains and a synthetic +`Raise` for any alarm the snapshot adds. This catches transitions the live +poll-and-diff feed missed (e.g. across a transport blip). A failed reconcile +pass logs at Debug and keeps the current cache. + +### Subscriber backpressure + +Each subscriber gets a bounded channel of **2048** messages +(`SubscriberQueueCapacity`). When `Broadcast` cannot write to a subscriber (its +channel is full), that subscriber is **completed with an error and dropped** — +the error message tells the client to reconnect to re-snapshot. Backpressure +from one slow consumer never blocks the broker or other subscribers. + +### Snapshot state collapse + +`ActiveAlarmSnapshot.current_state` carries only three `AlarmConditionState` +values, so the four AVEVA `STATE`s collapse: `UNACK_ALM` → `Active`, +`ACK_ALM` → `ActiveAcked`, and both `UNACK_RTN` and `ACK_RTN` → `Inactive` +(`AlarmDispatcher`). A returned-to-normal alarm is reported as `Inactive` in a +snapshot even though it is still listed because it is unacknowledged. The live +`transition` feed instead reports `AlarmTransitionKind` (`Raise` / `Acknowledge` +/ `Clear`). + +### `alarm_full_reference` parse contract + +`AcknowledgeAlarm` accepts either form in `alarm_full_reference` +(`GatewayAlarmMonitor.BuildAcknowledgeCommand`): + +- a canonical GUID (`Guid.TryParse`) → GUID ack path + (`AcknowledgeAlarmCommand`), which on the deployed wnwrap build hits the + `E_NOTIMPL` `AlarmAckByGUID` — see + [`AlarmAckByGUID` is not implemented](#4-alarmackbyguid-is-not-implemented); +- a `Provider!Group.Tag` reference (`TryParseAlarmReference`: first `!` splits + provider from `Group.Tag`, the first `.` after the `!` splits group from tag) + → by-name ack path (`AcknowledgeAlarmByNameCommand`), the path that works; +- anything else → a parse error before any worker call. + +The transition feed emits the `Provider!Group.Tag` form in +`alarm_full_reference`, so echoing that value back into `AcknowledgeAlarm` takes +the working by-name path. + +### Reserved / unused + +`AlarmTransitionKind.RETRIGGER` is defined in the proto but is **not currently +produced** — the transition mapper emits only `Raise` / `Acknowledge` / `Clear`. +It is reserved for a future "re-raise from a previously cleared condition" +distinction. diff --git a/docs/Authentication.md b/docs/Authentication.md index 3fc0074..14c3ded 100644 --- a/docs/Authentication.md +++ b/docs/Authentication.md @@ -2,11 +2,13 @@ The gateway authentication subsystem verifies inbound API key credentials against a SQLite-backed key store, hashes secrets with a configurable pepper, and records administrative and verification events to an audit trail. +The peppered-HMAC API-key pipeline — token format, parsing, secret generation and hashing, constant-time comparison, the SQLite schema, the stores, the verifier, and the migrator — lives in the shared `ZB.MOM.WW.Auth.ApiKeys` package (with abstractions in `ZB.MOM.WW.Auth.Abstractions`), of which this gateway is the donor. The gateway references the package and binds the library's `ApiKeyOptions` from its own `MxGateway:Authentication` section through `AddSqliteAuthStore`, then layers the gateway-specific pieces on top: constraint enforcement, the gRPC authorization interceptor, the admin CLI, the dashboard API Keys page, and canonical audit forwarding. Types whose code is shown below for reference are owned by the shared package unless noted; the gateway does not re-implement them. + ## Token Format API keys travel in the HTTP `Authorization` header as a bearer token shaped `mxgw__`. The `mxgw_` prefix scopes parsing to gateway tokens, the `` segment is the public identifier used for lookup, and `` is the high-entropy portion that the gateway verifies against a stored hash. -`ApiKeyParser` enforces the format and rejects malformed tokens before any database round-trip: +The shared library's `ApiKeyParser` enforces the format and rejects malformed tokens before any database round-trip: ```csharp public bool TryParseAuthorizationHeader(string? authorizationHeader, out ParsedApiKey? apiKey) @@ -50,7 +52,7 @@ public static string Generate() ### Peppered hashing -`ApiKeySecretHasher` (registered behind `IApiKeySecretHasher`) hashes secrets with `HMACSHA256` keyed by a server-side pepper. The pepper lives outside the database and is resolved by `IConfiguration` lookup against the configured `PepperSecretName`: +The shared library's `ApiKeySecretHasher` (behind `IApiKeySecretHasher`) hashes secrets with `HMACSHA256` keyed by a server-side pepper. The pepper lives outside the database and is resolved through an `IApiKeyPepperProvider` — the gateway wires the configuration-backed provider so the pepper comes from `IConfiguration` lookup against `MxGateway:ApiKeyPepper` (`PepperSecretName`): ```csharp public byte[] HashSecret(string secret) @@ -69,37 +71,29 @@ The pepper is intentionally not stored alongside the hash: an attacker who exfil ## Verification -`ApiKeyVerifier` (`IApiKeyVerifier`) implements the verification flow: +The shared library's `IApiKeyVerifier.VerifyAsync(authorizationHeader, cancellationToken)` owns the whole verification flow — the gateway interceptor hands it the raw `authorization` header value and never parses the token itself: -1. Parse the `Authorization` header into a `ParsedApiKey`. -2. Look up the `ApiKeyRecord` by `KeyId` through `IApiKeyStore.FindByKeyIdAsync`. -3. Reject revoked records (`RevokedUtc is not null`). +1. Parse the `Authorization` header into the key id and secret. +2. Look up the record by key id. +3. Reject revoked records. 4. Hash the presented secret with the configured pepper. 5. Compare hashes with `CryptographicOperations.FixedTimeEquals` to avoid timing oracles. -6. Record a `LastUsedUtc` timestamp via `MarkKeyUsedAsync` and return an `ApiKeyIdentity`. +6. Stamp `last_used_utc` and return an identity. + +`VerifyAsync` returns an `ApiKeyVerification` value with a `Succeeded` flag and a nullable `Identity`. On failure the result is discriminated so the caller can tell parse errors, missing pepper, missing or revoked keys, and secret mismatch apart for audit detail — without leaking which check failed to the client. The gateway interceptor treats any non-success uniformly as `Unauthenticated` (see [Authorization](./Authorization.md)): ```csharp -if (!CryptographicOperations.FixedTimeEquals(presentedHash, storedKey.SecretHash)) -{ - return ApiKeyVerificationResult.Fail(ApiKeyVerificationFailure.SecretMismatch); -} - -await keyStore.MarkKeyUsedAsync(storedKey.KeyId, DateTimeOffset.UtcNow, cancellationToken) +ApiKeyVerification verification = await apiKeyVerifier + .VerifyAsync(authorizationHeader ?? string.Empty, context.CancellationToken) .ConfigureAwait(false); -return ApiKeyVerificationResult.Success(new ApiKeyIdentity( - KeyId: storedKey.KeyId, - KeyPrefix: storedKey.KeyPrefix, - DisplayName: storedKey.DisplayName, - Scopes: storedKey.Scopes, - Constraints: storedKey.Constraints)); +if (!verification.Succeeded || verification.Identity is null) +{ + throw new RpcException(new Status(StatusCode.Unauthenticated, "Missing or invalid API key.")); +} ``` -`ApiKeyVerificationResult` carries either an `ApiKeyIdentity` or a discriminated `ApiKeyVerificationFailure` value. The failure enum distinguishes parse errors, missing pepper, missing or revoked keys, and secret mismatch so the calling middleware can emit precise audit detail without leaking which check failed to the client. - -`ApiKeyIdentity` exposes only non-secret fields (`KeyId`, `KeyPrefix`, -`DisplayName`, `Scopes`, and `Constraints`) and is the type downstream -authorization code consumes. +The shared verifier returns `ZB.MOM.WW.Auth.Abstractions.ApiKeys.ApiKeyIdentity`, which carries the persisted constraints as an opaque JSON string. The gateway's `GatewayApiKeyIdentityMapper.ToGatewayIdentity` projects it onto the gateway-local `ApiKeyIdentity` record, which exposes only non-secret fields (`KeyId`, `KeyPrefix`, `DisplayName`, `Scopes`) plus the deserialized `Constraints`, and is the type downstream authorization code consumes. ## Storage @@ -107,7 +101,7 @@ The gateway keeps API key state in a dedicated SQLite database. SQLite is suffic ### Connection factory -`AuthSqliteConnectionFactory` reads `GatewayOptions.Authentication.SqlitePath`, ensures the parent directory exists, and builds a connection string in `ReadWriteCreate` mode so first-run installations can create the file without manual provisioning. Connection pooling is enabled and the connection string carries a non-zero `DefaultTimeout`: +The shared library's `AuthSqliteConnectionFactory` (registered by `AddZbApiKeyAuth`) reads the bound `ApiKeyOptions.SqlitePath` — which the gateway populates from `MxGateway:Authentication:SqlitePath` — ensures the parent directory exists, and builds a connection string in `ReadWriteCreate` mode so first-run installations can create the file without manual provisioning. Connection pooling is enabled and the connection string carries a non-zero `DefaultTimeout`: ```csharp SqliteConnectionStringBuilder builder = new() @@ -119,21 +113,22 @@ SqliteConnectionStringBuilder builder = new() }; ``` -Every store opens its connection through `OpenConnectionAsync`, which opens the connection and then applies `PRAGMA journal_mode=WAL` and `PRAGMA busy_timeout`. WAL is a persistent database-level setting so re-applying it per connection is a cheap no-op; `busy_timeout` is per-connection state. Because `MarkKeyUsedAsync` runs on every authenticated request and `SqliteApiKeyAuditStore` appends on every denial, this lets concurrent readers and writers retry briefly instead of surfacing `SQLITE_BUSY` as a hard failure on the request path. +Every store opens its connection through `OpenConnectionAsync`, which opens the connection and then applies `PRAGMA journal_mode=WAL` and `PRAGMA busy_timeout`. WAL is a persistent database-level setting so re-applying it per connection is a cheap no-op; `busy_timeout` is per-connection state. Because `MarkKeyUsedAsync` runs on every authenticated request and the canonical audit writer appends to the same file, this lets concurrent readers and writers retry briefly instead of surfacing `SQLITE_BUSY` as a hard failure on the request path. ### Schema -`SqliteAuthSchema` declares table names and the current schema version as constants. Three tables are involved: +The shared library's `SqliteAuthSchema` declares the API-key table names and the current schema version as constants. Four tables live in the database file: - `api_keys` stores `key_id`, `key_prefix`, the `secret_hash` blob, `display_name`, serialized `scopes`, optional serialized `constraints`, and the `created_utc`, `last_used_utc`, and `revoked_utc` timestamps. -- `api_key_audit` is an append-only log keyed by an autoincrement `audit_id` with `key_id`, `event_type`, `remote_address`, `created_utc`, and `details` columns. +- `api_key_audit` is the shared library's append-only audit log keyed by an autoincrement `audit_id` with `key_id`, `event_type`, `remote_address`, `created_utc`, and `details` columns. The gateway overrides the library audit store (see [Audit trail](#audit-trail)), so this table is **left in place but unused** at runtime — nothing writes to it. +- `audit_event` is the gateway-owned canonical audit table written by `SqliteCanonicalAuditStore`. It lives in the same SQLite file (reusing the library's `AuthSqliteConnectionFactory`) and is where every gateway audit event actually lands. See [Audit trail](#audit-trail). - `schema_version` carries a single row whose `version` column is matched against `SqliteAuthSchema.CurrentVersion`. ### Read paths -`SqliteApiKeyStore` (`IApiKeyStore`) handles the two reads needed at request time: `FindByKeyIdAsync` returns any record (so revoked keys can be reported distinctly) and `FindActiveByKeyIdAsync` filters to non-revoked rows. `MarkKeyUsedAsync` updates `last_used_utc` only for non-revoked rows so a freshly revoked key cannot have its timestamp refreshed by a racing verification. +The shared library's `SqliteApiKeyStore` (`IApiKeyStore`) handles the two reads needed at request time: `FindByKeyIdAsync` returns any record (so revoked keys can be reported distinctly) and `FindActiveByKeyIdAsync` filters to non-revoked rows. `MarkKeyUsedAsync` updates `last_used_utc` only for non-revoked rows so a freshly revoked key cannot have its timestamp refreshed by a racing verification. `ApiKeyRecord` is the in-memory projection. `ApiKeyRecordReader.Read` is shared by every read path so column ordering is defined in one place: @@ -155,17 +150,21 @@ public static ApiKeyRecord Read(SqliteDataReader reader) ### Write paths -`SqliteApiKeyAdminStore` (`IApiKeyAdminStore`) implements administrative mutations: `CreateAsync` accepts an `ApiKeyCreateRequest`, `RevokeAsync` sets `revoked_utc` only when not already revoked, `RotateAsync` replaces `secret_hash`, clears `last_used_utc`, and clears `revoked_utc` so a rotated key is immediately usable, and `DeleteAsync` permanently removes a row but only when `revoked_utc IS NOT NULL` — active keys are untouched (returns false) so the revoke event lands in the audit log before the row disappears. +The shared library's `SqliteApiKeyAdminStore` (`IApiKeyAdminStore`) implements administrative mutations: `CreateAsync` accepts an `ApiKeyCreateRequest`, `RevokeAsync` sets `revoked_utc` only when not already revoked, `RotateAsync` replaces `secret_hash`, clears `last_used_utc`, and clears `revoked_utc` so a rotated key is immediately usable, and `DeleteAsync` permanently removes a row but only when `revoked_utc IS NOT NULL` — active keys are untouched (returns false) so the revoke event lands in the audit log before the row disappears. Because `RotateAsync` clears `revoked_utc`, rotating a previously revoked key reactivates it. The dashboard API Keys page therefore offers the Rotate (and Revoke) actions only for keys whose status is `Active`; revoked keys instead show a Delete action that calls `DeleteAsync`, so an operator can permanently remove a revoked row without ever risking un-revocation as a side effect of a rotation. ### Audit trail -`SqliteApiKeyAuditStore` (`IApiKeyAuditStore`) appends `ApiKeyAuditEntry` values to the `api_key_audit` table and stamps each row with a UTC timestamp inside the store rather than trusting the caller. `ListRecentAsync` returns the most recent rows ordered by `audit_id` descending and projects them into `ApiKeyAuditRecord`. Rows are kept even after the referenced key is revoked because the audit history is the durable record of administrative action; the `key_id` column is nullable to accommodate non-key-scoped events such as `init-db`. +All gateway audit flows through a single canonical `AuditEvent` written to the gateway-owned `audit_event` table, not the shared library's `api_key_audit` table. The gateway adopts `ZB.MOM.WW.Audit` and **overrides** the library's `IApiKeyAuditStore` registration with `CanonicalForwardingApiKeyAuditStore`. That adapter receives each library-emitted `ApiKeyAuditEntry` — including the library-internal admin-command verbs (`create-key`, `revoke-key`, `rotate-key`, `init-db`) the gateway cannot edit — canonicalizes it onto an `AuditEvent`, and forwards it through `IAuditWriter` (`CanonicalAuditWriter`), which persists to `audit_event` via `SqliteCanonicalAuditStore`. + +Because the adapter is registered after `AddZbApiKeyAuth`, it is the `IApiKeyAuditStore` that the admin commands resolve and that the dashboard "recent audit" view reads through `IApiKeyAuditStore.ListRecentAsync`. The library's own `SqliteApiKeyAuditStore` and its `api_key_audit` table are therefore unused at runtime — the override is the only writer. Audit rows are kept even after the referenced key is revoked because the audit history is the durable record of administrative action; non-key-scoped events such as `init-db` carry no key id. + +This canonical-forwarding wiring lives under `src/ZB.MOM.WW.MxGateway.Server/Security/Audit/`; the audit store override and writer are gateway types, while the entry shape and admin verbs originate in the shared library. ## Migration -Schema bring-up is centralised behind `IAuthStoreMigrator`. `SqliteAuthStoreMigrator` executes the migration inside a single transaction so a partial failure leaves the database untouched, refuses to start when the on-disk schema version is newer than the binary supports, and idempotently creates the v1 schema: +Schema bring-up for the API-key tables is owned by the shared library's `SqliteAuthStoreMigrator`, wired by `AddZbApiKeyAuth` along with its migration hosted service. It executes the migration inside a single transaction so a partial failure leaves the database untouched, refuses to start when the on-disk schema version is newer than the binary supports, and idempotently creates the schema: ```csharp if (existingVersion > SqliteAuthSchema.CurrentVersion) @@ -179,13 +178,11 @@ await ApplyVersionOneAsync(connection, transaction, cancellationToken).Configure await transaction.CommitAsync(cancellationToken).ConfigureAwait(false); ``` -`AuthStoreMigrationHostedService` runs the migrator at startup, but only when API-key authentication is enabled and `RunMigrationsOnStartup` is true. Operators who manage schema out-of-band can disable the hosted run and use the admin CLI's `init-db` command instead. - -`AuthStoreMigrationException` is a sealed `InvalidOperationException` so it can be caught precisely without swallowing unrelated failures. +The library's migration hosted service runs the migrator at startup. Operators who manage schema out-of-band can use the admin CLI's `init-db` command instead. ## Admin CLI -`ApiKeyAdminCommandLineParser.Parse` recognises a leading `apikey` argument and dispatches to one of the subcommands declared by `ApiKeyAdminCommandKind`. Each parsed invocation produces an `ApiKeyAdminCommand` (or an `ApiKeyAdminParseResult` carrying an error). `ApiKeyAdminCliRunner` then executes the command, runs the migrator first, calls the relevant store method, appends an audit row, and writes either text or JSON output via `ApiKeyAdminOutput`. The returned `ApiKeyAdminListedKey` projection deliberately omits the `secret_hash` so listing a database does not surface hash material. +`ApiKeyAdminCommandLineParser.Parse` (a gateway type) recognises a leading `apikey` argument and dispatches to one of the subcommands declared by `ApiKeyAdminCommandKind`. Each parsed invocation produces an `ApiKeyAdminCommand` (or an `ApiKeyAdminParseResult` carrying an error). The parser validates requested `--scopes` against `GatewayScopes.All` (see [Authorization](./Authorization.md#scope-catalog)) so a non-canonical scope string cannot be persisted on a key. `ApiKeyAdminCliRunner` then drives the shared library's `ApiKeyAdminCommands` — which the gateway registers over the already-wired stores, pepper provider, and migrator — to execute the command, and writes either text or JSON output via `ApiKeyAdminOutput`. The returned `ApiKeyAdminListedKey` projection deliberately omits the `secret_hash` so listing a database does not surface hash material. The supported subcommands match `ApiKeyAdminCommandKind` exactly: @@ -201,7 +198,7 @@ Examples: ```bash mxgateway apikey init-db -mxgateway apikey create-key --key-id ops.alice --display-name "Alice (ops)" --scopes read,write +mxgateway apikey create-key --key-id ops.alice --display-name "Alice (ops)" --scopes invoke:read,invoke:write mxgateway apikey create-key --key-id area1.reader --display-name "Area 1 reader" --scopes invoke:read,metadata:read --read-subtree "Area1/*" --browse-subtree "Area1/*" mxgateway apikey list-keys --json mxgateway apikey revoke-key --key-id ops.alice @@ -226,7 +223,7 @@ confirmation dialog and emits its own audit event ## Scope Serialization -Scopes are persisted as a single TEXT column rather than a join table because the set is small, never queried by membership at the database level, and changes atomically with the owning row. `ApiKeyScopeSerializer.Serialize` writes a JSON array sorted with `StringComparer.Ordinal` so equivalent scope sets produce byte-identical column values, which makes audit diffing and database comparisons deterministic: +Scopes are persisted as a single TEXT column rather than a join table because the set is small, never queried by membership at the database level, and changes atomically with the owning row. The shared library's `ApiKeyScopeSerializer.Serialize` writes a JSON array sorted with `StringComparer.Ordinal` so equivalent scope sets produce byte-identical column values, which makes audit diffing and database comparisons deterministic: ```csharp public static string Serialize(IReadOnlySet scopes) @@ -249,29 +246,50 @@ public static IReadOnlySet Deserialize(string value) `Deserialize` tolerates an empty column by returning an empty set so older rows or hand-edited records do not crash the verifier. +## Dashboard Cookie and Hub Token + +The API-key model above guards the gRPC surface. Interactive dashboard requests use a separate LDAP-backed cookie scheme (see [Gateway Dashboard Design](./GatewayDashboardDesign.md)). Two timeouts and a few configuration knobs govern that cookie: + +- **Cookie idle timeout — 8 hours.** `DashboardServiceCollectionExtensions` applies the shared `ZbCookieDefaults.Apply` hardened cookie defaults (HttpOnly, `SameSite=Strict`, secure policy, sliding expiration) but overrides the library's 30-minute default with an 8-hour idle timeout, so an active operator is not signed out mid-shift. The expiration is sliding, so each authenticated request resets the window. +- **Hub bearer token — 30 minutes.** SignalR hub connections cannot always carry the HttpOnly cookie (the client SignalR JS may resolve the cookie scope to loopback), so the dashboard mints a short-lived data-protected bearer at `/hubs/token` via `HubTokenService`. The token lifetime is 30 minutes; the hubs accept either it or the cookie. +- **`MxGateway:Dashboard:CookieName`** overrides the cookie name (default `MxGatewayDashboard`, from `DashboardAuthenticationDefaults.CookieName`). Two gateway instances on the same host but different ports share a cookie scope — host+path, not port — so giving each a distinct name keeps their dashboard sessions from clobbering each other. Changing it signs out existing sessions on next deploy. +- **`MxGateway:Dashboard:RequireHttpsCookie`** (default `true`) restricts the cookie to HTTPS via `CookieSecurePolicy.Always`. Set it to `false` for plain-HTTP dev so the cookie uses `SameAsRequest`; leaving it `true` while serving the dashboard over plain HTTP from a non-localhost host breaks login, because browsers drop Secure cookies set over HTTP. + +The dashboard issues claims through the shared `ZB.MOM.WW.Auth.AspNetCore.ZbClaimTypes` (e.g. `ZbClaimTypes.Username` = `zb:username`, `ZbClaimTypes.Name` = `ClaimTypes.Name` so `Identity.Name` resolves, `ZbClaimTypes.Role` = `ClaimTypes.Role` so `IsInRole`/`[Authorize(Roles=...)]` work). Cookie hardening defaults come from `ZbCookieDefaults`. Both live in the shared Auth packages, not the gateway. + ## Registration -`AuthStoreServiceCollectionExtensions.AddSqliteAuthStore` wires every service in this subsystem as a singleton and registers the migration hosted service: +`AuthStoreServiceCollectionExtensions.AddSqliteAuthStore` is the gateway entry point. It does not register the parser, hasher, verifier, stores, or migrator directly — those come from the shared package. Instead it delegates to the package's `AddZbApiKeyAuth` and then layers the gateway-specific audit and CLI services: ```csharp -public static IServiceCollection AddSqliteAuthStore(this IServiceCollection services) +public static IServiceCollection AddSqliteAuthStore( + this IServiceCollection services, + IConfiguration configuration) { - services.AddSingleton(); - services.AddSingleton(); - services.AddSingleton(); + // Register the shared API-key provider: binds ApiKeyOptions from MxGateway:Authentication, + // wires up the SQLite stores, the configuration-backed pepper provider, the verifier, the + // migrator and the migration hosted service. + services.AddZbApiKeyAuth(effectiveConfig, AuthenticationSectionPath); + + // Gateway-owned canonical audit (ZB.MOM.WW.Audit) in the same SQLite file. + services.AddSingleton(sp => + new SqliteCanonicalAuditStore(sp.GetRequiredService())); + services.AddSingleton(sp => new CanonicalAuditWriter(/* ... */)); + + // Override the library's IApiKeyAuditStore so every audit lands in audit_event. + services.AddSingleton(); + + // The shared admin command set, driven by the gateway CLI and dashboard. + services.AddSingleton(sp => new ApiKeyAdminCommands(/* ... */)); services.AddSingleton(); - services.AddSingleton(); - services.AddSingleton(); - services.AddSingleton(); - services.AddSingleton(); - services.AddSingleton(); - services.AddHostedService(); return services; } ``` -Singletons are safe because each operation opens its own short-lived `SqliteConnection` through the factory; there is no shared mutable state inside the services. +The gateway pins its own API-key contract — token prefix `mxgw` and the pepper key `MxGateway:ApiKeyPepper` — by layering those as fallback defaults under the supplied configuration before calling `AddZbApiKeyAuth`, because `ApiKeyOptions` is an init-only record that must be bound with those values present rather than mutated afterward. Explicit configuration still wins. `AddZbApiKeyAuth` binds `ApiKeyOptions` from the `MxGateway:Authentication` section and registers the connection factory, stores, pepper provider, verifier, migrator, and migration hosted service. + +The audit-store override is registered *after* `AddZbApiKeyAuth` so it replaces the library's `TryAddSingleton` registration. The shared admin command set is not auto-registered by `AddZbApiKeyAuth`, so the gateway registers `ApiKeyAdminCommands` itself over the wired stores; the CLI and dashboard drive it. Library services are singletons and safe because each operation opens its own short-lived `SqliteConnection` through the factory. ## Related Documentation diff --git a/docs/Authorization.md b/docs/Authorization.md index c693317..a638c4b 100644 --- a/docs/Authorization.md +++ b/docs/Authorization.md @@ -58,32 +58,34 @@ if (options.Value.Authentication.Mode == AuthenticationMode.Disabled) } string? authorizationHeader = context.RequestHeaders.GetValue("authorization"); -ApiKeyVerificationResult verificationResult = await apiKeyVerifier - .VerifyAsync(authorizationHeader, context.CancellationToken) +ApiKeyVerification verification = await apiKeyVerifier + .VerifyAsync(authorizationHeader ?? string.Empty, context.CancellationToken) .ConfigureAwait(false); -if (!verificationResult.Succeeded || verificationResult.Identity is null) +if (!verification.Succeeded || verification.Identity is null) { throw new RpcException(new Status( StatusCode.Unauthenticated, "Missing or invalid API key.")); } +ApiKeyIdentity identity = GatewayApiKeyIdentityMapper.ToGatewayIdentity(verification.Identity); + string requiredScope = scopeResolver.ResolveRequiredScope(request); -if (!verificationResult.Identity.Scopes.Contains(requiredScope)) +if (!identity.Scopes.Contains(requiredScope)) { throw new RpcException(new Status( StatusCode.PermissionDenied, $"API key is missing required scope '{requiredScope}'.")); } -return verificationResult.Identity; +return identity; ``` The flow is: 1. If `GatewayOptions.Authentication.Mode` is `AuthenticationMode.Disabled`, the helper returns `null` immediately. No identity is pushed onto the accessor and the continuation runs without scope enforcement. This matches the `AuthenticationMode` enum, which only defines `ApiKey` and `Disabled`. -2. Otherwise, the `authorization` request header is read directly off `ServerCallContext.RequestHeaders` and handed to `IApiKeyVerifier.VerifyAsync`. A failed verification or a missing identity throws `RpcException` with `StatusCode.Unauthenticated`. +2. Otherwise, the `authorization` request header is read directly off `ServerCallContext.RequestHeaders` and handed to the shared `IApiKeyVerifier.VerifyAsync`, which returns an `ApiKeyVerification`. A failed verification or a missing identity throws `RpcException` with `StatusCode.Unauthenticated`. The shared library's identity is then projected onto the gateway-local `ApiKeyIdentity` by `GatewayApiKeyIdentityMapper.ToGatewayIdentity` before scope checks run. 3. `GatewayGrpcScopeResolver.ResolveRequiredScope(request)` produces the scope string. If the identity's `Scopes` set does not contain it, the helper throws `RpcException` with `StatusCode.PermissionDenied` and embeds the missing scope name in `Status.Detail` so callers can diagnose the failure. 4. On success, the verified `ApiKeyIdentity` is returned and pushed onto `IGatewayRequestIdentityAccessor` for the lifetime of the call. @@ -107,7 +109,8 @@ public string ResolveRequiredScope(object request) TestConnectionRequest or GetLastDeployTimeRequest or DiscoverHierarchyRequest or - WatchDeployEventsRequest => GatewayScopes.MetadataRead, + WatchDeployEventsRequest or + BrowseChildrenRequest => GatewayScopes.MetadataRead, _ => GatewayScopes.Admin }; } @@ -194,7 +197,7 @@ the gateway fails closed. Non-bulk constraint failures return gRPC `PermissionDenied`. Bulk read commands preserve input order and return a failed `SubscribeResult` for each denied item while still forwarding allowed items to the worker. Every denial -adds an `api_key_audit` entry with the key id, command kind, target, and +records a canonical audit event with the key id, command kind, target, and blocking constraint; secured values and raw credentials are never logged. ## Scope Catalog @@ -209,10 +212,10 @@ blocking constraint; secured values and raw credentials are never logged. | `InvokeRead` | `invoke:read` | `MxCommandRequest` for read-style command kinds (`Register`, `AddItem`, `Advise`, `ReadBulk`, and any kind not otherwise mapped) | | `InvokeWrite` | `invoke:write` | `AcknowledgeAlarmRequest`, `MxCommandKind.Write`, `MxCommandKind.Write2`, `MxCommandKind.WriteBulk`, `MxCommandKind.Write2Bulk` | | `InvokeSecure` | `invoke:secure` | `MxCommandKind.WriteSecured`, `MxCommandKind.WriteSecured2`, `MxCommandKind.WriteSecuredBulk`, `MxCommandKind.WriteSecured2Bulk`, `MxCommandKind.AuthenticateUser` | -| `MetadataRead` | `metadata:read` | `MxCommandKind.ArchestraUserToId`, `MxCommandKind.GetSessionState`, `MxCommandKind.GetWorkerInfo`, `GalaxyRepository.TestConnection`, `GalaxyRepository.GetLastDeployTime`, `GalaxyRepository.DiscoverHierarchy`, `GalaxyRepository.WatchDeployEvents` | -| `Admin` | `admin` | `MxCommandKind.ShutdownWorker`, the default for any unrecognized request type, and the dashboard authorization policy | +| `MetadataRead` | `metadata:read` | `MxCommandKind.ArchestraUserToId`, `MxCommandKind.GetSessionState`, `MxCommandKind.GetWorkerInfo`, `GalaxyRepository.TestConnection`, `GalaxyRepository.GetLastDeployTime`, `GalaxyRepository.DiscoverHierarchy`, `GalaxyRepository.WatchDeployEvents`, `GalaxyRepository.BrowseChildren` | +| `Admin` | `admin` | `MxCommandKind.ShutdownWorker` and the default for any unrecognized request type | -The `Admin` constant is also referenced by `DashboardAuthenticator` and `DashboardAuthorizationHandler` so that the dashboard and the gRPC layer agree on what "admin" means. +The gRPC `admin` scope here is **distinct** from the dashboard's `Administrator` role. The scope gates API-key access to admin-level RPCs; the dashboard role gates interactive cookie-authenticated dashboard pages. `DashboardAuthorizationHandler` and the dashboard policies authorize against the `Administrator`/`Viewer` roles (see [Gateway Dashboard Design](./GatewayDashboardDesign.md)) and do not reference `GatewayScopes.Admin`. The only dashboard code that touches `GatewayScopes` is the API Keys page, which validates requested scopes against `GatewayScopes.All` when creating a key — the same validation the CLI applies. ## Identity Access for Downstream Layers @@ -263,14 +266,24 @@ public static IServiceCollection AddGatewayGrpcAuthorization(this IServiceCollec { services.AddSingleton(); services.AddSingleton(); + services.AddSingleton(); services.AddSingleton(); + services + .AddOptions() + .Configure((grpcOptions, configuration) => + { + ProtocolOptions protocolOptions = new(); + configuration.GetSection("MxGateway:Protocol").Bind(protocolOptions); + grpcOptions.MaxReceiveMessageSize = protocolOptions.MaxGrpcMessageBytes; + grpcOptions.MaxSendMessageSize = protocolOptions.MaxGrpcMessageBytes; + }); services.AddGrpc(options => options.Interceptors.Add()); return services; } ``` -Singleton lifetimes are appropriate because none of the three classes hold per-request state on instance fields; the request-scoped value lives inside the `AsyncLocal` on `GatewayRequestIdentityAccessor`. `GatewayApplication` calls `builder.Services.AddGatewayGrpcAuthorization()` during startup, and the call also performs `AddGrpc`, so the gateway never registers gRPC without the interceptor attached. +Four singletons are registered: the scope resolver, the identity accessor, the constraint enforcer (`IConstraintEnforcer` → `ConstraintEnforcer`, which service bodies call to apply API-key constraints), and the interceptor itself. The same method also binds gRPC's `GrpcServiceOptions.MaxReceiveMessageSize` and `MaxSendMessageSize` from `MxGateway:Protocol:MaxGrpcMessageBytes` so the message-size limits are configured in the one place that wires the authorization pipeline. Singleton lifetimes are appropriate because none of these classes hold per-request state on instance fields; the request-scoped value lives inside the `AsyncLocal` on `GatewayRequestIdentityAccessor`. `GatewayApplication` calls `builder.Services.AddGatewayGrpcAuthorization()` during startup, and the call also performs `AddGrpc`, so the gateway never registers gRPC without the interceptor attached. ## Related Documentation diff --git a/docs/ClientPackaging.md b/docs/ClientPackaging.md index 71d341c..53084ad 100644 --- a/docs/ClientPackaging.md +++ b/docs/ClientPackaging.md @@ -48,8 +48,8 @@ dotnet build src/ZB.MOM.WW.MxGateway.Contracts/ZB.MOM.WW.MxGateway.Contracts.csp Build and test from the repository root: ```powershell -dotnet build clients/dotnet/ZB.MOM.WW.MxGateway.Client.sln -dotnet test clients/dotnet/ZB.MOM.WW.MxGateway.Client.sln --no-build +dotnet build clients/dotnet/ZB.MOM.WW.MxGateway.Client.slnx +dotnet test clients/dotnet/ZB.MOM.WW.MxGateway.Client.slnx --no-build ``` Create local package artifacts: @@ -173,10 +173,14 @@ Install, test, and build a wheel from `clients/python`: Push-Location clients/python python -m pip install -e ".[dev]" python -m pytest -python -m pip wheel . --no-deps --wheel-dir "$env:TEMP\mxgateway-python-wheel" +python -m build --outdir "$env:TEMP\mxgateway-python-dist" Pop-Location ``` +`python -m build` (sdist plus wheel) is the canonical build method — it is what +`scripts/pack-clients.ps1` runs for the Python package. Use +`python -m pip wheel . --no-deps` only for a quick wheel-only build. + Run the CLI from the editable install or with `python -m`: ```powershell @@ -190,9 +194,10 @@ Pop-Location ## Java -The Java workspace uses Gradle, Java 21, `mxgateway-client`, and -`mxgateway-cli`. The Gradle protobuf plugin writes generated Java protobuf and -gRPC sources under `clients/java/src/main/generated`. +The Java workspace uses Gradle, Java 21, and the subprojects +`zb-mom-ww-mxgateway-client` and `zb-mom-ww-mxgateway-cli`. The Gradle protobuf +plugin writes generated Java protobuf and gRPC sources under +`clients/java/src/main/generated`. Regenerate Java bindings: @@ -228,6 +233,28 @@ gradle :zb-mom-ww-mxgateway-cli:run --args="smoke --endpoint mxgateway.example.l Pop-Location ``` +## Packing All Clients + +`scripts/pack-clients.ps1` runs every client's native packaging command and +drops the artifacts into one directory so a release does not depend on running +each per-language command by hand. It packs the .NET NuGet packages +(`ZB.MOM.WW.MxGateway.Contracts` and `ZB.MOM.WW.MxGateway.Client`), the Python +sdist and wheel (`python -m build`), the Rust `.crate` (`cargo package`), and +the Java jars plus generated POM (`gradle assemble` and the publication tasks). +Go has no artifact to pack — it is released by git-tagging, so the script prints +the `scripts/tag-go-module.ps1` command and skips it. + +```powershell +pwsh scripts/pack-clients.ps1 +pwsh scripts/pack-clients.ps1 -Languages dotnet,python +``` + +Artifacts land in `-OutputDir` (default `dist/`). Each language runs its +regression tests first unless `-SkipTests` is set. With `-Publish`, every +package is pushed to the internal Gitea feed; this requires the `GITEA_USERNAME` +and `GITEA_TOKEN` environment variables and the script refuses to publish if +either is missing. + ## Integration Tests Client integration checks are opt-in because they need a live gateway and a diff --git a/docs/ClientProtoGeneration.md b/docs/ClientProtoGeneration.md index c790f3b..db882dd 100644 --- a/docs/ClientProtoGeneration.md +++ b/docs/ClientProtoGeneration.md @@ -98,7 +98,7 @@ Use these commands to regenerate language-specific client bindings: | Go | `Push-Location clients/go; ./generate-proto.ps1; Pop-Location` | | Rust | `Push-Location clients/rust; cargo check --workspace; Pop-Location` | | Python | `Push-Location clients/python; ./generate-proto.ps1; Pop-Location` | -| Java | `Push-Location clients/java; gradle :mxgateway-client:generateProto; Pop-Location` | +| Java | `Push-Location clients/java; gradle :zb-mom-ww-mxgateway-client:generateProto; Pop-Location` | .NET generation currently runs through the contracts project: @@ -152,10 +152,11 @@ clients/python/generate-proto.ps1 ``` Java clients use the Gradle protobuf plugin from `clients/java`. The -`mxgateway-client` project reads the shared `.proto` files and writes generated -Java protobuf and gRPC sources under `clients/java/src/main/generated`, matching -the manifest output path. Handwritten client and CLI code stays in the -`mxgateway-client` and `mxgateway-cli` project source trees. +`zb-mom-ww-mxgateway-client` project reads the shared `.proto` files and writes +generated Java protobuf and gRPC sources under +`clients/java/src/main/generated`, matching the manifest output path. +Handwritten client and CLI code stays in the `zb-mom-ww-mxgateway-client` and +`zb-mom-ww-mxgateway-cli` project source trees. Run the Java workspace checks from `clients/java`: diff --git a/docs/Contracts.md b/docs/Contracts.md index 0d19ee6..12bcbe4 100644 --- a/docs/Contracts.md +++ b/docs/Contracts.md @@ -77,6 +77,44 @@ only and does not share types with `mxaccess_gateway.proto`. See [Galaxy Repository Browse](./GalaxyRepository.md) for the RPC catalog and behavior. +### Alarm RPCs and messages + +`mxaccess_gateway.proto` also defines three session-less alarm RPCs served by +the gateway's always-on central alarm monitor (no client worker session is +involved): + +- `AcknowledgeAlarm(AcknowledgeAlarmRequest) returns (AcknowledgeAlarmReply)` — + acknowledges one alarm by its `alarm_full_reference`, with an operator + `comment` and `operator_user`. +- `StreamAlarms(StreamAlarmsRequest) returns (stream AlarmFeedMessage)` — the + central alarm feed. +- `QueryActiveAlarms(QueryActiveAlarmsRequest) returns (stream + ActiveAlarmSnapshot)` — a point-in-time snapshot of the currently-active + alarm set, streamed so callers can begin processing without buffering the + whole set. `alarm_filter_prefix` (when non-empty) narrows the snapshot to + alarms whose `alarm_full_reference` starts with the prefix. + +`StreamAlarms` uses a three-phase protocol carried by the `AlarmFeedMessage` +`oneof payload`: the stream opens with one `active_alarm` (`ActiveAlarmSnapshot`) +per currently-active alarm, then a single `snapshot_complete = true` sentinel, +then a `transition` (`OnAlarmTransitionEvent`) for every subsequent change. +`active_alarm` carries the collapsed current state (`AlarmConditionState`: +`Active` / `ActiveAcked` / `Inactive`); `transition` carries the +`AlarmTransitionKind` (`Raise` / `Acknowledge` / `Clear` / `Retrigger`). + +`AcknowledgeAlarmRequest` and `AcknowledgeAlarmReply` both **reserve** field 1 +and the name `session_id`: acknowledgement was made session-less and the field +was retired (the reservation prevents reuse of the tag). The authoritative +ack-outcome field on `AcknowledgeAlarmReply` is `hresult` (the worker's native +by-name/by-GUID ack return code, 0 = success), alongside `protocol_status`. The +structured `MxStatusProxy status` field is intentionally left **unset** on every +reply because the worker ack path produces only the int32 return code; clients +must read `hresult` and must not depend on `status` being populated. + +For the broker architecture and the parse contract for `alarm_full_reference` +(GUID vs `Provider!Group.Tag`) see +[Alarm Client Discovery](./AlarmClientDiscovery.md). + Generated C# output is written to `src/ZB.MOM.WW.MxGateway.Contracts/Generated/`. Do not hand-edit generated files. diff --git a/docs/DashboardInterfaceDesign.md b/docs/DashboardInterfaceDesign.md index c0dab3e..5cee53c 100644 --- a/docs/DashboardInterfaceDesign.md +++ b/docs/DashboardInterfaceDesign.md @@ -8,8 +8,12 @@ operations-focused projects. The dashboard is an operational interface, not a landing page. It prioritizes fast scanning, low visual noise, and stable layouts while live data changes. -The design uses Bootstrap for common behavior and a small local stylesheet for -project identity, spacing, and status presentation. +The layout chrome, status presentation, and design tokens come from the shared +`ZB.MOM.WW.Theme` kit (the technical-light design system). Bootstrap supplies +common widget behavior, and a small local stylesheet (`wwwroot/css/site.css`) +wires the dashboard's own class names and Bootstrap widgets onto the kit's +tokens. The local sheet contains no hard-coded colors; every color, font, and +surface resolves to a theme token. Use this style for applications where users repeatedly check system state, compare rows, inspect details, and diagnose faults. Avoid promotional layouts, @@ -25,7 +29,7 @@ The interface uses a quiet, work-focused visual system: - White cards and sections carry the actual operational content. - Borders define structure more often than shadows. - Accent color is reserved for metric values and important numeric signals. -- Bootstrap status badges provide state color without custom status art. +- The kit's `StatusPill` provides state color without custom status art. - Tables remain compact and responsive so long identifiers and timestamps stay readable. @@ -34,93 +38,113 @@ and dense enough for repeated use. ## Layout Structure -Every page follows the same structure: +The application chassis is the kit's `ThemeShell` component (a vertical side +rail plus a content area), not a horizontal top navbar. `MainLayout.razor` is a +thin wrapper that delegates the rail chassis — brand block, hamburger toggle, +responsive collapse — to `` and supplies only the navigation items +and a rail footer: -1. A top navigation bar with the product or service name on the left. -2. A full-width `container-fluid` content area. -3. A page header with the page title, short context text, and optional status - badge. -4. Metric cards when a page has top-level numeric state. -5. Bordered content sections for tables, details, faults, or empty states. - -The shell does not use a sidebar. A horizontal navigation bar is enough for the -current page count and keeps the content width available for tables. - -```html -
- -
- -
-
+```razor + + + + @Body + ``` +Within the content area, every page follows the same structure: + +1. A page header with the page title, short context text, and optional status + pill. +2. Metric cards when a page has top-level numeric state. +3. Bordered content sections for tables, details, faults, or empty states. + +The login page uses `LoginLayout.razor` instead — a minimal layout with no rail +and no brand block, because the page renders its own centered ``. + ## Color Tokens -Use a small token set and let Bootstrap provide the rest. The current dashboard -uses these local tokens: - -```css -:root { - --mxgw-surface: #f7f8fa; - --mxgw-border: #d8dee6; - --mxgw-ink-muted: #667085; - --mxgw-accent: #146c64; -} -``` +Colors come from the `ZB.MOM.WW.Theme` kit's `theme.css`. The local +`site.css` defines no `:root` custom properties of its own; it references kit +tokens by name. The dashboard does not define a `--mxgw-*` token set. | Token | Purpose | |-------|---------| -| `--mxgw-surface` | Page background behind all content. | -| `--mxgw-border` | Borders on cards, tables, sections, and empty states. | -| `--mxgw-ink-muted` | Secondary labels, details, and empty-state text. | -| `--mxgw-accent` | Metric values and important numeric summaries. | +| `var(--card)` | Background of cards, sections, and data tables. | +| `var(--rule)`, `var(--rule-strong)` | Hairline and stronger borders. | +| `var(--ink)`, `var(--ink-soft)`, `var(--ink-faint)` | Primary, secondary, and muted text. | +| `var(--accent)`, `var(--accent-deep)` | Metric values, links, primary buttons, focus rings. | +| `var(--mono)` | Monospace family for values, identifiers, and code. | +| `var(--ok)`/`--ok-bg`, `var(--warn)`/`--warn-bg`, `var(--bad)`/`--bad-bg`, `var(--idle)`/`--idle-bg` | State colors for chips, alerts, and alarm-state labels. | -Keep the palette small. Add new colors only when they encode state or improve -readability. Prefer Bootstrap badge classes for states such as ready, closing, -closed, and faulted. +Keep the palette small and let the kit own it. Add new colors only when they +encode state or improve readability, and resolve them to a kit token rather than +a literal hex value. Use the kit's `StatusPill` for states such as ready, +closing, idle, and faulted. ## Typography Typography stays compact and consistent: -- Page headings use `1.35rem`, weight `650`, and normal letter spacing. -- Section headings use the same size as page headings when they introduce a - table or details group. -- Metric labels use uppercase text at `.78rem` and weight `650`. -- Metric values use `1.7rem`, weight `700`, and the accent color. +- Page headings (`.dashboard-page-header h1`) use `1.15rem`, weight `600`, and a + slight letter spacing. +- Section headings (`.section-heading h2`) use a small uppercase eyebrow: + `.74rem`, weight `600`, muted ink. +- Metric labels (`.agg-label`) use uppercase text at `.68rem` and weight `600`, + muted ink. +- Metric values (`.agg-value`) use `1.5rem`, weight `600`, the monospace family, + tabular numerics, and primary ink (`var(--ink)`). - Body and table text inherit Bootstrap defaults for readability. Do not scale text with viewport width. Long values use `overflow-wrap: -anywhere` so session IDs, paths, and fault messages do not break the layout. +break-word` (numbers and date tokens stay whole, wrapping only at spaces); a few +free-form fields such as `.agg-sub` use `overflow-wrap: anywhere` so session +IDs, paths, and fault messages do not break the layout. ## Spacing And Shape The dashboard uses modest spacing: -- Page content has `1.25rem` padding on desktop and `.75rem` on small screens. +- The kit owns the rail and content padding; the local small-screen rule sets + `.page` padding to `.85rem`. - Metric grids use `.75rem` gaps. -- Content sections start with a top border and `1rem` top padding. -- Cards and empty states use Bootstrap's small radius shape, `.375rem`. -- Metric cards have no shadow. +- Content sections (`.dashboard-section`) and metric cards (`.agg-card`) are + fully bordered cards: `var(--card)` fill, a `1px solid var(--rule)` hairline, + and `0.9rem` padding for sections. +- Cards, sections, and modals use an `8px` radius; smaller widgets such as the + empty state use `6px`. +- Metric cards have no shadow (`box-shadow: none`); borders define structure. This keeps information grouped without turning each section into a decorative panel. Use cards for repeated metric summaries, login forms, and individual -items. Use unframed sections with a top border for page-level groups. +items. Use bordered sections for page-level groups. ## Navigation -Navigation is a Bootstrap responsive navbar. It includes: +Navigation lives in the `ThemeShell` side rail. It is built from the kit's +`NavRailSection` and `NavRailItem` components: a single home item plus eight +page items grouped into three labeled sections. -- Brand text for the service name. -- Short page labels: `Overview`, `Sessions`, `Workers`, `Events`, `Settings`. -- Active route styling through `NavLink`. -- A right-aligned sign-out button when authentication is enabled. +| Section | Items | +|---------|-------| +| (home) | `Dashboard` (route `/`, `NavLinkMatch.All`) | +| Runtime | `Sessions`, `Workers`, `Events`, `Alarms` | +| Galaxy | `Repository`, `Browse` | +| Admin | `API Keys`, `Settings` | -Keep navigation labels short. Operational users should be able to predict what -each page contains without reading explanatory copy. +Section expand/collapse state is owned by the kit (a `
` element plus +`ThemeScripts`); the layout does not run JS interop for it. The rail footer +shows the signed-in user name and a sign-out form (or a sign-in link when +unauthenticated). + +Keep navigation labels short and group related pages. Operational users should +be able to predict what each page contains without reading explanatory copy. ## Page Headers @@ -128,42 +152,43 @@ Each page starts with a `dashboard-page-header`: - The title is the primary anchor. - A single secondary line gives timestamp, row count, or configuration context. -- A status badge appears on the right when the page has an overall state. +- A status pill appears on the right when the page has an overall state. On narrow screens, the header stacks vertically. This prevents long context -text or status badges from overlapping the title. +text or status pills from overlapping the title. ```html
-

Overview

+

Dashboard

Generated 2026-04-27 17:30:00
- Healthy +
``` ## Metric Cards -Metric cards summarize numeric state at the top of overview and diagnostic -pages. They use Bootstrap cards with a local `metric-card` class: +Metric cards summarize numeric state at the top of the home and diagnostic +pages. The `MetricCard` component renders an `.agg-card` with label, value, and +optional sub-line: -- Label: uppercase, muted, compact. -- Value: large enough to scan, accent colored, wraps safely. -- Detail: optional muted text for version, rate context, or explanatory state. +- Label (`.agg-label`): uppercase eyebrow, muted, compact. +- Value (`.agg-value`): large monospace number in primary ink, wraps safely. +- Sub (`.agg-sub`): optional muted text for version, rate context, or state. -Use auto-fit CSS grid tracks so the cards fill available width without custom -breakpoints: +Cards lay out in a `.metric-grid`. Use auto-fill CSS grid tracks so they fill +available width without custom breakpoints: ```css .metric-grid { display: grid; gap: .75rem; - grid-template-columns: repeat(auto-fit, minmax(12rem, 1fr)); + grid-template-columns: repeat(auto-fill, minmax(11rem, 1fr)); } .metric-grid.compact { - grid-template-columns: repeat(auto-fit, minmax(10rem, 1fr)); + grid-template-columns: repeat(auto-fill, minmax(10rem, 1fr)); } ``` @@ -188,15 +213,22 @@ entire rows clickable when a single identifier link is clearer. ## Status Badges -Status uses Bootstrap badge classes with a small mapping layer: +`StatusBadge` is a thin adapter over the kit's `StatusPill`. Call sites pass the +literal domain state text (``); the adapter maps +that text to one of the kit's four `StatusState` values, and `StatusPill` +renders the chip. There are no Bootstrap `text-bg-*` classes in this layer. -| State | Badge class | -|-------|-------------| -| `Ready`, `Healthy` | `text-bg-success` | -| `Creating`, `StartingWorker`, `WaitingForPipe`, `InitializingWorker`, `Closing` | `text-bg-info` | -| `Closed` | `text-bg-secondary` | -| `Faulted` | `text-bg-danger` | -| Unknown state | `text-bg-light text-dark border` | +| Domain state text | `StatusState` | +|-------------------|---------------| +| `Ready`, `Healthy`, `Active` | `Ok` | +| `Creating`, `StartingWorker`, `WaitingForPipe`, `InitializingWorker`, `Closing`, `Stale`, `Degraded` | `Warn` | +| `Faulted`, `Unavailable` | `Bad` | +| Any other text (including `Closed`, `Revoked`, `Unknown`) | `Idle` | + +Note the mapping changes from earlier revisions: `Closed` now falls through to +`Idle` (rather than its own neutral badge), and `Active`, `Stale`, `Degraded`, +and `Unavailable` are explicit cases. The kit owns the chip rendering; only this +domain text-to-state vocabulary lives in the app. Keep status text literal. Operators benefit from seeing the same state names that appear in logs and APIs. @@ -230,8 +262,8 @@ The dashboard uses one small-screen breakpoint: ```css @media (max-width: 700px) { - .dashboard-content { - padding: .75rem; + .page { + padding: .85rem; } .dashboard-page-header { @@ -245,6 +277,9 @@ The dashboard uses one small-screen breakpoint: } ``` +A second breakpoint (`max-width: 960px`) collapses the Browse two-pane layout +(`.browse-layout`) to a single column. + Do not hide important columns by default. Use horizontal table scrolling for dense operational data, and reserve column hiding for data that is clearly duplicative. @@ -277,13 +312,14 @@ markup. Use this checklist when applying the design to another project: -- Define four local tokens: surface, border, muted ink, and accent. -- Use a Bootstrap top navbar with short route labels. -- Keep page content inside a full-width fluid container. +- Take colors, fonts, and surfaces from the `ZB.MOM.WW.Theme` kit tokens; do + not define a local color token set. +- Use the kit's `ThemeShell` side rail with `NavRailSection`/`NavRailItem` and + short route labels grouped into sections. - Start every page with the same header structure. -- Put primary numeric state in `metric-grid` cards. +- Put primary numeric state in `metric-grid` / `agg-card` cards. - Put detailed runtime state in compact responsive tables. -- Use status badges mapped from real domain states. +- Use `StatusBadge` (kit `StatusPill`) mapped from real domain states. - Use dashed bordered empty states for loading and no-data cases. - Use top-bordered sections for page groups instead of nested cards. - Centralize formatting and redaction outside Razor markup. diff --git a/docs/DesignDecisions.md b/docs/DesignDecisions.md index ad8005e..34cbad0 100644 --- a/docs/DesignDecisions.md +++ b/docs/DesignDecisions.md @@ -357,10 +357,16 @@ Allowed UI stack: Do not use MudBlazor or other Blazor UI component libraries for v1. -Dashboard access should require API-key-backed dashboard authentication with -`admin` scope when enabled. For local development, anonymous localhost access -is enabled by default through `Dashboard:AllowAnonymousLocalhost`; the bypass is -limited to loopback requests. +Dashboard authentication is LDAP-backed, deliberately separate from the gRPC +API-key model: dashboard users are people who already have directory accounts, +so reusing LDAP avoids minting and distributing API keys for human operators. +`DashboardAuthenticator` binds the supplied credentials against `MxGateway:Ldap` +through the shared `ILdapAuthService`, then maps the user's LDAP groups to the +`Administrator` or `Viewer` dashboard role via `MxGateway:Dashboard:GroupToRole`. +A login whose groups match no role is denied. For local development, anonymous +localhost access is enabled by default through +`MxGateway:Dashboard:AllowAnonymousLocalhost`; the bypass is limited to loopback +requests. ## Lazy Browse Is Wire-Only diff --git a/docs/Diagnostics.md b/docs/Diagnostics.md index ecaab5a..352d8c0 100644 --- a/docs/Diagnostics.md +++ b/docs/Diagnostics.md @@ -205,13 +205,38 @@ app.MapGatewayEndpoints(); The order matters: putting the logging scope first ensures that authentication failures, authorization denials, and endpoint exceptions all run inside the request scope, so failure logs still carry the correlation id and session id headers that the caller sent. The `ClientIdentity` field is redacted before logging, so reading the `authorization` header at this stage does not leak the bearer secret into authentication failure logs. +### Telemetry redaction seam + +The per-request middleware redacts the `authorization` header before it reaches a scope, but log events produced outside the request scope (or with credential-bearing properties attached by other enrichers) need the same protection. `GatewayLogRedactorSeam` adapts the static `GatewayLogRedactor` to the shared `ILogRedactor` seam so the telemetry `RedactionEnricher` masks identity material on **every** log event: + +```csharp +builder.Services.AddSingleton(); +``` + +The seam scans a fixed set of identity-bearing property names (`ClientIdentity`, `authorization`, `Authorization`) and rewrites any string value through `GatewayLogRedactor.RedactClientIdentity`. Because it runs in the enricher rather than at the call site, it catches credential material that a component logged without going through `GatewayLogScope`. + +## Readiness Health Check + +`AuthStoreHealthCheck` is a readiness probe registered under the health-check name `auth-store` and tagged for the readiness set (`ZbHealthTags.Ready`): + +```csharp +builder.Services.AddHealthChecks() + .AddTypeActivatedCheck( + "auth-store", + failureStatus: null, + tags: new[] { ZbHealthTags.Ready }); +``` + +The gateway authenticates every gRPC call against the SQLite auth store, so its reachability gates readiness. The check opens a connection via `AuthSqliteConnectionFactory` and runs `SELECT 1;`: success reports `Healthy`, any exception (other than the probe being cancelled) reports `Unhealthy` with the underlying error attached. It is surfaced on the readiness endpoint exposed by the shared telemetry wiring (the live/ready split is what the `wonder-app-vd03` deployment exposes as `/health/live` with the dashboard disabled). + ## Consumers `GatewayLoggerExtensions.BeginGatewayScope` is consumed by `GatewayRequestLoggingMiddlewareExtensions` to attach the per-request scope. Component-level call sites build narrower `GatewayLogScope` instances (for example, with a known `WorkerProcessId` after a worker launch) and push a nested scope on top of the request scope. -`GatewayLogRedactor` is consumed in three places: +`GatewayLogRedactor` is consumed in four places: - `GatewayLogScope.ToDictionary` redacts `ClientIdentity` whenever a scope is materialized. +- `GatewayLogRedactorSeam.Redact` applies the same redaction to identity-bearing properties on every telemetry log event (see above). - `DashboardRedactor.Redact` delegates to `RedactClientIdentity` for any value containing the `mxgw_` marker, then falls back to a marker-keyword check for fields like `password` or `token`. This keeps dashboard renders aligned with log redaction. - `ZB.MOM.WW.MxGateway.Tests/Diagnostics/GatewayLogRedactorTests.cs` covers each redaction branch, including the assertion that `WriteSecured` values stay redacted even when `valueLoggingEnabled` is true. diff --git a/docs/GalaxyRepository.md b/docs/GalaxyRepository.md index 558a062..03e2237 100644 --- a/docs/GalaxyRepository.md +++ b/docs/GalaxyRepository.md @@ -81,11 +81,16 @@ computed against the *filtered* descendant set, a branch that contains no matching objects gets `false`, not `true`. **Paging.** Default page size is 500; the server caps any requested size at -5000. Page tokens encode `(cache_sequence, parent_id, filter_signature, -offset)`. A token from a different cache generation or a different filter set -returns `InvalidArgument`. The error messages reference "DiscoverHierarchy -page_token" because `BrowseChildren` reuses the same encoding and validation -path — if you see that wording in a `BrowseChildren` context it is expected. +5000. Page tokens are the colon-delimited triple `sequence:filterSignature:offset` +— the same encoding `DiscoverHierarchy` uses. The parent selector is not a +separate token field: it is folded into `filterSignature` along with the rest of +the filter set (the projector's `ComputeFilterSignature` takes the parent id), +so a page token implicitly pins the parent. A token from a different cache +generation (`sequence` mismatch) or a different filter set (`filterSignature` +mismatch) returns `InvalidArgument`. The error messages reference +"DiscoverHierarchy page_token" because `BrowseChildren` reuses the same encoding +and validation path — if you see that wording in a `BrowseChildren` context it is +expected. **Errors.** @@ -133,6 +138,15 @@ When SQL is unreachable, the cache retains the previous data and flips `Status` to `Stale` (or `Unavailable` if no data was ever loaded). A `SqlException` never bubbles out as the client-facing error. +The cache also auto-degrades a `Healthy` entry to `Stale` purely on age: when the +last successful refresh is older than five minutes, the projected status is +reported as `Stale` even though the data hasn't otherwise changed. This guards +against a silently wedged refresh loop — if ticks stop succeeding, browse +results visibly go `Stale` rather than continuing to look fresh. (`Unknown` and +`Unavailable` entries are returned as-is and not aged.) The first refresh runs at +service startup, before the interval loop begins, so the cache is populated as +soon as practical rather than waiting one full interval. + ### First-load behavior If a client calls `DiscoverHierarchy` before the background service has @@ -156,7 +170,10 @@ working across that gap, the cache persists its dataset to disk: - On the **first** refresh after startup, before any SQL runs, the cache reloads that file. The restored data is served with `Stale` status — it is last-known data, not live — so clients can browse immediately even - when the Galaxy database is unreachable. + when the Galaxy database is unreachable. The restore also publishes a deploy + event through `IGalaxyDeployNotifier`, so a `WatchDeployEvents` subscriber that + attaches before the first live query still sees the restored snapshot's deploy + state. - The first live query then reconciles: if it observes the **same** `time_of_last_deploy` the snapshot was saved at, the entry is promoted to `Healthy` with no heavy re-query (the snapshot is provably current); if it @@ -349,6 +366,25 @@ Component breakdown: override per object. `HierarchySql` still matches the OtOpcUa original; `AttributesSql` does not — it additionally enumerates built-in primitive attributes (see [Built-in vs configured attributes](#built-in-vs-configured-attributes)). + + `HierarchySql` restricts the result to a fixed allow-list of object categories + via `WHERE td.category_id IN (1, 3, 4, 10, 11, 13, 17, 24, 26)` — the same set + the dashboard's `ResolveCategoryName` map names. Categories outside this set + (for example, internal framework objects) are never browsed. The mapping: + + | `category_id` | Name | + |---|---| + | 1 | WinPlatform | + | 3 | AppEngine | + | 4 | InTouchViewApp | + | 10 | UserDefined | + | 11 | FieldReference | + | 13 | Area | + | 17 | DIObject | + | 24 | DDESuiteLinkClient | + | 26 | OPCClient | + + Any other category id renders as `Category {id}` in the dashboard. - `GalaxyHierarchyCache` (`src/ZB.MOM.WW.MxGateway.Server/Galaxy/GalaxyHierarchyCache.cs`) holds the most recent immutable `GalaxyHierarchyCacheEntry` (materialized objects + @@ -384,7 +420,7 @@ Bound to `MxGateway:Galaxy` via `GalaxyRepositoryOptions`. | Option | Default | Description | |--------|---------|-------------| | `MxGateway:Galaxy:ConnectionString` | `Server=localhost;Database=ZB;Integrated Security=True;TrustServerCertificate=True;Encrypt=False;` | SQL Server connection string for the Galaxy Repository. Integrated Security against `localhost` is the dev default; production deployments should override this through the standard double-underscore environment variable form, e.g. `MxGateway__Galaxy__ConnectionString`. | -| `MxGateway:Galaxy:CommandTimeoutSeconds` | `60` | Per-command SQL timeout. Applies to all three RPCs. | +| `MxGateway:Galaxy:CommandTimeoutSeconds` | `60` | Per-command SQL timeout applied to every SQL command the repository runs (the connectivity probe, the deploy-time poll, and the hierarchy and attribute queries), which back all five Galaxy RPCs. | | `MxGateway:Galaxy:PersistSnapshot` | `true` | Persists each successful browse dataset to disk and reloads it at startup. See [On-disk snapshot](#on-disk-snapshot). | | `MxGateway:Galaxy:SnapshotCachePath` | `C:\ProgramData\MxGateway\galaxy-snapshot.json` | File path for the persisted browse snapshot. Ignored when `PersistSnapshot` is `false`. | @@ -400,7 +436,8 @@ unparsed connection string text. ## Authorization -All four Galaxy RPCs (including `WatchDeployEvents`) require the +All five Galaxy RPCs (`TestConnection`, `GetLastDeployTime`, +`DiscoverHierarchy`, `WatchDeployEvents`, and `BrowseChildren`) require the `metadata:read` API-key scope. Browse is read-only metadata, equivalent in privilege to `MxCommandKind.GetSessionState` or `MxCommandKind.GetWorkerInfo`. The mapping lives in `GatewayGrpcScopeResolver`; see diff --git a/docs/GatewayConfiguration.md b/docs/GatewayConfiguration.md index 40c60bd..d97ac80 100644 --- a/docs/GatewayConfiguration.md +++ b/docs/GatewayConfiguration.md @@ -18,6 +18,19 @@ paths, timeouts, queue sizes, enum values, or protocol values are invalid. "PepperSecretName": "MxGateway:ApiKeyPepper", "RunMigrationsOnStartup": true }, + "Ldap": { + "Enabled": true, + "Server": "localhost", + "Port": 3893, + "Transport": "None", + "AllowInsecure": true, + "SearchBase": "dc=zb,dc=local", + "ServiceAccountDn": "cn=serviceaccount,dc=zb,dc=local", + "ServiceAccountPassword": "serviceaccount123", + "UserNameAttribute": "cn", + "DisplayNameAttribute": "cn", + "GroupAttribute": "memberOf" + }, "Worker": { "ExecutablePath": "src\\ZB.MOM.WW.MxGateway.Worker\\bin\\x86\\Release\\ZB.MOM.WW.MxGateway.Worker.exe", "WorkingDirectory": null, @@ -93,6 +106,39 @@ Environment variables use the normal .NET double-underscore form. For example, When `Mode` is `ApiKey`, `SqlitePath` and `PepperSecretName` must be present. `SqlitePath` must be a valid filesystem path. +## Ldap Options + +The `MxGateway:Ldap` section configures the dashboard's LDAP login (the gRPC API +uses API keys, not LDAP — see [Authentication](./Authentication.md)). The same +section is bound twice: the runtime bind/search is performed by the shared +`ZB.MOM.WW.Auth.Ldap` provider wired up by `AddZbLdapAuth`, while the gateway's +own `LdapOptions` shadow exists only for startup validation, the redacted +effective-config display, and the dev/default values. The two stay +field-compatible so the one section binds onto both. The gateway ships +dev-friendly defaults (plaintext localhost); the shared provider's own defaults +are secure-by-default. + +| Option | Default | Description | +|--------|---------|-------------| +| `MxGateway:Ldap:Enabled` | `true` | Enables LDAP-backed dashboard login. When `false`, the rest of the section is not validated and LDAP login is not wired up. | +| `MxGateway:Ldap:Server` | `localhost` | LDAP server host. Required when `Enabled`. | +| `MxGateway:Ldap:Port` | `3893` | LDAP server port. Must be a valid port (1–65535). | +| `MxGateway:Ldap:Transport` | `None` | Transport/TLS mode. One of `None` (plaintext), `StartTls` (upgrade a plaintext connection to TLS), or `Ldaps` (TLS from connect). Replaces the former boolean `UseTls`. | +| `MxGateway:Ldap:AllowInsecure` | `true` | Allows plaintext LDAP connections. Must be `true` when `Transport` is `None`; setting `Transport=None` with `AllowInsecure=false` fails validation. | +| `MxGateway:Ldap:SearchBase` | `dc=zb,dc=local` | Search base distinguished name for user lookup. Required when `Enabled`. | +| `MxGateway:Ldap:ServiceAccountDn` | `cn=serviceaccount,dc=zb,dc=local` | Service account DN used to bind before searching for the logging-in user. Required when `Enabled`. Redacted in the effective-config display. | +| `MxGateway:Ldap:ServiceAccountPassword` | `serviceaccount123` | Service account bind password. Required when `Enabled`. Never logged; redacted in the effective-config display. | +| `MxGateway:Ldap:UserNameAttribute` | `cn` | Attribute matched against the login user name (the dev GLAuth directory keys users by `cn`, not `uid`). Required when `Enabled`. | +| `MxGateway:Ldap:DisplayNameAttribute` | `cn` | Attribute read for the user's display name. Required when `Enabled`. | +| `MxGateway:Ldap:GroupAttribute` | `memberOf` | Attribute read for the user's group membership. The resulting group names are mapped to dashboard roles by `MxGateway:Dashboard:GroupToRole`. Required when `Enabled`. | + +When `Enabled` is `true`, `Server`, `SearchBase`, `ServiceAccountDn`, +`ServiceAccountPassword`, `UserNameAttribute`, `DisplayNameAttribute`, and +`GroupAttribute` must be non-blank, `Port` must be valid, and `AllowInsecure` +must be `true` whenever `Transport` is `None`. Group-to-role mapping lives in the +dashboard section; see `MxGateway:Dashboard:GroupToRole` below and +[glauth.md](../glauth.md). + ## Worker Options | Option | Default | Description | diff --git a/docs/GatewayDashboardDesign.md b/docs/GatewayDashboardDesign.md index bea9be8..91aea2e 100644 --- a/docs/GatewayDashboardDesign.md +++ b/docs/GatewayDashboardDesign.md @@ -9,11 +9,13 @@ statistics in real time. ## Technology Choice -Decision: Blazor Server with Bootstrap CSS/JS. +Decision: Blazor Server with the shared `ZB.MOM.WW.Theme` kit layered over +Bootstrap CSS/JS. Allowed UI stack: - ASP.NET Core Blazor Server, +- the `ZB.MOM.WW.Theme` kit (layout chassis, status components, design tokens), - Bootstrap CSS, - Bootstrap JavaScript, - small local CSS for layout and status styling, @@ -30,7 +32,35 @@ Not allowed for v1: Rationale: Blazor Server keeps the dashboard in the gateway process, avoids a separate frontend build, and gives real-time UI updates through the Blazor -SignalR circuit. Bootstrap is sufficient for a basic dashboard. +SignalR circuit. The `ZB.MOM.WW.Theme` kit gives the dashboard the same chassis, +status vocabulary, and visual identity as the other ZB.MOM.WW operations UIs +without re-implementing layout and status styling per project. + +## Theme Kit + +The dashboard depends on the shared `ZB.MOM.WW.Theme` NuGet package +(version `0.2.0`, referenced in `ZB.MOM.WW.MxGateway.Server.csproj`). The kit is +a Razor Class Library that ships the technical-light design system: a layout +chassis, a small set of UI components, the design tokens, and the head/script +asset wiring. The dashboard takes its chrome and status presentation from the +kit and adds only its own pages and view CSS on top. + +Components and assets used: + +| Kit member | Role in the dashboard | +|---|---| +| `` | The application chassis — vertical side rail (brand, hamburger, responsive collapse) plus a content area. `MainLayout.razor` wraps it and supplies `Nav`, `RailFooter`, and `ChildContent` slots. | +| `` / `` | Grouped navigation items in the rail. Section expand/collapse persistence is owned by the kit (`
` + `ThemeScripts`); the app runs no JS interop for it. | +| `` | The centered login card on `Login.razor`. Renders a native static `
` so the submit reaches the minimal-API endpoint rather than a Blazor event. | +| `` | The status chip. `StatusBadge.razor` is a thin adapter that maps domain state text to one of four `StatusState` values (`Ok`, `Warn`, `Bad`, `Idle`) and renders this pill. | +| `` | Loaded in `App.razor`'s ``; injects the kit's `theme.css` and related head assets. | +| `` | Loaded at the end of `App.razor`'s ``; supplies the rail's interactive behavior. | +| Token system | `theme.css` defines all design tokens (`var(--card)`, `var(--ink)`, `var(--accent)`, `var(--mono)`, the state colors, etc.). The local `site.css` references these tokens and defines no hard-coded colors. | + +The dependency on this kit is the reason the layout shell, navigation, status +chips, and tokens differ from a stock Bootstrap dashboard. See +[Dashboard Interface Design](./DashboardInterfaceDesign.md) for how the kit's +tokens and components shape the visual language. ## Hosting Model @@ -67,8 +97,8 @@ Endpoint layout: The `/galaxy` page surfaces the Galaxy Repository browse summary (deployed object hierarchy size, last deploy timestamp, attribute totals, template usage, and connectivity sync info). The summary is fed by -`GalaxySummaryCache`, which is refreshed off the request path by -`GalaxySummaryRefreshService` on the +`GalaxyHierarchyCache`, which is refreshed off the request path by +`GalaxyHierarchyRefreshService` on the `MxGateway:Galaxy:DashboardRefreshIntervalSeconds` cadence so the dashboard never blocks on SQL. See [Galaxy Repository Browse](./GalaxyRepository.md) for the underlying gRPC service. @@ -79,24 +109,31 @@ the underlying gRPC service. ZB.MOM.WW.MxGateway.Server Dashboard/ Components/ - App.razor + App.razor (loads / ) Routes.razor DashboardPageBase.cs DashboardDisplay.cs Layout/ - DashboardLayout.razor + MainLayout.razor (ThemeShell side-rail chassis) + LoginLayout.razor (minimal, no rail; hosts ) Pages/ DashboardHome.razor + Login.razor SessionsPage.razor SessionDetailsPage.razor WorkersPage.razor EventsPage.razor + AlarmsPage.razor + GalaxyPage.razor + BrowsePage.razor ApiKeysPage.razor SettingsPage.razor Shared/ MetricCard.razor - StatusBadge.razor + StatusBadge.razor (adapter over kit ) FaultList.razor + BrowseTreeNodeView.razor + ConfirmDialog.razor DashboardSnapshotService.cs DashboardAuthorizationHandler.cs DashboardAuthenticator.cs @@ -244,10 +281,14 @@ Show: - admin Close session / Kill worker controls (Admin role only). The Sessions list, the Workers list, and this details page all render the same -admin controls when the signed-in principal carries the `Admin` role; viewers +admin controls when the signed-in principal carries the `Administrator` role; viewers and the localhost-anonymous bypass see no action affordances and the server re-checks the role on every invocation. Every destructive admin action is -gated by a confirmation dialog before it reaches `ISessionManager`. +gated by the shared `ConfirmDialog` component before it reaches +`ISessionManager`. `ConfirmDialog` is a reusable Bootstrap modal (title, +message, confirm/cancel buttons, and a busy state that disables both buttons +while the action runs); each page binds its open state and confirm/cancel +callbacks. The API keys page uses the same component. - **Close session** routes through `ISessionManager.CloseSessionAsync`: the worker is asked to shut down gracefully and is killed only as a fallback if @@ -289,7 +330,8 @@ it opt-in and redacted. ### Browse page `/browse` lets an operator explore the Galaxy tag hierarchy and watch -live values. The tree is built in-process by `DashboardBrowseTreeBuilder` from +live values. The tree is built in-process by the static +`DashboardBrowseTreeBuilder` (in `DashboardBrowseModel.cs`) from `IGalaxyHierarchyCache.Current` — the same cache the Galaxy page reads — so a render costs no gRPC call and no SQL round-trip. Each node shows its child objects and, when expanded, its attributes with attribute name, data type @@ -307,7 +349,10 @@ diagnostic session/worker views. ### Alarms page `/alarms` lists the alarms the gateway's central alarm monitor -currently holds as Active or ActiveAcked, refreshed every three seconds. It +currently holds as Active or ActiveAcked. The page injects +`IDashboardLiveDataService` and drives a `PeriodicTimer` poll loop that calls +`QueryAlarmsAsync` every three seconds, rather than subscribing to the snapshot +hub or holding a `CurrentAlarms` reference directly. It defaults to showing unacknowledged `Active` alarms; filters add acknowledged alarms and narrow by area, severity range, and a reference/source/description text search. Cleared alarms are not retained — the gateway holds no @@ -358,7 +403,7 @@ for what each constraint means and how it is enforced on the gRPC path. Create, Rotate, Revoke, and Delete controls render only when the signed-in user is authorized. `DashboardApiKeyAuthorization.CanManage` requires an -authenticated principal carrying the `Admin` role claim (resolved at login +authenticated principal carrying the `Administrator` role claim (resolved at login from the user's LDAP groups via `MxGateway:Dashboard:GroupToRole`). A `Viewer` role can read the table but sees no action controls, and an anonymous localhost session shows the same read-only view. @@ -385,10 +430,11 @@ Create and Rotate return the assembled `mxgw__` token **once**, in a one-time banner. It is never shown again, so the operator must copy it immediately. This mirrors the `apikey create-key` / `rotate-key` CLI. -Every management action appends an `api_key_audit` entry -(`dashboard-create-key`, `dashboard-rotate-key`, `dashboard-revoke-key`, -`dashboard-delete-key`) with the key id and the caller's remote address. -Secrets and pepper values are never logged. +Every management action writes an entry to the canonical `audit_event` store +through `IAuditWriter` (`dashboard-create-key`, `dashboard-rotate-key`, +`dashboard-revoke-key`, `dashboard-delete-key`) with the key id, the caller's +remote address, and a correlation id. Secrets and pepper values are never +logged. ### Settings page @@ -408,23 +454,33 @@ Do not show API key secrets or pepper values. Dashboard authentication is LDAP-backed, distinct from the API-key model used on the gRPC API. Users sign in with directory credentials; the gateway maps -their LDAP groups to one of two dashboard roles (`Admin` or `Viewer`) and +their LDAP groups to one of two dashboard roles (`Administrator` or `Viewer`) and issues a cookie carrying those role claims. Implemented behavior: -- a static `/login` HTML form posts username/password to the gateway; -- `DashboardAuthenticator` binds against `MxGateway:Ldap` (service-account bind, - user search, candidate bind) using `Novell.Directory.Ldap.NETStandard`; -- the user's `memberOf` (or short CN) is matched against - `MxGateway:Dashboard:GroupToRole`; the resolved role(s) are emitted as - `ClaimTypes.Role` claims, alongside the per-group `mxgateway:ldap_group` - claims; -- a successful login signs in the `MxGateway.Dashboard` cookie scheme - (`MxGatewayDashboard`, HttpOnly, SameSite=Strict, Secure); +- `GET /login` is served by the `[AllowAnonymous]` Blazor `Login.razor` + component (under `LoginLayout`), which renders the shared kit's ``. + `LoginCard` emits a native static `` + (username, password, hidden returnUrl) plus an ``. A native + form submit is not a Blazor event, so it reaches the minimal-API `POST /login` + endpoint regardless of the app's InteractiveServer render mode; +- `DashboardAuthenticator` delegates bind/search to the shared + `ZB.MOM.WW.Auth.Ldap` provider, registered by `AddZbLdapAuth(configuration, + "MxGateway:Ldap")`. The provider performs a service-account bind, user search, + then candidate bind, and fails closed; +- the user's group membership (stripped to its first RDN by the provider) is + matched against `MxGateway:Dashboard:GroupToRole`; the resolved role(s) are + emitted as `ClaimTypes.Role` claims, alongside the per-group + `mxgateway:ldap_group` claims; +- a successful login signs in the `MxGateway.Dashboard` cookie scheme. The + cookie defaults to the name `MxGatewayDashboard` (HttpOnly, SameSite=Strict, + Secure) and can be overridden via `MxGateway:Dashboard:CookieName`; - a user with no matching group cannot sign in — the login screen returns the - generic credential-rejected message; -- antiforgery tokens guard the login and logout POSTs. + generic credential-rejected message via `/login?error=…`; +- antiforgery tokens guard the login and logout POSTs. `POST /logout` (and a + `GET /logout` convenience redirect) sign the cookie out and return to + `/login`. Three authorization policies are registered: @@ -443,8 +499,8 @@ Viewer role. ### Hub bearer flow -SignalR connections cannot reuse the `__Host-` cookie when the JS client -upgrades to WebSocket — the cookie's `SameSite=Strict; Path=/` keeps it from +SignalR connections cannot reuse the `MxGatewayDashboard` cookie when the JS +client upgrades to WebSocket — the cookie's `SameSite=Strict; Path=/` keeps it from being forwarded by the browser's WebSocket layer in some edge cases. The dashboard mints short-lived bearer tokens for the connection: @@ -480,8 +536,10 @@ Effective configuration: "RecentFaultLimit": 100, "RecentSessionLimit": 200, "ShowTagValues": false, + "CookieName": null, + "RequireHttpsCookie": true, "GroupToRole": { - "GwAdmin": "Admin", + "GwAdmin": "Administrator", "GwReader": "Viewer" } } @@ -489,6 +547,15 @@ Effective configuration: } ``` +Two cookie keys tune the auth cookie: + +- `CookieName` overrides the cookie name. Null or blank keeps the canonical + default `MxGatewayDashboard`, so a misconfiguration cannot leave the cookie + unnamed. +- `RequireHttpsCookie` (default `true`) sets the cookie `SecurePolicy` to + `Always`. Set it to `false` for dev HTTP deployments, which relaxes the policy + to `SameAsRequest`. + See [Gateway Configuration](./GatewayConfiguration.md#dashboard-options) for the full option table and the policies/hubs that derive from these values. @@ -504,17 +571,31 @@ the full option table and the policies/hubs that derive from these values. ## Styling -The dashboard serves Bootstrap 5.3.3 assets from -`src/ZB.MOM.WW.MxGateway.Server/wwwroot/lib/bootstrap/` and local layout/status styling -from `src/ZB.MOM.WW.MxGateway.Server/wwwroot/css/dashboard.css`. +Styling is layered. From base to top: + +1. Bootstrap 5.3.3 assets served from + `src/ZB.MOM.WW.MxGateway.Server/wwwroot/lib/bootstrap/`. +2. The `ZB.MOM.WW.Theme` kit's `theme.css` (the technical-light design system), + which owns the design tokens and the kit component styles. `App.razor` loads + it through the kit's `` component, and pairs it with + `` at the end of `` for the rail's interactive behavior. +3. The local view stylesheet + `src/ZB.MOM.WW.MxGateway.Server/wwwroot/css/site.css`, which wires the + dashboard's own class names and Bootstrap widgets onto the kit tokens. It + defines no hard-coded colors. + +The minimal `/denied` page is rendered outside the Blazor circuit, so it loads +the kit CSS directly from the static-web-asset path +(`/_content/ZB.MOM.WW.Theme/css/theme.css` and `…/layout.css`) plus Bootstrap +and `site.css`. Recommended visual language: - compact tables, -- status badges, +- the kit `StatusPill` for state, - metric cards, - Bootstrap alerts for faults, -- restrained colors, +- restrained colors drawn from the kit tokens, - no decorative hero sections, - no charting dependency for v1. @@ -530,7 +611,7 @@ Dashboard unit/component tests should cover: - snapshot projection, - dashboard auth authorization decisions, -- login API-key validation behavior, +- login LDAP bind and group-to-role mapping behavior, - pages render with empty state, - pages render with active sessions, - pages render with faulted sessions, @@ -557,7 +638,8 @@ Integration tests should verify: The first dashboard slice implements: 1. Blazor Server hosting in `ZB.MOM.WW.MxGateway.Server`. -2. local Bootstrap static assets. +2. local Bootstrap static assets plus the `ZB.MOM.WW.Theme` kit layer + (chassis, tokens, status components). 3. dashboard configuration binding. 4. dashboard auth using LDAP bind + role-mapped HTTP-only cookie. 5. `DashboardSnapshotService` projecting gateway state for read views. diff --git a/docs/GatewayProcessDesign.md b/docs/GatewayProcessDesign.md index bfeb38f..7b65e1f 100644 --- a/docs/GatewayProcessDesign.md +++ b/docs/GatewayProcessDesign.md @@ -248,10 +248,15 @@ Suggested routes: ```text / +/login /sessions /sessions/{sessionId} /workers /events +/alarms +/galaxy +/browse +/apikeys /settings ``` @@ -681,13 +686,14 @@ Dashboard authentication uses LDAP bind + role mapping (separate from the API-key model used on the gRPC API). The login endpoint accepts username and password in a form post, calls `DashboardAuthenticator` to bind against `MxGateway:Ldap`, resolves the user's LDAP groups through -`MxGateway:Dashboard:GroupToRole` to one of `Admin` / `Viewer`, and signs in +`MxGateway:Dashboard:GroupToRole` to one of `Administrator` / `Viewer`, and signs in with the `MxGateway.Dashboard` cookie scheme. The cookie is HTTP-only, -secure, strict SameSite, and named `__Host-MxGatewayDashboard`. Logout +secure, strict SameSite, and named `MxGatewayDashboard` (configurable via +`MxGateway:Dashboard:CookieName`). Logout clears it. Login and logout posts validate antiforgery tokens. SignalR connections additionally accept a 30-minute data-protected bearer minted at -`/hubs/token`. `Dashboard:AllowAnonymousLocalhost` permits loopback requests -to bypass the cookie requirement and defaults to `true`. +`/hubs/token`. `MxGateway:Dashboard:AllowAnonymousLocalhost` permits loopback +requests to bypass the cookie requirement and defaults to `true`. Recommended scopes: diff --git a/docs/GatewayTesting.md b/docs/GatewayTesting.md index d3cc886..e481d2a 100644 --- a/docs/GatewayTesting.md +++ b/docs/GatewayTesting.md @@ -100,6 +100,17 @@ Optional live smoke variables: | `MXGATEWAY_LIVE_MXACCESS_WRITE_SECURED_USER` | `admin` | ArchestrA user name passed to `AuthenticateUser` before the `WriteSecured` parity step. | | `MXGATEWAY_LIVE_MXACCESS_WRITE_SECURED_PASSWORD` | `admin123` | Password paired with the user above. Never logged; the test asserts the value does not appear in the WriteSecured diagnostic message. | +When `MXGATEWAY_LIVE_MXACCESS_WORKER_EXE` is unset, the integration harness +locates the worker by resolving the repository root: `ResolveRepositoryRoot` +walks parent directories from the test binary looking for a directory that +contains a `src` subdirectory next to either a `.git` marker or a `*.sln` / +`*.slnx` file under `src`. The `.git`-or-`.sln` pair lets the resolution work +both in a checked-out repository and in an extracted copy that ships no `.git` +folder. If the walk exhausts without a match, it throws `InvalidOperationException` +naming the start directory and the expected markers; set +`MXGATEWAY_LIVE_MXACCESS_WORKER_EXE` to point directly at a worker executable and +bypass repository-root resolution entirely. + The test output includes session id, worker process id, command status, HRESULT/status diagnostics, event sequence and handles, close status, and worker stdout/stderr lines emitted during the run. diff --git a/docs/Grpc.md b/docs/Grpc.md index 0043127..0a485b3 100644 --- a/docs/Grpc.md +++ b/docs/Grpc.md @@ -10,7 +10,7 @@ The layer is composed of four collaborators: | Type | Lifetime | Role | |------|----------|------| -| `MxAccessGatewayService` | scoped (gRPC) | Implements the six `MxAccessGateway` RPCs, performs exception mapping. | +| `MxAccessGatewayService` | scoped (gRPC) | Implements the seven `MxAccessGateway` RPCs, performs exception mapping. | | `MxAccessGrpcRequestValidator` | singleton | Rejects malformed requests before any session work runs. | | `MxAccessGrpcMapper` | singleton | Converts public proto types to internal `WorkerCommand`/`WorkerEvent` types and back. | | `IEventStreamService` (`EventStreamService`) | singleton | Owns the event stream pipeline, including bounded queue and backpressure handling. | @@ -29,7 +29,7 @@ A second gRPC service, `GalaxyRepositoryGrpcService`, is mapped alongside it. It ## RPC Handlers -`MxAccessGatewayService` derives from the generated `MxAccessGateway.MxAccessGatewayBase` and implements every RPC declared in `mxaccess_gateway.proto` — six in total: `OpenSession`, `CloseSession`, `Invoke`, `StreamEvents`, `AcknowledgeAlarm`, and `StreamAlarms`. The proto contract itself is documented in [Contracts](./Contracts.md); this section covers only what the server-side handler does on top of that contract. +`MxAccessGatewayService` derives from the generated `MxAccessGateway.MxAccessGatewayBase` and implements every RPC declared in `mxaccess_gateway.proto` — seven in total: `OpenSession`, `CloseSession`, `Invoke`, `StreamEvents`, `AcknowledgeAlarm`, `StreamAlarms`, and `QueryActiveAlarms`. The proto contract itself is documented in [Contracts](./Contracts.md); this section covers only what the server-side handler does on top of that contract. Public gRPC send and receive message sizes are configured from `MxGateway:Protocol:MaxGrpcMessageBytes` (default 16 MiB). Official clients use @@ -94,6 +94,10 @@ Carrying the enqueue timestamp into the worker layer is what lets queue-wait tim `StreamAlarms` is a server-streaming, **session-less** RPC that attaches to the gateway's central alarm feed. The handler delegates to `IGatewayAlarmService.StreamAsync`. The stream opens with one `AlarmFeedMessage` carrying an `active_alarm` per currently-active alarm (the ConditionRefresh snapshot), then a single `snapshot_complete`, then a `transition` for every subsequent raise / acknowledge / clear. It is served by the always-on `GatewayAlarmMonitor`, which owns a single gateway-managed worker session and fans out to every attached client — clients no longer open a session of their own. `alarm_filter_prefix`, when set, scopes the stream to a sub-tree. +### `QueryActiveAlarms` + +`QueryActiveAlarms` is a server-streaming, **session-less** RPC that returns a point-in-time snapshot of the alarm monitor's active-alarm cache. The handler iterates `IGatewayAlarmService.CurrentAlarms`, writing one `ActiveAlarmSnapshot` per active alarm, then completes — unlike `StreamAlarms` it emits no `snapshot_complete` sentinel and no transitions. When `alarm_filter_prefix` is non-empty, snapshots whose `alarm_full_reference` does not start with the prefix are skipped (ordinal match). Clients use it to seed or reconcile state after a reconnect; for a live feed they use `StreamAlarms`. + ## Validation Rules `MxAccessGrpcRequestValidator` rejects requests with `StatusCode.InvalidArgument` before any session work happens. The rules are intentionally narrow — anything that requires session state (for example, "session does not exist") is left for `ISessionManager` so the validator can stay synchronous and side-effect free. @@ -106,6 +110,7 @@ Carrying the enqueue timestamp into the worker layer is what lets queue-wait tim | `Invoke` | `session_id` non-empty, `command` present, `kind` not `Unspecified`, payload oneof must match `kind`. | `InvalidArgument` | | `AcknowledgeAlarm` | `alarm_full_reference` must be non-empty. Validated inline in the handler, not by `MxAccessGrpcRequestValidator`. | `InvalidArgument` | | `StreamAlarms` | No required fields — `alarm_filter_prefix` is optional. | — | +| `QueryActiveAlarms` | No required fields — `alarm_filter_prefix` is optional. | — | The payload-vs-kind check matters because the `MxCommand.payload` oneof is non-discriminated on the wire — a misaligned client could send `kind = Write` with a `Register` payload and silently confuse the worker. The validator turns that into a clear client error: @@ -145,7 +150,7 @@ public WorkerCommand MapCommand(MxCommandRequest request) When the worker reply or event payload is missing, the mapper returns a synthetic public message with `ProtocolStatusCode.ProtocolViolation` (for replies) or a sentinel `MxEvent` with `MxEventFamily.Unspecified` (for events). The gateway never relays a partial frame to clients — anything missing is reported as a protocol violation against the worker, not a transport error against the client. -The mapper also exposes static factory methods for every `ProtocolStatusCode` (`Ok`, `InvalidRequest`, `SessionNotFound`, `SessionNotReady`, `WorkerUnavailable`, `Timeout`, `Canceled`, `ProtocolViolation`) so that handlers and tests can produce status payloads without duplicating the enum-to-string mapping. +The mapper also exposes static factory methods for most `ProtocolStatusCode` values (`Ok`, `InvalidRequest`, `SessionNotFound`, `SessionNotReady`, `WorkerUnavailable`, `Timeout`, `Canceled`, `ProtocolViolation`) so that handlers and tests can produce status payloads without duplicating the enum-to-string mapping. There is intentionally no factory for `MxAccessFailure` (the ninth enum value): that code is set by the worker on the reply payload to report an MXAccess-side failure, not synthesized by the gateway mapper. ## Exception to Status Mapping @@ -224,7 +229,7 @@ if (!writer.TryWrite(publicEvent)) } ``` -Under `FailFast` the session is faulted so subsequent commands return `FailedPrecondition`; the client must reopen. Under the default policy only the stream is dropped and the session continues to accept commands, leaving recovery to the client (typically a fresh `StreamEvents` call with an updated `AfterWorkerSequence`). Either way, the consumer side observes `StatusCode.ResourceExhausted` via the `EventQueueOverflow` mapping above. +`FailFast` is the **default** policy (`Events:BackpressurePolicy`): on overflow the whole session is faulted, so subsequent commands return `FailedPrecondition` and the client must reopen. This is deliberate — the default refuses to silently drop MXAccess events. The non-default `DisconnectSubscriber` policy drops only the slow stream and leaves the session accepting commands, leaving recovery to the client (typically a fresh `StreamEvents` call with an updated `AfterWorkerSequence`). Either way, the consumer side observes `StatusCode.ResourceExhausted` via the `EventQueueOverflow` mapping above. ### Cancellation and cleanup diff --git a/docs/MxAccessWorkerInstanceDesign.md b/docs/MxAccessWorkerInstanceDesign.md index bd253f1..a3c09bf 100644 --- a/docs/MxAccessWorkerInstanceDesign.md +++ b/docs/MxAccessWorkerInstanceDesign.md @@ -94,9 +94,11 @@ Expected protected environment values: ```text MXGATEWAY_WORKER_NONCE= -MXGATEWAY_WORKER_LOG_CONTEXT= ``` +The nonce travels through the environment rather than the command line so it +never appears in process-listing tools that expose argument vectors. + Startup sequence: 1. Parse command-line arguments. @@ -114,16 +116,26 @@ Startup sequence: If validation fails before MXAccess creation, exit quickly with a non-zero exit code. If MXAccess creation fails, send `WorkerFault` when possible and exit. -The bootstrap layer returns structured exit codes before it creates pipes, -starts the STA, or touches MXAccess: +`WorkerApplication.Run` returns one of the structured `WorkerExitCode` values. +Codes `2`–`4` are produced by the bootstrap parse phase before any pipe, STA, or +MXAccess work happens; codes `5`–`6` and a clean `0` only become reachable once +the parse succeeds and the worker runs its pipe session: | Exit code | Name | Meaning | |-----------|------|---------| -| `0` | `Success` | Required bootstrap options are valid. | +| `0` | `Success` | The pipe session ran to a clean close. | | `1` | `UnexpectedFailure` | A non-bootstrap exception reaches the process boundary. | | `2` | `InvalidArguments` | Required arguments are missing or unknown arguments are present. | | `3` | `InvalidProtocolVersion` | `--protocol-version` is not numeric or does not match the supported worker protocol. | | `4` | `MissingNonce` | `MXGATEWAY_WORKER_NONCE` is absent or empty. | +| `5` | `PipeConnectionFailed` | The pipe connection raised an `IOException` or `TimeoutException`. | +| `6` | `ProtocolViolation` | A `WorkerFrameProtocolException` escaped the pipe session. | + +`WorkerBootstrapResult.Succeeded` is a separate parse-phase gate: it reports +whether argument parsing produced usable `WorkerOptions`. A `false` result +carries one of codes `2`–`4` and the worker exits before running a session, so a +successful parse is distinct from the `0` exit code, which only follows a clean +pipe-session close. Bootstrap logs use `WorkerConsoleLogger` key/value output. `WorkerLogRedactor` redacts fields whose names indicate nonce, secret, password, token, @@ -133,30 +145,35 @@ credential, or API key values before the message is written. ```text ZB.MOM.WW.MxGateway.Worker - Program + Program (calls WorkerApplication.Run) + WorkerApplication (parse, bootstrap, run pipe session, map exit code) Bootstrap + WorkerOptionsParser (parse args + env into WorkerOptions) WorkerOptions - WorkerHost + WorkerBootstrapResult (parse outcome + WorkerExitCode) + WorkerExitCode + WorkerConsoleLogger / WorkerLogRedactor Ipc - PipeClient - FrameReader - FrameWriter - WorkerProtocol + WorkerPipeClient (named-pipe connect + retry, owns the session) + WorkerPipeSession (handshake, read/write/drain/heartbeat loops) + WorkerFrameReader / WorkerFrameWriter + WorkerEnvelopeValidator + WorkerContractInfo (protocol version + descriptor names) Sta - StaRuntime - StaCommandQueue - MessagePump - StaWatchdog + StaRuntime (the dedicated STA thread + message pump loop) + StaCommandDispatcher + StaMessagePump MxAccess - MxAccessSession - MxAccessCommandDispatcher - MxAccessEventSink + MxAccessStaSession (IWorkerRuntimeSession over the STA) + MxAccessSession (handle registry + COM-call orchestration) + MxAccessCommandExecutor (IStaCommandExecutor; runs commands on the STA) + MxAccessBaseEventSink (OnDataChange tag-data events) MxAccessHandleRegistry + (alarm subsystem — see below) Conversion - VariantConverter - SafeArrayConverter - StatusProxyConverter - HResultMapper + VariantConverter (MxValue <-> COM VARIANT, both directions) + MxStatusProxyConverter + HResultConverter / HResultConversion ``` ## Threading Model @@ -330,13 +347,19 @@ cleanup path completes. ## Event Sink -The worker must subscribe to every public MXAccess event family: +The worker subscribes to every public MXAccess event family through +`MxAccessBaseEventSink`: - `OnDataChange` - `OnWriteComplete` - `OperationComplete` - `OnBufferedDataChange` +Alarm transitions arrive on a separate path. They do not originate from the +`LMXProxyServerClass` connection points, so `MxAccessAlarmEventSink` (driven by +the alarm subsystem below) feeds them onto the same `MxAccessEventQueue` rather +than `MxAccessBaseEventSink`. + Forward these event families only when the native MXAccess COM object raises them. Do not synthesize `OperationComplete` from write completion or command status. `OnBufferedDataChange` must be represented in the protocol now, but @@ -368,16 +391,49 @@ type on buffered events. `OperationComplete` is only emitted from the native `MxAccessEventQueue` is the bounded outbound event queue for one worker session. It assigns the monotonic `WorkerSequence` and `WorkerTimestamp` when an event is accepted, preserving the order in which MXAccess handlers enqueue -events. The default capacity is `10000`. When the queue reaches capacity it -records a `WorkerFaultCategory.QueueOverflow` fault and rejects further events. -The event handler catches conversion and enqueue failures, records the first -fault on the queue, and returns to the STA message pump instead of writing to -the pipe. +events. The default capacity is `10000`. When the queue reaches capacity, `Enqueue` +records a `WorkerFaultCategory.QueueOverflow` fault and then throws +`MxAccessEventQueueOverflowException` so the caller cannot silently drop the +event. The event handler catches conversion and enqueue failures (including this +overflow exception), records the first fault on the queue, and returns to the +STA message pump instead of writing to the pipe. If event conversion throws, catch it inside the event handler, record a structured `WorkerFault`, and keep the worker alive only if the fault policy allows it. +## Alarm Subsystem + +Alarms come from a different COM surface than tag data, so the worker carries a +separate pipeline rather than folding alarms into `MxAccessBaseEventSink`. The +MXAccess `LMXProxyServerClass` does not expose alarm subscription, so the worker +hosts AVEVA's standalone alarm-consumer COM object instead. + +- `WnWrapAlarmConsumer` is the production `IMxAccessAlarmConsumer`, backed by + `WNWRAPCONSUMERLib.wwAlarmConsumerClass`. It returns the active alarm set as a + BSTR XML string through `GetXmlCurrentAlarms2`, which avoids the FILETIME→ + `DateTime` marshaling that crashed the earlier managed alarm client. The CLSID + is registered `ThreadingModel=Apartment`, so the consumer is created and + driven entirely on the worker's STA. It owns no internal timer. +- `MxAccessStaSession` drives the **STA alarm poll loop**: `RunAlarmPollLoopAsync` + awaits a fixed `500 ms` interval and then calls `IAlarmCommandHandler.PollOnce` + on the STA via the runtime, so every `GetXmlCurrentAlarms2` call stays on the + apartment that owns the consumer. A poll failure is recorded as a + `WorkerFault` on the event queue rather than terminating the worker. +- `AlarmCommandHandler` owns one `AlarmDispatcher` per session and is the entry + point for the alarm IPC commands (`SubscribeAlarms`, `AcknowledgeAlarm` by GUID + or name, `QueryActiveAlarms`, `Unsubscribe`). It rejects a second subscribe + before an unsubscribe, mirroring the consumer's non-idempotent `Subscribe`. +- `AlarmDispatcher` wires the consumer's `AlarmTransitionEmitted` stream onto + `MxAccessAlarmEventSink.EnqueueTransition`. It maps state transitions through + `AlarmRecordTransitionMapper`, composes the canonical + `\\\Galaxy!` full reference, and projects active-alarm + snapshots to `ActiveAlarmSnapshot` protos for the `QueryActiveAlarms` refresh + stream. +- `MxAccessAlarmEventSink` enqueues each decoded transition onto the shared + `MxAccessEventQueue` as a proto alarm-transition event, stamping the session + id, so alarms ride the same outbound IPC path as tag-data events. + ## Command Queue The pipe reader converts `WorkerCommand` messages into `StaCommand` entries. diff --git a/docs/Sessions.md b/docs/Sessions.md index e5b8534..decfe2e 100644 --- a/docs/Sessions.md +++ b/docs/Sessions.md @@ -4,9 +4,9 @@ The sessions subsystem owns the in-memory representation of an active gateway-to ## Overview -A session is the gateway-side handle that callers use to invoke worker commands, stream worker events, and tear the worker down. The subsystem is split between the per-session state machine (`GatewaySession`), an in-memory directory (`SessionRegistry`), the orchestrator that opens and closes sessions (`SessionManager`), the worker construction step (`SessionWorkerClientFactory`), and a hosted service that drains sessions during host shutdown (`SessionShutdownHostedService`). +A session is the gateway-side handle that callers use to invoke worker commands, stream worker events, and tear the worker down. The subsystem is split between the per-session state machine (`GatewaySession`), an in-memory directory (`SessionRegistry`), the orchestrator that opens and closes sessions (`SessionManager`), the worker construction step (`SessionWorkerClientFactory`), a hosted service that sweeps expired leases (`SessionLeaseMonitorHostedService`), and a hosted service that drains sessions during host shutdown (`SessionShutdownHostedService`). -All four interfaces (`ISessionManager`, `ISessionRegistry`, `ISessionWorkerClientFactory`) plus `SessionShutdownHostedService` are wired as singletons by `SessionServiceCollectionExtensions.AddGatewaySessions`. +The three interfaces (`ISessionManager`, `ISessionRegistry`, `ISessionWorkerClientFactory`) are wired as singletons, and both hosted services (`SessionLeaseMonitorHostedService`, `SessionShutdownHostedService`) are registered, by `SessionServiceCollectionExtensions.AddGatewaySessions`. The startup orphan-worker cleanup that runs before any session opens lives in the worker subsystem (`OrphanWorkerCleanupHostedService`); see [Gateway Restart and Orphan Cleanup](#gateway-restart-and-orphan-cleanup). ## Key Types @@ -18,6 +18,8 @@ The session id is an opaque string in the form `session-{guid:N}` and the per-se `SessionState` itself is the protobuf-generated enum from `ZB.MOM.WW.MxGateway.Contracts.Proto`, so it is shared between the gateway and clients on the wire. +`GatewaySession` also keeps an `_items` dictionary keyed by `(ServerHandle, ItemHandle)` mapping each subscribed item to its `SessionItemRegistration` (server handle, item handle, tag address). It is the gateway-side shadow of the items the worker has added, populated as `AddItem`-style commands succeed and pruned on `RemoveItem`. The shadow exists so the gateway can answer item lookups and clean up subscriptions without round-tripping the worker; the worker remains authoritative for the handles themselves (see [gateway.md](../gateway.md)). + ```csharp public void TransitionTo(SessionState nextState) { @@ -54,7 +56,7 @@ public void TransitionTo(SessionState nextState) `CloseSessionAsync` and `KillWorkerAsync` are both end-of-life paths but differ in what they offer the worker: - `CloseSessionAsync` is the graceful path: it calls `GatewaySession.CloseAsync`, which asks the worker to shut down via `IWorkerClient.ShutdownAsync` and only kills the process as a fallback if shutdown fails. -- `KillWorkerAsync` is the forceful path used by the dashboard's admin Kill button: it calls `GatewaySession.KillWorker` directly, which kills the worker process immediately with no graceful-shutdown attempt and transitions the session to `Closed`. +- `KillWorkerAsync` is the forceful path used by the dashboard's admin Kill button: it calls `GatewaySession.KillWorkerWithCloseGateAsync`, which kills the worker process immediately with no graceful-shutdown attempt and transitions the session to `Closed`. Routing through `KillWorkerWithCloseGateAsync` (rather than the bare `GatewaySession.KillWorker`) acquires the per-session `_closeLock` so a kill and an in-flight graceful close serialize on the same "was the session already closed" observation that drives metric accounting; the method returns that observation so `KillWorkerAsync` increments `mxgateway.sessions.closed` at most once across concurrent callers. Both paths converge on the same registry/metrics cleanup, so the open-session slot is released and `mxgateway.sessions.closed` is incremented either way. @@ -99,6 +101,8 @@ if (exception is OperationCanceledException The named pipe is created with `maxNumberOfServerInstances: 1` so a second worker cannot connect to the same pipe name even if the first launch is still pending. Combined with the per-session nonce passed to the worker, this is the gateway's defense against a foreign process answering a pipe. +The factory also seeds the worker client's `MaxPendingCommands` from `MxGateway:Sessions:MaxPendingCommandsPerSession` (default 128, validated `> 0` at startup). This caps how many commands can be in flight to a single worker at once; the `WorkerClient` rejects an enqueue past the cap and records `mxgateway.queues.overflows` tagged `worker-pending-commands`. The bound exists because the worker executes commands serially on one STA — an unbounded backlog would only grow memory and latency, not throughput. + ### SessionShutdownHostedService `SessionShutdownHostedService` is an `IHostedService` whose only job is to call `ISessionManager.ShutdownAsync` from `StopAsync`. It catches `OperationCanceledException` triggered by the host shutdown timeout and logs a warning so that an over-running shutdown does not surface as an unhandled exception. @@ -172,6 +176,14 @@ catch (Exception exception) await session.DisposeAsync().ConfigureAwait(false); } + // If SessionOpened() already incremented the open-session gauge, + // a failure after that point (e.g. auto-subscribe rejection) must + // decrement it again so mxgateway.sessions.open does not leak. + if (sessionOpenedRecorded) + { + _metrics.SessionRemoved(); + } + ReleaseSessionSlot(); _metrics.Fault(SessionManagerErrorCode.OpenFailed.ToString()); _logger.LogWarning( @@ -186,7 +198,7 @@ catch (Exception exception) } ``` -The order — fault, deregister, dispose, release slot, record metric, log, rethrow — matters because releasing the semaphore before disposal would let the next open race the worker process tear-down on the same machine. +The order — fault, deregister, dispose, conditionally decrement the open-session gauge, release slot, record fault metric, log, rethrow — matters because releasing the semaphore before disposal would let the next open race the worker process tear-down on the same machine. The `SessionRemoved()` call is conditional on `sessionOpenedRecorded` (Server-006): a failure *after* `SessionOpened()` already incremented `mxgateway.sessions.open` (for example, an auto-subscribe rejection) must decrement the gauge so it does not leak, but a failure before that point must not. ### Run @@ -194,6 +206,8 @@ While `Ready`, callers reach the worker through `SessionManager.InvokeAsync` or Event streaming uses `AttachEventSubscriber` which returns a disposable lease. When `allowMultipleSubscribers` is false the second attach throws `EventSubscriberAlreadyActive`; this prevents two gRPC streams from racing on the same worker event channel. Active event subscribers keep the session lease from expiring until the stream is disposed. +The single-subscriber rule is enforced at startup, not just at runtime: setting `MxGateway:Sessions:AllowMultipleEventSubscribers` to `true` is refused by `GatewayOptionsValidator` with "AllowMultipleEventSubscribers is not supported until event fan-out is implemented," so the gateway fails fast rather than booting in a configuration the event path cannot honor. Multi-subscriber fan-out is explicitly out of scope for v1 (see [Design Decisions](./DesignDecisions.md)). + Sessions open with `MxGateway:Sessions:DefaultLeaseSeconds` (default 1800) added to the open timestamp. Unary client activity refreshes the lease by the same duration. `ExtendLease` and `IsLeaseExpired` cooperate with `SessionManager.CloseExpiredLeasesAsync`, which iterates a registry snapshot and closes any session whose lease has expired with `LeaseExpiredReason`. `SessionLeaseMonitorHostedService` runs that sweep every `MxGateway:Sessions:LeaseSweepIntervalSeconds` seconds (default 30). ### Close @@ -227,11 +241,11 @@ if (_workerClient is not null) If both graceful shutdown and the kill fall-back fail, the original and kill exceptions are bundled into an `AggregateException` and surfaced as `SessionCloseStartedException`. `SessionManager.CloseSessionCoreAsync` then translates that into a `SessionManagerException` with `CloseFailed` and removes the session. -`GatewaySession.KillWorker` is the unconditional forced-close path used by shutdown when graceful close itself throws, and also by `SessionManager.KillWorkerAsync` — the explicit kill path that the dashboard's admin Kill button invokes. `KillWorkerAsync` skips `WorkerClient.ShutdownAsync` entirely, so `KillCount` increments while `ShutdownCount` does not; the session is then removed from the registry and the open-session slot is released, identical to the cleanup that follows a successful `CloseSessionAsync`. +`GatewaySession.KillWorker` is the unconditional forced-close path. `SessionManager.KillWorkerAsync` — the explicit kill path that the dashboard's admin Kill button invokes — no longer calls it directly; it routes through `GatewaySession.KillWorkerWithCloseGateAsync` so the kill takes the per-session `_closeLock`. That method skips `WorkerClient.ShutdownAsync` entirely and forces the worker process down via `IWorkerClient.Kill`, which records the `mxgateway.workers.killed` counter through `GatewayMetrics.WorkerKilled(reason)`. The session is then removed from the registry and the open-session slot is released, identical to the cleanup that follows a successful `CloseSessionAsync` (which increments `mxgateway.sessions.closed`). There is no separate `KillCount` / `ShutdownCount`: worker terminations are counted by `mxgateway.workers.killed` (tagged with the kill reason), and session closes by `mxgateway.sessions.closed`. ## Shutdown Coordination -`SessionShutdownHostedService.StopAsync` calls `SessionManager.ShutdownAsync`, which closes every registered session with `GatewayShutdownReason`. The shutdown loop catches per-session exceptions, calls `KillWorker`, and removes the session so that one stuck worker cannot block the rest of the host: +`SessionShutdownHostedService.StopAsync` calls `SessionManager.ShutdownAsync`, which closes every registered session with `GatewayShutdownReason`. The shutdown loop catches per-session exceptions and falls back to a forced kill so that one stuck worker cannot block the rest of the host. The fallback routes through `KillWorkerAsync` (not a bare `session.KillWorker`) so the kill takes the same close-gate and metric bookkeeping as the dashboard kill path (Server-046): ```csharp public async Task ShutdownAsync(CancellationToken cancellationToken) @@ -248,21 +262,40 @@ public async Task ShutdownAsync(CancellationToken cancellationToken) exception, "Graceful shutdown failed for session {SessionId}; killing worker.", session.SessionId); + + // CloseSessionCoreAsync's inner SessionCloseStartedException catch normally + // removes and accounts the session; this fallback only fires for sessions + // still in the registry, and reuses KillWorkerAsync for identical bookkeeping. if (_registry.TryGet(session.SessionId, out _)) { - session.KillWorker(GatewayShutdownReason); - await RemoveSessionAsync(session).ConfigureAwait(false); + try + { + await KillWorkerAsync(session.SessionId, GatewayShutdownReason, cancellationToken).ConfigureAwait(false); + } + catch (SessionManagerException killException) + { + _logger.LogWarning( + killException, + "Worker kill fallback failed for session {SessionId}.", + session.SessionId); + } } } } } ``` -Iterating over `Snapshot` rather than the live dictionary lets `RemoveSessionAsync` mutate the registry inside the loop without throwing. +Iterating over `Snapshot` rather than the live dictionary lets the registry mutate inside the loop without throwing. + +## Gateway Restart and Orphan Cleanup + +A graceful shutdown drains sessions through `ShutdownAsync`, but a gateway crash or `Kill` leaves no chance to tear workers down. Those orphaned worker processes outlive the gateway that launched them, still holding their MXAccess COM instance and their named pipe. Because the pipe name encodes the *old* gateway PID, a fresh gateway will never reconnect to them — v1 deliberately does not reattach orphan workers (see [Design Decisions](./DesignDecisions.md)). + +Instead, `OrphanWorkerCleanupHostedService` runs once on startup, before any session opens, and calls `OrphanWorkerTerminator.TerminateOrphans`. The terminator enumerates running processes matching the configured worker executable name, skips the current process, and kills any that it identifies as a leftover worker (matched against the configured executable path). Each kill records `mxgateway.workers.killed` tagged `OrphanStartupCleanup` and logs a warning. The sweep is best-effort: a failure to kill any one orphan (it may have already exited, or be inaccessible) is logged and swallowed so it cannot block gateway startup. This service lives in the worker subsystem, not the session subsystem, because it operates on OS processes rather than `GatewaySession` state. ## Dependency Injection -`SessionServiceCollectionExtensions.AddGatewaySessions` registers the four singletons and the hosted service: +`SessionServiceCollectionExtensions.AddGatewaySessions` registers the three singletons and the two hosted services: ```csharp public static IServiceCollection AddGatewaySessions(this IServiceCollection services) @@ -270,13 +303,14 @@ public static IServiceCollection AddGatewaySessions(this IServiceCollection serv services.AddSingleton(); services.AddSingleton(); services.AddSingleton(); + services.AddHostedService(); services.AddHostedService(); return services; } ``` -The registry must be a singleton because its `ConcurrentDictionary` is the source of truth for session state across the gRPC service, the lease sweeper, the dashboard, and the shutdown hosted service. Registering `SessionShutdownHostedService` last ensures it is constructed after `ISessionManager` and therefore drains sessions during host stop. +The registry must be a singleton because its `ConcurrentDictionary` is the source of truth for session state across the gRPC service, the lease sweeper, the dashboard, and the shutdown hosted service. `SessionLeaseMonitorHostedService` runs the periodic expired-lease sweep; `SessionShutdownHostedService` drains sessions during host stop. Both are registered after `ISessionManager` so they resolve the same singleton manager when the host starts; `SessionShutdownHostedService` is registered last so it is the latter of the two to be constructed and is available to drain sessions on stop. ## Related Documentation diff --git a/docs/WorkerBootstrap.md b/docs/WorkerBootstrap.md index aa2ffc6..de64f39 100644 --- a/docs/WorkerBootstrap.md +++ b/docs/WorkerBootstrap.md @@ -4,7 +4,7 @@ The bootstrap layer parses the command-line arguments and environment variables ## Overview -The worker process is a short-lived child of the gateway. The gateway side of this contract lives in [WorkerProcessLauncher](./WorkerProcessLauncher.md). On the worker side, `Program.cs` is a single line that delegates to `WorkerApplication.Run(args)`: +The worker process is a per-session child process of the gateway: one worker is launched per session and lives for that session's lifetime. The gateway side of this contract lives in [WorkerProcessLauncher](./WorkerProcessLauncher.md). On the worker side, `Program.cs` is a single line that delegates to `WorkerApplication.Run(args)`: ```csharp using ZB.MOM.WW.MxGateway.Worker; @@ -143,7 +143,7 @@ The production binding in `WorkerApplication.Run(string[])` is `EnvironmentVaria ## Logging -The worker writes structured key/value lines to standard error. Standard error is used rather than standard output because the gateway side reads worker stdout for diagnostic capture only, while stderr is reserved for log output that does not interfere with any future stdout-based channel. +The worker writes structured key/value lines to standard error. The launcher does not redirect either stream (`WorkerProcessLauncher` sets `UseShellExecute=false` and `CreateNoWindow=true` but leaves stdout and stderr inherited), so log output lands on the inherited console rather than a pipe the gateway reads. Standard error is used rather than standard output so that diagnostic logging stays clear of stdout, keeping that stream free for any future stdout-based channel. ### The logger contract diff --git a/docs/WorkerConversion.md b/docs/WorkerConversion.md index 18691fe..43946da 100644 --- a/docs/WorkerConversion.md +++ b/docs/WorkerConversion.md @@ -109,6 +109,30 @@ default: The MXAccess engine returns values whose semantic type only fully resolves after consulting the engine's own attribute metadata. Clients that round-trip these values through the gateway (replay, parity fixtures, diagnostics) need the original `VT_*` tag, the engine-declared `MxDataType`, and any conversion diagnostic; otherwise edge cases such as decimal-to-double rounding, ulong overflow, or an unknown SAFEARRAY element type become invisible bugs. Storing both the typed projection and the raw fields in the same `MxValue`/`MxArray` lets cross-language clients recover the original observation byte-for-byte where possible and detect lossy cases where it is not. +### Inverse projection for COM writes + +The conversions above run on the read path, turning COM values into `MxValue`. +The write path runs the same `VariantConverter` in reverse: `ConvertToComValue` +takes an `MxValue` from a `Write` command and returns a CLR object that the COM +marshaler boxes into the matching VARIANT, so it is the inverse of `Convert`. + +- A null `MxValue` argument throws; an `MxValue` whose `IsNull` flag is set + returns `null` (the MXAccess null), keeping the read/write null semantics + symmetric. +- Each `KindCase` maps to its CLR scalar (`bool`, `int`, `long`, `float`, + `double`, `string`). A `TimestampValue` becomes a `DateTime`, which the + marshaler renders as `VT_DATE` — the form MXAccess accepts for the + timestamped-write argument. +- An array kind delegates to `ConvertToComArray`, which projects each + `MxArray.ValuesCase` to a typed CLR array (for example `int[]`, `string[]`, or + a `DateTime[]` for timestamp arrays) so the marshaler produces the + corresponding SAFEARRAY. +- `RawValue` payloads are intentionally rejected on both the scalar and array + paths. Raw bytes are preserved on the read path for diagnostics, but there is + no safe way to reconstruct the original VARIANT from them, so a write that + carries a raw value throws rather than guessing. An `MxValue` with no value + kind set throws for the same reason — there is nothing to write. + ## HResultConverter and HResultConversion `HResultConverter.Convert` wraps any `Exception` thrown across the COM boundary. It prefers `COMException.ErrorCode` over `Exception.HResult` because the runtime sometimes overwrites `Exception.HResult` while marshalling, and the `ErrorCode` field is the value the COM call actually returned. @@ -223,7 +247,7 @@ public string PreserveCompletionOnlyStatusBytes(byte[] statusBytes) `MxStatusDetailText` is an internal lookup that maps known `MXSTATUS_PROXY.detail` codes to short human-readable strings (for example `28 = "Index out of range"`, `42 = "Unable to convert string"`, `8017 = "Object must be offscan to modify attributes that have an MxSecurityConfigure security classification"`). `MxStatusProxyConverter.Convert` calls `Lookup` and writes the result to `DiagnosticText`. Unknown codes return `string.Empty`, leaving the numeric `Detail` field as the authoritative identifier. -The mapping covers the engine-error range documented for MXAccess (16-50, 56-61, 541-542, 8017). Adding entries here is the supported way to enrich wire-level diagnostics without changing the proto schema. +The mapping covers selected detail codes in the MXAccess engine-error ranges (16-50, 56-61, 541-542, 8017). The ranges are not contiguous: codes that the runtime does not assign a distinct meaning are omitted (for example 35, 45, and 46 in the 16-50 range and 58-59 in the 56-61 range), so only codes with a known text appear. Adding entries here is the supported way to enrich wire-level diagnostics without changing the proto schema. ## MxStatusConversionException diff --git a/docs/WorkerSta.md b/docs/WorkerSta.md index 8d20685..382bf6b 100644 --- a/docs/WorkerSta.md +++ b/docs/WorkerSta.md @@ -16,7 +16,7 @@ The installed MXAccess interop assembly declares an `Apartment` threading model | `IStaWorkItem` / `StaWorkItem` | Internal queue entries that capture a delegate, a `CancellationToken`, and a `TaskCompletionSource` for the caller. | | `StaCommand` | Carries an `MxCommand` together with `SessionId`, `CorrelationId`, `EnqueueTimestamp`, and a `CancellationToken`. | | `IStaCommandExecutor` | The boundary between the dispatcher and the MXAccess interop layer; returns `MxCommandReply`. | -| `StaCommandDispatcher` | Bounded asynchronous queue in front of `StaRuntime` that converts `StaCommand` into `MxCommandReply` and applies status normalization. | +| `StaCommandDispatcher` | A bounded `Queue` (guarded by a lock) with an async drain loop in front of `StaRuntime` that converts `StaCommand` into `MxCommandReply` and applies status normalization. | ## STA Thread Initialization @@ -141,10 +141,10 @@ finally `StaRuntime.Shutdown(TimeSpan timeout)` performs an ordered shutdown: -1. Sets `shutdownRequested` under `gate` so `InvokeAsync` rejects new work with `InvalidOperationException`. +1. Sets `shutdownRequested` under `gate` so subsequent `InvokeAsync` calls reject new work. `InvokeAsync` does not throw inline: it returns a faulted `Task` carrying `StaRuntimeShutdownException` (a dedicated subtype, not a bare `InvalidOperationException`). The distinct type lets callers and the dispatcher distinguish "rejected because the runtime is shutting down" from any other invalid-operation condition. 2. Signals `commandWakeEvent` to break the STA out of `WaitForWorkOrMessages`. 3. Waits up to `timeout` on `stoppedEvent`, which the STA sets after it leaves `ThreadMain`. -4. Once the thread has stopped, drains the queue through `CancelQueuedCommands`, which calls `CancelBeforeExecution` on every remaining work item so awaiting callers observe `OperationCanceledException` instead of hanging. +4. The queue is drained through `CancelQueuedCommands` twice. `ThreadMain`'s `finally` block runs it before setting `stoppedEvent`, so any work that was queued while the loop was exiting is canceled on the STA itself. `Shutdown` then runs it again after the wait returns, which catches work enqueued during the gap between the `finally` drain and the gate close. Either way, `CancelBeforeExecution` completes every remaining work item so awaiting callers observe `OperationCanceledException` instead of hanging. (When the STA thread never started, `Shutdown` instead drains directly and sets `stoppedEvent` itself.) `ThreadMain`'s `finally` block guarantees that `comApartmentInitializer.Uninitialize` runs (when COM was successfully initialized) before `stoppedEvent.Set`, so the apartment is always torn down on the same thread that initialized it. `Dispose` calls `Shutdown` with a five-second budget and only disposes the wait handles when shutdown actually completed, which prevents a still-running STA thread from touching disposed handles. diff --git a/docs/style-guides/PythonStyleGuide.md b/docs/style-guides/PythonStyleGuide.md index e0241ef..cc284e6 100644 --- a/docs/style-guides/PythonStyleGuide.md +++ b/docs/style-guides/PythonStyleGuide.md @@ -65,4 +65,6 @@ CLI, and tests. - Use `pytest` and `pytest-asyncio`. - Use fake generated stubs or an in-process test gRPC server for unit tests. -- Keep live integration tests behind `MXGATEWAY_INTEGRATION=1`. +- Keep live integration tests behind an explicit opt-in environment variable + and a `pytest` skip guard, matching the existing tests (for example the + loopback TLS tests gate on `MXGATEWAY_RUN_TLS_TESTS=1`). diff --git a/gateway.md b/gateway.md index 4db6d7a..382a87b 100644 --- a/gateway.md +++ b/gateway.md @@ -145,9 +145,10 @@ for the alarm subsystem. Dashboard authentication is LDAP-backed (distinct from the API-key model on the gRPC API). `/login` accepts username and password in a form body, binds -against `MxGateway:Ldap`, maps the user's LDAP groups to `Admin` or `Viewer` -via `MxGateway:Dashboard:GroupToRole`, and issues an HTTP-only secure -`__Host-MxGatewayDashboard` cookie. `/logout` clears it. Login and logout +against `MxGateway:Ldap`, maps the user's LDAP groups to `Administrator` or +`Viewer` via `MxGateway:Dashboard:GroupToRole`, and issues an HTTP-only secure +`MxGatewayDashboard` cookie (the name is configurable via +`MxGateway:Dashboard:CookieName`). `/logout` clears it. Login and logout posts validate antiforgery tokens. SignalR hub connections accept either the cookie or a 30-minute data-protected bearer minted at `/hubs/token`. `MxGateway:Dashboard:AllowAnonymousLocalhost` permits loopback to bypass the @@ -232,27 +233,35 @@ message WorkerEnvelope { uint32 protocol_version = 1; string session_id = 2; uint64 sequence = 3; - uint64 correlation_id = 4; + string correlation_id = 4; + oneof body { - WorkerHello worker_hello = 10; - GatewayHello gateway_hello = 11; + GatewayHello gateway_hello = 10; + WorkerHello worker_hello = 11; WorkerReady worker_ready = 12; - WorkerCommand command = 20; - WorkerCommandReply command_reply = 21; - WorkerEvent event = 22; - WorkerHeartbeat heartbeat = 23; - WorkerCancel cancel = 24; - WorkerShutdown shutdown = 25; - WorkerFault fault = 26; + WorkerCommand worker_command = 13; + WorkerCommandReply worker_command_reply = 14; + WorkerCancel worker_cancel = 15; + WorkerShutdown worker_shutdown = 16; + WorkerShutdownAck worker_shutdown_ack = 17; + WorkerEvent worker_event = 18; + WorkerHeartbeat worker_heartbeat = 19; + WorkerFault worker_fault = 20; } } ``` +The contract evolves additively only: field numbers and enum values are never +renumbered or repurposed, so a stale gateway and worker that disagree on the +newest tags still decode the fields they share. `correlation_id` is a `string` +(not a numeric id) because it is the same correlation token the public gRPC API +carries end to end, so the worker never has to translate id formats. + Rules: - `sequence` is monotonic per sender. - `correlation_id` links commands to replies. -- Events use their own correlation id or zero. +- Events carry their own correlation id or an empty string. - Replies must preserve MXAccess HRESULT/status information even when the command is also represented as a protocol-level failure. - Protocol version mismatch fails session creation. @@ -659,8 +668,10 @@ External gateway: - authenticate v1 gRPC clients with `authorization: Bearer mxgw__` API-key metadata, - reject missing or invalid API keys with gRPC `Unauthenticated`, -- reject valid keys that lack the required session, invoke, event, metadata, or - admin scope with gRPC `PermissionDenied`, +- reject valid keys that lack the required scope with gRPC `PermissionDenied`. + Scopes are fine-grained: `session:open`, `session:close`, `invoke:read`, + `invoke:write`, `invoke:secure`, `events:read`, `metadata:read`, and `admin` + (see `GatewayScopes`), - authorize access to commands that can write, authenticate users, expose metadata, stream events, or alter runtime state. @@ -901,6 +912,7 @@ State machine: Creating -> StartingWorker -> WaitingForPipe + -> Handshaking -> InitializingWorker -> Ready -> Closing diff --git a/glauth.md b/glauth.md index 0b57f55..7ebd700 100644 --- a/glauth.md +++ b/glauth.md @@ -59,13 +59,17 @@ For mxaccessgw dev, `admin` covers every gw-side capability test; `readonly` is the right "negative" case for proving Browse-OK / Write-denied. -The gateway dashboard adds one role beyond this LmxOpcUa taxonomy: -`GwAdmin`. `LdapOptions.RequiredGroup` defaults to `GwAdmin`, so the -dashboard login and `DashboardLdapLiveTests` require `admin` to be a -member of a `GwAdmin` group. `GwAdmin` is **not** in the baseline -GLAuth config — it must be provisioned before dashboard authn or the -LDAP live tests work. See [Provisioning the GwAdmin -group](#provisioning-the-gwadmin-group) below. +The gateway dashboard adds one group beyond this LmxOpcUa taxonomy: +`GwAdmin`. There is no `RequiredGroup` option — dashboard authorization +is driven entirely by `MxGateway:Dashboard:GroupToRole`, which maps an +LDAP group to a dashboard role. A user whose groups produce no mapped +role is rejected at login. So for the dashboard to admit `admin`, a +group named in `GroupToRole` (by convention `GwAdmin` → `Administrator`) +must exist and `admin` must belong to it. `GwAdmin` is **not** in the +baseline GLAuth config — it must be provisioned before dashboard authn +or the `DashboardLdapLiveTests` (`MXGATEWAY_RUN_LIVE_LDAP_TESTS=1`) +work. See [Provisioning the GwAdmin group](#provisioning-the-gwadmin-group) +below. > **Dashboard role value (Task 1.7):** the LDAP `GwAdmin` group now maps to > the canonical dashboard role **`Administrator`** (was `Admin`); `GwReader` @@ -112,43 +116,58 @@ to avoid re-deriving the LDAP escape-string handling. ## Suggested mxgw configuration shape -A YAML/JSON section for mxaccessgw that mirrors LmxOpcUa's `LdapOptions` -record: +The gateway binds the `MxGateway:Ldap` section onto `LdapOptions`. The +field names are PascalCase config keys (shown here as YAML; JSON +`appsettings` and env-var overrides use the same names). Note the keys +that changed from the older LmxOpcUa shape: `Transport` (an enum, +replacing the boolean `UseTls`), `AllowInsecure` (replacing +`AllowInsecureLdap`), and `UserNameAttribute` which defaults to `cn`: ```yaml -ldap: - enabled: true - server: localhost - port: 3893 - useTls: false - allowInsecureLdap: true # dev only - searchBase: "dc=zb,dc=local" - serviceAccountDn: "cn=serviceaccount,dc=zb,dc=local" - serviceAccountPassword: "serviceaccount123" - userNameAttribute: "uid" # GLAuth populates this; AD uses sAMAccountName - displayNameAttribute: "cn" - groupAttribute: "memberOf" - groupToRole: - ReadOnly: "Browse" - WriteOperate: "Write" - WriteTune: "WriteSecured" - WriteConfigure: "WriteSecured" - AlarmAck: "AlarmAck" +MxGateway: + Ldap: + Enabled: true + Server: localhost + Port: 3893 + Transport: None # None | StartTls | Ldaps (dev: None) + AllowInsecure: true # dev only + SearchBase: "dc=zb,dc=local" + ServiceAccountDn: "cn=serviceaccount,dc=zb,dc=local" + ServiceAccountPassword: "serviceaccount123" + UserNameAttribute: "cn" # GLAuth keys users by cn; AD uses sAMAccountName + DisplayNameAttribute: "cn" + GroupAttribute: "memberOf" + Dashboard: + GroupToRole: + GwAdmin: "Administrator" + GwReader: "Viewer" ``` -`groupAttribute` returns full DNs like -`ou=ReadOnly,ou=groups,dc=zb,dc=local` — the authenticator -should strip the leading `ou=` (or `cn=` against AD) RDN value and -look that up in `groupToRole`. +`Transport` is an `LdapTransport` enum (`None`, `StartTls`, `Ldaps`); it +replaces the old boolean `UseTls` (`true` ≈ `Ldaps`, `false` = `None`). +`UserNameAttribute` defaults to `cn` because GLAuth keys users by `cn` +(`backend.nameformat = "cn"`); only AD needs `sAMAccountName`. The +group-to-role mapping lives under `MxGateway:Dashboard:GroupToRole`, not +in the LDAP section, and its values must be dashboard roles +(`Administrator` or `Viewer`). + +The shared `ZB.MOM.WW.Auth.Ldap` provider performs the runtime bind and +search; it returns each group already stripped to its short RDN value +(e.g. `GwAdmin` from `ou=GwAdmin,ou=groups,dc=zb,dc=local`) before the +gateway looks it up in `GroupToRole`. Keep `GroupToRole` keys as short +group names — a full-DN key will never match the short name the provider +returns. ## Provisioning the GwAdmin group -`GwAdmin` is the gateway-specific dashboard-admin role. It is the -default `LdapOptions.RequiredGroup`, so the dashboard cookie login and -`DashboardLdapLiveTests` (`MXGATEWAY_RUN_LIVE_LDAP_TESTS=1`) reject -`admin` until a `GwAdmin` group exists and `admin` is a member. -GLAuth's baseline config ships only the five LmxOpcUa role groups, so -`GwAdmin` must be added to GLAuth rather than run from a separate LDAP +`GwAdmin` is the gateway-specific dashboard-admin group, mapped to the +`Administrator` role through `MxGateway:Dashboard:GroupToRole`. Because +dashboard login rejects any user who resolves to no role, the dashboard +cookie login and `DashboardLdapLiveTests` +(`MXGATEWAY_RUN_LIVE_LDAP_TESTS=1`) reject `admin` until a `GwAdmin` +group exists, `admin` is a member, and `GroupToRole` maps `GwAdmin` to a +role. GLAuth's baseline config ships only the five LmxOpcUa role groups, +so `GwAdmin` must be added to GLAuth rather than run from a separate LDAP server: 1. Edit `C:\publish\glauth\glauth.cfg` @@ -178,10 +197,11 @@ server: 4. `nssm restart GLAuth` After the restart, `admin`'s `memberOf` includes -`ou=GwAdmin,ou=groups,dc=zb,dc=local`, which the authenticator -strips to `GwAdmin` and matches against `RequiredGroup`. The same -pattern applies to any future permission that doesn't fit the existing -five roles. +`ou=GwAdmin,ou=groups,dc=zb,dc=local`. The shared LDAP provider strips +that to the short RDN `GwAdmin`, which the gateway looks up in +`MxGateway:Dashboard:GroupToRole` to resolve the dashboard role. The same +pattern applies to any future group that doesn't fit the existing five +roles — add the group, add the member, and add a `GroupToRole` entry. Generate `passsha256` from a plaintext password: @@ -254,24 +274,25 @@ Get-Content C:\publish\glauth\logs\stderr.log -Tail 20 -Wait ## Active Directory migration cheat-sheet -LmxOpcUa's `LdapOptions` xml-doc captures the AD overrides; same set -applies to mxaccessgw verbatim. Keys that change: +These `MxGateway:Ldap` keys change when pointing the gateway at AD +instead of dev GLAuth: | Field | GLAuth dev value | AD production value | |---|---|---| | `Server` | `localhost` | a domain controller FQDN, or the domain itself | | `Port` | `3893` | `636` (LDAPS) — AD increasingly rejects plain bind under LDAP-signing enforcement | -| `UseTls` | `false` | `true` | -| `AllowInsecureLdap` | `true` | `false` | +| `Transport` | `None` | `Ldaps` (or `StartTls`) | +| `AllowInsecure` | `true` | `false` | | `SearchBase` | `dc=zb,dc=local` | `DC=corp,DC=example,DC=com` | | `ServiceAccountDn` | `cn=serviceaccount,dc=zb,dc=local` | `CN=MxGwSvc,OU=Service Accounts,DC=corp,...` | -| `UserNameAttribute` | `uid` | `sAMAccountName` (or `userPrincipalName`) | +| `UserNameAttribute` | `cn` | `sAMAccountName` (or `userPrincipalName`) | | `GroupAttribute` | `memberOf` (unchanged) | `memberOf` (unchanged) | -`memberOf` returns full DNs; the authenticator strips the leading -`CN=` value and uses it as the lookup key in `groupToRole`. Nested -groups are **not** auto-expanded; either flatten in the directory or -add a `tokenGroups` query as an enhancement. +`memberOf` returns full DNs; the shared LDAP provider strips each to its +leading RDN value (`CN=`/`OU=`) and the gateway uses that as the lookup +key in `MxGateway:Dashboard:GroupToRole`. Nested groups are **not** +auto-expanded; either flatten in the directory or add a `tokenGroups` +query as an enhancement. ## Security notes for production