docs(m7): reflect OPC UA / MxGateway UX (T13-T17) across component docs + CLAUDE/stillpending/completion-design

2026-06-18 04:13:21 -04:00
parent 39afa2743e
commit 40928535fd
11 changed files with 158 additions and 19 deletions
@@ -84,8 +84,15 @@ Reshaped during the 2026-06-17 brainstorm (see `docs/plans/2026-06-17-m6-kpi-his
 - **T9 (Teams + other non-Email delivery adapters behind `INotificationDeliveryAdapter`) — DEFERRED to the next major version.** The seam exists; no code now. Transport choice (Incoming Webhook vs Microsoft Graph) and the Teams list-targeting model remain to be designed.
 - **T10 (`NotificationType` enum values + Central UI notification-list `Type` selector) — DEFERRED with T9.** A Type selector has no purpose until a second delivery type exists.

-#### M7 — OPC UA / MxGateway UX (T13–T17)
-Dedicated operator Alarm Summary page; MxGateway secured writes (operator+verifier); OPC UA address-space search + `BrowseNext` paging; type-info surfacing + bulk override CSV import; "Verify endpoint" connectivity button + cert-management UI.
+#### M7 — OPC UA / MxGateway UX (T13–T17) — **DELIVERED**
+Delivered per `docs/plans/2026-06-18-m7-opcua-mxgateway-ux-design.md` (full scope, all five features):
+- **T13** — operator Alarm Summary page (`/monitoring/alarms`, read-only, `RequireDeployment`); per-instance `DebugViewSnapshot` fan-out (no central alarm store); shared `AlarmStateBadges`.
+- **T14** — MxGateway secured writes: new global `Operator` + `Verifier` roles + `RequireOperator`/`RequireVerifier`; central `PendingSecuredWrite` table + migration; ManagementActor submit/approve/reject/list with **no-self-approval + CAS race guard**, MxGateway-protocol-only; approve relays a `WriteTagRequest` to the site; `SecuredWrite` audit channel/kinds (central direct-write, best-effort); Central UI `/operations/secured-writes`.
+- **T15** — OPC UA `BrowseNext` paging + bounded recursive address-space search (`IAddressSpaceSearchable`).
+- **T16** — browse type-info (DataType/ValueRank/Writable) + **attribute**-override CSV import (InstanceConfigure InputFile + CLI `instance import-overrides --file`). Native-alarm-source-override CSV import was **deferred** (attribute overrides only).
+- **T17** — Verify-endpoint probe (captures-but-never-trusts an untrusted server cert) + **site-local** cert trust (per-node `CertStoreActor`, DeploymentManager broadcast to **both** site nodes; D6) + Admin-gated cert-management UI.
+
+Small follow-ups logged (not blocking): stamp `SourceNode` on the `SecuredWrite` audit rows (currently NULL); an aggregated **live** alarm stream for the summary page (snapshot + poll today); central-persisted, auditable cert trust (site-local today).

 #### M8 — Transport (T18, T20)
 Site-scoped / instance-scoped artifact transport (name-mapping subsystem); per-line/Myers diff for Modified artifacts.
@@ -69,6 +69,13 @@ Script-initiated DB **reads** via `Database.Connection().ExecuteReader(...)`
 count as actions from a script and are in scope. Reads via DCL / subscriptions
 are framework traffic and excluded.

+**Extension beyond the script boundary — secured writes.** The `SecuredWrite`
+channel is a deliberate widening of the original script-trust-boundary scope: a
+two-person MxGateway write is **operator-initiated from the Central UI**, not
+script-caused, but it crosses the same equipment-write trust boundary and warrants
+the same append-only "who approved" trail. Its rows are emitted via the central
+direct-write path (see Central direct-write below), not the site hot-path.
+
 ## The `AuditLog` Table (central)

 Single wide table in central MS SQL, polymorphic by `Channel` + `Kind`
@@ -80,7 +87,7 @@ row per lifecycle event across all channels.
 | `EventId` | `uniqueidentifier` PK | Generated where the event originates (site or central). Idempotency key. |
 | `OccurredAtUtc` | `datetime2` | When the event happened (call returned, retry attempted, etc.). |
 | `IngestedAtUtc` | `datetime2` | When central persisted the row (lags `OccurredAtUtc` for site-originated rows). |
-| `Channel` | `varchar(32)` | `ApiOutbound` \| `DbOutbound` \| `Notification` \| `ApiInbound`. |
+| `Channel` | `varchar(32)` | `ApiOutbound` \| `DbOutbound` \| `Notification` \| `ApiInbound` \| `SecuredWrite`. |
 | `Kind` | `varchar(32)` | Event kind discriminator (see kinds list below). |
 | `CorrelationId` | `uniqueidentifier` NULL | Ties multi-event operations together. `TrackedOperationId` for cached calls, `NotificationId` for notifications, request-id for inbound API. NULL for sync one-shot calls. |
 | `ExecutionId` | `uniqueidentifier` NULL | The originating script execution / inbound request — the universal per-run correlation value; distinct from `CorrelationId`, which is the per-operation lifecycle id. Stamped on *every* audit row emitted by one execution. |
@@ -113,7 +120,7 @@ row per lifecycle event across all channels.
 - `IX_AuditLog_Target_Occurred (Target, OccurredAtUtc)` — "what did we send to system X".
 - Monthly partitioning on `OccurredAtUtc` from day one; purge is a partition switch (see Retention & Purge).

-**`Kind` values (flat — 10 discriminators across all channels):**
+**`Kind` values (flat — 14 discriminators across all channels):**

 | Kind | Fires when |
 |---|---|
@@ -127,11 +134,28 @@ row per lifecycle event across all channels.
 | `InboundAuthFailure` | An inbound API request was rejected at the auth boundary (bad/missing key). One row, `Status=Failed`, `HttpStatus=401`. |
 | `CachedSubmit` | Script-side enqueue of a cached call (`ExternalSystem.CachedCall` / `Database.CachedWrite`); first row in the cached-call lifecycle, written to site SQLite before any forward attempt. |
 | `CachedResolve` | Terminal row for a cached operation — `Status` = `Delivered` / `Failed` / `Parked` / `Discarded`. |
+| `SecuredWriteSubmit` | An operator submits a two-person MxGateway secured write (`Status=Submitted`); first row in the secured-write lifecycle. |
+| `SecuredWriteApprove` | A verifier wins the approval CAS for a pending secured write. |
+| `SecuredWriteReject` | A verifier rejects a pending secured write (`Status=Discarded`). |
+| `SecuredWriteExecute` | The approved write was relayed to the site MxGateway — terminal outcome (`Delivered`-equivalent on success, `Failed` on error). |

 Inbound API is intentionally collapsed to a single `InboundRequest` (or
 `InboundAuthFailure` for auth rejections) row per request rather than a
 multi-event lifecycle.

+**Secured writes (`Channel = SecuredWrite`).** The four `SecuredWrite*` kinds
+emit one row per lifecycle event of a two-person MxGateway write
+(submit → approve → execute, or submit → reject). All rows of one operation share
+the `PendingSecuredWrite.Id` (encoded as a `Guid`) in `CorrelationId` so they join,
+and carry both `operatorUser` and `verifierUser` in `Extra` so a single row names
+both parties. Rows are written via the central direct-write path (like Notification
+Outbox dispatch and Inbound API), and emission is **best-effort** — an audit-write
+failure never aborts the secured write itself.
+**Known gap (follow-up):** these central direct-write rows currently leave
+`SourceNode` **NULL** rather than stamping the writing central node's role name
+(`central-a` / `central-b`) as the other central direct-write paths do — stamping
+`SourceNode` for secured-write audit rows is a logged follow-up.
+
 ### `ExecutionId` vs `CorrelationId`

 The table carries two correlation columns at different granularities:
@@ -252,10 +276,14 @@ rejections emit `ApiInbound.InboundAuthFailure` (`Status=Failed`, HTTP 401)
 instead. The Notification Outbox dispatcher writes
 `Notification.NotifyDeliver` with `Status=Attempted` per delivery attempt and
 `Notification.NotifyDeliver` with `Status=Delivered`/`Parked`/`Discarded` on
-terminal status. Central direct-writes use the same insert-if-not-exists
-semantics keyed on `EventId`. `SourceSiteId` is NULL on all central direct-write
-rows; `SourceNode` is stamped to the local central node's role name
-(`central-a` / `central-b`).
+terminal status. The ManagementActor writes the four `SecuredWrite.*` rows for the
+two-person MxGateway write workflow (submit / approve / reject / execute) the same
+way — central direct-write, best-effort, insert-if-not-exists. Central direct-writes
+use the same insert-if-not-exists semantics keyed on `EventId`. `SourceSiteId` is
+NULL on all central direct-write rows; `SourceNode` is stamped to the local central
+node's role name (`central-a` / `central-b`) — **except** the `SecuredWrite.*` rows,
+which currently leave `SourceNode` NULL (stamping is a logged follow-up; the
+secured-write `SourceSiteId`, by contrast, *is* set to the target site).

 ## Cached Operations — Combined Telemetry

@@ -85,6 +85,7 @@ scadabridge instance get --id <id>
 scadabridge instance create --name <name> --template-id <id> --site-id <id> [--area-id <id>]
 scadabridge instance set-bindings --id <id> --bindings <json>
 scadabridge instance set-overrides --id <id> --overrides <json>
+scadabridge instance import-overrides --id <id> --file <path.csv>
 scadabridge instance alarm-override set --instance-id <id> --alarm <name> [--trigger-config <json>] [--priority <n>]
 scadabridge instance alarm-override delete --instance-id <id> --alarm <name>
 scadabridge instance alarm-override list --instance-id <id>
@@ -81,7 +81,9 @@ Central cluster only. Sites have no user interface.
 - Create, edit, and delete site definitions, including Akka node addresses (NodeA/NodeB) and gRPC node addresses (GrpcNodeA/GrpcNodeB).
 - Define data connections and assign them to sites (name, protocol type, connection details).
 - **Data connection form**: "Primary Endpoint Configuration" (required JSON text area) and optional "Backup Endpoint Configuration" (collapsible section, hidden by default, revealed via "Add Backup Endpoint" button; "Remove Backup" button when editing an existing backup). "Failover Retry Count" numeric input (default 3, min 1, max 20) is visible only when a backup endpoint is configured.
+- **Verify endpoint** (OPC UA): the OPC UA endpoint editor (in the data connection form) carries a **"Verify endpoint"** button that asks the target site to probe the configured endpoint — a temporary, short-lived connect against the live (or edited-but-unsaved) config. The result reports success or a typed failure kind (e.g. unreachable, untrusted certificate, server error). When the failure is an **untrusted server certificate**, the probe captures the cert (Subject / Issuer / Thumbprint / validity / DER) and the editor shows a detail panel with a **"Trust certificate"** button. The probe itself **never trusts** the cert — trusting is an explicit, Admin-gated action (see Server certificate management). After a Trust, Verify re-runs automatically and should then succeed.
 - **Data connection list page**: Shows Primary Config and Backup Config columns. Active Endpoint column populated from health reports.
+- **Server certificate management** (`/design/connections/{id}/certificates`, Admin role): a per-connection page that lists the contents of the site's OPC UA trusted-peer and rejected certificate stores (Subject / Issuer / Thumbprint / validity / Trusted-or-Rejected status) with a **Remove** action. The page makes clear the store is **node-wide for the site** (shared by every site node), not per data connection — trusting or removing a certificate affects all OPC UA connections at that site. Trust and Remove are central commands relayed to **both** site nodes so the node-local PKI stores stay consistent across failover (see Component-SiteRuntime.md, Component-DataConnectionLayer.md).
 - The site detail page exposes a new **"Audit feed"** tab that hosts the Audit Log page pre-filtered to `Site = <site>` — an in-context view of every operational audit event for that site.

 ### Inbound API Management (Admin Role for keys, Design Role for methods)
@@ -102,8 +104,12 @@ Central cluster only. Sites have no user interface.
 - Assign instances to areas.
 - Bind data connections — **per-attribute binding** where each attribute with a data source reference individually selects its data connection from the site's available connections. **Bulk assignment** supported: select multiple attributes and assign a data connection to all of them at once. Each row also exposes:
  - **Override** — optional per-attribute OPC UA node id (or other protocol address). When set, replaces the template's `DataSourceReference` at flattening time; when blank, the template default is used. The greyed placeholder shows the template default for context.
-  - **Browse…** — opens the OPC UA Tag Browser dialog, populated live from the site's OPC UA server via `BrowseOpcUaNodeCommand`. Visible only when the row's connection uses the OPC UA protocol; disabled until a connection is picked on that row. The dialog lazy-loads the address space, supports manual node-id entry as a fallback, and remains usable when the site or its OPC UA session is offline (the manual-paste field stays active even on error).
+  - **Browse…** — opens the OPC UA Tag Browser dialog, populated live from the site's OPC UA server via `BrowseOpcUaNodeCommand`. Visible only when the row's connection uses the OPC UA protocol; disabled until a connection is picked on that row. The dialog lazy-loads the address space, supports manual node-id entry as a fallback, and remains usable when the site or its OPC UA session is offline (the manual-paste field stays active even on error). The dialog adds:
+    - **Load more** — when a browse level is truncated, a "Load more" affordance fetches the next page using the server's continuation point (`BrowseNext`); an expired continuation point falls back to a fresh browse.
+    - **Search** — a search box runs a bounded recursive address-space search (depth + result caps) at the site, matching a case-insensitive substring against node DisplayName/path; clicking a result selects it. The dialog surfaces a "showing first N — refine" note when a result cap is hit.
+    - **Type column** — Variable rows display best-effort type info (data type friendly name, scalar/array value rank, writable flag) read from the server during browse.
 - Set instance-level attribute overrides (non-locked attributes only).
+  - **Bulk override CSV import** (`InstanceConfigure`): a Blazor `InputFile` upload accepts a CSV of `AttributeName, Value, ElementType?` rows (`ElementType` only for `List` attributes). Each row is validated against the instance's flattened attribute schema (name exists + value type-compatible, reusing the existing override validation); the import is **all-or-nothing** — any per-row error aborts the whole upload with a per-row error summary and nothing is applied. On success the rows are upserted through the same ManagementActor add/update-override handlers used by the inline editor. The same import is available from the CLI (`instance import-overrides --file`, see Component-CLI.md).
 - **Native Alarm Source Overrides card** (`InstanceConfigure`): a card placed **after the Alarm Overrides card**, listing the template's native alarm sources for per-instance binding. Each row offers **inline override** of the three fields that typically vary per physical instance:
  - **Connection** — a dropdown (same alarm-capable filtering as the template editor).
  - **Source Reference** — the concrete native key for this instance.
@@ -169,6 +175,14 @@ Per-leaf alarm rendering (leaf nodes are individual conditions for native alarms
 - **Row tooltip** — surfaces native metadata not warranting its own column: `AlarmTypeName`, category, operator user and comment, original raise time, current/limit value.
 - **Computed alarms render unchanged** from the prior flat-table style; the enrichment is purely additive for native rows.

+### Alarm Summary (Deployment Role)
+- A dedicated operator **Alarm Summary** page (`/monitoring/alarms`, `RequireDeployment`) gives a **cross-instance, read-only** roll-up of live alarm state at a site — the operator-facing complement to the per-instance Debug View.
+- **Data path** — no new site-side code and no central alarm store. The page selects a site, queries its deployed instances, then fans out the existing per-instance `DebugViewSnapshot` Ask **concurrently** (capped with a `SemaphoreSlim`) and aggregates the returned `AlarmStates` client-side. The fan-out is **partial-results tolerant**: instances that time out are listed as "not reporting" while the rest still render.
+- **View** — roll-up tiles (total active, worst severity, unacked count, per-`AlarmKind` counts) plus a flat, sortable, filterable table. Filters cover instance, `AlarmKind` (Computed / NativeOpcUa / NativeMxAccess), state, acked/unacked, severity threshold, and name search.
+- **Read-only** — there are no ack / shelve / suppress controls (native alarms remain read-only by design).
+- **Refresh** — manual refresh button plus an optional poll timer (mirroring the Health dashboard); there is no aggregated live alarm stream in this release (snapshot + poll only — logged as a follow-up).
+- **Reuse** — the alarm badge/formatter markup is factored out of Debug View into a shared `AlarmStateBadges` component consumed by both Debug View and this page.
+
 ### Parked Message Management (Deployment Role)
 - Query sites for parked messages (external system calls, cached DB writes). (Parked notifications are managed centrally on the Notification Outbox page, not here.)
 - View message details (target, payload, retry count, timestamps).
@@ -183,6 +197,14 @@ Per-leaf alarm rendering (leaf nodes are individual conditions for native alarms
 - **Stuck rows are visually badged** — a notification is stuck if it is `Pending` or `Retrying` and older than the configurable stuck-age threshold. Stuck detection is display-only; there is no automated escalation or alerting.
 - All queries are served from the central `Notifications` table — no remote per-site queries are needed, unlike the Parked Message Management page.

+### Secured Writes (Operator / Verifier Roles)
+- A **Secured Writes** page (`/operations/secured-writes`) drives the **two-person** authorization workflow for writes through the MxAccess Gateway: an **Operator** initiates the write, a separate **Verifier** approves it, and only an approved write reaches the site.
+- **Operator (submit)** — a submit form gated by `RequireOperator`: pick the site → an **MxGateway** connection on that site → the tag path → a typed value → an optional comment. Submission inserts a `Pending` `PendingSecuredWrite` row centrally; it does **not** write anything yet.
+- **Verifier (approve / reject)** — a pending queue gated by `RequireVerifier` with **Approve** / **Reject** (+comment) actions. Approve shows a confirmation of the exact site / connection / tag / value before firing. The verifier's **own submissions are disabled in the UI and rejected server-side** (no self-approval). On approve, central marks the row `Approved` and relays the write to the site MxGateway (records `Executed` / `Failed`); reject moves it to `Rejected` with a reason.
+- **History** — terminal rows (Executed / Failed / Rejected / Expired) with the full who/when/outcome trail (operator, verifier, comments, timestamps, any execution error).
+- Every lifecycle event (submit / approve / reject / execute) is written to the central Audit Log; the rows share the `PendingSecuredWrite.Id` as `CorrelationId` so they join into one operation (see Component-ManagementService.md, Component-AuditLog.md).
+- **Dev caveat**: with `DisableLogin` on, the auto-login identity holds all roles, so the two-person flow cannot be exercised end-to-end by a single user via the dev UI — no-self-approval is covered by handler tests; real two-person use requires two real identities.
+
 ### Site Calls (Deployment Role)
 - Monitor cached calls store-and-forwarded from sites — `ExternalSystem.CachedCall()` and `Database.CachedWrite()` operations. Scoped to the `ExternalCall` and `DatabaseWrite` kinds only; notifications keep their separate Notification Outbox page and are not merged here.
 - A **queryable cached-call list** filterable by site, kind, status, and time range. Each row shows the call's timestamp, site, kind, target summary, status badge, retry count, and last error.
@@ -195,10 +195,32 @@ These are configured via `DataConnectionOptions` in `appsettings.json`, not per-
 DCL is a clean data pipe on the hot path. Browse is an **opt-in capability** for protocols that support it, exposed via `IBrowsableDataConnection`. Only consumed by management/UI (the tag picker on the instance configure page); Instance Actors never call it. The browse path is **protocol-agnostic**: the same command/service/dialog serve every browsable protocol.

 - `OpcUaDataConnection` and `MxGatewayDataConnection` both implement `IBrowsableDataConnection`; other/custom protocols do not (and return a `NotBrowsable` failure).
- `DataConnectionManagerActor` handles `BrowseNodeCommand` (fields: `ConnectionName`, `ParentNodeId`) and replies with `BrowseNodeResult` (children + `Truncated` + structured `BrowseFailure?`). The Central UI facade is `IBrowseService`/`BrowseService`, backing the `NodeBrowserDialog` tag picker.
+- `DataConnectionManagerActor` handles `BrowseNodeCommand` (fields: `ConnectionName`, `ParentNodeId`, and an optional opaque `ContinuationToken` for paging) and replies with `BrowseNodeResult` (children + `Truncated` + an optional continuation token + structured `BrowseFailure?`). The Central UI facade is `IBrowseService`/`BrowseService`, backing the `NodeBrowserDialog` tag picker.
+- **Browse type-info** (OPC UA): each child `BrowseNode` carries optional best-effort type metadata — `DataType` (friendly name), `ValueRank` (scalar/array), and `Writable`. `RealOpcUaClient` batch-reads these attributes for **Variable** nodes during browse and maps built-in data-type node ids to friendly names; non-Variable rows leave them unset.
 - Node ids are opaque protocol-specific strings: OPC UA uses NodeIds; MxGateway uses Galaxy gobject ids for navigable objects and full tag references for selectable attribute leaves.
 - Browse runs against the live session; no caching at DCL.
- **Frame-size guard**: the reply crosses the site→central Akka frame (default 128 KB) on a temp Ask actor; an oversized reply is silently discarded by remoting, hanging the picker. The child handler caps each `BrowseNodeResult` to a byte budget (~100 KB) before replying, OR-ing the adapter's own truncation signal into `Truncated`. This is protocol-agnostic (every adapter's reply funnels through it). Per-protocol upstream caps narrow the window first: OPC UA requests at most 500 references per node (continuation point → `Truncated`); MxGateway relies on the gateway's `BrowseChildren` page cap. A `Truncated` level prompts manual node-id entry in the picker rather than auto-paging.
+- **Frame-size guard**: the reply crosses the site→central Akka frame (default 128 KB) on a temp Ask actor; an oversized reply is silently discarded by remoting, hanging the picker. The child handler caps each `BrowseNodeResult` to a byte budget (~100 KB) before replying, OR-ing the adapter's own truncation signal into `Truncated`. This is protocol-agnostic (every adapter's reply funnels through it). Per-protocol upstream caps narrow the window first: OPC UA requests at most 500 references per node (continuation point → `Truncated`); MxGateway relies on the gateway's `BrowseChildren` page cap.
+- **`BrowseNext` paging** (OPC UA): a `Truncated` level no longer forces manual node-id entry. When the OPC UA browse is truncated, the adapter returns the session's continuation point as the reply's opaque `ContinuationToken`; a follow-up `BrowseNodeCommand` carrying that token calls `Session.BrowseNext` to fetch the next page (the picker exposes this as **"Load more"**). Continuation points are session-bound and can expire — `BadContinuationPointInvalid` is caught and the browse restarts from a fresh first page rather than failing. Manual node-id entry remains as a fallback when the site or its session is offline.
+
+### Address-space search
+
+A second opt-in capability seam, `IAddressSpaceSearchable` (in Commons, mirroring the `IBrowsableDataConnection` / `IAlarmSubscribableConnection` pattern; implemented by the OPC UA adapter, consumed by management/UI only):
+
+```
+IAddressSpaceSearchable
+└── SearchAddressSpaceAsync(query, maxDepth, maxResults, ct) → matches
+```
+
+- `DataConnectionManagerActor` handles `SearchAddressSpaceCommand` (`ConnectionName`, `Query`, `MaxDepth`, `MaxResults`); the OPC UA adapter does a **bounded recursive browse** (depth + result caps) from the Objects folder, matching a **case-insensitive substring** against each node's DisplayName and root-relative path, and returns matches as `AddressSpaceMatch` (the `BrowseNode` plus its full path). When a cap is hit the result flags it so the UI can prompt "showing first N — refine".
+- The Central UI facade is `BrowseService.SearchAsync` (Design role), surfaced as the search box in `NodeBrowserDialog`.
+- `StubOpcUaClient` carries a canned browse/search implementation so the picker and its bUnit/unit tests run without a live OPC UA server.
+
+## Endpoint verification (OPC UA)
+
+Before saving or deploying an OPC UA data connection, the Central UI can ask the target site to **probe** the configured endpoint — this exercises connectivity (and TLS trust) against the live config, including edited-but-unsaved values.
+
+- `DataConnectionManagerActor` handles `VerifyEndpointCommand` (`SiteId`, protocol, config JSON). The site spins up a **temporary** `RealOpcUaClient` from the submitted config, attempts discovery + a session with a short timeout (a few seconds), then disconnects and disposes the client. The reply is a `VerifyEndpointResult` (`Success`, a typed `FailureKind`, an error message, and an optional captured server cert).
+- The probe forces **`AutoAcceptUntrustedCerts = false`** and hooks the certificate-validation callback so it can **capture** an untrusted server certificate (Subject / Issuer / Thumbprint / NotBefore / NotAfter / DER) for the UI to display. Critically, the probe **never trusts** the cert — the validation callback always rejects (`Accept = false`); trusting is a separate, explicit, Admin-gated action that writes the cert into the site PKI store (see the cert-trust path in Component-SiteRuntime.md). The probe is read-only with respect to the trust boundary.

 ## Native Alarm Mirroring

@@ -140,6 +140,15 @@ Both endpoints honour any site-scope rules attached to the caller's audit role b
 - **DeployArtifacts**: Deploy system-wide artifacts (shared scripts, external system definitions, DB connections, data connections) to all sites or a specific site.
 - **GetDeploymentStatus**: Query deployment status.

+### Secured Writes (MxGateway, two-person)
+
+The two-person authorization workflow for writes through the MxAccess Gateway. Backed by the central `PendingSecuredWrite` table; all four lifecycle events emit a best-effort Audit Log row sharing the row id as `CorrelationId` (see Component-AuditLog.md).
+
+- **SubmitSecuredWriteCommand** (`Operator` role): validates the target connection exists and uses the **MxGateway** protocol (rejected otherwise), then inserts a `Pending` row capturing site / connection / tag / typed value / operator + comment. Nothing is written yet.
+- **ApproveSecuredWriteCommand** (`Verifier` role): enforces **no self-approval** (`VerifierUser ≠ OperatorUser`, checked **before** any state change) and a **compare-and-swap** race guard (`TryMarkApprovedAsync` — exactly one concurrent approver flips `Pending → Approved`; the loser is rejected). On winning, it validates the value type, decodes the value (guarded — a bad value fails the row deterministically rather than leaving it stuck `Approved`), and relays a `WriteTagRequest` to the site MxGateway via the Communication Layer, recording the outcome as `Executed` or `Failed`.
+- **RejectSecuredWriteCommand** (`Verifier` role): marks a `Pending` row `Rejected` with the verifier's reason.
+- **List / query secured writes**: pending queue + terminal history, global or per-site.
+
 ### External Systems

 - **ListExternalSystems** / **GetExternalSystem**: Query external system definitions.
@@ -192,6 +201,10 @@ Both endpoints honour any site-scope rules attached to the caller's audit role b
 - **QuerySiteEventLog**: Query site event log entries from a remote site (routed via communication layer). Supports date range, keyword search, and pagination.
 - **QueryParkedMessages**: Query parked (dead-letter) messages at a remote site (routed via communication layer). Supports pagination.
 - **DebugSnapshot**: Request a one-shot snapshot of attribute values and alarm states for a running instance. Resolves the instance's site from the config DB and routes via the communication layer. Uses 30s `QueryTimeout`.
+- **BrowseNodeCommand**: Browse an OPC UA / MxGateway connection's address space one level at a time; supports a `BrowseNext` continuation token (paging) and returns per-node type-info for OPC UA Variables. (`Design` role.)
+- **SearchAddressSpaceCommand**: Bounded recursive address-space search (depth + result caps, case-insensitive substring) against an OPC UA connection. (`Design` role.)
+- **VerifyEndpointCommand**: Ask a site to probe an OPC UA endpoint (temporary client, short timeout) and report success / typed failure / a captured-but-untrusted server cert. Read-only — never trusts. (`Design` role — runs inside the Admin-gated connection editor.)
+- **TrustServerCertCommand** / **RemoveServerCertCommand** / **ListServerCertsCommand**: Manage the site's OPC UA trusted-peer PKI store; Trust/Remove broadcast to **both** site nodes (see Component-SiteRuntime.md). (`Admin` role.)

 ## Authorization

@@ -200,6 +213,9 @@ Every incoming message carries the authenticated user's identity and roles. The
 - **Admin** role required for: site management, area management, API key management, role mapping management, scope rule management, system configuration.
 - **Design** role required for: template authoring (including template member management: attributes, alarms, native alarm sources, scripts, compositions), shared scripts, external system definitions, database connection definitions, notification lists, inbound API method definitions.
 - **Deployment** role required for: instance management (including instance alarm overrides and native alarm source overrides), deployments, debug view, debug snapshot, parked message queries, site event log queries. Site scoping is enforced for site-scoped Deployment users.
+- **Operator** role required for: submitting a secured write (`SubmitSecuredWriteCommand`).
+- **Verifier** role required for: approving / rejecting a secured write (`ApproveSecuredWriteCommand` / `RejectSecuredWriteCommand`). The no-self-approval rule (`Operator ≠ Verifier`) is enforced in the handler, independent of the role check.
+- **Admin** role additionally required for: server-certificate trust/remove/list (`TrustServerCertCommand` / `RemoveServerCertCommand` / `ListServerCertsCommand`).
 - **Read-only access** (any authenticated role): health summary, health site, site event log queries, parked message queries.

 Unauthorized commands receive an `Unauthorized` response message. Failed authorization attempts are not audit logged (consistent with existing behavior).
@@ -216,6 +232,8 @@ The ManagementActor receives the following services and repositories via DI (inj
 - `IExternalSystemRepository` — External system definitions.
 - `INotificationRepository` — Notification lists and SMTP config.
 - `ISecurityRepository` — API keys and LDAP role mappings.
+- `ISecuredWriteRepository` — Pending/decided secured writes (the two-person MxGateway write workflow).
+- `IAuditLogRepository` — Best-effort central direct-write of secured-write lifecycle audit rows.
 - `IInboundApiRepository` — Inbound API method definitions.
 - `ISharedScriptRepository` / `SharedScriptService` — Shared script definitions.
 - `IDatabaseConnectionRepository` — Database connection definitions.
@@ -123,6 +123,34 @@ Set in a local or docker-dev environment via the environment variable `ScadaBrid
  - Navigation to audit pages.
  - Does not grant export or any mutating action.

+### Operator
+- **Scope**: System-wide (always).
+- **Permissions**:
+  - Submit a **secured write** to an MxAccess Gateway connection from the Central UI Secured Writes page (`/operations/secured-writes`) — the *initiating* half of the two-person write workflow.
+- **Purpose**: One of the two distinct global roles backing MxGateway secured writes. An Operator initiates a write; it never executes until a separate **Verifier** approves it. Holding Operator alone never executes a write.
+
+### Verifier
+- **Scope**: System-wide (always).
+- **Permissions**:
+  - Approve or reject a pending secured write from the Secured Writes page — the *approving* half of the two-person write workflow.
+- **Purpose**: The approving counterpart to **Operator**. Separation of duties is enforced **server-side**: the ManagementActor rejects any approval where the approving user equals the submitting Operator (no self-approval), so the two roles must be held by distinct principals for a write to execute. (See Component-ManagementService.md and Component-CentralUI.md.)
+
+> **Two-person secured-write workflow.** `Operator` and `Verifier` are deliberately separate global roles so a single principal cannot both initiate and approve a write through the MxAccess Gateway. Both are coarse global roles like the others; any site scoping is layered on at the LDAP-mapping level. Note the dev `DisableLogin` caveat: with `DisableLogin` on, the auto-login principal holds **all** roles, so the two-person flow cannot be exercised end-to-end by a single identity — no-self-approval is covered by handler tests and real two-person use requires two real identities.
+
+## Authorization Policies
+
+Role checks are expressed as named ASP.NET Core authorization policies (in `AuthorizationPolicies`), each requiring the matching role claim:
+
+| Policy | Role required | Used by |
+|--------|---------------|---------|
+| `RequireAdmin` | Administrator | Site/data-connection management, API keys, LDAP mappings, bundle import, server-certificate trust/remove |
+| `RequireDesign` | Designer | Template authoring, shared scripts, external/DB definitions, notification lists, bundle export |
+| `RequireDeployment` | Deployer | Instance/deployment management, debug view, parked messages, Notification Outbox, Alarm Summary |
+| `RequireOperator` | Operator | Submit a secured write (Secured Writes page) |
+| `RequireVerifier` | Verifier | Approve/reject a pending secured write |
+
+`RequireOperator` and `RequireVerifier` were added with the two-person secured-write feature (M7); the rest predate it.
+
 ## Multi-Role Support

 - A user can hold **multiple roles simultaneously** by being a member of multiple LDAP groups.
@@ -138,6 +166,8 @@ Set in a local or docker-dev environment via the environment variable `ScadaBrid
  - `SCADA-Deploy-All` → Deployer role (all sites)
  - `SCADA-Deploy-SiteA` → Deployer role (Site A only)
  - `SCADA-Deploy-SiteB` → Deployer role (Site B only)
+  - `SCADA-Operators` → Operator role (initiates secured writes)
+  - `SCADA-Verifiers` → Verifier role (approves secured writes)
 - A user can be a member of multiple groups, granting multiple independent roles.
 - Group mappings are stored in the configuration database and managed via the Central UI (Administrator role).

@@ -158,6 +188,7 @@ Set in a local or docker-dev environment via the environment variable `ScadaBrid
 - **Central UI**: All UI requests pass through authentication and authorization.
 - **Template Engine**: Designer role enforcement.
 - **Deployment Manager**: Deployer role enforcement with site scoping.
+- **Management Service / Central UI (secured writes)**: `RequireOperator` gates submission and `RequireVerifier` gates approval/rejection of MxGateway secured writes; the no-self-approval rule (Operator ≠ Verifier) is enforced server-side by the ManagementActor, not just in the UI.
 - **All central components**: Role checks are a cross-cutting concern applied at the API layer.
 - **Management Service**: The ManagementActor enforces role-based authorization on every incoming command using the authenticated user identity carried in the message envelope. The CLI authenticates users via the same LDAP bind mechanism and passes the user's identity (username, roles, permitted sites) in every request message. The ManagementActor applies the same role and site-scoping rules as the Central UI — no separate authentication path exists on the server side.
 - **Transport (#24)**: Provides the `RequireDesign` policy (export) and `RequireAdmin` policy (import) enforced at both the Razor page layer and inside the `ZB.MOM.WW.ScadaBridge.Transport` service entrypoints.
@@ -116,6 +116,14 @@ flowchart TD
 - Receives `DebugSnapshotRequest` from the Communication Layer and forwards to the Instance Actor by unique name (same lookup as `SubscribeDebugViewRequest`).
 - Returns an error response if no Instance Actor exists for the requested unique name (instance not deployed or not enabled).

+### Server Certificate Trust (Per-Node Broadcast)
+- The OPC UA trusted-peer certificate store is **node-local** — it lives in each node's PKI directory, so trusting a cert on only the active node would silently fail certificate validation after failover. To keep both nodes consistent, certificate trust is handled by a per-node actor and a broadcast.
+- **`CertStoreActor`** is a per-node, **non-singleton** actor — it runs on **every** site node (unlike the Deployment Manager singleton). It owns its node's OPC UA PKI trusted-peer / rejected stores and handles three commands:
+  - **`TrustServerCertCommand`** — write a base64/DER server certificate into the node's trusted-peer store (path-traversal-guarded thumbprints; the cert is decoded and persisted as a `.der`).
+  - **`RemoveServerCertCommand`** — delete a certificate (by thumbprint) from the node's stores.
+  - **`ListServerCertsCommand`** — enumerate the node's trusted-peer + rejected store contents (Subject / Issuer / Thumbprint / validity / status).
+- The Deployment Manager singleton receives the central cert-trust verbs (relayed via the Communication Layer) and **broadcasts** `TrustServerCertCommand` / `RemoveServerCertCommand` to the `CertStoreActor` on **both** site nodes (via `ActorSelection`), so node-a and node-b PKI stores stay in sync. List reads from the local node. Trust is **site-local** — there is no central persistence of trusted certs (logged as a follow-up). The captured-but-untrusted server cert that seeds this flow comes from the DCL verify-endpoint probe (see Component-DataConnectionLayer.md), which never trusts on its own; the Central UI surfaces Trust / Remove (Admin-gated — see Component-CentralUI.md).
+
 ---

 ## Instance Actor