Files
ScadaBridge/docs/plans/2026-06-18-m7-opcua-mxgateway-ux-design.md
T
Joseph Doherty 254e0e729f docs(m7): approved design — OPC UA / MxGateway UX (T13-T17)
Full-M7 scope: operator Alarm Summary (per-instance live snapshots),
MxGateway secured writes (Operator+Verifier roles + PendingSecuredWrite +
central relay), OPC UA BrowseNext paging + bounded recursive search,
type-info surfacing + attribute-override CSV import, Verify-endpoint button +
site-local cert trust (broadcast to both nodes). Builds on the merged
opcua-tag-browser + mxgw-supervisory-write foundations already in main.
2026-06-18 01:44:40 -04:00

264 lines
17 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Design: M7 — OPC UA / MxGateway UX (T13T17)
> Part of the `stillpending.md` completion roadmap (Phase 2 — Expand). Milestone **M7** of
> `docs/plans/2026-06-15-stillpending-completion-design.md`. Native task **#18**.
> Successor to M6 (KPI History, #26 KpiHistory, landed 241a792).
**Date:** 2026-06-18
**Branch:** `worktree-m7-opcua-mxgateway-ux` (off `origin/main` @ 241a792)
**Scope decision:** Full M7 — all five features T13T17.
---
## Goal
Round out the operator- and design-time UX for native alarms and OPC UA / MxAccess Gateway
data connections:
- **T13** — a dedicated operator **Alarm Summary** page (cross-instance, read-only).
- **T14** — **MxGateway secured writes**: a two-person (operator initiates, verifier approves)
authorization workflow for writes through the MxAccess Gateway.
- **T15** — OPC UA **address-space search + `BrowseNext` paging** in the node picker.
- **T16** — OPC UA **type-info surfacing** in browse + **bulk instance-override CSV import**.
- **T17** — OPC UA **"Verify endpoint"** connectivity button + **site-local certificate trust**.
This is a large, multi-theme milestone. It is built on infrastructure that already exists in
`main`: the OPC UA node browser (`NodeBrowserDialog`/`BrowseService`, from the merged
`feat/opcua-tag-browser`) and the MxGateway write path (`MxGatewayDataConnection.WriteAsync`,
from the merged `fix/mxgw-supervisory-write`). M7 does **not** reinvent those — it layers on top.
## Non-goals / deferred (logged as follow-ups)
- **Central alarm store / history / journal** — T13 reads live snapshots; no central alarm
tables (remains `[PERM]` per `2026-05-29-native-alarms-design.md`). No aggregated live gRPC
stream for the summary page in v1 (snapshot + poll only).
- **Native-alarm ack / shelve / suppress write-back** — read-only by design, unchanged.
- **CSV bulk import of native-alarm-source overrides** (`InstanceNativeAlarmSourceOverride`) —
the T16 importer targets *attribute* overrides only; native-alarm-source CSV is a follow-up.
- **Central-persisted server-cert trust** — T17 trust is site-local (no central entity/migration).
---
## Locked architecture decisions (from brainstorm)
| # | Feature | Decision |
|---|---------|----------|
| D1 | **T13 data path** | **Per-instance live snapshots.** Central fans out the existing per-instance `DebugViewSnapshot` Ask to each deployed instance and aggregates client-side. Zero new site-side code; no central store. N round-trips per site (concurrency-capped). |
| D2 | **T14 auth model** | **New global `Operator` + `Verifier` roles** + a dedicated **Secured Writes** page. `PendingSecuredWrite` central entity; central relays approved write to the site MxGateway; both users audited; no self-approval; single-tag first; MxGateway-protocol connections only. |
| D3 | **T15 search** | **`BrowseNext` paging + site-side bounded recursive search** (depth + result caps) matching substring on DisplayName/path. |
| D4 | **T17 cert trust** | **Site-local trust-on-verify** — the operator's Trust decision writes the cert directly into the site's OPC UA trusted-peer PKI store; central does **not** persist it. |
| D5 | **T16 CSV target** | **Instance attribute overrides** (same data as `instance set-overrides` / the InstanceConfigure list editor). Pairs with type-info; reuses existing override validation + handlers (MV-10). |
| D6 | **T17 HA divergence** | **Broadcast `TrustServerCert` to both site nodes** so node-a/node-b PKI stores stay consistent (active-node-only trust would silently fail cert validation after failover). |
| D7 | **T17 role gating** | **Verify = `Design`** (read-only probe); **Trust / Reject / Remove = `Administrator`** (changes a security trust boundary — least privilege). |
---
## Shared infrastructure
All cross-cluster verbs hang off the **existing request/response pattern** that
`BrowseNodeCommand` already uses:
```
CentralUI service
→ CommunicationService.XxxAsync(siteId, cmd)
→ SiteEnvelope(siteId, cmd)
→ CentralCommunicationActor (ClusterClient Ask, QueryTimeout budget)
→ SiteCommunicationActor (unwrap, route to site singleton)
→ DeploymentManagerActor → DataConnectionManagerActor (index by connection name)
→ DataConnectionActor (holds the IDataConnection adapter)
→ adapter (RealOpcUaClient / MxGatewayDataConnection)
← typed response flows back via Sender / PipeTo
```
M7 adds these verbs on that path (all additive, mirroring `BrowseCommands.cs` +
`DataConnectionActor.HandleBrowse`):
- `SearchAddressSpaceCommand` / browse-continuation (T15)
- `VerifyEndpointCommand`, `TrustServerCertCommand`, `ListServerCertsCommand` (T17)
- `ExecuteSecuredWriteCommand` (T14 relay)
> **Sequencing chokepoint:** the site-side handlers for T14 (execute relay), T15 (search /
> browse-next) and T17 (verify / trust) all add cases to `DataConnectionActor` /
> `DataConnectionManagerActor`. Those edits are serialized within the OPC UA / DCL stream to
> avoid file collisions (see Delivery).
---
## Feature designs
### T13 — Operator Alarm Summary page *(size: M)*
- **Page:** new `/monitoring/alarms` (`AlarmSummary.razor`) in the Monitoring nav group,
`RequireDeployment` policy (operator observability, same tier as Event Logs / Parked Messages).
- **Data path (D1):** site selector → query the site's deployed instances (existing
deployment-state query) → fan out the existing per-instance `DebugViewSnapshot` Ask
concurrently (capped via `SemaphoreSlim`), aggregate `AlarmStates` client-side. Partial-results
tolerant: instances that time out are listed as "not reporting", the rest still render.
- **View:** roll-up tiles (total active / worst severity / unacked count / per-`AlarmKind`
counts) + a flat, sortable, filterable table. Filters: instance, `AlarmKind`
(Computed / NativeOpcUa / NativeMxAccess), state (Active/Normal), acked/unacked, severity
threshold, name search.
- **Read-only** — no ack/shelve controls.
- **Reuse:** extract DebugView's inline alarm badge/formatter markup into a shared
`AlarmStateBadges` component consumed by both DebugView and the summary page.
- **Refresh:** manual button + optional poll timer (mirrors Health dashboard 10 s). No
aggregated live stream in v1.
- **Files (indicative):** `Components/Pages/Monitoring/AlarmSummary.razor(.cs)`, a CentralUI
`IAlarmSummaryService`/impl (fan-out + aggregate), `Components/Shared/AlarmStateBadges.razor`,
`NavMenu.razor`, Playwright test.
### T14 — MxGateway secured writes (operator + verifier) *(size: L — highest risk)*
- **Roles:** add `Operator` + `Verifier` to `Roles.cs` / `Roles.All`; add `RequireOperator` /
`RequireVerifier` authorization policies; update the LDAP group-mapping seed migration
(idempotent) + the role-mapping UI list.
- **Entity:** `PendingSecuredWrite` (Commons POCO + EF config + migration + regenerated model
snapshot, central MS SQL):
`Id, SiteId, ConnectionName, TagPath, ValueJson, ValueType,
Status {Pending → Approved/Rejected → Executed/Failed/Expired},
OperatorUser, OperatorComment, SubmittedAtUtc,
VerifierUser, VerifierComment, DecidedAtUtc,
ExecutedAtUtc, ExecutionError`.
Repository interface in Commons + impl in ConfigurationDatabase; registered in the unit of work.
- **ManagementActor handlers / commands:**
- `SubmitSecuredWriteCommand` (Operator) → insert `Pending` row.
- `ApproveSecuredWriteCommand` (Verifier) → enforce `VerifierUser ≠ OperatorUser`
server-side → mark `Approved` → relay `ExecuteSecuredWriteCommand` to the site → record
`Executed`/`Failed` from the `MxWriteOutcome`.
- `RejectSecuredWriteCommand` (Verifier) → `Rejected` + reason.
- List/query pending + history (global + per-site).
- **Site relay:** `ExecuteSecuredWriteCommand(connectionName, tagPath, value, valueType)`
`SiteEnvelope``DataConnectionManagerActor``DataConnectionActor`
`MxGatewayDataConnection.WriteAsync`. Validate **MxGateway protocol** at submit *and* execute.
Mirrors the parked-call Retry/Discard relay.
- **Central UI 'Secured Writes' page** (new nav entry):
- Operator: submit form (site → MxGateway connection → tag path → typed value → comment;
tag pick may reuse the T15 node browser).
- Verifier: pending queue table with Approve / Reject (+comment); own submissions disabled
in the UI *and* rejected server-side.
- History: terminal rows with full who/when/outcome.
- **Audit:** new `AuditKind.SecuredWrite` (+ channel); central direct-write a row per lifecycle
event (Submit / Approve / Reject / Execute), sharing the `PendingSecuredWrite.Id` as
`CorrelationId`, capturing operator + verifier + outcome — the "who approved" trail. Uses the
existing central direct-write path (as Notification Outbox dispatch + Inbound API do).
- **Safety:** Approve shows a confirm dialog with the exact site / connection / tag / value; the
write fires *only* on explicit verifier approval.
- **Dev caveat:** with `DisableLogin` on (docker), `AutoLoginAuthenticationHandler` grants
`Roles.All` to one identity, so the two-person flow cannot be exercised end-to-end via the dev
UI with a single user. No-self-approval is covered by handler-level tests; real two-person use
needs two real identities.
### T15 — Address-space search + `BrowseNext` paging *(size: M)*
- **`BrowseNext`:** today `RealOpcUaClient.BrowseChildrenAsync` (~line 763) discards the
continuation point and returns `Truncated = true` ("type the node id manually"). Thread an
opaque base64 continuation token through `BrowseNodeCommand` + `BrowseChildrenResult`; when
present, the site calls `Session.BrowseNext` on the connection's live session. Picker gains a
**"Load more"** affordance. Expired/invalid continuation points fall back to a fresh browse.
- **Search:** new `SearchAddressSpaceCommand(ConnectionName, Query, MaxDepth, MaxResults)`
site does a **bounded recursive browse** (depth + result caps) matching case-insensitive
substring on DisplayName/path; returns matches with full node ids + paths; UI surfaces
"showing first N — refine" when a cap is hit. New `IAddressSpaceSearchable` capability seam on
the OPC UA adapter; `StubOpcUaClient` gets a canned browse/search impl (it currently throws
`NotImplementedException`, needed for unit/bUnit tests).
- **Central:** `BrowseService.SearchAsync` (Design role) + a search box + results list in
`NodeBrowserDialog` (click a result → select it).
### T16 — Type-info surfacing + bulk override CSV import *(size: S + M)*
- **Type-info:** extend the `BrowseNode` record with optional `DataType` (friendly name),
`ValueRank` (scalar/array), `AccessLevel` (read/write). `RealOpcUaClient` batch-reads these
attributes for Variable nodes during browse; `NodeBrowserDialog` shows a Type column. Small
built-in-type nodeid → friendly-name lookup.
- **CSV bulk import (D5 — attribute overrides):** CSV columns `AttributeName, Value, ElementType?`
(`ElementType` only for `List` attributes). Per-row validation against the instance's flattened
attribute schema (name exists + type compatible — reuse the override validation +
`AttributeValueCodec`); collects per-row errors; all-or-nothing upsert with a result summary.
Writes reuse the existing ManagementActor add/update-override handlers (MV-10). Surfaces via:
- `InputFile` upload on the InstanceConfigure page (built-in Blazor; no third-party lib).
- CLI `instance import-overrides --instance-id <N> --file <path.csv>`.
### T17 — Verify-endpoint + cert-management (site-local) *(size: M + M)*
- **Verify-endpoint:** `VerifyEndpointCommand(SiteId, protocol, configJson)` → site spins a
**temporary** `RealOpcUaClient` from the submitted config (works for unsaved edits *and*
existing connections), connects (discovery + session) with a short timeout (~58 s), then
disconnects; returns `VerifyEndpointResult(Success, FailureKind, Error, ServerCert?)`. The probe
forces `AutoAcceptUntrustedCerts = false` and hooks the certificate-validation event to
**capture** an untrusted server cert (Subject / Issuer / Thumbprint / NotBefore / NotAfter /
DER). Button lives in `OpcUaEndpointEditor` (Design role — D7).
- **Cert trust (D4 site-local + D6 both-nodes):** when verify reports untrusted, the UI shows the
cert detail + a **Trust** button (Administrator role — D7). Trust → `TrustServerCertCommand`
**broadcast to both site nodes** → each node writes the `.der` into its own OPC UA
trusted-peer PKI store; re-verify then succeeds. A small cert-management view lists the site's
trusted/rejected store contents (`ListServerCertsCommand` reading the PKI dirs) with
Trust / Remove (Administrator). No central persistence/migration.
- **Open detail for planning:** the broadcast-to-both-nodes mechanism (per-node actor vs. cluster
broadcast vs. re-apply-on-failover) is settled at plan time; the trust handler must run on each
node because PKI dirs are node-local.
---
## Delivery approach (waves)
Three mostly-independent streams; full build + docker rebuild + Playwright only at integration.
1. **Wave A (parallel-safe foundations):**
- **T13** (CentralUI alarm summary + snapshot fan-out + `AlarmStateBadges` extraction).
- **T14a** (Security project: `Operator`/`Verifier` roles, policies, LDAP mapping seed + UI).
These are disjoint from each other and from the OPC UA / DCL files.
2. **Wave B (OPC UA / DCL stream — serialized on `RealOpcUaClient`/`StubOpcUaClient`/`DataConnectionActor`/`BrowseService`):**
- T16-typeinfo → T15 BrowseNext → T15 search → T17 verify → T17 trust (DCL edits in order),
then the UI layers (`NodeBrowserDialog`, `OpcUaEndpointEditor`, `DataConnectionForm`).
3. **Wave C (T14b — depends on T14a):**
- `PendingSecuredWrite` entity + migration → ManagementActor handlers + commands →
site execute-relay (`DataConnectionActor`, merges after Wave B's DCL edits) →
Secured Writes page → `AuditKind.SecuredWrite` wiring.
4. **Integration:** docs (Component-* updates, README/CLAUDE component count stays 26 — no new
component; these are features on existing components), `2026-06-15-stillpending-completion-design.md`
M7 status, full-solution build, `bash docker/deploy.sh`, Playwright, live smoke
(alarm summary renders; secured-write submit→approve relay; browse search/load-more;
verify-endpoint button).
**Execution conventions (per CLAUDE.md + standing constraints):** dedicated worktree (this one),
pathspec commits (`git commit -- <paths>`, never `git add -A`), ≤23 concurrent committers with
post-wave HEAD-presence checks, targeted builds/tests per task, full build + docker rebuild only
at integration. `TreatWarningsAsErrors=true` everywhere.
## Testing strategy
- **Unit:** T13 alarm aggregation/roll-up + partial-results; T14 no-self-approval + status
transitions + MxGateway-only validation; T15 bounded recursive search + BrowseNext continuation
+ stub impl; T16 CSV parse/validate (good + per-row-error corpus) + type-info mapping;
T17 verify-result mapping + cert-store write + both-node broadcast.
- **Integration (against the cluster):** secured-write end-to-end relay; browse / search /
verify round-trips; trust-then-reverify.
- **Playwright:** alarm summary page (filters, roll-up); secured-writes submit → approve → history;
node-browser search + load-more; verify-endpoint button (success + untrusted-cert path).
## Risks & open items
- **T14 is the heaviest and most security-sensitive** — writes to live process equipment. Mitigated
by the two-person gate, no-self-approval, confirm dialog, and full audit. High-risk
classification; serial spec→code review + final integration review.
- **OPC UA `BrowseNext` continuation points are session-bound** and can expire/be released — handle
invalid-CP by restarting the browse; never assume a CP survives indefinitely.
- **`StubOpcUaClient` currently throws on browse** — must gain a canned browse/search impl or the
new bUnit/unit tests can't run without a live server.
- **DisableLogin dev caveat** for T14 (single identity gets all roles) — documented above.
- **Cert trust HA** — broadcast-to-both-nodes (D6) chosen; the exact broadcast mechanism is a
plan-time detail.
## Follow-ups (not in M7)
- Native-alarm-source-override CSV bulk import (`InstanceNativeAlarmSourceOverride`).
- Aggregated **live** alarm stream for the summary page (vs. snapshot+poll).
- Central-persisted, auditable server-cert trust (supersede the site-local v1) if cross-site
governance is later wanted.
## Next step
Hand off to the **writing-plans** skill to produce the bite-sized, classification-tagged
implementation plan (`docs/plans/2026-06-18-m7-opcua-mxgateway-ux.md` + `.tasks.json`), then
execute subagent-driven in this session.