docs(m7): approved design — OPC UA / MxGateway UX (T13-T17)
Full-M7 scope: operator Alarm Summary (per-instance live snapshots), MxGateway secured writes (Operator+Verifier roles + PendingSecuredWrite + central relay), OPC UA BrowseNext paging + bounded recursive search, type-info surfacing + attribute-override CSV import, Verify-endpoint button + site-local cert trust (broadcast to both nodes). Builds on the merged opcua-tag-browser + mxgw-supervisory-write foundations already in main.
This commit is contained in:
@@ -0,0 +1,263 @@
|
||||
# Design: M7 — OPC UA / MxGateway UX (T13–T17)
|
||||
|
||||
> Part of the `stillpending.md` completion roadmap (Phase 2 — Expand). Milestone **M7** of
|
||||
> `docs/plans/2026-06-15-stillpending-completion-design.md`. Native task **#18**.
|
||||
> Successor to M6 (KPI History, #26 KpiHistory, landed 241a792).
|
||||
|
||||
**Date:** 2026-06-18
|
||||
**Branch:** `worktree-m7-opcua-mxgateway-ux` (off `origin/main` @ 241a792)
|
||||
**Scope decision:** Full M7 — all five features T13–T17.
|
||||
|
||||
---
|
||||
|
||||
## Goal
|
||||
|
||||
Round out the operator- and design-time UX for native alarms and OPC UA / MxAccess Gateway
|
||||
data connections:
|
||||
|
||||
- **T13** — a dedicated operator **Alarm Summary** page (cross-instance, read-only).
|
||||
- **T14** — **MxGateway secured writes**: a two-person (operator initiates, verifier approves)
|
||||
authorization workflow for writes through the MxAccess Gateway.
|
||||
- **T15** — OPC UA **address-space search + `BrowseNext` paging** in the node picker.
|
||||
- **T16** — OPC UA **type-info surfacing** in browse + **bulk instance-override CSV import**.
|
||||
- **T17** — OPC UA **"Verify endpoint"** connectivity button + **site-local certificate trust**.
|
||||
|
||||
This is a large, multi-theme milestone. It is built on infrastructure that already exists in
|
||||
`main`: the OPC UA node browser (`NodeBrowserDialog`/`BrowseService`, from the merged
|
||||
`feat/opcua-tag-browser`) and the MxGateway write path (`MxGatewayDataConnection.WriteAsync`,
|
||||
from the merged `fix/mxgw-supervisory-write`). M7 does **not** reinvent those — it layers on top.
|
||||
|
||||
## Non-goals / deferred (logged as follow-ups)
|
||||
|
||||
- **Central alarm store / history / journal** — T13 reads live snapshots; no central alarm
|
||||
tables (remains `[PERM]` per `2026-05-29-native-alarms-design.md`). No aggregated live gRPC
|
||||
stream for the summary page in v1 (snapshot + poll only).
|
||||
- **Native-alarm ack / shelve / suppress write-back** — read-only by design, unchanged.
|
||||
- **CSV bulk import of native-alarm-source overrides** (`InstanceNativeAlarmSourceOverride`) —
|
||||
the T16 importer targets *attribute* overrides only; native-alarm-source CSV is a follow-up.
|
||||
- **Central-persisted server-cert trust** — T17 trust is site-local (no central entity/migration).
|
||||
|
||||
---
|
||||
|
||||
## Locked architecture decisions (from brainstorm)
|
||||
|
||||
| # | Feature | Decision |
|
||||
|---|---------|----------|
|
||||
| D1 | **T13 data path** | **Per-instance live snapshots.** Central fans out the existing per-instance `DebugViewSnapshot` Ask to each deployed instance and aggregates client-side. Zero new site-side code; no central store. N round-trips per site (concurrency-capped). |
|
||||
| D2 | **T14 auth model** | **New global `Operator` + `Verifier` roles** + a dedicated **Secured Writes** page. `PendingSecuredWrite` central entity; central relays approved write to the site MxGateway; both users audited; no self-approval; single-tag first; MxGateway-protocol connections only. |
|
||||
| D3 | **T15 search** | **`BrowseNext` paging + site-side bounded recursive search** (depth + result caps) matching substring on DisplayName/path. |
|
||||
| D4 | **T17 cert trust** | **Site-local trust-on-verify** — the operator's Trust decision writes the cert directly into the site's OPC UA trusted-peer PKI store; central does **not** persist it. |
|
||||
| D5 | **T16 CSV target** | **Instance attribute overrides** (same data as `instance set-overrides` / the InstanceConfigure list editor). Pairs with type-info; reuses existing override validation + handlers (MV-10). |
|
||||
| D6 | **T17 HA divergence** | **Broadcast `TrustServerCert` to both site nodes** so node-a/node-b PKI stores stay consistent (active-node-only trust would silently fail cert validation after failover). |
|
||||
| D7 | **T17 role gating** | **Verify = `Design`** (read-only probe); **Trust / Reject / Remove = `Administrator`** (changes a security trust boundary — least privilege). |
|
||||
|
||||
---
|
||||
|
||||
## Shared infrastructure
|
||||
|
||||
All cross-cluster verbs hang off the **existing request/response pattern** that
|
||||
`BrowseNodeCommand` already uses:
|
||||
|
||||
```
|
||||
CentralUI service
|
||||
→ CommunicationService.XxxAsync(siteId, cmd)
|
||||
→ SiteEnvelope(siteId, cmd)
|
||||
→ CentralCommunicationActor (ClusterClient Ask, QueryTimeout budget)
|
||||
→ SiteCommunicationActor (unwrap, route to site singleton)
|
||||
→ DeploymentManagerActor → DataConnectionManagerActor (index by connection name)
|
||||
→ DataConnectionActor (holds the IDataConnection adapter)
|
||||
→ adapter (RealOpcUaClient / MxGatewayDataConnection)
|
||||
← typed response flows back via Sender / PipeTo
|
||||
```
|
||||
|
||||
M7 adds these verbs on that path (all additive, mirroring `BrowseCommands.cs` +
|
||||
`DataConnectionActor.HandleBrowse`):
|
||||
|
||||
- `SearchAddressSpaceCommand` / browse-continuation (T15)
|
||||
- `VerifyEndpointCommand`, `TrustServerCertCommand`, `ListServerCertsCommand` (T17)
|
||||
- `ExecuteSecuredWriteCommand` (T14 relay)
|
||||
|
||||
> **Sequencing chokepoint:** the site-side handlers for T14 (execute relay), T15 (search /
|
||||
> browse-next) and T17 (verify / trust) all add cases to `DataConnectionActor` /
|
||||
> `DataConnectionManagerActor`. Those edits are serialized within the OPC UA / DCL stream to
|
||||
> avoid file collisions (see Delivery).
|
||||
|
||||
---
|
||||
|
||||
## Feature designs
|
||||
|
||||
### T13 — Operator Alarm Summary page *(size: M)*
|
||||
|
||||
- **Page:** new `/monitoring/alarms` (`AlarmSummary.razor`) in the Monitoring nav group,
|
||||
`RequireDeployment` policy (operator observability, same tier as Event Logs / Parked Messages).
|
||||
- **Data path (D1):** site selector → query the site's deployed instances (existing
|
||||
deployment-state query) → fan out the existing per-instance `DebugViewSnapshot` Ask
|
||||
concurrently (capped via `SemaphoreSlim`), aggregate `AlarmStates` client-side. Partial-results
|
||||
tolerant: instances that time out are listed as "not reporting", the rest still render.
|
||||
- **View:** roll-up tiles (total active / worst severity / unacked count / per-`AlarmKind`
|
||||
counts) + a flat, sortable, filterable table. Filters: instance, `AlarmKind`
|
||||
(Computed / NativeOpcUa / NativeMxAccess), state (Active/Normal), acked/unacked, severity
|
||||
threshold, name search.
|
||||
- **Read-only** — no ack/shelve controls.
|
||||
- **Reuse:** extract DebugView's inline alarm badge/formatter markup into a shared
|
||||
`AlarmStateBadges` component consumed by both DebugView and the summary page.
|
||||
- **Refresh:** manual button + optional poll timer (mirrors Health dashboard 10 s). No
|
||||
aggregated live stream in v1.
|
||||
- **Files (indicative):** `Components/Pages/Monitoring/AlarmSummary.razor(.cs)`, a CentralUI
|
||||
`IAlarmSummaryService`/impl (fan-out + aggregate), `Components/Shared/AlarmStateBadges.razor`,
|
||||
`NavMenu.razor`, Playwright test.
|
||||
|
||||
### T14 — MxGateway secured writes (operator + verifier) *(size: L — highest risk)*
|
||||
|
||||
- **Roles:** add `Operator` + `Verifier` to `Roles.cs` / `Roles.All`; add `RequireOperator` /
|
||||
`RequireVerifier` authorization policies; update the LDAP group-mapping seed migration
|
||||
(idempotent) + the role-mapping UI list.
|
||||
- **Entity:** `PendingSecuredWrite` (Commons POCO + EF config + migration + regenerated model
|
||||
snapshot, central MS SQL):
|
||||
`Id, SiteId, ConnectionName, TagPath, ValueJson, ValueType,
|
||||
Status {Pending → Approved/Rejected → Executed/Failed/Expired},
|
||||
OperatorUser, OperatorComment, SubmittedAtUtc,
|
||||
VerifierUser, VerifierComment, DecidedAtUtc,
|
||||
ExecutedAtUtc, ExecutionError`.
|
||||
Repository interface in Commons + impl in ConfigurationDatabase; registered in the unit of work.
|
||||
- **ManagementActor handlers / commands:**
|
||||
- `SubmitSecuredWriteCommand` (Operator) → insert `Pending` row.
|
||||
- `ApproveSecuredWriteCommand` (Verifier) → enforce `VerifierUser ≠ OperatorUser`
|
||||
server-side → mark `Approved` → relay `ExecuteSecuredWriteCommand` to the site → record
|
||||
`Executed`/`Failed` from the `MxWriteOutcome`.
|
||||
- `RejectSecuredWriteCommand` (Verifier) → `Rejected` + reason.
|
||||
- List/query pending + history (global + per-site).
|
||||
- **Site relay:** `ExecuteSecuredWriteCommand(connectionName, tagPath, value, valueType)` →
|
||||
`SiteEnvelope` → `DataConnectionManagerActor` → `DataConnectionActor` →
|
||||
`MxGatewayDataConnection.WriteAsync`. Validate **MxGateway protocol** at submit *and* execute.
|
||||
Mirrors the parked-call Retry/Discard relay.
|
||||
- **Central UI 'Secured Writes' page** (new nav entry):
|
||||
- Operator: submit form (site → MxGateway connection → tag path → typed value → comment;
|
||||
tag pick may reuse the T15 node browser).
|
||||
- Verifier: pending queue table with Approve / Reject (+comment); own submissions disabled
|
||||
in the UI *and* rejected server-side.
|
||||
- History: terminal rows with full who/when/outcome.
|
||||
- **Audit:** new `AuditKind.SecuredWrite` (+ channel); central direct-write a row per lifecycle
|
||||
event (Submit / Approve / Reject / Execute), sharing the `PendingSecuredWrite.Id` as
|
||||
`CorrelationId`, capturing operator + verifier + outcome — the "who approved" trail. Uses the
|
||||
existing central direct-write path (as Notification Outbox dispatch + Inbound API do).
|
||||
- **Safety:** Approve shows a confirm dialog with the exact site / connection / tag / value; the
|
||||
write fires *only* on explicit verifier approval.
|
||||
- **Dev caveat:** with `DisableLogin` on (docker), `AutoLoginAuthenticationHandler` grants
|
||||
`Roles.All` to one identity, so the two-person flow cannot be exercised end-to-end via the dev
|
||||
UI with a single user. No-self-approval is covered by handler-level tests; real two-person use
|
||||
needs two real identities.
|
||||
|
||||
### T15 — Address-space search + `BrowseNext` paging *(size: M)*
|
||||
|
||||
- **`BrowseNext`:** today `RealOpcUaClient.BrowseChildrenAsync` (~line 763) discards the
|
||||
continuation point and returns `Truncated = true` ("type the node id manually"). Thread an
|
||||
opaque base64 continuation token through `BrowseNodeCommand` + `BrowseChildrenResult`; when
|
||||
present, the site calls `Session.BrowseNext` on the connection's live session. Picker gains a
|
||||
**"Load more"** affordance. Expired/invalid continuation points fall back to a fresh browse.
|
||||
- **Search:** new `SearchAddressSpaceCommand(ConnectionName, Query, MaxDepth, MaxResults)` →
|
||||
site does a **bounded recursive browse** (depth + result caps) matching case-insensitive
|
||||
substring on DisplayName/path; returns matches with full node ids + paths; UI surfaces
|
||||
"showing first N — refine" when a cap is hit. New `IAddressSpaceSearchable` capability seam on
|
||||
the OPC UA adapter; `StubOpcUaClient` gets a canned browse/search impl (it currently throws
|
||||
`NotImplementedException`, needed for unit/bUnit tests).
|
||||
- **Central:** `BrowseService.SearchAsync` (Design role) + a search box + results list in
|
||||
`NodeBrowserDialog` (click a result → select it).
|
||||
|
||||
### T16 — Type-info surfacing + bulk override CSV import *(size: S + M)*
|
||||
|
||||
- **Type-info:** extend the `BrowseNode` record with optional `DataType` (friendly name),
|
||||
`ValueRank` (scalar/array), `AccessLevel` (read/write). `RealOpcUaClient` batch-reads these
|
||||
attributes for Variable nodes during browse; `NodeBrowserDialog` shows a Type column. Small
|
||||
built-in-type nodeid → friendly-name lookup.
|
||||
- **CSV bulk import (D5 — attribute overrides):** CSV columns `AttributeName, Value, ElementType?`
|
||||
(`ElementType` only for `List` attributes). Per-row validation against the instance's flattened
|
||||
attribute schema (name exists + type compatible — reuse the override validation +
|
||||
`AttributeValueCodec`); collects per-row errors; all-or-nothing upsert with a result summary.
|
||||
Writes reuse the existing ManagementActor add/update-override handlers (MV-10). Surfaces via:
|
||||
- `InputFile` upload on the InstanceConfigure page (built-in Blazor; no third-party lib).
|
||||
- CLI `instance import-overrides --instance-id <N> --file <path.csv>`.
|
||||
|
||||
### T17 — Verify-endpoint + cert-management (site-local) *(size: M + M)*
|
||||
|
||||
- **Verify-endpoint:** `VerifyEndpointCommand(SiteId, protocol, configJson)` → site spins a
|
||||
**temporary** `RealOpcUaClient` from the submitted config (works for unsaved edits *and*
|
||||
existing connections), connects (discovery + session) with a short timeout (~5–8 s), then
|
||||
disconnects; returns `VerifyEndpointResult(Success, FailureKind, Error, ServerCert?)`. The probe
|
||||
forces `AutoAcceptUntrustedCerts = false` and hooks the certificate-validation event to
|
||||
**capture** an untrusted server cert (Subject / Issuer / Thumbprint / NotBefore / NotAfter /
|
||||
DER). Button lives in `OpcUaEndpointEditor` (Design role — D7).
|
||||
- **Cert trust (D4 site-local + D6 both-nodes):** when verify reports untrusted, the UI shows the
|
||||
cert detail + a **Trust** button (Administrator role — D7). Trust → `TrustServerCertCommand`
|
||||
**broadcast to both site nodes** → each node writes the `.der` into its own OPC UA
|
||||
trusted-peer PKI store; re-verify then succeeds. A small cert-management view lists the site's
|
||||
trusted/rejected store contents (`ListServerCertsCommand` reading the PKI dirs) with
|
||||
Trust / Remove (Administrator). No central persistence/migration.
|
||||
- **Open detail for planning:** the broadcast-to-both-nodes mechanism (per-node actor vs. cluster
|
||||
broadcast vs. re-apply-on-failover) is settled at plan time; the trust handler must run on each
|
||||
node because PKI dirs are node-local.
|
||||
|
||||
---
|
||||
|
||||
## Delivery approach (waves)
|
||||
|
||||
Three mostly-independent streams; full build + docker rebuild + Playwright only at integration.
|
||||
|
||||
1. **Wave A (parallel-safe foundations):**
|
||||
- **T13** (CentralUI alarm summary + snapshot fan-out + `AlarmStateBadges` extraction).
|
||||
- **T14a** (Security project: `Operator`/`Verifier` roles, policies, LDAP mapping seed + UI).
|
||||
These are disjoint from each other and from the OPC UA / DCL files.
|
||||
2. **Wave B (OPC UA / DCL stream — serialized on `RealOpcUaClient`/`StubOpcUaClient`/`DataConnectionActor`/`BrowseService`):**
|
||||
- T16-typeinfo → T15 BrowseNext → T15 search → T17 verify → T17 trust (DCL edits in order),
|
||||
then the UI layers (`NodeBrowserDialog`, `OpcUaEndpointEditor`, `DataConnectionForm`).
|
||||
3. **Wave C (T14b — depends on T14a):**
|
||||
- `PendingSecuredWrite` entity + migration → ManagementActor handlers + commands →
|
||||
site execute-relay (`DataConnectionActor`, merges after Wave B's DCL edits) →
|
||||
Secured Writes page → `AuditKind.SecuredWrite` wiring.
|
||||
4. **Integration:** docs (Component-* updates, README/CLAUDE component count stays 26 — no new
|
||||
component; these are features on existing components), `2026-06-15-stillpending-completion-design.md`
|
||||
M7 status, full-solution build, `bash docker/deploy.sh`, Playwright, live smoke
|
||||
(alarm summary renders; secured-write submit→approve relay; browse search/load-more;
|
||||
verify-endpoint button).
|
||||
|
||||
**Execution conventions (per CLAUDE.md + standing constraints):** dedicated worktree (this one),
|
||||
pathspec commits (`git commit -- <paths>`, never `git add -A`), ≤2–3 concurrent committers with
|
||||
post-wave HEAD-presence checks, targeted builds/tests per task, full build + docker rebuild only
|
||||
at integration. `TreatWarningsAsErrors=true` everywhere.
|
||||
|
||||
## Testing strategy
|
||||
|
||||
- **Unit:** T13 alarm aggregation/roll-up + partial-results; T14 no-self-approval + status
|
||||
transitions + MxGateway-only validation; T15 bounded recursive search + BrowseNext continuation
|
||||
+ stub impl; T16 CSV parse/validate (good + per-row-error corpus) + type-info mapping;
|
||||
T17 verify-result mapping + cert-store write + both-node broadcast.
|
||||
- **Integration (against the cluster):** secured-write end-to-end relay; browse / search /
|
||||
verify round-trips; trust-then-reverify.
|
||||
- **Playwright:** alarm summary page (filters, roll-up); secured-writes submit → approve → history;
|
||||
node-browser search + load-more; verify-endpoint button (success + untrusted-cert path).
|
||||
|
||||
## Risks & open items
|
||||
|
||||
- **T14 is the heaviest and most security-sensitive** — writes to live process equipment. Mitigated
|
||||
by the two-person gate, no-self-approval, confirm dialog, and full audit. High-risk
|
||||
classification; serial spec→code review + final integration review.
|
||||
- **OPC UA `BrowseNext` continuation points are session-bound** and can expire/be released — handle
|
||||
invalid-CP by restarting the browse; never assume a CP survives indefinitely.
|
||||
- **`StubOpcUaClient` currently throws on browse** — must gain a canned browse/search impl or the
|
||||
new bUnit/unit tests can't run without a live server.
|
||||
- **DisableLogin dev caveat** for T14 (single identity gets all roles) — documented above.
|
||||
- **Cert trust HA** — broadcast-to-both-nodes (D6) chosen; the exact broadcast mechanism is a
|
||||
plan-time detail.
|
||||
|
||||
## Follow-ups (not in M7)
|
||||
|
||||
- Native-alarm-source-override CSV bulk import (`InstanceNativeAlarmSourceOverride`).
|
||||
- Aggregated **live** alarm stream for the summary page (vs. snapshot+poll).
|
||||
- Central-persisted, auditable server-cert trust (supersede the site-local v1) if cross-site
|
||||
governance is later wanted.
|
||||
|
||||
## Next step
|
||||
|
||||
Hand off to the **writing-plans** skill to produce the bite-sized, classification-tagged
|
||||
implementation plan (`docs/plans/2026-06-18-m7-opcua-mxgateway-ux.md` + `.tasks.json`), then
|
||||
execute subagent-driven in this session.
|
||||
Reference in New Issue
Block a user