Files
ScadaBridge/docs/plans/2026-06-18-m7-opcua-mxgateway-ux-design.md
T
Joseph Doherty 254e0e729f docs(m7): approved design — OPC UA / MxGateway UX (T13-T17)
Full-M7 scope: operator Alarm Summary (per-instance live snapshots),
MxGateway secured writes (Operator+Verifier roles + PendingSecuredWrite +
central relay), OPC UA BrowseNext paging + bounded recursive search,
type-info surfacing + attribute-override CSV import, Verify-endpoint button +
site-local cert trust (broadcast to both nodes). Builds on the merged
opcua-tag-browser + mxgw-supervisory-write foundations already in main.
2026-06-18 01:44:40 -04:00

17 KiB
Raw Blame History

Design: M7 — OPC UA / MxGateway UX (T13T17)

Part of the stillpending.md completion roadmap (Phase 2 — Expand). Milestone M7 of docs/plans/2026-06-15-stillpending-completion-design.md. Native task #18. Successor to M6 (KPI History, #26 KpiHistory, landed 241a792).

Date: 2026-06-18 Branch: worktree-m7-opcua-mxgateway-ux (off origin/main @ 241a792) Scope decision: Full M7 — all five features T13T17.


Goal

Round out the operator- and design-time UX for native alarms and OPC UA / MxAccess Gateway data connections:

  • T13 — a dedicated operator Alarm Summary page (cross-instance, read-only).
  • T14MxGateway secured writes: a two-person (operator initiates, verifier approves) authorization workflow for writes through the MxAccess Gateway.
  • T15 — OPC UA address-space search + BrowseNext paging in the node picker.
  • T16 — OPC UA type-info surfacing in browse + bulk instance-override CSV import.
  • T17 — OPC UA "Verify endpoint" connectivity button + site-local certificate trust.

This is a large, multi-theme milestone. It is built on infrastructure that already exists in main: the OPC UA node browser (NodeBrowserDialog/BrowseService, from the merged feat/opcua-tag-browser) and the MxGateway write path (MxGatewayDataConnection.WriteAsync, from the merged fix/mxgw-supervisory-write). M7 does not reinvent those — it layers on top.

Non-goals / deferred (logged as follow-ups)

  • Central alarm store / history / journal — T13 reads live snapshots; no central alarm tables (remains [PERM] per 2026-05-29-native-alarms-design.md). No aggregated live gRPC stream for the summary page in v1 (snapshot + poll only).
  • Native-alarm ack / shelve / suppress write-back — read-only by design, unchanged.
  • CSV bulk import of native-alarm-source overrides (InstanceNativeAlarmSourceOverride) — the T16 importer targets attribute overrides only; native-alarm-source CSV is a follow-up.
  • Central-persisted server-cert trust — T17 trust is site-local (no central entity/migration).

Locked architecture decisions (from brainstorm)

# Feature Decision
D1 T13 data path Per-instance live snapshots. Central fans out the existing per-instance DebugViewSnapshot Ask to each deployed instance and aggregates client-side. Zero new site-side code; no central store. N round-trips per site (concurrency-capped).
D2 T14 auth model New global Operator + Verifier roles + a dedicated Secured Writes page. PendingSecuredWrite central entity; central relays approved write to the site MxGateway; both users audited; no self-approval; single-tag first; MxGateway-protocol connections only.
D3 T15 search BrowseNext paging + site-side bounded recursive search (depth + result caps) matching substring on DisplayName/path.
D4 T17 cert trust Site-local trust-on-verify — the operator's Trust decision writes the cert directly into the site's OPC UA trusted-peer PKI store; central does not persist it.
D5 T16 CSV target Instance attribute overrides (same data as instance set-overrides / the InstanceConfigure list editor). Pairs with type-info; reuses existing override validation + handlers (MV-10).
D6 T17 HA divergence Broadcast TrustServerCert to both site nodes so node-a/node-b PKI stores stay consistent (active-node-only trust would silently fail cert validation after failover).
D7 T17 role gating Verify = Design (read-only probe); Trust / Reject / Remove = Administrator (changes a security trust boundary — least privilege).

Shared infrastructure

All cross-cluster verbs hang off the existing request/response pattern that BrowseNodeCommand already uses:

CentralUI service
  → CommunicationService.XxxAsync(siteId, cmd)
    → SiteEnvelope(siteId, cmd)
      → CentralCommunicationActor  (ClusterClient Ask, QueryTimeout budget)
        → SiteCommunicationActor   (unwrap, route to site singleton)
          → DeploymentManagerActor → DataConnectionManagerActor (index by connection name)
            → DataConnectionActor  (holds the IDataConnection adapter)
              → adapter (RealOpcUaClient / MxGatewayDataConnection)
  ← typed response flows back via Sender / PipeTo

M7 adds these verbs on that path (all additive, mirroring BrowseCommands.cs + DataConnectionActor.HandleBrowse):

  • SearchAddressSpaceCommand / browse-continuation (T15)
  • VerifyEndpointCommand, TrustServerCertCommand, ListServerCertsCommand (T17)
  • ExecuteSecuredWriteCommand (T14 relay)

Sequencing chokepoint: the site-side handlers for T14 (execute relay), T15 (search / browse-next) and T17 (verify / trust) all add cases to DataConnectionActor / DataConnectionManagerActor. Those edits are serialized within the OPC UA / DCL stream to avoid file collisions (see Delivery).


Feature designs

T13 — Operator Alarm Summary page (size: M)

  • Page: new /monitoring/alarms (AlarmSummary.razor) in the Monitoring nav group, RequireDeployment policy (operator observability, same tier as Event Logs / Parked Messages).
  • Data path (D1): site selector → query the site's deployed instances (existing deployment-state query) → fan out the existing per-instance DebugViewSnapshot Ask concurrently (capped via SemaphoreSlim), aggregate AlarmStates client-side. Partial-results tolerant: instances that time out are listed as "not reporting", the rest still render.
  • View: roll-up tiles (total active / worst severity / unacked count / per-AlarmKind counts) + a flat, sortable, filterable table. Filters: instance, AlarmKind (Computed / NativeOpcUa / NativeMxAccess), state (Active/Normal), acked/unacked, severity threshold, name search.
  • Read-only — no ack/shelve controls.
  • Reuse: extract DebugView's inline alarm badge/formatter markup into a shared AlarmStateBadges component consumed by both DebugView and the summary page.
  • Refresh: manual button + optional poll timer (mirrors Health dashboard 10 s). No aggregated live stream in v1.
  • Files (indicative): Components/Pages/Monitoring/AlarmSummary.razor(.cs), a CentralUI IAlarmSummaryService/impl (fan-out + aggregate), Components/Shared/AlarmStateBadges.razor, NavMenu.razor, Playwright test.

T14 — MxGateway secured writes (operator + verifier) (size: L — highest risk)

  • Roles: add Operator + Verifier to Roles.cs / Roles.All; add RequireOperator / RequireVerifier authorization policies; update the LDAP group-mapping seed migration (idempotent) + the role-mapping UI list.
  • Entity: PendingSecuredWrite (Commons POCO + EF config + migration + regenerated model snapshot, central MS SQL): Id, SiteId, ConnectionName, TagPath, ValueJson, ValueType, Status {Pending → Approved/Rejected → Executed/Failed/Expired}, OperatorUser, OperatorComment, SubmittedAtUtc, VerifierUser, VerifierComment, DecidedAtUtc, ExecutedAtUtc, ExecutionError. Repository interface in Commons + impl in ConfigurationDatabase; registered in the unit of work.
  • ManagementActor handlers / commands:
    • SubmitSecuredWriteCommand (Operator) → insert Pending row.
    • ApproveSecuredWriteCommand (Verifier) → enforce VerifierUser ≠ OperatorUser server-side → mark Approved → relay ExecuteSecuredWriteCommand to the site → record Executed/Failed from the MxWriteOutcome.
    • RejectSecuredWriteCommand (Verifier) → Rejected + reason.
    • List/query pending + history (global + per-site).
  • Site relay: ExecuteSecuredWriteCommand(connectionName, tagPath, value, valueType)SiteEnvelopeDataConnectionManagerActorDataConnectionActorMxGatewayDataConnection.WriteAsync. Validate MxGateway protocol at submit and execute. Mirrors the parked-call Retry/Discard relay.
  • Central UI 'Secured Writes' page (new nav entry):
    • Operator: submit form (site → MxGateway connection → tag path → typed value → comment; tag pick may reuse the T15 node browser).
    • Verifier: pending queue table with Approve / Reject (+comment); own submissions disabled in the UI and rejected server-side.
    • History: terminal rows with full who/when/outcome.
  • Audit: new AuditKind.SecuredWrite (+ channel); central direct-write a row per lifecycle event (Submit / Approve / Reject / Execute), sharing the PendingSecuredWrite.Id as CorrelationId, capturing operator + verifier + outcome — the "who approved" trail. Uses the existing central direct-write path (as Notification Outbox dispatch + Inbound API do).
  • Safety: Approve shows a confirm dialog with the exact site / connection / tag / value; the write fires only on explicit verifier approval.
  • Dev caveat: with DisableLogin on (docker), AutoLoginAuthenticationHandler grants Roles.All to one identity, so the two-person flow cannot be exercised end-to-end via the dev UI with a single user. No-self-approval is covered by handler-level tests; real two-person use needs two real identities.

T15 — Address-space search + BrowseNext paging (size: M)

  • BrowseNext: today RealOpcUaClient.BrowseChildrenAsync (~line 763) discards the continuation point and returns Truncated = true ("type the node id manually"). Thread an opaque base64 continuation token through BrowseNodeCommand + BrowseChildrenResult; when present, the site calls Session.BrowseNext on the connection's live session. Picker gains a "Load more" affordance. Expired/invalid continuation points fall back to a fresh browse.
  • Search: new SearchAddressSpaceCommand(ConnectionName, Query, MaxDepth, MaxResults) → site does a bounded recursive browse (depth + result caps) matching case-insensitive substring on DisplayName/path; returns matches with full node ids + paths; UI surfaces "showing first N — refine" when a cap is hit. New IAddressSpaceSearchable capability seam on the OPC UA adapter; StubOpcUaClient gets a canned browse/search impl (it currently throws NotImplementedException, needed for unit/bUnit tests).
  • Central: BrowseService.SearchAsync (Design role) + a search box + results list in NodeBrowserDialog (click a result → select it).

T16 — Type-info surfacing + bulk override CSV import (size: S + M)

  • Type-info: extend the BrowseNode record with optional DataType (friendly name), ValueRank (scalar/array), AccessLevel (read/write). RealOpcUaClient batch-reads these attributes for Variable nodes during browse; NodeBrowserDialog shows a Type column. Small built-in-type nodeid → friendly-name lookup.
  • CSV bulk import (D5 — attribute overrides): CSV columns AttributeName, Value, ElementType? (ElementType only for List attributes). Per-row validation against the instance's flattened attribute schema (name exists + type compatible — reuse the override validation + AttributeValueCodec); collects per-row errors; all-or-nothing upsert with a result summary. Writes reuse the existing ManagementActor add/update-override handlers (MV-10). Surfaces via:
    • InputFile upload on the InstanceConfigure page (built-in Blazor; no third-party lib).
    • CLI instance import-overrides --instance-id <N> --file <path.csv>.

T17 — Verify-endpoint + cert-management (site-local) (size: M + M)

  • Verify-endpoint: VerifyEndpointCommand(SiteId, protocol, configJson) → site spins a temporary RealOpcUaClient from the submitted config (works for unsaved edits and existing connections), connects (discovery + session) with a short timeout (~58 s), then disconnects; returns VerifyEndpointResult(Success, FailureKind, Error, ServerCert?). The probe forces AutoAcceptUntrustedCerts = false and hooks the certificate-validation event to capture an untrusted server cert (Subject / Issuer / Thumbprint / NotBefore / NotAfter / DER). Button lives in OpcUaEndpointEditor (Design role — D7).
  • Cert trust (D4 site-local + D6 both-nodes): when verify reports untrusted, the UI shows the cert detail + a Trust button (Administrator role — D7). Trust → TrustServerCertCommand broadcast to both site nodes → each node writes the .der into its own OPC UA trusted-peer PKI store; re-verify then succeeds. A small cert-management view lists the site's trusted/rejected store contents (ListServerCertsCommand reading the PKI dirs) with Trust / Remove (Administrator). No central persistence/migration.
  • Open detail for planning: the broadcast-to-both-nodes mechanism (per-node actor vs. cluster broadcast vs. re-apply-on-failover) is settled at plan time; the trust handler must run on each node because PKI dirs are node-local.

Delivery approach (waves)

Three mostly-independent streams; full build + docker rebuild + Playwright only at integration.

  1. Wave A (parallel-safe foundations):
    • T13 (CentralUI alarm summary + snapshot fan-out + AlarmStateBadges extraction).
    • T14a (Security project: Operator/Verifier roles, policies, LDAP mapping seed + UI). These are disjoint from each other and from the OPC UA / DCL files.
  2. Wave B (OPC UA / DCL stream — serialized on RealOpcUaClient/StubOpcUaClient/DataConnectionActor/BrowseService):
    • T16-typeinfo → T15 BrowseNext → T15 search → T17 verify → T17 trust (DCL edits in order), then the UI layers (NodeBrowserDialog, OpcUaEndpointEditor, DataConnectionForm).
  3. Wave C (T14b — depends on T14a):
    • PendingSecuredWrite entity + migration → ManagementActor handlers + commands → site execute-relay (DataConnectionActor, merges after Wave B's DCL edits) → Secured Writes page → AuditKind.SecuredWrite wiring.
  4. Integration: docs (Component-* updates, README/CLAUDE component count stays 26 — no new component; these are features on existing components), 2026-06-15-stillpending-completion-design.md M7 status, full-solution build, bash docker/deploy.sh, Playwright, live smoke (alarm summary renders; secured-write submit→approve relay; browse search/load-more; verify-endpoint button).

Execution conventions (per CLAUDE.md + standing constraints): dedicated worktree (this one), pathspec commits (git commit -- <paths>, never git add -A), ≤23 concurrent committers with post-wave HEAD-presence checks, targeted builds/tests per task, full build + docker rebuild only at integration. TreatWarningsAsErrors=true everywhere.

Testing strategy

  • Unit: T13 alarm aggregation/roll-up + partial-results; T14 no-self-approval + status transitions + MxGateway-only validation; T15 bounded recursive search + BrowseNext continuation
    • stub impl; T16 CSV parse/validate (good + per-row-error corpus) + type-info mapping; T17 verify-result mapping + cert-store write + both-node broadcast.
  • Integration (against the cluster): secured-write end-to-end relay; browse / search / verify round-trips; trust-then-reverify.
  • Playwright: alarm summary page (filters, roll-up); secured-writes submit → approve → history; node-browser search + load-more; verify-endpoint button (success + untrusted-cert path).

Risks & open items

  • T14 is the heaviest and most security-sensitive — writes to live process equipment. Mitigated by the two-person gate, no-self-approval, confirm dialog, and full audit. High-risk classification; serial spec→code review + final integration review.
  • OPC UA BrowseNext continuation points are session-bound and can expire/be released — handle invalid-CP by restarting the browse; never assume a CP survives indefinitely.
  • StubOpcUaClient currently throws on browse — must gain a canned browse/search impl or the new bUnit/unit tests can't run without a live server.
  • DisableLogin dev caveat for T14 (single identity gets all roles) — documented above.
  • Cert trust HA — broadcast-to-both-nodes (D6) chosen; the exact broadcast mechanism is a plan-time detail.

Follow-ups (not in M7)

  • Native-alarm-source-override CSV bulk import (InstanceNativeAlarmSourceOverride).
  • Aggregated live alarm stream for the summary page (vs. snapshot+poll).
  • Central-persisted, auditable server-cert trust (supersede the site-local v1) if cross-site governance is later wanted.

Next step

Hand off to the writing-plans skill to produce the bite-sized, classification-tagged implementation plan (docs/plans/2026-06-18-m7-opcua-mxgateway-ux.md + .tasks.json), then execute subagent-driven in this session.