Files
lmxopcua/docs/v2/implementation/phase-6-2-authorization-runtime.md
Joseph Doherty 4695a5c88e Phase 6 — Draft 4 implementation plans covering v2 unimplemented features + adversarial review + adjustments. After drivers were paused per user direction, audited the v2 plan for features documented-but-unshipped and identified four coherent tracks that had no implementation plan at all. Each plan follows the docs/v2/implementation/phase-*.md template (DRAFT status, branch name, Stream A-E task breakdown, Compliance Checks, Risks, Completion Checklist). docs/v2/implementation/phase-6-1-resilience-and-observability.md (243 lines) covers Polly resilience pipelines wired to every capability interface, Tier A/B/C runtime enforcement (memory watchdog generalized beyond Galaxy, scheduled recycle per decision #67, wedge detection), health endpoints on :4841, structured Serilog with correlation IDs, LiteDB local-cache fallback per decision #36. phase-6-2-authorization-runtime.md (145 lines) wires ACL enforcement on every OPC UA Read/Write/Subscribe/Call path + LDAP-group-to-admin-role grants per decisions #105 and #129 -- runtime permission-trie evaluator over the 6-level Cluster/Namespace/UnsArea/UnsLine/Equipment/Tag hierarchy, per-session cache invalidated on generation-apply + LDAP-cache expiry. phase-6-3-redundancy-runtime.md (165 lines) lands the non-transparent warm/hot redundancy runtime per decisions #79-85: dynamic ServiceLevel node, ServerUriArray peer broadcast, mid-apply dip via sp_PublishGeneration hook, operator-driven role transition (no auto-election -- plan remains explicit about what's out of scope). phase-6-4-admin-ui-completion.md (178 lines) closes Phase 1 Stream E completion-checklist items that never landed: UNS drag-reorder + impact preview, Equipment CSV import, 5-identifier search, draft-diff viewer enhancements, OPC 40010 _base Identification field exposure per decisions #138-139. Each plan then got a Codex adversarial-review pass (codex mcp tool, read-only sandbox, synchronous). Reviews explicitly targeted decision-log conflicts, API-shape assumptions, unbounded blast radius, under-specified state transitions, and testing holes. Appended 'Adversarial Review — 2026-04-19' section to each plan with numbered findings (severity / finding / why-it-matters / adjustment accepted). Review surfaced real substantive issues that the initial drafts glossed over: Phase 6.1 auto-retry conflicting with decisions #44-45 no-auto-write-retry rule; Phase 6.1 per-driver-instance pipeline breaking decision #35's per-device isolation; Phase 6.1 recycle/watchdog at Tier A/B breaching decisions #73-74 Tier-C-only constraint; Phase 6.2 conflating control-plane LdapGroupRoleMapping with data-plane ACL grants; Phase 6.2 missing Browse enforcement entirely; Phase 6.2 subscription re-authorization policy unresolved between create-time-only and per-publish; Phase 6.3 ServiceLevel=0 colliding with OPC UA Part 5 Maintenance semantics; Phase 6.3 ServerUriArray excluding self (spec-bug); Phase 6.3 apply-window counter race on cancellation; Phase 6.3 client cutover for Kepware/Aveva OI Gateway is unverified hearsay; Phase 6.4 stale UNS impact preview overwriting concurrent draft edits; Phase 6.4 identifier contract drifting from admin-ui.md canonical set (ZTag/MachineCode/SAPID/EquipmentId/EquipmentUuid, not ZTag/SAPID/UniqueId/Alias1/Alias2); Phase 6.4 CSV import atomicity internally contradictory (single txn vs chunked inserts); Phase 6.4 OPC 40010 field list not matching decision #139. Every finding has an adjustment in the plan doc -- plans are meant to be executable from the next session with the critique already baked in rather than a clean draft that would run into the same issues at implementation time. Codex thread IDs cited in each plan's review section for reproducibility. Pure documentation PR -- no code changes. Plans are DRAFT status; each becomes its own implementation phase with its own entry-gate + exit-gate when business prioritizes.
2026-04-19 03:15:00 -04:00

16 KiB
Raw Blame History

Phase 6.2 — Authorization Runtime (ACL + LDAP grants)

Status: DRAFT — the v2 plan.md decision #129 + acl-design.md specify a 6-level permission-trie evaluator with NodePermissions bitmask grants, but no runtime evaluator exists. ACL tables are schematized but unread by the data path.

Branch: v2/phase-6-2-authorization-runtime Estimated duration: 2.5 weeks Predecessor: Phase 6.1 (Resilience & Observability) — reuses the Polly pipeline for ACL-cache refresh retries Successor: Phase 6.3 (Redundancy)

Phase Objective

Wire ACL enforcement on every OPC UA Read / Write / Subscribe / Call path + LDAP group → admin role grants that the v2 plan specified but never ran. End-state: a user's effective permissions resolve through a per-session permission-trie over the 6-level Cluster / Namespace / UnsArea / UnsLine / Equipment / Tag hierarchy, cached per session, invalidated on generation-apply + LDAP group expiry.

Closes these gaps:

  1. Data-path ACL enforcementNodeAcl table + NodePermissions flags shipped; NodeAclService.cs present as a CRUD surface; no code consults ACLs at Read/Write time. OPC UA server answers everything to everyone.
  2. LdapGroupRoleMapping for cluster-scoped admin grants — decision #105 shipped as the design; admin roles are hardcoded (FleetAdmin / ConfigEditor / ReadOnly) with no cluster-scoping and no LDAP-to-grant table. Decision #105 explicitly lifts this from v2.1 into v2.0.
  3. Explicit Deny pathway — deferred to v2.1 (decision #129 note). Phase 6.2 ships grants only; Deny stays out.
  4. Admin UI ACL grant editorAclsTab.razor exists but edits the now-unused NodeAcl table; needs to wire to the runtime evaluator + the new LdapGroupRoleMapping table.

Scope — What Changes

Concern Change
Configuration project New entity LdapGroupRoleMapping { Id, LdapGroup, Role, ClusterId? (nullable = system-wide), IsSystemWide, GeneratedAtUtc }. Migration. Admin CRUD.
Core → new Core.Authorization sub-namespace IPermissionEvaluator interface; concrete PermissionTrieEvaluator implementation loads ACLs + LDAP mappings from Configuration, builds a trie keyed on the 6-level scope hierarchy, evaluates a (UserClaim[], NodeId, NodePermissions)bool decision in O(depth × group-count).
Core.Authorization cache PermissionTrieCache — one trie per (ClusterId, GenerationId). Rebuilt on sp_PublishGeneration confirmation; served from memory thereafter. Per-session evaluator keeps a reference to the current trie + user's LDAP groups.
OPC UA server dispatch OtOpcUa.Server/OpcUa/DriverNodeManager.cs Read/Write/HistoryRead/MonitoredItem-create paths call PermissionEvaluator.Authorize(session.Identity, nodeId, NodePermissions.Read) etc. before delegating to the driver. Unauthorized returns BadUserAccessDenied (0x80210000) — not a silent no-op per corrections-doc B1.
LdapAuthService (existing) On cookie-auth success, resolves the user's LDAP groups via LdapGroupService.GetMemberships + loads the matching LdapGroupRoleMapping rows → produces a role-claim list + cluster-scope claim list. Stored on the auth cookie.
Admin UI AclsTab.razor Repoint edits at the new NodeAclService API that writes through to the same table the evaluator reads. Add a "test this permission" probe that runs a dummy evaluator against a chosen (user, nodeId, action) so ops can sanity-check grants before publishing a draft.
Admin UI new tab RoleGrantsTab.razor CRUD over LdapGroupRoleMapping. Per-cluster + system-wide grants. FleetAdmin only.
Audit log Every Grant/Revoke/Publish on LdapGroupRoleMapping or NodeAcl writes an AuditLog row with old/new state + user.

Scope — What Does NOT Change

Item Reason
OPC UA authn Already done (PR 19 LDAP user identity + Basic256Sha256 profile). Phase 6.2 is authorization only.
Explicit Deny grants Decision #129 note explicitly defers to v2.1. Default-deny + additive grants only.
Driver-side SecurityClassification metadata Drivers keep reporting Operate / ViewOnly / etc. — the evaluator uses them as part of the decision but doesn't replace them.
Galaxy namespace (SystemPlatform kind) UNS levels don't apply; evaluator treats Galaxy nodes as Cluster → Namespace → Tag (skip UnsArea/UnsLine/Equipment).

Entry Gate Checklist

  • Phase 6.1 merged (reuse Core.Resilience Polly pipeline for the ACL cache-refresh retries)
  • acl-design.md re-read in full
  • Decision log #105, #129, corrections-doc B1 re-skimmed
  • Existing NodeAcl + NodePermissions flag enum audited; confirm bitmask flags match acl-design.md table
  • Existing LdapAuthService group-resolution code path traced end-to-end — confirm it already queries group memberships (we only need the caller to consume the result)
  • Test DB scenarios catalogued: two clusters, three LDAP groups per cluster, mixed grant shapes; captured as seed-data fixtures

Task Breakdown

Stream A — LdapGroupRoleMapping table + migration (3 days)

  1. A.1 Entity + EF Core migration. Columns per §Scope table. Unique constraint on (LdapGroup, ClusterId) with null-tolerant comparer for the system-wide case. Index on LdapGroup for the hot-path lookup on auth.
  2. A.2 ILdapGroupRoleMappingService CRUD. Wrap in the Phase 6.1 Polly pipeline (timeout → retry → fallback-to-cache).
  3. A.3 Seed-data migration: preserve the current hardcoded FleetAdmin / ConfigEditor / ReadOnly mappings by seeding rows for the existing LDAP groups the dev box uses (cn=fleet-admin,…, cn=config-editor,…, cn=read-only,…). Op no-op migration for existing deployments.

Stream B — Permission-trie evaluator (1 week)

  1. B.1 IPermissionEvaluator.Authorize(IEnumerable<Claim> identity, NodeId nodeId, NodePermissions needed) — returns bool. Phase 6.2 returns only true / false; v2.1 can widen to Allow/Deny/Indeterminate if Deny lands.
  2. B.2 PermissionTrieBuilder builds the trie from NodeAcl + LdapGroupRoleMapping joined to the current generation's UnsArea + UnsLine + Equipment + Tag tables. One trie per (ClusterId, GenerationId) so rollback doesn't smear permissions across generations.
  3. B.3 Trie node structure: { Level: enum, ScopeId: Guid, AllowedPermissions: NodePermissions, ChildrenByLevel: Dictionary<Guid, TrieNode> }. Evaluation walks from Cluster → Namespace → UnsArea → UnsLine → Equipment → Tag, ORing allowed permissions at each level. Additive semantics: a grant at Cluster level cascades to every descendant tag.
  4. B.4 PermissionTrieCache service scoped as singleton; exposes GetTrieAsync(ClusterId, ct) that returns the current-generation trie. Invalidated on sp_PublishGeneration via an in-process event bus; also on TTL expiry (24 h safety net).
  5. B.5 Per-session cached evaluator: OPC UA Session authentication produces UserAuthorizationState { ClusterId, LdapGroups[], Trie }; cached on the session until session close or generation-apply.
  6. B.6 Unit tests: trie-walk theory covering (a) Cluster-level grant cascades to tags, (b) Equipment-level grant doesn't leak to sibling Equipment, (c) multi-group union, (d) no-grant → deny, (e) Galaxy nodes skip UnsArea/UnsLine levels.

Stream C — OPC UA server dispatch wiring (4 days)

  1. C.1 DriverNodeManager.Read — consult evaluator before delegating to IReadable. Unauthorized nodes get BadUserAccessDenied per-attribute, not on the whole batch.
  2. C.2 DriverNodeManager.Write — same. Evaluator needs NodePermissions.WriteOperate / WriteTune / WriteConfigure depending on driver-reported SecurityClassification of the attribute.
  3. C.3 DriverNodeManager.HistoryRead — ACL checks NodePermissions.Read (history uses the same Read flag per acl-design.md).
  4. C.4 DriverNodeManager.CreateMonitoredItem — denies unauthorized nodes at subscription create time, not after the first publish. Cleaner than silently omitting notifications.
  5. C.5 Alarm actions (acknowledge / confirm / shelve) — checks AlarmAck / AlarmConfirm / AlarmShelve flags.
  6. C.6 Integration tests: boot server with a seed trie, auth as three distinct users with different group memberships, assert read of one tag allowed + read of another denied + write denied where Read allowed.

Stream D — Admin UI refresh (4 days)

  1. D.1 RoleGrantsTab.razor — FleetAdmin-gated CRUD on LdapGroupRoleMapping. Per-cluster dropdown + system-wide checkbox. Validation: LDAP group must exist in the dev LDAP (GLAuth) before saving — best-effort probe with graceful degradation.
  2. D.2 AclsTab.razor rewrites its edit path to write through the new NodeAclService. Adds a "Probe this permission" row: choose (LDAP group, node, action) → shows Allow / Deny + the reason (which grant matched).
  3. D.3 Draft-generation diff viewer now includes an ACL section: "X grants added, Y grants removed, Z grants changed."
  4. D.4 SignalR notification: PermissionTrieCache invalidation on sp_PublishGeneration pushes to Admin UI so operators see "this clusters permissions were just updated" within 2 s.

Compliance Checks (run at exit gate)

  • Data-path enforcement: OPC UA Read against a NodeId the current user has no grant for returns BadUserAccessDenied with a ServiceResult, not Good with stale data. Verified by an integration test with a Basic256Sha256-secured session + a read-only LDAP identity.
  • Trie invariants: PermissionTrieBuilder is idempotent (building twice with identical inputs produces equal tries — override Equals to assert).
  • Additive grants: Cluster-level grant on User A means User A can read every tag in that cluster without needing any lower-level grant.
  • Isolation between clusters: a grant on Cluster 1 has zero effect on Cluster 2 for the same user.
  • Galaxy path coverage: ACL checks work on Galaxy folder nodes + tag nodes where the UNS levels are absent (the trie treats them as shallow Cluster → Namespace → Tag).
  • No regression in driver test counts.

Risks and Mitigations

Risk Likelihood Impact Mitigation
ACL evaluator latency on per-read hot path Medium High Trie lookup is O(depth) = O(6); session-cached UserAuthorizationState avoids per-Read trie rebuild; benchmark in Stream B.6
Trie cache stale after a rollback Medium High sp_PublishGeneration + sp_RollbackGeneration both emit the invalidation event; trie keyed on (ClusterId, GenerationId) so rollback fetches the prior trie cleanly
BadUserAccessDenied returns expose sensitive browse-name metadata Low Medium Server returns only the status code + NodeId; no message leak per OPC UA Part 4 §7.34 guidance
LdapGroupRoleMapping migration breaks existing deployments Low High Seed-migration preserves the hardcoded groups' effective grants verbatim; smoke test exercises the post-migration fleet admin login
Deny semantics accidentally ship (would break acl-design.md defer) Low Medium IPermissionEvaluator.Authorize returns bool (not tri-state) through Phase 6.2; widening to Allow/Deny/Indeterminate is a v2.1 ticket

Completion Checklist

  • Stream A: LdapGroupRoleMapping entity + migration + CRUD + seed
  • Stream B: evaluator + trie builder + cache + per-session state + unit tests
  • Stream C: OPC UA dispatch wiring on Read/Write/HistoryRead/Subscribe/Alarm paths
  • Stream D: Admin UI RoleGrantsTab + AclsTab refresh + SignalR invalidation
  • phase-6-2-compliance.ps1 exits 0; exit-gate doc recorded

Adversarial Review — 2026-04-19 (Codex, thread 019da48d-0d2b-7171-aed2-fc05f1f39ca3)

  1. Crit · ACCEPT — Trie must not conflate LdapGroupRoleMapping (control-plane admin claims per decision #105) with data-plane ACLs (decision #129). Change: LdapGroupRoleMapping is consumed only by the Admin UI role router. Data-plane trie reads NodeAcl rows joined against the session's resolved LDAP groups, never admin roles. Stream B.2 updated.
  2. Crit · ACCEPT — Cached UserAuthorizationState survives LDAP group changes because memberships only refresh at cookie-auth. Change: add MembershipFreshnessInterval (default 15 min); past that, next hot-path authz call forces group re-resolution (fail-closed if LDAP unreachable). Session-close-wins on config-rollback.
  3. High · ACCEPT — Node-local invalidation doesn't extend across redundant pair. Change: trie keyed on (ClusterId, GenerationId); hot-path authz looks up CurrentGenerationId from the shared config DB (Polly-wrapped + sub-second cache). A Backup that read stale generation gets a mismatched trie → forces re-load. Implementation note added to Stream B.4.
  4. High · ACCEPT — Browse enforcement missing. Change: new Stream C.7 (Browse + TranslateBrowsePathsToNodeIds enforcement). Ancestor visibility implied when any descendant has a grant; denied ancestors filter from browse results per acl-design.md §Browse.
  5. High · ACCEPTHistoryRead should use NodePermissions.HistoryRead bit, not Read. Change: Stream C.3 revised; separate unit test asserts Read+no-HistoryRead denies HistoryRead while allowing current-value reads.
  6. High · ACCEPT — Galaxy shallow-path (Cluster→Namespace→Tag) loses folder hierarchy authorization. Change: SystemPlatform namespaces use a FolderSegment scope-level between Namespace and Tag, populated from Tag.FolderPath; UNS-kind namespaces keep the 6-level hierarchy. Trie supports both via ScopeKind on each node.
  7. High · ACCEPT — Subscription re-authorization policy unresolved between create-time-only (fast, wrong on revoke) and per-publish (slow). Change: stamp each MonitoredItem with (AuthGenerationId, MembershipVersion); re-evaluate on Publish only when either version changed. Revoked items drop to BadUserAccessDenied within one publish cycle.
  8. Med · ACCEPT — Mixed-authorization batch Read / CreateMonitoredItems service-result semantics underspecified. Change: Stream C.6 explicitly tests per-ReadValueId + per-MonitoredItemCreateResult denial in mixed batches; batch never collapses to a coarse failure.
  9. Med · ACCEPT — Missing surfaces: Method.Call, HistoryUpdate, event filter on subscriptions, subscription-transfer on reconnect, alarm-ack. Change: scope expanded — every OPC UA authorization surface enumerated in Stream C: Read, Write, HistoryRead, HistoryUpdate, CreateMonitoredItems, TransferSubscriptions, Call, Acknowledge/Confirm/Shelve, Browse, TranslateBrowsePathsToNodeIds.
  10. Med · ACCEPTbool evaluator bakes in grant-only semantics; collides with v2.1 Deny. Change: internal model uses AuthorizationDecision { Allow | NotGranted | Denied, IReadOnlyList<MatchedGrant> Provenance }. Phase 6.2 maps Denied → never produced; UI + audit log use the full record so v2.1 Deny lands without API break.
  11. Med · ACCEPT — 6.1 cache fallback is availability-oriented; applying it to auth is correctness-dangerous. Change: auth-specific staleness budget AuthCacheMaxStaleness (default 5 min, not 24 h). Past that, hot-path evaluator fails closed on cached reads; all authorization calls return NotGranted until fresh data lands. Documented in risks + compliance.
  12. Low · ACCEPT — Existing NodeAclService is raw CRUD. Change: new ValidatedNodeAclAuthoringService enforces scope-uniqueness + draft/publish invariants + rejects invalid (LDAP group, scope) pairs; Admin UI writes through it only. Stream D.2 adjusted.