Joseph Doherty
ac63c2cfb2
ACL + role-grant SignalR invalidation — #196 slice 2. Adds the live-push layer so an operator editing permissions in one Admin session sees the change in peer sessions without a manual reload. Covers both axes of task #196 's invalidation requirement: cluster-scoped NodeAcl mutations push NodeAclChanged to that cluster's subscribers; fleet-wide LdapGroupRoleMapping CRUD pushes RoleGrantsChanged to every Admin session on the fleet group. New AclChangeNotifier service wraps IHubContext<FleetStatusHub> with two methods: NotifyNodeAclChangedAsync(clusterId, generationId) + NotifyRoleGrantsChangedAsync(). Both are fire-and-forget — a failed hub send logs a warning + returns; the authoritative DB write already committed, so worst-case peers see stale data until their next poll (AclsTab has no polling today; on-parameter-set reload + this signal covers the practical refresh cases). Catching OperationCanceledException separately so request-teardown doesn't log a false-positive hub-failure. NodeAclService constructor gains an optional AclChangeNotifier param (defaults to null so the existing unit tests that pass only a DbContext keep compiling). GrantAsync + RevokeAsync both emit NodeAclChanged after the SaveChanges completes — the Revoke path uses the loaded row's ClusterId + GenerationId for accurate routing since the caller passes only the surrogate rowId. RoleGrants.razor consumes the notifier after every Create + Delete + opens a fleet-scoped HubConnection on first render that reloads the grant list on RoleGrantsChanged. AclsTab.razor opens a cluster-scoped connection on first render and reloads only when the incoming NodeAclChanged message matches both the current ClusterId + GenerationId (so a peer editing a different draft doesn't trigger spurious reloads). Both pages IAsyncDisposable the connection on navigation away. AclChangeNotifier is DI-registered alongside PermissionProbeService. Two new message records in AclChangeNotifier.cs: NodeAclChangedMessage(ClusterId, GenerationId, ObservedAtUtc) + RoleGrantsChangedMessage(ObservedAtUtc). Admin.Tests 92/92 passing (unchanged — the notifier is fire-and-forget + tested at hub level in existing FleetStatusPoller suite). Admin builds 0 errors. One slice of #196 remains: the draft-diff ACL section (extend sp_ComputeGenerationDiff to emit NodeAcl rows + wire the DiffViewer NodeAcl card from the empty placeholder it currently shows). Next PR.
...
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-20 00:32:28 -04:00
Joseph Doherty
ecc2389ca8
AclsTab Probe-this-permission — first of three #196 slices. New /clusters/{ClusterId}/draft/{GenerationId} ACLs-tab gains a probe card above the grant table so operators can ask the trie "if cn=X asks for permission Y on node Z, would it be granted, and which rows contributed?" without shell-ing into the DB. Service thinly wraps the same PermissionTrieBuilder + PermissionTrie.CollectMatches call path the Server's dispatch layer uses at request time, so a probe answer is by construction identical to what the live server would decide. New PermissionProbeService.ProbeAsync(generationId, ldapGroup, NodeScope, requiredFlags) — loads the target generation's NodeAcl rows filtered to the cluster (critical: without the cluster filter, cross-cluster grants leak into the probe which tested false-positive in the unit suite), builds a trie, CollectMatches against the supplied scope + [ldapGroup], ORs the matched-grant flags into Effective, compares to Required. Returns PermissionProbeResult(Granted, Required, Effective, Matches) — Matches carries LdapGroup + Scope + PermissionFlags per matched row so the UI can render the contribution chain. Zero side effects + no audit rows — a failing probe is a question, not a denial. AclsTab.razor gains the probe card at the top (before the New-grant form + grant table): six inputs for ldap group + every NodeScope level (NamespaceId → UnsAreaId → UnsLineId → EquipmentId → TagId — blank fields become null so the trie walks only as deep as the operator specified), a NodePermissions dropdown filtered to skip None, Probe button, green Granted / red Denied badge + Required/Effective bitmask display, and (when matches exist) a small table showing which LdapGroup matched at which level with which flags. Admin csproj adds ProjectReference to Core — the trie + NodeScope live there + were previously Server-only. Five new PermissionProbeServiceTests covering: cluster-level row grants a namespace-level read; no-group-match denies with empty Effective; matching group but insufficient flags (Browse+Read vs WriteOperate required) denies with correct Effective bitmask; cross-cluster grants stay isolated (c2's WriteOperate does NOT leak into c1's probe); generation isolation (gen1's Read-only does NOT let gen2's WriteOperate-requiring probe pass). Admin.Tests 92/92 passing (was 87, +5). Admin builds 0 errors. Remaining #196 slices — SignalR invalidation + draft-diff ACL section — ship in follow-up PRs so the review surface per PR stays tight.
...
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-20 00:28:17 -04:00
Joseph Doherty
1f3343e61f
OpenTelemetry redundancy metrics + RoleChanged SignalR push. Closes instrumentation + live-push slices of task #198 ; the exporter wiring (OTLP vs Prometheus package decision) is split to new task #201 because the collector/scrape-endpoint choice is a fleet-ops decision that deserves its own PR rather than hardcoded here. New RedundancyMetrics class (Singleton-registered in DI) owning a System.Diagnostics.Metrics.Meter("ZB.MOM.WW.OtOpcUa.Redundancy", "1.0.0"). Three ObservableGauge instruments — otopcua.redundancy.primary_count / secondary_count / stale_count — all tagged by cluster.id, populated by SetClusterCounts(clusterId, primary, secondary, stale) which the poller calls at the tail of every tick; ObservableGauge callbacks snapshot the last value set under a lock so the reader (OTel collector, dotnet-counters) sees consistent tuples. One Counter — otopcua.redundancy.role_transition — tagged cluster.id, node.id, from_role, to_role; ideal for tracking "how often does Cluster-X failover" + "which node transitions most" aggregate queries. In-box Metrics API means zero NuGet dep here — the exporter PR adds OpenTelemetry.Extensions.Hosting + OpenTelemetry.Exporter.OpenTelemetryProtocol or OpenTelemetry.Exporter.Prometheus.AspNetCore to actually ship the data somewhere. FleetStatusPoller extended with role-change detection. Its PollOnceAsync now pulls ClusterNode rows alongside the existing ClusterNodeGenerationState scan, and a new PollRolesAsync walks every node comparing RedundancyRole to the _lastRole cache. On change: records the transition to RedundancyMetrics + emits a RoleChanged SignalR message to both FleetStatusHub.GroupName(cluster) + FleetStatusHub.FleetGroup so cluster-scoped + fleet-wide subscribers both see it. First observation per node is a bootstrap (cache fill) + NOT a transition — avoids spurious churn on service startup or pod restart. UpdateClusterGauges groups nodes by cluster + sets the three gauge values, using ClusterNodeService.StaleThreshold (shared 30s convention) for staleness so the /hosts page + the gauge agree. RoleChangedMessage record lives alongside NodeStateChangedMessage in FleetStatusPoller.cs. RedundancyTab.razor subscribes to the fleet-status hub on first parameters-set, filters RoleChanged events to the current cluster, reloads the node list + paints a blue info banner ("Role changed on node-a: Primary → Secondary at HH:mm:ss UTC") so operators see the transition without needing to poll-refresh the page. IAsyncDisposable closes the connection on tab swap-away. Two new RedundancyMetricsTests covering RecordRoleTransition tag emission (cluster.id + node.id + from_role + to_role all flow through the MeterListener callback) + ObservableGauge snapshot for two clusters (assert primary_count=1 for c1, stale_count=1 for c2). Existing FleetStatusPollerTests ctor-line updated to pass a RedundancyMetrics instance; all tests still pass. Full Admin.Tests suite 87/87 passing (was 85, +2). Admin project builds 0 errors. Task #201 captures the exporter-wiring follow-up — OpenTelemetry.Extensions.Hosting + OTLP vs Prometheus + /metrics endpoint decision, driven by fleet-ops infra direction.
...
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-19 23:16:09 -04:00
Joseph Doherty
c0751fdda5
ExternalIdReservation merge inside FinaliseBatchAsync. Closes task #197 . The FinaliseBatch docstring called this out as a narrower follow-up pending a concurrent-insert test matrix, and the CSV import UI PR ( #163 ) noted that operators would see raw DbUpdate UNIQUE-constraint messages on ZTag/SAPID collision until this landed. Now every finalised-batch row reserves ZTag + SAPID in the same EF transaction as the Equipment inserts, so either both commit atomically or neither does. New MergeReservation helper handles the four outcomes per (Kind, Value) pair: (1) value empty/whitespace → skip the reservation entirely (operator left the optional identifier blank); (2) active reservation exists for same EquipmentUuid → bump LastPublishedAt + reuse (re-finalising a batch against the same equipment must be idempotent, e.g. a retry after a transient DB blip); (3) active reservation exists for a DIFFERENT EquipmentUuid → throw ExternalIdReservationConflictException with the conflicting UUID + originating cluster + first-published timestamp so operator sees exactly who owns the value + where to resolve it (release via sp_ReleaseExternalIdReservation or pick a new ZTag); (4) no active reservation → create a fresh row with FirstPublishedBy = batch.CreatedBy + FirstPublishedAt = transaction time. Pre-commit overlap scan uses one round-trip (WHERE Kind+Value IN the batch's distinct sets, filtered to ReleasedAt IS NULL so explicitly-released values can be re-issued per decision #124 ) + caches the results in a Dictionary keyed on (Kind, value.ToLowerInvariant()) for O(1) lookup during the row loop. Race-safety catch: if another finalise commits between our cache-load + our SaveChanges, SQL Server surfaces a 2601/2627 unique-index violation against UX_ExternalIdReservation_KindValue_Active — IsReservationUniquenessViolation walks the inner-exception chain for that specific signature + rethrows as ExternalIdReservationConflictException so the UI shows a clean message instead of a raw DbUpdateException. The index-name match means unrelated filtered-unique violations (future indices) don't get mis-classified. Test-fixture Row() helper updated to generate unique SAPID per row (sap-{ZTag}) — the prior shared SAPID="sap" worked only because reservations didn't exist; two rows sharing a SAPID under different EquipmentUuids now collide as intended by decision #124 's fleet-wide uniqueness rule. Four new tests: (a) finalise creates both ZTag + SAPID reservations with expected Kind + Value; (b) re-finalising same EquipmentUuid's ZTag from a different batch does not create a duplicate (LastPublishedAt refresh only); (c) different EquipmentUuid claiming the same ZTag throws ExternalIdReservationConflictException with the ZTag value in the message + Equipment row for the second batch is NOT inserted (transaction rolled back cleanly); (d) row with empty ZTag + empty SAPID skips reservation entirely. Full Admin.Tests suite 85/85 passing (was 81 before this PR, +4). Admin project builds 0 errors. Note: the InMemory EF provider doesn't enforce filtered-unique indices, so the IsReservationUniquenessViolation catch is exercised only in the SQL Server integration path — the in-memory tests cover the cache-level conflict detection in MergeReservation instead, which is the first line of defence + catches the same-batch + published-vs-staged cases. The DbUpdate catch protects only the last-second race where two concurrent transactions both passed the cache check.
...
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-19 23:02:31 -04:00
Joseph Doherty
5ee510dc1a
UnsTab native HTML5 drag/drop + 409 concurrent-edit modal + optimistic-concurrency commit path. Closes UI slice of task #153 (Phase 6.4 Stream A UI follow-up). Playwright E2E smoke is split into new task #199 — Playwright install + WebApplicationFactory + seeded-DB harness is genuinely its own infra-setup PR. Native HTML5 attributes (draggable, @ondragstart, @ondragover, @ondragleave, @ondrop) deliberately over MudBlazor per the task title — no MudBlazor ever joins this project. Two new service methods on UnsService land the data layer the existing UnsImpactAnalyzer assumed but which didn't actually exist: (1) LoadSnapshotAsync(generationId) — walks UnsAreas + UnsLines + per-line equipment counts + builds a UnsTreeSnapshot including a 16-char SHA-256 revision token computed deterministically over the sorted (kind, id, parent, name, notes) tuple-set so it's stable across processes + changes whenever any row is added / modified / deleted; (2) MoveLineAsync(generationId, expectedToken, lineId, targetAreaId) — re-parents one line inside the same draft under an EF transaction, recomputes the current revision token from freshly-loaded rows, and throws DraftRevisionConflictException when the caller-supplied token no longer matches. Token mismatch means another operator mutated the draft between preview + commit + the move rolls back rather than clobbering their work. No-op same-area drop is a silent return. Cross-generation move is prevented by the generationId filter on the transaction reads. UnsTab.razor gains draggable="true" on every line row with @ondragstart capturing the LineId into _dragLineId, and every area row is a drop target (@ondragover with :preventDefault so the browser accepts drops, @ondrop kicking off OnLineDroppedAsync). Drop path loads a fresh snapshot, builds a UnsMoveOperation(Kind=LineMove, source/target cluster matching because cross-cluster is decision-#82 rejected), runs UnsImpactAnalyzer.Analyze + shows a Bootstrap modal rendered inline in the component — modal shows HumanReadableSummary + equipment/tag counts + any CascadeWarnings list. Confirm button calls MoveLineAsync with the snapshot's RevisionToken; DraftRevisionConflictException surfaces a separate red-header "Draft changed — refresh required" modal with a Reload button that re-fetches areas + lines from the DB. New DraftRevisionConflictException in UnsService.cs, co-located with the service that throws it. Five new UnsServiceMoveTests covering LoadSnapshotAsync (areas + lines + equipment counts), RevisionToken stability between two reads, RevisionToken changes on AddLineAsync, MoveLineAsync happy path reparents the line in the DB, MoveLineAsync with stale token throws DraftRevisionConflictException + leaves the DB unchanged. Admin suite 81/81 passing (was 76, +5). Admin project builds 0 errors. Task #199 captures the deferred Playwright E2E smoke — drag a line onto a different area in a real browser, assert preview modal contents, click Confirm, assert the line row shows the new area. That PR stands up a new tests/ZB.MOM.WW.OtOpcUa.Admin.E2ETests project with Playwright + WebApplicationFactory + seeded InMemory DbContext.
...
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-19 22:30:48 -04:00
Joseph Doherty
13d5a7968b
Admin RedundancyTab — per-cluster read-only topology view. Closes the UI slice of task #149 (Phase 6.3 Stream E — Admin UI RedundancyTab + OpenTelemetry metrics + SignalR); the OpenTelemetry metrics + RoleChanged SignalR push are split into new follow-up task #198 because each is a structural add that deserves its own test matrix + NuGet-dep decision rather than riding this UI PR. New /clusters/{ClusterId} Redundancy tab slotted between ACLs and Audit in the existing ClusterDetail tab bar. Shows each ClusterNode row in the cluster with columns Node / Role (Primary green, Secondary blue, Standalone primary-blue badge) / Host / OPC UA port / ServiceLevel base / ApplicationUri (text-break so the long urn: doesn't blow out the table) / Enabled badge / Last seen (relative age via the same FormatAge helper as Hosts.razor, with a yellow "Stale" chip once LastSeenAt crosses the 30s threshold shared with HostStatusService.StaleThreshold — a missed heartbeat plus clock-skew buffer). Four summary cards above the table — total Nodes, Primary count, Secondary count, Stale count. Two guard-rail alerts: (a) red "No Primary or Standalone" when the cluster has no authoritative write target (all rows are Secondaries — read-only until one is promoted by the server-side RedundancyCoordinator apply-lease flow); (b) red "Split-brain" when >1 Primary exists — apply-lease enforcement at the coordinator level should have made this impossible, so the alert implies a hand-edited DB row + an investigation. New ClusterNodeService with ListByClusterAsync (ordered by ServiceLevelBase descending so Primary rows with higher base float to the top) + a static IsStale predicate matching HostStatusService's 30s convention. DI-registered alongside the existing scoped services in Program.cs. Writes (role swap, enable/disable) are deliberately absent from the service — they go through the RedundancyCoordinator apply-lease flow on the server side + direct DB mutation from Admin would race with it. New ClusterNodeServiceTests covering IsStale across null/recent/old LastSeenAt + ListByClusterAsync ordering + cluster filter. 4/4 new tests passing; full Admin.Tests suite 76/76 (was 72 before this PR, +4). Admin project builds 0 errors. Task #198 captures the deferred work: (1) OpenTelemetry Meter for primary/secondary/stale counts + role_transition counter with from/to/node tags + OTLP exporter config; (2) RoleChanged SignalR push — extend FleetStatusPoller to detect RedundancyRole changes on ClusterNode rows + emit a RoleChanged hub message so the RedundancyTab refreshes instantly instead of on-page-load polling.
...
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-19 22:14:25 -04:00
Joseph Doherty
d06cc01a48
Admin /hosts red-badge + resilience columns + Polly telemetry observer. Closes task #164 (the remaining slice of Phase 6.1 Stream E.3 after the earlier publisher + hub PR). Three cooperating pieces wired together so the operator-facing /hosts table actually reflects the live Polly counters that the pipeline builder is producing. DriverResiliencePipelineBuilder gains an optional DriverResilienceStatusTracker ctor param — when non-null, every built pipeline wires Polly's OnRetry/OnOpened/OnClosed strategy-options callbacks into the tracker. OnRetry → tracker.RecordFailure (so ConsecutiveFailures climbs per retry), OnOpened → tracker.RecordBreakerOpen (stamps LastCircuitBreakerOpenUtc), OnClosed → tracker.RecordSuccess (resets the failure counter once the target recovers). Absent tracker = silent, preserving the unit-test constructor path + any deployment that doesn't care about resilience observability. Cancellation stays excluded from the failure count via the existing ShouldHandle predicate. HostStatusService.HostStatusRow extends with four new fields — ConsecutiveFailures, LastCircuitBreakerOpenUtc, CurrentBulkheadDepth, LastRecycleUtc — populated via a second LEFT JOIN onto DriverInstanceResilienceStatuses keyed on (DriverInstanceId, HostName). LEFT JOIN because brand-new hosts haven't been sampled yet; a missing row means zero failures + never-opened breaker, which is the correct default. New FailureFlagThreshold constant (=3, matches plan decision #143 's conservative half-of-breaker convention) + IsFlagged predicate so the UI can pre-warn before the breaker actually trips. Hosts.razor paints three new columns between State and Last-transition — Fail# (bold red when flagged), In-flight (bulkhead-depth proxy), Breaker-opened (relative age). Per-row "Flagged" red badge alongside State when IsFlagged is true. Above the first cluster table, a red alert banner summarises the flagged-host count when ≥1 host is flagged, so operators see the problem before scanning rows. Three new tests in DriverResiliencePipelineBuilderTests — Tracker_RecordsFailure_OnEveryRetry verifies ConsecutiveFailures reaches RetryCount after a transient-forever operation, Tracker_StampsBreakerOpen_WhenBreakerTrips verifies LastBreakerOpenUtc is set after threshold failures on a Write pipeline, Tracker_IsolatesCounters_PerHost verifies one dead host does not leak failure counts into a healthy sibling. Full suite — Core.Tests 14/14 resilience-builder tests passing (11 existing + 3 new), Admin.Tests 72/72 passing, Admin project builds 0 errors. SignalR live push of status changes + browser visual review are deliberately left to a follow-up — this PR keeps the structural change minimal (polling refresh already exists in the page's 10s timer; SignalR would be a structural add that touches hub registration + client subscription).
...
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-19 21:35:54 -04:00
Joseph Doherty
ad131932d3
Phase 6.4 Stream B.2-B.4 server-side — EquipmentImportBatch staging + FinaliseBatch transaction
...
Closes the server-side/data-layer piece of Phase 6.4 Stream B.2-B.4. The
CSV-import preview + modal UI (Stream B.3/B.5) still belongs to the Admin
UI follow-up — this PR owns the staging tables + atomic finalise alone.
Configuration:
- New EquipmentImportBatch entity (Id, ClusterId, CreatedBy, CreatedAtUtc,
RowsStaged/Accepted/Rejected, FinalisedAtUtc?). Composite index on
(CreatedBy, FinalisedAtUtc) powers the Admin preview modal's "my open
batches" query.
- New EquipmentImportRow entity — one row per CSV row, 8 required columns
from decision #117 + 9 optional from decision #139 + IsAccepted flag +
RejectReason. FK to EquipmentImportBatch with cascade delete so
DropBatch collapses the whole tree.
- EF migration 20260419_..._AddEquipmentImportBatch.
- SchemaComplianceTests expected tables list gains the two new tables.
Admin.Services.EquipmentImportBatchService:
- CreateBatchAsync — new header row, caller-supplied ClusterId + CreatedBy.
- StageRowsAsync(batchId, acceptedRows, rejectedRows) — bulk-inserts the
parsed CSV rows into staging. Rejected rows carry LineNumberInFile +
RejectReason for the preview modal. Throws when the batch is finalised.
- DropBatchAsync — removes batch + cascaded rows. Throws when the batch
was already finalised (rollback via staging is not a time machine).
- FinaliseBatchAsync(batchId, generationId, driverInstanceId, unsLineId) —
atomic apply. Opens an EF transaction when the provider supports it
(SQL Server in prod; InMemory in tests skips the tx), bulk-inserts
every accepted staging row into Equipment, stamps
EquipmentImportBatch.FinalisedAtUtc, commits. Failure rolls back so
Equipment never partially mutates. Idempotent-under-double-call:
second finalise throws ImportBatchAlreadyFinalisedException.
- ListByUserAsync(createdBy, includeFinalised) — the Admin preview modal's
backing query. OrderByDescending on CreatedAtUtc so the most-recent
batch shows first.
- Two exception types: ImportBatchNotFoundException +
ImportBatchAlreadyFinalisedException.
ExternalIdReservation merging (ZTag + SAPID fleet-wide uniqueness) is NOT
done here — a narrower follow-up wires it once the concurrent-insert test
matrix is green.
Tests (10 new EquipmentImportBatchServiceTests, all pass):
- CreateBatch populates Id + CreatedAtUtc + zero-ed counters.
- StageRows accepted + rejected both persist; counters advance.
- DropBatch cascades row delete.
- DropBatch after finalise throws.
- Finalise translates accepted staging rows → Equipment under the target
GenerationId + DriverInstanceId + UnsLineId.
- Finalise twice throws.
- Finalise of unknown batch throws.
- Stage after finalise throws.
- ListByUserAsync filters by creator + finalised flag.
- Drop of unknown batch is a no-op (idempotent rollback).
Full solution dotnet test: 1235 passing (was 1225, +10). Pre-existing
Client.CLI Subscribe flake unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-19 14:55:39 -04:00
Joseph Doherty
560a961cca
Phase 6.4 Stream A + B data layer — UnsImpactAnalyzer + EquipmentCsvImporter (parser)
...
Ships the pure-logic data layer of Phase 6.4. Blazor UI pieces
(UnsTab drag/drop, CSV import modal, preview table, FinaliseImportBatch txn,
staging tables) are deferred to visual-compliance follow-ups (tasks #153 ,
#155 , #157 ).
Admin.Services additions:
- UnsImpactAnalyzer.Analyze(snapshot, move) — pure-function, no I/O. Three
move variants: LineMove, AreaRename, LineMerge. Returns UnsImpactPreview
with AffectedEquipmentCount + AffectedTagCount + CascadeWarnings +
RevisionToken + HumanReadableSummary the Admin UI shows in the confirm
modal. Cross-cluster moves rejected with CrossClusterMoveRejectedException
per decision #82 . Missing source/target throws UnsMoveValidationException.
Surfaces sibling-line same-name ambiguity as a cascade warning.
- DraftRevisionToken — opaque revision fingerprint. Preview captures the
token; Confirm compares it. The 409-concurrent-edit UX plumbs through on
the Razor-page follow-up (task #153 ). Matches(other) is null-safe.
- UnsTreeSnapshot + UnsAreaSummary + UnsLineSummary — snapshot shape the
caller hands to the analyzer. Tests build them in-memory without a DB.
- EquipmentCsvImporter.Parse(csvText) — RFC 4180 CSV parser per decision #95 .
Version-marker contract: line 1 must be "# OtOpcUaCsv v1" (future shapes
bump the version). Required columns from decision #117 + optional columns
from decision #139 . Rejects unknown columns, duplicate column names,
blank required fields, duplicate ZTags within the file. Quoted-field
handling supports embedded commas + escaped "" quotes. Returns
EquipmentCsvParseResult { AcceptedRows, RejectedRows } so the preview
modal renders accept/reject counts without re-parsing.
Tests (22 new, all pass):
- UnsImpactAnalyzerTests (9): line move counts equipment + tags; cross-
cluster throws; unknown source/target throws validation; ambiguous same-
name target raises warning; area rename sums across lines; line merge
cross-area warns; same-area merge no warning; DraftRevisionToken matches
semantics.
- EquipmentCsvImporterTests (13): empty file throws; missing version marker;
missing required column; unknown column; duplicate column; valid single
row round-trips; optional columns populate when present; blank required
field rejects row; duplicate ZTag rejects second; RFC 4180 quoted fields
with commas + escaped quotes; mismatched column count rejects; blank
lines between rows ignored; required + optional column constants match
decisions #117 + #139 exactly.
Full solution dotnet test: 1159 passing (Phase 6.3 = 1137, Phase 6.4 A+B
data = +22). Pre-existing Client.CLI Subscribe flake unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-19 10:09:47 -04:00
Joseph Doherty
3b8280f08a
Phase 6.2 Stream D (data layer) — ValidatedNodeAclAuthoringService with write-time invariants
...
Ships the non-UI piece of Stream D: a draft-aware write surface over NodeAcl
that enforces the Phase 6.2 plan's scope-uniqueness + grant-shape invariants.
Blazor UI pieces (RoleGrantsTab + AclsTab refresh + SignalR invalidation +
visual-compliance reviewer signoff) are deferred to the Phase 6.1-style
follow-up task.
Admin.Services:
- ValidatedNodeAclAuthoringService — alongside existing NodeAclService (raw
CRUD, kept for read + revoke paths). GrantAsync enforces:
* Permissions != None (decision #129 — additive only, no empty grants).
* Cluster scope has null ScopeId.
* Sub-cluster scope requires a populated ScopeId.
* No duplicate (GenerationId, ClusterId, LdapGroup, ScopeKind, ScopeId)
tuple — operator updates the row instead of inserting a duplicate.
UpdatePermissionsAsync also rejects None (operator revokes via NodeAclService).
Violations throw InvalidNodeAclGrantException.
Tests (10 new in Admin.Tests/ValidatedNodeAclAuthoringServiceTests):
- Grant rejects None permissions.
- Grant rejects Cluster-scope with ScopeId / sub-cluster without ScopeId.
- Grant succeeds on well-formed row.
- Grant rejects duplicate (group, scope) in same draft.
- Grant allows same group at different scope.
- Grant allows same (group, scope) in different draft.
- UpdatePermissions rejects None.
- UpdatePermissions round-trips new flags + notes.
- UpdatePermissions on unknown rowid throws.
Microsoft.EntityFrameworkCore.InMemory 10.0.0 added to Admin.Tests csproj.
Full solution dotnet test: 1097 passing (was 1087, +10). Phase 6.2 total is
now 1087+10 = 1097; baseline 906 → +191 net across Phase 6.1 (all streams) +
Phase 6.2 (Streams A, B, C foundation, D data layer).
Stream D follow-up task tracks: RoleGrantsTab CRUD over LdapGroupRoleMapping,
AclsTab write-through + Probe-this-permission diagnostic, draft-diff ACL
section, SignalR PermissionTrieCache invalidation push.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-19 09:39:06 -04:00
Joseph Doherty
ef2a810b2d
Phase 3 PR 34 — Host-status publisher (Server) + /hosts drill-down page (Admin). Closes LMX follow-up #7 by wiring together the data layer from PR 33. Server.HostStatusPublisher is a BackgroundService that walks every driver registered in DriverHost every 10 seconds, skips drivers that don't implement IHostConnectivityProbe, calls GetHostStatuses() on each probe-capable driver, and upserts one DriverHostStatus row per (NodeId, DriverInstanceId, HostName) into the central config DB. Upsert path: SingleOrDefaultAsync on the composite PK; if no row exists, Add a new one; if a row exists, LastSeenUtc advances unconditionally (heartbeat) and State + StateChangedUtc update only on transitions so Admin UI can distinguish 'still reporting, still Running' from 'freshly transitioned to Running'. MapState translates Core.Abstractions.HostState to Configuration.Enums.DriverHostState (intentional duplicate enum — Configuration project stays free of driver-runtime deps per PR 33's choice). If a driver's GetHostStatuses throws, log warning and skip that driver this tick — never take down the Server on a publisher failure. If the DB is unreachable, log warning + retry next heartbeat (no buffering — next tick's current-state snapshot is more useful than replaying stale transitions after a long outage). 2-second startup delay so NodeBootstrap's RegisterAsync calls land before the first publish tick, then tick runs immediately so a freshly-started Server surfaces its host topology in the Admin UI without waiting a full interval.
...
Polling chosen over event-driven for initial scope: simpler, matches Admin UI consumer cadence, avoids DriverHost lifecycle-event plumbing that doesn't exist today. Event-driven push for sub-heartbeat latency is a straightforward follow-up.
Admin.Services.HostStatusService left-joins DriverHostStatus against ClusterNode on NodeId so rows persist even when the ClusterNode entry doesn't exist yet (first-boot bootstrap case). StaleThreshold = 30s — covers one missed publisher heartbeat plus a generous buffer for clock skew and GC pauses. Admin Components/Pages/Hosts.razor — FleetAdmin-visible page grouped by cluster (handles the '(unassigned)' case for rows without a matching ClusterNode). Four summary cards (Hosts / Running / Stale / Faulted); per-cluster table with Node / Driver / Host / State + Stale-badge / Last-transition / Last-seen / Detail columns; 10s auto-refresh via IServiceScopeFactory timer pattern matching FleetStatusPoller + Fleet dashboard (PR 27). Row-class highlighting: Faulted → table-danger, Stale → table-warning, else default. State badge maps DriverHostState enum to bootstrap color classes. Sidebar link added between 'Fleet status' and 'Clusters'.
Server csproj adds Microsoft.EntityFrameworkCore.SqlServer 10.0.0 + registers OtOpcUaConfigDbContext in Program.cs scoped via NodeOptions.ConfigDbConnectionString (no Admin-style manual SQL raw — the DbContext is the only access path, keeps migrations owner-of-record).
Tests — HostStatusPublisherTests (4 new Integration cases, uses per-run throwaway DB matching the FleetStatusPollerTests pattern): publisher upserts one row per host from each probe-capable driver and skips non-probe drivers; second tick advances LastSeenUtc without creating duplicate rows (upsert pattern verified end-to-end); state change between ticks updates State AND StateChangedUtc (datetime2(3) rounds to millisecond precision so comparison uses 1ms tolerance — documented inline); MapState translates every HostState enum member. Server.Tests Integration: 4 new tests pass. Admin build clean, Admin.Tests Unit still 23 / 0. docs/v2/lmx-followups.md item #7 marked DONE with three explicit deferred items (event-driven push, failure-count column, SignalR fan-out).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-18 15:51:55 -04:00
Joseph Doherty
ed88835d34
Phase 3 PR 28 — Admin UI cert-trust management page. New /certificates route (FleetAdmin-only) surfaces the OPC UA server's PKI store rejected + trusted certs and gives operators Trust / Delete / Revoke actions so rejected client certs can be promoted without touching disk. CertTrustService reads $PkiStoreRoot/{rejected,trusted}/certs/*.der files directly via X509CertificateLoader — no Opc.Ua dependency in the Admin project, which keeps the Admin host runnable on a machine that doesn't have the full Server install locally (only needs the shared PKI directory reachable; typical deployment has Admin + Server side-by-side on the same box and PkiStoreRoot defaults match so a plain-vanilla install needs no override). CertTrustOptions bound from the Admin's 'CertTrust:PkiStoreRoot' section, default %ProgramData%\OtOpcUa\pki (matches OpcUaServerOptions.PkiStoreRoot default). Trust action moves the .der from rejected/certs/ to trusted/certs/ via File.Move(overwrite:true) — idempotent, tolerates a concurrent operator doing the same move. Delete wipes the file. Revoke removes from trusted/certs/ (Opc.Ua re-reads the Directory store on each new client handshake, so no explicit reload signal is needed; operators retry the rejected connection after trusting). Thumbprint matching is case-insensitive because X509Certificate2.Thumbprint is upper-case hex but operators copy-paste from logs that sometimes lowercase it. Malformed files in the store are logged + skipped — a single bad .der can't take the whole management page offline. Missing store directories produce empty lists rather than exceptions so a pristine install (Server never run yet, no rejected/trusted dirs yet) doesn't crash the page.
...
Razor page layout: two tables (Rejected / Trusted) with Subject / Issuer / Thumbprint / Valid-window / Actions columns, status banner after each action with success or warning kind ('file missing' = another admin handled it), FleetAdmin-only via [Authorize(Roles=AdminRoles.FleetAdmin)]. Each action invokes LogActionAsync which Serilog-logs the authenticated admin user + thumbprint + action for an audit trail — DB-level ConfigAuditLog persistence is deferred because its schema is cluster-scoped and cert actions are cluster-agnostic; Serilog + CertTrustService's filesystem-op info logs give the forensic trail in the meantime. Sidebar link added to MainLayout between Reservations and the future Account page.
Tests — CertTrustServiceTests (9 new unit cases): ListRejected parses Subject + Thumbprint + store kind from a self-signed test cert written into rejected/certs/; rejected and trusted stores are kept separate; TrustRejected moves the file and the Rejected list is empty afterwards; TrustRejected with a thumbprint not in rejected returns false without touching trusted; DeleteRejected removes the file; UntrustCert removes from trusted only; thumbprint match is case-insensitive (operator UX); missing store directories produce empty lists instead of throwing DirectoryNotFoundException (pristine-install tolerance); a junk .der in the store is logged + skipped and the valid certs still surface (one bad file doesn't break the page). Full Admin.Tests Unit suite: 23 pass / 0 fail (14 prior + 9 new). Full Admin build clean — 0 errors, 0 warnings.
lmx-followups.md #3 marked DONE with a cross-reference to this PR and a note that flipping AutoAcceptUntrustedClientCertificates to false as the production default is a deployment-config follow-up, not a code gap — the Admin UI is now ready to be the trust gate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-18 14:37:55 -04:00
Joseph Doherty
7a5b535cd6
Phase 1 Stream E Admin UI — finish Blazor pages so operators can run the draft → publish → rollback workflow end-to-end without hand-executing SQL. Adds eight new scoped services that wrap the Configuration stored procs + managed validators: EquipmentService (CRUD with auto-derived EquipmentId per decision #125 ), UnsService (areas + lines), NamespaceService, DriverInstanceService (generic JSON DriverConfig editor per decision #94 — per-driver schema validation lands in each driver's phase), NodeAclService (grant + revoke with bundled-preset permission sets; full per-flag editor + bulk-grant + permission simulator deferred to v2.1), ReservationService (fleet-wide active + released reservation inspector + FleetAdmin-only sp_ReleaseExternalIdReservation wrapper with required-reason invariant), DraftValidationService (hydrates a DraftSnapshot from the draft's rows plus prior-cluster Equipment + active reservations, runs the managed DraftValidator to surface every rule in one pass for inline validation panel), AuditLogService (recent ConfigAuditLog reader). Pages: /clusters list with create-new shortcut; /clusters/new wizard that creates the cluster row + initial empty draft in one go; /clusters/{id} detail with 8 tabs (Overview / Generations / Equipment / UNS Structure / Namespaces / Drivers / ACLs / Audit) — tabs that write always target the active draft, published generations stay read-only; /clusters/{id}/draft/{gen} editor with live validation panel (errors list with stable code + message + context; publish button disabled while any error exists) and tab-embedded sub-components; /clusters/{id}/draft/{gen}/diff three-column view backed by sp_ComputeGenerationDiff with Added/Removed/Modified badges; Generations tab with per-row rollback action wired to sp_RollbackToGeneration; /reservations FleetAdmin-only page (CanPublish policy) with active + released lists and a modal release dialog that enforces non-empty reason and round-trips through sp_ReleaseExternalIdReservation; /login scaffold with stub credential accept + FleetAdmin-role cookie issuance (real LDAP bind via the ScadaLink-parity LdapAuthService is deferred until live GLAuth integration — marked in the login view and in the Phase 1 partial-exit TODO). Layout: sidebar gets Overview / Clusters / Reservations + AuthorizeView with signed-in username + roles + sign-out POST to /auth/logout; cascading authentication state registered for <AuthorizeView> to work in RenderMode.InteractiveServer. Integration testing: AdminServicesIntegrationTests creates a throwaway per-run database (same pattern as the Configuration test fixture), applies all three migrations, and exercises (1) create-cluster → add-namespace+UNS+driver+equipment → validate (expects zero errors) → publish (expects Published status) → rollback (expects one new Published + at least one Superseded); (2) cross-cluster namespace binding draft → validates to BadCrossClusterNamespaceBinding per decision #122 . Old flat Components/Pages/Clusters.razor moved to Components/Pages/Clusters/ClustersList.razor so the Clusters folder can host tab sub-components without the razor generator creating a type-and-namespace collision. Dev appsettings.json connection string switched from Integrated Security to sa auth to match the otopcua-mssql container on port 14330 (remapped from 1433 to coexist with the native MSSQL14 Galaxy ZB instance). Browser smoke test completed: home page, clusters list, new-cluster form, cluster detail with a seeded row, reservations (redirected to login for anon user) all return 200 / 302-to-login as expected; full solution 928 pass / 1 pre-existing Phase 0 baseline failure. Phase 1 Stream E items explicitly deferred with TODOs: CSV import for Equipment, SignalR FleetStatusHub + AlertHub real-time push, bulk-grant workflow, permission-simulator trie, merge-equipment draft, AppServer-via-OI-Gateway end-to-end smoke test (decision #142 ), and the real LDAP bind replacing the Login page stub.
...
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-17 21:52:42 -04:00
Joseph Doherty
01fd90c178
Phase 1 Streams B–E scaffold + Phase 2 Streams A–C scaffold — 8 new projects with ~70 new tests, all green alongside the 494 v1 IntegrationTests baseline (parity preserved: no v1 tests broken; legacy OtOpcUa.Host untouched). Phase 1 finish: Configuration project (16 entities + 10 enums + DbContext + DesignTimeDbContextFactory + InitialSchema/StoredProcedures/AuthorizationGrants migrations — 8 procs including sp_PublishGeneration with MERGE on ExternalIdReservation per decision #124 , sp_RollbackToGeneration cloning rows into a new published generation, sp_ValidateDraft with cross-cluster-namespace + EquipmentUuid-immutability + ZTag/SAPID reservation pre-flight, sp_ComputeGenerationDiff with CHECKSUM-based row signature — plus OtOpcUaNode/OtOpcUaAdmin SQL roles with EXECUTE grants scoped to per-principal-class proc sets and DENY UPDATE/DELETE/INSERT/SELECT on dbo schema); managed DraftValidator covering UNS segment regex, path length, EquipmentUuid immutability across generations, same-cluster namespace binding (decision #122 ), reservation pre-flight, EquipmentId derivation (decision #125 ), driver↔namespace compatibility — returning every failing rule in one pass; LiteDB local cache with round-trip + ring pruning + corruption-fast-fail; GenerationApplier with per-entity Added/Removed/Modified diff and dependency-ordered callbacks (namespace → driver → device → equipment → poll-group → tag, Removed before Added); Core project with GenericDriverNodeManager (scaffold for the Phase 2 Galaxy port) and DriverHost lifecycle registry; Server project using Microsoft.Extensions.Hosting BackgroundService replacing TopShelf, with NodeBootstrap that falls back to LiteDB cache when the central DB is unreachable (decision #79 ); Admin project scaffolded as Blazor Server with Bootstrap 5 sidebar layout, cookie auth, three admin roles (ConfigViewer/ConfigEditor/FleetAdmin), Cluster + Generation services fronting the stored procs. Phase 2 scaffold: Driver.Galaxy.Shared (netstandard2.0) with full MessagePack IPC contract surface — Hello version negotiation, Open/CloseSession, Heartbeat, DiscoverHierarchy + GalaxyObjectInfo/GalaxyAttributeInfo, Read/WriteValues, Subscribe/Unsubscribe/OnDataChange, AlarmSubscribe/Event/Ack, HistoryRead, HostConnectivityStatus, Recycle — plus length-prefixed framing (decision #28 ) with a 16 MiB cap and thread-safe FrameWriter/FrameReader; Driver.Galaxy.Host (net48) implementing the Tier C cross-cutting protections from driver-stability.md — strict PipeAcl (allow configured server SID only, explicit deny on LocalSystem + Administrators), PipeServer with caller-SID verification via pipe.RunAsClient + WindowsIdentity.GetCurrent and per-process shared-secret Hello, Galaxy-specific MemoryWatchdog (warn at max(1.5×baseline, +200 MB), soft-recycle at max(2×baseline, +200 MB), hard ceiling 1.5 GB, slope ≥5 MB/min over 30-min rolling window), RecyclePolicy (1 soft recycle per hour cap + 03:00 local daily scheduled), PostMortemMmf (1000-entry ring buffer in %ProgramData%\OtOpcUa\driver-postmortem\galaxy.mmf, survives hard crash, readable cross-process), MxAccessHandle : SafeHandle (ReleaseHandle loops Marshal.ReleaseComObject until refcount=0 then calls optional unregister callback), StaPump with responsiveness probe (BlockingCollection dispatcher for Phase 1 — real Win32 GetMessage/DispatchMessage pump slots in with the same semantics when the Galaxy code lift happens), IsExternalInit shim for init setters on .NET 4.8; Driver.Galaxy.Proxy (net10) implementing IDriver + ITagDiscovery forwarding over the IPC channel with MX data-type and security-classification mapping, plus Supervisor pieces — Backoff (5s → 15s → 60s capped, reset-on-stable-run), CircuitBreaker (3 crashes per 5 min opens; 1h → 4h → manual cooldown escalation; sticky alert doesn't auto-clear), HeartbeatMonitor (2s cadence, 3 consecutive misses = host dead per driver-stability.md). Infrastructure: docker SQL Server remapped to host port 14330 to coexist with the native MSSQL14 Galaxy ZB DB instance on 1433; NuGetAuditSuppress applied per-project for two System.Security.Cryptography.Xml advisories that only reach via EF Core Design with PrivateAssets=all (fix ships in 11.0.0-preview); .slnx gains 14 project registrations. Deferred with explicit TODOs in docs/v2/implementation/phase-2-partial-exit-evidence.md: Phase 1 Stream E Admin UI pages (Generations listing + draft-diff-publish, Equipment CRUD with OPC 40010 fields, UNS Areas/Lines tabs, ACLs + permission simulator, Generic JSON config editor, SignalR real-time, Release-Reservation + Merge-Equipment workflows, LDAP login page, AppServer smoke test per decision #142 ), Phase 2 Stream D (Galaxy MXAccess code lift out of legacy OtOpcUa.Host, dual-service installer, appsettings → DriverConfig migration script, legacy Host deletion — blocked by parity), Phase 2 Stream E (v1 IntegrationTests against v2 topology, Client.CLI walkthrough diff, four 2026-04-13 stability findings regression tests, adversarial review — requires live MXAccess runtime).
...
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-17 21:35:25 -04:00