Files
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 1 Streams B–E scaffold + Phase 2 Streams A–C scaffold — 8 new projects with ~70 new tests, all green alongside the 494 v1 IntegrationTests baseline (parity preserved: no v1 tests broken; legacy OtOpcUa.Host untouched). Phase 1 finish: Configuration project (16 entities + 10 enums + DbContext + DesignTimeDbContextFactory + InitialSchema/StoredProcedures/AuthorizationGrants migrations — 8 procs including sp_PublishGeneration with MERGE on ExternalIdReservation per decision #124, sp_RollbackToGeneration cloning rows into a new published generation, sp_ValidateDraft with cross-cluster-namespace + EquipmentUuid-immutability + ZTag/SAPID reservation pre-flight, sp_ComputeGenerationDiff with CHECKSUM-based row signature — plus OtOpcUaNode/OtOpcUaAdmin SQL roles with EXECUTE grants scoped to per-principal-class proc sets and DENY UPDATE/DELETE/INSERT/SELECT on dbo schema); managed DraftValidator covering UNS segment regex, path length, EquipmentUuid immutability across generations, same-cluster namespace binding (decision #122), reservation pre-flight, EquipmentId derivation (decision #125), driver↔namespace compatibility — returning every failing rule in one pass; LiteDB local cache with round-trip + ring pruning + corruption-fast-fail; GenerationApplier with per-entity Added/Removed/Modified diff and dependency-ordered callbacks (namespace → driver → device → equipment → poll-group → tag, Removed before Added); Core project with GenericDriverNodeManager (scaffold for the Phase 2 Galaxy port) and DriverHost lifecycle registry; Server project using Microsoft.Extensions.Hosting BackgroundService replacing TopShelf, with NodeBootstrap that falls back to LiteDB cache when the central DB is unreachable (decision #79); Admin project scaffolded as Blazor Server with Bootstrap 5 sidebar layout, cookie auth, three admin roles (ConfigViewer/ConfigEditor/FleetAdmin), Cluster + Generation services fronting the stored procs. Phase 2 scaffold: Driver.Galaxy.Shared (netstandard2.0) with full MessagePack IPC contract surface — Hello version negotiation, Open/CloseSession, Heartbeat, DiscoverHierarchy + GalaxyObjectInfo/GalaxyAttributeInfo, Read/WriteValues, Subscribe/Unsubscribe/OnDataChange, AlarmSubscribe/Event/Ack, HistoryRead, HostConnectivityStatus, Recycle — plus length-prefixed framing (decision #28) with a 16 MiB cap and thread-safe FrameWriter/FrameReader; Driver.Galaxy.Host (net48) implementing the Tier C cross-cutting protections from driver-stability.md — strict PipeAcl (allow configured server SID only, explicit deny on LocalSystem + Administrators), PipeServer with caller-SID verification via pipe.RunAsClient + WindowsIdentity.GetCurrent and per-process shared-secret Hello, Galaxy-specific MemoryWatchdog (warn at max(1.5×baseline, +200 MB), soft-recycle at max(2×baseline, +200 MB), hard ceiling 1.5 GB, slope ≥5 MB/min over 30-min rolling window), RecyclePolicy (1 soft recycle per hour cap + 03:00 local daily scheduled), PostMortemMmf (1000-entry ring buffer in %ProgramData%\OtOpcUa\driver-postmortem\galaxy.mmf, survives hard crash, readable cross-process), MxAccessHandle : SafeHandle (ReleaseHandle loops Marshal.ReleaseComObject until refcount=0 then calls optional unregister callback), StaPump with responsiveness probe (BlockingCollection dispatcher for Phase 1 — real Win32 GetMessage/DispatchMessage pump slots in with the same semantics when the Galaxy code lift happens), IsExternalInit shim for init setters on .NET 4.8; Driver.Galaxy.Proxy (net10) implementing IDriver + ITagDiscovery forwarding over the IPC channel with MX data-type and security-classification mapping, plus Supervisor pieces — Backoff (5s → 15s → 60s capped, reset-on-stable-run), CircuitBreaker (3 crashes per 5 min opens; 1h → 4h → manual cooldown escalation; sticky alert doesn't auto-clear), HeartbeatMonitor (2s cadence, 3 consecutive misses = host dead per driver-stability.md). Infrastructure: docker SQL Server remapped to host port 14330 to coexist with the native MSSQL14 Galaxy ZB DB instance on 1433; NuGetAuditSuppress applied per-project for two System.Security.Cryptography.Xml advisories that only reach via EF Core Design with PrivateAssets=all (fix ships in 11.0.0-preview); .slnx gains 14 project registrations. Deferred with explicit TODOs in docs/v2/implementation/phase-2-partial-exit-evidence.md: Phase 1 Stream E Admin UI pages (Generations listing + draft-diff-publish, Equipment CRUD with OPC 40010 fields, UNS Areas/Lines tabs, ACLs + permission simulator, Generic JSON config editor, SignalR real-time, Release-Reservation + Merge-Equipment workflows, LDAP login page, AppServer smoke test per decision #142), Phase 2 Stream D (Galaxy MXAccess code lift out of legacy OtOpcUa.Host, dual-service installer, appsettings → DriverConfig migration script, legacy Host deletion — blocked by parity), Phase 2 Stream E (v1 IntegrationTests against v2 topology, Client.CLI walkthrough diff, four 2026-04-13 stability findings regression tests, adversarial review — requires live MXAccess runtime).
Phase 1 Stream E Admin UI — finish Blazor pages so operators can run the draft → publish → rollback workflow end-to-end without hand-executing SQL. Adds eight new scoped services that wrap the Configuration stored procs + managed validators: EquipmentService (CRUD with auto-derived EquipmentId per decision #125), UnsService (areas + lines), NamespaceService, DriverInstanceService (generic JSON DriverConfig editor per decision #94 — per-driver schema validation lands in each driver's phase), NodeAclService (grant + revoke with bundled-preset permission sets; full per-flag editor + bulk-grant + permission simulator deferred to v2.1), ReservationService (fleet-wide active + released reservation inspector + FleetAdmin-only sp_ReleaseExternalIdReservation wrapper with required-reason invariant), DraftValidationService (hydrates a DraftSnapshot from the draft's rows plus prior-cluster Equipment + active reservations, runs the managed DraftValidator to surface every rule in one pass for inline validation panel), AuditLogService (recent ConfigAuditLog reader). Pages: /clusters list with create-new shortcut; /clusters/new wizard that creates the cluster row + initial empty draft in one go; /clusters/{id} detail with 8 tabs (Overview / Generations / Equipment / UNS Structure / Namespaces / Drivers / ACLs / Audit) — tabs that write always target the active draft, published generations stay read-only; /clusters/{id}/draft/{gen} editor with live validation panel (errors list with stable code + message + context; publish button disabled while any error exists) and tab-embedded sub-components; /clusters/{id}/draft/{gen}/diff three-column view backed by sp_ComputeGenerationDiff with Added/Removed/Modified badges; Generations tab with per-row rollback action wired to sp_RollbackToGeneration; /reservations FleetAdmin-only page (CanPublish policy) with active + released lists and a modal release dialog that enforces non-empty reason and round-trips through sp_ReleaseExternalIdReservation; /login scaffold with stub credential accept + FleetAdmin-role cookie issuance (real LDAP bind via the ScadaLink-parity LdapAuthService is deferred until live GLAuth integration — marked in the login view and in the Phase 1 partial-exit TODO). Layout: sidebar gets Overview / Clusters / Reservations + AuthorizeView with signed-in username + roles + sign-out POST to /auth/logout; cascading authentication state registered for <AuthorizeView> to work in RenderMode.InteractiveServer. Integration testing: AdminServicesIntegrationTests creates a throwaway per-run database (same pattern as the Configuration test fixture), applies all three migrations, and exercises (1) create-cluster → add-namespace+UNS+driver+equipment → validate (expects zero errors) → publish (expects Published status) → rollback (expects one new Published + at least one Superseded); (2) cross-cluster namespace binding draft → validates to BadCrossClusterNamespaceBinding per decision #122. Old flat Components/Pages/Clusters.razor moved to Components/Pages/Clusters/ClustersList.razor so the Clusters folder can host tab sub-components without the razor generator creating a type-and-namespace collision. Dev appsettings.json connection string switched from Integrated Security to sa auth to match the otopcua-mssql container on port 14330 (remapped from 1433 to coexist with the native MSSQL14 Galaxy ZB instance). Browser smoke test completed: home page, clusters list, new-cluster form, cluster detail with a seeded row, reservations (redirected to login for anon user) all return 200 / 302-to-login as expected; full solution 928 pass / 1 pre-existing Phase 0 baseline failure. Phase 1 Stream E items explicitly deferred with TODOs: CSV import for Equipment, SignalR FleetStatusHub + AlertHub real-time push, bulk-grant workflow, permission-simulator trie, merge-equipment draft, AppServer-via-OI-Gateway end-to-end smoke test (decision #142), and the real LDAP bind replacing the Login page stub.
Admin RedundancyTab — per-cluster read-only topology view. Closes the UI slice of task #149 (Phase 6.3 Stream E — Admin UI RedundancyTab + OpenTelemetry metrics + SignalR); the OpenTelemetry metrics + RoleChanged SignalR push are split into new follow-up task #198 because each is a structural add that deserves its own test matrix + NuGet-dep decision rather than riding this UI PR. New /clusters/{ClusterId} Redundancy tab slotted between ACLs and Audit in the existing ClusterDetail tab bar. Shows each ClusterNode row in the cluster with columns Node / Role (Primary green, Secondary blue, Standalone primary-blue badge) / Host / OPC UA port / ServiceLevel base / ApplicationUri (text-break so the long urn: doesn't blow out the table) / Enabled badge / Last seen (relative age via the same FormatAge helper as Hosts.razor, with a yellow "Stale" chip once LastSeenAt crosses the 30s threshold shared with HostStatusService.StaleThreshold — a missed heartbeat plus clock-skew buffer). Four summary cards above the table — total Nodes, Primary count, Secondary count, Stale count. Two guard-rail alerts: (a) red "No Primary or Standalone" when the cluster has no authoritative write target (all rows are Secondaries — read-only until one is promoted by the server-side RedundancyCoordinator apply-lease flow); (b) red "Split-brain" when >1 Primary exists — apply-lease enforcement at the coordinator level should have made this impossible, so the alert implies a hand-edited DB row + an investigation. New ClusterNodeService with ListByClusterAsync (ordered by ServiceLevelBase descending so Primary rows with higher base float to the top) + a static IsStale predicate matching HostStatusService's 30s convention. DI-registered alongside the existing scoped services in Program.cs. Writes (role swap, enable/disable) are deliberately absent from the service — they go through the RedundancyCoordinator apply-lease flow on the server side + direct DB mutation from Admin would race with it. New ClusterNodeServiceTests covering IsStale across null/recent/old LastSeenAt + ListByClusterAsync ordering + cluster filter. 4/4 new tests passing; full Admin.Tests suite 76/76 (was 72 before this PR, +4). Admin project builds 0 errors. Task #198 captures the deferred work: (1) OpenTelemetry Meter for primary/secondary/stale counts + role_transition counter with from/to/node tags + OTLP exporter config; (2) RoleChanged SignalR push — extend FleetStatusPoller to detect RedundancyRole changes on ClusterNode rows + emit a RoleChanged hub message so the RedundancyTab refreshes instantly instead of on-page-load polling.
ExternalIdReservation merge inside FinaliseBatchAsync. Closes task #197. The FinaliseBatch docstring called this out as a narrower follow-up pending a concurrent-insert test matrix, and the CSV import UI PR (#163) noted that operators would see raw DbUpdate UNIQUE-constraint messages on ZTag/SAPID collision until this landed. Now every finalised-batch row reserves ZTag + SAPID in the same EF transaction as the Equipment inserts, so either both commit atomically or neither does. New MergeReservation helper handles the four outcomes per (Kind, Value) pair: (1) value empty/whitespace → skip the reservation entirely (operator left the optional identifier blank); (2) active reservation exists for same EquipmentUuid → bump LastPublishedAt + reuse (re-finalising a batch against the same equipment must be idempotent, e.g. a retry after a transient DB blip); (3) active reservation exists for a DIFFERENT EquipmentUuid → throw ExternalIdReservationConflictException with the conflicting UUID + originating cluster + first-published timestamp so operator sees exactly who owns the value + where to resolve it (release via sp_ReleaseExternalIdReservation or pick a new ZTag); (4) no active reservation → create a fresh row with FirstPublishedBy = batch.CreatedBy + FirstPublishedAt = transaction time. Pre-commit overlap scan uses one round-trip (WHERE Kind+Value IN the batch's distinct sets, filtered to ReleasedAt IS NULL so explicitly-released values can be re-issued per decision #124) + caches the results in a Dictionary keyed on (Kind, value.ToLowerInvariant()) for O(1) lookup during the row loop. Race-safety catch: if another finalise commits between our cache-load + our SaveChanges, SQL Server surfaces a 2601/2627 unique-index violation against UX_ExternalIdReservation_KindValue_Active — IsReservationUniquenessViolation walks the inner-exception chain for that specific signature + rethrows as ExternalIdReservationConflictException so the UI shows a clean message instead of a raw DbUpdateException. The index-name match means unrelated filtered-unique violations (future indices) don't get mis-classified. Test-fixture Row() helper updated to generate unique SAPID per row (sap-{ZTag}) — the prior shared SAPID="sap" worked only because reservations didn't exist; two rows sharing a SAPID under different EquipmentUuids now collide as intended by decision #124's fleet-wide uniqueness rule. Four new tests: (a) finalise creates both ZTag + SAPID reservations with expected Kind + Value; (b) re-finalising same EquipmentUuid's ZTag from a different batch does not create a duplicate (LastPublishedAt refresh only); (c) different EquipmentUuid claiming the same ZTag throws ExternalIdReservationConflictException with the ZTag value in the message + Equipment row for the second batch is NOT inserted (transaction rolled back cleanly); (d) row with empty ZTag + empty SAPID skips reservation entirely. Full Admin.Tests suite 85/85 passing (was 81 before this PR, +4). Admin project builds 0 errors. Note: the InMemory EF provider doesn't enforce filtered-unique indices, so the IsReservationUniquenessViolation catch is exercised only in the SQL Server integration path — the in-memory tests cover the cache-level conflict detection in MergeReservation instead, which is the first line of defence + catches the same-batch + published-vs-staged cases. The DbUpdate catch protects only the last-second race where two concurrent transactions both passed the cache check.
OpenTelemetry redundancy metrics + RoleChanged SignalR push. Closes instrumentation + live-push slices of task #198; the exporter wiring (OTLP vs Prometheus package decision) is split to new task #201 because the collector/scrape-endpoint choice is a fleet-ops decision that deserves its own PR rather than hardcoded here. New RedundancyMetrics class (Singleton-registered in DI) owning a System.Diagnostics.Metrics.Meter("ZB.MOM.WW.OtOpcUa.Redundancy", "1.0.0"). Three ObservableGauge instruments — otopcua.redundancy.primary_count / secondary_count / stale_count — all tagged by cluster.id, populated by SetClusterCounts(clusterId, primary, secondary, stale) which the poller calls at the tail of every tick; ObservableGauge callbacks snapshot the last value set under a lock so the reader (OTel collector, dotnet-counters) sees consistent tuples. One Counter — otopcua.redundancy.role_transition — tagged cluster.id, node.id, from_role, to_role; ideal for tracking "how often does Cluster-X failover" + "which node transitions most" aggregate queries. In-box Metrics API means zero NuGet dep here — the exporter PR adds OpenTelemetry.Extensions.Hosting + OpenTelemetry.Exporter.OpenTelemetryProtocol or OpenTelemetry.Exporter.Prometheus.AspNetCore to actually ship the data somewhere. FleetStatusPoller extended with role-change detection. Its PollOnceAsync now pulls ClusterNode rows alongside the existing ClusterNodeGenerationState scan, and a new PollRolesAsync walks every node comparing RedundancyRole to the _lastRole cache. On change: records the transition to RedundancyMetrics + emits a RoleChanged SignalR message to both FleetStatusHub.GroupName(cluster) + FleetStatusHub.FleetGroup so cluster-scoped + fleet-wide subscribers both see it. First observation per node is a bootstrap (cache fill) + NOT a transition — avoids spurious churn on service startup or pod restart. UpdateClusterGauges groups nodes by cluster + sets the three gauge values, using ClusterNodeService.StaleThreshold (shared 30s convention) for staleness so the /hosts page + the gauge agree. RoleChangedMessage record lives alongside NodeStateChangedMessage in FleetStatusPoller.cs. RedundancyTab.razor subscribes to the fleet-status hub on first parameters-set, filters RoleChanged events to the current cluster, reloads the node list + paints a blue info banner ("Role changed on node-a: Primary → Secondary at HH:mm:ss UTC") so operators see the transition without needing to poll-refresh the page. IAsyncDisposable closes the connection on tab swap-away. Two new RedundancyMetricsTests covering RecordRoleTransition tag emission (cluster.id + node.id + from_role + to_role all flow through the MeterListener callback) + ObservableGauge snapshot for two clusters (assert primary_count=1 for c1, stale_count=1 for c2). Existing FleetStatusPollerTests ctor-line updated to pass a RedundancyMetrics instance; all tests still pass. Full Admin.Tests suite 87/87 passing (was 85, +2). Admin project builds 0 errors. Task #201 captures the exporter-wiring follow-up — OpenTelemetry.Extensions.Hosting + OTLP vs Prometheus + /metrics endpoint decision, driven by fleet-ops infra direction.
OpenTelemetry redundancy metrics + RoleChanged SignalR push. Closes instrumentation + live-push slices of task #198; the exporter wiring (OTLP vs Prometheus package decision) is split to new task #201 because the collector/scrape-endpoint choice is a fleet-ops decision that deserves its own PR rather than hardcoded here. New RedundancyMetrics class (Singleton-registered in DI) owning a System.Diagnostics.Metrics.Meter("ZB.MOM.WW.OtOpcUa.Redundancy", "1.0.0"). Three ObservableGauge instruments — otopcua.redundancy.primary_count / secondary_count / stale_count — all tagged by cluster.id, populated by SetClusterCounts(clusterId, primary, secondary, stale) which the poller calls at the tail of every tick; ObservableGauge callbacks snapshot the last value set under a lock so the reader (OTel collector, dotnet-counters) sees consistent tuples. One Counter — otopcua.redundancy.role_transition — tagged cluster.id, node.id, from_role, to_role; ideal for tracking "how often does Cluster-X failover" + "which node transitions most" aggregate queries. In-box Metrics API means zero NuGet dep here — the exporter PR adds OpenTelemetry.Extensions.Hosting + OpenTelemetry.Exporter.OpenTelemetryProtocol or OpenTelemetry.Exporter.Prometheus.AspNetCore to actually ship the data somewhere. FleetStatusPoller extended with role-change detection. Its PollOnceAsync now pulls ClusterNode rows alongside the existing ClusterNodeGenerationState scan, and a new PollRolesAsync walks every node comparing RedundancyRole to the _lastRole cache. On change: records the transition to RedundancyMetrics + emits a RoleChanged SignalR message to both FleetStatusHub.GroupName(cluster) + FleetStatusHub.FleetGroup so cluster-scoped + fleet-wide subscribers both see it. First observation per node is a bootstrap (cache fill) + NOT a transition — avoids spurious churn on service startup or pod restart. UpdateClusterGauges groups nodes by cluster + sets the three gauge values, using ClusterNodeService.StaleThreshold (shared 30s convention) for staleness so the /hosts page + the gauge agree. RoleChangedMessage record lives alongside NodeStateChangedMessage in FleetStatusPoller.cs. RedundancyTab.razor subscribes to the fleet-status hub on first parameters-set, filters RoleChanged events to the current cluster, reloads the node list + paints a blue info banner ("Role changed on node-a: Primary → Secondary at HH:mm:ss UTC") so operators see the transition without needing to poll-refresh the page. IAsyncDisposable closes the connection on tab swap-away. Two new RedundancyMetricsTests covering RecordRoleTransition tag emission (cluster.id + node.id + from_role + to_role all flow through the MeterListener callback) + ObservableGauge snapshot for two clusters (assert primary_count=1 for c1, stale_count=1 for c2). Existing FleetStatusPollerTests ctor-line updated to pass a RedundancyMetrics instance; all tests still pass. Full Admin.Tests suite 87/87 passing (was 85, +2). Admin project builds 0 errors. Task #201 captures the exporter-wiring follow-up — OpenTelemetry.Extensions.Hosting + OTLP vs Prometheus + /metrics endpoint decision, driven by fleet-ops infra direction.
UnsTab native HTML5 drag/drop + 409 concurrent-edit modal + optimistic-concurrency commit path. Closes UI slice of task #153 (Phase 6.4 Stream A UI follow-up). Playwright E2E smoke is split into new task #199 — Playwright install + WebApplicationFactory + seeded-DB harness is genuinely its own infra-setup PR. Native HTML5 attributes (draggable, @ondragstart, @ondragover, @ondragleave, @ondrop) deliberately over MudBlazor per the task title — no MudBlazor ever joins this project. Two new service methods on UnsService land the data layer the existing UnsImpactAnalyzer assumed but which didn't actually exist: (1) LoadSnapshotAsync(generationId) — walks UnsAreas + UnsLines + per-line equipment counts + builds a UnsTreeSnapshot including a 16-char SHA-256 revision token computed deterministically over the sorted (kind, id, parent, name, notes) tuple-set so it's stable across processes + changes whenever any row is added / modified / deleted; (2) MoveLineAsync(generationId, expectedToken, lineId, targetAreaId) — re-parents one line inside the same draft under an EF transaction, recomputes the current revision token from freshly-loaded rows, and throws DraftRevisionConflictException when the caller-supplied token no longer matches. Token mismatch means another operator mutated the draft between preview + commit + the move rolls back rather than clobbering their work. No-op same-area drop is a silent return. Cross-generation move is prevented by the generationId filter on the transaction reads. UnsTab.razor gains draggable="true" on every line row with @ondragstart capturing the LineId into _dragLineId, and every area row is a drop target (@ondragover with :preventDefault so the browser accepts drops, @ondrop kicking off OnLineDroppedAsync). Drop path loads a fresh snapshot, builds a UnsMoveOperation(Kind=LineMove, source/target cluster matching because cross-cluster is decision-#82 rejected), runs UnsImpactAnalyzer.Analyze + shows a Bootstrap modal rendered inline in the component — modal shows HumanReadableSummary + equipment/tag counts + any CascadeWarnings list. Confirm button calls MoveLineAsync with the snapshot's RevisionToken; DraftRevisionConflictException surfaces a separate red-header "Draft changed — refresh required" modal with a Reload button that re-fetches areas + lines from the DB. New DraftRevisionConflictException in UnsService.cs, co-located with the service that throws it. Five new UnsServiceMoveTests covering LoadSnapshotAsync (areas + lines + equipment counts), RevisionToken stability between two reads, RevisionToken changes on AddLineAsync, MoveLineAsync happy path reparents the line in the DB, MoveLineAsync with stale token throws DraftRevisionConflictException + leaves the DB unchanged. Admin suite 81/81 passing (was 76, +5). Admin project builds 0 errors. Task #199 captures the deferred Playwright E2E smoke — drag a line onto a different area in a real browser, assert preview modal contents, click Confirm, assert the line row shows the new area. That PR stands up a new tests/ZB.MOM.WW.OtOpcUa.Admin.E2ETests project with Playwright + WebApplicationFactory + seeded InMemory DbContext.