Commit Graph

5 Commits

Author SHA1 Message Date
Joseph Doherty
0687bb2e2d Phase 7 Stream F — Admin UI for scripts + test harness + historian diagnostics
Adds the draft-editor tab + page surface for authoring Phase 7 virtual tags and
scripted alarms, plus the /alarms/historian operator diagnostics page. Monaco loads
from CDN via a progressive-enhancement JS shim — the textarea works immediately so
the page is functional even if the CDN is unreachable.

## New services (Admin)

- ScriptService — CRUD for Script entity. SHA-256 SourceHash recomputed on save so
  Core.Scripting's CompiledScriptCache hits on re-publish of unchanged source + misses
  when the source actually changes.
- VirtualTagService — CRUD for VirtualTag, with Enabled toggle.
- ScriptedAlarmService — CRUD for ScriptedAlarm + lookup of persistent ScriptedAlarmState
  (logical-id-keyed per plan decision #14).
- ScriptTestHarnessService — pre-publish dry-run. Enforces plan decision #22: only
  inputs the DependencyExtractor identifies can be supplied. Missing / extra synthetic
  inputs surface as dedicated outcomes. Captures SetVirtualTag writes + Serilog events
  from the script so the operator can see both the output + the log output before
  publishing.
- HistorianDiagnosticsService — surfaces the local-process IAlarmHistorianSink state
  on /alarms/historian. Null sink reports Disabled + swallows retry. Live
  SqliteStoreAndForwardSink reports real queue depth + last-error + drain state and
  routes the Retry-dead-lettered button through.

## New UI

- ScriptsTab.razor (inside DraftEditor tabs) — list + create/edit/delete scripts with
  Monaco editor + dependency preview + test-harness run panel showing output + writes
  + log emissions.
- ScriptEditor.razor — reusable Monaco-backed textarea. Loads editor from CDN via
  wwwroot/js/monaco-loader.js. Textarea stays authoritative for Blazor binding; Monaco
  mirrors into it on every keystroke.
- AlarmsHistorian.razor (/alarms/historian) — queue depth + dead-letter depth + drain
  state badge + last-error banner + Retry-dead-lettered button.
- DraftEditor.razor — new "Scripts" tab.

## DI wiring

All five services registered in Program.cs. Null historian sink bound at Admin
composition time (real SqliteStoreAndForwardSink lives in the Server process).

## Tests — 13/13

Phase7ServicesTests covers:
- ScriptService: Add generates logical id + hash, Update recomputes hash on source
  change, Update same-source keeps hash (cache-hit preservation), Delete is idempotent
- VirtualTagService: round-trips trigger flags, Enabled toggle works
- ScriptedAlarmService: HistorizeToAveva defaults true per plan decision #15
- ScriptTestHarness: successful run captures output + writes, rejects missing /
  extra inputs, rejects non-literal paths, compile errors surface as Threw
- HistorianDiagnosticsService: null sink reports Disabled + retry returns 0
2026-04-20 19:59:18 -04:00
Joseph Doherty
ef53553e9d OTel Prometheus exporter wiring — RedundancyMetrics meter now scraped at /metrics. Closes task #201. Picked Prometheus over OTLP per the earlier recommendation (pull-based means no OTel Collector deployment required for the common K8s/containers case; the endpoint ASP.NET-hosts inside the Admin app already, so one less moving part). Adds two NuGet refs to the Admin csproj: OpenTelemetry.Extensions.Hosting 1.15.2 (stable) + OpenTelemetry.Exporter.Prometheus.AspNetCore 1.15.2-beta.1 (the exporter has historically been beta-only; rest of the OTel ecosystem treats it as production-acceptable + it's what the upstream OTel docs themselves recommend for AspNetCore hosts). Program.cs gains a Metrics:Prometheus:Enabled toggle (defaults true; setting to false disables both the MeterProvider registration + the scrape endpoint entirely for locked-down deployments). When enabled, AddOpenTelemetry().WithMetrics() registers a MeterProvider that subscribes to the "ZB.MOM.WW.OtOpcUa.Redundancy" meter (the exact MeterName constant on RedundancyMetrics) + wires AddPrometheusExporter. MapPrometheusScrapingEndpoint() appends a /metrics handler producing the Prometheus text-format output; deliberately NOT authenticated because scrape jobs typically run on a trusted network + operators who need auth wrap the endpoint behind a reverse-proxy basic-auth gate per fleet-ops convention. appsettings.json declares the toggle with Enabled: true so the default deploy gets metrics automatically — turning off is the explicit action. Future meters (resilience tracker + host status + auth probe) just AddMeter("Name") alongside the existing call to start flowing through the same endpoint without more infrastructure. Admin project builds 0 errors; Admin.Tests 92/92 passing (unchanged — the OTel pipeline runs at request time, not test time). Still-pending work that was NOT part of #201's scope: an equivalent setup for the Server project (different MeterNames — the Polly pipeline builder's tracker + host-status publisher) + a metrics cheat-sheet in docs/observability.md documenting each meter's tag set + expected alerting thresholds. Those are natural follow-ups when fleet-ops starts building dashboards.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 00:41:16 -04:00
Joseph Doherty
ecc2389ca8 AclsTab Probe-this-permission — first of three #196 slices. New /clusters/{ClusterId}/draft/{GenerationId} ACLs-tab gains a probe card above the grant table so operators can ask the trie "if cn=X asks for permission Y on node Z, would it be granted, and which rows contributed?" without shell-ing into the DB. Service thinly wraps the same PermissionTrieBuilder + PermissionTrie.CollectMatches call path the Server's dispatch layer uses at request time, so a probe answer is by construction identical to what the live server would decide. New PermissionProbeService.ProbeAsync(generationId, ldapGroup, NodeScope, requiredFlags) — loads the target generation's NodeAcl rows filtered to the cluster (critical: without the cluster filter, cross-cluster grants leak into the probe which tested false-positive in the unit suite), builds a trie, CollectMatches against the supplied scope + [ldapGroup], ORs the matched-grant flags into Effective, compares to Required. Returns PermissionProbeResult(Granted, Required, Effective, Matches) — Matches carries LdapGroup + Scope + PermissionFlags per matched row so the UI can render the contribution chain. Zero side effects + no audit rows — a failing probe is a question, not a denial. AclsTab.razor gains the probe card at the top (before the New-grant form + grant table): six inputs for ldap group + every NodeScope level (NamespaceId → UnsAreaId → UnsLineId → EquipmentId → TagId — blank fields become null so the trie walks only as deep as the operator specified), a NodePermissions dropdown filtered to skip None, Probe button, green Granted / red Denied badge + Required/Effective bitmask display, and (when matches exist) a small table showing which LdapGroup matched at which level with which flags. Admin csproj adds ProjectReference to Core — the trie + NodeScope live there + were previously Server-only. Five new PermissionProbeServiceTests covering: cluster-level row grants a namespace-level read; no-group-match denies with empty Effective; matching group but insufficient flags (Browse+Read vs WriteOperate required) denies with correct Effective bitmask; cross-cluster grants stay isolated (c2's WriteOperate does NOT leak into c1's probe); generation isolation (gen1's Read-only does NOT let gen2's WriteOperate-requiring probe pass). Admin.Tests 92/92 passing (was 87, +5). Admin builds 0 errors. Remaining #196 slices — SignalR invalidation + draft-diff ACL section — ship in follow-up PRs so the review surface per PR stays tight.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 00:28:17 -04:00
Joseph Doherty
18f93d72bb Phase 1 LDAP auth + SignalR real-time — closes the last two open Admin UI TODOs. LDAP: Admin/Security/ gets SecurityOptions (bound from appsettings.json Authentication:Ldap), LdapAuthResult record, ILdapAuthService + LdapAuthService ported from scadalink-design's LdapAuthService (TLS guard, search-then-bind when a service account is configured, direct-bind fallback, service-account re-bind after user bind so attribute lookup uses the service principal's read rights, LdapException-to-friendly-message translation, OperationCanceledException pass-through), RoleMapper (pure function: case-insensitive group-name match against LdapOptions.GroupToRole, returns the distinct set of mapped Admin roles). EscapeLdapFilter escapes the five LDAP filter control chars (\, *, (, ), \0); ExtractFirstRdnValue pulls the value portion of a DN's leading RDN for memberOf parsing; ExtractOuSegment added as a GLAuth-specific fallback when the directory doesn't populate memberOf but does embed ou=PrimaryGroup into user DNs (actual GLAuth config in C:\publish\glauth\glauth.cfg uses nameformat=cn, groupformat=ou — direct bind is enough). Login page rewritten: EditForm → ILdapAuthService.AuthenticateAsync → cookie sign-in with claims (Name = displayName, NameIdentifier = username, Role for each mapped role, ldap_group for each raw group); failed bind shows the service's error; empty-role-map returns an explicit "no Admin role mapped" message rather than silently succeeding. appsettings.json gains an Authentication:Ldap section with dev-GLAuth defaults (localhost:3893, UseTls=false, AllowInsecureLdap=true for dev, GroupToRole maps GLAuth's ReadOnly/WriteOperate/AlarmAck → ConfigViewer/ConfigEditor/FleetAdmin). SignalR: two hubs + a BackgroundService poller. FleetStatusHub routes per-cluster NodeStateChanged pushes (SubscribeCluster/UnsubscribeCluster on connection; FleetGroup for dashboard-wide) with a typed NodeStateChangedMessage payload. AlertHub auto-subscribes every connection to the AllAlertsGroup and exposes AcknowledgeAsync (ack persistence deferred to v2.1). FleetStatusPoller (IHostedService, 5s default cadence) scans ClusterNodeGenerationState joined with ClusterNode, caches the prior snapshot per NodeId, pushes NodeStateChanged on any delta, raises AlertMessage("apply-failed") on transition INTO Failed (sticky — the hub client acks later). Program.cs registers HttpContextAccessor (sign-in needs it), SignalR, LdapOptions + ILdapAuthService, the poller as hosted service, and maps /hubs/fleet + /hubs/alerts endpoints. ClusterDetail adds @rendermode RenderMode.InteractiveServer, @implements IAsyncDisposable, and a HubConnectionBuilder subscription that calls LoadAsync() on each NodeStateChanged for its cluster so the "current published" card refreshes without a page reload; a dismissable "Live update" info banner surfaces the most recent event. Microsoft.AspNetCore.SignalR.Client 10.0.0 + Novell.Directory.Ldap.NETStandard 3.6.0 added. Tests: 13 new — RoleMapperTests (single group, case-insensitive match, multi-group distinct-roles, unknown-group ignored, empty-map); LdapAuthServiceTests (EscapeLdapFilter with 4 inputs, ExtractFirstRdnValue with 4 inputs — all via reflection against internals); LdapLiveBindTests (skip when localhost:3893 unreachable; valid-credentials-bind-succeeds; wrong-password-fails-with-recognizable-error; empty-username-rejected-before-hitting-directory); FleetStatusPollerTests (throwaway DB, seeds cluster+node+generation+apply-state, runs PollOnceAsync, asserts NodeStateChanged hit the recorder; second test seeds a Failed state and asserts AlertRaised fired) — backed by RecordingHubContext/RecordingHubClients/RecordingClientProxy that capture SendCoreAsync invocations while throwing NotImplementedException for the IHubClients methods the poller doesn't call (fail-fast if evolution adds new dependencies). InternalsVisibleTo added so the test project can call FleetStatusPoller.PollOnceAsync directly. Full solution 946 pass / 1 pre-existing Phase 0 baseline failure.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 22:28:49 -04:00
Joseph Doherty
01fd90c178 Phase 1 Streams B–E scaffold + Phase 2 Streams A–C scaffold — 8 new projects with ~70 new tests, all green alongside the 494 v1 IntegrationTests baseline (parity preserved: no v1 tests broken; legacy OtOpcUa.Host untouched). Phase 1 finish: Configuration project (16 entities + 10 enums + DbContext + DesignTimeDbContextFactory + InitialSchema/StoredProcedures/AuthorizationGrants migrations — 8 procs including sp_PublishGeneration with MERGE on ExternalIdReservation per decision #124, sp_RollbackToGeneration cloning rows into a new published generation, sp_ValidateDraft with cross-cluster-namespace + EquipmentUuid-immutability + ZTag/SAPID reservation pre-flight, sp_ComputeGenerationDiff with CHECKSUM-based row signature — plus OtOpcUaNode/OtOpcUaAdmin SQL roles with EXECUTE grants scoped to per-principal-class proc sets and DENY UPDATE/DELETE/INSERT/SELECT on dbo schema); managed DraftValidator covering UNS segment regex, path length, EquipmentUuid immutability across generations, same-cluster namespace binding (decision #122), reservation pre-flight, EquipmentId derivation (decision #125), driver↔namespace compatibility — returning every failing rule in one pass; LiteDB local cache with round-trip + ring pruning + corruption-fast-fail; GenerationApplier with per-entity Added/Removed/Modified diff and dependency-ordered callbacks (namespace → driver → device → equipment → poll-group → tag, Removed before Added); Core project with GenericDriverNodeManager (scaffold for the Phase 2 Galaxy port) and DriverHost lifecycle registry; Server project using Microsoft.Extensions.Hosting BackgroundService replacing TopShelf, with NodeBootstrap that falls back to LiteDB cache when the central DB is unreachable (decision #79); Admin project scaffolded as Blazor Server with Bootstrap 5 sidebar layout, cookie auth, three admin roles (ConfigViewer/ConfigEditor/FleetAdmin), Cluster + Generation services fronting the stored procs. Phase 2 scaffold: Driver.Galaxy.Shared (netstandard2.0) with full MessagePack IPC contract surface — Hello version negotiation, Open/CloseSession, Heartbeat, DiscoverHierarchy + GalaxyObjectInfo/GalaxyAttributeInfo, Read/WriteValues, Subscribe/Unsubscribe/OnDataChange, AlarmSubscribe/Event/Ack, HistoryRead, HostConnectivityStatus, Recycle — plus length-prefixed framing (decision #28) with a 16 MiB cap and thread-safe FrameWriter/FrameReader; Driver.Galaxy.Host (net48) implementing the Tier C cross-cutting protections from driver-stability.md — strict PipeAcl (allow configured server SID only, explicit deny on LocalSystem + Administrators), PipeServer with caller-SID verification via pipe.RunAsClient + WindowsIdentity.GetCurrent and per-process shared-secret Hello, Galaxy-specific MemoryWatchdog (warn at max(1.5×baseline, +200 MB), soft-recycle at max(2×baseline, +200 MB), hard ceiling 1.5 GB, slope ≥5 MB/min over 30-min rolling window), RecyclePolicy (1 soft recycle per hour cap + 03:00 local daily scheduled), PostMortemMmf (1000-entry ring buffer in %ProgramData%\OtOpcUa\driver-postmortem\galaxy.mmf, survives hard crash, readable cross-process), MxAccessHandle : SafeHandle (ReleaseHandle loops Marshal.ReleaseComObject until refcount=0 then calls optional unregister callback), StaPump with responsiveness probe (BlockingCollection dispatcher for Phase 1 — real Win32 GetMessage/DispatchMessage pump slots in with the same semantics when the Galaxy code lift happens), IsExternalInit shim for init setters on .NET 4.8; Driver.Galaxy.Proxy (net10) implementing IDriver + ITagDiscovery forwarding over the IPC channel with MX data-type and security-classification mapping, plus Supervisor pieces — Backoff (5s → 15s → 60s capped, reset-on-stable-run), CircuitBreaker (3 crashes per 5 min opens; 1h → 4h → manual cooldown escalation; sticky alert doesn't auto-clear), HeartbeatMonitor (2s cadence, 3 consecutive misses = host dead per driver-stability.md). Infrastructure: docker SQL Server remapped to host port 14330 to coexist with the native MSSQL14 Galaxy ZB DB instance on 1433; NuGetAuditSuppress applied per-project for two System.Security.Cryptography.Xml advisories that only reach via EF Core Design with PrivateAssets=all (fix ships in 11.0.0-preview); .slnx gains 14 project registrations. Deferred with explicit TODOs in docs/v2/implementation/phase-2-partial-exit-evidence.md: Phase 1 Stream E Admin UI pages (Generations listing + draft-diff-publish, Equipment CRUD with OPC 40010 fields, UNS Areas/Lines tabs, ACLs + permission simulator, Generic JSON config editor, SignalR real-time, Release-Reservation + Merge-Equipment workflows, LDAP login page, AppServer smoke test per decision #142), Phase 2 Stream D (Galaxy MXAccess code lift out of legacy OtOpcUa.Host, dual-service installer, appsettings → DriverConfig migration script, legacy Host deletion — blocked by parity), Phase 2 Stream E (v1 IntegrationTests against v2 topology, Client.CLI walkthrough diff, four 2026-04-13 stability findings regression tests, adversarial review — requires live MXAccess runtime).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 21:35:25 -04:00