Files
lmxopcua/src/ZB.MOM.WW.OtOpcUa.Admin
Joseph Doherty 0687bb2e2d Phase 7 Stream F — Admin UI for scripts + test harness + historian diagnostics
Adds the draft-editor tab + page surface for authoring Phase 7 virtual tags and
scripted alarms, plus the /alarms/historian operator diagnostics page. Monaco loads
from CDN via a progressive-enhancement JS shim — the textarea works immediately so
the page is functional even if the CDN is unreachable.

## New services (Admin)

- ScriptService — CRUD for Script entity. SHA-256 SourceHash recomputed on save so
  Core.Scripting's CompiledScriptCache hits on re-publish of unchanged source + misses
  when the source actually changes.
- VirtualTagService — CRUD for VirtualTag, with Enabled toggle.
- ScriptedAlarmService — CRUD for ScriptedAlarm + lookup of persistent ScriptedAlarmState
  (logical-id-keyed per plan decision #14).
- ScriptTestHarnessService — pre-publish dry-run. Enforces plan decision #22: only
  inputs the DependencyExtractor identifies can be supplied. Missing / extra synthetic
  inputs surface as dedicated outcomes. Captures SetVirtualTag writes + Serilog events
  from the script so the operator can see both the output + the log output before
  publishing.
- HistorianDiagnosticsService — surfaces the local-process IAlarmHistorianSink state
  on /alarms/historian. Null sink reports Disabled + swallows retry. Live
  SqliteStoreAndForwardSink reports real queue depth + last-error + drain state and
  routes the Retry-dead-lettered button through.

## New UI

- ScriptsTab.razor (inside DraftEditor tabs) — list + create/edit/delete scripts with
  Monaco editor + dependency preview + test-harness run panel showing output + writes
  + log emissions.
- ScriptEditor.razor — reusable Monaco-backed textarea. Loads editor from CDN via
  wwwroot/js/monaco-loader.js. Textarea stays authoritative for Blazor binding; Monaco
  mirrors into it on every keystroke.
- AlarmsHistorian.razor (/alarms/historian) — queue depth + dead-letter depth + drain
  state badge + last-error banner + Retry-dead-lettered button.
- DraftEditor.razor — new "Scripts" tab.

## DI wiring

All five services registered in Program.cs. Null historian sink bound at Admin
composition time (real SqliteStoreAndForwardSink lives in the Server process).

## Tests — 13/13

Phase7ServicesTests covers:
- ScriptService: Add generates logical id + hash, Update recomputes hash on source
  change, Update same-source keeps hash (cache-hit preservation), Delete is idempotent
- VirtualTagService: round-trips trigger flags, Enabled toggle works
- ScriptedAlarmService: HistorizeToAveva defaults true per plan decision #15
- ScriptTestHarness: successful run captures output + writes, rejects missing /
  extra inputs, rejects non-literal paths, compile errors surface as Threw
- HistorianDiagnosticsService: null sink reports Disabled + retry returns 0
2026-04-20 19:59:18 -04:00
..
OpenTelemetry redundancy metrics + RoleChanged SignalR push. Closes instrumentation + live-push slices of task #198; the exporter wiring (OTLP vs Prometheus package decision) is split to new task #201 because the collector/scrape-endpoint choice is a fleet-ops decision that deserves its own PR rather than hardcoded here. New RedundancyMetrics class (Singleton-registered in DI) owning a System.Diagnostics.Metrics.Meter("ZB.MOM.WW.OtOpcUa.Redundancy", "1.0.0"). Three ObservableGauge instruments — otopcua.redundancy.primary_count / secondary_count / stale_count — all tagged by cluster.id, populated by SetClusterCounts(clusterId, primary, secondary, stale) which the poller calls at the tail of every tick; ObservableGauge callbacks snapshot the last value set under a lock so the reader (OTel collector, dotnet-counters) sees consistent tuples. One Counter — otopcua.redundancy.role_transition — tagged cluster.id, node.id, from_role, to_role; ideal for tracking "how often does Cluster-X failover" + "which node transitions most" aggregate queries. In-box Metrics API means zero NuGet dep here — the exporter PR adds OpenTelemetry.Extensions.Hosting + OpenTelemetry.Exporter.OpenTelemetryProtocol or OpenTelemetry.Exporter.Prometheus.AspNetCore to actually ship the data somewhere. FleetStatusPoller extended with role-change detection. Its PollOnceAsync now pulls ClusterNode rows alongside the existing ClusterNodeGenerationState scan, and a new PollRolesAsync walks every node comparing RedundancyRole to the _lastRole cache. On change: records the transition to RedundancyMetrics + emits a RoleChanged SignalR message to both FleetStatusHub.GroupName(cluster) + FleetStatusHub.FleetGroup so cluster-scoped + fleet-wide subscribers both see it. First observation per node is a bootstrap (cache fill) + NOT a transition — avoids spurious churn on service startup or pod restart. UpdateClusterGauges groups nodes by cluster + sets the three gauge values, using ClusterNodeService.StaleThreshold (shared 30s convention) for staleness so the /hosts page + the gauge agree. RoleChangedMessage record lives alongside NodeStateChangedMessage in FleetStatusPoller.cs. RedundancyTab.razor subscribes to the fleet-status hub on first parameters-set, filters RoleChanged events to the current cluster, reloads the node list + paints a blue info banner ("Role changed on node-a: Primary → Secondary at HH:mm:ss UTC") so operators see the transition without needing to poll-refresh the page. IAsyncDisposable closes the connection on tab swap-away. Two new RedundancyMetricsTests covering RecordRoleTransition tag emission (cluster.id + node.id + from_role + to_role all flow through the MeterListener callback) + ObservableGauge snapshot for two clusters (assert primary_count=1 for c1, stale_count=1 for c2). Existing FleetStatusPollerTests ctor-line updated to pass a RedundancyMetrics instance; all tests still pass. Full Admin.Tests suite 87/87 passing (was 85, +2). Admin project builds 0 errors. Task #201 captures the exporter-wiring follow-up — OpenTelemetry.Extensions.Hosting + OTLP vs Prometheus + /metrics endpoint decision, driven by fleet-ops infra direction.
2026-04-19 23:16:09 -04:00
Phase 1 LDAP auth + SignalR real-time — closes the last two open Admin UI TODOs. LDAP: Admin/Security/ gets SecurityOptions (bound from appsettings.json Authentication:Ldap), LdapAuthResult record, ILdapAuthService + LdapAuthService ported from scadalink-design's LdapAuthService (TLS guard, search-then-bind when a service account is configured, direct-bind fallback, service-account re-bind after user bind so attribute lookup uses the service principal's read rights, LdapException-to-friendly-message translation, OperationCanceledException pass-through), RoleMapper (pure function: case-insensitive group-name match against LdapOptions.GroupToRole, returns the distinct set of mapped Admin roles). EscapeLdapFilter escapes the five LDAP filter control chars (\, *, (, ), \0); ExtractFirstRdnValue pulls the value portion of a DN's leading RDN for memberOf parsing; ExtractOuSegment added as a GLAuth-specific fallback when the directory doesn't populate memberOf but does embed ou=PrimaryGroup into user DNs (actual GLAuth config in C:\publish\glauth\glauth.cfg uses nameformat=cn, groupformat=ou — direct bind is enough). Login page rewritten: EditForm → ILdapAuthService.AuthenticateAsync → cookie sign-in with claims (Name = displayName, NameIdentifier = username, Role for each mapped role, ldap_group for each raw group); failed bind shows the service's error; empty-role-map returns an explicit "no Admin role mapped" message rather than silently succeeding. appsettings.json gains an Authentication:Ldap section with dev-GLAuth defaults (localhost:3893, UseTls=false, AllowInsecureLdap=true for dev, GroupToRole maps GLAuth's ReadOnly/WriteOperate/AlarmAck → ConfigViewer/ConfigEditor/FleetAdmin). SignalR: two hubs + a BackgroundService poller. FleetStatusHub routes per-cluster NodeStateChanged pushes (SubscribeCluster/UnsubscribeCluster on connection; FleetGroup for dashboard-wide) with a typed NodeStateChangedMessage payload. AlertHub auto-subscribes every connection to the AllAlertsGroup and exposes AcknowledgeAsync (ack persistence deferred to v2.1). FleetStatusPoller (IHostedService, 5s default cadence) scans ClusterNodeGenerationState joined with ClusterNode, caches the prior snapshot per NodeId, pushes NodeStateChanged on any delta, raises AlertMessage("apply-failed") on transition INTO Failed (sticky — the hub client acks later). Program.cs registers HttpContextAccessor (sign-in needs it), SignalR, LdapOptions + ILdapAuthService, the poller as hosted service, and maps /hubs/fleet + /hubs/alerts endpoints. ClusterDetail adds @rendermode RenderMode.InteractiveServer, @implements IAsyncDisposable, and a HubConnectionBuilder subscription that calls LoadAsync() on each NodeStateChanged for its cluster so the "current published" card refreshes without a page reload; a dismissable "Live update" info banner surfaces the most recent event. Microsoft.AspNetCore.SignalR.Client 10.0.0 + Novell.Directory.Ldap.NETStandard 3.6.0 added. Tests: 13 new — RoleMapperTests (single group, case-insensitive match, multi-group distinct-roles, unknown-group ignored, empty-map); LdapAuthServiceTests (EscapeLdapFilter with 4 inputs, ExtractFirstRdnValue with 4 inputs — all via reflection against internals); LdapLiveBindTests (skip when localhost:3893 unreachable; valid-credentials-bind-succeeds; wrong-password-fails-with-recognizable-error; empty-username-rejected-before-hitting-directory); FleetStatusPollerTests (throwaway DB, seeds cluster+node+generation+apply-state, runs PollOnceAsync, asserts NodeStateChanged hit the recorder; second test seeds a Failed state and asserts AlertRaised fired) — backed by RecordingHubContext/RecordingHubClients/RecordingClientProxy that capture SendCoreAsync invocations while throwing NotImplementedException for the IHubClients methods the poller doesn't call (fail-fast if evolution adds new dependencies). InternalsVisibleTo added so the test project can call FleetStatusPoller.PollOnceAsync directly. Full solution 946 pass / 1 pre-existing Phase 0 baseline failure.
2026-04-17 22:28:49 -04:00
OTel Prometheus exporter wiring — RedundancyMetrics meter now scraped at /metrics. Closes task #201. Picked Prometheus over OTLP per the earlier recommendation (pull-based means no OTel Collector deployment required for the common K8s/containers case; the endpoint ASP.NET-hosts inside the Admin app already, so one less moving part). Adds two NuGet refs to the Admin csproj: OpenTelemetry.Extensions.Hosting 1.15.2 (stable) + OpenTelemetry.Exporter.Prometheus.AspNetCore 1.15.2-beta.1 (the exporter has historically been beta-only; rest of the OTel ecosystem treats it as production-acceptable + it's what the upstream OTel docs themselves recommend for AspNetCore hosts). Program.cs gains a Metrics:Prometheus:Enabled toggle (defaults true; setting to false disables both the MeterProvider registration + the scrape endpoint entirely for locked-down deployments). When enabled, AddOpenTelemetry().WithMetrics() registers a MeterProvider that subscribes to the "ZB.MOM.WW.OtOpcUa.Redundancy" meter (the exact MeterName constant on RedundancyMetrics) + wires AddPrometheusExporter. MapPrometheusScrapingEndpoint() appends a /metrics handler producing the Prometheus text-format output; deliberately NOT authenticated because scrape jobs typically run on a trusted network + operators who need auth wrap the endpoint behind a reverse-proxy basic-auth gate per fleet-ops convention. appsettings.json declares the toggle with Enabled: true so the default deploy gets metrics automatically — turning off is the explicit action. Future meters (resilience tracker + host status + auth probe) just AddMeter("Name") alongside the existing call to start flowing through the same endpoint without more infrastructure. Admin project builds 0 errors; Admin.Tests 92/92 passing (unchanged — the OTel pipeline runs at request time, not test time). Still-pending work that was NOT part of #201's scope: an equivalent setup for the Server project (different MeterNames — the Polly pipeline builder's tracker + host-status publisher) + a metrics cheat-sheet in docs/observability.md documenting each meter's tag set + expected alerting thresholds. Those are natural follow-ups when fleet-ops starts building dashboards.
2026-04-20 00:41:16 -04:00