Files
lmxopcua/src/ZB.MOM.WW.OtOpcUa.Admin/Hubs
Joseph Doherty 1f3343e61f OpenTelemetry redundancy metrics + RoleChanged SignalR push. Closes instrumentation + live-push slices of task #198; the exporter wiring (OTLP vs Prometheus package decision) is split to new task #201 because the collector/scrape-endpoint choice is a fleet-ops decision that deserves its own PR rather than hardcoded here. New RedundancyMetrics class (Singleton-registered in DI) owning a System.Diagnostics.Metrics.Meter("ZB.MOM.WW.OtOpcUa.Redundancy", "1.0.0"). Three ObservableGauge instruments — otopcua.redundancy.primary_count / secondary_count / stale_count — all tagged by cluster.id, populated by SetClusterCounts(clusterId, primary, secondary, stale) which the poller calls at the tail of every tick; ObservableGauge callbacks snapshot the last value set under a lock so the reader (OTel collector, dotnet-counters) sees consistent tuples. One Counter — otopcua.redundancy.role_transition — tagged cluster.id, node.id, from_role, to_role; ideal for tracking "how often does Cluster-X failover" + "which node transitions most" aggregate queries. In-box Metrics API means zero NuGet dep here — the exporter PR adds OpenTelemetry.Extensions.Hosting + OpenTelemetry.Exporter.OpenTelemetryProtocol or OpenTelemetry.Exporter.Prometheus.AspNetCore to actually ship the data somewhere. FleetStatusPoller extended with role-change detection. Its PollOnceAsync now pulls ClusterNode rows alongside the existing ClusterNodeGenerationState scan, and a new PollRolesAsync walks every node comparing RedundancyRole to the _lastRole cache. On change: records the transition to RedundancyMetrics + emits a RoleChanged SignalR message to both FleetStatusHub.GroupName(cluster) + FleetStatusHub.FleetGroup so cluster-scoped + fleet-wide subscribers both see it. First observation per node is a bootstrap (cache fill) + NOT a transition — avoids spurious churn on service startup or pod restart. UpdateClusterGauges groups nodes by cluster + sets the three gauge values, using ClusterNodeService.StaleThreshold (shared 30s convention) for staleness so the /hosts page + the gauge agree. RoleChangedMessage record lives alongside NodeStateChangedMessage in FleetStatusPoller.cs. RedundancyTab.razor subscribes to the fleet-status hub on first parameters-set, filters RoleChanged events to the current cluster, reloads the node list + paints a blue info banner ("Role changed on node-a: Primary → Secondary at HH:mm:ss UTC") so operators see the transition without needing to poll-refresh the page. IAsyncDisposable closes the connection on tab swap-away. Two new RedundancyMetricsTests covering RecordRoleTransition tag emission (cluster.id + node.id + from_role + to_role all flow through the MeterListener callback) + ObservableGauge snapshot for two clusters (assert primary_count=1 for c1, stale_count=1 for c2). Existing FleetStatusPollerTests ctor-line updated to pass a RedundancyMetrics instance; all tests still pass. Full Admin.Tests suite 87/87 passing (was 85, +2). Admin project builds 0 errors. Task #201 captures the exporter-wiring follow-up — OpenTelemetry.Extensions.Hosting + OTLP vs Prometheus + /metrics endpoint decision, driven by fleet-ops infra direction.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 23:16:09 -04:00
..
Phase 1 LDAP auth + SignalR real-time — closes the last two open Admin UI TODOs. LDAP: Admin/Security/ gets SecurityOptions (bound from appsettings.json Authentication:Ldap), LdapAuthResult record, ILdapAuthService + LdapAuthService ported from scadalink-design's LdapAuthService (TLS guard, search-then-bind when a service account is configured, direct-bind fallback, service-account re-bind after user bind so attribute lookup uses the service principal's read rights, LdapException-to-friendly-message translation, OperationCanceledException pass-through), RoleMapper (pure function: case-insensitive group-name match against LdapOptions.GroupToRole, returns the distinct set of mapped Admin roles). EscapeLdapFilter escapes the five LDAP filter control chars (\, *, (, ), \0); ExtractFirstRdnValue pulls the value portion of a DN's leading RDN for memberOf parsing; ExtractOuSegment added as a GLAuth-specific fallback when the directory doesn't populate memberOf but does embed ou=PrimaryGroup into user DNs (actual GLAuth config in C:\publish\glauth\glauth.cfg uses nameformat=cn, groupformat=ou — direct bind is enough). Login page rewritten: EditForm → ILdapAuthService.AuthenticateAsync → cookie sign-in with claims (Name = displayName, NameIdentifier = username, Role for each mapped role, ldap_group for each raw group); failed bind shows the service's error; empty-role-map returns an explicit "no Admin role mapped" message rather than silently succeeding. appsettings.json gains an Authentication:Ldap section with dev-GLAuth defaults (localhost:3893, UseTls=false, AllowInsecureLdap=true for dev, GroupToRole maps GLAuth's ReadOnly/WriteOperate/AlarmAck → ConfigViewer/ConfigEditor/FleetAdmin). SignalR: two hubs + a BackgroundService poller. FleetStatusHub routes per-cluster NodeStateChanged pushes (SubscribeCluster/UnsubscribeCluster on connection; FleetGroup for dashboard-wide) with a typed NodeStateChangedMessage payload. AlertHub auto-subscribes every connection to the AllAlertsGroup and exposes AcknowledgeAsync (ack persistence deferred to v2.1). FleetStatusPoller (IHostedService, 5s default cadence) scans ClusterNodeGenerationState joined with ClusterNode, caches the prior snapshot per NodeId, pushes NodeStateChanged on any delta, raises AlertMessage("apply-failed") on transition INTO Failed (sticky — the hub client acks later). Program.cs registers HttpContextAccessor (sign-in needs it), SignalR, LdapOptions + ILdapAuthService, the poller as hosted service, and maps /hubs/fleet + /hubs/alerts endpoints. ClusterDetail adds @rendermode RenderMode.InteractiveServer, @implements IAsyncDisposable, and a HubConnectionBuilder subscription that calls LoadAsync() on each NodeStateChanged for its cluster so the "current published" card refreshes without a page reload; a dismissable "Live update" info banner surfaces the most recent event. Microsoft.AspNetCore.SignalR.Client 10.0.0 + Novell.Directory.Ldap.NETStandard 3.6.0 added. Tests: 13 new — RoleMapperTests (single group, case-insensitive match, multi-group distinct-roles, unknown-group ignored, empty-map); LdapAuthServiceTests (EscapeLdapFilter with 4 inputs, ExtractFirstRdnValue with 4 inputs — all via reflection against internals); LdapLiveBindTests (skip when localhost:3893 unreachable; valid-credentials-bind-succeeds; wrong-password-fails-with-recognizable-error; empty-username-rejected-before-hitting-directory); FleetStatusPollerTests (throwaway DB, seeds cluster+node+generation+apply-state, runs PollOnceAsync, asserts NodeStateChanged hit the recorder; second test seeds a Failed state and asserts AlertRaised fired) — backed by RecordingHubContext/RecordingHubClients/RecordingClientProxy that capture SendCoreAsync invocations while throwing NotImplementedException for the IHubClients methods the poller doesn't call (fail-fast if evolution adds new dependencies). InternalsVisibleTo added so the test project can call FleetStatusPoller.PollOnceAsync directly. Full solution 946 pass / 1 pre-existing Phase 0 baseline failure.
2026-04-17 22:28:49 -04:00
Phase 1 LDAP auth + SignalR real-time — closes the last two open Admin UI TODOs. LDAP: Admin/Security/ gets SecurityOptions (bound from appsettings.json Authentication:Ldap), LdapAuthResult record, ILdapAuthService + LdapAuthService ported from scadalink-design's LdapAuthService (TLS guard, search-then-bind when a service account is configured, direct-bind fallback, service-account re-bind after user bind so attribute lookup uses the service principal's read rights, LdapException-to-friendly-message translation, OperationCanceledException pass-through), RoleMapper (pure function: case-insensitive group-name match against LdapOptions.GroupToRole, returns the distinct set of mapped Admin roles). EscapeLdapFilter escapes the five LDAP filter control chars (\, *, (, ), \0); ExtractFirstRdnValue pulls the value portion of a DN's leading RDN for memberOf parsing; ExtractOuSegment added as a GLAuth-specific fallback when the directory doesn't populate memberOf but does embed ou=PrimaryGroup into user DNs (actual GLAuth config in C:\publish\glauth\glauth.cfg uses nameformat=cn, groupformat=ou — direct bind is enough). Login page rewritten: EditForm → ILdapAuthService.AuthenticateAsync → cookie sign-in with claims (Name = displayName, NameIdentifier = username, Role for each mapped role, ldap_group for each raw group); failed bind shows the service's error; empty-role-map returns an explicit "no Admin role mapped" message rather than silently succeeding. appsettings.json gains an Authentication:Ldap section with dev-GLAuth defaults (localhost:3893, UseTls=false, AllowInsecureLdap=true for dev, GroupToRole maps GLAuth's ReadOnly/WriteOperate/AlarmAck → ConfigViewer/ConfigEditor/FleetAdmin). SignalR: two hubs + a BackgroundService poller. FleetStatusHub routes per-cluster NodeStateChanged pushes (SubscribeCluster/UnsubscribeCluster on connection; FleetGroup for dashboard-wide) with a typed NodeStateChangedMessage payload. AlertHub auto-subscribes every connection to the AllAlertsGroup and exposes AcknowledgeAsync (ack persistence deferred to v2.1). FleetStatusPoller (IHostedService, 5s default cadence) scans ClusterNodeGenerationState joined with ClusterNode, caches the prior snapshot per NodeId, pushes NodeStateChanged on any delta, raises AlertMessage("apply-failed") on transition INTO Failed (sticky — the hub client acks later). Program.cs registers HttpContextAccessor (sign-in needs it), SignalR, LdapOptions + ILdapAuthService, the poller as hosted service, and maps /hubs/fleet + /hubs/alerts endpoints. ClusterDetail adds @rendermode RenderMode.InteractiveServer, @implements IAsyncDisposable, and a HubConnectionBuilder subscription that calls LoadAsync() on each NodeStateChanged for its cluster so the "current published" card refreshes without a page reload; a dismissable "Live update" info banner surfaces the most recent event. Microsoft.AspNetCore.SignalR.Client 10.0.0 + Novell.Directory.Ldap.NETStandard 3.6.0 added. Tests: 13 new — RoleMapperTests (single group, case-insensitive match, multi-group distinct-roles, unknown-group ignored, empty-map); LdapAuthServiceTests (EscapeLdapFilter with 4 inputs, ExtractFirstRdnValue with 4 inputs — all via reflection against internals); LdapLiveBindTests (skip when localhost:3893 unreachable; valid-credentials-bind-succeeds; wrong-password-fails-with-recognizable-error; empty-username-rejected-before-hitting-directory); FleetStatusPollerTests (throwaway DB, seeds cluster+node+generation+apply-state, runs PollOnceAsync, asserts NodeStateChanged hit the recorder; second test seeds a Failed state and asserts AlertRaised fired) — backed by RecordingHubContext/RecordingHubClients/RecordingClientProxy that capture SendCoreAsync invocations while throwing NotImplementedException for the IHubClients methods the poller doesn't call (fail-fast if evolution adds new dependencies). InternalsVisibleTo added so the test project can call FleetStatusPoller.PollOnceAsync directly. Full solution 946 pass / 1 pre-existing Phase 0 baseline failure.
2026-04-17 22:28:49 -04:00
OpenTelemetry redundancy metrics + RoleChanged SignalR push. Closes instrumentation + live-push slices of task #198; the exporter wiring (OTLP vs Prometheus package decision) is split to new task #201 because the collector/scrape-endpoint choice is a fleet-ops decision that deserves its own PR rather than hardcoded here. New RedundancyMetrics class (Singleton-registered in DI) owning a System.Diagnostics.Metrics.Meter("ZB.MOM.WW.OtOpcUa.Redundancy", "1.0.0"). Three ObservableGauge instruments — otopcua.redundancy.primary_count / secondary_count / stale_count — all tagged by cluster.id, populated by SetClusterCounts(clusterId, primary, secondary, stale) which the poller calls at the tail of every tick; ObservableGauge callbacks snapshot the last value set under a lock so the reader (OTel collector, dotnet-counters) sees consistent tuples. One Counter — otopcua.redundancy.role_transition — tagged cluster.id, node.id, from_role, to_role; ideal for tracking "how often does Cluster-X failover" + "which node transitions most" aggregate queries. In-box Metrics API means zero NuGet dep here — the exporter PR adds OpenTelemetry.Extensions.Hosting + OpenTelemetry.Exporter.OpenTelemetryProtocol or OpenTelemetry.Exporter.Prometheus.AspNetCore to actually ship the data somewhere. FleetStatusPoller extended with role-change detection. Its PollOnceAsync now pulls ClusterNode rows alongside the existing ClusterNodeGenerationState scan, and a new PollRolesAsync walks every node comparing RedundancyRole to the _lastRole cache. On change: records the transition to RedundancyMetrics + emits a RoleChanged SignalR message to both FleetStatusHub.GroupName(cluster) + FleetStatusHub.FleetGroup so cluster-scoped + fleet-wide subscribers both see it. First observation per node is a bootstrap (cache fill) + NOT a transition — avoids spurious churn on service startup or pod restart. UpdateClusterGauges groups nodes by cluster + sets the three gauge values, using ClusterNodeService.StaleThreshold (shared 30s convention) for staleness so the /hosts page + the gauge agree. RoleChangedMessage record lives alongside NodeStateChangedMessage in FleetStatusPoller.cs. RedundancyTab.razor subscribes to the fleet-status hub on first parameters-set, filters RoleChanged events to the current cluster, reloads the node list + paints a blue info banner ("Role changed on node-a: Primary → Secondary at HH:mm:ss UTC") so operators see the transition without needing to poll-refresh the page. IAsyncDisposable closes the connection on tab swap-away. Two new RedundancyMetricsTests covering RecordRoleTransition tag emission (cluster.id + node.id + from_role + to_role all flow through the MeterListener callback) + ObservableGauge snapshot for two clusters (assert primary_count=1 for c1, stale_count=1 for c2). Existing FleetStatusPollerTests ctor-line updated to pass a RedundancyMetrics instance; all tests still pass. Full Admin.Tests suite 87/87 passing (was 85, +2). Admin project builds 0 errors. Task #201 captures the exporter-wiring follow-up — OpenTelemetry.Extensions.Hosting + OTLP vs Prometheus + /metrics endpoint decision, driven by fleet-ops infra direction.
2026-04-19 23:16:09 -04:00