Task #242 finish — UnsTab drag-drop interactive E2E tests un-skip + pass #201

Merged
dohertj2 merged 1 commits from task-242-finish-interactive-tests into v2 2026-04-21 02:33:28 -04:00
Owner

Closes the scope-out from PR #200. Two co-operating harness bugs were masking the interactive circuit: (1) UseStaticWebAssets was never called so the staticwebassets manifest had no composed file provider → blazor.web.js served zero bytes; (2) InMemory DB name was re-rolled on every DbContext construction so seed + circuit + assertion scopes had isolated stores → ClusterDetail rendered <p>Loading...</p> forever.

Fix is two small AdminWebAppFactory changes: ApplicationName + UseStaticWebAssets, plus capturing the dbName string. Both interactive tests now un-skip.

Test plan

  • Full E2E suite: 3 passed, 0 skipped, 0 failed (was 1 passed, 2 skipped).
Closes the scope-out from PR #200. Two co-operating harness bugs were masking the interactive circuit: (1) UseStaticWebAssets was never called so the staticwebassets manifest had no composed file provider → blazor.web.js served zero bytes; (2) InMemory DB name was re-rolled on every DbContext construction so seed + circuit + assertion scopes had isolated stores → ClusterDetail rendered `<p>Loading...</p>` forever. Fix is two small AdminWebAppFactory changes: ApplicationName + UseStaticWebAssets, plus capturing the dbName string. Both interactive tests now un-skip. ## Test plan - Full E2E suite: 3 passed, 0 skipped, 0 failed (was 1 passed, 2 skipped).
dohertj2 added 1 commit 2026-04-21 02:33:18 -04:00
Closes the scope-out left by the #242 partial. Root cause of the blazor.web.js
zero-byte response turned out to be two co-operating harness bugs:

1) The static-asset manifest was discoverable but the runtime needs
   UseStaticWebAssets to be called so the StaticWebAssetsLoader composes a
   PhysicalFileProvider per ContentRoot declared in
   staticwebassets.development.json (Admin source wwwroot + obj/compressed +
   the framework NuGet cache). Without that call MapStaticAssets resolves the
   route but has no ContentRoot map — so every asset serves zero bytes.

2) The EF InMemory DB name was being re-generated on every DbContext
   construction (the lambda body called Guid.NewGuid() inline), so the seed
   scope, Blazor circuit scope, and test-assertion scopes all got separate
   stores. Capturing the name as a stable string per fixture instance fixes
   the "cluster not found → page stays at Loading…" symptom.

Fixes:
  - AdminWebAppFactory:
      * ApplicationName set on WebApplicationOptions so UseStaticWebAssets
        discovers the manifest.
      * builder.WebHost.UseStaticWebAssets() wired explicitly (matches what
        `dotnet run` does via MSBuild targets).
      * dbName captured once per fixture; the options lambda reads the
        captured string instead of re-rolling a Guid.
  - UnsTabDragDropE2ETests: the two [Fact(Skip=...)] tests un-skip.

Suite state: 3 passed, 0 skipped, 0 failed. Task #242 closed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dohertj2 merged commit ae07fea630 into v2 2026-04-21 02:33:28 -04:00
dohertj2 referenced this issue from a commit 2026-04-30 08:21:26 -04:00
OpenTelemetry redundancy metrics + RoleChanged SignalR push. Closes instrumentation + live-push slices of task #198; the exporter wiring (OTLP vs Prometheus package decision) is split to new task #201 because the collector/scrape-endpoint choice is a fleet-ops decision that deserves its own PR rather than hardcoded here. New RedundancyMetrics class (Singleton-registered in DI) owning a System.Diagnostics.Metrics.Meter("ZB.MOM.WW.OtOpcUa.Redundancy", "1.0.0"). Three ObservableGauge instruments — otopcua.redundancy.primary_count / secondary_count / stale_count — all tagged by cluster.id, populated by SetClusterCounts(clusterId, primary, secondary, stale) which the poller calls at the tail of every tick; ObservableGauge callbacks snapshot the last value set under a lock so the reader (OTel collector, dotnet-counters) sees consistent tuples. One Counter — otopcua.redundancy.role_transition — tagged cluster.id, node.id, from_role, to_role; ideal for tracking "how often does Cluster-X failover" + "which node transitions most" aggregate queries. In-box Metrics API means zero NuGet dep here — the exporter PR adds OpenTelemetry.Extensions.Hosting + OpenTelemetry.Exporter.OpenTelemetryProtocol or OpenTelemetry.Exporter.Prometheus.AspNetCore to actually ship the data somewhere. FleetStatusPoller extended with role-change detection. Its PollOnceAsync now pulls ClusterNode rows alongside the existing ClusterNodeGenerationState scan, and a new PollRolesAsync walks every node comparing RedundancyRole to the _lastRole cache. On change: records the transition to RedundancyMetrics + emits a RoleChanged SignalR message to both FleetStatusHub.GroupName(cluster) + FleetStatusHub.FleetGroup so cluster-scoped + fleet-wide subscribers both see it. First observation per node is a bootstrap (cache fill) + NOT a transition — avoids spurious churn on service startup or pod restart. UpdateClusterGauges groups nodes by cluster + sets the three gauge values, using ClusterNodeService.StaleThreshold (shared 30s convention) for staleness so the /hosts page + the gauge agree. RoleChangedMessage record lives alongside NodeStateChangedMessage in FleetStatusPoller.cs. RedundancyTab.razor subscribes to the fleet-status hub on first parameters-set, filters RoleChanged events to the current cluster, reloads the node list + paints a blue info banner ("Role changed on node-a: Primary → Secondary at HH:mm:ss UTC") so operators see the transition without needing to poll-refresh the page. IAsyncDisposable closes the connection on tab swap-away. Two new RedundancyMetricsTests covering RecordRoleTransition tag emission (cluster.id + node.id + from_role + to_role all flow through the MeterListener callback) + ObservableGauge snapshot for two clusters (assert primary_count=1 for c1, stale_count=1 for c2). Existing FleetStatusPollerTests ctor-line updated to pass a RedundancyMetrics instance; all tests still pass. Full Admin.Tests suite 87/87 passing (was 85, +2). Admin project builds 0 errors. Task #201 captures the exporter-wiring follow-up — OpenTelemetry.Extensions.Hosting + OTLP vs Prometheus + /metrics endpoint decision, driven by fleet-ops infra direction.
dohertj2 referenced this issue from a commit 2026-04-30 08:21:26 -04:00
OTel Prometheus exporter wiring — RedundancyMetrics meter now scraped at /metrics. Closes task #201. Picked Prometheus over OTLP per the earlier recommendation (pull-based means no OTel Collector deployment required for the common K8s/containers case; the endpoint ASP.NET-hosts inside the Admin app already, so one less moving part). Adds two NuGet refs to the Admin csproj: OpenTelemetry.Extensions.Hosting 1.15.2 (stable) + OpenTelemetry.Exporter.Prometheus.AspNetCore 1.15.2-beta.1 (the exporter has historically been beta-only; rest of the OTel ecosystem treats it as production-acceptable + it's what the upstream OTel docs themselves recommend for AspNetCore hosts). Program.cs gains a Metrics:Prometheus:Enabled toggle (defaults true; setting to false disables both the MeterProvider registration + the scrape endpoint entirely for locked-down deployments). When enabled, AddOpenTelemetry().WithMetrics() registers a MeterProvider that subscribes to the "ZB.MOM.WW.OtOpcUa.Redundancy" meter (the exact MeterName constant on RedundancyMetrics) + wires AddPrometheusExporter. MapPrometheusScrapingEndpoint() appends a /metrics handler producing the Prometheus text-format output; deliberately NOT authenticated because scrape jobs typically run on a trusted network + operators who need auth wrap the endpoint behind a reverse-proxy basic-auth gate per fleet-ops convention. appsettings.json declares the toggle with Enabled: true so the default deploy gets metrics automatically — turning off is the explicit action. Future meters (resilience tracker + host status + auth probe) just AddMeter("Name") alongside the existing call to start flowing through the same endpoint without more infrastructure. Admin project builds 0 errors; Admin.Tests 92/92 passing (unchanged — the OTel pipeline runs at request time, not test time). Still-pending work that was NOT part of #201's scope: an equivalent setup for the Server project (different MeterNames — the Polly pipeline builder's tracker + host-status publisher) + a metrics cheat-sheet in docs/observability.md documenting each meter's tag set + expected alerting thresholds. Those are natural follow-ups when fleet-ops starts building dashboards.
Sign in to join this conversation.