Files

Joseph Doherty 0a85a839a2 feat(infra): add Traefik load balancer with active node health check for central cluster failover

Add ActiveNodeHealthCheck that returns 200 only on the Akka.NET cluster
leader, enabling Traefik to route traffic to the active central node and
automatically fail over when the leader changes. Also fixes AkkaClusterHealthCheck
to resolve ActorSystem from AkkaHostedService (was always null via DI).

2026-03-21 00:44:37 -04:00

5.8 KiB

Raw Blame History

Component: Traefik Proxy

Purpose

The Traefik Proxy is a reverse proxy and load balancer that sits in front of the central cluster's two web servers. It provides a single stable URL for the CLI, browser, and external API consumers, automatically routing traffic to the active central node. When the active node fails over, Traefik detects the change via health checks and redirects traffic to the new active node without manual intervention.

Location

Runs as a Docker container (scadalink-traefik) in the cluster compose stack (docker/docker-compose.yml). Not part of the application codebase — it is a third-party infrastructure component with static configuration files.

docker/traefik/

Responsibilities

Route all HTTP traffic (Central UI, Management API, Inbound API, health endpoints) to the active central node.
Health-check both central nodes via /health/active to determine which is the active (cluster leader) node.
Automatically fail over to the standby node when the active node goes down.
Provide a dashboard for monitoring routing state and backend health.

How It Works

Active Node Detection

Traefik polls /health/active on both central nodes every 5 seconds. This endpoint returns:

HTTP 200 on the active node (the Akka.NET cluster leader).
HTTP 503 on the standby node (or if the node is unreachable).

Only the node returning 200 receives traffic. The health check is implemented by ActiveNodeHealthCheck in the Host project, which checks Cluster.Get(system).State.Leader == SelfMember.Address.

Failover Sequence

Active node fails (crash, network partition, or graceful shutdown).
Akka.NET cluster detects the failure (~10s heartbeat timeout).
Split-brain resolver acts after stable-after period (~15s).
Surviving node becomes cluster leader.
ActiveNodeHealthCheck on the surviving node starts returning 200.
Traefik's next health poll (within 5s) detects the change.
Traffic routes to the new active node.

Total failover time: ~25–30s (Akka failover ~25s + Traefik poll interval up to 5s).

SignalR / Blazor Server Considerations

Blazor Server uses persistent SignalR connections (WebSocket circuits). During failover:

Active SignalR circuits on the failed node are lost.
The browser's SignalR reconnection logic attempts to reconnect.
Traefik routes the reconnection to the new active node.
The user's session survives because authentication uses cookie-embedded JWT with shared Data Protection keys across both central nodes.
The user may see a brief "Reconnecting..." overlay before the circuit re-establishes.

Configuration

Static Config (`docker/traefik/traefik.yml`)

entryPoints:
  web:
    address: ":80"

api:
  dashboard: true
  insecure: true

providers:
  file:
    filename: /etc/traefik/dynamic.yml

Entrypoint web: Listens on port 80 (mapped to host port 9000).
Dashboard: Enabled in insecure mode (no auth) for development. Accessible at http://localhost:8180.
File provider: Loads routing rules from a static YAML file (no Docker socket required).

Dynamic Config (`docker/traefik/dynamic.yml`)

http:
  routers:
    central:
      rule: "PathPrefix(`/`)"
      service: central
      entryPoints:
        - web

  services:
    central:
      loadBalancer:
        healthCheck:
          path: /health/active
          interval: 5s
          timeout: 3s
        servers:
          - url: "http://scadalink-central-a:5000"
          - url: "http://scadalink-central-b:5000"

Router central: Catches all requests and forwards to the central service.
Service central: Load balancer with two backends (both central nodes) and a health check on /health/active.
Health check interval: 5 seconds. A server failing the health check is removed from the pool within one interval.

Ports

Host Port	Container Port	Purpose
9000	80	Load-balanced entrypoint (Central UI, Management API, Inbound API)
8180	8080	Traefik dashboard

Health Endpoints

The central nodes expose three health endpoints:

Endpoint	Purpose	Who Uses It
`/health/ready`	Readiness gate — 200 when database + Akka cluster are healthy	Kubernetes probes, monitoring
`/health/active`	Active node — 200 only on cluster leader	Traefik (routing decisions)

Dependencies

Central cluster nodes: The two backends (scadalink-central-a, scadalink-central-b) on the scadalink-net Docker network.
ActiveNodeHealthCheck: Health check implementation in src/ScadaLink.Host/Health/ActiveNodeHealthCheck.cs that determines cluster leader status.
Docker network: All containers must be on the shared scadalink-net bridge network.

Interactions

CLI: Connects to http://localhost:9000/management — routed by Traefik to the active node.
Browser (Central UI): Connects to http://localhost:9000 — Blazor Server + SignalR routed to the active node.
Inbound API consumers: Connect to http://localhost:9000/api/{methodName} — routed to the active node.
Cluster Infrastructure: The ActiveNodeHealthCheck relies on Akka.NET cluster gossip state to determine the leader.

Production Considerations

The current configuration is for development/testing. In production:

TLS termination: Add HTTPS entrypoint with certificates (Let's Encrypt via Traefik's ACME provider, or static certs).
Dashboard auth: Disable insecure: true and configure authentication on the dashboard.
WebSocket support: Traefik supports WebSocket proxying natively — no additional config needed for SignalR.
Sticky sessions: Not required. The Management API is stateless (Basic Auth per request). Blazor Server circuits are bound to a specific node via SignalR, but reconnection handles failover transparently.

5.8 KiB Raw Blame History Unescape Escape