Organize documentation by moving requirements (HighLevelReqs, Component-*, lmxproxy_protocol) to docs/requirements/ and test infrastructure docs to docs/test_infra/. Updates all cross-references in README, CLAUDE.md, infra/README, component docs, and 23 plan files.
5.8 KiB
Component: Traefik Proxy
Purpose
The Traefik Proxy is a reverse proxy and load balancer that sits in front of the central cluster's two web servers. It provides a single stable URL for the CLI, browser, and external API consumers, automatically routing traffic to the active central node. When the active node fails over, Traefik detects the change via health checks and redirects traffic to the new active node without manual intervention.
Location
Runs as a Docker container (scadalink-traefik) in the cluster compose stack (docker/docker-compose.yml). Not part of the application codebase — it is a third-party infrastructure component with static configuration files.
docker/traefik/
Responsibilities
- Route all HTTP traffic (Central UI, Management API, Inbound API, health endpoints) to the active central node.
- Health-check both central nodes via
/health/activeto determine which is the active (cluster leader) node. - Automatically fail over to the standby node when the active node goes down.
- Provide a dashboard for monitoring routing state and backend health.
How It Works
Active Node Detection
Traefik polls /health/active on both central nodes every 5 seconds. This endpoint returns:
- HTTP 200 on the active node (the Akka.NET cluster leader).
- HTTP 503 on the standby node (or if the node is unreachable).
Only the node returning 200 receives traffic. The health check is implemented by ActiveNodeHealthCheck in the Host project, which checks Cluster.Get(system).State.Leader == SelfMember.Address.
Failover Sequence
- Active node fails (crash, network partition, or graceful shutdown).
- Akka.NET cluster detects the failure (~10s heartbeat timeout).
- Split-brain resolver acts after stable-after period (~15s).
- Surviving node becomes cluster leader.
ActiveNodeHealthCheckon the surviving node starts returning 200.- Traefik's next health poll (within 5s) detects the change.
- Traffic routes to the new active node.
Total failover time: ~25–30s (Akka failover ~25s + Traefik poll interval up to 5s).
SignalR / Blazor Server Considerations
Blazor Server uses persistent SignalR connections (WebSocket circuits). During failover:
- Active SignalR circuits on the failed node are lost.
- The browser's SignalR reconnection logic attempts to reconnect.
- Traefik routes the reconnection to the new active node.
- The user's session survives because authentication uses cookie-embedded JWT with shared Data Protection keys across both central nodes.
- The user may see a brief "Reconnecting..." overlay before the circuit re-establishes.
Configuration
Static Config (docker/traefik/traefik.yml)
entryPoints:
web:
address: ":80"
api:
dashboard: true
insecure: true
providers:
file:
filename: /etc/traefik/dynamic.yml
- Entrypoint
web: Listens on port 80 (mapped to host port 9000). - Dashboard: Enabled in insecure mode (no auth) for development. Accessible at
http://localhost:8180. - File provider: Loads routing rules from a static YAML file (no Docker socket required).
Dynamic Config (docker/traefik/dynamic.yml)
http:
routers:
central:
rule: "PathPrefix(`/`)"
service: central
entryPoints:
- web
services:
central:
loadBalancer:
healthCheck:
path: /health/active
interval: 5s
timeout: 3s
servers:
- url: "http://scadalink-central-a:5000"
- url: "http://scadalink-central-b:5000"
- Router
central: Catches all requests and forwards to thecentralservice. - Service
central: Load balancer with two backends (both central nodes) and a health check on/health/active. - Health check interval: 5 seconds. A server failing the health check is removed from the pool within one interval.
Ports
| Host Port | Container Port | Purpose |
|---|---|---|
| 9000 | 80 | Load-balanced entrypoint (Central UI, Management API, Inbound API) |
| 8180 | 8080 | Traefik dashboard |
Health Endpoints
The central nodes expose three health endpoints:
| Endpoint | Purpose | Who Uses It |
|---|---|---|
/health/ready |
Readiness gate — 200 when database + Akka cluster are healthy | Kubernetes probes, monitoring |
/health/active |
Active node — 200 only on cluster leader | Traefik (routing decisions) |
Dependencies
- Central cluster nodes: The two backends (
scadalink-central-a,scadalink-central-b) on thescadalink-netDocker network. - ActiveNodeHealthCheck: Health check implementation in
src/ScadaLink.Host/Health/ActiveNodeHealthCheck.csthat determines cluster leader status. - Docker network: All containers must be on the shared
scadalink-netbridge network.
Interactions
- CLI: Connects to
http://localhost:9000/management— routed by Traefik to the active node. - Browser (Central UI): Connects to
http://localhost:9000— Blazor Server + SignalR routed to the active node. - Inbound API consumers: Connect to
http://localhost:9000/api/{methodName}— routed to the active node. - Cluster Infrastructure: The
ActiveNodeHealthCheckrelies on Akka.NET cluster gossip state to determine the leader.
Production Considerations
The current configuration is for development/testing. In production:
- TLS termination: Add HTTPS entrypoint with certificates (Let's Encrypt via Traefik's ACME provider, or static certs).
- Dashboard auth: Disable
insecure: trueand configure authentication on the dashboard. - WebSocket support: Traefik supports WebSocket proxying natively — no additional config needed for SignalR.
- Sticky sessions: Not required. The Management API is stateless (Basic Auth per request). Blazor Server circuits are bound to a specific node via SignalR, but reconnection handles failover transparently.