# Component: Traefik Proxy

## Purpose

The Traefik Proxy is a reverse proxy and load balancer that sits in front of the central cluster's two web servers. It provides a single stable URL for the CLI, browser, and external API consumers, automatically routing traffic to the active central node. When the active node fails over, Traefik detects the change via health checks and redirects traffic to the new active node without manual intervention.

## Location

Runs as a Docker container (`scadalink-traefik`) in the cluster compose stack (`docker/docker-compose.yml`). Not part of the application codebase — it is a third-party infrastructure component with static configuration files.

`docker/traefik/`

## Responsibilities

- Route all HTTP traffic (Central UI, Management API, Inbound API, health endpoints) to the active central node.
- Health-check both central nodes via `/health/active` to determine which is the active (cluster leader) node.
- Automatically fail over to the standby node when the active node goes down.
- Provide a dashboard for monitoring routing state and backend health.

## How It Works

### Active Node Detection

Traefik polls `/health/active` on both central nodes every 5 seconds. This endpoint returns:

- **HTTP 200** on the active node (the Akka.NET cluster leader).
- **HTTP 503** on the standby node (or if the node is unreachable).

Only the node returning 200 receives traffic. The health check is implemented by `ActiveNodeHealthCheck` in the Host project, which checks `Cluster.Get(system).State.Leader == SelfMember.Address`.

### Failover Sequence

1. Active node fails (crash, network partition, or graceful shutdown).
2. Akka.NET cluster detects the failure (~10s heartbeat timeout).
3. Split-brain resolver acts after stable-after period (~15s).
4. Surviving node becomes cluster leader.
5. `ActiveNodeHealthCheck` on the surviving node starts returning 200.
6. Traefik's next health poll (within 5s) detects the change.
7. Traffic routes to the new active node.

**Total failover time**: ~25–30s (Akka failover ~25s + Traefik poll interval up to 5s).

### SignalR / Blazor Server Considerations

Blazor Server uses persistent SignalR connections (WebSocket circuits). During failover:

- Active SignalR circuits on the failed node are lost.
- The browser's SignalR reconnection logic attempts to reconnect.
- Traefik routes the reconnection to the new active node.
- The user's session survives because authentication uses cookie-embedded JWT with shared Data Protection keys across both central nodes.
- The user may see a brief "Reconnecting..." overlay before the circuit re-establishes.

## Configuration

### Static Config (`docker/traefik/traefik.yml`)

```yaml
entryPoints:
  web:
    address: ":80"

api:
  dashboard: true
  insecure: true

providers:
  file:
    filename: /etc/traefik/dynamic.yml
```

- **Entrypoint `web`**: Listens on port 80 (mapped to host port 9000).
- **Dashboard**: Enabled in insecure mode (no auth) for development. Accessible at `http://localhost:8180`.
- **File provider**: Loads routing rules from a static YAML file (no Docker socket required).

### Dynamic Config (`docker/traefik/dynamic.yml`)

```yaml
http:
  routers:
    central:
      rule: "PathPrefix(`/`)"
      service: central
      entryPoints:
        - web

  services:
    central:
      loadBalancer:
        healthCheck:
          path: /health/active
          interval: 5s
          timeout: 3s
        servers:
          - url: "http://scadalink-central-a:5000"
          - url: "http://scadalink-central-b:5000"
```

- **Router `central`**: Catches all requests and forwards to the `central` service.
- **Service `central`**: Load balancer with two backends (both central nodes) and a health check on `/health/active`.
- **Health check interval**: 5 seconds. A server failing the health check is removed from the pool within one interval.

## Ports

| Host Port | Container Port | Purpose |
|-----------|---------------|---------|
| 9000 | 80 | Load-balanced entrypoint (Central UI, Management API, Inbound API) |
| 8180 | 8080 | Traefik dashboard |

## Health Endpoints

The central nodes expose three health endpoints:

| Endpoint | Purpose | Who Uses It |
|----------|---------|-------------|
| `/health/ready` | Readiness gate — 200 when database + Akka cluster are healthy | Kubernetes probes, monitoring |
| `/health/active` | Active node — 200 only on cluster leader | **Traefik** (routing decisions) |

## Dependencies

- **Central cluster nodes**: The two backends (`scadalink-central-a`, `scadalink-central-b`) on the `scadalink-net` Docker network.
- **ActiveNodeHealthCheck**: Health check implementation in `src/ScadaLink.Host/Health/ActiveNodeHealthCheck.cs` that determines cluster leader status.
- **Docker network**: All containers must be on the shared `scadalink-net` bridge network.

## Interactions

- **CLI**: Connects to `http://localhost:9000/management` — routed by Traefik to the active node.
- **Browser (Central UI)**: Connects to `http://localhost:9000` — Blazor Server + SignalR routed to the active node.
- **Inbound API consumers**: Connect to `http://localhost:9000/api/{methodName}` — routed to the active node.
- **Cluster Infrastructure**: The `ActiveNodeHealthCheck` relies on Akka.NET cluster gossip state to determine the leader.

## Production Considerations

The current configuration is for development/testing. In production:

- **TLS termination**: Add HTTPS entrypoint with certificates (Let's Encrypt via Traefik's ACME provider, or static certs).
- **Dashboard auth**: Disable `insecure: true` and configure authentication on the dashboard.
- **WebSocket support**: Traefik supports WebSocket proxying natively — no additional config needed for SignalR.
- **Sticky sessions**: Not required. The Management API is stateless (Basic Auth per request). Blazor Server circuits are bound to a specific node via SignalR, but reconnection handles failover transparently.