Files
Joseph Doherty 66f0f96328 docs(components): verification pass — fix cross-link targets, tag code fences, correct type names
- Fix 15 link-text/target mismatches (ConfigurationDatabase ×8 to Commons,
  NotificationOutbox ×4, ClusterInfrastructure case, HealthMonitoring,
  SiteCallAudit) caught by a link-text-vs-target consistency check.
- Tag 14 untagged code-fence openers (ASCII diagrams/trees, JSON, HTTP).
- Correct 4 type names to match source (ValidationService, HealthReportSender,
  CentralCommunicationActor, DebugSnapshotCommand set).
- Soften Traefik version prose per the style guide.
2026-06-03 16:09:06 -04:00

193 lines
11 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Traefik Proxy
The Traefik Proxy is the reverse proxy and load balancer that fronts the central cluster's two web servers. It exposes a single stable entrypoint for all central traffic — Central UI, Management API, Inbound API — and routes exclusively to whichever central node is currently the Akka.NET cluster leader, using a health-check on each node's `/health/active` endpoint to make that determination. When the active node changes, Traefik detects the change on its next poll cycle and redirects traffic automatically, with no operator intervention.
## Overview
The proxy runs as the `scadabridge-traefik` Docker container in the main compose stack (`docker/docker-compose.yml`). It is a third-party infrastructure component (Traefik; the image tag is pinned in `docker/docker-compose.yml`) — there is no C# project for it. Its entire configuration is two YAML files mounted read-only into the container:
- `docker/traefik/traefik.yml` — static config: entrypoints, API dashboard, and file provider declaration.
- `docker/traefik/dynamic.yml` — routing rules: the router that catches all traffic, the `central` load-balancer service listing both backend nodes, and the `/health/active` health-check settings.
The proxy sits on the `scadabridge-net` Docker bridge network alongside both central nodes (`scadabridge-central-a`, `scadabridge-central-b`) and all site containers, so it can reach the central backends by container name.
## Key Concepts
### Active-node routing via `/health/active`
Traefik does not know which central node is the Akka.NET cluster leader — it discovers this by polling `/health/active` on both backends. The Host registers `ActiveNodeHealthCheck` under the `Active` health tag; `app.MapZbHealth()` serves it at `/health/active`. The check returns HTTP 200 on the leader and HTTP 503 on the standby (or when the actor system has not yet reached `MemberStatus.Up`):
```csharp
public bool IsActiveNode
{
get
{
var system = _akkaService.ActorSystem;
if (system == null)
return false;
var cluster = Cluster.Get(system);
var self = cluster.SelfMember;
if (self.Status != MemberStatus.Up)
return false;
var leader = cluster.State.Leader;
return leader != null && leader == self.Address;
}
}
```
The identical leadership check backs `ActiveNodeGate` — the `IActiveNodeGate` implementation the Inbound API endpoint filter consults before executing method scripts. Both surfaces agree on which node is active because they share the same Akka cluster state.
### Automatic failover
When the active central node goes down, the Akka cluster's keep-oldest split-brain resolver promotes the surviving node to leader (roughly 25 seconds: 10-second heartbeat threshold plus a 15-second stable-after period). Once the surviving node's `ActiveNodeHealthCheck` starts returning 200, Traefik's next poll cycle — within the 5-second interval — removes the failed backend from the pool and routes all subsequent requests to the new active node. No config change or restart is required on the Traefik side.
## Architecture
### Docker topology
```text
Clients (CLI, browser, external API)
host:9000 (HTTP)
┌───────▼──────────────────┐
│ scadabridge-traefik │ (Traefik container)
│ entrypoint :80 │
└──────┬──────────┬─────────┘
│ /health/active poll (5s)
▼ ▼
scadabridge- scadabridge-
central-a:5000 central-b:5000
(ACTIVE → 200) (STANDBY → 503)
```
Clients always connect to `http://localhost:9000`. The two central nodes are also reachable directly — `central-a` on host port 9001, `central-b` on host port 9002 — but these bypass the load balancer and should be used only for direct debugging. The Traefik dashboard is accessible at `http://localhost:8180`.
### Request flow
Every incoming request on the `web` entrypoint hits the `central` router, which matches all paths (`PathPrefix("/")`) and forwards to the `central` load-balancer service. The load balancer only includes servers that are currently passing the health check, so in normal operation all traffic goes to the single healthy (active) backend.
## Usage
Traefik starts automatically with the cluster compose stack:
```bash
# Start full cluster (includes Traefik)
docker compose -f docker/docker-compose.yml up -d
# Check Traefik dashboard (shows backend health status)
open http://localhost:8180
# Verify routing — reaches the active node
curl http://localhost:9000/health/active
# Direct node access (bypasses Traefik — use for debugging only)
curl http://localhost:9001/health/active # central-a
curl http://localhost:9002/health/active # central-b
```
The Traefik container's `restart: unless-stopped` policy means it recovers automatically after a Docker host restart.
## Configuration
### Static config (`docker/traefik/traefik.yml`)
```yaml
entryPoints:
web:
address: ":80"
api:
dashboard: true
insecure: true
providers:
file:
filename: /etc/traefik/dynamic.yml
```
| Key | Value | Effect |
|-----|-------|--------|
| `entryPoints.web.address` | `:80` | Listens on container port 80, mapped to host port 9000. |
| `api.dashboard` | `true` | Enables the Traefik web dashboard. |
| `api.insecure` | `true` | Serves the dashboard on port 8080 without auth (development only). |
| `providers.file.filename` | `/etc/traefik/dynamic.yml` | Loads routing rules from the mounted dynamic config; no Docker socket required. |
### Dynamic config (`docker/traefik/dynamic.yml`)
```yaml
http:
routers:
central:
rule: "PathPrefix(`/`)"
service: central
entryPoints:
- web
services:
central:
loadBalancer:
healthCheck:
path: /health/active
interval: 5s
timeout: 3s
servers:
- url: "http://scadabridge-central-a:5000"
- url: "http://scadabridge-central-b:5000"
```
| Setting | Value | Effect |
|---------|-------|--------|
| `routers.central.rule` | `PathPrefix("/")` | Catches every request on the `web` entrypoint. |
| `services.central.loadBalancer.healthCheck.path` | `/health/active` | The endpoint Traefik polls on each backend. |
| `services.central.loadBalancer.healthCheck.interval` | `5s` | Poll cadence; a backend failing the check is removed within one interval. |
| `services.central.loadBalancer.healthCheck.timeout` | `3s` | Per-poll timeout; a non-responding backend counts as unhealthy. |
| `servers[0].url` | `http://scadabridge-central-a:5000` | `central-a` backend, reachable by container name on `scadabridge-net`. |
| `servers[1].url` | `http://scadabridge-central-b:5000` | `central-b` backend, reachable by container name on `scadabridge-net`. |
### Port mapping
| Host port | Container port | Purpose |
|-----------|---------------|---------|
| `9000` | `80` | Load-balanced entrypoint — all central traffic (Central UI, Management API, Inbound API). |
| `8180` | `8080` | Traefik dashboard. |
| `9001` | `5000` | Direct access to `central-a` (bypasses Traefik). |
| `9002` | `5000` | Direct access to `central-b` (bypasses Traefik). |
## Dependencies & Interactions
- [Host (#15)](./Host.md) — implements and serves `/health/active` via `ActiveNodeHealthCheck` (tagged `Active`, mounted by `app.MapZbHealth()`). Also implements `ActiveNodeGate`, which enforces the same active-node contract at the Inbound API filter level, providing a defence-in-depth layer if traffic reaches the standby directly.
- [Cluster Infrastructure (#13)](./ClusterInfrastructure.md) — the underlying Akka.NET cluster determines which node is the leader. Traefik's routing decision is derived entirely from cluster leadership state via the health-check poll; Traefik has no Akka dependency of its own.
- [Central UI (#9)](./CentralUI.md) — Blazor Server (SignalR/WebSocket circuits) is proxied through Traefik. Traefik proxies WebSocket connections natively with no additional config. On failover, active SignalR circuits on the failed node are lost; the browser's reconnection logic re-establishes the circuit on the new active node. Session continuity is preserved because authentication uses a cookie-embedded JWT with Data Protection keys shared across both central nodes.
- [Inbound API (#14)](./InboundAPI.md) — external API consumers target `http://localhost:9000/api/{methodName}`. Traefik routes each request to the active node; if a request reaches the standby directly (bypassing Traefik), `ActiveNodeGate` responds with HTTP 503.
- [CLI (#19)](./CLI.md) — the CLI connects to the Management API via `http://localhost:9000` (the Traefik entrypoint) by default, so it always reaches the active central node without needing to know which node is active.
## Troubleshooting
### Both backends show unhealthy on the dashboard
If both `central-a` and `central-b` appear red on the Traefik dashboard, neither node's `ActiveNodeHealthCheck` is returning 200. Common causes:
1. **Akka cluster has not formed yet** — both nodes are still starting. Wait for the cluster to stabilise (typically 1015 seconds after both containers are up). Check the central node logs for `Cluster is now ready`.
2. **Split-brain resolver has downed both nodes** — a network partition followed by a split-brain condition. Restart the cluster via `bash docker/deploy.sh`.
3. **Traefik cannot reach the backends** — the `scadabridge-net` Docker network may not exist. Create it: `docker network create scadabridge-net`.
### Traffic reaches a standby node
If a client receives HTTP 503 with `X-ScadaBridge-Active: false`, the request reached a standby node — either because Traefik has not yet completed its health-check poll after a failover (up to 5 seconds), or because the client is connecting directly to port 9001/9002 instead of port 9000. Use `http://localhost:9000` for all normal access. The 503 is transient during the Traefik poll window; the client should retry.
### Health check succeeds but `/health/ready` returns degraded
`/health/active` and `/health/ready` are independent. A node can pass the active check (it is the leader) but fail the readiness check (database or Akka cluster health probe failed). Traefik only uses `/health/active`; readiness gating is for orchestration and monitoring. Check the node's structured logs for `database` or `akka-cluster` check failures.
## Related Documentation
- [Traefik Proxy design specification](../requirements/Component-TraefikProxy.md)
- [Host](./Host.md)
- [Cluster Infrastructure](./ClusterInfrastructure.md)
- [Central UI](./CentralUI.md)
- [Inbound API](./InboundAPI.md)
- [CLI](./CLI.md)