Files
Joseph Doherty 66f0f96328 docs(components): verification pass — fix cross-link targets, tag code fences, correct type names
- Fix 15 link-text/target mismatches (ConfigurationDatabase ×8 to Commons,
  NotificationOutbox ×4, ClusterInfrastructure case, HealthMonitoring,
  SiteCallAudit) caught by a link-text-vs-target consistency check.
- Tag 14 untagged code-fence openers (ASCII diagrams/trees, JSON, HTTP).
- Correct 4 type names to match source (ValidationService, HealthReportSender,
  CentralCommunicationActor, DebugSnapshotCommand set).
- Soften Traefik version prose per the style guide.
2026-06-03 16:09:06 -04:00

11 KiB
Raw Permalink Blame History

Traefik Proxy

The Traefik Proxy is the reverse proxy and load balancer that fronts the central cluster's two web servers. It exposes a single stable entrypoint for all central traffic — Central UI, Management API, Inbound API — and routes exclusively to whichever central node is currently the Akka.NET cluster leader, using a health-check on each node's /health/active endpoint to make that determination. When the active node changes, Traefik detects the change on its next poll cycle and redirects traffic automatically, with no operator intervention.

Overview

The proxy runs as the scadabridge-traefik Docker container in the main compose stack (docker/docker-compose.yml). It is a third-party infrastructure component (Traefik; the image tag is pinned in docker/docker-compose.yml) — there is no C# project for it. Its entire configuration is two YAML files mounted read-only into the container:

  • docker/traefik/traefik.yml — static config: entrypoints, API dashboard, and file provider declaration.
  • docker/traefik/dynamic.yml — routing rules: the router that catches all traffic, the central load-balancer service listing both backend nodes, and the /health/active health-check settings.

The proxy sits on the scadabridge-net Docker bridge network alongside both central nodes (scadabridge-central-a, scadabridge-central-b) and all site containers, so it can reach the central backends by container name.

Key Concepts

Active-node routing via /health/active

Traefik does not know which central node is the Akka.NET cluster leader — it discovers this by polling /health/active on both backends. The Host registers ActiveNodeHealthCheck under the Active health tag; app.MapZbHealth() serves it at /health/active. The check returns HTTP 200 on the leader and HTTP 503 on the standby (or when the actor system has not yet reached MemberStatus.Up):

public bool IsActiveNode
{
    get
    {
        var system = _akkaService.ActorSystem;
        if (system == null)
            return false;

        var cluster = Cluster.Get(system);
        var self = cluster.SelfMember;
        if (self.Status != MemberStatus.Up)
            return false;

        var leader = cluster.State.Leader;
        return leader != null && leader == self.Address;
    }
}

The identical leadership check backs ActiveNodeGate — the IActiveNodeGate implementation the Inbound API endpoint filter consults before executing method scripts. Both surfaces agree on which node is active because they share the same Akka cluster state.

Automatic failover

When the active central node goes down, the Akka cluster's keep-oldest split-brain resolver promotes the surviving node to leader (roughly 25 seconds: 10-second heartbeat threshold plus a 15-second stable-after period). Once the surviving node's ActiveNodeHealthCheck starts returning 200, Traefik's next poll cycle — within the 5-second interval — removes the failed backend from the pool and routes all subsequent requests to the new active node. No config change or restart is required on the Traefik side.

Architecture

Docker topology

  Clients (CLI, browser, external API)
          │
   host:9000 (HTTP)
          │
  ┌───────▼──────────────────┐
  │  scadabridge-traefik      │  (Traefik container)   
  │  entrypoint :80           │
  └──────┬──────────┬─────────┘
         │  /health/active poll (5s)
         ▼          ▼
  scadabridge-    scadabridge-
  central-a:5000  central-b:5000
  (ACTIVE → 200)  (STANDBY → 503)

Clients always connect to http://localhost:9000. The two central nodes are also reachable directly — central-a on host port 9001, central-b on host port 9002 — but these bypass the load balancer and should be used only for direct debugging. The Traefik dashboard is accessible at http://localhost:8180.

Request flow

Every incoming request on the web entrypoint hits the central router, which matches all paths (PathPrefix("/")) and forwards to the central load-balancer service. The load balancer only includes servers that are currently passing the health check, so in normal operation all traffic goes to the single healthy (active) backend.

Usage

Traefik starts automatically with the cluster compose stack:

# Start full cluster (includes Traefik)
docker compose -f docker/docker-compose.yml up -d

# Check Traefik dashboard (shows backend health status)
open http://localhost:8180

# Verify routing — reaches the active node
curl http://localhost:9000/health/active

# Direct node access (bypasses Traefik — use for debugging only)
curl http://localhost:9001/health/active   # central-a
curl http://localhost:9002/health/active   # central-b

The Traefik container's restart: unless-stopped policy means it recovers automatically after a Docker host restart.

Configuration

Static config (docker/traefik/traefik.yml)

entryPoints:
  web:
    address: ":80"

api:
  dashboard: true
  insecure: true

providers:
  file:
    filename: /etc/traefik/dynamic.yml
Key Value Effect
entryPoints.web.address :80 Listens on container port 80, mapped to host port 9000.
api.dashboard true Enables the Traefik web dashboard.
api.insecure true Serves the dashboard on port 8080 without auth (development only).
providers.file.filename /etc/traefik/dynamic.yml Loads routing rules from the mounted dynamic config; no Docker socket required.

Dynamic config (docker/traefik/dynamic.yml)

http:
  routers:
    central:
      rule: "PathPrefix(`/`)"
      service: central
      entryPoints:
        - web

  services:
    central:
      loadBalancer:
        healthCheck:
          path: /health/active
          interval: 5s
          timeout: 3s
        servers:
          - url: "http://scadabridge-central-a:5000"
          - url: "http://scadabridge-central-b:5000"
Setting Value Effect
routers.central.rule PathPrefix("/") Catches every request on the web entrypoint.
services.central.loadBalancer.healthCheck.path /health/active The endpoint Traefik polls on each backend.
services.central.loadBalancer.healthCheck.interval 5s Poll cadence; a backend failing the check is removed within one interval.
services.central.loadBalancer.healthCheck.timeout 3s Per-poll timeout; a non-responding backend counts as unhealthy.
servers[0].url http://scadabridge-central-a:5000 central-a backend, reachable by container name on scadabridge-net.
servers[1].url http://scadabridge-central-b:5000 central-b backend, reachable by container name on scadabridge-net.

Port mapping

Host port Container port Purpose
9000 80 Load-balanced entrypoint — all central traffic (Central UI, Management API, Inbound API).
8180 8080 Traefik dashboard.
9001 5000 Direct access to central-a (bypasses Traefik).
9002 5000 Direct access to central-b (bypasses Traefik).

Dependencies & Interactions

  • Host (#15) — implements and serves /health/active via ActiveNodeHealthCheck (tagged Active, mounted by app.MapZbHealth()). Also implements ActiveNodeGate, which enforces the same active-node contract at the Inbound API filter level, providing a defence-in-depth layer if traffic reaches the standby directly.
  • Cluster Infrastructure (#13) — the underlying Akka.NET cluster determines which node is the leader. Traefik's routing decision is derived entirely from cluster leadership state via the health-check poll; Traefik has no Akka dependency of its own.
  • Central UI (#9) — Blazor Server (SignalR/WebSocket circuits) is proxied through Traefik. Traefik proxies WebSocket connections natively with no additional config. On failover, active SignalR circuits on the failed node are lost; the browser's reconnection logic re-establishes the circuit on the new active node. Session continuity is preserved because authentication uses a cookie-embedded JWT with Data Protection keys shared across both central nodes.
  • Inbound API (#14) — external API consumers target http://localhost:9000/api/{methodName}. Traefik routes each request to the active node; if a request reaches the standby directly (bypassing Traefik), ActiveNodeGate responds with HTTP 503.
  • CLI (#19) — the CLI connects to the Management API via http://localhost:9000 (the Traefik entrypoint) by default, so it always reaches the active central node without needing to know which node is active.

Troubleshooting

Both backends show unhealthy on the dashboard

If both central-a and central-b appear red on the Traefik dashboard, neither node's ActiveNodeHealthCheck is returning 200. Common causes:

  1. Akka cluster has not formed yet — both nodes are still starting. Wait for the cluster to stabilise (typically 1015 seconds after both containers are up). Check the central node logs for Cluster is now ready.
  2. Split-brain resolver has downed both nodes — a network partition followed by a split-brain condition. Restart the cluster via bash docker/deploy.sh.
  3. Traefik cannot reach the backends — the scadabridge-net Docker network may not exist. Create it: docker network create scadabridge-net.

Traffic reaches a standby node

If a client receives HTTP 503 with X-ScadaBridge-Active: false, the request reached a standby node — either because Traefik has not yet completed its health-check poll after a failover (up to 5 seconds), or because the client is connecting directly to port 9001/9002 instead of port 9000. Use http://localhost:9000 for all normal access. The 503 is transient during the Traefik poll window; the client should retry.

Health check succeeds but /health/ready returns degraded

/health/active and /health/ready are independent. A node can pass the active check (it is the leader) but fail the readiness check (database or Akka cluster health probe failed). Traefik only uses /health/active; readiness gating is for orchestration and monitoring. Check the node's structured logs for database or akka-cluster check failures.