docs(components): reference docs batch 4/4 — ManagementService, CLI, Transport, CentralUI, TraefikProxy, TreeView
This commit is contained in:
@@ -0,0 +1,192 @@
|
||||
# Traefik Proxy
|
||||
|
||||
The Traefik Proxy is the reverse proxy and load balancer that fronts the central cluster's two web servers. It exposes a single stable entrypoint for all central traffic — Central UI, Management API, Inbound API — and routes exclusively to whichever central node is currently the Akka.NET cluster leader, using a health-check on each node's `/health/active` endpoint to make that determination. When the active node changes, Traefik detects the change on its next poll cycle and redirects traffic automatically, with no operator intervention.
|
||||
|
||||
## Overview
|
||||
|
||||
The proxy runs as the `scadabridge-traefik` Docker container in the main compose stack (`docker/docker-compose.yml`). It is a third-party infrastructure component (Traefik v3.4) — there is no C# project for it. Its entire configuration is two YAML files mounted read-only into the container:
|
||||
|
||||
- `docker/traefik/traefik.yml` — static config: entrypoints, API dashboard, and file provider declaration.
|
||||
- `docker/traefik/dynamic.yml` — routing rules: the router that catches all traffic, the `central` load-balancer service listing both backend nodes, and the `/health/active` health-check settings.
|
||||
|
||||
The proxy sits on the `scadabridge-net` Docker bridge network alongside both central nodes (`scadabridge-central-a`, `scadabridge-central-b`) and all site containers, so it can reach the central backends by container name.
|
||||
|
||||
## Key Concepts
|
||||
|
||||
### Active-node routing via `/health/active`
|
||||
|
||||
Traefik does not know which central node is the Akka.NET cluster leader — it discovers this by polling `/health/active` on both backends. The Host registers `ActiveNodeHealthCheck` under the `Active` health tag; `app.MapZbHealth()` serves it at `/health/active`. The check returns HTTP 200 on the leader and HTTP 503 on the standby (or when the actor system has not yet reached `MemberStatus.Up`):
|
||||
|
||||
```csharp
|
||||
public bool IsActiveNode
|
||||
{
|
||||
get
|
||||
{
|
||||
var system = _akkaService.ActorSystem;
|
||||
if (system == null)
|
||||
return false;
|
||||
|
||||
var cluster = Cluster.Get(system);
|
||||
var self = cluster.SelfMember;
|
||||
if (self.Status != MemberStatus.Up)
|
||||
return false;
|
||||
|
||||
var leader = cluster.State.Leader;
|
||||
return leader != null && leader == self.Address;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The identical leadership check backs `ActiveNodeGate` — the `IActiveNodeGate` implementation the Inbound API endpoint filter consults before executing method scripts. Both surfaces agree on which node is active because they share the same Akka cluster state.
|
||||
|
||||
### Automatic failover
|
||||
|
||||
When the active central node goes down, the Akka cluster's keep-oldest split-brain resolver promotes the surviving node to leader (roughly 25 seconds: 10-second heartbeat threshold plus a 15-second stable-after period). Once the surviving node's `ActiveNodeHealthCheck` starts returning 200, Traefik's next poll cycle — within the 5-second interval — removes the failed backend from the pool and routes all subsequent requests to the new active node. No config change or restart is required on the Traefik side.
|
||||
|
||||
## Architecture
|
||||
|
||||
### Docker topology
|
||||
|
||||
```
|
||||
Clients (CLI, browser, external API)
|
||||
│
|
||||
host:9000 (HTTP)
|
||||
│
|
||||
┌───────▼──────────────────┐
|
||||
│ scadabridge-traefik │ (Traefik v3.4 container)
|
||||
│ entrypoint :80 │
|
||||
└──────┬──────────┬─────────┘
|
||||
│ /health/active poll (5s)
|
||||
▼ ▼
|
||||
scadabridge- scadabridge-
|
||||
central-a:5000 central-b:5000
|
||||
(ACTIVE → 200) (STANDBY → 503)
|
||||
```
|
||||
|
||||
Clients always connect to `http://localhost:9000`. The two central nodes are also reachable directly — `central-a` on host port 9001, `central-b` on host port 9002 — but these bypass the load balancer and should be used only for direct debugging. The Traefik dashboard is accessible at `http://localhost:8180`.
|
||||
|
||||
### Request flow
|
||||
|
||||
Every incoming request on the `web` entrypoint hits the `central` router, which matches all paths (`PathPrefix("/")`) and forwards to the `central` load-balancer service. The load balancer only includes servers that are currently passing the health check, so in normal operation all traffic goes to the single healthy (active) backend.
|
||||
|
||||
## Usage
|
||||
|
||||
Traefik starts automatically with the cluster compose stack:
|
||||
|
||||
```bash
|
||||
# Start full cluster (includes Traefik)
|
||||
docker compose -f docker/docker-compose.yml up -d
|
||||
|
||||
# Check Traefik dashboard (shows backend health status)
|
||||
open http://localhost:8180
|
||||
|
||||
# Verify routing — reaches the active node
|
||||
curl http://localhost:9000/health/active
|
||||
|
||||
# Direct node access (bypasses Traefik — use for debugging only)
|
||||
curl http://localhost:9001/health/active # central-a
|
||||
curl http://localhost:9002/health/active # central-b
|
||||
```
|
||||
|
||||
The Traefik container's `restart: unless-stopped` policy means it recovers automatically after a Docker host restart.
|
||||
|
||||
## Configuration
|
||||
|
||||
### Static config (`docker/traefik/traefik.yml`)
|
||||
|
||||
```yaml
|
||||
entryPoints:
|
||||
web:
|
||||
address: ":80"
|
||||
|
||||
api:
|
||||
dashboard: true
|
||||
insecure: true
|
||||
|
||||
providers:
|
||||
file:
|
||||
filename: /etc/traefik/dynamic.yml
|
||||
```
|
||||
|
||||
| Key | Value | Effect |
|
||||
|-----|-------|--------|
|
||||
| `entryPoints.web.address` | `:80` | Listens on container port 80, mapped to host port 9000. |
|
||||
| `api.dashboard` | `true` | Enables the Traefik web dashboard. |
|
||||
| `api.insecure` | `true` | Serves the dashboard on port 8080 without auth (development only). |
|
||||
| `providers.file.filename` | `/etc/traefik/dynamic.yml` | Loads routing rules from the mounted dynamic config; no Docker socket required. |
|
||||
|
||||
### Dynamic config (`docker/traefik/dynamic.yml`)
|
||||
|
||||
```yaml
|
||||
http:
|
||||
routers:
|
||||
central:
|
||||
rule: "PathPrefix(`/`)"
|
||||
service: central
|
||||
entryPoints:
|
||||
- web
|
||||
|
||||
services:
|
||||
central:
|
||||
loadBalancer:
|
||||
healthCheck:
|
||||
path: /health/active
|
||||
interval: 5s
|
||||
timeout: 3s
|
||||
servers:
|
||||
- url: "http://scadabridge-central-a:5000"
|
||||
- url: "http://scadabridge-central-b:5000"
|
||||
```
|
||||
|
||||
| Setting | Value | Effect |
|
||||
|---------|-------|--------|
|
||||
| `routers.central.rule` | `PathPrefix("/")` | Catches every request on the `web` entrypoint. |
|
||||
| `services.central.loadBalancer.healthCheck.path` | `/health/active` | The endpoint Traefik polls on each backend. |
|
||||
| `services.central.loadBalancer.healthCheck.interval` | `5s` | Poll cadence; a backend failing the check is removed within one interval. |
|
||||
| `services.central.loadBalancer.healthCheck.timeout` | `3s` | Per-poll timeout; a non-responding backend counts as unhealthy. |
|
||||
| `servers[0].url` | `http://scadabridge-central-a:5000` | `central-a` backend, reachable by container name on `scadabridge-net`. |
|
||||
| `servers[1].url` | `http://scadabridge-central-b:5000` | `central-b` backend, reachable by container name on `scadabridge-net`. |
|
||||
|
||||
### Port mapping
|
||||
|
||||
| Host port | Container port | Purpose |
|
||||
|-----------|---------------|---------|
|
||||
| `9000` | `80` | Load-balanced entrypoint — all central traffic (Central UI, Management API, Inbound API). |
|
||||
| `8180` | `8080` | Traefik dashboard. |
|
||||
| `9001` | `5000` | Direct access to `central-a` (bypasses Traefik). |
|
||||
| `9002` | `5000` | Direct access to `central-b` (bypasses Traefik). |
|
||||
|
||||
## Dependencies & Interactions
|
||||
|
||||
- [Host (#15)](./Host.md) — implements and serves `/health/active` via `ActiveNodeHealthCheck` (tagged `Active`, mounted by `app.MapZbHealth()`). Also implements `ActiveNodeGate`, which enforces the same active-node contract at the Inbound API filter level, providing a defence-in-depth layer if traffic reaches the standby directly.
|
||||
- [Cluster Infrastructure (#13)](./ClusterInfrastructure.md) — the underlying Akka.NET cluster determines which node is the leader. Traefik's routing decision is derived entirely from cluster leadership state via the health-check poll; Traefik has no Akka dependency of its own.
|
||||
- [Central UI (#9)](./CentralUI.md) — Blazor Server (SignalR/WebSocket circuits) is proxied through Traefik. Traefik proxies WebSocket connections natively with no additional config. On failover, active SignalR circuits on the failed node are lost; the browser's reconnection logic re-establishes the circuit on the new active node. Session continuity is preserved because authentication uses a cookie-embedded JWT with Data Protection keys shared across both central nodes.
|
||||
- [Inbound API (#14)](./InboundAPI.md) — external API consumers target `http://localhost:9000/api/{methodName}`. Traefik routes each request to the active node; if a request reaches the standby directly (bypassing Traefik), `ActiveNodeGate` responds with HTTP 503.
|
||||
- [CLI (#19)](./CLI.md) — the CLI connects to the Management API via `http://localhost:9000` (the Traefik entrypoint) by default, so it always reaches the active central node without needing to know which node is active.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Both backends show unhealthy on the dashboard
|
||||
|
||||
If both `central-a` and `central-b` appear red on the Traefik dashboard, neither node's `ActiveNodeHealthCheck` is returning 200. Common causes:
|
||||
|
||||
1. **Akka cluster has not formed yet** — both nodes are still starting. Wait for the cluster to stabilise (typically 10–15 seconds after both containers are up). Check the central node logs for `Cluster is now ready`.
|
||||
2. **Split-brain resolver has downed both nodes** — a network partition followed by a split-brain condition. Restart the cluster via `bash docker/deploy.sh`.
|
||||
3. **Traefik cannot reach the backends** — the `scadabridge-net` Docker network may not exist. Create it: `docker network create scadabridge-net`.
|
||||
|
||||
### Traffic reaches a standby node
|
||||
|
||||
If a client receives HTTP 503 with `X-ScadaBridge-Active: false`, the request reached a standby node — either because Traefik has not yet completed its health-check poll after a failover (up to 5 seconds), or because the client is connecting directly to port 9001/9002 instead of port 9000. Use `http://localhost:9000` for all normal access. The 503 is transient during the Traefik poll window; the client should retry.
|
||||
|
||||
### Health check succeeds but `/health/ready` returns degraded
|
||||
|
||||
`/health/active` and `/health/ready` are independent. A node can pass the active check (it is the leader) but fail the readiness check (database or Akka cluster health probe failed). Traefik only uses `/health/active`; readiness gating is for orchestration and monitoring. Check the node's structured logs for `database` or `akka-cluster` check failures.
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Traefik Proxy design specification](../requirements/Component-TraefikProxy.md)
|
||||
- [Host](./Host.md)
|
||||
- [Cluster Infrastructure](./ClusterInfrastructure.md)
|
||||
- [Central UI](./CentralUI.md)
|
||||
- [Inbound API](./InboundAPI.md)
|
||||
- [CLI](./CLI.md)
|
||||
Reference in New Issue
Block a user