docs(components): reference docs batch 4/4 — ManagementService, CLI, Transport, CentralUI, TraefikProxy, TreeView

This commit is contained in:
Joseph Doherty
2026-06-03 15:57:32 -04:00
parent c1c8e35687
commit d14fc3f68f
6 changed files with 1352 additions and 0 deletions
+192
View File
@@ -0,0 +1,192 @@
# Traefik Proxy
The Traefik Proxy is the reverse proxy and load balancer that fronts the central cluster's two web servers. It exposes a single stable entrypoint for all central traffic — Central UI, Management API, Inbound API — and routes exclusively to whichever central node is currently the Akka.NET cluster leader, using a health-check on each node's `/health/active` endpoint to make that determination. When the active node changes, Traefik detects the change on its next poll cycle and redirects traffic automatically, with no operator intervention.
## Overview
The proxy runs as the `scadabridge-traefik` Docker container in the main compose stack (`docker/docker-compose.yml`). It is a third-party infrastructure component (Traefik v3.4) — there is no C# project for it. Its entire configuration is two YAML files mounted read-only into the container:
- `docker/traefik/traefik.yml` — static config: entrypoints, API dashboard, and file provider declaration.
- `docker/traefik/dynamic.yml` — routing rules: the router that catches all traffic, the `central` load-balancer service listing both backend nodes, and the `/health/active` health-check settings.
The proxy sits on the `scadabridge-net` Docker bridge network alongside both central nodes (`scadabridge-central-a`, `scadabridge-central-b`) and all site containers, so it can reach the central backends by container name.
## Key Concepts
### Active-node routing via `/health/active`
Traefik does not know which central node is the Akka.NET cluster leader — it discovers this by polling `/health/active` on both backends. The Host registers `ActiveNodeHealthCheck` under the `Active` health tag; `app.MapZbHealth()` serves it at `/health/active`. The check returns HTTP 200 on the leader and HTTP 503 on the standby (or when the actor system has not yet reached `MemberStatus.Up`):
```csharp
public bool IsActiveNode
{
get
{
var system = _akkaService.ActorSystem;
if (system == null)
return false;
var cluster = Cluster.Get(system);
var self = cluster.SelfMember;
if (self.Status != MemberStatus.Up)
return false;
var leader = cluster.State.Leader;
return leader != null && leader == self.Address;
}
}
```
The identical leadership check backs `ActiveNodeGate` — the `IActiveNodeGate` implementation the Inbound API endpoint filter consults before executing method scripts. Both surfaces agree on which node is active because they share the same Akka cluster state.
### Automatic failover
When the active central node goes down, the Akka cluster's keep-oldest split-brain resolver promotes the surviving node to leader (roughly 25 seconds: 10-second heartbeat threshold plus a 15-second stable-after period). Once the surviving node's `ActiveNodeHealthCheck` starts returning 200, Traefik's next poll cycle — within the 5-second interval — removes the failed backend from the pool and routes all subsequent requests to the new active node. No config change or restart is required on the Traefik side.
## Architecture
### Docker topology
```
Clients (CLI, browser, external API)
host:9000 (HTTP)
┌───────▼──────────────────┐
│ scadabridge-traefik │ (Traefik v3.4 container)
│ entrypoint :80 │
└──────┬──────────┬─────────┘
│ /health/active poll (5s)
▼ ▼
scadabridge- scadabridge-
central-a:5000 central-b:5000
(ACTIVE → 200) (STANDBY → 503)
```
Clients always connect to `http://localhost:9000`. The two central nodes are also reachable directly — `central-a` on host port 9001, `central-b` on host port 9002 — but these bypass the load balancer and should be used only for direct debugging. The Traefik dashboard is accessible at `http://localhost:8180`.
### Request flow
Every incoming request on the `web` entrypoint hits the `central` router, which matches all paths (`PathPrefix("/")`) and forwards to the `central` load-balancer service. The load balancer only includes servers that are currently passing the health check, so in normal operation all traffic goes to the single healthy (active) backend.
## Usage
Traefik starts automatically with the cluster compose stack:
```bash
# Start full cluster (includes Traefik)
docker compose -f docker/docker-compose.yml up -d
# Check Traefik dashboard (shows backend health status)
open http://localhost:8180
# Verify routing — reaches the active node
curl http://localhost:9000/health/active
# Direct node access (bypasses Traefik — use for debugging only)
curl http://localhost:9001/health/active # central-a
curl http://localhost:9002/health/active # central-b
```
The Traefik container's `restart: unless-stopped` policy means it recovers automatically after a Docker host restart.
## Configuration
### Static config (`docker/traefik/traefik.yml`)
```yaml
entryPoints:
web:
address: ":80"
api:
dashboard: true
insecure: true
providers:
file:
filename: /etc/traefik/dynamic.yml
```
| Key | Value | Effect |
|-----|-------|--------|
| `entryPoints.web.address` | `:80` | Listens on container port 80, mapped to host port 9000. |
| `api.dashboard` | `true` | Enables the Traefik web dashboard. |
| `api.insecure` | `true` | Serves the dashboard on port 8080 without auth (development only). |
| `providers.file.filename` | `/etc/traefik/dynamic.yml` | Loads routing rules from the mounted dynamic config; no Docker socket required. |
### Dynamic config (`docker/traefik/dynamic.yml`)
```yaml
http:
routers:
central:
rule: "PathPrefix(`/`)"
service: central
entryPoints:
- web
services:
central:
loadBalancer:
healthCheck:
path: /health/active
interval: 5s
timeout: 3s
servers:
- url: "http://scadabridge-central-a:5000"
- url: "http://scadabridge-central-b:5000"
```
| Setting | Value | Effect |
|---------|-------|--------|
| `routers.central.rule` | `PathPrefix("/")` | Catches every request on the `web` entrypoint. |
| `services.central.loadBalancer.healthCheck.path` | `/health/active` | The endpoint Traefik polls on each backend. |
| `services.central.loadBalancer.healthCheck.interval` | `5s` | Poll cadence; a backend failing the check is removed within one interval. |
| `services.central.loadBalancer.healthCheck.timeout` | `3s` | Per-poll timeout; a non-responding backend counts as unhealthy. |
| `servers[0].url` | `http://scadabridge-central-a:5000` | `central-a` backend, reachable by container name on `scadabridge-net`. |
| `servers[1].url` | `http://scadabridge-central-b:5000` | `central-b` backend, reachable by container name on `scadabridge-net`. |
### Port mapping
| Host port | Container port | Purpose |
|-----------|---------------|---------|
| `9000` | `80` | Load-balanced entrypoint — all central traffic (Central UI, Management API, Inbound API). |
| `8180` | `8080` | Traefik dashboard. |
| `9001` | `5000` | Direct access to `central-a` (bypasses Traefik). |
| `9002` | `5000` | Direct access to `central-b` (bypasses Traefik). |
## Dependencies & Interactions
- [Host (#15)](./Host.md) — implements and serves `/health/active` via `ActiveNodeHealthCheck` (tagged `Active`, mounted by `app.MapZbHealth()`). Also implements `ActiveNodeGate`, which enforces the same active-node contract at the Inbound API filter level, providing a defence-in-depth layer if traffic reaches the standby directly.
- [Cluster Infrastructure (#13)](./ClusterInfrastructure.md) — the underlying Akka.NET cluster determines which node is the leader. Traefik's routing decision is derived entirely from cluster leadership state via the health-check poll; Traefik has no Akka dependency of its own.
- [Central UI (#9)](./CentralUI.md) — Blazor Server (SignalR/WebSocket circuits) is proxied through Traefik. Traefik proxies WebSocket connections natively with no additional config. On failover, active SignalR circuits on the failed node are lost; the browser's reconnection logic re-establishes the circuit on the new active node. Session continuity is preserved because authentication uses a cookie-embedded JWT with Data Protection keys shared across both central nodes.
- [Inbound API (#14)](./InboundAPI.md) — external API consumers target `http://localhost:9000/api/{methodName}`. Traefik routes each request to the active node; if a request reaches the standby directly (bypassing Traefik), `ActiveNodeGate` responds with HTTP 503.
- [CLI (#19)](./CLI.md) — the CLI connects to the Management API via `http://localhost:9000` (the Traefik entrypoint) by default, so it always reaches the active central node without needing to know which node is active.
## Troubleshooting
### Both backends show unhealthy on the dashboard
If both `central-a` and `central-b` appear red on the Traefik dashboard, neither node's `ActiveNodeHealthCheck` is returning 200. Common causes:
1. **Akka cluster has not formed yet** — both nodes are still starting. Wait for the cluster to stabilise (typically 1015 seconds after both containers are up). Check the central node logs for `Cluster is now ready`.
2. **Split-brain resolver has downed both nodes** — a network partition followed by a split-brain condition. Restart the cluster via `bash docker/deploy.sh`.
3. **Traefik cannot reach the backends** — the `scadabridge-net` Docker network may not exist. Create it: `docker network create scadabridge-net`.
### Traffic reaches a standby node
If a client receives HTTP 503 with `X-ScadaBridge-Active: false`, the request reached a standby node — either because Traefik has not yet completed its health-check poll after a failover (up to 5 seconds), or because the client is connecting directly to port 9001/9002 instead of port 9000. Use `http://localhost:9000` for all normal access. The 503 is transient during the Traefik poll window; the client should retry.
### Health check succeeds but `/health/ready` returns degraded
`/health/active` and `/health/ready` are independent. A node can pass the active check (it is the leader) but fail the readiness check (database or Akka cluster health probe failed). Traefik only uses `/health/active`; readiness gating is for orchestration and monitoring. Check the node's structured logs for `database` or `akka-cluster` check failures.
## Related Documentation
- [Traefik Proxy design specification](../requirements/Component-TraefikProxy.md)
- [Host](./Host.md)
- [Cluster Infrastructure](./ClusterInfrastructure.md)
- [Central UI](./CentralUI.md)
- [Inbound API](./InboundAPI.md)
- [CLI](./CLI.md)