docs(components): reference docs batch 4/4 — ManagementService, CLI, Transport, CentralUI, TraefikProxy, TreeView

2026-06-03 15:57:32 -04:00
parent c1c8e35687
commit d14fc3f68f
6 changed files with 1352 additions and 0 deletions
@@ -0,0 +1,192 @@
+# Traefik Proxy
+
+The Traefik Proxy is the reverse proxy and load balancer that fronts the central cluster's two web servers. It exposes a single stable entrypoint for all central traffic — Central UI, Management API, Inbound API — and routes exclusively to whichever central node is currently the Akka.NET cluster leader, using a health-check on each node's `/health/active` endpoint to make that determination. When the active node changes, Traefik detects the change on its next poll cycle and redirects traffic automatically, with no operator intervention.
+
+## Overview
+
+The proxy runs as the `scadabridge-traefik` Docker container in the main compose stack (`docker/docker-compose.yml`). It is a third-party infrastructure component (Traefik v3.4) — there is no C# project for it. Its entire configuration is two YAML files mounted read-only into the container:
+
+- `docker/traefik/traefik.yml` — static config: entrypoints, API dashboard, and file provider declaration.
+- `docker/traefik/dynamic.yml` — routing rules: the router that catches all traffic, the `central` load-balancer service listing both backend nodes, and the `/health/active` health-check settings.
+
+The proxy sits on the `scadabridge-net` Docker bridge network alongside both central nodes (`scadabridge-central-a`, `scadabridge-central-b`) and all site containers, so it can reach the central backends by container name.
+
+## Key Concepts
+
+### Active-node routing via `/health/active`
+
+Traefik does not know which central node is the Akka.NET cluster leader — it discovers this by polling `/health/active` on both backends. The Host registers `ActiveNodeHealthCheck` under the `Active` health tag; `app.MapZbHealth()` serves it at `/health/active`. The check returns HTTP 200 on the leader and HTTP 503 on the standby (or when the actor system has not yet reached `MemberStatus.Up`):
+
+```csharp
+public bool IsActiveNode
+{
+    get
+    {
+        var system = _akkaService.ActorSystem;
+        if (system == null)
+            return false;
+
+        var cluster = Cluster.Get(system);
+        var self = cluster.SelfMember;
+        if (self.Status != MemberStatus.Up)
+            return false;
+
+        var leader = cluster.State.Leader;
+        return leader != null && leader == self.Address;
+    }
+}
+```
+
+The identical leadership check backs `ActiveNodeGate` — the `IActiveNodeGate` implementation the Inbound API endpoint filter consults before executing method scripts. Both surfaces agree on which node is active because they share the same Akka cluster state.
+
+### Automatic failover
+
+When the active central node goes down, the Akka cluster's keep-oldest split-brain resolver promotes the surviving node to leader (roughly 25 seconds: 10-second heartbeat threshold plus a 15-second stable-after period). Once the surviving node's `ActiveNodeHealthCheck` starts returning 200, Traefik's next poll cycle — within the 5-second interval — removes the failed backend from the pool and routes all subsequent requests to the new active node. No config change or restart is required on the Traefik side.
+
+## Architecture
+
+### Docker topology
+
+```
+  Clients (CLI, browser, external API)
+          │
+   host:9000 (HTTP)
+          │
+  ┌───────▼──────────────────┐
+  │  scadabridge-traefik      │  (Traefik v3.4 container)
+  │  entrypoint :80           │
+  └──────┬──────────┬─────────┘
+         │  /health/active poll (5s)
+         ▼          ▼
+  scadabridge-    scadabridge-
+  central-a:5000  central-b:5000
+  (ACTIVE → 200)  (STANDBY → 503)
+```
+
+Clients always connect to `http://localhost:9000`. The two central nodes are also reachable directly — `central-a` on host port 9001, `central-b` on host port 9002 — but these bypass the load balancer and should be used only for direct debugging. The Traefik dashboard is accessible at `http://localhost:8180`.
+
+### Request flow
+
+Every incoming request on the `web` entrypoint hits the `central` router, which matches all paths (`PathPrefix("/")`) and forwards to the `central` load-balancer service. The load balancer only includes servers that are currently passing the health check, so in normal operation all traffic goes to the single healthy (active) backend.
+
+## Usage
+
+Traefik starts automatically with the cluster compose stack:
+
+```bash
+# Start full cluster (includes Traefik)
+docker compose -f docker/docker-compose.yml up -d
+
+# Check Traefik dashboard (shows backend health status)
+open http://localhost:8180
+
+# Verify routing — reaches the active node
+curl http://localhost:9000/health/active
+
+# Direct node access (bypasses Traefik — use for debugging only)
+curl http://localhost:9001/health/active   # central-a
+curl http://localhost:9002/health/active   # central-b
+```
+
+The Traefik container's `restart: unless-stopped` policy means it recovers automatically after a Docker host restart.
+
+## Configuration
+
+### Static config (`docker/traefik/traefik.yml`)
+
+```yaml
+entryPoints:
+  web:
+    address: ":80"
+
+api:
+  dashboard: true
+  insecure: true
+
+providers:
+  file:
+    filename: /etc/traefik/dynamic.yml
+```
+
+| Key | Value | Effect |
+|-----|-------|--------|
+| `entryPoints.web.address` | `:80` | Listens on container port 80, mapped to host port 9000. |
+| `api.dashboard` | `true` | Enables the Traefik web dashboard. |
+| `api.insecure` | `true` | Serves the dashboard on port 8080 without auth (development only). |
+| `providers.file.filename` | `/etc/traefik/dynamic.yml` | Loads routing rules from the mounted dynamic config; no Docker socket required. |
+
+### Dynamic config (`docker/traefik/dynamic.yml`)
+
+```yaml
+http:
+  routers:
+    central:
+      rule: "PathPrefix(`/`)"
+      service: central
+      entryPoints:
+        - web
+
+  services:
+    central:
+      loadBalancer:
+        healthCheck:
+          path: /health/active
+          interval: 5s
+          timeout: 3s
+        servers:
+          - url: "http://scadabridge-central-a:5000"
+          - url: "http://scadabridge-central-b:5000"
+```
+
+| Setting | Value | Effect |
+|---------|-------|--------|
+| `routers.central.rule` | `PathPrefix("/")` | Catches every request on the `web` entrypoint. |
+| `services.central.loadBalancer.healthCheck.path` | `/health/active` | The endpoint Traefik polls on each backend. |
+| `services.central.loadBalancer.healthCheck.interval` | `5s` | Poll cadence; a backend failing the check is removed within one interval. |
+| `services.central.loadBalancer.healthCheck.timeout` | `3s` | Per-poll timeout; a non-responding backend counts as unhealthy. |
+| `servers[0].url` | `http://scadabridge-central-a:5000` | `central-a` backend, reachable by container name on `scadabridge-net`. |
+| `servers[1].url` | `http://scadabridge-central-b:5000` | `central-b` backend, reachable by container name on `scadabridge-net`. |
+
+### Port mapping
+
+| Host port | Container port | Purpose |
+|-----------|---------------|---------|
+| `9000` | `80` | Load-balanced entrypoint — all central traffic (Central UI, Management API, Inbound API). |
+| `8180` | `8080` | Traefik dashboard. |
+| `9001` | `5000` | Direct access to `central-a` (bypasses Traefik). |
+| `9002` | `5000` | Direct access to `central-b` (bypasses Traefik). |
+
+## Dependencies & Interactions
+
+- [Host (#15)](./Host.md) — implements and serves `/health/active` via `ActiveNodeHealthCheck` (tagged `Active`, mounted by `app.MapZbHealth()`). Also implements `ActiveNodeGate`, which enforces the same active-node contract at the Inbound API filter level, providing a defence-in-depth layer if traffic reaches the standby directly.
+- [Cluster Infrastructure (#13)](./ClusterInfrastructure.md) — the underlying Akka.NET cluster determines which node is the leader. Traefik's routing decision is derived entirely from cluster leadership state via the health-check poll; Traefik has no Akka dependency of its own.
+- [Central UI (#9)](./CentralUI.md) — Blazor Server (SignalR/WebSocket circuits) is proxied through Traefik. Traefik proxies WebSocket connections natively with no additional config. On failover, active SignalR circuits on the failed node are lost; the browser's reconnection logic re-establishes the circuit on the new active node. Session continuity is preserved because authentication uses a cookie-embedded JWT with Data Protection keys shared across both central nodes.
+- [Inbound API (#14)](./InboundAPI.md) — external API consumers target `http://localhost:9000/api/{methodName}`. Traefik routes each request to the active node; if a request reaches the standby directly (bypassing Traefik), `ActiveNodeGate` responds with HTTP 503.
+- [CLI (#19)](./CLI.md) — the CLI connects to the Management API via `http://localhost:9000` (the Traefik entrypoint) by default, so it always reaches the active central node without needing to know which node is active.
+
+## Troubleshooting
+
+### Both backends show unhealthy on the dashboard
+
+If both `central-a` and `central-b` appear red on the Traefik dashboard, neither node's `ActiveNodeHealthCheck` is returning 200. Common causes:
+
+1. **Akka cluster has not formed yet** — both nodes are still starting. Wait for the cluster to stabilise (typically 10–15 seconds after both containers are up). Check the central node logs for `Cluster is now ready`.
+2. **Split-brain resolver has downed both nodes** — a network partition followed by a split-brain condition. Restart the cluster via `bash docker/deploy.sh`.
+3. **Traefik cannot reach the backends** — the `scadabridge-net` Docker network may not exist. Create it: `docker network create scadabridge-net`.
+
+### Traffic reaches a standby node
+
+If a client receives HTTP 503 with `X-ScadaBridge-Active: false`, the request reached a standby node — either because Traefik has not yet completed its health-check poll after a failover (up to 5 seconds), or because the client is connecting directly to port 9001/9002 instead of port 9000. Use `http://localhost:9000` for all normal access. The 503 is transient during the Traefik poll window; the client should retry.
+
+### Health check succeeds but `/health/ready` returns degraded
+
+`/health/active` and `/health/ready` are independent. A node can pass the active check (it is the leader) but fail the readiness check (database or Akka cluster health probe failed). Traefik only uses `/health/active`; readiness gating is for orchestration and monitoring. Check the node's structured logs for `database` or `akka-cluster` check failures.
+
+## Related Documentation
+
+- [Traefik Proxy design specification](../requirements/Component-TraefikProxy.md)
+- [Host](./Host.md)
+- [Cluster Infrastructure](./ClusterInfrastructure.md)
+- [Central UI](./CentralUI.md)
+- [Inbound API](./InboundAPI.md)
+- [CLI](./CLI.md)