feat(infra): add Traefik load balancer with active node health check for central cluster failover
Add ActiveNodeHealthCheck that returns 200 only on the Akka.NET cluster leader, enabling Traefik to route traffic to the active central node and automatically fail over when the leader changes. Also fixes AkkaClusterHealthCheck to resolve ActorSystem from AkkaHostedService (was always null via DI).
This commit is contained in:
@@ -35,7 +35,7 @@ This project contains design documentation for a distributed SCADA system built
|
|||||||
- Use `git diff` to review changes before committing.
|
- Use `git diff` to review changes before committing.
|
||||||
- Commit related changes together with a descriptive message summarizing the design decision.
|
- Commit related changes together with a descriptive message summarizing the design decision.
|
||||||
|
|
||||||
## Current Component List (19 components)
|
## Current Component List (20 components)
|
||||||
|
|
||||||
1. Template Engine — Template modeling, inheritance, composition, validation, flattening, diffs.
|
1. Template Engine — Template modeling, inheritance, composition, validation, flattening, diffs.
|
||||||
2. Deployment Manager — Central-side deployment pipeline, system-wide artifact deployment, instance lifecycle.
|
2. Deployment Manager — Central-side deployment pipeline, system-wide artifact deployment, instance lifecycle.
|
||||||
@@ -55,7 +55,8 @@ This project contains design documentation for a distributed SCADA system built
|
|||||||
16. Commons — Shared types, POCO entity classes, repository interfaces, message contracts.
|
16. Commons — Shared types, POCO entity classes, repository interfaces, message contracts.
|
||||||
17. Configuration Database — EF Core data access layer, repositories, unit-of-work, audit logging (IAuditService), migrations.
|
17. Configuration Database — EF Core data access layer, repositories, unit-of-work, audit logging (IAuditService), migrations.
|
||||||
18. Management Service — Akka.NET actor providing programmatic access to all admin operations, ClusterClientReceptionist registration.
|
18. Management Service — Akka.NET actor providing programmatic access to all admin operations, ClusterClientReceptionist registration.
|
||||||
19. CLI — Command-line tool using ClusterClient to interact with Management Service, System.CommandLine, JSON/table output.
|
19. CLI — Command-line tool using HTTP Management API, System.CommandLine, JSON/table output.
|
||||||
|
20. Traefik Proxy — Reverse proxy/load balancer fronting central cluster, active node routing via `/health/active`, automatic failover.
|
||||||
|
|
||||||
## Key Design Decisions (for context across sessions)
|
## Key Design Decisions (for context across sessions)
|
||||||
|
|
||||||
@@ -152,7 +153,7 @@ This project contains design documentation for a distributed SCADA system built
|
|||||||
|
|
||||||
### CLI Quick Reference (Docker / OrbStack)
|
### CLI Quick Reference (Docker / OrbStack)
|
||||||
|
|
||||||
- **Management URL**: `http://localhost:9001` — the CLI connects to the Central Host's HTTP management API (port 5000 mapped to 9001 in Docker).
|
- **Management URL**: `http://localhost:9000` — the CLI connects via the Traefik load balancer, which routes to the active central node. Direct access: central-a on port 9001, central-b on port 9002.
|
||||||
- **Test user**: `--username multi-role --password password` — has Admin, Design, and Deployment roles. The `admin` user only has the Admin role and cannot create templates, data connections, or deploy.
|
- **Test user**: `--username multi-role --password password` — has Admin, Design, and Deployment roles. The `admin` user only has the Admin role and cannot create templates, data connections, or deploy.
|
||||||
- **Config file**: `~/.scadalink/config.json` — stores `managementUrl` and default format. See `docker/README.md` for a ready-to-use test config.
|
- **Config file**: `~/.scadalink/config.json` — stores `managementUrl` and default format. See `docker/README.md` for a ready-to-use test config.
|
||||||
- **Rebuild cluster**: `bash docker/deploy.sh` — builds the `scadalink:latest` image and recreates all containers. Run this after code changes to ManagementActor, Host, or any server-side component.
|
- **Rebuild cluster**: `bash docker/deploy.sh` — builds the `scadalink:latest` image and recreates all containers. Run this after code changes to ManagementActor, Host, or any server-side component.
|
||||||
|
|||||||
138
Component-TraefikProxy.md
Normal file
138
Component-TraefikProxy.md
Normal file
@@ -0,0 +1,138 @@
|
|||||||
|
# Component: Traefik Proxy
|
||||||
|
|
||||||
|
## Purpose
|
||||||
|
|
||||||
|
The Traefik Proxy is a reverse proxy and load balancer that sits in front of the central cluster's two web servers. It provides a single stable URL for the CLI, browser, and external API consumers, automatically routing traffic to the active central node. When the active node fails over, Traefik detects the change via health checks and redirects traffic to the new active node without manual intervention.
|
||||||
|
|
||||||
|
## Location
|
||||||
|
|
||||||
|
Runs as a Docker container (`scadalink-traefik`) in the cluster compose stack (`docker/docker-compose.yml`). Not part of the application codebase — it is a third-party infrastructure component with static configuration files.
|
||||||
|
|
||||||
|
`docker/traefik/`
|
||||||
|
|
||||||
|
## Responsibilities
|
||||||
|
|
||||||
|
- Route all HTTP traffic (Central UI, Management API, Inbound API, health endpoints) to the active central node.
|
||||||
|
- Health-check both central nodes via `/health/active` to determine which is the active (cluster leader) node.
|
||||||
|
- Automatically fail over to the standby node when the active node goes down.
|
||||||
|
- Provide a dashboard for monitoring routing state and backend health.
|
||||||
|
|
||||||
|
## How It Works
|
||||||
|
|
||||||
|
### Active Node Detection
|
||||||
|
|
||||||
|
Traefik polls `/health/active` on both central nodes every 5 seconds. This endpoint returns:
|
||||||
|
|
||||||
|
- **HTTP 200** on the active node (the Akka.NET cluster leader).
|
||||||
|
- **HTTP 503** on the standby node (or if the node is unreachable).
|
||||||
|
|
||||||
|
Only the node returning 200 receives traffic. The health check is implemented by `ActiveNodeHealthCheck` in the Host project, which checks `Cluster.Get(system).State.Leader == SelfMember.Address`.
|
||||||
|
|
||||||
|
### Failover Sequence
|
||||||
|
|
||||||
|
1. Active node fails (crash, network partition, or graceful shutdown).
|
||||||
|
2. Akka.NET cluster detects the failure (~10s heartbeat timeout).
|
||||||
|
3. Split-brain resolver acts after stable-after period (~15s).
|
||||||
|
4. Surviving node becomes cluster leader.
|
||||||
|
5. `ActiveNodeHealthCheck` on the surviving node starts returning 200.
|
||||||
|
6. Traefik's next health poll (within 5s) detects the change.
|
||||||
|
7. Traffic routes to the new active node.
|
||||||
|
|
||||||
|
**Total failover time**: ~25–30s (Akka failover ~25s + Traefik poll interval up to 5s).
|
||||||
|
|
||||||
|
### SignalR / Blazor Server Considerations
|
||||||
|
|
||||||
|
Blazor Server uses persistent SignalR connections (WebSocket circuits). During failover:
|
||||||
|
|
||||||
|
- Active SignalR circuits on the failed node are lost.
|
||||||
|
- The browser's SignalR reconnection logic attempts to reconnect.
|
||||||
|
- Traefik routes the reconnection to the new active node.
|
||||||
|
- The user's session survives because authentication uses cookie-embedded JWT with shared Data Protection keys across both central nodes.
|
||||||
|
- The user may see a brief "Reconnecting..." overlay before the circuit re-establishes.
|
||||||
|
|
||||||
|
## Configuration
|
||||||
|
|
||||||
|
### Static Config (`docker/traefik/traefik.yml`)
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
entryPoints:
|
||||||
|
web:
|
||||||
|
address: ":80"
|
||||||
|
|
||||||
|
api:
|
||||||
|
dashboard: true
|
||||||
|
insecure: true
|
||||||
|
|
||||||
|
providers:
|
||||||
|
file:
|
||||||
|
filename: /etc/traefik/dynamic.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
- **Entrypoint `web`**: Listens on port 80 (mapped to host port 9000).
|
||||||
|
- **Dashboard**: Enabled in insecure mode (no auth) for development. Accessible at `http://localhost:8180`.
|
||||||
|
- **File provider**: Loads routing rules from a static YAML file (no Docker socket required).
|
||||||
|
|
||||||
|
### Dynamic Config (`docker/traefik/dynamic.yml`)
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
http:
|
||||||
|
routers:
|
||||||
|
central:
|
||||||
|
rule: "PathPrefix(`/`)"
|
||||||
|
service: central
|
||||||
|
entryPoints:
|
||||||
|
- web
|
||||||
|
|
||||||
|
services:
|
||||||
|
central:
|
||||||
|
loadBalancer:
|
||||||
|
healthCheck:
|
||||||
|
path: /health/active
|
||||||
|
interval: 5s
|
||||||
|
timeout: 3s
|
||||||
|
servers:
|
||||||
|
- url: "http://scadalink-central-a:5000"
|
||||||
|
- url: "http://scadalink-central-b:5000"
|
||||||
|
```
|
||||||
|
|
||||||
|
- **Router `central`**: Catches all requests and forwards to the `central` service.
|
||||||
|
- **Service `central`**: Load balancer with two backends (both central nodes) and a health check on `/health/active`.
|
||||||
|
- **Health check interval**: 5 seconds. A server failing the health check is removed from the pool within one interval.
|
||||||
|
|
||||||
|
## Ports
|
||||||
|
|
||||||
|
| Host Port | Container Port | Purpose |
|
||||||
|
|-----------|---------------|---------|
|
||||||
|
| 9000 | 80 | Load-balanced entrypoint (Central UI, Management API, Inbound API) |
|
||||||
|
| 8180 | 8080 | Traefik dashboard |
|
||||||
|
|
||||||
|
## Health Endpoints
|
||||||
|
|
||||||
|
The central nodes expose three health endpoints:
|
||||||
|
|
||||||
|
| Endpoint | Purpose | Who Uses It |
|
||||||
|
|----------|---------|-------------|
|
||||||
|
| `/health/ready` | Readiness gate — 200 when database + Akka cluster are healthy | Kubernetes probes, monitoring |
|
||||||
|
| `/health/active` | Active node — 200 only on cluster leader | **Traefik** (routing decisions) |
|
||||||
|
|
||||||
|
## Dependencies
|
||||||
|
|
||||||
|
- **Central cluster nodes**: The two backends (`scadalink-central-a`, `scadalink-central-b`) on the `scadalink-net` Docker network.
|
||||||
|
- **ActiveNodeHealthCheck**: Health check implementation in `src/ScadaLink.Host/Health/ActiveNodeHealthCheck.cs` that determines cluster leader status.
|
||||||
|
- **Docker network**: All containers must be on the shared `scadalink-net` bridge network.
|
||||||
|
|
||||||
|
## Interactions
|
||||||
|
|
||||||
|
- **CLI**: Connects to `http://localhost:9000/management` — routed by Traefik to the active node.
|
||||||
|
- **Browser (Central UI)**: Connects to `http://localhost:9000` — Blazor Server + SignalR routed to the active node.
|
||||||
|
- **Inbound API consumers**: Connect to `http://localhost:9000/api/{methodName}` — routed to the active node.
|
||||||
|
- **Cluster Infrastructure**: The `ActiveNodeHealthCheck` relies on Akka.NET cluster gossip state to determine the leader.
|
||||||
|
|
||||||
|
## Production Considerations
|
||||||
|
|
||||||
|
The current configuration is for development/testing. In production:
|
||||||
|
|
||||||
|
- **TLS termination**: Add HTTPS entrypoint with certificates (Let's Encrypt via Traefik's ACME provider, or static certs).
|
||||||
|
- **Dashboard auth**: Disable `insecure: true` and configure authentication on the dashboard.
|
||||||
|
- **WebSocket support**: Traefik supports WebSocket proxying natively — no additional config needed for SignalR.
|
||||||
|
- **Sticky sessions**: Not required. The Management API is stateless (Basic Auth per request). Blazor Server circuits are bound to a specific node via SignalR, but reconnection handles failover transparently.
|
||||||
@@ -52,7 +52,8 @@ This document serves as the master index for the SCADA system design. The system
|
|||||||
| 16 | Commons | [Component-Commons.md](Component-Commons.md) | Namespace/folder convention (Types/Interfaces/Entities/Messages), shared data types, POCOs, repository interfaces, message contracts with additive-only versioning, UTC timestamp convention. |
|
| 16 | Commons | [Component-Commons.md](Component-Commons.md) | Namespace/folder convention (Types/Interfaces/Entities/Messages), shared data types, POCOs, repository interfaces, message contracts with additive-only versioning, UTC timestamp convention. |
|
||||||
| 17 | Configuration Database | [Component-ConfigurationDatabase.md](Component-ConfigurationDatabase.md) | EF Core data access, per-component repositories, unit-of-work, optimistic concurrency on deployment status, audit logging (IAuditService), migration management. |
|
| 17 | Configuration Database | [Component-ConfigurationDatabase.md](Component-ConfigurationDatabase.md) | EF Core data access, per-component repositories, unit-of-work, optimistic concurrency on deployment status, audit logging (IAuditService), migration management. |
|
||||||
| 18 | Management Service | [Component-ManagementService.md](Component-ManagementService.md) | Akka.NET ManagementActor on central, ClusterClientReceptionist registration, programmatic access to all admin operations, CLI interface. |
|
| 18 | Management Service | [Component-ManagementService.md](Component-ManagementService.md) | Akka.NET ManagementActor on central, ClusterClientReceptionist registration, programmatic access to all admin operations, CLI interface. |
|
||||||
| 19 | CLI | [Component-CLI.md](Component-CLI.md) | Standalone command-line tool, System.CommandLine, Akka.NET ClusterClient transport, LDAP auth, JSON/table output, mirrors all Management Service operations. |
|
| 19 | CLI | [Component-CLI.md](Component-CLI.md) | Standalone command-line tool, System.CommandLine, HTTP transport via Management API, JSON/table output, mirrors all Management Service operations. |
|
||||||
|
| 20 | Traefik Proxy | [Component-TraefikProxy.md](Component-TraefikProxy.md) | Reverse proxy/load balancer fronting central cluster, active node routing via `/health/active`, automatic failover. |
|
||||||
|
|
||||||
### Reference Documentation
|
### Reference Documentation
|
||||||
|
|
||||||
|
|||||||
@@ -5,7 +5,12 @@ Local Docker deployment of the full ScadaLink cluster topology: a 2-node central
|
|||||||
## Cluster Topology
|
## Cluster Topology
|
||||||
|
|
||||||
```
|
```
|
||||||
┌─────────────────────────────────────────────────────┐
|
┌───────────────────┐
|
||||||
|
│ Traefik LB :9000 │ ◄── CLI / Browser
|
||||||
|
│ Dashboard :8180 │
|
||||||
|
└────────┬──────────┘
|
||||||
|
│ routes to active node
|
||||||
|
┌──────────────────────┼──────────────────────────────┐
|
||||||
│ Central Cluster │
|
│ Central Cluster │
|
||||||
│ │
|
│ │
|
||||||
│ ┌─────────────────┐ ┌─────────────────┐ │
|
│ ┌─────────────────┐ ┌─────────────────┐ │
|
||||||
@@ -48,6 +53,7 @@ Each site cluster runs Site Runtime, Data Connection Layer, Store-and-Forward, a
|
|||||||
|
|
||||||
| Node | Container Name | Host Web Port | Host Akka Port | Internal Ports |
|
| Node | Container Name | Host Web Port | Host Akka Port | Internal Ports |
|
||||||
|------|---------------|---------------|----------------|----------------|
|
|------|---------------|---------------|----------------|----------------|
|
||||||
|
| Traefik LB | `scadalink-traefik` | 9000 | — | 80 (proxy), 8080 (dashboard) |
|
||||||
| Central A | `scadalink-central-a` | 9001 | 9011 | 5000 (web), 8081 (Akka) |
|
| Central A | `scadalink-central-a` | 9001 | 9011 | 5000 (web), 8081 (Akka) |
|
||||||
| Central B | `scadalink-central-b` | 9002 | 9012 | 5000 (web), 8081 (Akka) |
|
| Central B | `scadalink-central-b` | 9002 | 9012 | 5000 (web), 8081 (Akka) |
|
||||||
| Site-A A | `scadalink-site-a-a` | — | 9021 | 8082 (Akka) |
|
| Site-A A | `scadalink-site-a-a` | — | 9021 | 8082 (Akka) |
|
||||||
@@ -185,22 +191,24 @@ curl -s http://localhost:9002/health/ready | python3 -m json.tool
|
|||||||
|
|
||||||
### CLI Access
|
### CLI Access
|
||||||
|
|
||||||
The CLI connects to the Central Host's HTTP management API. With the Docker setup, the Central UI (and management API) is available at `http://localhost:9001`:
|
The CLI connects to the Central Host's HTTP management API via the Traefik load balancer at `http://localhost:9000`, which routes to the active central node:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
dotnet run --project src/ScadaLink.CLI -- \
|
dotnet run --project src/ScadaLink.CLI -- \
|
||||||
--url http://localhost:9001 \
|
--url http://localhost:9000 \
|
||||||
--username multi-role --password password \
|
--username multi-role --password password \
|
||||||
template list
|
template list
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Direct access to individual nodes is also available at `http://localhost:9001` (central-a) and `http://localhost:9002` (central-b).
|
||||||
|
|
||||||
> **Note:** The `multi-role` test user has Admin, Design, and Deployment roles. The `admin` user only has the Admin role and cannot perform design or deployment operations. See `infra/glauth/config.toml` for all test users and their group memberships.
|
> **Note:** The `multi-role` test user has Admin, Design, and Deployment roles. The `admin` user only has the Admin role and cannot perform design or deployment operations. See `infra/glauth/config.toml` for all test users and their group memberships.
|
||||||
|
|
||||||
A recommended `~/.scadalink/config.json` for the Docker test environment:
|
A recommended `~/.scadalink/config.json` for the Docker test environment:
|
||||||
|
|
||||||
```json
|
```json
|
||||||
{
|
{
|
||||||
"managementUrl": "http://localhost:9001"
|
"managementUrl": "http://localhost:9000"
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|||||||
@@ -18,10 +18,12 @@ docker compose -f "$SCRIPT_DIR/docker-compose.yml" ps
|
|||||||
|
|
||||||
echo ""
|
echo ""
|
||||||
echo "Access points:"
|
echo "Access points:"
|
||||||
|
echo " Central (Traefik LB): http://localhost:9000"
|
||||||
echo " Central UI (node A): http://localhost:9001"
|
echo " Central UI (node A): http://localhost:9001"
|
||||||
echo " Central UI (node B): http://localhost:9002"
|
echo " Central UI (node B): http://localhost:9002"
|
||||||
echo " Health check: http://localhost:9001/health/ready"
|
echo " Health check: http://localhost:9001/health/ready"
|
||||||
echo " CLI contact points: akka.tcp://scadalink@localhost:9011"
|
echo " Active node check: http://localhost:9001/health/active"
|
||||||
echo " akka.tcp://scadalink@localhost:9012"
|
echo " Traefik dashboard: http://localhost:8180"
|
||||||
|
echo " Management API: http://localhost:9000/management"
|
||||||
echo ""
|
echo ""
|
||||||
echo "Logs: docker compose -f $SCRIPT_DIR/docker-compose.yml logs -f"
|
echo "Logs: docker compose -f $SCRIPT_DIR/docker-compose.yml logs -f"
|
||||||
|
|||||||
@@ -123,6 +123,19 @@ services:
|
|||||||
- scadalink-net
|
- scadalink-net
|
||||||
restart: unless-stopped
|
restart: unless-stopped
|
||||||
|
|
||||||
|
traefik:
|
||||||
|
image: traefik:v3.4
|
||||||
|
container_name: scadalink-traefik
|
||||||
|
ports:
|
||||||
|
- "9000:80" # Central load-balanced entrypoint
|
||||||
|
- "8180:8080" # Traefik dashboard
|
||||||
|
volumes:
|
||||||
|
- ./traefik/traefik.yml:/etc/traefik/traefik.yml:ro
|
||||||
|
- ./traefik/dynamic.yml:/etc/traefik/dynamic.yml:ro
|
||||||
|
networks:
|
||||||
|
- scadalink-net
|
||||||
|
restart: unless-stopped
|
||||||
|
|
||||||
networks:
|
networks:
|
||||||
scadalink-net:
|
scadalink-net:
|
||||||
external: true
|
external: true
|
||||||
|
|||||||
18
docker/traefik/dynamic.yml
Normal file
18
docker/traefik/dynamic.yml
Normal file
@@ -0,0 +1,18 @@
|
|||||||
|
http:
|
||||||
|
routers:
|
||||||
|
central:
|
||||||
|
rule: "PathPrefix(`/`)"
|
||||||
|
service: central
|
||||||
|
entryPoints:
|
||||||
|
- web
|
||||||
|
|
||||||
|
services:
|
||||||
|
central:
|
||||||
|
loadBalancer:
|
||||||
|
healthCheck:
|
||||||
|
path: /health/active
|
||||||
|
interval: 5s
|
||||||
|
timeout: 3s
|
||||||
|
servers:
|
||||||
|
- url: "http://scadalink-central-a:5000"
|
||||||
|
- url: "http://scadalink-central-b:5000"
|
||||||
11
docker/traefik/traefik.yml
Normal file
11
docker/traefik/traefik.yml
Normal file
@@ -0,0 +1,11 @@
|
|||||||
|
entryPoints:
|
||||||
|
web:
|
||||||
|
address: ":80"
|
||||||
|
|
||||||
|
api:
|
||||||
|
dashboard: true
|
||||||
|
insecure: true
|
||||||
|
|
||||||
|
providers:
|
||||||
|
file:
|
||||||
|
filename: /etc/traefik/dynamic.yml
|
||||||
40
src/ScadaLink.Host/Health/ActiveNodeHealthCheck.cs
Normal file
40
src/ScadaLink.Host/Health/ActiveNodeHealthCheck.cs
Normal file
@@ -0,0 +1,40 @@
|
|||||||
|
using Akka.Cluster;
|
||||||
|
using Microsoft.Extensions.Diagnostics.HealthChecks;
|
||||||
|
using ScadaLink.Host.Actors;
|
||||||
|
|
||||||
|
namespace ScadaLink.Host.Health;
|
||||||
|
|
||||||
|
/// <summary>
|
||||||
|
/// Health check that returns healthy only if this node is the active (leader) node
|
||||||
|
/// in the Akka.NET cluster. Used by Traefik to route traffic to the active node.
|
||||||
|
/// </summary>
|
||||||
|
public class ActiveNodeHealthCheck : IHealthCheck
|
||||||
|
{
|
||||||
|
private readonly AkkaHostedService _akkaService;
|
||||||
|
|
||||||
|
public ActiveNodeHealthCheck(AkkaHostedService akkaService)
|
||||||
|
{
|
||||||
|
_akkaService = akkaService;
|
||||||
|
}
|
||||||
|
|
||||||
|
public Task<HealthCheckResult> CheckHealthAsync(
|
||||||
|
HealthCheckContext context,
|
||||||
|
CancellationToken cancellationToken = default)
|
||||||
|
{
|
||||||
|
var system = _akkaService.ActorSystem;
|
||||||
|
if (system == null)
|
||||||
|
return Task.FromResult(HealthCheckResult.Unhealthy("ActorSystem not yet available."));
|
||||||
|
|
||||||
|
var cluster = Cluster.Get(system);
|
||||||
|
var self = cluster.SelfMember;
|
||||||
|
|
||||||
|
if (self.Status != MemberStatus.Up)
|
||||||
|
return Task.FromResult(HealthCheckResult.Unhealthy($"Node not Up (status: {self.Status})."));
|
||||||
|
|
||||||
|
var leader = cluster.State.Leader;
|
||||||
|
if (leader != null && leader == self.Address)
|
||||||
|
return Task.FromResult(HealthCheckResult.Healthy("Active node (cluster leader)."));
|
||||||
|
|
||||||
|
return Task.FromResult(HealthCheckResult.Unhealthy("Standby node (not cluster leader)."));
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -1,6 +1,6 @@
|
|||||||
using Akka.Actor;
|
|
||||||
using Akka.Cluster;
|
using Akka.Cluster;
|
||||||
using Microsoft.Extensions.Diagnostics.HealthChecks;
|
using Microsoft.Extensions.Diagnostics.HealthChecks;
|
||||||
|
using ScadaLink.Host.Actors;
|
||||||
|
|
||||||
namespace ScadaLink.Host.Health;
|
namespace ScadaLink.Host.Health;
|
||||||
|
|
||||||
@@ -10,21 +10,22 @@ namespace ScadaLink.Host.Health;
|
|||||||
/// </summary>
|
/// </summary>
|
||||||
public class AkkaClusterHealthCheck : IHealthCheck
|
public class AkkaClusterHealthCheck : IHealthCheck
|
||||||
{
|
{
|
||||||
private readonly ActorSystem? _system;
|
private readonly AkkaHostedService _akkaService;
|
||||||
|
|
||||||
public AkkaClusterHealthCheck(ActorSystem? system = null)
|
public AkkaClusterHealthCheck(AkkaHostedService akkaService)
|
||||||
{
|
{
|
||||||
_system = system;
|
_akkaService = akkaService;
|
||||||
}
|
}
|
||||||
|
|
||||||
public Task<HealthCheckResult> CheckHealthAsync(
|
public Task<HealthCheckResult> CheckHealthAsync(
|
||||||
HealthCheckContext context,
|
HealthCheckContext context,
|
||||||
CancellationToken cancellationToken = default)
|
CancellationToken cancellationToken = default)
|
||||||
{
|
{
|
||||||
if (_system == null)
|
var system = _akkaService.ActorSystem;
|
||||||
|
if (system == null)
|
||||||
return Task.FromResult(HealthCheckResult.Degraded("ActorSystem not yet available."));
|
return Task.FromResult(HealthCheckResult.Degraded("ActorSystem not yet available."));
|
||||||
|
|
||||||
var cluster = Cluster.Get(_system);
|
var cluster = Cluster.Get(system);
|
||||||
var status = cluster.SelfMember.Status;
|
var status = cluster.SelfMember.Status;
|
||||||
|
|
||||||
var result = status switch
|
var result = status switch
|
||||||
|
|||||||
@@ -87,7 +87,8 @@ try
|
|||||||
// WP-12: Health checks for readiness gating
|
// WP-12: Health checks for readiness gating
|
||||||
builder.Services.AddHealthChecks()
|
builder.Services.AddHealthChecks()
|
||||||
.AddCheck<DatabaseHealthCheck>("database")
|
.AddCheck<DatabaseHealthCheck>("database")
|
||||||
.AddCheck<AkkaClusterHealthCheck>("akka-cluster");
|
.AddCheck<AkkaClusterHealthCheck>("akka-cluster")
|
||||||
|
.AddCheck<ActiveNodeHealthCheck>("active-node");
|
||||||
|
|
||||||
// WP-13: Akka.NET bootstrap via hosted service
|
// WP-13: Akka.NET bootstrap via hosted service
|
||||||
builder.Services.AddSingleton<AkkaHostedService>();
|
builder.Services.AddSingleton<AkkaHostedService>();
|
||||||
@@ -126,6 +127,13 @@ try
|
|||||||
ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse
|
ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse
|
||||||
});
|
});
|
||||||
|
|
||||||
|
// Active node endpoint — returns 200 only on the cluster leader; used by Traefik for routing
|
||||||
|
app.MapHealthChecks("/health/active", new HealthCheckOptions
|
||||||
|
{
|
||||||
|
Predicate = check => check.Name == "active-node",
|
||||||
|
ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse
|
||||||
|
});
|
||||||
|
|
||||||
app.MapStaticAssets();
|
app.MapStaticAssets();
|
||||||
app.MapCentralUI<ScadaLink.Host.Components.App>();
|
app.MapCentralUI<ScadaLink.Host.Components.App>();
|
||||||
app.MapInboundAPI();
|
app.MapInboundAPI();
|
||||||
|
|||||||
@@ -1,17 +1,18 @@
|
|||||||
# Test Infrastructure
|
# Test Infrastructure
|
||||||
|
|
||||||
This document describes the local Docker-based test infrastructure for ScadaLink development. Five services provide the external dependencies needed to run and test the system locally.
|
This document describes the local Docker-based test infrastructure for ScadaLink development. Six services provide the external dependencies needed to run and test the system locally. The first six run in `infra/docker-compose.yml`; Traefik runs alongside the cluster nodes in `docker/docker-compose.yml`.
|
||||||
|
|
||||||
## Services
|
## Services
|
||||||
|
|
||||||
| Service | Image | Port(s) | Config |
|
| Service | Image | Port(s) | Config | Compose File |
|
||||||
|---------|-------|---------|--------|
|
|---------|-------|---------|--------|-------------|
|
||||||
| OPC UA Server | `mcr.microsoft.com/iotedge/opc-plc:latest` | 50000 (OPC UA), 8080 (web) | `infra/opcua/nodes.json` |
|
| OPC UA Server | `mcr.microsoft.com/iotedge/opc-plc:latest` | 50000 (OPC UA), 8080 (web) | `infra/opcua/nodes.json` | `infra/` |
|
||||||
| LDAP Server | `glauth/glauth:latest` | 3893 | `infra/glauth/config.toml` |
|
| LDAP Server | `glauth/glauth:latest` | 3893 | `infra/glauth/config.toml` | `infra/` |
|
||||||
| MS SQL 2022 | `mcr.microsoft.com/mssql/server:2022-latest` | 1433 | `infra/mssql/setup.sql` |
|
| MS SQL 2022 | `mcr.microsoft.com/mssql/server:2022-latest` | 1433 | `infra/mssql/setup.sql` | `infra/` |
|
||||||
| SMTP (Mailpit) | `axllent/mailpit:latest` | 1025 (SMTP), 8025 (web) | Environment vars |
|
| SMTP (Mailpit) | `axllent/mailpit:latest` | 1025 (SMTP), 8025 (web) | Environment vars | `infra/` |
|
||||||
| REST API (Flask) | Custom build (`infra/restapi/Dockerfile`) | 5200 | `infra/restapi/app.py` |
|
| REST API (Flask) | Custom build (`infra/restapi/Dockerfile`) | 5200 | `infra/restapi/app.py` | `infra/` |
|
||||||
| LmxFakeProxy | Custom build (`infra/lmxfakeproxy/Dockerfile`) | 50051 (gRPC) | Environment vars |
|
| LmxFakeProxy | Custom build (`infra/lmxfakeproxy/Dockerfile`) | 50051 (gRPC) | Environment vars | `infra/` |
|
||||||
|
| Traefik LB | `traefik:v3.4` | 9000 (proxy), 8180 (dashboard) | `docker/traefik/` | `docker/` |
|
||||||
|
|
||||||
## Quick Start
|
## Quick Start
|
||||||
|
|
||||||
@@ -42,6 +43,7 @@ Each service has a dedicated document with configuration details, verification s
|
|||||||
- [test_infra_smtp.md](test_infra_smtp.md) — SMTP test server (Mailpit)
|
- [test_infra_smtp.md](test_infra_smtp.md) — SMTP test server (Mailpit)
|
||||||
- [test_infra_restapi.md](test_infra_restapi.md) — REST API test server (Flask)
|
- [test_infra_restapi.md](test_infra_restapi.md) — REST API test server (Flask)
|
||||||
- [test_infra_lmxfakeproxy.md](test_infra_lmxfakeproxy.md) — LmxProxy fake server (OPC UA bridge)
|
- [test_infra_lmxfakeproxy.md](test_infra_lmxfakeproxy.md) — LmxProxy fake server (OPC UA bridge)
|
||||||
|
- Traefik LB — see `docker/README.md` and `docker/traefik/` (runs with the cluster, not in `infra/`)
|
||||||
|
|
||||||
## Connection Strings
|
## Connection Strings
|
||||||
|
|
||||||
@@ -112,4 +114,8 @@ infra/
|
|||||||
lmxfakeproxy/ # .NET gRPC proxy bridging LmxProxy protocol to OPC UA
|
lmxfakeproxy/ # .NET gRPC proxy bridging LmxProxy protocol to OPC UA
|
||||||
tools/ # Python CLI tools (opcua, ldap, mssql, smtp, restapi)
|
tools/ # Python CLI tools (opcua, ldap, mssql, smtp, restapi)
|
||||||
README.md # Quick-start for the infra folder
|
README.md # Quick-start for the infra folder
|
||||||
|
|
||||||
|
docker/
|
||||||
|
traefik/traefik.yml # Traefik static config (entrypoints, file provider)
|
||||||
|
traefik/dynamic.yml # Traefik dynamic config (load balancer, health check routing)
|
||||||
```
|
```
|
||||||
|
|||||||
@@ -1,10 +1,11 @@
|
|||||||
using Microsoft.AspNetCore.Mvc.Testing;
|
using Microsoft.AspNetCore.Mvc.Testing;
|
||||||
using Microsoft.Extensions.Configuration;
|
using Microsoft.Extensions.Configuration;
|
||||||
|
using ScadaLink.Host.Health;
|
||||||
|
|
||||||
namespace ScadaLink.Host.Tests;
|
namespace ScadaLink.Host.Tests;
|
||||||
|
|
||||||
/// <summary>
|
/// <summary>
|
||||||
/// WP-12: Tests for /health/ready endpoint.
|
/// WP-12: Tests for /health/ready and /health/active endpoints.
|
||||||
/// </summary>
|
/// </summary>
|
||||||
public class HealthCheckTests : IDisposable
|
public class HealthCheckTests : IDisposable
|
||||||
{
|
{
|
||||||
@@ -63,4 +64,94 @@ public class HealthCheckTests : IDisposable
|
|||||||
Environment.SetEnvironmentVariable("DOTNET_ENVIRONMENT", previousEnv);
|
Environment.SetEnvironmentVariable("DOTNET_ENVIRONMENT", previousEnv);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
[Fact]
|
||||||
|
public async Task HealthActive_Endpoint_ReturnsResponse()
|
||||||
|
{
|
||||||
|
var previousEnv = Environment.GetEnvironmentVariable("DOTNET_ENVIRONMENT");
|
||||||
|
try
|
||||||
|
{
|
||||||
|
Environment.SetEnvironmentVariable("DOTNET_ENVIRONMENT", "Central");
|
||||||
|
|
||||||
|
var factory = new WebApplicationFactory<Program>()
|
||||||
|
.WithWebHostBuilder(builder =>
|
||||||
|
{
|
||||||
|
builder.ConfigureAppConfiguration((context, config) =>
|
||||||
|
{
|
||||||
|
config.AddInMemoryCollection(new Dictionary<string, string?>
|
||||||
|
{
|
||||||
|
["ScadaLink:Node:NodeHostname"] = "localhost",
|
||||||
|
["ScadaLink:Node:RemotingPort"] = "0",
|
||||||
|
["ScadaLink:Cluster:SeedNodes:0"] = "akka.tcp://scadalink@localhost:2551",
|
||||||
|
["ScadaLink:Cluster:SeedNodes:1"] = "akka.tcp://scadalink@localhost:2552",
|
||||||
|
["ScadaLink:Database:SkipMigrations"] = "true",
|
||||||
|
});
|
||||||
|
});
|
||||||
|
builder.UseSetting("ScadaLink:Node:Role", "Central");
|
||||||
|
builder.UseSetting("ScadaLink:Database:SkipMigrations", "true");
|
||||||
|
});
|
||||||
|
_disposables.Add(factory);
|
||||||
|
|
||||||
|
var client = factory.CreateClient();
|
||||||
|
_disposables.Add(client);
|
||||||
|
|
||||||
|
var response = await client.GetAsync("/health/active");
|
||||||
|
|
||||||
|
// In test mode, the ActorSystem may not be fully available,
|
||||||
|
// so the active-node check returns 503 (Unhealthy).
|
||||||
|
Assert.True(
|
||||||
|
response.StatusCode == System.Net.HttpStatusCode.OK ||
|
||||||
|
response.StatusCode == System.Net.HttpStatusCode.ServiceUnavailable,
|
||||||
|
$"Expected 200 or 503, got {(int)response.StatusCode}");
|
||||||
|
}
|
||||||
|
finally
|
||||||
|
{
|
||||||
|
Environment.SetEnvironmentVariable("DOTNET_ENVIRONMENT", previousEnv);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
[Fact]
|
||||||
|
public async Task ActiveNodeHealthCheck_SystemNotStarted_ReturnsUnhealthy()
|
||||||
|
{
|
||||||
|
// AkkaHostedService before StartAsync has ActorSystem == null.
|
||||||
|
// The integration test (HealthActive_Endpoint_ReturnsResponse) validates the full
|
||||||
|
// endpoint wiring. This test validates the null-system path via WebApplicationFactory
|
||||||
|
// where the ActorSystem may not be available.
|
||||||
|
var previousEnv = Environment.GetEnvironmentVariable("DOTNET_ENVIRONMENT");
|
||||||
|
try
|
||||||
|
{
|
||||||
|
Environment.SetEnvironmentVariable("DOTNET_ENVIRONMENT", "Central");
|
||||||
|
var factory = new WebApplicationFactory<Program>()
|
||||||
|
.WithWebHostBuilder(builder =>
|
||||||
|
{
|
||||||
|
builder.ConfigureAppConfiguration((context, config) =>
|
||||||
|
{
|
||||||
|
config.AddInMemoryCollection(new Dictionary<string, string?>
|
||||||
|
{
|
||||||
|
["ScadaLink:Node:NodeHostname"] = "localhost",
|
||||||
|
["ScadaLink:Node:RemotingPort"] = "0",
|
||||||
|
["ScadaLink:Cluster:SeedNodes:0"] = "akka.tcp://scadalink@localhost:2551",
|
||||||
|
["ScadaLink:Database:SkipMigrations"] = "true",
|
||||||
|
});
|
||||||
|
});
|
||||||
|
builder.UseSetting("ScadaLink:Node:Role", "Central");
|
||||||
|
builder.UseSetting("ScadaLink:Database:SkipMigrations", "true");
|
||||||
|
});
|
||||||
|
_disposables.Add(factory);
|
||||||
|
|
||||||
|
var client = factory.CreateClient();
|
||||||
|
_disposables.Add(client);
|
||||||
|
|
||||||
|
var response = await client.GetAsync("/health/active");
|
||||||
|
var body = await response.Content.ReadAsStringAsync();
|
||||||
|
|
||||||
|
// Active-node check returns 503 when ActorSystem is not yet available or not leader
|
||||||
|
Assert.Equal(System.Net.HttpStatusCode.ServiceUnavailable, response.StatusCode);
|
||||||
|
Assert.Contains("active-node", body);
|
||||||
|
}
|
||||||
|
finally
|
||||||
|
{
|
||||||
|
Environment.SetEnvironmentVariable("DOTNET_ENVIRONMENT", previousEnv);
|
||||||
|
}
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|||||||
Reference in New Issue
Block a user