ScadaBridge/docs/deployment/topology-guide.md

# ScadaBridge Cluster Topology Guide

## Architecture Overview

ScadaBridge uses a hub-and-spoke architecture:
- **Central Cluster**: Two-node active/standby Akka.NET cluster for management, UI, and coordination.
- **Site Clusters**: Two-node active/standby Akka.NET clusters at each remote site for data collection and local processing.

```mermaid
%%{init: {'theme':'base', 'themeVariables': {'textColor':'#111111','lineColor':'#555555','edgeLabelBackground':'#ffffff','fontSize':'15px'}}}%%
flowchart TD
    USERS["Users<br/>(HTTPS / LB)"]

    subgraph CENTRAL["Central Cluster"]
        NA["Node A<br/>Active"]
        NB["Node B<br/>Standby"]
        NA <--> NB
    end

    USERS --> NA
    CENTRAL --> SITE01
    CENTRAL --> SITE02
    CENTRAL --> SITE03
    CENTRAL --> SITEN

    subgraph SITE01["Site 01"]
        S01A["A<br/>Active"]
        S01B["B<br/>Standby"]
    end
    subgraph SITE02["Site 02"]
        S02A["A<br/>Active"]
        S02B["B<br/>Standby"]
    end
    subgraph SITE03["Site 03"]
        S03A["A<br/>Active"]
        S03B["B<br/>Standby"]
    end
    subgraph SITEN["Site N"]
        SNA["A<br/>Active"]
        SNB["B<br/>Standby"]
    end

    classDef start fill:#d5e8d4,stroke:#82b366,color:#111111;
    classDef proc fill:#dae8fc,stroke:#6c8ebf,color:#111111;
    classDef dec fill:#fff2cc,stroke:#d6b656,color:#111111;
    classDef warn fill:#ffe6cc,stroke:#d79b00,color:#111111;
    classDef muted fill:#f5f5f5,stroke:#999999,color:#666666;
    class USERS dec
    class CENTRAL proc
    class NA,S01A,S02A,S03A,SNA start
    class NB,S01B,S02B,S03B,SNB muted
    class SITE01,SITE02,SITE03,SITEN warn
```

## Central Cluster Setup

### Cluster Configuration

Both central nodes must be configured as seed nodes for each other:

**Node A** (`central-01.example.com`):
```json
{
  "ScadaBridge": {
    "Node": {
      "Role": "Central",
      "NodeHostname": "central-01.example.com",
      "RemotingPort": 8081
    },
    "Cluster": {
      "SeedNodes": [
        "akka.tcp://scadabridge@central-01.example.com:8081",
        "akka.tcp://scadabridge@central-02.example.com:8081"
      ]
    }
  }
}
```

**Node B** (`central-02.example.com`):
```json
{
  "ScadaBridge": {
    "Node": {
      "Role": "Central",
      "NodeHostname": "central-02.example.com",
      "RemotingPort": 8081
    },
    "Cluster": {
      "SeedNodes": [
        "akka.tcp://scadabridge@central-01.example.com:8081",
        "akka.tcp://scadabridge@central-02.example.com:8081"
      ]
    }
  }
}
```

### Cluster Behavior

- **Split-brain resolver**: Keep-oldest with `down-if-alone = on`, 15-second stable-after.
- **Minimum members**: `min-nr-of-members = 1` — a single node can form a cluster.
- **Failure detection**: 2-second heartbeat interval, 10-second threshold.
- **Total failover time**: ~25 seconds from node failure to singleton migration.
- **Singleton handover**: Uses CoordinatedShutdown for graceful migration.

### Shared State

Both central nodes share state through:
- **SQL Server**: All configuration, deployment records, templates, and audit logs.
- **JWT signing key**: Same `JwtSigningKey` in both nodes' configuration.
- **Data Protection keys**: Shared key ring (stored in SQL Server or shared file path).

### Load Balancer

A load balancer sits in front of both central nodes for the Blazor Server UI:
- Health check: `GET /health/ready`
- Protocol: HTTPS (TLS termination at LB or pass-through)
- Sticky sessions: Not required (JWT + shared Data Protection keys)
- If the active node fails, the LB routes to the standby (which becomes active after singleton migration).

## Site Cluster Setup

### Cluster Configuration

Each site has its own two-node cluster:

**Site Node A** (`site-01-a.example.com`):
```json
{
  "ScadaBridge": {
    "Node": {
      "Role": "Site",
      "NodeHostname": "site-01-a.example.com",
      "SiteId": "plant-north",
      "RemotingPort": 8081
    },
    "Cluster": {
      "SeedNodes": [
        "akka.tcp://scadabridge@site-01-a.example.com:8081",
        "akka.tcp://scadabridge@site-01-b.example.com:8081"
      ]
    }
  }
}
```

### Site Cluster Behavior

- Same split-brain resolver as central (keep-oldest).
- Singleton actors: Site Deployment Manager migrates on failover.
- Staggered instance startup: 50ms delay between Instance Actor creation to prevent reconnection storms.
- SQLite persistence: Both nodes access the same SQLite files (or each has its own copy with async replication).

### Central-Site Communication

- Sites connect to central via Akka.NET remoting.
- The `Communication:CentralSeedNode` setting in the site config points to one of the central nodes.
- If that central node is down, the site's communication actor will retry until it connects to the active central node.

## Scaling Guidelines

### Target Scale

- 10 sites maximum per central cluster
- 500 machines (instances) total across all sites
- 75 tags per machine (37,500 total tag subscriptions)

### Resource Requirements

| Component | CPU | RAM | Disk | Notes |
|-----------|-----|-----|------|-------|
| Central node | 4 cores | 8 GB | 50 GB | SQL Server is separate |
| Site node | 2 cores | 4 GB | 20 GB | SQLite databases grow with S&F |
| SQL Server | 4 cores | 16 GB | 100 GB | Shared across central cluster |

### Network Bandwidth

- Health reports: ~1 KB per site per 30 seconds = negligible
- Tag value updates: Depends on data change rate; OPC UA subscription-based
- Deployment artifacts: One-time burst per deployment (varies by config size)
- Debug view streaming: ~500 bytes per attribute change per subscriber

## Dual-Node Failure Recovery

### Scenario: Both Nodes Down

1. **First node starts**: Forms a single-node cluster (`min-nr-of-members = 1`).
2. **Central**: Reconnects to SQL Server, reads deployment state, becomes operational.
3. **Site**: Opens SQLite databases, rebuilds Instance Actors from persisted configs, resumes S&F retries.
4. **Second node starts**: Joins the existing cluster as standby.

### Automatic Recovery

No manual intervention required for dual-node failure. The first node to start will:
- Form the cluster
- Take over all singletons
- Begin processing immediately
- Accept the second node when it joins