43228185b4
Gitea renders mermaid inline, so the flow/state/hierarchy/DAG diagrams move to text-in-markdown: auto-layout (removes the manual overlap-prone draw.io step), diffable source, no committed binaries, and a dark-text theme so labels stay legible. Keep draw.io PNGs only for the two complex bespoke diagrams (logical architecture, env2 topology) where pixel control still wins. All 24 mermaid blocks validated by rendering.
167 lines
6.8 KiB
Markdown
167 lines
6.8 KiB
Markdown
# Primary/Backup Data Connection Endpoints — Design
|
|
|
|
**Date:** 2026-03-22
|
|
**Status:** Approved
|
|
|
|
## Problem
|
|
|
|
Data connections currently support a single endpoint. If that endpoint goes down, the connection retries indefinitely at 5s intervals against the same address. When redundant infrastructure exists (e.g., two OPC UA servers), there is no way to automatically fail over to a backup.
|
|
|
|
## Design Decisions
|
|
|
|
| Decision | Choice |
|
|
|----------|--------|
|
|
| Failover mode | Automatic after N failed retries |
|
|
| Failback | No auto-failback; stay on active until it fails (round-robin) |
|
|
| Backup required? | Optional — single-endpoint connections work unchanged |
|
|
| Failover trigger | After configurable retry count (default 3) |
|
|
| Entity model | Separate `PrimaryConfiguration` and `BackupConfiguration` columns |
|
|
| UI approach | Two JSON text areas; backup collapsible |
|
|
| Failover logic location | DataConnectionActor (adapters stay single-endpoint) |
|
|
| Observability | Health reports + site event log entries |
|
|
|
|
## Entity Model
|
|
|
|
**`DataConnection` changes:**
|
|
|
|
| Field | Type | Notes |
|
|
|-------|------|-------|
|
|
| `PrimaryConfiguration` | string? (max 4000) | Renamed from `Configuration` |
|
|
| `BackupConfiguration` | string? (max 4000) | New. Null = no backup |
|
|
| `FailoverRetryCount` | int (default 3) | New. Retries before switching |
|
|
|
|
Both endpoints use the same `Protocol`. EF Core migration renames `Configuration` → `PrimaryConfiguration` (data-preserving).
|
|
|
|
**`DataConnectionArtifact` changes:**
|
|
- `ConfigurationJson` → `PrimaryConfigurationJson` + `BackupConfigurationJson`
|
|
|
|
## Failover State Machine
|
|
|
|
The `DataConnectionActor` Reconnecting state is extended:
|
|
|
|
```mermaid
|
|
%%{init: {'theme':'base', 'themeVariables': {'textColor':'#111111','lineColor':'#555555','edgeLabelBackground':'#ffffff','fontSize':'15px'}}}%%
|
|
flowchart TD
|
|
C(["Connected"])
|
|
BQ["Push bad quality<br/>to all subscribers"]
|
|
RT["Retry active endpoint<br/>(5s interval)"]
|
|
INC["_consecutiveFailures++"]
|
|
BR{"Evaluate<br/>_consecutiveFailures"}
|
|
SAME["Retry same endpoint"]
|
|
FO["Failover<br/>- dispose adapter, switch _activeEndpoint, reset counter<br/>- create fresh adapter with other config<br/>- attempt connect"]
|
|
NB["Keep retrying indefinitely<br/>(current behavior)"]
|
|
RC(["On successful reconnect (either endpoint)<br/>1. Reset _consecutiveFailures = 0<br/>2. ReSubscribeAll() — re-create subscriptions on new adapter<br/>3. Transition to Connected<br/>4. Log failover event if endpoint changed<br/>5. Report active endpoint in health metrics"])
|
|
|
|
C -->|"disconnect detected"| BQ
|
|
BQ --> RT
|
|
RT -->|"failure"| INC
|
|
INC --> BR
|
|
BR -->|"< FailoverRetryCount"| SAME
|
|
SAME -.->|"retry"| RT
|
|
BR -->|">= FailoverRetryCount AND backup exists"| FO
|
|
BR -->|">= FailoverRetryCount AND no backup"| NB
|
|
NB -.->|"retry (round-robin n/a)"| RT
|
|
FO -->|"connect succeeds"| RC
|
|
FO -.->|"connect fails (round-robin: primary to backup to primary...)"| RT
|
|
RC -->|"Transition to Connected"| C
|
|
|
|
classDef start fill:#d5e8d4,stroke:#82b366,color:#111111;
|
|
classDef proc fill:#dae8fc,stroke:#6c8ebf,color:#111111;
|
|
classDef dec fill:#fff2cc,stroke:#d6b656,color:#111111;
|
|
classDef warn fill:#ffe6cc,stroke:#d79b00,color:#111111;
|
|
classDef bad fill:#f8cecc,stroke:#b85450,color:#111111;
|
|
class C,RC start
|
|
class BQ,RT,SAME proc
|
|
class INC,BR dec
|
|
class FO warn
|
|
class NB bad
|
|
```
|
|
|
|
**On successful reconnect (either endpoint):**
|
|
1. Reset `_consecutiveFailures = 0`
|
|
2. `ReSubscribeAll()` — re-create all subscriptions on the new adapter
|
|
3. Transition to Connected
|
|
4. Log failover event if endpoint changed
|
|
5. Report active endpoint in health metrics
|
|
|
|
**Round-robin on failure:** primary → backup → primary → backup...
|
|
|
|
**Adapter lifecycle on failover:** Actor disposes current `IDataConnection` adapter and creates a fresh one via `DataConnectionFactory.Create()` with the other endpoint's config. Clean slate — no stale state.
|
|
|
|
## Actor State
|
|
|
|
New fields in `DataConnectionActor`:
|
|
|
|
- `IDictionary<string, string> _primaryConfig`
|
|
- `IDictionary<string, string>? _backupConfig`
|
|
- `ActiveEndpoint _activeEndpoint` (enum: Primary, Backup)
|
|
- `int _consecutiveFailures`
|
|
- `int _failoverRetryCount`
|
|
|
|
`CreateConnectionCommand` gains: `primaryConfig`, `backupConfig`, `failoverRetryCount`.
|
|
|
|
`DataConnectionFactory` is unchanged — still creates single-endpoint adapters.
|
|
|
|
## Health & Observability
|
|
|
|
**`DataConnectionHealthReport`** gains:
|
|
- `ActiveEndpoint` (string): `"Primary"`, `"Backup"`, or `"Primary (no backup)"`
|
|
|
|
**Site event log entries:**
|
|
- `DataConnectionFailover` — connection name, from-endpoint, to-endpoint, reason
|
|
- `DataConnectionRestored` — connection name, active endpoint
|
|
|
|
Uses existing `ISiteEventLogger`.
|
|
|
|
## Central UI
|
|
|
|
**List page:** Add `Active Endpoint` column from health reports.
|
|
|
|
**Form (Create/Edit):**
|
|
- "Primary Endpoint Configuration" label (renamed from "Configuration")
|
|
- "Add Backup Endpoint" button reveals second JSON text area
|
|
- "Remove Backup" button in edit mode when backup exists
|
|
- "Failover Retry Count" numeric input (default 3, min 1, max 20) — visible only when backup configured
|
|
- Vertical stacking, collapsible backup subsection
|
|
|
|
## CLI
|
|
|
|
- `--configuration` renamed to `--primary-config` (hidden alias for backwards compat)
|
|
- `--backup-config` (optional)
|
|
- `--failover-retry-count` (optional, default 3)
|
|
- `data-connection get` shows both configs and active endpoint
|
|
|
|
## Management API
|
|
|
|
- `CreateDataConnectionCommand` / `UpdateDataConnectionCommand` gain `PrimaryConfiguration`, `BackupConfiguration`, `FailoverRetryCount`
|
|
- Setting `BackupConfiguration` to null removes the backup
|
|
- `GetDataConnectionResponse` returns both configs
|
|
|
|
## Deployment Flow
|
|
|
|
`DataConnectionArtifact` carries `PrimaryConfigurationJson` and `BackupConfigurationJson`. Site-side deployment handler passes both to `CreateConnectionCommand`.
|
|
|
|
## Testing
|
|
|
|
**Unit tests:**
|
|
- Actor: failover after N failures, round-robin, single-endpoint retries forever, counter reset, ReSubscribeAll on failover
|
|
- Manager actor: updated CreateConnectionCommand
|
|
- Factory: unchanged registration
|
|
|
|
**Integration test (manual with test infra):**
|
|
1. Primary=`opc.tcp://localhost:50000`, backup=`opc.tcp://localhost:50010`
|
|
2. Subscribe to `Motor.Speed`
|
|
3. `docker compose stop opcua` → verify failover to opcua2 after 3 retries
|
|
4. `docker compose stop opcua2 && docker compose start opcua` → verify round-robin back
|
|
|
|
## Implementation Tasks
|
|
|
|
1. **#4** Entity model & database (foundation)
|
|
2. **#6** CreateConnectionCommand & DataConnectionManagerActor (blocked by #4)
|
|
3. **#5** DataConnectionActor failover state machine (blocked by #4, #6)
|
|
4. **#7** Health reporting & site event log (blocked by #5)
|
|
5. **#8** Central UI (blocked by #4)
|
|
6. **#9** CLI, Management API, deployment (blocked by #4)
|
|
7. **#10** Documentation (blocked by #5)
|
|
8. **#11** Tests (blocked by #5)
|