Joseph Doherty 86a15c0a65 docs(lmxproxy): document reconnection gaps from platform restart testing
Tested aaBootstrap kill on windev — three gaps identified:
1. No active health probing (IsConnected stays true on dead connection)
2. Stale SubscriptionManager handles after reconnect cycle
3. AVEVA objects don't auto-start after platform crash (platform behavior)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-22 06:19:30 -04:00

SCADA System — Design Documentation

Overview

This document serves as the master index for the SCADA system design. The system is a centrally-managed, distributed SCADA configuration and deployment platform built on Akka.NET, running across a central cluster and multiple site clusters in a hub-and-spoke topology.

Technology Stack

Layer Technology
Runtime .NET, Akka.NET (actors, clustering, remoting, persistence, streams)
Central UI Blazor Server (ASP.NET Core + SignalR)
Inbound API ASP.NET Core Web API (REST/JSON)
Central Database MS SQL Server, Entity Framework Core
Site Storage SQLite (deployed configs, S&F buffer, event logs)
Authentication Direct LDAP/AD bind (LDAPS/StartTLS), JWT sessions
Notifications SMTP with OAuth2 Client Credentials (Microsoft 365)
Hosting Windows Server, Windows Service
Cluster Akka.NET Cluster (active/standby, keep-oldest SBR)
Logging Serilog (structured)

Scale

  • ~10 site clusters, each with 50500 machines, 2575 live tags per machine.
  • Central cluster: 2-node active/standby behind a load balancer.
  • Site clusters: 2-node active/standby, headless (no UI).

Document Map

Requirements

  • HighLevelReqs.md — Complete high-level requirements covering all functional areas.

Component Design Documents

# Component Document Description
1 Template Engine docs/requirements/Component-TemplateEngine.md Template modeling, inheritance, composition, path-qualified member addressing, override granularity, locking, alarms, flattening, semantic validation, revision hashing, and diff calculation.
2 Deployment Manager docs/requirements/Component-DeploymentManager.md Central-side deployment pipeline with deployment ID/idempotency, per-instance operation lock, state transition matrix, all-or-nothing site apply, system-wide artifact deployment with per-site status.
3 Site Runtime docs/requirements/Component-SiteRuntime.md Site-side actor hierarchy with explicit supervision strategies, staggered startup, script trust model (constrained APIs), Tell/Ask conventions, concurrency serialization, and site-wide Akka stream with per-subscriber backpressure.
4 Data Connection Layer docs/requirements/Component-DataConnectionLayer.md Common data connection interface (OPC UA, custom), Become/Stash connection actor model, auto-reconnect, immediate bad quality on disconnect, transparent re-subscribe, synchronous write failures, tag path resolution retry.
5 CentralSite Communication docs/requirements/Component-Communication.md Dual transport: Akka.NET ClusterClient (command/control) + gRPC server-streaming (real-time data). 8 message patterns with per-pattern timeouts, SiteStreamGrpcServer/Client, application-level correlation IDs, transport heartbeat config, gRPC keepalive, message ordering, connection failure behavior.
6 Store-and-Forward Engine docs/requirements/Component-StoreAndForward.md Buffering (transient failures only), fixed-interval retry, parking, async best-effort replication, SQLite persistence at sites.
7 External System Gateway docs/requirements/Component-ExternalSystemGateway.md HTTP/REST + JSON, API key/Basic Auth, per-system timeout, dual call modes (Call/CachedCall), transient/permanent error classification, dedicated blocking I/O dispatcher, ADO.NET connection pooling.
8 Notification Service docs/requirements/Component-NotificationService.md SMTP with OAuth2 (M365) or Basic Auth, BCC delivery, plain text, transient/permanent SMTP error classification, store-and-forward integration.
9 Central UI docs/requirements/Component-CentralUI.md Blazor Server with SignalR real-time push, load balancer failover with JWT, all management workflows.
10 Security & Auth docs/requirements/Component-Security.md Direct LDAP bind (LDAPS/StartTLS), JWT sessions (HMAC-SHA256, 15-min refresh, 30-min idle), role-based authorization, site-scoped permissions.
11 Health Monitoring docs/requirements/Component-HealthMonitoring.md 30s report interval, 60s offline threshold, monotonic sequence numbers, raw error counts, tag resolution counts, dead letter monitoring.
12 Site Event Logging docs/requirements/Component-SiteEventLogging.md SQLite storage, 30-day retention + 1GB cap, daily purge, paginated remote queries with keyword search.
13 Cluster Infrastructure docs/requirements/Component-ClusterInfrastructure.md Akka.NET cluster, keep-oldest SBR with down-if-alone, min-nr-of-members=1, 2s/10s/15s failure detection, CoordinatedShutdown, automatic dual-node recovery.
14 Inbound API docs/requirements/Component-InboundAPI.md POST /api/{methodName}, X-API-Key header, flat JSON, extended type system (Object/List), script-based implementations, failures-only logging.
15 Host docs/requirements/Component-Host.md Single deployable binary, role-based component registration, per-component config binding (Options pattern), readiness gating, dead letter monitoring, Akka.NET bootstrap, ASP.NET Core hosting for central.
16 Commons docs/requirements/Component-Commons.md Namespace/folder convention (Types/Interfaces/Entities/Messages), shared data types, POCOs, repository interfaces, message contracts with additive-only versioning, UTC timestamp convention.
17 Configuration Database docs/requirements/Component-ConfigurationDatabase.md EF Core data access, per-component repositories, unit-of-work, optimistic concurrency on deployment status, audit logging (IAuditService), migration management.
18 Management Service docs/requirements/Component-ManagementService.md Akka.NET ManagementActor on central, ClusterClientReceptionist registration, programmatic access to all admin operations, CLI interface.
19 CLI docs/requirements/Component-CLI.md Standalone command-line tool, System.CommandLine, HTTP transport via Management API, JSON/table output, mirrors all Management Service operations.
20 Traefik Proxy docs/requirements/Component-TraefikProxy.md Reverse proxy/load balancer fronting central cluster, active node routing via /health/active, automatic failover.

Reference Documentation

  • AkkaDotNet/ — Akka.NET reference notes covering actors, remoting, clustering, persistence, streams, serialization, hosting, testing, and best practices.
  • docs/plans/ — Design decision documents from refinement sessions.

Architecture Diagram (Logical)

                    Users (Blazor Server)
                         │
                    Load Balancer
                         │
┌────────────────────────┼────────────────────────────┐
│                   CENTRAL CLUSTER                    │
│              (2-node active/standby)                 │
│                                                      │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐            │
│  │ Template  │ │Deployment│ │ Central  │            │
│  │ Engine    │ │ Manager  │ │ UI       │ Blazor Svr │
│  └──────────┘ └──────────┘ └──────────┘            │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐            │
│  │ Security  │ │  Config  │ │  Health  │            │
│  │ & Auth    │ │   DB     │ │ Monitor  │            │
│  │ (JWT/LDAP)│ │ (EF+IAud)│ │          │            │
│  └──────────┘ └──────────┘ └──────────┘            │
│  ┌──────────┐                                       │
│  │ Inbound  │  ◄── External Systems (X-API-Key)     │
│  │ API      │      POST /api/{method}, JSON         │
│  └──────────┘                                       │
│  ┌──────────┐                                       │
│  │ Mgmt     │  ◄── CLI (ClusterClient)              │
│  │ Service  │      ManagementActor + Receptionist   │
│  └──────────┘                                       │
│  ┌───────────────────────────────────┐              │
│  │    Akka.NET Communication Layer   │              │
│  │  ClusterClient: command/control   │              │
│  │  gRPC Client: real-time streams   │              │
│  │  (correlation IDs, per-pattern    │              │
│  │   timeouts, message ordering)     │              │
│  └──────────────┬────────────────────┘              │
│  ┌──────────────┴────────────────────┐              │
│  │    Configuration Database (EF)    │──► MS SQL    │
│  └───────────────────────────────────┘   (Config DB)│
│                  │                    Machine Data DB│
└─────────────────┼───────────────────────────────────┘
                  │ Akka.NET Remoting (command/control)
                  │ gRPC HTTP/2 (real-time data, port 8083)
     ┌────────────┼────────────┐
     ▼            ▼            ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ SITE A  │ │ SITE B  │ │ SITE N  │
│ (2-node)│ │ (2-node)│ │ (2-node)│
│ ┌─────┐ │ │ ┌─────┐ │ │ ┌─────┐ │
│ │Data │ │ │ │Data │ │ │ │Data │ │
│ │Conn │ │ │ │Conn │ │ │ │Conn │ │
│ │Layer │ │ │ │Layer │ │ │ │Layer │ │
│ ├─────┤ │ │ ├─────┤ │ │ ├─────┤ │
│ │Site │ │ │ │Site │ │ │ │Site │ │
│ │Runtm│ │ │ │Runtm│ │ │ │Runtm│ │
│ ├─────┤ │ │ ├─────┤ │ │ ├─────┤ │
│ │gRPC │ │ │ │gRPC │ │ │ │gRPC │ │
│ │Srvr │ │ │ │Srvr │ │ │ │Srvr │ │
│ ├─────┤ │ │ ├─────┤ │ │ ├─────┤ │
│ │S&F  │ │ │ │S&F  │ │ │ │S&F  │ │
│ │Engine│ │ │ │Engine│ │ │ │Engine│ │
│ ├─────┤ │ │ ├─────┤ │ │ ├─────┤ │
│ │ExtSys│ │ │ │ExtSys│ │ │ │ExtSys│ │
│ │Gatwy │ │ │ │Gatwy │ │ │ │Gatwy │ │
│ └─────┘ │ │ └─────┘ │ │ └─────┘ │
│ SQLite  │ │ SQLite  │ │ SQLite  │
└─────────┘ └─────────┘ └─────────┘
     │            │            │
  OPC UA /     OPC UA /     OPC UA /
  Custom       Custom       Custom
  Protocol     Protocol     Protocol

Site Runtime Actor Hierarchy

Deployment Manager Singleton (Cluster Singleton)
├── Instance Actor (one per deployed, enabled instance)
│   ├── Script Actor (coordinator, one per instance script)
│   │   └── Script Execution Actor (short-lived, per invocation)
│   ├── Alarm Actor (coordinator, one per alarm definition)
│   │   └── Alarm Execution Actor (short-lived, per on-trigger invocation)
│   └── ... (more Script/Alarm Actors)
├── Instance Actor
│   └── ...
└── ... (more Instance Actors)

Site-Wide Akka Stream (attribute + alarm state changes)
├── All Instance Actors publish to the stream
└── Debug view subscribes with instance-level filtering
Description
No description provided
Readme 7.1 MiB
Languages
C# 88.5%
HTML 9.4%
Python 1.3%
TSQL 0.5%
Shell 0.2%