Update CLAUDE.md and README.md with all design decisions from refinement

CLAUDE.md: reorganize Key Design Decisions into categorized sections covering
architecture, data, integrations, templates, S&F, security, cluster, UI,
monitoring, code organization, and Akka.NET conventions. Add docs/plans and
AkkaDotNet to project structure.

README.md: add technology stack table and scale summary. Update all 17
component descriptions to reflect refined designs. Update architecture
diagram with load balancer, 2-node annotations, protocol connections, and
component details. Add links to AkkaDotNet reference docs and design plans.
This commit is contained in:
Joseph Doherty
2026-03-16 09:38:15 -04:00
parent f5b3b2b59e
commit 760eb38eac
2 changed files with 142 additions and 38 deletions

View File

@@ -7,6 +7,8 @@ This project contains design documentation for a distributed SCADA system built
- `README.md` — Master index with component table and architecture diagrams. - `README.md` — Master index with component table and architecture diagrams.
- `HighLevelReqs.md` — Complete high-level requirements covering all functional areas. - `HighLevelReqs.md` — Complete high-level requirements covering all functional areas.
- `Component-*.md` — Individual component design documents (one per component). - `Component-*.md` — Individual component design documents (one per component).
- `docs/plans/` — Design decision documents from refinement sessions.
- `AkkaDotNet/` — Akka.NET reference documentation and best practices notes.
There is no source code in this project — only design documentation in markdown. There is no source code in this project — only design documentation in markdown.
@@ -54,29 +56,87 @@ There is no source code in this project — only design documentation in markdow
## Key Design Decisions (for context across sessions) ## Key Design Decisions (for context across sessions)
### Architecture & Runtime
- Instance modeled as Akka actor (Instance Actor) — single source of truth for runtime state. - Instance modeled as Akka actor (Instance Actor) — single source of truth for runtime state.
- Site Runtime actor hierarchy: Deployment Manager singleton → Instance Actors → Script Actors + Alarm Actors. - Site Runtime actor hierarchy: Deployment Manager singleton → Instance Actors → Script Actors + Alarm Actors.
- Script Actors spawn short-lived Script Execution Actors for concurrent execution. - Script Actors spawn short-lived Script Execution Actors on a dedicated blocking I/O dispatcher.
- Alarm Actors are separate peer subsystem from scripts (not inside Script Engine). - Alarm Actors are separate peer subsystem from scripts (not inside Script Engine).
- Shared scripts execute inline as compiled code (no separate actors). - Shared scripts execute inline as compiled code (no separate actors).
- Site-wide Akka stream for attribute value and alarm state changes (debug view subscribes to this). - Site-wide Akka stream for attribute value and alarm state changes with per-subscriber buffering.
- Script Engine and Alarm Actors subscribe directly to Instance Actors (not via stream). - Instance Actors serialize all state mutations (Akka actor model); concurrent scripts produce interleaved side effects.
- Staggered Instance Actor startup on failover to prevent reconnection storms.
- Explicit supervision strategies: Resume for coordinator actors, Stop for short-lived execution actors.
### Data & Communication
- Data Connection Layer is a clean data pipe — publishes to Instance Actors only. - Data Connection Layer is a clean data pipe — publishes to Instance Actors only.
- Pre-deployment validation is comprehensive (flattening, script compilation, trigger refs, binding completeness). - DCL connection actor uses Become/Stash pattern for lifecycle state machine.
- Data source references are relative paths; connection binding is per-attribute at instance level. - DCL auto-reconnect at fixed interval; immediate bad quality on disconnect; transparent re-subscribe.
- System-wide artifact deployment requires explicit Deployment role action. - DCL write failures returned synchronously to calling script.
- Store-and-forward: fixed retry interval, no max buffer size. - Tag path resolution retried periodically for devices still booting.
- Instance lifecycle: enabled/disabled states, deletion supported. - Static attribute writes persisted to local SQLite (survive restart/failover, reset on redeployment).
- Template deletion blocked if instances or child templates reference it. - All timestamps are UTC throughout the system.
### External Integrations
- External System Gateway: HTTP/REST only, JSON serialization, API key + Basic Auth.
- Dual call modes: `ExternalSystem.Call()` (synchronous) and `ExternalSystem.CachedCall()` (store-and-forward on transient failure).
- Error classification: HTTP 5xx/408/429/connection errors = transient; other 4xx = permanent (returned to script).
- Notification Service: SMTP with OAuth2 Client Credentials (Microsoft 365) or Basic Auth. BCC delivery, plain text.
- Inbound API: `POST /api/{methodName}`, `X-API-Key` header, flat JSON, extended type system (Object, List).
### Templates & Deployment
- Pre-deployment validation includes semantic checks (call targets, argument types, trigger operand types).
- Composed member addressing uses path-qualified canonical names: `[ModuleInstanceName].[MemberName]`.
- Override granularity defined per entity type and per field.
- Template graph acyclicity enforced on save.
- Flattened configs include a revision hash for staleness detection.
- Deployment identity: unique deployment ID + revision hash for idempotency.
- Per-instance operation lock covers all mutating commands (deploy, disable, enable, delete).
- Site-side apply is all-or-nothing per instance.
- System-wide artifact version skew across sites is supported.
- Last-write-wins for concurrent template editing (no optimistic concurrency on templates).
- Optimistic concurrency on deployment status records.
- Naming collisions in composed feature modules are design-time errors. - Naming collisions in composed feature modules are design-time errors.
- Last-write-wins for concurrent template editing.
- Scripts compiled at site; pre-validated (test compiled) at central. ### Store-and-Forward
- Max recursion depth for script-to-script calls. - Fixed retry interval, no max buffer size. Only transient failures buffered.
- Alarm on-trigger scripts can call instance scripts; instance scripts cannot call alarm scripts. - Async best-effort replication to standby (no ack wait).
- Audit logging absorbed into Configuration Database component (IAuditService). - Messages not cleared on instance deletion.
- CachedCall idempotency is the caller's responsibility.
### Security & Auth
- Authentication: direct LDAP bind (username/password), no Kerberos/NTLM. LDAPS/StartTLS required.
- JWT sessions: HMAC-SHA256 shared symmetric key, 15-minute expiry with sliding refresh, 30-minute idle timeout.
- LDAP failure: new logins fail; active sessions continue with current roles.
- Load balancer in front of central UI; JWT + shared Data Protection keys for failover transparency.
### Cluster & Failover
- Keep-oldest split-brain resolver with `down-if-alone = on`, 15s stable-after.
- Both nodes are seed nodes. `min-nr-of-members = 1`.
- Failure detection: 2s heartbeat, 10s threshold. Total failover ~25s.
- CoordinatedShutdown for graceful singleton handover.
- Automatic dual-node recovery from persistent storage.
### UI & Monitoring
- Central UI: Blazor Server (ASP.NET Core + SignalR). Real-time push for debug view, health dashboard, deployment status.
- Health reports: 30s interval, 60s offline threshold, monotonic sequence numbers, raw error counts per interval.
- Dead letter monitoring as a health metric.
- Site Event Logging: 30-day retention, 1GB storage cap, daily purge, paginated queries with keyword search.
### Code Organization
- Entity classes are persistence-ignorant POCOs in Commons; EF mappings in Configuration Database. - Entity classes are persistence-ignorant POCOs in Commons; EF mappings in Configuration Database.
- Repository interfaces defined in Commons; implementations in Configuration Database. - Repository interfaces defined in Commons; implementations in Configuration Database.
- Commons namespace hierarchy: Types/, Interfaces/, Entities/, Messages/ with domain area subfolders.
- Message contracts follow additive-only evolution rules for version compatibility.
- Per-component configuration via `appsettings.json` sections bound to options classes (Options pattern).
- Options classes owned by component projects, not Commons.
- Host readiness gating: `/health/ready` endpoint, no traffic until operational.
- EF Core migrations: auto-apply in dev, manual SQL scripts for production. - EF Core migrations: auto-apply in dev, manual SQL scripts for production.
- Audit logging absorbed into Configuration Database component (IAuditService).
### Akka.NET Conventions
- Tell for hot-path internal communication; Ask reserved for system boundaries.
- Script trust model: forbidden APIs (System.IO, Process, Threading, Reflection, raw network).
- Application-level correlation IDs on all request/response messages.
## Tool Usage ## Tool Usage

View File

@@ -4,6 +4,27 @@
This document serves as the master index for the SCADA system design. The system is a centrally-managed, distributed SCADA configuration and deployment platform built on Akka.NET, running across a central cluster and multiple site clusters in a hub-and-spoke topology. This document serves as the master index for the SCADA system design. The system is a centrally-managed, distributed SCADA configuration and deployment platform built on Akka.NET, running across a central cluster and multiple site clusters in a hub-and-spoke topology.
### Technology Stack
| Layer | Technology |
|-------|-----------|
| Runtime | .NET, Akka.NET (actors, clustering, remoting, persistence, streams) |
| Central UI | Blazor Server (ASP.NET Core + SignalR) |
| Inbound API | ASP.NET Core Web API (REST/JSON) |
| Central Database | MS SQL Server, Entity Framework Core |
| Site Storage | SQLite (deployed configs, S&F buffer, event logs) |
| Authentication | Direct LDAP/AD bind (LDAPS/StartTLS), JWT sessions |
| Notifications | SMTP with OAuth2 Client Credentials (Microsoft 365) |
| Hosting | Windows Server, Windows Service |
| Cluster | Akka.NET Cluster (active/standby, keep-oldest SBR) |
| Logging | Serilog (structured) |
### Scale
- ~10 site clusters, each with 50500 machines, 2575 live tags per machine.
- Central cluster: 2-node active/standby behind a load balancer.
- Site clusters: 2-node active/standby, headless (no UI).
## Document Map ## Document Map
### Requirements ### Requirements
@@ -13,45 +34,59 @@ This document serves as the master index for the SCADA system design. The system
| # | Component | Document | Description | | # | Component | Document | Description |
|---|-----------|----------|-------------| |---|-----------|----------|-------------|
| 1 | Template Engine | [Component-TemplateEngine.md](Component-TemplateEngine.md) | Template modeling, inheritance, composition, attribute resolution, locking, alarms, flattening, validation, and diff calculation. | | 1 | Template Engine | [Component-TemplateEngine.md](Component-TemplateEngine.md) | Template modeling, inheritance, composition, path-qualified member addressing, override granularity, locking, alarms, flattening, semantic validation, revision hashing, and diff calculation. |
| 2 | Deployment Manager | [Component-DeploymentManager.md](Component-DeploymentManager.md) | Central-side deployment pipeline: requesting configs, sending to sites, tracking status, system-wide artifact deployment, instance disable/delete. | | 2 | Deployment Manager | [Component-DeploymentManager.md](Component-DeploymentManager.md) | Central-side deployment pipeline with deployment ID/idempotency, per-instance operation lock, state transition matrix, all-or-nothing site apply, system-wide artifact deployment with per-site status. |
| 3 | Site Runtime | [Component-SiteRuntime.md](Component-SiteRuntime.md) | Site-side actor hierarchy: Deployment Manager singleton, Instance Actors, Script Actors, Alarm Actors, script compilation, shared script library, and the site-wide attribute/alarm Akka stream. | | 3 | Site Runtime | [Component-SiteRuntime.md](Component-SiteRuntime.md) | Site-side actor hierarchy with explicit supervision strategies, staggered startup, script trust model (constrained APIs), Tell/Ask conventions, concurrency serialization, and site-wide Akka stream with per-subscriber backpressure. |
| 4 | Data Connection Layer | [Component-DataConnectionLayer.md](Component-DataConnectionLayer.md) | Common data connection interface, OPC UA and custom protocol adapters, subscription management. Publishes tag value updates to Instance Actors. | | 4 | Data Connection Layer | [Component-DataConnectionLayer.md](Component-DataConnectionLayer.md) | Common data connection interface (OPC UA, custom), Become/Stash connection actor model, auto-reconnect, immediate bad quality on disconnect, transparent re-subscribe, synchronous write failures, tag path resolution retry. |
| 5 | CentralSite Communication | [Component-Communication.md](Component-Communication.md) | Akka.NET remoting/cluster topology, message patterns, request routing, and debug streaming. | | 5 | CentralSite Communication | [Component-Communication.md](Component-Communication.md) | Akka.NET remoting/cluster topology, 8 message patterns with per-pattern timeouts, application-level correlation IDs, transport heartbeat config, message ordering, connection failure behavior. |
| 6 | Store-and-Forward Engine | [Component-StoreAndForward.md](Component-StoreAndForward.md) | Buffering, retry, parking, application-level replication, and SQLite persistence at sites. | | 6 | Store-and-Forward Engine | [Component-StoreAndForward.md](Component-StoreAndForward.md) | Buffering (transient failures only), fixed-interval retry, parking, async best-effort replication, SQLite persistence at sites. |
| 7 | External System Gateway | [Component-ExternalSystemGateway.md](Component-ExternalSystemGateway.md) | External system definitions, API method invocation, and database connection management. | | 7 | External System Gateway | [Component-ExternalSystemGateway.md](Component-ExternalSystemGateway.md) | HTTP/REST + JSON, API key/Basic Auth, per-system timeout, dual call modes (Call/CachedCall), transient/permanent error classification, dedicated blocking I/O dispatcher, ADO.NET connection pooling. |
| 8 | Notification Service | [Component-NotificationService.md](Component-NotificationService.md) | Notification lists, email delivery, script API, and store-and-forward integration. | | 8 | Notification Service | [Component-NotificationService.md](Component-NotificationService.md) | SMTP with OAuth2 (M365) or Basic Auth, BCC delivery, plain text, transient/permanent SMTP error classification, store-and-forward integration. |
| 9 | Central UI | [Component-CentralUI.md](Component-CentralUI.md) | Web-based management interface, workflows, and pages. | | 9 | Central UI | [Component-CentralUI.md](Component-CentralUI.md) | Blazor Server with SignalR real-time push, load balancer failover with JWT, all management workflows. |
| 10 | Security & Auth | [Component-Security.md](Component-Security.md) | LDAP/AD authentication, role-based authorization, and site-scoped permissions. | | 10 | Security & Auth | [Component-Security.md](Component-Security.md) | Direct LDAP bind (LDAPS/StartTLS), JWT sessions (HMAC-SHA256, 15-min refresh, 30-min idle), role-based authorization, site-scoped permissions. |
| 11 | Health Monitoring | [Component-HealthMonitoring.md](Component-HealthMonitoring.md) | Site health metrics collection (including alarm evaluation errors) and central reporting. | | 11 | Health Monitoring | [Component-HealthMonitoring.md](Component-HealthMonitoring.md) | 30s report interval, 60s offline threshold, monotonic sequence numbers, raw error counts, tag resolution counts, dead letter monitoring. |
| 12 | Site Event Logging | [Component-SiteEventLogging.md](Component-SiteEventLogging.md) | Local operational event logs at sites with central query access. | | 12 | Site Event Logging | [Component-SiteEventLogging.md](Component-SiteEventLogging.md) | SQLite storage, 30-day retention + 1GB cap, daily purge, paginated remote queries with keyword search. |
| 13 | Cluster Infrastructure | [Component-ClusterInfrastructure.md](Component-ClusterInfrastructure.md) | Akka.NET cluster setup, active/standby failover, and node management. | | 13 | Cluster Infrastructure | [Component-ClusterInfrastructure.md](Component-ClusterInfrastructure.md) | Akka.NET cluster, keep-oldest SBR with down-if-alone, min-nr-of-members=1, 2s/10s/15s failure detection, CoordinatedShutdown, automatic dual-node recovery. |
| 14 | Inbound API | [Component-InboundAPI.md](Component-InboundAPI.md) | Web API for external systems to call in, API key auth, method definitions, script-based implementations. | | 14 | Inbound API | [Component-InboundAPI.md](Component-InboundAPI.md) | POST /api/{methodName}, X-API-Key header, flat JSON, extended type system (Object/List), script-based implementations, failures-only logging. |
| 15 | Host | [Component-Host.md](Component-Host.md) | Single deployable binary, role-based component registration, Akka.NET bootstrap, and ASP.NET Core hosting for central nodes. | | 15 | Host | [Component-Host.md](Component-Host.md) | Single deployable binary, role-based component registration, per-component config binding (Options pattern), readiness gating, dead letter monitoring, Akka.NET bootstrap, ASP.NET Core hosting for central. |
| 16 | Commons | [Component-Commons.md](Component-Commons.md) | Shared data types, interfaces, domain entity POCOs, repository interfaces, and message contracts used across all components. | | 16 | Commons | [Component-Commons.md](Component-Commons.md) | Namespace/folder convention (Types/Interfaces/Entities/Messages), shared data types, POCOs, repository interfaces, message contracts with additive-only versioning, UTC timestamp convention. |
| 17 | Configuration Database | [Component-ConfigurationDatabase.md](Component-ConfigurationDatabase.md) | EF Core data access layer, schema ownership, per-component repositories, unit-of-work, audit logging (IAuditService), and migration management for the central MS SQL configuration database. | | 17 | Configuration Database | [Component-ConfigurationDatabase.md](Component-ConfigurationDatabase.md) | EF Core data access, per-component repositories, unit-of-work, optimistic concurrency on deployment status, audit logging (IAuditService), migration management. |
### Reference Documentation
- [AkkaDotNet/](AkkaDotNet/) — Akka.NET reference notes covering actors, remoting, clustering, persistence, streams, serialization, hosting, testing, and best practices.
- [docs/plans/](docs/plans/) — Design decision documents from refinement sessions.
### Architecture Diagram (Logical) ### Architecture Diagram (Logical)
``` ```
┌─────────────────────────────────────────────────────┐ Users (Blazor Server)
Load Balancer
┌────────────────────────┼────────────────────────────┐
│ CENTRAL CLUSTER │ │ CENTRAL CLUSTER │
│ (2-node active/standby) │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Template │ │Deployment│ │ Central │ │ │ │ Template │ │Deployment│ │ Central │ │
│ │ Engine │ │ Manager │ │ UI │ │ │ Engine │ │ Manager │ │ UI │ Blazor Svr
│ └──────────┘ └──────────┘ └──────────┘ │ │ └──────────┘ └──────────┘ └──────────┘ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Security │ │ Audit │ │ Health │ │ │ │ Security │ │ Config │ │ Health │ │
│ │ & Auth │ │ Logging │ │ Monitor │ │ │ │ & Auth │ │ DB │ │ Monitor │ │
│ │ (JWT/LDAP)│ │ (EF+IAud)│ │ │ │
│ └──────────┘ └──────────┘ └──────────┘ │ │ └──────────┘ └──────────┘ └──────────┘ │
│ ┌──────────┐ │ │ ┌──────────┐ │
│ │ Inbound │ ◄── External Systems (API key auth) │ │ Inbound │ ◄── External Systems (X-API-Key)
│ │ API │ │ API POST /api/{method}, JSON
│ └──────────┘ │ │ └──────────┘ │
│ ┌───────────────────────────────────┐ │ │ ┌───────────────────────────────────┐ │
│ │ Akka.NET Communication Layer │ │ │ │ Akka.NET Communication Layer │ │
│ │ (correlation IDs, per-pattern │ │
│ │ timeouts, message ordering) │ │
│ └──────────────┬────────────────────┘ │ │ └──────────────┬────────────────────┘ │
│ ┌──────────────────────────────────┐ │ │ ┌──────────────────────────────────┐ │
│ │ Configuration Database (EF) │──► MS SQL │ │ │ Configuration Database (EF) │──► MS SQL │
│ └───────────────────────────────────┘ (Config DB)│ │ └───────────────────────────────────┘ (Config DB)│
│ │ Machine Data DB│ │ │ Machine Data DB│
@@ -61,18 +96,27 @@ This document serves as the master index for the SCADA system design. The system
▼ ▼ ▼ ▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ SITE A │ │ SITE B │ │ SITE N │ │ SITE A │ │ SITE B │ │ SITE N │
│ (2-node)│ │ (2-node)│ │ (2-node)│
│ ┌─────┐ │ │ ┌─────┐ │ │ ┌─────┐ │ │ ┌─────┐ │ │ ┌─────┐ │ │ ┌─────┐ │
│ │Data │ │ │ │Data │ │ │ │Data │ │ │ │Data │ │ │ │Data │ │ │ │Data │ │
│ │Conn │ │ │ │Conn │ │ │ │Conn │ │ │ │Conn │ │ │ │Conn │ │ │ │Conn │ │
│ │Layer │ │ │ │Layer │ │ │ │Layer │ │
│ ├─────┤ │ │ ├─────┤ │ │ ├─────┤ │ │ ├─────┤ │ │ ├─────┤ │ │ ├─────┤ │
│ │Site │ │ │ │Site │ │ │ │Site │ │ │ │Site │ │ │ │Site │ │ │ │Site │ │
│ │Runtm│ │ │ │Runtm│ │ │ │Runtm│ │ │ │Runtm│ │ │ │Runtm│ │ │ │Runtm│ │
│ ├─────┤ │ │ ├─────┤ │ │ ├─────┤ │ │ ├─────┤ │ │ ├─────┤ │ │ ├─────┤ │
│ │S&F │ │ │ │S&F │ │ │ │S&F │ │ │ │S&F │ │ │ │S&F │ │ │ │S&F │ │
│ │Engine│ │ │ │Engine│ │ │ │Engine│ │ │ │Engine│ │ │ │Engine│ │ │ │Engine│ │
│ ├─────┤ │ │ ├─────┤ │ │ ├─────┤ │
│ │ExtSys│ │ │ │ExtSys│ │ │ │ExtSys│ │
│ │Gatwy │ │ │ │Gatwy │ │ │ │Gatwy │ │
│ └─────┘ │ │ └─────┘ │ │ └─────┘ │ │ └─────┘ │ │ └─────┘ │ │ └─────┘ │
│ SQLite │ │ SQLite │ │ SQLite │ │ SQLite │ │ SQLite │ │ SQLite │
└─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘
│ │ │
OPC UA / OPC UA / OPC UA /
Custom Custom Custom
Protocol Protocol Protocol
``` ```
### Site Runtime Actor Hierarchy ### Site Runtime Actor Hierarchy