diff --git a/CLAUDE.md b/CLAUDE.md index f87255d..2bdb921 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -7,6 +7,8 @@ This project contains design documentation for a distributed SCADA system built - `README.md` — Master index with component table and architecture diagrams. - `HighLevelReqs.md` — Complete high-level requirements covering all functional areas. - `Component-*.md` — Individual component design documents (one per component). +- `docs/plans/` — Design decision documents from refinement sessions. +- `AkkaDotNet/` — Akka.NET reference documentation and best practices notes. There is no source code in this project — only design documentation in markdown. @@ -54,29 +56,87 @@ There is no source code in this project — only design documentation in markdow ## Key Design Decisions (for context across sessions) +### Architecture & Runtime - Instance modeled as Akka actor (Instance Actor) — single source of truth for runtime state. - Site Runtime actor hierarchy: Deployment Manager singleton → Instance Actors → Script Actors + Alarm Actors. -- Script Actors spawn short-lived Script Execution Actors for concurrent execution. +- Script Actors spawn short-lived Script Execution Actors on a dedicated blocking I/O dispatcher. - Alarm Actors are separate peer subsystem from scripts (not inside Script Engine). - Shared scripts execute inline as compiled code (no separate actors). -- Site-wide Akka stream for attribute value and alarm state changes (debug view subscribes to this). -- Script Engine and Alarm Actors subscribe directly to Instance Actors (not via stream). +- Site-wide Akka stream for attribute value and alarm state changes with per-subscriber buffering. +- Instance Actors serialize all state mutations (Akka actor model); concurrent scripts produce interleaved side effects. +- Staggered Instance Actor startup on failover to prevent reconnection storms. +- Explicit supervision strategies: Resume for coordinator actors, Stop for short-lived execution actors. + +### Data & Communication - Data Connection Layer is a clean data pipe — publishes to Instance Actors only. -- Pre-deployment validation is comprehensive (flattening, script compilation, trigger refs, binding completeness). -- Data source references are relative paths; connection binding is per-attribute at instance level. -- System-wide artifact deployment requires explicit Deployment role action. -- Store-and-forward: fixed retry interval, no max buffer size. -- Instance lifecycle: enabled/disabled states, deletion supported. -- Template deletion blocked if instances or child templates reference it. +- DCL connection actor uses Become/Stash pattern for lifecycle state machine. +- DCL auto-reconnect at fixed interval; immediate bad quality on disconnect; transparent re-subscribe. +- DCL write failures returned synchronously to calling script. +- Tag path resolution retried periodically for devices still booting. +- Static attribute writes persisted to local SQLite (survive restart/failover, reset on redeployment). +- All timestamps are UTC throughout the system. + +### External Integrations +- External System Gateway: HTTP/REST only, JSON serialization, API key + Basic Auth. +- Dual call modes: `ExternalSystem.Call()` (synchronous) and `ExternalSystem.CachedCall()` (store-and-forward on transient failure). +- Error classification: HTTP 5xx/408/429/connection errors = transient; other 4xx = permanent (returned to script). +- Notification Service: SMTP with OAuth2 Client Credentials (Microsoft 365) or Basic Auth. BCC delivery, plain text. +- Inbound API: `POST /api/{methodName}`, `X-API-Key` header, flat JSON, extended type system (Object, List). + +### Templates & Deployment +- Pre-deployment validation includes semantic checks (call targets, argument types, trigger operand types). +- Composed member addressing uses path-qualified canonical names: `[ModuleInstanceName].[MemberName]`. +- Override granularity defined per entity type and per field. +- Template graph acyclicity enforced on save. +- Flattened configs include a revision hash for staleness detection. +- Deployment identity: unique deployment ID + revision hash for idempotency. +- Per-instance operation lock covers all mutating commands (deploy, disable, enable, delete). +- Site-side apply is all-or-nothing per instance. +- System-wide artifact version skew across sites is supported. +- Last-write-wins for concurrent template editing (no optimistic concurrency on templates). +- Optimistic concurrency on deployment status records. - Naming collisions in composed feature modules are design-time errors. -- Last-write-wins for concurrent template editing. -- Scripts compiled at site; pre-validated (test compiled) at central. -- Max recursion depth for script-to-script calls. -- Alarm on-trigger scripts can call instance scripts; instance scripts cannot call alarm scripts. -- Audit logging absorbed into Configuration Database component (IAuditService). + +### Store-and-Forward +- Fixed retry interval, no max buffer size. Only transient failures buffered. +- Async best-effort replication to standby (no ack wait). +- Messages not cleared on instance deletion. +- CachedCall idempotency is the caller's responsibility. + +### Security & Auth +- Authentication: direct LDAP bind (username/password), no Kerberos/NTLM. LDAPS/StartTLS required. +- JWT sessions: HMAC-SHA256 shared symmetric key, 15-minute expiry with sliding refresh, 30-minute idle timeout. +- LDAP failure: new logins fail; active sessions continue with current roles. +- Load balancer in front of central UI; JWT + shared Data Protection keys for failover transparency. + +### Cluster & Failover +- Keep-oldest split-brain resolver with `down-if-alone = on`, 15s stable-after. +- Both nodes are seed nodes. `min-nr-of-members = 1`. +- Failure detection: 2s heartbeat, 10s threshold. Total failover ~25s. +- CoordinatedShutdown for graceful singleton handover. +- Automatic dual-node recovery from persistent storage. + +### UI & Monitoring +- Central UI: Blazor Server (ASP.NET Core + SignalR). Real-time push for debug view, health dashboard, deployment status. +- Health reports: 30s interval, 60s offline threshold, monotonic sequence numbers, raw error counts per interval. +- Dead letter monitoring as a health metric. +- Site Event Logging: 30-day retention, 1GB storage cap, daily purge, paginated queries with keyword search. + +### Code Organization - Entity classes are persistence-ignorant POCOs in Commons; EF mappings in Configuration Database. - Repository interfaces defined in Commons; implementations in Configuration Database. +- Commons namespace hierarchy: Types/, Interfaces/, Entities/, Messages/ with domain area subfolders. +- Message contracts follow additive-only evolution rules for version compatibility. +- Per-component configuration via `appsettings.json` sections bound to options classes (Options pattern). +- Options classes owned by component projects, not Commons. +- Host readiness gating: `/health/ready` endpoint, no traffic until operational. - EF Core migrations: auto-apply in dev, manual SQL scripts for production. +- Audit logging absorbed into Configuration Database component (IAuditService). + +### Akka.NET Conventions +- Tell for hot-path internal communication; Ask reserved for system boundaries. +- Script trust model: forbidden APIs (System.IO, Process, Threading, Reflection, raw network). +- Application-level correlation IDs on all request/response messages. ## Tool Usage diff --git a/README.md b/README.md index 76a6578..85f35ad 100644 --- a/README.md +++ b/README.md @@ -4,6 +4,27 @@ This document serves as the master index for the SCADA system design. The system is a centrally-managed, distributed SCADA configuration and deployment platform built on Akka.NET, running across a central cluster and multiple site clusters in a hub-and-spoke topology. +### Technology Stack + +| Layer | Technology | +|-------|-----------| +| Runtime | .NET, Akka.NET (actors, clustering, remoting, persistence, streams) | +| Central UI | Blazor Server (ASP.NET Core + SignalR) | +| Inbound API | ASP.NET Core Web API (REST/JSON) | +| Central Database | MS SQL Server, Entity Framework Core | +| Site Storage | SQLite (deployed configs, S&F buffer, event logs) | +| Authentication | Direct LDAP/AD bind (LDAPS/StartTLS), JWT sessions | +| Notifications | SMTP with OAuth2 Client Credentials (Microsoft 365) | +| Hosting | Windows Server, Windows Service | +| Cluster | Akka.NET Cluster (active/standby, keep-oldest SBR) | +| Logging | Serilog (structured) | + +### Scale + +- ~10 site clusters, each with 50–500 machines, 25–75 live tags per machine. +- Central cluster: 2-node active/standby behind a load balancer. +- Site clusters: 2-node active/standby, headless (no UI). + ## Document Map ### Requirements @@ -13,45 +34,59 @@ This document serves as the master index for the SCADA system design. The system | # | Component | Document | Description | |---|-----------|----------|-------------| -| 1 | Template Engine | [Component-TemplateEngine.md](Component-TemplateEngine.md) | Template modeling, inheritance, composition, attribute resolution, locking, alarms, flattening, validation, and diff calculation. | -| 2 | Deployment Manager | [Component-DeploymentManager.md](Component-DeploymentManager.md) | Central-side deployment pipeline: requesting configs, sending to sites, tracking status, system-wide artifact deployment, instance disable/delete. | -| 3 | Site Runtime | [Component-SiteRuntime.md](Component-SiteRuntime.md) | Site-side actor hierarchy: Deployment Manager singleton, Instance Actors, Script Actors, Alarm Actors, script compilation, shared script library, and the site-wide attribute/alarm Akka stream. | -| 4 | Data Connection Layer | [Component-DataConnectionLayer.md](Component-DataConnectionLayer.md) | Common data connection interface, OPC UA and custom protocol adapters, subscription management. Publishes tag value updates to Instance Actors. | -| 5 | Central–Site Communication | [Component-Communication.md](Component-Communication.md) | Akka.NET remoting/cluster topology, message patterns, request routing, and debug streaming. | -| 6 | Store-and-Forward Engine | [Component-StoreAndForward.md](Component-StoreAndForward.md) | Buffering, retry, parking, application-level replication, and SQLite persistence at sites. | -| 7 | External System Gateway | [Component-ExternalSystemGateway.md](Component-ExternalSystemGateway.md) | External system definitions, API method invocation, and database connection management. | -| 8 | Notification Service | [Component-NotificationService.md](Component-NotificationService.md) | Notification lists, email delivery, script API, and store-and-forward integration. | -| 9 | Central UI | [Component-CentralUI.md](Component-CentralUI.md) | Web-based management interface, workflows, and pages. | -| 10 | Security & Auth | [Component-Security.md](Component-Security.md) | LDAP/AD authentication, role-based authorization, and site-scoped permissions. | -| 11 | Health Monitoring | [Component-HealthMonitoring.md](Component-HealthMonitoring.md) | Site health metrics collection (including alarm evaluation errors) and central reporting. | -| 12 | Site Event Logging | [Component-SiteEventLogging.md](Component-SiteEventLogging.md) | Local operational event logs at sites with central query access. | -| 13 | Cluster Infrastructure | [Component-ClusterInfrastructure.md](Component-ClusterInfrastructure.md) | Akka.NET cluster setup, active/standby failover, and node management. | -| 14 | Inbound API | [Component-InboundAPI.md](Component-InboundAPI.md) | Web API for external systems to call in, API key auth, method definitions, script-based implementations. | -| 15 | Host | [Component-Host.md](Component-Host.md) | Single deployable binary, role-based component registration, Akka.NET bootstrap, and ASP.NET Core hosting for central nodes. | -| 16 | Commons | [Component-Commons.md](Component-Commons.md) | Shared data types, interfaces, domain entity POCOs, repository interfaces, and message contracts used across all components. | -| 17 | Configuration Database | [Component-ConfigurationDatabase.md](Component-ConfigurationDatabase.md) | EF Core data access layer, schema ownership, per-component repositories, unit-of-work, audit logging (IAuditService), and migration management for the central MS SQL configuration database. | +| 1 | Template Engine | [Component-TemplateEngine.md](Component-TemplateEngine.md) | Template modeling, inheritance, composition, path-qualified member addressing, override granularity, locking, alarms, flattening, semantic validation, revision hashing, and diff calculation. | +| 2 | Deployment Manager | [Component-DeploymentManager.md](Component-DeploymentManager.md) | Central-side deployment pipeline with deployment ID/idempotency, per-instance operation lock, state transition matrix, all-or-nothing site apply, system-wide artifact deployment with per-site status. | +| 3 | Site Runtime | [Component-SiteRuntime.md](Component-SiteRuntime.md) | Site-side actor hierarchy with explicit supervision strategies, staggered startup, script trust model (constrained APIs), Tell/Ask conventions, concurrency serialization, and site-wide Akka stream with per-subscriber backpressure. | +| 4 | Data Connection Layer | [Component-DataConnectionLayer.md](Component-DataConnectionLayer.md) | Common data connection interface (OPC UA, custom), Become/Stash connection actor model, auto-reconnect, immediate bad quality on disconnect, transparent re-subscribe, synchronous write failures, tag path resolution retry. | +| 5 | Central–Site Communication | [Component-Communication.md](Component-Communication.md) | Akka.NET remoting/cluster topology, 8 message patterns with per-pattern timeouts, application-level correlation IDs, transport heartbeat config, message ordering, connection failure behavior. | +| 6 | Store-and-Forward Engine | [Component-StoreAndForward.md](Component-StoreAndForward.md) | Buffering (transient failures only), fixed-interval retry, parking, async best-effort replication, SQLite persistence at sites. | +| 7 | External System Gateway | [Component-ExternalSystemGateway.md](Component-ExternalSystemGateway.md) | HTTP/REST + JSON, API key/Basic Auth, per-system timeout, dual call modes (Call/CachedCall), transient/permanent error classification, dedicated blocking I/O dispatcher, ADO.NET connection pooling. | +| 8 | Notification Service | [Component-NotificationService.md](Component-NotificationService.md) | SMTP with OAuth2 (M365) or Basic Auth, BCC delivery, plain text, transient/permanent SMTP error classification, store-and-forward integration. | +| 9 | Central UI | [Component-CentralUI.md](Component-CentralUI.md) | Blazor Server with SignalR real-time push, load balancer failover with JWT, all management workflows. | +| 10 | Security & Auth | [Component-Security.md](Component-Security.md) | Direct LDAP bind (LDAPS/StartTLS), JWT sessions (HMAC-SHA256, 15-min refresh, 30-min idle), role-based authorization, site-scoped permissions. | +| 11 | Health Monitoring | [Component-HealthMonitoring.md](Component-HealthMonitoring.md) | 30s report interval, 60s offline threshold, monotonic sequence numbers, raw error counts, tag resolution counts, dead letter monitoring. | +| 12 | Site Event Logging | [Component-SiteEventLogging.md](Component-SiteEventLogging.md) | SQLite storage, 30-day retention + 1GB cap, daily purge, paginated remote queries with keyword search. | +| 13 | Cluster Infrastructure | [Component-ClusterInfrastructure.md](Component-ClusterInfrastructure.md) | Akka.NET cluster, keep-oldest SBR with down-if-alone, min-nr-of-members=1, 2s/10s/15s failure detection, CoordinatedShutdown, automatic dual-node recovery. | +| 14 | Inbound API | [Component-InboundAPI.md](Component-InboundAPI.md) | POST /api/{methodName}, X-API-Key header, flat JSON, extended type system (Object/List), script-based implementations, failures-only logging. | +| 15 | Host | [Component-Host.md](Component-Host.md) | Single deployable binary, role-based component registration, per-component config binding (Options pattern), readiness gating, dead letter monitoring, Akka.NET bootstrap, ASP.NET Core hosting for central. | +| 16 | Commons | [Component-Commons.md](Component-Commons.md) | Namespace/folder convention (Types/Interfaces/Entities/Messages), shared data types, POCOs, repository interfaces, message contracts with additive-only versioning, UTC timestamp convention. | +| 17 | Configuration Database | [Component-ConfigurationDatabase.md](Component-ConfigurationDatabase.md) | EF Core data access, per-component repositories, unit-of-work, optimistic concurrency on deployment status, audit logging (IAuditService), migration management. | + +### Reference Documentation + +- [AkkaDotNet/](AkkaDotNet/) — Akka.NET reference notes covering actors, remoting, clustering, persistence, streams, serialization, hosting, testing, and best practices. +- [docs/plans/](docs/plans/) — Design decision documents from refinement sessions. ### Architecture Diagram (Logical) ``` -┌─────────────────────────────────────────────────────┐ + Users (Blazor Server) + │ + Load Balancer + │ +┌────────────────────────┼────────────────────────────┐ │ CENTRAL CLUSTER │ +│ (2-node active/standby) │ +│ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ Template │ │Deployment│ │ Central │ │ -│ │ Engine │ │ Manager │ │ UI │ │ +│ │ Engine │ │ Manager │ │ UI │ Blazor Svr │ │ └──────────┘ └──────────┘ └──────────┘ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ -│ │ Security │ │ Audit │ │ Health │ │ -│ │ & Auth │ │ Logging │ │ Monitor │ │ +│ │ Security │ │ Config │ │ Health │ │ +│ │ & Auth │ │ DB │ │ Monitor │ │ +│ │ (JWT/LDAP)│ │ (EF+IAud)│ │ │ │ │ └──────────┘ └──────────┘ └──────────┘ │ │ ┌──────────┐ │ -│ │ Inbound │ ◄── External Systems (API key auth) │ -│ │ API │ │ +│ │ Inbound │ ◄── External Systems (X-API-Key) │ +│ │ API │ POST /api/{method}, JSON │ │ └──────────┘ │ │ ┌───────────────────────────────────┐ │ │ │ Akka.NET Communication Layer │ │ +│ │ (correlation IDs, per-pattern │ │ +│ │ timeouts, message ordering) │ │ │ └──────────────┬────────────────────┘ │ -│ ┌───────────────────────────────────┐ │ +│ ┌──────────────┴────────────────────┐ │ │ │ Configuration Database (EF) │──► MS SQL │ │ └───────────────────────────────────┘ (Config DB)│ │ │ Machine Data DB│ @@ -61,18 +96,27 @@ This document serves as the master index for the SCADA system design. The system ▼ ▼ ▼ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ SITE A │ │ SITE B │ │ SITE N │ +│ (2-node)│ │ (2-node)│ │ (2-node)│ │ ┌─────┐ │ │ ┌─────┐ │ │ ┌─────┐ │ │ │Data │ │ │ │Data │ │ │ │Data │ │ │ │Conn │ │ │ │Conn │ │ │ │Conn │ │ +│ │Layer │ │ │ │Layer │ │ │ │Layer │ │ │ ├─────┤ │ │ ├─────┤ │ │ ├─────┤ │ │ │Site │ │ │ │Site │ │ │ │Site │ │ │ │Runtm│ │ │ │Runtm│ │ │ │Runtm│ │ │ ├─────┤ │ │ ├─────┤ │ │ ├─────┤ │ │ │S&F │ │ │ │S&F │ │ │ │S&F │ │ │ │Engine│ │ │ │Engine│ │ │ │Engine│ │ +│ ├─────┤ │ │ ├─────┤ │ │ ├─────┤ │ +│ │ExtSys│ │ │ │ExtSys│ │ │ │ExtSys│ │ +│ │Gatwy │ │ │ │Gatwy │ │ │ │Gatwy │ │ │ └─────┘ │ │ └─────┘ │ │ └─────┘ │ │ SQLite │ │ SQLite │ │ SQLite │ └─────────┘ └─────────┘ └─────────┘ + │ │ │ + OPC UA / OPC UA / OPC UA / + Custom Custom Custom + Protocol Protocol Protocol ``` ### Site Runtime Actor Hierarchy