Files
scadalink-design/CLAUDE.md
Joseph Doherty 760eb38eac Update CLAUDE.md and README.md with all design decisions from refinement
CLAUDE.md: reorganize Key Design Decisions into categorized sections covering
architecture, data, integrations, templates, S&F, security, cluster, UI,
monitoring, code organization, and Akka.NET conventions. Add docs/plans and
AkkaDotNet to project structure.

README.md: add technology stack table and scale summary. Update all 17
component descriptions to reflect refined designs. Update architecture
diagram with load balancer, 2-node annotations, protocol connections, and
component details. Add links to AkkaDotNet reference docs and design plans.
2026-03-16 09:38:15 -04:00

9.3 KiB
Raw Blame History

ScadaLink Design Documentation Project

This project contains design documentation for a distributed SCADA system built on Akka.NET. The documents describe a hub-and-spoke architecture with a central cluster and multiple site clusters.

Project Structure

  • README.md — Master index with component table and architecture diagrams.
  • HighLevelReqs.md — Complete high-level requirements covering all functional areas.
  • Component-*.md — Individual component design documents (one per component).
  • docs/plans/ — Design decision documents from refinement sessions.
  • AkkaDotNet/ — Akka.NET reference documentation and best practices notes.

There is no source code in this project — only design documentation in markdown.

Document Conventions

  • All documents are markdown files in the project root directory.
  • Component documents are named Component-<Name>.md (PascalCase, hyphen-separated).
  • Each component document follows a consistent structure: Purpose, Location, Responsibilities, detailed design sections, Dependencies, and Interactions.
  • The README.md component table must stay in sync with actual component documents. When a component is added, removed, or renamed, update the table.
  • Cross-component references in Dependencies and Interactions sections must be kept accurate across all documents. When a component's role changes, update references in all affected documents.

Refinement Process

  • When refining requirements, use the Socratic method — ask clarifying questions before making changes.
  • Probe for implications of decisions across components before updating documents.
  • When a decision is made, identify all affected documents and update them together for consistency.
  • After updates, verify no stale cross-references remain (e.g., references to removed or renamed components).

Editing Rules

  • Edit documents in place. Do not create copies or backup files.
  • When a change affects multiple documents, update all affected documents in the same session.
  • Use git diff to review changes before committing.
  • Commit related changes together with a descriptive message summarizing the design decision.

Current Component List (17 components)

  1. Template Engine — Template modeling, inheritance, composition, validation, flattening, diffs.
  2. Deployment Manager — Central-side deployment pipeline, system-wide artifact deployment, instance lifecycle.
  3. Site Runtime — Site-side actor hierarchy (Deployment Manager singleton, Instance/Script/Alarm Actors), script compilation, Akka stream.
  4. Data Connection Layer — Protocol abstraction (OPC UA, custom), subscription management, clean data pipe.
  5. CentralSite Communication — Akka.NET remoting, message patterns, debug streaming.
  6. Store-and-Forward Engine — Buffering, fixed-interval retry, parking, SQLite persistence, replication.
  7. External System Gateway — External system definitions, API method invocation, database connections.
  8. Notification Service — Notification lists, email delivery, store-and-forward integration.
  9. Central UI — Web-based management interface, all workflows.
  10. Security & Auth — LDAP/AD authentication, role-based authorization, site-scoped permissions.
  11. Health Monitoring — Site health metrics collection and central reporting.
  12. Site Event Logging — Local operational event logs at sites with central query access.
  13. Cluster Infrastructure — Akka.NET cluster setup, active/standby failover, singleton support.
  14. Inbound API — Web API for external systems, API key auth, script-based implementations.
  15. Host — Single deployable binary, role-based component registration, Akka.NET bootstrap.
  16. Commons — Shared types, POCO entity classes, repository interfaces, message contracts.
  17. Configuration Database — EF Core data access layer, repositories, unit-of-work, audit logging (IAuditService), migrations.

Key Design Decisions (for context across sessions)

Architecture & Runtime

  • Instance modeled as Akka actor (Instance Actor) — single source of truth for runtime state.
  • Site Runtime actor hierarchy: Deployment Manager singleton → Instance Actors → Script Actors + Alarm Actors.
  • Script Actors spawn short-lived Script Execution Actors on a dedicated blocking I/O dispatcher.
  • Alarm Actors are separate peer subsystem from scripts (not inside Script Engine).
  • Shared scripts execute inline as compiled code (no separate actors).
  • Site-wide Akka stream for attribute value and alarm state changes with per-subscriber buffering.
  • Instance Actors serialize all state mutations (Akka actor model); concurrent scripts produce interleaved side effects.
  • Staggered Instance Actor startup on failover to prevent reconnection storms.
  • Explicit supervision strategies: Resume for coordinator actors, Stop for short-lived execution actors.

Data & Communication

  • Data Connection Layer is a clean data pipe — publishes to Instance Actors only.
  • DCL connection actor uses Become/Stash pattern for lifecycle state machine.
  • DCL auto-reconnect at fixed interval; immediate bad quality on disconnect; transparent re-subscribe.
  • DCL write failures returned synchronously to calling script.
  • Tag path resolution retried periodically for devices still booting.
  • Static attribute writes persisted to local SQLite (survive restart/failover, reset on redeployment).
  • All timestamps are UTC throughout the system.

External Integrations

  • External System Gateway: HTTP/REST only, JSON serialization, API key + Basic Auth.
  • Dual call modes: ExternalSystem.Call() (synchronous) and ExternalSystem.CachedCall() (store-and-forward on transient failure).
  • Error classification: HTTP 5xx/408/429/connection errors = transient; other 4xx = permanent (returned to script).
  • Notification Service: SMTP with OAuth2 Client Credentials (Microsoft 365) or Basic Auth. BCC delivery, plain text.
  • Inbound API: POST /api/{methodName}, X-API-Key header, flat JSON, extended type system (Object, List).

Templates & Deployment

  • Pre-deployment validation includes semantic checks (call targets, argument types, trigger operand types).
  • Composed member addressing uses path-qualified canonical names: [ModuleInstanceName].[MemberName].
  • Override granularity defined per entity type and per field.
  • Template graph acyclicity enforced on save.
  • Flattened configs include a revision hash for staleness detection.
  • Deployment identity: unique deployment ID + revision hash for idempotency.
  • Per-instance operation lock covers all mutating commands (deploy, disable, enable, delete).
  • Site-side apply is all-or-nothing per instance.
  • System-wide artifact version skew across sites is supported.
  • Last-write-wins for concurrent template editing (no optimistic concurrency on templates).
  • Optimistic concurrency on deployment status records.
  • Naming collisions in composed feature modules are design-time errors.

Store-and-Forward

  • Fixed retry interval, no max buffer size. Only transient failures buffered.
  • Async best-effort replication to standby (no ack wait).
  • Messages not cleared on instance deletion.
  • CachedCall idempotency is the caller's responsibility.

Security & Auth

  • Authentication: direct LDAP bind (username/password), no Kerberos/NTLM. LDAPS/StartTLS required.
  • JWT sessions: HMAC-SHA256 shared symmetric key, 15-minute expiry with sliding refresh, 30-minute idle timeout.
  • LDAP failure: new logins fail; active sessions continue with current roles.
  • Load balancer in front of central UI; JWT + shared Data Protection keys for failover transparency.

Cluster & Failover

  • Keep-oldest split-brain resolver with down-if-alone = on, 15s stable-after.
  • Both nodes are seed nodes. min-nr-of-members = 1.
  • Failure detection: 2s heartbeat, 10s threshold. Total failover ~25s.
  • CoordinatedShutdown for graceful singleton handover.
  • Automatic dual-node recovery from persistent storage.

UI & Monitoring

  • Central UI: Blazor Server (ASP.NET Core + SignalR). Real-time push for debug view, health dashboard, deployment status.
  • Health reports: 30s interval, 60s offline threshold, monotonic sequence numbers, raw error counts per interval.
  • Dead letter monitoring as a health metric.
  • Site Event Logging: 30-day retention, 1GB storage cap, daily purge, paginated queries with keyword search.

Code Organization

  • Entity classes are persistence-ignorant POCOs in Commons; EF mappings in Configuration Database.
  • Repository interfaces defined in Commons; implementations in Configuration Database.
  • Commons namespace hierarchy: Types/, Interfaces/, Entities/, Messages/ with domain area subfolders.
  • Message contracts follow additive-only evolution rules for version compatibility.
  • Per-component configuration via appsettings.json sections bound to options classes (Options pattern).
  • Options classes owned by component projects, not Commons.
  • Host readiness gating: /health/ready endpoint, no traffic until operational.
  • EF Core migrations: auto-apply in dev, manual SQL scripts for production.
  • Audit logging absorbed into Configuration Database component (IAuditService).

Akka.NET Conventions

  • Tell for hot-path internal communication; Ask reserved for system boundaries.
  • Script trust model: forbidden APIs (System.IO, Process, Threading, Reflection, raw network).
  • Application-level correlation IDs on all request/response messages.

Tool Usage

  • When consulting with the Codex MCP tool, use model gpt-5.4.