Update CLAUDE.md and README.md with all design decisions from refinement

CLAUDE.md: reorganize Key Design Decisions into categorized sections covering
architecture, data, integrations, templates, S&F, security, cluster, UI,
monitoring, code organization, and Akka.NET conventions. Add docs/plans and
AkkaDotNet to project structure.

README.md: add technology stack table and scale summary. Update all 17
component descriptions to reflect refined designs. Update architecture
diagram with load balancer, 2-node annotations, protocol connections, and
component details. Add links to AkkaDotNet reference docs and design plans.
This commit is contained in:
Joseph Doherty
2026-03-16 09:38:15 -04:00
parent f5b3b2b59e
commit 760eb38eac
2 changed files with 142 additions and 38 deletions

View File

@@ -7,6 +7,8 @@ This project contains design documentation for a distributed SCADA system built
- `README.md` — Master index with component table and architecture diagrams.
- `HighLevelReqs.md` — Complete high-level requirements covering all functional areas.
- `Component-*.md` — Individual component design documents (one per component).
- `docs/plans/` — Design decision documents from refinement sessions.
- `AkkaDotNet/` — Akka.NET reference documentation and best practices notes.
There is no source code in this project — only design documentation in markdown.
@@ -54,29 +56,87 @@ There is no source code in this project — only design documentation in markdow
## Key Design Decisions (for context across sessions)
### Architecture & Runtime
- Instance modeled as Akka actor (Instance Actor) — single source of truth for runtime state.
- Site Runtime actor hierarchy: Deployment Manager singleton → Instance Actors → Script Actors + Alarm Actors.
- Script Actors spawn short-lived Script Execution Actors for concurrent execution.
- Script Actors spawn short-lived Script Execution Actors on a dedicated blocking I/O dispatcher.
- Alarm Actors are separate peer subsystem from scripts (not inside Script Engine).
- Shared scripts execute inline as compiled code (no separate actors).
- Site-wide Akka stream for attribute value and alarm state changes (debug view subscribes to this).
- Script Engine and Alarm Actors subscribe directly to Instance Actors (not via stream).
- Site-wide Akka stream for attribute value and alarm state changes with per-subscriber buffering.
- Instance Actors serialize all state mutations (Akka actor model); concurrent scripts produce interleaved side effects.
- Staggered Instance Actor startup on failover to prevent reconnection storms.
- Explicit supervision strategies: Resume for coordinator actors, Stop for short-lived execution actors.
### Data & Communication
- Data Connection Layer is a clean data pipe — publishes to Instance Actors only.
- Pre-deployment validation is comprehensive (flattening, script compilation, trigger refs, binding completeness).
- Data source references are relative paths; connection binding is per-attribute at instance level.
- System-wide artifact deployment requires explicit Deployment role action.
- Store-and-forward: fixed retry interval, no max buffer size.
- Instance lifecycle: enabled/disabled states, deletion supported.
- Template deletion blocked if instances or child templates reference it.
- DCL connection actor uses Become/Stash pattern for lifecycle state machine.
- DCL auto-reconnect at fixed interval; immediate bad quality on disconnect; transparent re-subscribe.
- DCL write failures returned synchronously to calling script.
- Tag path resolution retried periodically for devices still booting.
- Static attribute writes persisted to local SQLite (survive restart/failover, reset on redeployment).
- All timestamps are UTC throughout the system.
### External Integrations
- External System Gateway: HTTP/REST only, JSON serialization, API key + Basic Auth.
- Dual call modes: `ExternalSystem.Call()` (synchronous) and `ExternalSystem.CachedCall()` (store-and-forward on transient failure).
- Error classification: HTTP 5xx/408/429/connection errors = transient; other 4xx = permanent (returned to script).
- Notification Service: SMTP with OAuth2 Client Credentials (Microsoft 365) or Basic Auth. BCC delivery, plain text.
- Inbound API: `POST /api/{methodName}`, `X-API-Key` header, flat JSON, extended type system (Object, List).
### Templates & Deployment
- Pre-deployment validation includes semantic checks (call targets, argument types, trigger operand types).
- Composed member addressing uses path-qualified canonical names: `[ModuleInstanceName].[MemberName]`.
- Override granularity defined per entity type and per field.
- Template graph acyclicity enforced on save.
- Flattened configs include a revision hash for staleness detection.
- Deployment identity: unique deployment ID + revision hash for idempotency.
- Per-instance operation lock covers all mutating commands (deploy, disable, enable, delete).
- Site-side apply is all-or-nothing per instance.
- System-wide artifact version skew across sites is supported.
- Last-write-wins for concurrent template editing (no optimistic concurrency on templates).
- Optimistic concurrency on deployment status records.
- Naming collisions in composed feature modules are design-time errors.
- Last-write-wins for concurrent template editing.
- Scripts compiled at site; pre-validated (test compiled) at central.
- Max recursion depth for script-to-script calls.
- Alarm on-trigger scripts can call instance scripts; instance scripts cannot call alarm scripts.
- Audit logging absorbed into Configuration Database component (IAuditService).
### Store-and-Forward
- Fixed retry interval, no max buffer size. Only transient failures buffered.
- Async best-effort replication to standby (no ack wait).
- Messages not cleared on instance deletion.
- CachedCall idempotency is the caller's responsibility.
### Security & Auth
- Authentication: direct LDAP bind (username/password), no Kerberos/NTLM. LDAPS/StartTLS required.
- JWT sessions: HMAC-SHA256 shared symmetric key, 15-minute expiry with sliding refresh, 30-minute idle timeout.
- LDAP failure: new logins fail; active sessions continue with current roles.
- Load balancer in front of central UI; JWT + shared Data Protection keys for failover transparency.
### Cluster & Failover
- Keep-oldest split-brain resolver with `down-if-alone = on`, 15s stable-after.
- Both nodes are seed nodes. `min-nr-of-members = 1`.
- Failure detection: 2s heartbeat, 10s threshold. Total failover ~25s.
- CoordinatedShutdown for graceful singleton handover.
- Automatic dual-node recovery from persistent storage.
### UI & Monitoring
- Central UI: Blazor Server (ASP.NET Core + SignalR). Real-time push for debug view, health dashboard, deployment status.
- Health reports: 30s interval, 60s offline threshold, monotonic sequence numbers, raw error counts per interval.
- Dead letter monitoring as a health metric.
- Site Event Logging: 30-day retention, 1GB storage cap, daily purge, paginated queries with keyword search.
### Code Organization
- Entity classes are persistence-ignorant POCOs in Commons; EF mappings in Configuration Database.
- Repository interfaces defined in Commons; implementations in Configuration Database.
- Commons namespace hierarchy: Types/, Interfaces/, Entities/, Messages/ with domain area subfolders.
- Message contracts follow additive-only evolution rules for version compatibility.
- Per-component configuration via `appsettings.json` sections bound to options classes (Options pattern).
- Options classes owned by component projects, not Commons.
- Host readiness gating: `/health/ready` endpoint, no traffic until operational.
- EF Core migrations: auto-apply in dev, manual SQL scripts for production.
- Audit logging absorbed into Configuration Database component (IAuditService).
### Akka.NET Conventions
- Tell for hot-path internal communication; Ask reserved for system boundaries.
- Script trust model: forbidden APIs (System.IO, Process, Threading, Reflection, raw network).
- Application-level correlation IDs on all request/response messages.
## Tool Usage