Files
scadalink-design/CLAUDE.md
2026-05-19 12:03:17 -04:00

185 lines
17 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ScadaLink Design Documentation Project
This project contains design documentation for a distributed SCADA system built on Akka.NET. The documents describe a hub-and-spoke architecture with a central cluster and multiple site clusters.
## Project Structure
- `README.md` — Master index with component table and architecture diagrams.
- `docs/requirements/HighLevelReqs.md` — Complete high-level requirements covering all functional areas.
- `docs/requirements/Component-*.md` — Individual component design documents (one per component).
- `docs/test_infra/test_infra.md` — Master test infrastructure doc (OPC UA, LDAP, MS SQL, SMTP, REST API, Traefik).
- `docs/plans/` — Design decision documents from refinement sessions.
- `AkkaDotNet/` — Akka.NET reference documentation and best practices notes.
- `infra/` — Docker Compose and config files for local test services.
- `docker/` — Docker infrastructure for the 8-node cluster topology (2 central + 3 sites). See [`docker/README.md`](docker/README.md) for cluster setup, port allocation, and management commands.
## Document Conventions
- Requirements documents (high-level and component-level) live in `docs/requirements/`.
- Test infrastructure documentation lives in `docs/test_infra/`.
- Component documents are named `Component-<Name>.md` (PascalCase, hyphen-separated).
- Each component document follows a consistent structure: Purpose, Location, Responsibilities, detailed design sections, Dependencies, and Interactions.
- The README.md component table must stay in sync with actual component documents. When a component is added, removed, or renamed, update the table.
- Cross-component references in Dependencies and Interactions sections must be kept accurate across all documents. When a component's role changes, update references in all affected documents.
## Refinement Process
- When refining requirements, use the Socratic method — ask clarifying questions before making changes.
- Probe for implications of decisions across components before updating documents.
- When a decision is made, identify all affected documents and update them together for consistency.
- After updates, verify no stale cross-references remain (e.g., references to removed or renamed components).
## Editing Rules
- Edit documents in place. Do not create copies or backup files.
- When a change affects multiple documents, update all affected documents in the same session.
- Use `git diff` to review changes before committing.
- Commit related changes together with a descriptive message summarizing the design decision.
## Current Component List (22 components)
1. Template Engine — Template modeling, inheritance, composition, validation, flattening, diffs.
2. Deployment Manager — Central-side deployment pipeline, system-wide artifact deployment, instance lifecycle.
3. Site Runtime — Site-side actor hierarchy (Deployment Manager singleton, Instance/Script/Alarm Actors), script compilation, Akka stream.
4. Data Connection Layer — Protocol abstraction (OPC UA, custom), subscription management, clean data pipe.
5. CentralSite Communication — Akka.NET ClusterClient (command/control) + gRPC server-streaming (real-time data), message patterns, debug streaming.
6. Store-and-Forward Engine — Buffering, fixed-interval retry, parking, SQLite persistence, replication.
7. External System Gateway — External system definitions, API method invocation, database connections.
8. Notification Service — Central-only notification-list and SMTP definitions, per-type delivery adapters (sites no longer deliver notifications).
9. Central UI — Web-based management interface, all workflows.
10. Security & Auth — LDAP/AD authentication, role-based authorization, site-scoped permissions.
11. Health Monitoring — Site health metrics collection and central reporting.
12. Site Event Logging — Local operational event logs at sites with central query access.
13. Cluster Infrastructure — Akka.NET cluster setup, active/standby failover, singleton support.
14. Inbound API — Web API for external systems, API key auth, script-based implementations.
15. Host — Single deployable binary, role-based component registration, Akka.NET bootstrap.
16. Commons — Shared types, POCO entity classes, repository interfaces, message contracts.
17. Configuration Database — EF Core data access layer, repositories, unit-of-work, audit logging (IAuditService), migrations.
18. Management Service — Akka.NET actor providing programmatic access to all admin operations, ClusterClientReceptionist registration.
19. CLI — Command-line tool using HTTP Management API, System.CommandLine, JSON/table output.
20. Traefik Proxy — Reverse proxy/load balancer fronting central cluster, active node routing via `/health/active`, automatic failover.
21. Notification Outbox — Central component ingesting store-and-forwarded notifications, `Notifications` audit table, dispatcher loop, retry/parking, delivery KPIs.
22. Site Call Audit — Central component auditing site cached calls (`CachedCall`/`CachedWrite`); `SiteCalls` audit table, telemetry ingest, reconciliation, KPIs, central→site Retry/Discard relay; sites remain the source of truth.
## Key Design Decisions (for context across sessions)
### Architecture & Runtime
- Instance modeled as Akka actor (Instance Actor) — single source of truth for runtime state.
- Site Runtime actor hierarchy: Deployment Manager singleton → Instance Actors → Script Actors + Alarm Actors.
- Script Actors spawn short-lived Script Execution Actors on a dedicated blocking I/O dispatcher.
- Alarm Actors are separate peer subsystem from scripts (not inside Script Engine).
- Shared scripts execute inline as compiled code (no separate actors).
- Site-wide Akka stream for attribute value and alarm state changes with per-subscriber buffering.
- Instance Actors serialize all state mutations (Akka actor model); concurrent scripts produce interleaved side effects.
- Staggered Instance Actor startup on failover to prevent reconnection storms.
- Explicit supervision strategies: Resume for coordinator actors, Stop for short-lived execution actors.
### Data & Communication
- Data Connection Layer is a clean data pipe — publishes to Instance Actors only.
- DCL connection actor uses Become/Stash pattern for lifecycle state machine.
- DCL auto-reconnect at fixed interval; immediate bad quality on disconnect; transparent re-subscribe.
- DCL write failures returned synchronously to calling script.
- Tag path resolution retried periodically for devices still booting.
- Static attribute writes persisted to local SQLite (survive restart/failover, reset on redeployment).
- All timestamps are UTC throughout the system.
- Inter-cluster communication uses two transports: ClusterClient for command/control (deployments, lifecycle, subscribe/unsubscribe handshake, snapshots) and gRPC server-streaming for real-time data (attribute values, alarm states). Both CentralCommunicationActor and SiteCommunicationActor registered with receptionist. Central creates one ClusterClient per site using NodeA/NodeB as contact points. Sites configure multiple central contact points for failover. Addresses cached in CentralCommunicationActor, refreshed periodically (60s) and on admin changes. Heartbeats serve health monitoring only.
- gRPC streaming channel: SiteStreamGrpcServer on each site node (Kestrel HTTP/2, port 8083); central creates per-site SiteStreamGrpcClient via SiteStreamGrpcClientFactory. Site entity has GrpcNodeAAddress/GrpcNodeBAddress fields. Proto: sitestream.proto with SiteStreamService, SiteStreamEvent (oneof: AttributeValueUpdate, AlarmStateUpdate). DebugStreamEvent message removed (no longer flows through ClusterClient).
### External Integrations
- External System Gateway: HTTP/REST only, JSON serialization, API key + Basic Auth.
- Dual call modes: `ExternalSystem.Call()` (synchronous) and `ExternalSystem.CachedCall()` (store-and-forward on transient failure).
- Error classification: HTTP 5xx/408/429/connection errors = transient; other 4xx = permanent (returned to script).
- Notification Service: SMTP with OAuth2 Client Credentials (Microsoft 365) or Basic Auth. BCC delivery, plain text.
- Notification delivery is central-only: sites store-and-forward notifications to the central cluster (target = central, not SMTP); sites never talk to SMTP. Notification lists and SMTP config are no longer deployed to sites; recipient resolution happens at central, at delivery time.
- Notification lists carry a `Type` discriminator (`Email` now; `Teams` and others later). `Notify.To("list")` is type-agnostic; delivery is via per-type `INotificationDeliveryAdapter` (success/transient/permanent classification, same pattern as External System Gateway).
- `Notify.Send` is async — returns a `NotificationId` (GUID, idempotency key) status handle immediately. `Notify.Status(notificationId)` returns a status record (status, retry count, last error, key timestamps); answered site-locally as `Forwarding` while still in the site S&F buffer, otherwise round-trips to central.
- Inbound API: `POST /api/{methodName}`, `X-API-Key` header, flat JSON, extended type system (Object, List).
### Templates & Deployment
- Pre-deployment validation includes semantic checks (call targets, argument types, trigger operand types).
- Composed member addressing uses path-qualified canonical names: `[ModuleInstanceName].[MemberName]`.
- Override granularity defined per entity type and per field.
- Template graph acyclicity enforced on save.
- Flattened configs include a revision hash for staleness detection.
- Deployment identity: unique deployment ID + revision hash for idempotency.
- Per-instance operation lock covers all mutating commands (deploy, disable, enable, delete).
- Site-side apply is all-or-nothing per instance.
- System-wide artifact version skew across sites is supported.
- Last-write-wins for concurrent template editing (no optimistic concurrency on templates).
- Optimistic concurrency on deployment status records.
- Naming collisions in composed feature modules are design-time errors.
### Store-and-Forward
- Fixed retry interval, no max buffer size. Only transient failures buffered.
- Async best-effort replication to standby (no ack wait).
- Messages not cleared on instance deletion.
- CachedCall idempotency is the caller's responsibility.
- Notification Outbox: central `NotificationOutboxActor` singleton on the active central node — the first centrally-hosted store-and-forward component (the S&F Engine remains site-only).
- `Notifications` table in central MS SQL is the single source of audit truth (one row per notification); type-agnostic via the `Type` discriminator.
- Status lifecycle `Pending → Retrying → Delivered / Parked / Discarded`, plus site-local `Forwarding` (never persisted centrally).
- Dispatcher loop polls due rows, resolves the list, delivers via the typed adapter; transient failures retry to `Parked`, permanent failures park immediately.
- Site→central handoff is at-least-once: ack-after-persist plus insert-if-not-exists on `NotificationId`.
- No Akka replication — MS SQL is the HA store; daily purge of terminal rows after a configurable window (default 365 days).
- Notification Outbox retry reuses central SMTP max-retry-count and fixed interval.
- Cached calls (`ExternalSystem.CachedCall`, `Database.CachedWrite`) return a `TrackedOperationId` tracking handle, unified with `Notify.Send`'s existing tracking model (`Notify.Status` retained as a thin alias).
- A site-local operation tracking table (SQLite, alongside the S&F buffer) is the source of truth for cached-call status; `Tracking.Status(id)` reads it site-locally and authoritatively; terminal rows purged after a configurable window (default 7 days).
- Unified tracking status lifecycle `Pending → Retrying → Delivered / Parked / Failed / Discarded`; `Failed` = permanent failure (also returned synchronously to the calling script). No `Forwarding` state for cached calls.
- Site Call Audit (#22): central `SiteCallAuditActor` singleton with a `SiteCalls` audit table (central MS SQL) fed by best-effort site telemetry plus periodic reconciliation pulls — an eventually-consistent mirror, NOT a dispatcher; cached-call delivery stays site-local. Ingest is insert-if-not-exists then upsert-on-newer-status.
- Central UI Site Calls page + central→site `RetryParkedOperation`/`DiscardParkedOperation` relay for parked cached calls; central never mutates the `SiteCalls` row directly.
### Security & Auth
- Authentication: direct LDAP bind (username/password), no Kerberos/NTLM. LDAPS/StartTLS required.
- Cookie+JWT hybrid sessions: HttpOnly/Secure cookie carries an embedded JWT (HMAC-SHA256 shared symmetric key), 15-minute expiry with sliding refresh, 30-minute idle timeout. Cookies are the correct transport for Blazor Server (SignalR circuits).
- LDAP failure: new logins fail; active sessions continue with current roles.
- Load balancer in front of central UI; cookie-embedded JWT + shared Data Protection keys for failover transparency.
### Cluster & Failover
- Keep-oldest split-brain resolver with `down-if-alone = on`, 15s stable-after.
- Both nodes are seed nodes. `min-nr-of-members = 1`.
- Failure detection: 2s heartbeat, 10s threshold. Total failover ~25s.
- CoordinatedShutdown for graceful singleton handover.
- Automatic dual-node recovery from persistent storage.
### UI & Monitoring
- Central UI: Blazor Server (ASP.NET Core + SignalR) with Bootstrap CSS. No third-party component frameworks (no Blazorise, MudBlazor, Radzen, etc.). Build custom Blazor components for tables, grids, forms, etc.
- UI design: Clean, corporate, internal-use aesthetic. Not flashy. Use the `frontend-design` skill when designing UI pages/components.
- Debug view: real-time streaming via DebugStreamBridgeActor + gRPC (events via SiteStreamGrpcClient, snapshot via ClusterClient). Health dashboard: 10s polling timer. Deployment status: real-time push via SignalR.
- Health reports: 30s interval, 60s offline threshold, monotonic sequence numbers, raw error counts per interval.
- Dead letter monitoring as a health metric.
- Site Event Logging: 30-day retention, 1GB storage cap, daily purge, paginated queries with keyword search.
- Notification Outbox KPIs are central-computed point-in-time from the `Notifications` table (global + per-source-site): queue depth, stuck count, parked count, delivered-last-interval, oldest-pending age.
- Stuck = `Pending`/`Retrying` older than a configurable age threshold (default 10 min) — display-only (KPI count + row badge), no escalation/alerting.
- Headline KPI tiles surface on the Health dashboard; a new Central UI Notification Outbox page offers a queryable list with Retry/Discard actions on parked notifications.
- Site Call Audit KPIs are central-computed point-in-time from the `SiteCalls` table (global + per-site), mirroring the Notification Outbox KPI shape; tiles surface on the Health dashboard alongside a queryable Central UI Site Calls page with Retry/Discard on parked rows.
### Code Organization
- Entity classes are persistence-ignorant POCOs in Commons; EF mappings in Configuration Database.
- Repository interfaces defined in Commons; implementations in Configuration Database.
- Commons namespace hierarchy: Types/, Interfaces/, Entities/, Messages/ with domain area subfolders.
- Message contracts follow additive-only evolution rules for version compatibility.
- Per-component configuration via `appsettings.json` sections bound to options classes (Options pattern).
- Options classes owned by component projects, not Commons.
- Host readiness gating: `/health/ready` endpoint, no traffic until operational.
- EF Core migrations: auto-apply in dev, manual SQL scripts for production.
- Audit logging absorbed into Configuration Database component (IAuditService).
### Akka.NET Conventions
- Tell for hot-path internal communication; Ask reserved for system boundaries.
- ClusterClient for cross-cluster communication; ClusterClientReceptionist for service discovery across cluster boundaries.
- Script trust model: forbidden APIs (System.IO, Process, Threading, Reflection, raw network).
- Application-level correlation IDs on all request/response messages.
## Tool Usage
- When consulting with the Codex MCP tool, use model `gpt-5.4`.
- When a task requires setting up or controlling system state (sites, templates, instances, data connections, deployments, security, etc.) and the Central UI is not needed, prefer the ScadaLink CLI over manual DB edits or UI navigation. See [`src/ScadaLink.CLI/README.md`](src/ScadaLink.CLI/README.md) for the full command reference.
### CLI Quick Reference (Docker / OrbStack)
- **Management URL**: `http://localhost:9000` — the CLI connects via the Traefik load balancer, which routes to the active central node. Direct access: central-a on port 9001, central-b on port 9002.
- **Test user**: `--username multi-role --password password` — has Admin, Design, and Deployment roles. The `admin` user only has the Admin role and cannot create templates, data connections, or deploy.
- **Config file**: `~/.scadalink/config.json` — stores `managementUrl` and default format. See `docker/README.md` for a ready-to-use test config.
- **Rebuild cluster**: `bash docker/deploy.sh` — builds the `scadalink:latest` image and recreates all containers. Run this after code changes to ManagementActor, Host, or any server-side component.
- **Infrastructure services**: `cd infra && docker compose up -d` — starts LDAP, MS SQL, OPC UA, SMTP, and REST API. These are separate from the cluster containers in `docker/`.
- **All test LDAP passwords**: `password` (see `infra/glauth/config.toml` for users and groups).