refactor(docs): move requirements and test infra docs into docs/ subdirectories

Organize documentation by moving requirements (HighLevelReqs, Component-*, lmxproxy_protocol) to docs/requirements/ and test infrastructure docs to docs/test_infra/. Updates all cross-references in README, CLAUDE.md, infra/README, component docs, and 23 plan files.
2026-03-21 01:11:35 -04:00
parent 0a85a839a2
commit d91aa83665
52 changed files with 486 additions and 124 deletions
--- a/docs/requirements/Component-CLI.md
+++ b/docs/requirements/Component-CLI.md
@@ -0,0 +1,246 @@
+# Component: CLI
+
+## Purpose
+
+The CLI is a standalone command-line tool for scripting and automating administrative operations against the ScadaLink central cluster. It connects to the Central Host's HTTP Management API (`POST /management`), which dispatches commands to the ManagementActor. Authentication and role resolution are handled server-side — the CLI sends credentials via HTTP Basic Auth. The CLI provides the same administrative capabilities as the Central UI, enabling automation, batch operations, and integration with CI/CD pipelines.
+
+## Location
+
+Standalone executable, not part of the Host binary. Deployed on any machine with HTTP access to a central node.
+
+`src/ScadaLink.CLI/`
+
+## Responsibilities
+
+- Parse command-line arguments and dispatch to the appropriate management operation.
+- Send HTTP requests to the Central Host's Management API endpoint with Basic Auth credentials.
+- Display structured responses from the Management API.
+- Support both JSON and human-readable table output formats.
+
+## Technology
+
+- **Argument parsing**: `System.CommandLine` library for command/subcommand/option parsing with built-in help generation.
+- **Transport**: HTTP client connecting to the Central Host's `POST /management` endpoint. Authentication is via HTTP Basic Auth — the server performs LDAP bind and role resolution.
+- **Serialization**: Commands serialized as JSON with a type discriminator (`command` field). Message contracts from Commons define the command types.
+
+## Authentication
+
+The CLI sends user credentials to the Management API via HTTP Basic Auth:
+
+1. The user provides credentials via `--username` / `--password` options.
+2. On each request, the CLI encodes credentials as a Basic Auth header and sends them with the command.
+3. The server performs LDAP authentication, group lookup, and role resolution — the CLI does not communicate with LDAP directly.
+4. Credentials are not stored or cached between invocations. Each CLI invocation requires fresh credentials.
+
+## Connection
+
+The CLI connects to the Central Host via HTTP:
+
+- **Management URL**: The URL of a central node's web server (e.g., `http://localhost:9001`). The management API is served at `POST /management` on the same host as the Central UI.
+- **Failover**: For HA, use a load balancer URL in front of both central nodes. The management API is stateless (Basic Auth per request), so any central node can handle any request without sticky sessions.
+- **No Akka.NET dependency**: The CLI is a pure HTTP client with no Akka.NET runtime.
+
+## Command Structure
+
+The CLI uses a hierarchical subcommand structure mirroring the Management Service message groups:
+
+```
+scadalink <group> <action> [options]
+```
+
+### Template Commands
+```
+scadalink template list [--format json|table]
+scadalink template get <name> [--format json|table]
+scadalink template create --name <name> [--parent <parent>] --file <path>
+scadalink template update <name> --file <path>
+scadalink template delete <name>
+scadalink template validate <name>
+scadalink template diff <instance-code>
+scadalink template attribute add --template-id <id> --name <name> --data-type <type> [--default-value <value>] [--tag-path <path>]
+scadalink template attribute update --template-id <id> --name <name> [--data-type <type>] [--default-value <value>] [--tag-path <path>]
+scadalink template attribute delete --template-id <id> --name <name>
+scadalink template alarm add --template-id <id> --name <name> --trigger-attribute <attr> --condition <cond> --setpoint <value> [--severity <level>] [--notification-list <name>]
+scadalink template alarm update --template-id <id> --name <name> [--condition <cond>] [--setpoint <value>] [--severity <level>] [--notification-list <name>]
+scadalink template alarm delete --template-id <id> --name <name>
+scadalink template script add --template-id <id> --name <name> --trigger-type <type> [--trigger-attribute <attr>] [--interval <ms>] --code <code>
+scadalink template script update --template-id <id> --name <name> [--trigger-type <type>] [--trigger-attribute <attr>] [--interval <ms>] [--code <code>]
+scadalink template script delete --template-id <id> --name <name>
+scadalink template composition add --template-id <id> --module-template-id <id> --instance-name <name>
+scadalink template composition delete --template-id <id> --instance-name <name>
+```
+
+### Instance Commands
+```
+scadalink instance list [--site <site>] [--area <area>] [--format json|table]
+scadalink instance get <code> [--format json|table]
+scadalink instance create --template <name> --site <site> --code <code> [--area <area>]
+scadalink instance set-overrides <code> --file <path>
+scadalink instance set-bindings <code> --bindings <json>
+scadalink instance bind-connections <code> --file <path>
+scadalink instance assign-area <code> --area <area>
+scadalink instance enable <code>
+scadalink instance disable <code>
+scadalink instance delete <code>
+```
+
+### Site Commands
+```
+scadalink site list [--format json|table]
+scadalink site get <site-id> [--format json|table]
+scadalink site create --name <name> --id <site-id>
+scadalink site update <site-id> --file <path>
+scadalink site delete <site-id>
+scadalink site area list <site-id>
+scadalink site area create <site-id> --name <name> [--parent <parent-area>]
+scadalink site area update <site-id> --name <name> [--new-name <name>] [--parent <parent-area>]
+scadalink site area delete <site-id> --name <name>
+```
+
+### Deployment Commands
+```
+scadalink deploy instance <code>
+scadalink deploy artifacts [--site <site>] [--type <artifact-type>]
+scadalink deploy status [--format json|table]
+```
+
+### Data Connection Commands
+```
+scadalink data-connection list [--format json|table]
+scadalink data-connection get <name> [--format json|table]
+scadalink data-connection create --file <path>
+scadalink data-connection update <name> --file <path>
+scadalink data-connection delete <name>
+scadalink data-connection assign <name> --site <site-id>
+scadalink data-connection unassign <name> --site <site-id>
+```
+
+### External System Commands
+```
+scadalink external-system list [--format json|table]
+scadalink external-system get <name> [--format json|table]
+scadalink external-system create --file <path>
+scadalink external-system update <name> --file <path>
+scadalink external-system delete <name>
+```
+
+### Notification Commands
+```
+scadalink notification list [--format json|table]
+scadalink notification get <name> [--format json|table]
+scadalink notification create --file <path>
+scadalink notification update <name> --file <path>
+scadalink notification delete <name>
+scadalink notification smtp list [--format json|table]
+scadalink notification smtp update --file <path>
+```
+
+### Security Commands
+```
+scadalink security api-key list [--format json|table]
+scadalink security api-key create --name <name>
+scadalink security api-key update <name> [--name <new-name>] [--enabled <bool>]
+scadalink security api-key enable <name>
+scadalink security api-key disable <name>
+scadalink security api-key delete <name>
+scadalink security role-mapping list [--format json|table]
+scadalink security role-mapping create --group <ldap-group> --role <role> [--site <site>]
+scadalink security role-mapping update --id <id> [--group <ldap-group>] [--role <role>]
+scadalink security role-mapping delete --group <ldap-group> --role <role>
+scadalink security scope-rule list [--role-mapping-id <id>] [--format json|table]
+scadalink security scope-rule add --role-mapping-id <id> --site-id <site-id>
+scadalink security scope-rule delete --id <id>
+```
+
+### Audit Log Commands
+```
+scadalink audit-log query [--user <username>] [--entity-type <type>] [--from <date>] [--to <date>] [--format json|table]
+```
+
+### Health Commands
+```
+scadalink health summary [--format json|table]
+scadalink health site <site-id> [--format json|table]
+scadalink health event-log --site-identifier <site-id> [--from <date>] [--to <date>] [--search <term>] [--page <n>] [--page-size <n>] [--format json|table]
+scadalink health parked-messages --site-identifier <site-id> [--page <n>] [--page-size <n>] [--format json|table]
+```
+
+### Debug Commands
+```
+scadalink debug snapshot --id <id> [--format json|table]
+```
+
+### Shared Script Commands
+```
+scadalink shared-script list [--format json|table]
+scadalink shared-script get --id <id> [--format json|table]
+scadalink shared-script create --name <name> --code <code>
+scadalink shared-script update --id <id> [--name <name>] [--code <code>]
+scadalink shared-script delete --id <id>
+```
+
+### Database Connection Commands
+```
+scadalink db-connection list [--format json|table]
+scadalink db-connection get --id <id> [--format json|table]
+scadalink db-connection create --name <name> --connection-string <string> [--provider <provider>]
+scadalink db-connection update --id <id> [--name <name>] [--connection-string <string>] [--provider <provider>]
+scadalink db-connection delete --id <id>
+```
+
+### Inbound API Method Commands
+```
+scadalink api-method list [--format json|table]
+scadalink api-method get --id <id> [--format json|table]
+scadalink api-method create --name <name> --code <code> [--description <desc>]
+scadalink api-method update --id <id> [--name <name>] [--code <code>] [--description <desc>]
+scadalink api-method delete --id <id>
+```
+
+## Configuration
+
+Configuration is resolved in the following priority order (highest wins):
+
+1. **Command-line options**: `--url`, `--username`, `--password`, `--format`.
+2. **Environment variables**:
+   - `SCADALINK_MANAGEMENT_URL` — Management API URL (e.g., `http://central-host:5000`).
+   - `SCADALINK_FORMAT` — Default output format (`json` or `table`).
+3. **Configuration file**: `~/.scadalink/config.json` — Persistent defaults for management URL and output format.
+
+### Configuration File Format
+
+```json
+{
+  "managementUrl": "http://central-host:5000"
+}
+```
+
+## Output Formats
+
+- **JSON** (default): Machine-readable JSON output to stdout. Suitable for piping to `jq` or processing in scripts. Errors are written to stderr as JSON objects with `error` and `code` fields.
+- **Table** (`--format table` or `--table`): Human-readable tabular output with aligned columns. Suitable for interactive use.
+
+## Exit Codes
+
+| Code | Meaning |
+|------|---------|
+| 0 | Success |
+| 1 | General error (command failed, connection failure, or authentication failure) |
+| 2 | Authorization failure (insufficient role) |
+
+## Error Handling
+
+- **Connection failure**: If the CLI cannot connect to the management URL (e.g., DNS failure, connection refused), it exits with code 1 and a descriptive error message.
+- **Command timeout**: If the server does not respond within 30 seconds, the command fails with a timeout error (HTTP 504).
+- **Authentication failure**: If the server returns HTTP 401 (LDAP bind failed), the CLI exits with code 1.
+- **Authorization failure**: If the server returns HTTP 403, the CLI exits with code 2.
+
+## Dependencies
+
+- **Commons**: Message contracts (`Messages/Management/`) for command type definitions and registry.
+- **System.CommandLine**: Command-line argument parsing.
+
+## Interactions
+
+- **Management Service (via HTTP)**: The CLI's sole runtime dependency. All operations are sent as HTTP POST requests to the Management API endpoint on a central node, which dispatches to the ManagementActor.
+- **Central Host**: Serves the Management API at `POST /management`. Handles LDAP authentication, role resolution, and ManagementActor dispatch.
--- a/docs/requirements/Component-CentralUI.md
+++ b/docs/requirements/Component-CentralUI.md
@@ -0,0 +1,138 @@
+# Component: Central UI
+
+## Purpose
+
+The Central UI is a web-based management interface hosted on the central cluster. It provides all configuration, deployment, monitoring, and troubleshooting workflows for the SCADA system. There is no live machine data visualization — the UI is focused on system management, with the exception of on-demand debug views.
+
+## Location
+
+Central cluster only. Sites have no user interface.
+
+## Technology
+
+- **Framework**: Blazor Server (ASP.NET Core). UI logic executes on the server, updates pushed to the browser via SignalR.
+- Keeps the entire stack in C#/.NET, consistent with the rest of the system (Akka.NET, EF Core).
+- SignalR provides built-in support for real-time UI updates.
+
+## Failover Behavior
+
+- A **load balancer** sits in front of the central cluster and routes to the active node.
+- On central failover, the Blazor Server SignalR circuit is interrupted. The browser automatically attempts to reconnect via SignalR's built-in reconnection logic.
+- Since sessions use **authentication cookies** carrying an embedded JWT (not server-side state), the user's authentication survives failover — the new active node validates the same cookie-embedded JWT. No re-login required if the token is still valid.
+- Active debug view polling and in-progress deployment status subscriptions are lost on failover and must be re-opened by the user.
+- Both central nodes share the same **ASP.NET Data Protection keys** (stored in the configuration database or shared configuration) so that tokens and anti-forgery tokens remain valid across failover.
+
+## Real-Time Updates
+
+- **Debug view**: Near-real-time display of attribute values and alarm states, updated via a **2-second polling timer**. This avoids the complexity of cross-cluster streaming while providing responsive feedback — 2s latency is imperceptible for debugging purposes.
+- **Health dashboard**: Site status, connection health, error rates, and buffer depths update via a **10-second auto-refresh timer**. Since health reports arrive from sites every 30 seconds, a 10s poll interval catches updates within one reporting cycle without unnecessary overhead.
+- **Deployment status**: Pending/in-progress/success/failed transitions **push to the UI immediately** via SignalR (built into Blazor Server). No polling required for deployment tracking.
+
+## Responsibilities
+
+- Provide authenticated access to all management workflows.
+- Enforce role-based access control in the UI (Admin, Design, Deployment with site scoping).
+- Present data from the configuration database, and from site clusters via remote queries.
+
+## Workflows / Pages
+
+### Template Authoring (Design Role)
+- Create, edit, and delete templates.
+- **Template deletion** is blocked if any instances or child templates reference the template. The UI displays the references preventing deletion.
+- Manage template hierarchy (inheritance) — visual tree of parent/child relationships.
+- Manage composition — add/remove feature module instances within templates. **Naming collision detection** provides immediate feedback if composed modules introduce duplicate attribute, alarm, or script names.
+- Define and edit attributes, alarms, and scripts on templates.
+- Set lock flags on attributes, alarms, and scripts.
+- Visual indicator showing inherited vs. locally defined vs. overridden members.
+- **On-demand validation**: A "Validate" action allows Design users to run comprehensive pre-deployment validation (flattening, naming collisions, script compilation, trigger references) without triggering a deployment. Provides early feedback during authoring.
+- **Last-write-wins** editing — no pessimistic locks or conflict detection on templates.
+
+### Shared Script Management (Design Role)
+- Create, edit, and delete shared (global) scripts.
+- Shared scripts are not associated with any template.
+- On-demand validation (compilation check) available.
+
+### External System Management (Design Role)
+- Define external system contracts: connection details, API method definitions (parameters, return types).
+- Define retry settings per external system (max retry count, fixed time between retries).
+
+### Database Connection Management (Design Role)
+- Define named database connections: server, database, credentials.
+- Define retry settings per connection (max retry count, fixed time between retries).
+
+### Notification List Management (Design Role)
+- Create, edit, and delete notification lists.
+- Manage recipients (name + email) within each list.
+- Configure SMTP settings.
+
+### Site & Data Connection Management (Admin Role)
+- Create, edit, and delete site definitions.
+- Define data connections and assign them to sites (name, protocol type, connection details).
+
+### Area Management (Admin Role)
+- Define hierarchical area structures per site.
+- Parent-child area relationships.
+- Assign areas when managing instances.
+
+### Instance Management (Deployment Role)
+- Create instances from templates at a specific site.
+- Assign instances to areas.
+- Bind data connections — **per-attribute binding** where each attribute with a data source reference individually selects its data connection from the site's available connections. **Bulk assignment** supported: select multiple attributes and assign a data connection to all of them at once.
+- Set instance-level attribute overrides (non-locked attributes only).
+- Filter/search instances by site, area, template, or status.
+- **Disable** instances — stops data collection, script triggers, and alarm evaluation at the site while retaining the deployed configuration.
+- **Enable** instances — re-activates a disabled instance.
+- **Delete** instances — removes the running configuration from the site. Blocked if the site is unreachable. Store-and-forward messages are not cleared.
+
+### Deployment (Deployment Role)
+- View list of instances with staleness indicators (deployed config differs from template-derived config).
+- Filter by site, area, template.
+- View diff between deployed and current template-derived configuration.
+- Deploy updated configuration to individual instances. **Pre-deployment validation** runs automatically before any deployment is sent — validation errors are displayed and block deployment.
+- Track deployment status (pending, in-progress, success, failed).
+
+### System-Wide Artifact Deployment (Deployment Role)
+- Explicitly deploy shared scripts, external system definitions, database connection definitions, data connection definitions, notification lists, and SMTP configuration to all sites or to an individual site.
+- **Per-site deployment**: A "Deploy Artifacts" button on the Sites admin page allows deploying all artifacts to an individual site.
+- **Deploy all**: A bulk action deploys artifacts to all sites at once.
+- This is a **separate action** from instance deployment — system-wide artifacts are not automatically pushed when definitions change.
+- Track per-site deployment status.
+
+### Debug View (Deployment Role)
+- Select a deployed instance and open a live debug view.
+- Near-real-time polling (2s interval) of all attribute values (with quality and timestamp) and alarm states for that instance.
+- Initial snapshot of current state followed by periodic polling for updates.
+- Stream includes attribute values formatted as `[InstanceUniqueName].[AttributePath].[AttributeName]` and alarm states formatted as `[InstanceUniqueName].[AlarmName]`.
+- Subscribe-on-demand — stream starts when opened, stops when closed.
+
+### Parked Message Management (Deployment Role)
+- Query sites for parked messages (external system calls, notifications, cached DB writes).
+- View message details (target, payload, retry count, timestamps).
+- Retry or discard individual parked messages.
+
+### Health Monitoring Dashboard (All Roles)
+- Overview of all sites with online/offline status.
+- Per-site detail: active/standby node status, data connection health, script error rates, alarm evaluation error rates, store-and-forward buffer depths.
+
+### Site Event Log Viewer (Deployment Role)
+- Query site event logs remotely.
+- Filter by event type, time range, instance.
+- View script executions, alarm events (activations, clears, evaluation errors), deployment events (including script compilation results), connection status changes, store-and-forward activity, instance lifecycle events (enable, disable, delete).
+
+### Audit Log Viewer (Admin Role)
+- Query the central audit log.
+- Filter by user, entity type, action type, time range.
+- View before/after state for each change.
+
+### LDAP Group Mapping (Admin Role)
+- Map LDAP groups to system roles (Admin, Design, Deployment).
+- Configure site-scoping for Deployment role groups.
+
+## Dependencies
+
+- **Template Engine**: Provides template and instance data models, flattening, diff calculation, and validation.
+- **Deployment Manager**: Triggers deployments, system-wide artifact deployments, and instance lifecycle commands. Provides deployment status.
+- **Communication Layer**: Routes debug view subscriptions, remote queries to sites.
+- **Security & Auth**: Authenticates users and enforces role-based access.
+- **Configuration Database**: All central data, including audit log data for the audit log viewer. Accessed via `ICentralUiRepository`.
+- **Health Monitoring**: Provides site health data for the dashboard.
--- a/docs/requirements/Component-ClusterInfrastructure.md
+++ b/docs/requirements/Component-ClusterInfrastructure.md
@@ -0,0 +1,135 @@
+# Component: Cluster Infrastructure
+
+## Purpose
+
+The Cluster Infrastructure component manages the Akka.NET cluster setup, active/standby node roles, failover detection, and the foundational runtime environment on which all other components run. It provides the base layer for both central and site clusters.
+
+## Location
+
+Both central and site clusters.
+
+## Responsibilities
+
+- Bootstrap the Akka.NET actor system on each node.
+- Form a two-node cluster (active/standby) using Akka.NET Cluster.
+- Manage leader election and role assignment (active vs. standby).
+- Detect node failures and trigger failover.
+- Provide the Akka.NET remoting infrastructure for inter-cluster communication.
+- Support cluster singleton hosting (used by the Site Runtime Deployment Manager singleton on site clusters).
+- Manage Windows service lifecycle (start, stop, restart) on each node.
+
+## Cluster Topology
+
+### Central Cluster
+- Two nodes forming an Akka.NET cluster.
+- One active node runs all central components (Template Engine, Deployment Manager, Central UI, etc.).
+- One standby node is ready to take over on failover.
+- Connected to MS SQL databases (Config DB, Machine Data DB).
+
+### Site Cluster (per site)
+- Two nodes forming an Akka.NET cluster.
+- One active node runs all site components (Site Runtime, Data Connection Layer, Store-and-Forward Engine, etc.).
+- The Site Runtime Deployment Manager runs as an **Akka.NET cluster singleton** on the active node, owning the full Instance Actor hierarchy.
+- One standby node receives replicated store-and-forward data and is ready to take over.
+- Connected to local SQLite databases (store-and-forward buffer, event logs, deployed configurations).
+- Connected to machines via data connections (OPC UA, LmxProxy).
+
+## Failover Behavior
+
+### Detection
+- Akka.NET Cluster monitors node health via heartbeat.
+- If the active node becomes unreachable, the standby node detects the failure and promotes itself to active.
+
+### Central Failover
+- The new active node takes over all central responsibilities.
+- In-progress deployments are treated as **failed** — engineers must retry.
+- The UI session may be interrupted — users reconnect to the new active node.
+- No message buffering at central — no state to recover beyond what's in MS SQL.
+
+### Site Failover
+- The new active node takes over:
+  - The Deployment Manager singleton restarts and re-creates the full Instance Actor hierarchy by reading deployed configurations from local SQLite. Each Instance Actor spawns its child Script and Alarm Actors.
+  - Data collection (Data Connection Layer re-establishes subscriptions as Instance Actors register their data source references).
+  - Store-and-forward delivery (buffer is already replicated locally).
+- Active debug view streams from central are interrupted — the engineer must re-open them.
+- Health reporting resumes from the new active node.
+- Alarm states are re-evaluated from incoming values (alarm state is in-memory only).
+
+## Split-Brain Resolution
+
+The system uses the Akka.NET **keep-oldest** split-brain resolver strategy:
+
+- On a network partition, the node that has been in the cluster longest remains active. The younger node downs itself.
+- **Stable-after duration**: 15 seconds. The cluster membership must remain stable (no changes) for 15 seconds before the resolver acts to down unreachable nodes. This prevents premature downing during startup or rolling restarts.
+- **`down-if-alone = on`**: The keep-oldest resolver is configured with `down-if-alone` enabled. If the oldest node finds itself alone (no other reachable members), it downs itself rather than continuing as a single-node cluster. This prevents the oldest node from running in isolation during a network partition while the younger node also forms its own cluster.
+- **Why keep-oldest**: With only two nodes, quorum-based strategies (static-quorum, keep-majority) cannot distinguish "one node crashed" from "network partition" — both sides see fewer than quorum and both would down themselves, resulting in total cluster shutdown. Keep-oldest with `down-if-alone` provides safe singleton ownership — at most one node runs the cluster singleton at any time.
+
+## Single-Node Operation
+
+`akka.cluster.min-nr-of-members` must be set to **1**. After failover, only one node is running. If set to 2, the surviving node waits for a second member before allowing the Cluster Singleton (Site Runtime Deployment Manager) to start — blocking all data collection and script execution indefinitely.
+
+## Failure Detection Timing
+
+Configurable defaults for heartbeat and failure detection:
+
+| Setting | Default | Description |
+|---------|---------|-------------|
+| Heartbeat interval | 2 seconds | Frequency of health check messages between nodes |
+| Failure detection threshold | 10 seconds | Time without heartbeat before a node is considered unreachable |
+| Stable-after (split-brain) | 15 seconds | Time cluster must be stable before resolver acts |
+| **Total failover time** | **~25 seconds** | Detection (10s) + stable-after (15s) + singleton restart |
+
+These values balance failover speed with stability — fast enough that data collection gaps are small, tolerant enough that brief network hiccups don't trigger unnecessary failovers.
+
+## Dual-Node Recovery
+
+If both nodes in a cluster fail simultaneously (e.g., site power outage):
+
+1. **No manual intervention required.** Since both nodes are configured as seed nodes, whichever node starts first forms a new cluster. The second node joins when it starts.
+2. **State recovery** (each node has its own local copy of all required data):
+   - **Site clusters**: The Deployment Manager singleton reads deployed configurations from local SQLite and re-creates the full Instance Actor hierarchy. Store-and-forward buffers are already persisted locally. Alarm states re-evaluate from incoming data values.
+   - **Central cluster**: All state is in MS SQL (configuration database). The active node resumes normal operation.
+3. The keep-oldest resolver handles the "both starting fresh" case naturally — there is no pre-existing cluster to conflict with.
+
+## Graceful Shutdown & Singleton Handover
+
+When a node is stopped for planned maintenance (Windows Service stop), `CoordinatedShutdown` triggers a **graceful leave** from the cluster. This enables the Cluster Singleton (Site Runtime Deployment Manager) to hand over to the other node in seconds (limited by the hand-over retry interval) rather than waiting for the full failure detection timeout (~25 seconds).
+
+Configuration required:
+- `akka.coordinated-shutdown.run-by-clr-shutdown-hook = on`
+- `akka.cluster.run-coordinated-shutdown-when-down = on`
+
+The Host component wires CoordinatedShutdown into the Windows Service lifecycle (see REQ-HOST-9).
+
+## Node Configuration
+
+Each node is configured with:
+- **Cluster seed nodes**: **Both nodes** are seed nodes — each node lists both itself and its partner. Either node can start first and form the cluster; the other joins when it starts. No startup ordering dependency.
+- **Cluster role**: Central or Site (plus site identifier for site clusters).
+- **Akka.NET remoting**: Hostname/port for inter-node and inter-cluster communication.
+- **Local storage paths**: SQLite database locations (site nodes only).
+
+## Windows Service
+
+- Each node runs as a **Windows service** for automatic startup and recovery.
+- Service configuration includes Akka.NET cluster settings and component-specific configuration.
+
+## Platform
+
+- **OS**: Windows Server.
+- **Runtime**: .NET (Akka.NET).
+- **Cluster**: Akka.NET Cluster (application-level, not Windows Server Failover Clustering).
+
+## Dependencies
+
+- **Akka.NET**: Core actor system, cluster, remoting, and cluster singleton libraries.
+- **Windows**: Service hosting, networking.
+- **MS SQL** (central only): Database connectivity.
+- **SQLite** (sites only): Local storage.
+
+## Interactions
+
+- **All components**: Every component runs within the Akka.NET actor system managed by this infrastructure.
+- **Site Runtime**: The Deployment Manager singleton relies on Akka.NET cluster singleton support provided by this infrastructure.
+- **Communication Layer**: Built on top of the Akka.NET remoting provided here.
+- **Health Monitoring**: Reports node status (active/standby) as a health metric.
--- a/docs/requirements/Component-Commons.md
+++ b/docs/requirements/Component-Commons.md
@@ -0,0 +1,239 @@
+# Component: Commons
+
+## Purpose
+
+The Commons component provides the shared foundation of data types, interfaces, enums, message contracts, data transfer objects, and persistence-ignorant domain entity classes used across all other ScadaLink components. It ensures consistent type definitions for cross-component communication, data access, and eliminates duplication of common abstractions.
+
+## Location
+
+Referenced by all component libraries and the Host.
+
+## Responsibilities
+
+- Define shared data types (enums, value objects, result types) used across multiple components.
+- Define **persistence-ignorant domain entity classes** (POCOs) representing all configuration database entities. These classes have no dependency on Entity Framework or any persistence framework — EF mapping is handled entirely by the Configuration Database component via Fluent API.
+- Define **per-component repository interfaces** that consuming components use for data access. Repository implementations are owned by the Configuration Database component.
+- Define protocol abstraction interfaces for the Data Connection Layer.
+- Define cross-component message contracts and DTOs for deployment, health, communication, instance lifecycle, and other inter-component data flows.
+- Contain **no business logic** — only data structures, interfaces, and enums.
+- Maintain **minimal dependencies** — only core .NET libraries; no Akka.NET, no ASP.NET, no Entity Framework.
+
+---
+
+## Requirements
+
+### REQ-COM-1: Shared Data Type System
+
+Commons must define shared primitive and utility types used across multiple components, including but not limited to:
+
+- **`DataType` enum**: Enumerates the data types supported by the system (e.g., Boolean, Int32, Float, Double, String, DateTime, Binary).
+- **`RetryPolicy`**: A record or immutable class describing retry behavior (max retries, fixed delay between retries).
+- **`Result<T>`**: A discriminated result type that represents either a success value or an error, enabling consistent error handling across component boundaries without exceptions.
+- **`InstanceState` enum**: Enabled, Disabled.
+- **`DeploymentStatus` enum**: Pending, InProgress, Success, Failed.
+- **`AlarmState` enum**: Active, Normal.
+- **`AlarmTriggerType` enum**: ValueMatch, RangeViolation, RateOfChange.
+- **`ConnectionHealth` enum**: Connected, Disconnected, Connecting, Error.
+
+Types defined here must be immutable and thread-safe.
+
+**Timestamp convention**: All timestamps throughout the system must use **UTC** (`DateTime` with `DateTimeKind.Utc` or `DateTimeOffset` with zero offset). This applies to all stored timestamps (SQLite, MS SQL, audit log entries), all message timestamps (attribute values, alarm state changes, health reports, event log entries, deployment records), and all wire-format timestamps (Akka remoting, Inbound API responses). Local time conversion, if needed, is a UI display concern only.
+
+### REQ-COM-2: Protocol Abstraction
+
+Commons must define the protocol abstraction interfaces that the Data Connection Layer implements and other components consume:
+
+- **`IDataConnection`**: The common interface for reading, writing, and subscribing to device data regardless of the underlying protocol (OPC UA, custom legacy, etc.).
+- **Related types**: Tag identifiers, read/write results, subscription callbacks, connection status enums, and quality codes.
+
+These interfaces must not reference any specific protocol implementation.
+
+### REQ-COM-3: Domain Entity Classes (POCOs)
+
+Commons must define persistence-ignorant POCO entity classes for all configuration database entities. These classes:
+
+- Are plain C# classes with properties — no EF attributes, no base classes from EF, no navigation property annotations.
+- May include navigation properties (e.g., `Template.Attributes` as `ICollection<TemplateAttribute>`) defined as plain collections. The Configuration Database component configures the relationships via Fluent API.
+- May include constructors that enforce invariants (e.g., required fields).
+- Must have **no dependency** on Entity Framework Core or any persistence library.
+
+Entity classes are organized by domain area:
+
+- **Template & Modeling**: `Template`, `TemplateAttribute`, `TemplateAlarm`, `TemplateScript`, `TemplateComposition`, `Instance`, `InstanceAttributeOverride`, `InstanceConnectionBinding`, `Area`.
+- **Shared Scripts**: `SharedScript`.
+- **Sites & Data Connections**: `Site`, `DataConnection`, `SiteDataConnectionAssignment`.
+- **External Systems & Database Connections**: `ExternalSystemDefinition`, `ExternalSystemMethod`, `DatabaseConnectionDefinition`.
+- **Notifications**: `NotificationList`, `NotificationRecipient`, `SmtpConfiguration`.
+- **Inbound API**: `ApiKey`, `ApiMethod`.
+- **Security**: `LdapGroupMapping`, `SiteScopeRule`.
+- **Deployment**: `DeploymentRecord`, `SystemArtifactDeploymentRecord`.
+- **Audit**: `AuditLogEntry`.
+
+### REQ-COM-4: Per-Component Repository Interfaces
+
+Commons must define repository interfaces that consuming components use for data access. Each interface is tailored to the data needs of its consuming component:
+
+- `ITemplateEngineRepository` — Templates, attributes, alarms, scripts, compositions, instances, overrides, connection bindings, areas.
+- `IDeploymentManagerRepository` — Deployment records, deployed configuration snapshots, system-wide artifact deployment records.
+- `ISecurityRepository` — LDAP group mappings, site scoping rules.
+- `IInboundApiRepository` — API keys, API method definitions.
+- `IExternalSystemRepository` — External system definitions, method definitions, database connection definitions.
+- `INotificationRepository` — Notification lists, recipients, SMTP configuration.
+- `ICentralUiRepository` — Read-oriented queries spanning multiple domain areas for display purposes.
+
+All repository interfaces must:
+- Accept and return the POCO entity classes defined in Commons.
+- Include a `SaveChangesAsync()` method (or equivalent) to support unit-of-work commit.
+- Have **no dependency** on Entity Framework Core — they are pure interfaces.
+
+Implementations of these interfaces are owned by the Configuration Database component.
+
+### REQ-COM-4a: Cross-Cutting Service Interfaces
+
+Commons must define service interfaces for cross-cutting concerns that multiple components consume:
+
+- **`IAuditService`**: Provides a single method for components to log audit entries: `LogAsync(user, action, entityType, entityId, entityName, afterState)`. The implementation (owned by the Audit Logging component) serializes the state as JSON and adds the audit entry to the current unit-of-work transaction. Defined in Commons so any central component can call it without depending on the Audit Logging component directly.
+
+### REQ-COM-5: Cross-Component Message Contracts
+
+Commons must define the shared DTOs and message contracts used for inter-component communication, including:
+
+- **Deployment DTOs**: Configuration snapshots, deployment commands, deployment status, validation results.
+- **Instance Lifecycle DTOs**: Disable, enable, delete commands and responses.
+- **Health DTOs**: Health check results, site status reports, heartbeat messages. Includes script error rates and alarm evaluation error rates.
+- **Communication DTOs**: Site identity, connection state, routing metadata.
+- **Attribute Stream DTOs**: Attribute value change messages (instance name, attribute path, value, quality, timestamp) and alarm state change messages (instance name, alarm name, state, priority, timestamp) for the site-wide Akka stream.
+- **Debug View DTOs**: Subscribe/unsubscribe requests, one-shot snapshot request (`DebugSnapshotRequest`), initial snapshot, stream filter criteria.
+- **Script Execution DTOs**: Script call requests (with recursion depth), return values, error results.
+- **System-Wide Artifact DTOs**: Shared script packages, external system definitions, database connection definitions, notification list definitions.
+
+All message types must be `record` types or immutable classes suitable for use as Akka.NET messages (though Commons itself must not depend on Akka.NET).
+
+### REQ-COM-5a: Message Contract Versioning
+
+Since the system supports cross-site artifact version skew (sites may temporarily run different versions), message contracts must follow **additive-only evolution rules**:
+
+- New fields may be added with default values. Existing fields must not be removed or have their types changed.
+- Serialization must tolerate unknown fields (forward compatibility) and missing optional fields (backward compatibility).
+- Breaking changes require a new message type and a coordinated deployment to all nodes.
+- The Akka.NET serialization binding configuration (in the Host component) must explicitly map message types to serializers to prevent accidental binary serialization.
+
+### REQ-COM-5b: Namespace & Folder Convention
+
+All types in Commons are organized by **category** and **domain area** using a consistent namespace and folder hierarchy:
+
+```
+ScadaLink.Commons/
+├── Types/                          # REQ-COM-1: Shared data types
+│   ├── DataType.cs
+│   ├── RetryPolicy.cs
+│   ├── Result.cs
+│   └── Enums/
+│       ├── InstanceState.cs
+│       ├── DeploymentStatus.cs
+│       ├── AlarmState.cs
+│       ├── AlarmTriggerType.cs
+│       └── ConnectionHealth.cs
+├── Interfaces/                     # Shared interfaces by concern
+│   ├── Protocol/                   # REQ-COM-2: Protocol abstraction
+│   │   ├── IDataConnection.cs
+│   │   ├── TagValue.cs
+│   │   └── SubscriptionCallback.cs
+│   ├── Repositories/               # REQ-COM-4: Per-component repository interfaces
+│   │   ├── ITemplateEngineRepository.cs
+│   │   ├── IDeploymentManagerRepository.cs
+│   │   ├── ISecurityRepository.cs
+│   │   ├── IInboundApiRepository.cs
+│   │   ├── IExternalSystemRepository.cs
+│   │   ├── INotificationRepository.cs
+│   │   └── ICentralUiRepository.cs
+│   └── Services/                   # REQ-COM-4a: Cross-cutting service interfaces
+│       └── IAuditService.cs
+├── Entities/                       # REQ-COM-3: Domain entity POCOs, by domain area
+│   ├── Templates/
+│   │   ├── Template.cs
+│   │   ├── TemplateAttribute.cs
+│   │   ├── TemplateAlarm.cs
+│   │   ├── TemplateScript.cs
+│   │   └── TemplateComposition.cs
+│   ├── Instances/
+│   │   ├── Instance.cs
+│   │   ├── InstanceAttributeOverride.cs
+│   │   ├── InstanceConnectionBinding.cs
+│   │   └── Area.cs
+│   ├── Sites/
+│   │   ├── Site.cs
+│   │   ├── DataConnection.cs
+│   │   └── SiteDataConnectionAssignment.cs
+│   ├── ExternalSystems/
+│   │   ├── ExternalSystemDefinition.cs
+│   │   ├── ExternalSystemMethod.cs
+│   │   └── DatabaseConnectionDefinition.cs
+│   ├── Notifications/
+│   │   ├── NotificationList.cs
+│   │   ├── NotificationRecipient.cs
+│   │   └── SmtpConfiguration.cs
+│   ├── InboundApi/
+│   │   ├── ApiKey.cs
+│   │   └── ApiMethod.cs
+│   ├── Security/
+│   │   ├── LdapGroupMapping.cs
+│   │   └── SiteScopeRule.cs
+│   ├── Deployment/
+│   │   ├── DeploymentRecord.cs
+│   │   └── SystemArtifactDeploymentRecord.cs
+│   ├── Scripts/
+│   │   └── SharedScript.cs
+│   └── Audit/
+│       └── AuditLogEntry.cs
+└── Messages/                       # REQ-COM-5: Cross-component message contracts, by concern
+    ├── Deployment/
+    ├── Lifecycle/
+    ├── Health/
+    ├── Communication/
+    ├── Streaming/
+    ├── DebugView/
+    ├── ScriptExecution/
+    └── Artifacts/
+```
+
+**Naming rules**:
+- Namespaces mirror the folder structure: `ScadaLink.Commons.Entities.Templates`, `ScadaLink.Commons.Interfaces.Repositories`, etc.
+- Interface names use the `I` prefix: `ITemplateEngineRepository`, `IAuditService`, `IDataConnection`.
+- Entity classes are named after the domain concept (no suffixes like `Entity` or `Model`): `Template`, `Instance`, `Site`.
+- Message contracts are named as commands, events, or responses: `DeployInstanceCommand`, `DeploymentStatusResponse`, `AttributeValueChanged`.
+- Enums use singular names: `AlarmState`, not `AlarmStates`.
+
+### REQ-COM-6: No Business Logic
+
+Commons must contain only:
+
+- Data structures (records, classes, structs)
+- Interfaces
+- Enums
+- Constants
+
+It must **not** contain any business logic, service implementations, actor definitions, or orchestration code. Any method bodies must be limited to trivial data-access logic (e.g., factory methods, validation of invariants in constructors).
+
+### REQ-COM-7: Minimal Dependencies
+
+Commons must depend only on core .NET libraries (`System.*`, `Microsoft.Extensions.Primitives` if needed). It must **not** reference:
+
+- Akka.NET or any Akka.* packages
+- ASP.NET Core or any Microsoft.AspNetCore.* packages
+- Entity Framework Core or any Microsoft.EntityFrameworkCore.* packages
+- Any third-party libraries requiring paid licenses
+
+This ensures Commons can be referenced by all components without introducing transitive dependency conflicts.
+
+---
+
+## Dependencies
+
+- **None** — only core .NET SDK.
+
+## Interactions
+
+- **All component libraries**: Reference Commons for shared types, interfaces, entity classes, and contracts.
+- **Configuration Database**: Implements the repository interfaces defined in Commons. Maps the POCO entity classes to the database via EF Core Fluent API.
+- **Host**: References Commons transitively through the component libraries.
--- a/docs/requirements/Component-Communication.md
+++ b/docs/requirements/Component-Communication.md
@@ -0,0 +1,183 @@
+# Component: Central–Site Communication
+
+## Purpose
+
+The Communication component manages all messaging between the central cluster and site clusters using Akka.NET. It provides the transport layer for deployments, instance lifecycle commands, integration routing, debug streaming, health reporting, and remote queries (parked messages, event logs).
+
+## Location
+
+Both central and site clusters. Each side has communication actors that handle message routing.
+
+## Responsibilities
+
+- Resolve site addresses from the configuration database and maintain a cached address map.
+- Establish and maintain cross-cluster connections using Akka.NET ClusterClient/ClusterClientReceptionist.
+- Route messages between central and site clusters in a hub-and-spoke topology.
+- Broker requests from external systems (via central) to sites and return responses.
+- Support multiple concurrent message patterns (request/response, fire-and-forget, streaming).
+- Detect site connectivity status for health monitoring.
+
+## Communication Patterns
+
+### 1. Deployment (Central → Site)
+- **Pattern**: Request/Response.
+- Central sends a flattened configuration to a site.
+- Site Runtime receives, compiles scripts, creates/updates Instance Actors, and responds with success/failure.
+- No buffering at central — if the site is unreachable, the deployment fails immediately.
+
+### 2. Instance Lifecycle Commands (Central → Site)
+- **Pattern**: Request/Response.
+- Central sends disable, enable, or delete commands for specific instances.
+- Site Runtime processes the command and responds with success/failure.
+- If the site is unreachable, the command fails immediately (no buffering).
+
+### 3. System-Wide Artifact Deployment (Central → Site(s))
+- **Pattern**: Broadcast with per-site acknowledgment (deploy to all sites), or targeted to a single site (per-site deployment).
+- When shared scripts, external system definitions, database connections, data connections, notification lists, or SMTP configuration are explicitly deployed, central sends them to the target site(s).
+- Each site acknowledges receipt and reports success/failure independently.
+
+### 4. Integration Routing (External System → Central → Site → Central → External System)
+- **Pattern**: Request/Response (brokered).
+- External system sends a request to central (e.g., MES requests machine values).
+- Central routes the request to the appropriate site.
+- Site reads values from the Instance Actor and responds.
+- Central returns the response to the external system.
+
+### 5. Recipe/Command Delivery (External System → Central → Site)
+- **Pattern**: Fire-and-forget with acknowledgment.
+- External system sends a command to central (e.g., recipe manager sends recipe).
+- Central routes to the site.
+- Site applies and acknowledges.
+
+### 6. Debug Streaming (Site → Central)
+- **Pattern**: Subscribe/stream with initial snapshot.
+- Central sends a subscribe request for a specific instance (identified by unique name).
+- Site requests a **snapshot** of all current attribute values and alarm states from the Instance Actor and sends it to central.
+- Site then subscribes to the **site-wide Akka stream** filtered by the instance's unique name and forwards attribute value changes and alarm state changes to central.
+- Attribute value stream messages: `[InstanceUniqueName].[AttributePath].[AttributeName]`, value, quality, timestamp.
+- Alarm state stream messages: `[InstanceUniqueName].[AlarmName]`, state (active/normal), priority, timestamp.
+- Central sends an unsubscribe request when the debug view closes. The site removes its stream subscription.
+- The stream is session-based and temporary.
+
+### 6a. Debug Snapshot (Central → Site)
+- **Pattern**: Request/Response (one-shot, no subscription).
+- Central sends a `DebugSnapshotRequest` (identified by instance unique name) to the site.
+- Site's Deployment Manager routes to the Instance Actor by unique name.
+- Instance Actor builds and returns a `DebugViewSnapshot` with all current attribute values and alarm states (same payload as the streaming initial snapshot).
+- No subscription is created; no stream is established.
+- Uses the 30-second `QueryTimeout`.
+
+### 7. Health Reporting (Site → Central)
+- **Pattern**: Periodic push.
+- Sites periodically send health metrics (connection status, node status, buffer depth, script error rates, alarm evaluation error rates) to central.
+
+### 8. Remote Queries (Central → Site)
+- **Pattern**: Request/Response.
+- Central queries sites for:
+  - Parked messages (store-and-forward dead letters).
+  - Site event logs.
+  - Instance debug snapshots (attribute values and alarm states).
+- Central can also send management commands:
+  - Retry or discard parked messages.
+
+## Topology
+
+```
+Central Cluster
+  ├── ClusterClient → Site A Cluster (SiteCommunicationActor via Receptionist)
+  ├── ClusterClient → Site B Cluster (SiteCommunicationActor via Receptionist)
+  └── ClusterClient → Site N Cluster (SiteCommunicationActor via Receptionist)
+
+Site Clusters
+  └── ClusterClient → Central Cluster (CentralCommunicationActor via Receptionist)
+```
+
+- Sites do **not** communicate with each other.
+- All inter-cluster communication flows through central.
+- Both **CentralCommunicationActor** and **SiteCommunicationActor** are registered with their cluster's **ClusterClientReceptionist** for cross-cluster discovery.
+
+## Site Address Resolution
+
+Central discovers site addresses through the **configuration database**, not runtime registration:
+
+- Each site record in the Sites table includes optional **NodeAAddress** and **NodeBAddress** fields containing base Akka addresses of the site's cluster nodes (e.g., `akka.tcp://scadalink@host:port`).
+- The **CentralCommunicationActor** loads all site addresses from the database at startup and creates one **ClusterClient per site**, configured with both NodeA and NodeB as contact points.
+- The address cache is **refreshed every 60 seconds** and **on-demand** when site records are added, edited, or deleted via the Central UI or CLI. ClusterClient instances are recreated when contact points change.
+- When routing a message to a site, central sends via `ClusterClient.Send("/user/site-communication", msg)`. **ClusterClient handles failover between NodeA and NodeB internally** — there is no application-level NodeA preference/NodeB fallback logic.
+- **Heartbeats** from sites serve **health monitoring only** — they do not serve as a registration or address discovery mechanism.
+- If no addresses are configured for a site, messages to that site are **dropped** and the caller's Ask times out.
+
+### Site → Central Communication
+
+- Site nodes configure a list of **CentralContactPoints** (both central node addresses) instead of a single `CentralActorPath`.
+- The site creates a **ClusterClient** using the central contact points and sends heartbeats, health reports, and other messages via `ClusterClient.Send("/user/central-communication", msg)`.
+- ClusterClient handles automatic failover between central nodes — if the active central node goes down, the site's ClusterClient reconnects to the standby node transparently.
+
+## Message Timeouts
+
+Each request/response pattern has a default timeout that can be overridden in configuration:
+
+| Pattern | Default Timeout | Rationale |
+|---------|----------------|-----------|
+| 1. Deployment | 120 seconds | Script compilation at the site can be slow |
+| 2. Instance Lifecycle | 30 seconds | Stop/start actors is fast |
+| 3. System-Wide Artifacts | 120 seconds per site | Includes shared script recompilation |
+| 4. Integration Routing | 30 seconds | External system waiting for response; Inbound API per-method timeout may cap this further |
+| 5. Recipe/Command Delivery | 30 seconds | Fire-and-forget with ack |
+| 8. Remote Queries | 30 seconds | Querying parked messages or event logs |
+
+Timeouts use the Akka.NET **ask pattern**. If no response is received within the timeout, the caller receives a timeout failure.
+
+## Transport Configuration
+
+Akka.NET remoting provides the underlying transport for both intra-cluster communication and ClusterClient connections. The following transport-level settings are **explicitly configured** (not left to framework defaults) for predictable behavior:
+
+- **Transport heartbeat interval**: Configurable interval at which heartbeat messages are sent over remoting connections (e.g., every 5 seconds).
+- **Failure detection threshold**: Number of missed heartbeats before the connection is considered lost (e.g., 3 missed heartbeats = 15 seconds with a 5-second interval).
+- **Reconnection**: ClusterClient handles reconnection and failover between contact points automatically for cross-cluster communication. No custom reconnection logic is required.
+
+These settings should be tuned for the expected network conditions between central and site clusters.
+
+## Application-Level Correlation
+
+All request/response messages include an application-level **correlation ID** to ensure correct pairing of requests and responses, even across reconnection events:
+
+- Deployments include a **deployment ID** and **revision hash** for idempotency (see Deployment Manager).
+- Lifecycle commands include a **command ID** for deduplication.
+- Remote queries include a **query ID** for response correlation.
+
+This provides protocol-level safety beyond Akka.NET's transport guarantees, which may not hold across disconnect/reconnect cycles.
+
+## Message Ordering
+
+Akka.NET guarantees message ordering between a specific sender/receiver actor pair. The Communication Layer relies on this guarantee — messages to a given site are processed in the order they are sent. Callers do not need to handle out-of-order delivery.
+
+## ManagementActor and ClusterClient
+
+The ManagementActor is registered at the well-known path `/user/management` on central nodes and advertised via **ClusterClientReceptionist**. External tools (primarily the CLI) connect using Akka.NET ClusterClient, which contacts the receptionist to discover the ManagementActor. This is a separate ClusterClient usage from the inter-cluster ClusterClient connections used for central-site messaging — the CLI does not participate in cluster membership or affect the hub-and-spoke topology.
+
+## Connection Failure Behavior
+
+- **In-flight messages**: When a connection drops while a request is in flight (e.g., deployment sent but no response received), the Akka ask pattern times out and the caller receives a failure. There is **no automatic retry or buffering at central** — the engineer sees the failure in the UI and re-initiates the action. This is consistent with the design principle that central does not buffer messages.
+- **Debug streams**: Any connection interruption (failover or network blip) kills the debug stream. The engineer must reopen the debug view in the Central UI to re-establish the subscription with a fresh snapshot. There is no auto-resume.
+
+## Failover Behavior
+
+- **Central failover**: The standby node takes over the Akka.NET cluster role. In-progress deployments are treated as failed. Site ClusterClients automatically reconnect to the standby central node via their configured contact points.
+- **Site failover**: The standby node takes over. The Deployment Manager singleton restarts and re-creates the Instance Actor hierarchy. Central's per-site ClusterClient automatically reconnects to the surviving site node. Ongoing debug streams are interrupted and must be re-established by the engineer.
+
+## Dependencies
+
+- **Akka.NET Remoting + ClusterClient**: Provides the transport layer. ClusterClient/ClusterClientReceptionist used for all cross-cluster messaging.
+- **Cluster Infrastructure**: Manages node roles and failover detection.
+- **Configuration Database**: Provides site node addresses (NodeAAddress, NodeBAddress) for address resolution.
+
+## Interactions
+
+- **Deployment Manager (central)**: Uses communication to deliver configurations, lifecycle commands, and system-wide artifacts, and receive status.
+- **Site Runtime**: Receives deployments, lifecycle commands, and artifact updates. Provides debug view data.
+- **Central UI**: Debug view requests and remote queries flow through communication.
+- **Health Monitoring**: Receives periodic health reports from sites.
+- **Store-and-Forward Engine (site)**: Parked message queries/commands are routed through communication.
+- **Site Event Logging**: Event log queries are routed through communication.
+- **Management Service**: The ManagementActor is registered with ClusterClientReceptionist on central nodes. The CLI communicates with the ManagementActor via ClusterClient, which is a separate channel from inter-cluster remoting.
--- a/docs/requirements/Component-ConfigurationDatabase.md
+++ b/docs/requirements/Component-ConfigurationDatabase.md
@@ -0,0 +1,293 @@
+# Component: Configuration Database
+
+## Purpose
+
+The Configuration Database component provides the centralized data access layer for all system configuration data stored in MS SQL. It owns the database schema, Entity Framework DbContext, repository implementations, unit-of-work support, migration management, and audit logging. All central components access configuration data through this component — no other component interacts with the configuration database directly.
+
+## Location
+
+Central cluster only. Site clusters do not access the configuration database — they receive all configuration via artifact deployment and instance deployment through the Communication Layer, and read it from local SQLite at runtime.
+
+## Responsibilities
+
+- Define and own the complete database schema for the configuration MS SQL database via EF Core Fluent API mappings.
+- Provide the Entity Framework Core DbContext as the single point of access to the configuration database.
+- **Implement** the per-component repository interfaces defined in Commons. The interfaces and POCO entity classes live in Commons (persistence-ignorant); this component provides the EF Core implementations.
+- **Implement** the `IAuditService` interface defined in Commons. Handles JSON serialization of entity state and writes audit entries within the same unit-of-work transaction as the change being audited.
+- Provide unit-of-work support via EF Core's DbContext for transactional multi-entity operations.
+- Manage schema migrations via EF Core Migrations with support for generating SQL scripts for manual execution in production.
+- Support seed data for initial system setup.
+- Manage connection pooling and connection lifecycle for the configuration database.
+
+**Note**: This component does **not** manage the Machine Data Database. The Machine Data Database is a separate concern with different access patterns (direct ADO.NET connections from scripts via `Database.Connection()`).
+
+---
+
+## Database Schema
+
+The configuration database stores all central system data, organized by domain area:
+
+### Template & Modeling
+- **Templates**: Template definitions (name, parent template reference, description).
+- **Template Attributes**: Attribute definitions per template (name, value, data type, lock flag, description, data source reference).
+- **Template Alarms**: Alarm definitions per template (name, description, priority, lock flag, trigger type, trigger configuration, on-trigger script reference).
+- **Template Scripts**: Script definitions per template (name, lock flag, C# source code, trigger type, trigger configuration, minimum time between runs, parameter definitions, return value definitions).
+- **Template Compositions**: Feature module composition relationships (composing template, composed template, module instance name).
+- **Instances**: Instance definitions (template reference, site reference, area reference, enabled/disabled state).
+- **Instance Attribute Overrides**: Per-instance attribute value overrides.
+- **Instance Connection Bindings**: Per-attribute data connection binding for each instance.
+- **Areas**: Hierarchical area definitions per site (name, parent area reference, site reference).
+
+### Shared Scripts
+- **Shared Scripts**: System-wide reusable script definitions (name, C# source code, parameter definitions, return value definitions).
+
+### Sites & Data Connections
+- **Sites**: Site definitions (name, identifier, description).
+- **Data Connections**: Data connection definitions (name, protocol type, connection details) with site assignments.
+
+### External Systems & Database Connections
+- **External System Definitions**: External system contracts (name, connection details, retry settings).
+- **External System Methods**: API method definitions per external system (method name, parameter definitions, return type definitions).
+- **Database Connection Definitions**: Named database connections (name, connection details, retry settings).
+
+### Notifications
+- **Notification Lists**: List definitions (name).
+- **Notification Recipients**: Recipients per list (name, email address).
+- **SMTP Configuration**: Email server settings.
+
+### Inbound API
+- **API Keys**: Key definitions (name/label, key value, enabled flag).
+- **API Methods**: Method definitions (name, approved key references, parameter definitions, return value definitions, implementation script, timeout).
+
+### Security
+- **LDAP Group Mappings**: Mappings between LDAP group names and system roles (Admin, Design, Deployment).
+- **Site Scoping Rules**: Per-mapping site scope restrictions for Deployment role.
+
+### Deployment
+- **Deployment Records**: Deployment history per instance (timestamp, user, status, deployed configuration snapshot).
+- **System-Wide Artifact Deployment Records**: Deployment history for shared artifacts (timestamp, user, artifact type, status).
+
+### Audit Logging
+- **Audit Log Entries**: Append-only audit trail (timestamp, user, action, entity type, entity ID, entity name, state as JSON). Stores only the after-state — change history is reconstructed by comparing consecutive entries. Entries are never modified or deleted. No retention policy — retained indefinitely. Indexed on timestamp, user, entity type, entity ID, and action for efficient filtering.
+
+---
+
+## Data Access Architecture
+
+### DbContext
+
+A single `ScadaLinkDbContext` (or a small number of bounded DbContexts if warranted) serves as the EF Core entry point. The DbContext:
+
+- Maps the POCO entity classes defined in Commons to the database using **Fluent API only** — no data annotations on the entity classes.
+- Configures relationships, indexes, constraints, and value conversions.
+- Provides `SaveChangesAsync()` as the unit-of-work commit mechanism.
+
+### Per-Component Repository Implementations
+
+Repository interfaces are defined in **Commons** alongside the POCO entity classes (see Component-Commons.md, REQ-COM-4). This component provides the **EF Core implementations** of those interfaces.
+
+| Repository Interface (in Commons) | Consuming Component | Scope |
+|---|---|---|
+| `ITemplateEngineRepository` | Template Engine | Templates, attributes, alarms, scripts, compositions, instances, overrides, connection bindings, areas |
+| `IDeploymentManagerRepository` | Deployment Manager | Current deployment status per instance, deployed configuration snapshots, system-wide artifact deployment status per site (no deployment history — audit log provides historical traceability) |
+| `ISecurityRepository` | Security & Auth | LDAP group mappings, site scoping rules |
+| `IInboundApiRepository` | Inbound API | API keys, API method definitions |
+| `IExternalSystemRepository` | External System Gateway | External system definitions, method definitions, database connection definitions |
+| `INotificationRepository` | Notification Service | Notification lists, recipients, SMTP configuration |
+| `IHealthMonitoringRepository` | Health Monitoring | (Minimal — health data is in-memory; repository needed only if connectivity history is persisted in the future) |
+| `ICentralUiRepository` | Central UI | Read-oriented queries spanning multiple domain areas for display purposes |
+
+Each implementation class uses the DbContext internally and works with the POCO entity classes from Commons. Consuming components depend only on Commons (for interfaces and entities) — they never reference this component or EF Core directly. The DI container in the Host wires the implementations to the interfaces.
+
+### Unit of Work
+
+EF Core's DbContext naturally provides unit-of-work semantics:
+
+- Multiple entity modifications within a single request are tracked by the DbContext.
+- `SaveChangesAsync()` commits all pending changes in a single database transaction.
+- If any part fails, the entire transaction rolls back.
+- **Optimistic concurrency** is used on deployment status records and instance lifecycle state via EF Core `rowversion` / concurrency tokens. This prevents stale deployment status transitions (e.g., two concurrent requests both trying to update the same instance's status). Template editing remains **last-write-wins** by design — optimistic concurrency is intentionally not applied to template content.
+- For operations that span multiple repository calls (e.g., creating a template with attributes, alarms, and scripts), the consuming component uses a single DbContext instance (via DI scoping) to ensure atomicity.
+
+### Example Transactional Flow
+
+```
+Template Engine: Create Template
+    │
+    ├── repository.AddTemplate(template)      // template is a Commons POCO
+    ├── repository.AddAttributes(attributes)  // attributes are Commons POCOs
+    ├── repository.AddAlarms(alarms)          // alarms are Commons POCOs
+    ├── repository.AddScripts(scripts)        // scripts are Commons POCOs
+    └── repository.SaveChangesAsync()         // single transaction commits all
+```
+
+---
+
+## Audit Logging
+
+The Configuration Database component implements the `IAuditService` interface (defined in Commons), providing audit logging as a built-in capability of the data access layer.
+
+### IAuditService Implementation
+
+Components call `IAuditService` after a successful operation:
+
+```
+IAuditService.LogAsync(user, action, entityType, entityId, entityName, afterState)
+```
+
+- **`user`**: The authenticated AD user who performed the action.
+- **`action`**: The type of operation (`Create`, `Update`, `Delete`, `Deploy`, `Disable`, `Enable`).
+- **`entityType`**: What was changed (`Template`, `Instance`, `SharedScript`, `Alarm`, `ExternalSystem`, `DatabaseConnection`, `NotificationList`, `ApiKey`, `ApiMethod`, `Area`, `Site`, `DataConnection`, `LdapGroupMapping`).
+- **`entityId`**: Unique identifier of the specific entity.
+- **`entityName`**: Human-readable name of the entity.
+- **`afterState`**: The entity's state after the change, which the implementation serializes as JSON. Null for deletes.
+
+### Transactional Guarantee
+
+Audit entries are written **synchronously** within the same database transaction as the change. The `IAuditService` implementation adds an `AuditLogEntry` to the current DbContext. When the calling component calls `SaveChangesAsync()`, both the change and the audit entry commit together. This guarantees:
+
+- If the change succeeds, the audit entry is always recorded.
+- If the change fails and rolls back, the audit entry is also rolled back.
+- No audit entries are lost due to process crashes between the change and the audit write.
+
+### Integration Example
+
+```
+Template Engine: Update Template
+    │
+    ├── repository.UpdateTemplate(template)
+    ├── auditService.LogAsync(user, "Update", "Template", template.Id,
+    │       template.Name, template)
+    └── repository.SaveChangesAsync()  ← both the change and audit entry commit together
+```
+
+### Audit Entry Schema
+
+| Field | Type | Description |
+|-------|------|-------------|
+| **Id** | Long / GUID | Unique identifier for the audit entry. |
+| **Timestamp** | DateTimeOffset | When the action occurred (UTC). |
+| **User** | String | Authenticated AD username. |
+| **Action** | String | The type of operation. |
+| **EntityType** | String | What was changed. |
+| **EntityId** | String | Unique identifier of the entity. |
+| **EntityName** | String | Human-readable name (for display without deserializing state). |
+| **State** | nvarchar(max) | Entity state after the change, serialized as JSON. Null for deletes. |
+
+### State Serialization
+
+- Entity state is serialized as **JSON** using the standard .NET JSON serializer.
+- JSON is stored in `nvarchar(max)` and is queryable via SQL Server's `JSON_VALUE` and `OPENJSON` functions.
+- For deletes, the state is null. The previous state can be found by querying the most recent prior entry for the same entity.
+
+### Granularity
+
+- **One audit entry per save operation**. When a user edits a template and changes multiple attributes in a single save, one entry is created with the full entity state after the save.
+
+### Reconstructing Change History
+
+Since only the after-state is stored, change history for an entity is reconstructed by querying all entries for that entity ordered by timestamp. Comparing consecutive entries reveals what changed at each step. This is a query-time concern handled by the Central UI.
+
+### Audited Actions
+
+| Category | Actions |
+|----------|---------|
+| Templates | Create, edit, delete templates |
+| Scripts | Create, edit, delete template scripts and shared scripts |
+| Alarms | Create, edit, delete alarm definitions |
+| Instances | Create, override values, bind connections, area assignment, disable, enable, delete |
+| Deployments | Deploy to instance (who, what, which instance, success/failure) |
+| System-Wide Artifact Deployments | Deploy shared scripts / external system definitions / DB connections / data connections / notification lists / SMTP config to site(s) (who, what, which site(s), result) |
+| External Systems | Create, edit, delete definitions |
+| Database Connections | Create, edit, delete definitions |
+| Notification Lists | Create, edit, delete lists and recipients |
+| Inbound API | API key create, enable/disable, delete. API method create, edit, delete |
+| Areas | Create, edit, delete area definitions |
+| Sites & Data Connections | Create, edit, delete sites. Define and assign data connections to sites |
+| Security/Admin | Role mapping changes, site permission changes |
+
+### Query Capabilities
+
+The Central UI audit log viewer can filter by:
+- **User**: Who made the change.
+- **Entity type**: What kind of entity was changed.
+- **Action type**: What kind of operation was performed.
+- **Time range**: When the change occurred.
+- **Specific entity ID/name**: Changes to a particular entity.
+
+Results are returned in reverse chronological order (most recent first) with pagination support.
+
+---
+
+## Migration Management
+
+### Entity Framework Core Migrations
+
+- Schema changes are managed via EF Core Migrations (`dotnet ef migrations add`, `dotnet ef migrations script`).
+- Each migration is a versioned, incremental schema change.
+
+### Development Environment
+- Migrations are **auto-applied** at application startup using `dbContext.Database.MigrateAsync()`.
+- This allows rapid iteration without manual SQL execution.
+
+### Production Environment
+- Migrations are **never auto-applied**.
+- SQL scripts are generated via `dotnet ef migrations script --idempotent` and reviewed by a DBA or engineer.
+- Scripts are executed manually in SQL Server Management Studio (SSMS) or equivalent tooling.
+- The Host startup in production validates that the database schema version matches the expected migration level and fails fast with a clear error if not.
+
+### Migration Script Generation
+
+```bash
+# Generate idempotent SQL script for all pending migrations
+dotnet ef migrations script --idempotent --output migration.sql --project <ConfigDbProject>
+
+# Generate script from a specific migration to another
+dotnet ef migrations script FromMigration ToMigration --output migration.sql
+```
+
+Generated scripts are idempotent — they can be safely re-run without causing errors or duplicate changes.
+
+---
+
+## Seed Data
+
+The Configuration Database supports seeding initial data required for the system to be usable after a fresh installation. Seed data is applied as part of the migration pipeline.
+
+### Seed Data Includes
+- Default system configuration values.
+- Any baseline reference data required by the application.
+
+### Mechanism
+- Seed data is defined using EF Core's `HasData()` in entity configurations or in dedicated seed migrations.
+- Seed data is included in the generated SQL scripts, so it is applied alongside schema changes in both development and production.
+
+---
+
+## Connection Management
+
+- Connection strings are provided via the Host's `DatabaseConfiguration` options (bound from `appsettings.json`).
+- EF Core manages connection pooling via the underlying ADO.NET SQL Server provider.
+- The DbContext is registered as a **scoped** service in the DI container, ensuring each request/operation gets its own instance.
+- No connection management for the Machine Data Database — that is handled separately by consumers (Inbound API scripts, external system gateway).
+
+---
+
+## Dependencies
+
+- **Entity Framework Core**: ORM, DbContext, migrations, change tracking.
+- **Microsoft.EntityFrameworkCore.SqlServer**: SQL Server database provider.
+- **MS SQL Server**: The configuration database instance.
+- **Commons**: POCO entity classes and repository interfaces that this component maps and implements.
+
+## Interactions
+
+- **Template Engine**: Uses `ITemplateEngineRepository` for all template, instance, and area data operations.
+- **Deployment Manager**: Uses `IDeploymentManagerRepository` for deployment records and status tracking.
+- **Security & Auth**: Uses `ISecurityRepository` for LDAP group mappings and site scoping.
+- **Inbound API**: Uses `IInboundApiRepository` for API keys and method definitions.
+- **External System Gateway**: Uses `IExternalSystemRepository` for external system and database connection definitions.
+- **Notification Service**: Uses `INotificationRepository` for notification lists and SMTP configuration.
+- **Central UI**: Uses `ICentralUiRepository` for read-oriented queries across domain areas, including audit log queries for the audit log viewer.
+- **All central components that modify state**: Call `IAuditService.LogAsync()` after successful operations to record audit entries within the same transaction.
+- **Host**: Provides database connection configuration. Registers DbContext, repository implementations, and `IAuditService` implementation in the DI container. Triggers auto-migration in development or validates schema version in production.
--- a/docs/requirements/Component-DataConnectionLayer.md
+++ b/docs/requirements/Component-DataConnectionLayer.md
@@ -0,0 +1,245 @@
+# Component: Data Connection Layer
+
+## Purpose
+
+The Data Connection Layer provides a uniform interface for reading from and writing to physical machines at site clusters. It abstracts protocol-specific details behind a common interface, manages subscriptions, and delivers live tag value updates to Instance Actors. It is a **clean data pipe** — it performs no evaluation of triggers, alarm conditions, or business logic.
+
+## Location
+
+Site clusters only. Central does not interact with machines directly.
+
+## Responsibilities
+
+- Manage data connections defined centrally and deployed to sites as part of artifact deployment (OPC UA servers, LmxProxy endpoints). Data connection definitions are stored in local SQLite after deployment.
+- Establish and maintain connections to data sources based on deployed instance configurations.
+- Subscribe to tag paths as requested by Instance Actors (based on attribute data source references in the flattened configuration).
+- Deliver tag value updates to the requesting Instance Actors.
+- Support writing values to machines (when Instance Actors forward `SetAttribute` write requests for data-connected attributes).
+- Report data connection health status to the Health Monitoring component.
+
+## Common Interface
+
+Both OPC UA and LmxProxy implement the same interface:
+
+```
+IDataConnection : IAsyncDisposable
+├── Connect(connectionDetails) → void
+├── Disconnect() → void
+├── Subscribe(tagPath, callback) → subscriptionId
+├── Unsubscribe(subscriptionId) → void
+├── Read(tagPath) → value
+├── ReadBatch(tagPaths) → values
+├── Write(tagPath, value) → void
+├── WriteBatch(values) → void
+├── WriteBatchAndWait(values, flagPath, flagValue, responsePath, responseValue, timeout) → bool
+├── Status → ConnectionHealth
+└── Disconnected → event Action?
+```
+
+The `Disconnected` event is raised by an adapter when it detects an unexpected connection loss (server offline, network failure, keep-alive timeout). The `DataConnectionActor` subscribes to this event to trigger the reconnection state machine. Additional protocols can be added by implementing this interface.
+
+### Concrete Type Mappings
+
+| IDataConnection | OPC UA SDK | LmxProxy (`RealLmxProxyClient`) |
+|---|---|---|
+| `Connect()` | OPC UA session establishment | gRPC `Connect` RPC with `x-api-key` metadata header, server returns `SessionId` |
+| `Disconnect()` | Close OPC UA session | gRPC `Disconnect` RPC |
+| `Subscribe(tagPath, callback)` | OPC UA Monitored Items | gRPC `Subscribe` server-streaming RPC (`stream VtqMessage`), cancelled via `CancellationTokenSource` |
+| `Unsubscribe(id)` | Remove Monitored Item | Cancel the `CancellationTokenSource` for that subscription (stops streaming RPC) |
+| `Read(tagPath)` | OPC UA Read | gRPC `Read` RPC → `VtqMessage` → `LmxVtq` |
+| `ReadBatch(tagPaths)` | OPC UA Read (multiple nodes) | gRPC `ReadBatch` RPC → `repeated VtqMessage` → `IDictionary<string, LmxVtq>` |
+| `Write(tagPath, value)` | OPC UA Write | gRPC `Write` RPC (throws on failure) |
+| `WriteBatch(values)` | OPC UA Write (multiple nodes) | gRPC `WriteBatch` RPC (throws on failure) |
+| `WriteBatchAndWait(...)` | OPC UA Write + poll for confirmation | `WriteBatch` + poll `Read` at 100ms intervals until response value matches or timeout |
+| `Status` | OPC UA session state | `IsConnected` — true when `SessionId` is non-empty |
+| `Disconnected` | `Session.KeepAlive` event fires with bad `ServiceResult` | gRPC subscription stream ends or throws non-cancellation `RpcException` |
+
+### Common Value Type
+
+Both protocols produce the same value tuple consumed by Instance Actors. Before the first value update arrives from the DCL, data-sourced attributes are held at **uncertain** quality by the Instance Actor (see Site Runtime — Initialization):
+
+| Concept | ScadaLink Design | LmxProxy Wire Format | Local Type |
+|---|---|---|---|
+| Value container | `TagValue(Value, Quality, Timestamp)` | `VtqMessage { Tag, Value, TimestampUtcTicks, Quality }` | `LmxVtq(Value, TimestampUtc, Quality)` — readonly record struct |
+| Quality | `QualityCode` enum: Good / Bad / Uncertain | String: `"Good"` / `"Uncertain"` / `"Bad"` | `LmxQuality` enum: Good / Uncertain / Bad |
+| Timestamp | `DateTimeOffset` (UTC) | `int64` (DateTime.Ticks, UTC) | `DateTime` (UTC) |
+| Value type | `object?` | `string` (parsed by client to double, bool, or string) | `object?` |
+
+## Supported Protocols
+
+### OPC UA
+- Uses the **OPC Foundation .NET Standard Library** (`OPCFoundation.NetStandard.Opc.Ua.Client`).
+- Session-based connection with endpoint discovery, certificate handling, and configurable security modes.
+- Subscriptions via OPC UA Monitored Items with data change notifications (1000ms sampling, queue size 10, discard-oldest).
+- Read/Write via OPC UA Read/Write services with StatusCode-based quality mapping.
+- Disconnect detection via `Session.KeepAlive` event (see Disconnect Detection Pattern below).
+
+### LmxProxy (Custom Protocol)
+
+LmxProxy is a gRPC-based protocol for communicating with LMX data servers. The DCL includes its own proto-generated gRPC client (`RealLmxProxyClient`) — no external SDK dependency.
+
+**Transport & Connection**:
+- gRPC over HTTP/2, using proto-generated client stubs from `scada.proto` (service: `scada.ScadaService`). Pre-generated C# files are checked into `Adapters/LmxProxyGrpc/` to avoid running `protoc` in Docker (ARM64 compatibility).
+- Default port: **50051**.
+- Session-based: `Connect` RPC returns a `SessionId` used for all subsequent operations.
+- Keep-alive: Managed by the LmxProxy server's session timeout. The DCL reconnect cycle handles session loss.
+
+**Authentication & TLS**:
+- API key-based authentication sent as `x-api-key` gRPC metadata header on every call. The server's `ApiKeyInterceptor` validates the header before the request reaches the service method. The API key is also included in the `ConnectRequest` body for session-level validation.
+- Plain HTTP/2 (no TLS) for current deployments. The server supports TLS when configured.
+
+**Subscriptions**:
+- Server-streaming gRPC (`Subscribe` RPC returns `stream VtqMessage`).
+- Configurable sampling interval (default: 0 = on-change).
+- Wire format: `VtqMessage { tag, value (string), timestamp_utc_ticks (int64), quality (string: "Good"/"Uncertain"/"Bad") }`.
+- Subscription lifetime managed by `CancellationTokenSource` — cancellation stops the streaming RPC.
+
+**Client Implementation** (`RealLmxProxyClient`):
+- Uses `Google.Protobuf` + `Grpc.Net.Client` (standard proto-generated stubs, no protobuf-net runtime IL emit).
+- `ILmxProxyClientFactory` creates instances configured with host, port, and API key.
+- Value conversion: string values from `VtqMessage` are parsed to `double`, `bool`, or left as `string`.
+- Quality mapping: `"Good"` → `LmxQuality.Good`, `"Uncertain"` → `LmxQuality.Uncertain`, else `LmxQuality.Bad`.
+
+**Proto Source**: The `.proto` file originates from the LmxProxy server repository (`lmx/Proxy/Grpc/Protos/scada.proto` in ScadaBridge). The C# stubs are pre-generated and stored at `Adapters/LmxProxyGrpc/`.
+
+**Test Infrastructure**: The `infra/lmxfakeproxy/` project provides a fake LmxProxy server that bridges to the OPC UA test server. It implements the full `scada.ScadaService` proto, enabling end-to-end testing of `RealLmxProxyClient` without a Windows LmxProxy deployment. See [test_infra_lmxfakeproxy.md](../test_infra/test_infra_lmxfakeproxy.md) for setup.
+
+## Connection Configuration Reference
+
+All settings are parsed from the data connection's `Configuration` JSON dictionary (stored as `IDictionary<string, string>` connection details). Invalid numeric values fall back to defaults silently.
+
+### OPC UA Settings
+
+| Key | Type | Default | Description |
+|-----|------|---------|-------------|
+| `endpoint` / `EndpointUrl` | string | `opc.tcp://localhost:4840` | OPC UA server endpoint URL |
+| `SessionTimeoutMs` | int | `60000` | OPC UA session timeout in milliseconds |
+| `OperationTimeoutMs` | int | `15000` | Transport operation timeout in milliseconds |
+| `PublishingIntervalMs` | int | `1000` | Subscription publishing interval in milliseconds |
+| `KeepAliveCount` | int | `10` | Keep-alive frames before session timeout |
+| `LifetimeCount` | int | `30` | Subscription lifetime in publish intervals |
+| `MaxNotificationsPerPublish` | int | `100` | Max notifications batched per publish cycle |
+| `SamplingIntervalMs` | int | `1000` | Per-item server sampling rate in milliseconds |
+| `QueueSize` | int | `10` | Per-item notification buffer size |
+| `SecurityMode` | string | `None` | Preferred endpoint security: `None`, `Sign`, or `SignAndEncrypt` |
+| `AutoAcceptUntrustedCerts` | bool | `true` | Accept untrusted server certificates |
+
+### LmxProxy Settings
+
+| Key | Type | Default | Description |
+|-----|------|---------|-------------|
+| `Host` | string | `localhost` | LmxProxy server hostname |
+| `Port` | int | `50051` | LmxProxy gRPC port |
+| `ApiKey` | string | *(none)* | API key for `x-api-key` header authentication |
+| `SamplingIntervalMs` | int | `0` | Subscription sampling interval: 0 = on-change, >0 = time-based (ms) |
+| `UseTls` | bool | `false` | Use HTTPS instead of plain HTTP/2 for gRPC channel |
+
+### Shared Settings (appsettings.json)
+
+These are configured via `DataConnectionOptions` in `appsettings.json`, not per-connection:
+
+| Setting | Default | Description |
+|---------|---------|-------------|
+| `ReconnectInterval` | 5s | Fixed interval between reconnection attempts |
+| `TagResolutionRetryInterval` | 10s | Retry interval for unresolved tag paths |
+| `WriteTimeout` | 30s | Timeout for write operations |
+| `LmxProxyKeepAliveInterval` | 30s | Keep-alive ping interval for LmxProxy sessions |
+
+## Subscription Management
+
+- When an Instance Actor is created (as part of the Site Runtime actor hierarchy), it registers its data source references with the Data Connection Layer.
+- The DCL subscribes to the tag paths using the concrete connection details from the flattened configuration.
+- Tag value updates are delivered directly to the requesting Instance Actor.
+- When an Instance Actor is stopped (due to disable, delete, or redeployment), the DCL cleans up the associated subscriptions.
+- When a new Instance Actor is created for a redeployment, subscriptions are established fresh based on the new configuration.
+
+## Write-Back Support
+
+- When a script calls `Instance.SetAttribute` for an attribute with a data source reference, the Instance Actor sends a write request to the DCL.
+- The DCL writes the value to the physical device via the appropriate protocol.
+- The existing subscription picks up the confirmed new value from the device and delivers it back to the Instance Actor as a standard value update.
+- The Instance Actor's in-memory value is **not** updated until the device confirms the write.
+
+## Value Update Message Format
+
+Each value update delivered to an Instance Actor includes:
+- **Tag path**: The relative path of the attribute's data source reference.
+- **Value**: The new value from the device.
+- **Quality**: Data quality indicator (good, bad, uncertain).
+- **Timestamp**: When the value was read from the device.
+
+## Connection Actor Model
+
+Each data connection is managed by a dedicated connection actor that uses the Akka.NET **Become/Stash** pattern to model its lifecycle as a state machine:
+
+- **Connecting**: The actor attempts to establish the connection. Subscription requests and write commands received during this phase are **stashed** (buffered in the actor's stash).
+- **Connected**: The actor is actively servicing subscriptions. On entering this state, all stashed messages are unstashed and processed.
+- **Reconnecting**: The connection was lost. The actor transitions back to a connecting-like state, stashing new requests while it retries.
+
+This pattern ensures no messages are lost during connection transitions and is the standard Akka.NET approach for actors with I/O lifecycle dependencies.
+
+**LmxProxy-specific notes**: The `RealLmxProxyClient` holds the `SessionId` returned by the `Connect` RPC and includes it in all subsequent operations. Subscriptions use server-streaming gRPC — a background task reads from the `ResponseStream` and invokes the callback for each `VtqMessage`. When the stream breaks (server offline, network failure), the background task detects the `RpcException` or stream end and invokes the `onStreamError` callback, which triggers the adapter's `Disconnected` event. The DCL actor transitions to **Reconnecting**, pushes bad quality, disposes the client, and retries at the fixed interval.
+
+**OPC UA-specific notes**: The `RealOpcUaClient` uses the OPC Foundation SDK's `Session.KeepAlive` event for proactive disconnect detection. The SDK sends keep-alive requests at the subscription's `KeepAliveCount × PublishingInterval` (default: 10s). When keep-alive fails, the `ConnectionLost` event fires, triggering the same reconnection flow. On reconnection, the DCL re-creates the OPC UA session and subscription, then re-adds all monitored items.
+
+## Connection Lifecycle & Reconnection
+
+The DCL manages connection lifecycle automatically:
+
+1. **Connection drop detection**: When a connection to a data source is lost, the DCL immediately pushes a value update with quality `bad` for **every tag subscribed on that connection**. Instance Actors and their downstream consumers (alarms, scripts checking quality) see the staleness immediately.
+2. **Auto-reconnect with fixed interval**: The DCL retries the connection at a configurable fixed interval (e.g., every 5 seconds). The retry interval is defined **per data connection**. This is consistent with the fixed-interval retry philosophy used throughout the system. Individual gRPC/OPC UA operations (reads, writes) fail immediately to the caller on error; there is no operation-level retry within the adapter.
+3. **Connection state transitions**: The DCL tracks each connection's state as `connected`, `disconnected`, or `reconnecting`. All transitions are logged to Site Event Logging.
+4. **Transparent re-subscribe**: On successful reconnection, the DCL automatically re-establishes all previously active subscriptions for that connection. Instance Actors require no action — they simply see quality return to `good` as fresh values arrive from restored subscriptions.
+
+### Disconnect Detection Pattern
+
+Each adapter implements the `IDataConnection.Disconnected` event to proactively signal connection loss to the `DataConnectionActor`. Detection uses two complementary paths:
+
+**Proactive detection** (server goes offline between operations):
+- **OPC UA**: The OPC Foundation SDK fires `Session.KeepAlive` events at regular intervals. `RealOpcUaClient` hooks this event; when `ServiceResult.IsBad(e.Status)` (server unreachable, keep-alive timeout), it fires `ConnectionLost`. The `OpcUaDataConnection` adapter translates this into `IDataConnection.Disconnected`.
+- **LmxProxy**: gRPC server-streaming subscriptions run in background tasks reading from `ResponseStream`. When the server goes offline, the stream either ends normally (server closed) or throws a non-cancellation `RpcException`. `RealLmxProxyClient` invokes the `onStreamError` callback, which `LmxProxyDataConnection` translates into `IDataConnection.Disconnected`.
+
+**Reactive detection** (failure discovered during an operation):
+- Both adapters wrap `ReadAsync` (and by extension `ReadBatchAsync`) with exception handling. If a read throws a non-cancellation exception, the adapter calls `RaiseDisconnected()` and re-throws. The `DataConnectionActor`'s existing error handling catches the exception while the disconnect event triggers the reconnection state machine.
+
+**Event marshalling**: The `DataConnectionActor` subscribes to `_adapter.Disconnected` in `PreStart()`. Since `Disconnected` may fire from a background thread (gRPC stream task, OPC UA keep-alive timer), the handler sends an `AdapterDisconnected` message to `Self`, marshalling the notification onto the actor's message loop. This triggers `BecomeReconnecting()` → bad quality push → retry timer.
+
+**Once-only guard**: Both `LmxProxyDataConnection` and `OpcUaDataConnection` use a `volatile bool _disconnectFired` flag to ensure `RaiseDisconnected()` fires exactly once per connection session. The flag resets on successful reconnection (`ConnectAsync`).
+
+## Write Failure Handling
+
+Writes to physical devices are **synchronous** from the script's perspective:
+
+- If the write fails (connection down, device rejection, timeout), the error is **returned to the calling script**. Script authors can catch and handle write errors (log, notify, retry, etc.).
+- Write failures are also logged to Site Event Logging.
+- There is **no store-and-forward for device writes** — these are real-time control operations. Buffering stale setpoints for later application would be dangerous in an industrial context.
+
+## Tag Path Resolution
+
+When the DCL subscribes to a tag path from the flattened configuration but the path does not exist on the physical device (e.g., typo in the template, device firmware changed, device still booting):
+
+1. The failure is **logged to Site Event Logging**.
+2. The attribute is marked with quality `bad`.
+3. The DCL **periodically retries resolution** at a configurable interval, accommodating devices that come online in stages or load modules after startup.
+4. On successful resolution, the subscription activates normally and quality reflects the live value from the device.
+
+Note: Pre-deployment validation at central does **not** verify that tag paths resolve to real tags on physical devices — that is a runtime concern handled here.
+
+## Health Reporting
+
+The DCL reports the following metrics to the Health Monitoring component via the existing periodic heartbeat:
+
+- **Connection status**: `connected`, `disconnected`, or `reconnecting` per data connection.
+- **Tag resolution counts**: Per connection, the number of total subscribed tags vs. successfully resolved tags. This gives operators visibility into misconfigured templates without needing to open the debug view for individual instances.
+
+## Dependencies
+
+- **Site Runtime (Instance Actors)**: Receives subscription registrations and delivers value updates. Receives write requests.
+- **Health Monitoring**: Reports connection status.
+- **Site Event Logging**: Logs connection status changes.
+
+## Interactions
+
+- **Site Runtime (Instance Actors)**: Bidirectional — delivers value updates, receives subscription registrations and write-back commands.
+- **Health Monitoring**: Reports connection health periodically.
+- **Site Event Logging**: Logs connection/disconnection events.
--- a/docs/requirements/Component-DeploymentManager.md
+++ b/docs/requirements/Component-DeploymentManager.md
@@ -0,0 +1,149 @@
+# Component: Deployment Manager
+
+## Purpose
+
+The Deployment Manager orchestrates the process of deploying configurations from the central cluster to site clusters. It coordinates between the Template Engine (which produces flattened and validated configs), the Communication Layer (which delivers them), and tracks deployment status. It also manages system-wide artifact deployment and instance lifecycle commands (disable, enable, delete).
+
+## Location
+
+Central cluster only. The site-side deployment responsibilities (receiving configs, spawning Instance Actors) are handled by the Site Runtime component.
+
+## Responsibilities
+
+- Accept deployment requests from the Central UI for individual instances.
+- Request flattened and validated configurations from the Template Engine.
+- Request diffs between currently deployed and template-derived configurations from the Template Engine.
+- Send flattened configurations to site clusters via the Communication Layer.
+- Track deployment status (pending, in-progress, success, failed).
+- Handle deployment failures gracefully — if a site is unreachable or the deployment fails, report the failure. No retry or buffering at central.
+- If a central failover occurs during deployment, the deployment is treated as failed and must be re-initiated.
+- Deploy system-wide artifacts (shared scripts, external system definitions, database connection definitions, data connection definitions, notification lists, SMTP configuration) to all sites or to an individual site on explicit request.
+- Send instance lifecycle commands (disable, enable, delete) to sites via the Communication Layer.
+
+## Deployment Flow
+
+```
+Engineer (UI) → Deployment Manager (Central)
+    │
+    ├── 1. Request validated + flattened config from Template Engine
+    │      (validation includes flattening, script compilation,
+    │       trigger references, connection binding completeness)
+    ├── 2. If validation fails → return errors to UI, stop
+    ├── 3. Send config to site via Communication Layer
+    │       │
+    │       ▼
+    │   Site Runtime (Deployment Manager Singleton)
+    │       ├── 4. Store new flattened config locally (SQLite)
+    │       ├── 5. Compile scripts at site
+    │       ├── 6. Create/update Instance Actor (with child Script + Alarm Actors)
+    │       └── 7. Report success/failure back to central
+    │
+    └── 8. Update deployment status in config DB
+```
+
+## Deployment Identity & Idempotency
+
+- Every deployment is assigned a unique **deployment ID** and includes the flattened configuration's **revision hash** (from the Template Engine).
+- Site-side apply is **idempotent on deployment ID** — if the same deployment is received twice (e.g., after a timeout where the site actually applied it), the site responds with "already applied" rather than re-applying.
+- Sites **reject stale configurations** — if a deployment carries an older revision hash than what is already applied, the site rejects it and reports the current version.
+- After a central failover or timeout, the Deployment Manager **queries the site for current deployment state** before allowing a re-deploy. This prevents duplicate application and out-of-order config changes.
+
+## Operation Concurrency
+
+All mutating operations on a single instance (deploy, disable, enable, delete) share a **per-instance operation lock**:
+
+- Only one mutating operation per instance can be in-flight at a time. A second operation is rejected with an "operation in progress" error.
+- **Different instances**: Operations on different instances can proceed **in parallel**, even at the same site. Each tracks status independently. This supports the bulk "deploy all out-of-date instances" operation efficiently.
+
+### Allowed State Transitions
+
+| Current State | Deploy | Disable | Enable | Delete |
+|---------------|--------|---------|--------|--------|
+| Enabled | Yes | Yes | No (already enabled) | Yes |
+| Disabled | Yes (enables on apply) | No (already disabled) | Yes | Yes |
+| Not deployed | Yes (initial deploy) | No | No | No |
+
+## System-Wide Artifact Deployment Failure Handling
+
+When deploying artifacts (shared scripts, external system definitions, etc.) to all sites, each site reports success or failure **independently**:
+
+- The deployment status shows a per-site result matrix.
+- Successful sites are **not rolled back** if other sites fail.
+- The engineer can retry failed sites individually (e.g., when an offline site comes back online).
+- This is consistent with the hub-and-spoke independence model — one site's unavailability does not affect others.
+
+## Deployment Status Persistence
+
+- Only the **current deployment status** per instance is stored in the configuration database (pending, in-progress, success, failed).
+- No deployment history table — the audit log (via IAuditService) already captures every deployment action with who, what, when, and result.
+- The Deployment Manager uses current status to determine staleness (is this instance up-to-date?) and display deployment results in the UI.
+
+## Deployment Scope
+
+- Deployment is performed at the **individual instance level**.
+- The UI may provide convenience operations (e.g., "deploy all out-of-date instances at Site A"), but these decompose into individual instance deployments.
+
+## Diff View
+
+Before deploying, the Deployment Manager can request a diff from the Template Engine showing:
+- **Added** attributes, alarms, or scripts (new in the template since last deploy).
+- **Removed** members (removed from template since last deploy).
+- **Changed** values (attribute values, alarm thresholds, script code that differ).
+- **Connection binding changes** (data connection references that changed).
+
+## Deployed vs. Template-Derived State
+
+The system maintains two views per instance:
+- **Deployed Configuration**: What is currently running at the site, as of the last successful deployment.
+- **Template-Derived Configuration**: What the instance would look like if deployed now, based on the current state of its template hierarchy and instance overrides.
+
+These are compared to determine staleness and generate diffs.
+
+## Deployable Artifacts
+
+A deployment to a site includes the flattened instance configuration plus any system-wide artifacts that have changed:
+- Shared scripts
+- External system definitions
+- Database connection definitions
+- Data connection definitions
+- Notification lists
+- SMTP configuration
+
+System-wide artifact deployment is a **separate action** from instance deployment, triggered explicitly by a user with the Deployment role. Artifacts can be deployed to all sites at once or to an individual site (per-site deployment via the Sites admin page).
+
+## Site-Side Apply Atomicity
+
+Applying a deployment at the site is **all-or-nothing per instance**:
+
+- The site stores the new config, compiles all scripts, and creates/updates the Instance Actor as a single operation.
+- If any step fails (e.g., script compilation), the entire deployment for that instance is rejected. The previous configuration remains active and unchanged.
+- The site reports the specific failure reason (e.g., compilation error details) back to central.
+
+## System-Wide Artifact Version Compatibility
+
+- Cross-site version skew for artifacts (shared scripts, external system definitions, data connection definitions, etc.) is **supported** — sites can temporarily run different artifact versions after a partial deployment.
+- Artifacts are self-contained and site-independent. A site running an older version of shared scripts continues to operate correctly with its current instance configurations.
+- The Central UI clearly indicates which sites have pending artifact updates so engineers can remediate.
+
+## Instance Lifecycle Commands
+
+The Deployment Manager sends the following commands to sites via the Communication Layer:
+
+- **Disable**: Instructs the site to stop the Instance Actor's data subscriptions, script triggers, and alarm evaluation. The deployed configuration is retained for re-enablement.
+- **Enable**: Instructs the site to re-activate a disabled instance.
+- **Delete**: Instructs the site to remove the running configuration and destroy the Instance Actor and its children. Store-and-forward messages are not cleared. If the site is unreachable, the delete command **fails** — the central side does not mark the instance as deleted until the site confirms.
+
+## Dependencies
+
+- **Template Engine**: Produces flattened configurations, diffs, and validation results.
+- **Communication Layer**: Delivers configurations and lifecycle commands to sites.
+- **Configuration Database (MS SQL)**: Stores deployment status and deployed configuration snapshots.
+- **Security & Auth**: Enforces Deployment role (with optional site scoping).
+- **Configuration Database (via IAuditService)**: Logs all deployment actions, system-wide artifact deployments, and instance lifecycle changes.
+
+## Interactions
+
+- **Central UI**: Engineers trigger deployments, view diffs/status, manage instance lifecycle, and deploy system-wide artifacts.
+- **Template Engine**: Provides resolved and validated configurations.
+- **Site Runtime**: Receives and applies configurations and lifecycle commands.
+- **Health Monitoring**: Deployment failures contribute to site health status.
--- a/docs/requirements/Component-ExternalSystemGateway.md
+++ b/docs/requirements/Component-ExternalSystemGateway.md
@@ -0,0 +1,126 @@
+# Component: External System Gateway
+
+## Purpose
+
+The External System Gateway manages predefined integrations with external systems (e.g., MES, recipe managers) and database connections. It provides the runtime for invoking external API methods and executing database operations from scripts at site clusters.
+
+## Location
+
+Site clusters (executes calls directly to external systems, reads definitions from local SQLite). Central cluster (stores definitions in config DB, brokers inbound requests from external systems to sites).
+
+## Responsibilities
+
+### Definitions (Central)
+- Store external system definitions in the configuration database: connection details, API method signatures (parameters and return types).
+- Store database connection definitions: server, database, credentials.
+- Deploy definitions uniformly to all sites (no per-site overrides). Deployment requires **explicit action** by a user with the Deployment role.
+- Managed by users with the Design role.
+
+### Execution (Site)
+- Invoke external system API methods as requested by scripts (via Script Execution Actors and Alarm Execution Actors).
+- Provide raw MS SQL client connections (ADO.NET) by name for synchronous database access.
+- Submit cached database writes to the Store-and-Forward Engine for reliable delivery.
+- Sites communicate with external systems **directly** (not routed through central).
+
+### Integration Brokering (Central)
+- Receive inbound requests from external systems (e.g., MES querying machine values).
+- Route requests to the appropriate site via the Communication Layer.
+- Return responses to the external system.
+
+## External System Definition
+
+Each external system definition includes:
+- **Name**: Unique identifier (e.g., "MES", "RecipeManager").
+- **Base URL**: The root endpoint URL for the external system (e.g., `https://mes.example.com/api`).
+- **Authentication**: One of:
+  - **API Key**: Header name (e.g., `X-API-Key`) and key value.
+  - **Basic Auth**: Username and password.
+- **Timeout**: Per-system timeout for all method calls (e.g., 30 seconds). Applies to the HTTP request round-trip.
+- **Retry Settings**: Max retry count, fixed time between retries (used by Store-and-Forward Engine for transient failures only).
+- **Method Definitions**: List of available API methods, each with:
+  - Method name.
+  - **HTTP method**: GET, POST, PUT, or DELETE.
+  - **Path**: Relative path appended to the base URL (e.g., `/recipes/{id}`).
+  - Parameter definitions (name, type). Supports the extended type system (Boolean, Integer, Float, String, Object, List).
+  - Return type definition. Supports the extended type system for complex response structures.
+
+## Database Connection Definition
+
+Each database connection definition includes:
+- **Name**: Unique identifier (e.g., "MES_DB", "HistorianDB").
+- **Connection Details**: Server address, database name, credentials.
+- **Retry Settings**: Max retry count, fixed time between retries (for cached writes).
+
+## Database Access Modes
+
+### Synchronous (Real-time)
+- Script calls `Database.Connection("name")` and receives a raw ADO.NET `SqlConnection`.
+- Full control: queries, updates, transactions, stored procedures.
+- Failures are immediate — no buffering.
+
+### Cached Write (Store-and-Forward)
+- Script calls `Database.CachedWrite("name", "sql", parameters)`.
+- The write is submitted to the Store-and-Forward Engine.
+- Payload includes: connection name, SQL statement, serialized parameter values.
+- If the database is unavailable, the write is buffered and retried per the connection's retry settings.
+
+## Invocation Protocol
+
+All external system calls are **HTTP/REST** with **JSON** serialization:
+
+- The ESG acts as an HTTP client. The external system definition provides the base URL; each method definition specifies the HTTP method and relative path.
+- Request parameters are serialized as JSON in the request body (POST/PUT) or as query parameters (GET/DELETE).
+- Response bodies are deserialized from JSON into the method's defined return type.
+- Credentials (API key header or Basic Auth header) are attached to every request per the system's authentication configuration.
+
+## External System Call Modes
+
+Scripts choose between two call modes per invocation, mirroring the dual-mode database access pattern:
+
+### Synchronous (Real-time)
+- Script calls `ExternalSystem.Call("systemName", "methodName", params)`.
+- The HTTP request is executed immediately. The script blocks until the response is received or the timeout elapses.
+- **All failures** (transient and permanent) return an error to the calling script. No store-and-forward buffering.
+- Use for request/response interactions where the script needs the result (e.g., fetching a recipe, querying inventory).
+
+### Cached (Store-and-Forward)
+- Script calls `ExternalSystem.CachedCall("systemName", "methodName", params)`.
+- The call is attempted immediately. If it succeeds, the response is discarded (fire-and-forget).
+- On **transient failure** (connection refused, timeout, HTTP 5xx), the call is routed to the Store-and-Forward Engine for retry per the system's retry settings. The script does **not** block — the call is buffered and the script continues.
+- On **permanent failure** (HTTP 4xx), the error is returned **synchronously** to the calling script. No retry — the request itself is wrong.
+- Use for outbound data pushes where deferred delivery is acceptable (e.g., posting production data, sending quality reports).
+
+## Call Timeout & Error Handling
+
+- Each external system definition specifies a **timeout** that applies to all method calls on that system.
+- Error classification by HTTP response:
+  - **Transient failures** (connection refused, timeout, HTTP 408, 429, 5xx): Behavior depends on call mode — `CachedCall` buffers for retry; `Call` returns error to script.
+  - **Permanent failures** (HTTP 4xx except 408/429): Always returned to the calling script regardless of call mode. Logged to Site Event Logging.
+- This classification ensures the S&F buffer is not polluted with requests that will never succeed.
+- **Idempotency note**: `CachedCall` retries may result in duplicate delivery if the external system received the original request but the response was lost. Callers should use `CachedCall` only for operations that are idempotent or where duplicate delivery is acceptable.
+
+## Blocking I/O Isolation
+
+- External system HTTP calls and database operations are blocking I/O. Script Execution Actors (which are short-lived, per-invocation actors) execute these calls, ensuring that blocking does not starve the parent Script Actor or Instance Actor.
+- The Akka.NET actor system should configure a **dedicated dispatcher** for Script Execution Actors to isolate blocking I/O from the default dispatcher used by coordination actors.
+
+## Database Connection Management
+
+- Database connections use **standard ADO.NET connection pooling** per named connection. No custom pool management.
+- Pool behavior (max pool size, connection lifetime, etc.) can be tuned via connection string parameters in the database connection definition if needed.
+- Synchronous failures on `Database.Connection()` (e.g., unreachable server) return an error to the calling script, consistent with external system permanent failure handling.
+
+## Dependencies
+
+- **Configuration Database (MS SQL)**: Stores external system and database connection definitions (central only).
+- **Local SQLite**: At sites, external system and database connection definitions are read from local SQLite (populated by artifact deployment). Sites do not access the central config DB.
+- **Store-and-Forward Engine**: Handles buffering for failed external system calls and cached database writes.
+- **Communication Layer**: Routes inbound external system requests from central to sites.
+- **Security & Auth**: Design role manages definitions.
+- **Configuration Database (via IAuditService)**: Definition changes are audit logged.
+
+## Interactions
+
+- **Site Runtime (Script/Alarm Execution Actors)**: Scripts invoke external system methods and database operations through this component.
+- **Store-and-Forward Engine**: Failed calls and cached writes are routed here for reliable delivery.
+- **Deployment Manager**: Receives updated definitions as part of system-wide artifact deployment (triggered explicitly by Deployment role).
--- a/docs/requirements/Component-HealthMonitoring.md
+++ b/docs/requirements/Component-HealthMonitoring.md
@@ -0,0 +1,75 @@
+# Component: Health Monitoring
+
+## Purpose
+
+The Health Monitoring component collects and reports operational health metrics from site clusters to the central cluster, providing engineers with visibility into the status of the distributed system.
+
+## Location
+
+Site clusters (metric collection and reporting). Central cluster (aggregation and display).
+
+## Responsibilities
+
+### Site Side
+- Collect health metrics from local subsystems.
+- Periodically report metrics to the central cluster via the Communication Layer.
+
+### Central Side
+- Receive and store health metrics from all sites.
+- Detect site connectivity status (online/offline) based on heartbeat presence.
+- Present health data in the Central UI dashboard.
+
+## Monitored Metrics
+
+| Metric | Source | Description |
+|--------|--------|-------------|
+| Site online/offline | Communication Layer | Whether the site is reachable (based on heartbeat) |
+| Active/standby node status | Cluster Infrastructure | Which node is active, which is standby |
+| Data connection health | Data Connection Layer | Connected/disconnected/reconnecting per data connection |
+| Tag resolution counts | Data Connection Layer | Per connection: total subscribed tags vs. successfully resolved tags |
+| Script error rates | Site Runtime (Script Actors) | Frequency of script failures |
+| Alarm evaluation error rates | Site Runtime (Alarm Actors) | Frequency of alarm evaluation failures |
+| Store-and-forward buffer depth | Store-and-Forward Engine | Pending messages by category (external, notification, DB write) |
+| Dead letter count | Akka.NET EventStream | Messages sent to actors that no longer exist — indicates stale references or timing issues |
+
+## Reporting Protocol
+
+- Sites send a **health report message** to central at a configurable interval (default: **30 seconds**).
+- Each report is a **flat snapshot** containing the current values of all monitored metrics, a **monotonic sequence number**, and the **report timestamp** from the site. Central replaces the previous state for that site only if the incoming sequence number is higher than the last received — this prevents stale reports (e.g., delayed in transit or from a pre-failover node) from overwriting newer state.
+- **Offline detection**: If central does not receive a report within a configurable timeout window (default: **60 seconds** — 2x the report interval), the site is marked as **offline**. This gives one missed report as grace before marking offline.
+- **Online recovery**: When central receives a health report from a site that was marked offline, the site is automatically marked **online**. No manual acknowledgment required — the metrics in the report provide immediate visibility into the site's condition.
+
+## Error Rate Metrics
+
+Script error rates and alarm evaluation error rates are calculated as **raw counts per reporting interval**:
+
+- The site maintains a counter for each metric that increments on every failure.
+- Each health report includes the count since the last report. The counter resets after each report is sent.
+- Central displays these as "X errors in the last 30 seconds" (or whatever the configured interval is).
+- **Script errors** include all failures that prevent a script from completing successfully: unhandled exceptions, timeouts, recursion limit violations, and any other error condition.
+- **Alarm evaluation errors** include all failures during alarm condition evaluation.
+- For detailed diagnostics (error types, stack traces, affected instances), operators use the **Site Event Log Viewer** — the health dashboard is for quick triage, not forensics.
+
+## Central Storage
+
+- Health metrics are held **in memory** at the central cluster for display in the UI.
+- No historical health data is persisted — the dashboard shows current/latest status only.
+- Site connectivity history (online/offline transitions) may optionally be logged via the Audit Log or a separate mechanism if needed in the future.
+
+## No Alerting
+
+- Health monitoring is **display-only** for now — no automated notifications or alerts are triggered by health status changes.
+- This can be extended in the future.
+
+## Dependencies
+
+- **Communication Layer**: Transports health reports from sites to central.
+- **Data Connection Layer (site)**: Provides connection health metrics.
+- **Site Runtime (site)**: Provides script error rate and alarm evaluation error rate metrics.
+- **Store-and-Forward Engine (site)**: Provides buffer depth metrics.
+- **Cluster Infrastructure (site)**: Provides node role status.
+
+## Interactions
+
+- **Central UI**: Health Monitoring Dashboard displays aggregated metrics.
+- **Communication Layer**: Health reports flow as periodic messages.
--- a/docs/requirements/Component-Host.md
+++ b/docs/requirements/Component-Host.md
@@ -0,0 +1,191 @@
+# Component: Host
+
+## Purpose
+
+The Host component is the single deployable executable for the entire ScadaLink system. The same binary runs on every node — central and site alike. The node's role is determined entirely by configuration (`appsettings.json`), not by which binary is deployed. On central nodes the Host additionally bootstraps ASP.NET Core to serve the Central UI and Inbound API web endpoints.
+
+## Location
+
+All nodes (central and site).
+
+## Responsibilities
+
+- Serve as the single entry point (`Program.cs`) for the ScadaLink process.
+- Read and validate node configuration at startup before any actor system is created.
+- Register the correct set of component services and actors based on the configured node role.
+- Bootstrap the Akka.NET actor system with Remoting, Clustering, Persistence, and split-brain resolution via Akka.Hosting.
+- Host ASP.NET Core web endpoints on central nodes only.
+- Configure structured logging (Serilog) with environment-specific enrichment.
+- Support running as a Windows Service in production and as a console application during development.
+- Perform graceful shutdown via Akka.NET CoordinatedShutdown when the service is stopped.
+
+---
+
+## Requirements
+
+### REQ-HOST-1: Single Binary Deployment
+
+The same compiled binary must be deployable to both central and site nodes. The node's role (Central or Site) is determined solely by configuration values in `appsettings.json` (or environment-specific overrides). There must be no separate build targets, projects, or conditional compilation symbols for central vs. site.
+
+### REQ-HOST-2: Role-Based Service Registration
+
+At startup the Host must inspect the configured node role and register only the component services appropriate for that role:
+
+- **Shared** (both Central and Site): ClusterInfrastructure, Communication, HealthMonitoring, ExternalSystemGateway, NotificationService.
+- **Central only**: TemplateEngine, DeploymentManager, Security, AuditLogging, CentralUI, InboundAPI, ManagementService.
+- **Site only**: SiteRuntime, DataConnectionLayer, StoreAndForward, SiteEventLogging.
+
+Components not applicable to the current role must not be registered in the DI container or the Akka.NET actor system.
+
+### REQ-HOST-3: Configuration Binding
+
+The Host must bind configuration sections from `appsettings.json` to strongly-typed options classes using the .NET **Options pattern** (`IOptions<T>` / `IOptionsSnapshot<T>`). Each component has its own configuration section under `ScadaLink`, mapped to a dedicated configuration class owned by that component.
+
+#### Infrastructure Sections
+
+| Section | Options Class | Owner | Contents |
+|---------|--------------|-------|----------|
+| `ScadaLink:Node` | `NodeOptions` | Host | Role, NodeHostname, SiteId, RemotingPort |
+| `ScadaLink:Cluster` | `ClusterOptions` | ClusterInfrastructure | SeedNodes, SplitBrainResolverStrategy, StableAfter, HeartbeatInterval, FailureDetectionThreshold, MinNrOfMembers |
+| `ScadaLink:Database` | `DatabaseOptions` | Host | Central: ConfigurationDb, MachineDataDb connection strings; Site: SQLite paths |
+
+#### Per-Component Sections
+
+| Section | Options Class | Owner | Contents |
+|---------|--------------|-------|----------|
+| `ScadaLink:DataConnection` | `DataConnectionOptions` | Data Connection Layer | ReconnectInterval, TagResolutionRetryInterval, WriteTimeout |
+| `ScadaLink:StoreAndForward` | `StoreAndForwardOptions` | Store-and-Forward | SqliteDbPath, ReplicationEnabled |
+| `ScadaLink:HealthMonitoring` | `HealthMonitoringOptions` | Health Monitoring | ReportInterval, OfflineTimeout |
+| `ScadaLink:SiteEventLog` | `SiteEventLogOptions` | Site Event Logging | RetentionDays, MaxStorageMb, PurgeScheduleCron |
+| `ScadaLink:Communication` | `CommunicationOptions` | Communication | DeploymentTimeout, LifecycleTimeout, QueryTimeout, TransportHeartbeatInterval, TransportFailureThreshold |
+| `ScadaLink:Security` | `SecurityOptions` | Security & Auth | LdapServer, LdapPort, LdapUseTls, JwtSigningKey, JwtExpiryMinutes, IdleTimeoutMinutes |
+| `ScadaLink:InboundApi` | `InboundApiOptions` | Inbound API | DefaultMethodTimeout |
+| `ScadaLink:Notification` | `NotificationOptions` | Notification Service | (SMTP config is stored in config DB and deployed to sites, not in appsettings) |
+| `ScadaLink:ManagementService` | `ManagementServiceOptions` | Management Service | (Reserved for future configuration) |
+| `ScadaLink:Logging` | `LoggingOptions` | Host | Serilog sink configuration, log level overrides |
+
+#### Convention
+
+- Each component defines its own options class (e.g., `DataConnectionOptions`) in its own project. The class is a plain POCO with properties matching the JSON section keys.
+- The Host binds each section during startup via `services.Configure<T>(configuration.GetSection("ScadaLink:<ComponentName>"))`.
+- Each component's `AddXxx()` extension method accepts `IServiceCollection` and reads its options via `IOptions<T>` — the component never reads `IConfiguration` directly.
+- Options classes live in the component project, not in Commons, because they are component-specific configuration — not shared contracts.
+- Startup validation (REQ-HOST-4) validates all required options before the actor system starts.
+
+### REQ-HOST-4: Startup Validation
+
+Before the Akka.NET actor system is created, the Host must validate all required configuration values and fail fast with a clear error message if any are missing or invalid. Validation rules include:
+
+- `NodeConfiguration.Role` must be a valid `NodeRole` value.
+- `NodeConfiguration.NodeHostname` must not be null or empty.
+- `NodeConfiguration.RemotingPort` must be in valid port range (1–65535).
+- Site nodes must have a non-empty `SiteId`.
+- Central nodes must have non-empty `ConfigurationDb` and `MachineDataDb` connection strings.
+- Site nodes must have non-empty SQLite path values. Site nodes do **not** require a `ConfigurationDb` connection string — all configuration is received via artifact deployment and read from local SQLite.
+- At least two seed nodes must be configured.
+
+### REQ-HOST-4a: Readiness Gating
+
+On central nodes, the ASP.NET Core web endpoints (Central UI, Inbound API) must **not accept traffic** until the node is fully operational:
+
+- Akka.NET cluster membership is established.
+- Database connectivity (MS SQL) is verified.
+- Required cluster singletons are running (if applicable).
+
+A standard ASP.NET Core health check endpoint (`/health/ready`) reports readiness status. The load balancer uses this endpoint to determine when to route traffic to the node. During startup or failover, the node returns `503 Service Unavailable` until ready.
+
+### REQ-HOST-5: Windows Service Hosting
+
+The Host must support running as a Windows Service via `UseWindowsService()`. When launched outside of a Windows Service context (e.g., during development), it must run as a standard console application. No code changes or conditional compilation are required to switch between the two modes.
+
+### REQ-HOST-6: Akka.NET Bootstrap
+
+The Host must configure the Akka.NET actor system using Akka.Hosting with:
+
+- **Remoting**: Configured with the node's hostname and port from `NodeConfiguration`.
+- **Clustering**: Configured with seed nodes and the node's cluster role from configuration.
+- **Persistence**: Configured with the appropriate journal and snapshot store (SQL for central, SQLite for site).
+- **Split-Brain Resolver**: Configured with the strategy and stable-after duration from `ClusterConfiguration`.
+- **Actor registration**: Each component's actors registered via its `AddXxxActors()` extension method, conditional on the node's role.
+
+### REQ-HOST-6a: ClusterClientReceptionist (Central Only)
+
+On central nodes, the Host must configure the Akka.NET **ClusterClientReceptionist** and register the ManagementActor with it. This allows external processes (e.g., the CLI) to discover and communicate with the ManagementActor via ClusterClient without joining the cluster as full members. The receptionist is started as part of the Akka.NET bootstrap (REQ-HOST-6) on central nodes only.
+
+### REQ-HOST-7: ASP.NET Web Endpoints (Central Only)
+
+On central nodes, the Host must use `WebApplication.CreateBuilder` to produce a full ASP.NET Core host with Kestrel, and must map web endpoints for:
+
+- Central UI (via `MapCentralUI()` extension method).
+- Inbound API (via `MapInboundAPI()` extension method).
+
+On site nodes, the Host must use `Host.CreateDefaultBuilder` to produce a generic `IHost` — **not** a `WebApplication`. This ensures no Kestrel server is started, no HTTP port is opened, and no web endpoint or middleware pipeline is configured. Site nodes are headless and must never accept inbound HTTP connections.
+
+### REQ-HOST-8: Structured Logging
+
+The Host must configure Serilog as the logging provider with:
+
+- Configuration-driven sink setup (console and file sinks at minimum).
+- Automatic enrichment of every log entry with `SiteId`, `NodeHostname`, and `NodeRole` properties sourced from `NodeConfiguration`.
+- Structured (machine-parseable) output format.
+
+### REQ-HOST-8a: Dead Letter Monitoring
+
+The Host must subscribe to the Akka.NET `DeadLetter` event stream and log dead letters at Warning level. Dead letters indicate messages sent to actors that no longer exist — a common symptom of failover timing issues, stale actor references, or race conditions during instance lifecycle transitions. The dead letter count is reported as a health metric (see Health Monitoring).
+
+### REQ-HOST-9: Graceful Shutdown
+
+When the Host process receives a stop signal (Windows Service stop, `Ctrl+C`, or SIGTERM), it must trigger Akka.NET CoordinatedShutdown to allow actors to drain in-flight work before the process exits. The Host must not call `Environment.Exit()` or forcibly terminate the actor system without coordinated shutdown.
+
+### REQ-HOST-10: Extension Method Convention
+
+Each component library must expose its services to the Host via a consistent set of extension methods:
+
+- `IServiceCollection.AddXxx()` — registers the component's DI services.
+- `AkkaConfigurationBuilder.AddXxxActors()` — registers the component's actors with the Akka.NET actor system (for components that have actors).
+- `WebApplication.MapXxx()` — maps the component's web endpoints (only for CentralUI and InboundAPI).
+
+The Host's `Program.cs` calls these extension methods; the component libraries own the registration logic. This keeps the Host thin and each component self-contained. The ManagementService component additionally registers the ManagementActor with ClusterClientReceptionist in its `AddManagementServiceActors()` method.
+
+---
+
+## Component Registration Matrix
+
+| Component | Central | Site | DI (`AddXxx`) | Actors (`AddXxxActors`) | Endpoints (`MapXxx`) |
+|---|---|---|---|---|---|
+| ClusterInfrastructure | Yes | Yes | Yes | Yes | No |
+| Communication | Yes | Yes | Yes | Yes | No |
+| HealthMonitoring | Yes | Yes | Yes | Yes | No |
+| ExternalSystemGateway | Yes | Yes | Yes | Yes | No |
+| NotificationService | Yes | Yes | Yes | Yes | No |
+| TemplateEngine | Yes | No | Yes | Yes | No |
+| DeploymentManager | Yes | No | Yes | Yes | No |
+| Security | Yes | No | Yes | Yes | No |
+| CentralUI | Yes | No | Yes | No | Yes |
+| InboundAPI | Yes | No | Yes | No | Yes |
+| ManagementService | Yes | No | Yes | Yes | No |
+| SiteRuntime | No | Yes | Yes | Yes | No |
+| DataConnectionLayer | No | Yes | Yes | Yes | No |
+| StoreAndForward | No | Yes | Yes | Yes | No |
+| SiteEventLogging | No | Yes | Yes | Yes | No |
+| ConfigurationDatabase | Yes | No | Yes | No | No |
+
+---
+
+## Dependencies
+
+- **All 17 component libraries**: The Host references every component project to call their extension methods (excludes CLI, which is a separate executable).
+- **Akka.Hosting**: For `AddAkka()` and the hosting configuration builder.
+- **Akka.Remote.Hosting, Akka.Cluster.Hosting, Akka.Persistence.Hosting**: For Akka subsystem configuration.
+- **Serilog.AspNetCore**: For structured logging integration.
+- **Microsoft.Extensions.Hosting.WindowsServices**: For Windows Service support.
+- **ASP.NET Core** (central only): For web endpoint hosting.
+
+## Interactions
+
+- **All components**: The Host is the composition root — it wires every component into the DI container and actor system.
+- **Configuration Database**: The Host registers the DbContext and wires repository implementations to their interfaces. In development, triggers auto-migration; in production, validates schema version.
+- **ClusterInfrastructure**: The Host configures the underlying Akka.NET cluster that ClusterInfrastructure manages at runtime.
+- **CentralUI / InboundAPI**: The Host maps their web endpoints into the ASP.NET Core pipeline on central nodes.
+- **ManagementService**: The Host registers the ManagementActor and configures ClusterClientReceptionist on central nodes, enabling CLI access.
+- **HealthMonitoring**: The Host's startup validation and logging configuration provide the foundation for health reporting.
--- a/docs/requirements/Component-InboundAPI.md
+++ b/docs/requirements/Component-InboundAPI.md
@@ -0,0 +1,183 @@
+# Component: Inbound API
+
+## Purpose
+
+The Inbound API exposes a web API on the central cluster that external systems can call into. This is the reverse of the External System Gateway — where that component handles the SCADA system calling out to external systems, this component handles external systems calling in. It provides API key authentication, method-level authorization, and script-based method implementations.
+
+## Location
+
+Central cluster only (active node). Not available at site clusters.
+
+## Responsibilities
+
+- Host a web API endpoint on the central cluster.
+- Authenticate inbound requests via API keys.
+- Route requests to the appropriate API method definition.
+- Enforce per-method API key authorization (only approved keys can call a given method).
+- Execute the C# script implementation for the called method.
+- Return structured responses to the caller.
+- Failover: API becomes available on the new active node after central failover.
+
+## API Key Management
+
+### Storage
+- API keys are stored in the **configuration database (MS SQL)**.
+
+### Key Properties
+- **Name/Label**: Human-readable identifier for the key (e.g., "MES-Production", "RecipeManager-Dev").
+- **Key Value**: The secret key string used for authentication.
+- **Enabled/Disabled Flag**: Keys can be disabled without deletion.
+
+### Management
+- Managed by users with the **Admin** role via the Central UI.
+- All key changes (create, enable/disable, delete) are audit logged.
+
+## API Method Definition
+
+### Properties
+Each API method definition includes:
+- **Method Name**: Unique identifier and URL path segment for the endpoint.
+- **Approved API Keys**: List of API keys authorized to invoke this method. Requests from non-approved keys are rejected.
+- **Parameter Definitions**: Ordered list of input parameters, each with:
+  - Parameter name.
+  - Data type (Boolean, Integer, Float, String — same fixed set as template attributes).
+- **Return Value Definition**: Structure of the response, with:
+  - Field names and data types. Supports returning **lists of objects**.
+- **Implementation Script**: C# script that executes when the method is called. Stored **inline** in the method definition. Follows standard C# authoring patterns but has no template inheritance — it is a standalone script tied to this method.
+- **Timeout**: Configurable per method. Defines the maximum time the method is allowed to execute (including any routed calls to sites) before returning a timeout error to the caller.
+
+### Management
+- Managed by users with the **Design** role via the Central UI.
+- All method definition changes are audit logged.
+
+## HTTP Contract
+
+### URL Structure
+- All API calls use `POST /api/{methodName}`.
+- Method names map directly to URL path segments (e.g., method "GetProductionReport" → `POST /api/GetProductionReport`).
+- All calls are POST — these are RPC-style script invocations, not RESTful resource operations.
+
+### API Key Header
+- API key is passed via the `X-API-Key` HTTP header.
+
+### Request Format
+- Content-Type: `application/json`.
+- Parameters are top-level JSON fields in the request body matching the method's parameter definitions:
+```json
+{
+  "siteId": "SiteA",
+  "startDate": "2026-03-01",
+  "endDate": "2026-03-16"
+}
+```
+
+### Response Format
+- **Success (200)**: The response body is the method's return value as JSON, with fields matching the return value definition:
+```json
+{
+  "siteName": "Site Alpha",
+  "totalUnits": 14250,
+  "lines": [
+    { "lineName": "Line-1", "units": 8200, "efficiency": 92.5 },
+    { "lineName": "Line-2", "units": 6050, "efficiency": 88.1 }
+  ]
+}
+```
+- **Failure (4xx/5xx)**: The response body is an error object:
+```json
+{
+  "error": "Site unreachable",
+  "code": "SITE_UNREACHABLE"
+}
+```
+- HTTP status codes distinguish success from failure — no envelope wrapper.
+
+### Extended Type System
+- API method parameter and return type definitions support an **extended type system** beyond the four template attribute types (Boolean, Integer, Float, String):
+  - **Object**: A named structure with typed fields. Supports nesting.
+  - **List**: An ordered collection of objects or primitive types.
+- This allows complex request/response structures (e.g., an object containing properties and a list of nested objects).
+- Template attributes retain the simpler four-type system. The extended types apply only to Inbound API method definitions and External System Gateway method definitions.
+
+## API Call Logging
+
+- **Only failures are logged.** Script execution errors (500 responses) are logged centrally.
+- Successful API calls are **not** logged — the audit log is reserved for configuration changes, not operational traffic.
+- No rate limiting — this is a private API in a controlled industrial environment with a known set of callers. Misbehaving callers are handled operationally (disable the API key).
+
+## Request Flow
+
+```
+External System
+    │
+    ▼
+Inbound API (Central)
+    ├── 1. Extract API key from request
+    ├── 2. Validate key exists and is enabled
+    ├── 3. Resolve method by name
+    ├── 4. Check API key is in method's approved list
+    ├── 5. Validate and deserialize parameters
+    ├── 6. Execute implementation script (subject to method timeout)
+    ├── 7. Serialize return value
+    └── 8. Return response
+```
+
+## Implementation Script Capabilities
+
+The C# script that implements an API method executes on the central cluster. Unlike instance scripts at sites, inbound API scripts run on central and can interact with **any instance at any site** through a routing API.
+
+Inbound API scripts **cannot** call shared scripts directly — shared scripts are deployed to sites only and execute inline in Script Actors. To execute logic on a site, use `Route.To().Call()`.
+
+### Script Runtime API
+
+#### Instance Routing
+- `Route.To("instanceUniqueCode").Call("scriptName", parameters)` — Invoke a script on a specific instance at any site. Central routes the call to the appropriate site via the Communication Layer. The call reaches the target Instance Actor's Script Actor, which spawns a Script Execution Actor to execute the script. The return value flows back to the calling API script.
+- `Route.To("instanceUniqueCode").GetAttribute("attributeName")` — Read a single attribute value from a specific instance at any site.
+- `Route.To("instanceUniqueCode").GetAttributes("attr1", "attr2", ...)` — Read multiple attribute values in a **single call**, returned as a dictionary of name-value pairs.
+- `Route.To("instanceUniqueCode").SetAttribute("attributeName", value)` — Write a single attribute value on a specific instance at any site.
+- `Route.To("instanceUniqueCode").SetAttributes(dictionary)` — Write multiple attribute values in a **single call**, accepting a dictionary of name-value pairs.
+
+#### Input/Output
+- **Input parameters** are available as defined in the method definition.
+- **Return value** construction matching the defined return structure.
+
+#### Database Access
+- `Database.Connection("connectionName")` — Obtain a raw MS SQL client connection for querying the configuration or machine data databases directly from central.
+
+### Routing Behavior
+- The `Route.To()` helper resolves the instance's site assignment from the configuration database and routes the request to the correct site cluster via the Communication Layer.
+- The call is **synchronous from the API caller's perspective** — the API method blocks until the site responds or the **method-level timeout** is reached.
+- If the target site is unreachable or the call times out, the call fails and the API returns an error to the caller. No store-and-forward buffering is used for inbound API calls.
+
+## Authentication Details
+
+- API key is passed via the `X-API-Key` HTTP header.
+- The system validates:
+  1. The key exists in the configuration database.
+  2. The key is enabled.
+  3. The key is in the approved list for the requested method.
+- Failed authentication returns an appropriate HTTP error (401 Unauthorized or 403 Forbidden).
+
+## Error Handling
+
+- Invalid API key → 401 Unauthorized.
+- Valid key but not approved for method → 403 Forbidden.
+- Invalid parameters → 400 Bad Request.
+- Script execution failure → 500 Internal Server Error (with safe error message, no internal details exposed).
+- Script errors are logged in the central audit/event system.
+
+## Dependencies
+
+- **Configuration Database (MS SQL)**: Stores API keys and method definitions.
+- **Communication Layer**: Routes requests to sites when method implementations need site data.
+- **Security & Auth**: API key validation (separate from LDAP/AD — API uses key-based auth).
+- **Configuration Database (via IAuditService)**: All API key and method definition changes are audit logged. Optionally, API call activity can be logged.
+- **Cluster Infrastructure**: API is hosted on the active central node and fails over with it.
+
+## Interactions
+
+- **External Systems**: Call the API with API keys.
+- **Communication Layer**: API method scripts use this to reach sites.
+- **Site Runtime (Instance Actors, Script Actors)**: Routed calls execute on site Instance Actors via their Script Actors.
+- **Central UI**: Admin manages API keys; Design manages method definitions.
+- **Configuration Database (via IAuditService)**: Configuration changes are audited.
--- a/docs/requirements/Component-ManagementService.md
+++ b/docs/requirements/Component-ManagementService.md
@@ -0,0 +1,219 @@
+# Component: Management Service
+
+## Purpose
+
+The Management Service is an Akka.NET actor on the central cluster that provides programmatic access to all administrative operations. It exposes the same capabilities as the Central UI but through an actor-based interface, enabling the CLI (and potentially other tooling) to interact with the system without going through the web UI. The ManagementActor registers with ClusterClientReceptionist for cross-cluster access and is also exposed via an HTTP Management API endpoint (`POST /management`) for external tools like the CLI.
+
+## Location
+
+Central cluster only. The ManagementActor runs as a plain actor on **every** central node (not a cluster singleton). Because the actor is completely stateless — it holds no locks and no local state, delegating all work to repositories and services — running on all nodes improves availability without requiring coordination between instances. Either node can serve any request independently.
+
+`src/ScadaLink.ManagementService/`
+
+## Responsibilities
+
+- Provide an actor-based interface to all administrative operations available in the Central UI.
+- Register with Akka.NET ClusterClientReceptionist so external tools (CLI) can discover and communicate with it via ClusterClient.
+- Expose an HTTP API endpoint (`POST /management`) that accepts JSON commands with Basic Auth, performs LDAP authentication and role resolution, and dispatches to the ManagementActor.
+- Validate and authorize all incoming commands using the authenticated user identity carried in message envelopes.
+- Delegate to the appropriate services and repositories for each operation.
+- Return structured response messages for all commands and queries.
+- Failover: The ManagementActor runs on all central nodes, so no actor-level failover is needed. If one node goes down, the ClusterClient transparently routes to the ManagementActor on the remaining node.
+
+## Key Classes
+
+### ManagementActor
+
+The central actor that receives and processes all management commands. Registered at a well-known actor path (`/user/management`) and with ClusterClientReceptionist.
+
+### ManagementEndpoints
+
+Minimal API endpoint (`POST /management`) that serves as the HTTP interface to the ManagementActor. Handles Basic Auth decoding, LDAP authentication via `LdapAuthService`, role resolution via `RoleMapper`, command deserialization via `ManagementCommandRegistry`, and ManagementActor dispatch.
+
+### ManagementActorHolder
+
+DI-registered singleton that holds the `IActorRef` for the ManagementActor. Set during actor registration in `AkkaHostedService` and injected into the HTTP endpoint handler.
+
+### ManagementCommandRegistry
+
+Static registry mapping command names (e.g., `"ListSites"`) to command types (e.g., `ListSitesCommand`). Built via reflection at startup. Used by the HTTP endpoint to deserialize JSON payloads into the correct command type.
+
+### Message Contracts
+
+All request/response messages are defined in **Commons** under `Messages/Management/`. Messages follow the existing additive-only evolution rules for version compatibility. Every request message includes:
+
+- **CorrelationId**: Application-level correlation ID for request/response pairing.
+- **AuthenticatedUser**: The identity and roles of the user issuing the command (username, display name, roles, permitted sites). This is populated by the CLI from the user's authenticated session.
+
+## ClusterClientReceptionist Registration
+
+The ManagementActor registers itself with `ClusterClientReceptionist` at startup. This allows external processes using `ClusterClient` to send messages to the ManagementActor without joining the Akka.NET cluster as a full member. The receptionist advertises the actor under its well-known path.
+
+## HTTP Management API
+
+The Management Service also exposes a `POST /management` endpoint on the Central Host's web server. This provides an HTTP interface to the same ManagementActor, enabling the CLI (and other HTTP clients) to interact without Akka.NET dependencies.
+
+**Request format:**
+```json
+POST /management
+Authorization: Basic base64(username:password)
+Content-Type: application/json
+
+{
+  "command": "ListSites",
+  "payload": {}
+}
+```
+
+**Response mapping:**
+- `ManagementSuccess` → HTTP 200 with JSON body
+- `ManagementError` → HTTP 400 with `{ "error": "...", "code": "..." }`
+- `ManagementUnauthorized` → HTTP 403 with `{ "error": "...", "code": "UNAUTHORIZED" }`
+- Authentication failure → HTTP 401
+- Actor timeout → HTTP 504
+
+The endpoint performs LDAP authentication and role resolution server-side, collapsing the CLI's previous two-step flow (ResolveRoles + actual command) into a single HTTP round-trip.
+
+## Message Groups
+
+### Templates
+
+- **ListTemplates** / **GetTemplate**: Query template definitions.
+- **CreateTemplate** / **UpdateTemplate** / **DeleteTemplate**: Manage templates.
+- **ValidateTemplate**: Run on-demand pre-deployment validation (flattening, naming collisions, script compilation).
+- **GetTemplateDiff**: Compare deployed vs. template-derived configuration for an instance.
+
+### Template Members
+
+- **AddTemplateAttribute** / **UpdateTemplateAttribute** / **DeleteTemplateAttribute**: Manage attributes on a template.
+- **AddTemplateAlarm** / **UpdateTemplateAlarm** / **DeleteTemplateAlarm**: Manage alarm definitions on a template.
+- **AddTemplateScript** / **UpdateTemplateScript** / **DeleteTemplateScript**: Manage scripts on a template.
+- **AddTemplateComposition** / **DeleteTemplateComposition**: Manage feature module compositions on a template.
+
+### Instances
+
+- **ListInstances** / **GetInstance**: Query instances, with filtering by site and area.
+- **CreateInstance**: Create a new instance from a template.
+- **UpdateInstanceOverrides**: Set attribute overrides on an instance.
+- **SetInstanceBindings** / **BindDataConnections**: Bind data connections to instance attributes.
+- **AssignArea**: Assign an instance to an area.
+- **EnableInstance** / **DisableInstance** / **DeleteInstance**: Instance lifecycle commands.
+
+### Sites
+
+- **ListSites** / **GetSite**: Query site definitions.
+- **CreateSite** / **UpdateSite** / **DeleteSite**: Manage site definitions.
+- **ListAreas** / **CreateArea** / **UpdateArea** / **DeleteArea**: Manage area hierarchies per site.
+
+### Data Connections
+
+- **ListDataConnections** / **GetDataConnection**: Query data connection definitions.
+- **CreateDataConnection** / **UpdateDataConnection** / **DeleteDataConnection**: Manage data connection definitions.
+- **AssignDataConnectionToSite** / **UnassignDataConnectionFromSite**: Manage site assignments.
+
+### Deployments
+
+- **DeployInstance**: Deploy configuration to a specific instance (includes pre-deployment validation).
+- **DeployArtifacts**: Deploy system-wide artifacts (shared scripts, external system definitions, DB connections, data connections, notification lists, SMTP config) to all sites or a specific site.
+- **GetDeploymentStatus**: Query deployment status.
+
+### External Systems
+
+- **ListExternalSystems** / **GetExternalSystem**: Query external system definitions.
+- **CreateExternalSystem** / **UpdateExternalSystem** / **DeleteExternalSystem**: Manage external system definitions.
+
+### Notifications
+
+- **ListNotificationLists** / **GetNotificationList**: Query notification lists.
+- **CreateNotificationList** / **UpdateNotificationList** / **DeleteNotificationList**: Manage notification lists and recipients.
+- **GetSmtpConfig** / **UpdateSmtpConfig**: Query and update SMTP configuration.
+
+### Security (LDAP & API Keys)
+
+- **ListApiKeys** / **CreateApiKey** / **UpdateApiKey** / **EnableApiKey** / **DisableApiKey** / **DeleteApiKey**: Manage API keys.
+- **ListRoleMappings** / **CreateRoleMapping** / **UpdateRoleMapping** / **DeleteRoleMapping**: Manage LDAP group-to-role mappings.
+- **ListScopeRules** / **AddScopeRule** / **DeleteScopeRule**: Manage site scope rules on role mappings.
+
+### Audit Log
+
+- **QueryAuditLog**: Query audit log entries with filtering by entity type, user, date range, etc.
+
+### Shared Scripts
+
+- **ListSharedScripts** / **GetSharedScript**: Query shared script definitions.
+- **CreateSharedScript** / **UpdateSharedScript** / **DeleteSharedScript**: Manage shared scripts.
+
+### Database Connections
+
+- **ListDatabaseConnections** / **GetDatabaseConnection**: Query database connection definitions.
+- **CreateDatabaseConnection** / **UpdateDatabaseConnection** / **DeleteDatabaseConnection**: Manage database connections.
+
+### Inbound API Methods
+
+- **ListApiMethods** / **GetApiMethod**: Query inbound API method definitions.
+- **CreateApiMethod** / **UpdateApiMethod** / **DeleteApiMethod**: Manage inbound API methods.
+
+### Health
+
+- **GetHealthSummary**: Query current health status of all sites.
+- **GetSiteHealth**: Query detailed health for a specific site.
+
+### Remote Queries
+
+- **QuerySiteEventLog**: Query site event log entries from a remote site (routed via communication layer). Supports date range, keyword search, and pagination.
+- **QueryParkedMessages**: Query parked (dead-letter) messages at a remote site (routed via communication layer). Supports pagination.
+- **DebugSnapshot**: Request a one-shot snapshot of attribute values and alarm states for a running instance. Resolves the instance's site from the config DB and routes via the communication layer. Uses 30s `QueryTimeout`.
+
+## Authorization
+
+Every incoming message carries the authenticated user's identity and roles. The ManagementActor enforces the same role-based authorization rules as the Central UI:
+
+- **Admin** role required for: site management, area management, API key management, role mapping management, scope rule management, system configuration.
+- **Design** role required for: template authoring (including template member management: attributes, alarms, scripts, compositions), shared scripts, external system definitions, database connection definitions, notification lists, inbound API method definitions.
+- **Deployment** role required for: instance management, deployments, debug view, debug snapshot, parked message queries, site event log queries. Site scoping is enforced for site-scoped Deployment users.
+- **Read-only access** (any authenticated role): health summary, health site, site event log queries, parked message queries.
+
+Unauthorized commands receive an `Unauthorized` response message. Failed authorization attempts are not audit logged (consistent with existing behavior).
+
+## Service Dependencies (DI)
+
+The ManagementActor receives the following services and repositories via DI (injected through the actor's constructor or via a service provider):
+
+- `ITemplateEngineRepository` / `TemplateService` — Template operations.
+- `InstanceService` — Instance lifecycle and configuration.
+- `ISiteRepository` — Site definitions and area management.
+- `IDeploymentManagerRepository` / `DeploymentService` — Deployment pipeline operations.
+- `ArtifactDeploymentService` — System-wide artifact deployment.
+- `IExternalSystemRepository` — External system definitions.
+- `INotificationRepository` — Notification lists and SMTP config.
+- `ISecurityRepository` — API keys and LDAP role mappings.
+- `IInboundApiRepository` — Inbound API method definitions.
+- `ISharedScriptRepository` / `SharedScriptService` — Shared script definitions.
+- `IDatabaseConnectionRepository` — Database connection definitions.
+- `ICentralHealthAggregator` — Health status aggregation.
+- `CommunicationService` — Central-site communication for deployment and remote queries.
+
+## Configuration
+
+| Section | Options Class | Contents |
+|---------|--------------|----------|
+| `ScadaLink:ManagementService` | `ManagementServiceOptions` | (Reserved for future configuration — e.g., command timeout overrides) |
+
+## Dependencies
+
+- **Commons**: Message contracts (`Messages/Management/`), shared types, repository interfaces.
+- **Configuration Database (MS SQL)**: All queries and mutations go through repositories backed by EF Core.
+- **Configuration Database (via IAuditService)**: All mutating operations are audit logged through the existing transactional audit mechanism.
+- **Communication Layer**: Deployment commands and remote queries (parked messages, event logs) are routed to sites via Communication.
+- **Security & Auth**: Authorization rules are enforced on every command using the authenticated user identity from the message envelope.
+- **Cluster Infrastructure**: ManagementActor runs on all central nodes; ClusterClientReceptionist requires cluster membership.
+- **All service components**: The ManagementActor delegates to the same services used by the Central UI — Template Engine, Deployment Manager, etc.
+
+## Interactions
+
+- **CLI**: The primary consumer. Connects via the HTTP Management API (`POST /management`) and sends commands as JSON with Basic Auth credentials.
+- **Host**: Registers the ManagementActor and ClusterClientReceptionist on central nodes during startup.
+- **Central UI**: Shares the same underlying services and repositories. The ManagementActor and Central UI are parallel interfaces to the same operations.
+- **Communication Layer**: Deployment commands and remote site queries flow through communication actors.
+- **Configuration Database (via IAuditService)**: All configuration changes are audited.
+- **Security & Auth**: The ManagementActor enforces authorization using user identity passed in messages. For HTTP API access, the Management endpoint authenticates the user via LDAP and resolves roles before dispatching to the ManagementActor.
--- a/docs/requirements/Component-NotificationService.md
+++ b/docs/requirements/Component-NotificationService.md
@@ -0,0 +1,85 @@
+# Component: Notification Service
+
+## Purpose
+
+The Notification Service provides email notification capabilities to scripts running at site clusters. It manages notification lists, handles email delivery, and integrates with the Store-and-Forward Engine for reliable delivery when the email server is unavailable.
+
+## Location
+
+Central cluster (definition management, stores in config DB). Site clusters (email delivery, reads definitions from local SQLite).
+
+## Responsibilities
+
+### Definitions (Central)
+- Store notification lists in the configuration database: list name, recipients (name + email address).
+- Store email server configuration (SMTP settings).
+- Deploy notification lists and SMTP configuration uniformly to all sites. Deployment requires **explicit action** by a user with the Deployment role.
+- Managed by users with the Design role.
+
+### Delivery (Site)
+- Resolve notification list names to recipient lists from **local SQLite** (populated by artifact deployment). Sites do not access the central config DB.
+- Compose and send emails via SMTP using locally stored SMTP configuration.
+- On delivery failure, submit the notification to the Store-and-Forward Engine for buffered retry.
+
+## Notification List Definition
+
+Each notification list includes:
+- **Name**: Unique identifier (e.g., "Maintenance-Team", "Shift-Supervisors").
+- **Recipients**: One or more entries, each with:
+  - Recipient name.
+  - Email address.
+
+## Email Server Configuration
+
+The SMTP configuration is defined centrally and deployed to all sites. It includes:
+
+- **Server hostname**: SMTP server address (e.g., `smtp.office365.com`).
+- **Port**: SMTP port (e.g., 587 for StartTLS, 465 for SSL).
+- **Authentication mode**: One of:
+  - **Basic Auth**: Username and password. For on-prem SMTP relays or servers that support basic authentication.
+  - **OAuth2 Client Credentials**: Tenant ID, Client ID, and Client Secret. For Microsoft 365 and other modern SMTP providers that require OAuth2. The Notification Service handles the token lifecycle internally (fetch, cache, refresh on expiry).
+- **TLS mode**: None, StartTLS, or SSL.
+- **From address**: The sender email address for all notifications (e.g., `scada-notifications@company.com`).
+- **Connection timeout**: Maximum time to wait for SMTP connection (default: 30 seconds).
+- **Max concurrent connections**: Maximum simultaneous SMTP connections per site (default: 5).
+- **Retry settings**: Max retry count, fixed time between retries (used by Store-and-Forward Engine for transient failures).
+
+## Script API
+
+```csharp
+Notify.To("listName").Send("subject", "message")
+```
+
+- Available to instance scripts (via Script Execution Actors), alarm on-trigger scripts (via Alarm Execution Actors), and shared scripts (executing inline).
+- Resolves the list name to recipients, composes the email, and attempts delivery.
+- The message body is **plain text** only. No HTML content.
+
+## Email Delivery Behavior
+
+### Recipient Handling
+- A single email is sent per `Notify.To().Send()` call, with all list recipients in **BCC**. The from address is placed in the To field.
+- Recipients do not see each other's email addresses.
+- No per-recipient deduplication — if the same email address appears in multiple lists and a script sends to both, they receive multiple emails.
+
+### Error Classification
+Consistent with the External System Gateway pattern:
+- **Transient failures** (connection refused, timeout, SMTP 4xx temporary errors): The notification is handed to the **Store-and-Forward Engine** for buffered retry per the SMTP configuration's retry settings. The script does **not** block waiting for eventual delivery.
+- **Permanent failures** (SMTP 5xx permanent errors, e.g., mailbox not found): The error is returned **synchronously** to the calling script for handling. No retry — the notification will never deliver.
+- This prevents the S&F buffer from accumulating notifications that will never succeed.
+
+### No Rate Limiting
+- No application-level rate limiting. If the SMTP server enforces sending limits (e.g., Microsoft 365 throttling), those manifest as transient failures and are handled naturally by store-and-forward.
+
+## Dependencies
+
+- **Configuration Database (MS SQL)**: Stores notification list definitions and SMTP config (central only).
+- **Local SQLite**: At sites, notification lists, recipients, and SMTP configuration are read from local SQLite (populated by artifact deployment). Sites do not access the central config DB.
+- **Store-and-Forward Engine**: Handles buffering for failed email deliveries.
+- **Security & Auth**: Design role manages notification lists.
+- **Configuration Database (via IAuditService)**: Notification list changes are audit logged.
+
+## Interactions
+
+- **Site Runtime (Script/Alarm Execution Actors)**: Scripts invoke `Notify.To().Send()` through this component.
+- **Store-and-Forward Engine**: Failed notifications are buffered here.
+- **Deployment Manager**: Receives updated notification lists and SMTP config as part of system-wide artifact deployment (triggered explicitly by Deployment role).
--- a/docs/requirements/Component-Security.md
+++ b/docs/requirements/Component-Security.md
@@ -0,0 +1,122 @@
+# Component: Security & Auth
+
+## Purpose
+
+The Security & Auth component handles user authentication via LDAP/Active Directory and enforces role-based authorization across the system. It maps LDAP group memberships to system roles and applies permission checks to all operations.
+
+## Location
+
+Central cluster. Sites do not have user-facing interfaces and do not perform independent authentication.
+
+## Responsibilities
+
+- Authenticate users against LDAP/Active Directory using Windows Integrated Authentication.
+- Map LDAP group memberships to system roles.
+- Enforce role-based access control on all API and UI operations.
+- Support site-scoped permissions for the Deployment role.
+
+## Authentication
+
+- **Mechanism**: The Central UI presents a username/password login form. The app validates credentials by binding to the LDAP/AD server with the provided credentials, then queries the user's group memberships.
+- **Transport security**: LDAP connections **must** use LDAPS (port 636) or StartTLS to encrypt credentials in transit. Unencrypted LDAP (port 389) is not permitted.
+- **No local user store**: All identity and group information comes from AD. No credentials are cached locally.
+- **No Windows Integrated Authentication**: The app authenticates directly against LDAP/AD, not via Kerberos/NTLM.
+
+## Session Management
+
+### Cookie + JWT Hybrid
+- On successful authentication, the app issues a **JWT** signed with a shared symmetric key (HMAC-SHA256). Both central cluster nodes use the same signing key from configuration, so either node can issue and validate tokens.
+- The JWT is embedded in an **authentication cookie** rather than being passed as a bearer token. This is the correct transport for Blazor Server, where persistent SignalR circuits do not carry Authorization headers — the browser automatically sends the cookie with every SignalR connection and HTTP request.
+- The cookie is **HttpOnly** and **Secure** (requires HTTPS).
+- On each request, the server extracts and validates the JWT from the cookie. All authorization decisions are made from the JWT claims without hitting the database.
+- **JWT claims**: User display name, username, list of roles (Admin, Design, Deployment), and for site-scoped Deployment, the list of permitted site IDs.
+
+### Token Lifecycle
+- **JWT expiry**: 15 minutes. On each request, if the cookie-embedded JWT is near expiry, the app re-queries LDAP for current group memberships and issues a fresh JWT, writing an updated cookie. Roles are never more than 15 minutes stale.
+- **Idle timeout**: Configurable, default **30 minutes**. If no requests are made within the idle window, the token is not refreshed and the user must re-login. Tracked via a last-activity timestamp in the token.
+- **Sliding refresh**: Active users stay logged in indefinitely — the token refreshes every 15 minutes as long as requests are made within the 30-minute idle window.
+
+### Load Balancer Compatibility
+- The authentication cookie carries a self-contained JWT — no server-side session state. A load balancer in front of the central cluster can route requests to either node without sticky sessions or a shared session store.
+- Since both central nodes share the same JWT signing key, either node can validate the cookie-embedded JWT. Central failover is transparent to users with valid cookies.
+
+## LDAP Connection Failure
+
+- **New logins**: If the LDAP/AD server is unreachable, login attempts **fail**. Users cannot be authenticated without LDAP.
+- **Active sessions**: Users with valid (not-yet-expired) JWTs can **continue operating** with their current roles. The token refresh is skipped until LDAP is available again. This avoids disrupting engineers mid-work during a brief LDAP outage.
+- **Recovery**: When LDAP becomes reachable again, the next token refresh cycle re-queries group memberships and issues a fresh token with current roles.
+
+## Roles
+
+### Admin
+- **Scope**: System-wide (always).
+- **Permissions**:
+  - Manage site definitions.
+  - Manage site-level data connections (define and assign to sites).
+  - Manage area definitions per site.
+  - Manage LDAP group-to-role mappings.
+  - Manage API keys (create, enable/disable, delete).
+  - System-level configuration.
+  - View audit logs.
+
+### Design
+- **Scope**: System-wide (always).
+- **Permissions**:
+  - Create, edit, delete templates (including attributes, alarms, scripts).
+  - Manage shared scripts.
+  - Manage external system definitions.
+  - Manage database connection definitions.
+  - Manage notification lists and SMTP configuration.
+  - Manage inbound API method definitions.
+  - Run on-demand validation (template flattening, script compilation).
+
+### Deployment
+- **Scope**: System-wide or site-scoped.
+- **Permissions**:
+  - Create and manage instances (overrides, connection bindings, area assignment).
+  - Disable, enable, and delete instances.
+  - Deploy configurations to instances.
+  - Deploy system-wide artifacts (shared scripts, external system definitions, DB connections, notification lists) to all sites.
+  - View deployment diffs and status.
+  - Use debug view.
+  - Manage parked messages.
+  - View site event logs.
+- **Site scoping**: A user with site-scoped Deployment role can only perform these actions for instances at their permitted sites.
+
+## Multi-Role Support
+
+- A user can hold **multiple roles simultaneously** by being a member of multiple LDAP groups.
+- Roles are **independent** — there is no implied hierarchy between roles.
+- For example, a user who is a member of both `SCADA-Designers` and `SCADA-Deploy-All` holds both the Design and Deployment roles, allowing them to author templates and also deploy configurations.
+
+## LDAP Group Mapping
+
+- System administrators configure mappings between LDAP groups and roles.
+- Examples:
+  - `SCADA-Admins` → Admin role
+  - `SCADA-Designers` → Design role
+  - `SCADA-Deploy-All` → Deployment role (all sites)
+  - `SCADA-Deploy-SiteA` → Deployment role (Site A only)
+  - `SCADA-Deploy-SiteB` → Deployment role (Site B only)
+- A user can be a member of multiple groups, granting multiple independent roles.
+- Group mappings are stored in the configuration database and managed via the Central UI (Admin role).
+
+## Permission Enforcement
+
+- Every API endpoint and UI action checks the authenticated user's roles before proceeding.
+- Site-scoped checks additionally verify the target site is within the user's permitted sites.
+- Unauthorized actions return an appropriate error and are not logged as audit events (only successful changes are audited).
+
+## Dependencies
+
+- **Active Directory / LDAP**: Source of user identity and group memberships.
+- **Configuration Database (MS SQL)**: Stores LDAP group-to-role mappings and site scoping rules.
+- **Configuration Database (via IAuditService)**: Security/admin changes (role mapping updates) are audit logged.
+
+## Interactions
+
+- **Central UI**: All UI requests pass through authentication and authorization.
+- **Template Engine**: Design role enforcement.
+- **Deployment Manager**: Deployment role enforcement with site scoping.
+- **All central components**: Role checks are a cross-cutting concern applied at the API layer.
+- **Management Service**: The ManagementActor enforces role-based authorization on every incoming command using the authenticated user identity carried in the message envelope. The CLI authenticates users via the same LDAP bind mechanism and passes the user's identity (username, roles, permitted sites) in every request message. The ManagementActor applies the same role and site-scoping rules as the Central UI — no separate authentication path exists on the server side.
--- a/docs/requirements/Component-SiteEventLogging.md
+++ b/docs/requirements/Component-SiteEventLogging.md
@@ -0,0 +1,72 @@
+# Component: Site Event Logging
+
+## Purpose
+
+The Site Event Logging component records operational events at each site cluster, providing a local audit trail of runtime activity. Events are queryable from the central UI for remote troubleshooting.
+
+## Location
+
+Site clusters (event recording and storage). Central cluster (remote query access via UI).
+
+## Responsibilities
+
+- Record operational events from all site subsystems.
+- Persist events to local SQLite.
+- Enforce 30-day retention policy with automatic purging.
+- Respond to remote queries from central for event log data.
+
+## Events Logged
+
+| Category | Events |
+|----------|--------|
+| Script Executions | Script started, completed, failed (with error details), recursion limit exceeded |
+| Alarm Events | Alarm activated, alarm cleared (which alarm, which instance), alarm evaluation error |
+| Deployment Events | Configuration received from central, scripts compiled, applied successfully, apply failed |
+| Data Connection Status | Connected, disconnected, reconnected (per connection) |
+| Store-and-Forward | Message queued, delivered, retried, parked |
+| Instance Lifecycle | Instance enabled, disabled, deleted |
+
+## Event Entry Schema
+
+Each event entry contains:
+- **Timestamp**: When the event occurred.
+- **Event Type**: Category of the event (script, alarm, deployment, connection, store-and-forward, instance-lifecycle).
+- **Severity**: Info, Warning, or Error.
+- **Instance ID** *(optional)*: The instance associated with the event (if applicable).
+- **Source**: The subsystem that generated the event (e.g., "ScriptActor:MonitorSpeed", "AlarmActor:OverTemp", "DataConnection:PLC1").
+- **Message**: Human-readable description of the event.
+- **Details** *(optional)*: Additional structured data (e.g., exception stack trace, alarm name, message ID, compilation errors).
+
+## Storage
+
+- Events are stored in **local SQLite** on each site node.
+- Each node maintains its own event log. Only the **active node** generates and stores events. Event logs are **not replicated** to the standby node. On failover, the new active node starts logging to its own SQLite database; historical events from the previous active node are no longer queryable via central until that node comes back online. This is acceptable because event logs are diagnostic, not transactional.
+- **Retention**: 30 days. A **daily background job** runs on the active node and deletes all events older than 30 days. Hard delete — no archival.
+- **Storage cap**: A configurable maximum database size (default: 1 GB) is enforced. If the storage cap is reached before the 30-day retention window, the oldest events are purged first. This prevents disk exhaustion from alarm storms, script failure loops, or connection flapping.
+
+## Central Access
+
+- The central UI can query site event logs remotely via the Communication Layer.
+- Queries support filtering by:
+  - Event type / category
+  - Time range
+  - Instance ID
+  - Severity
+  - **Keyword search**: Free-text search on message and source fields (SQLite LIKE query). Useful for finding events by script name, alarm name, or error message across all instances.
+- Results are **paginated** with a configurable page size (default: 500 events). Each response includes a continuation token for fetching additional pages. This prevents broad queries from overwhelming the communication channel.
+- The site processes the query locally against SQLite and returns matching results to central.
+
+## Dependencies
+
+- **SQLite**: Local storage on each site node.
+- **Communication Layer**: Handles remote query requests from central.
+- **Site Runtime**: Generates script execution events, alarm events, deployment application events, and instance lifecycle events.
+- **Data Connection Layer**: Generates connection status events.
+- **Store-and-Forward Engine**: Generates buffer activity events.
+
+## Interactions
+
+- **All site subsystems**: Event logging is a cross-cutting concern — any subsystem that produces notable events calls the Event Logging service.
+- **Communication Layer**: Receives remote queries from central and returns results.
+- **Central UI**: Site Event Log Viewer displays queried events.
+- **Health Monitoring**: Script error rates and alarm evaluation error rates can be derived from event log data.
--- a/docs/requirements/Component-SiteRuntime.md
+++ b/docs/requirements/Component-SiteRuntime.md
@@ -0,0 +1,325 @@
+# Component: Site Runtime
+
+## Purpose
+
+The Site Runtime component manages the execution of deployed machine instances at site clusters. It encompasses the actor hierarchy that represents running instances, their scripts, and their alarms. It owns the site-side deployment lifecycle (receiving configs from central, compiling scripts, creating actors), script execution, alarm evaluation, and the site-wide Akka stream for attribute and alarm state changes.
+
+This component replaces the previously separate Script Engine and Alarm Engine concepts, unifying them under a single actor hierarchy rooted at the Deployment Manager singleton.
+
+## Location
+
+Site clusters only.
+
+## Responsibilities
+
+- Run the Deployment Manager singleton (Akka.NET cluster singleton) on the active site node.
+- On startup (or failover), read all deployed configurations from local SQLite and re-create the full actor hierarchy.
+- Receive deployment commands from central: new/updated instance configurations, instance lifecycle commands (disable, enable, delete), and system-wide artifact updates.
+- Compile C# scripts when deployments are received.
+- Manage the Instance Actor hierarchy (Instance Actors, Script Actors, Alarm Actors).
+- Execute scripts via Script Actors with support for concurrent execution.
+- Evaluate alarm conditions via Alarm Actors and manage alarm state.
+- Maintain the site-wide Akka stream for attribute value and alarm state changes.
+- Execute shared scripts inline as compiled code libraries (no separate actors).
+- Enforce script call recursion limits.
+
+---
+
+## Actor Hierarchy
+
+```
+Deployment Manager Singleton (Cluster Singleton)
+├── Instance Actor ("MachineA-001")
+│   ├── Script Actor ("MonitorSpeed") — coordinator
+│   │   └── Script Execution Actor — short-lived, per invocation
+│   ├── Script Actor ("CalculateOEE") — coordinator
+│   │   └── Script Execution Actor — short-lived, per invocation
+│   ├── Alarm Actor ("OverTemp") — coordinator
+│   │   └── Alarm Execution Actor — short-lived, per on-trigger invocation
+│   └── Alarm Actor ("LowPressure") — coordinator
+├── Instance Actor ("MachineA-002")
+│   └── ...
+└── ...
+```
+
+---
+
+## Deployment Manager Singleton
+
+### Role
+- Akka.NET **cluster singleton** — guaranteed to run on exactly one node in the site cluster (the active node).
+- On failover, Akka.NET restarts the singleton on the new active node.
+
+### Startup Behavior
+1. Read all deployed configurations from local SQLite.
+2. Read all shared scripts from local storage.
+3. Compile all scripts (instance scripts, alarm on-trigger scripts, shared scripts).
+4. Create Instance Actors for all deployed, **enabled** instances as child actors. Instance Actors are created in **staggered batches** (e.g., 20 at a time with a short delay between batches) to prevent a reconnection storm — 500 Instance Actors all registering data subscriptions simultaneously would overwhelm OPC UA servers and network capacity.
+5. Make compiled shared script code available to all Script Actors.
+
+### Deployment Handling
+- Receives flattened instance configurations from central via the Communication Layer.
+- Stores the new configuration in local SQLite.
+- Compiles all scripts in the configuration.
+- Creates a new Instance Actor (for new instances) or updates an existing one (for redeployments).
+- For redeployments: the existing Instance Actor and all its children are stopped, then a new Instance Actor is created with the updated configuration. Subscriptions are re-established.
+- Reports deployment result (success/failure) back to central.
+
+### System-Wide Artifact Handling
+- Receives updated shared scripts, external system definitions, database connection definitions, data connection definitions, notification lists, and SMTP configuration from central.
+- Stores all artifacts in local SQLite. After artifact deployment, the site is fully self-contained — all runtime configuration is read from local SQLite with no access to the central configuration database.
+- Recompiles shared scripts and makes updated code available to all Script Actors.
+
+### Instance Lifecycle Commands
+- **Disable**: Stops the Instance Actor and its children. Retains the deployed configuration in SQLite so the instance can be re-enabled without redeployment.
+- **Enable**: Creates a new Instance Actor from the stored configuration (same as startup).
+- **Delete**: Stops the Instance Actor and its children, removes the deployed configuration from local SQLite. Does **not** clear store-and-forward messages.
+
+### Debug Snapshot Routing
+- Receives `DebugSnapshotRequest` from the Communication Layer and forwards to the Instance Actor by unique name (same lookup as `SubscribeDebugViewRequest`).
+- Returns an error response if no Instance Actor exists for the requested unique name (instance not deployed or not enabled).
+
+---
+
+## Instance Actor
+
+### Role
+- **Single source of truth** for all runtime state of a deployed instance.
+- Holds all attribute values (both static configuration values and live values from data connections).
+- Holds current alarm states (active/normal), updated by child Alarm Actors.
+- Publishes attribute value changes and alarm state changes to the site-wide Akka stream.
+
+### Initialization
+1. Load all attribute values from the flattened configuration (static defaults).
+2. Set quality to **uncertain** for all attributes that have a data source reference. Static attributes (no data source reference) have quality **good**. The uncertain quality persists until the first value update arrives from the Data Connection Layer, distinguishing "not yet received" from "known good" or "connection lost."
+3. Register data source references with the Data Connection Layer for subscriptions.
+4. Create child Script Actors (one per script defined on the instance).
+5. Create child Alarm Actors (one per alarm defined on the instance).
+
+### Attribute Value Updates
+- Receives tag value updates from the Data Connection Layer for attributes with data source references.
+- Updates the in-memory attribute value.
+- Notifies subscribed child Script Actors and Alarm Actors of the change.
+- Publishes the change to the site-wide Akka stream.
+
+### Stream Message Format
+- **Attribute changes**: `[InstanceUniqueName].[AttributePath].[AttributeName]`, attribute value, attribute quality, attribute change timestamp.
+- **Alarm state changes**: `[InstanceUniqueName].[AlarmName]`, alarm state (active/normal), priority, timestamp.
+
+### GetAttribute / SetAttribute
+- **GetAttribute**: Returns the current in-memory value for the requested attribute.
+- **SetAttribute** (for attributes with data source reference): Sends a write request to the Data Connection Layer. The DCL writes to the physical device. If the write fails (connection down, device rejection, timeout), the error is returned synchronously to the calling script for handling. On success, the existing subscription picks up the confirmed value from the device and sends it back as a value update, which then updates the in-memory value. The in-memory value is **not** optimistically updated.
+- **SetAttribute** (for static attributes): Updates the in-memory value and **persists the override to local SQLite**. On restart (or failover), the Instance Actor loads persisted overrides on top of the deployed configuration, preserving runtime-modified values. A redeployment of the instance resets all persisted overrides to the new deployed configuration values.
+
+### Debug View Support
+- On request from central (via Communication Layer), the Instance Actor provides a **snapshot** of all current attribute values and alarm states.
+- Subsequent changes are delivered via the site-wide Akka stream, filtered by instance unique name.
+- The Instance Actor also handles one-shot `DebugSnapshotRequest` messages: it builds the same snapshot (attribute values and alarm states) and replies directly to the sender. Unlike `SubscribeDebugViewRequest`, no subscriber is registered and no stream is established.
+
+### Supervision Strategy
+
+The Instance Actor supervises all child Script and Alarm Actors with explicit strategies:
+
+| Child Actor | Exception Type | Strategy | Rationale |
+|-------------|---------------|----------|-----------|
+| Script Actor | Any exception | Resume | Script Actor is a coordinator — its state (trigger timers, last execution time) should survive child failures. Script Execution Actor failures are isolated. |
+| Alarm Actor | Any exception | Resume | Alarm Actor holds alarm state. Resume preserves state and continues evaluation on next value update. |
+| Script Execution Actor | Unhandled exception | Stop | Short-lived, per-invocation. Failure is logged; the Script Actor coordinator remains active for future triggers. |
+| Alarm Execution Actor | Unhandled exception | Stop | Short-lived, per on-trigger invocation. Same as Script Execution Actor. |
+
+The Deployment Manager singleton supervises Instance Actors with a **OneForOneStrategy** — one Instance Actor's failure does not affect other instances.
+
+When the Instance Actor is stopped (due to disable, delete, or redeployment), Akka.NET automatically stops all child actors.
+
+---
+
+## Script Actor
+
+### Role
+- **Coordinator** for a single script definition on an instance.
+- Holds the compiled script code and trigger configuration.
+- Manages trigger evaluation (interval timer, value change detection, conditional evaluation).
+- Spawns short-lived Script Execution Actors for each invocation.
+
+### Trigger Management
+- **Interval**: The Script Actor manages an internal timer. When the timer fires, it spawns a Script Execution Actor.
+- **Value Change**: The Script Actor subscribes to attribute change notifications from its parent Instance Actor for the specific monitored attribute. When the attribute changes, it spawns a Script Execution Actor.
+- **Conditional**: The Script Actor subscribes to attribute change notifications for the monitored attribute. On each update, it evaluates the condition (equals or not-equals a value). If the condition is met, it spawns a Script Execution Actor.
+- **Minimum time between runs**: If configured, the Script Actor tracks the last execution time and skips trigger invocations that fire before the minimum interval has elapsed.
+
+### Concurrent Execution
+- Each invocation spawns a **new Script Execution Actor** as a child.
+- Multiple Script Execution Actors can run concurrently (e.g., a trigger fires while a previous `Instance.CallScript` invocation is still running).
+- The Script Actor coordinates but does not block on child completion.
+
+### Script Execution Actor
+- **Short-lived** child actor created per invocation.
+- Receives: compiled script code, input parameters, reference to the parent Instance Actor, current call depth.
+- Executes the script in the Akka actor context.
+- Has access to the full Script Runtime API (see below).
+- Returns the script's return value (if defined) to the caller, then stops.
+
+### Handling `Instance.CallScript`
+- When an external caller (another Script Execution Actor, an Alarm Execution Actor, or a routed call from the Inbound API) sends a `CallScript` message to the Script Actor, it spawns a Script Execution Actor to handle the call.
+- The caller uses the **Akka ask pattern** and receives the return value when the execution completes.
+
+---
+
+## Alarm Actor
+
+### Role
+- **Coordinator** for a single alarm definition on an instance.
+- Evaluates alarm trigger conditions against attribute value updates.
+- Manages alarm state (active/normal) in memory.
+- Executes on-trigger scripts when the alarm activates.
+
+### Alarm Evaluation
+- Subscribes to attribute change notifications from its parent Instance Actor for the attribute(s) referenced by its trigger definition.
+- On each value update, evaluates the trigger condition:
+  - **Value Match**: Incoming value equals the predefined target.
+  - **Range Violation**: Value is outside the allowed min/max range.
+  - **Rate of Change**: Value change rate exceeds the defined threshold over time.
+- When the condition is met and the alarm is currently in **normal** state, the alarm transitions to **active**:
+  - Updates the alarm state on the parent Instance Actor (which publishes to the Akka stream).
+  - If an on-trigger script is defined, spawns an Alarm Execution Actor to execute it.
+- When the condition clears and the alarm is in **active** state, the alarm transitions to **normal**:
+  - Updates the alarm state on the parent Instance Actor.
+  - No script execution on clear.
+
+### Alarm State
+- Held **in memory** only — not persisted to SQLite.
+- On restart (or failover), alarm states are re-evaluated from incoming values. All alarms start in normal state and transition to active when conditions are detected.
+
+### Alarm Execution Actor
+- **Short-lived** child actor created when an on-trigger script needs to execute.
+- Same pattern as Script Execution Actor — receives compiled code, executes, returns, and stops.
+- Has access to the Instance Actor for `GetAttribute`/`SetAttribute`.
+- **Can** call instance scripts via `Instance.CallScript()` — sends an ask message to the appropriate sibling Script Actor.
+- Instance scripts **cannot** call alarm on-trigger scripts — the call direction is one-way.
+
+---
+
+## Shared Script Library
+
+- Shared scripts are compiled at the site when received from central.
+- Compiled code is stored in memory and made available to all Script Actors.
+- When a Script Execution Actor calls `Scripts.CallShared("scriptName", params)`, the shared script code executes **inline** in the Script Execution Actor's context — it is a direct method invocation, not an actor message.
+- This avoids serialization bottlenecks since there is no shared script actor to contend for.
+- Shared scripts have access to the same runtime API as instance scripts (GetAttribute, SetAttribute, external systems, notifications, databases).
+
+---
+
+## Script Runtime API
+
+Available to all Script Execution Actors and Alarm Execution Actors:
+
+### Instance Attributes
+- `Instance.GetAttribute("name")` — Read an attribute value from the parent Instance Actor.
+- `Instance.SetAttribute("name", value)` — Write an attribute value. For data-connected attributes, writes to the DCL; for static attributes, updates in-memory and persists to local SQLite (survives restart/failover, reset on redeployment).
+
+### Other Scripts
+- `Instance.CallScript("scriptName", parameters)` — Send an ask message to a sibling Script Actor. The target Script Actor spawns a Script Execution Actor, executes, and returns the result. The call includes the current recursion depth.
+- `Scripts.CallShared("scriptName", parameters)` — Execute shared script code inline (direct method invocation). The call includes the current recursion depth.
+
+### External Systems
+- `ExternalSystem.Call("systemName", "methodName", params)` — Synchronous HTTP call. Blocks until response or timeout. All failures return to script. Use when the script needs the result.
+- `ExternalSystem.CachedCall("systemName", "methodName", params)` — Fire-and-forget with store-and-forward on transient failure. Use for outbound data pushes where deferred delivery is acceptable.
+
+### Notifications
+- `Notify.To("listName").Send("subject", "message")` — Send an email notification via a named notification list.
+
+### Database Access
+- `Database.Connection("connectionName")` — Obtain a raw MS SQL client connection (ADO.NET) for synchronous read/write.
+- `Database.CachedWrite("connectionName", "sql", parameters)` — Submit a write operation for store-and-forward delivery.
+
+### Recursion Limit
+- Every script call (`Instance.CallScript` and `Scripts.CallShared`) increments a call depth counter.
+- If the counter exceeds the maximum recursion depth (default: 10), the call fails with an error.
+- The error is logged to the site event log.
+
+---
+
+## Script Trust Model
+
+Scripts execute **in-process** with constrained access. The following restrictions are enforced at compilation and runtime:
+
+- **Allowed**: Access to the Script Runtime API (GetAttribute, SetAttribute, CallScript, CallShared, ExternalSystem, Notify, Database), standard C# language features, basic .NET types (collections, string manipulation, math, date/time).
+- **Forbidden**: File system access (`System.IO`), process spawning (`System.Diagnostics.Process`), threading (`System.Threading` — except async/await), reflection (`System.Reflection`), raw network access (`System.Net.Sockets`, `System.Net.Http` — must use `ExternalSystem.Call`), assembly loading, unsafe code.
+- **Execution timeout**: Configurable per-script maximum execution time. Exceeding the timeout cancels the script and logs an error.
+- **Memory**: Scripts share the host process memory. No per-script memory limit, but the execution timeout prevents runaway allocations.
+
+These constraints are enforced by restricting the set of assemblies and namespaces available to the script compilation context.
+
+## Script Scoping Rules
+
+- Scripts can only read/write attributes on **their own instance** (via the parent Instance Actor).
+- Scripts can call other scripts on **their own instance** (via sibling Script Actors).
+- Scripts can call **shared scripts** (inline execution).
+- Scripts **cannot** access other instances' attributes or scripts.
+- Alarm on-trigger scripts **can** call instance scripts; instance scripts **cannot** call alarm on-trigger scripts.
+
+---
+
+## Tell vs. Ask Usage
+
+Per Akka.NET best practices, internal actor communication uses **Tell** (fire-and-forget with reply-to) for the hot path:
+
+- **Tag value updates** (DCL → Instance Actor): Tell. High-frequency, no response needed.
+- **Attribute change notifications** (Instance Actor → Script/Alarm Actors): Tell. Fan-out notifications.
+- **Stream publishing** (Instance Actor → Akka stream): Tell. Fire-and-forget.
+
+**Ask** is reserved for system boundaries where a synchronous response is needed:
+
+- **`Instance.CallScript()`**: Ask pattern from Script Execution Actor to sibling Script Actor. The caller needs the return value. Acceptable because script calls are infrequent relative to tag updates.
+- **`Route.To().Call()`**: Ask from Inbound API to site Instance Actor via Communication Layer. External caller needs a response.
+- **Debug view snapshot**: Ask from Communication Layer to Instance Actor for initial state.
+
+## Concurrency & Serialization
+
+- The Instance Actor processes messages **sequentially** (standard Akka actor model). This means `SetAttribute` calls from concurrent Script Execution Actors are serialized at the Instance Actor, preventing race conditions on attribute state.
+- Script Execution Actors may run concurrently, but all state mutations (attribute reads/writes, alarm state updates) are mediated through the parent Instance Actor's message queue.
+- External side effects (external system calls, notifications, database writes) are not serialized — concurrent scripts may produce interleaved side effects. This is acceptable because each side effect is independent.
+
+## Site-Wide Stream Backpressure
+
+- The site-wide Akka stream uses **per-subscriber buffering** with bounded buffers. Each subscriber (debug view, future consumers) gets an independent buffer.
+- If a subscriber falls behind (e.g., slow network on debug view), its buffer fills and oldest events are dropped. This does not affect other subscribers or the publishing Instance Actors.
+- Instance Actors publish to the stream with **fire-and-forget** semantics — publishing never blocks the actor.
+
+## Error Handling
+
+### Script Errors
+- Unhandled exceptions and timeouts in Script Execution Actors are **logged locally** to the site event log.
+- The Script Actor (coordinator) is **not affected** — it remains active for future trigger events.
+- Script failures are **not reported to central** (except as aggregated error rate metrics via Health Monitoring).
+
+### Alarm Evaluation Errors
+- Errors during alarm condition evaluation are **logged locally** to the site event log.
+- The Alarm Actor remains active and continues evaluating on subsequent value updates.
+- Alarm evaluation error rates are reported to central via Health Monitoring.
+
+### Script Compilation Errors
+- If script compilation fails when a deployment is received, the entire deployment for that instance is **rejected**. No partial state is applied.
+- The failure is reported back to central as a failed deployment.
+- Note: Pre-deployment validation at central should catch compilation errors before they reach the site. Site-side compilation failures indicate an unexpected issue.
+
+---
+
+## Dependencies
+
+- **Data Connection Layer**: Provides tag value updates to Instance Actors. Receives write requests from Instance Actors.
+- **Store-and-Forward Engine**: Handles reliable delivery for external system calls, notifications, and cached database writes submitted by scripts.
+- **External System Gateway**: Provides external system method invocations for scripts.
+- **Notification Service**: Handles email delivery for scripts.
+- **Communication Layer**: Receives deployments and lifecycle commands from central. Handles debug view requests. Reports deployment results.
+- **Site Event Logging**: Records script executions, alarm events, deployment events, instance lifecycle events.
+- **Health Monitoring**: Reports script error rates and alarm evaluation error rates.
+- **Local SQLite**: Persists deployed configurations, system-wide artifacts (external system definitions, database connection definitions, data connection definitions, notification lists, SMTP configuration).
+
+## Interactions
+
+- **Deployment Manager (central)**: Receives flattened configurations, system-wide artifact updates, and instance lifecycle commands.
+- **Data Connection Layer**: Bidirectional — receives value updates, sends write-back commands.
+- **Communication Layer**: Receives commands from central, sends deployment results, serves debug view data.
+- **Store-and-Forward Engine**: Scripts route cached writes, notifications, and external system calls here.
+- **Health Monitoring**: Periodically reports error rate metrics.
--- a/docs/requirements/Component-StoreAndForward.md
+++ b/docs/requirements/Component-StoreAndForward.md
@@ -0,0 +1,102 @@
+# Component: Store-and-Forward Engine
+
+## Purpose
+
+The Store-and-Forward Engine provides reliable message delivery for outbound communications from site clusters. It buffers messages when the target system is unavailable, retries them according to configured policies, and parks messages that exhaust retries for manual review.
+
+## Location
+
+Site clusters only. The central cluster does not buffer messages.
+
+## Responsibilities
+
+- Buffer outbound messages when the target system is unavailable.
+- Manage three categories of buffered messages:
+  - External system API calls.
+  - Email notifications.
+  - Cached database writes.
+- Retry delivery per message according to the configured retry policy.
+- Park messages that exhaust their retry limit (dead-letter).
+- Persist buffered messages to local SQLite for durability.
+- Replicate buffered messages to the standby node via application-level replication over Akka.NET remoting.
+- On failover, the standby node takes over delivery from its replicated copy.
+- Respond to remote queries from central for parked message management (list, retry, discard).
+
+## Message Lifecycle
+
+```
+Script submits message
+    │
+    ▼
+Attempt immediate delivery
+    │
+    ├── Success → Remove from buffer
+    │
+    └── Failure → Buffer message
+                    │
+                    ▼
+              Retry loop (per retry policy)
+                    │
+                    ├── Success → Remove from buffer + notify standby
+                    │
+                    └── Max retries exhausted → Park message
+```
+
+## Retry Policy
+
+Retry settings are defined on the **source entity** (not per-message):
+- **External systems**: Each external system definition includes max retry count and time between retries.
+- **Notifications**: Email/SMTP configuration includes max retry count and time between retries.
+- **Cached database writes**: Each database connection definition includes max retry count and time between retries.
+
+The retry interval is **fixed** (not exponential backoff). Fixed interval is sufficient for the expected use cases.
+
+**Note**: Only **transient failures** are eligible for store-and-forward buffering. For external system calls, transient failures are connection errors, timeouts, and HTTP 5xx responses. Permanent failures (HTTP 4xx) are returned directly to the calling script and are **not** queued for retry. This prevents the buffer from accumulating requests that will never succeed.
+
+## Buffer Size
+
+There is **no maximum buffer size**. Messages accumulate in the buffer until delivery succeeds or retries are exhausted and the message is parked. Storage is bounded only by available disk space on the site node.
+
+## Persistence
+
+- Buffered messages are persisted to a **local SQLite database** on each site node.
+- The active node persists locally and forwards each buffer operation (add, remove, park) to the standby node **asynchronously** via Akka.NET remoting. The active node does not wait for standby acknowledgment — this avoids adding latency to every script that buffers a message.
+- The standby node applies the same operations to its own local SQLite database.
+- On failover, the new active node has a near-complete copy of the buffer. In rare cases, the most recent operations may not have been replicated (e.g., a message added or removed just before failover). This can result in a few **duplicate deliveries** (message delivered but remove not replicated) or a few **missed retries** (message added but not replicated). Both are acceptable trade-offs for the latency benefit.
+- On failover, the new active node resumes delivery from its local copy.
+
+## Parked Message Management
+
+- Parked messages remain stored at the site in SQLite.
+- The central UI can query sites for parked messages via the Communication Layer.
+- Operators can:
+  - **Retry** a parked message (moves it back to the retry queue).
+  - **Discard** a parked message (removes it permanently).
+- Store-and-forward messages are **not** automatically cleared when an instance is deleted. Pending and parked messages continue to exist and can be managed via the central UI.
+
+## Message Format
+
+Each buffered message stores:
+- **Message ID**: Unique identifier.
+- **Category**: External system call, notification, or cached database write.
+- **Target**: External system name, notification list name, or database connection name.
+- **Payload**: Serialized message content (API method + parameters, email subject + body, SQL + parameters).
+- **Retry Count**: Number of attempts so far.
+- **Created At**: Timestamp when the message was first queued.
+- **Last Attempt At**: Timestamp of the most recent delivery attempt.
+- **Status**: Pending, retrying, or parked.
+
+## Dependencies
+
+- **SQLite**: Local persistence on each node.
+- **Communication Layer**: Application-level replication to standby node; remote query handling from central.
+- **External System Gateway**: Delivers external system API calls.
+- **Notification Service**: Delivers email notifications.
+- **Database Connections**: Delivers cached database writes.
+- **Site Event Logging**: Logs store-and-forward activity (queued, delivered, retried, parked).
+
+## Interactions
+
+- **Site Runtime (Script Actors)**: Scripts submit messages to the buffer (external calls, notifications, cached DB writes).
+- **Communication Layer**: Handles parked message queries/commands from central.
+- **Health Monitoring**: Reports buffer depth metrics.
--- a/docs/requirements/Component-TemplateEngine.md
+++ b/docs/requirements/Component-TemplateEngine.md
@@ -0,0 +1,165 @@
+# Component: Template Engine
+
+## Purpose
+
+The Template Engine is the core modeling component that lives on the central cluster. It manages the definition, inheritance, composition, and resolution of machine templates — the blueprints from which all machine instances are created. It handles flattening templates into deployable configurations, calculating diffs between deployed and current states, and performing comprehensive pre-deployment validation.
+
+## Location
+
+Central cluster only. Sites receive flattened output and have no awareness of templates.
+
+## Responsibilities
+
+- Store and manage template definitions (attributes, alarms, scripts) in the configuration database.
+- Enforce inheritance (is-a) relationships between templates.
+- Enforce composition (has-a) relationships, including recursive nesting of feature modules.
+- Detect and reject naming collisions when composing feature modules (design-time error).
+- Resolve the attribute chain: Instance → Child Template → Parent Template → Composing Template → Composed Module.
+- Enforce locking rules — locked members cannot be overridden downstream, intermediate levels can lock previously unlocked members, and nothing can unlock what's locked above.
+- Support adding new attributes, alarms, and scripts in child templates.
+- Prevent removal of inherited members.
+- Flatten a fully resolved template + instance overrides into a deployable configuration (no template structure, just concrete attribute values with resolved data connection bindings).
+- Calculate diffs between deployed and template-derived configurations.
+- Perform comprehensive pre-deployment validation (see Validation section).
+- Provide on-demand validation for Design users during template authoring.
+- Enforce template deletion constraints — templates cannot be deleted if any instances or child templates reference them.
+
+## Key Entities
+
+### Template
+- Has a unique name/ID.
+- Optionally extends a parent template (inheritance).
+- Contains zero or more composed feature modules (composition).
+- Defines attributes, alarms, and scripts as first-class members.
+- Cannot be deleted if referenced by instances or child templates.
+- Concurrent editing uses **last-write-wins** — no pessimistic locking or conflict detection.
+
+### Attribute
+- Name, Value, Data Type (Boolean, Integer, Float, String), Lock Flag, Description.
+- Optional Data Source Reference — a **relative path** within a data connection (e.g., `/Motor/Speed`). The template defines *what* to read but not *where* to read it from. The connection binding is an instance-level concern.
+- Value may be empty if intended to be set at instance level or via data connection binding.
+
+### Alarm
+- Name, Description, Priority Level (0–1000), Lock Flag.
+- Trigger Definition: Value Match, Range Violation, or Rate of Change.
+- Optional On-Trigger Script reference.
+
+### Script (Template-Level)
+- Name, Lock Flag, C# source code.
+- Trigger configuration: Interval, Value Change, Conditional, or invoked by alarm/other script.
+- Optional minimum time between runs.
+- **Parameter Definition** *(optional)*: Defines input parameters (name and data type per parameter). Scripts without parameters accept no arguments.
+- **Return Value Definition** *(optional)*: Defines the structure of the script's return value (field names and data types). Supports single objects and lists of objects. Scripts without a return definition return void.
+
+### Instance
+- Associated with a specific template and a specific site.
+- Assigned to an area within the site.
+- Can override non-locked attribute values (no adding/removing attributes).
+- Bound to data connections at instance creation — **per-attribute binding** where each attribute with a data source reference individually selects its data connection.
+- Can be in **enabled** or **disabled** state.
+- Can be **deleted** — deletion is blocked if the site is unreachable.
+
+### Area
+- Hierarchical groupings per site (parent-child).
+- Stored in the configuration database.
+- Used for filtering/organizing instances in the UI.
+
+## Composed Member Addressing
+
+When a template composes a feature module, members from that module are addressed using a **path-qualified canonical name**: `[ModuleInstanceName].[MemberName]`. For nested compositions, the path extends: `[OuterModule].[InnerModule].[MemberName]`.
+
+- All internal references (triggers, scripts, diffs, stream topics, UI display) use canonical names.
+- The composing template's own members (not from a module) have no prefix — they are top-level names.
+- Naming collision detection operates on canonical names, so two modules can define the same member name as long as their module instance names differ.
+
+## Override Granularity
+
+Override and lock rules apply per entity type at the following granularity:
+
+- **Attributes**: Value and Description are overridable. Data Type and Data Source Reference are fixed by the defining level. Lock applies to the entire attribute (when locked, no fields can be overridden).
+- **Alarms**: Priority Level, Trigger Definition (thresholds/ranges/rates), Description, and On-Trigger Script reference are overridable. Name and Trigger Type (Value Match vs. Range vs. Rate of Change) are fixed. Lock applies to the entire alarm.
+- **Scripts**: C# source code, Trigger configuration, minimum time between runs, and parameter/return definitions are overridable. Name is fixed. Lock applies to the entire script.
+- **Composed module members**: A composing template or child template can override non-locked members inside a composed module using the canonical path-qualified name.
+
+## Naming Collision Detection
+
+When a template composes two or more feature modules, the system must check for naming collisions across:
+- Attribute names
+- Alarm names
+- Script names
+
+If any composed module introduces a name that already exists (from another composed module or from the composing template itself), this is a **design-time error**. The template cannot be saved until the conflict is resolved. Collision detection is performed recursively for nested module compositions.
+
+## Flattening Process
+
+When an instance is deployed, the Template Engine resolves the full configuration:
+
+1. Start with the base template's attributes, alarms, and scripts.
+2. Walk the inheritance chain, applying overrides at each level (respecting locks).
+3. Resolve composed feature modules, applying overrides from composing templates (respecting locks).
+4. Apply instance-level overrides (respecting locks).
+5. Resolve data connection bindings — replace connection name references with concrete connection details from the site.
+6. Output a flat structure: list of attributes with resolved values and data source addresses, list of alarms with resolved trigger definitions, list of scripts with resolved code and triggers.
+
+## Diff Calculation
+
+The Template Engine can compare:
+- The **currently deployed** flat configuration of an instance.
+- The **current template-derived** flat configuration (what the instance would look like if redeployed now).
+
+The diff output identifies added, removed, and changed attributes/alarms/scripts.
+
+## Pre-Deployment Validation
+
+Before a deployment is sent to a site, the Template Engine performs comprehensive validation:
+
+- **Flattening**: The full template hierarchy resolves and flattens without errors.
+- **Naming collision detection**: No duplicate attribute, alarm, or script names in the flattened configuration.
+- **Script compilation**: All instance scripts and alarm on-trigger scripts are test-compiled and must compile without errors.
+- **Alarm trigger references**: Alarm trigger definitions reference attributes that exist in the flattened configuration.
+- **Script trigger references**: Script triggers (value change, conditional) reference attributes that exist in the flattened configuration.
+- **Data connection binding completeness**: Every attribute with a data source reference has a data connection binding assigned on the instance, and the bound data connection name exists as a defined connection at the instance's site.
+- **Exception**: Validation does **not** verify that data source relative paths resolve to real tags on physical devices — that is a runtime concern.
+
+### Semantic Validation
+
+Beyond compilation, the Template Engine performs static semantic checks:
+
+- **Script call targets**: `Instance.CallScript()` and `Scripts.CallShared()` targets must reference scripts that exist in the flattened configuration or shared script library.
+- **Argument compatibility**: Parameter count and data types at call sites must match the target script's parameter definitions.
+- **Return type compatibility**: If a script call's return value is used, the return type definition must match the caller's expectations.
+- **Trigger operand types**: Alarm triggers and script conditional triggers must reference attributes with compatible data types (e.g., Range Violation requires numeric attributes).
+
+### Graph Acyclicity
+
+The Template Engine enforces that inheritance and composition graphs are **acyclic**:
+
+- A template cannot inherit from itself or from any descendant in its inheritance chain.
+- A template cannot compose itself or any ancestor/descendant that would create a circular composition.
+- These checks are performed on save.
+
+### Flattened Configuration Revision
+
+Each flattened configuration output includes a **revision hash** (computed from the resolved content). This hash is used for:
+
+- Staleness detection: comparing the deployed revision to the current template-derived revision without a full diff.
+- Diff correlation: ensuring diffs are computed against a consistent baseline.
+
+### On-Demand Validation
+
+The same validation logic is available to Design users in the Central UI without triggering a deployment. This allows template authors to check their work for errors during authoring.
+
+### Shared Script Validation
+
+For shared scripts, pre-compilation validation is performed before deployment. Since shared scripts have no instance context, validation is limited to C# syntax and structural correctness.
+
+## Dependencies
+
+- **Configuration Database (MS SQL)**: Stores all templates, instances, areas, and their relationships.
+- **Security & Auth**: Enforces Design role for template authoring, Deployment role for instance management.
+- **Configuration Database (via IAuditService)**: All template and instance changes are audit logged.
+
+## Interactions
+
+- **Deployment Manager**: Requests flattened configurations, diffs, and validation results from the Template Engine.
+- **Central UI**: Provides the data model for template authoring, instance management, and on-demand validation.
--- a/docs/requirements/Component-TraefikProxy.md
+++ b/docs/requirements/Component-TraefikProxy.md
@@ -0,0 +1,138 @@
+# Component: Traefik Proxy
+
+## Purpose
+
+The Traefik Proxy is a reverse proxy and load balancer that sits in front of the central cluster's two web servers. It provides a single stable URL for the CLI, browser, and external API consumers, automatically routing traffic to the active central node. When the active node fails over, Traefik detects the change via health checks and redirects traffic to the new active node without manual intervention.
+
+## Location
+
+Runs as a Docker container (`scadalink-traefik`) in the cluster compose stack (`docker/docker-compose.yml`). Not part of the application codebase — it is a third-party infrastructure component with static configuration files.
+
+`docker/traefik/`
+
+## Responsibilities
+
+- Route all HTTP traffic (Central UI, Management API, Inbound API, health endpoints) to the active central node.
+- Health-check both central nodes via `/health/active` to determine which is the active (cluster leader) node.
+- Automatically fail over to the standby node when the active node goes down.
+- Provide a dashboard for monitoring routing state and backend health.
+
+## How It Works
+
+### Active Node Detection
+
+Traefik polls `/health/active` on both central nodes every 5 seconds. This endpoint returns:
+
+- **HTTP 200** on the active node (the Akka.NET cluster leader).
+- **HTTP 503** on the standby node (or if the node is unreachable).
+
+Only the node returning 200 receives traffic. The health check is implemented by `ActiveNodeHealthCheck` in the Host project, which checks `Cluster.Get(system).State.Leader == SelfMember.Address`.
+
+### Failover Sequence
+
+1. Active node fails (crash, network partition, or graceful shutdown).
+2. Akka.NET cluster detects the failure (~10s heartbeat timeout).
+3. Split-brain resolver acts after stable-after period (~15s).
+4. Surviving node becomes cluster leader.
+5. `ActiveNodeHealthCheck` on the surviving node starts returning 200.
+6. Traefik's next health poll (within 5s) detects the change.
+7. Traffic routes to the new active node.
+
+**Total failover time**: ~25–30s (Akka failover ~25s + Traefik poll interval up to 5s).
+
+### SignalR / Blazor Server Considerations
+
+Blazor Server uses persistent SignalR connections (WebSocket circuits). During failover:
+
+- Active SignalR circuits on the failed node are lost.
+- The browser's SignalR reconnection logic attempts to reconnect.
+- Traefik routes the reconnection to the new active node.
+- The user's session survives because authentication uses cookie-embedded JWT with shared Data Protection keys across both central nodes.
+- The user may see a brief "Reconnecting..." overlay before the circuit re-establishes.
+
+## Configuration
+
+### Static Config (`docker/traefik/traefik.yml`)
+
+```yaml
+entryPoints:
+  web:
+    address: ":80"
+
+api:
+  dashboard: true
+  insecure: true
+
+providers:
+  file:
+    filename: /etc/traefik/dynamic.yml
+```
+
+- **Entrypoint `web`**: Listens on port 80 (mapped to host port 9000).
+- **Dashboard**: Enabled in insecure mode (no auth) for development. Accessible at `http://localhost:8180`.
+- **File provider**: Loads routing rules from a static YAML file (no Docker socket required).
+
+### Dynamic Config (`docker/traefik/dynamic.yml`)
+
+```yaml
+http:
+  routers:
+    central:
+      rule: "PathPrefix(`/`)"
+      service: central
+      entryPoints:
+        - web
+
+  services:
+    central:
+      loadBalancer:
+        healthCheck:
+          path: /health/active
+          interval: 5s
+          timeout: 3s
+        servers:
+          - url: "http://scadalink-central-a:5000"
+          - url: "http://scadalink-central-b:5000"
+```
+
+- **Router `central`**: Catches all requests and forwards to the `central` service.
+- **Service `central`**: Load balancer with two backends (both central nodes) and a health check on `/health/active`.
+- **Health check interval**: 5 seconds. A server failing the health check is removed from the pool within one interval.
+
+## Ports
+
+| Host Port | Container Port | Purpose |
+|-----------|---------------|---------|
+| 9000 | 80 | Load-balanced entrypoint (Central UI, Management API, Inbound API) |
+| 8180 | 8080 | Traefik dashboard |
+
+## Health Endpoints
+
+The central nodes expose three health endpoints:
+
+| Endpoint | Purpose | Who Uses It |
+|----------|---------|-------------|
+| `/health/ready` | Readiness gate — 200 when database + Akka cluster are healthy | Kubernetes probes, monitoring |
+| `/health/active` | Active node — 200 only on cluster leader | **Traefik** (routing decisions) |
+
+## Dependencies
+
+- **Central cluster nodes**: The two backends (`scadalink-central-a`, `scadalink-central-b`) on the `scadalink-net` Docker network.
+- **ActiveNodeHealthCheck**: Health check implementation in `src/ScadaLink.Host/Health/ActiveNodeHealthCheck.cs` that determines cluster leader status.
+- **Docker network**: All containers must be on the shared `scadalink-net` bridge network.
+
+## Interactions
+
+- **CLI**: Connects to `http://localhost:9000/management` — routed by Traefik to the active node.
+- **Browser (Central UI)**: Connects to `http://localhost:9000` — Blazor Server + SignalR routed to the active node.
+- **Inbound API consumers**: Connect to `http://localhost:9000/api/{methodName}` — routed to the active node.
+- **Cluster Infrastructure**: The `ActiveNodeHealthCheck` relies on Akka.NET cluster gossip state to determine the leader.
+
+## Production Considerations
+
+The current configuration is for development/testing. In production:
+
+- **TLS termination**: Add HTTPS entrypoint with certificates (Let's Encrypt via Traefik's ACME provider, or static certs).
+- **Dashboard auth**: Disable `insecure: true` and configure authentication on the dashboard.
+- **WebSocket support**: Traefik supports WebSocket proxying natively — no additional config needed for SignalR.
+- **Sticky sessions**: Not required. The Management API is stateless (Basic Auth per request). Blazor Server circuits are bound to a specific node via SignalR, but reconnection handles failover transparently.
--- a/docs/requirements/HighLevelReqs.md
+++ b/docs/requirements/HighLevelReqs.md
@@ -0,0 +1,497 @@
+# SCADA System - High Level Requirements
+
+## 1. Deployment Architecture
+
+- **Site Clusters**: 2-node failover clusters deployed at each site, running on Windows.
+- **Central Cluster**: A single 2-node failover cluster serving as the central hub.
+- **Communication Topology**: Hub-and-spoke. Central cluster communicates with each site cluster. Site clusters do **not** communicate with one another.
+
+### 1.1 Central vs. Site Responsibilities
+- **Central cluster** is the single source of truth for all template authoring, configuration, and deployment decisions.
+- **Site clusters** receive **flattened configurations** — fully resolved attribute sets with no template structure. Sites do not need to understand templates, inheritance, or composition.
+- Sites **do not** support local/emergency configuration overrides. All configuration changes originate from central.
+
+### 1.2 Failover
+- Failover is managed at the **application level** using **Akka.NET** (not Windows Server Failover Clustering).
+- Each cluster (central and site) runs an **active/standby** pair where Akka.NET manages node roles and failover detection.
+- **Site failover**: The standby node takes over data collection and script execution seamlessly, including responsibility for the store-and-forward buffers. The Site Runtime Deployment Manager singleton is restarted on the new active node, which reads deployed configurations from local SQLite and re-creates the full Instance Actor hierarchy.
+- **Central failover**: The standby node takes over central responsibilities. Deployments that are in-progress during a failover are treated as **failed** and must be re-initiated by the engineer.
+
+### 1.3 Store-and-Forward Persistence (Site Clusters Only)
+- Store-and-forward applies **only at site clusters** — the central cluster does **not** buffer messages. If a site is unreachable, operations from central fail and must be retried by the engineer.
+- All site-level store-and-forward buffers (external system calls, notifications, and cached database writes) are **replicated between the two site cluster nodes** using **application-level replication** over Akka.NET remoting.
+- The **active node** persists buffered messages to a **local SQLite database** and forwards them to the standby node, which maintains its own local SQLite copy.
+- On failover, the standby node already has a replicated copy of the buffer and takes over delivery seamlessly.
+- Successfully delivered messages are removed from both nodes' local stores.
+- There is **no maximum buffer size** — messages accumulate until they either succeed or exhaust retries and are parked.
+- Retry intervals are **fixed** (not exponential backoff). The fixed interval is sufficient for the expected use cases.
+
+### 1.4 Deployment Behavior
+- When central deploys a new configuration to a site instance, the site **applies it immediately** upon receipt — no local operator confirmation is required.
+- If a site loses connectivity to central, it **continues operating** with its last received deployed configuration.
+- The site reports back to central whether deployment was successfully applied.
+- **Pre-deployment validation**: Before any deployment is sent to a site, the central cluster performs comprehensive validation including flattening the configuration, test-compiling all scripts, verifying alarm trigger references, verifying script trigger references, and checking data connection binding completeness (see Section 3.11).
+
+### 1.5 System-Wide Artifact Deployment
+- Changes to shared scripts, external system definitions, database connection definitions, data connection definitions, notification lists, and SMTP configuration are **not automatically propagated** to sites.
+- Deployment of system-wide artifacts requires **explicit action** by a user with the **Deployment** role.
+- Artifacts can be deployed to **all sites at once** or to an **individual site** (per-site deployment).
+- The Design role manages the definitions; the Deployment role triggers deployment to sites. A user may hold both roles.
+
+## 2. Data Storage & Data Flow
+
+### 2.1 Central Databases (MS SQL)
+- **Configuration Database**: A dedicated database for system-specific configuration data (e.g., templates, site definitions, instance configurations, system settings).
+- **Machine Data Database**: A separate database for collected machine data (e.g., telemetry, measurements, events).
+
+### 2.2 Communication: Central ↔ Site
+- Central-to-site and site-to-central communication uses **Akka.NET ClusterClient/ClusterClientReceptionist** for cross-cluster messaging with automatic failover.
+- **Site addressing**: Site Akka base addresses (NodeA and NodeB) are stored in the **Sites database table** and configured via the Central UI. Central creates a ClusterClient per site using both addresses as contact points (cached in memory, refreshed periodically and on admin changes) rather than relying on runtime registration messages from sites.
+- **Central contact points**: Sites configure **multiple central contact points** (both central node addresses) for redundancy. ClusterClient handles failover between central nodes automatically.
+- **Central as integration hub**: Central brokers requests between external systems and sites. For example, a recipe manager sends a recipe to central, which routes it to the appropriate site. MES requests machine values from central, which routes the request to the site and returns the response.
+- **Real-time data streaming** is not continuous for all machine data. The only real-time stream is an **on-demand debug view** — an engineer in the central UI can open a live view of a specific instance's tag values and alarm states for troubleshooting purposes. This is session-based and temporary. The debug view subscribes to the site-wide Akka stream filtered by instance (see Section 8.1).
+
+### 2.3 Site-Level Storage & Interface
+- Sites have **no user interface** — they are headless collectors, forwarders, and script executors.
+- Sites require local storage for: the current deployed (flattened) configurations, deployed scripts, shared scripts, external system definitions, database connection definitions, data connection definitions, notification lists, and SMTP configuration.
+- After artifact deployment, sites are **fully self-contained** — all runtime configuration is read from local SQLite. Sites do **not** access the central configuration database at runtime.
+- Store-and-forward buffers are persisted to a **local SQLite database on each node** and replicated between nodes via application-level replication (see 1.3).
+
+### 2.4 Data Connection Protocols
+- The system supports **OPC UA** and **LmxProxy** (a gRPC-based custom protocol with an existing client SDK).
+- Both protocols implement a **common interface** supporting: connect, subscribe to tag paths, receive value updates, and write values.
+- Additional protocols can be added by implementing the common interface.
+- The Data Connection Layer is a **clean data pipe** — it publishes tag value updates to Instance Actors but performs no evaluation of triggers or alarm conditions.
+- **Initial attribute quality**: Attributes bound to a data connection start with **uncertain** quality when the Instance Actor initializes. The quality remains uncertain until the first value update is received from the Data Connection Layer. This distinguishes "never received a value" from "received a known-good value" or "connection lost" (bad quality).
+
+### 2.5 Scale
+- Approximately **10 sites**.
+- **50–500 machines per site**.
+- **25–75 live data point tags per machine**.
+
+## 3. Template & Machine Modeling
+
+### 3.1 Template Structure
+- Machines are modeled as **instances of templates**.
+- Templates define a set of **attributes**.
+- Each attribute has a **lock flag** that controls whether it can be overridden downstream.
+
+### 3.2 Attribute Definition
+Each attribute carries the following metadata:
+- **Name**: Identifier for the attribute.
+- **Value**: The default or configured value. May be empty if intended to be set at the instance level.
+- **Data Type**: The value's type. Fixed set: Boolean, Integer, Float, String.
+- **Lock Flag**: Controls whether the attribute can be overridden downstream.
+- **Description**: Human-readable explanation of the attribute's purpose.
+- **Data Source Reference** *(optional)*: A **relative path** within a data connection (e.g., `/Motor/Speed`). The template defines *what* to read — the path relative to a data connection. The template does **not** specify which data connection to use; that is an instance-level concern (see Section 3.3). Attributes without a data source reference are static configuration values.
+
+### 3.3 Data Connections
+- **Data connections** are reusable, named resources defined centrally and then **assigned to specific sites** (e.g., an OPC server, a PLC endpoint). Data connection definitions are deployed to sites as part of **artifact deployment** (see Section 1.5) and stored in local SQLite.
+- A data connection encapsulates the details needed to communicate with a data source (protocol, address, credentials, etc.).
+- Attributes with a data source reference must be **bound to a data connection at instance creation** — the template defines *what* to read (the relative path), and the instance specifies *where* to read it from (the data connection assigned to the site).
+- **Binding is per-attribute**: Each attribute with a data source reference individually selects its data connection. Different attributes on the same instance may use different data connections. The Central UI supports bulk assignment (selecting multiple attributes and assigning a data connection to all of them at once) to reduce tedium.
+- Templates do **not** specify a default connection. The connection binding is an instance-level concern.
+- The flattened configuration sent to a site resolves connection references into concrete connection details paired with attribute relative paths.
+- Data connection names are **not** standardized across sites — different sites may have different data connection names for equivalent devices.
+
+### 3.4 Alarm Definitions
+Alarms are **first-class template members** alongside attributes and scripts, following the same **inheritance, override, and lock rules**.
+
+Each alarm has:
+- **Name**: Identifier for the alarm.
+- **Description**: Human-readable explanation of the alarm condition.
+- **Priority Level**: Numeric value from 0–1000.
+- **Lock Flag**: Controls whether the alarm can be overridden downstream.
+- **Trigger Definition**: One of the following trigger types:
+  - **Value Match**: Triggers when a monitored attribute equals a predefined value.
+  - **Range Violation**: Triggers when a monitored attribute value falls outside an allowed range.
+  - **Rate of Change**: Triggers when a monitored attribute value changes faster than a defined threshold.
+- **On-Trigger Script** *(optional)*: A script to execute when the alarm triggers. The alarm on-trigger script executes in the context of the instance and can call instance scripts, but instance scripts **cannot** call alarm on-trigger scripts. The call direction is one-way.
+
+### 3.4.1 Alarm State
+- Alarm state (active/normal) is **managed at the site level** per instance, held **in memory** by the Alarm Actor.
+- When the alarm condition clears, the alarm **automatically returns to normal state** — no acknowledgment workflow is required.
+- Alarm state is **not persisted** — on restart, alarm states are re-evaluated from incoming values.
+- Alarm state changes are published to the site-wide Akka stream as `[InstanceUniqueName].[AlarmName]`, alarm state (active/normal), priority, timestamp.
+
+### 3.5 Template Relationships
+
+Templates participate in two distinct relationship types:
+
+- **Inheritance (is-a)**: A child template extends a parent template. The child inherits all attributes, alarms, scripts, and composed feature modules from the parent. The child can:
+  - Override the **values** of non-locked inherited attributes, alarms, and scripts.
+  - **Add** new attributes, alarms, or scripts not present in the parent.
+  - **Not** remove attributes, alarms, or scripts defined by the parent.
+- **Composition (has-a)**: A template can nest an instance of another template as a **feature module** (e.g., embedding a RecipeSystem module inside a base machine template). Feature modules can themselves compose other feature modules **recursively**.
+- **Naming collisions**: If a template composes two feature modules that each define an attribute, alarm, or script with the same name, this is a **design-time error**. The system must detect and report the collision, and the template cannot be saved until the conflict is resolved.
+
+### 3.6 Locking
+- Locking applies to **attributes, alarms, and scripts** uniformly.
+- Any of these can be **locked** at the level where it is defined or overridden.
+- A locked attribute **cannot** be overridden by any downstream level (child templates, composing templates, or instances).
+- An unlocked attribute **can** be overridden by any downstream level.
+- **Intermediate locking**: Any level in the chain can lock an attribute that was unlocked upstream. Once locked, it remains locked for all levels below — a downstream level **cannot** unlock an attribute locked above it.
+
+### 3.6 Attribute Resolution Order
+Attributes are resolved from most-specific to least-specific. The first value encountered wins:
+
+1. **Instance** (site-deployed machine)
+2. **Child Template** (most derived first, walking up the inheritance chain)
+3. **Composing Template** (the template that embeds a feature module can override the module's attributes)
+4. **Composed Module** (the original feature module definition, recursively resolved if modules nest other modules)
+
+At any level, an override is only permitted if the attribute has **not been locked** at a higher-priority level.
+
+### 3.7 Override Scope
+- **Inheritance**: Child templates can override non-locked attributes from their parent, including attributes originating from composed feature modules.
+- **Composition**: A template that composes a feature module can override non-locked attributes within that module.
+- Overrides can "pierce" into composed modules — a child template can override attributes inside a feature module it inherited from its parent.
+
+### 3.8 Instance Rules
+- An instance is a deployed occurrence of a template at a site.
+- Instances **can** override the values of non-locked attributes.
+- Instances **cannot** add new attributes.
+- Instances **cannot** remove attributes.
+- The instance's structure (which attributes exist, which feature modules are composed) is strictly defined by its template.
+- Each instance is **assigned to an area** within its site (see 3.10).
+
+### 3.8.1 Instance Lifecycle
+- Instances can be in one of two states: **enabled** or **disabled**.
+- **Enabled**: The instance is active at the site — data subscriptions, script triggers, and alarm evaluation are all running.
+- **Disabled**: The site **stops** script triggers, data subscriptions (no live data collection), and alarm evaluation. The deployed configuration is **retained** on the site so the instance can be re-enabled without redeployment. Store-and-forward messages for a disabled instance **continue to drain** (deliver pending messages).
+- **Deletion**: Instances can be deleted. Deletion removes the running configuration from the site, stops subscriptions, and destroys the Instance Actor and its children. Store-and-forward messages are **not** cleared on deletion — they continue to be delivered or can be managed (retried/discarded) via parked message management. If the site is unreachable when a delete is triggered, the deletion **fails** (same behavior as a failed deployment). The central side does not mark it as deleted until the site confirms.
+- Templates **cannot** be deleted if any instances or child templates reference them. The user must remove all references first.
+
+### 3.9 Template Deployment & Change Propagation
+- Template changes are **not** automatically propagated to deployed instances.
+- The system maintains two views of each instance:
+  - **Deployed Configuration**: The currently active configuration on the instance, as it was last explicitly deployed.
+  - **Template-Derived Configuration**: The configuration the instance *would* have based on the current state of its template (including resolved inheritance, composition, and overrides).
+- Deployment is performed at the **individual instance level** — an engineer explicitly commands the system to update a specific instance.
+- The system must be able to **show differences** between the deployed configuration and the current template-derived configuration, allowing engineers to see what would change before deploying.
+- **No rollback** support is required. The system only needs to track the current deployed state, not a history of prior deployments.
+- **Concurrent editing**: Template editing uses a **last-write-wins** model. No pessimistic locking or optimistic concurrency conflict detection is required.
+
+### 3.10 Areas
+- Areas are **predefined hierarchical groupings** associated with a site, stored in the configuration database.
+- Areas support **parent-child relationships** (e.g., Plant → Building → Production Line → Cell).
+- Each instance is assigned to an area within its site.
+- Areas are used for **filtering and finding instances** in the central UI.
+- Area definitions are managed by users with the **Design** role.
+
+### 3.11 Pre-Deployment Validation
+
+Before any deployment is sent to a site, the central cluster performs **comprehensive validation**. Validation covers:
+
+- **Flattening**: The full template hierarchy is resolved and flattened successfully.
+- **Naming collision detection**: No duplicate attribute, alarm, or script names exist in the flattened configuration.
+- **Script compilation**: All instance scripts and alarm on-trigger scripts are test-compiled and must compile without errors.
+- **Alarm trigger references**: Alarm trigger definitions reference attributes that exist in the flattened configuration.
+- **Script trigger references**: Script triggers (value change, conditional) reference attributes that exist in the flattened configuration.
+- **Data connection binding completeness**: Every attribute with a data source reference has a data connection binding assigned on the instance, and the bound data connection name exists as a defined connection at the instance's site.
+- **Exception**: Validation does **not** verify that data source relative paths resolve to real tags on physical devices — that is a runtime concern that can only be determined at the site.
+
+Validation is also available **on demand in the Central UI** for Design users during template authoring, providing early feedback without requiring a deployment attempt.
+
+For **shared scripts**, pre-compilation validation is performed before deployment to sites. Since shared scripts have no instance context, validation is limited to C# syntax and structural correctness.
+
+## 4. Scripting
+
+### 4.1 Script Definitions
+- Scripts are **C#** and are defined at the **template level** as first-class template members.
+- Scripts follow the same **inheritance, override, and lock rules** as attributes. A parent template can define a script, a child template can override it (if not locked), and any level can lock a script to prevent downstream changes.
+- Scripts are deployed to sites as part of the flattened instance configuration.
+- Scripts are **compiled at the site** when a deployment is received. Pre-compilation validation occurs at central before deployment (see Section 3.11), but the site performs the actual compilation for execution.
+- Scripts can optionally define **input parameters** (name and data type per parameter). Scripts without parameter definitions accept no arguments.
+- Scripts can optionally define a **return value definition** (field names and data types). Return values support **single objects** and **lists of objects**. Scripts without a return definition return void.
+- Return values are used when scripts are called explicitly by other scripts (via `Instance.CallScript()` or `Scripts.CallShared()`) or by the Inbound API (via `Route.To().Call()`). When invoked by a trigger (interval, value change, conditional, alarm), any return value is discarded.
+
+### 4.2 Script Triggers
+Scripts can be triggered by:
+- **Interval**: Execute on a recurring time schedule.
+- **Value Change**: Execute when a specific instance attribute value changes.
+- **Conditional**: Execute when an instance attribute value equals or does not equal a given value.
+
+Scripts have an optional **minimum time between runs** setting. If a trigger fires before the minimum interval has elapsed since the last execution, the invocation is skipped.
+
+### 4.3 Script Error Handling
+- If a script fails (unhandled exception, timeout, etc.), the failure is **logged locally** at the site.
+- The script is **not disabled** — it remains active and will fire on the next qualifying trigger event.
+- Script failures are **not reported to central**. Diagnostics are local only.
+- For external system call failures within scripts, store-and-forward handling (Section 5.3) applies independently of script error handling.
+
+### 4.4 Script Capabilities
+Scripts executing on a site for a given instance can:
+- **Read** attribute values on that instance (live data points and static config).
+- **Write** attribute values on that instance. For attributes with a data source reference, the write goes to the Data Connection Layer which writes to the physical device; the in-memory value updates when the device confirms the new value via the existing subscription. For static attributes, the write updates the in-memory value and **persists the override to local SQLite** — the value survives restart and failover. Persisted overrides are reset when the instance is redeployed.
+- **Call other scripts** on that instance via `Instance.CallScript("scriptName", params)`. Calls use the Akka ask pattern and return the called script's return value. Script-to-script calls support concurrent execution.
+- **Call shared scripts** via `Scripts.CallShared("scriptName", params)`. Shared scripts execute **inline** in the calling Script Actor's context — they are compiled code libraries, not separate actors.
+- **Call external system API methods** in two modes: `ExternalSystem.Call()` for synchronous request/response, or `ExternalSystem.CachedCall()` for fire-and-forget with store-and-forward on transient failure (see Section 5).
+- **Send notifications** (see Section 6).
+- **Access databases** by requesting an MS SQL client connection by name (see Section 5.5).
+
+Scripts **cannot** access other instances' attributes or scripts.
+
+### 4.4.1 Script Call Recursion Limit
+- Script-to-script calls (via `Instance.CallScript` and `Scripts.CallShared`) enforce a **maximum recursion depth** to prevent infinite loops.
+- The default maximum depth is a reasonable limit (e.g., 10 levels).
+- The current call depth is tracked and incremented with each nested call. If the limit is reached, the call fails with an error logged to the site event log.
+- This applies to all script call chains including alarm on-trigger scripts calling instance scripts.
+
+### 4.5 Shared Scripts
+- Shared scripts are **not associated with any template** — they are a **system-wide library** of reusable C# scripts.
+- Shared scripts can optionally define **input parameters** and **return value definitions**, following the same rules as template-level scripts.
+- Managed by users with the **Design** role.
+- Deployed to **all sites** for use by any instance script (deployment requires explicit action by a user with the Deployment role).
+- Shared scripts execute **inline** in the calling Script Actor's context as compiled code. They are not separate actors. This avoids serialization bottlenecks and messaging overhead.
+- Shared scripts are **not available on the central cluster** — Inbound API scripts cannot call them directly. To execute shared script logic, route to a site instance via `Route.To().Call()`.
+
+### 4.6 Alarm On-Trigger Scripts
+- Alarm on-trigger scripts are defined as part of the alarm definition and execute when the alarm activates.
+- They execute directly in the Alarm Actor's context (via a short-lived Alarm Execution Actor), similar to how shared scripts execute inline.
+- Alarm on-trigger scripts **can** call instance scripts via `Instance.CallScript()`, which sends an ask message to the appropriate sibling Script Actor.
+- Instance scripts **cannot** call alarm on-trigger scripts — the call direction is one-way.
+- The recursion depth limit applies to alarm-to-instance script call chains.
+
+## 5. External System Integrations
+
+### 5.1 External System Definitions
+- External systems are **predefined contracts** created by users with the **Design** role.
+- Each definition includes:
+  - **Connection details**: Endpoint URL, authentication, protocol information.
+  - **Method definitions**: Available API methods with defined parameters and return types.
+- Definitions are deployed **uniformly to all sites** — no per-site connection detail overrides.
+- Deployment of definition changes requires **explicit action** by a user with the Deployment role.
+- At the site, external system definitions are read from **local SQLite** (populated by artifact deployment), not from the central config DB.
+
+### 5.2 Site-to-External-System Communication
+- Sites communicate with external systems **directly** (not routed through central).
+- Scripts invoke external system methods by referencing the predefined definitions.
+
+### 5.3 Store-and-Forward for External Calls
+- If an external system is unavailable when a script invokes a method, the message is **buffered locally at the site**.
+- Retry is performed **per message** — individual failed messages retry independently.
+- Each external system definition includes configurable retry settings:
+  - **Max retry count**: Maximum number of retry attempts before giving up.
+  - **Time between retries**: Fixed interval between retry attempts (no exponential backoff).
+- After max retries are exhausted, the message is **parked** (dead-lettered) for manual review.
+- There is **no maximum buffer size** — messages accumulate until delivery succeeds or retries are exhausted.
+
+### 5.4 Parked Message Management
+- Parked messages are **stored at the site** where they originated.
+- The **central UI** can **query sites** for parked messages and manage them remotely.
+- Operators can **retry** or **discard** parked messages from the central UI.
+- Parked message management covers **external system calls**, **notifications**, and **cached database writes**.
+
+### 5.5 Database Connections
+- Database connections are **predefined, named resources** created by users with the **Design** role.
+- Each definition includes the connection details needed to connect to an MS SQL database (server, database name, credentials, etc.).
+- Each definition includes configurable retry settings (same pattern as external systems): **max retry count** and **time between retries** (fixed interval).
+- Definitions are deployed **uniformly to all sites** — no per-site overrides.
+- Deployment of definition changes requires **explicit action** by a user with the Deployment role.
+- At the site, database connection definitions are read from **local SQLite** (populated by artifact deployment), not from the central config DB.
+
+### 5.6 Database Access Modes
+Scripts can interact with databases in two modes:
+
+- **Real-time (synchronous)**: Scripts request a **raw MS SQL client connection by name** (e.g., `Database.Connection("MES_DB")`), giving script authors full ADO.NET-level control for immediate queries and updates.
+- **Cached write (store-and-forward)**: Scripts submit a write operation for deferred, reliable delivery. The cached entry stores the **database connection name**, the **SQL statement to execute**, and **parameter values**. If the database is unavailable, the write is buffered locally at the site and retried per the connection's retry settings. After max retries are exhausted, the write is **parked** for manual review (managed via central UI alongside other parked messages).
+
+## 6. Notifications
+
+### 6.1 Notification Lists
+- Notification lists are **system-wide**, managed by users with the **Design** role.
+- Each list has a **name** and contains one or more **recipients**.
+- Each recipient has a **name** and an **email address**.
+- Notification lists are deployed to **all sites** (deployment requires explicit action by a user with the Deployment role).
+- At the site, notification lists and recipients are read from **local SQLite** (populated by artifact deployment), not from the central config DB.
+
+### 6.2 Email Support
+- The system has **predefined support for sending email** as the notification delivery mechanism.
+- Email server configuration (SMTP settings) is defined centrally and deployed to all sites as part of **artifact deployment** (see Section 1.5). Sites read SMTP configuration from **local SQLite**.
+
+### 6.3 Script API
+- Scripts send notifications using a simplified API: `Notify.To("list name").Send("subject", "message")`
+- This API is available to instance scripts, alarm on-trigger scripts, and shared scripts.
+
+### 6.4 Store-and-Forward for Notifications
+- If the email server is unavailable, notifications are **buffered locally at the site**.
+- Follows the same retry pattern as external system calls: configurable **max retry count** and **time between retries** (fixed interval).
+- After max retries are exhausted, the notification is **parked** for manual review (managed via central UI alongside external system parked messages).
+- There is **no maximum buffer size** for notification messages.
+
+## 7. Inbound API (Central)
+
+### 7.1 Purpose
+The system exposes a **web API on the central cluster** for external systems to call into the SCADA system. This is the counterpart to the outbound External System Integrations (Section 5) — where Section 5 defines how the system calls out, this section defines how external systems call in.
+
+### 7.2 API Key Management
+- API keys are stored in the **configuration database**.
+- Each API key has a **name/label** (for identification), the **key value**, and an **enabled/disabled** flag.
+- API keys are managed by users with the **Admin** role.
+
+### 7.3 Authentication
+- Inbound API requests are authenticated via **API key** (not LDAP/AD).
+- The API key must be included with each request.
+- Invalid or disabled keys are rejected.
+
+### 7.4 API Method Definitions
+- API methods are **predefined** and managed by users with the **Design** role.
+- Each method definition includes:
+  - **Method name**: Unique identifier for the endpoint.
+  - **Approved API keys**: List of API keys authorized to call this method.
+  - **Parameter definitions**: Name and data type for each input parameter.
+  - **Return value definition**: Data type and structure of the response. Supports **single objects** and **lists of objects**.
+  - **Timeout**: Configurable per method. Maximum execution time including routed calls to sites.
+- The implementation of each method is a **C# script stored inline** in the method definition. It executes on the central cluster. No template inheritance — API scripts are standalone.
+- API scripts can route calls to any instance at any site via `Route.To("instanceCode").Call("scriptName", parameters)`, read/write attributes in batch, and access databases directly.
+- API scripts **cannot** call shared scripts directly (shared scripts are site-only). To invoke site logic, use `Route.To().Call()`.
+
+### 7.5 Availability
+- The inbound API is hosted **only on the central cluster** (active node).
+- On central failover, the API becomes available on the new active node.
+
+## 8. Central UI
+
+The central cluster hosts a **configuration and management UI** (no live machine data visualization, except on-demand debug views). The UI supports the following workflows:
+
+- **Template Authoring**: Create, edit, and manage templates including hierarchy (inheritance) and composition (feature modules). Author and manage scripts within templates. **Design-time validation** available on demand to check flattening, naming collisions, and script compilation without deploying.
+- **Shared Script Management**: Create, edit, and manage the system-wide shared script library.
+- **Notification List Management**: Create, edit, and manage notification lists and recipients.
+- **External System Management**: Define external system contracts (connection details, API method definitions).
+- **Database Connection Management**: Define named database connections for script use.
+- **Inbound API Management**: Manage API keys (create, enable/disable, delete). Define API methods (name, parameters, return values, approved keys, implementation script). *(Admin role for keys, Design role for methods.)*
+- **Instance Management**: Create instances from templates, bind data connections (per-attribute, with **bulk assignment** UI for selecting multiple attributes and assigning a data connection at once), set instance-level attribute overrides, assign instances to areas. **Disable** or **delete** instances.
+- **Site & Data Connection Management**: Define sites (including optional NodeAAddress and NodeBAddress fields for Akka remoting paths), manage data connections and assign them to sites.
+- **Area Management**: Define hierarchical area structures per site for organizing instances.
+- **Deployment**: View diffs between deployed and current template-derived configurations, deploy updates to individual instances. Filter instances by area. Pre-deployment validation runs automatically before any deployment is sent.
+- **System-Wide Artifact Deployment**: Explicitly deploy shared scripts, external system definitions, database connection definitions, data connection definitions, notification lists, and SMTP configuration to all sites or to an individual site (requires Deployment role). Per-site deployment is available via the Sites admin page.
+- **Deployment Status Monitoring**: Track whether deployments were successfully applied at site level.
+- **Debug View**: On-demand real-time view of a specific instance's tag values and alarm states for troubleshooting (see 8.1).
+- **Parked Message Management**: Query sites for parked messages (external system calls, notifications, and cached database writes), retry or discard them.
+- **Health Monitoring Dashboard**: View site cluster health, node status, data connection health, script error rates, alarm evaluation errors, and store-and-forward buffer depths (see Section 11).
+- **Site Event Log Viewer**: Query and view operational event logs from site clusters (see Section 12).
+
+### 8.1 Debug View
+- **Subscribe-on-demand**: When an engineer opens a debug view for an instance, central subscribes to the **site-wide Akka stream** filtered by instance unique name. The site first provides a **snapshot** of all current attribute values and alarm states from the Instance Actor, then streams subsequent changes from the Akka stream.
+- Attribute value stream messages are structured as: `[InstanceUniqueName].[AttributePath].[AttributeName]`, attribute value, attribute quality, attribute change timestamp.
+- Alarm state stream messages are structured as: `[InstanceUniqueName].[AlarmName]`, alarm state (active/normal), priority, timestamp.
+- The stream continues until the engineer **closes the debug view**, at which point central unsubscribes and the site stops streaming.
+- No attribute/alarm selection — the debug view always shows all tag values and alarm states for the instance.
+- No special concurrency limits are required.
+
+## 9. Security & Access Control
+
+### 9.1 Authentication
+- **UI users** authenticate via **username/password** validated directly against **LDAP/Active Directory**. Sessions are maintained via JWT tokens.
+- **External system API callers** authenticate via **API key** (see Section 7).
+
+### 9.2 Authorization
+- Authorization is **role-based**, with roles assigned by **LDAP group membership**.
+- Roles are **independent** — they can be mixed and matched per user (via group membership). There is no implied hierarchy between roles.
+- A user may hold multiple roles simultaneously (e.g., both Design and Deployment) by being a member of the corresponding LDAP groups.
+- Inbound API authorization is per-method, based on **approved API key lists** (see Section 7.4).
+
+### 9.3 Roles
+- **Admin**: System-wide permission to manage sites, data connections, LDAP group-to-role mappings, API keys, and system-level configuration.
+- **Design**: System-wide permission to author and edit templates, scripts, shared scripts, external system definitions, notification lists, inbound API method definitions, and area definitions.
+- **Deployment**: Permission to manage instances (create, set overrides, bind connections, disable, delete) and deploy configurations to sites. Also triggers system-wide artifact deployment. Can be scoped **per site**.
+
+### 9.4 Role Scoping
+- Admin is always **system-wide**.
+- Design is always **system-wide**.
+- Deployment can be **system-wide** or **site-scoped**, controlled by LDAP group membership (e.g., `Deploy-SiteA`, `Deploy-SiteB`, or `Deploy-All`).
+
+## 10. Audit Logging
+
+Audit logging is implemented as part of the **Configuration Database** component via the `IAuditService` interface.
+
+### 10.1 Storage
+- Audit logs are stored in the **configuration MS SQL database** alongside system config data, enabling direct querying.
+- Entries are **append-only** — never modified or deleted. No retention policy — retained indefinitely.
+
+### 10.2 Scope
+All system-modifying actions are logged, including:
+- **Template changes**: Create, edit, delete templates.
+- **Script changes**: Template script and shared script create, edit, delete.
+- **Alarm changes**: Create, edit, delete alarm definitions.
+- **Instance changes**: Create, override values, bind connections, area assignment, disable, enable, delete.
+- **Deployments**: Who deployed what to which instance, and the result (success/failure).
+- **System-wide artifact deployments**: Who deployed shared scripts / external system definitions / DB connections / data connections / notification lists / SMTP config, to which site(s), and the result.
+- **External system definition changes**: Create, edit, delete.
+- **Database connection changes**: Create, edit, delete.
+- **Notification list changes**: Create, edit, delete lists and recipients.
+- **Inbound API changes**: API key create, enable/disable, delete. API method create, edit, delete.
+- **Area changes**: Create, edit, delete area definitions.
+- **Site & data connection changes**: Create, edit, delete.
+- **Security/admin changes**: Role mapping changes, site permission changes.
+
+### 10.3 Detail Level
+- Each audit log entry records the **state of the entity after the change**, serialized as JSON. Only the after-state is stored — change history is reconstructed by comparing consecutive entries for the same entity at query time.
+- Each entry includes: **who** (authenticated user), **what** (action, entity type, entity ID, entity name), **when** (timestamp), and **state** (JSON after-state, null for deletes).
+- **One entry per save operation** — when a user edits a template and changes multiple attributes in one save, a single entry captures the full entity state.
+
+### 10.4 Transactional Guarantee
+- Audit entries are written **synchronously** within the same database transaction as the change (via the unit-of-work pattern). If the change succeeds, the audit entry is guaranteed to be recorded. If the change rolls back, the audit entry rolls back too.
+
+## 11. Health Monitoring
+
+### 11.1 Monitored Metrics
+The central cluster monitors the health of each site cluster, including:
+- **Site cluster online/offline status**: Whether the site is reachable.
+- **Active vs. standby node status**: Which node is active and which is standby.
+- **Data connection health**: Connected/disconnected status per data connection at the site.
+- **Script error rates**: Frequency of script failures at the site.
+- **Alarm evaluation errors**: Frequency of alarm evaluation failures at the site.
+- **Store-and-forward buffer depth**: Number of messages currently queued (broken down by external system calls, notifications, and cached database writes).
+
+### 11.2 Reporting
+- Site clusters **report health metrics to central** periodically.
+- Health status is **visible in the central UI** — no automated alerting/notifications for now.
+
+## 12. Site-Level Event Logging
+
+### 12.1 Events Logged
+Sites log operational events locally, including:
+- **Script executions**: Start, complete, error (with error details).
+- **Alarm events**: Alarm activated, alarm cleared (which alarm, which instance, when). Alarm evaluation errors.
+- **Deployment applications**: Configuration received from central, applied successfully or failed. Script compilation results.
+- **Data connection status changes**: Connected, disconnected, reconnected per connection.
+- **Store-and-forward activity**: Message queued, delivered, retried, parked.
+- **Instance lifecycle**: Instance enabled, disabled, deleted.
+
+### 12.2 Storage
+- Event logs are stored in **local SQLite** on each site node.
+- **Retention policy**: 30 days. Events older than 30 days are automatically purged.
+
+### 12.3 Central Access
+- The central UI can **query site event logs remotely**, following the same pattern as parked message management — central requests data from the site over Akka.NET remoting.
+
+## 13. Management Service & CLI
+
+### 13.1 Management Service
+- The central cluster exposes a **ManagementActor** that provides programmatic access to all administrative operations — the same operations available through the Central UI.
+- The ManagementActor registers with Akka.NET **ClusterClientReceptionist** for cross-cluster access, and is also exposed via an HTTP Management API endpoint (`POST /management`) with Basic Auth, LDAP authentication, and role resolution — enabling external tools like the CLI to interact without Akka.NET dependencies.
+- The ManagementActor enforces the **same role-based authorization** as the Central UI. Every incoming message carries the authenticated user's identity and roles.
+- All mutating operations performed through the Management Service are **audit logged** via IAuditService, identical to operations performed through the Central UI.
+- The ManagementActor runs on **every central node** (stateless). For HTTP API access, any central node can handle any request without sticky sessions.
+
+### 13.2 CLI
+- The system provides a standalone **command-line tool** (`scadalink`) for scripting and automating administrative operations.
+- The CLI connects to the Central Host's HTTP Management API (`POST /management`) — it sends commands as JSON with HTTP Basic Auth credentials. The server handles LDAP authentication, role resolution, and ManagementActor dispatch.
+- The CLI sends user credentials via HTTP Basic Auth. The server authenticates against **LDAP/AD** and resolves roles before dispatching commands to the ManagementActor.
+- CLI commands mirror all Management Service operations: templates, instances, sites, data connections, deployments, external systems, notifications, security (API keys and role mappings), audit log queries, and health status.
+- Output is **JSON by default** (machine-readable, suitable for scripting) with an optional `--format table` flag for human-readable tabular output.
+- Configuration is resolved from command-line options, **environment variables** (`SCADALINK_MANAGEMENT_URL`, `SCADALINK_FORMAT`), or a **configuration file** (`~/.scadalink/config.json`).
+- The CLI is a separate executable from the Host binary — it is deployed on any machine with HTTP access to a central node.
+
+## 14. General Conventions
+
+### 14.1 Timestamps
+- All timestamps throughout the system are stored, transmitted, and processed in **UTC**.
+- This applies to: attribute value timestamps, alarm state change timestamps, audit log entries, event log entries, deployment records, health reports, store-and-forward message timestamps, and all inter-node messages.
+- Local time conversion for display is a **Central UI concern only** — no other component performs timezone conversion.
+
+---
+
+*All initial high-level requirements have been captured. This document will continue to be updated as the design evolves.*
--- a/docs/requirements/lmxproxy_protocol.md
+++ b/docs/requirements/lmxproxy_protocol.md
@@ -0,0 +1,360 @@
+# LmxProxy Protocol Specification
+
+The LmxProxy protocol is a gRPC-based SCADA read/write interface for bridging ScadaLink's Data Connection Layer to devices via an intermediary proxy server (LmxProxy). The proxy translates LmxProxy protocol operations into backend device calls (e.g., OPC UA). All communication uses HTTP/2 gRPC with Protocol Buffers.
+
+## Service Definition
+
+```protobuf
+syntax = "proto3";
+package scada;
+
+service ScadaService {
+  rpc Connect(ConnectRequest) returns (ConnectResponse);
+  rpc Disconnect(DisconnectRequest) returns (DisconnectResponse);
+  rpc GetConnectionState(GetConnectionStateRequest) returns (GetConnectionStateResponse);
+  rpc Read(ReadRequest) returns (ReadResponse);
+  rpc ReadBatch(ReadBatchRequest) returns (ReadBatchResponse);
+  rpc Write(WriteRequest) returns (WriteResponse);
+  rpc WriteBatch(WriteBatchRequest) returns (WriteBatchResponse);
+  rpc WriteBatchAndWait(WriteBatchAndWaitRequest) returns (WriteBatchAndWaitResponse);
+  rpc Subscribe(SubscribeRequest) returns (stream VtqMessage);
+  rpc CheckApiKey(CheckApiKeyRequest) returns (CheckApiKeyResponse);
+}
+```
+
+Proto file location: `src/ScadaLink.DataConnectionLayer/Adapters/Protos/scada.proto`
+
+## Connection Lifecycle
+
+### Session Model
+
+Every client must call `Connect` before performing any read, write, or subscribe operation. The server returns a session ID (32-character hex GUID) that must be included in all subsequent requests. Sessions persist until `Disconnect` is called or the server restarts — there is no idle timeout.
+
+### Authentication
+
+API key authentication is optional, controlled by server configuration:
+
+- **If required**: The `Connect` RPC fails with `success=false` if the API key doesn't match.
+- **If not required**: All API keys are accepted (including empty).
+- The API key is sent both in the `ConnectRequest.api_key` field and as an `x-api-key` gRPC metadata header on the `Connect` call.
+
+### Connect
+
+```
+ConnectRequest {
+  client_id: string     // Client identifier (e.g., "ScadaLink-{guid}")
+  api_key:   string     // API key for authentication (empty if none)
+}
+
+ConnectResponse {
+  success:    bool      // Whether connection succeeded
+  message:    string    // Status message
+  session_id: string    // 32-char hex GUID (only valid if success=true)
+}
+```
+
+The client generates `client_id` as `"ScadaLink-{Guid:N}"` for uniqueness.
+
+### Disconnect
+
+```
+DisconnectRequest {
+  session_id: string
+}
+
+DisconnectResponse {
+  success: bool
+  message: string
+}
+```
+
+Best-effort — the client calls disconnect but does not retry on failure.
+
+### GetConnectionState
+
+```
+GetConnectionStateRequest {
+  session_id: string
+}
+
+GetConnectionStateResponse {
+  is_connected:            bool
+  client_id:               string
+  connected_since_utc_ticks: int64   // DateTime.UtcNow.Ticks at connect time
+}
+```
+
+### CheckApiKey
+
+```
+CheckApiKeyRequest {
+  api_key: string
+}
+
+CheckApiKeyResponse {
+  is_valid: bool
+  message:  string
+}
+```
+
+Standalone API key validation without creating a session.
+
+## Value-Timestamp-Quality (VTQ)
+
+The core data structure for all read and subscription results:
+
+```
+VtqMessage {
+  tag:                 string   // Tag address
+  value:               string   // Value encoded as string (see Value Encoding)
+  timestamp_utc_ticks: int64    // UTC DateTime.Ticks (100ns intervals since 0001-01-01)
+  quality:             string   // "Good", "Uncertain", or "Bad"
+}
+```
+
+### Value Encoding
+
+All values are transmitted as strings on the wire. Both client and server use the same parsing order:
+
+| Wire String | Parsed Type | Example |
+|-------------|------------|---------|
+| Numeric (double-parseable) | `double` | `"42.5"` → `42.5` |
+| `"true"` / `"false"` (case-insensitive) | `bool` | `"True"` → `true` |
+| Everything else | `string` | `"Running"` → `"Running"` |
+| Empty string | `null` | `""` → `null` |
+
+For write operations, values are converted to strings via `.ToString()` before transmission.
+
+Arrays and lists are JSON-serialized (e.g., `[1,2,3]`).
+
+### Quality Codes
+
+Quality is transmitted as a case-insensitive string:
+
+| Wire Value | Meaning | OPC UA Status Code |
+|-----------|---------|-------------------|
+| `"Good"` | Value is reliable | `0x00000000` (StatusCode == 0) |
+| `"Uncertain"` | Value may not be current | Non-zero, high bit clear |
+| `"Bad"` | Value is unreliable or unavailable | High bit set (`0x80000000`) |
+
+A null or missing VTQ message is treated as Bad quality with null value and current UTC timestamp.
+
+### Timestamps
+
+- All timestamps are UTC.
+- Encoded as `int64` representing `DateTime.Ticks` (100-nanosecond intervals since 0001-01-01 00:00:00 UTC).
+- Client reconstructs via `new DateTime(ticks, DateTimeKind.Utc)`.
+
+## Read Operations
+
+### Read (Single Tag)
+
+```
+ReadRequest {
+  session_id: string    // Valid session ID
+  tag:        string    // Tag address
+}
+
+ReadResponse {
+  success: bool         // Whether read succeeded
+  message: string       // Error message if failed
+  vtq:     VtqMessage   // Value-timestamp-quality result
+}
+```
+
+### ReadBatch (Multiple Tags)
+
+```
+ReadBatchRequest {
+  session_id: string
+  tags:       repeated string    // Tag addresses
+}
+
+ReadBatchResponse {
+  success: bool                  // false if any tag failed
+  message: string                // Error message
+  vtqs:    repeated VtqMessage   // Results in same order as request
+}
+```
+
+Batch reads are **partially successful** — individual tags may have Bad quality while the overall response succeeds. If a tag read throws an exception, its VTQ is returned with Bad quality and current UTC timestamp.
+
+## Write Operations
+
+### Write (Single Tag)
+
+```
+WriteRequest {
+  session_id: string
+  tag:        string
+  value:      string    // Value as string (parsed server-side)
+}
+
+WriteResponse {
+  success: bool
+  message: string
+}
+```
+
+### WriteBatch (Multiple Tags)
+
+```
+WriteItem {
+  tag:   string
+  value: string
+}
+
+WriteResult {
+  tag:     string
+  success: bool
+  message: string
+}
+
+WriteBatchRequest {
+  session_id: string
+  items:      repeated WriteItem
+}
+
+WriteBatchResponse {
+  success: bool                   // Overall success (all items must succeed)
+  message: string
+  results: repeated WriteResult   // Per-item results
+}
+```
+
+Batch writes are **all-or-nothing** at the reporting level — if any item fails, overall `success` is `false`.
+
+### WriteBatchAndWait (Atomic Write + Flag Polling)
+
+A compound operation: write values, then poll a flag tag until it matches an expected value or times out.
+
+```
+WriteBatchAndWaitRequest {
+  session_id:       string
+  items:            repeated WriteItem   // Values to write
+  flag_tag:         string               // Tag to poll after writes
+  flag_value:       string               // Expected value (string comparison)
+  timeout_ms:       int32                // Timeout in ms (default 5000 if ≤ 0)
+  poll_interval_ms: int32                // Poll interval in ms (default 100 if ≤ 0)
+}
+
+WriteBatchAndWaitResponse {
+  success:       bool                    // Overall operation success
+  message:       string
+  write_results: repeated WriteResult    // Per-item write results
+  flag_reached:  bool                    // Whether flag matched before timeout
+  elapsed_ms:    int32                   // Total elapsed time
+}
+```
+
+**Behavior:**
+1. All writes execute first. If any write fails, the operation returns immediately with `success=false`.
+2. If writes succeed, polls `flag_tag` at `poll_interval_ms` intervals.
+3. Compares `readResult.Value?.ToString() == flag_value` (case-sensitive string comparison).
+4. If flag matches before timeout: `success=true`, `flag_reached=true`.
+5. If timeout expires: `success=true`, `flag_reached=false` (timeout is not an error).
+
+## Subscription (Server Streaming)
+
+### Subscribe
+
+```
+SubscribeRequest {
+  session_id:  string
+  tags:        repeated string   // Tag addresses to monitor
+  sampling_ms: int32             // Backend sampling interval in milliseconds
+}
+
+// Returns: stream of VtqMessage
+```
+
+**Behavior:**
+
+1. Server validates the session. Invalid session → `RpcException` with `StatusCode.Unauthenticated`.
+2. Server registers monitored items on the backend (e.g., OPC UA subscriptions) for all requested tags.
+3. On each value change, the server pushes a `VtqMessage` to the response stream.
+4. The stream remains open indefinitely until:
+   - The client cancels (disposes the subscription).
+   - The server encounters an error (backend disconnect, etc.).
+   - The gRPC connection drops.
+5. On stream termination, the client's `onStreamError` callback fires exactly once.
+
+**Client-side subscription lifecycle:**
+
+```
+ILmxSubscription subscription = await client.SubscribeAsync(
+    addresses: ["Motor.Speed", "Motor.Temperature"],
+    onUpdate: (tag, vtq) => { /* handle value change */ },
+    onStreamError: () => { /* handle disconnect */ });
+
+// Later:
+await subscription.DisposeAsync();  // Cancels the stream
+```
+
+Disposing the subscription cancels the underlying `CancellationTokenSource`, which terminates the background stream-reading task and triggers server-side cleanup of monitored items.
+
+## Tag Addressing
+
+Tags are string addresses that identify data points. The proxy maps tag addresses to backend-specific identifiers.
+
+**LmxFakeProxy example** (OPC UA backend):
+
+Tag addresses are concatenated with a configurable prefix to form OPC UA node IDs:
+
+```
+Prefix: "ns=3;s="
+Tag:    "Motor.Speed"
+NodeId: "ns=3;s=Motor.Speed"
+```
+
+The prefix is configured at server startup via the `OPC_UA_PREFIX` environment variable.
+
+## Transport Details
+
+| Setting | Value |
+|---------|-------|
+| Protocol | gRPC over HTTP/2 |
+| Default port | 50051 |
+| TLS | Optional (controlled by `UseTls` connection parameter) |
+| Metadata headers | `x-api-key` (sent on Connect call if API key configured) |
+
+### Connection Parameters
+
+The ScadaLink DCL configures LmxProxy connections via a string dictionary:
+
+| Key | Type | Default | Description |
+|-----|------|---------|-------------|
+| `Host` | string | `"localhost"` | gRPC server hostname |
+| `Port` | string (parsed as int) | `"50051"` | gRPC server port |
+| `ApiKey` | string | (none) | API key for authentication |
+| `SamplingIntervalMs` | string (parsed as int) | `"0"` | Backend sampling interval for subscriptions |
+| `UseTls` | string (parsed as bool) | `"false"` | Use HTTPS instead of HTTP |
+
+## Error Handling
+
+| Operation | Error Mechanism | Client Behavior |
+|-----------|----------------|-----------------|
+| Connect | `success=false` in response | Throws `InvalidOperationException` |
+| Read/ReadBatch | `success=false` in response | Throws `InvalidOperationException` |
+| Write/WriteBatch | `success=false` in response | Throws `InvalidOperationException` |
+| WriteBatchAndWait | `success=false` or `flag_reached=false` | Returns result (timeout is not an exception) |
+| Subscribe (auth) | `RpcException` with `Unauthenticated` | Propagated to caller |
+| Subscribe (stream) | Stream ends or gRPC error | `onStreamError` callback invoked; `sessionId` nullified |
+| Any (disconnected) | Client checks `IsConnected` | Throws `InvalidOperationException("not connected")` |
+
+When a subscription stream ends unexpectedly, the client immediately nullifies its session ID, causing `IsConnected` to return `false`. The DCL adapter fires its `Disconnected` event, which triggers the reconnection cycle in the `DataConnectionActor`.
+
+## Implementation Files
+
+| Component | File |
+|-----------|------|
+| Proto definition | `src/ScadaLink.DataConnectionLayer/Adapters/Protos/scada.proto` |
+| Client interface | `src/ScadaLink.DataConnectionLayer/Adapters/ILmxProxyClient.cs` |
+| Client implementation | `src/ScadaLink.DataConnectionLayer/Adapters/RealLmxProxyClient.cs` |
+| DCL adapter | `src/ScadaLink.DataConnectionLayer/Adapters/LmxProxyDataConnection.cs` |
+| Client factory | `src/ScadaLink.DataConnectionLayer/Adapters/LmxProxyClientFactory.cs` |
+| Server implementation | `infra/lmxfakeproxy/Services/ScadaServiceImpl.cs` |
+| Session manager | `infra/lmxfakeproxy/Sessions/SessionManager.cs` |
+| Tag mapper | `infra/lmxfakeproxy/TagMapper.cs` |
+| OPC UA bridge interface | `infra/lmxfakeproxy/Bridge/IOpcUaBridge.cs` |
+| OPC UA bridge impl | `infra/lmxfakeproxy/Bridge/OpcUaBridge.cs` |