scadalink-design/HighLevelReqs.md

# SCADA System - High Level Requirements

## 1. Deployment Architecture

- **Site Clusters**: 2-node failover clusters deployed at each site, running on Windows.
- **Central Cluster**: A single 2-node failover cluster serving as the central hub.
- **Communication Topology**: Hub-and-spoke. Central cluster communicates with each site cluster. Site clusters do **not** communicate with one another.

### 1.1 Central vs. Site Responsibilities
- **Central cluster** is the single source of truth for all template authoring, configuration, and deployment decisions.
- **Site clusters** receive **flattened configurations** — fully resolved attribute sets with no template structure. Sites do not need to understand templates, inheritance, or composition.
- Sites **do not** support local/emergency configuration overrides. All configuration changes originate from central.

### 1.2 Failover
- Failover is managed at the **application level** using **Akka.NET** (not Windows Server Failover Clustering).
- Each cluster (central and site) runs an **active/standby** pair where Akka.NET manages node roles and failover detection.
- **Site failover**: The standby node takes over data collection and script execution seamlessly, including responsibility for the store-and-forward buffers. The Site Runtime Deployment Manager singleton is restarted on the new active node, which reads deployed configurations from local SQLite and re-creates the full Instance Actor hierarchy.
- **Central failover**: The standby node takes over central responsibilities. Deployments that are in-progress during a failover are treated as **failed** and must be re-initiated by the engineer.

### 1.3 Store-and-Forward Persistence (Site Clusters Only)
- Store-and-forward applies **only at site clusters** — the central cluster does **not** buffer messages. If a site is unreachable, operations from central fail and must be retried by the engineer.
- All site-level store-and-forward buffers (external system calls, notifications, and cached database writes) are **replicated between the two site cluster nodes** using **application-level replication** over Akka.NET remoting.
- The **active node** persists buffered messages to a **local SQLite database** and forwards them to the standby node, which maintains its own local SQLite copy.
- On failover, the standby node already has a replicated copy of the buffer and takes over delivery seamlessly.
- Successfully delivered messages are removed from both nodes' local stores.
- There is **no maximum buffer size** — messages accumulate until they either succeed or exhaust retries and are parked.
- Retry intervals are **fixed** (not exponential backoff). The fixed interval is sufficient for the expected use cases.

### 1.4 Deployment Behavior
- When central deploys a new configuration to a site instance, the site **applies it immediately** upon receipt — no local operator confirmation is required.
- If a site loses connectivity to central, it **continues operating** with its last received deployed configuration.
- The site reports back to central whether deployment was successfully applied.
- **Pre-deployment validation**: Before any deployment is sent to a site, the central cluster performs comprehensive validation including flattening the configuration, test-compiling all scripts, verifying alarm trigger references, verifying script trigger references, and checking data connection binding completeness (see Section 3.11).

### 1.5 System-Wide Artifact Deployment
- Changes to shared scripts, external system definitions, database connection definitions, data connection definitions, notification lists, and SMTP configuration are **not automatically propagated** to sites.
- Deployment of system-wide artifacts requires **explicit action** by a user with the **Deployment** role.
- Artifacts can be deployed to **all sites at once** or to an **individual site** (per-site deployment).
- The Design role manages the definitions; the Deployment role triggers deployment to sites. A user may hold both roles.

## 2. Data Storage & Data Flow

### 2.1 Central Databases (MS SQL)
- **Configuration Database**: A dedicated database for system-specific configuration data (e.g., templates, site definitions, instance configurations, system settings).
- **Machine Data Database**: A separate database for collected machine data (e.g., telemetry, measurements, events).

### 2.2 Communication: Central ↔ Site
- Central-to-site and site-to-central communication uses **Akka.NET ClusterClient/ClusterClientReceptionist** for cross-cluster messaging with automatic failover.
- **Site addressing**: Site Akka base addresses (NodeA and NodeB) are stored in the **Sites database table** and configured via the Central UI. Central creates a ClusterClient per site using both addresses as contact points (cached in memory, refreshed periodically and on admin changes) rather than relying on runtime registration messages from sites.
- **Central contact points**: Sites configure **multiple central contact points** (both central node addresses) for redundancy. ClusterClient handles failover between central nodes automatically.
- **Central as integration hub**: Central brokers requests between external systems and sites. For example, a recipe manager sends a recipe to central, which routes it to the appropriate site. MES requests machine values from central, which routes the request to the site and returns the response.
- **Real-time data streaming** is not continuous for all machine data. The only real-time stream is an **on-demand debug view** — an engineer in the central UI can open a live view of a specific instance's tag values and alarm states for troubleshooting purposes. This is session-based and temporary. The debug view subscribes to the site-wide Akka stream filtered by instance (see Section 8.1).

### 2.3 Site-Level Storage & Interface
- Sites have **no user interface** — they are headless collectors, forwarders, and script executors.
- Sites require local storage for: the current deployed (flattened) configurations, deployed scripts, shared scripts, external system definitions, database connection definitions, data connection definitions, notification lists, and SMTP configuration.
- After artifact deployment, sites are **fully self-contained** — all runtime configuration is read from local SQLite. Sites do **not** access the central configuration database at runtime.
- Store-and-forward buffers are persisted to a **local SQLite database on each node** and replicated between nodes via application-level replication (see 1.3).

### 2.4 Data Connection Protocols
- The system supports **OPC UA** and **LmxProxy** (a gRPC-based custom protocol with an existing client SDK).
- Both protocols implement a **common interface** supporting: connect, subscribe to tag paths, receive value updates, and write values.
- Additional protocols can be added by implementing the common interface.
- The Data Connection Layer is a **clean data pipe** — it publishes tag value updates to Instance Actors but performs no evaluation of triggers or alarm conditions.
- **Initial attribute quality**: Attributes bound to a data connection start with **uncertain** quality when the Instance Actor initializes. The quality remains uncertain until the first value update is received from the Data Connection Layer. This distinguishes "never received a value" from "received a known-good value" or "connection lost" (bad quality).

### 2.5 Scale
- Approximately **10 sites**.
- **50–500 machines per site**.
- **25–75 live data point tags per machine**.

## 3. Template & Machine Modeling

### 3.1 Template Structure
- Machines are modeled as **instances of templates**.
- Templates define a set of **attributes**.
- Each attribute has a **lock flag** that controls whether it can be overridden downstream.

### 3.2 Attribute Definition
Each attribute carries the following metadata:
- **Name**: Identifier for the attribute.
- **Value**: The default or configured value. May be empty if intended to be set at the instance level.
- **Data Type**: The value's type. Fixed set: Boolean, Integer, Float, String.
- **Lock Flag**: Controls whether the attribute can be overridden downstream.
- **Description**: Human-readable explanation of the attribute's purpose.
- **Data Source Reference** *(optional)*: A **relative path** within a data connection (e.g., `/Motor/Speed`). The template defines *what* to read — the path relative to a data connection. The template does **not** specify which data connection to use; that is an instance-level concern (see Section 3.3). Attributes without a data source reference are static configuration values.

### 3.3 Data Connections
- **Data connections** are reusable, named resources defined centrally and then **assigned to specific sites** (e.g., an OPC server, a PLC endpoint). Data connection definitions are deployed to sites as part of **artifact deployment** (see Section 1.5) and stored in local SQLite.
- A data connection encapsulates the details needed to communicate with a data source (protocol, address, credentials, etc.).
- Attributes with a data source reference must be **bound to a data connection at instance creation** — the template defines *what* to read (the relative path), and the instance specifies *where* to read it from (the data connection assigned to the site).
- **Binding is per-attribute**: Each attribute with a data source reference individually selects its data connection. Different attributes on the same instance may use different data connections. The Central UI supports bulk assignment (selecting multiple attributes and assigning a data connection to all of them at once) to reduce tedium.
- Templates do **not** specify a default connection. The connection binding is an instance-level concern.
- The flattened configuration sent to a site resolves connection references into concrete connection details paired with attribute relative paths.
- Data connection names are **not** standardized across sites — different sites may have different data connection names for equivalent devices.

### 3.4 Alarm Definitions
Alarms are **first-class template members** alongside attributes and scripts, following the same **inheritance, override, and lock rules**.

Each alarm has:
- **Name**: Identifier for the alarm.
- **Description**: Human-readable explanation of the alarm condition.
- **Priority Level**: Numeric value from 0–1000.
- **Lock Flag**: Controls whether the alarm can be overridden downstream.
- **Trigger Definition**: One of the following trigger types:
  - **Value Match**: Triggers when a monitored attribute equals a predefined value.
  - **Range Violation**: Triggers when a monitored attribute value falls outside an allowed range.
  - **Rate of Change**: Triggers when a monitored attribute value changes faster than a defined threshold.
- **On-Trigger Script** *(optional)*: A script to execute when the alarm triggers. The alarm on-trigger script executes in the context of the instance and can call instance scripts, but instance scripts **cannot** call alarm on-trigger scripts. The call direction is one-way.

### 3.4.1 Alarm State
- Alarm state (active/normal) is **managed at the site level** per instance, held **in memory** by the Alarm Actor.
- When the alarm condition clears, the alarm **automatically returns to normal state** — no acknowledgment workflow is required.
- Alarm state is **not persisted** — on restart, alarm states are re-evaluated from incoming values.
- Alarm state changes are published to the site-wide Akka stream as `[InstanceUniqueName].[AlarmName]`, alarm state (active/normal), priority, timestamp.

### 3.5 Template Relationships

Templates participate in two distinct relationship types:

- **Inheritance (is-a)**: A child template extends a parent template. The child inherits all attributes, alarms, scripts, and composed feature modules from the parent. The child can:
  - Override the **values** of non-locked inherited attributes, alarms, and scripts.
  - **Add** new attributes, alarms, or scripts not present in the parent.
  - **Not** remove attributes, alarms, or scripts defined by the parent.
- **Composition (has-a)**: A template can nest an instance of another template as a **feature module** (e.g., embedding a RecipeSystem module inside a base machine template). Feature modules can themselves compose other feature modules **recursively**.
- **Naming collisions**: If a template composes two feature modules that each define an attribute, alarm, or script with the same name, this is a **design-time error**. The system must detect and report the collision, and the template cannot be saved until the conflict is resolved.

### 3.6 Locking
- Locking applies to **attributes, alarms, and scripts** uniformly.
- Any of these can be **locked** at the level where it is defined or overridden.
- A locked attribute **cannot** be overridden by any downstream level (child templates, composing templates, or instances).
- An unlocked attribute **can** be overridden by any downstream level.
- **Intermediate locking**: Any level in the chain can lock an attribute that was unlocked upstream. Once locked, it remains locked for all levels below — a downstream level **cannot** unlock an attribute locked above it.

### 3.6 Attribute Resolution Order
Attributes are resolved from most-specific to least-specific. The first value encountered wins:

1. **Instance** (site-deployed machine)
2. **Child Template** (most derived first, walking up the inheritance chain)
3. **Composing Template** (the template that embeds a feature module can override the module's attributes)
4. **Composed Module** (the original feature module definition, recursively resolved if modules nest other modules)

At any level, an override is only permitted if the attribute has **not been locked** at a higher-priority level.

### 3.7 Override Scope
- **Inheritance**: Child templates can override non-locked attributes from their parent, including attributes originating from composed feature modules.
- **Composition**: A template that composes a feature module can override non-locked attributes within that module.
- Overrides can "pierce" into composed modules — a child template can override attributes inside a feature module it inherited from its parent.

### 3.8 Instance Rules
- An instance is a deployed occurrence of a template at a site.
- Instances **can** override the values of non-locked attributes.
- Instances **cannot** add new attributes.
- Instances **cannot** remove attributes.
- The instance's structure (which attributes exist, which feature modules are composed) is strictly defined by its template.
- Each instance is **assigned to an area** within its site (see 3.10).

### 3.8.1 Instance Lifecycle
- Instances can be in one of two states: **enabled** or **disabled**.
- **Enabled**: The instance is active at the site — data subscriptions, script triggers, and alarm evaluation are all running.
- **Disabled**: The site **stops** script triggers, data subscriptions (no live data collection), and alarm evaluation. The deployed configuration is **retained** on the site so the instance can be re-enabled without redeployment. Store-and-forward messages for a disabled instance **continue to drain** (deliver pending messages).
- **Deletion**: Instances can be deleted. Deletion removes the running configuration from the site, stops subscriptions, and destroys the Instance Actor and its children. Store-and-forward messages are **not** cleared on deletion — they continue to be delivered or can be managed (retried/discarded) via parked message management. If the site is unreachable when a delete is triggered, the deletion **fails** (same behavior as a failed deployment). The central side does not mark it as deleted until the site confirms.
- Templates **cannot** be deleted if any instances or child templates reference them. The user must remove all references first.

### 3.9 Template Deployment & Change Propagation
- Template changes are **not** automatically propagated to deployed instances.
- The system maintains two views of each instance:
  - **Deployed Configuration**: The currently active configuration on the instance, as it was last explicitly deployed.
  - **Template-Derived Configuration**: The configuration the instance *would* have based on the current state of its template (including resolved inheritance, composition, and overrides).
- Deployment is performed at the **individual instance level** — an engineer explicitly commands the system to update a specific instance.
- The system must be able to **show differences** between the deployed configuration and the current template-derived configuration, allowing engineers to see what would change before deploying.
- **No rollback** support is required. The system only needs to track the current deployed state, not a history of prior deployments.
- **Concurrent editing**: Template editing uses a **last-write-wins** model. No pessimistic locking or optimistic concurrency conflict detection is required.

### 3.10 Areas
- Areas are **predefined hierarchical groupings** associated with a site, stored in the configuration database.
- Areas support **parent-child relationships** (e.g., Plant → Building → Production Line → Cell).
- Each instance is assigned to an area within its site.
- Areas are used for **filtering and finding instances** in the central UI.
- Area definitions are managed by users with the **Admin** role.

### 3.11 Pre-Deployment Validation

Before any deployment is sent to a site, the central cluster performs **comprehensive validation**. Validation covers:

- **Flattening**: The full template hierarchy is resolved and flattened successfully.
- **Naming collision detection**: No duplicate attribute, alarm, or script names exist in the flattened configuration.
- **Script compilation**: All instance scripts and alarm on-trigger scripts are test-compiled and must compile without errors.
- **Alarm trigger references**: Alarm trigger definitions reference attributes that exist in the flattened configuration.
- **Script trigger references**: Script triggers (value change, conditional) reference attributes that exist in the flattened configuration.
- **Data connection binding completeness**: Every attribute with a data source reference has a data connection binding assigned on the instance, and the bound data connection name exists as a defined connection at the instance's site.
- **Exception**: Validation does **not** verify that data source relative paths resolve to real tags on physical devices — that is a runtime concern that can only be determined at the site.

Validation is also available **on demand in the Central UI** for Design users during template authoring, providing early feedback without requiring a deployment attempt.

For **shared scripts**, pre-compilation validation is performed before deployment to sites. Since shared scripts have no instance context, validation is limited to C# syntax and structural correctness.

## 4. Scripting

### 4.1 Script Definitions
- Scripts are **C#** and are defined at the **template level** as first-class template members.
- Scripts follow the same **inheritance, override, and lock rules** as attributes. A parent template can define a script, a child template can override it (if not locked), and any level can lock a script to prevent downstream changes.
- Scripts are deployed to sites as part of the flattened instance configuration.
- Scripts are **compiled at the site** when a deployment is received. Pre-compilation validation occurs at central before deployment (see Section 3.11), but the site performs the actual compilation for execution.
- Scripts can optionally define **input parameters** (name and data type per parameter). Scripts without parameter definitions accept no arguments.
- Scripts can optionally define a **return value definition** (field names and data types). Return values support **single objects** and **lists of objects**. Scripts without a return definition return void.
- Return values are used when scripts are called explicitly by other scripts (via `Instance.CallScript()` or `Scripts.CallShared()`) or by the Inbound API (via `Route.To().Call()`). When invoked by a trigger (interval, value change, conditional, alarm), any return value is discarded.

### 4.2 Script Triggers
Scripts can be triggered by:
- **Interval**: Execute on a recurring time schedule.
- **Value Change**: Execute when a specific instance attribute value changes.
- **Conditional**: Execute when an instance attribute value equals or does not equal a given value.

Scripts have an optional **minimum time between runs** setting. If a trigger fires before the minimum interval has elapsed since the last execution, the invocation is skipped.

### 4.3 Script Error Handling
- If a script fails (unhandled exception, timeout, etc.), the failure is **logged locally** at the site.
- The script is **not disabled** — it remains active and will fire on the next qualifying trigger event.
- Script failures are **not reported to central**. Diagnostics are local only.
- For external system call failures within scripts, store-and-forward handling (Section 5.3) applies independently of script error handling.

### 4.4 Script Capabilities
Scripts executing on a site for a given instance can:
- **Read** attribute values on that instance (live data points and static config).
- **Write** attribute values on that instance. For attributes with a data source reference, the write goes to the Data Connection Layer which writes to the physical device; the in-memory value updates when the device confirms the new value via the existing subscription. For static attributes, the write updates the in-memory value and **persists the override to local SQLite** — the value survives restart and failover. Persisted overrides are reset when the instance is redeployed.
- **Call other scripts** on that instance via `Instance.CallScript("scriptName", params)`. Calls use the Akka ask pattern and return the called script's return value. Script-to-script calls support concurrent execution.
- **Call shared scripts** via `Scripts.CallShared("scriptName", params)`. Shared scripts execute **inline** in the calling Script Actor's context — they are compiled code libraries, not separate actors.
- **Call external system API methods** in two modes: `ExternalSystem.Call()` for synchronous request/response, or `ExternalSystem.CachedCall()` for fire-and-forget with store-and-forward on transient failure (see Section 5).
- **Send notifications** (see Section 6).
- **Access databases** by requesting an MS SQL client connection by name (see Section 5.5).

Scripts **cannot** access other instances' attributes or scripts.

### 4.4.1 Script Call Recursion Limit
- Script-to-script calls (via `Instance.CallScript` and `Scripts.CallShared`) enforce a **maximum recursion depth** to prevent infinite loops.
- The default maximum depth is a reasonable limit (e.g., 10 levels).
- The current call depth is tracked and incremented with each nested call. If the limit is reached, the call fails with an error logged to the site event log.
- This applies to all script call chains including alarm on-trigger scripts calling instance scripts.

### 4.5 Shared Scripts
- Shared scripts are **not associated with any template** — they are a **system-wide library** of reusable C# scripts.
- Shared scripts can optionally define **input parameters** and **return value definitions**, following the same rules as template-level scripts.
- Managed by users with the **Design** role.
- Deployed to **all sites** for use by any instance script (deployment requires explicit action by a user with the Deployment role).
- Shared scripts execute **inline** in the calling Script Actor's context as compiled code. They are not separate actors. This avoids serialization bottlenecks and messaging overhead.
- Shared scripts are **not available on the central cluster** — Inbound API scripts cannot call them directly. To execute shared script logic, route to a site instance via `Route.To().Call()`.

### 4.6 Alarm On-Trigger Scripts
- Alarm on-trigger scripts are defined as part of the alarm definition and execute when the alarm activates.
- They execute directly in the Alarm Actor's context (via a short-lived Alarm Execution Actor), similar to how shared scripts execute inline.
- Alarm on-trigger scripts **can** call instance scripts via `Instance.CallScript()`, which sends an ask message to the appropriate sibling Script Actor.
- Instance scripts **cannot** call alarm on-trigger scripts — the call direction is one-way.
- The recursion depth limit applies to alarm-to-instance script call chains.

## 5. External System Integrations

### 5.1 External System Definitions
- External systems are **predefined contracts** created by users with the **Design** role.
- Each definition includes:
  - **Connection details**: Endpoint URL, authentication, protocol information.
  - **Method definitions**: Available API methods with defined parameters and return types.
- Definitions are deployed **uniformly to all sites** — no per-site connection detail overrides.
- Deployment of definition changes requires **explicit action** by a user with the Deployment role.
- At the site, external system definitions are read from **local SQLite** (populated by artifact deployment), not from the central config DB.

### 5.2 Site-to-External-System Communication
- Sites communicate with external systems **directly** (not routed through central).
- Scripts invoke external system methods by referencing the predefined definitions.

### 5.3 Store-and-Forward for External Calls
- If an external system is unavailable when a script invokes a method, the message is **buffered locally at the site**.
- Retry is performed **per message** — individual failed messages retry independently.
- Each external system definition includes configurable retry settings:
  - **Max retry count**: Maximum number of retry attempts before giving up.
  - **Time between retries**: Fixed interval between retry attempts (no exponential backoff).
- After max retries are exhausted, the message is **parked** (dead-lettered) for manual review.
- There is **no maximum buffer size** — messages accumulate until delivery succeeds or retries are exhausted.

### 5.4 Parked Message Management
- Parked messages are **stored at the site** where they originated.
- The **central UI** can **query sites** for parked messages and manage them remotely.
- Operators can **retry** or **discard** parked messages from the central UI.
- Parked message management covers **external system calls**, **notifications**, and **cached database writes**.

### 5.5 Database Connections
- Database connections are **predefined, named resources** created by users with the **Design** role.
- Each definition includes the connection details needed to connect to an MS SQL database (server, database name, credentials, etc.).
- Each definition includes configurable retry settings (same pattern as external systems): **max retry count** and **time between retries** (fixed interval).
- Definitions are deployed **uniformly to all sites** — no per-site overrides.
- Deployment of definition changes requires **explicit action** by a user with the Deployment role.
- At the site, database connection definitions are read from **local SQLite** (populated by artifact deployment), not from the central config DB.

### 5.6 Database Access Modes
Scripts can interact with databases in two modes:

- **Real-time (synchronous)**: Scripts request a **raw MS SQL client connection by name** (e.g., `Database.Connection("MES_DB")`), giving script authors full ADO.NET-level control for immediate queries and updates.
- **Cached write (store-and-forward)**: Scripts submit a write operation for deferred, reliable delivery. The cached entry stores the **database connection name**, the **SQL statement to execute**, and **parameter values**. If the database is unavailable, the write is buffered locally at the site and retried per the connection's retry settings. After max retries are exhausted, the write is **parked** for manual review (managed via central UI alongside other parked messages).

## 6. Notifications

### 6.1 Notification Lists
- Notification lists are **system-wide**, managed by users with the **Design** role.
- Each list has a **name** and contains one or more **recipients**.
- Each recipient has a **name** and an **email address**.
- Notification lists are deployed to **all sites** (deployment requires explicit action by a user with the Deployment role).
- At the site, notification lists and recipients are read from **local SQLite** (populated by artifact deployment), not from the central config DB.

### 6.2 Email Support
- The system has **predefined support for sending email** as the notification delivery mechanism.
- Email server configuration (SMTP settings) is defined centrally and deployed to all sites as part of **artifact deployment** (see Section 1.5). Sites read SMTP configuration from **local SQLite**.

### 6.3 Script API
- Scripts send notifications using a simplified API: `Notify.To("list name").Send("subject", "message")`
- This API is available to instance scripts, alarm on-trigger scripts, and shared scripts.

### 6.4 Store-and-Forward for Notifications
- If the email server is unavailable, notifications are **buffered locally at the site**.
- Follows the same retry pattern as external system calls: configurable **max retry count** and **time between retries** (fixed interval).
- After max retries are exhausted, the notification is **parked** for manual review (managed via central UI alongside external system parked messages).
- There is **no maximum buffer size** for notification messages.

## 7. Inbound API (Central)

### 7.1 Purpose
The system exposes a **web API on the central cluster** for external systems to call into the SCADA system. This is the counterpart to the outbound External System Integrations (Section 5) — where Section 5 defines how the system calls out, this section defines how external systems call in.

### 7.2 API Key Management
- API keys are stored in the **configuration database**.
- Each API key has a **name/label** (for identification), the **key value**, and an **enabled/disabled** flag.
- API keys are managed by users with the **Admin** role.

### 7.3 Authentication
- Inbound API requests are authenticated via **API key** (not LDAP/AD).
- The API key must be included with each request.
- Invalid or disabled keys are rejected.

### 7.4 API Method Definitions
- API methods are **predefined** and managed by users with the **Design** role.
- Each method definition includes:
  - **Method name**: Unique identifier for the endpoint.
  - **Approved API keys**: List of API keys authorized to call this method.
  - **Parameter definitions**: Name and data type for each input parameter.
  - **Return value definition**: Data type and structure of the response. Supports **single objects** and **lists of objects**.
  - **Timeout**: Configurable per method. Maximum execution time including routed calls to sites.
- The implementation of each method is a **C# script stored inline** in the method definition. It executes on the central cluster. No template inheritance — API scripts are standalone.
- API scripts can route calls to any instance at any site via `Route.To("instanceCode").Call("scriptName", parameters)`, read/write attributes in batch, and access databases directly.
- API scripts **cannot** call shared scripts directly (shared scripts are site-only). To invoke site logic, use `Route.To().Call()`.

### 7.5 Availability
- The inbound API is hosted **only on the central cluster** (active node).
- On central failover, the API becomes available on the new active node.

## 8. Central UI

The central cluster hosts a **configuration and management UI** (no live machine data visualization, except on-demand debug views). The UI supports the following workflows:

- **Template Authoring**: Create, edit, and manage templates including hierarchy (inheritance) and composition (feature modules). Author and manage scripts within templates. **Design-time validation** available on demand to check flattening, naming collisions, and script compilation without deploying.
- **Shared Script Management**: Create, edit, and manage the system-wide shared script library.
- **Notification List Management**: Create, edit, and manage notification lists and recipients.
- **External System Management**: Define external system contracts (connection details, API method definitions).
- **Database Connection Management**: Define named database connections for script use.
- **Inbound API Management**: Manage API keys (create, enable/disable, delete). Define API methods (name, parameters, return values, approved keys, implementation script). *(Admin role for keys, Design role for methods.)*
- **Instance Management**: Create instances from templates, bind data connections (per-attribute, with **bulk assignment** UI for selecting multiple attributes and assigning a data connection at once), set instance-level attribute overrides, assign instances to areas. **Disable** or **delete** instances.
- **Site & Data Connection Management**: Define sites (including optional NodeAAddress and NodeBAddress fields for Akka remoting paths), manage data connections and assign them to sites.
- **Area Management**: Define hierarchical area structures per site for organizing instances.
- **Deployment**: View diffs between deployed and current template-derived configurations, deploy updates to individual instances. Filter instances by area. Pre-deployment validation runs automatically before any deployment is sent.
- **System-Wide Artifact Deployment**: Explicitly deploy shared scripts, external system definitions, database connection definitions, data connection definitions, notification lists, and SMTP configuration to all sites or to an individual site (requires Deployment role). Per-site deployment is available via the Sites admin page.
- **Deployment Status Monitoring**: Track whether deployments were successfully applied at site level.
- **Debug View**: On-demand real-time view of a specific instance's tag values and alarm states for troubleshooting (see 8.1).
- **Parked Message Management**: Query sites for parked messages (external system calls, notifications, and cached database writes), retry or discard them.
- **Health Monitoring Dashboard**: View site cluster health, node status, data connection health, script error rates, alarm evaluation errors, and store-and-forward buffer depths (see Section 11).
- **Site Event Log Viewer**: Query and view operational event logs from site clusters (see Section 12).

### 8.1 Debug View
- **Subscribe-on-demand**: When an engineer opens a debug view for an instance, central subscribes to the **site-wide Akka stream** filtered by instance unique name. The site first provides a **snapshot** of all current attribute values and alarm states from the Instance Actor, then streams subsequent changes from the Akka stream.
- Attribute value stream messages are structured as: `[InstanceUniqueName].[AttributePath].[AttributeName]`, attribute value, attribute quality, attribute change timestamp.
- Alarm state stream messages are structured as: `[InstanceUniqueName].[AlarmName]`, alarm state (active/normal), priority, timestamp.
- The stream continues until the engineer **closes the debug view**, at which point central unsubscribes and the site stops streaming.
- No attribute/alarm selection — the debug view always shows all tag values and alarm states for the instance.
- No special concurrency limits are required.

## 9. Security & Access Control

### 9.1 Authentication
- **UI users** authenticate via **username/password** validated directly against **LDAP/Active Directory**. Sessions are maintained via JWT tokens.
- **External system API callers** authenticate via **API key** (see Section 7).

### 9.2 Authorization
- Authorization is **role-based**, with roles assigned by **LDAP group membership**.
- Roles are **independent** — they can be mixed and matched per user (via group membership). There is no implied hierarchy between roles.
- A user may hold multiple roles simultaneously (e.g., both Design and Deployment) by being a member of the corresponding LDAP groups.
- Inbound API authorization is per-method, based on **approved API key lists** (see Section 7.4).

### 9.3 Roles
- **Admin**: System-wide permission to manage sites, data connections, LDAP group-to-role mappings, API keys, and system-level configuration.
- **Design**: System-wide permission to author and edit templates, scripts, shared scripts, external system definitions, notification lists, and inbound API method definitions.
- **Deployment**: Permission to manage instances (create, set overrides, bind connections, disable, delete) and deploy configurations to sites. Also triggers system-wide artifact deployment. Can be scoped **per site**.

### 9.4 Role Scoping
- Admin is always **system-wide**.
- Design is always **system-wide**.
- Deployment can be **system-wide** or **site-scoped**, controlled by LDAP group membership (e.g., `Deploy-SiteA`, `Deploy-SiteB`, or `Deploy-All`).

## 10. Audit Logging

Audit logging is implemented as part of the **Configuration Database** component via the `IAuditService` interface.

### 10.1 Storage
- Audit logs are stored in the **configuration MS SQL database** alongside system config data, enabling direct querying.
- Entries are **append-only** — never modified or deleted. No retention policy — retained indefinitely.

### 10.2 Scope
All system-modifying actions are logged, including:
- **Template changes**: Create, edit, delete templates.
- **Script changes**: Template script and shared script create, edit, delete.
- **Alarm changes**: Create, edit, delete alarm definitions.
- **Instance changes**: Create, override values, bind connections, area assignment, disable, enable, delete.
- **Deployments**: Who deployed what to which instance, and the result (success/failure).
- **System-wide artifact deployments**: Who deployed shared scripts / external system definitions / DB connections / data connections / notification lists / SMTP config, to which site(s), and the result.
- **External system definition changes**: Create, edit, delete.
- **Database connection changes**: Create, edit, delete.
- **Notification list changes**: Create, edit, delete lists and recipients.
- **Inbound API changes**: API key create, enable/disable, delete. API method create, edit, delete.
- **Area changes**: Create, edit, delete area definitions.
- **Site & data connection changes**: Create, edit, delete.
- **Security/admin changes**: Role mapping changes, site permission changes.

### 10.3 Detail Level
- Each audit log entry records the **state of the entity after the change**, serialized as JSON. Only the after-state is stored — change history is reconstructed by comparing consecutive entries for the same entity at query time.
- Each entry includes: **who** (authenticated user), **what** (action, entity type, entity ID, entity name), **when** (timestamp), and **state** (JSON after-state, null for deletes).
- **One entry per save operation** — when a user edits a template and changes multiple attributes in one save, a single entry captures the full entity state.

### 10.4 Transactional Guarantee
- Audit entries are written **synchronously** within the same database transaction as the change (via the unit-of-work pattern). If the change succeeds, the audit entry is guaranteed to be recorded. If the change rolls back, the audit entry rolls back too.

## 11. Health Monitoring

### 11.1 Monitored Metrics
The central cluster monitors the health of each site cluster, including:
- **Site cluster online/offline status**: Whether the site is reachable.
- **Active vs. standby node status**: Which node is active and which is standby.
- **Data connection health**: Connected/disconnected status per data connection at the site.
- **Script error rates**: Frequency of script failures at the site.
- **Alarm evaluation errors**: Frequency of alarm evaluation failures at the site.
- **Store-and-forward buffer depth**: Number of messages currently queued (broken down by external system calls, notifications, and cached database writes).

### 11.2 Reporting
- Site clusters **report health metrics to central** periodically.
- Health status is **visible in the central UI** — no automated alerting/notifications for now.

## 12. Site-Level Event Logging

### 12.1 Events Logged
Sites log operational events locally, including:
- **Script executions**: Start, complete, error (with error details).
- **Alarm events**: Alarm activated, alarm cleared (which alarm, which instance, when). Alarm evaluation errors.
- **Deployment applications**: Configuration received from central, applied successfully or failed. Script compilation results.
- **Data connection status changes**: Connected, disconnected, reconnected per connection.
- **Store-and-forward activity**: Message queued, delivered, retried, parked.
- **Instance lifecycle**: Instance enabled, disabled, deleted.

### 12.2 Storage
- Event logs are stored in **local SQLite** on each site node.
- **Retention policy**: 30 days. Events older than 30 days are automatically purged.

### 12.3 Central Access
- The central UI can **query site event logs remotely**, following the same pattern as parked message management — central requests data from the site over Akka.NET remoting.

## 13. Management Service & CLI

### 13.1 Management Service
- The central cluster exposes a **ManagementActor** that provides programmatic access to all administrative operations — the same operations available through the Central UI.
- The ManagementActor registers with Akka.NET **ClusterClientReceptionist**, allowing external tools to communicate with it via ClusterClient without joining the cluster.
- The ManagementActor enforces the **same role-based authorization** as the Central UI. Every incoming message carries the authenticated user's identity and roles.
- All mutating operations performed through the Management Service are **audit logged** via IAuditService, identical to operations performed through the Central UI.
- The ManagementActor runs on the **active central node** and fails over with it. ClusterClient handles reconnection transparently.

### 13.2 CLI
- The system provides a standalone **command-line tool** (`scadalink`) for scripting and automating administrative operations.
- The CLI connects to the ManagementActor via Akka.NET **ClusterClient** — it does not join the cluster as a full member and does not use HTTP/REST.
- The CLI authenticates the user against **LDAP/AD** (direct bind, same mechanism as the Central UI) and includes the authenticated identity in every message sent to the ManagementActor.
- CLI commands mirror all Management Service operations: templates, instances, sites, data connections, deployments, external systems, notifications, security (API keys and role mappings), audit log queries, and health status.
- Output is **JSON by default** (machine-readable, suitable for scripting) with an optional `--format table` flag for human-readable tabular output.
- Configuration is resolved from command-line options, **environment variables** (`SCADALINK_CONTACT_POINTS`, `SCADALINK_LDAP_SERVER`, etc.), or a **configuration file** (`~/.scadalink/config.json`).
- The CLI is a separate executable from the Host binary — it is deployed on any Windows machine with network access to the central cluster.

## 14. General Conventions

### 14.1 Timestamps
- All timestamps throughout the system are stored, transmitted, and processed in **UTC**.
- This applies to: attribute value timestamps, alarm state change timestamps, audit log entries, event log entries, deployment records, health reports, store-and-forward message timestamps, and all inter-node messages.
- Local time conversion for display is a **Central UI concern only** — no other component performs timezone conversion.

---

*All initial high-level requirements have been captured. This document will continue to be updated as the design evolves.*