# SCADA System - High Level Requirements ## 1. Deployment Architecture - **Site Clusters**: 2-node failover clusters deployed at each site, running on Windows. - **Central Cluster**: A single 2-node failover cluster serving as the central hub. - **Communication Topology**: Hub-and-spoke. Central cluster communicates with each site cluster. Site clusters do **not** communicate with one another. ### 1.1 Central vs. Site Responsibilities - **Central cluster** is the single source of truth for all template authoring, configuration, and deployment decisions. - **Site clusters** receive **flattened configurations** — fully resolved attribute sets with no template structure. Sites do not need to understand templates, inheritance, or composition. - Sites **do not** support local/emergency configuration overrides. All configuration changes originate from central. ### 1.2 Failover - Failover is managed at the **application level** using **Akka.NET** (not Windows Server Failover Clustering). - Each cluster (central and site) runs an **active/standby** pair where Akka.NET manages node roles and failover detection. - **Site failover**: The standby node takes over data collection and script execution seamlessly, including responsibility for the store-and-forward buffers. The Site Runtime Deployment Manager singleton is restarted on the new active node, which reads deployed configurations from local SQLite and re-creates the full Instance Actor hierarchy. - **Central failover**: The standby node takes over central responsibilities. Deployments that are in-progress during a failover are treated as **failed** and must be re-initiated by the engineer. ### 1.3 Store-and-Forward Persistence (Site Clusters Only) - Store-and-forward applies **only at site clusters** — the central cluster does **not** buffer messages. If a site is unreachable, operations from central fail and must be retried by the engineer. - All site-level store-and-forward buffers (external system calls, notifications, and cached database writes) are **replicated between the two site cluster nodes** using **application-level replication** over Akka.NET remoting. - The **active node** persists buffered messages to a **local SQLite database** and forwards them to the standby node, which maintains its own local SQLite copy. - On failover, the standby node already has a replicated copy of the buffer and takes over delivery seamlessly. - Successfully delivered messages are removed from both nodes' local stores. - There is **no maximum buffer size** — messages accumulate until they either succeed or exhaust retries and are parked. - Retry intervals are **fixed** (not exponential backoff). The fixed interval is sufficient for the expected use cases. ### 1.4 Deployment Behavior - When central deploys a new configuration to a site instance, the site **applies it immediately** upon receipt — no local operator confirmation is required. - If a site loses connectivity to central, it **continues operating** with its last received deployed configuration. - The site reports back to central whether deployment was successfully applied. - **Pre-deployment validation**: Before any deployment is sent to a site, the central cluster performs comprehensive validation including flattening the configuration, test-compiling all scripts, verifying alarm trigger references, verifying script trigger references, and checking data connection binding completeness (see Section 3.11). ### 1.5 System-Wide Artifact Deployment - Changes to shared scripts, external system definitions, database connection definitions, data connection definitions, notification lists, and SMTP configuration are **not automatically propagated** to sites. - Deployment of system-wide artifacts requires **explicit action** by a user with the **Deployment** role. - Artifacts can be deployed to **all sites at once** or to an **individual site** (per-site deployment). - The Design role manages the definitions; the Deployment role triggers deployment to sites. A user may hold both roles. ## 2. Data Storage & Data Flow ### 2.1 Central Databases (MS SQL) - **Configuration Database**: A dedicated database for system-specific configuration data (e.g., templates, site definitions, instance configurations, system settings). - **Machine Data Database**: A separate database for collected machine data (e.g., telemetry, measurements, events). ### 2.2 Communication: Central ↔ Site - Two transport layers are used for central-site communication: - **Akka.NET ClusterClient/ClusterClientReceptionist**: Handles **command/control** messaging — deployments, instance lifecycle commands, subscribe/unsubscribe handshake, debug snapshots, health reports, remote queries, and integration routing. Provides automatic failover between contact points. - **gRPC server-streaming (site→central)**: Handles **real-time data streaming** — attribute value updates and alarm state changes. Each site node hosts a **SiteStreamGrpcServer** on a dedicated HTTP/2 port (Kestrel, default port 8083). Central creates per-site **SiteStreamGrpcClient** instances to subscribe to site streams. gRPC provides HTTP/2 flow control and per-stream backpressure that ClusterClient lacks. - **Site addressing**: Site Akka base addresses (NodeA and NodeB) and gRPC endpoints (GrpcNodeAAddress and GrpcNodeBAddress) are stored in the **Sites database table** and configured via the Central UI or CLI. Central creates a ClusterClient per site using both Akka addresses as contact points, and per-site gRPC clients using the gRPC addresses. - **Central contact points**: Sites configure **multiple central contact points** (both central node addresses) for redundancy. ClusterClient handles failover between central nodes automatically. - **Central as integration hub**: Central brokers requests between external systems and sites. For example, a recipe manager sends a recipe to central, which routes it to the appropriate site. MES requests machine values from central, which routes the request to the site and returns the response. - **Real-time data streaming** is not continuous for all machine data. The only real-time stream is an **on-demand debug view** — an engineer in the central UI can open a live view of a specific instance's tag values and alarm states for troubleshooting purposes. This is session-based and temporary. The debug view subscribes via gRPC to the site's SiteStreamManager filtered by instance (see Section 8.1). ### 2.3 Site-Level Storage & Interface - Sites have **no user interface** — they are headless collectors, forwarders, and script executors. - Sites require local storage for: the current deployed (flattened) configurations, deployed scripts, shared scripts, external system definitions, database connection definitions, data connection definitions, notification lists, and SMTP configuration. - After artifact deployment, sites are **fully self-contained** — all runtime configuration is read from local SQLite. Sites do **not** access the central configuration database at runtime. - Store-and-forward buffers are persisted to a **local SQLite database on each node** and replicated between nodes via application-level replication (see 1.3). ### 2.4 Data Connection Protocols - The system supports **OPC UA** and **LmxProxy** (a gRPC-based custom protocol with an existing client SDK). - Both protocols implement a **common interface** supporting: connect, subscribe to tag paths, receive value updates, and write values. - Additional protocols can be added by implementing the common interface. - The Data Connection Layer is a **clean data pipe** — it publishes tag value updates to Instance Actors but performs no evaluation of triggers or alarm conditions. - **Initial attribute quality**: Attributes bound to a data connection start with **uncertain** quality when the Instance Actor initializes. The quality remains uncertain until the first value update is received from the Data Connection Layer. This distinguishes "never received a value" from "received a known-good value" or "connection lost" (bad quality). - Data connections support optional **backup endpoints** with automatic failover after a configurable retry count. On failover, all subscriptions are transparently re-created on the new endpoint. ### 2.5 Scale - Approximately **10 sites**. - **50–500 machines per site**. - **25–75 live data point tags per machine**. ## 3. Template & Machine Modeling ### 3.1 Template Structure - Machines are modeled as **instances of templates**. - Templates define a set of **attributes**. - Each attribute has a **lock flag** that controls whether it can be overridden downstream. ### 3.2 Attribute Definition Each attribute carries the following metadata: - **Name**: Identifier for the attribute. - **Value**: The default or configured value. May be empty if intended to be set at the instance level. - **Data Type**: The value's type. Fixed set: Boolean, Integer, Float, String. - **Lock Flag**: Controls whether the attribute can be overridden downstream. - **Description**: Human-readable explanation of the attribute's purpose. - **Data Source Reference** *(optional)*: A **relative path** within a data connection (e.g., `/Motor/Speed`). The template defines *what* to read — the path relative to a data connection. The template does **not** specify which data connection to use; that is an instance-level concern (see Section 3.3). Attributes without a data source reference are static configuration values. ### 3.3 Data Connections - **Data connections** are reusable, named resources defined centrally and then **assigned to specific sites** (e.g., an OPC server, a PLC endpoint). Data connection definitions are deployed to sites as part of **artifact deployment** (see Section 1.5) and stored in local SQLite. - A data connection encapsulates the details needed to communicate with a data source (protocol, address, credentials, etc.). - Attributes with a data source reference must be **bound to a data connection at instance creation** — the template defines *what* to read (the relative path), and the instance specifies *where* to read it from (the data connection assigned to the site). - **Binding is per-attribute**: Each attribute with a data source reference individually selects its data connection. Different attributes on the same instance may use different data connections. The Central UI supports bulk assignment (selecting multiple attributes and assigning a data connection to all of them at once) to reduce tedium. - Templates do **not** specify a default connection. The connection binding is an instance-level concern. - The flattened configuration sent to a site resolves connection references into concrete connection details paired with attribute relative paths. - Data connection names are **not** standardized across sites — different sites may have different data connection names for equivalent devices. ### 3.4 Alarm Definitions Alarms are **first-class template members** alongside attributes and scripts, following the same **inheritance, override, and lock rules**. Each alarm has: - **Name**: Identifier for the alarm. - **Description**: Human-readable explanation of the alarm condition. - **Priority Level**: Numeric value from 0–1000. - **Lock Flag**: Controls whether the alarm can be overridden downstream. - **Trigger Definition**: One of the following trigger types: - **Value Match**: Triggers when a monitored attribute equals a predefined value. - **Range Violation**: Triggers when a monitored attribute value falls outside an allowed range. - **Rate of Change**: Triggers when a monitored attribute value changes faster than a defined threshold. - **On-Trigger Script** *(optional)*: A script to execute when the alarm triggers. The alarm on-trigger script executes in the context of the instance and can call instance scripts, but instance scripts **cannot** call alarm on-trigger scripts. The call direction is one-way. ### 3.4.1 Alarm State - Alarm state (active/normal) is **managed at the site level** per instance, held **in memory** by the Alarm Actor. - When the alarm condition clears, the alarm **automatically returns to normal state** — no acknowledgment workflow is required. - Alarm state is **not persisted** — on restart, alarm states are re-evaluated from incoming values. - Alarm state changes are published to the site-wide Akka stream as `[InstanceUniqueName].[AlarmName]`, alarm state (active/normal), priority, timestamp. ### 3.5 Template Relationships Templates participate in two distinct relationship types: - **Inheritance (is-a)**: A child template extends a parent template. The child inherits all attributes, alarms, scripts, and composed feature modules from the parent. The child can: - Override the **values** of non-locked inherited attributes, alarms, and scripts. - **Add** new attributes, alarms, or scripts not present in the parent. - **Not** remove attributes, alarms, or scripts defined by the parent. - **Composition (has-a)**: A template can nest an instance of another template as a **feature module** (e.g., embedding a RecipeSystem module inside a base machine template). Feature modules can themselves compose other feature modules **recursively**. - **Naming collisions**: If a template composes two feature modules that each define an attribute, alarm, or script with the same name, this is a **design-time error**. The system must detect and report the collision, and the template cannot be saved until the conflict is resolved. ### 3.6 Locking - Locking applies to **attributes, alarms, and scripts** uniformly. - Any of these can be **locked** at the level where it is defined or overridden. - A locked attribute **cannot** be overridden by any downstream level (child templates, composing templates, or instances). - An unlocked attribute **can** be overridden by any downstream level. - **Intermediate locking**: Any level in the chain can lock an attribute that was unlocked upstream. Once locked, it remains locked for all levels below — a downstream level **cannot** unlock an attribute locked above it. ### 3.6 Attribute Resolution Order Attributes are resolved from most-specific to least-specific. The first value encountered wins: 1. **Instance** (site-deployed machine) 2. **Child Template** (most derived first, walking up the inheritance chain) 3. **Composing Template** (the template that embeds a feature module can override the module's attributes) 4. **Composed Module** (the original feature module definition, recursively resolved if modules nest other modules) At any level, an override is only permitted if the attribute has **not been locked** at a higher-priority level. ### 3.7 Override Scope - **Inheritance**: Child templates can override non-locked attributes from their parent, including attributes originating from composed feature modules. - **Composition**: A template that composes a feature module can override non-locked attributes within that module. - Overrides can "pierce" into composed modules — a child template can override attributes inside a feature module it inherited from its parent. ### 3.8 Instance Rules - An instance is a deployed occurrence of a template at a site. - Instances **can** override the values of non-locked attributes. - Instances **cannot** add new attributes. - Instances **cannot** remove attributes. - The instance's structure (which attributes exist, which feature modules are composed) is strictly defined by its template. - Each instance is **assigned to an area** within its site (see 3.10). ### 3.8.1 Instance Lifecycle - Instances can be in one of two states: **enabled** or **disabled**. - **Enabled**: The instance is active at the site — data subscriptions, script triggers, and alarm evaluation are all running. - **Disabled**: The site **stops** script triggers, data subscriptions (no live data collection), and alarm evaluation. The deployed configuration is **retained** on the site so the instance can be re-enabled without redeployment. Store-and-forward messages for a disabled instance **continue to drain** (deliver pending messages). - **Deletion**: Instances can be deleted. Deletion removes the running configuration from the site, stops subscriptions, and destroys the Instance Actor and its children. Store-and-forward messages are **not** cleared on deletion — they continue to be delivered or can be managed (retried/discarded) via parked message management. If the site is unreachable when a delete is triggered, the deletion **fails** (same behavior as a failed deployment). The central side does not mark it as deleted until the site confirms. - Templates **cannot** be deleted if any instances or child templates reference them. The user must remove all references first. ### 3.9 Template Deployment & Change Propagation - Template changes are **not** automatically propagated to deployed instances. - The system maintains two views of each instance: - **Deployed Configuration**: The currently active configuration on the instance, as it was last explicitly deployed. - **Template-Derived Configuration**: The configuration the instance *would* have based on the current state of its template (including resolved inheritance, composition, and overrides). - Deployment is performed at the **individual instance level** — an engineer explicitly commands the system to update a specific instance. - The system must be able to **show differences** between the deployed configuration and the current template-derived configuration, allowing engineers to see what would change before deploying. - **No rollback** support is required. The system only needs to track the current deployed state, not a history of prior deployments. - **Concurrent editing**: Template editing uses a **last-write-wins** model. No pessimistic locking or optimistic concurrency conflict detection is required. ### 3.10 Areas - Areas are **predefined hierarchical groupings** associated with a site, stored in the configuration database. - Areas support **parent-child relationships** (e.g., Plant → Building → Production Line → Cell). - Each instance is assigned to an area within its site. - Areas are used for **filtering and finding instances** in the central UI. - Area definitions are managed by users with the **Design** role. ### 3.11 Pre-Deployment Validation Before any deployment is sent to a site, the central cluster performs **comprehensive validation**. Validation covers: - **Flattening**: The full template hierarchy is resolved and flattened successfully. - **Naming collision detection**: No duplicate attribute, alarm, or script names exist in the flattened configuration. - **Script compilation**: All instance scripts and alarm on-trigger scripts are test-compiled and must compile without errors. - **Alarm trigger references**: Alarm trigger definitions reference attributes that exist in the flattened configuration. - **Script trigger references**: Script triggers (value change, conditional) reference attributes that exist in the flattened configuration. - **Data connection binding completeness**: Every attribute with a data source reference has a data connection binding assigned on the instance, and the bound data connection name exists as a defined connection at the instance's site. - **Exception**: Validation does **not** verify that data source relative paths resolve to real tags on physical devices — that is a runtime concern that can only be determined at the site. Validation is also available **on demand in the Central UI** for Design users during template authoring, providing early feedback without requiring a deployment attempt. For **shared scripts**, pre-compilation validation is performed before deployment to sites. Since shared scripts have no instance context, validation is limited to C# syntax and structural correctness. ## 4. Scripting ### 4.1 Script Definitions - Scripts are **C#** and are defined at the **template level** as first-class template members. - Scripts follow the same **inheritance, override, and lock rules** as attributes. A parent template can define a script, a child template can override it (if not locked), and any level can lock a script to prevent downstream changes. - Scripts are deployed to sites as part of the flattened instance configuration. - Scripts are **compiled at the site** when a deployment is received. Pre-compilation validation occurs at central before deployment (see Section 3.11), but the site performs the actual compilation for execution. - Scripts can optionally define **input parameters** (name and data type per parameter). Scripts without parameter definitions accept no arguments. - Scripts can optionally define a **return value definition** (field names and data types). Return values support **single objects** and **lists of objects**. Scripts without a return definition return void. - Return values are used when scripts are called explicitly by other scripts (via `Instance.CallScript()` or `Scripts.CallShared()`) or by the Inbound API (via `Route.To().Call()`). When invoked by a trigger (interval, value change, conditional, alarm), any return value is discarded. ### 4.2 Script Triggers Scripts can be triggered by: - **Interval**: Execute on a recurring time schedule. - **Value Change**: Execute when a specific instance attribute value changes. - **Conditional**: Execute when an instance attribute value equals or does not equal a given value. Scripts have an optional **minimum time between runs** setting. If a trigger fires before the minimum interval has elapsed since the last execution, the invocation is skipped. ### 4.3 Script Error Handling - If a script fails (unhandled exception, timeout, etc.), the failure is **logged locally** at the site. - The script is **not disabled** — it remains active and will fire on the next qualifying trigger event. - Script failures are **not reported to central**. Diagnostics are local only. - For external system call failures within scripts, store-and-forward handling (Section 5.3) applies independently of script error handling. ### 4.4 Script Capabilities Scripts executing on a site for a given instance can: - **Read** attribute values on that instance (live data points and static config). - **Write** attribute values on that instance. For attributes with a data source reference, the write goes to the Data Connection Layer which writes to the physical device; the in-memory value updates when the device confirms the new value via the existing subscription. For static attributes, the write updates the in-memory value and **persists the override to local SQLite** — the value survives restart and failover. Persisted overrides are reset when the instance is redeployed. - **Call other scripts** on that instance via `Instance.CallScript("scriptName", params)`. Calls use the Akka ask pattern and return the called script's return value. Script-to-script calls support concurrent execution. - **Call shared scripts** via `Scripts.CallShared("scriptName", params)`. Shared scripts execute **inline** in the calling Script Actor's context — they are compiled code libraries, not separate actors. - **Call external system API methods** in two modes: `ExternalSystem.Call()` for synchronous request/response, or `ExternalSystem.CachedCall()` for fire-and-forget with store-and-forward on transient failure (see Section 5). - **Send notifications** (see Section 6). - **Access databases** by requesting an MS SQL client connection by name (see Section 5.5). Scripts **cannot** access other instances' attributes or scripts. ### 4.4.1 Script Call Recursion Limit - Script-to-script calls (via `Instance.CallScript` and `Scripts.CallShared`) enforce a **maximum recursion depth** to prevent infinite loops. - The default maximum depth is a reasonable limit (e.g., 10 levels). - The current call depth is tracked and incremented with each nested call. If the limit is reached, the call fails with an error logged to the site event log. - This applies to all script call chains including alarm on-trigger scripts calling instance scripts. ### 4.5 Shared Scripts - Shared scripts are **not associated with any template** — they are a **system-wide library** of reusable C# scripts. - Shared scripts can optionally define **input parameters** and **return value definitions**, following the same rules as template-level scripts. - Managed by users with the **Design** role. - Deployed to **all sites** for use by any instance script (deployment requires explicit action by a user with the Deployment role). - Shared scripts execute **inline** in the calling Script Actor's context as compiled code. They are not separate actors. This avoids serialization bottlenecks and messaging overhead. - Shared scripts are **not available on the central cluster** — Inbound API scripts cannot call them directly. To execute shared script logic, route to a site instance via `Route.To().Call()`. ### 4.6 Alarm On-Trigger Scripts - Alarm on-trigger scripts are defined as part of the alarm definition and execute when the alarm activates. - They execute directly in the Alarm Actor's context (via a short-lived Alarm Execution Actor), similar to how shared scripts execute inline. - Alarm on-trigger scripts **can** call instance scripts via `Instance.CallScript()`, which sends an ask message to the appropriate sibling Script Actor. - Instance scripts **cannot** call alarm on-trigger scripts — the call direction is one-way. - The recursion depth limit applies to alarm-to-instance script call chains. ## 5. External System Integrations ### 5.1 External System Definitions - External systems are **predefined contracts** created by users with the **Design** role. - Each definition includes: - **Connection details**: Endpoint URL, authentication, protocol information. - **Method definitions**: Available API methods with defined parameters and return types. - Definitions are deployed **uniformly to all sites** — no per-site connection detail overrides. - Deployment of definition changes requires **explicit action** by a user with the Deployment role. - At the site, external system definitions are read from **local SQLite** (populated by artifact deployment), not from the central config DB. ### 5.2 Site-to-External-System Communication - Sites communicate with external systems **directly** (not routed through central). - Scripts invoke external system methods by referencing the predefined definitions. ### 5.3 Store-and-Forward for External Calls - If an external system is unavailable when a script invokes a method, the message is **buffered locally at the site**. - Retry is performed **per message** — individual failed messages retry independently. - Each external system definition includes configurable retry settings: - **Max retry count**: Maximum number of retry attempts before giving up. - **Time between retries**: Fixed interval between retry attempts (no exponential backoff). - After max retries are exhausted, the message is **parked** (dead-lettered) for manual review. - There is **no maximum buffer size** — messages accumulate until delivery succeeds or retries are exhausted. ### 5.4 Parked Message Management - Parked messages are **stored at the site** where they originated. - The **central UI** can **query sites** for parked messages and manage them remotely. - Operators can **retry** or **discard** parked messages from the central UI. - Parked message management covers **external system calls**, **notifications**, and **cached database writes**. ### 5.5 Database Connections - Database connections are **predefined, named resources** created by users with the **Design** role. - Each definition includes the connection details needed to connect to an MS SQL database (server, database name, credentials, etc.). - Each definition includes configurable retry settings (same pattern as external systems): **max retry count** and **time between retries** (fixed interval). - Definitions are deployed **uniformly to all sites** — no per-site overrides. - Deployment of definition changes requires **explicit action** by a user with the Deployment role. - At the site, database connection definitions are read from **local SQLite** (populated by artifact deployment), not from the central config DB. ### 5.6 Database Access Modes Scripts can interact with databases in two modes: - **Real-time (synchronous)**: Scripts request a **raw MS SQL client connection by name** (e.g., `Database.Connection("MES_DB")`), giving script authors full ADO.NET-level control for immediate queries and updates. - **Cached write (store-and-forward)**: Scripts submit a write operation for deferred, reliable delivery. The cached entry stores the **database connection name**, the **SQL statement to execute**, and **parameter values**. If the database is unavailable, the write is buffered locally at the site and retried per the connection's retry settings. After max retries are exhausted, the write is **parked** for manual review (managed via central UI alongside other parked messages). ## 6. Notifications ### 6.1 Notification Lists - Notification lists are **system-wide**, managed by users with the **Design** role. - Each list has a **name** and contains one or more **recipients**. - Each recipient has a **name** and an **email address**. - Notification lists are deployed to **all sites** (deployment requires explicit action by a user with the Deployment role). - At the site, notification lists and recipients are read from **local SQLite** (populated by artifact deployment), not from the central config DB. ### 6.2 Email Support - The system has **predefined support for sending email** as the notification delivery mechanism. - Email server configuration (SMTP settings) is defined centrally and deployed to all sites as part of **artifact deployment** (see Section 1.5). Sites read SMTP configuration from **local SQLite**. ### 6.3 Script API - Scripts send notifications using a simplified API: `Notify.To("list name").Send("subject", "message")` - This API is available to instance scripts, alarm on-trigger scripts, and shared scripts. ### 6.4 Store-and-Forward for Notifications - If the email server is unavailable, notifications are **buffered locally at the site**. - Follows the same retry pattern as external system calls: configurable **max retry count** and **time between retries** (fixed interval). - After max retries are exhausted, the notification is **parked** for manual review (managed via central UI alongside external system parked messages). - There is **no maximum buffer size** for notification messages. ## 7. Inbound API (Central) ### 7.1 Purpose The system exposes a **web API on the central cluster** for external systems to call into the SCADA system. This is the counterpart to the outbound External System Integrations (Section 5) — where Section 5 defines how the system calls out, this section defines how external systems call in. ### 7.2 API Key Management - API keys are stored in the **configuration database**. - Each API key has a **name/label** (for identification), the **key value**, and an **enabled/disabled** flag. - API keys are managed by users with the **Admin** role. ### 7.3 Authentication - Inbound API requests are authenticated via **API key** (not LDAP/AD). - The API key must be included with each request. - Invalid or disabled keys are rejected. ### 7.4 API Method Definitions - API methods are **predefined** and managed by users with the **Design** role. - Each method definition includes: - **Method name**: Unique identifier for the endpoint. - **Approved API keys**: List of API keys authorized to call this method. - **Parameter definitions**: Name and data type for each input parameter. - **Return value definition**: Data type and structure of the response. Supports **single objects** and **lists of objects**. - **Timeout**: Configurable per method. Maximum execution time including routed calls to sites. - The implementation of each method is a **C# script stored inline** in the method definition. It executes on the central cluster. No template inheritance — API scripts are standalone. - API scripts can route calls to any instance at any site via `Route.To("instanceCode").Call("scriptName", parameters)`, read/write attributes in batch, and access databases directly. - API scripts **cannot** call shared scripts directly (shared scripts are site-only). To invoke site logic, use `Route.To().Call()`. ### 7.5 Availability - The inbound API is hosted **only on the central cluster** (active node). - On central failover, the API becomes available on the new active node. ## 8. Central UI The central cluster hosts a **configuration and management UI** (no live machine data visualization, except on-demand debug views). The UI supports the following workflows: - **Template Authoring**: Create, edit, and manage templates including hierarchy (inheritance) and composition (feature modules). Author and manage scripts within templates. **Design-time validation** available on demand to check flattening, naming collisions, and script compilation without deploying. - **Shared Script Management**: Create, edit, and manage the system-wide shared script library. - **Notification List Management**: Create, edit, and manage notification lists and recipients. - **External System Management**: Define external system contracts (connection details, API method definitions). - **Database Connection Management**: Define named database connections for script use. - **Inbound API Management**: Manage API keys (create, enable/disable, delete). Define API methods (name, parameters, return values, approved keys, implementation script). *(Admin role for keys, Design role for methods.)* - **Instance Management**: Create instances from templates, bind data connections (per-attribute, with **bulk assignment** UI for selecting multiple attributes and assigning a data connection at once), set instance-level attribute overrides, assign instances to areas. **Disable** or **delete** instances. - **Site & Data Connection Management**: Define sites (including optional NodeAAddress and NodeBAddress fields for Akka remoting paths, and optional GrpcNodeAAddress and GrpcNodeBAddress fields for gRPC streaming endpoints), manage data connections and assign them to sites. - **Area Management**: Define hierarchical area structures per site for organizing instances. - **Deployment**: View diffs between deployed and current template-derived configurations, deploy updates to individual instances. Filter instances by area. Pre-deployment validation runs automatically before any deployment is sent. - **System-Wide Artifact Deployment**: Explicitly deploy shared scripts, external system definitions, database connection definitions, data connection definitions, notification lists, and SMTP configuration to all sites or to an individual site (requires Deployment role). Per-site deployment is available via the Sites admin page. - **Deployment Status Monitoring**: Track whether deployments were successfully applied at site level. - **Debug View**: On-demand real-time view of a specific instance's tag values and alarm states for troubleshooting (see 8.1). - **Parked Message Management**: Query sites for parked messages (external system calls, notifications, and cached database writes), retry or discard them. - **Health Monitoring Dashboard**: View site cluster health, node status, data connection health, script error rates, alarm evaluation errors, and store-and-forward buffer depths (see Section 11). - **Site Event Log Viewer**: Query and view operational event logs from site clusters (see Section 12). ### 8.1 Debug View - **Subscribe-on-demand**: When an engineer opens a debug view for an instance, central opens a **gRPC server-streaming subscription** to the site's `SiteStreamGrpcServer` for the instance, then requests a **snapshot** of all current attribute values and alarm states via ClusterClient. The gRPC stream delivers subsequent attribute value and alarm state changes directly from the site's `SiteStreamManager`. - Attribute value stream messages are structured as: `[InstanceUniqueName].[AttributePath].[AttributeName]`, attribute value, attribute quality, attribute change timestamp. - Alarm state stream messages are structured as: `[InstanceUniqueName].[AlarmName]`, alarm state (active/normal), priority, timestamp. - The stream continues until the engineer **closes the debug view**, at which point central unsubscribes and the site stops streaming. - No attribute/alarm selection — the debug view always shows all tag values and alarm states for the instance. - No special concurrency limits are required. ## 9. Security & Access Control ### 9.1 Authentication - **UI users** authenticate via **username/password** validated directly against **LDAP/Active Directory**. Sessions are maintained via JWT tokens. - **External system API callers** authenticate via **API key** (see Section 7). ### 9.2 Authorization - Authorization is **role-based**, with roles assigned by **LDAP group membership**. - Roles are **independent** — they can be mixed and matched per user (via group membership). There is no implied hierarchy between roles. - A user may hold multiple roles simultaneously (e.g., both Design and Deployment) by being a member of the corresponding LDAP groups. - Inbound API authorization is per-method, based on **approved API key lists** (see Section 7.4). ### 9.3 Roles - **Admin**: System-wide permission to manage sites, data connections, LDAP group-to-role mappings, API keys, and system-level configuration. - **Design**: System-wide permission to author and edit templates, scripts, shared scripts, external system definitions, notification lists, inbound API method definitions, and area definitions. - **Deployment**: Permission to manage instances (create, set overrides, bind connections, disable, delete) and deploy configurations to sites. Also triggers system-wide artifact deployment. Can be scoped **per site**. ### 9.4 Role Scoping - Admin is always **system-wide**. - Design is always **system-wide**. - Deployment can be **system-wide** or **site-scoped**, controlled by LDAP group membership (e.g., `Deploy-SiteA`, `Deploy-SiteB`, or `Deploy-All`). ## 10. Audit Logging Audit logging is implemented as part of the **Configuration Database** component via the `IAuditService` interface. ### 10.1 Storage - Audit logs are stored in the **configuration MS SQL database** alongside system config data, enabling direct querying. - Entries are **append-only** — never modified or deleted. No retention policy — retained indefinitely. ### 10.2 Scope All system-modifying actions are logged, including: - **Template changes**: Create, edit, delete templates. - **Script changes**: Template script and shared script create, edit, delete. - **Alarm changes**: Create, edit, delete alarm definitions. - **Instance changes**: Create, override values, bind connections, area assignment, disable, enable, delete. - **Deployments**: Who deployed what to which instance, and the result (success/failure). - **System-wide artifact deployments**: Who deployed shared scripts / external system definitions / DB connections / data connections / notification lists / SMTP config, to which site(s), and the result. - **External system definition changes**: Create, edit, delete. - **Database connection changes**: Create, edit, delete. - **Notification list changes**: Create, edit, delete lists and recipients. - **Inbound API changes**: API key create, enable/disable, delete. API method create, edit, delete. - **Area changes**: Create, edit, delete area definitions. - **Site & data connection changes**: Create, edit, delete. - **Security/admin changes**: Role mapping changes, site permission changes. ### 10.3 Detail Level - Each audit log entry records the **state of the entity after the change**, serialized as JSON. Only the after-state is stored — change history is reconstructed by comparing consecutive entries for the same entity at query time. - Each entry includes: **who** (authenticated user), **what** (action, entity type, entity ID, entity name), **when** (timestamp), and **state** (JSON after-state, null for deletes). - **One entry per save operation** — when a user edits a template and changes multiple attributes in one save, a single entry captures the full entity state. ### 10.4 Transactional Guarantee - Audit entries are written **synchronously** within the same database transaction as the change (via the unit-of-work pattern). If the change succeeds, the audit entry is guaranteed to be recorded. If the change rolls back, the audit entry rolls back too. ## 11. Health Monitoring ### 11.1 Monitored Metrics The central cluster monitors the health of each site cluster, including: - **Site cluster online/offline status**: Whether the site is reachable. - **Active vs. standby node status**: Which node is active and which is standby. - **Data connection health**: Connected/disconnected status per data connection at the site. - **Script error rates**: Frequency of script failures at the site. - **Alarm evaluation errors**: Frequency of alarm evaluation failures at the site. - **Store-and-forward buffer depth**: Number of messages currently queued (broken down by external system calls, notifications, and cached database writes). ### 11.2 Reporting - Site clusters **report health metrics to central** periodically. - Health status is **visible in the central UI** — no automated alerting/notifications for now. ## 12. Site-Level Event Logging ### 12.1 Events Logged Sites log operational events locally, including: - **Script executions**: Start, complete, error (with error details). - **Alarm events**: Alarm activated, alarm cleared (which alarm, which instance, when). Alarm evaluation errors. - **Deployment applications**: Configuration received from central, applied successfully or failed. Script compilation results. - **Data connection status changes**: Connected, disconnected, reconnected per connection. - **Store-and-forward activity**: Message queued, delivered, retried, parked. - **Instance lifecycle**: Instance enabled, disabled, deleted. ### 12.2 Storage - Event logs are stored in **local SQLite** on each site node. - **Retention policy**: 30 days. Events older than 30 days are automatically purged. ### 12.3 Central Access - The central UI can **query site event logs remotely**, following the same pattern as parked message management — central requests data from the site over Akka.NET remoting. ## 13. Management Service & CLI ### 13.1 Management Service - The central cluster exposes a **ManagementActor** that provides programmatic access to all administrative operations — the same operations available through the Central UI. - The ManagementActor registers with Akka.NET **ClusterClientReceptionist** for cross-cluster access, and is also exposed via an HTTP Management API endpoint (`POST /management`) with Basic Auth, LDAP authentication, and role resolution — enabling external tools like the CLI to interact without Akka.NET dependencies. - The ManagementActor enforces the **same role-based authorization** as the Central UI. Every incoming message carries the authenticated user's identity and roles. - All mutating operations performed through the Management Service are **audit logged** via IAuditService, identical to operations performed through the Central UI. - The ManagementActor runs on **every central node** (stateless). For HTTP API access, any central node can handle any request without sticky sessions. ### 13.2 CLI - The system provides a standalone **command-line tool** (`scadalink`) for scripting and automating administrative operations. - The CLI connects to the Central Host's HTTP Management API (`POST /management`) — it sends commands as JSON with HTTP Basic Auth credentials. The server handles LDAP authentication, role resolution, and ManagementActor dispatch. - The CLI sends user credentials via HTTP Basic Auth. The server authenticates against **LDAP/AD** and resolves roles before dispatching commands to the ManagementActor. - CLI commands mirror all Management Service operations: templates, instances, sites, data connections, deployments, external systems, notifications, security (API keys and role mappings), audit log queries, and health status. - Output is **JSON by default** (machine-readable, suitable for scripting) with an optional `--format table` flag for human-readable tabular output. - Configuration is resolved from command-line options, **environment variables** (`SCADALINK_MANAGEMENT_URL`, `SCADALINK_FORMAT`), or a **configuration file** (`~/.scadalink/config.json`). - The CLI is a separate executable from the Host binary — it is deployed on any machine with HTTP access to a central node. ## 14. General Conventions ### 14.1 Timestamps - All timestamps throughout the system are stored, transmitted, and processed in **UTC**. - This applies to: attribute value timestamps, alarm state change timestamps, audit log entries, event log entries, deployment records, health reports, store-and-forward message timestamps, and all inter-node messages. - Local time conversion for display is a **Central UI concern only** — no other component performs timezone conversion. --- *All initial high-level requirements have been captured. This document will continue to be updated as the design evolves.*