dohertj2/scadalink-design

Fork 0

Files

Joseph Doherty 5de6c8d052 docs(dcl): document primary/backup endpoint redundancy across requirements and test infra

2026-03-22 08:43:59 -04:00

45 KiB

Raw Blame History

SCADA System - High Level Requirements

1. Deployment Architecture

Site Clusters: 2-node failover clusters deployed at each site, running on Windows.
Central Cluster: A single 2-node failover cluster serving as the central hub.
Communication Topology: Hub-and-spoke. Central cluster communicates with each site cluster. Site clusters do not communicate with one another.

1.1 Central vs. Site Responsibilities

Central cluster is the single source of truth for all template authoring, configuration, and deployment decisions.
Site clusters receive flattened configurations — fully resolved attribute sets with no template structure. Sites do not need to understand templates, inheritance, or composition.
Sites do not support local/emergency configuration overrides. All configuration changes originate from central.

1.2 Failover

Failover is managed at the application level using Akka.NET (not Windows Server Failover Clustering).
Each cluster (central and site) runs an active/standby pair where Akka.NET manages node roles and failover detection.
Site failover: The standby node takes over data collection and script execution seamlessly, including responsibility for the store-and-forward buffers. The Site Runtime Deployment Manager singleton is restarted on the new active node, which reads deployed configurations from local SQLite and re-creates the full Instance Actor hierarchy.
Central failover: The standby node takes over central responsibilities. Deployments that are in-progress during a failover are treated as failed and must be re-initiated by the engineer.

1.3 Store-and-Forward Persistence (Site Clusters Only)

Store-and-forward applies only at site clusters — the central cluster does not buffer messages. If a site is unreachable, operations from central fail and must be retried by the engineer.
All site-level store-and-forward buffers (external system calls, notifications, and cached database writes) are replicated between the two site cluster nodes using application-level replication over Akka.NET remoting.
The active node persists buffered messages to a local SQLite database and forwards them to the standby node, which maintains its own local SQLite copy.
On failover, the standby node already has a replicated copy of the buffer and takes over delivery seamlessly.
Successfully delivered messages are removed from both nodes' local stores.
There is no maximum buffer size — messages accumulate until they either succeed or exhaust retries and are parked.
Retry intervals are fixed (not exponential backoff). The fixed interval is sufficient for the expected use cases.

1.4 Deployment Behavior

When central deploys a new configuration to a site instance, the site applies it immediately upon receipt — no local operator confirmation is required.
If a site loses connectivity to central, it continues operating with its last received deployed configuration.
The site reports back to central whether deployment was successfully applied.
Pre-deployment validation: Before any deployment is sent to a site, the central cluster performs comprehensive validation including flattening the configuration, test-compiling all scripts, verifying alarm trigger references, verifying script trigger references, and checking data connection binding completeness (see Section 3.11).

1.5 System-Wide Artifact Deployment

Changes to shared scripts, external system definitions, database connection definitions, data connection definitions, notification lists, and SMTP configuration are not automatically propagated to sites.
Deployment of system-wide artifacts requires explicit action by a user with the Deployment role.
Artifacts can be deployed to all sites at once or to an individual site (per-site deployment).
The Design role manages the definitions; the Deployment role triggers deployment to sites. A user may hold both roles.

2. Data Storage & Data Flow

2.1 Central Databases (MS SQL)

Configuration Database: A dedicated database for system-specific configuration data (e.g., templates, site definitions, instance configurations, system settings).
Machine Data Database: A separate database for collected machine data (e.g., telemetry, measurements, events).

2.2 Communication: Central ↔ Site

Two transport layers are used for central-site communication:
- Akka.NET ClusterClient/ClusterClientReceptionist: Handles command/control messaging — deployments, instance lifecycle commands, subscribe/unsubscribe handshake, debug snapshots, health reports, remote queries, and integration routing. Provides automatic failover between contact points.
- gRPC server-streaming (site→central): Handles real-time data streaming — attribute value updates and alarm state changes. Each site node hosts a SiteStreamGrpcServer on a dedicated HTTP/2 port (Kestrel, default port 8083). Central creates per-site SiteStreamGrpcClient instances to subscribe to site streams. gRPC provides HTTP/2 flow control and per-stream backpressure that ClusterClient lacks.
Site addressing: Site Akka base addresses (NodeA and NodeB) and gRPC endpoints (GrpcNodeAAddress and GrpcNodeBAddress) are stored in the Sites database table and configured via the Central UI or CLI. Central creates a ClusterClient per site using both Akka addresses as contact points, and per-site gRPC clients using the gRPC addresses.
Central contact points: Sites configure multiple central contact points (both central node addresses) for redundancy. ClusterClient handles failover between central nodes automatically.
Central as integration hub: Central brokers requests between external systems and sites. For example, a recipe manager sends a recipe to central, which routes it to the appropriate site. MES requests machine values from central, which routes the request to the site and returns the response.
Real-time data streaming is not continuous for all machine data. The only real-time stream is an on-demand debug view — an engineer in the central UI can open a live view of a specific instance's tag values and alarm states for troubleshooting purposes. This is session-based and temporary. The debug view subscribes via gRPC to the site's SiteStreamManager filtered by instance (see Section 8.1).

2.3 Site-Level Storage & Interface

Sites have no user interface — they are headless collectors, forwarders, and script executors.
Sites require local storage for: the current deployed (flattened) configurations, deployed scripts, shared scripts, external system definitions, database connection definitions, data connection definitions, notification lists, and SMTP configuration.
After artifact deployment, sites are fully self-contained — all runtime configuration is read from local SQLite. Sites do not access the central configuration database at runtime.
Store-and-forward buffers are persisted to a local SQLite database on each node and replicated between nodes via application-level replication (see 1.3).

2.4 Data Connection Protocols

The system supports OPC UA and LmxProxy (a gRPC-based custom protocol with an existing client SDK).
Both protocols implement a common interface supporting: connect, subscribe to tag paths, receive value updates, and write values.
Additional protocols can be added by implementing the common interface.
The Data Connection Layer is a clean data pipe — it publishes tag value updates to Instance Actors but performs no evaluation of triggers or alarm conditions.
Initial attribute quality: Attributes bound to a data connection start with uncertain quality when the Instance Actor initializes. The quality remains uncertain until the first value update is received from the Data Connection Layer. This distinguishes "never received a value" from "received a known-good value" or "connection lost" (bad quality).
Data connections support optional backup endpoints with automatic failover after a configurable retry count. On failover, all subscriptions are transparently re-created on the new endpoint.

2.5 Scale

Approximately 10 sites.
50–500 machines per site.
25–75 live data point tags per machine.

3. Template & Machine Modeling

3.1 Template Structure

Machines are modeled as instances of templates.
Templates define a set of attributes.
Each attribute has a lock flag that controls whether it can be overridden downstream.

3.2 Attribute Definition

Each attribute carries the following metadata:

Name: Identifier for the attribute.
Value: The default or configured value. May be empty if intended to be set at the instance level.
Data Type: The value's type. Fixed set: Boolean, Integer, Float, String.
Lock Flag: Controls whether the attribute can be overridden downstream.
Description: Human-readable explanation of the attribute's purpose.
Data Source Reference (optional): A relative path within a data connection (e.g., /Motor/Speed). The template defines what to read — the path relative to a data connection. The template does not specify which data connection to use; that is an instance-level concern (see Section 3.3). Attributes without a data source reference are static configuration values.

3.3 Data Connections

Data connections are reusable, named resources defined centrally and then assigned to specific sites (e.g., an OPC server, a PLC endpoint). Data connection definitions are deployed to sites as part of artifact deployment (see Section 1.5) and stored in local SQLite.
A data connection encapsulates the details needed to communicate with a data source (protocol, address, credentials, etc.).
Attributes with a data source reference must be bound to a data connection at instance creation — the template defines what to read (the relative path), and the instance specifies where to read it from (the data connection assigned to the site).
Binding is per-attribute: Each attribute with a data source reference individually selects its data connection. Different attributes on the same instance may use different data connections. The Central UI supports bulk assignment (selecting multiple attributes and assigning a data connection to all of them at once) to reduce tedium.
Templates do not specify a default connection. The connection binding is an instance-level concern.
The flattened configuration sent to a site resolves connection references into concrete connection details paired with attribute relative paths.
Data connection names are not standardized across sites — different sites may have different data connection names for equivalent devices.

3.4 Alarm Definitions

Alarms are first-class template members alongside attributes and scripts, following the same inheritance, override, and lock rules.

Each alarm has:

Name: Identifier for the alarm.
Description: Human-readable explanation of the alarm condition.
Priority Level: Numeric value from 0–1000.
Lock Flag: Controls whether the alarm can be overridden downstream.
Trigger Definition: One of the following trigger types:
- Value Match: Triggers when a monitored attribute equals a predefined value.
- Range Violation: Triggers when a monitored attribute value falls outside an allowed range.
- Rate of Change: Triggers when a monitored attribute value changes faster than a defined threshold.
On-Trigger Script (optional): A script to execute when the alarm triggers. The alarm on-trigger script executes in the context of the instance and can call instance scripts, but instance scripts cannot call alarm on-trigger scripts. The call direction is one-way.

3.4.1 Alarm State

Alarm state (active/normal) is managed at the site level per instance, held in memory by the Alarm Actor.
When the alarm condition clears, the alarm automatically returns to normal state — no acknowledgment workflow is required.
Alarm state is not persisted — on restart, alarm states are re-evaluated from incoming values.
Alarm state changes are published to the site-wide Akka stream as [InstanceUniqueName].[AlarmName], alarm state (active/normal), priority, timestamp.

3.5 Template Relationships

Templates participate in two distinct relationship types:

Inheritance (is-a): A child template extends a parent template. The child inherits all attributes, alarms, scripts, and composed feature modules from the parent. The child can:
- Override the values of non-locked inherited attributes, alarms, and scripts.
- Add new attributes, alarms, or scripts not present in the parent.
- Not remove attributes, alarms, or scripts defined by the parent.
Composition (has-a): A template can nest an instance of another template as a feature module (e.g., embedding a RecipeSystem module inside a base machine template). Feature modules can themselves compose other feature modules recursively.
Naming collisions: If a template composes two feature modules that each define an attribute, alarm, or script with the same name, this is a design-time error. The system must detect and report the collision, and the template cannot be saved until the conflict is resolved.

3.6 Locking

Locking applies to attributes, alarms, and scripts uniformly.
Any of these can be locked at the level where it is defined or overridden.
A locked attribute cannot be overridden by any downstream level (child templates, composing templates, or instances).
An unlocked attribute can be overridden by any downstream level.
Intermediate locking: Any level in the chain can lock an attribute that was unlocked upstream. Once locked, it remains locked for all levels below — a downstream level cannot unlock an attribute locked above it.

3.6 Attribute Resolution Order

Attributes are resolved from most-specific to least-specific. The first value encountered wins:

Instance (site-deployed machine)
Child Template (most derived first, walking up the inheritance chain)
Composing Template (the template that embeds a feature module can override the module's attributes)
Composed Module (the original feature module definition, recursively resolved if modules nest other modules)

At any level, an override is only permitted if the attribute has not been locked at a higher-priority level.

3.7 Override Scope

Inheritance: Child templates can override non-locked attributes from their parent, including attributes originating from composed feature modules.
Composition: A template that composes a feature module can override non-locked attributes within that module.
Overrides can "pierce" into composed modules — a child template can override attributes inside a feature module it inherited from its parent.

3.8 Instance Rules

An instance is a deployed occurrence of a template at a site.
Instances can override the values of non-locked attributes.
Instances cannot add new attributes.
Instances cannot remove attributes.
The instance's structure (which attributes exist, which feature modules are composed) is strictly defined by its template.
Each instance is assigned to an area within its site (see 3.10).

3.8.1 Instance Lifecycle

Instances can be in one of two states: enabled or disabled.
Enabled: The instance is active at the site — data subscriptions, script triggers, and alarm evaluation are all running.
Disabled: The site stops script triggers, data subscriptions (no live data collection), and alarm evaluation. The deployed configuration is retained on the site so the instance can be re-enabled without redeployment. Store-and-forward messages for a disabled instance continue to drain (deliver pending messages).
Deletion: Instances can be deleted. Deletion removes the running configuration from the site, stops subscriptions, and destroys the Instance Actor and its children. Store-and-forward messages are not cleared on deletion — they continue to be delivered or can be managed (retried/discarded) via parked message management. If the site is unreachable when a delete is triggered, the deletion fails (same behavior as a failed deployment). The central side does not mark it as deleted until the site confirms.
Templates cannot be deleted if any instances or child templates reference them. The user must remove all references first.

3.9 Template Deployment & Change Propagation

Template changes are not automatically propagated to deployed instances.
The system maintains two views of each instance:
- Deployed Configuration: The currently active configuration on the instance, as it was last explicitly deployed.
- Template-Derived Configuration: The configuration the instance would have based on the current state of its template (including resolved inheritance, composition, and overrides).
Deployment is performed at the individual instance level — an engineer explicitly commands the system to update a specific instance.
The system must be able to show differences between the deployed configuration and the current template-derived configuration, allowing engineers to see what would change before deploying.
No rollback support is required. The system only needs to track the current deployed state, not a history of prior deployments.
Concurrent editing: Template editing uses a last-write-wins model. No pessimistic locking or optimistic concurrency conflict detection is required.

3.10 Areas

Areas are predefined hierarchical groupings associated with a site, stored in the configuration database.
Areas support parent-child relationships (e.g., Plant → Building → Production Line → Cell).
Each instance is assigned to an area within its site.
Areas are used for filtering and finding instances in the central UI.
Area definitions are managed by users with the Design role.

3.11 Pre-Deployment Validation

Before any deployment is sent to a site, the central cluster performs comprehensive validation. Validation covers:

Flattening: The full template hierarchy is resolved and flattened successfully.
Naming collision detection: No duplicate attribute, alarm, or script names exist in the flattened configuration.
Script compilation: All instance scripts and alarm on-trigger scripts are test-compiled and must compile without errors.
Alarm trigger references: Alarm trigger definitions reference attributes that exist in the flattened configuration.
Script trigger references: Script triggers (value change, conditional) reference attributes that exist in the flattened configuration.
Data connection binding completeness: Every attribute with a data source reference has a data connection binding assigned on the instance, and the bound data connection name exists as a defined connection at the instance's site.
Exception: Validation does not verify that data source relative paths resolve to real tags on physical devices — that is a runtime concern that can only be determined at the site.

Validation is also available on demand in the Central UI for Design users during template authoring, providing early feedback without requiring a deployment attempt.

For shared scripts, pre-compilation validation is performed before deployment to sites. Since shared scripts have no instance context, validation is limited to C# syntax and structural correctness.

4. Scripting

4.1 Script Definitions

Scripts are C# and are defined at the template level as first-class template members.
Scripts follow the same inheritance, override, and lock rules as attributes. A parent template can define a script, a child template can override it (if not locked), and any level can lock a script to prevent downstream changes.
Scripts are deployed to sites as part of the flattened instance configuration.
Scripts are compiled at the site when a deployment is received. Pre-compilation validation occurs at central before deployment (see Section 3.11), but the site performs the actual compilation for execution.
Scripts can optionally define input parameters (name and data type per parameter). Scripts without parameter definitions accept no arguments.
Scripts can optionally define a return value definition (field names and data types). Return values support single objects and lists of objects. Scripts without a return definition return void.
Return values are used when scripts are called explicitly by other scripts (via Instance.CallScript() or Scripts.CallShared()) or by the Inbound API (via Route.To().Call()). When invoked by a trigger (interval, value change, conditional, alarm), any return value is discarded.

4.2 Script Triggers

Scripts can be triggered by:

Interval: Execute on a recurring time schedule.
Value Change: Execute when a specific instance attribute value changes.
Conditional: Execute when an instance attribute value equals or does not equal a given value.

Scripts have an optional minimum time between runs setting. If a trigger fires before the minimum interval has elapsed since the last execution, the invocation is skipped.

4.3 Script Error Handling

If a script fails (unhandled exception, timeout, etc.), the failure is logged locally at the site.
The script is not disabled — it remains active and will fire on the next qualifying trigger event.
Script failures are not reported to central. Diagnostics are local only.
For external system call failures within scripts, store-and-forward handling (Section 5.3) applies independently of script error handling.

4.4 Script Capabilities

Scripts executing on a site for a given instance can:

Read attribute values on that instance (live data points and static config).
Write attribute values on that instance. For attributes with a data source reference, the write goes to the Data Connection Layer which writes to the physical device; the in-memory value updates when the device confirms the new value via the existing subscription. For static attributes, the write updates the in-memory value and persists the override to local SQLite — the value survives restart and failover. Persisted overrides are reset when the instance is redeployed.
Call other scripts on that instance via Instance.CallScript("scriptName", params). Calls use the Akka ask pattern and return the called script's return value. Script-to-script calls support concurrent execution.
Call shared scripts via Scripts.CallShared("scriptName", params). Shared scripts execute inline in the calling Script Actor's context — they are compiled code libraries, not separate actors.
Call external system API methods in two modes: ExternalSystem.Call() for synchronous request/response, or ExternalSystem.CachedCall() for fire-and-forget with store-and-forward on transient failure (see Section 5).
Send notifications (see Section 6).
Access databases by requesting an MS SQL client connection by name (see Section 5.5).

Scripts cannot access other instances' attributes or scripts.

4.4.1 Script Call Recursion Limit

Script-to-script calls (via Instance.CallScript and Scripts.CallShared) enforce a maximum recursion depth to prevent infinite loops.
The default maximum depth is a reasonable limit (e.g., 10 levels).
The current call depth is tracked and incremented with each nested call. If the limit is reached, the call fails with an error logged to the site event log.
This applies to all script call chains including alarm on-trigger scripts calling instance scripts.

4.5 Shared Scripts

Shared scripts are not associated with any template — they are a system-wide library of reusable C# scripts.
Shared scripts can optionally define input parameters and return value definitions, following the same rules as template-level scripts.
Managed by users with the Design role.
Deployed to all sites for use by any instance script (deployment requires explicit action by a user with the Deployment role).
Shared scripts execute inline in the calling Script Actor's context as compiled code. They are not separate actors. This avoids serialization bottlenecks and messaging overhead.
Shared scripts are not available on the central cluster — Inbound API scripts cannot call them directly. To execute shared script logic, route to a site instance via Route.To().Call().

4.6 Alarm On-Trigger Scripts

Alarm on-trigger scripts are defined as part of the alarm definition and execute when the alarm activates.
They execute directly in the Alarm Actor's context (via a short-lived Alarm Execution Actor), similar to how shared scripts execute inline.
Alarm on-trigger scripts can call instance scripts via Instance.CallScript(), which sends an ask message to the appropriate sibling Script Actor.
Instance scripts cannot call alarm on-trigger scripts — the call direction is one-way.
The recursion depth limit applies to alarm-to-instance script call chains.

5. External System Integrations

5.1 External System Definitions

External systems are predefined contracts created by users with the Design role.
Each definition includes:
- Connection details: Endpoint URL, authentication, protocol information.
- Method definitions: Available API methods with defined parameters and return types.
Definitions are deployed uniformly to all sites — no per-site connection detail overrides.
Deployment of definition changes requires explicit action by a user with the Deployment role.
At the site, external system definitions are read from local SQLite (populated by artifact deployment), not from the central config DB.

5.2 Site-to-External-System Communication

Sites communicate with external systems directly (not routed through central).
Scripts invoke external system methods by referencing the predefined definitions.

5.3 Store-and-Forward for External Calls

If an external system is unavailable when a script invokes a method, the message is buffered locally at the site.
Retry is performed per message — individual failed messages retry independently.
Each external system definition includes configurable retry settings:
- Max retry count: Maximum number of retry attempts before giving up.
- Time between retries: Fixed interval between retry attempts (no exponential backoff).
After max retries are exhausted, the message is parked (dead-lettered) for manual review.
There is no maximum buffer size — messages accumulate until delivery succeeds or retries are exhausted.

5.4 Parked Message Management

Parked messages are stored at the site where they originated.
The central UI can query sites for parked messages and manage them remotely.
Operators can retry or discard parked messages from the central UI.
Parked message management covers external system calls, notifications, and cached database writes.

5.5 Database Connections

Database connections are predefined, named resources created by users with the Design role.
Each definition includes the connection details needed to connect to an MS SQL database (server, database name, credentials, etc.).
Each definition includes configurable retry settings (same pattern as external systems): max retry count and time between retries (fixed interval).
Definitions are deployed uniformly to all sites — no per-site overrides.
Deployment of definition changes requires explicit action by a user with the Deployment role.
At the site, database connection definitions are read from local SQLite (populated by artifact deployment), not from the central config DB.

5.6 Database Access Modes

Scripts can interact with databases in two modes:

Real-time (synchronous): Scripts request a raw MS SQL client connection by name (e.g., Database.Connection("MES_DB")), giving script authors full ADO.NET-level control for immediate queries and updates.
Cached write (store-and-forward): Scripts submit a write operation for deferred, reliable delivery. The cached entry stores the database connection name, the SQL statement to execute, and parameter values. If the database is unavailable, the write is buffered locally at the site and retried per the connection's retry settings. After max retries are exhausted, the write is parked for manual review (managed via central UI alongside other parked messages).

6. Notifications

6.1 Notification Lists

Notification lists are system-wide, managed by users with the Design role.
Each list has a name and contains one or more recipients.
Each recipient has a name and an email address.
Notification lists are deployed to all sites (deployment requires explicit action by a user with the Deployment role).
At the site, notification lists and recipients are read from local SQLite (populated by artifact deployment), not from the central config DB.

6.2 Email Support

The system has predefined support for sending email as the notification delivery mechanism.
Email server configuration (SMTP settings) is defined centrally and deployed to all sites as part of artifact deployment (see Section 1.5). Sites read SMTP configuration from local SQLite.

6.3 Script API

Scripts send notifications using a simplified API: Notify.To("list name").Send("subject", "message")
This API is available to instance scripts, alarm on-trigger scripts, and shared scripts.

6.4 Store-and-Forward for Notifications

If the email server is unavailable, notifications are buffered locally at the site.
Follows the same retry pattern as external system calls: configurable max retry count and time between retries (fixed interval).
After max retries are exhausted, the notification is parked for manual review (managed via central UI alongside external system parked messages).
There is no maximum buffer size for notification messages.

7. Inbound API (Central)

7.1 Purpose

The system exposes a web API on the central cluster for external systems to call into the SCADA system. This is the counterpart to the outbound External System Integrations (Section 5) — where Section 5 defines how the system calls out, this section defines how external systems call in.

7.2 API Key Management

API keys are stored in the configuration database.
Each API key has a name/label (for identification), the key value, and an enabled/disabled flag.
API keys are managed by users with the Admin role.

7.3 Authentication

Inbound API requests are authenticated via API key (not LDAP/AD).
The API key must be included with each request.
Invalid or disabled keys are rejected.

7.4 API Method Definitions

API methods are predefined and managed by users with the Design role.
Each method definition includes:
- Method name: Unique identifier for the endpoint.
- Approved API keys: List of API keys authorized to call this method.
- Parameter definitions: Name and data type for each input parameter.
- Return value definition: Data type and structure of the response. Supports single objects and lists of objects.
- Timeout: Configurable per method. Maximum execution time including routed calls to sites.
The implementation of each method is a C# script stored inline in the method definition. It executes on the central cluster. No template inheritance — API scripts are standalone.
API scripts can route calls to any instance at any site via Route.To("instanceCode").Call("scriptName", parameters), read/write attributes in batch, and access databases directly.
API scripts cannot call shared scripts directly (shared scripts are site-only). To invoke site logic, use Route.To().Call().

7.5 Availability

The inbound API is hosted only on the central cluster (active node).
On central failover, the API becomes available on the new active node.

8. Central UI

The central cluster hosts a configuration and management UI (no live machine data visualization, except on-demand debug views). The UI supports the following workflows:

Template Authoring: Create, edit, and manage templates including hierarchy (inheritance) and composition (feature modules). Author and manage scripts within templates. Design-time validation available on demand to check flattening, naming collisions, and script compilation without deploying.
Shared Script Management: Create, edit, and manage the system-wide shared script library.
Notification List Management: Create, edit, and manage notification lists and recipients.
External System Management: Define external system contracts (connection details, API method definitions).
Database Connection Management: Define named database connections for script use.
Inbound API Management: Manage API keys (create, enable/disable, delete). Define API methods (name, parameters, return values, approved keys, implementation script). (Admin role for keys, Design role for methods.)
Instance Management: Create instances from templates, bind data connections (per-attribute, with bulk assignment UI for selecting multiple attributes and assigning a data connection at once), set instance-level attribute overrides, assign instances to areas. Disable or delete instances.
Site & Data Connection Management: Define sites (including optional NodeAAddress and NodeBAddress fields for Akka remoting paths, and optional GrpcNodeAAddress and GrpcNodeBAddress fields for gRPC streaming endpoints), manage data connections and assign them to sites.
Area Management: Define hierarchical area structures per site for organizing instances.
Deployment: View diffs between deployed and current template-derived configurations, deploy updates to individual instances. Filter instances by area. Pre-deployment validation runs automatically before any deployment is sent.
System-Wide Artifact Deployment: Explicitly deploy shared scripts, external system definitions, database connection definitions, data connection definitions, notification lists, and SMTP configuration to all sites or to an individual site (requires Deployment role). Per-site deployment is available via the Sites admin page.
Deployment Status Monitoring: Track whether deployments were successfully applied at site level.
Debug View: On-demand real-time view of a specific instance's tag values and alarm states for troubleshooting (see 8.1).
Parked Message Management: Query sites for parked messages (external system calls, notifications, and cached database writes), retry or discard them.
Health Monitoring Dashboard: View site cluster health, node status, data connection health, script error rates, alarm evaluation errors, and store-and-forward buffer depths (see Section 11).
Site Event Log Viewer: Query and view operational event logs from site clusters (see Section 12).

8.1 Debug View

Subscribe-on-demand: When an engineer opens a debug view for an instance, central opens a gRPC server-streaming subscription to the site's SiteStreamGrpcServer for the instance, then requests a snapshot of all current attribute values and alarm states via ClusterClient. The gRPC stream delivers subsequent attribute value and alarm state changes directly from the site's SiteStreamManager.
Attribute value stream messages are structured as: [InstanceUniqueName].[AttributePath].[AttributeName], attribute value, attribute quality, attribute change timestamp.
Alarm state stream messages are structured as: [InstanceUniqueName].[AlarmName], alarm state (active/normal), priority, timestamp.
The stream continues until the engineer closes the debug view, at which point central unsubscribes and the site stops streaming.
No attribute/alarm selection — the debug view always shows all tag values and alarm states for the instance.
No special concurrency limits are required.

9. Security & Access Control

9.1 Authentication

UI users authenticate via username/password validated directly against LDAP/Active Directory. Sessions are maintained via JWT tokens.
External system API callers authenticate via API key (see Section 7).

9.2 Authorization

Authorization is role-based, with roles assigned by LDAP group membership.
Roles are independent — they can be mixed and matched per user (via group membership). There is no implied hierarchy between roles.
A user may hold multiple roles simultaneously (e.g., both Design and Deployment) by being a member of the corresponding LDAP groups.
Inbound API authorization is per-method, based on approved API key lists (see Section 7.4).

9.3 Roles

Admin: System-wide permission to manage sites, data connections, LDAP group-to-role mappings, API keys, and system-level configuration.
Design: System-wide permission to author and edit templates, scripts, shared scripts, external system definitions, notification lists, inbound API method definitions, and area definitions.
Deployment: Permission to manage instances (create, set overrides, bind connections, disable, delete) and deploy configurations to sites. Also triggers system-wide artifact deployment. Can be scoped per site.

9.4 Role Scoping

Admin is always system-wide.
Design is always system-wide.
Deployment can be system-wide or site-scoped, controlled by LDAP group membership (e.g., Deploy-SiteA, Deploy-SiteB, or Deploy-All).

10. Audit Logging

Audit logging is implemented as part of the Configuration Database component via the IAuditService interface.

10.1 Storage

Audit logs are stored in the configuration MS SQL database alongside system config data, enabling direct querying.
Entries are append-only — never modified or deleted. No retention policy — retained indefinitely.

10.2 Scope

All system-modifying actions are logged, including:

Template changes: Create, edit, delete templates.
Script changes: Template script and shared script create, edit, delete.
Alarm changes: Create, edit, delete alarm definitions.
Instance changes: Create, override values, bind connections, area assignment, disable, enable, delete.
Deployments: Who deployed what to which instance, and the result (success/failure).
System-wide artifact deployments: Who deployed shared scripts / external system definitions / DB connections / data connections / notification lists / SMTP config, to which site(s), and the result.
External system definition changes: Create, edit, delete.
Database connection changes: Create, edit, delete.
Notification list changes: Create, edit, delete lists and recipients.
Inbound API changes: API key create, enable/disable, delete. API method create, edit, delete.
Area changes: Create, edit, delete area definitions.
Site & data connection changes: Create, edit, delete.
Security/admin changes: Role mapping changes, site permission changes.

10.3 Detail Level

Each audit log entry records the state of the entity after the change, serialized as JSON. Only the after-state is stored — change history is reconstructed by comparing consecutive entries for the same entity at query time.
Each entry includes: who (authenticated user), what (action, entity type, entity ID, entity name), when (timestamp), and state (JSON after-state, null for deletes).
One entry per save operation — when a user edits a template and changes multiple attributes in one save, a single entry captures the full entity state.

10.4 Transactional Guarantee

Audit entries are written synchronously within the same database transaction as the change (via the unit-of-work pattern). If the change succeeds, the audit entry is guaranteed to be recorded. If the change rolls back, the audit entry rolls back too.

11. Health Monitoring

11.1 Monitored Metrics

The central cluster monitors the health of each site cluster, including:

Site cluster online/offline status: Whether the site is reachable.
Active vs. standby node status: Which node is active and which is standby.
Data connection health: Connected/disconnected status per data connection at the site.
Script error rates: Frequency of script failures at the site.
Alarm evaluation errors: Frequency of alarm evaluation failures at the site.
Store-and-forward buffer depth: Number of messages currently queued (broken down by external system calls, notifications, and cached database writes).

11.2 Reporting

Site clusters report health metrics to central periodically.
Health status is visible in the central UI — no automated alerting/notifications for now.

12. Site-Level Event Logging

12.1 Events Logged

Sites log operational events locally, including:

Script executions: Start, complete, error (with error details).
Alarm events: Alarm activated, alarm cleared (which alarm, which instance, when). Alarm evaluation errors.
Deployment applications: Configuration received from central, applied successfully or failed. Script compilation results.
Data connection status changes: Connected, disconnected, reconnected per connection.
Store-and-forward activity: Message queued, delivered, retried, parked.
Instance lifecycle: Instance enabled, disabled, deleted.

12.2 Storage

Event logs are stored in local SQLite on each site node.
Retention policy: 30 days. Events older than 30 days are automatically purged.

12.3 Central Access

The central UI can query site event logs remotely, following the same pattern as parked message management — central requests data from the site over Akka.NET remoting.

13. Management Service & CLI

13.1 Management Service

The central cluster exposes a ManagementActor that provides programmatic access to all administrative operations — the same operations available through the Central UI.
The ManagementActor registers with Akka.NET ClusterClientReceptionist for cross-cluster access, and is also exposed via an HTTP Management API endpoint (POST /management) with Basic Auth, LDAP authentication, and role resolution — enabling external tools like the CLI to interact without Akka.NET dependencies.
The ManagementActor enforces the same role-based authorization as the Central UI. Every incoming message carries the authenticated user's identity and roles.
All mutating operations performed through the Management Service are audit logged via IAuditService, identical to operations performed through the Central UI.
The ManagementActor runs on every central node (stateless). For HTTP API access, any central node can handle any request without sticky sessions.

13.2 CLI

The system provides a standalone command-line tool (scadalink) for scripting and automating administrative operations.
The CLI connects to the Central Host's HTTP Management API (POST /management) — it sends commands as JSON with HTTP Basic Auth credentials. The server handles LDAP authentication, role resolution, and ManagementActor dispatch.
The CLI sends user credentials via HTTP Basic Auth. The server authenticates against LDAP/AD and resolves roles before dispatching commands to the ManagementActor.
CLI commands mirror all Management Service operations: templates, instances, sites, data connections, deployments, external systems, notifications, security (API keys and role mappings), audit log queries, and health status.
Output is JSON by default (machine-readable, suitable for scripting) with an optional --format table flag for human-readable tabular output.
Configuration is resolved from command-line options, environment variables (SCADALINK_MANAGEMENT_URL, SCADALINK_FORMAT), or a configuration file (~/.scadalink/config.json).
The CLI is a separate executable from the Host binary — it is deployed on any machine with HTTP access to a central node.

14. General Conventions

14.1 Timestamps

All timestamps throughout the system are stored, transmitted, and processed in UTC.
This applies to: attribute value timestamps, alarm state change timestamps, audit log entries, event log entries, deployment records, health reports, store-and-forward message timestamps, and all inter-node messages.
Local time conversion for display is a Central UI concern only — no other component performs timezone conversion.

All initial high-level requirements have been captured. This document will continue to be updated as the design evolves.

45 KiB Raw Blame History Unescape Escape