Apply Codex review findings across all 17 components

Template Engine: add composed member addressing (path-qualified canonical names), override granularity per entity type, semantic validation (call targets, arg types), graph acyclicity enforcement, revision hashes for flattened configs. Deployment Manager: add deployment ID + idempotency, per-instance operation lock covering all mutating commands, state transition matrix, site-side apply atomicity (all-or-nothing), artifact version compatibility policy. Site Runtime: add script trust model (forbidden APIs, execution timeout, constrained compilation), concurrency/serialization rules (Instance Actor serializes mutations), site-wide stream backpressure (per-subscriber buffering, fire-and-forget publish). Communication: add application-level correlation IDs for protocol safety beyond Akka.NET transport guarantees. External System Gateway: add 408/429 as transient errors, CachedCall idempotency note, dedicated dispatcher for blocking I/O isolation. Health Monitoring: add monotonic sequence numbers to prevent stale report overwrites. Security: require LDAPS/StartTLS for LDAP connections. Central UI: add failover behavior (SignalR reconnect, JWT survives, shared Data Protection keys, load balancer readiness). Cluster Infrastructure: add down-if-alone=on for safe singleton ownership. Site Event Logging: clarify active-node-only logging (no replication), add 1GB storage cap with oldest-first purge. Host: add readiness gating (health check endpoint, no traffic until operational). Commons: add message contract versioning policy (additive-only evolution). Configuration Database: add optimistic concurrency on deployment status records.
2026-03-16 09:06:12 -04:00
parent 70e5ae33d5
commit 34694adba2
13 changed files with 152 additions and 10 deletions
--- a/Component-CentralUI.md
+++ b/Component-CentralUI.md
@@ -14,6 +14,14 @@ Central cluster only. Sites have no user interface.
 - Keeps the entire stack in C#/.NET, consistent with the rest of the system (Akka.NET, EF Core).
 - SignalR provides built-in support for real-time UI updates.

+## Failover Behavior
+
+- A **load balancer** sits in front of the central cluster and routes to the active node.
+- On central failover, the Blazor Server SignalR circuit is interrupted. The browser automatically attempts to reconnect via SignalR's built-in reconnection logic.
+- Since sessions use **JWT tokens** (not server-side state), the user's authentication survives failover — the new active node validates the same JWT. No re-login required if the token is still valid.
+- Active debug view streams and in-progress real-time subscriptions are lost on failover and must be re-opened by the user.
+- Both central nodes share the same **ASP.NET Data Protection keys** (stored in the configuration database or shared configuration) so that tokens and anti-forgery tokens remain valid across failover.
+
 ## Real-Time Updates

 All real-time features use **server push via SignalR** (built into Blazor Server):
--- a/Component-ClusterInfrastructure.md
+++ b/Component-ClusterInfrastructure.md
@@ -61,7 +61,8 @@ The system uses the Akka.NET **keep-oldest** split-brain resolver strategy:

 - On a network partition, the node that has been in the cluster longest remains active. The younger node downs itself.
 - **Stable-after duration**: 15 seconds. The cluster membership must remain stable (no changes) for 15 seconds before the resolver acts to down unreachable nodes. This prevents premature downing during startup or rolling restarts.
- **Why keep-oldest**: With only two nodes, quorum-based strategies (static-quorum, keep-majority) cannot distinguish "one node crashed" from "network partition" — both sides see fewer than quorum and both would down themselves, resulting in total cluster shutdown. Keep-oldest accepts a brief potential dual-active window during true network partitions, which is safe because site state rebuilds from SQLite and central state is in MS SQL.
+- **`down-if-alone = on`**: The keep-oldest resolver is configured with `down-if-alone` enabled. If the oldest node finds itself alone (no other reachable members), it downs itself rather than continuing as a single-node cluster. This prevents the oldest node from running in isolation during a network partition while the younger node also forms its own cluster.
+- **Why keep-oldest**: With only two nodes, quorum-based strategies (static-quorum, keep-majority) cannot distinguish "one node crashed" from "network partition" — both sides see fewer than quorum and both would down themselves, resulting in total cluster shutdown. Keep-oldest with `down-if-alone` provides safe singleton ownership — at most one node runs the cluster singleton at any time.

 ## Failure Detection Timing

@@ -81,7 +82,7 @@ These values balance failover speed with stability — fast enough that data col
 If both nodes in a cluster fail simultaneously (e.g., site power outage):

 1. **No manual intervention required.** Since both nodes are configured as seed nodes, whichever node starts first forms a new cluster. The second node joins when it starts.
-2. **State recovery**:
+2. **State recovery** (each node has its own local copy of all required data):
   - **Site clusters**: The Deployment Manager singleton reads deployed configurations from local SQLite and re-creates the full Instance Actor hierarchy. Store-and-forward buffers are already persisted locally. Alarm states re-evaluate from incoming data values.
   - **Central cluster**: All state is in MS SQL (configuration database). The active node resumes normal operation.
 3. The keep-oldest resolver handles the "both starting fresh" case naturally — there is no pre-existing cluster to conflict with.
--- a/Component-Commons.md
+++ b/Component-Commons.md
@@ -107,6 +107,15 @@ Commons must define the shared DTOs and message contracts used for inter-compone

 All message types must be `record` types or immutable classes suitable for use as Akka.NET messages (though Commons itself must not depend on Akka.NET).

+### REQ-COM-5a: Message Contract Versioning
+
+Since the system supports cross-site artifact version skew (sites may temporarily run different versions), message contracts must follow **additive-only evolution rules**:
+
+- New fields may be added with default values. Existing fields must not be removed or have their types changed.
+- Serialization must tolerate unknown fields (forward compatibility) and missing optional fields (backward compatibility).
+- Breaking changes require a new message type and a coordinated deployment to all nodes.
+- The Akka.NET serialization binding configuration (in the Host component) must explicitly map message types to serializers to prevent accidental binary serialization.
+
 ### REQ-COM-6: No Business Logic

 Commons must contain only:
--- a/Component-Communication.md
+++ b/Component-Communication.md
@@ -107,6 +107,16 @@ Akka.NET remoting provides built-in connection management and failure detection.

 These settings should be tuned for the expected network conditions between central and site clusters.

+## Application-Level Correlation
+
+All request/response messages include an application-level **correlation ID** to ensure correct pairing of requests and responses, even across reconnection events:
+
+- Deployments include a **deployment ID** and **revision hash** for idempotency (see Deployment Manager).
+- Lifecycle commands include a **command ID** for deduplication.
+- Remote queries include a **query ID** for response correlation.
+
+This provides protocol-level safety beyond Akka.NET's transport guarantees, which may not hold across disconnect/reconnect cycles.
+
 ## Message Ordering

 Akka.NET guarantees message ordering between a specific sender/receiver actor pair. The Communication Layer relies on this guarantee — messages to a given site are processed in the order they are sent. Callers do not need to handle out-of-order delivery.
--- a/Component-ConfigurationDatabase.md
+++ b/Component-ConfigurationDatabase.md
@@ -89,7 +89,7 @@ Repository interfaces are defined in **Commons** alongside the POCO entity class
 | Repository Interface (in Commons) | Consuming Component | Scope |
 |---|---|---|
 | `ITemplateEngineRepository` | Template Engine | Templates, attributes, alarms, scripts, compositions, instances, overrides, connection bindings, areas |
-| `IDeploymentManagerRepository` | Deployment Manager | Deployment records, deployed configuration snapshots, system-wide artifact deployment records |
+| `IDeploymentManagerRepository` | Deployment Manager | Current deployment status per instance, deployed configuration snapshots, system-wide artifact deployment status per site (no deployment history — audit log provides historical traceability) |
 | `ISecurityRepository` | Security & Auth | LDAP group mappings, site scoping rules |
 | `IInboundApiRepository` | Inbound API | API keys, API method definitions |
 | `IExternalSystemRepository` | External System Gateway | External system definitions, method definitions, database connection definitions |
@@ -106,6 +106,7 @@ EF Core's DbContext naturally provides unit-of-work semantics:
 - Multiple entity modifications within a single request are tracked by the DbContext.
 - `SaveChangesAsync()` commits all pending changes in a single database transaction.
 - If any part fails, the entire transaction rolls back.
+- **Optimistic concurrency** is used on deployment status records and instance lifecycle state via EF Core `rowversion` / concurrency tokens. This prevents stale deployment status transitions (e.g., two concurrent requests both trying to update the same instance's status). Template editing remains **last-write-wins** by design — optimistic concurrency is intentionally not applied to template content.
 - For operations that span multiple repository calls (e.g., creating a template with attributes, alarms, and scripts), the consuming component uses a single DbContext instance (via DI scoping) to ensure atomicity.

 ### Example Transactional Flow
--- a/Component-DeploymentManager.md
+++ b/Component-DeploymentManager.md
@@ -41,10 +41,27 @@ Engineer (UI) → Deployment Manager (Central)
    └── 8. Update deployment status in config DB
 ```

-## Deployment Concurrency
+## Deployment Identity & Idempotency

- **Same instance**: A deployment to an instance is **blocked** if a previous deployment to that instance is still in progress (waiting for site response). The UI shows the deployment is in progress and rejects the second request. This prevents conflicting state at the site.
- **Different instances**: Deployments to different instances can proceed **in parallel**, even at the same site. Each deployment tracks status independently. This supports the bulk "deploy all out-of-date instances" operation efficiently.
+- Every deployment is assigned a unique **deployment ID** and includes the flattened configuration's **revision hash** (from the Template Engine).
+- Site-side apply is **idempotent on deployment ID** — if the same deployment is received twice (e.g., after a timeout where the site actually applied it), the site responds with "already applied" rather than re-applying.
+- Sites **reject stale configurations** — if a deployment carries an older revision hash than what is already applied, the site rejects it and reports the current version.
+- After a central failover or timeout, the Deployment Manager **queries the site for current deployment state** before allowing a re-deploy. This prevents duplicate application and out-of-order config changes.
+
+## Operation Concurrency
+
+All mutating operations on a single instance (deploy, disable, enable, delete) share a **per-instance operation lock**:
+
+- Only one mutating operation per instance can be in-flight at a time. A second operation is rejected with an "operation in progress" error.
+- **Different instances**: Operations on different instances can proceed **in parallel**, even at the same site. Each tracks status independently. This supports the bulk "deploy all out-of-date instances" operation efficiently.
+
+### Allowed State Transitions
+
+| Current State | Deploy | Disable | Enable | Delete |
+|---------------|--------|---------|--------|--------|
+| Enabled | Yes | Yes | No (already enabled) | Yes |
+| Disabled | Yes (enables on apply) | No (already disabled) | Yes | Yes |
+| Not deployed | Yes (initial deploy) | No | No | No |

 ## System-Wide Artifact Deployment Failure Handling

@@ -92,6 +109,20 @@ A deployment to a site includes the flattened instance configuration plus any sy

 System-wide artifact deployment is a **separate action** from instance deployment, triggered explicitly by a user with the Deployment role.

+## Site-Side Apply Atomicity
+
+Applying a deployment at the site is **all-or-nothing per instance**:
+
+- The site stores the new config, compiles all scripts, and creates/updates the Instance Actor as a single operation.
+- If any step fails (e.g., script compilation), the entire deployment for that instance is rejected. The previous configuration remains active and unchanged.
+- The site reports the specific failure reason (e.g., compilation error details) back to central.
+
+## System-Wide Artifact Version Compatibility
+
+- Cross-site version skew for artifacts (shared scripts, external system definitions, etc.) is **supported** — sites can temporarily run different artifact versions after a partial deployment.
+- Artifacts are self-contained and site-independent. A site running an older version of shared scripts continues to operate correctly with its current instance configurations.
+- The Central UI clearly indicates which sites have pending artifact updates so engineers can remediate.
+
 ## Instance Lifecycle Commands

 The Deployment Manager sends the following commands to sites via the Communication Layer:
--- a/Component-ExternalSystemGateway.md
+++ b/Component-ExternalSystemGateway.md
@@ -94,9 +94,15 @@ Scripts choose between two call modes per invocation, mirroring the dual-mode da

 - Each external system definition specifies a **timeout** that applies to all method calls on that system.
 - Error classification by HTTP response:
-  - **Transient failures** (connection refused, timeout, HTTP 5xx): Behavior depends on call mode — `CachedCall` buffers for retry; `Call` returns error to script.
-  - **Permanent failures** (HTTP 4xx): Always returned to the calling script regardless of call mode. Logged to Site Event Logging.
+  - **Transient failures** (connection refused, timeout, HTTP 408, 429, 5xx): Behavior depends on call mode — `CachedCall` buffers for retry; `Call` returns error to script.
+  - **Permanent failures** (HTTP 4xx except 408/429): Always returned to the calling script regardless of call mode. Logged to Site Event Logging.
 - This classification ensures the S&F buffer is not polluted with requests that will never succeed.
+- **Idempotency note**: `CachedCall` retries may result in duplicate delivery if the external system received the original request but the response was lost. Callers should use `CachedCall` only for operations that are idempotent or where duplicate delivery is acceptable.
+
+## Blocking I/O Isolation
+
+- External system HTTP calls and database operations are blocking I/O. Script Execution Actors (which are short-lived, per-invocation actors) execute these calls, ensuring that blocking does not starve the parent Script Actor or Instance Actor.
+- The Akka.NET actor system should configure a **dedicated dispatcher** for Script Execution Actors to isolate blocking I/O from the default dispatcher used by coordination actors.

 ## Database Connection Management

--- a/Component-HealthMonitoring.md
+++ b/Component-HealthMonitoring.md
@@ -34,7 +34,7 @@ Site clusters (metric collection and reporting). Central cluster (aggregation an
 ## Reporting Protocol

 - Sites send a **health report message** to central at a configurable interval (default: **30 seconds**).
- Each report is a **flat snapshot** containing the current values of all monitored metrics. Central replaces the entire previous state for that site on receipt.
+- Each report is a **flat snapshot** containing the current values of all monitored metrics, a **monotonic sequence number**, and the **report timestamp** from the site. Central replaces the previous state for that site only if the incoming sequence number is higher than the last received — this prevents stale reports (e.g., delayed in transit or from a pre-failover node) from overwriting newer state.
 - **Offline detection**: If central does not receive a report within a configurable timeout window (default: **60 seconds** — 2x the report interval), the site is marked as **offline**. This gives one missed report as grace before marking offline.
 - **Online recovery**: When central receives a health report from a site that was marked offline, the site is automatically marked **online**. No manual acknowledgment required — the metrics in the report provide immediate visibility into the site's condition.

--- a/Component-Host.md
+++ b/Component-Host.md
@@ -57,6 +57,16 @@ Before the Akka.NET actor system is created, the Host must validate all required
 - Site nodes must have non-empty SQLite path values.
 - At least two seed nodes must be configured.

+### REQ-HOST-4a: Readiness Gating
+
+On central nodes, the ASP.NET Core web endpoints (Central UI, Inbound API) must **not accept traffic** until the node is fully operational:
+
+- Akka.NET cluster membership is established.
+- Database connectivity (MS SQL) is verified.
+- Required cluster singletons are running (if applicable).
+
+A standard ASP.NET Core health check endpoint (`/health/ready`) reports readiness status. The load balancer uses this endpoint to determine when to route traffic to the node. During startup or failover, the node returns `503 Service Unavailable` until ready.
+
 ### REQ-HOST-5: Windows Service Hosting

 The Host must support running as a Windows Service via `UseWindowsService()`. When launched outside of a Windows Service context (e.g., during development), it must run as a standard console application. No code changes or conditional compilation are required to switch between the two modes.
--- a/Component-Security.md
+++ b/Component-Security.md
@@ -18,6 +18,7 @@ Central cluster. Sites do not have user-facing interfaces and do not perform ind
 ## Authentication

 - **Mechanism**: The Central UI presents a username/password login form. The app validates credentials by binding to the LDAP/AD server with the provided credentials, then queries the user's group memberships.
+- **Transport security**: LDAP connections **must** use LDAPS (port 636) or StartTLS to encrypt credentials in transit. Unencrypted LDAP (port 389) is not permitted.
 - **No local user store**: All identity and group information comes from AD. No credentials are cached locally.
 - **No Windows Integrated Authentication**: The app authenticates directly against LDAP/AD, not via Kerberos/NTLM.

--- a/Component-SiteEventLogging.md
+++ b/Component-SiteEventLogging.md
@@ -40,8 +40,9 @@ Each event entry contains:
 ## Storage

 - Events are stored in **local SQLite** on each site node.
- Each node maintains its own event log (the active node generates events; the standby node generates minimal events related to replication).
+- Each node maintains its own event log. Only the **active node** generates and stores events. Event logs are **not replicated** to the standby node. On failover, the new active node starts logging to its own SQLite database; historical events from the previous active node are no longer queryable via central until that node comes back online. This is acceptable because event logs are diagnostic, not transactional.
 - **Retention**: 30 days. A **daily background job** runs on the active node and deletes all events older than 30 days. Hard delete — no archival.
+- **Storage cap**: A configurable maximum database size (default: 1 GB) is enforced. If the storage cap is reached before the 30-day retention window, the oldest events are purged first. This prevents disk exhaustion from alarm storms, script failure loops, or connection flapping.

 ## Central Access

--- a/Component-SiteRuntime.md
+++ b/Component-SiteRuntime.md
@@ -222,6 +222,17 @@ Available to all Script Execution Actors and Alarm Execution Actors:

 ---

+## Script Trust Model
+
+Scripts execute **in-process** with constrained access. The following restrictions are enforced at compilation and runtime:
+
+- **Allowed**: Access to the Script Runtime API (GetAttribute, SetAttribute, CallScript, CallShared, ExternalSystem, Notify, Database), standard C# language features, basic .NET types (collections, string manipulation, math, date/time).
+- **Forbidden**: File system access (`System.IO`), process spawning (`System.Diagnostics.Process`), threading (`System.Threading` — except async/await), reflection (`System.Reflection`), raw network access (`System.Net.Sockets`, `System.Net.Http` — must use `ExternalSystem.Call`), assembly loading, unsafe code.
+- **Execution timeout**: Configurable per-script maximum execution time. Exceeding the timeout cancels the script and logs an error.
+- **Memory**: Scripts share the host process memory. No per-script memory limit, but the execution timeout prevents runaway allocations.
+
+These constraints are enforced by restricting the set of assemblies and namespaces available to the script compilation context.
+
 ## Script Scoping Rules

 - Scripts can only read/write attributes on **their own instance** (via the parent Instance Actor).
@@ -232,6 +243,18 @@ Available to all Script Execution Actors and Alarm Execution Actors:

 ---

+## Concurrency & Serialization
+
+- The Instance Actor processes messages **sequentially** (standard Akka actor model). This means `SetAttribute` calls from concurrent Script Execution Actors are serialized at the Instance Actor, preventing race conditions on attribute state.
+- Script Execution Actors may run concurrently, but all state mutations (attribute reads/writes, alarm state updates) are mediated through the parent Instance Actor's message queue.
+- External side effects (external system calls, notifications, database writes) are not serialized — concurrent scripts may produce interleaved side effects. This is acceptable because each side effect is independent.
+
+## Site-Wide Stream Backpressure
+
+- The site-wide Akka stream uses **per-subscriber buffering** with bounded buffers. Each subscriber (debug view, future consumers) gets an independent buffer.
+- If a subscriber falls behind (e.g., slow network on debug view), its buffer fills and oldest events are dropped. This does not affect other subscribers or the publishing Instance Actors.
+- Instance Actors publish to the stream with **fire-and-forget** semantics — publishing never blocks the actor.
+
 ## Error Handling

 ### Script Errors
--- a/Component-TemplateEngine.md
+++ b/Component-TemplateEngine.md
@@ -64,6 +64,23 @@ Central cluster only. Sites receive flattened output and have no awareness of te
 - Stored in the configuration database.
 - Used for filtering/organizing instances in the UI.

+## Composed Member Addressing
+
+When a template composes a feature module, members from that module are addressed using a **path-qualified canonical name**: `[ModuleInstanceName].[MemberName]`. For nested compositions, the path extends: `[OuterModule].[InnerModule].[MemberName]`.
+
+- All internal references (triggers, scripts, diffs, stream topics, UI display) use canonical names.
+- The composing template's own members (not from a module) have no prefix — they are top-level names.
+- Naming collision detection operates on canonical names, so two modules can define the same member name as long as their module instance names differ.
+
+## Override Granularity
+
+Override and lock rules apply per entity type at the following granularity:
+
+- **Attributes**: Value and Description are overridable. Data Type and Data Source Reference are fixed by the defining level. Lock applies to the entire attribute (when locked, no fields can be overridden).
+- **Alarms**: Priority Level, Trigger Definition (thresholds/ranges/rates), Description, and On-Trigger Script reference are overridable. Name and Trigger Type (Value Match vs. Range vs. Rate of Change) are fixed. Lock applies to the entire alarm.
+- **Scripts**: C# source code, Trigger configuration, minimum time between runs, and parameter/return definitions are overridable. Name is fixed. Lock applies to the entire script.
+- **Composed module members**: A composing template or child template can override non-locked members inside a composed module using the canonical path-qualified name.
+
 ## Naming Collision Detection

 When a template composes two or more feature modules, the system must check for naming collisions across:
@@ -104,6 +121,30 @@ Before a deployment is sent to a site, the Template Engine performs comprehensiv
 - **Data connection binding completeness**: Every attribute with a data source reference has a data connection binding assigned on the instance, and the bound data connection name exists as a defined connection at the instance's site.
 - **Exception**: Validation does **not** verify that data source relative paths resolve to real tags on physical devices — that is a runtime concern.

+### Semantic Validation
+
+Beyond compilation, the Template Engine performs static semantic checks:
+
+- **Script call targets**: `Instance.CallScript()` and `Scripts.CallShared()` targets must reference scripts that exist in the flattened configuration or shared script library.
+- **Argument compatibility**: Parameter count and data types at call sites must match the target script's parameter definitions.
+- **Return type compatibility**: If a script call's return value is used, the return type definition must match the caller's expectations.
+- **Trigger operand types**: Alarm triggers and script conditional triggers must reference attributes with compatible data types (e.g., Range Violation requires numeric attributes).
+
+### Graph Acyclicity
+
+The Template Engine enforces that inheritance and composition graphs are **acyclic**:
+
+- A template cannot inherit from itself or from any descendant in its inheritance chain.
+- A template cannot compose itself or any ancestor/descendant that would create a circular composition.
+- These checks are performed on save.
+
+### Flattened Configuration Revision
+
+Each flattened configuration output includes a **revision hash** (computed from the resolved content). This hash is used for:
+
+- Staleness detection: comparing the deployed revision to the current template-derived revision without a full diff.
+- Diff correlation: ensuring diffs are computed against a consistent baseline.
+
 ### On-Demand Validation

 The same validation logic is available to Design users in the Central UI without triggering a deployment. This allows template authors to check their work for errors during authoring.