Apply Codex review findings across all 17 components

Template Engine: add composed member addressing (path-qualified canonical names), override granularity per entity type, semantic validation (call targets, arg types), graph acyclicity enforcement, revision hashes for flattened configs. Deployment Manager: add deployment ID + idempotency, per-instance operation lock covering all mutating commands, state transition matrix, site-side apply atomicity (all-or-nothing), artifact version compatibility policy. Site Runtime: add script trust model (forbidden APIs, execution timeout, constrained compilation), concurrency/serialization rules (Instance Actor serializes mutations), site-wide stream backpressure (per-subscriber buffering, fire-and-forget publish). Communication: add application-level correlation IDs for protocol safety beyond Akka.NET transport guarantees. External System Gateway: add 408/429 as transient errors, CachedCall idempotency note, dedicated dispatcher for blocking I/O isolation. Health Monitoring: add monotonic sequence numbers to prevent stale report overwrites. Security: require LDAPS/StartTLS for LDAP connections. Central UI: add failover behavior (SignalR reconnect, JWT survives, shared Data Protection keys, load balancer readiness). Cluster Infrastructure: add down-if-alone=on for safe singleton ownership. Site Event Logging: clarify active-node-only logging (no replication), add 1GB storage cap with oldest-first purge. Host: add readiness gating (health check endpoint, no traffic until operational). Commons: add message contract versioning policy (additive-only evolution). Configuration Database: add optimistic concurrency on deployment status records.
2026-03-16 09:06:12 -04:00
parent 70e5ae33d5
commit 34694adba2
13 changed files with 152 additions and 10 deletions
--- a/Component-ClusterInfrastructure.md
+++ b/Component-ClusterInfrastructure.md
@@ -61,7 +61,8 @@ The system uses the Akka.NET **keep-oldest** split-brain resolver strategy:

 - On a network partition, the node that has been in the cluster longest remains active. The younger node downs itself.
 - **Stable-after duration**: 15 seconds. The cluster membership must remain stable (no changes) for 15 seconds before the resolver acts to down unreachable nodes. This prevents premature downing during startup or rolling restarts.
- **Why keep-oldest**: With only two nodes, quorum-based strategies (static-quorum, keep-majority) cannot distinguish "one node crashed" from "network partition" — both sides see fewer than quorum and both would down themselves, resulting in total cluster shutdown. Keep-oldest accepts a brief potential dual-active window during true network partitions, which is safe because site state rebuilds from SQLite and central state is in MS SQL.
+- **`down-if-alone = on`**: The keep-oldest resolver is configured with `down-if-alone` enabled. If the oldest node finds itself alone (no other reachable members), it downs itself rather than continuing as a single-node cluster. This prevents the oldest node from running in isolation during a network partition while the younger node also forms its own cluster.
+- **Why keep-oldest**: With only two nodes, quorum-based strategies (static-quorum, keep-majority) cannot distinguish "one node crashed" from "network partition" — both sides see fewer than quorum and both would down themselves, resulting in total cluster shutdown. Keep-oldest with `down-if-alone` provides safe singleton ownership — at most one node runs the cluster singleton at any time.

 ## Failure Detection Timing

@@ -81,7 +82,7 @@ These values balance failover speed with stability — fast enough that data col
 If both nodes in a cluster fail simultaneously (e.g., site power outage):

 1. **No manual intervention required.** Since both nodes are configured as seed nodes, whichever node starts first forms a new cluster. The second node joins when it starts.
-2. **State recovery**:
+2. **State recovery** (each node has its own local copy of all required data):
   - **Site clusters**: The Deployment Manager singleton reads deployed configurations from local SQLite and re-creates the full Instance Actor hierarchy. Store-and-forward buffers are already persisted locally. Alarm states re-evaluate from incoming data values.
   - **Central cluster**: All state is in MS SQL (configuration database). The active node resumes normal operation.
 3. The keep-oldest resolver handles the "both starting fresh" case naturally — there is no pre-existing cluster to conflict with.