Apply Codex review findings across all 17 components

Template Engine: add composed member addressing (path-qualified canonical names), override granularity per entity type, semantic validation (call targets, arg types), graph acyclicity enforcement, revision hashes for flattened configs. Deployment Manager: add deployment ID + idempotency, per-instance operation lock covering all mutating commands, state transition matrix, site-side apply atomicity (all-or-nothing), artifact version compatibility policy. Site Runtime: add script trust model (forbidden APIs, execution timeout, constrained compilation), concurrency/serialization rules (Instance Actor serializes mutations), site-wide stream backpressure (per-subscriber buffering, fire-and-forget publish). Communication: add application-level correlation IDs for protocol safety beyond Akka.NET transport guarantees. External System Gateway: add 408/429 as transient errors, CachedCall idempotency note, dedicated dispatcher for blocking I/O isolation. Health Monitoring: add monotonic sequence numbers to prevent stale report overwrites. Security: require LDAPS/StartTLS for LDAP connections. Central UI: add failover behavior (SignalR reconnect, JWT survives, shared Data Protection keys, load balancer readiness). Cluster Infrastructure: add down-if-alone=on for safe singleton ownership. Site Event Logging: clarify active-node-only logging (no replication), add 1GB storage cap with oldest-first purge. Host: add readiness gating (health check endpoint, no traffic until operational). Commons: add message contract versioning policy (additive-only evolution). Configuration Database: add optimistic concurrency on deployment status records.
2026-03-16 09:06:12 -04:00
parent 70e5ae33d5
commit 34694adba2
13 changed files with 152 additions and 10 deletions
--- a/Component-HealthMonitoring.md
+++ b/Component-HealthMonitoring.md
@@ -34,7 +34,7 @@ Site clusters (metric collection and reporting). Central cluster (aggregation an
 ## Reporting Protocol

 - Sites send a **health report message** to central at a configurable interval (default: **30 seconds**).
- Each report is a **flat snapshot** containing the current values of all monitored metrics. Central replaces the entire previous state for that site on receipt.
+- Each report is a **flat snapshot** containing the current values of all monitored metrics, a **monotonic sequence number**, and the **report timestamp** from the site. Central replaces the previous state for that site only if the incoming sequence number is higher than the last received — this prevents stale reports (e.g., delayed in transit or from a pre-failover node) from overwriting newer state.
 - **Offline detection**: If central does not receive a report within a configurable timeout window (default: **60 seconds** — 2x the report interval), the site is marked as **offline**. This gives one missed report as grace before marking offline.
 - **Online recovery**: When central receives a health report from a site that was marked offline, the site is automatically marked **online**. No manual acknowledgment required — the metrics in the report provide immediate visibility into the site's condition.