fix(high-severity): close 9 of 10 open High findings across 8 modules

Comm-016: delete dead HandleConnectionStateChanged + _debugSubscriptions / _inProgressDeployments tracking + ConnectionStateChanged message record. Disconnect detection is owned by the transport layers (gRPC keepalive PING ~25s; Ask-timeout at CommunicationService). Updates the Component-Communication.md design doc to make that explicit. SnF-018: NotificationForwarder.DeliverAsync now discards a corrupt buffered payload (Warning log + return true) instead of returning false and parking the row — honoring the design's "notifications do not park" invariant. DM-018: reconciliation no longer force-sets Enabled, preserving an intentional Disabled state after central failover. ESG-018: DeliverBufferedAsync (both ExternalSystemClient + DatabaseGateway) catches JsonException and returns false, turning a corrupt buffered row into a parked operation instead of a retry-forever poison message. InboundAPI-022: register ActiveNodeGate as IActiveNodeGate in the Central DI branch so standby-node gating is actually wired up in production. NS-019: remove orphaned NotificationDeliveryService / INotificationDeliveryService / NotificationResult; central notification delivery now lives entirely in NotificationOutbox. SEL-016: normalise From/To filters to UTC before ISO-string compare so non-UTC DateTimeOffset clients no longer get spuriously excluded events. TE-017: include Description on attributes/alarms and a HashableConnections projection (protocol, endpoint JSON, failover count) in the revision hash and DiffService; staleness detection now catches description-only and connection-endpoint edits. Transport-001 and Transport-002 (also High) remain Open — they're being handled in a follow-up batch because both touch BundleImporter.cs and must serialise.
2026-05-28 05:40:15 -04:00
parent f936f55f51
commit ac96b83b08
38 changed files with 852 additions and 1729 deletions
@@ -224,8 +224,10 @@ The ManagementActor is registered at the well-known path `/user/management` on c

 ## Connection Failure Behavior

- **In-flight messages**: When a connection drops while a request is in flight (e.g., deployment sent but no response received), the Akka ask pattern times out and the caller receives a failure. There is **no automatic retry or buffering at central** — the engineer sees the failure in the UI and re-initiates the action. This is consistent with the design principle that central does not buffer messages.
- **Debug streams**: Any gRPC stream interruption triggers reconnection logic in the `DebugStreamBridgeActor`. The bridge actor attempts to reconnect to the other site node endpoint (NodeB if NodeA failed, or vice versa), with up to 3 retries and 5-second backoff. If all retries fail, the consumer is notified via `OnStreamTerminated` and the bridge actor is stopped. Events during the reconnection gap are lost (acceptable for real-time debug view). On successful reconnection, the consumer can request a fresh snapshot to re-sync state.
+Disconnect is detected at the **transport layer**, never via an application-level signal from central. There is no `ConnectionStateChanged`-style synchronous notification: the central coordinator does not maintain a model of "this site is up / down" because the two transports already report unavailability at their natural cadence.
+
+- **In-flight command/control messages (ClusterClient + Ask)**: When a connection drops while a request is in flight (e.g., a deployment sent but no response received), the Akka ask pattern times out and the caller receives a failure. There is **no automatic retry or buffering at central** — the engineer sees the failure in the UI and re-initiates the action. This is consistent with the design principle that central does not buffer messages. An in-progress deployment whose round-trip exceeds the Ask timeout (default 120 s at `CommunicationService.DeployInstanceAsync`) surfaces as `DeploymentStatus.Failed` to the caller.
+- **Debug streams (gRPC)**: Any gRPC stream interruption is detected by the HTTP/2 keepalive PING (~25 s) and triggers reconnection logic in the `DebugStreamBridgeActor`. The bridge actor attempts to reconnect to the other site node endpoint (NodeB if NodeA failed, or vice versa), with up to 3 retries and 5-second backoff. If all retries fail, the consumer is notified via `OnStreamTerminated` and the bridge actor is stopped. Events during the reconnection gap are lost (acceptable for real-time debug view). On successful reconnection, the consumer can request a fresh snapshot to re-sync state.

 ## Failover Behavior