Template Engine: add composed member addressing (path-qualified canonical names), override granularity per entity type, semantic validation (call targets, arg types), graph acyclicity enforcement, revision hashes for flattened configs. Deployment Manager: add deployment ID + idempotency, per-instance operation lock covering all mutating commands, state transition matrix, site-side apply atomicity (all-or-nothing), artifact version compatibility policy. Site Runtime: add script trust model (forbidden APIs, execution timeout, constrained compilation), concurrency/serialization rules (Instance Actor serializes mutations), site-wide stream backpressure (per-subscriber buffering, fire-and-forget publish). Communication: add application-level correlation IDs for protocol safety beyond Akka.NET transport guarantees. External System Gateway: add 408/429 as transient errors, CachedCall idempotency note, dedicated dispatcher for blocking I/O isolation. Health Monitoring: add monotonic sequence numbers to prevent stale report overwrites. Security: require LDAPS/StartTLS for LDAP connections. Central UI: add failover behavior (SignalR reconnect, JWT survives, shared Data Protection keys, load balancer readiness). Cluster Infrastructure: add down-if-alone=on for safe singleton ownership. Site Event Logging: clarify active-node-only logging (no replication), add 1GB storage cap with oldest-first purge. Host: add readiness gating (health check endpoint, no traffic until operational). Commons: add message contract versioning policy (additive-only evolution). Configuration Database: add optimistic concurrency on deployment status records.
147 lines
8.4 KiB
Markdown
147 lines
8.4 KiB
Markdown
# Component: Central–Site Communication
|
||
|
||
## Purpose
|
||
|
||
The Communication component manages all messaging between the central cluster and site clusters using Akka.NET. It provides the transport layer for deployments, instance lifecycle commands, integration routing, debug streaming, health reporting, and remote queries (parked messages, event logs).
|
||
|
||
## Location
|
||
|
||
Both central and site clusters. Each side has communication actors that handle message routing.
|
||
|
||
## Responsibilities
|
||
|
||
- Establish and maintain Akka.NET remoting connections between central and each site cluster.
|
||
- Route messages between central and site clusters in a hub-and-spoke topology.
|
||
- Broker requests from external systems (via central) to sites and return responses.
|
||
- Support multiple concurrent message patterns (request/response, fire-and-forget, streaming).
|
||
- Detect site connectivity status for health monitoring.
|
||
|
||
## Communication Patterns
|
||
|
||
### 1. Deployment (Central → Site)
|
||
- **Pattern**: Request/Response.
|
||
- Central sends a flattened configuration to a site.
|
||
- Site Runtime receives, compiles scripts, creates/updates Instance Actors, and responds with success/failure.
|
||
- No buffering at central — if the site is unreachable, the deployment fails immediately.
|
||
|
||
### 2. Instance Lifecycle Commands (Central → Site)
|
||
- **Pattern**: Request/Response.
|
||
- Central sends disable, enable, or delete commands for specific instances.
|
||
- Site Runtime processes the command and responds with success/failure.
|
||
- If the site is unreachable, the command fails immediately (no buffering).
|
||
|
||
### 3. System-Wide Artifact Deployment (Central → All Sites)
|
||
- **Pattern**: Broadcast with per-site acknowledgment.
|
||
- When shared scripts, external system definitions, database connections, or notification lists are explicitly deployed, central sends them to all sites.
|
||
- Each site acknowledges receipt and reports success/failure independently.
|
||
|
||
### 4. Integration Routing (External System → Central → Site → Central → External System)
|
||
- **Pattern**: Request/Response (brokered).
|
||
- External system sends a request to central (e.g., MES requests machine values).
|
||
- Central routes the request to the appropriate site.
|
||
- Site reads values from the Instance Actor and responds.
|
||
- Central returns the response to the external system.
|
||
|
||
### 5. Recipe/Command Delivery (External System → Central → Site)
|
||
- **Pattern**: Fire-and-forget with acknowledgment.
|
||
- External system sends a command to central (e.g., recipe manager sends recipe).
|
||
- Central routes to the site.
|
||
- Site applies and acknowledges.
|
||
|
||
### 6. Debug Streaming (Site → Central)
|
||
- **Pattern**: Subscribe/stream with initial snapshot.
|
||
- Central sends a subscribe request for a specific instance (identified by unique name).
|
||
- Site requests a **snapshot** of all current attribute values and alarm states from the Instance Actor and sends it to central.
|
||
- Site then subscribes to the **site-wide Akka stream** filtered by the instance's unique name and forwards attribute value changes and alarm state changes to central.
|
||
- Attribute value stream messages: `[InstanceUniqueName].[AttributePath].[AttributeName]`, value, quality, timestamp.
|
||
- Alarm state stream messages: `[InstanceUniqueName].[AlarmName]`, state (active/normal), priority, timestamp.
|
||
- Central sends an unsubscribe request when the debug view closes. The site removes its stream subscription.
|
||
- The stream is session-based and temporary.
|
||
|
||
### 7. Health Reporting (Site → Central)
|
||
- **Pattern**: Periodic push.
|
||
- Sites periodically send health metrics (connection status, node status, buffer depth, script error rates, alarm evaluation error rates) to central.
|
||
|
||
### 8. Remote Queries (Central → Site)
|
||
- **Pattern**: Request/Response.
|
||
- Central queries sites for:
|
||
- Parked messages (store-and-forward dead letters).
|
||
- Site event logs.
|
||
- Central can also send management commands:
|
||
- Retry or discard parked messages.
|
||
|
||
## Topology
|
||
|
||
```
|
||
Central Cluster
|
||
├── Akka.NET Remoting → Site A Cluster
|
||
├── Akka.NET Remoting → Site B Cluster
|
||
└── Akka.NET Remoting → Site N Cluster
|
||
```
|
||
|
||
- Sites do **not** communicate with each other.
|
||
- All inter-cluster communication flows through central.
|
||
|
||
## Message Timeouts
|
||
|
||
Each request/response pattern has a default timeout that can be overridden in configuration:
|
||
|
||
| Pattern | Default Timeout | Rationale |
|
||
|---------|----------------|-----------|
|
||
| 1. Deployment | 120 seconds | Script compilation at the site can be slow |
|
||
| 2. Instance Lifecycle | 30 seconds | Stop/start actors is fast |
|
||
| 3. System-Wide Artifacts | 120 seconds per site | Includes shared script recompilation |
|
||
| 4. Integration Routing | 30 seconds | External system waiting for response; Inbound API per-method timeout may cap this further |
|
||
| 5. Recipe/Command Delivery | 30 seconds | Fire-and-forget with ack |
|
||
| 8. Remote Queries | 30 seconds | Querying parked messages or event logs |
|
||
|
||
Timeouts use the Akka.NET **ask pattern**. If no response is received within the timeout, the caller receives a timeout failure.
|
||
|
||
## Transport Configuration
|
||
|
||
Akka.NET remoting provides built-in connection management and failure detection. The following transport-level settings are **explicitly configured** (not left to framework defaults) for predictable behavior:
|
||
|
||
- **Transport heartbeat interval**: Configurable interval at which heartbeat messages are sent over remoting connections (e.g., every 5 seconds).
|
||
- **Failure detection threshold**: Number of missed heartbeats before the connection is considered lost (e.g., 3 missed heartbeats = 15 seconds with a 5-second interval).
|
||
- **Reconnection**: Akka.NET remoting handles reconnection automatically. No custom reconnection logic is required.
|
||
|
||
These settings should be tuned for the expected network conditions between central and site clusters.
|
||
|
||
## Application-Level Correlation
|
||
|
||
All request/response messages include an application-level **correlation ID** to ensure correct pairing of requests and responses, even across reconnection events:
|
||
|
||
- Deployments include a **deployment ID** and **revision hash** for idempotency (see Deployment Manager).
|
||
- Lifecycle commands include a **command ID** for deduplication.
|
||
- Remote queries include a **query ID** for response correlation.
|
||
|
||
This provides protocol-level safety beyond Akka.NET's transport guarantees, which may not hold across disconnect/reconnect cycles.
|
||
|
||
## Message Ordering
|
||
|
||
Akka.NET guarantees message ordering between a specific sender/receiver actor pair. The Communication Layer relies on this guarantee — messages to a given site are processed in the order they are sent. Callers do not need to handle out-of-order delivery.
|
||
|
||
## Connection Failure Behavior
|
||
|
||
- **In-flight messages**: When a connection drops while a request is in flight (e.g., deployment sent but no response received), the Akka ask pattern times out and the caller receives a failure. There is **no automatic retry or buffering at central** — the engineer sees the failure in the UI and re-initiates the action. This is consistent with the design principle that central does not buffer messages.
|
||
- **Debug streams**: Any connection interruption (failover or network blip) kills the debug stream. The engineer must reopen the debug view in the Central UI to re-establish the subscription with a fresh snapshot. There is no auto-resume.
|
||
|
||
## Failover Behavior
|
||
|
||
- **Central failover**: The standby node takes over the Akka.NET cluster role. In-progress deployments are treated as failed. Sites reconnect to the new active central node.
|
||
- **Site failover**: The standby node takes over. The Deployment Manager singleton restarts and re-creates the Instance Actor hierarchy. Central detects the node change and reconnects. Ongoing debug streams are interrupted and must be re-established by the engineer.
|
||
|
||
## Dependencies
|
||
|
||
- **Akka.NET Remoting**: Provides the transport layer.
|
||
- **Cluster Infrastructure**: Manages node roles and failover detection.
|
||
|
||
## Interactions
|
||
|
||
- **Deployment Manager (central)**: Uses communication to deliver configurations, lifecycle commands, and system-wide artifacts, and receive status.
|
||
- **Site Runtime**: Receives deployments, lifecycle commands, and artifact updates. Provides debug view data.
|
||
- **Central UI**: Debug view requests and remote queries flow through communication.
|
||
- **Health Monitoring**: Receives periodic health reports from sites.
|
||
- **Store-and-Forward Engine (site)**: Parked message queries/commands are routed through communication.
|
||
- **Site Event Logging**: Event log queries are routed through communication.
|