diff --git a/AkkaDotNet/01-Actors.md b/AkkaDotNet/01-Actors.md new file mode 100644 index 0000000..c0a8d71 --- /dev/null +++ b/AkkaDotNet/01-Actors.md @@ -0,0 +1,188 @@ +# 01 — Actors (Akka Core Library) + +## Overview + +Actors are the foundational building block of the entire SCADA system. Every machine, protocol adapter, alarm handler, and command dispatcher is modeled as an actor. Actors encapsulate state and behavior, communicate exclusively through asynchronous message passing, and form supervision hierarchies that provide automatic fault recovery. + +In our system, the actor model provides three critical capabilities: an actor-per-device model where each of the 50–500 machines is represented by its own actor instance, a protocol abstraction hierarchy where a common base actor defines the machine comms API and derived actors implement OPC-UA or custom protocol specifics, and supervision trees that isolate failures in individual device connections from the rest of the system. + +## When to Use + +- Every machine/device should be represented as an actor — this is the core modeling decision +- Protocol adapters (OPC-UA, custom) should be actors that own their connection lifecycle +- Command dispatchers, alarm aggregators, and tag subscription managers are all actors +- Any component that needs isolated state, failure recovery, or concurrent operation + +## When Not to Use + +- Simple data transformations with no state — use plain functions or Streams stages instead +- One-off request/response patterns with external HTTP APIs — consider using standard async/await unless the call needs supervision or retry logic +- Database access layers — wrap in actors only if you need supervision; otherwise inject repository services via DI + +## Design Decisions for the SCADA System + +### Actor-Per-Device Model + +Each physical machine gets its own actor instance. The actor holds the current known state of the device (tag values, connection status, last command sent, pending command acknowledgments). This provides natural isolation — if communication with Machine #47 fails, only that actor is affected. + +``` +/user + /device-manager + /machine-001 (OpcUaDeviceActor) + /machine-002 (CustomProtocolDeviceActor) + /machine-003 (OpcUaDeviceActor) + ... +``` + +### Protocol Abstraction Hierarchy + +Define a common message protocol (interface) that all device actors respond to, regardless of the underlying communication protocol. The actor hierarchy looks like: + +``` +IDeviceActor (message contract) + ├── OpcUaDeviceActor — implements comms via OPC-UA, subscribes to tags + └── CustomDeviceActor — implements comms via legacy custom protocol +``` + +Both actor types handle the same inbound messages (`SubscribeToTag`, `SendCommand`, `GetDeviceState`, `ConnectionLost`, etc.) but implement the underlying transport differently. The rest of the system only interacts with the message contract, never with protocol specifics. + +### Supervision Strategy + +The device manager actor supervises all device actors with a `OneForOneStrategy`: + +- `CommunicationException` → Restart the device actor (re-establishes connection) +- `ConfigurationException` → Stop the actor (broken config won't fix itself on restart) +- `Exception` (general) → Restart with backoff (use exponential backoff to avoid reconnection storms) +- Escalate only if the device manager itself has an unrecoverable error + +### Message Design + +Messages should be immutable records. Use C# `record` types: + +```csharp +public record SubscribeToTag(string TagName, IActorRef Subscriber); +public record TagValueChanged(string TagName, object Value, DateTime Timestamp); +public record SendCommand(string CommandId, string TagName, object Value); +public record CommandAcknowledged(string CommandId, bool Success, string? Error); +public record GetDeviceState(); +public record DeviceState(string DeviceId, ConnectionStatus Status, IReadOnlyDictionary Tags); +``` + +## Common Patterns + +### Stash for Connection Lifecycle + +Device actors may receive commands before the connection is established. Use `IWithStash` to buffer messages during the connecting phase, then unstash when the connection is ready: + +```csharp +public class OpcUaDeviceActor : ReceiveActor, IWithStash +{ + public IStash Stash { get; set; } + + public OpcUaDeviceActor(DeviceConfig config) + { + Connecting(); + } + + private void Connecting() + { + Receive(msg => { + UnstashAll(); + Become(Online); + }); + Receive(msg => Stash.Stash()); + } + + private void Online() + { + Receive(HandleCommand); + Receive(msg => Become(Connecting)); + } +} +``` + +### Become/Unbecome for State Machines + +Device actors naturally have states: `Connecting`, `Online`, `Faulted`, `Maintenance`. Use `Become()` to switch message handlers cleanly rather than littering `if/else` checks throughout. + +### Ask Pattern — Use Sparingly + +Prefer `Tell` (fire-and-forget with reply-to) over `Ask` (request-response). `Ask` creates temporary actors and timeouts. Use `Ask` only at system boundaries (e.g., when an external API needs a synchronous response from a device actor). + +## Anti-Patterns + +### Sharing Mutable State Between Actors + +Never pass mutable objects in messages. Each actor's state is private. If two actors need the same data, send immutable copies via messages. This is especially critical in the SCADA context — two device actors should never share a connection object or a mutable tag cache. + +### God Actor + +Avoid creating a single "SCADA Manager" actor that handles all devices, alarms, commands, and reporting. Decompose into a hierarchy: device manager → device actors, alarm manager → alarm processors, command queue → command handlers. + +### Blocking Inside Actors + +Never block an actor's message processing thread with synchronous I/O (e.g., `Thread.Sleep`, synchronous socket reads, `.Result` on tasks). This stalls the actor's mailbox and can cascade into thread pool starvation. Use `ReceiveAsync` or `PipeTo` for async operations: + +```csharp +// WRONG +Receive(msg => { + var value = opcClient.ReadValueSync(msg.TagName); // blocks! + Sender.Tell(new TagValue(value)); +}); + +// RIGHT +ReceiveAsync(async msg => { + var value = await opcClient.ReadValueAsync(msg.TagName); + Sender.Tell(new TagValue(value)); +}); +``` + +### Over-Granular Actors + +Not everything needs to be an actor. A device actor can internally use plain objects to manage tag dictionaries, parse protocol frames, or compute alarm thresholds. Only promote to a child actor if the component needs its own mailbox, lifecycle, or supervision. + +## Configuration Guidance + +### Dispatcher Tuning + +For 500 device actors doing I/O-bound work (network communication with equipment), the default dispatcher is usually sufficient. If profiling shows contention, consider a dedicated dispatcher for device actors: + +```hocon +device-dispatcher { + type = Dispatcher + throughput = 10 + executor = "fork-join-executor" + fork-join-executor { + parallelism-min = 4 + parallelism-factor = 1.0 + parallelism-max = 16 + } +} +``` + +### Mailbox Configuration + +Default unbounded mailbox is fine for most SCADA scenarios. If tag subscription updates are very high frequency (thousands per second per device), consider a bounded mailbox with a drop-oldest strategy to prevent memory growth from stale updates: + +```hocon +bounded-device-mailbox { + mailbox-type = "Akka.Dispatch.BoundedMailbox, Akka" + mailbox-capacity = 1000 + mailbox-push-timeout-time = 0s +} +``` + +### Dead Letter Monitoring + +In a SCADA system, dead letters often indicate a device actor has been stopped but something is still trying to send to it. Subscribe to dead letters for monitoring: + +```csharp +system.EventStream.Subscribe(monitoringActor); +``` + +## References + +- Official Documentation: +- Actor Concepts: +- Supervision: +- Dispatchers: diff --git a/AkkaDotNet/02-Remoting.md b/AkkaDotNet/02-Remoting.md new file mode 100644 index 0000000..5c1b9b2 --- /dev/null +++ b/AkkaDotNet/02-Remoting.md @@ -0,0 +1,134 @@ +# 02 — Remoting (Akka.Remote) + +## Overview + +Akka.Remote is the transport layer that enables actor systems on different machines to exchange messages transparently. In our SCADA system, Remoting is the communication backbone between the active and standby nodes in the failover pair. It underpins the Cluster module — you rarely interact with Remoting directly, but understanding its configuration is critical because it governs how the two nodes discover each other, exchange heartbeats, and transfer messages. + +## When to Use + +- Remoting is always enabled as a prerequisite for Akka.Cluster — it's not optional in our 2-node topology +- Direct Remoting APIs (sending to remote actor paths) are rarely needed; prefer Cluster-aware abstractions (Singleton, Pub-Sub, Distributed Data) instead + +## When Not to Use + +- Do not use raw Remoting for device communication — use Akka.IO or protocol-specific libraries for equipment comms +- Do not use Remoting to communicate with external systems (databases, APIs) — it's strictly for inter-ActorSystem communication +- Do not attempt to use Remoting without Cluster in our architecture; the Cluster module adds membership management and failure detection that raw Remoting lacks + +## Design Decisions for the SCADA System + +### Transport: DotNetty TCP + +Akka.NET uses DotNetty as its TCP transport. For our Windows Server environment, this works well out of the box. Key decision: use a fixed, known port (not port 0/random) since both nodes in the failover pair have static IPs or hostnames. + +### Port and Hostname Configuration + +Each node in the pair must be individually addressable. Use explicit hostnames and ports: + +- **Node A (Active):** `akka.tcp://scada-system@nodeA-hostname:4053` +- **Node B (Standby):** `akka.tcp://scada-system@nodeB-hostname:4053` + +The ActorSystem name (`scada-system`) must be identical on both nodes. + +### Public Hostname vs. Bind Hostname + +On machines with multiple NICs (common in industrial environments — one NIC for the equipment network, one for the corporate/management network), configure `public-hostname` to advertise the correct address: + +```hocon +akka.remote.dot-netty.tcp { + hostname = "0.0.0.0" # Bind to all interfaces + public-hostname = "nodeA.scada.local" # Advertise this address + port = 4053 +} +``` + +### Serialization Boundary + +Every message that crosses the Remoting boundary must be serializable. This includes messages sent via Cluster Singleton, Distributed Data, and Pub-Sub. Design messages as simple, immutable records and register appropriate serializers (see [20-Serialization.md](./20-Serialization.md)). + +## Common Patterns + +### Separate Equipment and Cluster Networks + +If the site has separate networks for equipment communication and inter-node communication, bind Akka.Remote to the inter-node network only. Device actors use Akka.IO or direct protocol libraries on the equipment network. This prevents equipment traffic from interfering with cluster heartbeats. + +### Logging Remote Lifecycle Events + +Enable remote lifecycle event logging during development and initial deployment to diagnose connection issues: + +```hocon +akka.remote { + log-remote-lifecycle-events = on + log-received-messages = off # Too verbose for production + log-sent-messages = off +} +``` + +### Watching Remote Actors + +`Context.Watch()` works across Remoting boundaries. The device manager on the standby node can watch actors on the active node to detect failover conditions — though in practice, Cluster membership changes are a better signal. + +## Anti-Patterns + +### Using Remote Actor Paths Directly + +Avoid constructing remote `ActorSelection` paths like `akka.tcp://scada-system@nodeB:4053/user/some-actor`. This creates tight coupling to physical addresses. Instead, use Cluster-aware mechanisms: Cluster Singleton proxy, Distributed Data, or Pub-Sub. + +### Large Message Payloads + +Remoting is optimized for small, frequent messages — not bulk data transfer. If you need to transfer large datasets between nodes (e.g., a full device state snapshot during failover), consider chunking the data or using an out-of-band mechanism (shared file system, database). + +Default maximum frame size is 128KB. For our SCADA system, this should be sufficient for individual messages, but if device state snapshots are large: + +```hocon +akka.remote.dot-netty.tcp { + maximum-frame-size = 256000b # Increase only if needed +} +``` + +### Assuming Reliable Delivery + +Remoting provides at-most-once delivery. Messages can be lost if the connection drops at the wrong moment. For commands that must not be lost during failover, use Akka.Persistence to journal them (see [08-Persistence.md](./08-Persistence.md)). + +## Configuration Guidance + +### Heartbeat and Failure Detection + +The transport failure detector controls how quickly a downed node is detected. For a SCADA failover pair, faster detection means faster failover, but too aggressive settings cause false positives: + +```hocon +akka.remote { + transport-failure-detector { + heartbeat-interval = 4s + acceptable-heartbeat-pause = 20s # Default; increase to 30s if on unreliable networks + } + + watch-failure-detector { + heartbeat-interval = 1s + threshold = 10.0 + acceptable-heartbeat-pause = 10s + } +} +``` + +For our 2-node SCADA pair on a reliable local network, the defaults are generally appropriate. Do not reduce `acceptable-heartbeat-pause` below 10s — garbage collection pauses on .NET can trigger false positives. + +### Connection Limits + +With only 2 nodes, connection limits are not a concern. Keep defaults. + +### Retry and Backoff + +If the active node crashes, the standby will attempt to reconnect. Configure retry behavior: + +```hocon +akka.remote { + retry-gate-closed-for = 5s # Wait before retrying after a failed connection +} +``` + +## References + +- Official Documentation: +- Configuration Reference: +- Transports: diff --git a/AkkaDotNet/03-Cluster.md b/AkkaDotNet/03-Cluster.md new file mode 100644 index 0000000..dbdba26 --- /dev/null +++ b/AkkaDotNet/03-Cluster.md @@ -0,0 +1,206 @@ +# 03 — Cluster (Akka.Cluster) + +## Overview + +Akka.Cluster organizes our two SCADA nodes into a managed membership group with failure detection, role assignment, and coordinated lifecycle management. It is the backbone that makes the active/cold-standby failover model work — Cluster determines which node is "oldest" (and therefore active), detects when a node has failed, and triggers the standby to take over. + +The 2-node cluster topology is the most architecturally challenging configuration in Akka.Cluster because traditional quorum-based decisions require a majority, and with 2 nodes there is no majority when one node is down. + +## When to Use + +- Always — the 2-node failover pair requires Cluster for membership management, failure detection, and coordinated role transitions +- Cluster membership events drive the entire failover lifecycle + +## When Not to Use + +- Do not use Cluster for communication with equipment — that's the device actor layer +- Do not add more than 2 nodes to a site cluster without revisiting the Split Brain Resolver configuration + +## Design Decisions for the SCADA System + +### Cluster Roles + +Assign a single role to both nodes: `scada-node`. Role differentiation between active and standby is handled by Cluster Singleton (the oldest node becomes the singleton host), not by role assignment. + +```hocon +akka.cluster { + roles = ["scada-node"] +} +``` + +### Seed Nodes + +With exactly 2 nodes and static IPs/hostnames, use static seed nodes. Both nodes list both addresses as seeds: + +```hocon +akka.cluster { + seed-nodes = [ + "akka.tcp://scada-system@nodeA.scada.local:4053", + "akka.tcp://scada-system@nodeB.scada.local:4053" + ] +} +``` + +The first seed node listed should be the same on both nodes. The node at the first seed address will automatically become the cluster leader and the "oldest" node (which determines the Singleton host). + +### Split Brain Resolver (SBR) — The Critical Decision + +With only 2 nodes, quorum-based SBR strategies (majority, static-quorum) do not work because losing 1 of 2 nodes means no majority exists. The options for 2-node clusters: + +**Recommended: `keep-oldest`** + +The oldest node (the one that has been in the cluster longest) survives a partition. The younger node downs itself. When the failed/partitioned node comes back, it rejoins as the younger node. + +```hocon +akka.cluster { + downing-provider-class = "Akka.Cluster.SBR.SplitBrainResolverProvider, Akka.Cluster" + split-brain-resolver { + active-strategy = keep-oldest + keep-oldest { + down-if-alone = on # If the oldest is truly alone (no other node visible), it downs itself too + } + stable-after = 15s # Wait for cluster to stabilize before making decisions + } +} +``` + +**Why `down-if-alone = on`:** In a 2-node cluster, if the oldest node is partitioned (it can see no one), it should down itself. Otherwise, both nodes think they are the sole survivor and you get a split brain. With `down-if-alone = on`, the oldest node in a partition will also down itself, and an external mechanism (Windows Service restart, watchdog) must restart both nodes to reform the cluster. + +**Alternative: Lease-based SBR** (see [15-Coordination.md](./15-Coordination.md)) + +If you have access to a shared resource (file share, database), a lease-based approach is more robust for 2-node clusters. Only the node that holds the lease survives. This avoids the "both nodes down themselves" scenario. + +### Failure Detection Tuning + +Cluster uses an accrual failure detector on top of Remoting heartbeats. For a SCADA system on a local network: + +```hocon +akka.cluster { + failure-detector { + heartbeat-interval = 1s + threshold = 8.0 + acceptable-heartbeat-pause = 10s + min-std-deviation = 100ms + } +} +``` + +These settings provide failover detection within ~10–15 seconds. Do not make `acceptable-heartbeat-pause` shorter than 5s — .NET garbage collection can cause pauses that trigger false positives. + +### Cluster Event Subscription + +The application should subscribe to cluster membership events to react to failover: + +```csharp +Cluster.Get(Context.System).Subscribe(Self, typeof(ClusterEvent.MemberUp), + typeof(ClusterEvent.MemberRemoved), + typeof(ClusterEvent.UnreachableMember), + typeof(ClusterEvent.LeaderChanged)); +``` + +## Common Patterns + +### Graceful Shutdown + +When performing planned maintenance on a node, leave the cluster gracefully before shutting down. This triggers an orderly handoff rather than a failure-detected removal: + +```csharp +await CoordinatedShutdown.Get(system).Run(CoordinatedShutdown.ClrExitReason.Instance); +``` + +Configure CoordinatedShutdown to leave the cluster: + +```hocon +akka.coordinated-shutdown.run-by-clr-shutdown-hook = on +akka.cluster.run-coordinated-shutdown-when-down = on +``` + +### Health Check Actor + +Create an actor that monitors cluster state and exposes it via a local HTTP endpoint (or Windows Event Log) for operations teams: + +```csharp +public class ClusterHealthActor : ReceiveActor +{ + public ClusterHealthActor() + { + var cluster = Cluster.Get(Context.System); + cluster.Subscribe(Self, typeof(ClusterEvent.IMemberEvent)); + + Receive(msg => LogMemberUp(msg)); + Receive(msg => LogMemberRemoved(msg)); + Receive(msg => AlertUnreachable(msg)); + } +} +``` + +### Auto-Restart with Windows Service + +Wrap the Akka.NET application as a Windows Service. If both nodes down themselves during a partition (due to `down-if-alone = on`), the Windows Service restart mechanism will bring them back and they'll reform the cluster. + +## Anti-Patterns + +### Ignoring SBR Configuration + +Running a 2-node cluster without a Split Brain Resolver is dangerous. If a network partition occurs, both nodes will remain "up" independently, each believing it is the sole cluster member. Both will run the Singleton, and both will issue commands to equipment — violating the "no duplicate commands" requirement. + +### Over-Tuning Failure Detection + +Making the failure detector extremely aggressive (e.g., `acceptable-heartbeat-pause = 2s`) causes flapping — nodes repeatedly marking each other as unreachable and then reachable. Each transition triggers Singleton handoff, which means device actors restart, connections drop, and commands may be lost. + +### Manual Downing + +Never rely on manual downing (an operator clicking a button to remove a node) in a SCADA system. The failover must be automatic. Always configure SBR. + +### Multiple ActorSystems per Process + +Do not create multiple `ActorSystem` instances in the same process. Each SCADA node should have exactly one `ActorSystem` that participates in exactly one cluster. + +## Configuration Guidance + +### Complete Cluster Configuration Block + +```hocon +akka.cluster { + roles = ["scada-node"] + seed-nodes = [ + "akka.tcp://scada-system@nodeA.scada.local:4053", + "akka.tcp://scada-system@nodeB.scada.local:4053" + ] + + downing-provider-class = "Akka.Cluster.SBR.SplitBrainResolverProvider, Akka.Cluster" + split-brain-resolver { + active-strategy = keep-oldest + keep-oldest { + down-if-alone = on + } + stable-after = 15s + } + + failure-detector { + heartbeat-interval = 1s + threshold = 8.0 + acceptable-heartbeat-pause = 10s + } + + min-nr-of-members = 1 # Allow single-node operation (after failover, only 1 node is up) + + run-coordinated-shutdown-when-down = on +} + +akka.coordinated-shutdown { + run-by-clr-shutdown-hook = on + exit-clr = on +} +``` + +### Important: `min-nr-of-members = 1` + +Set this to 1, not 2. After a failover, only 1 node is running. If set to 2, the surviving node will wait for a second member before the Singleton starts — blocking all device communication. + +## References + +- Official Documentation: +- Split Brain Resolver: +- Configuration Reference: +- Cluster Membership Events: diff --git a/AkkaDotNet/04-ClusterSharding.md b/AkkaDotNet/04-ClusterSharding.md new file mode 100644 index 0000000..04331f9 --- /dev/null +++ b/AkkaDotNet/04-ClusterSharding.md @@ -0,0 +1,119 @@ +# 04 — Cluster Sharding (Akka.Cluster.Sharding) + +## Overview + +Cluster Sharding distributes a set of actors (called "entities") across cluster members using a consistent hashing strategy. Entities are identified by a unique ID, and the sharding infrastructure ensures each entity exists on exactly one node at a time, automatically rebalancing when nodes join or leave. + +In our SCADA system, Cluster Sharding is a candidate for managing device actors across the cluster — but with a 2-node active/cold-standby topology where only the active node communicates with equipment, the value proposition is nuanced. + +## When to Use + +- If the system evolves beyond a strict active/standby model to allow both nodes to handle subsets of devices +- If device counts grow beyond what a single node can handle (significantly more than 500 devices with high-frequency tag updates) +- If the architecture shifts to active/active with partitioned device ownership + +## When Not to Use + +- In the current design (active/cold-standby), Cluster Singleton is a better fit for owning all device communication on a single node +- Sharding adds complexity (shard coordinators, rebalancing, remember-entities configuration) that is unnecessary when one node does all the work +- With only 2 nodes, rebalancing is not meaningful — entities would just move from one node to the other + +## Design Decisions for the SCADA System + +### Current Recommendation: Do Not Use Sharding + +For the current architecture where one node is active and the other is a cold standby, Cluster Sharding adds overhead without benefit. The active node hosts all device actors via the Singleton pattern, and on failover, all actors restart on the new active node. + +### Future Consideration: Active/Active Migration Path + +If the system eventually needs to scale beyond one node's capacity, Sharding provides a clean migration path. Each device becomes a sharded entity identified by its device ID: + +```csharp +// Hypothetical future sharding setup +var sharding = ClusterSharding.Get(system); +var deviceRegion = sharding.Start( + typeName: "Device", + entityPropsFactory: entityId => Props.Create(() => new DeviceActor(entityId)), + settings: ClusterShardingSettings.Create(system), + messageExtractor: new DeviceMessageExtractor() +); +``` + +The message extractor would route based on device ID: + +```csharp +public class DeviceMessageExtractor : HashCodeMessageExtractor +{ + public DeviceMessageExtractor() : base(maxNumberOfShards: 100) { } + + public override string EntityId(object message) => message switch + { + IDeviceMessage m => m.DeviceId, + _ => null + }; +} +``` + +### If Sharding Is Adopted: Remember Entities + +For SCADA, if sharding is used, enable `remember-entities` so that device actors are automatically restarted after rebalancing without waiting for a new message: + +```hocon +akka.cluster.sharding { + remember-entities = on + remember-entities-store = "eventsourced" # Requires Akka.Persistence +} +``` + +This is important because device actors need to maintain their tag subscriptions — they can't wait for an external trigger to restart. + +## Common Patterns + +### Passivation (Not Recommended for SCADA) + +Sharding supports passivating idle entities to save memory. For SCADA device actors, passivation is inappropriate — devices need continuous monitoring regardless of command activity. If sharding is used, disable passivation or set very long timeouts: + +```hocon +akka.cluster.sharding { + passivate-idle-entity-after = off +} +``` + +### Shard Count Sizing + +If sharding is adopted, the number of shards should be significantly larger than the number of nodes but not excessively large. For 2 nodes with 500 devices, 50–100 shards is reasonable. The shard count is permanent and cannot be changed without data migration. + +## Anti-Patterns + +### Sharding as a Default Choice + +Do not adopt Sharding just because it's available. For a 2-node active/standby system, it adds a shard coordinator (which itself is a Singleton), persistence requirements for shard state, and rebalancing logic — all of which increase failure surface area without providing benefit. + +### Mixing Sharding and Singleton for the Same Concern + +If device actors are managed by Sharding, do not also wrap them in a Singleton. These are alternative distribution strategies. Pick one. + +## Configuration Guidance + +If Sharding is adopted in the future, here is a starting configuration for the SCADA system: + +```hocon +akka.cluster.sharding { + guardian-name = "sharding" + role = "scada-node" + remember-entities = on + remember-entities-store = "eventsourced" + passivate-idle-entity-after = off + + # For 2-node cluster, keep coordinator on oldest node + least-shard-allocation-strategy { + rebalance-absolute-limit = 5 + rebalance-relative-limit = 0.1 + } +} +``` + +## References + +- Official Documentation: +- Sharding Configuration: diff --git a/AkkaDotNet/05-ClusterSingleton.md b/AkkaDotNet/05-ClusterSingleton.md new file mode 100644 index 0000000..91634f5 --- /dev/null +++ b/AkkaDotNet/05-ClusterSingleton.md @@ -0,0 +1,161 @@ +# 05 — Cluster Singleton (Akka.Cluster.Tools) + +## Overview + +Cluster Singleton ensures that exactly one instance of a particular actor runs across the entire cluster at any time. If the node hosting the singleton goes down, the singleton is automatically restarted on the next oldest node. This is the primary mechanism for implementing the active/cold-standby model in our SCADA system. + +The "Device Manager" — the top-level actor that owns all device actors and manages equipment communication — runs as a Cluster Singleton. The active node hosts the singleton; the standby node runs a Singleton Proxy that can route messages to the active node's singleton. + +## When to Use + +- The Device Manager actor that spawns and supervises all device actors — this is the primary use case +- Any component where exactly-one semantics are required: alarm aggregator, command queue processor, historian writer +- Any coordination point that must not have duplicates during normal operation + +## When Not to Use + +- Do not make every actor a singleton — only top-level coordinators that genuinely need cluster-wide uniqueness +- Do not use a singleton for high-throughput work that could benefit from parallelism +- Individual device actors should not be singletons; they are children of the singleton Device Manager + +## Design Decisions for the SCADA System + +### Singleton: Device Manager Actor + +The Device Manager is the singleton. On startup, it reads the site's device configuration, creates one device actor per machine, and manages their lifecycle. When failover occurs, the singleton restarts on the standby node, creating new device actors that reconnect to equipment. + +```csharp +// Singleton registration via Akka.Hosting +builder.WithSingleton( + singletonName: "device-manager", + propsFactory: (system, registry, resolver) => + resolver.Props(), + options: new ClusterSingletonOptions + { + Role = "scada-node", + // Hand-over retry interval during graceful migration + HandOverRetryInterval = TimeSpan.FromSeconds(5), + }); +``` + +### Singleton Proxy on Both Nodes + +Both nodes should have a Singleton Proxy. Even on the active node (which hosts the actual singleton), the proxy provides a stable `IActorRef` that other actors can use to send messages. This decouples message senders from knowing which node currently hosts the singleton. + +```csharp +builder.WithSingletonProxy( + singletonName: "device-manager", + options: new ClusterSingletonProxyOptions + { + Role = "scada-node", + BufferSize = 1000, // Buffer messages while singleton is being handed over + }); +``` + +### Singleton Lifecycle During Failover + +When the active node goes down: + +1. Cluster failure detector marks the node as unreachable (~10–15 seconds with our config) +2. SBR downs the unreachable node +3. Cluster Singleton notices the singleton host is gone +4. Singleton starts on the next oldest (surviving) node +5. The new Device Manager reads device config, replays the Persistence journal for in-flight commands, and creates device actors +6. Device actors connect to equipment and resume tag subscriptions + +**Total failover time:** ~20–40 seconds depending on failure detection + singleton startup + equipment reconnection. + +### Buffering During Hand-Over + +During singleton migration (whether from failure or graceful shutdown), messages sent to the Singleton Proxy are buffered. Configure `BufferSize` to handle the expected message volume during the hand-over window: + +```csharp +new ClusterSingletonProxyOptions +{ + BufferSize = 1000, // Buffer up to 1000 messages during hand-over +} +``` + +If the buffer overflows, messages are dropped. For a SCADA system, 1000 is usually sufficient — commands arrive at human-operator speed, and tag updates will be re-sent by equipment once the new singleton subscribes. + +## Common Patterns + +### Singleton with Persistence + +The Device Manager singleton should use Akka.Persistence to journal in-flight commands. When the singleton restarts on the standby node, it replays the journal to identify commands that were sent but not yet acknowledged: + +```csharp +public class DeviceManagerActor : ReceivePersistentActor +{ + public override string PersistenceId => "device-manager-singleton"; + + // On recovery, rebuild the pending command queue + // On command: persist the command event, then send to device + // On ack: persist the ack event, remove from pending queue +} +``` + +See [08-Persistence.md](./08-Persistence.md) for full details. + +### Graceful Hand-Over + +When performing planned maintenance, use `CoordinatedShutdown` to trigger a graceful singleton hand-over. The singleton on the old node stops, the proxy buffers messages, and the singleton starts on the new node — minimizing the gap in equipment communication. + +### Singleton Health Monitoring + +Create a periodic self-check inside the singleton that publishes its status. The standby node's monitoring actor can watch for these heartbeats to provide early warning of issues: + +```csharp +// Inside the singleton +Context.System.Scheduler.ScheduleTellRepeatedly( + TimeSpan.FromSeconds(5), + TimeSpan.FromSeconds(5), + Self, + new HealthCheck(), + ActorRefs.NoSender); +``` + +## Anti-Patterns + +### Putting All Logic in the Singleton + +The singleton should be a thin coordination layer. It owns the device actor hierarchy but does not process tag updates, execute commands, or handle alarms directly. Those are delegated to child actors. If the singleton actor becomes a bottleneck, the entire system stalls. + +### Not Handling Singleton Restart + +When the singleton restarts on the standby node, all child actors (device actors) are created fresh. Any in-memory state from the previous instance is gone. If the system assumes device actors persist across failover, it will fail. Design for restart: use Persistence for critical state, re-read configuration, and re-establish equipment connections. + +### Ignoring the Buffer Overflow Scenario + +If the Singleton Proxy buffer fills up during a long failover (e.g., network partition where SBR takes time to act), messages are silently dropped. For commands, this means lost instructions. Mitigate by persisting commands before sending them through the proxy. + +## Configuration Guidance + +```hocon +akka.cluster.singleton { + # Role that the singleton runs on + singleton-name = "device-manager" + role = "scada-node" + + # Minimum members before singleton starts + # Set to 1 — after failover, the surviving node is alone + min-number-of-hand-over-retries = 15 + hand-over-retry-interval = 5s +} + +akka.cluster.singleton-proxy { + singleton-name = "device-manager" + role = "scada-node" + buffer-size = 1000 + singleton-identification-interval = 1s +} +``` + +### Important: Single-Node Operation + +After failover, only one node is running. The singleton must be able to start with just one cluster member. Ensure `akka.cluster.min-nr-of-members = 1` (set in [03-Cluster.md](./03-Cluster.md)). + +## References + +- Official Documentation: +- Cluster Tools Configuration: diff --git a/AkkaDotNet/06-ClusterPubSub.md b/AkkaDotNet/06-ClusterPubSub.md new file mode 100644 index 0000000..d51ad88 --- /dev/null +++ b/AkkaDotNet/06-ClusterPubSub.md @@ -0,0 +1,159 @@ +# 06 — Cluster Publish-Subscribe (Akka.Cluster.Tools) + +## Overview + +Cluster Publish-Subscribe provides a distributed message broker within the Akka.NET cluster. Actors can publish messages to named topics, and any actor in the cluster that has subscribed to that topic receives the message. It also supports "send to one subscriber" semantics for load-balanced distribution. + +In the SCADA system, Pub-Sub serves as the event distribution backbone — enabling the active node's device actors to broadcast tag updates, alarms, and status changes to subscribers on both nodes (e.g., a logging actor on the standby node, or a monitoring dashboard service). + +## When to Use + +- Broadcasting alarm events from device actors to all interested subscribers (alarm processors, historians, UI notifiers) +- Distributing tag value changes to monitoring/dashboard actors +- Decoupling event producers (device actors) from consumers (alarm handlers, loggers, external integrations) +- Cross-node event distribution where the producer doesn't need to know about specific consumers + +## When Not to Use + +- Point-to-point command delivery — use direct actor references or the Singleton Proxy instead +- High-frequency, high-volume data streams — use Akka.Streams for backpressure support +- Guaranteed delivery — Pub-Sub is at-most-once; if a subscriber is down, it misses the message + +## Design Decisions for the SCADA System + +### Topic Structure + +Define topics by event category, not by device. This keeps subscription management simple and avoids an explosion of topics (500 devices × N event types): + +``` +Topics: + "alarms" — all alarm raise/clear events + "tag-updates" — tag value changes (filtered by subscriber if needed) + "device-status" — connection up/down, device faulted + "commands" — command dispatched/acknowledged (for audit/logging) +``` + +### DistributedPubSub Mediator + +Access the mediator via the `DistributedPubSub` extension: + +```csharp +var mediator = DistributedPubSub.Get(Context.System).Mediator; + +// Publishing +mediator.Tell(new Publish("alarms", new AlarmRaised("machine-001", "HighTemp", DateTime.UtcNow))); + +// Subscribing +mediator.Tell(new Subscribe("alarms", Self)); +``` + +### Subscribing from the Standby Node + +Even though the standby node is "cold" (not running device actors), it can still subscribe to topics. This enables a monitoring or logging actor on the standby to receive events from the active node's device actors. This is useful for keeping a warm audit log on both nodes: + +```csharp +// On the standby node +public class AuditLogActor : ReceiveActor +{ + public AuditLogActor() + { + var mediator = DistributedPubSub.Get(Context.System).Mediator; + mediator.Tell(new Subscribe("alarms", Self)); + mediator.Tell(new Subscribe("commands", Self)); + + Receive(msg => WriteToLocalLog(msg)); + Receive(msg => WriteToLocalLog(msg)); + } +} +``` + +### Message Filtering + +Pub-Sub delivers all messages on a topic to all subscribers. If a subscriber only cares about a subset (e.g., alarms from a specific device group), filter inside the subscriber actor rather than creating per-device topics: + +```csharp +Receive(msg => +{ + if (_monitoredDeviceGroup.Contains(msg.DeviceId)) + ProcessAlarm(msg); + // else: ignore +}); +``` + +## Common Patterns + +### Topic Groups for Load Distribution + +If multiple actors should share the processing of a topic (only one handles each message), use `SendToAll = false` with group-based routing. This is useful for distributing alarm processing across multiple handler actors: + +```csharp +// Subscribe with a group name — messages are sent to one member of the group +mediator.Tell(new Subscribe("alarms", Self, "alarm-processors")); +``` + +### Unsubscribe on Actor Stop + +Pub-Sub automatically cleans up subscriptions when an actor is terminated, but explicitly unsubscribing during graceful shutdown ensures clean behavior: + +```csharp +protected override void PostStop() +{ + var mediator = DistributedPubSub.Get(Context.System).Mediator; + mediator.Tell(new Unsubscribe("alarms", Self)); + base.PostStop(); +} +``` + +### Event Envelope Pattern + +Wrap all published events in a common envelope that includes metadata (source device, timestamp, sequence number). This makes it easier for subscribers to filter, order, and deduplicate: + +```csharp +public record ScadaEvent(string DeviceId, DateTime Timestamp, long SequenceNr, object Payload); + +// Publishing +mediator.Tell(new Publish("alarms", + new ScadaEvent("machine-001", DateTime.UtcNow, _seqNr++, new AlarmRaised("HighTemp")))); +``` + +## Anti-Patterns + +### Using Pub-Sub for Commands + +Commands (write a tag value, start/stop equipment) should not go through Pub-Sub. Commands need a specific target and acknowledgment. Use direct actor references or the Singleton Proxy for command delivery. + +### Publishing High-Frequency Data Without Throttling + +If a device updates a tag value every 100ms and publishes each update to Pub-Sub, subscribers can be overwhelmed. Throttle at the publisher: only publish on significant value changes (deadband) or at a reduced rate. + +### Relying on Pub-Sub for Critical Event Delivery + +Pub-Sub provides at-most-once delivery. If the subscriber is temporarily unreachable (e.g., during singleton hand-over), events are lost. For events that must not be lost (safety alarms, command audit trail), persist them to the journal first, then publish to Pub-Sub as a notification. + +## Configuration Guidance + +```hocon +akka.cluster.pub-sub { + # Actor name of the mediator + name = "distributedPubSubMediator" + + # The role to use for the mediator — should match our cluster role + role = "scada-node" + + # How often to gossip subscription state between nodes + gossip-interval = 1s + + # How long to buffer messages to unreachable nodes + removed-time-to-live = 120s + + # Maximum delta elements to transfer in one round of gossip + max-delta-elements = 3000 +} +``` + +For a 2-node cluster, the defaults are fine. The `gossip-interval` of 1 second ensures subscription changes propagate quickly between the pair. + +## References + +- Official Documentation: +- Cluster Tools Configuration: diff --git a/AkkaDotNet/07-ClusterMetrics.md b/AkkaDotNet/07-ClusterMetrics.md new file mode 100644 index 0000000..6f1b7ea --- /dev/null +++ b/AkkaDotNet/07-ClusterMetrics.md @@ -0,0 +1,118 @@ +# 07 — Cluster Metrics (Akka.Cluster.Metrics) + +## Overview + +Akka.Cluster.Metrics periodically collects node-level resource metrics (CPU usage, memory consumption) from each cluster member and publishes them across the cluster. It can also drive adaptive load-balancing routers that route messages to the least-loaded node. + +In our 2-node SCADA system, Cluster Metrics serves primarily as a health monitoring input — providing visibility into whether the active node is under resource pressure. The adaptive routing capability is less relevant since only one node is actively processing. + +## When to Use + +- Monitoring active node resource consumption (CPU, memory) as a health indicator +- Triggering alerts if the active node approaches resource limits (e.g., memory pressure from too many device actors or tag subscriptions) +- Feeding operational dashboards that display cluster health + +## When Not to Use + +- Adaptive load-balancing routing — irrelevant in an active/standby topology since the standby does no work +- As a replacement for proper application-level metrics (device connection counts, command throughput, alarm rates) — Cluster Metrics only covers infrastructure-level metrics + +## Design Decisions for the SCADA System + +### Enable for Monitoring, Not Routing + +Install Cluster Metrics on both nodes to collect health data, but do not configure metrics-based routers. The standby node will report minimal resource usage (it's idle); the active node's metrics are the useful signal. + +### Metrics Listener Actor + +Create a metrics listener that logs warnings when the active node exceeds thresholds: + +```csharp +public class MetricsMonitorActor : ReceiveActor +{ + private readonly ClusterMetrics _metrics = ClusterMetrics.Get(Context.System); + private readonly Cluster _cluster = Cluster.Get(Context.System); + + public MetricsMonitorActor() + { + Receive(changed => + { + foreach (var nodeMetrics in changed.NodeMetrics) + { + if (nodeMetrics.Address.Equals(_cluster.SelfAddress)) + { + CheckMemory(nodeMetrics); + CheckCpu(nodeMetrics); + } + } + }); + } + + protected override void PreStart() + { + _metrics.Subscribe(Self); + } + + protected override void PostStop() + { + _metrics.Unsubscribe(Self); + } + + private void CheckMemory(NodeMetrics metrics) + { + // Alert if memory usage exceeds 80% + if (metrics.Metric("MemoryUsed") is { } mem) + { + // Log or publish alert + } + } +} +``` + +### Collection Interval + +The default collection interval (3 seconds) is adequate for SCADA monitoring. Faster collection wastes resources; slower collection misses transient spikes. Leave the default unless specific monitoring needs dictate otherwise. + +## Common Patterns + +### Forwarding Metrics to External Monitoring + +In production SCADA deployments, metrics should feed into the site's existing monitoring infrastructure (Windows Performance Counters, Prometheus, Grafana, or SCADA historian). The metrics listener actor can bridge Cluster Metrics data to these external systems. + +### Memory Pressure Early Warning + +Use memory metrics to detect when the active node is approaching limits. If device actor count increases (e.g., new equipment added) and memory grows toward the machine's limit, the metrics listener can log warnings before out-of-memory conditions cause crashes. + +## Anti-Patterns + +### Using Metrics for Failover Decisions + +Do not use Cluster Metrics to trigger failover. Failover should be driven by Cluster membership events (unreachable/down). High CPU or memory on the active node does not mean it should hand off to the standby — it means the active node needs attention (configuration tuning, resource allocation). + +### Excessive Collection Frequency + +Collecting metrics every 100ms creates unnecessary garbage collection pressure from the metrics snapshots. Stick to the default 3-second interval. + +## Configuration Guidance + +```hocon +akka { + extensions = ["Akka.Cluster.Metrics.ClusterMetricsExtensionProvider, Akka.Cluster.Metrics"] + + cluster.metrics { + # Use the default collector + collector { + provider = "Akka.Cluster.Metrics.Collectors.DefaultCollector, Akka.Cluster.Metrics" + sample-interval = 3s + gossip-interval = 3s + } + + # Disable the metrics-based router (not needed for active/standby) + # No router configuration required + } +} +``` + +## References + +- Official Documentation: diff --git a/AkkaDotNet/08-Persistence.md b/AkkaDotNet/08-Persistence.md new file mode 100644 index 0000000..dace7d6 --- /dev/null +++ b/AkkaDotNet/08-Persistence.md @@ -0,0 +1,244 @@ +# 08 — Persistence (Akka.Persistence) + +## Overview + +Akka.Persistence enables actors to persist their state as a sequence of events (event sourcing) and/or snapshots. On restart — whether from a crash, deployment, or failover — the actor replays its event journal to recover its last known state. Persistence uses pluggable backends for the event journal and snapshot store. + +In our SCADA system, Persistence is the mechanism that ensures **no commands are lost or duplicated during failover**. The Device Manager singleton persists in-flight commands to a SQLite journal. When the singleton restarts on the standby node, it replays the journal to identify commands that were sent but not yet acknowledged, and can re-issue or reconcile them. + +## When to Use + +- The Device Manager singleton — for journaling in-flight commands and their acknowledgments +- Any actor where state must survive failover (pending command queues, alarm acknowledgment state) +- Command deduplication — the journal provides a record of commands already sent + +## When Not to Use + +- Device actors that hold transient equipment state (current tag values) — this state is re-read from equipment on reconnection; persisting it adds latency without benefit +- High-frequency tag value updates — these should go to the historian (SQL Server or time-series DB), not the Persistence journal +- Read models / projections — use the historian or a dedicated CQRS read store, not the Persistence journal + +## Design Decisions for the SCADA System + +### SQLite as the Journal Backend + +SQL Server is not guaranteed at every site. SQLite provides a local, zero-configuration persistence backend. Use `Akka.Persistence.Sql` with the SQLite provider: + +``` +NuGet: Akka.Persistence.Sql +``` + +The SQLite database file lives on the local filesystem of each node. + +### Journal Sync Between Nodes + +The key challenge: each node has its own SQLite file. When the standby takes over, it needs the active node's journal data. Options: + +**Option A: Shared Journal via Distributed Data (Recommended)** + +Do not attempt to sync SQLite files between nodes. Instead, use the active node as the journal host during normal operation, and ensure the Device Manager actor persists only critical state (in-flight commands). On failover, the new singleton's journal starts fresh, and it reconciles state by: + +1. Re-reading equipment state from devices (tag values, connection status) +2. Checking a lightweight "pending commands" record in Distributed Data (see [09-DistributedData.md](./09-DistributedData.md)) + +This approach avoids the complexity and corruption risks of SQLite file replication. + +**Option B: Shared SQLite on a Network Path** + +Place the SQLite file on a shared filesystem (SMB share) accessible by both nodes. This is simple but introduces risks: + +- SQLite does not handle network filesystem locking reliably — corruption is possible +- If the network share becomes unavailable, persistence fails entirely +- Not recommended for production SCADA systems + +**Option C: Use SQL Server Where Available, SQLite as Fallback** + +Configure the journal backend based on site capabilities. Sites with SQL Server use it as a shared journal (both nodes point to the same database). Sites without SQL Server use local SQLite with the reconciliation approach from Option A. + +```csharp +// In Akka.Hosting configuration +if (siteConfig.HasSqlServer) +{ + builder.WithSqlPersistence(siteConfig.SqlServerConnectionString, ProviderName.SqlServer2019); +} +else +{ + builder.WithSqlPersistence($"Data Source={localDbPath}", ProviderName.SQLite); +} +``` + +### What to Persist: In-Flight Command Journal + +The Device Manager persists a focused set of events — only commands and their outcomes: + +```csharp +// Events (persisted to journal) +public record CommandDispatched(string CommandId, string DeviceId, string TagName, object Value, DateTime Timestamp); +public record CommandAcknowledged(string CommandId, bool Success, DateTime Timestamp); +public record CommandExpired(string CommandId, DateTime Timestamp); + +// State (rebuilt from events) +public class DeviceManagerState +{ + public Dictionary PendingCommands { get; } = new(); +} +``` + +### Persistence ID Strategy + +The singleton Device Manager uses a fixed persistence ID since there's only ever one instance: + +```csharp +public override string PersistenceId => "device-manager"; +``` + +### Snapshots for Fast Recovery + +If the command journal grows large (unlikely if commands are acknowledged/expired regularly), use snapshots to speed up recovery: + +```csharp +// Snapshot every 100 events +if (LastSequenceNr % 100 == 0) +{ + SaveSnapshot(_state); +} + +// Handle snapshot offers +Recover(offer => +{ + _state = (DeviceManagerState)offer.Snapshot; +}); +``` + +## Common Patterns + +### Command Lifecycle Pattern + +``` +1. Operator sends command → Device Manager +2. Device Manager persists CommandDispatched event +3. Device Manager sends command to device actor +4. Device actor sends to equipment, waits for ack +5. Device actor receives ack → tells Device Manager +6. Device Manager persists CommandAcknowledged event +7. Device Manager removes from pending queue + +On failover recovery: +1. New Device Manager replays journal +2. Rebuilds pending command queue (dispatched but not acknowledged) +3. For each pending command: re-read equipment state +4. If equipment shows command was applied → persist ack +5. If equipment shows command was not applied → re-issue +6. If uncertain → flag for operator review +``` + +### Expiration of Stale Commands + +Commands that remain pending beyond a timeout should be expired to prevent the journal from accumulating indefinitely: + +```csharp +Context.System.Scheduler.ScheduleTellRepeatedly( + TimeSpan.FromMinutes(1), + TimeSpan.FromMinutes(1), + Self, + new ExpireStaleCommands(), + ActorRefs.NoSender); + +// Handler +Receive(_ => +{ + var cutoff = DateTime.UtcNow.AddMinutes(-5); + foreach (var cmd in _state.PendingCommands.Values.Where(c => c.Timestamp < cutoff)) + { + Persist(new CommandExpired(cmd.CommandId, DateTime.UtcNow), evt => + { + _state.PendingCommands.Remove(evt.CommandId); + }); + } +}); +``` + +### Journal Cleanup + +Periodically delete old journal entries and snapshots to prevent SQLite file growth: + +```csharp +// After saving a snapshot, delete events up to the snapshot sequence number +Receive(success => +{ + DeleteMessages(success.Metadata.SequenceNr); + DeleteSnapshots(new SnapshotSelectionCriteria(success.Metadata.SequenceNr - 1)); +}); +``` + +## Anti-Patterns + +### Persisting Tag Value Updates + +Do not use Akka.Persistence to journal every tag value change from equipment. A 500-device system with 50 tags each updating every second produces 25,000 events per second. This will overwhelm the SQLite journal and is the historian's job. + +### Large Event Payloads + +Keep persisted events small. Do not embed full device state snapshots in every event — persist only the delta (command ID, tag name, value). Large events slow down recovery and bloat the journal. + +### Not Handling Recovery Failures + +If the journal is corrupted or inaccessible, the persistent actor will fail to start. Handle this gracefully: + +```csharp +protected override void OnRecoveryFailure(Exception reason, object message) +{ + _log.Error(reason, "Recovery failed — starting with empty state"); + // Optionally: start with empty state and reconcile from equipment +} +``` + +### Assuming Exactly-Once Delivery from Persistence Alone + +Persistence ensures the command is journaled, but it does not guarantee the equipment received it exactly once. After failover, the reconciliation logic (re-read from equipment) is essential to determine whether a pending command was actually applied. + +## Configuration Guidance + +### SQLite Journal Configuration + +```hocon +akka.persistence { + journal { + plugin = "akka.persistence.journal.sql" + sql { + connection-string = "Data Source=C:\\ProgramData\\SCADA\\persistence.db" + provider-name = "Microsoft.Data.Sqlite" + auto-initialize = true + } + } + + snapshot-store { + plugin = "akka.persistence.snapshot-store.sql" + sql { + connection-string = "Data Source=C:\\ProgramData\\SCADA\\persistence.db" + provider-name = "Microsoft.Data.Sqlite" + auto-initialize = true + } + } + + # Limit concurrent recoveries — with only one persistent actor, this is less critical + max-concurrent-recoveries = 2 +} +``` + +### Recovery Throttling + +During failover, the singleton replays the journal. Limit recovery throughput to avoid spiking CPU on startup: + +```hocon +akka.persistence.journal.sql { + replay-batch-size = 100 +} +``` + +## References + +- Official Documentation: +- Configuration Reference: +- Persistence Query: +- Akka.Persistence.Sql: diff --git a/AkkaDotNet/09-DistributedData.md b/AkkaDotNet/09-DistributedData.md new file mode 100644 index 0000000..0c30500 --- /dev/null +++ b/AkkaDotNet/09-DistributedData.md @@ -0,0 +1,167 @@ +# 09 — Distributed Data (Akka.DistributedData) + +## Overview + +Akka.Distributed Data shares state between cluster nodes using Conflict-Free Replicated Data Types (CRDTs). Writes can happen on any node and are merged automatically without coordination, providing eventual consistency. Data is replicated in-memory — there is no external database dependency. + +In our SCADA system, Distributed Data provides a lightweight, zero-infrastructure mechanism for sharing state between the active and standby nodes. Unlike Persistence (which journals events to disk), Distributed Data keeps state in memory and gossips it between nodes. This makes it ideal for state that should be visible on both nodes but does not need to survive a full cluster restart. + +## When to Use + +- Sharing pending command summaries between active and standby (so the standby knows what to reconcile on failover) +- Replicating device connection status across nodes (so the standby's monitoring actor has live status) +- Sharing alarm acknowledgment state (so the standby knows which alarms have been acknowledged) +- Any state that should be available on both nodes without the overhead of disk persistence + +## When Not to Use + +- State that must survive a full cluster restart (both nodes down simultaneously) — use Persistence instead +- Large datasets (hundreds of MB) — Distributed Data holds everything in memory and gossips the full state +- Strongly consistent state where both nodes must agree before proceeding — CRDTs provide eventual consistency only +- High-frequency updates (thousands per second on the same key) — gossip merging has overhead + +## Design Decisions for the SCADA System + +### Data Model: What to Replicate + +Keep Distributed Data focused on small, operationally critical state: + +| Key | CRDT Type | Purpose | +|---|---|---| +| `pending-commands` | `LWWDictionary` | In-flight commands (mirrors Persistence journal for fast standby access) | +| `device-status` | `LWWDictionary` | Connection status per device | +| `alarm-acks` | `ORDictionary` | Which alarms have been acknowledged | +| `system-config` | `LWWRegister` | Current site config (device list, thresholds) | + +### CRDT Type Selection + +- **LWWRegister** (Last-Writer-Wins Register): For single-value state where the most recent write wins. Good for device status, configuration. +- **LWWDictionary**: For dictionaries where entries are independently updated. Good for pending commands (keyed by command ID). +- **ORDictionary** (Observed-Remove Dictionary): For dictionaries where entries can be added and removed concurrently. Good for alarm acknowledgments. +- **GCounter / PNCounter**: For monotonic counters (e.g., total commands issued). Rarely needed in SCADA. + +### Write and Read Consistency Levels + +For a 2-node cluster: + +- **WriteTo(1)** (WriteLocal): Write to the local node only. Fastest, but the standby won't see the update until the next gossip round. +- **WriteTo(2)** (WriteMajority on 2 nodes = WriteAll): Write to both nodes before confirming. Ensures the standby has the data immediately, but blocks if the standby is down. + +**Recommendation for SCADA:** + +- **Pending commands:** `WriteTo(1)` with frequent gossip. Speed matters when issuing commands; the standby gets the update within 1-2 gossip intervals. +- **Device status:** `WriteTo(1)`. Status is informational; eventual consistency is fine. +- **Alarm acknowledgments:** `WriteTo(2)` if both nodes are up, fall back to `WriteTo(1)` on failure. Alarm state is safety-critical. + +```csharp +var replicator = DistributedData.Get(Context.System).Replicator; + +// Write pending command with local consistency (fast) +var key = new LWWDictionaryKey("pending-commands"); +replicator.Tell(Dsl.Update(key, LWWDictionary.Empty, WriteLocal.Instance, + dict => dict.SetItem(Cluster.Get(Context.System), commandId, serializedCommand))); + +// Read device status with local consistency (fast) +var statusKey = new LWWDictionaryKey("device-status"); +replicator.Tell(Dsl.Get(statusKey, ReadLocal.Instance)); +``` + +### Gossip Interval + +With 2 nodes on a local network, gossip propagation is nearly instant. The default 2-second gossip interval means the standby sees updates within ~2 seconds. For command-critical state, this is acceptable because Persistence provides the durability guarantee; Distributed Data is a convenience for the standby. + +## Common Patterns + +### Subscribe to Data Changes + +The standby node's monitoring actor can subscribe to Distributed Data changes to maintain a live view of the active node's state: + +```csharp +public class StandbyMonitorActor : ReceiveActor +{ + public StandbyMonitorActor() + { + var replicator = DistributedData.Get(Context.System).Replicator; + var statusKey = new LWWDictionaryKey("device-status"); + + replicator.Tell(Dsl.Subscribe(statusKey, Self)); + + Receive(changed => + { + var status = changed.Get(statusKey); + UpdateDashboard(status); + }); + } +} +``` + +### Expiry / Cleanup Pattern + +Distributed Data does not have built-in TTL. For entries like pending commands that should be cleaned up after acknowledgment, the active node must explicitly remove them: + +```csharp +// After command is acknowledged +replicator.Tell(Dsl.Update(key, LWWDictionary.Empty, WriteLocal.Instance, + dict => dict.Remove(Cluster.Get(Context.System), commandId))); +``` + +### Fallback When Standby Is Down + +Always handle the case where the standby node is unreachable. Use `WriteTo(1)` as a fallback: + +```csharp +var writeConsistency = clusterHasTwoMembers + ? (IWriteConsistency)new WriteTo(2, TimeSpan.FromSeconds(3)) + : WriteLocal.Instance; +``` + +## Anti-Patterns + +### Storing Large or High-Cardinality Data + +Distributed Data replicates everything to all nodes. Storing 500 devices × 50 tags × full value history would consume significant memory and gossip bandwidth. Store only summaries and current state, not history. + +### Using Distributed Data as a Database + +Distributed Data is ephemeral — it lives in memory only. If both nodes restart simultaneously (power failure, site-wide event), all Distributed Data is lost. Critical state that must survive this scenario belongs in Persistence or the historian database. + +### Ignoring Merge Conflicts + +CRDTs resolve conflicts automatically, but the resolution may not match business expectations. For example, with `LWWDictionary`, if both nodes update the same key simultaneously, the last writer wins. For command state, this is fine (only the active node writes). For alarm acknowledgments, ensure the business logic is idempotent — acknowledging an already-acknowledged alarm should be a no-op. + +## Configuration Guidance + +```hocon +akka.cluster.distributed-data { + # Role for replication — both nodes must have this role + role = "scada-node" + + # Gossip interval — how often state is pushed between nodes + gossip-interval = 2s + + # Notification interval — how often subscribers are notified of changes + notify-subscribers-interval = 500ms + + # Maximum delta size for gossip — increase if data volume is high + max-delta-elements = 500 + + # Durable storage (optional) — persist Distributed Data to disk + # Enable this for data that should survive single-node restarts + durable { + keys = ["pending-commands", "alarm-acks"] + lmdb { + dir = "C:\\ProgramData\\SCADA\\ddata" + map-size = 104857600 # 100 MB + } + } +} +``` + +### Durable Distributed Data + +Akka.DistributedData supports optional durable storage (LMDB) that persists specified keys to disk. This means the data survives a single-node restart (but not a full cluster loss, since LMDB is node-local). This is a good middle ground for pending command state — it's faster than full Persistence recovery but still durable. + +## References + +- Official Documentation: +- CRDT Concepts: diff --git a/AkkaDotNet/10-Streams.md b/AkkaDotNet/10-Streams.md new file mode 100644 index 0000000..3f89dbd --- /dev/null +++ b/AkkaDotNet/10-Streams.md @@ -0,0 +1,208 @@ +# 10 — Streams (Akka.Streams) + +## Overview + +Akka.Streams provides a higher-level abstraction for building reactive, back-pressured data processing pipelines on top of the actor model. Streams compose `Source`, `Flow`, and `Sink` stages into graphs that handle flow control automatically — fast producers cannot overwhelm slow consumers. + +In the SCADA system, Streams is the natural fit for processing continuous tag subscription data from equipment. Both OPC-UA and the custom protocol support tag subscriptions that push data continuously. Streams provides backpressure, batching, and error handling for these data flows without writing manual actor mailbox management. + +## When to Use + +- Processing continuous tag value feeds from equipment subscriptions +- Batching tag updates for historian writes (buffering individual updates and writing in bulk to SQL Server) +- Transforming and filtering raw protocol data into domain events (e.g., raw Modbus registers → typed tag values) +- Connecting device data flows to Pub-Sub for alarm/event distribution +- Any pipeline where backpressure is important (producer faster than consumer) + +## When Not to Use + +- Simple request/response patterns — use actors directly +- Command dispatch (point-to-point, needs acknowledgment) — use actors with Persistence +- Control flow logic (state machines, lifecycle management) — use actor `Become`/`Unbecome` + +## Design Decisions for the SCADA System + +### Tag Subscription Pipeline + +Each device actor can host a stream that processes tag subscription updates from the protocol adapter: + +``` +Equipment (OPC-UA / Custom Protocol) + → Source: Tag subscription callback + → Flow: Parse raw values, apply deadband filtering + → Flow: Enrich with device metadata (device ID, tag name, timestamp) + → Sink: Forward to device actor mailbox (for state update) + → Sink: Publish to Pub-Sub (for alarm processing, historian) +``` + +### Source: Wrapping Protocol Callbacks + +Both OPC-UA and the custom protocol deliver tag updates via callbacks. Wrap these in a `Source` using `Source.Queue` or `Source.ActorRef`: + +```csharp +// Using Source.Queue for backpressured ingestion +var (queue, source) = Source + .Queue(bufferSize: 1000, OverflowStrategy.DropHead) + .PreMaterialize(materializer); + +// Protocol callback pushes to the queue +opcUaSubscription.OnTagChanged += (tag, value) => + queue.OfferAsync(new RawTagUpdate(tag, value, DateTime.UtcNow)); +``` + +`OverflowStrategy.DropHead` drops the oldest update when the buffer is full — appropriate for tag values where only the latest matters. + +### Flow: Deadband Filtering + +Reduce noise by filtering out tag updates that haven't changed significantly: + +```csharp +Flow deadbandFilter = + Flow.Create() + .StatefulSelectMany(() => + { + var lastValues = new Dictionary(); + return update => + { + if (update.Value is double current + && lastValues.TryGetValue(update.TagName, out var last) + && Math.Abs(current - last) < update.Deadband) + { + return Enumerable.Empty(); // Filtered out + } + lastValues[update.TagName] = (double)update.Value; + return new[] { update }; + }; + }); +``` + +### Sink: Batched Historian Writes + +Buffer tag updates and write to the historian database in batches: + +```csharp +Sink historianSink = + Flow.Create() + .GroupedWithin(maxElements: 500, maxDuration: TimeSpan.FromSeconds(2)) + .SelectAsync(parallelism: 2, batch => WriteToHistorian(batch)) + .To(Sink.Ignore()); +``` + +This batches up to 500 updates or flushes every 2 seconds, whichever comes first. `parallelism: 2` allows two concurrent database writes. + +### Materializer Lifecycle + +Create one `ActorMaterializer` per `ActorSystem` (or per top-level actor). Do not create a materializer per device actor — this wastes resources: + +```csharp +// In the Device Manager singleton +var materializer = Context.Materializer(); // Tied to this actor's lifecycle +``` + +## Common Patterns + +### Fan-Out: One Source, Multiple Sinks + +A single tag subscription feed often needs to go to multiple consumers (device state, historian, alarm checker). Use `Broadcast`: + +```csharp +var graph = GraphDsl.Create(builder => +{ + var broadcast = builder.Add(new Broadcast(3)); + var source = builder.Add(tagSubscriptionSource); + var stateSink = builder.Add(deviceStateSink); + var historianSink = builder.Add(batchedHistorianSink); + var alarmSink = builder.Add(alarmCheckSink); + + builder.From(source).To(broadcast); + builder.From(broadcast.Out(0)).To(stateSink); + builder.From(broadcast.Out(1)).To(historianSink); + builder.From(broadcast.Out(2)).To(alarmSink); + + return ClosedShape.Instance; +}); + +RunnableGraph.FromGraph(graph).Run(materializer); +``` + +### Error Handling with Supervision + +Stream stages can fail (protocol errors, serialization issues). Use stream supervision to handle errors without killing the entire pipeline: + +```csharp +var decider = Decider.From( + Directive.Resume, // Default: skip the failed element and continue + (typeof(FormatException), Directive.Resume), + (typeof(TimeoutException), Directive.Restart) +); + +var settings = ActorMaterializerSettings.Create(system) + .WithSupervisionStrategy(decider); +``` + +### Throttling High-Frequency Updates + +If equipment sends updates faster than the system can process, throttle the stream: + +```csharp +Flow.Create() + .Throttle(elements: 100, per: TimeSpan.FromSeconds(1), maximumBurst: 50, ThrottleMode.Shaping) +``` + +## Anti-Patterns + +### Building Complex Business Logic in Streams + +Streams excel at data flow — transforming, filtering, routing, batching. Do not embed complex business logic (alarm escalation rules, command sequencing, state machine transitions) in stream stages. Keep that in actors where state management and supervision are more natural. + +### Unbounded Buffers + +Never use `OverflowStrategy.Backpressure` with an effectively unbounded buffer when the source is a protocol callback. If the protocol pushes faster than the stream processes, backpressure propagates to the protocol callback — which typically can't handle it (it's a fire-and-forget callback). Use `DropHead` or `DropTail` for tag updates where only recent values matter. + +### Creating Streams Per-Message + +Do not materialize a new stream for each incoming message. Stream materialization has overhead (actor creation, buffer allocation). Materialize the stream once (at actor startup) and keep it running for the lifetime of the device actor. + +### Ignoring Materialized Values + +When a stream completes or fails, the materialized value (a `Task`) completes. Monitor this to detect stream failures: + +```csharp +var completionTask = stream.Run(materializer); +completionTask.ContinueWith(t => +{ + if (t.IsFaulted) + Self.Tell(new StreamFailed(t.Exception)); +}, TaskScheduler.Default).PipeTo(Self); +``` + +## Configuration Guidance + +```hocon +akka.stream { + # Default materializer settings + materializer { + # Initial and maximum buffer sizes for stream stages + initial-input-buffer-size = 4 + max-input-buffer-size = 16 + + # Dispatcher for stream actors + dispatcher = "" # Uses default dispatcher + + # Subscription timeout — how long a publisher waits for a subscriber + subscription-timeout { + mode = cancel + timeout = 5s + } + } +} +``` + +For the SCADA system, the default buffer sizes are fine. If profiling shows that specific high-throughput flows (e.g., historian batching) need larger buffers, increase `max-input-buffer-size` selectively by passing custom `ActorMaterializerSettings` to those flows. + +## References + +- Official Documentation: +- Stream Quickstart: +- Modularity & Composition: +- Buffers & Working with Rate: diff --git a/AkkaDotNet/11-Hosting.md b/AkkaDotNet/11-Hosting.md new file mode 100644 index 0000000..49f9b15 --- /dev/null +++ b/AkkaDotNet/11-Hosting.md @@ -0,0 +1,195 @@ +# 11 — Akka.Hosting + +## Overview + +Akka.Hosting is the recommended integration layer between Akka.NET and the `Microsoft.Extensions.*` ecosystem (Hosting, DependencyInjection, Configuration, Logging). It replaces the traditional HOCON-only configuration approach with a code-first, type-safe API and provides an `ActorRegistry` for injecting actor references into non-actor services. + +In our SCADA system, Akka.Hosting is the entry point for bootstrapping the entire ActorSystem — configuring Remoting, Cluster, Singleton, Persistence, and all other modules through a fluent C# API that integrates with the standard .NET Generic Host. + +## When to Use + +- Always — Akka.Hosting should be the primary way to configure and start Akka.NET in any new .NET 10 application +- Configuring all Akka.NET modules (Remoting, Cluster, Persistence, etc.) in code rather than HOCON files +- Registering top-level actors in the `ActorRegistry` for injection into ASP.NET controllers, Windows Services, or other DI consumers +- Managing ActorSystem lifecycle tied to the .NET host lifecycle + +## When Not to Use + +- There are no scenarios where you should avoid Akka.Hosting in new projects on .NET 6+ +- Legacy projects on older .NET may need the traditional HOCON approach if they cannot adopt `Microsoft.Extensions.Hosting` + +## Design Decisions for the SCADA System + +### Windows Service Host + +The SCADA application runs as a Windows Service using the .NET Generic Host: + +```csharp +var builder = Host.CreateApplicationBuilder(args); + +builder.Services.AddAkka("scada-system", (akkaBuilder, sp) => +{ + var siteConfig = sp.GetRequiredService(); + + akkaBuilder + .ConfigureLoggers(loggers => + { + loggers.LogLevel = Akka.Event.LogLevel.InfoLevel; + loggers.AddLoggerFactory(); // Bridge to Microsoft.Extensions.Logging + }) + .WithRemoting(options => + { + options.Hostname = "0.0.0.0"; + options.PublicHostname = siteConfig.NodeHostname; + options.Port = 4053; + }) + .WithClustering(new ClusterOptions + { + Roles = new[] { "scada-node" }, + SeedNodes = siteConfig.SeedNodes + }) + .WithSingleton("device-manager", + (system, registry, resolver) => resolver.Props(), + new ClusterSingletonOptions { Role = "scada-node" }) + .WithSingletonProxy("device-manager", + new ClusterSingletonProxyOptions { Role = "scada-node" }) + .WithDistributedData() + .WithActors((system, registry, resolver) => + { + var healthActor = system.ActorOf( + resolver.Props(), "cluster-health"); + registry.Register(healthActor); + }); +}); + +var host = builder.Build(); +await host.RunAsync(); +``` + +### ActorRegistry for DI Integration + +The `ActorRegistry` stores named `IActorRef` instances that can be injected into non-actor services: + +```csharp +// Registering actors +registry.Register(deviceManagerRef); +registry.Register(healthActorRef); + +// Consuming in a Windows Service health check endpoint +public class HealthCheckService : IHostedService +{ + private readonly IRequiredActor _healthActor; + + public HealthCheckService(IRequiredActor healthActor) + { + _healthActor = healthActor; + } + + public async Task CheckHealthAsync() + { + var actorRef = await _healthActor.GetAsync(); + var status = await actorRef.Ask(new GetHealth(), TimeSpan.FromSeconds(5)); + return status.IsHealthy ? HealthStatus.Healthy : HealthStatus.Unhealthy; + } +} +``` + +### Site-Specific Configuration + +Use `Microsoft.Extensions.Configuration` (appsettings.json, environment variables) for site-specific values (hostnames, seed nodes, device lists). Inject these into the Akka configuration: + +```json +// appsettings.json +{ + "ScadaSite": { + "NodeHostname": "nodeA.scada.local", + "SeedNodes": [ + "akka.tcp://scada-system@nodeA.scada.local:4053", + "akka.tcp://scada-system@nodeB.scada.local:4053" + ], + "DeviceConfigPath": "C:\\ProgramData\\SCADA\\devices.json" + } +} +``` + +```csharp +builder.Services.Configure( + builder.Configuration.GetSection("ScadaSite")); +``` + +## Common Patterns + +### Combining HOCON and Hosting + +For module-specific settings not yet covered by Hosting APIs, you can mix HOCON with code-first configuration: + +```csharp +akkaBuilder.AddHocon(ConfigurationFactory.ParseString(@" + akka.cluster.split-brain-resolver { + active-strategy = keep-oldest + keep-oldest.down-if-alone = on + stable-after = 15s + } +"), HoconAddMode.Prepend); +``` + +### Logging Bridge + +Bridge Akka.NET's internal logging to `Microsoft.Extensions.Logging` so all logs go through the same pipeline (Serilog, NLog, etc.): + +```csharp +akkaBuilder.ConfigureLoggers(loggers => +{ + loggers.ClearLoggers(); + loggers.AddLoggerFactory(); +}); +``` + +### Coordinated Shutdown Integration + +Akka.Hosting automatically ties `CoordinatedShutdown` to the host's lifetime. When the Windows Service stops, the ActorSystem leaves the cluster gracefully and shuts down. No additional configuration is needed. + +## Anti-Patterns + +### Bypassing Hosting to Create ActorSystem Manually + +Do not create `ActorSystem.Create()` alongside Akka.Hosting. This creates a second, unmanaged ActorSystem. Always use `AddAkka()` on the service collection. + +### Registering Actors That Don't Exist Yet + +The `ActorRegistry` resolves asynchronously (`GetAsync()`), but registering an actor reference that is never created will cause consumers to hang indefinitely. Always register actors in the `WithActors` callback where they are created. + +### Hardcoding Configuration + +Do not hardcode hostnames, ports, or seed nodes in the `AddAkka` configuration. Use `Microsoft.Extensions.Configuration` so that each site's deployment can override these values via appsettings.json or environment variables. + +## Configuration Guidance + +### NuGet Packages + +``` +Akka.Hosting +Akka.Remote.Hosting +Akka.Cluster.Hosting (includes Sharding, Tools) +Akka.Persistence.Hosting +``` + +These Hosting packages replace the need to install the base packages separately — they include their dependencies. + +### Startup Order + +Akka.Hosting initializes components in the order they are registered. Register in dependency order: + +1. Loggers +2. Remoting +3. Clustering (depends on Remoting) +4. Persistence +5. Distributed Data +6. Singletons and Singleton Proxies (depend on Clustering) +7. Application actors + +## References + +- GitHub / Documentation: +- Petabridge Blog: +- DI with Akka.NET: diff --git a/AkkaDotNet/12-DependencyInjection.md b/AkkaDotNet/12-DependencyInjection.md new file mode 100644 index 0000000..3055290 --- /dev/null +++ b/AkkaDotNet/12-DependencyInjection.md @@ -0,0 +1,176 @@ +# 12 — Dependency Injection (Akka.DependencyInjection) + +## Overview + +Akka.DependencyInjection integrates `Microsoft.Extensions.DependencyInjection` with Akka.NET actor construction, allowing actors to receive injected services through their constructors. When using Akka.Hosting (recommended), DI integration is handled automatically — `Akka.DependencyInjection` is the underlying mechanism. + +In the SCADA system, DI is essential for injecting protocol adapters, configuration services, logging, and database clients into actors without tight coupling. + +## When to Use + +- Injecting protocol adapter factories (OPC-UA client factory, custom protocol client factory) into device actors +- Injecting configuration services (`IOptions`) into the Device Manager +- Injecting logging (`ILogger`) into actors for structured logging +- Injecting database clients or repository services for historian writes + +## When Not to Use + +- Do not inject `IActorRef` via standard DI — use the `ActorRegistry` and `IRequiredActor` from Akka.Hosting instead +- Do not inject heavy, stateful services that have their own lifecycle management conflicts with actor lifecycle +- Do not use DI to inject mutable shared state — this breaks actor isolation + +## Design Decisions for the SCADA System + +### Actor Construction via DI Resolver + +Use Akka.Hosting's `resolver.Props()` to create actors with DI-injected constructors: + +```csharp +akkaBuilder.WithActors((system, registry, resolver) => +{ + // DeviceManagerActor receives IOptions and + // IProtocolAdapterFactory via its constructor + var props = resolver.Props(); + var manager = system.ActorOf(props, "device-manager"); + registry.Register(manager); +}); +``` + +### Protocol Adapter Injection + +The unified protocol abstraction is implemented via a factory pattern. Register the factory in DI, inject it into the Device Manager, which then creates the appropriate adapter actor per device: + +```csharp +// DI registration +builder.Services.AddSingleton(); +builder.Services.AddTransient(); +builder.Services.AddTransient(); + +// Device Manager actor constructor +public class DeviceManagerActor : ReceiveActor +{ + public DeviceManagerActor( + IOptions config, + IProtocolAdapterFactory adapterFactory, + IServiceProvider serviceProvider) + { + // Use adapterFactory to create protocol-specific adapters + // Use serviceProvider for creating scoped services + } +} +``` + +### Scoped Services and Actor Lifecycle + +Actors are long-lived. DI scopes must be managed manually to avoid memory leaks with scoped/transient services: + +```csharp +public class DeviceActor : ReceiveActor +{ + private readonly IServiceScope _scope; + private readonly IHistorianWriter _historian; + + public DeviceActor(IServiceProvider sp, DeviceConfig config) + { + _scope = sp.CreateScope(); + _historian = _scope.ServiceProvider.GetRequiredService(); + + Receive(HandleTagUpdate); + } + + protected override void PostStop() + { + _scope.Dispose(); // Clean up scoped services + base.PostStop(); + } +} +``` + +### Child Actor Creation with DI + +When the Device Manager creates child device actors that also need DI services, use the `IDependencyResolver`: + +```csharp +// Inside DeviceManagerActor +var resolver = DependencyResolver.For(Context.System); +var deviceProps = resolver.Props(deviceConfig); // Additional args +var deviceActor = Context.ActorOf(deviceProps, $"device-{deviceConfig.DeviceId}"); +``` + +## Common Patterns + +### Factory Pattern for Protocol Selection + +```csharp +public interface IProtocolAdapterFactory +{ + Props CreateAdapterProps(DeviceConfig config, IDependencyResolver resolver); +} + +public class ProtocolAdapterFactory : IProtocolAdapterFactory +{ + public Props CreateAdapterProps(DeviceConfig config, IDependencyResolver resolver) + { + return config.Protocol switch + { + ProtocolType.OpcUa => resolver.Props(config), + ProtocolType.Custom => resolver.Props(config), + _ => throw new ArgumentException($"Unknown protocol: {config.Protocol}") + }; + } +} +``` + +### ILogger Integration + +Inject `ILoggerFactory` and create loggers inside actors: + +```csharp +public class DeviceActor : ReceiveActor +{ + private readonly ILogger _logger; + + public DeviceActor(ILoggerFactory loggerFactory, DeviceConfig config) + { + _logger = loggerFactory.CreateLogger($"Device.{config.DeviceId}"); + _logger.LogInformation("Device actor started for {DeviceId}", config.DeviceId); + } +} +``` + +## Anti-Patterns + +### Injecting IActorRef Directly via DI + +Do not register `IActorRef` in the DI container. Actor references are runtime constructs managed by the ActorSystem. Use `ActorRegistry` and `IRequiredActor` instead. + +### Singleton Services with Mutable State + +If a DI singleton service has mutable state, injecting it into multiple actors creates shared mutable state — violating actor isolation. Either make the service thread-safe (with locks) or restructure it as an actor. + +### Forgetting to Dispose Scopes + +If an actor creates an `IServiceScope` and doesn't dispose it in `PostStop`, scoped services (database connections, HTTP clients) leak. Always pair scope creation with disposal. + +### Constructor Doing Heavy Work + +Actor constructors should be lightweight. Do not perform I/O (network connections, database queries) in the constructor. Use `PreStart` for initialization that requires async work. + +## Configuration Guidance + +No additional HOCON or Hosting configuration is needed beyond the standard Akka.Hosting setup. DI integration is enabled automatically when using `AddAkka()` with `resolver.Props()`. + +### Service Lifetimes + +| Service | Recommended Lifetime | Reason | +|---|---|---| +| `IProtocolAdapterFactory` | Singleton | Stateless factory, shared across actors | +| `OpcUaClientWrapper` | Transient | One per device actor, owned by the actor | +| `IHistorianWriter` | Scoped | Tied to actor lifecycle via `IServiceScope` | +| `IOptions` | Singleton | Configuration doesn't change at runtime | +| `ILoggerFactory` | Singleton | Standard .NET pattern | + +## References + +- Official Documentation: +- Akka.Hosting DI Integration: diff --git a/AkkaDotNet/13-Discovery.md b/AkkaDotNet/13-Discovery.md new file mode 100644 index 0000000..c287464 --- /dev/null +++ b/AkkaDotNet/13-Discovery.md @@ -0,0 +1,125 @@ +# 13 — Discovery (Akka.Discovery) + +## Overview + +Akka.Discovery provides a pluggable service discovery API that allows cluster nodes to find each other dynamically. It supports multiple backends: configuration-based (static), Azure Table Storage, AWS EC2, Kubernetes DNS, and more. Discovery is used by Akka.Management's Cluster Bootstrap to automate cluster formation. + +In the SCADA system's 2-node topology with static IPs, Discovery's dynamic capabilities are largely unnecessary — static seed nodes work well. However, understanding Discovery is valuable for future-proofing and for sites that may use DHCP or containerized deployments. + +## When to Use + +- If site deployments use DHCP or dynamic IP assignment (uncommon in industrial environments but possible) +- If the system is ever deployed in containers (Docker on Windows Server) +- For sites where node addresses may change due to infrastructure migration + +## When Not to Use + +- In the current architecture with static IPs/hostnames — static seed nodes in the Cluster configuration are simpler and more predictable +- For equipment discovery (finding machines on the network) — Akka.Discovery is for finding *cluster nodes*, not industrial equipment + +## Design Decisions for the SCADA System + +### Current Recommendation: Static Seed Nodes + +For fixed 2-node deployments on Windows Server with known hostnames, static seed nodes (configured in [03-Cluster.md](./03-Cluster.md)) are the simplest and most reliable approach. No Discovery plugin is needed. + +### Future Option: Config-Based Discovery + +If the system needs a Discovery mechanism without external dependencies, use the built-in configuration-based discovery as a stepping stone: + +```hocon +akka.discovery { + method = config + config { + services { + scada-cluster { + endpoints = [ + "nodeA.scada.local:4053", + "nodeB.scada.local:4053" + ] + } + } + } +} +``` + +This is functionally equivalent to static seed nodes but uses the Discovery API, making it easier to swap to a dynamic provider later. + +### Future Option: Custom Discovery Provider + +For sites with a central configuration management system (e.g., a site database that tracks which machines run the SCADA software), a custom Discovery provider could query that system to find cluster nodes: + +```csharp +public class SiteDbDiscovery : ServiceDiscovery +{ + public override Task Lookup(Lookup lookup, TimeSpan resolveTimeout) + { + // Query site database for SCADA node addresses + var nodes = _siteDb.GetScadaNodes(); + var targets = nodes.Select(n => new ResolvedTarget(n.Hostname, n.Port)).ToList(); + return Task.FromResult(new Resolved(lookup.ServiceName, targets)); + } +} +``` + +## Common Patterns + +### Discovery + Cluster Bootstrap + +If Discovery is adopted, pair it with Akka.Management's Cluster Bootstrap (see [14-Management.md](./14-Management.md)): + +```csharp +akkaBuilder + .WithAkkaManagement(port: 8558) + .WithClusterBootstrap( + serviceName: "scada-cluster", + requiredContactPoints: 2); +``` + +### Fallback Chain + +Aggregate Discovery allows chaining multiple discovery methods with fallback: + +```hocon +akka.discovery { + method = aggregate + aggregate { + discovery-methods = ["custom-site-db", "config"] + } +} +``` + +## Anti-Patterns + +### Using Discovery for a 2-Node Static Deployment + +Adding Discovery and Cluster Bootstrap to a fixed 2-node deployment with known addresses adds complexity without benefit. The Discovery plugin, Management HTTP endpoint, and Bootstrap coordination all introduce additional failure points. + +### Confusing Node Discovery with Equipment Discovery + +Akka.Discovery finds *Akka.NET cluster nodes*. It does not discover industrial equipment on the network. OPC-UA has its own discovery mechanism (OPC-UA Discovery Server); the custom protocol presumably has its own registration or configuration approach. + +## Configuration Guidance + +For the current architecture, no Discovery configuration is needed. If adopted in the future: + +```hocon +akka.discovery { + method = config # Or "azure", "aws", "kubernetes", etc. + config { + services { + scada-cluster { + endpoints = [ + "nodeA.scada.local:4053", + "nodeB.scada.local:4053" + ] + } + } + } +} +``` + +## References + +- Official Documentation: +- Azure Discovery: diff --git a/AkkaDotNet/14-Management.md b/AkkaDotNet/14-Management.md new file mode 100644 index 0000000..8fe26a6 --- /dev/null +++ b/AkkaDotNet/14-Management.md @@ -0,0 +1,130 @@ +# 14 — Management (Akka.Management) + +## Overview + +Akka.Management exposes HTTP endpoints for cluster health checks, membership information, and Cluster Bootstrap coordination. It is primarily used in dynamic environments (Kubernetes, cloud) where nodes need to discover and coordinate with each other via HTTP probes. + +In the SCADA system, Management is useful for providing HTTP health check endpoints that operations tools (load balancers, monitoring systems, Windows Service health probes) can query to verify the Akka.NET cluster is functioning. + +## When to Use + +- Exposing an HTTP health check endpoint for monitoring tools or Windows Service health checks +- If the system ever moves to containerized deployments where Cluster Bootstrap replaces static seed nodes +- Providing operational visibility into cluster membership state via HTTP + +## When Not to Use + +- As a general-purpose HTTP API for the SCADA system — use ASP.NET Core for that +- For equipment communication — Management HTTP endpoints are for cluster operations only + +## Design Decisions for the SCADA System + +### Health Check Endpoint + +Configure Akka.Management to expose a simple HTTP health check on a management port: + +```csharp +akkaBuilder.WithAkkaManagement(options => +{ + options.Port = 8558; + options.Hostname = "0.0.0.0"; +}); +``` + +This provides endpoints like: + +- `GET /cluster/members` — current cluster membership +- `GET /cluster/health` — cluster health status +- `GET /alive` — basic liveness probe +- `GET /ready` — readiness probe (cluster joined and stable) + +### Windows Service Health Integration + +Use the Management health endpoint to report Akka.NET cluster health as part of the Windows Service's health reporting: + +```csharp +// In a health check service +public class AkkaHealthCheck : IHealthCheck +{ + public async Task CheckHealthAsync(HealthCheckContext context, CancellationToken ct) + { + try + { + using var client = new HttpClient(); + var response = await client.GetAsync("http://localhost:8558/ready", ct); + return response.IsSuccessStatusCode + ? HealthCheckResult.Healthy("Akka cluster ready") + : HealthCheckResult.Degraded("Akka cluster not ready"); + } + catch (Exception ex) + { + return HealthCheckResult.Unhealthy("Akka management unreachable", ex); + } + } +} +``` + +### Port Selection + +Use a dedicated port (8558 is the convention) separate from Akka.Remote's port (4053). Ensure Windows Firewall allows this port on both nodes. + +## Common Patterns + +### Cluster Bootstrap (Future Use) + +If the system moves to dynamic infrastructure: + +```csharp +akkaBuilder + .WithAkkaManagement(port: 8558) + .WithClusterBootstrap( + serviceName: "scada-cluster", + requiredContactPoints: 2); +``` + +`requiredContactPoints: 2` ensures both nodes must be visible before forming a cluster — preventing a single node from bootstrapping alone and creating a split-brain situation. + +### Operational Monitoring + +Periodically poll the `/cluster/members` endpoint from an external monitoring system to track cluster state: + +``` +GET http://nodeA.scada.local:8558/cluster/members + +{ + "selfNode": "akka.tcp://scada-system@nodeA.scada.local:4053", + "members": [...], + "unreachable": [...] +} +``` + +## Anti-Patterns + +### Exposing Management Port to Equipment Network + +The Management HTTP endpoint should only be accessible on the management/corporate network, not the equipment network. Equipment should never be able to reach cluster management endpoints. + +### Using Management as a REST API + +Management endpoints are for cluster operations, not application logic. Do not add custom routes for SCADA operations (command dispatch, alarm query) to the Management HTTP server. Use ASP.NET Core for that. + +## Configuration Guidance + +```hocon +akka.management { + http { + hostname = "0.0.0.0" + port = 8558 + route-providers-read-only = true # Only expose read endpoints + } +} +``` + +### Firewall Rules + +Ensure port 8558 is open between the two SCADA nodes and from the monitoring network. Block access from the equipment network. + +## References + +- Official Documentation: +- GitHub: diff --git a/AkkaDotNet/15-Coordination.md b/AkkaDotNet/15-Coordination.md new file mode 100644 index 0000000..3b15699 --- /dev/null +++ b/AkkaDotNet/15-Coordination.md @@ -0,0 +1,165 @@ +# 15 — Coordination (Akka.Coordination) + +## Overview + +Akka.Coordination provides lease-based distributed locking primitives. A lease is a time-bounded lock that a node acquires from an external store. It is used by the Split Brain Resolver, Cluster Sharding, and Cluster Singleton to prevent split-brain scenarios by ensuring only the node holding the lease can act as the leader or singleton host. + +In the SCADA system's 2-node topology, lease-based coordination addresses the fundamental challenge of 2-node split-brain resolution: without a third-party arbiter, two partitioned nodes cannot determine which should survive. + +## When to Use + +- If the `keep-oldest` SBR strategy with `down-if-alone = on` (see [03-Cluster.md](./03-Cluster.md)) is insufficient — specifically, if the scenario where *both* nodes down themselves during a partition is unacceptable +- When a shared resource (file share, database) is available to serve as the lease store +- For stronger singleton guarantees — ensuring only the lease holder can run the Device Manager singleton + +## When Not to Use + +- If the site has no shared resource accessible by both nodes — a lease requires an external store +- If the `keep-oldest` strategy with automatic Windows Service restart is acceptable for your availability requirements +- If the added dependency on the lease store introduces more risk than it mitigates (lease store becomes a single point of failure) + +## Design Decisions for the SCADA System + +### Lease Store Options + +For Windows Server deployments without cloud services: + +**Option A: SMB File Share Lease** + +If the site has a shared filesystem (NAS, Windows file server), a file-based lease can work. However, Akka.NET does not ship a file-based lease implementation — a custom one would need to be built. + +**Option B: SQL Server Lease (Where Available)** + +For sites with SQL Server, use a database-backed lease. This provides strong consistency guarantees: + +```csharp +// Custom lease implementation using SQL Server +public class SqlServerLease : Lease +{ + // Acquire: INSERT with optimistic concurrency + // Heartbeat: UPDATE timestamp periodically + // Release: DELETE the lease row + // Check: SELECT where timestamp is recent +} +``` + +**Option C: Azure Blob Storage Lease (Akka.Coordination.Azure)** + +If the site has Azure connectivity, `Akka.Coordination.Azure` provides a production-ready lease implementation using Azure Blob Storage: + +```csharp +akkaBuilder.WithClusterBootstrap(options => +{ + // Configure Azure lease for SBR +}); +``` + +**Current Recommendation:** + +For most on-premise SCADA sites without cloud access, use the `keep-oldest` SBR strategy without a lease, and rely on Windows Service auto-restart to recover from the "both nodes downed" scenario. The recovery time is longer (~1–2 minutes for service restart + cluster reformation) but avoids the lease store dependency. + +If faster recovery or stronger guarantees are needed, implement a SQL Server-backed lease for sites that have SQL Server. + +### Lease-Based SBR Configuration + +If a lease is available: + +```hocon +akka.cluster { + downing-provider-class = "Akka.Cluster.SBR.SplitBrainResolverProvider, Akka.Cluster" + split-brain-resolver { + active-strategy = lease-majority + lease-majority { + lease-implementation = "custom-sql-lease" + acquire-lease-delay-for-minority = 5s + } + } +} +``` + +### Lease for Cluster Singleton + +The Singleton can be configured to require a lease before starting. This provides an additional guarantee against duplicate singletons: + +```hocon +akka.cluster.singleton { + use-lease = "custom-sql-lease" + lease-retry-interval = 5s +} +``` + +## Common Patterns + +### Lease Heartbeat + +Leases are time-bounded. The holder must periodically renew (heartbeat) the lease. If the holder crashes without releasing, the lease expires after the heartbeat timeout, allowing another node to acquire it: + +``` +Heartbeat interval: 12s (default) +Heartbeat timeout: 120s (default) +``` + +This means after a node crash, it takes up to 120 seconds for the lease to expire and the standby to acquire it. Reduce the timeout for faster failover, but be cautious of network glitches causing false lease expiry: + +```hocon +custom-sql-lease { + heartbeat-timeout = 30s + heartbeat-interval = 5s + lease-operation-timeout = 5s +} +``` + +### Lease + SBR Interaction + +When using lease-based SBR in a 2-node cluster: + +1. Both nodes attempt to acquire the lease on startup +2. Only one succeeds — this node becomes the leader +3. During a partition, both nodes attempt to renew/acquire the lease +4. The node that holds the lease survives; the other downs itself +5. On the surviving node, the Singleton continues running +6. No "both nodes down" scenario occurs (as long as the lease store is reachable) + +## Anti-Patterns + +### Lease Store on One of the Two SCADA Nodes + +Never host the lease store on one of the SCADA nodes themselves. If that node goes down, the lease store goes with it, and the surviving node cannot acquire the lease. The lease store must be on an independent resource. + +### Very Short Lease Timeouts + +Setting `heartbeat-timeout` below 10 seconds risks false lease expiry during garbage collection pauses, network blips, or high CPU load. This would cause the singleton to stop unnecessarily. + +### Assuming the Lease Prevents All Split-Brain + +The lease only works if both nodes can reach the lease store. If the lease store itself is partitioned from one node, that node cannot acquire the lease and will down itself — even if it's otherwise healthy. Consider lease store availability as part of the system's overall availability design. + +## Configuration Guidance + +### Without Lease (Current Default) + +No coordination configuration needed. Use `keep-oldest` SBR as described in [03-Cluster.md](./03-Cluster.md). + +### With SQL Server Lease (Future Enhancement) + +```hocon +custom-sql-lease { + lease-class = "ScadaSystem.SqlServerLease, ScadaSystem" + heartbeat-timeout = 30s + heartbeat-interval = 5s + lease-operation-timeout = 5s +} + +akka.cluster.split-brain-resolver { + active-strategy = lease-majority + lease-majority { + lease-implementation = "custom-sql-lease" + } +} +``` + +## References + +- Akka Coordination Concepts: +- Akka.Coordination.Azure: +- Split Brain Resolver: diff --git a/AkkaDotNet/16-TestKit.md b/AkkaDotNet/16-TestKit.md new file mode 100644 index 0000000..fb88c9a --- /dev/null +++ b/AkkaDotNet/16-TestKit.md @@ -0,0 +1,223 @@ +# 16 — TestKit (Akka.TestKit) + +## Overview + +Akka.TestKit provides infrastructure for unit and integration testing of actors. It creates a controlled test environment with a test `ActorSystem`, `TestProbe` actors for asserting message flows, `TestActorRef` for synchronous actor access, and `EventFilter` for asserting log output. Test-framework-specific adapters are available for xUnit, NUnit, and MSTest. + +In the SCADA system, TestKit is essential for validating device actor behavior, protocol abstraction correctness, command handling logic, and supervision strategies — all without connecting to real equipment. + +## When to Use + +- Unit testing individual actor behavior (device actors, command handlers, alarm processors) +- Testing message flows between actors (Device Manager → Device Actor → Protocol Adapter) +- Verifying supervision strategies (what happens when a device actor throws a `CommunicationException`) +- Testing actor state transitions (`Connecting` → `Online` → `Faulted`) +- Asserting that specific messages are sent in response to inputs + +## When Not to Use + +- Testing non-actor code (protocol parsing logic, tag value conversions) — use standard unit tests +- Full integration tests with DI, hosting, and configuration — use Akka.Hosting.TestKit (see [17-HostingTestKit.md](./17-HostingTestKit.md)) +- Multi-node failover tests — use MultiNodeTestRunner (see [18-MultiNodeTestRunner.md](./18-MultiNodeTestRunner.md)) + +## Design Decisions for the SCADA System + +### Test Framework: xUnit + +Use `Akka.TestKit.Xunit2` to integrate with xUnit, which is the standard test framework for modern .NET projects: + +``` +NuGet: Akka.TestKit.Xunit2 +``` + +### Testing the Protocol Abstraction + +The protocol abstraction (common message contract with OPC-UA and custom protocol implementations) is the most critical code to test. Create tests that verify both implementations respond identically to the same message contract: + +```csharp +public class OpcUaDeviceActorTests : TestKit +{ + [Fact] + public void Should_respond_with_DeviceState_when_asked() + { + var deviceActor = Sys.ActorOf(Props.Create(() => + new OpcUaDeviceActor(TestDeviceConfig.Create()))); + + deviceActor.Tell(new GetDeviceState()); + + var state = ExpectMsg(); + Assert.Equal(ConnectionStatus.Connecting, state.Status); + } +} + +public class CustomProtocolDeviceActorTests : TestKit +{ + [Fact] + public void Should_respond_with_DeviceState_when_asked() + { + var deviceActor = Sys.ActorOf(Props.Create(() => + new CustomProtocolDeviceActor(TestDeviceConfig.Create()))); + + deviceActor.Tell(new GetDeviceState()); + + var state = ExpectMsg(); + Assert.Equal(ConnectionStatus.Connecting, state.Status); + } +} +``` + +### TestProbe for Interaction Testing + +Use `TestProbe` to verify that actors send the correct messages to their collaborators: + +```csharp +[Fact] +public void DeviceManager_should_forward_command_to_correct_device() +{ + var probe = CreateTestProbe(); + + // Create a DeviceManager that uses the probe as a device actor + var manager = Sys.ActorOf(Props.Create(() => + new TestableDeviceManager(deviceActorOverride: probe))); + + manager.Tell(new SendCommand("cmd-1", "machine-001", "StartMotor", true)); + + // Verify the command was forwarded to the device actor + var forwarded = probe.ExpectMsg(); + Assert.Equal("cmd-1", forwarded.CommandId); +} +``` + +### Testing Supervision Strategies + +Verify that the Device Manager restarts device actors on communication failures: + +```csharp +[Fact] +public void Should_restart_device_actor_on_communication_exception() +{ + var probe = CreateTestProbe(); + var deviceActor = Sys.ActorOf(Props.Create(() => + new FailingDeviceActor(failOnFirst: true))); + + Watch(deviceActor); + + // Send a message that triggers the CommunicationException + deviceActor.Tell(new ConnectToDevice()); + + // The actor should be restarted, not stopped + ExpectNoMsg(TimeSpan.FromSeconds(1)); // No Terminated message + + // After restart, the actor should accept messages again + deviceActor.Tell(new GetDeviceState()); + ExpectMsg(); +} +``` + +## Common Patterns + +### ExpectMsg with Timeout + +Always specify reasonable timeouts for message expectations. In CI environments, use the `Dilated` method to account for slower machines: + +```csharp +ExpectMsg(TimeSpan.FromSeconds(5)); +// or +ExpectMsg(Dilated(TimeSpan.FromSeconds(3))); +``` + +### ExpectNoMsg for Negative Testing + +Verify that an actor does NOT send a message in certain conditions (e.g., a filtered tag update should not be forwarded): + +```csharp +deviceActor.Tell(new TagUpdate("temp", 20.0)); // Within deadband +ExpectNoMsg(TimeSpan.FromMilliseconds(500)); // Should be filtered +``` + +### WithinAsync for Timing Assertions + +Verify that a sequence of messages arrives within a time window: + +```csharp +await WithinAsync(TimeSpan.FromSeconds(5), async () => +{ + deviceActor.Tell(new ConnectToDevice()); + await ExpectMsgAsync(); + deviceActor.Tell(new SubscribeToTag("temperature")); + await ExpectMsgAsync(); +}); +``` + +### Mock Protocol Adapters + +Create test doubles for protocol adapters that simulate equipment behavior: + +```csharp +public class MockOpcUaClient : IOpcUaClient +{ + private readonly Dictionary _tagValues = new(); + + public Task ReadTagAsync(string tagName) => + Task.FromResult(_tagValues.GetValueOrDefault(tagName, 0.0)); + + public void SimulateTagChange(string tagName, object value) => + _tagValues[tagName] = value; +} +``` + +## Anti-Patterns + +### Testing Internal Actor State Directly + +Avoid using `TestActorRef` to inspect internal actor state. Test behavior through messages — send an input, assert the output. This keeps tests decoupled from implementation details. + +### Shared Test ActorSystem + +Each test class gets its own `ActorSystem` (TestKit creates one per fixture). Do not share an ActorSystem across test classes — actor names and state leak between tests. + +### Ignoring Dead Letters in Tests + +Dead letters in tests often indicate bugs (messages sent to stopped actors, wrong actor references). Subscribe to dead letters in tests and fail on unexpected ones: + +```csharp +Sys.EventStream.Subscribe(TestActor, typeof(DeadLetter)); +``` + +### Flaky Time-Dependent Tests + +Avoid tests that depend on exact timing (`Thread.Sleep(100)`). Use `ExpectMsg` with generous timeouts and `Dilated` for CI compatibility. + +## Configuration Guidance + +### Test-Specific Configuration + +Pass custom configuration to TestKit for specific test scenarios: + +```csharp +public class DeviceActorTests : TestKit +{ + public DeviceActorTests() + : base(ConfigurationFactory.ParseString(@" + akka.loglevel = DEBUG + akka.actor.debug.receive = on + ")) + { + } +} +``` + +### CI/CD Time Factor + +Set the time factor for CI environments where machines are slower: + +```hocon +akka.test { + timefactor = 3.0 # Multiply all timeouts by 3 in CI +} +``` + +## References + +- Official Documentation: +- Akka.TestKit.Xunit2 NuGet: diff --git a/AkkaDotNet/17-HostingTestKit.md b/AkkaDotNet/17-HostingTestKit.md new file mode 100644 index 0000000..1855baf --- /dev/null +++ b/AkkaDotNet/17-HostingTestKit.md @@ -0,0 +1,192 @@ +# 17 — Akka.Hosting.TestKit + +## Overview + +Akka.Hosting.TestKit extends the standard TestKit with full `Microsoft.Extensions.Hosting` integration. It spins up a complete host environment with DI, logging, configuration, and Akka.NET — making it ideal for integration tests that need the full application wiring. + +In the SCADA system, Hosting.TestKit is used to test the full actor startup pipeline: DI-injected actors, protocol adapter factories, configuration loading, and the Device Manager singleton lifecycle — all within a controlled test environment. + +## When to Use + +- Integration tests that require DI (protocol adapter factories, configuration services, loggers) +- Testing the full Akka.Hosting configuration pipeline (Remoting, Cluster, Singleton registration) +- Verifying that the ActorRegistry correctly resolves actor references +- Testing actors that depend on `IServiceProvider` or `IOptions` + +## When Not to Use + +- Simple actor unit tests that don't need DI — use Akka.TestKit directly (see [16-TestKit.md](./16-TestKit.md)) +- Multi-node cluster tests — use MultiNodeTestRunner (see [18-MultiNodeTestRunner.md](./18-MultiNodeTestRunner.md)) +- Performance tests — the hosting overhead adds latency that distorts benchmarks + +## Design Decisions for the SCADA System + +### Test Fixture Structure + +Each integration test class inherits from `Akka.Hosting.TestKit.TestKit` and overrides configuration methods: + +```csharp +public class DeviceManagerIntegrationTests : Akka.Hosting.TestKit.TestKit +{ + public DeviceManagerIntegrationTests(ITestOutputHelper output) + : base(output: output) { } + + protected override void ConfigureServices(HostBuilderContext context, IServiceCollection services) + { + // Register test doubles for protocol adapters + services.AddSingleton(); + services.AddSingleton>( + Options.Create(TestSiteConfig.TwoDevices())); + } + + protected override void ConfigureAkka(AkkaConfigurationBuilder builder, IServiceProvider provider) + { + builder + .WithActors((system, registry, resolver) => + { + var manager = system.ActorOf( + resolver.Props(), "device-manager"); + registry.Register(manager); + }); + } + + [Fact] + public async Task DeviceManager_should_create_device_actors_from_config() + { + var manager = await ActorRegistry.GetAsync(); + manager.Tell(new GetManagedDevices()); + + var devices = ExpectMsg(); + Assert.Equal(2, devices.Devices.Count); + } +} +``` + +### Mock Protocol Adapters + +Register mock protocol adapters that simulate equipment behavior without requiring actual network connections: + +```csharp +public class MockProtocolAdapterFactory : IProtocolAdapterFactory +{ + public Props CreateAdapterProps(DeviceConfig config, IDependencyResolver resolver) + { + return Props.Create(() => new MockDeviceActor(config)); + } +} +``` + +### Testing Configuration Loading + +Verify that site-specific configuration is correctly loaded and applied: + +```csharp +protected override void ConfigureServices(HostBuilderContext context, IServiceCollection services) +{ + // Load test configuration + var config = new ConfigurationBuilder() + .AddJsonFile("test-appsettings.json") + .Build(); + + services.Configure(config.GetSection("ScadaSite")); +} +``` + +## Common Patterns + +### ActorRegistry Assertions + +Use `ActorRegistry.GetAsync()` to verify actors were correctly registered during startup: + +```csharp +[Fact] +public async Task Should_register_all_required_actors() +{ + var manager = await ActorRegistry.GetAsync(); + Assert.NotNull(manager); + + var health = await ActorRegistry.GetAsync(); + Assert.NotNull(health); +} +``` + +### Testing with TestProbe + +TestProbe works alongside Hosting.TestKit: + +```csharp +[Fact] +public async Task DeviceManager_should_forward_commands() +{ + var probe = CreateTestProbe(); + var manager = await ActorRegistry.GetAsync(); + + // Tell the manager to use the probe for a specific device + manager.Tell(new RegisterTestDevice("test-device", probe)); + manager.Tell(new SendCommand("cmd-1", "test-device", "Start", true)); + + probe.ExpectMsg(cmd => cmd.CommandId == "cmd-1"); +} +``` + +### Lifecycle Testing + +Verify that the actor system starts and stops cleanly: + +```csharp +[Fact] +public async Task Should_shutdown_gracefully() +{ + var manager = await ActorRegistry.GetAsync(); + Watch(manager); + + // Trigger coordinated shutdown + await CoordinatedShutdown.Get(Sys).Run(CoordinatedShutdown.ClrExitReason.Instance); + + ExpectTerminated(manager); +} +``` + +## Anti-Patterns + +### Duplicating Production Configuration + +Test configuration should be minimal and focused on the test scenario. Do not copy the entire production Akka.Hosting setup into tests — only configure what the test needs. This prevents tests from breaking when production config changes. + +### Testing Cluster Behavior in Hosting.TestKit + +Hosting.TestKit runs a single-node ActorSystem. Do not test cluster membership, failover, or Singleton migration here — those require MultiNodeTestRunner. Hosting.TestKit is for testing application-level actor behavior with DI, not distributed system behavior. + +### Slow Tests from Full Hosting Stack + +Each Hosting.TestKit test spins up a full host. If you have hundreds of tests, this adds up. Reserve Hosting.TestKit for integration tests that genuinely need DI; use plain TestKit for unit tests. + +## Configuration Guidance + +### NuGet Package + +``` +Akka.Hosting.TestKit +``` + +This pulls in `Akka.Hosting`, `Akka.TestKit.Xunit2`, and `Microsoft.Extensions.Hosting.Testing`. + +### Test Project Structure + +``` +ScadaSystem.Tests/ + Unit/ + DeviceActorTests.cs (plain TestKit) + CommandHandlerTests.cs (plain TestKit) + Integration/ + DeviceManagerIntegTests.cs (Hosting.TestKit) + ConfigurationIntegTests.cs (Hosting.TestKit) + Fixtures/ + TestSiteConfig.cs + MockProtocolAdapterFactory.cs +``` + +## References + +- NuGet: +- Petabridge Bootcamp: diff --git a/AkkaDotNet/18-MultiNodeTestRunner.md b/AkkaDotNet/18-MultiNodeTestRunner.md new file mode 100644 index 0000000..8e3bd18 --- /dev/null +++ b/AkkaDotNet/18-MultiNodeTestRunner.md @@ -0,0 +1,220 @@ +# 18 — MultiNodeTestRunner (Akka.MultiNodeTestRunner) + +## Overview + +The MultiNodeTestRunner provides infrastructure for running distributed integration tests across multiple actor systems simultaneously. Each "node" in the test runs in its own process, with full cluster formation, network simulation, and coordinated test assertions. This is the tool for validating failover behavior, split-brain scenarios, and cluster membership transitions. + +In the SCADA system, MultiNodeTestRunner is essential for validating the core availability guarantee: that the standby node correctly takes over device communication when the active node fails, without losing or duplicating commands. + +## When to Use + +- Testing failover scenarios (active node crash → standby takes over) +- Validating Split Brain Resolver behavior in the 2-node topology +- Testing Cluster Singleton migration (Device Manager moves to the standby) +- Verifying Distributed Data replication between nodes +- Testing graceful shutdown and rejoin sequences + +## When Not to Use + +- Unit testing individual actor logic — use TestKit +- Integration tests that only need DI — use Hosting.TestKit +- Performance or load testing — MultiNodeTestRunner adds significant overhead from process coordination + +## Design Decisions for the SCADA System + +### Key Failover Scenarios to Test + +1. **Active node hard crash:** Kill the active node's process. Verify the standby detects the failure, acquires the singleton, and starts device actors. + +2. **Active node graceful shutdown:** Initiate CoordinatedShutdown on the active node. Verify the singleton migrates cleanly with buffered messages preserved. + +3. **Network partition (simulated):** Prevent the two nodes from communicating. Verify SBR correctly resolves the partition (one node survives, one downs itself). + +4. **Rejoin after failure:** After failover, restart the failed node. Verify it joins the cluster as the new standby without disrupting the active node. + +5. **Command in-flight during failover:** Send a command to the active node, then kill it before the command is acknowledged. Verify the new active node recovers the pending command from the Persistence journal. + +### Test Structure + +```csharp +public class FailoverSpec : MultiNodeClusterSpec +{ + public FailoverSpec() : base(new FailoverSpecConfig()) { } + + [MultiNodeFact] + public void Active_node_failure_should_trigger_singleton_migration() + { + // Arrange: Both nodes join cluster + RunOn(() => Cluster.Join(GetAddress(First)), First, Second); + AwaitMembersUp(2); + + // Verify singleton is on the first (oldest) node + RunOn(() => + { + var singleton = Sys.ActorSelection("/user/device-manager"); + singleton.Tell(new Identify(1)); + var identity = ExpectMsg(); + Assert.NotNull(identity.Subject); + }, First); + + EnterBarrier("singleton-running"); + + // Act: Kill the first node + RunOn(() => + { + TestConductor.Exit(First, 0).Wait(); + }, Second); + + // Assert: Singleton migrates to second node + RunOn(() => + { + AwaitAssert(() => + { + var singleton = Sys.ActorSelection("/user/device-manager"); + singleton.Tell(new Identify(2)); + var identity = ExpectMsg(TimeSpan.FromSeconds(30)); + Assert.NotNull(identity.Subject); + }, TimeSpan.FromSeconds(60)); + }, Second); + } +} +``` + +### Spec Configuration + +```csharp +public class FailoverSpecConfig : MultiNodeConfig +{ + public RoleName First { get; } + public RoleName Second { get; } + + public FailoverSpecConfig() + { + First = Role("first"); + Second = Role("second"); + + CommonConfig = ConfigurationFactory.ParseString(@" + akka.actor.provider = cluster + akka.remote.dot-netty.tcp.port = 0 + akka.cluster { + downing-provider-class = ""Akka.Cluster.SBR.SplitBrainResolverProvider, Akka.Cluster"" + split-brain-resolver { + active-strategy = keep-oldest + keep-oldest.down-if-alone = on + } + min-nr-of-members = 1 + } + "); + } +} +``` + +## Common Patterns + +### Barriers for Synchronization + +Use `EnterBarrier` to synchronize test steps across nodes: + +```csharp +// Both nodes reach this point before proceeding +EnterBarrier("cluster-formed"); +// ... do work ... +EnterBarrier("work-complete"); +``` + +### RunOn for Node-Specific Logic + +Execute test logic on specific nodes: + +```csharp +RunOn(() => +{ + // This code runs only on the "first" node + Cluster.Join(GetAddress(First)); +}, First); + +RunOn(() => +{ + // This code runs only on the "second" node + Cluster.Join(GetAddress(First)); +}, Second); +``` + +### TestConductor for Failure Injection + +The TestConductor controls node lifecycle and network simulation: + +```csharp +// Kill a node +TestConductor.Exit(First, exitCode: 0).Wait(); + +// Simulate network partition (blackhole traffic) +TestConductor.Blackhole(First, Second, ThrottleTransportAdapter.Direction.Both).Wait(); + +// Restore network +TestConductor.PassThrough(First, Second, ThrottleTransportAdapter.Direction.Both).Wait(); +``` + +### Timeout Handling + +Multi-node tests involve network coordination and are inherently slower. Use generous timeouts: + +```csharp +AwaitAssert(() => +{ + // Assertion that may take time (singleton migration, failure detection) +}, max: TimeSpan.FromSeconds(60), interval: TimeSpan.FromSeconds(2)); +``` + +## Anti-Patterns + +### Testing Everything Multi-Node + +Multi-node tests are slow (process startup, cluster formation, barrier synchronization). Only test scenarios that genuinely require multiple nodes: failover, partition handling, data replication. All other tests should use TestKit or Hosting.TestKit. + +### Brittle Timing Assertions + +Do not assert that failover completes in exactly N seconds. Timing varies with machine load, GC pauses, and CI environment. Use `AwaitAssert` with a generous maximum timeout. + +### Forgetting Cleanup + +Ensure all node processes are terminated after each test. The MultiNodeTestRunner handles this, but custom test infrastructure must clean up explicitly. + +### Testing with Real Equipment + +Multi-node tests should use mock protocol adapters, not real equipment connections. Equipment behavior during test-driven cluster failures could be unpredictable. + +## Configuration Guidance + +### Running Multi-Node Tests + +```bash +# Using the Akka.MultiNodeTestRunner CLI +dotnet tool install --global Akka.MultiNodeTestRunner + +# Run tests +mntr run ScadaSystem.MultiNode.Tests.dll +``` + +### CI/CD Integration + +Multi-node tests require multiple processes on the same machine. Ensure the CI agent has sufficient resources and that ports are available (the test runner uses random ports). + +### Test Project Structure + +``` +ScadaSystem.MultiNode.Tests/ + Specs/ + FailoverSpec.cs + SplitBrainSpec.cs + RejoinSpec.cs + CommandRecoverySpec.cs + Configs/ + FailoverSpecConfig.cs + SplitBrainSpecConfig.cs +``` + +## References + +- GitHub: +- Testing Actor Systems: diff --git a/AkkaDotNet/19-AkkaIO.md b/AkkaDotNet/19-AkkaIO.md new file mode 100644 index 0000000..5c610f4 --- /dev/null +++ b/AkkaDotNet/19-AkkaIO.md @@ -0,0 +1,243 @@ +# 19 — Akka.IO (TCP/UDP Networking) + +## Overview + +Akka.IO provides actor-based, non-blocking TCP and UDP networking built into the core Akka library. Instead of working with raw sockets, you interact with special I/O manager actors that handle connection lifecycle, reading, and writing through messages. This fits naturally into the actor model — connection state is managed by actors, and data flows through the mailbox. + +In the SCADA system, Akka.IO is a candidate for implementing the custom legacy protocol adapter. If the custom protocol runs over raw TCP sockets (not HTTP, not a managed library), Akka.IO provides the connection management layer. + +## When to Use + +- Implementing the custom legacy SCADA protocol adapter if it communicates over raw TCP or UDP sockets +- Building any custom network protocol handler that needs actor-based lifecycle management +- When you need non-blocking I/O integrated with the actor supervision model (automatic reconnection on failure) + +## When Not to Use + +- OPC-UA communication — use an OPC-UA client library (e.g., OPC Foundation's .NET Standard Stack), not raw sockets +- HTTP communication — use `HttpClient` or ASP.NET Core +- If the custom protocol has its own managed .NET client library — use that library instead, wrapping it in an actor +- High-throughput bulk data transfer — Akka.Streams with custom stages may be more appropriate for backpressure handling + +## Design Decisions for the SCADA System + +### Custom Protocol Actor with Akka.IO TCP + +If the custom legacy protocol is a proprietary TCP-based protocol, model each device connection as an actor that uses Akka.IO: + +```csharp +public class CustomProtocolConnectionActor : ReceiveActor +{ + private readonly EndPoint _remoteEndpoint; + private IActorRef _connection; + + public CustomProtocolConnectionActor(DeviceConfig config) + { + _remoteEndpoint = new DnsEndPoint(config.Hostname, config.Port); + Become(Disconnected); + } + + private void Disconnected() + { + // Request a TCP connection + Context.System.Tcp().Tell(new Tcp.Connect(_remoteEndpoint)); + + Receive(connected => + { + _connection = Sender; + _connection.Tell(new Tcp.Register(Self)); + Become(Connected); + }); + + Receive(failed => + { + // Connection failed — schedule retry + Context.System.Scheduler.ScheduleTellOnce( + TimeSpan.FromSeconds(5), Self, new RetryConnect(), ActorRefs.NoSender); + }); + + Receive(_ => Become(Disconnected)); // Re-triggers connect + } + + private void Connected() + { + Receive(received => + { + // Parse the custom protocol frame from received.Data + var frame = CustomProtocolParser.Parse(received.Data); + HandleFrame(frame); + }); + + Receive(cmd => + { + var bytes = CustomProtocolSerializer.Serialize(cmd); + _connection.Tell(Tcp.Write.Create(ByteString.FromBytes(bytes))); + }); + + Receive(closed => + { + _connection = null; + Become(Disconnected); + }); + } +} +``` + +### Frame Parsing and Buffering + +Industrial protocols often use framed messages (length-prefixed or delimited). TCP delivers data as a byte stream, so you must handle partial reads and frame reassembly: + +```csharp +private ByteString _buffer = ByteString.Empty; + +private void HandleReceived(Tcp.Received received) +{ + _buffer = _buffer.Concat(received.Data); + + while (TryParseFrame(_buffer, out var frame, out var remaining)) + { + _buffer = remaining; + ProcessFrame(frame); + } +} + +private bool TryParseFrame(ByteString data, out CustomFrame frame, out ByteString remaining) +{ + // Check if we have a complete frame (e.g., length prefix + payload) + if (data.Count < 4) + { + frame = null; + remaining = data; + return false; + } + + var length = BitConverter.ToInt32(data.Take(4).ToArray(), 0); + if (data.Count < 4 + length) + { + frame = null; + remaining = data; + return false; + } + + frame = CustomFrame.Parse(data.Slice(4, length)); + remaining = data.Slice(4 + length); + return true; +} +``` + +### Connection Supervision + +Wrap the connection actor in a parent that supervises reconnection: + +```csharp +// Parent actor's supervision strategy +protected override SupervisorStrategy SupervisorStrategy() +{ + return new OneForOneStrategy( + maxNrOfRetries: -1, // Unlimited retries + withinTimeRange: TimeSpan.FromMinutes(1), + decider: Decider.From( + Directive.Restart, // Restart on any exception — re-establishes connection + (typeof(SocketException), Directive.Restart) + )); +} +``` + +### Akka.IO vs. Direct Socket Wrapper + +If the custom protocol client already has a managed .NET library, wrapping it in an actor (without Akka.IO) is simpler: + +```csharp +public class CustomProtocolDeviceActor : ReceiveActor +{ + private readonly CustomProtocolClient _client; // Existing library + + public CustomProtocolDeviceActor(DeviceConfig config) + { + _client = new CustomProtocolClient(config.Hostname, config.Port); + _client.OnTagChanged += (tag, value) => + Self.Tell(new TagValueReceived(tag, value)); + + ReceiveAsync(async _ => await _client.ConnectAsync()); + Receive(HandleTagUpdate); + } +} +``` + +Use Akka.IO only if you need actor-level control over the TCP connection lifecycle, or if no managed client library exists. + +## Common Patterns + +### Tag Subscription via Polling or Push + +If the custom protocol supports push-based tag subscriptions (the device sends updates when values change), the connection actor receives `Tcp.Received` messages passively. If polling is required, use the scheduler: + +```csharp +// Polling pattern +Context.System.Scheduler.ScheduleTellRepeatedly( + TimeSpan.FromSeconds(1), + TimeSpan.FromSeconds(1), + Self, + new PollTags(), + ActorRefs.NoSender); + +Receive(_ => +{ + foreach (var tag in _subscribedTags) + { + var request = CustomProtocolSerializer.CreateReadRequest(tag); + _connection.Tell(Tcp.Write.Create(ByteString.FromBytes(request))); + } +}); +``` + +### Backpressure with Ack-Based Writing + +For high-throughput writes, use Akka.IO's ack-based flow control: + +```csharp +_connection.Tell(Tcp.Write.Create(data, ack: new WriteAck())); + +Receive(_ => +{ + // Previous write completed — safe to send next + SendNextQueuedCommand(); +}); +``` + +## Anti-Patterns + +### Blocking Socket Operations in Actors + +Never use synchronous socket calls (`Socket.Receive`, `Socket.Send`) inside an actor. This blocks the dispatcher thread. Akka.IO handles all I/O asynchronously. + +### Not Handling Partial Reads + +TCP is a stream protocol. A single `Tcp.Received` message may contain a partial frame, multiple frames, or a frame split across two receives. Always implement frame buffering and parsing. + +### Creating One TCP Manager Per Device + +The TCP manager (`Context.System.Tcp()`) is a singleton per ActorSystem. Do not create additional instances. Each device actor sends `Tcp.Connect` to the same TCP manager. + +## Configuration Guidance + +```hocon +akka.io.tcp { + # Buffer pool settings — defaults are fine for SCADA scale + buffer-pool = "akka.io.tcp.disabled-buffer-pool" + + # Maximum number of open channels + max-channels = 1024 # Sufficient for 500 devices + + # Batch sizes for reads + received-message-size-limit = 65536 # 64KB per read + direct-buffer-size = 65536 +} +``` + +For 500 device connections, the default TCP settings are adequate. Increase `max-channels` only if you anticipate more concurrent connections. + +## References + +- Official Documentation: +- Akka IO Configuration: diff --git a/AkkaDotNet/20-Serialization.md b/AkkaDotNet/20-Serialization.md new file mode 100644 index 0000000..94ae139 --- /dev/null +++ b/AkkaDotNet/20-Serialization.md @@ -0,0 +1,215 @@ +# 20 — Serialization + +## Overview + +Akka.NET's serialization system converts messages to bytes and back. Serialization is required whenever a message crosses a boundary: Remoting (between nodes), Persistence (to the journal/snapshot store), or Cluster Sharding (entity passivation). The serialization system is pluggable — you can register different serializers for different message types. + +In the SCADA system, serialization matters in two critical paths: messages between the active and standby nodes (via Remoting/Cluster), and command events persisted to the SQLite journal (via Persistence). The default JSON serializer works but is slow and produces large payloads. For a production SCADA system, a more efficient serializer is recommended. + +## When to Use + +- Serialization configuration is required whenever Remoting or Persistence is enabled — which is always in our system +- Explicit serializer registration is recommended for all application messages that cross node boundaries or are persisted +- Custom serialization is needed if messages contain types that the default serializer handles poorly (e.g., complex object graphs, binary data) + +## When Not to Use + +- Messages that stay within a single ActorSystem (local-only messages between actors on the same node) are not serialized — they are passed by reference +- Do not serialize large binary blobs (equipment firmware, images) through Akka messages — use out-of-band transfer + +## Design Decisions for the SCADA System + +### Serializer Choice + +**Recommended: System.Text.Json or a binary serializer** + +Akka.NET v1.5+ supports pluggable serializers. Options: + +| Serializer | Pros | Cons | +|---|---|---| +| Newtonsoft.Json (default) | Human-readable, easy debugging | Slow, large payloads, type handling quirks | +| System.Text.Json | Fast, built into .NET, human-readable | Requires explicit converters for some types | +| Hyperion | Fast binary, handles complex types | Not human-readable, occasional compatibility issues across versions | +| Custom (protobuf, MessagePack) | Maximum performance, schema evolution | Requires manual schema management | + +**Recommendation for SCADA:** Use Hyperion for Remoting messages (speed matters for cluster heartbeats and Distributed Data gossip) and Newtonsoft.Json or System.Text.Json for Persistence events (human-readable journal aids debugging). + +### Message Design for Serialization + +Design all cross-boundary messages as simple, immutable records with primitive or well-known types: + +```csharp +// GOOD — simple types, easy to serialize +public record CommandDispatched(string CommandId, string DeviceId, string TagName, double Value, DateTime Timestamp); +public record TagValueChanged(string DeviceId, string TagName, double Value, DateTime Timestamp); + +// BAD — complex types, hard to serialize +public record DeviceSnapshot(IActorRef DeviceActor, ConcurrentDictionary State); +``` + +Rules for serializable messages: + +- Use primitive types (string, int, double, bool, DateTime, Guid) +- Use immutable collections (`IReadOnlyList`, `IReadOnlyDictionary`) +- Never include `IActorRef` in persisted messages — actor references are not stable across restarts +- Never include mutable state or framework types + +### Serializer Binding Configuration + +Register serializers for application message types: + +```hocon +akka.actor { + serializers { + hyperion = "Akka.Serialization.HyperionSerializer, Akka.Serialization.Hyperion" + json = "Akka.Serialization.NewtonSoftJsonSerializer, Akka" + } + + serialization-bindings { + # Remoting messages — use Hyperion for speed + "ScadaSystem.Messages.IClusterMessage, ScadaSystem" = hyperion + + # Persistence events — use JSON for readability + "ScadaSystem.Persistence.IPersistedEvent, ScadaSystem" = json + } +} +``` + +### Marker Interfaces for Binding + +Use marker interfaces to group messages by serialization strategy: + +```csharp +// All messages that cross the Remoting boundary +public interface IClusterMessage { } + +// All events persisted to the journal +public interface IPersistedEvent { } + +// Application messages +public record CommandDispatched(...) : IClusterMessage, IPersistedEvent; +public record TagValueChanged(...) : IClusterMessage; +public record AlarmRaised(...) : IClusterMessage, IPersistedEvent; +``` + +## Common Patterns + +### Versioning Persisted Events + +Persistence events are stored permanently. When the message schema changes (new fields, renamed fields), the journal contains old-format events. Handle this with version-tolerant deserialization: + +```csharp +// Version 1 +public record CommandDispatched(string CommandId, string DeviceId, string TagName, double Value, DateTime Timestamp); + +// Version 2 — added Priority field +public record CommandDispatchedV2(string CommandId, string DeviceId, string TagName, double Value, DateTime Timestamp, int Priority); + +// In the persistent actor's recovery +Recover(evt => +{ + // Handle v1 events + _state.AddPendingCommand(evt.CommandId, evt.DeviceId, priority: 0); +}); +Recover(evt => +{ + // Handle v2 events + _state.AddPendingCommand(evt.CommandId, evt.DeviceId, evt.Priority); +}); +``` + +Alternatively, use a custom serializer with built-in schema evolution (protobuf, Avro). + +### Serialization Verification in Tests + +Verify that all cross-boundary messages serialize and deserialize correctly: + +```csharp +[Theory] +[MemberData(nameof(AllClusterMessages))] +public void All_cluster_messages_should_roundtrip_serialize(IClusterMessage message) +{ + var serializer = Sys.Serialization.FindSerializerFor(message); + var bytes = serializer.ToBinary(message); + var deserialized = serializer.FromBinary(bytes, message.GetType()); + + Assert.Equal(message, deserialized); +} +``` + +### Excluding Local-Only Messages + +Not all messages need serialization. Mark local-only messages to avoid accidentally sending them across Remoting: + +```csharp +// Local-only message — never crosses node boundaries +public record InternalDeviceStateUpdate(string TagName, object Value); +// This does NOT implement IClusterMessage +``` + +## Anti-Patterns + +### IActorRef in Persisted Messages + +`IActorRef` contains a node address that becomes invalid after restart. Never persist actor references. Store the actor's logical identifier (device ID, entity name) and resolve the reference at runtime. + +### Serializing Everything as JSON + +JSON serialization of cluster heartbeats, Distributed Data gossip, and Singleton coordination messages adds unnecessary latency. Use a binary serializer (Hyperion) for infrastructure messages. + +### Ignoring Serialization in Development + +Serialization issues often surface only when Remoting is enabled (in multi-node testing or production). Test serialization explicitly during development, not just in production. + +### Large Serialized Payloads + +If a serialized message exceeds Remoting's maximum frame size (default 128KB), the message is dropped silently. Monitor serialized message sizes, especially for device state snapshots. + +## Configuration Guidance + +### Hyperion for Remoting + +``` +NuGet: Akka.Serialization.Hyperion +``` + +```hocon +akka.actor { + serializers { + hyperion = "Akka.Serialization.HyperionSerializer, Akka.Serialization.Hyperion" + } + serialization-bindings { + "System.Object" = hyperion # Default all messages to Hyperion + } + serialization-settings.hyperion { + preserve-object-references = false # Better performance, no circular refs needed + known-types-provider = "ScadaSystem.HyperionKnownTypes, ScadaSystem" + } +} +``` + +### Known Types for Performance + +Register frequently serialized types to improve Hyperion performance: + +```csharp +public class HyperionKnownTypes : IKnownTypesProvider +{ + public IEnumerable GetKnownTypes() + { + return new[] + { + typeof(CommandDispatched), + typeof(CommandAcknowledged), + typeof(TagValueChanged), + typeof(AlarmRaised), + typeof(DeviceStatus) + }; + } +} +``` + +## References + +- Official Documentation: +- Hyperion Serializer: (Note: check current status — Hyperion has had maintenance concerns; evaluate alternatives) diff --git a/AkkaDotNet/21-Configuration.md b/AkkaDotNet/21-Configuration.md new file mode 100644 index 0000000..382a282 --- /dev/null +++ b/AkkaDotNet/21-Configuration.md @@ -0,0 +1,226 @@ +# 21 — HOCON Configuration + +## Overview + +HOCON (Human-Optimized Config Object Notation) is Akka.NET's native configuration format. It supports hierarchical keys, includes, substitutions, fallback chains, and default values. While Akka.Hosting provides code-first configuration for most modules, HOCON remains necessary for fine-grained settings that Hosting APIs don't yet cover, and for understanding default behaviors. + +In the SCADA system, the recommended approach is **Akka.Hosting as the primary configuration mechanism**, with HOCON used only for settings not exposed by Hosting's fluent API. Site-specific values (hostnames, ports, device lists) come from `appsettings.json` via `Microsoft.Extensions.Configuration`. + +## When to Use + +- Fine-grained settings not yet covered by Akka.Hosting APIs (e.g., specific dispatcher tuning, transport failure detector thresholds, serialization bindings) +- Understanding and overriding Akka.NET default configurations +- Legacy configuration that predates Akka.Hosting adoption + +## When Not to Use + +- As the primary configuration mechanism in new .NET 10 projects — use Akka.Hosting instead +- For site-specific deployment values (hostnames, connection strings) — use `appsettings.json` and `IConfiguration` +- Do not duplicate settings in both HOCON and Akka.Hosting — pick one source of truth per setting + +## Design Decisions for the SCADA System + +### Configuration Layering + +The SCADA system uses a three-layer configuration approach: + +1. **Akka.Hosting (code-first):** Remoting, Cluster, Singleton, Persistence, Distributed Data — all major module configuration +2. **HOCON (embedded in code):** Fine-grained tuning that Hosting APIs don't expose — SBR details, dispatcher settings, serialization bindings +3. **appsettings.json:** Site-specific values — node hostname, seed node addresses, device configuration path, database connection strings + +```csharp +akkaBuilder + // Layer 1: Hosting APIs + .WithRemoting(options => { ... }) + .WithClustering(new ClusterOptions { ... }) + + // Layer 2: HOCON for fine-grained settings + .AddHocon(ConfigurationFactory.ParseString(@" + akka.cluster.split-brain-resolver { + active-strategy = keep-oldest + keep-oldest.down-if-alone = on + stable-after = 15s + } + akka.actor.serialization-bindings { + ""ScadaSystem.Messages.IClusterMessage, ScadaSystem"" = hyperion + } + "), HoconAddMode.Prepend); +``` + +### HOCON Precedence + +When combining HOCON with Hosting, use `HoconAddMode.Prepend` to ensure your HOCON overrides defaults, or `HoconAddMode.Append` to provide fallback values. Hosting API settings are applied after HOCON, so they take highest precedence. + +Effective precedence (highest to lowest): +1. Akka.Hosting API calls +2. HOCON with `Prepend` +3. Akka.NET internal defaults (reference.conf) + +### Avoiding HOCON Files on Disk + +Do not deploy `akka.hocon` or `akka.conf` files alongside the SCADA application. All HOCON should be embedded in code (via `ConfigurationFactory.ParseString`) or generated from `appsettings.json`. This avoids configuration drift between nodes — both nodes use the same compiled code with identical embedded HOCON. + +Site-specific differences (hostname, seed nodes) come from `appsettings.json`, which is deployed per-node. + +## Common Patterns + +### Reading HOCON Defaults for Reference + +Akka.NET modules ship with `reference.conf` files that contain all default values. These are embedded in the NuGet packages. To view defaults for any module: + +```csharp +var defaults = ConfigurationFactory.Default(); +var remoteDefaults = defaults.GetConfig("akka.remote"); +Console.WriteLine(remoteDefaults.ToString()); +``` + +This is useful when tuning specific settings — you can see the default value before overriding it. + +### HOCON Substitutions + +HOCON supports variable substitution, useful for sharing values: + +```hocon +scada { + cluster-role = "scada-node" +} + +akka.cluster { + roles = [${scada.cluster-role}] +} + +akka.cluster.singleton { + role = ${scada.cluster-role} +} +``` + +### Environment Variable Overrides + +HOCON can read environment variables, useful for container deployments: + +```hocon +akka.remote.dot-netty.tcp { + hostname = ${?SCADA_HOSTNAME} # Override from env var if set + port = 4053 +} +``` + +The `?` prefix makes the substitution optional — if the env var isn't set, the setting is omitted (and the default applies). + +### Complete HOCON Reference for the SCADA System + +This is the full HOCON block for settings that Hosting APIs don't cover: + +```hocon +akka { + # SBR fine-tuning + cluster { + split-brain-resolver { + active-strategy = keep-oldest + keep-oldest.down-if-alone = on + stable-after = 15s + } + + failure-detector { + heartbeat-interval = 1s + threshold = 8.0 + acceptable-heartbeat-pause = 10s + } + } + + # Serialization bindings + actor { + serializers { + hyperion = "Akka.Serialization.HyperionSerializer, Akka.Serialization.Hyperion" + } + serialization-bindings { + "ScadaSystem.Messages.IClusterMessage, ScadaSystem" = hyperion + } + } + + # Remote transport tuning + remote { + dot-netty.tcp { + maximum-frame-size = 256000b + } + transport-failure-detector { + heartbeat-interval = 4s + acceptable-heartbeat-pause = 20s + } + } + + # Distributed Data tuning + cluster.distributed-data { + gossip-interval = 2s + notify-subscribers-interval = 500ms + durable { + keys = ["pending-commands"] + lmdb { + dir = "C:\\ProgramData\\SCADA\\ddata" + map-size = 104857600 + } + } + } + + # IO tuning for custom protocol connections + io.tcp { + max-channels = 1024 + } +} +``` + +## Anti-Patterns + +### HOCON Files with Different Content Per Node + +If `akka.conf` files are deployed separately to each node and they drift out of sync, cluster behavior becomes unpredictable. Both nodes must have identical Akka configuration except for node-specific values (hostname). Use embedded HOCON + `appsettings.json` to enforce this. + +### Overriding Hosting Settings with HOCON + +If you configure Remoting via `WithRemoting()` and also set `akka.remote.dot-netty.tcp.port` in HOCON, the effective value depends on precedence order. This is confusing and error-prone. Pick one mechanism per setting. + +### Deep HOCON Nesting Without Comments + +HOCON is powerful but can become unreadable. Always comment non-obvious settings, especially timeouts and thresholds that affect failover behavior. + +### Ignoring reference.conf Defaults + +Akka.NET's defaults are generally sensible. Do not override every setting "just to be explicit." Override only settings where the default doesn't match your SCADA requirements (e.g., SBR strategy, failure detector tuning). + +## Configuration Guidance + +### appsettings.json Structure + +```json +{ + "ScadaSite": { + "NodeHostname": "nodeA.scada.local", + "SeedNodes": [ + "akka.tcp://scada-system@nodeA.scada.local:4053", + "akka.tcp://scada-system@nodeB.scada.local:4053" + ], + "RemotePort": 4053, + "ManagementPort": 8558, + "DeviceConfigPath": "C:\\ProgramData\\SCADA\\devices.json", + "PersistenceDbPath": "C:\\ProgramData\\SCADA\\persistence.db", + "DistributedDataPath": "C:\\ProgramData\\SCADA\\ddata" + }, + "Logging": { + "LogLevel": { + "Default": "Information", + "Akka": "Warning" + } + } +} +``` + +### File Paths + +All SCADA data files (persistence DB, distributed data, device config) should live under a dedicated directory (`C:\ProgramData\SCADA\`) with appropriate ACLs. Both nodes should use the same path structure for consistency, even though the files are node-local. + +## References + +- Official Documentation: +- Module Configs: +- Akka.Remote Config: +- Akka.Persistence Config: diff --git a/AkkaDotNet/AkkaComponents.md b/AkkaDotNet/AkkaComponents.md new file mode 100644 index 0000000..eccc716 --- /dev/null +++ b/AkkaDotNet/AkkaComponents.md @@ -0,0 +1,218 @@ +# Akka.NET Components Reference + +This document catalogs the major components (modules/libraries) of the Akka.NET framework, along with a brief description and a link to the official documentation for each. + +--- + +## Core Components + +### 1. Actors (Akka — The Core Library) + +The foundational building block of Akka.NET. Actors encapsulate state and behavior, communicate exclusively through asynchronous message passing, and form supervision hierarchies for fault tolerance. The core library includes the `ActorSystem`, `Props`, `IActorRef`, mailboxes, dispatchers, routers, FSM (Finite State Machine), and the supervision/monitoring model. + +- **NuGet Package:** `Akka` +- **Documentation:** +- **Concepts — Actor References:** +- **Supervision & Monitoring:** + +### 2. Remoting (Akka.Remote) + +Enables transparent communication between actor systems running on different hosts or processes. Messages are serialized and sent over the network; remote and local message sends use the same API. Remoting is the transport layer upon which Clustering is built. + +- **NuGet Package:** `Akka.Remote` +- **Documentation:** +- **Configuration Reference:** + +### 3. Cluster (Akka.Cluster) + +Organizes multiple actor systems into a membership-based "meta-system" with failure detection, member lifecycle management, role assignment, and split-brain resolution. In most real-world applications, Cluster is used instead of raw Remoting. + +- **NuGet Package:** `Akka.Cluster` +- **Documentation:** +- **Configuration Reference:** + +### 4. Cluster Sharding (Akka.Cluster.Sharding) + +Distributes a large set of stateful entities (actors) across cluster members. Entities are addressed by an identifier and are automatically balanced, migrated on failure, and guaranteed to be unique across the cluster. Typically paired with Persistence. + +- **NuGet Package:** `Akka.Cluster.Sharding` +- **Documentation:** + +### 5. Cluster Singleton (Akka.Cluster.Tools) + +Ensures that exactly one instance of a particular actor exists across the entire cluster at any given time. If the hosting node goes down, the singleton is automatically migrated to another node. + +- **NuGet Package:** `Akka.Cluster.Tools` +- **Documentation:** + +### 6. Cluster Publish-Subscribe (Akka.Cluster.Tools) + +Provides a distributed publish-subscribe mechanism within a cluster. Messages can be broadcast to all subscribers of a topic or sent to a single interested party. Part of the `Akka.Cluster.Tools` package alongside Cluster Singleton. + +- **NuGet Package:** `Akka.Cluster.Tools` +- **Documentation:** + +### 7. Cluster Metrics (Akka.Cluster.Metrics) + +Collects and publishes node-level resource metrics (CPU, memory) across the cluster. Useful for adaptive load balancing and monitoring. + +- **NuGet Package:** `Akka.Cluster.Metrics` +- **Documentation:** + +### 8. Persistence (Akka.Persistence) + +Enables actors to persist their state as a sequence of events (event sourcing) or snapshots. On restart or migration, the actor replays its event journal to recover state. Supports pluggable journal and snapshot store backends (SQL Server, PostgreSQL, SQLite, MongoDB, etc.). + +- **NuGet Package:** `Akka.Persistence` +- **Documentation:** +- **Configuration Reference:** +- **Persistence Query (CQRS Projections):** + +### 9. Distributed Data (Akka.DistributedData) + +Shares data across cluster nodes using Conflict-Free Replicated Data Types (CRDTs). Supports reads and writes even during network partitions, with eventual consistency guarantees and automatic conflict resolution. + +- **NuGet Package:** `Akka.DistributedData` +- **Documentation:** + +### 10. Streams (Akka.Streams) + +A higher-level abstraction on top of actors for building reactive, back-pressured stream processing pipelines. Implements the Reactive Streams standard. Provides composable building blocks: `Source`, `Flow`, `Sink`, and `Graph` DSL. + +- **NuGet Package:** `Akka.Streams` +- **Documentation:** +- **Modularity & Composition:** + +--- + +## Hosting & Integration + +### 11. Akka.Hosting + +The recommended modern integration layer for Akka.NET with the `Microsoft.Extensions.*` ecosystem (Hosting, DependencyInjection, Configuration, Logging). Provides HOCON-less, code-first configuration and an `ActorRegistry` for injecting actor references into ASP.NET Core, SignalR, gRPC, and other DI-based services. + +- **NuGet Package:** `Akka.Hosting` +- **GitHub / Documentation:** +- **Sub-packages:** + - `Akka.Remote.Hosting` — Akka.Remote configuration via Hosting + - `Akka.Cluster.Hosting` — Akka.Cluster, Sharding, and Tools configuration via Hosting + - `Akka.Persistence.Hosting` — Persistence configuration via Hosting + +### 12. Akka.DependencyInjection + +Integrates `Microsoft.Extensions.DependencyInjection` (or other DI containers) directly into actor construction, allowing actors to receive injected services through their constructors. + +- **NuGet Package:** `Akka.DependencyInjection` +- **Documentation:** + +--- + +## Discovery & Management + +### 13. Akka.Discovery + +Provides a pluggable service discovery API for dynamically locating cluster nodes in cloud and containerized environments. Ships with a built-in config-based discovery method and supports plugins for AWS, Azure, Kubernetes, and more. + +- **NuGet Package:** `Akka.Discovery` (core); provider-specific packages (e.g. `Akka.Discovery.Azure`, `Akka.Discovery.AwsApi`, `Akka.Discovery.KubernetesApi`) +- **Documentation:** + +### 14. Akka.Management + +A toolkit for managing and bootstrapping Akka.NET clusters in dynamic environments. Exposes HTTP endpoints for cluster coordination and works with Akka.Discovery to enable safe, automated cluster formation via Cluster Bootstrap (replacing static seed nodes). + +- **NuGet Package:** `Akka.Management` +- **Documentation:** +- **GitHub:** + +### 15. Akka.Coordination + +Provides lease-based distributed locking primitives used by Split Brain Resolver, Cluster Sharding, and Cluster Singleton to prevent split-brain scenarios. Backend implementations include Azure Blob Storage (`Akka.Coordination.Azure`). + +- **NuGet Package:** `Akka.Coordination.Azure` +- **Documentation (Azure Lease):** (see Coordination section) + +--- + +## Testing + +### 16. Akka.TestKit + +The base testing framework for Akka.NET actors. Provides `TestProbe`, `TestActorRef`, `EventFilter`, and other utilities for unit and integration testing of actor-based systems. Requires a test-framework-specific adapter package. + +- **NuGet Package:** `Akka.TestKit` (base), `Akka.TestKit.Xunit2`, `Akka.TestKit.NUnit`, `Akka.TestKit.MSTest` +- **Documentation:** + +### 17. Akka.Hosting.TestKit + +An integration testing toolkit built on top of `Akka.Hosting` and xUnit. Provides a `TestKit` base class that spins up a full `Microsoft.Extensions.Hosting` environment with DI, logging, and Akka.NET — ideal for testing actors alongside other services. + +- **NuGet Package:** `Akka.Hosting.TestKit` +- **NuGet Page:** + +### 18. Akka.MultiNodeTestRunner + +Infrastructure for running distributed, multi-node integration tests across multiple actor systems. Used to validate cluster behavior, split-brain scenarios, and network partition handling. + +- **GitHub:** + +--- + +## Networking + +### 19. Akka.IO + +Low-level, actor-based TCP and UDP networking built into the core Akka library. Provides non-blocking I/O through an actor API rather than traditional socket programming. + +- **Part of the core `Akka` NuGet package** +- **Documentation (TCP):** + +--- + +## Serialization & Configuration + +### 20. Serialization + +Akka.NET includes a pluggable serialization system used for message passing (both local and remote). The default serializer is JSON-based; high-performance alternatives include Hyperion and custom `Serializer` implementations. Serialization is critical for Remoting, Persistence, and Cluster Sharding. + +- **Part of the core `Akka` NuGet package** +- **Documentation:** + +### 21. HOCON Configuration + +Akka.NET uses HOCON (Human-Optimized Config Object Notation) as its primary configuration format. HOCON supports includes, substitutions, and hierarchical keys. With `Akka.Hosting`, HOCON can be replaced entirely by code-first configuration. + +- **Documentation:** +- **Module Configs:** + +--- + +## Persistence Providers (Selected) + +Akka.Persistence supports pluggable backends. Some notable provider packages: + +| Provider | NuGet Package | Documentation / GitHub | +|---|---|---| +| SQL (Linq2Db — SQL Server, PostgreSQL, SQLite, MySQL, Oracle) | `Akka.Persistence.Sql` | | +| MongoDB | `Akka.Persistence.MongoDb` | | +| Azure (Table Storage / Blob Storage) | `Akka.Persistence.Azure` | | + +--- + +## Logging Integrations + +| Logger | NuGet Package | GitHub | +|---|---|---| +| Serilog | `Akka.Logger.Serilog` | | +| NLog | `Akka.Logger.NLog` | | + +--- + +## Additional Resources + +- **Official Documentation Home:** +- **GitHub Organization:** +- **Modules Overview:** +- **Bootcamp (Learn Akka.NET):** +- **Example Projects:** +- **Community Discord:** +- **Petabridge (Commercial Support):** diff --git a/AkkaDotNet/BestPracticesAndTraps.md b/AkkaDotNet/BestPracticesAndTraps.md new file mode 100644 index 0000000..b8c8524 --- /dev/null +++ b/AkkaDotNet/BestPracticesAndTraps.md @@ -0,0 +1,358 @@ +# Best Practices and Traps to Avoid + +This document consolidates cross-cutting best practices and common traps for building the SCADA system on Akka.NET. While each component document covers module-specific guidance, the issues here span multiple components or emerge from the interactions between them. + +--- + +## Best Practices + +### 1. Message Design + +**Use immutable C# records for all messages.** Records provide value equality, `ToString()` for logging, and immutability by default. Every message that crosses an actor boundary — whether local, remote, or persisted — should be a record. + +```csharp +// Good +public record SendCommand(string CommandId, string DeviceId, string TagName, object Value, DateTime Timestamp); + +// Bad — mutable class with no equality semantics +public class SendCommand +{ + public string CommandId { get; set; } + public Dictionary Metadata { get; set; } // Mutable collection +} +``` + +**Keep messages small and focused.** A message should represent one thing: a command, an event, a query, or a response. Do not bundle unrelated data into a single message to "save roundtrips." Actors process messages sequentially — smaller messages mean faster processing and clearer intent. + +**Never put IActorRef in persisted messages.** Actor references contain node addresses that are invalid after restart. Store logical identifiers (device ID, actor name) and resolve references at runtime via the `ActorRegistry` or `ActorSelection`. + +### 2. Actor Hierarchy Design + +**Model the domain, not the infrastructure.** The actor hierarchy should reflect the SCADA domain: + +``` +/user + /device-manager (Singleton) + /machine-001 + /machine-002 + /... + /alarm-manager + /alarm-processor-1 + /alarm-processor-2 + /historian-writer +``` + +Do not create actors for infrastructure concerns (one actor per database connection, one actor per thread). Use DI for infrastructure services; use actors for domain entities with state and lifecycle. + +**One actor, one responsibility.** The Device Manager creates and supervises device actors — it does not process tag updates, evaluate alarm conditions, or write to the historian. Each of those concerns gets its own actor or actor subtree. + +**Prefer flat-and-wide over deep-and-narrow.** A Device Manager with 500 direct child device actors is fine. A hierarchy where `DeviceManager → Zone → Group → Subgroup → Device` adds supervision overhead at every level. Only add hierarchy levels when you need different supervision strategies at each level. + +### 3. Supervision Strategy + +**Design supervision strategies explicitly for every parent actor.** The default strategy (restart on any exception) is rarely correct for a SCADA system. Think through each failure type: + +| Exception Type | Strategy | Rationale | +|---|---|---| +| `CommunicationException` | Restart with backoff | Transient network issue; reconnection likely succeeds | +| `ConfigurationException` | Stop | Bad config won't fix itself on restart | +| `TimeoutException` | Restart | Equipment may be temporarily unresponsive | +| `SerializationException` | Resume | Bad message; skip it and continue | +| `OutOfMemoryException` | Escalate | Node-level problem; let the parent/system handle it | + +**Use exponential backoff for restarts.** When a device actor fails repeatedly (equipment offline), exponential backoff prevents the actor from saturating the network with rapid reconnection attempts: + +```csharp +var strategy = new OneForOneStrategy( + maxNrOfRetries: -1, + withinTimeRange: TimeSpan.MaxValue, + decider: Decider.From(Directive.Restart)); + +// Combine with BackoffSupervisor +var backoffProps = BackoffSupervisor.Props( + Backoff.OnFailure( + childProps: Props.Create(() => new DeviceActor(config)), + childName: $"device-{config.DeviceId}", + minBackoff: TimeSpan.FromSeconds(3), + maxBackoff: TimeSpan.FromSeconds(60), + randomFactor: 0.2)); +``` + +### 4. Failover Design + +**Design every actor for restart.** The entire device actor subtree will be destroyed and recreated during failover. No actor should assume its state persists across restarts unless it explicitly uses Akka.Persistence. This means: + +- Device actors re-read current state from equipment on startup +- The Device Manager replays its Persistence journal for in-flight commands +- Alarm state is reconstructed from Distributed Data (or re-evaluated from current tag values) +- Subscriptions are re-established, not carried over + +**Test failover as a first-class scenario, not an afterthought.** Build MultiNodeTestRunner specs for failover early in development. The most dangerous bugs in a SCADA system are the ones that only appear during failover. + +**Accept that failover has a gap.** With a cold standby, there will be a window (20–40 seconds) where no node is actively communicating with equipment. Design the system so this gap is safe: equipment should have local safety logic that does not depend on continuous SCADA commands. If the SCADA system is the sole safety mechanism, a cold standby may not be appropriate — consider a warm standby architecture. + +### 5. Configuration Management + +**Identical code, different config.** Both nodes in the pair should run the exact same compiled code. All node-specific differences (hostname, seed node order) come from `appsettings.json`. This prevents configuration drift bugs that only manifest in one node. + +**Validate configuration at startup.** Before the ActorSystem starts, validate all configuration values. A misconfigured seed node address that silently fails to connect is worse than a loud startup crash: + +```csharp +public static class ConfigValidator +{ + public static void Validate(SiteConfiguration config) + { + if (string.IsNullOrEmpty(config.NodeHostname)) + throw new ConfigurationException("NodeHostname is required"); + if (config.SeedNodes.Count < 2) + throw new ConfigurationException("At least 2 seed nodes are required for failover pair"); + if (config.SeedNodes.All(s => !s.Contains(config.NodeHostname))) + throw new ConfigurationException("This node's hostname must appear in the seed node list"); + } +} +``` + +### 6. Logging and Observability + +**Use structured logging with correlation IDs.** Every command should carry a `CommandId` that flows through all actors and log entries. When diagnosing a failover issue, you need to trace a specific command's journey: + +```csharp +_logger.LogInformation("Command {CommandId} dispatched to device {DeviceId} for tag {TagName}", + command.CommandId, command.DeviceId, command.TagName); +``` + +**Log cluster membership events at Warning level.** In a SCADA system, cluster membership changes are operationally significant. An `UnreachableMember` event means failover may be imminent: + +```csharp +Receive(msg => + _logger.LogWarning("Node unreachable: {Address} — failover may initiate", msg.Member.Address)); +``` + +**Monitor dead letters.** Dead letters in production indicate messages being sent to actors that no longer exist — often a symptom of failover timing issues or stale actor references. Subscribe to dead letters and log them at Warning level. + +### 7. Performance + +**Do not prematurely optimize.** Akka.NET handles millions of messages per second. A 500-device SCADA system with tag updates every second is approximately 25,000 messages per second — well within the comfort zone of a single node. Optimize only after profiling identifies an actual bottleneck. + +**Use `Tell`, not `Ask`, for internal actor communication.** `Ask` creates a temporary actor, allocates a `TaskCompletionSource`, and starts a timeout timer for every call. In the hot path (tag updates flowing from device actors to alarm processors), use `Tell` with reply-to patterns: + +```csharp +// Good — fire and forget with reply via Tell +deviceActor.Tell(new GetDeviceState(replyTo: Self)); + +// Bad in hot paths — creates overhead per call +var state = await deviceActor.Ask(new GetDeviceState(), TimeSpan.FromSeconds(5)); +``` + +**Batch historian writes.** Writing each tag update individually to SQL Server creates excessive I/O. Use Akka.Streams `GroupedWithin` to batch updates (see [10-Streams.md](./10-Streams.md)). + +### 8. Testing Strategy + +**Test at three levels:** + +| Level | Tool | What to Test | Volume | +|---|---|---|---| +| Unit | Akka.TestKit | Individual actor behavior, message handling, state transitions | Many — fast, cheap | +| Integration | Akka.Hosting.TestKit | DI wiring, configuration loading, actor startup pipeline | Moderate | +| Distributed | MultiNodeTestRunner | Failover, SBR, singleton migration, command recovery | Few — slow, expensive | + +**Create mock protocol adapters from day one.** The ability to test without real equipment is essential. Mock adapters should simulate both normal behavior (tag updates, command acks) and failure scenarios (connection drops, timeouts, garbled responses). + +--- + +## Traps to Avoid + +### Trap 1: The 2-Node Split Brain + +**The problem:** With exactly 2 nodes, there is no majority. If the network partitions, each node sees itself as the sole survivor and runs the Cluster Singleton. Both nodes issue commands to equipment simultaneously. + +**How it manifests:** Two Device Manager singletons running concurrently, each sending commands to the same equipment. Motors start and stop unpredictably. Safety-critical commands are duplicated. + +**How to avoid it:** Configure the Split Brain Resolver with `keep-oldest` and `down-if-alone = on`. Accept that both nodes may down themselves during a true partition, and rely on Windows Service auto-restart to reform the cluster. Alternatively, implement a lease-based SBR with an external arbiter (see [15-Coordination.md](./15-Coordination.md)). + +**How to detect it:** Monitor for `ClusterEvent.MemberUp` events where the cluster has 2 members that both believe they are leader. Log the singleton actor's lifecycle — if you see two "Singleton started" log entries in the same time window from different nodes, you have a split brain. + +### Trap 2: Blocking Inside Actors + +**The problem:** A device actor makes a synchronous network call (e.g., `opcClient.ReadValue()` without `await`) inside a `Receive` handler. This blocks the actor's dispatcher thread. If enough actors block simultaneously, the thread pool is exhausted and the entire ActorSystem stalls — including cluster heartbeats. + +**How it manifests:** The cluster failure detector triggers because heartbeats stop being processed. The standby node marks the active node as unreachable, even though it's running — it's just deadlocked. Failover initiates unnecessarily, and the same blocking behavior occurs on the new active node. + +**How to avoid it:** Use `ReceiveAsync` or `PipeTo` for all asynchronous operations. If a third-party library only offers synchronous APIs, wrap the call in `Task.Run` and `PipeTo` the result: + +```csharp +Receive(msg => +{ + Task.Run(() => syncOnlyClient.ReadValue(msg.TagName)) + .PipeTo(Self, success: value => new TagValueRead(msg.TagName, value), + failure: ex => new TagReadFailed(msg.TagName, ex)); +}); +``` + +### Trap 3: Singleton Starvation on Startup + +**The problem:** After failover, the surviving node is alone. If `akka.cluster.min-nr-of-members = 2`, the Cluster waits for a second member before allowing the Singleton to start. No device communication occurs until the failed node is restarted. + +**How it manifests:** Failover appears to succeed (cluster state shows 1 member), but the Device Manager never starts. Equipment is disconnected indefinitely. + +**How to avoid it:** Set `akka.cluster.min-nr-of-members = 1`. The Singleton must be able to start on a single-node cluster. + +### Trap 4: Persistence Journal Growth + +**The problem:** The Device Manager persists every command event and never cleans up. Over weeks of operation, the SQLite journal grows to gigabytes. Recovery after failover takes minutes instead of seconds as the entire journal is replayed. + +**How it manifests:** Failover time gradually increases. Eventually, the standby node takes so long to recover that operators assume it has failed and restart it, creating a cascading failure loop. + +**How to avoid it:** Take periodic snapshots and delete old journal entries. Expire stale pending commands. Set up a maintenance task that monitors journal file size: + +```csharp +// After every 100 persisted events +if (LastSequenceNr % 100 == 0) +{ + SaveSnapshot(_state); +} + +Receive(success => +{ + DeleteMessages(success.Metadata.SequenceNr); + DeleteSnapshots(new SnapshotSelectionCriteria(success.Metadata.SequenceNr - 1)); +}); +``` + +### Trap 5: Serialization Mismatch After Deployment + +**The problem:** A new version of the SCADA software changes a message type (adds a field, renames a property, changes a namespace). The Persistence journal contains events serialized with the old schema. On recovery, deserialization fails and the persistent actor crashes. + +**How it manifests:** After a software update, the Device Manager singleton fails to start. The error log shows `SerializationException` during recovery. The system is down until someone manually fixes or clears the journal. + +**How to avoid it:** Never modify existing persisted event types in a breaking way. Add new event versions alongside old ones and handle both during recovery (see [20-Serialization.md](./20-Serialization.md)). Test journal recovery with old-format events as part of the CI/CD pipeline: + +```csharp +[Fact] +public void Should_recover_v1_command_events() +{ + // Seed the journal with V1 events + // Start the persistent actor + // Verify state is correctly recovered +} +``` + +### Trap 6: Equipment Reconnection Storm + +**The problem:** After failover, 500 device actors start simultaneously on the standby node and all attempt to connect to their respective equipment at the same instant. This saturates the network and the equipment's connection capacity, causing most connections to fail. All 500 actors retry after the same backoff interval, creating synchronized waves. + +**How it manifests:** After failover, only a handful of devices connect successfully. The rest cycle through connect → timeout → retry in lockstep waves. Full reconnection takes 10–15 minutes instead of 30 seconds. + +**How to avoid it:** Stagger device actor startup. The Device Manager should create device actors in small batches with delays between them: + +```csharp +private async Task StartDeviceActors(IReadOnlyList configs) +{ + const int batchSize = 20; + foreach (var batch in configs.Chunk(batchSize)) + { + foreach (var config in batch) + { + var props = _adapterFactory.CreateAdapterProps(config, _resolver); + Context.ActorOf(props, $"device-{config.DeviceId}"); + } + await Task.Delay(TimeSpan.FromSeconds(1)); // Pause between batches + } +} +``` + +Additionally, add random jitter to the `BackoffSupervisor`'s `randomFactor` so retrying actors don't synchronize. + +### Trap 7: Distributed Data Loss on Full Cluster Restart + +**The problem:** Distributed Data is in-memory by default. If both nodes go down simultaneously (power failure, site-wide event), all Distributed Data is lost. If the system relies on Distributed Data for pending command state without a Persistence backup, those commands are lost. + +**How it manifests:** After a site-wide restart, the Device Manager has no record of in-flight commands. Commands that were sent but not yet acknowledged are neither retried nor flagged for operator review. Equipment may be in an inconsistent state. + +**How to avoid it:** Use Distributed Data's durable LMDB storage for critical keys (see [09-DistributedData.md](./09-DistributedData.md)). Additionally, always use Akka.Persistence as the authoritative record for in-flight commands — Distributed Data is a convenience for the standby, not the source of truth. + +### Trap 8: Forgetting CoordinatedShutdown + +**The problem:** The SCADA Windows Service is stopped (for maintenance, update, etc.) by killing the process or stopping the service without graceful shutdown. The Cluster Singleton does not hand over cleanly; the standby detects the active node as unreachable (not gracefully left) and waits for the SBR timeout before taking over. + +**How it manifests:** Planned maintenance causes the same 20–40 second failover gap as an unplanned crash. Operators learn to distrust the failover system and develop manual procedures that bypass it. + +**How to avoid it:** Ensure the Windows Service wrapper triggers `CoordinatedShutdown`: + +```csharp +protected override async Task ExecuteAsync(CancellationToken stoppingToken) +{ + stoppingToken.Register(() => + { + CoordinatedShutdown.Get(_actorSystem) + .Run(CoordinatedShutdown.ClrExitReason.Instance) + .Wait(TimeSpan.FromSeconds(30)); + }); +} +``` + +And configure Akka to respect the CLR shutdown hook: + +```hocon +akka.coordinated-shutdown { + run-by-clr-shutdown-hook = on + exit-clr = on +} +akka.cluster.run-coordinated-shutdown-when-down = on +``` + +With graceful shutdown, the singleton migrates in seconds (limited by the hand-over retry interval) instead of the full failure detection timeout. + +### Trap 9: Testing Only the Happy Path + +**The problem:** Tests cover normal operation (device connects, tags update, commands succeed) but not failure scenarios (connection drops mid-command, equipment returns garbled data, failover during alarm escalation). The system works perfectly in the lab and fails unpredictably in production. + +**How to avoid it:** For every happy-path test, write at least one failure-path test: + +- Device connects → Device connection times out +- Command succeeds → Command times out, equipment never responds +- Tag updates normally → Tag update contains invalid data +- Singleton runs on Node A → Node A crashes during command processing +- Both nodes healthy → Network partition between nodes + +### Trap 10: Overusing Ask in Actor-to-Actor Communication + +**The problem:** Internal actor communication uses `Ask` (request-response) pervasively. Each `Ask` creates a temporary actor and a timeout. Under load, this creates thousands of temporary actors, increasing garbage collection pressure and memory consumption. Worse, if an actor is processing slowly, `Ask` timeouts cascade into `AskTimeoutException` storms that fill logs and trigger supervisor restarts. + +**How it manifests:** Under moderate load, the system starts logging `AskTimeoutException` frequently. Device actors are restarted by supervisors that interpret the timeout exception as a failure. The restarts disconnect equipment, causing more timeouts, creating a cascading failure. + +**How to avoid it:** Use `Tell` with explicit reply-to actor references for all internal communication. Reserve `Ask` for system boundaries where an external caller (ASP.NET controller, health check) needs a synchronous response from an actor: + +```csharp +// Internal: Tell with reply +target.Tell(new RequestState(replyTo: Self)); +Receive(response => ProcessResponse(response)); + +// Boundary only: Ask from ASP.NET +public async Task GetDeviceState(string deviceId) +{ + var state = await _deviceManager.Ask( + new GetDeviceState(deviceId), TimeSpan.FromSeconds(5)); + return Ok(state); +} +``` + +--- + +## Checklist: Before Going to Production + +- [ ] Split Brain Resolver is configured and tested (not left on defaults) +- [ ] `min-nr-of-members = 1` so the singleton starts after failover +- [ ] All cross-boundary messages are serialization-tested +- [ ] Persistence journal cleanup is implemented (snapshots + deletion) +- [ ] Persisted event schema versioning strategy is documented +- [ ] Device actor startup is staggered to prevent reconnection storms +- [ ] CoordinatedShutdown is wired into the Windows Service lifecycle +- [ ] MultiNodeTestRunner specs cover failover, partition, and rejoin scenarios +- [ ] Dead letter monitoring is enabled +- [ ] Structured logging with command correlation IDs is in place +- [ ] Mock protocol adapters exist for all equipment protocols +- [ ] appsettings.json is validated at startup before the ActorSystem starts +- [ ] Both nodes have been tested as the surviving node (not just Node A) +- [ ] The failover gap duration is documented and accepted by operations +- [ ] Equipment has local safety logic that does not depend on continuous SCADA commands diff --git a/AkkaDotNet/README.md b/AkkaDotNet/README.md new file mode 100644 index 0000000..b7b4db4 --- /dev/null +++ b/AkkaDotNet/README.md @@ -0,0 +1,100 @@ +# Akka.NET SCADA System — Architecture & Design Knowledge Base + +## System Overview + +This knowledge base documents the architectural design decisions for an industrial SCADA system built on Akka.NET. The system manages communication with 50–500 machines per site using a unified protocol abstraction layer (OPC-UA and a custom legacy SCADA protocol), deployed as **2-node active/cold-standby failover pairs** on Windows Server. + +### Key Architectural Constraints + +- **Platform:** Windows Server, .NET 10, C# +- **Cluster Topology:** Exactly 2 nodes per site — active/cold-standby +- **Equipment Scale:** 50–500 machines per site +- **Protocols:** OPC-UA and a custom legacy SCADA protocol, both supporting tag subscriptions; unified behind a common high-level comms abstraction +- **Consistency Requirement:** No commands lost or duplicated during failover +- **Persistence:** Akka.Persistence on SQLite (no SQL Server guaranteed at every site), synced between nodes via Akka.NET distributed mechanisms +- **Database:** MS SQL Server used where available (historian, reporting); not the persistence journal +- **Failover Model:** Cold standby — standby node is idle until failover, then recovers state via Akka.Persistence journal and re-reads from equipment + +### Core Architectural Patterns + +| Pattern | Akka.NET Component | Purpose | +|---|---|---| +| Actor-per-device | Actors (Core) | Each machine is represented by a dedicated actor | +| Protocol abstraction | Actors (Core) + Akka.IO/Streams | Common comms interface with OPC-UA and custom protocol implementations | +| Active node ownership | Cluster Singleton | Only the active node owns device communication | +| Command recovery | Akka.Persistence (SQLite) | In-flight command journal for failover recovery | +| Cross-node state sync | Distributed Data | Lightweight state replication between the pair | +| 2-node split-brain safety | Cluster (SBR) | Keep-oldest or lease-based strategy for the 2-node edge case | + +--- + +## Document Index + +### Reference Catalog + +| Document | Description | +|---|---| +| [AkkaComponents.md](./AkkaComponents.md) | Master catalog of all Akka.NET components with documentation URLs | +| [BestPracticesAndTraps.md](./BestPracticesAndTraps.md) | Cross-cutting best practices, common traps, and a production readiness checklist | + +### Core Components + +| # | Document | Component | Relevance to SCADA System | +|---|---|---|---| +| 1 | [01-Actors.md](./01-Actors.md) | Actors (Core Library) | Actor-per-device model, supervision hierarchies, protocol abstraction | +| 2 | [02-Remoting.md](./02-Remoting.md) | Akka.Remote | Transport layer between the two failover nodes | +| 3 | [03-Cluster.md](./03-Cluster.md) | Akka.Cluster | 2-node membership, failure detection, split-brain resolution | +| 4 | [04-ClusterSharding.md](./04-ClusterSharding.md) | Cluster Sharding | Potential for distributing device actors (evaluated but may not apply) | +| 5 | [05-ClusterSingleton.md](./05-ClusterSingleton.md) | Cluster Singleton | Active node owns all device communication | +| 6 | [06-ClusterPubSub.md](./06-ClusterPubSub.md) | Cluster Publish-Subscribe | Event/alarm distribution between nodes | +| 7 | [07-ClusterMetrics.md](./07-ClusterMetrics.md) | Cluster Metrics | Node health monitoring | +| 8 | [08-Persistence.md](./08-Persistence.md) | Akka.Persistence | Command journal on SQLite for failover recovery | +| 9 | [09-DistributedData.md](./09-DistributedData.md) | Distributed Data | Cross-node state sync via CRDTs | +| 10 | [10-Streams.md](./10-Streams.md) | Akka.Streams | Tag subscription data flow, backpressure handling | + +### Hosting & Integration + +| # | Document | Component | Relevance to SCADA System | +|---|---|---|---| +| 11 | [11-Hosting.md](./11-Hosting.md) | Akka.Hosting | Microsoft.Extensions integration, DI, service lifecycle | +| 12 | [12-DependencyInjection.md](./12-DependencyInjection.md) | Akka.DependencyInjection | Injecting protocol adapters and services into actors | + +### Discovery & Management + +| # | Document | Component | Relevance to SCADA System | +|---|---|---|---| +| 13 | [13-Discovery.md](./13-Discovery.md) | Akka.Discovery | Node discovery in the failover pair | +| 14 | [14-Management.md](./14-Management.md) | Akka.Management | HTTP health endpoints, cluster bootstrap | +| 15 | [15-Coordination.md](./15-Coordination.md) | Akka.Coordination | Lease-based locking for singleton/SBR safety | + +### Testing + +| # | Document | Component | Relevance to SCADA System | +|---|---|---|---| +| 16 | [16-TestKit.md](./16-TestKit.md) | Akka.TestKit | Unit testing device actors and protocol logic | +| 17 | [17-HostingTestKit.md](./17-HostingTestKit.md) | Akka.Hosting.TestKit | Integration testing with full DI/hosting stack | +| 18 | [18-MultiNodeTestRunner.md](./18-MultiNodeTestRunner.md) | MultiNodeTestRunner | Failover scenario testing | + +### Networking & Infrastructure + +| # | Document | Component | Relevance to SCADA System | +|---|---|---|---| +| 19 | [19-AkkaIO.md](./19-AkkaIO.md) | Akka.IO | TCP/UDP communication with equipment | +| 20 | [20-Serialization.md](./20-Serialization.md) | Serialization | Message serialization for remoting and persistence | +| 21 | [21-Configuration.md](./21-Configuration.md) | HOCON Configuration | System configuration management | + +--- + +## How to Use This Knowledge Base + +Each component document follows a consistent structure: + +1. **Overview** — What the component does and its role in the SCADA system +2. **When to Use / When Not to Use** — Decision criteria for this system +3. **Design Decisions** — Architectural choices specific to our failover-pair SCADA topology +4. **Common Patterns** — Recommended usage patterns for our context +5. **Anti-Patterns** — What to avoid and why +6. **Configuration Guidance** — Key settings with recommended values for our deployment +7. **References** — Links to official documentation + +Start with the [AkkaComponents.md](./AkkaComponents.md) for a broad overview, then read the core components (1–10) for the primary architectural decisions. The hosting/integration docs (11–12) cover how the system boots and wires together. Testing docs (16–18) cover validation strategy. Before going to production, review [BestPracticesAndTraps.md](./BestPracticesAndTraps.md) for cross-cutting concerns and the production readiness checklist.