# 08 — Persistence (Akka.Persistence) ## Overview Akka.Persistence enables actors to persist their state as a sequence of events (event sourcing) and/or snapshots. On restart — whether from a crash, deployment, or failover — the actor replays its event journal to recover its last known state. Persistence uses pluggable backends for the event journal and snapshot store. In our SCADA system, Persistence is the mechanism that ensures **no commands are lost or duplicated during failover**. The Device Manager singleton persists in-flight commands to a SQLite journal. When the singleton restarts on the standby node, it replays the journal to identify commands that were sent but not yet acknowledged, and can re-issue or reconcile them. ## When to Use - The Device Manager singleton — for journaling in-flight commands and their acknowledgments - Any actor where state must survive failover (pending command queues, alarm acknowledgment state) - Command deduplication — the journal provides a record of commands already sent ## When Not to Use - Device actors that hold transient equipment state (current tag values) — this state is re-read from equipment on reconnection; persisting it adds latency without benefit - High-frequency tag value updates — these should go to the historian (SQL Server or time-series DB), not the Persistence journal - Read models / projections — use the historian or a dedicated CQRS read store, not the Persistence journal ## Design Decisions for the SCADA System ### SQLite as the Journal Backend SQL Server is not guaranteed at every site. SQLite provides a local, zero-configuration persistence backend. Use `Akka.Persistence.Sql` with the SQLite provider: ``` NuGet: Akka.Persistence.Sql ``` The SQLite database file lives on the local filesystem of each node. ### Journal Sync Between Nodes The key challenge: each node has its own SQLite file. When the standby takes over, it needs the active node's journal data. Options: **Option A: Shared Journal via Distributed Data (Recommended)** Do not attempt to sync SQLite files between nodes. Instead, use the active node as the journal host during normal operation, and ensure the Device Manager actor persists only critical state (in-flight commands). On failover, the new singleton's journal starts fresh, and it reconciles state by: 1. Re-reading equipment state from devices (tag values, connection status) 2. Checking a lightweight "pending commands" record in Distributed Data (see [09-DistributedData.md](./09-DistributedData.md)) This approach avoids the complexity and corruption risks of SQLite file replication. **Option B: Shared SQLite on a Network Path** Place the SQLite file on a shared filesystem (SMB share) accessible by both nodes. This is simple but introduces risks: - SQLite does not handle network filesystem locking reliably — corruption is possible - If the network share becomes unavailable, persistence fails entirely - Not recommended for production SCADA systems **Option C: Use SQL Server Where Available, SQLite as Fallback** Configure the journal backend based on site capabilities. Sites with SQL Server use it as a shared journal (both nodes point to the same database). Sites without SQL Server use local SQLite with the reconciliation approach from Option A. ```csharp // In Akka.Hosting configuration if (siteConfig.HasSqlServer) { builder.WithSqlPersistence(siteConfig.SqlServerConnectionString, ProviderName.SqlServer2019); } else { builder.WithSqlPersistence($"Data Source={localDbPath}", ProviderName.SQLite); } ``` ### What to Persist: In-Flight Command Journal The Device Manager persists a focused set of events — only commands and their outcomes: ```csharp // Events (persisted to journal) public record CommandDispatched(string CommandId, string DeviceId, string TagName, object Value, DateTime Timestamp); public record CommandAcknowledged(string CommandId, bool Success, DateTime Timestamp); public record CommandExpired(string CommandId, DateTime Timestamp); // State (rebuilt from events) public class DeviceManagerState { public Dictionary PendingCommands { get; } = new(); } ``` ### Persistence ID Strategy The singleton Device Manager uses a fixed persistence ID since there's only ever one instance: ```csharp public override string PersistenceId => "device-manager"; ``` ### Snapshots for Fast Recovery If the command journal grows large (unlikely if commands are acknowledged/expired regularly), use snapshots to speed up recovery: ```csharp // Snapshot every 100 events if (LastSequenceNr % 100 == 0) { SaveSnapshot(_state); } // Handle snapshot offers Recover(offer => { _state = (DeviceManagerState)offer.Snapshot; }); ``` ## Common Patterns ### Command Lifecycle Pattern ``` 1. Operator sends command → Device Manager 2. Device Manager persists CommandDispatched event 3. Device Manager sends command to device actor 4. Device actor sends to equipment, waits for ack 5. Device actor receives ack → tells Device Manager 6. Device Manager persists CommandAcknowledged event 7. Device Manager removes from pending queue On failover recovery: 1. New Device Manager replays journal 2. Rebuilds pending command queue (dispatched but not acknowledged) 3. For each pending command: re-read equipment state 4. If equipment shows command was applied → persist ack 5. If equipment shows command was not applied → re-issue 6. If uncertain → flag for operator review ``` ### Expiration of Stale Commands Commands that remain pending beyond a timeout should be expired to prevent the journal from accumulating indefinitely: ```csharp Context.System.Scheduler.ScheduleTellRepeatedly( TimeSpan.FromMinutes(1), TimeSpan.FromMinutes(1), Self, new ExpireStaleCommands(), ActorRefs.NoSender); // Handler Receive(_ => { var cutoff = DateTime.UtcNow.AddMinutes(-5); foreach (var cmd in _state.PendingCommands.Values.Where(c => c.Timestamp < cutoff)) { Persist(new CommandExpired(cmd.CommandId, DateTime.UtcNow), evt => { _state.PendingCommands.Remove(evt.CommandId); }); } }); ``` ### Journal Cleanup Periodically delete old journal entries and snapshots to prevent SQLite file growth: ```csharp // After saving a snapshot, delete events up to the snapshot sequence number Receive(success => { DeleteMessages(success.Metadata.SequenceNr); DeleteSnapshots(new SnapshotSelectionCriteria(success.Metadata.SequenceNr - 1)); }); ``` ## Anti-Patterns ### Persisting Tag Value Updates Do not use Akka.Persistence to journal every tag value change from equipment. A 500-device system with 50 tags each updating every second produces 25,000 events per second. This will overwhelm the SQLite journal and is the historian's job. ### Large Event Payloads Keep persisted events small. Do not embed full device state snapshots in every event — persist only the delta (command ID, tag name, value). Large events slow down recovery and bloat the journal. ### Not Handling Recovery Failures If the journal is corrupted or inaccessible, the persistent actor will fail to start. Handle this gracefully: ```csharp protected override void OnRecoveryFailure(Exception reason, object message) { _log.Error(reason, "Recovery failed — starting with empty state"); // Optionally: start with empty state and reconcile from equipment } ``` ### Assuming Exactly-Once Delivery from Persistence Alone Persistence ensures the command is journaled, but it does not guarantee the equipment received it exactly once. After failover, the reconciliation logic (re-read from equipment) is essential to determine whether a pending command was actually applied. ## Configuration Guidance ### SQLite Journal Configuration ```hocon akka.persistence { journal { plugin = "akka.persistence.journal.sql" sql { connection-string = "Data Source=C:\\ProgramData\\SCADA\\persistence.db" provider-name = "Microsoft.Data.Sqlite" auto-initialize = true } } snapshot-store { plugin = "akka.persistence.snapshot-store.sql" sql { connection-string = "Data Source=C:\\ProgramData\\SCADA\\persistence.db" provider-name = "Microsoft.Data.Sqlite" auto-initialize = true } } # Limit concurrent recoveries — with only one persistent actor, this is less critical max-concurrent-recoveries = 2 } ``` ### Recovery Throttling During failover, the singleton replays the journal. Limit recovery throughput to avoid spiking CPU on startup: ```hocon akka.persistence.journal.sql { replay-batch-size = 100 } ``` ## References - Official Documentation: - Configuration Reference: - Persistence Query: - Akka.Persistence.Sql: