All phases (0-8) now have detailed implementation plans with: - Bullet-level requirement extraction from HighLevelReqs sections - Design constraint traceability (KDD + Component Design) - Work packages with acceptance criteria mapped to every requirement - Split-section ownership verified across phases - Orphan checks (forward, reverse, negative) all passing - Codex MCP (gpt-5.4) external verification completed per phase Total: 7,549 lines across 11 plan documents, ~160 work packages, ~400 requirements traced, ~25 open questions logged for follow-up.
41 KiB
Phase 3A: Runtime Foundation & Persistence Model
Date: 2026-03-16 Status: Draft Prerequisites: Phase 0 (Solution Skeleton), Phase 1 (Central Platform Foundations — Akka.NET bootstrap via REQ-HOST-6), Phase 2 (Modeling & Validation — deployment package contract defines the serialized format stored in SQLite)
Scope
Goal: Prove the Akka.NET cluster, singleton, and local persistence model work correctly — including failover.
Components:
- Cluster Infrastructure (full) — Akka.NET cluster setup, split-brain resolution, failure detection, graceful shutdown, dual-node recovery.
- Host (site-role Akka bootstrap) — Site nodes use generic
IHost(no Kestrel), Akka.NET actor system with Remoting, Clustering, Persistence (SQLite), and SBR. - Site Runtime (partial) — Deployment Manager singleton skeleton, basic Instance Actor (holds attribute state, static attribute persistence). Script Actors, Alarm Actors, stream, and script execution are Phase 3B.
- Local SQLite persistence model — Schema for deployed configurations, static attribute overrides.
Testable Outcome: Two-node site cluster forms. Singleton starts on oldest node. Failover migrates singleton to surviving node. Singleton reads deployed configs from SQLite and recreates Instance Actors. Static attribute overrides persist across restart. min-nr-of-members=1 verified. CoordinatedShutdown enables fast handover.
Prerequisites
| Prerequisite | Phase | What's Needed |
|---|---|---|
| Solution structure with all component projects | 0 | ClusterInfrastructure, Host, SiteRuntime, Commons projects exist and compile |
| Commons shared types | 0 | NodeRole enum, Result<T>, UTC timestamp types |
| Commons entity POCOs | 0 | Deployed configuration entity, attribute entity |
| Commons message contracts | 0 | Base message types with correlation IDs |
| Host skeleton with role detection | 0 | Program.cs reads NodeOptions and selects service registration path |
| Host Akka.NET bootstrap | 1 | REQ-HOST-6 baseline: Akka.Hosting, Remoting, Clustering wired |
| Host configuration binding | 1 | ClusterOptions, NodeOptions, DatabaseOptions bound via Options pattern |
| Host CoordinatedShutdown wiring | 1 | REQ-HOST-9 baseline wired into service lifecycle |
| Host structured logging | 1 | Serilog with SiteId/NodeHostname/NodeRole enrichment |
| Deployment package contract | 2 | Stable serialization format for flattened configs that Site Runtime will store in SQLite |
Requirements Checklist
Section 1.1 — Central vs. Site Responsibilities (partial — this phase covers site-side bullets only)
[1.1-1]Central cluster is the single source of truth for all template authoring, configuration, and deployment decisions.- Phase 3A scope: Not directly implemented here (central-side concern), but the site-side design must not include any local authoring capability. Negative requirement: Site has no mechanism to create or edit configurations locally.
[1.1-2]Site clusters receive flattened configurations — fully resolved attribute sets with no template structure. Sites do not need to understand templates, inheritance, or composition.- Phase 3A scope: SQLite schema stores flattened configs. Deployment Manager reads them. Instance Actor loads attributes from them. No template resolution logic on site.
[1.1-3]Sites do not support local/emergency configuration overrides. All configuration changes originate from central.- Phase 3A scope: Negative requirement — no API or mechanism for local configuration changes. Static attribute writes (SetAttribute) are runtime value overrides, not configuration overrides.
Split-section note: Section 1.1 is primarily Phase 3A. All three bullets are addressed here. No bullets deferred to other phases (central-side truth enforcement is implicit in system design — central is the only entity that sends configs).
Section 1.2 — Failover (partial — Phase 3A covers mechanism; Phase 8 covers full-system validation)
[1.2-1]Failover is managed at the application level using Akka.NET (not Windows Server Failover Clustering).[1.2-2]Each cluster (central and site) runs an active/standby pair where Akka.NET manages node roles and failover detection.[1.2-3]Site failover: The standby node takes over data collection and script execution seamlessly, including responsibility for the store-and-forward buffers.- Phase 3A scope: Singleton migration to standby and Instance Actor recreation proven. Data collection (DCL) seamless takeover is Phase 3B. Script execution resumption is Phase 3B. S&F buffer takeover is Phase 3C. This phase proves the foundation: singleton migrates, deployed configs are read from SQLite, and Instance Actors are recreated — which is the prerequisite for all higher-level takeover behaviors.
[1.2-4]The Site Runtime Deployment Manager singleton is restarted on the new active node, which reads deployed configurations from local SQLite and re-creates the full Instance Actor hierarchy.- Phase 3A scope: "Full hierarchy" in this phase means Deployment Manager → Instance Actors. Script Actors and Alarm Actors (the lower levels of the hierarchy) are added in Phase 3B. The recreation pattern established here is extended in Phase 3B to include script compilation and child actor creation.
[1.2-5]Central failover: The standby node takes over central responsibilities. Deployments that are in-progress during a failover are treated as failed and must be re-initiated by the engineer.- Phase 3A scope: Central failover is not tested here (Phase 8). But the cluster infrastructure must support both central and site cluster topologies.
Split-section note: Section 1.2 contains 4 prose bullets. Phase 3A decomposes them into 5 atomic requirements for finer traceability:
- Phase 3A owns:
[1.2-1](app-level failover),[1.2-2](active/standby pair),[1.2-3](site failover — singleton migration and Instance Actor recreation foundation),[1.2-4](singleton restart and hierarchy recreation from SQLite). - Phase 3B owns:
[1.2-3]completion (DCL reconnection, script execution resumption, alarm re-evaluation). - Phase 3C owns:
[1.2-3]completion (S&F buffer takeover). - Phase 8 owns:
[1.2-3]full-system validation,[1.2-5](central failover end-to-end).
Section 2.3 — Site-Level Storage & Interface (partial)
[2.3-1]Sites have no user interface — they are headless collectors, forwarders, and script executors.- Phase 3A scope: Site-role Host uses generic
IHost(no Kestrel, no HTTP). Negative requirement.
- Phase 3A scope: Site-role Host uses generic
[2.3-2]Sites require local storage for: the current deployed (flattened) configurations, deployed scripts, shared scripts, external system definitions, database connection definitions, and notification lists.- Phase 3A scope: SQLite schema for deployed flattened configurations. Scripts, shared scripts, external system defs, DB connection defs, and notification lists are stored in later phases (3B, 3C, 7) but the schema should be extensible.
[2.3-3]Store-and-forward buffers are persisted to a local SQLite database on each node and replicated between nodes via application-level replication.- Phase 3A scope: Not implemented here (Phase 3C). But SQLite infrastructure established here is reused.
Split-section note: [2.3-2] is a compound bullet listing 6 storage categories. Phase ownership by sub-item:
- Phase 3A: deployed (flattened) configurations.
- Phase 3B: deployed scripts, shared scripts (stored when deployments are received and scripts compiled).
- Phase 3C/7: external system definitions, database connection definitions, notification lists (stored when system-wide artifacts are deployed).
- Phase 3C:
[2.3-3]store-and-forward buffers.
Design Constraints Checklist
From CLAUDE.md Key Design Decisions
[KDD-runtime-1]Instance modeled as Akka actor (Instance Actor) — single source of truth for runtime state.[KDD-runtime-2]Site Runtime actor hierarchy: Deployment Manager singleton → Instance Actors → Script Actors + Alarm Actors.- Phase 3A scope: Deployment Manager → Instance Actors. Script/Alarm Actors are Phase 3B.
[KDD-runtime-8]Staggered Instance Actor startup on failover to prevent reconnection storms (e.g., 20 at a time with short delay between batches).[KDD-runtime-9]Supervision: Resume for coordinator actors, Stop for short-lived execution actors.- Phase 3A scope: Deployment Manager supervises Instance Actors with OneForOneStrategy. Instance Actor supervision of Script/Alarm Actors is Phase 3B.
[KDD-data-5]Static attribute writes persisted to local SQLite (survive restart/failover, reset on redeployment).[KDD-data-7]Tell for hot-path internal communication; Ask reserved for system boundaries.- Phase 3A scope: Establish the convention. Instance Actor attribute updates use Tell.
[KDD-cluster-1]Keep-oldest SBR withdown-if-alone=on, 15s stable-after.[KDD-cluster-2]Both nodes are seed nodes.min-nr-of-members=1.[KDD-cluster-3]Failure detection: 2s heartbeat, 10s threshold. Total failover ~25s.[KDD-cluster-4]CoordinatedShutdown for graceful singleton handover.[KDD-cluster-5]Automatic dual-node recovery from persistent storage.
From Component-ClusterInfrastructure.md
[CD-CI-1]Two-node cluster (active/standby) using Akka.NET Cluster.[CD-CI-2]Leader election and role assignment (active vs. standby).[CD-CI-3]Cluster singleton hosting for Site Runtime Deployment Manager.[CD-CI-4]Cluster seed nodes: both nodes listed; either can start first.[CD-CI-5]Cluster role configuration: Central or Site (plus site identifier for site clusters).[CD-CI-6]Akka.NET remoting: hostname/port for inter-node communication.[CD-CI-7]Local storage paths: SQLite database locations (site nodes only).[CD-CI-8]akka.coordinated-shutdown.run-by-clr-shutdown-hook = on.[CD-CI-9]akka.cluster.run-coordinated-shutdown-when-down = on.[CD-CI-10]Dual-node recovery: no manual intervention required. First node forms cluster, second joins.[CD-CI-11]Deployment Manager singleton reads deployed configurations from local SQLite on recovery.[CD-CI-12]Alarm states re-evaluated from incoming values on recovery (alarm state is in-memory only).- Phase 3A scope: Establish the pattern — no alarm persistence. Alarm Actors are Phase 3B, but the design must not persist alarm state.
[CD-CI-13]Keep-oldest SBR rationale: with two nodes, quorum-based strategies cause total shutdown. Keep-oldest withdown-if-aloneensures at most one node runs the singleton.
From Component-SiteRuntime.md
[CD-SR-1]Deployment Manager is an Akka.NET cluster singleton — guaranteed to run on exactly one node.[CD-SR-2]Startup behavior step 1: Read all deployed configurations from local SQLite.[CD-SR-3]Startup behavior step 4: Create Instance Actors for all deployed, enabled instances as child actors in staggered batches (e.g., 20 at a time with short delay).[CD-SR-4]Instance Actor: single source of truth for all runtime state of a deployed instance.[CD-SR-5]Instance Actor initialization: Load all attribute values from flattened configuration (static defaults).[CD-SR-6]Instance Actor SetAttribute (static): Updates in-memory value and persists override to local SQLite. On restart/failover, loads persisted overrides on top of deployed config. Redeployment resets all persisted overrides.[CD-SR-7]Deployment Manager supervises Instance Actors with OneForOneStrategy — one Instance Actor's failure does not affect others.[CD-SR-8]Instance lifecycle: Disable stops actor, retains config in SQLite. Enable re-creates actor. Delete stops actor, removes config from SQLite. Delete does not clear S&F messages.- Phase 3A scope: Skeleton lifecycle — disable/enable/delete message handling in Deployment Manager. Full lifecycle with DCL/scripts is Phase 3B/3C.
[CD-SR-9]When Instance Actor is stopped (disable, delete, redeployment), Akka.NET automatically stops all child actors.
From Component-Host.md
[CD-HOST-1]REQ-HOST-6: Site-role Akka bootstrap with Remoting, Clustering, Persistence (SQLite), Split-Brain Resolver.[CD-HOST-2]REQ-HOST-7: Site nodes useHost.CreateDefaultBuilder— genericIHost, notWebApplication. No Kestrel, no HTTP port, no web endpoints.[CD-HOST-3]REQ-HOST-2: Site-role service registration includes SiteRuntime, DataConnectionLayer, StoreAndForward, SiteEventLogging (only SiteRuntime is wired in this phase; others are stubs).[CD-HOST-4]ClusterOptions: SeedNodes, SplitBrainResolverStrategy, StableAfter, HeartbeatInterval, FailureDetectionThreshold, MinNrOfMembers.[CD-HOST-5]DatabaseOptions: Site SQLite paths.
Work Packages
WP-1: Akka.NET Cluster Configuration (HOCON/Akka.Hosting)
Description: Implement the full Akka.NET cluster configuration for site nodes using Akka.Hosting, driven by ClusterOptions. This includes Remoting, Clustering, Split-Brain Resolver, failure detection, and CoordinatedShutdown settings.
Acceptance Criteria:
- Cluster configured with keep-oldest SBR,
down-if-alone = on, 15s stable-after. ([KDD-cluster-1],[CD-CI-13]) - Both nodes configured as seed nodes. Either node can start first. (
[KDD-cluster-2],[CD-CI-4]) min-nr-of-members = 1— surviving node can operate alone. ([KDD-cluster-2])- Failure detection: 2s heartbeat interval, 10s failure threshold. (
[KDD-cluster-3]) - Total failover time ~25s (detection + stable-after + singleton restart). (
[KDD-cluster-3]) akka.coordinated-shutdown.run-by-clr-shutdown-hook = on. ([CD-CI-8])akka.cluster.run-coordinated-shutdown-when-down = on. ([CD-CI-9])- Remoting configured with hostname/port from
NodeOptions. ([CD-CI-6]) - Cluster role set to "site" with SiteId from
NodeOptions. Site identifier is included in the cluster role tag for site clusters. ([CD-CI-5]) - All cluster settings driven by
ClusterOptions(Options pattern). ([CD-HOST-4]) - Failover is application-level (Akka.NET), not WSFC. (
[1.2-1])
Estimated Complexity: M
Requirements Traced: [1.2-1], [1.2-2], [KDD-cluster-1], [KDD-cluster-2], [KDD-cluster-3], [KDD-cluster-4], [CD-CI-1], [CD-CI-2], [CD-CI-4], [CD-CI-5], [CD-CI-6], [CD-CI-8], [CD-CI-9], [CD-CI-13], [CD-HOST-1], [CD-HOST-4]
WP-2: Site-Role Host Bootstrap
Description: Implement the site-role startup path in Program.cs. Site nodes use generic IHost (no Kestrel), configure Akka.NET with Remoting, Clustering, SQLite Persistence, and SBR. Register the SiteRuntime component services and actors.
Acceptance Criteria:
- Site nodes use
Host.CreateDefaultBuilder— noWebApplication, no Kestrel, no HTTP port. ([CD-HOST-2],[2.3-1]) - Site node cannot accept inbound HTTP connections. (
[2.3-1]— negative) - Akka.NET actor system boots with Remoting, Clustering, SQLite Persistence, and SBR. (
[CD-HOST-1]) - SiteRuntime
AddSiteRuntime()/AddSiteRuntimeActors()extension methods called for site role. ([CD-HOST-3]) - SQLite paths read from
DatabaseOptions. ([CD-HOST-5]) - Akka.NET Persistence configured with SQLite journal and snapshot store. (
[CD-HOST-1]) - Active/standby pair forms when two site nodes start. (
[1.2-2],[CD-CI-1])
Estimated Complexity: M
Requirements Traced: [1.2-2], [2.3-1], [CD-HOST-1], [CD-HOST-2], [CD-HOST-3], [CD-HOST-4], [CD-HOST-5], [CD-CI-1]
WP-3: Local SQLite Persistence Schema
Description: Design and implement the local SQLite schema for site nodes. This phase covers deployed configurations and static attribute overrides. The schema must be extensible for future additions (scripts, shared scripts, S&F buffers, event logs).
Acceptance Criteria:
deployed_configurationstable stores flattened configuration blobs keyed by instance unique name. Stores the deployment package format defined in Phase 2. ([1.1-2],[2.3-2],[CD-SR-2])static_attribute_overridestable stores per-instance, per-attribute runtime value overrides. ([KDD-data-5],[CD-SR-6])- Schema includes an
enabledflag per deployed instance to support disable/enable lifecycle. ([CD-SR-8]) - Schema supports efficient lookup: all configs for startup, single config for deployment/lifecycle. (
[CD-SR-2]) - SQLite database file created at path from
DatabaseOptions. ([CD-HOST-5]) - Schema migration strategy for SQLite (code-first or explicit migration scripts).
- No template structure in site storage — only flattened configs. Schema does not include tables for templates, inheritance relationships, or composition relationships. Stored configs are the deployment package format (flat attribute sets). (
[1.1-2]) - No local configuration authoring or editing capability. (
[1.1-3]— negative)
Estimated Complexity: M
Requirements Traced: [1.1-2], [1.1-3], [2.3-2], [KDD-data-5], [CD-SR-2], [CD-SR-6], [CD-SR-8], [CD-HOST-5]
WP-4: Deployment Manager Singleton
Description: Implement the Deployment Manager as an Akka.NET cluster singleton on site nodes. On startup (or failover recovery), it reads all deployed configurations from SQLite and creates Instance Actors for enabled instances in staggered batches.
Acceptance Criteria:
- Deployment Manager registered as an Akka.NET cluster singleton via
ClusterSingletonManager. ([CD-CI-3],[CD-SR-1]) - Cluster singleton proxy registered for communication with the singleton. (
[CD-CI-3]) - On startup: reads all deployed configurations from local SQLite. (
[CD-SR-2],[1.2-4],[CD-CI-11]) - Creates Instance Actors only for enabled instances. (
[CD-SR-3]) - Instance Actors created in staggered batches (configurable batch size, e.g., 20, with configurable delay between batches). (
[KDD-runtime-8],[CD-SR-3]) - Supervises Instance Actors with OneForOneStrategy — one failure does not affect others. (
[CD-SR-7]) - Supervision directive for Instance Actors is Resume (coordinator-level actors retain state across child failures). Verify: an Instance Actor that throws an unhandled exception resumes with its pre-exception state intact. (
[KDD-runtime-9]) - Actor hierarchy: Deployment Manager → Instance Actors (children). (
[KDD-runtime-2]) - Handles skeleton lifecycle messages: Deploy (store config, create actor), Disable (stop actor, mark disabled), Enable (re-create actor), Delete (stop actor, remove config). (
[CD-SR-8]) - Deploy does not include any local authoring — configs come from central only. (
[1.1-1],[1.1-3]) - Delete does not clear store-and-forward messages. Implementation: delete logic only removes the
deployed_configurationsandstatic_attribute_overridesrows for the instance. It does not touch any other tables. When S&F tables are added in Phase 3C, this constraint is verified end-to-end. ([CD-SR-8]— negative) - When an Instance Actor is stopped, Akka.NET automatically stops all child actors. (
[CD-SR-9])
Estimated Complexity: L
Requirements Traced: [1.1-1], [1.1-3], [1.2-4], [KDD-runtime-2], [KDD-runtime-8], [KDD-runtime-9], [CD-CI-3], [CD-CI-11], [CD-SR-1], [CD-SR-2], [CD-SR-3], [CD-SR-7], [CD-SR-8], [CD-SR-9]
WP-5: Instance Actor Skeleton
Description: Implement the basic Instance Actor that holds attribute state from the flattened configuration. In this phase, it loads attributes, supports GetAttribute/SetAttribute for static attributes (with SQLite persistence), and establishes the single-source-of-truth pattern. DCL integration, Script/Alarm Actors, and stream publishing are Phase 3B.
Acceptance Criteria:
- Instance Actor is the single source of truth for all runtime state of a deployed instance. (
[KDD-runtime-1],[CD-SR-4]) - On initialization: loads all attribute values from flattened configuration (static defaults). (
[CD-SR-5]) - On initialization: loads persisted static attribute overrides from SQLite and applies them on top of deployed config defaults. (
[CD-SR-6],[KDD-data-5]) - GetAttribute returns current in-memory value for requested attribute. (
[CD-SR-5]) - SetAttribute for static attributes: updates in-memory value and persists override to local SQLite. (
[CD-SR-6],[KDD-data-5]) - Static attribute overrides survive restart and failover. (
[KDD-data-5]) - Static attribute overrides are reset when the instance is redeployed (new deployment clears previous overrides). (
[CD-SR-6],[KDD-data-5]) - Internal communication uses Tell pattern for attribute updates. (
[KDD-data-7]) - Alarm state is not persisted — design explicitly excludes alarm persistence. (
[CD-CI-12]— negative)
Estimated Complexity: M
Requirements Traced: [KDD-runtime-1], [KDD-data-5], [KDD-data-7], [CD-CI-12], [CD-SR-4], [CD-SR-5], [CD-SR-6]
WP-6: CoordinatedShutdown & Graceful Singleton Handover
Description: Implement and verify CoordinatedShutdown for graceful singleton handover. When a site node is stopped (service stop, Ctrl+C), the Deployment Manager singleton hands over to the other node in seconds rather than waiting for full failure detection timeout.
Acceptance Criteria:
- CoordinatedShutdown triggers graceful leave from cluster on service stop. (
[KDD-cluster-4],[CD-CI-8],[CD-CI-9]) - Graceful shutdown enables singleton handover in seconds (hand-over retry interval), not ~25s. (
[KDD-cluster-4]) - Host does not call
Environment.Exit()or forcibly terminate the actor system without coordinated shutdown. (REQ-HOST-9) - CoordinatedShutdown wired into Windows Service lifecycle. (
[CD-CI-8])
Estimated Complexity: S
Requirements Traced: [KDD-cluster-4], [CD-CI-8], [CD-CI-9]
WP-7: Dual-Node Recovery
Description: Implement and verify automatic recovery when both nodes in a site cluster fail simultaneously (e.g., power outage). Whichever node starts first forms a new cluster; the second joins. No manual intervention required.
Acceptance Criteria:
- Both nodes are seed nodes — either can start first and form a cluster. (
[KDD-cluster-2],[CD-CI-4],[CD-CI-10]) min-nr-of-members=1allows single surviving node to operate. ([KDD-cluster-2])- First node starts, forms cluster, Deployment Manager singleton starts and rebuilds from SQLite. (
[CD-CI-10],[CD-CI-11],[KDD-cluster-5]) - Second node joins the cluster as standby. (
[CD-CI-10]) - No manual intervention required for recovery. (
[CD-CI-10])
Estimated Complexity: S
Requirements Traced: [KDD-cluster-2], [KDD-cluster-5], [CD-CI-4], [CD-CI-10], [CD-CI-11]
WP-8: Failover Acceptance Tests
Description: Comprehensive integration/acceptance tests proving failover, recovery, and persistence semantics. These are the primary verification gate for Phase 3A.
Acceptance Criteria:
Test: Active node crash → singleton migration
- Kill the active node process. Standby detects failure within ~25s. (
[KDD-cluster-3]) - Deployment Manager singleton restarts on surviving node. (
[1.2-4],[CD-CI-11]) - Singleton reads deployed configs from SQLite and recreates Instance Actors. (
[1.2-4],[CD-SR-2]) - Instance Actors have correct attribute state (deployed defaults + persisted overrides). (
[CD-SR-5],[CD-SR-6])
Test: Graceful shutdown → fast singleton handover
- Stop the active node gracefully (service stop). (
[KDD-cluster-4]) - Singleton hands over to standby in seconds (faster than crash scenario). (
[KDD-cluster-4]) - Instance Actors recreated on new active node. (
[1.2-4])
Test: Both nodes down → first up forms cluster, rebuilds from SQLite
- Both nodes stopped. (
[CD-CI-10]) - First node starts and forms cluster alone (seed node +
min-nr-of-members=1). ([KDD-cluster-2],[CD-CI-10]) - Deployment Manager singleton starts and rebuilds Instance Actor hierarchy from SQLite. (
[KDD-cluster-5],[CD-CI-11]) - Second node starts and joins as standby. (
[CD-CI-10])
Test: Static attribute overrides survive failover
- Set a static attribute via SetAttribute on Instance Actor. (
[KDD-data-5]) - Kill active node. Wait for failover. (
[KDD-cluster-3]) - On new active node, Instance Actor loads with persisted override value. (
[KDD-data-5],[CD-SR-6])
Test: Static attribute overrides reset on redeployment
- Set a static attribute override. (
[KDD-data-5]) - Redeploy the instance (send new flattened config). (
[CD-SR-6]) - Instance Actor loads with new deployed defaults; persisted override is cleared. (
[CD-SR-6])
Test: Staggered Instance Actor startup
- Deploy many instances (e.g., 50+). (
[KDD-runtime-8]) - On startup/failover, verify Instance Actors are created in batches (default batch size configurable, e.g., 20) with observable delays between batches. (
[KDD-runtime-8],[CD-SR-3]) - Verify batch size and delay are configurable via options. (
[CD-SR-3])
Test: Instance lifecycle (disable/enable/delete)
- Disable an instance: actor stopped, config retained in SQLite, not recreated on restart. (
[CD-SR-8]) - Enable a disabled instance: actor re-created from stored config. (
[CD-SR-8]) - Delete an instance: actor stopped, config removed from SQLite. (
[CD-SR-8])
Test: Singleton on single node
- Start only one node. Cluster forms. Singleton starts. Instances created. (
[KDD-cluster-2]) - Confirms
min-nr-of-members=1works correctly. ([KDD-cluster-2])
Test: Negative — no local configuration authoring or overrides
- Verify the Deployment Manager accepts configuration only via deployment messages (the central-to-site message path). No public method, message type, or API endpoint exists for local config creation or modification. (
[1.1-1],[1.1-3]) - Verify that the only way to modify deployed config structure is via a new deployment from central. Static attribute SetAttribute modifies runtime values only, not config structure (attributes, scripts, alarms). (
[1.1-3])
Test: Negative — site nodes are headless (no UI, no inbound HTTP)
- Verify site node process does not bind any TCP listener on HTTP/HTTPS ports. Scan for open ports after startup. (
[2.3-1]) - Verify site-role Host uses
Host.CreateDefaultBuilder(notWebApplication.CreateBuilder). ([2.3-1])
Test: Negative — no alarm state persistence
- Verify no alarm state table or column exists in the SQLite schema. (
[CD-CI-12]) - Verify Instance Actor initialization does not attempt to load alarm state from any persistent store. (
[CD-CI-12])
Test: Supervision — OneForOneStrategy isolation and Resume directive
- One Instance Actor throws an unhandled exception. Other Instance Actors continue processing unaffected. (
[CD-SR-7]) - The failing Instance Actor resumes with its pre-exception in-memory state intact (Resume directive). (
[KDD-runtime-9])
Estimated Complexity: L
Requirements Traced: [1.1-1], [1.1-3], [1.2-4], [2.3-1], [KDD-runtime-8], [KDD-data-5], [KDD-cluster-2], [KDD-cluster-3], [KDD-cluster-4], [KDD-cluster-5], [CD-CI-10], [CD-CI-11], [CD-CI-12], [CD-CI-13], [CD-SR-2], [CD-SR-3], [CD-SR-5], [CD-SR-6], [CD-SR-7], [CD-SR-8]
Test Strategy
Unit Tests
| Area | Tests |
|---|---|
| SQLite schema | Table creation, CRUD operations for deployed configs and attribute overrides |
| Deployment Manager | Startup reads configs, creates correct number of Instance Actors, staggering logic |
| Instance Actor | Attribute loading from flattened config, static override load/save, override reset on redeploy |
| Cluster config | HOCON/Akka.Hosting configuration generates correct settings (SBR, seed nodes, timings) |
| Lifecycle commands | Deploy/Disable/Enable/Delete state transitions, SQLite side effects |
Integration Tests
| Area | Tests |
|---|---|
| Two-node cluster formation | Two processes join cluster, leader elected, singleton starts on oldest |
| Singleton migration on crash | Kill active process, singleton restarts on standby |
| Graceful handover | CoordinatedShutdown on active, measure handover time |
| Dual-node recovery | Both down, first up forms cluster, second joins |
| Static attribute persistence | Set override, restart, verify value survives |
| Staggered startup | Deploy 50+ instances, verify batch creation with timing |
Negative Tests
| Requirement | Test |
|---|---|
[1.1-1] No local authoring |
Verify Deployment Manager only accepts configs via deployment messages from central. No local creation path exists. |
[1.1-3] No local overrides |
Verify no mechanism to modify deployed config structure locally. SetAttribute modifies runtime values only, not config structure. |
[2.3-1] No HTTP on site |
Verify no TCP listener on any HTTP port. Verify Host.CreateDefaultBuilder used (not WebApplication). |
[CD-CI-12] No alarm persistence |
Verify no alarm state table/column in SQLite. Verify Instance Actor does not load alarm state from storage. |
[CD-SR-8] Delete does not clear S&F |
Verify delete only removes deployed_configurations and static_attribute_overrides rows. Does not touch other tables. |
[KDD-runtime-9] Resume directive |
Verify Instance Actor resumes with intact state after unhandled exception (not restarted from scratch). |
Failover Tests
See WP-8 for complete failover test scenarios.
Verification Gate
Phase 3A is complete when all of the following pass:
- Two-node site cluster forms reliably with keep-oldest SBR.
- Deployment Manager singleton starts on oldest node and creates Instance Actors from SQLite.
- Instance Actors hold correct attribute state (deployed defaults + persisted overrides).
- Active node crash triggers failover: singleton migrates, Instance Actors recreated within ~25s.
- Graceful shutdown triggers fast handover (seconds, not ~25s).
- Both-nodes-down recovery works with no manual intervention.
min-nr-of-members=1allows single-node operation.- Static attribute overrides persist across restart and failover.
- Static attribute overrides reset on redeployment.
- Instance lifecycle (disable/enable/delete) works correctly.
- Staggered Instance Actor startup is observable.
- All negative tests pass (no HTTP on site, no local authoring, no alarm persistence).
- All unit and integration tests pass.
Open Questions
| # | Question | Context | Impact | Status |
|---|---|---|---|---|
| Q-P3A-1 | What is the optimal batch size and delay for staggered Instance Actor startup? | Component-SiteRuntime.md suggests 20 with a "short delay." Actual values depend on OPC UA server capacity. | Performance tuning. Default to 20/100ms, make configurable. | Deferred — tune during Phase 3B when DCL is integrated. |
| Q-P3A-2 | Should the SQLite schema use a single database file or separate files per concern (configs, overrides, S&F, events)? | Single file is simpler. Separate files isolate concerns and allow independent backup/maintenance. | Schema design. | Recommend single file with separate tables. Simpler transaction management. Final decision during implementation. |
| Q-P3A-3 | Should Akka.Persistence (event sourcing / snapshotting) be used for the Deployment Manager singleton, or is direct SQLite access sufficient? | Akka.Persistence adds complexity (journal, snapshots) but provides built-in recovery. Direct SQLite is simpler for this use case (singleton reads all configs on startup). | Architecture. | Recommend direct SQLite — Deployment Manager recovery is a full read-all-configs-and-rebuild pattern, not event replay. Akka.Persistence is overkill here. |
Orphan Check Result
Forward Check (Requirements → Work Packages)
Every item in the Requirements Checklist and Design Constraints Checklist is verified against work packages:
| Requirement/Constraint | Work Package(s) | Verified |
|---|---|---|
[1.1-1] Central is single source of truth |
WP-4 (negative: no local authoring), WP-8 (negative test) | Yes |
[1.1-2] Sites receive flattened configs |
WP-3 (schema), WP-4 (reads configs), WP-5 (loads attributes) | Yes |
[1.1-3] Sites do not support local overrides |
WP-3 (negative), WP-4 (negative), WP-8 (negative test) | Yes |
[1.2-1] Failover at application level (Akka.NET) |
WP-1 (config) | Yes |
[1.2-2] Active/standby pair |
WP-1, WP-2 | Yes |
[1.2-3] Site failover: standby takes over |
WP-8 (failover tests) | Yes |
[1.2-4] Singleton restarts, reads SQLite, recreates hierarchy |
WP-4, WP-8 | Yes |
[1.2-5] Central failover (Phase 8) |
Out of scope — noted in split-section | Yes |
[2.3-1] Sites are headless |
WP-2 (no Kestrel), WP-8 (negative test) | Yes |
[2.3-2] Local storage for deployed configs |
WP-3 (schema) | Yes |
[2.3-3] S&F buffers (Phase 3C) |
Out of scope — noted in split-section | Yes |
[KDD-runtime-1] Instance as Akka actor |
WP-5 | Yes |
[KDD-runtime-2] Actor hierarchy |
WP-4, WP-5 | Yes |
[KDD-runtime-8] Staggered startup |
WP-4, WP-8 | Yes |
[KDD-runtime-9] Supervision strategies |
WP-4 | Yes |
[KDD-data-5] Static attribute SQLite persistence |
WP-3, WP-5, WP-8 | Yes |
[KDD-data-7] Tell for hot-path |
WP-5 | Yes |
[KDD-cluster-1] Keep-oldest SBR |
WP-1 | Yes |
[KDD-cluster-2] Both seed nodes, min-nr=1 |
WP-1, WP-7, WP-8 | Yes |
[KDD-cluster-3] Failure detection timing |
WP-1, WP-8 | Yes |
[KDD-cluster-4] CoordinatedShutdown |
WP-1, WP-6, WP-8 | Yes |
[KDD-cluster-5] Dual-node recovery |
WP-7, WP-8 | Yes |
[CD-CI-1] through [CD-CI-13] |
WP-1, WP-2, WP-4, WP-6, WP-7, WP-8 | Yes |
[CD-SR-1] through [CD-SR-9] |
WP-3, WP-4, WP-5, WP-8 | Yes |
[CD-HOST-1] through [CD-HOST-5] |
WP-1, WP-2, WP-3 | Yes |
Result: All checklist items map to at least one work package. No orphans.
Reverse Check (Work Packages → Requirements)
| Work Package | Requirements Traced | Verified |
|---|---|---|
| WP-1 | [1.2-1], [1.2-2], [KDD-cluster-1–4], [CD-CI-1,2,4,5,6,8,9,13], [CD-HOST-1,4] |
Yes |
| WP-2 | [1.2-2], [2.3-1], [CD-HOST-1,2,3,4,5], [CD-CI-1] |
Yes |
| WP-3 | [1.1-2], [1.1-3], [2.3-2], [KDD-data-5], [CD-SR-2,6,8], [CD-HOST-5] |
Yes |
| WP-4 | [1.1-1], [1.1-3], [1.2-4], [KDD-runtime-2,8,9], [CD-CI-3,11], [CD-SR-1,2,3,7,8,9] |
Yes |
| WP-5 | [KDD-runtime-1], [KDD-data-5,7], [CD-CI-12], [CD-SR-4,5,6] |
Yes |
| WP-6 | [KDD-cluster-4], [CD-CI-8,9] |
Yes |
| WP-7 | [KDD-cluster-2,5], [CD-CI-4,10,11] |
Yes |
| WP-8 | All requirements verified via tests | Yes |
Result: Every work package traces to at least one requirement or constraint. No untraceable work.
Split-Section Check
| Section | This Phase Covers | Other Phase Covers | Gap |
|---|---|---|---|
| 1.1 | [1.1-1], [1.1-2], [1.1-3] (all bullets) |
— | None |
| 1.2 | [1.2-1], [1.2-2], [1.2-3] (singleton portion), [1.2-4] |
Phase 8: [1.2-3] (full DCL/S&F), [1.2-5] (central) |
None |
| 2.3 | [2.3-1], [2.3-2] (deployed configs) |
Phase 3B/3C: [2.3-2] (scripts, artifacts), [2.3-3] (S&F) |
None |
Result: No gaps in split-section coverage.
Negative Requirement Check
| Negative Requirement | Acceptance Criterion | Sufficient |
|---|---|---|
[1.1-1] No local authoring |
WP-4: Configs accepted only via deployment messages. WP-8: Verify no local creation path exists. | Yes |
[1.1-3] No local overrides |
WP-3: No config structure modification API. WP-8: Verify SetAttribute modifies runtime values only, not config structure. | Yes |
[2.3-1] No HTTP on site |
WP-2: Host.CreateDefaultBuilder (no Kestrel). WP-8: Port scan + builder type verification. |
Yes |
[CD-CI-12] No alarm persistence |
WP-5: No alarm state in SQLite. WP-8: Verify no alarm table/column and no alarm state load on init. | Yes |
[CD-SR-8] Delete does not clear S&F |
WP-4: Delete only removes deployed_configurations and static_attribute_overrides rows. End-to-end S&F verification in Phase 3C. | Yes |
[KDD-runtime-9] Resume directive |
WP-4: Resume directive on Instance Actors. WP-8: Verify Instance Actor retains state after exception. | Yes |
Result: All negative requirements have explicit behavioral acceptance criteria. Criteria strengthened after Codex review.
Codex MCP Verification
Model: gpt-5.4 Date: 2026-03-16 Result: Pass with corrections applied.
Step 1 — Requirements Coverage Review
Codex identified 15 findings. Disposition:
| # | Finding | Disposition |
|---|---|---|
| 1 | [1.2-3] not fully covered (DCL, scripts, S&F) |
Acknowledged — by design. Phase 3A covers singleton migration foundation only. Split-section notes updated to explicitly list Phase 3B (DCL, scripts) and Phase 3C (S&F) ownership. |
| 2 | "Full hierarchy" means Script/Alarm Actors too | Acknowledged — clarified. Added scope note to [1.2-4] explaining "full hierarchy" in this phase means DM → Instance Actors; Script/Alarm Actors added in Phase 3B. |
| 3 | [2.3-2] missing deployed scripts, ext sys defs, etc. |
Acknowledged — by design. Split-section note updated to list all 6 storage categories with per-phase ownership. |
| 4 | [2.3-3] S&F not covered |
Acknowledged — by design. Explicitly deferred to Phase 3C in split-section. |
| 5 | [1.2-5] central failover not covered |
Acknowledged — by design. Deferred to Phase 8 per phase definition. |
| 6 | REQ-HOST-2 only partially covered (missing DCL, S&F, SiteEventLogging registration) | Acknowledged. [CD-HOST-3] already notes "only SiteRuntime is wired in this phase; others are stubs." Stub registrations are sufficient for Phase 3A. |
| 7 | Script compilation not covered in startup | Acknowledged — by design. Script compilation is Phase 3B. Startup step 3 (compile scripts) is deferred. |
| 8 | Site identifier not explicit in cluster role | Corrected. WP-1 acceptance criterion updated to include site identifier. |
| 9 | Database path configuration not explicit | Dismissed. Already covered by [CD-HOST-5] in WP-2 and WP-3 acceptance criteria. |
| 10 | Supervision strategy (Resume) not verified | Corrected. WP-4 and WP-8 updated with explicit Resume directive verification. |
| 11 | Ask boundary rule not tested | Dismissed. [KDD-data-7] in Phase 3A scope note says "Establish the convention." Ask pattern usage is Phase 3B (CallScript). WP-5 verifies Tell usage. |
| 12 | Delete S&F negative not verifiable yet | Corrected. WP-4 criterion strengthened to specify delete only removes config/override rows. End-to-end S&F verification deferred to Phase 3C. |
| 13 | Alarm re-evaluation not tested | Dismissed. Alarm Actors are Phase 3B. [CD-CI-12] in Phase 3A scope is limited to "no alarm persistence." Re-evaluation is Phase 3B concern. |
| 14 | Flattened config not explicitly verified | Corrected. WP-3 criterion strengthened to verify schema has no template/inheritance/composition tables. |
| 15 | [1.1-3] negative test too narrow |
Corrected. WP-8 negative test expanded to verify SetAttribute modifies runtime values only, not config structure. |
Step 2 — Negative Requirement Review
Codex flagged all 5 negative requirements as weak. Disposition:
| # | Finding | Disposition |
|---|---|---|
| 1 | [1.1-1] only checks one mechanism |
Corrected. Test expanded to verify Deployment Manager only accepts configs via deployment messages. |
| 2 | [1.1-3] misses override paths |
Corrected. Test expanded to verify SetAttribute modifies runtime values only. |
| 3 | [2.3-1] "headless" vs "no HTTP" misaligned |
Partially corrected. Added Host builder type verification. The "no HTTP" test is the practical enforcement of "headless" — site nodes have no web framework loaded. |
| 4 | [CD-CI-12] schema check insufficient |
Corrected. Added verification that Instance Actor does not attempt to load alarm state from storage. |
| 5 | [CD-SR-8] delete S&F just restated |
Corrected. Specified delete only removes config/override rows. |
Step 3 — Split-Section Gap Review
Codex found:
[1.2-3]double-assigned (Phase 3A and Phase 8): Intentional — Phase 3A proves the foundation, Phase 8 validates the full behavior. Different aspects.[2.3-2]triple-assigned: Intentional — compound bullet decomposed into sub-items across phases. Split-section note now lists all 6 storage categories with explicit per-phase ownership.[1.2-5]numbering concern: Clarified. Section 1.2 has 4 prose bullets but bullet 3 contains two distinct requirements (site failover mechanics + singleton restart behavior), hence 5 atomic IDs. Split-section note updated.