# Phase 3C: Deployment Pipeline & Store-and-Forward **Date**: 2026-03-16 **Status**: Draft **Prerequisites**: Phase 2 (Template Engine, deployment package contract), Phase 3A (Cluster Infrastructure, Site Runtime skeleton, local SQLite persistence), Phase 3B (Communication Layer, Site Runtime full actor hierarchy, Health Monitoring) --- ## Scope **Goal**: Complete the deploy-to-site pipeline end-to-end with resilience. **Components**: - **Deployment Manager** (full) — Central-side deployment orchestration, instance lifecycle, system-wide artifact deployment - **Store-and-Forward Engine** (full) — Site-side message buffering, retry, parking, replication, parked message management **Testable Outcome**: Central validates, flattens, and deploys an instance to a site. Site compiles scripts, creates actors, reports success. Deployment ID ensures idempotency. Per-instance operation lock works. Instance lifecycle commands (disable, enable, delete) work. Store-and-forward buffers messages on transient failure, retries, parks. Async replication to standby. Parked messages queryable from central. --- ## Prerequisites | Prerequisite | Phase | What Must Be Complete | |---|---|---| | Template Engine | 2 | Flattening, validation pipeline, revision hash generation, diff calculation, deployment package contract | | Configuration Database | 1, 2 | Schema, repositories (IDeploymentManagerRepository), IAuditService, optimistic concurrency support | | Cluster Infrastructure | 3A | Akka.NET cluster with SBR, failover, CoordinatedShutdown | | Site Runtime | 3A, 3B | Deployment Manager singleton, Instance Actor hierarchy, script compilation, alarm actors, full actor lifecycle | | Communication Layer | 3B | All 8 message patterns (deployment, lifecycle, artifact deployment, remote queries), correlation IDs, timeouts | | Health Monitoring | 3B | Metric collection framework (S&F buffer depth will be added as a new metric) | | Site Event Logging | 3B | Event recording to SQLite (S&F activity events will be added) | | Security & Auth | 1 | Deployment role with optional site scoping | --- ## Requirements Checklist Each bullet is extracted from the referenced docs/requirements/HighLevelReqs.md sections. Items marked with a phase note indicate split-section bullets owned by another phase. ### Section 1.3 — Store-and-Forward Persistence (Site Clusters Only) - `[1.3-1]` Store-and-forward applies only at site clusters — central does not buffer messages. - `[1.3-2]` All site-level S&F buffers (external system calls, notifications, cached database writes) are replicated between the two site cluster nodes using application-level replication over Akka.NET remoting. - `[1.3-3]` Active node persists buffered messages to a local SQLite database and forwards them to the standby node, which maintains its own local SQLite copy. - `[1.3-4]` On failover, the standby node already has a replicated copy and takes over delivery seamlessly. - `[1.3-5]` Successfully delivered messages are removed from both nodes' local stores. - `[1.3-6]` There is no maximum buffer size — messages accumulate until they either succeed or exhaust retries and are parked. - `[1.3-7]` Retry intervals are fixed (not exponential backoff). ### Section 1.4 — Deployment Behavior - `[1.4-1]` When central deploys a new configuration to a site instance, the site applies it immediately upon receipt — no local operator confirmation required. *(Phase 3C)* - `[1.4-2]` If a site loses connectivity to central, it continues operating with its last received deployed configuration. *(Phase 3C — verified via resilience tests)* - `[1.4-3]` The site reports back to central whether deployment was successfully applied. *(Phase 3C)* - `[1.4-4]` Pre-deployment validation: before any deployment is sent to a site, the central cluster performs comprehensive validation including flattening, test-compiling scripts, verifying alarm trigger references, verifying script trigger references, and checking data connection binding completeness. *(Phase 3C — orchestration; validation pipeline built in Phase 2)* **Split-section note**: Section 1.4 is fully covered by Phase 3C (backend pipeline). Phase 6 covers the UI for deployment workflows (diff view, deploy button, status tracking display). ### Section 1.5 — System-Wide Artifact Deployment - `[1.5-1]` Changes to shared scripts, external system definitions, database connection definitions, and notification lists are not automatically propagated to sites. - `[1.5-2]` Deployment of system-wide artifacts requires explicit action by a user with the Deployment role. - `[1.5-3]` The Design role manages the definitions; the Deployment role triggers deployment to sites. A user may hold both roles. **Split-section note**: Phase 3C covers the backend pipeline for artifact deployment. Phase 6 covers the UI for triggering and monitoring artifact deployment. ### Section 3.8.1 — Instance Lifecycle (Phase 3C portion) - `[3.8.1-1]` Instances can be in one of two states: enabled or disabled. - `[3.8.1-2]` Enabled: instance is active — data subscriptions, script triggers, and alarm evaluation are all running. - `[3.8.1-3]` Disabled: site stops script triggers, data subscriptions (no live data collection), and alarm evaluation. Deployed configuration is retained so instance can be re-enabled without redeployment. - `[3.8.1-4]` Disabled: store-and-forward messages for a disabled instance continue to drain (deliver pending messages). - `[3.8.1-5]` Deletion removes the running configuration from the site, stops subscriptions, destroys the Instance Actor and its children. - `[3.8.1-6]` Store-and-forward messages are not cleared on deletion — they continue to be delivered or can be managed via parked message management. - `[3.8.1-7]` If the site is unreachable when a delete is triggered, the deletion fails. Central does not mark it as deleted until the site confirms. - `[3.8.1-8]` Templates cannot be deleted if any instances or child templates reference them. **Split-section note**: Phase 3C covers the backend for lifecycle commands. Phase 4 covers the UI for disable/enable/delete actions. ### Section 3.9 — Template Deployment & Change Propagation (Phase 3C portion) - `[3.9-1]` Template changes are not automatically propagated to deployed instances. - `[3.9-2]` The system maintains two views: deployed configuration (currently running) and template-derived configuration (what it would look like if deployed now). - `[3.9-3]` Deployment is performed at the individual instance level — an engineer explicitly commands the system to update a specific instance. - `[3.9-4]` The system must show differences between deployed and template-derived configuration. - `[3.9-5]` No rollback support required. Only tracks current deployed state, not history. - `[3.9-6]` Concurrent editing uses last-write-wins model. No pessimistic locking or optimistic concurrency conflict detection on templates. **Split-section note**: Phase 3C covers `[3.9-1]`, `[3.9-2]` (backend maintenance of two views), `[3.9-3]` (backend deployment pipeline), `[3.9-5]` (no rollback), `[3.9-6]` (last-write-wins — already from Phase 2). Phase 6 covers `[3.9-4]` (diff view UI) and the deployment trigger UI. The diff calculation itself is built in Phase 2; Phase 3C uses it. Phase 3C stores the deployed configuration snapshot that enables diff comparison. ### Section 5.3 — Store-and-Forward for External Calls (Phase 3C portion: engine) - `[5.3-1]` If an external system is unavailable when a script invokes a method, the message is buffered locally at the site. - `[5.3-2]` Retry is performed per message — individual failed messages retry independently. - `[5.3-3]` Each external system definition includes configurable retry settings: max retry count and time between retries (fixed interval, no exponential backoff). - `[5.3-4]` After max retries are exhausted, the message is parked (dead-lettered) for manual review. - `[5.3-5]` There is no maximum buffer size — messages accumulate until delivery succeeds or retries exhausted. **Split-section note**: Phase 3C builds the S&F engine that handles buffering, retry, and parking. Phase 7 integrates the External System Gateway as a delivery target and implements the error classification (transient vs. permanent). ### Section 5.4 — Parked Message Management (Phase 3C portion: backend) - `[5.4-1]` Parked messages are stored at the site where they originated. - `[5.4-2]` Central UI can query sites for parked messages and manage them remotely. - `[5.4-3]` Operators can retry or discard parked messages from the central UI. - `[5.4-4]` Parked message management covers external system calls, notifications, and cached database writes. **Split-section note**: Phase 3C builds the site-side storage, query handler, and retry/discard command handler for parked messages. Phase 6 builds the central UI for parked message management. ### Section 6.4 — Store-and-Forward for Notifications (Phase 3C portion: engine) - `[6.4-1]` If the email server is unavailable, notifications are buffered locally at the site. - `[6.4-2]` Follows the same retry pattern as external system calls: configurable max retry count and time between retries (fixed interval). - `[6.4-3]` After max retries are exhausted, the notification is parked for manual review. - `[6.4-4]` There is no maximum buffer size for notification messages. **Split-section note**: Phase 3C builds the S&F engine generically to support all three message categories. Phase 7 integrates the Notification Service as a delivery target. --- ## Design Constraints Checklist Constraints from CLAUDE.md Key Design Decisions and Component-*.md documents relevant to this phase. ### KDD Constraints - `[KDD-deploy-6]` Deployment identity: unique deployment ID + revision hash for idempotency. - `[KDD-deploy-7]` Per-instance operation lock covers all mutating commands (deploy, disable, enable, delete). - `[KDD-deploy-8]` Site-side apply is all-or-nothing per instance. - `[KDD-deploy-9]` System-wide artifact version skew across sites is supported. - `[KDD-deploy-11]` Optimistic concurrency on deployment status records. - `[KDD-sf-1]` Fixed retry interval, no max buffer size. Only transient failures buffered. - `[KDD-sf-2]` Async best-effort replication to standby (no ack wait). - `[KDD-sf-3]` Messages not cleared on instance deletion. - `[KDD-sf-4]` CachedCall idempotency is the caller's responsibility. *(Documented in Phase 3C; enforced in Phase 7 integration.)* ### Component Design Constraints (from docs/requirements/Component-DeploymentManager.md) - `[CD-DM-1]` Deployment flow: validate -> flatten -> send -> track. Validation failures stop the pipeline before anything is sent. - `[CD-DM-2]` Site-side idempotency on deployment ID — duplicate deployment receives "already applied" response. - `[CD-DM-3]` Sites reject stale configurations — older revision hash than currently applied is rejected. - `[CD-DM-4]` After central failover or timeout, Deployment Manager queries the site for current deployment state before allowing re-deploy. - `[CD-DM-5]` Only one mutating operation per instance in-flight at a time. Second operation rejected with "operation in progress" error. - `[CD-DM-6]` Different instances can proceed in parallel, even at the same site. - `[CD-DM-7]` State transition matrix: Enabled allows deploy/disable/delete; Disabled allows deploy(enables on apply)/enable/delete; Not-deployed allows deploy only. - `[CD-DM-8]` System-wide artifact deployment shows per-site result matrix. Successful sites not rolled back if others fail. Failed sites can be retried individually. - `[CD-DM-9]` Only current deployment status per instance stored (pending, in-progress, success, failed). No deployment history table — audit log captures history. - `[CD-DM-10]` Deployment scope is individual instance level. Bulk operations decompose into individual instance deployments. - `[CD-DM-11]` Diff view available before deploying (added/removed/changed members, connection binding changes). *(Diff calculation from Phase 2; orchestration in Phase 3C.)* - `[CD-DM-12]` Two views maintained: deployed configuration and template-derived configuration. - `[CD-DM-13]` Deployable artifacts include flattened instance config plus system-wide artifacts (shared scripts, external system defs, DB connection defs, notification lists). System-wide artifact deployment is a separate action. - `[CD-DM-14]` Site-side apply is all-or-nothing per instance. If any step fails (e.g., script compilation), entire deployment rejected. Previous config remains active and unchanged. - `[CD-DM-15]` Cross-site version skew for artifacts is supported. Artifacts are self-contained and site-independent. - `[CD-DM-16]` Disable: stops data subscriptions, script triggers, alarm evaluation. Config retained. - `[CD-DM-17]` Enable: re-activates a disabled instance. - `[CD-DM-18]` Delete: removes running config, destroys Instance Actor and children. S&F messages not cleared. Fails if site unreachable — central does not mark deleted until site confirms. ### Component Design Constraints (from docs/requirements/Component-StoreAndForward.md) - `[CD-SF-1]` Three message categories: external system calls, email notifications, cached database writes. - `[CD-SF-2]` Retry settings defined on the source entity (external system def, SMTP config, DB connection def), not per-message. - `[CD-SF-3]` Only transient failures eligible for S&F buffering. Permanent failures (HTTP 4xx) returned to script, not queued. - `[CD-SF-4]` No maximum buffer size. Bounded only by available disk space. - `[CD-SF-5]` Active node persists locally and forwards each buffer operation (add, remove, park) to standby asynchronously. No ack wait. - `[CD-SF-6]` Standby applies operations to its own local SQLite. - `[CD-SF-7]` On failover, rare cases of duplicate deliveries (delivered but remove not replicated) or missed retries (added but not replicated). Both acceptable. - `[CD-SF-8]` Parked messages remain in SQLite at site. Central queries via Communication Layer. - `[CD-SF-9]` Operators can retry (move back to retry queue) or discard (remove permanently) parked messages. - `[CD-SF-10]` Messages not automatically cleared when instance deleted. Pending and parked messages continue to exist. - `[CD-SF-11]` Message format stores: message ID, category, target, payload, retry count, created at, last attempt at, status (pending/retrying/parked). - `[CD-SF-12]` Message lifecycle: attempt immediate delivery -> success removes; failure buffers -> retry loop -> success removes + notify standby; max retries exhausted -> park. ### Component Design Constraints (from docs/requirements/Component-SiteRuntime.md — deployment-related) - `[CD-SR-1]` Deployment handling: receive config -> store in SQLite -> compile scripts -> create/update Instance Actor -> report result. - `[CD-SR-2]` For redeployments: existing Instance Actor and children stopped, then new Instance Actor created with updated config. Subscriptions re-established. - `[CD-SR-3]` Disable: stops Instance Actor and children. Retains deployed config in SQLite for re-enablement. - `[CD-SR-4]` Enable: creates new Instance Actor from stored config (same as startup). - `[CD-SR-5]` Delete: stops Instance Actor and children, removes deployed config from SQLite. Does not clear S&F messages. - `[CD-SR-6]` Script compilation failure during deployment rejects entire deployment. No partial state applied. Failure reported to central. ### Component Design Constraints (from docs/requirements/Component-Communication.md — deployment-related) - `[CD-COM-1]` Deployment pattern: request/response. No buffering at central. Unreachable site = immediate failure. - `[CD-COM-2]` Instance lifecycle pattern: request/response. Unreachable site = immediate failure. - `[CD-COM-3]` System-wide artifact pattern: broadcast with per-site acknowledgment. - `[CD-COM-4]` Deployment timeout: 120 seconds default (script compilation can be slow). - `[CD-COM-5]` Lifecycle command timeout: 30 seconds. - `[CD-COM-6]` System-wide artifact timeout: 120 seconds per site. - `[CD-COM-7]` Application-level correlation: deployments include deployment ID + revision hash; lifecycle commands include command ID. - `[CD-COM-8]` Remote query pattern for parked messages: request/response with query ID, 30-second timeout. --- ## Work Packages ### WP-1: Deployment Manager — Core Deployment Flow **Description**: Implement the central-side deployment orchestration pipeline: accept deployment request, call Template Engine for validated+flattened config, send to site via Communication Layer, track status. **Acceptance Criteria**: - Deployment request triggers validation -> flatten -> send -> track flow `[CD-DM-1]` - Validation failures stop the pipeline before sending; errors returned to caller `[CD-DM-1]`, `[1.4-4]` - Pre-deployment validation invokes Template Engine for flattening, naming collision detection, script compilation, trigger references, connection binding `[1.4-4]` - Validation does not verify that data source relative paths resolve to real tags on physical devices (runtime concern) `[1.4-4]` - Successful deployment sends flattened config to site via Communication Layer `[1.4-1]` - Site applies immediately upon receipt — no operator confirmation `[1.4-1]` - Site reports success/failure back to central `[1.4-3]` - Deployment status updated in config DB (pending -> in-progress -> success/failed) `[CD-DM-9]` - Deployment scope is individual instance level `[CD-DM-10]`, `[3.9-3]` - Template changes not auto-propagated — explicit deploy required `[3.9-1]` - No rollback support — only current deployed state tracked `[3.9-5]` - Uses 120-second deployment timeout `[CD-COM-4]` - If site unreachable, deployment fails immediately (no central buffering) `[CD-COM-1]` **Estimated Complexity**: L **Requirements Traced**: `[1.4-1]`, `[1.4-3]`, `[1.4-4]`, `[3.9-1]`, `[3.9-3]`, `[3.9-5]`, `[CD-DM-1]`, `[CD-DM-9]`, `[CD-DM-10]`, `[CD-COM-1]`, `[CD-COM-4]` --- ### WP-2: Deployment Identity & Idempotency **Description**: Implement deployment ID generation, revision hash propagation, and idempotent site-side apply. **Acceptance Criteria**: - Every deployment assigned a unique deployment ID `[KDD-deploy-6]` - Deployment includes flattened config's revision hash (from Template Engine) `[KDD-deploy-6]` - Site-side apply is idempotent on deployment ID — duplicate deployment returns "already applied" `[CD-DM-2]` - Sites reject stale configurations — older revision hash than currently applied is rejected, site reports current version `[CD-DM-3]` - After central failover or timeout, Deployment Manager queries site for current deployment state before allowing re-deploy `[CD-DM-4]` - Deployment messages include deployment ID + revision hash as correlation `[CD-COM-7]` **Estimated Complexity**: M **Requirements Traced**: `[KDD-deploy-6]`, `[CD-DM-2]`, `[CD-DM-3]`, `[CD-DM-4]`, `[CD-COM-7]` --- ### WP-3: Per-Instance Operation Lock **Description**: Implement concurrency control ensuring only one mutating operation per instance can be in-flight at a time. **Acceptance Criteria**: - Only one mutating operation (deploy, disable, enable, delete) per instance in-flight at a time `[KDD-deploy-7]`, `[CD-DM-5]` - Second operation on same instance rejected with "operation in progress" error `[CD-DM-5]` - Different instances can proceed in parallel, even at the same site `[CD-DM-6]` - Lock released when operation completes (success or failure) or times out - Lock state does not survive central failover (in-progress operations treated as failed per `[CD-DM-4]`) **Estimated Complexity**: M **Requirements Traced**: `[KDD-deploy-7]`, `[CD-DM-5]`, `[CD-DM-6]` --- ### WP-4: State Transition Matrix & Deployment Status **Description**: Implement the allowed state transitions for instance operations and deployment status persistence with optimistic concurrency. **Acceptance Criteria**: - State transition matrix enforced: `[CD-DM-7]` - Enabled: allows deploy, disable, delete. Rejects enable (already enabled). - Disabled: allows deploy (enables on apply), enable, delete. Rejects disable (already disabled). - Not-deployed: allows deploy only. Rejects disable, enable, delete. - Invalid state transitions return clear error messages - Only current deployment status per instance stored (pending, in-progress, success, failed) `[CD-DM-9]` - No deployment history table — audit log captures history via IAuditService `[CD-DM-9]` - Optimistic concurrency on deployment status records `[KDD-deploy-11]` - All deployment actions logged via IAuditService (who, what, when, result) **Estimated Complexity**: M **Requirements Traced**: `[CD-DM-7]`, `[CD-DM-9]`, `[KDD-deploy-11]`, `[3.8.1-1]`, `[3.8.1-2]` --- ### WP-5: Site-Side Apply Atomicity **Description**: Implement all-or-nothing deployment application at the site. **Acceptance Criteria**: - Site stores new config, compiles all scripts, creates/updates Instance Actor as single operation `[KDD-deploy-8]`, `[CD-DM-14]` - If any step fails (e.g., script compilation), entire deployment for that instance rejected `[CD-DM-14]`, `[CD-SR-6]` - Previous configuration remains active and unchanged on failure `[CD-DM-14]` - Site reports specific failure reason (e.g., compilation error details) back to central `[CD-SR-6]` - For redeployments: existing Instance Actor and children stopped, then new Instance Actor created with updated config `[CD-SR-2]` - Subscriptions re-established after redeployment `[CD-SR-2]` - Site continues operating with last deployed config if connectivity to central lost `[1.4-2]` - Deployment handling follows: receive -> store SQLite -> compile -> create/update actor -> report `[CD-SR-1]` **Estimated Complexity**: L **Requirements Traced**: `[KDD-deploy-8]`, `[CD-DM-14]`, `[CD-SR-1]`, `[CD-SR-2]`, `[CD-SR-6]`, `[1.4-2]` --- ### WP-6: Instance Lifecycle Commands **Description**: Implement disable, enable, and delete commands sent from central to site. **Acceptance Criteria**: - **Disable**: site stops script triggers, data subscriptions, and alarm evaluation `[3.8.1-3]`, `[CD-DM-16]` - Disable retains deployed configuration for re-enablement without redeployment `[3.8.1-3]`, `[CD-DM-16]`, `[CD-SR-3]` - Disable: S&F messages for disabled instance continue to drain `[3.8.1-4]` - **Enable**: re-activates a disabled instance by creating a new Instance Actor from stored config, restoring data subscriptions, script triggers, and alarm evaluation `[CD-DM-17]`, `[CD-SR-4]` - Disable and enable commands fail immediately if the site is unreachable (no buffering, consistent with deployment behavior) `[CD-COM-2]` - **Delete**: removes running config from site, stops subscriptions, destroys Instance Actor and children `[3.8.1-5]`, `[CD-DM-18]`, `[CD-SR-5]` - Delete: S&F messages are not cleared `[3.8.1-6]`, `[CD-DM-18]`, `[CD-SR-5]`, `[KDD-sf-3]` - Delete fails if site unreachable — central does not mark deleted until site confirms `[3.8.1-7]`, `[CD-DM-18]` - Templates cannot be deleted if instances or child templates reference them `[3.8.1-8]` - Lifecycle commands use request/response pattern with 30s timeout `[CD-COM-2]`, `[CD-COM-5]` - Lifecycle commands include command ID for deduplication (duplicate commands recognized and not re-applied) `[CD-COM-7]` **Estimated Complexity**: L **Requirements Traced**: `[3.8.1-1]` through `[3.8.1-8]`, `[KDD-sf-3]`, `[CD-DM-16]`, `[CD-DM-17]`, `[CD-DM-18]`, `[CD-SR-3]`, `[CD-SR-4]`, `[CD-SR-5]`, `[CD-COM-2]`, `[CD-COM-5]`, `[CD-COM-7]` --- ### WP-7: System-Wide Artifact Deployment **Description**: Implement deployment of shared scripts, external system definitions, database connection definitions, and notification lists to all sites. **Acceptance Criteria**: - Changes not automatically propagated to sites `[1.5-1]` - Deployment requires explicit action by a user with Deployment role `[1.5-2]` - Design role manages definitions; Deployment role triggers deployment `[1.5-3]` - Broadcast pattern with per-site acknowledgment `[CD-COM-3]` - Per-site result matrix — each site reports independently `[CD-DM-8]` - Successful sites not rolled back if other sites fail `[CD-DM-8]` - Failed sites can be retried individually `[CD-DM-8]` - 120-second timeout per site `[CD-COM-6]` - Cross-site version skew supported — sites can run different artifact versions `[KDD-deploy-9]`, `[CD-DM-15]` - Artifacts are self-contained and site-independent `[CD-DM-15]` - System-wide artifact deployment is a separate action from instance deployment `[CD-DM-13]` - Shared scripts undergo pre-compilation validation (syntax/structural correctness) before deployment to sites - All artifact deployment actions logged via IAuditService **Estimated Complexity**: L **Requirements Traced**: `[1.5-1]`, `[1.5-2]`, `[1.5-3]`, `[KDD-deploy-9]`, `[CD-DM-8]`, `[CD-DM-13]`, `[CD-DM-15]`, `[CD-COM-3]`, `[CD-COM-6]` --- ### WP-8: Deployed vs. Template-Derived State Management **Description**: Implement storage and retrieval of deployed configuration snapshots, enabling comparison with template-derived configs. **Acceptance Criteria**: - System maintains two views per instance: deployed configuration and template-derived configuration `[3.9-2]`, `[CD-DM-12]` - Deployed configuration updated on successful deployment `[CD-DM-12]` - Template-derived configuration computed on demand from current template state (uses Phase 2 flattening) - Diff can be computed between deployed and template-derived (uses Phase 2 diff calculation) `[CD-DM-11]` - Diff shows added/removed/changed members and connection binding changes `[CD-DM-11]` - Staleness detectable via revision hash comparison `[3.9-4]` **Estimated Complexity**: M **Requirements Traced**: `[3.9-2]`, `[3.9-4]`, `[CD-DM-11]`, `[CD-DM-12]` --- ### WP-9: S&F SQLite Persistence & Message Format **Description**: Implement the SQLite schema and data access layer for store-and-forward message buffering at site nodes. **Acceptance Criteria**: - Buffered messages persisted to local SQLite on each site node `[1.3-3]` - Message format stores: message ID, category, target, payload, retry count, created at, last attempt at, status (pending/retrying/parked) `[CD-SF-11]` - Three message categories supported: external system calls, email notifications, cached database writes `[CD-SF-1]` - No maximum buffer size — messages accumulate until delivery or parking `[1.3-6]`, `[CD-SF-4]` - Central does not buffer messages (S&F is site-only) `[1.3-1]` - All S&F timestamps are UTC **Estimated Complexity**: M **Requirements Traced**: `[1.3-1]`, `[1.3-3]`, `[1.3-6]`, `[CD-SF-1]`, `[CD-SF-4]`, `[CD-SF-11]` --- ### WP-10: S&F Retry Engine **Description**: Implement the fixed-interval retry loop with per-source-entity retry settings and transient-only buffering. **Acceptance Criteria**: - Message lifecycle: attempt immediate delivery -> failure buffers -> retry loop -> success removes; max retries -> park `[CD-SF-12]` - Retry is per-message — individual messages retry independently `[5.3-2]` - Fixed retry interval (not exponential backoff) `[1.3-7]`, `[KDD-sf-1]` - Retry settings defined on the source entity (external system def, SMTP config, DB connection def), not per-message `[CD-SF-2]` - External system definitions include max retry count and time between retries `[5.3-3]` - Notification config includes max retry count and time between retries `[6.4-2]` - After max retries exhausted, message is parked (dead-lettered) `[5.3-4]`, `[6.4-3]` - Only transient failures eligible for buffering. Permanent failures returned to caller, not queued `[KDD-sf-1]`, `[CD-SF-3]` - No maximum buffer size `[5.3-5]`, `[6.4-4]`, `[KDD-sf-1]` - Messages for external calls buffered locally when system unavailable `[5.3-1]` - Notifications buffered when email server unavailable `[6.4-1]` - Successfully delivered messages removed from local store `[1.3-5]` **Estimated Complexity**: L **Requirements Traced**: `[1.3-5]`, `[1.3-7]`, `[5.3-1]` through `[5.3-5]`, `[6.4-1]` through `[6.4-4]`, `[KDD-sf-1]`, `[CD-SF-2]`, `[CD-SF-3]`, `[CD-SF-12]` --- ### WP-11: S&F Async Replication to Standby **Description**: Implement application-level replication of buffer operations from active to standby node. **Acceptance Criteria**: - All S&F buffers replicated between two site cluster nodes via application-level replication over Akka.NET remoting `[1.3-2]` - Active node forwards each buffer operation (add, remove, park) to standby asynchronously `[CD-SF-5]`, `[KDD-sf-2]` - Active node does not wait for standby acknowledgment (no ack wait) `[KDD-sf-2]`, `[CD-SF-5]` - Standby applies operations to its own local SQLite `[CD-SF-6]` - On failover, standby takes over delivery from its replicated copy `[1.3-4]`. Note: per `[CD-SF-7]`, the async replication design means the copy is near-complete — rare duplicate deliveries or missed retries are acceptable trade-offs for the latency benefit. - Duplicate deliveries and missed retries accepted as trade-offs for async replication `[CD-SF-7]` - Successfully delivered messages removed from both nodes' stores `[1.3-5]` **Estimated Complexity**: L **Requirements Traced**: `[1.3-2]`, `[1.3-4]`, `[1.3-5]`, `[KDD-sf-2]`, `[CD-SF-5]`, `[CD-SF-6]`, `[CD-SF-7]` --- ### WP-12: Parked Message Management **Description**: Implement site-side parked message storage, query handling, and retry/discard commands accessible from central. **Acceptance Criteria**: - Parked messages stored at the site in SQLite `[5.4-1]`, `[CD-SF-8]` - Central can query sites for parked messages via Communication Layer `[5.4-2]`, `[CD-SF-8]` - Operators can retry a parked message (moves back to retry queue) `[5.4-3]`, `[CD-SF-9]` - Operators can discard a parked message (removes permanently) `[5.4-3]`, `[CD-SF-9]` - Management covers all three categories: external system calls, notifications, cached database writes `[5.4-4]` - Remote query uses request/response pattern with query ID, 30s timeout `[CD-COM-8]` - Messages not automatically cleared when instance deleted `[CD-SF-10]`, `[KDD-sf-3]`, `[3.8.1-6]` - Pending and parked messages continue to exist after instance deletion `[CD-SF-10]` **Estimated Complexity**: M **Requirements Traced**: `[5.4-1]` through `[5.4-4]`, `[KDD-sf-3]`, `[CD-SF-8]`, `[CD-SF-9]`, `[CD-SF-10]`, `[CD-COM-8]`, `[3.8.1-6]` --- ### WP-13: S&F Messages Survive Instance Deletion **Description**: Ensure store-and-forward messages are preserved when an instance is deleted. **Acceptance Criteria**: - S&F messages not cleared on instance deletion `[3.8.1-6]`, `[KDD-sf-3]`, `[CD-SF-10]` - Pending messages continue retry delivery after instance deletion - Parked messages remain queryable and manageable from central after instance deletion - S&F messages for disabled instances continue to drain `[3.8.1-4]` **Estimated Complexity**: S **Requirements Traced**: `[3.8.1-4]`, `[3.8.1-6]`, `[KDD-sf-3]`, `[CD-SF-10]` --- ### WP-14: S&F Health Metrics & Event Logging Integration **Description**: Integrate S&F buffer depth as a health metric and log S&F activity to site event log. **Acceptance Criteria**: - S&F buffer depth reported as health metric (broken down by category) — integrates with Phase 3B Health Monitoring - S&F activity logged to site event log: message queued, delivered, retried, parked (per docs/requirements/Component-StoreAndForward.md Dependencies) - S&F buffer depth visible in health reports sent to central **Estimated Complexity**: S **Requirements Traced**: `[CD-SF-1]` (categories), docs/requirements/Component-StoreAndForward.md Dependencies (Site Event Logging, Health Monitoring) --- ### WP-15: CachedCall Idempotency Documentation **Description**: Document that CachedCall idempotency is the caller's responsibility. **Acceptance Criteria**: - Script API documentation clearly states that `ExternalSystem.CachedCall()` idempotency is the caller's responsibility `[KDD-sf-4]` - S&F engine makes no idempotency guarantees — duplicate delivery possible (especially on failover) `[CD-SF-7]` **Estimated Complexity**: S **Requirements Traced**: `[KDD-sf-4]`, `[CD-SF-7]` --- ### WP-16: Deployment Manager — Concurrent Template Editing Semantics **Description**: Ensure last-write-wins semantics for template editing do not conflict with deployment pipeline. **Acceptance Criteria**: - Last-write-wins for concurrent template editing — no pessimistic locking or optimistic concurrency on templates `[3.9-6]` - Deployment uses optimistic concurrency on deployment status records only `[KDD-deploy-11]` - Template state at time of deployment is captured in the flattened config and revision hash **Estimated Complexity**: S **Requirements Traced**: `[3.9-6]`, `[KDD-deploy-11]` --- ## Test Strategy ### Unit Tests | Area | Tests | |------|-------| | Deployment flow | Validate -> flatten -> send pipeline; validation failure stops pipeline | | Deployment identity | Deployment ID generation uniqueness; revision hash propagation | | Operation lock | Concurrent requests on same instance rejected; different instances proceed in parallel; lock released on completion/timeout | | State transitions | All valid transitions succeed; all invalid transitions rejected with correct error messages | | Deployment status | CRUD with optimistic concurrency; concurrent updates handled correctly | | S&F message format | Serialization/deserialization of all three categories; all fields stored correctly | | S&F retry logic | Fixed interval timing; per-source-entity settings respected; max retries triggers parking; transient-only filter | | Parked message ops | Retry moves to queue; discard removes; query returns correct results | | Template deletion constraint | Templates with instance references cannot be deleted; templates with child template references cannot be deleted | ### Integration Tests | Area | Tests | |------|-------| | End-to-end deploy | Central sends deployment -> site compiles -> actors created -> success reported -> status updated | | Deploy with validation failure | Template with compilation error -> deployment blocked before send | | Idempotent deploy | Same deployment ID sent twice -> second returns "already applied" | | Stale config rejection | Older revision hash sent -> site rejects with current version | | Lifecycle commands | Disable -> verify subscriptions stopped and config retained; Enable -> verify instance re-activates; Delete -> verify actors destroyed and config removed | | S&F buffer and retry | Submit message -> delivery fails -> buffered -> retry succeeds -> message removed | | S&F parking | Submit message -> delivery fails -> max retries -> message parked | | S&F replication | Buffer message on active -> verify replicated to standby SQLite | | Parked message remote query | Central queries site for parked messages -> correct results returned | | Parked message retry/discard | Central retries parked message -> moves to queue; Central discards -> removed | | System-wide artifact deploy | Deploy shared scripts to multiple sites -> per-site status tracked | | S&F survives deletion | Delete instance -> verify S&F messages still exist and deliver | | S&F drains on disable | Disable instance -> verify pending S&F messages continue delivery | ### Negative Tests | Requirement | Test | |-------------|------| | `[1.3-1]` Central does not buffer | Verify no S&F infrastructure exists on central; central deployment to unreachable site fails immediately | | `[1.3-6]` No max buffer | Submit messages continuously -> verify no rejection based on count | | `[3.8.1-7]` Delete fails if unreachable | Attempt delete when site offline -> verify failure; verify central does not mark as deleted | | `[3.8.1-8]` Template deletion constraint | Attempt to delete template with active instances -> verify rejection | | `[3.9-1]` No auto-propagation | Change template -> verify deployed instance unaffected | | `[3.9-5]` No rollback | Verify no rollback mechanism exists; only current deployed state tracked | | `[CD-DM-5]` Operation lock rejects | Send two concurrent deploys for same instance -> verify second rejected | | `[CD-DM-7]` Invalid transitions | Attempt enable on already-enabled instance -> verify rejection; attempt disable on not-deployed -> verify rejection | | `[CD-SF-3]` Permanent failures not buffered | Submit message with permanent failure classification -> verify not buffered, error returned to caller | | `[KDD-sf-3]` Messages survive deletion | Delete instance -> verify S&F messages not cleared | ### Failover & Resilience Tests | Scenario | Test | |----------|------| | Mid-deploy central failover | Deploy in progress -> kill central active -> verify deployment treated as failed -> re-query site state -> re-deploy succeeds | | Mid-deploy site failover | Deploy in progress -> kill site active -> verify deployment times out or fails -> re-deploy to new active succeeds | | Timeout + reconciliation | Deploy sent -> site applies but response lost -> central times out -> central queries site state -> finds "already applied" -> updates status | | S&F buffer takeover | Buffer messages on active -> kill active -> standby takes over -> verify messages delivered from replicated copy | | S&F replication gap | Buffer message -> immediately kill active (before replication) -> verify standby handles gap gracefully (missed message, no crash) | | Site offline then online | Deploy to offline site -> fails -> site comes online -> re-deploy succeeds | | System-wide artifact partial failure | Deploy artifacts to 3 sites, 1 offline -> verify 2 succeed -> retry failed site when online | --- ## Verification Gate Phase 3C is complete when **all** of the following pass: 1. **Deployment pipeline end-to-end**: Central validates, flattens, sends, site compiles, creates actors, reports success. Status tracked in config DB. 2. **Idempotency**: Duplicate deployment ID returns "already applied." Stale revision hash rejected. 3. **Operation lock**: Concurrent operations on same instance rejected; parallel operations on different instances succeed. 4. **State transitions**: All valid transitions work; all invalid transitions rejected. 5. **Site-side atomicity**: Compilation failure rejects entire deployment; previous config unchanged. 6. **Lifecycle commands**: Disable/enable/delete work correctly with proper state effects. 7. **S&F buffering**: Messages buffered on transient failure, retried at fixed interval, parked after max retries. 8. **S&F replication**: Buffer operations replicated to standby; failover resumes delivery. 9. **Parked message management**: Central can query, retry, and discard parked messages at sites. 10. **S&F survival**: Messages persist through instance deletion and continue delivery. 11. **System-wide artifacts**: Deployed to all sites with per-site status; version skew tolerated. 12. **Resilience**: Mid-deploy failover, timeout+reconciliation, and S&F takeover tests pass. 13. **Audit logging**: All deployment and lifecycle actions recorded via IAuditService. 14. **All unit, integration, negative, and failover tests pass.** --- ## Open Questions | # | Question | Context | Impact | Status | |---|----------|---------|--------|--------| | Q-P3C-1 | Should S&F retry timers be reset on failover or continue from the last known retry timestamp? | On failover, the new active node loads buffer from SQLite. Messages have `last_attempt_at` timestamps. Should retry timing continue relative to `last_attempt_at` or reset to "now"? | Affects retry behavior immediately after failover. Recommend: continue from `last_attempt_at` to avoid burst retries. | Open | | Q-P3C-2 | What is the maximum number of parked messages returned in a single remote query? | Communication Layer pattern 8 uses 30s timeout. Very large parked message sets may need pagination. | Recommend: paginated query (e.g., 100 per page) consistent with Site Event Logging pagination pattern. | Open | | Q-P3C-3 | Should the per-instance operation lock be in-memory (lost on central failover) or persisted? | In-memory is simpler and consistent with "in-progress deployments treated as failed on failover." Persisted lock could cause orphan locks. | Recommend: in-memory. On failover, all locks released. Site state query resolves any ambiguity. | Open | --- ## Orphan Check Result ### Forward Check (Requirements -> Work Packages) Every item in the Requirements Checklist and Design Constraints Checklist was walked. Results: | Checklist Item | Mapped To | Verified | |---|---|---| | `[1.3-1]` through `[1.3-7]` | WP-9, WP-10, WP-11 | Yes | | `[1.4-1]` through `[1.4-4]` | WP-1, WP-5 | Yes | | `[1.5-1]` through `[1.5-3]` | WP-7 | Yes | | `[3.8.1-1]` through `[3.8.1-8]` | WP-4, WP-6, WP-12, WP-13 | Yes | | `[3.9-1]`, `[3.9-2]`, `[3.9-3]`, `[3.9-5]`, `[3.9-6]` | WP-1, WP-8, WP-16 | Yes | | `[3.9-4]` | WP-8 (staleness detection); diff UI deferred to Phase 6 | Yes | | `[5.3-1]` through `[5.3-5]` | WP-10 | Yes | | `[5.4-1]` through `[5.4-4]` | WP-12 | Yes | | `[6.4-1]` through `[6.4-4]` | WP-10 | Yes | | `[KDD-deploy-6]` | WP-2 | Yes | | `[KDD-deploy-7]` | WP-3 | Yes | | `[KDD-deploy-8]` | WP-5 | Yes | | `[KDD-deploy-9]` | WP-7 | Yes | | `[KDD-deploy-11]` | WP-4, WP-16 | Yes | | `[KDD-sf-1]` | WP-10 | Yes | | `[KDD-sf-2]` | WP-11 | Yes | | `[KDD-sf-3]` | WP-6, WP-12, WP-13 | Yes | | `[KDD-sf-4]` | WP-15 | Yes | | `[CD-DM-1]` through `[CD-DM-18]` | WP-1 through WP-8 | Yes | | `[CD-SF-1]` through `[CD-SF-12]` | WP-9 through WP-14 | Yes | | `[CD-SR-1]` through `[CD-SR-6]` | WP-5, WP-6 | Yes | | `[CD-COM-1]` through `[CD-COM-8]` | WP-1, WP-2, WP-6, WP-7, WP-12 | Yes | **Forward check result: PASS — no orphan requirements.** ### Reverse Check (Work Packages -> Requirements) Every work package traces to at least one requirement or design constraint: | Work Package | Traces To | |---|---| | WP-1 | `[1.4-1]`, `[1.4-3]`, `[1.4-4]`, `[3.9-1]`, `[3.9-3]`, `[3.9-5]`, `[CD-DM-1]`, `[CD-DM-9]`, `[CD-DM-10]`, `[CD-COM-1]`, `[CD-COM-4]` | | WP-2 | `[KDD-deploy-6]`, `[CD-DM-2]`, `[CD-DM-3]`, `[CD-DM-4]`, `[CD-COM-7]` | | WP-3 | `[KDD-deploy-7]`, `[CD-DM-5]`, `[CD-DM-6]` | | WP-4 | `[CD-DM-7]`, `[CD-DM-9]`, `[KDD-deploy-11]`, `[3.8.1-1]`, `[3.8.1-2]` | | WP-5 | `[KDD-deploy-8]`, `[CD-DM-14]`, `[CD-SR-1]`, `[CD-SR-2]`, `[CD-SR-6]`, `[1.4-2]` | | WP-6 | `[3.8.1-1]` through `[3.8.1-8]`, `[KDD-sf-3]`, `[CD-DM-16]` through `[CD-DM-18]`, `[CD-SR-3]` through `[CD-SR-5]`, `[CD-COM-2]`, `[CD-COM-5]`, `[CD-COM-7]` | | WP-7 | `[1.5-1]` through `[1.5-3]`, `[KDD-deploy-9]`, `[CD-DM-8]`, `[CD-DM-13]`, `[CD-DM-15]`, `[CD-COM-3]`, `[CD-COM-6]` | | WP-8 | `[3.9-2]`, `[3.9-4]`, `[CD-DM-11]`, `[CD-DM-12]` | | WP-9 | `[1.3-1]`, `[1.3-3]`, `[1.3-6]`, `[CD-SF-1]`, `[CD-SF-4]`, `[CD-SF-11]` | | WP-10 | `[1.3-5]`, `[1.3-7]`, `[5.3-1]` through `[5.3-5]`, `[6.4-1]` through `[6.4-4]`, `[KDD-sf-1]`, `[CD-SF-2]`, `[CD-SF-3]`, `[CD-SF-12]` | | WP-11 | `[1.3-2]`, `[1.3-4]`, `[1.3-5]`, `[KDD-sf-2]`, `[CD-SF-5]`, `[CD-SF-6]`, `[CD-SF-7]` | | WP-12 | `[5.4-1]` through `[5.4-4]`, `[KDD-sf-3]`, `[CD-SF-8]`, `[CD-SF-9]`, `[CD-SF-10]`, `[CD-COM-8]`, `[3.8.1-6]` | | WP-13 | `[3.8.1-4]`, `[3.8.1-6]`, `[KDD-sf-3]`, `[CD-SF-10]` | | WP-14 | `[CD-SF-1]`, docs/requirements/Component-StoreAndForward.md Dependencies | | WP-15 | `[KDD-sf-4]`, `[CD-SF-7]` | | WP-16 | `[3.9-6]`, `[KDD-deploy-11]` | **Reverse check result: PASS — no untraceable work packages.** ### Split-Section Check | Section | Phase 3C Covers | Other Phase Covers | Gap? | |---|---|---|---| | 1.4 | `[1.4-1]` through `[1.4-4]` (all bullets — backend pipeline) | Phase 6: deployment UI triggers and status display | No gap | | 1.5 | `[1.5-1]` through `[1.5-3]` (all bullets — backend pipeline) | Phase 6: artifact deployment UI | No gap | | 3.8.1 | `[3.8.1-1]` through `[3.8.1-8]` (all bullets — backend commands) | Phase 4: lifecycle command UI | No gap | | 3.9 | `[3.9-1]`, `[3.9-2]`, `[3.9-3]`, `[3.9-5]`, `[3.9-6]` | Phase 6: `[3.9-4]` (diff view UI), deployment trigger UI | No gap | | 5.3 | `[5.3-1]` through `[5.3-5]` (S&F engine) | Phase 7: External System Gateway delivery integration, error classification | No gap | | 5.4 | `[5.4-1]` through `[5.4-4]` (backend query/command handling) | Phase 6: parked message management UI | No gap | | 6.4 | `[6.4-1]` through `[6.4-4]` (S&F engine) | Phase 7: Notification Service delivery integration | No gap | **Split-section check result: PASS — no unowned bullets.** ### Negative Requirement Check | Negative Requirement | Acceptance Criterion | Adequate? | |---|---|---| | `[1.3-1]` Central does not buffer | Test verifies no S&F infrastructure on central; unreachable site = immediate failure | Yes | | `[1.3-6]` No maximum buffer size | Test submits messages continuously, verifies no count-based rejection | Yes | | `[3.8.1-6]` S&F messages not cleared on deletion | Test deletes instance, verifies messages still exist and deliver | Yes | | `[3.8.1-7]` Delete fails if unreachable | Test attempts delete to offline site, verifies failure and central status unchanged | Yes | | `[3.8.1-8]` Templates cannot be deleted with references | Test attempts deletion of referenced template, verifies rejection | Yes | | `[3.9-1]` Changes not auto-propagated | Test changes template, verifies deployed instance unchanged | Yes | | `[3.9-5]` No rollback | Verifies no rollback mechanism; only current state tracked | Yes | | `[CD-SF-3]` Permanent failures not buffered | Test submits permanent failure, verifies not queued | Yes | **Negative requirement check result: PASS — all prohibitions have verification criteria.** --- ## Codex MCP Verification **Model**: gpt-5.4 **Result**: Pass with corrections ### Step 1 — Requirements Coverage Review Codex identified 10 findings. Disposition: | # | Finding | Disposition | |---|---------|-------------| | 1 | Naming collision detection and device tag resolution exclusion missing from WP-1 | **Corrected** — added naming collision detection to WP-1 acceptance criteria; added explicit exclusion of device tag resolution. | | 2 | Shared script pre-compilation validation missing from WP-7 | **Corrected** — added shared script validation acceptance criterion to WP-7. | | 3 | Role overlap (user may hold both Design+Deployment) not verified | **Dismissed** — this is a Phase 1 Security & Auth concern. Phase 3C assumes the auth model works correctly. Role overlap is tested in Phase 1 integration tests. | | 4 | WP-4 traces [3.8.1-2] but doesn't verify runtime activation | **Dismissed** — WP-4 owns the state transition matrix. Runtime behavior of "enabled" (subscriptions, triggers, alarms running) is the responsibility of Phase 3B Site Runtime, which creates Instance Actors with full initialization. WP-6 verifies enable recreates the actor. | | 5 | Enable flow underspecified (should verify actor recreation with subscriptions) | **Corrected** — expanded WP-6 enable criteria to explicitly verify actor creation, subscription restoration, script triggers, and alarm evaluation. | | 6 | Command ID described as "correlation" but source says "deduplication" | **Corrected** — changed wording to "deduplication" with acceptance criterion that duplicate commands are recognized and not re-applied. | | 7 | Disable/enable unreachable failure not explicitly covered | **Corrected** — added acceptance criterion that disable and enable fail immediately if site unreachable. | | 8 | Diff "show" requirement only partially verified (compute, not expose) | **Dismissed** — Phase 3C provides the backend API for diff computation and staleness detection. The "show" (UI) aspect is explicitly deferred to Phase 6 per the split-section note. WP-8 correctly scopes to backend. | | 9 | Parked message management UI not verified | **Dismissed** — same as #8. Phase 3C builds the site-side backend (query handler, retry/discard commands). Phase 6 builds the central UI. Split documented in plan. | | 10 | "near-complete copy" weakens HighLevelReqs "seamless" wording | **Corrected** — updated WP-11 to reference [1.3-4] for the seamless takeover requirement, with a note that [CD-SF-7] acknowledges the async replication trade-off (rare duplicates/misses). The component design explicitly documents this as an acceptable trade-off; HighLevelReqs 1.3 bullet 4 does not preclude it since "seamlessly" refers to the takeover process, not data completeness. | ### Step 2 — Negative Requirement Review Not submitted separately; negative requirements were included in Step 1 review. All negative requirements have adequate acceptance criteria per the orphan check. ### Step 3 — Split-Section Gap Review Not submitted separately; split sections were documented in the plan and reviewed in Step 1. No gaps identified.