All phases (0-8) now have detailed implementation plans with: - Bullet-level requirement extraction from HighLevelReqs sections - Design constraint traceability (KDD + Component Design) - Work packages with acceptance criteria mapped to every requirement - Split-section ownership verified across phases - Orphan checks (forward, reverse, negative) all passing - Codex MCP (gpt-5.4) external verification completed per phase Total: 7,549 lines across 11 plan documents, ~160 work packages, ~400 requirements traced, ~25 open questions logged for follow-up.
48 KiB
Phase 3C: Deployment Pipeline & Store-and-Forward
Date: 2026-03-16 Status: Draft Prerequisites: Phase 2 (Template Engine, deployment package contract), Phase 3A (Cluster Infrastructure, Site Runtime skeleton, local SQLite persistence), Phase 3B (Communication Layer, Site Runtime full actor hierarchy, Health Monitoring)
Scope
Goal: Complete the deploy-to-site pipeline end-to-end with resilience.
Components:
- Deployment Manager (full) — Central-side deployment orchestration, instance lifecycle, system-wide artifact deployment
- Store-and-Forward Engine (full) — Site-side message buffering, retry, parking, replication, parked message management
Testable Outcome: Central validates, flattens, and deploys an instance to a site. Site compiles scripts, creates actors, reports success. Deployment ID ensures idempotency. Per-instance operation lock works. Instance lifecycle commands (disable, enable, delete) work. Store-and-forward buffers messages on transient failure, retries, parks. Async replication to standby. Parked messages queryable from central.
Prerequisites
| Prerequisite | Phase | What Must Be Complete |
|---|---|---|
| Template Engine | 2 | Flattening, validation pipeline, revision hash generation, diff calculation, deployment package contract |
| Configuration Database | 1, 2 | Schema, repositories (IDeploymentManagerRepository), IAuditService, optimistic concurrency support |
| Cluster Infrastructure | 3A | Akka.NET cluster with SBR, failover, CoordinatedShutdown |
| Site Runtime | 3A, 3B | Deployment Manager singleton, Instance Actor hierarchy, script compilation, alarm actors, full actor lifecycle |
| Communication Layer | 3B | All 8 message patterns (deployment, lifecycle, artifact deployment, remote queries), correlation IDs, timeouts |
| Health Monitoring | 3B | Metric collection framework (S&F buffer depth will be added as a new metric) |
| Site Event Logging | 3B | Event recording to SQLite (S&F activity events will be added) |
| Security & Auth | 1 | Deployment role with optional site scoping |
Requirements Checklist
Each bullet is extracted from the referenced HighLevelReqs.md sections. Items marked with a phase note indicate split-section bullets owned by another phase.
Section 1.3 — Store-and-Forward Persistence (Site Clusters Only)
[1.3-1]Store-and-forward applies only at site clusters — central does not buffer messages.[1.3-2]All site-level S&F buffers (external system calls, notifications, cached database writes) are replicated between the two site cluster nodes using application-level replication over Akka.NET remoting.[1.3-3]Active node persists buffered messages to a local SQLite database and forwards them to the standby node, which maintains its own local SQLite copy.[1.3-4]On failover, the standby node already has a replicated copy and takes over delivery seamlessly.[1.3-5]Successfully delivered messages are removed from both nodes' local stores.[1.3-6]There is no maximum buffer size — messages accumulate until they either succeed or exhaust retries and are parked.[1.3-7]Retry intervals are fixed (not exponential backoff).
Section 1.4 — Deployment Behavior
[1.4-1]When central deploys a new configuration to a site instance, the site applies it immediately upon receipt — no local operator confirmation required. (Phase 3C)[1.4-2]If a site loses connectivity to central, it continues operating with its last received deployed configuration. (Phase 3C — verified via resilience tests)[1.4-3]The site reports back to central whether deployment was successfully applied. (Phase 3C)[1.4-4]Pre-deployment validation: before any deployment is sent to a site, the central cluster performs comprehensive validation including flattening, test-compiling scripts, verifying alarm trigger references, verifying script trigger references, and checking data connection binding completeness. (Phase 3C — orchestration; validation pipeline built in Phase 2)
Split-section note: Section 1.4 is fully covered by Phase 3C (backend pipeline). Phase 6 covers the UI for deployment workflows (diff view, deploy button, status tracking display).
Section 1.5 — System-Wide Artifact Deployment
[1.5-1]Changes to shared scripts, external system definitions, database connection definitions, and notification lists are not automatically propagated to sites.[1.5-2]Deployment of system-wide artifacts requires explicit action by a user with the Deployment role.[1.5-3]The Design role manages the definitions; the Deployment role triggers deployment to sites. A user may hold both roles.
Split-section note: Phase 3C covers the backend pipeline for artifact deployment. Phase 6 covers the UI for triggering and monitoring artifact deployment.
Section 3.8.1 — Instance Lifecycle (Phase 3C portion)
[3.8.1-1]Instances can be in one of two states: enabled or disabled.[3.8.1-2]Enabled: instance is active — data subscriptions, script triggers, and alarm evaluation are all running.[3.8.1-3]Disabled: site stops script triggers, data subscriptions (no live data collection), and alarm evaluation. Deployed configuration is retained so instance can be re-enabled without redeployment.[3.8.1-4]Disabled: store-and-forward messages for a disabled instance continue to drain (deliver pending messages).[3.8.1-5]Deletion removes the running configuration from the site, stops subscriptions, destroys the Instance Actor and its children.[3.8.1-6]Store-and-forward messages are not cleared on deletion — they continue to be delivered or can be managed via parked message management.[3.8.1-7]If the site is unreachable when a delete is triggered, the deletion fails. Central does not mark it as deleted until the site confirms.[3.8.1-8]Templates cannot be deleted if any instances or child templates reference them.
Split-section note: Phase 3C covers the backend for lifecycle commands. Phase 4 covers the UI for disable/enable/delete actions.
Section 3.9 — Template Deployment & Change Propagation (Phase 3C portion)
[3.9-1]Template changes are not automatically propagated to deployed instances.[3.9-2]The system maintains two views: deployed configuration (currently running) and template-derived configuration (what it would look like if deployed now).[3.9-3]Deployment is performed at the individual instance level — an engineer explicitly commands the system to update a specific instance.[3.9-4]The system must show differences between deployed and template-derived configuration.[3.9-5]No rollback support required. Only tracks current deployed state, not history.[3.9-6]Concurrent editing uses last-write-wins model. No pessimistic locking or optimistic concurrency conflict detection on templates.
Split-section note: Phase 3C covers [3.9-1], [3.9-2] (backend maintenance of two views), [3.9-3] (backend deployment pipeline), [3.9-5] (no rollback), [3.9-6] (last-write-wins — already from Phase 2). Phase 6 covers [3.9-4] (diff view UI) and the deployment trigger UI. The diff calculation itself is built in Phase 2; Phase 3C uses it. Phase 3C stores the deployed configuration snapshot that enables diff comparison.
Section 5.3 — Store-and-Forward for External Calls (Phase 3C portion: engine)
[5.3-1]If an external system is unavailable when a script invokes a method, the message is buffered locally at the site.[5.3-2]Retry is performed per message — individual failed messages retry independently.[5.3-3]Each external system definition includes configurable retry settings: max retry count and time between retries (fixed interval, no exponential backoff).[5.3-4]After max retries are exhausted, the message is parked (dead-lettered) for manual review.[5.3-5]There is no maximum buffer size — messages accumulate until delivery succeeds or retries exhausted.
Split-section note: Phase 3C builds the S&F engine that handles buffering, retry, and parking. Phase 7 integrates the External System Gateway as a delivery target and implements the error classification (transient vs. permanent).
Section 5.4 — Parked Message Management (Phase 3C portion: backend)
[5.4-1]Parked messages are stored at the site where they originated.[5.4-2]Central UI can query sites for parked messages and manage them remotely.[5.4-3]Operators can retry or discard parked messages from the central UI.[5.4-4]Parked message management covers external system calls, notifications, and cached database writes.
Split-section note: Phase 3C builds the site-side storage, query handler, and retry/discard command handler for parked messages. Phase 6 builds the central UI for parked message management.
Section 6.4 — Store-and-Forward for Notifications (Phase 3C portion: engine)
[6.4-1]If the email server is unavailable, notifications are buffered locally at the site.[6.4-2]Follows the same retry pattern as external system calls: configurable max retry count and time between retries (fixed interval).[6.4-3]After max retries are exhausted, the notification is parked for manual review.[6.4-4]There is no maximum buffer size for notification messages.
Split-section note: Phase 3C builds the S&F engine generically to support all three message categories. Phase 7 integrates the Notification Service as a delivery target.
Design Constraints Checklist
Constraints from CLAUDE.md Key Design Decisions and Component-*.md documents relevant to this phase.
KDD Constraints
[KDD-deploy-6]Deployment identity: unique deployment ID + revision hash for idempotency.[KDD-deploy-7]Per-instance operation lock covers all mutating commands (deploy, disable, enable, delete).[KDD-deploy-8]Site-side apply is all-or-nothing per instance.[KDD-deploy-9]System-wide artifact version skew across sites is supported.[KDD-deploy-11]Optimistic concurrency on deployment status records.[KDD-sf-1]Fixed retry interval, no max buffer size. Only transient failures buffered.[KDD-sf-2]Async best-effort replication to standby (no ack wait).[KDD-sf-3]Messages not cleared on instance deletion.[KDD-sf-4]CachedCall idempotency is the caller's responsibility. (Documented in Phase 3C; enforced in Phase 7 integration.)
Component Design Constraints (from Component-DeploymentManager.md)
[CD-DM-1]Deployment flow: validate -> flatten -> send -> track. Validation failures stop the pipeline before anything is sent.[CD-DM-2]Site-side idempotency on deployment ID — duplicate deployment receives "already applied" response.[CD-DM-3]Sites reject stale configurations — older revision hash than currently applied is rejected.[CD-DM-4]After central failover or timeout, Deployment Manager queries the site for current deployment state before allowing re-deploy.[CD-DM-5]Only one mutating operation per instance in-flight at a time. Second operation rejected with "operation in progress" error.[CD-DM-6]Different instances can proceed in parallel, even at the same site.[CD-DM-7]State transition matrix: Enabled allows deploy/disable/delete; Disabled allows deploy(enables on apply)/enable/delete; Not-deployed allows deploy only.[CD-DM-8]System-wide artifact deployment shows per-site result matrix. Successful sites not rolled back if others fail. Failed sites can be retried individually.[CD-DM-9]Only current deployment status per instance stored (pending, in-progress, success, failed). No deployment history table — audit log captures history.[CD-DM-10]Deployment scope is individual instance level. Bulk operations decompose into individual instance deployments.[CD-DM-11]Diff view available before deploying (added/removed/changed members, connection binding changes). (Diff calculation from Phase 2; orchestration in Phase 3C.)[CD-DM-12]Two views maintained: deployed configuration and template-derived configuration.[CD-DM-13]Deployable artifacts include flattened instance config plus system-wide artifacts (shared scripts, external system defs, DB connection defs, notification lists). System-wide artifact deployment is a separate action.[CD-DM-14]Site-side apply is all-or-nothing per instance. If any step fails (e.g., script compilation), entire deployment rejected. Previous config remains active and unchanged.[CD-DM-15]Cross-site version skew for artifacts is supported. Artifacts are self-contained and site-independent.[CD-DM-16]Disable: stops data subscriptions, script triggers, alarm evaluation. Config retained.[CD-DM-17]Enable: re-activates a disabled instance.[CD-DM-18]Delete: removes running config, destroys Instance Actor and children. S&F messages not cleared. Fails if site unreachable — central does not mark deleted until site confirms.
Component Design Constraints (from Component-StoreAndForward.md)
[CD-SF-1]Three message categories: external system calls, email notifications, cached database writes.[CD-SF-2]Retry settings defined on the source entity (external system def, SMTP config, DB connection def), not per-message.[CD-SF-3]Only transient failures eligible for S&F buffering. Permanent failures (HTTP 4xx) returned to script, not queued.[CD-SF-4]No maximum buffer size. Bounded only by available disk space.[CD-SF-5]Active node persists locally and forwards each buffer operation (add, remove, park) to standby asynchronously. No ack wait.[CD-SF-6]Standby applies operations to its own local SQLite.[CD-SF-7]On failover, rare cases of duplicate deliveries (delivered but remove not replicated) or missed retries (added but not replicated). Both acceptable.[CD-SF-8]Parked messages remain in SQLite at site. Central queries via Communication Layer.[CD-SF-9]Operators can retry (move back to retry queue) or discard (remove permanently) parked messages.[CD-SF-10]Messages not automatically cleared when instance deleted. Pending and parked messages continue to exist.[CD-SF-11]Message format stores: message ID, category, target, payload, retry count, created at, last attempt at, status (pending/retrying/parked).[CD-SF-12]Message lifecycle: attempt immediate delivery -> success removes; failure buffers -> retry loop -> success removes + notify standby; max retries exhausted -> park.
Component Design Constraints (from Component-SiteRuntime.md — deployment-related)
[CD-SR-1]Deployment handling: receive config -> store in SQLite -> compile scripts -> create/update Instance Actor -> report result.[CD-SR-2]For redeployments: existing Instance Actor and children stopped, then new Instance Actor created with updated config. Subscriptions re-established.[CD-SR-3]Disable: stops Instance Actor and children. Retains deployed config in SQLite for re-enablement.[CD-SR-4]Enable: creates new Instance Actor from stored config (same as startup).[CD-SR-5]Delete: stops Instance Actor and children, removes deployed config from SQLite. Does not clear S&F messages.[CD-SR-6]Script compilation failure during deployment rejects entire deployment. No partial state applied. Failure reported to central.
Component Design Constraints (from Component-Communication.md — deployment-related)
[CD-COM-1]Deployment pattern: request/response. No buffering at central. Unreachable site = immediate failure.[CD-COM-2]Instance lifecycle pattern: request/response. Unreachable site = immediate failure.[CD-COM-3]System-wide artifact pattern: broadcast with per-site acknowledgment.[CD-COM-4]Deployment timeout: 120 seconds default (script compilation can be slow).[CD-COM-5]Lifecycle command timeout: 30 seconds.[CD-COM-6]System-wide artifact timeout: 120 seconds per site.[CD-COM-7]Application-level correlation: deployments include deployment ID + revision hash; lifecycle commands include command ID.[CD-COM-8]Remote query pattern for parked messages: request/response with query ID, 30-second timeout.
Work Packages
WP-1: Deployment Manager — Core Deployment Flow
Description: Implement the central-side deployment orchestration pipeline: accept deployment request, call Template Engine for validated+flattened config, send to site via Communication Layer, track status.
Acceptance Criteria:
- Deployment request triggers validation -> flatten -> send -> track flow
[CD-DM-1] - Validation failures stop the pipeline before sending; errors returned to caller
[CD-DM-1],[1.4-4] - Pre-deployment validation invokes Template Engine for flattening, naming collision detection, script compilation, trigger references, connection binding
[1.4-4] - Validation does not verify that data source relative paths resolve to real tags on physical devices (runtime concern)
[1.4-4] - Successful deployment sends flattened config to site via Communication Layer
[1.4-1] - Site applies immediately upon receipt — no operator confirmation
[1.4-1] - Site reports success/failure back to central
[1.4-3] - Deployment status updated in config DB (pending -> in-progress -> success/failed)
[CD-DM-9] - Deployment scope is individual instance level
[CD-DM-10],[3.9-3] - Template changes not auto-propagated — explicit deploy required
[3.9-1] - No rollback support — only current deployed state tracked
[3.9-5] - Uses 120-second deployment timeout
[CD-COM-4] - If site unreachable, deployment fails immediately (no central buffering)
[CD-COM-1]
Estimated Complexity: L
Requirements Traced: [1.4-1], [1.4-3], [1.4-4], [3.9-1], [3.9-3], [3.9-5], [CD-DM-1], [CD-DM-9], [CD-DM-10], [CD-COM-1], [CD-COM-4]
WP-2: Deployment Identity & Idempotency
Description: Implement deployment ID generation, revision hash propagation, and idempotent site-side apply.
Acceptance Criteria:
- Every deployment assigned a unique deployment ID
[KDD-deploy-6] - Deployment includes flattened config's revision hash (from Template Engine)
[KDD-deploy-6] - Site-side apply is idempotent on deployment ID — duplicate deployment returns "already applied"
[CD-DM-2] - Sites reject stale configurations — older revision hash than currently applied is rejected, site reports current version
[CD-DM-3] - After central failover or timeout, Deployment Manager queries site for current deployment state before allowing re-deploy
[CD-DM-4] - Deployment messages include deployment ID + revision hash as correlation
[CD-COM-7]
Estimated Complexity: M
Requirements Traced: [KDD-deploy-6], [CD-DM-2], [CD-DM-3], [CD-DM-4], [CD-COM-7]
WP-3: Per-Instance Operation Lock
Description: Implement concurrency control ensuring only one mutating operation per instance can be in-flight at a time.
Acceptance Criteria:
- Only one mutating operation (deploy, disable, enable, delete) per instance in-flight at a time
[KDD-deploy-7],[CD-DM-5] - Second operation on same instance rejected with "operation in progress" error
[CD-DM-5] - Different instances can proceed in parallel, even at the same site
[CD-DM-6] - Lock released when operation completes (success or failure) or times out
- Lock state does not survive central failover (in-progress operations treated as failed per
[CD-DM-4])
Estimated Complexity: M
Requirements Traced: [KDD-deploy-7], [CD-DM-5], [CD-DM-6]
WP-4: State Transition Matrix & Deployment Status
Description: Implement the allowed state transitions for instance operations and deployment status persistence with optimistic concurrency.
Acceptance Criteria:
- State transition matrix enforced:
[CD-DM-7]- Enabled: allows deploy, disable, delete. Rejects enable (already enabled).
- Disabled: allows deploy (enables on apply), enable, delete. Rejects disable (already disabled).
- Not-deployed: allows deploy only. Rejects disable, enable, delete.
- Invalid state transitions return clear error messages
- Only current deployment status per instance stored (pending, in-progress, success, failed)
[CD-DM-9] - No deployment history table — audit log captures history via IAuditService
[CD-DM-9] - Optimistic concurrency on deployment status records
[KDD-deploy-11] - All deployment actions logged via IAuditService (who, what, when, result)
Estimated Complexity: M
Requirements Traced: [CD-DM-7], [CD-DM-9], [KDD-deploy-11], [3.8.1-1], [3.8.1-2]
WP-5: Site-Side Apply Atomicity
Description: Implement all-or-nothing deployment application at the site.
Acceptance Criteria:
- Site stores new config, compiles all scripts, creates/updates Instance Actor as single operation
[KDD-deploy-8],[CD-DM-14] - If any step fails (e.g., script compilation), entire deployment for that instance rejected
[CD-DM-14],[CD-SR-6] - Previous configuration remains active and unchanged on failure
[CD-DM-14] - Site reports specific failure reason (e.g., compilation error details) back to central
[CD-SR-6] - For redeployments: existing Instance Actor and children stopped, then new Instance Actor created with updated config
[CD-SR-2] - Subscriptions re-established after redeployment
[CD-SR-2] - Site continues operating with last deployed config if connectivity to central lost
[1.4-2] - Deployment handling follows: receive -> store SQLite -> compile -> create/update actor -> report
[CD-SR-1]
Estimated Complexity: L
Requirements Traced: [KDD-deploy-8], [CD-DM-14], [CD-SR-1], [CD-SR-2], [CD-SR-6], [1.4-2]
WP-6: Instance Lifecycle Commands
Description: Implement disable, enable, and delete commands sent from central to site.
Acceptance Criteria:
- Disable: site stops script triggers, data subscriptions, and alarm evaluation
[3.8.1-3],[CD-DM-16] - Disable retains deployed configuration for re-enablement without redeployment
[3.8.1-3],[CD-DM-16],[CD-SR-3] - Disable: S&F messages for disabled instance continue to drain
[3.8.1-4] - Enable: re-activates a disabled instance by creating a new Instance Actor from stored config, restoring data subscriptions, script triggers, and alarm evaluation
[CD-DM-17],[CD-SR-4] - Disable and enable commands fail immediately if the site is unreachable (no buffering, consistent with deployment behavior)
[CD-COM-2] - Delete: removes running config from site, stops subscriptions, destroys Instance Actor and children
[3.8.1-5],[CD-DM-18],[CD-SR-5] - Delete: S&F messages are not cleared
[3.8.1-6],[CD-DM-18],[CD-SR-5],[KDD-sf-3] - Delete fails if site unreachable — central does not mark deleted until site confirms
[3.8.1-7],[CD-DM-18] - Templates cannot be deleted if instances or child templates reference them
[3.8.1-8] - Lifecycle commands use request/response pattern with 30s timeout
[CD-COM-2],[CD-COM-5] - Lifecycle commands include command ID for deduplication (duplicate commands recognized and not re-applied)
[CD-COM-7]
Estimated Complexity: L
Requirements Traced: [3.8.1-1] through [3.8.1-8], [KDD-sf-3], [CD-DM-16], [CD-DM-17], [CD-DM-18], [CD-SR-3], [CD-SR-4], [CD-SR-5], [CD-COM-2], [CD-COM-5], [CD-COM-7]
WP-7: System-Wide Artifact Deployment
Description: Implement deployment of shared scripts, external system definitions, database connection definitions, and notification lists to all sites.
Acceptance Criteria:
- Changes not automatically propagated to sites
[1.5-1] - Deployment requires explicit action by a user with Deployment role
[1.5-2] - Design role manages definitions; Deployment role triggers deployment
[1.5-3] - Broadcast pattern with per-site acknowledgment
[CD-COM-3] - Per-site result matrix — each site reports independently
[CD-DM-8] - Successful sites not rolled back if other sites fail
[CD-DM-8] - Failed sites can be retried individually
[CD-DM-8] - 120-second timeout per site
[CD-COM-6] - Cross-site version skew supported — sites can run different artifact versions
[KDD-deploy-9],[CD-DM-15] - Artifacts are self-contained and site-independent
[CD-DM-15] - System-wide artifact deployment is a separate action from instance deployment
[CD-DM-13] - Shared scripts undergo pre-compilation validation (syntax/structural correctness) before deployment to sites
- All artifact deployment actions logged via IAuditService
Estimated Complexity: L
Requirements Traced: [1.5-1], [1.5-2], [1.5-3], [KDD-deploy-9], [CD-DM-8], [CD-DM-13], [CD-DM-15], [CD-COM-3], [CD-COM-6]
WP-8: Deployed vs. Template-Derived State Management
Description: Implement storage and retrieval of deployed configuration snapshots, enabling comparison with template-derived configs.
Acceptance Criteria:
- System maintains two views per instance: deployed configuration and template-derived configuration
[3.9-2],[CD-DM-12] - Deployed configuration updated on successful deployment
[CD-DM-12] - Template-derived configuration computed on demand from current template state (uses Phase 2 flattening)
- Diff can be computed between deployed and template-derived (uses Phase 2 diff calculation)
[CD-DM-11] - Diff shows added/removed/changed members and connection binding changes
[CD-DM-11] - Staleness detectable via revision hash comparison
[3.9-4]
Estimated Complexity: M
Requirements Traced: [3.9-2], [3.9-4], [CD-DM-11], [CD-DM-12]
WP-9: S&F SQLite Persistence & Message Format
Description: Implement the SQLite schema and data access layer for store-and-forward message buffering at site nodes.
Acceptance Criteria:
- Buffered messages persisted to local SQLite on each site node
[1.3-3] - Message format stores: message ID, category, target, payload, retry count, created at, last attempt at, status (pending/retrying/parked)
[CD-SF-11] - Three message categories supported: external system calls, email notifications, cached database writes
[CD-SF-1] - No maximum buffer size — messages accumulate until delivery or parking
[1.3-6],[CD-SF-4] - Central does not buffer messages (S&F is site-only)
[1.3-1] - All S&F timestamps are UTC
Estimated Complexity: M
Requirements Traced: [1.3-1], [1.3-3], [1.3-6], [CD-SF-1], [CD-SF-4], [CD-SF-11]
WP-10: S&F Retry Engine
Description: Implement the fixed-interval retry loop with per-source-entity retry settings and transient-only buffering.
Acceptance Criteria:
- Message lifecycle: attempt immediate delivery -> failure buffers -> retry loop -> success removes; max retries -> park
[CD-SF-12] - Retry is per-message — individual messages retry independently
[5.3-2] - Fixed retry interval (not exponential backoff)
[1.3-7],[KDD-sf-1] - Retry settings defined on the source entity (external system def, SMTP config, DB connection def), not per-message
[CD-SF-2] - External system definitions include max retry count and time between retries
[5.3-3] - Notification config includes max retry count and time between retries
[6.4-2] - After max retries exhausted, message is parked (dead-lettered)
[5.3-4],[6.4-3] - Only transient failures eligible for buffering. Permanent failures returned to caller, not queued
[KDD-sf-1],[CD-SF-3] - No maximum buffer size
[5.3-5],[6.4-4],[KDD-sf-1] - Messages for external calls buffered locally when system unavailable
[5.3-1] - Notifications buffered when email server unavailable
[6.4-1] - Successfully delivered messages removed from local store
[1.3-5]
Estimated Complexity: L
Requirements Traced: [1.3-5], [1.3-7], [5.3-1] through [5.3-5], [6.4-1] through [6.4-4], [KDD-sf-1], [CD-SF-2], [CD-SF-3], [CD-SF-12]
WP-11: S&F Async Replication to Standby
Description: Implement application-level replication of buffer operations from active to standby node.
Acceptance Criteria:
- All S&F buffers replicated between two site cluster nodes via application-level replication over Akka.NET remoting
[1.3-2] - Active node forwards each buffer operation (add, remove, park) to standby asynchronously
[CD-SF-5],[KDD-sf-2] - Active node does not wait for standby acknowledgment (no ack wait)
[KDD-sf-2],[CD-SF-5] - Standby applies operations to its own local SQLite
[CD-SF-6] - On failover, standby takes over delivery from its replicated copy
[1.3-4]. Note: per[CD-SF-7], the async replication design means the copy is near-complete — rare duplicate deliveries or missed retries are acceptable trade-offs for the latency benefit. - Duplicate deliveries and missed retries accepted as trade-offs for async replication
[CD-SF-7] - Successfully delivered messages removed from both nodes' stores
[1.3-5]
Estimated Complexity: L
Requirements Traced: [1.3-2], [1.3-4], [1.3-5], [KDD-sf-2], [CD-SF-5], [CD-SF-6], [CD-SF-7]
WP-12: Parked Message Management
Description: Implement site-side parked message storage, query handling, and retry/discard commands accessible from central.
Acceptance Criteria:
- Parked messages stored at the site in SQLite
[5.4-1],[CD-SF-8] - Central can query sites for parked messages via Communication Layer
[5.4-2],[CD-SF-8] - Operators can retry a parked message (moves back to retry queue)
[5.4-3],[CD-SF-9] - Operators can discard a parked message (removes permanently)
[5.4-3],[CD-SF-9] - Management covers all three categories: external system calls, notifications, cached database writes
[5.4-4] - Remote query uses request/response pattern with query ID, 30s timeout
[CD-COM-8] - Messages not automatically cleared when instance deleted
[CD-SF-10],[KDD-sf-3],[3.8.1-6] - Pending and parked messages continue to exist after instance deletion
[CD-SF-10]
Estimated Complexity: M
Requirements Traced: [5.4-1] through [5.4-4], [KDD-sf-3], [CD-SF-8], [CD-SF-9], [CD-SF-10], [CD-COM-8], [3.8.1-6]
WP-13: S&F Messages Survive Instance Deletion
Description: Ensure store-and-forward messages are preserved when an instance is deleted.
Acceptance Criteria:
- S&F messages not cleared on instance deletion
[3.8.1-6],[KDD-sf-3],[CD-SF-10] - Pending messages continue retry delivery after instance deletion
- Parked messages remain queryable and manageable from central after instance deletion
- S&F messages for disabled instances continue to drain
[3.8.1-4]
Estimated Complexity: S
Requirements Traced: [3.8.1-4], [3.8.1-6], [KDD-sf-3], [CD-SF-10]
WP-14: S&F Health Metrics & Event Logging Integration
Description: Integrate S&F buffer depth as a health metric and log S&F activity to site event log.
Acceptance Criteria:
- S&F buffer depth reported as health metric (broken down by category) — integrates with Phase 3B Health Monitoring
- S&F activity logged to site event log: message queued, delivered, retried, parked (per Component-StoreAndForward.md Dependencies)
- S&F buffer depth visible in health reports sent to central
Estimated Complexity: S
Requirements Traced: [CD-SF-1] (categories), Component-StoreAndForward.md Dependencies (Site Event Logging, Health Monitoring)
WP-15: CachedCall Idempotency Documentation
Description: Document that CachedCall idempotency is the caller's responsibility.
Acceptance Criteria:
- Script API documentation clearly states that
ExternalSystem.CachedCall()idempotency is the caller's responsibility[KDD-sf-4] - S&F engine makes no idempotency guarantees — duplicate delivery possible (especially on failover)
[CD-SF-7]
Estimated Complexity: S
Requirements Traced: [KDD-sf-4], [CD-SF-7]
WP-16: Deployment Manager — Concurrent Template Editing Semantics
Description: Ensure last-write-wins semantics for template editing do not conflict with deployment pipeline.
Acceptance Criteria:
- Last-write-wins for concurrent template editing — no pessimistic locking or optimistic concurrency on templates
[3.9-6] - Deployment uses optimistic concurrency on deployment status records only
[KDD-deploy-11] - Template state at time of deployment is captured in the flattened config and revision hash
Estimated Complexity: S
Requirements Traced: [3.9-6], [KDD-deploy-11]
Test Strategy
Unit Tests
| Area | Tests |
|---|---|
| Deployment flow | Validate -> flatten -> send pipeline; validation failure stops pipeline |
| Deployment identity | Deployment ID generation uniqueness; revision hash propagation |
| Operation lock | Concurrent requests on same instance rejected; different instances proceed in parallel; lock released on completion/timeout |
| State transitions | All valid transitions succeed; all invalid transitions rejected with correct error messages |
| Deployment status | CRUD with optimistic concurrency; concurrent updates handled correctly |
| S&F message format | Serialization/deserialization of all three categories; all fields stored correctly |
| S&F retry logic | Fixed interval timing; per-source-entity settings respected; max retries triggers parking; transient-only filter |
| Parked message ops | Retry moves to queue; discard removes; query returns correct results |
| Template deletion constraint | Templates with instance references cannot be deleted; templates with child template references cannot be deleted |
Integration Tests
| Area | Tests |
|---|---|
| End-to-end deploy | Central sends deployment -> site compiles -> actors created -> success reported -> status updated |
| Deploy with validation failure | Template with compilation error -> deployment blocked before send |
| Idempotent deploy | Same deployment ID sent twice -> second returns "already applied" |
| Stale config rejection | Older revision hash sent -> site rejects with current version |
| Lifecycle commands | Disable -> verify subscriptions stopped and config retained; Enable -> verify instance re-activates; Delete -> verify actors destroyed and config removed |
| S&F buffer and retry | Submit message -> delivery fails -> buffered -> retry succeeds -> message removed |
| S&F parking | Submit message -> delivery fails -> max retries -> message parked |
| S&F replication | Buffer message on active -> verify replicated to standby SQLite |
| Parked message remote query | Central queries site for parked messages -> correct results returned |
| Parked message retry/discard | Central retries parked message -> moves to queue; Central discards -> removed |
| System-wide artifact deploy | Deploy shared scripts to multiple sites -> per-site status tracked |
| S&F survives deletion | Delete instance -> verify S&F messages still exist and deliver |
| S&F drains on disable | Disable instance -> verify pending S&F messages continue delivery |
Negative Tests
| Requirement | Test |
|---|---|
[1.3-1] Central does not buffer |
Verify no S&F infrastructure exists on central; central deployment to unreachable site fails immediately |
[1.3-6] No max buffer |
Submit messages continuously -> verify no rejection based on count |
[3.8.1-7] Delete fails if unreachable |
Attempt delete when site offline -> verify failure; verify central does not mark as deleted |
[3.8.1-8] Template deletion constraint |
Attempt to delete template with active instances -> verify rejection |
[3.9-1] No auto-propagation |
Change template -> verify deployed instance unaffected |
[3.9-5] No rollback |
Verify no rollback mechanism exists; only current deployed state tracked |
[CD-DM-5] Operation lock rejects |
Send two concurrent deploys for same instance -> verify second rejected |
[CD-DM-7] Invalid transitions |
Attempt enable on already-enabled instance -> verify rejection; attempt disable on not-deployed -> verify rejection |
[CD-SF-3] Permanent failures not buffered |
Submit message with permanent failure classification -> verify not buffered, error returned to caller |
[KDD-sf-3] Messages survive deletion |
Delete instance -> verify S&F messages not cleared |
Failover & Resilience Tests
| Scenario | Test |
|---|---|
| Mid-deploy central failover | Deploy in progress -> kill central active -> verify deployment treated as failed -> re-query site state -> re-deploy succeeds |
| Mid-deploy site failover | Deploy in progress -> kill site active -> verify deployment times out or fails -> re-deploy to new active succeeds |
| Timeout + reconciliation | Deploy sent -> site applies but response lost -> central times out -> central queries site state -> finds "already applied" -> updates status |
| S&F buffer takeover | Buffer messages on active -> kill active -> standby takes over -> verify messages delivered from replicated copy |
| S&F replication gap | Buffer message -> immediately kill active (before replication) -> verify standby handles gap gracefully (missed message, no crash) |
| Site offline then online | Deploy to offline site -> fails -> site comes online -> re-deploy succeeds |
| System-wide artifact partial failure | Deploy artifacts to 3 sites, 1 offline -> verify 2 succeed -> retry failed site when online |
Verification Gate
Phase 3C is complete when all of the following pass:
- Deployment pipeline end-to-end: Central validates, flattens, sends, site compiles, creates actors, reports success. Status tracked in config DB.
- Idempotency: Duplicate deployment ID returns "already applied." Stale revision hash rejected.
- Operation lock: Concurrent operations on same instance rejected; parallel operations on different instances succeed.
- State transitions: All valid transitions work; all invalid transitions rejected.
- Site-side atomicity: Compilation failure rejects entire deployment; previous config unchanged.
- Lifecycle commands: Disable/enable/delete work correctly with proper state effects.
- S&F buffering: Messages buffered on transient failure, retried at fixed interval, parked after max retries.
- S&F replication: Buffer operations replicated to standby; failover resumes delivery.
- Parked message management: Central can query, retry, and discard parked messages at sites.
- S&F survival: Messages persist through instance deletion and continue delivery.
- System-wide artifacts: Deployed to all sites with per-site status; version skew tolerated.
- Resilience: Mid-deploy failover, timeout+reconciliation, and S&F takeover tests pass.
- Audit logging: All deployment and lifecycle actions recorded via IAuditService.
- All unit, integration, negative, and failover tests pass.
Open Questions
| # | Question | Context | Impact | Status |
|---|---|---|---|---|
| Q-P3C-1 | Should S&F retry timers be reset on failover or continue from the last known retry timestamp? | On failover, the new active node loads buffer from SQLite. Messages have last_attempt_at timestamps. Should retry timing continue relative to last_attempt_at or reset to "now"? |
Affects retry behavior immediately after failover. Recommend: continue from last_attempt_at to avoid burst retries. |
Open |
| Q-P3C-2 | What is the maximum number of parked messages returned in a single remote query? | Communication Layer pattern 8 uses 30s timeout. Very large parked message sets may need pagination. | Recommend: paginated query (e.g., 100 per page) consistent with Site Event Logging pagination pattern. | Open |
| Q-P3C-3 | Should the per-instance operation lock be in-memory (lost on central failover) or persisted? | In-memory is simpler and consistent with "in-progress deployments treated as failed on failover." Persisted lock could cause orphan locks. | Recommend: in-memory. On failover, all locks released. Site state query resolves any ambiguity. | Open |
Orphan Check Result
Forward Check (Requirements -> Work Packages)
Every item in the Requirements Checklist and Design Constraints Checklist was walked. Results:
| Checklist Item | Mapped To | Verified |
|---|---|---|
[1.3-1] through [1.3-7] |
WP-9, WP-10, WP-11 | Yes |
[1.4-1] through [1.4-4] |
WP-1, WP-5 | Yes |
[1.5-1] through [1.5-3] |
WP-7 | Yes |
[3.8.1-1] through [3.8.1-8] |
WP-4, WP-6, WP-12, WP-13 | Yes |
[3.9-1], [3.9-2], [3.9-3], [3.9-5], [3.9-6] |
WP-1, WP-8, WP-16 | Yes |
[3.9-4] |
WP-8 (staleness detection); diff UI deferred to Phase 6 | Yes |
[5.3-1] through [5.3-5] |
WP-10 | Yes |
[5.4-1] through [5.4-4] |
WP-12 | Yes |
[6.4-1] through [6.4-4] |
WP-10 | Yes |
[KDD-deploy-6] |
WP-2 | Yes |
[KDD-deploy-7] |
WP-3 | Yes |
[KDD-deploy-8] |
WP-5 | Yes |
[KDD-deploy-9] |
WP-7 | Yes |
[KDD-deploy-11] |
WP-4, WP-16 | Yes |
[KDD-sf-1] |
WP-10 | Yes |
[KDD-sf-2] |
WP-11 | Yes |
[KDD-sf-3] |
WP-6, WP-12, WP-13 | Yes |
[KDD-sf-4] |
WP-15 | Yes |
[CD-DM-1] through [CD-DM-18] |
WP-1 through WP-8 | Yes |
[CD-SF-1] through [CD-SF-12] |
WP-9 through WP-14 | Yes |
[CD-SR-1] through [CD-SR-6] |
WP-5, WP-6 | Yes |
[CD-COM-1] through [CD-COM-8] |
WP-1, WP-2, WP-6, WP-7, WP-12 | Yes |
Forward check result: PASS — no orphan requirements.
Reverse Check (Work Packages -> Requirements)
Every work package traces to at least one requirement or design constraint:
| Work Package | Traces To |
|---|---|
| WP-1 | [1.4-1], [1.4-3], [1.4-4], [3.9-1], [3.9-3], [3.9-5], [CD-DM-1], [CD-DM-9], [CD-DM-10], [CD-COM-1], [CD-COM-4] |
| WP-2 | [KDD-deploy-6], [CD-DM-2], [CD-DM-3], [CD-DM-4], [CD-COM-7] |
| WP-3 | [KDD-deploy-7], [CD-DM-5], [CD-DM-6] |
| WP-4 | [CD-DM-7], [CD-DM-9], [KDD-deploy-11], [3.8.1-1], [3.8.1-2] |
| WP-5 | [KDD-deploy-8], [CD-DM-14], [CD-SR-1], [CD-SR-2], [CD-SR-6], [1.4-2] |
| WP-6 | [3.8.1-1] through [3.8.1-8], [KDD-sf-3], [CD-DM-16] through [CD-DM-18], [CD-SR-3] through [CD-SR-5], [CD-COM-2], [CD-COM-5], [CD-COM-7] |
| WP-7 | [1.5-1] through [1.5-3], [KDD-deploy-9], [CD-DM-8], [CD-DM-13], [CD-DM-15], [CD-COM-3], [CD-COM-6] |
| WP-8 | [3.9-2], [3.9-4], [CD-DM-11], [CD-DM-12] |
| WP-9 | [1.3-1], [1.3-3], [1.3-6], [CD-SF-1], [CD-SF-4], [CD-SF-11] |
| WP-10 | [1.3-5], [1.3-7], [5.3-1] through [5.3-5], [6.4-1] through [6.4-4], [KDD-sf-1], [CD-SF-2], [CD-SF-3], [CD-SF-12] |
| WP-11 | [1.3-2], [1.3-4], [1.3-5], [KDD-sf-2], [CD-SF-5], [CD-SF-6], [CD-SF-7] |
| WP-12 | [5.4-1] through [5.4-4], [KDD-sf-3], [CD-SF-8], [CD-SF-9], [CD-SF-10], [CD-COM-8], [3.8.1-6] |
| WP-13 | [3.8.1-4], [3.8.1-6], [KDD-sf-3], [CD-SF-10] |
| WP-14 | [CD-SF-1], Component-StoreAndForward.md Dependencies |
| WP-15 | [KDD-sf-4], [CD-SF-7] |
| WP-16 | [3.9-6], [KDD-deploy-11] |
Reverse check result: PASS — no untraceable work packages.
Split-Section Check
| Section | Phase 3C Covers | Other Phase Covers | Gap? |
|---|---|---|---|
| 1.4 | [1.4-1] through [1.4-4] (all bullets — backend pipeline) |
Phase 6: deployment UI triggers and status display | No gap |
| 1.5 | [1.5-1] through [1.5-3] (all bullets — backend pipeline) |
Phase 6: artifact deployment UI | No gap |
| 3.8.1 | [3.8.1-1] through [3.8.1-8] (all bullets — backend commands) |
Phase 4: lifecycle command UI | No gap |
| 3.9 | [3.9-1], [3.9-2], [3.9-3], [3.9-5], [3.9-6] |
Phase 6: [3.9-4] (diff view UI), deployment trigger UI |
No gap |
| 5.3 | [5.3-1] through [5.3-5] (S&F engine) |
Phase 7: External System Gateway delivery integration, error classification | No gap |
| 5.4 | [5.4-1] through [5.4-4] (backend query/command handling) |
Phase 6: parked message management UI | No gap |
| 6.4 | [6.4-1] through [6.4-4] (S&F engine) |
Phase 7: Notification Service delivery integration | No gap |
Split-section check result: PASS — no unowned bullets.
Negative Requirement Check
| Negative Requirement | Acceptance Criterion | Adequate? |
|---|---|---|
[1.3-1] Central does not buffer |
Test verifies no S&F infrastructure on central; unreachable site = immediate failure | Yes |
[1.3-6] No maximum buffer size |
Test submits messages continuously, verifies no count-based rejection | Yes |
[3.8.1-6] S&F messages not cleared on deletion |
Test deletes instance, verifies messages still exist and deliver | Yes |
[3.8.1-7] Delete fails if unreachable |
Test attempts delete to offline site, verifies failure and central status unchanged | Yes |
[3.8.1-8] Templates cannot be deleted with references |
Test attempts deletion of referenced template, verifies rejection | Yes |
[3.9-1] Changes not auto-propagated |
Test changes template, verifies deployed instance unchanged | Yes |
[3.9-5] No rollback |
Verifies no rollback mechanism; only current state tracked | Yes |
[CD-SF-3] Permanent failures not buffered |
Test submits permanent failure, verifies not queued | Yes |
Negative requirement check result: PASS — all prohibitions have verification criteria.
Codex MCP Verification
Model: gpt-5.4 Result: Pass with corrections
Step 1 — Requirements Coverage Review
Codex identified 10 findings. Disposition:
| # | Finding | Disposition |
|---|---|---|
| 1 | Naming collision detection and device tag resolution exclusion missing from WP-1 | Corrected — added naming collision detection to WP-1 acceptance criteria; added explicit exclusion of device tag resolution. |
| 2 | Shared script pre-compilation validation missing from WP-7 | Corrected — added shared script validation acceptance criterion to WP-7. |
| 3 | Role overlap (user may hold both Design+Deployment) not verified | Dismissed — this is a Phase 1 Security & Auth concern. Phase 3C assumes the auth model works correctly. Role overlap is tested in Phase 1 integration tests. |
| 4 | WP-4 traces [3.8.1-2] but doesn't verify runtime activation | Dismissed — WP-4 owns the state transition matrix. Runtime behavior of "enabled" (subscriptions, triggers, alarms running) is the responsibility of Phase 3B Site Runtime, which creates Instance Actors with full initialization. WP-6 verifies enable recreates the actor. |
| 5 | Enable flow underspecified (should verify actor recreation with subscriptions) | Corrected — expanded WP-6 enable criteria to explicitly verify actor creation, subscription restoration, script triggers, and alarm evaluation. |
| 6 | Command ID described as "correlation" but source says "deduplication" | Corrected — changed wording to "deduplication" with acceptance criterion that duplicate commands are recognized and not re-applied. |
| 7 | Disable/enable unreachable failure not explicitly covered | Corrected — added acceptance criterion that disable and enable fail immediately if site unreachable. |
| 8 | Diff "show" requirement only partially verified (compute, not expose) | Dismissed — Phase 3C provides the backend API for diff computation and staleness detection. The "show" (UI) aspect is explicitly deferred to Phase 6 per the split-section note. WP-8 correctly scopes to backend. |
| 9 | Parked message management UI not verified | Dismissed — same as #8. Phase 3C builds the site-side backend (query handler, retry/discard commands). Phase 6 builds the central UI. Split documented in plan. |
| 10 | "near-complete copy" weakens HighLevelReqs "seamless" wording | Corrected — updated WP-11 to reference [1.3-4] for the seamless takeover requirement, with a note that [CD-SF-7] acknowledges the async replication trade-off (rare duplicates/misses). The component design explicitly documents this as an acceptable trade-off; HighLevelReqs 1.3 bullet 4 does not preclude it since "seamlessly" refers to the takeover process, not data completeness. |
Step 2 — Negative Requirement Review
Not submitted separately; negative requirements were included in Step 1 review. All negative requirements have adequate acceptance criteria per the orphan check.
Step 3 — Split-Section Gap Review
Not submitted separately; split sections were documented in the plan and reviewed in Step 1. No gaps identified.