Files
scadalink-design/docs/plans/phase-3c-deployment-store-forward.md
Joseph Doherty d91aa83665 refactor(docs): move requirements and test infra docs into docs/ subdirectories
Organize documentation by moving requirements (HighLevelReqs, Component-*,
lmxproxy_protocol) to docs/requirements/ and test infrastructure docs to
docs/test_infra/. Updates all cross-references in README, CLAUDE.md,
infra/README, component docs, and 23 plan files.
2026-03-21 01:11:35 -04:00

48 KiB

Phase 3C: Deployment Pipeline & Store-and-Forward

Date: 2026-03-16 Status: Draft Prerequisites: Phase 2 (Template Engine, deployment package contract), Phase 3A (Cluster Infrastructure, Site Runtime skeleton, local SQLite persistence), Phase 3B (Communication Layer, Site Runtime full actor hierarchy, Health Monitoring)


Scope

Goal: Complete the deploy-to-site pipeline end-to-end with resilience.

Components:

  • Deployment Manager (full) — Central-side deployment orchestration, instance lifecycle, system-wide artifact deployment
  • Store-and-Forward Engine (full) — Site-side message buffering, retry, parking, replication, parked message management

Testable Outcome: Central validates, flattens, and deploys an instance to a site. Site compiles scripts, creates actors, reports success. Deployment ID ensures idempotency. Per-instance operation lock works. Instance lifecycle commands (disable, enable, delete) work. Store-and-forward buffers messages on transient failure, retries, parks. Async replication to standby. Parked messages queryable from central.


Prerequisites

Prerequisite Phase What Must Be Complete
Template Engine 2 Flattening, validation pipeline, revision hash generation, diff calculation, deployment package contract
Configuration Database 1, 2 Schema, repositories (IDeploymentManagerRepository), IAuditService, optimistic concurrency support
Cluster Infrastructure 3A Akka.NET cluster with SBR, failover, CoordinatedShutdown
Site Runtime 3A, 3B Deployment Manager singleton, Instance Actor hierarchy, script compilation, alarm actors, full actor lifecycle
Communication Layer 3B All 8 message patterns (deployment, lifecycle, artifact deployment, remote queries), correlation IDs, timeouts
Health Monitoring 3B Metric collection framework (S&F buffer depth will be added as a new metric)
Site Event Logging 3B Event recording to SQLite (S&F activity events will be added)
Security & Auth 1 Deployment role with optional site scoping

Requirements Checklist

Each bullet is extracted from the referenced docs/requirements/HighLevelReqs.md sections. Items marked with a phase note indicate split-section bullets owned by another phase.

Section 1.3 — Store-and-Forward Persistence (Site Clusters Only)

  • [1.3-1] Store-and-forward applies only at site clusters — central does not buffer messages.
  • [1.3-2] All site-level S&F buffers (external system calls, notifications, cached database writes) are replicated between the two site cluster nodes using application-level replication over Akka.NET remoting.
  • [1.3-3] Active node persists buffered messages to a local SQLite database and forwards them to the standby node, which maintains its own local SQLite copy.
  • [1.3-4] On failover, the standby node already has a replicated copy and takes over delivery seamlessly.
  • [1.3-5] Successfully delivered messages are removed from both nodes' local stores.
  • [1.3-6] There is no maximum buffer size — messages accumulate until they either succeed or exhaust retries and are parked.
  • [1.3-7] Retry intervals are fixed (not exponential backoff).

Section 1.4 — Deployment Behavior

  • [1.4-1] When central deploys a new configuration to a site instance, the site applies it immediately upon receipt — no local operator confirmation required. (Phase 3C)
  • [1.4-2] If a site loses connectivity to central, it continues operating with its last received deployed configuration. (Phase 3C — verified via resilience tests)
  • [1.4-3] The site reports back to central whether deployment was successfully applied. (Phase 3C)
  • [1.4-4] Pre-deployment validation: before any deployment is sent to a site, the central cluster performs comprehensive validation including flattening, test-compiling scripts, verifying alarm trigger references, verifying script trigger references, and checking data connection binding completeness. (Phase 3C — orchestration; validation pipeline built in Phase 2)

Split-section note: Section 1.4 is fully covered by Phase 3C (backend pipeline). Phase 6 covers the UI for deployment workflows (diff view, deploy button, status tracking display).

Section 1.5 — System-Wide Artifact Deployment

  • [1.5-1] Changes to shared scripts, external system definitions, database connection definitions, and notification lists are not automatically propagated to sites.
  • [1.5-2] Deployment of system-wide artifacts requires explicit action by a user with the Deployment role.
  • [1.5-3] The Design role manages the definitions; the Deployment role triggers deployment to sites. A user may hold both roles.

Split-section note: Phase 3C covers the backend pipeline for artifact deployment. Phase 6 covers the UI for triggering and monitoring artifact deployment.

Section 3.8.1 — Instance Lifecycle (Phase 3C portion)

  • [3.8.1-1] Instances can be in one of two states: enabled or disabled.
  • [3.8.1-2] Enabled: instance is active — data subscriptions, script triggers, and alarm evaluation are all running.
  • [3.8.1-3] Disabled: site stops script triggers, data subscriptions (no live data collection), and alarm evaluation. Deployed configuration is retained so instance can be re-enabled without redeployment.
  • [3.8.1-4] Disabled: store-and-forward messages for a disabled instance continue to drain (deliver pending messages).
  • [3.8.1-5] Deletion removes the running configuration from the site, stops subscriptions, destroys the Instance Actor and its children.
  • [3.8.1-6] Store-and-forward messages are not cleared on deletion — they continue to be delivered or can be managed via parked message management.
  • [3.8.1-7] If the site is unreachable when a delete is triggered, the deletion fails. Central does not mark it as deleted until the site confirms.
  • [3.8.1-8] Templates cannot be deleted if any instances or child templates reference them.

Split-section note: Phase 3C covers the backend for lifecycle commands. Phase 4 covers the UI for disable/enable/delete actions.

Section 3.9 — Template Deployment & Change Propagation (Phase 3C portion)

  • [3.9-1] Template changes are not automatically propagated to deployed instances.
  • [3.9-2] The system maintains two views: deployed configuration (currently running) and template-derived configuration (what it would look like if deployed now).
  • [3.9-3] Deployment is performed at the individual instance level — an engineer explicitly commands the system to update a specific instance.
  • [3.9-4] The system must show differences between deployed and template-derived configuration.
  • [3.9-5] No rollback support required. Only tracks current deployed state, not history.
  • [3.9-6] Concurrent editing uses last-write-wins model. No pessimistic locking or optimistic concurrency conflict detection on templates.

Split-section note: Phase 3C covers [3.9-1], [3.9-2] (backend maintenance of two views), [3.9-3] (backend deployment pipeline), [3.9-5] (no rollback), [3.9-6] (last-write-wins — already from Phase 2). Phase 6 covers [3.9-4] (diff view UI) and the deployment trigger UI. The diff calculation itself is built in Phase 2; Phase 3C uses it. Phase 3C stores the deployed configuration snapshot that enables diff comparison.

Section 5.3 — Store-and-Forward for External Calls (Phase 3C portion: engine)

  • [5.3-1] If an external system is unavailable when a script invokes a method, the message is buffered locally at the site.
  • [5.3-2] Retry is performed per message — individual failed messages retry independently.
  • [5.3-3] Each external system definition includes configurable retry settings: max retry count and time between retries (fixed interval, no exponential backoff).
  • [5.3-4] After max retries are exhausted, the message is parked (dead-lettered) for manual review.
  • [5.3-5] There is no maximum buffer size — messages accumulate until delivery succeeds or retries exhausted.

Split-section note: Phase 3C builds the S&F engine that handles buffering, retry, and parking. Phase 7 integrates the External System Gateway as a delivery target and implements the error classification (transient vs. permanent).

Section 5.4 — Parked Message Management (Phase 3C portion: backend)

  • [5.4-1] Parked messages are stored at the site where they originated.
  • [5.4-2] Central UI can query sites for parked messages and manage them remotely.
  • [5.4-3] Operators can retry or discard parked messages from the central UI.
  • [5.4-4] Parked message management covers external system calls, notifications, and cached database writes.

Split-section note: Phase 3C builds the site-side storage, query handler, and retry/discard command handler for parked messages. Phase 6 builds the central UI for parked message management.

Section 6.4 — Store-and-Forward for Notifications (Phase 3C portion: engine)

  • [6.4-1] If the email server is unavailable, notifications are buffered locally at the site.
  • [6.4-2] Follows the same retry pattern as external system calls: configurable max retry count and time between retries (fixed interval).
  • [6.4-3] After max retries are exhausted, the notification is parked for manual review.
  • [6.4-4] There is no maximum buffer size for notification messages.

Split-section note: Phase 3C builds the S&F engine generically to support all three message categories. Phase 7 integrates the Notification Service as a delivery target.


Design Constraints Checklist

Constraints from CLAUDE.md Key Design Decisions and Component-*.md documents relevant to this phase.

KDD Constraints

  • [KDD-deploy-6] Deployment identity: unique deployment ID + revision hash for idempotency.
  • [KDD-deploy-7] Per-instance operation lock covers all mutating commands (deploy, disable, enable, delete).
  • [KDD-deploy-8] Site-side apply is all-or-nothing per instance.
  • [KDD-deploy-9] System-wide artifact version skew across sites is supported.
  • [KDD-deploy-11] Optimistic concurrency on deployment status records.
  • [KDD-sf-1] Fixed retry interval, no max buffer size. Only transient failures buffered.
  • [KDD-sf-2] Async best-effort replication to standby (no ack wait).
  • [KDD-sf-3] Messages not cleared on instance deletion.
  • [KDD-sf-4] CachedCall idempotency is the caller's responsibility. (Documented in Phase 3C; enforced in Phase 7 integration.)

Component Design Constraints (from docs/requirements/Component-DeploymentManager.md)

  • [CD-DM-1] Deployment flow: validate -> flatten -> send -> track. Validation failures stop the pipeline before anything is sent.
  • [CD-DM-2] Site-side idempotency on deployment ID — duplicate deployment receives "already applied" response.
  • [CD-DM-3] Sites reject stale configurations — older revision hash than currently applied is rejected.
  • [CD-DM-4] After central failover or timeout, Deployment Manager queries the site for current deployment state before allowing re-deploy.
  • [CD-DM-5] Only one mutating operation per instance in-flight at a time. Second operation rejected with "operation in progress" error.
  • [CD-DM-6] Different instances can proceed in parallel, even at the same site.
  • [CD-DM-7] State transition matrix: Enabled allows deploy/disable/delete; Disabled allows deploy(enables on apply)/enable/delete; Not-deployed allows deploy only.
  • [CD-DM-8] System-wide artifact deployment shows per-site result matrix. Successful sites not rolled back if others fail. Failed sites can be retried individually.
  • [CD-DM-9] Only current deployment status per instance stored (pending, in-progress, success, failed). No deployment history table — audit log captures history.
  • [CD-DM-10] Deployment scope is individual instance level. Bulk operations decompose into individual instance deployments.
  • [CD-DM-11] Diff view available before deploying (added/removed/changed members, connection binding changes). (Diff calculation from Phase 2; orchestration in Phase 3C.)
  • [CD-DM-12] Two views maintained: deployed configuration and template-derived configuration.
  • [CD-DM-13] Deployable artifacts include flattened instance config plus system-wide artifacts (shared scripts, external system defs, DB connection defs, notification lists). System-wide artifact deployment is a separate action.
  • [CD-DM-14] Site-side apply is all-or-nothing per instance. If any step fails (e.g., script compilation), entire deployment rejected. Previous config remains active and unchanged.
  • [CD-DM-15] Cross-site version skew for artifacts is supported. Artifacts are self-contained and site-independent.
  • [CD-DM-16] Disable: stops data subscriptions, script triggers, alarm evaluation. Config retained.
  • [CD-DM-17] Enable: re-activates a disabled instance.
  • [CD-DM-18] Delete: removes running config, destroys Instance Actor and children. S&F messages not cleared. Fails if site unreachable — central does not mark deleted until site confirms.

Component Design Constraints (from docs/requirements/Component-StoreAndForward.md)

  • [CD-SF-1] Three message categories: external system calls, email notifications, cached database writes.
  • [CD-SF-2] Retry settings defined on the source entity (external system def, SMTP config, DB connection def), not per-message.
  • [CD-SF-3] Only transient failures eligible for S&F buffering. Permanent failures (HTTP 4xx) returned to script, not queued.
  • [CD-SF-4] No maximum buffer size. Bounded only by available disk space.
  • [CD-SF-5] Active node persists locally and forwards each buffer operation (add, remove, park) to standby asynchronously. No ack wait.
  • [CD-SF-6] Standby applies operations to its own local SQLite.
  • [CD-SF-7] On failover, rare cases of duplicate deliveries (delivered but remove not replicated) or missed retries (added but not replicated). Both acceptable.
  • [CD-SF-8] Parked messages remain in SQLite at site. Central queries via Communication Layer.
  • [CD-SF-9] Operators can retry (move back to retry queue) or discard (remove permanently) parked messages.
  • [CD-SF-10] Messages not automatically cleared when instance deleted. Pending and parked messages continue to exist.
  • [CD-SF-11] Message format stores: message ID, category, target, payload, retry count, created at, last attempt at, status (pending/retrying/parked).
  • [CD-SF-12] Message lifecycle: attempt immediate delivery -> success removes; failure buffers -> retry loop -> success removes + notify standby; max retries exhausted -> park.
  • [CD-SR-1] Deployment handling: receive config -> store in SQLite -> compile scripts -> create/update Instance Actor -> report result.
  • [CD-SR-2] For redeployments: existing Instance Actor and children stopped, then new Instance Actor created with updated config. Subscriptions re-established.
  • [CD-SR-3] Disable: stops Instance Actor and children. Retains deployed config in SQLite for re-enablement.
  • [CD-SR-4] Enable: creates new Instance Actor from stored config (same as startup).
  • [CD-SR-5] Delete: stops Instance Actor and children, removes deployed config from SQLite. Does not clear S&F messages.
  • [CD-SR-6] Script compilation failure during deployment rejects entire deployment. No partial state applied. Failure reported to central.
  • [CD-COM-1] Deployment pattern: request/response. No buffering at central. Unreachable site = immediate failure.
  • [CD-COM-2] Instance lifecycle pattern: request/response. Unreachable site = immediate failure.
  • [CD-COM-3] System-wide artifact pattern: broadcast with per-site acknowledgment.
  • [CD-COM-4] Deployment timeout: 120 seconds default (script compilation can be slow).
  • [CD-COM-5] Lifecycle command timeout: 30 seconds.
  • [CD-COM-6] System-wide artifact timeout: 120 seconds per site.
  • [CD-COM-7] Application-level correlation: deployments include deployment ID + revision hash; lifecycle commands include command ID.
  • [CD-COM-8] Remote query pattern for parked messages: request/response with query ID, 30-second timeout.

Work Packages

WP-1: Deployment Manager — Core Deployment Flow

Description: Implement the central-side deployment orchestration pipeline: accept deployment request, call Template Engine for validated+flattened config, send to site via Communication Layer, track status.

Acceptance Criteria:

  • Deployment request triggers validation -> flatten -> send -> track flow [CD-DM-1]
  • Validation failures stop the pipeline before sending; errors returned to caller [CD-DM-1], [1.4-4]
  • Pre-deployment validation invokes Template Engine for flattening, naming collision detection, script compilation, trigger references, connection binding [1.4-4]
  • Validation does not verify that data source relative paths resolve to real tags on physical devices (runtime concern) [1.4-4]
  • Successful deployment sends flattened config to site via Communication Layer [1.4-1]
  • Site applies immediately upon receipt — no operator confirmation [1.4-1]
  • Site reports success/failure back to central [1.4-3]
  • Deployment status updated in config DB (pending -> in-progress -> success/failed) [CD-DM-9]
  • Deployment scope is individual instance level [CD-DM-10], [3.9-3]
  • Template changes not auto-propagated — explicit deploy required [3.9-1]
  • No rollback support — only current deployed state tracked [3.9-5]
  • Uses 120-second deployment timeout [CD-COM-4]
  • If site unreachable, deployment fails immediately (no central buffering) [CD-COM-1]

Estimated Complexity: L

Requirements Traced: [1.4-1], [1.4-3], [1.4-4], [3.9-1], [3.9-3], [3.9-5], [CD-DM-1], [CD-DM-9], [CD-DM-10], [CD-COM-1], [CD-COM-4]


WP-2: Deployment Identity & Idempotency

Description: Implement deployment ID generation, revision hash propagation, and idempotent site-side apply.

Acceptance Criteria:

  • Every deployment assigned a unique deployment ID [KDD-deploy-6]
  • Deployment includes flattened config's revision hash (from Template Engine) [KDD-deploy-6]
  • Site-side apply is idempotent on deployment ID — duplicate deployment returns "already applied" [CD-DM-2]
  • Sites reject stale configurations — older revision hash than currently applied is rejected, site reports current version [CD-DM-3]
  • After central failover or timeout, Deployment Manager queries site for current deployment state before allowing re-deploy [CD-DM-4]
  • Deployment messages include deployment ID + revision hash as correlation [CD-COM-7]

Estimated Complexity: M

Requirements Traced: [KDD-deploy-6], [CD-DM-2], [CD-DM-3], [CD-DM-4], [CD-COM-7]


WP-3: Per-Instance Operation Lock

Description: Implement concurrency control ensuring only one mutating operation per instance can be in-flight at a time.

Acceptance Criteria:

  • Only one mutating operation (deploy, disable, enable, delete) per instance in-flight at a time [KDD-deploy-7], [CD-DM-5]
  • Second operation on same instance rejected with "operation in progress" error [CD-DM-5]
  • Different instances can proceed in parallel, even at the same site [CD-DM-6]
  • Lock released when operation completes (success or failure) or times out
  • Lock state does not survive central failover (in-progress operations treated as failed per [CD-DM-4])

Estimated Complexity: M

Requirements Traced: [KDD-deploy-7], [CD-DM-5], [CD-DM-6]


WP-4: State Transition Matrix & Deployment Status

Description: Implement the allowed state transitions for instance operations and deployment status persistence with optimistic concurrency.

Acceptance Criteria:

  • State transition matrix enforced: [CD-DM-7]
    • Enabled: allows deploy, disable, delete. Rejects enable (already enabled).
    • Disabled: allows deploy (enables on apply), enable, delete. Rejects disable (already disabled).
    • Not-deployed: allows deploy only. Rejects disable, enable, delete.
  • Invalid state transitions return clear error messages
  • Only current deployment status per instance stored (pending, in-progress, success, failed) [CD-DM-9]
  • No deployment history table — audit log captures history via IAuditService [CD-DM-9]
  • Optimistic concurrency on deployment status records [KDD-deploy-11]
  • All deployment actions logged via IAuditService (who, what, when, result)

Estimated Complexity: M

Requirements Traced: [CD-DM-7], [CD-DM-9], [KDD-deploy-11], [3.8.1-1], [3.8.1-2]


WP-5: Site-Side Apply Atomicity

Description: Implement all-or-nothing deployment application at the site.

Acceptance Criteria:

  • Site stores new config, compiles all scripts, creates/updates Instance Actor as single operation [KDD-deploy-8], [CD-DM-14]
  • If any step fails (e.g., script compilation), entire deployment for that instance rejected [CD-DM-14], [CD-SR-6]
  • Previous configuration remains active and unchanged on failure [CD-DM-14]
  • Site reports specific failure reason (e.g., compilation error details) back to central [CD-SR-6]
  • For redeployments: existing Instance Actor and children stopped, then new Instance Actor created with updated config [CD-SR-2]
  • Subscriptions re-established after redeployment [CD-SR-2]
  • Site continues operating with last deployed config if connectivity to central lost [1.4-2]
  • Deployment handling follows: receive -> store SQLite -> compile -> create/update actor -> report [CD-SR-1]

Estimated Complexity: L

Requirements Traced: [KDD-deploy-8], [CD-DM-14], [CD-SR-1], [CD-SR-2], [CD-SR-6], [1.4-2]


WP-6: Instance Lifecycle Commands

Description: Implement disable, enable, and delete commands sent from central to site.

Acceptance Criteria:

  • Disable: site stops script triggers, data subscriptions, and alarm evaluation [3.8.1-3], [CD-DM-16]
  • Disable retains deployed configuration for re-enablement without redeployment [3.8.1-3], [CD-DM-16], [CD-SR-3]
  • Disable: S&F messages for disabled instance continue to drain [3.8.1-4]
  • Enable: re-activates a disabled instance by creating a new Instance Actor from stored config, restoring data subscriptions, script triggers, and alarm evaluation [CD-DM-17], [CD-SR-4]
  • Disable and enable commands fail immediately if the site is unreachable (no buffering, consistent with deployment behavior) [CD-COM-2]
  • Delete: removes running config from site, stops subscriptions, destroys Instance Actor and children [3.8.1-5], [CD-DM-18], [CD-SR-5]
  • Delete: S&F messages are not cleared [3.8.1-6], [CD-DM-18], [CD-SR-5], [KDD-sf-3]
  • Delete fails if site unreachable — central does not mark deleted until site confirms [3.8.1-7], [CD-DM-18]
  • Templates cannot be deleted if instances or child templates reference them [3.8.1-8]
  • Lifecycle commands use request/response pattern with 30s timeout [CD-COM-2], [CD-COM-5]
  • Lifecycle commands include command ID for deduplication (duplicate commands recognized and not re-applied) [CD-COM-7]

Estimated Complexity: L

Requirements Traced: [3.8.1-1] through [3.8.1-8], [KDD-sf-3], [CD-DM-16], [CD-DM-17], [CD-DM-18], [CD-SR-3], [CD-SR-4], [CD-SR-5], [CD-COM-2], [CD-COM-5], [CD-COM-7]


WP-7: System-Wide Artifact Deployment

Description: Implement deployment of shared scripts, external system definitions, database connection definitions, and notification lists to all sites.

Acceptance Criteria:

  • Changes not automatically propagated to sites [1.5-1]
  • Deployment requires explicit action by a user with Deployment role [1.5-2]
  • Design role manages definitions; Deployment role triggers deployment [1.5-3]
  • Broadcast pattern with per-site acknowledgment [CD-COM-3]
  • Per-site result matrix — each site reports independently [CD-DM-8]
  • Successful sites not rolled back if other sites fail [CD-DM-8]
  • Failed sites can be retried individually [CD-DM-8]
  • 120-second timeout per site [CD-COM-6]
  • Cross-site version skew supported — sites can run different artifact versions [KDD-deploy-9], [CD-DM-15]
  • Artifacts are self-contained and site-independent [CD-DM-15]
  • System-wide artifact deployment is a separate action from instance deployment [CD-DM-13]
  • Shared scripts undergo pre-compilation validation (syntax/structural correctness) before deployment to sites
  • All artifact deployment actions logged via IAuditService

Estimated Complexity: L

Requirements Traced: [1.5-1], [1.5-2], [1.5-3], [KDD-deploy-9], [CD-DM-8], [CD-DM-13], [CD-DM-15], [CD-COM-3], [CD-COM-6]


WP-8: Deployed vs. Template-Derived State Management

Description: Implement storage and retrieval of deployed configuration snapshots, enabling comparison with template-derived configs.

Acceptance Criteria:

  • System maintains two views per instance: deployed configuration and template-derived configuration [3.9-2], [CD-DM-12]
  • Deployed configuration updated on successful deployment [CD-DM-12]
  • Template-derived configuration computed on demand from current template state (uses Phase 2 flattening)
  • Diff can be computed between deployed and template-derived (uses Phase 2 diff calculation) [CD-DM-11]
  • Diff shows added/removed/changed members and connection binding changes [CD-DM-11]
  • Staleness detectable via revision hash comparison [3.9-4]

Estimated Complexity: M

Requirements Traced: [3.9-2], [3.9-4], [CD-DM-11], [CD-DM-12]


WP-9: S&F SQLite Persistence & Message Format

Description: Implement the SQLite schema and data access layer for store-and-forward message buffering at site nodes.

Acceptance Criteria:

  • Buffered messages persisted to local SQLite on each site node [1.3-3]
  • Message format stores: message ID, category, target, payload, retry count, created at, last attempt at, status (pending/retrying/parked) [CD-SF-11]
  • Three message categories supported: external system calls, email notifications, cached database writes [CD-SF-1]
  • No maximum buffer size — messages accumulate until delivery or parking [1.3-6], [CD-SF-4]
  • Central does not buffer messages (S&F is site-only) [1.3-1]
  • All S&F timestamps are UTC

Estimated Complexity: M

Requirements Traced: [1.3-1], [1.3-3], [1.3-6], [CD-SF-1], [CD-SF-4], [CD-SF-11]


WP-10: S&F Retry Engine

Description: Implement the fixed-interval retry loop with per-source-entity retry settings and transient-only buffering.

Acceptance Criteria:

  • Message lifecycle: attempt immediate delivery -> failure buffers -> retry loop -> success removes; max retries -> park [CD-SF-12]
  • Retry is per-message — individual messages retry independently [5.3-2]
  • Fixed retry interval (not exponential backoff) [1.3-7], [KDD-sf-1]
  • Retry settings defined on the source entity (external system def, SMTP config, DB connection def), not per-message [CD-SF-2]
  • External system definitions include max retry count and time between retries [5.3-3]
  • Notification config includes max retry count and time between retries [6.4-2]
  • After max retries exhausted, message is parked (dead-lettered) [5.3-4], [6.4-3]
  • Only transient failures eligible for buffering. Permanent failures returned to caller, not queued [KDD-sf-1], [CD-SF-3]
  • No maximum buffer size [5.3-5], [6.4-4], [KDD-sf-1]
  • Messages for external calls buffered locally when system unavailable [5.3-1]
  • Notifications buffered when email server unavailable [6.4-1]
  • Successfully delivered messages removed from local store [1.3-5]

Estimated Complexity: L

Requirements Traced: [1.3-5], [1.3-7], [5.3-1] through [5.3-5], [6.4-1] through [6.4-4], [KDD-sf-1], [CD-SF-2], [CD-SF-3], [CD-SF-12]


WP-11: S&F Async Replication to Standby

Description: Implement application-level replication of buffer operations from active to standby node.

Acceptance Criteria:

  • All S&F buffers replicated between two site cluster nodes via application-level replication over Akka.NET remoting [1.3-2]
  • Active node forwards each buffer operation (add, remove, park) to standby asynchronously [CD-SF-5], [KDD-sf-2]
  • Active node does not wait for standby acknowledgment (no ack wait) [KDD-sf-2], [CD-SF-5]
  • Standby applies operations to its own local SQLite [CD-SF-6]
  • On failover, standby takes over delivery from its replicated copy [1.3-4]. Note: per [CD-SF-7], the async replication design means the copy is near-complete — rare duplicate deliveries or missed retries are acceptable trade-offs for the latency benefit.
  • Duplicate deliveries and missed retries accepted as trade-offs for async replication [CD-SF-7]
  • Successfully delivered messages removed from both nodes' stores [1.3-5]

Estimated Complexity: L

Requirements Traced: [1.3-2], [1.3-4], [1.3-5], [KDD-sf-2], [CD-SF-5], [CD-SF-6], [CD-SF-7]


WP-12: Parked Message Management

Description: Implement site-side parked message storage, query handling, and retry/discard commands accessible from central.

Acceptance Criteria:

  • Parked messages stored at the site in SQLite [5.4-1], [CD-SF-8]
  • Central can query sites for parked messages via Communication Layer [5.4-2], [CD-SF-8]
  • Operators can retry a parked message (moves back to retry queue) [5.4-3], [CD-SF-9]
  • Operators can discard a parked message (removes permanently) [5.4-3], [CD-SF-9]
  • Management covers all three categories: external system calls, notifications, cached database writes [5.4-4]
  • Remote query uses request/response pattern with query ID, 30s timeout [CD-COM-8]
  • Messages not automatically cleared when instance deleted [CD-SF-10], [KDD-sf-3], [3.8.1-6]
  • Pending and parked messages continue to exist after instance deletion [CD-SF-10]

Estimated Complexity: M

Requirements Traced: [5.4-1] through [5.4-4], [KDD-sf-3], [CD-SF-8], [CD-SF-9], [CD-SF-10], [CD-COM-8], [3.8.1-6]


WP-13: S&F Messages Survive Instance Deletion

Description: Ensure store-and-forward messages are preserved when an instance is deleted.

Acceptance Criteria:

  • S&F messages not cleared on instance deletion [3.8.1-6], [KDD-sf-3], [CD-SF-10]
  • Pending messages continue retry delivery after instance deletion
  • Parked messages remain queryable and manageable from central after instance deletion
  • S&F messages for disabled instances continue to drain [3.8.1-4]

Estimated Complexity: S

Requirements Traced: [3.8.1-4], [3.8.1-6], [KDD-sf-3], [CD-SF-10]


WP-14: S&F Health Metrics & Event Logging Integration

Description: Integrate S&F buffer depth as a health metric and log S&F activity to site event log.

Acceptance Criteria:

  • S&F buffer depth reported as health metric (broken down by category) — integrates with Phase 3B Health Monitoring
  • S&F activity logged to site event log: message queued, delivered, retried, parked (per docs/requirements/Component-StoreAndForward.md Dependencies)
  • S&F buffer depth visible in health reports sent to central

Estimated Complexity: S

Requirements Traced: [CD-SF-1] (categories), docs/requirements/Component-StoreAndForward.md Dependencies (Site Event Logging, Health Monitoring)


WP-15: CachedCall Idempotency Documentation

Description: Document that CachedCall idempotency is the caller's responsibility.

Acceptance Criteria:

  • Script API documentation clearly states that ExternalSystem.CachedCall() idempotency is the caller's responsibility [KDD-sf-4]
  • S&F engine makes no idempotency guarantees — duplicate delivery possible (especially on failover) [CD-SF-7]

Estimated Complexity: S

Requirements Traced: [KDD-sf-4], [CD-SF-7]


WP-16: Deployment Manager — Concurrent Template Editing Semantics

Description: Ensure last-write-wins semantics for template editing do not conflict with deployment pipeline.

Acceptance Criteria:

  • Last-write-wins for concurrent template editing — no pessimistic locking or optimistic concurrency on templates [3.9-6]
  • Deployment uses optimistic concurrency on deployment status records only [KDD-deploy-11]
  • Template state at time of deployment is captured in the flattened config and revision hash

Estimated Complexity: S

Requirements Traced: [3.9-6], [KDD-deploy-11]


Test Strategy

Unit Tests

Area Tests
Deployment flow Validate -> flatten -> send pipeline; validation failure stops pipeline
Deployment identity Deployment ID generation uniqueness; revision hash propagation
Operation lock Concurrent requests on same instance rejected; different instances proceed in parallel; lock released on completion/timeout
State transitions All valid transitions succeed; all invalid transitions rejected with correct error messages
Deployment status CRUD with optimistic concurrency; concurrent updates handled correctly
S&F message format Serialization/deserialization of all three categories; all fields stored correctly
S&F retry logic Fixed interval timing; per-source-entity settings respected; max retries triggers parking; transient-only filter
Parked message ops Retry moves to queue; discard removes; query returns correct results
Template deletion constraint Templates with instance references cannot be deleted; templates with child template references cannot be deleted

Integration Tests

Area Tests
End-to-end deploy Central sends deployment -> site compiles -> actors created -> success reported -> status updated
Deploy with validation failure Template with compilation error -> deployment blocked before send
Idempotent deploy Same deployment ID sent twice -> second returns "already applied"
Stale config rejection Older revision hash sent -> site rejects with current version
Lifecycle commands Disable -> verify subscriptions stopped and config retained; Enable -> verify instance re-activates; Delete -> verify actors destroyed and config removed
S&F buffer and retry Submit message -> delivery fails -> buffered -> retry succeeds -> message removed
S&F parking Submit message -> delivery fails -> max retries -> message parked
S&F replication Buffer message on active -> verify replicated to standby SQLite
Parked message remote query Central queries site for parked messages -> correct results returned
Parked message retry/discard Central retries parked message -> moves to queue; Central discards -> removed
System-wide artifact deploy Deploy shared scripts to multiple sites -> per-site status tracked
S&F survives deletion Delete instance -> verify S&F messages still exist and deliver
S&F drains on disable Disable instance -> verify pending S&F messages continue delivery

Negative Tests

Requirement Test
[1.3-1] Central does not buffer Verify no S&F infrastructure exists on central; central deployment to unreachable site fails immediately
[1.3-6] No max buffer Submit messages continuously -> verify no rejection based on count
[3.8.1-7] Delete fails if unreachable Attempt delete when site offline -> verify failure; verify central does not mark as deleted
[3.8.1-8] Template deletion constraint Attempt to delete template with active instances -> verify rejection
[3.9-1] No auto-propagation Change template -> verify deployed instance unaffected
[3.9-5] No rollback Verify no rollback mechanism exists; only current deployed state tracked
[CD-DM-5] Operation lock rejects Send two concurrent deploys for same instance -> verify second rejected
[CD-DM-7] Invalid transitions Attempt enable on already-enabled instance -> verify rejection; attempt disable on not-deployed -> verify rejection
[CD-SF-3] Permanent failures not buffered Submit message with permanent failure classification -> verify not buffered, error returned to caller
[KDD-sf-3] Messages survive deletion Delete instance -> verify S&F messages not cleared

Failover & Resilience Tests

Scenario Test
Mid-deploy central failover Deploy in progress -> kill central active -> verify deployment treated as failed -> re-query site state -> re-deploy succeeds
Mid-deploy site failover Deploy in progress -> kill site active -> verify deployment times out or fails -> re-deploy to new active succeeds
Timeout + reconciliation Deploy sent -> site applies but response lost -> central times out -> central queries site state -> finds "already applied" -> updates status
S&F buffer takeover Buffer messages on active -> kill active -> standby takes over -> verify messages delivered from replicated copy
S&F replication gap Buffer message -> immediately kill active (before replication) -> verify standby handles gap gracefully (missed message, no crash)
Site offline then online Deploy to offline site -> fails -> site comes online -> re-deploy succeeds
System-wide artifact partial failure Deploy artifacts to 3 sites, 1 offline -> verify 2 succeed -> retry failed site when online

Verification Gate

Phase 3C is complete when all of the following pass:

  1. Deployment pipeline end-to-end: Central validates, flattens, sends, site compiles, creates actors, reports success. Status tracked in config DB.
  2. Idempotency: Duplicate deployment ID returns "already applied." Stale revision hash rejected.
  3. Operation lock: Concurrent operations on same instance rejected; parallel operations on different instances succeed.
  4. State transitions: All valid transitions work; all invalid transitions rejected.
  5. Site-side atomicity: Compilation failure rejects entire deployment; previous config unchanged.
  6. Lifecycle commands: Disable/enable/delete work correctly with proper state effects.
  7. S&F buffering: Messages buffered on transient failure, retried at fixed interval, parked after max retries.
  8. S&F replication: Buffer operations replicated to standby; failover resumes delivery.
  9. Parked message management: Central can query, retry, and discard parked messages at sites.
  10. S&F survival: Messages persist through instance deletion and continue delivery.
  11. System-wide artifacts: Deployed to all sites with per-site status; version skew tolerated.
  12. Resilience: Mid-deploy failover, timeout+reconciliation, and S&F takeover tests pass.
  13. Audit logging: All deployment and lifecycle actions recorded via IAuditService.
  14. All unit, integration, negative, and failover tests pass.

Open Questions

# Question Context Impact Status
Q-P3C-1 Should S&F retry timers be reset on failover or continue from the last known retry timestamp? On failover, the new active node loads buffer from SQLite. Messages have last_attempt_at timestamps. Should retry timing continue relative to last_attempt_at or reset to "now"? Affects retry behavior immediately after failover. Recommend: continue from last_attempt_at to avoid burst retries. Open
Q-P3C-2 What is the maximum number of parked messages returned in a single remote query? Communication Layer pattern 8 uses 30s timeout. Very large parked message sets may need pagination. Recommend: paginated query (e.g., 100 per page) consistent with Site Event Logging pagination pattern. Open
Q-P3C-3 Should the per-instance operation lock be in-memory (lost on central failover) or persisted? In-memory is simpler and consistent with "in-progress deployments treated as failed on failover." Persisted lock could cause orphan locks. Recommend: in-memory. On failover, all locks released. Site state query resolves any ambiguity. Open

Orphan Check Result

Forward Check (Requirements -> Work Packages)

Every item in the Requirements Checklist and Design Constraints Checklist was walked. Results:

Checklist Item Mapped To Verified
[1.3-1] through [1.3-7] WP-9, WP-10, WP-11 Yes
[1.4-1] through [1.4-4] WP-1, WP-5 Yes
[1.5-1] through [1.5-3] WP-7 Yes
[3.8.1-1] through [3.8.1-8] WP-4, WP-6, WP-12, WP-13 Yes
[3.9-1], [3.9-2], [3.9-3], [3.9-5], [3.9-6] WP-1, WP-8, WP-16 Yes
[3.9-4] WP-8 (staleness detection); diff UI deferred to Phase 6 Yes
[5.3-1] through [5.3-5] WP-10 Yes
[5.4-1] through [5.4-4] WP-12 Yes
[6.4-1] through [6.4-4] WP-10 Yes
[KDD-deploy-6] WP-2 Yes
[KDD-deploy-7] WP-3 Yes
[KDD-deploy-8] WP-5 Yes
[KDD-deploy-9] WP-7 Yes
[KDD-deploy-11] WP-4, WP-16 Yes
[KDD-sf-1] WP-10 Yes
[KDD-sf-2] WP-11 Yes
[KDD-sf-3] WP-6, WP-12, WP-13 Yes
[KDD-sf-4] WP-15 Yes
[CD-DM-1] through [CD-DM-18] WP-1 through WP-8 Yes
[CD-SF-1] through [CD-SF-12] WP-9 through WP-14 Yes
[CD-SR-1] through [CD-SR-6] WP-5, WP-6 Yes
[CD-COM-1] through [CD-COM-8] WP-1, WP-2, WP-6, WP-7, WP-12 Yes

Forward check result: PASS — no orphan requirements.

Reverse Check (Work Packages -> Requirements)

Every work package traces to at least one requirement or design constraint:

Work Package Traces To
WP-1 [1.4-1], [1.4-3], [1.4-4], [3.9-1], [3.9-3], [3.9-5], [CD-DM-1], [CD-DM-9], [CD-DM-10], [CD-COM-1], [CD-COM-4]
WP-2 [KDD-deploy-6], [CD-DM-2], [CD-DM-3], [CD-DM-4], [CD-COM-7]
WP-3 [KDD-deploy-7], [CD-DM-5], [CD-DM-6]
WP-4 [CD-DM-7], [CD-DM-9], [KDD-deploy-11], [3.8.1-1], [3.8.1-2]
WP-5 [KDD-deploy-8], [CD-DM-14], [CD-SR-1], [CD-SR-2], [CD-SR-6], [1.4-2]
WP-6 [3.8.1-1] through [3.8.1-8], [KDD-sf-3], [CD-DM-16] through [CD-DM-18], [CD-SR-3] through [CD-SR-5], [CD-COM-2], [CD-COM-5], [CD-COM-7]
WP-7 [1.5-1] through [1.5-3], [KDD-deploy-9], [CD-DM-8], [CD-DM-13], [CD-DM-15], [CD-COM-3], [CD-COM-6]
WP-8 [3.9-2], [3.9-4], [CD-DM-11], [CD-DM-12]
WP-9 [1.3-1], [1.3-3], [1.3-6], [CD-SF-1], [CD-SF-4], [CD-SF-11]
WP-10 [1.3-5], [1.3-7], [5.3-1] through [5.3-5], [6.4-1] through [6.4-4], [KDD-sf-1], [CD-SF-2], [CD-SF-3], [CD-SF-12]
WP-11 [1.3-2], [1.3-4], [1.3-5], [KDD-sf-2], [CD-SF-5], [CD-SF-6], [CD-SF-7]
WP-12 [5.4-1] through [5.4-4], [KDD-sf-3], [CD-SF-8], [CD-SF-9], [CD-SF-10], [CD-COM-8], [3.8.1-6]
WP-13 [3.8.1-4], [3.8.1-6], [KDD-sf-3], [CD-SF-10]
WP-14 [CD-SF-1], docs/requirements/Component-StoreAndForward.md Dependencies
WP-15 [KDD-sf-4], [CD-SF-7]
WP-16 [3.9-6], [KDD-deploy-11]

Reverse check result: PASS — no untraceable work packages.

Split-Section Check

Section Phase 3C Covers Other Phase Covers Gap?
1.4 [1.4-1] through [1.4-4] (all bullets — backend pipeline) Phase 6: deployment UI triggers and status display No gap
1.5 [1.5-1] through [1.5-3] (all bullets — backend pipeline) Phase 6: artifact deployment UI No gap
3.8.1 [3.8.1-1] through [3.8.1-8] (all bullets — backend commands) Phase 4: lifecycle command UI No gap
3.9 [3.9-1], [3.9-2], [3.9-3], [3.9-5], [3.9-6] Phase 6: [3.9-4] (diff view UI), deployment trigger UI No gap
5.3 [5.3-1] through [5.3-5] (S&F engine) Phase 7: External System Gateway delivery integration, error classification No gap
5.4 [5.4-1] through [5.4-4] (backend query/command handling) Phase 6: parked message management UI No gap
6.4 [6.4-1] through [6.4-4] (S&F engine) Phase 7: Notification Service delivery integration No gap

Split-section check result: PASS — no unowned bullets.

Negative Requirement Check

Negative Requirement Acceptance Criterion Adequate?
[1.3-1] Central does not buffer Test verifies no S&F infrastructure on central; unreachable site = immediate failure Yes
[1.3-6] No maximum buffer size Test submits messages continuously, verifies no count-based rejection Yes
[3.8.1-6] S&F messages not cleared on deletion Test deletes instance, verifies messages still exist and deliver Yes
[3.8.1-7] Delete fails if unreachable Test attempts delete to offline site, verifies failure and central status unchanged Yes
[3.8.1-8] Templates cannot be deleted with references Test attempts deletion of referenced template, verifies rejection Yes
[3.9-1] Changes not auto-propagated Test changes template, verifies deployed instance unchanged Yes
[3.9-5] No rollback Verifies no rollback mechanism; only current state tracked Yes
[CD-SF-3] Permanent failures not buffered Test submits permanent failure, verifies not queued Yes

Negative requirement check result: PASS — all prohibitions have verification criteria.


Codex MCP Verification

Model: gpt-5.4 Result: Pass with corrections

Step 1 — Requirements Coverage Review

Codex identified 10 findings. Disposition:

# Finding Disposition
1 Naming collision detection and device tag resolution exclusion missing from WP-1 Corrected — added naming collision detection to WP-1 acceptance criteria; added explicit exclusion of device tag resolution.
2 Shared script pre-compilation validation missing from WP-7 Corrected — added shared script validation acceptance criterion to WP-7.
3 Role overlap (user may hold both Design+Deployment) not verified Dismissed — this is a Phase 1 Security & Auth concern. Phase 3C assumes the auth model works correctly. Role overlap is tested in Phase 1 integration tests.
4 WP-4 traces [3.8.1-2] but doesn't verify runtime activation Dismissed — WP-4 owns the state transition matrix. Runtime behavior of "enabled" (subscriptions, triggers, alarms running) is the responsibility of Phase 3B Site Runtime, which creates Instance Actors with full initialization. WP-6 verifies enable recreates the actor.
5 Enable flow underspecified (should verify actor recreation with subscriptions) Corrected — expanded WP-6 enable criteria to explicitly verify actor creation, subscription restoration, script triggers, and alarm evaluation.
6 Command ID described as "correlation" but source says "deduplication" Corrected — changed wording to "deduplication" with acceptance criterion that duplicate commands are recognized and not re-applied.
7 Disable/enable unreachable failure not explicitly covered Corrected — added acceptance criterion that disable and enable fail immediately if site unreachable.
8 Diff "show" requirement only partially verified (compute, not expose) Dismissed — Phase 3C provides the backend API for diff computation and staleness detection. The "show" (UI) aspect is explicitly deferred to Phase 6 per the split-section note. WP-8 correctly scopes to backend.
9 Parked message management UI not verified Dismissed — same as #8. Phase 3C builds the site-side backend (query handler, retry/discard commands). Phase 6 builds the central UI. Split documented in plan.
10 "near-complete copy" weakens HighLevelReqs "seamless" wording Corrected — updated WP-11 to reference [1.3-4] for the seamless takeover requirement, with a note that [CD-SF-7] acknowledges the async replication trade-off (rare duplicates/misses). The component design explicitly documents this as an acceptable trade-off; HighLevelReqs 1.3 bullet 4 does not preclude it since "seamlessly" refers to the takeover process, not data completeness.

Step 2 — Negative Requirement Review

Not submitted separately; negative requirements were included in Step 1 review. All negative requirements have adequate acceptance criteria per the orphan check.

Step 3 — Split-Section Gap Review

Not submitted separately; split sections were documented in the plan and reviewed in Step 1. No gaps identified.