# Phase 3B: Site I/O & Observability — Implementation Plan **Date**: 2026-03-16 **Status**: Draft **Predecessor**: Phase 3A (Runtime Foundation & Persistence Model) --- ## Scope Phase 3B brings the site cluster to life as a fully operational data collection, scripting, alarm evaluation, and health reporting platform. Upon completion, a site can: - Communicate bidirectionally with the central cluster using all 8 message patterns. - Connect to OPC UA servers and LmxProxy endpoints, subscribe to tags, and deliver values to Instance Actors. - Execute scripts in response to triggers (interval, value change, conditional). - Evaluate alarm conditions, manage alarm state, and execute on-trigger scripts. - Compile and execute shared scripts inline. - Report health metrics to central with monotonic sequence numbers and offline detection. - Record operational events to local SQLite with retention enforcement. - Support remote event log queries from central. - Stream live debug data (attribute values + alarm states) on demand. ### Components Included | Component | Scope | |-----------|-------| | Central-Site Communication | Full — all 8 message patterns, correlation IDs, per-pattern timeouts, transport heartbeat | | Data Connection Layer | Full — IDataConnection, OPC UA adapter, LmxProxy adapter, connection actor, auto-reconnect, write-back, tag path resolution, health reporting | | Site Runtime | Full runtime — Script Actor, Alarm Actor, shared scripts, Script Runtime API (core operations), script trust model, site-wide Akka stream | | Health Monitoring | Site-side collection + central-side aggregation and offline detection | | Site Event Logging | Event recording, retention/purge, remote query with pagination | --- ## Prerequisites | Dependency | Phase | What Must Be Complete | |------------|-------|----------------------| | Cluster Infrastructure | 3A | Akka.NET cluster, SBR, singleton, CoordinatedShutdown | | Host (site role) | 3A | Site-role Akka bootstrap | | Site Runtime skeleton | 3A | Deployment Manager singleton, basic Instance Actor, supervision strategies, staggered startup | | Local SQLite persistence | 3A | Deployed config storage, static attribute override persistence | | Commons | 0 | IDataConnection interface, message contracts, shared types | | Configuration Database | 1 | Central-side repositories (for health metric storage if needed) | --- ## Requirements Checklist Each bullet extracted from docs/requirements/HighLevelReqs.md at the individual requirement level. Checkbox items must each map to at least one work package. ### Section 2.2 — Communication: Central <-> Site - [ ] `[2.2-1]` Central-to-site and site-to-central communication uses Akka.NET (remoting/cluster). - [ ] `[2.2-2]` Central as integration hub: central brokers requests between external systems and sites (e.g., recipe to site, MES requests machine values). - [ ] `[2.2-3]` Real-time data streaming is not continuous for all machine data. - [ ] `[2.2-4]` Only real-time stream is on-demand debug view — engineer opens live view of specific instance's tag values and alarm states. - [ ] `[2.2-5]` Debug view is session-based and temporary. - [ ] `[2.2-6]` Debug view subscribes to site-wide Akka stream filtered by instance (see Section 8.1). ### Section 2.3 — Site-Level Storage & Interface - [ ] `[2.3-1]` Sites have no user interface — headless collectors, forwarders, and script executors. - [ ] `[2.3-2]` Sites require local storage for: deployed (flattened) configurations, deployed scripts, shared scripts, external system definitions, database connection definitions, and notification lists. - [ ] `[2.3-3]` Store-and-forward buffers persisted to local SQLite on each node and replicated between nodes. *(Phase 3C owns S&F engine and buffer persistence/replication. Not in scope for Phase 3B — listed here for split-section completeness only.)* ### Section 2.4 — Data Connection Protocols - [ ] `[2.4-1]` System supports OPC UA and LmxProxy (gRPC-based custom protocol with existing client SDK). - [ ] `[2.4-2]` Both protocols implement a common interface supporting: connect, subscribe to tag paths, receive value updates, and write values. - [ ] `[2.4-3]` Additional protocols can be added by implementing the common interface. - [ ] `[2.4-4]` Data Connection Layer is a clean data pipe — publishes tag value updates to Instance Actors but performs no evaluation of triggers or alarm conditions. ### Section 2.5 — Scale (context only for this phase) - [ ] `[2.5-1]` Approximately 10 sites. *(Validated in Phase 8; informs design here.)* - [ ] `[2.5-2]` 50-500 machines per site. *(Validated in Phase 8; informs staggered startup batch sizing.)* - [ ] `[2.5-3]` 25-75 live data point tags per machine. *(Validated in Phase 8; informs subscription management design.)* ### Section 3.4.1 — Alarm State - [ ] `[3.4.1-1]` Alarm state (active/normal) is managed at the site level per instance, held in memory by the Alarm Actor. - [ ] `[3.4.1-2]` When alarm condition clears, alarm automatically returns to normal state — no acknowledgment workflow. - [ ] `[3.4.1-3]` Alarm state is not persisted — on restart, alarm states are re-evaluated from incoming values. - [ ] `[3.4.1-4]` Alarm state changes published to site-wide Akka stream as `[InstanceUniqueName].[AlarmName]`, alarm state (active/normal), priority, timestamp. ### Section 4.1 — Script Definitions (Phase 3B portion: runtime compilation/execution) - [ ] `[4.1-5]` Scripts are compiled at the site when a deployment is received. Pre-compilation validation occurs at central (Phase 2), but site performs actual compilation for execution. - [ ] `[4.1-6]` Scripts can optionally define input parameters (name and data type per parameter). Scripts without parameter definitions accept no arguments. - [ ] `[4.1-7]` Scripts can optionally define a return value definition (field names and data types). Return values support single objects and lists of objects. Scripts without a return definition return void. - [ ] `[4.1-8]` Return values used when scripts called by other scripts (CallScript, CallShared) or by Inbound API (Route.To().Call()). When invoked by trigger, return value is discarded. **Phase 2 owns**: `[4.1-1]` scripts are C# defined at template level, `[4.1-2]` inheritance/override/lock rules, `[4.1-3]` deployed as part of flattened config, `[4.1-4]` script definitions as first-class template members. ### Section 4.2 — Script Triggers - [ ] `[4.2-1]` Interval trigger: execute on recurring time schedule. - [ ] `[4.2-2]` Value Change trigger: execute when a specific instance attribute value changes. - [ ] `[4.2-3]` Conditional trigger: execute when an instance attribute value equals or does not equal a given value. - [ ] `[4.2-4]` Optional minimum time between runs — if trigger fires before minimum interval has elapsed since last execution, invocation is skipped. ### Section 4.3 — Script Error Handling - [ ] `[4.3-1]` If a script fails (unhandled exception, timeout, etc.), the failure is logged locally at the site. - [ ] `[4.3-2]` The script is not disabled — remains active and will fire on next qualifying trigger event. - [ ] `[4.3-3]` Script failures are not reported to central. Diagnostics are local only. *(Except aggregated error rate via Health Monitoring.)* - [ ] `[4.3-4]` For external system call failures within scripts, store-and-forward handling (Section 5.3) applies independently of script error handling. *(S&F integration is Phase 7; noted here as boundary.)* ### Section 4.4 — Script Capabilities (Phase 3B portion) - [ ] `[4.4-1]` Read attribute values on that instance (live data points and static config). - [ ] `[4.4-2]` Write attributes — for attributes with data source reference, write goes to DCL which writes to physical device; in-memory value updates when device confirms via existing subscription. - [ ] `[4.4-3]` Write attributes — for static attributes, write updates in-memory value and persists override to local SQLite; value survives restart and failover; persisted overrides reset on redeployment. - [ ] `[4.4-4]` CallScript with ask pattern — `Instance.CallScript("scriptName", params)` returns called script's return value; supports concurrent execution. - [ ] `[4.4-5]` CallShared — `Scripts.CallShared("scriptName", params)` executes inline in calling Script Actor's context; compiled code libraries, not separate actors. - [ ] `[4.4-10]` Scripts cannot access other instances' attributes or scripts. *(Negative requirement.)* **Phase 7 owns**: `[4.4-6]` ExternalSystem.Call(), `[4.4-7]` ExternalSystem.CachedCall(), `[4.4-8]` Send notifications, `[4.4-9]` Database.Connection(). ### Section 4.4.1 — Script Call Recursion Limit - [ ] `[4.4.1-1]` Script-to-script calls (CallScript and CallShared) enforce maximum recursion depth. - [ ] `[4.4.1-2]` Default maximum depth is a reasonable limit (e.g., 10 levels). - [ ] `[4.4.1-3]` Current call depth is tracked and incremented with each nested call. - [ ] `[4.4.1-4]` If limit reached, call fails with error logged to site event log. - [ ] `[4.4.1-5]` Applies to all script call chains including alarm on-trigger scripts calling instance scripts. ### Section 4.5 — Shared Scripts (Phase 3B portion: runtime) - [ ] `[4.5-1]` Shared scripts are not associated with any template — system-wide library of reusable C# scripts. - [ ] `[4.5-2]` Shared scripts can optionally define input parameters and return value definitions, same rules as template-level scripts. - [ ] `[4.5-3]` Deployed to all sites for use by any instance script (deployment requires explicit action by Deployment role user). *(Deployment mechanism is Phase 3C; this phase implements site-side reception and compilation.)* - [ ] `[4.5-4]` Shared scripts execute inline in calling Script Actor's context as compiled code — not separate actors. Avoids serialization bottlenecks and messaging overhead. - [ ] `[4.5-5]` Shared scripts are not available on the central cluster — Inbound API scripts cannot call them directly. *(Negative requirement; verified as boundary.)* ### Section 4.6 — Alarm On-Trigger Scripts - [ ] `[4.6-1]` Alarm on-trigger scripts defined as part of alarm definition, execute when alarm activates. - [ ] `[4.6-2]` Execute directly in Alarm Actor's context (via short-lived Alarm Execution Actor), similar to shared scripts executing inline. - [ ] `[4.6-3]` Alarm on-trigger scripts can call instance scripts via `Instance.CallScript()` — sends ask message to sibling Script Actor. - [ ] `[4.6-4]` Instance scripts cannot call alarm on-trigger scripts — call direction is one-way. *(Negative requirement.)* - [ ] `[4.6-5]` Recursion depth limit applies to alarm-to-instance script call chains. ### Section 8.1 — Debug View - [ ] `[8.1-1]` Subscribe-on-demand: engineer opens debug view, central subscribes to site-wide Akka stream filtered by instance unique name. - [ ] `[8.1-2]` Site first provides a snapshot of all current attribute values and alarm states from Instance Actor. - [ ] `[8.1-3]` Then streams subsequent changes from Akka stream. - [ ] `[8.1-4]` Attribute value stream messages: `[InstanceUniqueName].[AttributePath].[AttributeName]`, value, quality, timestamp. - [ ] `[8.1-5]` Alarm state stream messages: `[InstanceUniqueName].[AlarmName]`, state (active/normal), priority, timestamp. - [ ] `[8.1-6]` Stream continues until engineer closes debug view; central unsubscribes and site stops streaming. - [ ] `[8.1-7]` No attribute/alarm selection — debug view always shows all tag values and alarm states for the instance. - [ ] `[8.1-8]` No special concurrency limits required. ### Section 11.1 — Monitored Metrics - [ ] `[11.1-1]` Site cluster online/offline status — whether site is reachable. - [ ] `[11.1-2]` Active vs. standby node status — which node is active, which is standby. - [ ] `[11.1-3]` Data connection health — connected/disconnected status per data connection. - [ ] `[11.1-4]` Script error rates — frequency of script failures at site. - [ ] `[11.1-5]` Alarm evaluation errors — frequency of alarm evaluation failures at site. - [ ] `[11.1-6]` Store-and-forward buffer depth — number of messages currently queued, broken down by external system calls, notifications, and cached database writes. *(S&F engine is Phase 3C; 3B reports placeholder/zero until S&F exists.)* ### Section 11.2 — Health Reporting - [ ] `[11.2-1]` Site clusters report health metrics to central periodically. - [ ] `[11.2-2]` Health status is visible in the central UI — no automated alerting/notifications for now. ### Section 12.1 — Events Logged - [ ] `[12.1-1]` Script executions: start, complete, error (with error details). - [ ] `[12.1-2]` Alarm events: alarm activated, alarm cleared (which alarm, which instance, when). Alarm evaluation errors. - [ ] `[12.1-3]` Deployment applications: configuration received from central, applied successfully or failed. Script compilation results. - [ ] `[12.1-4]` Data connection status changes: connected, disconnected, reconnected per connection. - [ ] `[12.1-5]` Store-and-forward activity: message queued, delivered, retried, parked. *(S&F engine is Phase 3C; event logging API is available, S&F calls it when implemented.)* - [ ] `[12.1-6]` Instance lifecycle: instance enabled, disabled, deleted. ### Section 12.2 — Event Log Storage - [ ] `[12.2-1]` Event logs stored in local SQLite on each site node. - [ ] `[12.2-2]` Retention policy: 30 days. Events older than 30 days automatically purged. ### Section 12.3 — Central Access to Event Logs - [ ] `[12.3-1]` Central UI can query site event logs remotely, following same pattern as parked message management — central requests data from site over Akka.NET remoting. *(UI is Phase 6; backend query mechanism implemented here.)* --- ## Design Constraints Checklist Constraints from CLAUDE.md Key Design Decisions (KDD) and Component-*.md (CD) that impose implementation requirements beyond HighLevelReqs. ### Runtime & Actor Architecture - [ ] `[KDD-runtime-2]` Site Runtime actor hierarchy: Deployment Manager singleton -> Instance Actors -> Script Actors + Alarm Actors. *(Hierarchy established in 3A; 3B adds Script/Alarm Actor children.)* - [ ] `[KDD-runtime-3]` Script Actors spawn short-lived Script Execution Actors on a dedicated blocking I/O dispatcher. - [ ] `[KDD-runtime-4]` Alarm Actors are a separate peer subsystem from scripts (not inside Script Engine). - [ ] `[KDD-runtime-5]` Shared scripts execute inline as compiled code (no separate actors). - [ ] `[KDD-runtime-6]` Site-wide Akka stream for attribute value and alarm state changes with per-subscriber buffering. - [ ] `[KDD-runtime-7]` Instance Actors serialize all state mutations (Akka actor model); concurrent scripts produce interleaved side effects. - [ ] `[KDD-runtime-9]` Supervision: Resume for coordinator actors (Script Actor, Alarm Actor), Stop for short-lived execution actors. *(Strategy defined in 3A; 3B implements the actual actor types.)* ### Data & Communication - [ ] `[KDD-data-1]` DCL connection actor uses Become/Stash pattern for lifecycle state machine (Connecting -> Connected -> Reconnecting). - [ ] `[KDD-data-2]` DCL auto-reconnect at fixed interval; immediate bad quality on disconnect; transparent re-subscribe. - [ ] `[KDD-data-3]` DCL write failures returned synchronously to calling script. - [ ] `[KDD-data-4]` Tag path resolution retried periodically for devices still booting. - [ ] `[KDD-data-7]` Tell for hot-path internal communication (tag value updates, attribute change notifications, stream publishing); Ask reserved for system boundaries (CallScript, Route.To, debug snapshot). - [ ] `[KDD-data-8]` Application-level correlation IDs on all request/response messages (deployment ID, command ID, query ID). ### Script Trust Model - [ ] `[KDD-code-9]` Script trust model: forbidden APIs — System.IO, Process, Threading (except async/await), Reflection, raw network (System.Net.Sockets, System.Net.Http). Enforced at compilation and runtime. ### Health & UI - [ ] `[KDD-ui-2]` Real-time push for debug view and health dashboard. *(Backend streaming support; UI rendering is Phase 6.)* - [ ] `[KDD-ui-3]` Health reports: 30s interval, 60s offline threshold, monotonic sequence numbers, raw error counts per interval. - [ ] `[KDD-ui-4]` Dead letter monitoring as a health metric. - [ ] `[KDD-ui-5]` Site Event Logging: 30-day retention, 1GB storage cap, daily purge, paginated queries with keyword search. ### LmxProxy Protocol Details - [ ] `[CD-DCL-1]` LmxProxy: gRPC/HTTP/2 transport, protobuf-net code-first, port 5050. - [ ] `[CD-DCL-2]` LmxProxy: API key auth, session-based (SessionId), 30s keep-alive heartbeat via `GetConnectionStateAsync`. - [ ] `[CD-DCL-3]` LmxProxy: Server-streaming gRPC for subscriptions (`IAsyncEnumerable`), 1000ms default sampling, on-change with 0. - [ ] `[CD-DCL-4]` LmxProxy: SDK retry policy (exponential backoff via Polly) complements DCL's fixed-interval reconnect. SDK handles operation-level transient failures; DCL handles connection-level recovery. - [ ] `[CD-DCL-5]` LmxProxy: Batch read/write capabilities (ReadBatchAsync, WriteBatchAsync, WriteBatchAndWaitAsync). - [ ] `[CD-DCL-6]` LmxProxy: TLS 1.2/1.3, mutual TLS (client cert + key PEM), custom CA trust, self-signed for dev. ### Communication Component Design - [ ] `[CD-Comm-1]` 8 distinct message patterns: Deployment, Instance Lifecycle, System-Wide Artifact, Integration Routing, Recipe/Command Delivery, Debug Streaming, Health Reporting, Remote Queries. - [ ] `[CD-Comm-2]` Per-pattern timeouts: Deployment 120s, Instance Lifecycle 30s, System-Wide Artifacts 120s/site, Integration Routing 30s, Recipe/Command 30s, Remote Queries 30s. - [ ] `[CD-Comm-3]` Transport heartbeat explicitly configured (not framework defaults). - [ ] `[CD-Comm-4]` Message ordering: Akka.NET guarantees sender/receiver pair ordering; Communication Layer relies on this. - [ ] `[CD-Comm-5]` Connection failure: in-flight messages fail via ask timeout, no central buffering. Debug streams killed on interruption — engineer must reopen. - [ ] `[CD-Comm-6]` Failover: central failover = in-progress deployments treated as failed; site failover = singleton restarts, debug streams interrupted. ### Site Event Logging Component Design - [ ] `[CD-SEL-1]` Event entry schema: timestamp, event type, severity, instance ID (optional), source, message, details (optional). - [ ] `[CD-SEL-2]` Only active node generates and stores events. Event logs not replicated to standby. On failover, new active starts fresh log; old node's events unavailable until it comes back. - [ ] `[CD-SEL-3]` Storage cap (default 1 GB) enforced — if reached before 30-day window, oldest events purged first. - [ ] `[CD-SEL-4]` Queries support filtering by: event type/category, time range, instance ID, severity, keyword search (SQLite LIKE on message and source). - [ ] `[CD-SEL-5]` Results paginated (default 500 events) with continuation token. ### Health Monitoring Component Design - [ ] `[CD-HM-1]` Health report is flat snapshot of all metrics + monotonic sequence number + report timestamp. - [ ] `[CD-HM-2]` Central replaces previous state only if incoming sequence number > last received (prevents stale report overwrite). - [ ] `[CD-HM-3]` Online recovery: receipt of report from offline site automatically marks it online. - [ ] `[CD-HM-4]` Error rates as raw counts per reporting interval, reset after each report. - [ ] `[CD-HM-5]` Tag resolution counts: per connection, total subscribed vs. successfully resolved. - [ ] `[CD-HM-6]` Health metrics held in memory at central — no historical data persisted. - [ ] `[CD-HM-7]` No alerting — display-only for now. ### Site Runtime Component Design (beyond HighLevelReqs) - [ ] `[CD-SR-1]` Script Execution Actor receives: compiled script code, input parameters, reference to parent Instance Actor, current call depth. - [ ] `[CD-SR-2]` Alarm evaluation: Value Match (equals predefined), Range Violation (outside min/max), Rate of Change (exceeds threshold). - [ ] `[CD-SR-3]` On alarm clear, no script execution — only state transition. - [ ] `[CD-SR-4]` Script compilation errors on deployment cause entire instance deployment to be rejected (no partial state). - [ ] `[CD-SR-5]` Script error includes: unhandled exceptions, timeouts, recursion limit violations. - [ ] `[CD-SR-6]` Alarm evaluation errors logged locally; Alarm Actor remains active for subsequent updates. - [ ] `[CD-SR-7]` Site-wide stream uses per-subscriber bounded buffers. Slow subscriber drops oldest events, does not block publishers. - [ ] `[CD-SR-8]` Instance Actors publish to stream with fire-and-forget — publishing never blocks the actor. - [ ] `[CD-SR-9]` Alarm Execution Actor can call instance scripts; instance scripts cannot call alarm on-trigger scripts (enforced at runtime). - [ ] `[CD-SR-10]` Execution timeout per script is configurable. Exceeding timeout cancels script and logs error. - [ ] `[CD-SR-11]` Memory: scripts share host process memory. No per-script memory limit. - [ ] `[CD-SR-12]` Script trust model enforced by restricting assemblies/namespaces available to compilation context. ### Data Connection Layer Component Design (beyond HighLevelReqs) - [ ] `[CD-DCL-7]` Connection actor Become/Stash states: Connecting (stash requests), Connected (unstash and process), Reconnecting (stash new requests). - [ ] `[CD-DCL-8]` On connection drop, immediately push bad quality for every tag subscribed on that connection. - [ ] `[CD-DCL-9]` Auto-reconnect interval configurable per data connection. - [ ] `[CD-DCL-10]` Tag path resolution failure: log to event log, mark attribute bad quality, periodically retry at configurable interval. - [ ] `[CD-DCL-11]` Write failure: error returned to calling script; also logged to site event logging. No S&F for device writes. - [ ] `[CD-DCL-12]` Value update message format: tag path, value, quality (good/bad/uncertain), timestamp. - [ ] `[CD-DCL-13]` When Instance Actor stopped, DCL cleans up associated subscriptions. - [ ] `[CD-DCL-14]` On redeployment, subscriptions established fresh based on new configuration. - [ ] `[CD-DCL-15]` LmxProxy connection actor holds SessionId, starts 30s keep-alive timer on Connected state. On keep-alive failure, transitions to Reconnecting, client disposes subscriptions. --- ## Work Packages ### WP-1: Communication Layer — Message Contracts & Correlation IDs **Description**: Define all message contracts for the 8 communication patterns with application-level correlation IDs. **Acceptance Criteria**: - Message contract types defined in Commons/Messages for all 8 patterns: Deployment request/response, Instance Lifecycle command/response, System-Wide Artifact deploy/ack, Integration Routing request/response, Recipe/Command request/ack, Debug Subscribe/Unsubscribe/Snapshot/StreamMessage, Health Report, Remote Query request/response (event logs, parked messages). - All request/response message pairs include a correlation ID field (deployment ID, command ID, query ID). - Contracts follow additive-only versioning rules (REQ-COM-5a). - All timestamps in message contracts are UTC. **Estimated Complexity**: M **Requirements Traced**: `[2.2-1]`, `[2.2-2]`, `[KDD-data-8]`, `[CD-Comm-1]` --- ### WP-2: Communication Layer — Per-Pattern Timeouts **Description**: Implement configurable per-pattern timeout support for all request/response patterns using the Akka ask pattern. **Acceptance Criteria**: - Timeout configuration via options class (bound to appsettings.json section). - Default values: Deployment 120s, Instance Lifecycle 30s, System-Wide Artifacts 120s/site, Integration Routing 30s, Recipe/Command 30s, Remote Queries 30s. - Timeout exceeded produces a clear failure result (not an unhandled exception). - Integration test: verify timeout fires at configured interval. **Estimated Complexity**: S **Requirements Traced**: `[CD-Comm-2]` --- ### WP-3: Communication Layer — Transport Heartbeat Configuration **Description**: Explicitly configure Akka.NET remoting transport heartbeat settings (not framework defaults). **Acceptance Criteria**: - Transport heartbeat interval explicitly set in Akka.NET HOCON config. - Failure detection threshold explicitly set. - Values configurable via appsettings (not hardcoded). - Settings documented in site and central appsettings templates. **Estimated Complexity**: S **Requirements Traced**: `[CD-Comm-3]` --- ### WP-4: Communication Layer — All 8 Message Patterns Implementation **Description**: Implement central-side and site-side actors/handlers for all 8 communication patterns. **Acceptance Criteria**: - Pattern 1 (Deployment): Central sends flattened config, site responds success/failure. Unreachable site fails immediately. - Pattern 2 (Instance Lifecycle): Central sends disable/enable/delete, site responds. Unreachable site fails immediately. - Pattern 3 (System-Wide Artifacts): Central broadcasts to all sites, each site acknowledges independently. - Pattern 4 (Integration Routing): Central brokers external request to site and returns response. - Pattern 5 (Recipe/Command): Central routes fire-and-forget with ack. - Pattern 6 (Debug Streaming): Subscribe request, snapshot response, then continuous stream. Unsubscribe request stops stream. - Pattern 7 (Health Reporting): Site periodically pushes health report (Tell, no response needed). - Pattern 8 (Remote Queries): Central queries site for event logs / parked messages, site responds. - Message ordering preserved per sender/receiver pair (Akka guarantee relied upon). - Sites do not communicate with each other — all messages hub-and-spoke through central. **Estimated Complexity**: L **Requirements Traced**: `[2.2-1]`, `[2.2-2]`, `[2.2-3]`, `[2.2-4]`, `[2.2-5]`, `[2.2-6]`, `[CD-Comm-1]`, `[CD-Comm-4]`, `[CD-Comm-5]`, `[CD-Comm-6]` --- ### WP-5: Communication Layer — Connection Failure & Failover Behavior **Description**: Implement connection failure handling and failover behavior for the communication layer. **Acceptance Criteria**: - In-flight messages: on connection drop, ask pattern times out and caller receives failure. No central-side buffering or retry. - Debug streams: connection interruption kills the stream. Engineer must reopen debug view. - Central failover: in-progress deployments treated as failed. - Site failover: singleton restarts, central detects node change and reconnects. Debug streams interrupted. **Estimated Complexity**: M **Requirements Traced**: `[CD-Comm-5]`, `[CD-Comm-6]` --- ### WP-6: Data Connection Layer — Connection Actor with Become/Stash Lifecycle **Description**: Implement the connection actor using Akka.NET Become/Stash pattern for lifecycle state machine. **Acceptance Criteria**: - Three states implemented: Connecting, Connected, Reconnecting. - In Connecting state: subscription requests and write commands are stashed. - On transition to Connected: all stashed messages unstashed and processed. - In Reconnecting state: new requests stashed while retry occurs. - State transitions logged to Site Event Logging (`[12.1-4]`). - One connection actor per data connection definition at the site. **Estimated Complexity**: M **Requirements Traced**: `[KDD-data-1]`, `[CD-DCL-7]` --- ### WP-7: Data Connection Layer — OPC UA Adapter **Description**: Implement the OPC UA adapter conforming to IDataConnection. **Acceptance Criteria**: - Implements all IDataConnection methods: Connect, Disconnect, Subscribe, Unsubscribe, Read, Write, Status. - OPC UA client establishes session with configured endpoint. - Subscribe creates OPC UA monitored items. - Value updates delivered as `{tagPath, value, quality, timestamp}` tuples. - Write operation sends value to OPC UA server. - Status reports connection state (connected/disconnected/reconnecting). - Integration test against OPC PLC simulator (from test infrastructure). **Estimated Complexity**: L **Requirements Traced**: `[2.4-1]`, `[2.4-2]`, `[CD-DCL-12]` --- ### WP-8: Data Connection Layer — LmxProxy Adapter **Description**: Implement the LmxProxy adapter wrapping the existing `LmxProxyClient` SDK behind IDataConnection. **Acceptance Criteria**: - Implements all IDataConnection methods mapped per docs/requirements/Component-DCL concrete type mappings. - Connect: calls `ConnectAsync`, stores SessionId. - Subscribe: calls `SubscribeAsync`, processes `IAsyncEnumerable` stream, forwards updates. - Write: calls `WriteAsync`. - Read: calls `ReadAsync`. - Configurable sampling interval (default 1000ms, 0 = on-change). - gRPC/HTTP/2 transport on configured port (default 5050). - API key authentication passed in ConnectRequest. - TLS support: TLS 1.2/1.3, mutual TLS, custom CA trust, self-signed for dev. - 30s keep-alive heartbeat via `GetConnectionStateAsync`. On failure, marks disconnected, disposes subscriptions. - SDK retry policy (Polly exponential backoff) retained for operation-level transient failures. - Batch operations exposed (ReadBatchAsync, WriteBatchAsync) for future use. **Estimated Complexity**: L **Requirements Traced**: `[2.4-1]`, `[2.4-2]`, `[CD-DCL-1]`, `[CD-DCL-2]`, `[CD-DCL-3]`, `[CD-DCL-4]`, `[CD-DCL-5]`, `[CD-DCL-6]`, `[CD-DCL-15]` --- ### WP-9: Data Connection Layer — Auto-Reconnect & Bad Quality Propagation **Description**: Implement auto-reconnection at fixed interval with immediate bad quality propagation on disconnect. **Acceptance Criteria**: - On connection drop: immediately push value update with quality `bad` for every tag subscribed on that connection. - Auto-reconnect at configurable fixed interval per data connection (e.g., 5 seconds default). - Reconnect interval is per-connection, not global. - Connection state tracked as connected/disconnected/reconnecting. - All state transitions logged to Site Event Logging. - Instance Actors and downstream consumers see staleness immediately on disconnect. **Estimated Complexity**: M **Requirements Traced**: `[KDD-data-2]`, `[CD-DCL-8]`, `[CD-DCL-9]` --- ### WP-10: Data Connection Layer — Transparent Re-Subscribe **Description**: On successful reconnection, automatically re-establish all previously active subscriptions. **Acceptance Criteria**: - After reconnection, all subscriptions that were active before disconnect are re-subscribed. - Instance Actors require no action — they see quality return to good as fresh values arrive. - LmxProxy adapter: new session established, new subscriptions created (old session/subscriptions were disposed on disconnect). - OPC UA adapter: new session established, monitored items re-created. - Test: disconnect OPC UA server, reconnect, verify values resume without Instance Actor intervention. **Estimated Complexity**: M **Requirements Traced**: `[KDD-data-2]`, `[2.4-2]` --- ### WP-11: Data Connection Layer — Write-Back Support **Description**: Implement write-back from Instance Actors through DCL to physical devices. **Acceptance Criteria**: - Instance Actor sends write request to DCL when script calls SetAttribute for data-connected attribute. - DCL writes value via appropriate protocol (OPC UA Write / LmxProxy WriteAsync). - Write failure (connection down, device rejection, timeout) returned synchronously to calling script. - Successful write: in-memory value NOT optimistically updated. Value updates only when device confirms via existing subscription. - Write failures also logged to Site Event Logging. - No store-and-forward for device writes. - Test: script writes value, verify value update arrives only after device confirms. **Estimated Complexity**: M **Requirements Traced**: `[4.4-2]`, `[KDD-data-3]`, `[CD-DCL-11]` --- ### WP-12: Data Connection Layer — Tag Path Resolution with Retry **Description**: Handle tag paths that do not resolve on the physical device, with periodic retry. **Acceptance Criteria**: - When tag path does not exist on device: failure logged to Site Event Logging. - Attribute marked with quality `bad`. - Periodic retry at configurable interval to accommodate devices that boot in stages. - On successful resolution: subscription activates normally, quality reflects live value. - Separate from connection-level reconnect — tag resolution retry handles individual tag failures on an active connection. **Estimated Complexity**: M **Requirements Traced**: `[KDD-data-4]`, `[CD-DCL-10]` --- ### WP-13: Data Connection Layer — Health Reporting **Description**: DCL reports connection status and tag resolution metrics to Health Monitoring. **Acceptance Criteria**: - Reports connection status (connected/disconnected/reconnecting) per data connection. - Reports tag resolution counts per connection: total subscribed tags vs. successfully resolved tags. - Metrics collected and available for inclusion in periodic health report. **Estimated Complexity**: S **Requirements Traced**: `[11.1-3]`, `[CD-HM-5]`, `[CD-DCL-12]` --- ### WP-14: Data Connection Layer — Subscription & Cleanup Lifecycle **Description**: Manage subscription creation when Instance Actors start and cleanup when they stop. **Acceptance Criteria**: - When Instance Actor created: registers data source references with DCL for subscription. - DCL subscribes to tag paths using concrete connection details from flattened configuration. - Tag value updates delivered directly to requesting Instance Actor. - When Instance Actor stopped (disable, delete, redeployment): DCL cleans up associated subscriptions. - On redeployment: subscriptions established fresh based on new configuration. - Protocol-agnostic — works for both OPC UA and LmxProxy. **Estimated Complexity**: M **Requirements Traced**: `[2.4-4]`, `[CD-DCL-13]`, `[CD-DCL-14]` --- ### WP-15: Site Runtime — Script Actor & Script Execution Actor **Description**: Implement the Script Actor coordinator and short-lived Script Execution Actor for script invocation. **Acceptance Criteria**: - Script Actor created as child of Instance Actor (one per script definition). - Script Actor holds compiled script code, trigger configuration, and manages trigger evaluation. - Interval trigger: internal timer, spawns Script Execution Actor on fire. - Value Change trigger: subscribes to attribute change notifications from Instance Actor, spawns Script Execution Actor on change. - Conditional trigger: subscribes to attribute notifications, evaluates condition (equals/not-equals), spawns Script Execution Actor when condition met. - Minimum time between runs: Script Actor tracks last execution time, skips trigger if minimum interval not elapsed. - Script Execution Actor is short-lived child, receives compiled code, input parameters, reference to Instance Actor, current call depth. - Script Execution Actor runs on dedicated blocking I/O dispatcher. - Multiple Script Execution Actors can run concurrently. - Script Actor coordinator does not block on child completion. - Supervision: Script Actor resumed on exception; Script Execution Actor stopped on unhandled exception. - Return value (if defined) sent back to caller; discarded for trigger invocations. **Estimated Complexity**: L **Requirements Traced**: `[4.2-1]`, `[4.2-2]`, `[4.2-3]`, `[4.2-4]`, `[4.1-5]`, `[4.1-6]`, `[4.1-7]`, `[4.1-8]`, `[KDD-runtime-2]`, `[KDD-runtime-3]`, `[KDD-runtime-9]`, `[CD-SR-1]`, `[CD-SR-10]` --- ### WP-16: Site Runtime — Alarm Actor & Alarm Execution Actor **Description**: Implement the Alarm Actor coordinator for alarm condition evaluation and state management. **Acceptance Criteria**: - Alarm Actor created as child of Instance Actor (one per alarm definition). - Alarm Actor subscribes to attribute change notifications from Instance Actor for referenced attribute(s). - Evaluates trigger conditions: Value Match, Range Violation, Rate of Change. - Alarm state (active/normal) held in memory only — not persisted. - On alarm activate (condition met, currently normal): transition to active, update Instance Actor alarm state (publishes to stream), spawn Alarm Execution Actor for on-trigger script if defined. - On alarm clear (condition clears, currently active): transition to normal, update Instance Actor. No script execution on clear. - On restart/failover: alarm starts in normal, re-evaluates from incoming values. - Alarm Execution Actor: short-lived child, same pattern as Script Execution Actor. Has access to Instance Actor for GetAttribute/SetAttribute. - Alarm Actors are a separate peer subsystem from Script Actors (not nested inside). - Alarm evaluation errors logged locally; Alarm Actor remains active for subsequent updates. - Supervision: Alarm Actor resumed on exception; Alarm Execution Actor stopped on unhandled exception. **Estimated Complexity**: L **Requirements Traced**: `[3.4.1-1]`, `[3.4.1-2]`, `[3.4.1-3]`, `[3.4.1-4]`, `[4.6-1]`, `[4.6-2]`, `[KDD-runtime-4]`, `[KDD-runtime-9]`, `[CD-SR-2]`, `[CD-SR-3]`, `[CD-SR-6]` --- ### WP-17: Site Runtime — Shared Script Library (Inline Execution) **Description**: Implement shared script compilation and inline execution within Script Execution Actor context. **Acceptance Criteria**: - Shared scripts compiled at site when received from central. - Compiled code stored in memory, made available to all Script Actors. - `Scripts.CallShared("scriptName", params)` executes shared script inline — direct method invocation, not actor message. - Shared scripts not associated with any template — system-wide library. - Shared scripts can define input parameters and return value definitions. - No serialization bottleneck — inline execution avoids contention on a shared actor. - Shared scripts have access to same runtime API as instance scripts (GetAttribute, SetAttribute, etc.). - Shared scripts are not available on central cluster. *(Negative: verified by architecture — site-only compilation.)* **Estimated Complexity**: M **Requirements Traced**: `[4.5-1]`, `[4.5-2]`, `[4.5-3]`, `[4.5-4]`, `[4.5-5]`, `[KDD-runtime-5]` --- ### WP-18: Site Runtime — Script Runtime API (Core Operations) **Description**: Implement the core Script Runtime API available to all script and alarm execution actors. **Acceptance Criteria**: - `Instance.GetAttribute("name")` — reads current in-memory value from parent Instance Actor. - `Instance.SetAttribute("name", value)` — for data-connected: sends write to DCL, error returned synchronously; for static: updates in-memory + persists to SQLite, survives restart/failover, resets on redeployment. - `Instance.CallScript("scriptName", params)` — ask pattern to sibling Script Actor, target spawns Script Execution Actor, returns result. Includes current recursion depth. - `Scripts.CallShared("scriptName", params)` — inline execution. Includes current recursion depth. - Scripts can only access own instance's attributes/scripts. Cross-instance access fails with clear error. - Runtime API provided via a context object injected into Script Execution Actor. **Estimated Complexity**: L **Requirements Traced**: `[4.4-1]`, `[4.4-2]`, `[4.4-3]`, `[4.4-4]`, `[4.4-5]`, `[4.4-10]`, `[KDD-data-7]` --- ### WP-19: Site Runtime — Script Trust Model & Constrained Compilation **Description**: Implement compilation restrictions and runtime constraints for script execution. **Acceptance Criteria**: - Forbidden APIs enforced at compilation: System.IO, System.Diagnostics.Process, System.Threading (except async/await), System.Reflection, System.Net.Sockets, System.Net.Http, assembly loading, unsafe code. - Compilation context restricts available assemblies and namespaces. - Execution timeout: configurable per-script maximum execution time. Exceeding timeout cancels script and logs error. - Memory: scripts share host process memory, no per-script memory limit (timeout prevents runaway allocations). - Test: verify compilation fails when script references forbidden API. - Test: verify runtime timeout cancels long-running script. **Estimated Complexity**: L **Requirements Traced**: `[KDD-code-9]`, `[CD-SR-10]`, `[CD-SR-11]`, `[CD-SR-12]` --- ### WP-20: Site Runtime — Recursion Limit Enforcement **Description**: Enforce maximum recursion depth for script-to-script calls. **Acceptance Criteria**: - Every CallScript and CallShared increments call depth counter. - Default maximum depth: 10 levels (configurable). - If limit exceeded, call fails with error. - Error logged to site event log. - Applies to all call chains: script -> script, script -> shared, alarm on-trigger -> instance script chains. - Test: create call chain of depth 11, verify it fails at the 11th level with logged error. **Estimated Complexity**: S **Requirements Traced**: `[4.4.1-1]`, `[4.4.1-2]`, `[4.4.1-3]`, `[4.4.1-4]`, `[4.4.1-5]`, `[4.6-5]` --- ### WP-21: Site Runtime — Alarm On-Trigger Script Call Direction Enforcement **Description**: Enforce one-way call direction between alarm on-trigger scripts and instance scripts. **Acceptance Criteria**: - Alarm Execution Actor can call instance scripts via `Instance.CallScript()` (sends ask to sibling Script Actor). - Instance scripts (Script Execution Actors) cannot call alarm on-trigger scripts. Mechanism: alarm on-trigger scripts are not exposed as callable targets in the Script Runtime API; no `Instance.CallAlarmScript()` API exists. - Test: verify alarm on-trigger script successfully calls instance script. - Test: verify no API path exists for instance scripts to invoke alarm on-trigger scripts. **Estimated Complexity**: S **Requirements Traced**: `[4.6-3]`, `[4.6-4]`, `[CD-SR-9]` --- ### WP-22: Site Runtime — Tell vs Ask Conventions **Description**: Implement correct Tell/Ask usage patterns per Akka.NET best practices. **Acceptance Criteria**: - Tell (fire-and-forget) used for: tag value updates (DCL -> Instance Actor), attribute change notifications (Instance Actor -> Script/Alarm Actors), stream publishing (Instance Actor -> Akka stream). - Ask used for: `Instance.CallScript()` (Script Execution Actor -> sibling Script Actor), `Route.To().Call()` (Inbound API -> site, Phase 7), debug view snapshot (Communication Layer -> Instance Actor). - No Ask usage on the hot path (tag updates, notifications). **Estimated Complexity**: S **Requirements Traced**: `[KDD-data-7]` --- ### WP-23: Site Runtime — Site-Wide Akka Stream **Description**: Implement the site-wide Akka stream for attribute value and alarm state changes with per-subscriber backpressure. **Acceptance Criteria**: - All Instance Actors publish attribute value changes and alarm state changes to the stream. - Attribute change format: `[InstanceUniqueName].[AttributePath].[AttributeName]`, value, quality, timestamp. - Alarm change format: `[InstanceUniqueName].[AlarmName]`, state (active/normal), priority, timestamp. - Per-subscriber bounded buffers. Each subscriber gets independent buffer. - Slow subscriber: buffer fills, oldest events dropped. Does not affect other subscribers or publishers. - Instance Actors publish with fire-and-forget — publishing never blocks the actor. - Debug view can subscribe filtered by instance unique name. - Stream survives individual Instance Actor stop/restart. **Estimated Complexity**: L **Requirements Traced**: `[3.4.1-4]`, `[KDD-runtime-6]`, `[CD-SR-7]`, `[CD-SR-8]`, `[8.1-1]`, `[8.1-4]`, `[8.1-5]` --- ### WP-24: Site Runtime — Concurrency Serialization **Description**: Ensure Instance Actor correctly serializes all state mutations while allowing concurrent script execution. **Acceptance Criteria**: - Instance Actor processes messages sequentially (standard Akka model). - SetAttribute calls from concurrent Script Execution Actors serialized at Instance Actor — no race conditions on attribute state. - Script Execution Actors may run concurrently; all state mutations mediated through Instance Actor message queue. - External side effects (external system calls, notifications) not serialized — concurrent scripts produce interleaved side effects (acceptable). - Test: two concurrent scripts writing to same attribute, verify no lost updates (serialized through Instance Actor). **Estimated Complexity**: M **Requirements Traced**: `[KDD-runtime-7]` --- ### WP-25: Site Runtime — Debug View Backend Support **Description**: Implement the site-side debug view infrastructure: snapshot + stream subscription. **Acceptance Criteria**: - Central sends subscribe request for specific instance (by unique name). - Instance Actor provides snapshot of all current attribute values and alarm states. - Site subscribes to site-wide Akka stream filtered by instance unique name and forwards changes to central. - Central sends unsubscribe request when debug view closes; site removes stream subscription. - Session-based and temporary — no persistent subscriptions. - No attribute/alarm selection — always shows all tags and alarms for the instance. - No special concurrency limits on debug subscriptions. - Connection interruption kills debug stream; engineer must reopen. **Estimated Complexity**: M **Requirements Traced**: `[8.1-1]`, `[8.1-2]`, `[8.1-3]`, `[8.1-4]`, `[8.1-5]`, `[8.1-6]`, `[8.1-7]`, `[8.1-8]`, `[KDD-ui-2]` --- ### WP-26: Health Monitoring — Site-Side Metric Collection **Description**: Implement the site-side health metric collector that aggregates metrics from all site subsystems. **Acceptance Criteria**: - Collects all metrics defined in 11.1: - Active/standby node status (from Cluster Infrastructure). - Data connection health: connected/disconnected/reconnecting per data connection (from DCL). - Tag resolution counts per connection (from DCL). - Script error rates: raw count per interval, reset after report (from Site Runtime). - Alarm evaluation error rates: raw count per interval, reset after report (from Site Runtime). - Store-and-forward buffer depth by category. *(Reports 0/placeholder until S&F implemented in Phase 3C.)* - Dead letter count: subscribed to Akka.NET EventStream dead letter events, count per interval. - Script errors include: unhandled exceptions, timeouts, recursion limit violations. **Estimated Complexity**: M **Requirements Traced**: `[11.1-1]`, `[11.1-2]`, `[11.1-3]`, `[11.1-4]`, `[11.1-5]`, `[11.1-6]`, `[KDD-ui-4]`, `[CD-HM-4]`, `[CD-HM-5]`, `[CD-SR-5]` --- ### WP-27: Health Monitoring — Periodic Reporting with Sequence Numbers **Description**: Implement periodic health report sending from site to central with monotonic sequence numbers. **Acceptance Criteria**: - Health report sent at configurable interval (default 30 seconds). - Report is flat snapshot of all current metric values. - Includes monotonic sequence number (incremented per report). - Includes report timestamp (UTC from site clock). - Sent via Communication Layer (Pattern 7: periodic push, Tell — no response needed). - Sequence number survives within a singleton lifecycle; resets on singleton restart (central handles via comparison). **Estimated Complexity**: S **Requirements Traced**: `[11.2-1]`, `[KDD-ui-3]`, `[CD-HM-1]` --- ### WP-28: Health Monitoring — Central-Side Aggregation & Offline Detection **Description**: Implement central-side health metric reception, aggregation, and site online/offline detection. **Acceptance Criteria**: - Receives health reports from all sites. - Stores latest metrics per site in memory (no persistence). - Replaces previous state only if incoming sequence number > last received (prevents stale overwrite). - Offline detection: if no report received within configurable timeout (default 60s — 2x report interval), site marked offline. - Online recovery: receipt of report from offline site automatically marks it online — no manual ack. - Metrics available for Central UI dashboard (rendering is Phase 4/6). - No alerting — display-only. **Estimated Complexity**: M **Requirements Traced**: `[11.1-1]`, `[11.2-1]`, `[11.2-2]`, `[KDD-ui-3]`, `[CD-HM-2]`, `[CD-HM-3]`, `[CD-HM-6]`, `[CD-HM-7]` --- ### WP-29: Site Event Logging — Event Recording to SQLite **Description**: Implement the site event logging service with SQLite persistence. **Acceptance Criteria**: - Event logging service available as a cross-cutting concern to all site subsystems. - Events recorded with schema: timestamp (UTC), event type, severity (Info/Warning/Error), instance ID (optional), source, message, details (optional). - Categories supported: script executions, alarm events, deployment applications, data connection status, store-and-forward activity, instance lifecycle. - Only active node generates and stores events. Event logs not replicated to standby. - On failover, new active node starts logging to its own SQLite; historical from previous active unavailable until that node returns. - SQLite database created at site startup if not exists. **Estimated Complexity**: M **Requirements Traced**: `[12.1-1]`, `[12.1-2]`, `[12.1-3]`, `[12.1-4]`, `[12.1-5]`, `[12.1-6]`, `[12.2-1]`, `[CD-SEL-1]`, `[CD-SEL-2]` --- ### WP-30: Site Event Logging — Retention & Storage Cap Enforcement **Description**: Implement 30-day retention with daily purge and 1GB storage cap. **Acceptance Criteria**: - Daily background job on active node deletes all events older than 30 days. Hard delete, no archival. - Configurable storage cap (default 1 GB). If cap reached before 30-day window, oldest events purged first. - Storage cap checked periodically (at least daily, ideally on each purge run). - Purge does not block event recording (runs on background thread/task). **Estimated Complexity**: S **Requirements Traced**: `[12.2-2]`, `[KDD-ui-5]`, `[CD-SEL-3]` --- ### WP-31: Site Event Logging — Remote Query with Pagination & Keyword Search **Description**: Implement remote query support for central to query site event logs. **Acceptance Criteria**: - Query received via Communication Layer (Pattern 8: Remote Queries). - Supports filtering by: event type/category, time range, instance ID, severity, keyword search (SQLite LIKE on message and source fields). - Results paginated with configurable page size (default 500 events). - Each response includes continuation token for fetching additional pages. - Site processes query locally against SQLite and returns matching results to central. **Estimated Complexity**: M **Requirements Traced**: `[12.3-1]`, `[KDD-ui-5]`, `[CD-SEL-4]`, `[CD-SEL-5]` --- ### WP-32: Site Runtime — Script Error Handling Integration **Description**: Implement script error handling behavior per requirements. **Acceptance Criteria**: - Script failure (unhandled exception, timeout): logged locally to site event log with error details. - Script not disabled after failure — remains active, fires on next qualifying trigger. - Script failures not reported to central individually (only aggregated error rate via health report). - Script compilation errors on deployment reject entire instance deployment — no partial state. **Estimated Complexity**: S **Requirements Traced**: `[4.3-1]`, `[4.3-2]`, `[4.3-3]`, `[CD-SR-4]`, `[CD-SR-5]` --- ### WP-33: Site Runtime — Local Artifact Storage **Description**: Implement local storage for system-wide artifacts received from central (shared scripts, external system definitions, DB connection definitions, notification lists). **Acceptance Criteria**: - SQLite schema or file storage for: shared scripts, external system definitions, database connection definitions, notification lists. - Artifacts stored on receipt from central (via Pattern 3: System-Wide Artifact Deployment). - Shared scripts recompiled on update and new code made available to Script Actors. - Artifact storage persists across restart. - Sites are headless — no local UI for artifact management. **Estimated Complexity**: M **Requirements Traced**: `[2.3-1]`, `[2.3-2]`, `[4.5-3]` --- ### WP-34: Data Connection Layer — Protocol Extensibility **Description**: Ensure the IDataConnection interface allows adding new protocol adapters. **Acceptance Criteria**: - IDataConnection interface defined in Commons (Phase 0 — REQ-COM-2). - OPC UA adapter and LmxProxy adapter both implement IDataConnection. - Connection actor instantiates the correct adapter based on data connection protocol type from configuration. - Adding a new protocol requires only implementing IDataConnection and registering the adapter — no changes to connection actor or Instance Actor. **Estimated Complexity**: S **Requirements Traced**: `[2.4-3]` --- ### WP-35: Failover Acceptance Tests **Description**: Validate failover behavior for all Phase 3B components. **Acceptance Criteria**: - **DCL reconnection after failover**: Active node fails, singleton migrates, new Deployment Manager re-creates Instance Actors, DCL re-establishes connections and subscriptions. Values resume flowing. - **Health report continuity**: After failover, new active node begins sending health reports with new sequence numbers. Central detects the gap but accepts new reports (sequence number > 0 accepted for a site that was marked offline). - **Stream recovery**: Debug stream interrupted on failover. Engineer reopens debug view and gets fresh snapshot + stream. - **Alarm re-evaluation**: After failover, alarms start in normal state and re-evaluate from incoming values. - **Script triggers resume**: After failover, interval timers restart, value change/conditional triggers re-subscribe. - **Event log continuity**: New active node starts fresh event log. Previous active's events available when that node returns. - **Static attribute overrides survive**: Instance Actor loads persisted overrides from SQLite after failover. *(Covered in Phase 3A but re-verified here with full runtime.)* **Estimated Complexity**: L **Requirements Traced**: `[3.4.1-3]`, `[CD-SEL-2]`, `[KDD-data-2]`, `[CD-Comm-6]` --- ## Test Strategy ### Unit Tests | Area | Test Scenarios | |------|---------------| | Connection Actor | State machine transitions (Connecting -> Connected -> Reconnecting), stash/unstash behavior, bad quality propagation on disconnect | | OPC UA Adapter | IDataConnection contract compliance, subscribe/unsubscribe, write | | LmxProxy Adapter | IDataConnection contract compliance, SessionId management, keep-alive, subscription stream processing | | Script Actor | Trigger evaluation (interval, value change, conditional), minimum time between runs, concurrent execution | | Alarm Actor | Condition evaluation (Value Match, Range Violation, Rate of Change), state transitions (normal->active, active->normal), no script on clear | | Script Runtime API | GetAttribute, SetAttribute (data-connected + static), CallScript, CallShared | | Script Trust Model | Compilation rejection for forbidden APIs, execution timeout | | Recursion Limit | Depth tracking, limit enforcement, error logging | | Health Metric Collector | Counter accumulation, reset after report, dead letter counting | | Event Logger | Event recording, schema compliance, retention purge, storage cap | | Event Query | Filter combinations, pagination, keyword search | | Communication Contracts | Serialization/deserialization, correlation ID propagation | ### Integration Tests | Area | Test Scenarios | |------|---------------| | OPC UA End-to-End | Connect to OPC PLC simulator, subscribe, receive values, write, verify round-trip | | DCL -> Instance Actor | Tag value updates flow from DCL to Instance Actor, update in-memory state, publish to stream | | Script Execution | Trigger fires, Script Execution Actor spawns, executes script, reads/writes attributes, returns | | Alarm Evaluation | Value update triggers alarm, state change published to stream, on-trigger script fires | | CallScript Chain | Script A calls Script B, recursion depth tracked, return value propagated | | Shared Script | Instance script calls shared script inline, shared script accesses runtime API | | Debug View | Subscribe, receive snapshot, stream changes, unsubscribe | | Health Report | Site sends report, central receives, offline detection after timeout | | Event Log Query | Central queries site event log, receives paginated results | | Communication Patterns | All 8 patterns exercised end-to-end | ### Negative Tests | Requirement | Test | |-------------|------| | `[4.4-10]` Scripts cannot access other instances | Script attempts cross-instance attribute access; verify clear error returned | | `[4.6-4]` Instance scripts cannot call alarm scripts | Verify no API path exists for this; attempt to address alarm script from instance script fails | | `[4.5-5]` Shared scripts not available on central | Verify shared script library is site-only compilation | | `[2.2-3]` No continuous real-time streaming | Verify no background stream runs without debug view open | | `[4.3-2]` Script not disabled after failure | Script fails, verify next trigger still fires | | `[4.3-3]` Script failures not reported to central | Verify no individual failure message sent; only aggregated rate in health report | | `[3.4.1-3]` Alarm state not persisted | Restart, verify all alarms start normal | | `[CD-DCL-11]` No S&F for device writes | Verify write failure returned to script, not buffered | | `[CD-HM-7]` No alerting | Verify health monitoring is display-only | | `[KDD-code-9]` Forbidden APIs | Compile script with System.IO reference; verify compilation fails | ### Failover Tests See WP-35 acceptance criteria above. --- ## Verification Gate Phase 3B is complete when ALL of the following pass: 1. **OPC UA integration**: Site connects to OPC PLC simulator, subscribes to tags, values flow to Instance Actors, attribute values visible in debug view snapshot. 2. **Script execution**: All three trigger types (interval, value change, conditional) fire correctly. Minimum time between runs enforced. Scripts read/write attributes. CallScript returns values. CallShared executes inline. 3. **Alarm evaluation**: All three condition types (Value Match, Range Violation, Rate of Change) correctly transition alarms. Alarm state changes appear on Akka stream. On-trigger scripts execute. No script on clear. 4. **Script trust model**: Forbidden APIs rejected at compilation. Execution timeout cancels scripts. 5. **Recursion limit**: Call chain depth enforced at configured limit. Error logged. 6. **Health monitoring**: Site sends periodic reports with sequence numbers. Central aggregates, detects offline (60s), detects online recovery. All metric categories populated. 7. **Event logging**: Events recorded for all categories. 30-day retention purge works. 1GB cap enforced. Remote query with pagination and keyword search returns correct results. 8. **Debug view**: Full cycle — subscribe, snapshot, stream changes, unsubscribe. 9. **Communication**: All 8 patterns exercised. Per-pattern timeouts verified. Correlation IDs propagated. 10. **Failover**: WP-35 acceptance tests pass — DCL reconnection, health continuity, stream recovery, alarm re-evaluation, script trigger resume. 11. **Negative tests**: All negative test cases pass (cross-instance access blocked, alarm script call direction enforced, forbidden APIs rejected, etc.). --- ## Open Questions | # | Question | Context | Impact | Status | |---|----------|---------|--------|--------| | Q-P3B-1 | What is the exact dedicated blocking I/O dispatcher configuration for Script Execution Actors? | KDD-runtime-3 says "dedicated blocking I/O dispatcher" — need Akka.NET HOCON config (thread pool size, throughput settings). | WP-15. Sensible defaults can be set; tuned in Phase 8. | Deferred — use Akka.NET default blocking-io-dispatcher config; tune during Phase 8 performance testing. | | Q-P3B-2 | Should LmxProxy adapter expose WriteBatchAndWaitAsync (write-and-poll handshake) through IDataConnection or as a protocol-specific extension? | CD-DCL-5 lists WriteBatchAndWaitAsync but IDataConnection only defines simple Write. | WP-8. Does not block core functionality. | Deferred — expose as protocol-specific extension method; not part of IDataConnection core contract. | | Q-P3B-3 | What is the Rate of Change alarm evaluation time window? | Section 3.4 says "changes faster than a defined threshold" but does not specify the time window (per-second? per-minute? configurable?). | WP-16. Needs a design decision for the evaluation algorithm. | Deferred — implement as configurable window (default: per-second rate). Document in alarm definition schema. | | Q-P3B-4 | How does the health report sequence number behave across failover? | Sequence number is monotonic within a singleton lifecycle. After failover, the new singleton starts at 1. Central must handle this. | WP-27, WP-28. Central should accept any report from a site marked offline regardless of sequence number. | Resolved in design — central accepts report when site is offline; for online sites, requires seq > last. On failover, site goes offline first (missed reports), so the reset is naturally handled. | --- ## Split-Section Tracking ### Section 4.1 — Script Definitions - **Phase 3B covers**: `[4.1-5]` site compilation, `[4.1-6]` input parameters (runtime), `[4.1-7]` return values (runtime), `[4.1-8]` return value usage (trigger vs. call). - **Phase 2 covers**: `[4.1-1]` C# defined at template level, `[4.1-2]` inheritance/override/lock, `[4.1-3]` deployed as flattened config, `[4.1-4]` first-class template members. - **Union**: Complete. ### Section 4.4 — Script Capabilities - **Phase 3B covers**: `[4.4-1]` read, `[4.4-2]` write data-sourced, `[4.4-3]` write static, `[4.4-4]` CallScript, `[4.4-5]` CallShared, `[4.4-10]` cannot access other instances. - **Phase 7 covers**: `[4.4-6]` ExternalSystem.Call, `[4.4-7]` CachedCall, `[4.4-8]` notifications, `[4.4-9]` Database.Connection. - **Union**: Complete. ### Section 4.5 — Shared Scripts - **Phase 3B covers**: `[4.5-1]` system-wide library, `[4.5-2]` parameters/return values, `[4.5-3]` deployment to sites (site-side reception), `[4.5-4]` inline execution, `[4.5-5]` not available on central. - **Phase 2 covers**: Model/definition (shared script entity schema). - **Union**: Complete. ### Section 2.3 — Site-Level Storage & Interface - **Phase 3A covers**: `[2.3-2]` deployed configs, `[2.3-3]` S&F buffers (schema preparation). - **Phase 3B covers**: `[2.3-1]` headless, `[2.3-2]` shared scripts/ext sys defs/db conn defs/notification lists storage. - **Phase 3C covers**: `[2.3-3]` S&F buffer persistence and replication. - **Union**: Complete. ### Section 8.1 — Debug View - **Phase 3B covers**: `[8.1-1]` through `[8.1-8]` — all backend/site-side debug view infrastructure. - **Phase 6 covers**: Central UI rendering of debug view. - **Union**: Complete (backend vs. UI split). ### Section 12.3 — Central Access to Event Logs - **Phase 3B covers**: `[12.3-1]` backend query mechanism (site-side query processing, communication pattern). - **Phase 6 covers**: Central UI Event Log Viewer rendering. - **Union**: Complete. ### Section 4.3 — Script Error Handling - **Phase 3B covers**: `[4.3-1]`, `[4.3-2]`, `[4.3-3]` (all core error handling). - **Phase 7 covers**: `[4.3-4]` external system call failure S&F interaction (depends on S&F integration). - **Union**: Complete. --- ## Orphan Check Result ### Forward Check (Requirements -> Work Packages) Every item in the Requirements Checklist and Design Constraints Checklist was walked. Results: - **Requirements Checklist**: All 79 requirement bullets map to at least one work package with acceptance criteria. - **Design Constraints Checklist**: All 47 design constraint items map to at least one work package with acceptance criteria. - **No orphaned requirements or constraints found.** Note: `[2.5-1]`, `[2.5-2]`, `[2.5-3]` are context-only items that inform design decisions in this phase but are formally validated in Phase 8. They are referenced in WP-15 (staggered startup batch sizing consideration) and WP-14 (subscription management design). ### Reverse Check (Work Packages -> Requirements) Every work package was walked. Results: - All 35 work packages trace back to at least one requirement bullet or design constraint. - **No untraceable work packages found.** ### Split-Section Check All 7 split sections verified. The union of bullets across phases equals the complete section for each. **No gaps found.** ### Negative Requirement Check All negative requirements have explicit test cases in the Test Strategy: | Negative Requirement | Test Location | |---------------------|---------------| | `[4.4-10]` Cannot access other instances | Negative Tests table | | `[4.6-4]` Instance scripts cannot call alarm scripts | Negative Tests table | | `[4.5-5]` Shared scripts not available on central | Negative Tests table | | `[2.2-3]` No continuous real-time streaming | Negative Tests table | | `[4.3-2]` Script not disabled after failure | Negative Tests table | | `[4.3-3]` Failures not reported to central | Negative Tests table | | `[3.4.1-3]` Alarm state not persisted | Negative Tests table | | `[CD-DCL-11]` No S&F for device writes | Negative Tests table | | `[CD-HM-7]` No alerting | Negative Tests table | | `[KDD-code-9]` Forbidden APIs | Negative Tests table | | `[3.4.1-2]` No acknowledgment workflow | Covered by WP-16 acceptance criteria | **All negative requirements have acceptance criteria that would catch violations.** ### Verification Status - **Forward check**: PASS - **Reverse check**: PASS - **Split-section check**: PASS - **Negative requirement check**: PASS --- ## External Verification (Codex MCP) **Model**: gpt-5.4 **Date**: 2026-03-16 ### Step 1 — Requirements Coverage Review Codex received work package titles (not full acceptance criteria due to prompt size constraints) and identified 12 findings. Analysis: | # | Finding | Disposition | |---|---------|-------------| | 1 | `[2.3-3]` S&F buffer persistence listed in checklist but no WP covers it | **Valid** — clarified as Phase 3C scope. `[2.3-3]` annotation updated to note split-section reference only. | | 2 | Script Runtime API missing ExternalSystem/Notify/Database | **False positive** — plan explicitly assigns `[4.4-6]` through `[4.4-9]` to Phase 7. WP-18 covers only the Phase 3B portion (read/write/CallScript/CallShared). Scope table says "core operations." | | 3 | Static attribute SQLite persistence not verified for restart/failover/redeploy | **False positive** — WP-18 acceptance criteria explicitly state "persists to SQLite, survives restart/failover, resets on redeployment." WP-35 re-verifies with full runtime. | | 4 | System-wide artifact explicit deployment behavior uncovered | **False positive** — WP-33 covers artifact storage on receipt. Deployment trigger mechanism is Phase 3C (Deployment Manager). WP-4 Pattern 3 covers the communication pattern. | | 5 | Staggered startup missing | **False positive** — staggered startup is Phase 3A (listed in prerequisites table). | | 6 | Blocking I/O dispatcher and supervision strategy uncovered | **False positive** — WP-15 acceptance criteria: "runs on dedicated blocking I/O dispatcher" and "Supervision: Script Actor resumed, Script Execution Actor stopped." WP-16 has same for Alarm Actors. | | 7 | Per-subscriber buffering uncovered in WP-23 | **False positive** — WP-23 acceptance criteria explicitly cover: "Per-subscriber bounded buffers. Each subscriber gets independent buffer. Slow subscriber: buffer fills, oldest events dropped." | | 8 | Tag resolution counts and dead letter count missing | **False positive** — WP-26 acceptance criteria include both. WP-13 covers tag resolution counts from DCL side. | | 9 | UTC timestamps not covered | **False positive** — UTC is a Phase 0 convention (KDD-data-6). Message contracts in WP-1 specify "All timestamps in message contracts are UTC." Health report in WP-27 specifies "UTC from site clock." | | 10 | Event log schema and active-node behavior uncovered | **False positive** — WP-29 acceptance criteria list full schema and "Only active node generates and stores events. Event logs not replicated to standby." | | 11 | Remote query filters/pagination details uncovered | **False positive** — WP-31 acceptance criteria list all filter types, "default 500 events," and "continuation token." | | 12 | LmxProxy details uncovered in WP-8 | **False positive** — WP-8 acceptance criteria explicitly cover port, API key, SessionId, keep-alive, TLS, batch ops, Polly retry. | ### Step 2 — Negative Requirement Review Codex did not raise concerns about negative requirements (not included in abbreviated submission). Self-review confirms all 11 negative requirements have explicit test cases in the Negative Tests table. ### Step 3 — Split-Section Gap Review Not submitted separately. Self-review in Split-Section Tracking section confirms all 7 split sections have complete unions. ### Outcome **Pass with 1 correction** — `[2.3-3]` annotation clarified as Phase 3C scope reference. All other findings were false positives caused by Codex receiving only work package titles rather than full acceptance criteria.