Organize documentation by moving requirements (HighLevelReqs, Component-*, lmxproxy_protocol) to docs/requirements/ and test infrastructure docs to docs/test_infra/. Updates all cross-references in README, CLAUDE.md, infra/README, component docs, and 23 plan files.
68 KiB
Phase 3B: Site I/O & Observability — Implementation Plan
Date: 2026-03-16 Status: Draft Predecessor: Phase 3A (Runtime Foundation & Persistence Model)
Scope
Phase 3B brings the site cluster to life as a fully operational data collection, scripting, alarm evaluation, and health reporting platform. Upon completion, a site can:
- Communicate bidirectionally with the central cluster using all 8 message patterns.
- Connect to OPC UA servers and LmxProxy endpoints, subscribe to tags, and deliver values to Instance Actors.
- Execute scripts in response to triggers (interval, value change, conditional).
- Evaluate alarm conditions, manage alarm state, and execute on-trigger scripts.
- Compile and execute shared scripts inline.
- Report health metrics to central with monotonic sequence numbers and offline detection.
- Record operational events to local SQLite with retention enforcement.
- Support remote event log queries from central.
- Stream live debug data (attribute values + alarm states) on demand.
Components Included
| Component | Scope |
|---|---|
| Central-Site Communication | Full — all 8 message patterns, correlation IDs, per-pattern timeouts, transport heartbeat |
| Data Connection Layer | Full — IDataConnection, OPC UA adapter, LmxProxy adapter, connection actor, auto-reconnect, write-back, tag path resolution, health reporting |
| Site Runtime | Full runtime — Script Actor, Alarm Actor, shared scripts, Script Runtime API (core operations), script trust model, site-wide Akka stream |
| Health Monitoring | Site-side collection + central-side aggregation and offline detection |
| Site Event Logging | Event recording, retention/purge, remote query with pagination |
Prerequisites
| Dependency | Phase | What Must Be Complete |
|---|---|---|
| Cluster Infrastructure | 3A | Akka.NET cluster, SBR, singleton, CoordinatedShutdown |
| Host (site role) | 3A | Site-role Akka bootstrap |
| Site Runtime skeleton | 3A | Deployment Manager singleton, basic Instance Actor, supervision strategies, staggered startup |
| Local SQLite persistence | 3A | Deployed config storage, static attribute override persistence |
| Commons | 0 | IDataConnection interface, message contracts, shared types |
| Configuration Database | 1 | Central-side repositories (for health metric storage if needed) |
Requirements Checklist
Each bullet extracted from docs/requirements/HighLevelReqs.md at the individual requirement level. Checkbox items must each map to at least one work package.
Section 2.2 — Communication: Central <-> Site
[2.2-1]Central-to-site and site-to-central communication uses Akka.NET (remoting/cluster).[2.2-2]Central as integration hub: central brokers requests between external systems and sites (e.g., recipe to site, MES requests machine values).[2.2-3]Real-time data streaming is not continuous for all machine data.[2.2-4]Only real-time stream is on-demand debug view — engineer opens live view of specific instance's tag values and alarm states.[2.2-5]Debug view is session-based and temporary.[2.2-6]Debug view subscribes to site-wide Akka stream filtered by instance (see Section 8.1).
Section 2.3 — Site-Level Storage & Interface
[2.3-1]Sites have no user interface — headless collectors, forwarders, and script executors.[2.3-2]Sites require local storage for: deployed (flattened) configurations, deployed scripts, shared scripts, external system definitions, database connection definitions, and notification lists.[2.3-3]Store-and-forward buffers persisted to local SQLite on each node and replicated between nodes. (Phase 3C owns S&F engine and buffer persistence/replication. Not in scope for Phase 3B — listed here for split-section completeness only.)
Section 2.4 — Data Connection Protocols
[2.4-1]System supports OPC UA and LmxProxy (gRPC-based custom protocol with existing client SDK).[2.4-2]Both protocols implement a common interface supporting: connect, subscribe to tag paths, receive value updates, and write values.[2.4-3]Additional protocols can be added by implementing the common interface.[2.4-4]Data Connection Layer is a clean data pipe — publishes tag value updates to Instance Actors but performs no evaluation of triggers or alarm conditions.
Section 2.5 — Scale (context only for this phase)
[2.5-1]Approximately 10 sites. (Validated in Phase 8; informs design here.)[2.5-2]50-500 machines per site. (Validated in Phase 8; informs staggered startup batch sizing.)[2.5-3]25-75 live data point tags per machine. (Validated in Phase 8; informs subscription management design.)
Section 3.4.1 — Alarm State
[3.4.1-1]Alarm state (active/normal) is managed at the site level per instance, held in memory by the Alarm Actor.[3.4.1-2]When alarm condition clears, alarm automatically returns to normal state — no acknowledgment workflow.[3.4.1-3]Alarm state is not persisted — on restart, alarm states are re-evaluated from incoming values.[3.4.1-4]Alarm state changes published to site-wide Akka stream as[InstanceUniqueName].[AlarmName], alarm state (active/normal), priority, timestamp.
Section 4.1 — Script Definitions (Phase 3B portion: runtime compilation/execution)
[4.1-5]Scripts are compiled at the site when a deployment is received. Pre-compilation validation occurs at central (Phase 2), but site performs actual compilation for execution.[4.1-6]Scripts can optionally define input parameters (name and data type per parameter). Scripts without parameter definitions accept no arguments.[4.1-7]Scripts can optionally define a return value definition (field names and data types). Return values support single objects and lists of objects. Scripts without a return definition return void.[4.1-8]Return values used when scripts called by other scripts (CallScript, CallShared) or by Inbound API (Route.To().Call()). When invoked by trigger, return value is discarded.
Phase 2 owns: [4.1-1] scripts are C# defined at template level, [4.1-2] inheritance/override/lock rules, [4.1-3] deployed as part of flattened config, [4.1-4] script definitions as first-class template members.
Section 4.2 — Script Triggers
[4.2-1]Interval trigger: execute on recurring time schedule.[4.2-2]Value Change trigger: execute when a specific instance attribute value changes.[4.2-3]Conditional trigger: execute when an instance attribute value equals or does not equal a given value.[4.2-4]Optional minimum time between runs — if trigger fires before minimum interval has elapsed since last execution, invocation is skipped.
Section 4.3 — Script Error Handling
[4.3-1]If a script fails (unhandled exception, timeout, etc.), the failure is logged locally at the site.[4.3-2]The script is not disabled — remains active and will fire on next qualifying trigger event.[4.3-3]Script failures are not reported to central. Diagnostics are local only. (Except aggregated error rate via Health Monitoring.)[4.3-4]For external system call failures within scripts, store-and-forward handling (Section 5.3) applies independently of script error handling. (S&F integration is Phase 7; noted here as boundary.)
Section 4.4 — Script Capabilities (Phase 3B portion)
[4.4-1]Read attribute values on that instance (live data points and static config).[4.4-2]Write attributes — for attributes with data source reference, write goes to DCL which writes to physical device; in-memory value updates when device confirms via existing subscription.[4.4-3]Write attributes — for static attributes, write updates in-memory value and persists override to local SQLite; value survives restart and failover; persisted overrides reset on redeployment.[4.4-4]CallScript with ask pattern —Instance.CallScript("scriptName", params)returns called script's return value; supports concurrent execution.[4.4-5]CallShared —Scripts.CallShared("scriptName", params)executes inline in calling Script Actor's context; compiled code libraries, not separate actors.[4.4-10]Scripts cannot access other instances' attributes or scripts. (Negative requirement.)
Phase 7 owns: [4.4-6] ExternalSystem.Call(), [4.4-7] ExternalSystem.CachedCall(), [4.4-8] Send notifications, [4.4-9] Database.Connection().
Section 4.4.1 — Script Call Recursion Limit
[4.4.1-1]Script-to-script calls (CallScript and CallShared) enforce maximum recursion depth.[4.4.1-2]Default maximum depth is a reasonable limit (e.g., 10 levels).[4.4.1-3]Current call depth is tracked and incremented with each nested call.[4.4.1-4]If limit reached, call fails with error logged to site event log.[4.4.1-5]Applies to all script call chains including alarm on-trigger scripts calling instance scripts.
Section 4.5 — Shared Scripts (Phase 3B portion: runtime)
[4.5-1]Shared scripts are not associated with any template — system-wide library of reusable C# scripts.[4.5-2]Shared scripts can optionally define input parameters and return value definitions, same rules as template-level scripts.[4.5-3]Deployed to all sites for use by any instance script (deployment requires explicit action by Deployment role user). (Deployment mechanism is Phase 3C; this phase implements site-side reception and compilation.)[4.5-4]Shared scripts execute inline in calling Script Actor's context as compiled code — not separate actors. Avoids serialization bottlenecks and messaging overhead.[4.5-5]Shared scripts are not available on the central cluster — Inbound API scripts cannot call them directly. (Negative requirement; verified as boundary.)
Section 4.6 — Alarm On-Trigger Scripts
[4.6-1]Alarm on-trigger scripts defined as part of alarm definition, execute when alarm activates.[4.6-2]Execute directly in Alarm Actor's context (via short-lived Alarm Execution Actor), similar to shared scripts executing inline.[4.6-3]Alarm on-trigger scripts can call instance scripts viaInstance.CallScript()— sends ask message to sibling Script Actor.[4.6-4]Instance scripts cannot call alarm on-trigger scripts — call direction is one-way. (Negative requirement.)[4.6-5]Recursion depth limit applies to alarm-to-instance script call chains.
Section 8.1 — Debug View
[8.1-1]Subscribe-on-demand: engineer opens debug view, central subscribes to site-wide Akka stream filtered by instance unique name.[8.1-2]Site first provides a snapshot of all current attribute values and alarm states from Instance Actor.[8.1-3]Then streams subsequent changes from Akka stream.[8.1-4]Attribute value stream messages:[InstanceUniqueName].[AttributePath].[AttributeName], value, quality, timestamp.[8.1-5]Alarm state stream messages:[InstanceUniqueName].[AlarmName], state (active/normal), priority, timestamp.[8.1-6]Stream continues until engineer closes debug view; central unsubscribes and site stops streaming.[8.1-7]No attribute/alarm selection — debug view always shows all tag values and alarm states for the instance.[8.1-8]No special concurrency limits required.
Section 11.1 — Monitored Metrics
[11.1-1]Site cluster online/offline status — whether site is reachable.[11.1-2]Active vs. standby node status — which node is active, which is standby.[11.1-3]Data connection health — connected/disconnected status per data connection.[11.1-4]Script error rates — frequency of script failures at site.[11.1-5]Alarm evaluation errors — frequency of alarm evaluation failures at site.[11.1-6]Store-and-forward buffer depth — number of messages currently queued, broken down by external system calls, notifications, and cached database writes. (S&F engine is Phase 3C; 3B reports placeholder/zero until S&F exists.)
Section 11.2 — Health Reporting
[11.2-1]Site clusters report health metrics to central periodically.[11.2-2]Health status is visible in the central UI — no automated alerting/notifications for now.
Section 12.1 — Events Logged
[12.1-1]Script executions: start, complete, error (with error details).[12.1-2]Alarm events: alarm activated, alarm cleared (which alarm, which instance, when). Alarm evaluation errors.[12.1-3]Deployment applications: configuration received from central, applied successfully or failed. Script compilation results.[12.1-4]Data connection status changes: connected, disconnected, reconnected per connection.[12.1-5]Store-and-forward activity: message queued, delivered, retried, parked. (S&F engine is Phase 3C; event logging API is available, S&F calls it when implemented.)[12.1-6]Instance lifecycle: instance enabled, disabled, deleted.
Section 12.2 — Event Log Storage
[12.2-1]Event logs stored in local SQLite on each site node.[12.2-2]Retention policy: 30 days. Events older than 30 days automatically purged.
Section 12.3 — Central Access to Event Logs
[12.3-1]Central UI can query site event logs remotely, following same pattern as parked message management — central requests data from site over Akka.NET remoting. (UI is Phase 6; backend query mechanism implemented here.)
Design Constraints Checklist
Constraints from CLAUDE.md Key Design Decisions (KDD) and Component-*.md (CD) that impose implementation requirements beyond HighLevelReqs.
Runtime & Actor Architecture
[KDD-runtime-2]Site Runtime actor hierarchy: Deployment Manager singleton -> Instance Actors -> Script Actors + Alarm Actors. (Hierarchy established in 3A; 3B adds Script/Alarm Actor children.)[KDD-runtime-3]Script Actors spawn short-lived Script Execution Actors on a dedicated blocking I/O dispatcher.[KDD-runtime-4]Alarm Actors are a separate peer subsystem from scripts (not inside Script Engine).[KDD-runtime-5]Shared scripts execute inline as compiled code (no separate actors).[KDD-runtime-6]Site-wide Akka stream for attribute value and alarm state changes with per-subscriber buffering.[KDD-runtime-7]Instance Actors serialize all state mutations (Akka actor model); concurrent scripts produce interleaved side effects.[KDD-runtime-9]Supervision: Resume for coordinator actors (Script Actor, Alarm Actor), Stop for short-lived execution actors. (Strategy defined in 3A; 3B implements the actual actor types.)
Data & Communication
[KDD-data-1]DCL connection actor uses Become/Stash pattern for lifecycle state machine (Connecting -> Connected -> Reconnecting).[KDD-data-2]DCL auto-reconnect at fixed interval; immediate bad quality on disconnect; transparent re-subscribe.[KDD-data-3]DCL write failures returned synchronously to calling script.[KDD-data-4]Tag path resolution retried periodically for devices still booting.[KDD-data-7]Tell for hot-path internal communication (tag value updates, attribute change notifications, stream publishing); Ask reserved for system boundaries (CallScript, Route.To, debug snapshot).[KDD-data-8]Application-level correlation IDs on all request/response messages (deployment ID, command ID, query ID).
Script Trust Model
[KDD-code-9]Script trust model: forbidden APIs — System.IO, Process, Threading (except async/await), Reflection, raw network (System.Net.Sockets, System.Net.Http). Enforced at compilation and runtime.
Health & UI
[KDD-ui-2]Real-time push for debug view and health dashboard. (Backend streaming support; UI rendering is Phase 6.)[KDD-ui-3]Health reports: 30s interval, 60s offline threshold, monotonic sequence numbers, raw error counts per interval.[KDD-ui-4]Dead letter monitoring as a health metric.[KDD-ui-5]Site Event Logging: 30-day retention, 1GB storage cap, daily purge, paginated queries with keyword search.
LmxProxy Protocol Details
[CD-DCL-1]LmxProxy: gRPC/HTTP/2 transport, protobuf-net code-first, port 5050.[CD-DCL-2]LmxProxy: API key auth, session-based (SessionId), 30s keep-alive heartbeat viaGetConnectionStateAsync.[CD-DCL-3]LmxProxy: Server-streaming gRPC for subscriptions (IAsyncEnumerable<VtqMessage>), 1000ms default sampling, on-change with 0.[CD-DCL-4]LmxProxy: SDK retry policy (exponential backoff via Polly) complements DCL's fixed-interval reconnect. SDK handles operation-level transient failures; DCL handles connection-level recovery.[CD-DCL-5]LmxProxy: Batch read/write capabilities (ReadBatchAsync, WriteBatchAsync, WriteBatchAndWaitAsync).[CD-DCL-6]LmxProxy: TLS 1.2/1.3, mutual TLS (client cert + key PEM), custom CA trust, self-signed for dev.
Communication Component Design
[CD-Comm-1]8 distinct message patterns: Deployment, Instance Lifecycle, System-Wide Artifact, Integration Routing, Recipe/Command Delivery, Debug Streaming, Health Reporting, Remote Queries.[CD-Comm-2]Per-pattern timeouts: Deployment 120s, Instance Lifecycle 30s, System-Wide Artifacts 120s/site, Integration Routing 30s, Recipe/Command 30s, Remote Queries 30s.[CD-Comm-3]Transport heartbeat explicitly configured (not framework defaults).[CD-Comm-4]Message ordering: Akka.NET guarantees sender/receiver pair ordering; Communication Layer relies on this.[CD-Comm-5]Connection failure: in-flight messages fail via ask timeout, no central buffering. Debug streams killed on interruption — engineer must reopen.[CD-Comm-6]Failover: central failover = in-progress deployments treated as failed; site failover = singleton restarts, debug streams interrupted.
Site Event Logging Component Design
[CD-SEL-1]Event entry schema: timestamp, event type, severity, instance ID (optional), source, message, details (optional).[CD-SEL-2]Only active node generates and stores events. Event logs not replicated to standby. On failover, new active starts fresh log; old node's events unavailable until it comes back.[CD-SEL-3]Storage cap (default 1 GB) enforced — if reached before 30-day window, oldest events purged first.[CD-SEL-4]Queries support filtering by: event type/category, time range, instance ID, severity, keyword search (SQLite LIKE on message and source).[CD-SEL-5]Results paginated (default 500 events) with continuation token.
Health Monitoring Component Design
[CD-HM-1]Health report is flat snapshot of all metrics + monotonic sequence number + report timestamp.[CD-HM-2]Central replaces previous state only if incoming sequence number > last received (prevents stale report overwrite).[CD-HM-3]Online recovery: receipt of report from offline site automatically marks it online.[CD-HM-4]Error rates as raw counts per reporting interval, reset after each report.[CD-HM-5]Tag resolution counts: per connection, total subscribed vs. successfully resolved.[CD-HM-6]Health metrics held in memory at central — no historical data persisted.[CD-HM-7]No alerting — display-only for now.
Site Runtime Component Design (beyond HighLevelReqs)
[CD-SR-1]Script Execution Actor receives: compiled script code, input parameters, reference to parent Instance Actor, current call depth.[CD-SR-2]Alarm evaluation: Value Match (equals predefined), Range Violation (outside min/max), Rate of Change (exceeds threshold).[CD-SR-3]On alarm clear, no script execution — only state transition.[CD-SR-4]Script compilation errors on deployment cause entire instance deployment to be rejected (no partial state).[CD-SR-5]Script error includes: unhandled exceptions, timeouts, recursion limit violations.[CD-SR-6]Alarm evaluation errors logged locally; Alarm Actor remains active for subsequent updates.[CD-SR-7]Site-wide stream uses per-subscriber bounded buffers. Slow subscriber drops oldest events, does not block publishers.[CD-SR-8]Instance Actors publish to stream with fire-and-forget — publishing never blocks the actor.[CD-SR-9]Alarm Execution Actor can call instance scripts; instance scripts cannot call alarm on-trigger scripts (enforced at runtime).[CD-SR-10]Execution timeout per script is configurable. Exceeding timeout cancels script and logs error.[CD-SR-11]Memory: scripts share host process memory. No per-script memory limit.[CD-SR-12]Script trust model enforced by restricting assemblies/namespaces available to compilation context.
Data Connection Layer Component Design (beyond HighLevelReqs)
[CD-DCL-7]Connection actor Become/Stash states: Connecting (stash requests), Connected (unstash and process), Reconnecting (stash new requests).[CD-DCL-8]On connection drop, immediately push bad quality for every tag subscribed on that connection.[CD-DCL-9]Auto-reconnect interval configurable per data connection.[CD-DCL-10]Tag path resolution failure: log to event log, mark attribute bad quality, periodically retry at configurable interval.[CD-DCL-11]Write failure: error returned to calling script; also logged to site event logging. No S&F for device writes.[CD-DCL-12]Value update message format: tag path, value, quality (good/bad/uncertain), timestamp.[CD-DCL-13]When Instance Actor stopped, DCL cleans up associated subscriptions.[CD-DCL-14]On redeployment, subscriptions established fresh based on new configuration.[CD-DCL-15]LmxProxy connection actor holds SessionId, starts 30s keep-alive timer on Connected state. On keep-alive failure, transitions to Reconnecting, client disposes subscriptions.
Work Packages
WP-1: Communication Layer — Message Contracts & Correlation IDs
Description: Define all message contracts for the 8 communication patterns with application-level correlation IDs.
Acceptance Criteria:
- Message contract types defined in Commons/Messages for all 8 patterns: Deployment request/response, Instance Lifecycle command/response, System-Wide Artifact deploy/ack, Integration Routing request/response, Recipe/Command request/ack, Debug Subscribe/Unsubscribe/Snapshot/StreamMessage, Health Report, Remote Query request/response (event logs, parked messages).
- All request/response message pairs include a correlation ID field (deployment ID, command ID, query ID).
- Contracts follow additive-only versioning rules (REQ-COM-5a).
- All timestamps in message contracts are UTC.
Estimated Complexity: M
Requirements Traced: [2.2-1], [2.2-2], [KDD-data-8], [CD-Comm-1]
WP-2: Communication Layer — Per-Pattern Timeouts
Description: Implement configurable per-pattern timeout support for all request/response patterns using the Akka ask pattern.
Acceptance Criteria:
- Timeout configuration via options class (bound to appsettings.json section).
- Default values: Deployment 120s, Instance Lifecycle 30s, System-Wide Artifacts 120s/site, Integration Routing 30s, Recipe/Command 30s, Remote Queries 30s.
- Timeout exceeded produces a clear failure result (not an unhandled exception).
- Integration test: verify timeout fires at configured interval.
Estimated Complexity: S
Requirements Traced: [CD-Comm-2]
WP-3: Communication Layer — Transport Heartbeat Configuration
Description: Explicitly configure Akka.NET remoting transport heartbeat settings (not framework defaults).
Acceptance Criteria:
- Transport heartbeat interval explicitly set in Akka.NET HOCON config.
- Failure detection threshold explicitly set.
- Values configurable via appsettings (not hardcoded).
- Settings documented in site and central appsettings templates.
Estimated Complexity: S
Requirements Traced: [CD-Comm-3]
WP-4: Communication Layer — All 8 Message Patterns Implementation
Description: Implement central-side and site-side actors/handlers for all 8 communication patterns.
Acceptance Criteria:
- Pattern 1 (Deployment): Central sends flattened config, site responds success/failure. Unreachable site fails immediately.
- Pattern 2 (Instance Lifecycle): Central sends disable/enable/delete, site responds. Unreachable site fails immediately.
- Pattern 3 (System-Wide Artifacts): Central broadcasts to all sites, each site acknowledges independently.
- Pattern 4 (Integration Routing): Central brokers external request to site and returns response.
- Pattern 5 (Recipe/Command): Central routes fire-and-forget with ack.
- Pattern 6 (Debug Streaming): Subscribe request, snapshot response, then continuous stream. Unsubscribe request stops stream.
- Pattern 7 (Health Reporting): Site periodically pushes health report (Tell, no response needed).
- Pattern 8 (Remote Queries): Central queries site for event logs / parked messages, site responds.
- Message ordering preserved per sender/receiver pair (Akka guarantee relied upon).
- Sites do not communicate with each other — all messages hub-and-spoke through central.
Estimated Complexity: L
Requirements Traced: [2.2-1], [2.2-2], [2.2-3], [2.2-4], [2.2-5], [2.2-6], [CD-Comm-1], [CD-Comm-4], [CD-Comm-5], [CD-Comm-6]
WP-5: Communication Layer — Connection Failure & Failover Behavior
Description: Implement connection failure handling and failover behavior for the communication layer.
Acceptance Criteria:
- In-flight messages: on connection drop, ask pattern times out and caller receives failure. No central-side buffering or retry.
- Debug streams: connection interruption kills the stream. Engineer must reopen debug view.
- Central failover: in-progress deployments treated as failed.
- Site failover: singleton restarts, central detects node change and reconnects. Debug streams interrupted.
Estimated Complexity: M
Requirements Traced: [CD-Comm-5], [CD-Comm-6]
WP-6: Data Connection Layer — Connection Actor with Become/Stash Lifecycle
Description: Implement the connection actor using Akka.NET Become/Stash pattern for lifecycle state machine.
Acceptance Criteria:
- Three states implemented: Connecting, Connected, Reconnecting.
- In Connecting state: subscription requests and write commands are stashed.
- On transition to Connected: all stashed messages unstashed and processed.
- In Reconnecting state: new requests stashed while retry occurs.
- State transitions logged to Site Event Logging (
[12.1-4]). - One connection actor per data connection definition at the site.
Estimated Complexity: M
Requirements Traced: [KDD-data-1], [CD-DCL-7]
WP-7: Data Connection Layer — OPC UA Adapter
Description: Implement the OPC UA adapter conforming to IDataConnection.
Acceptance Criteria:
- Implements all IDataConnection methods: Connect, Disconnect, Subscribe, Unsubscribe, Read, Write, Status.
- OPC UA client establishes session with configured endpoint.
- Subscribe creates OPC UA monitored items.
- Value updates delivered as
{tagPath, value, quality, timestamp}tuples. - Write operation sends value to OPC UA server.
- Status reports connection state (connected/disconnected/reconnecting).
- Integration test against OPC PLC simulator (from test infrastructure).
Estimated Complexity: L
Requirements Traced: [2.4-1], [2.4-2], [CD-DCL-12]
WP-8: Data Connection Layer — LmxProxy Adapter
Description: Implement the LmxProxy adapter wrapping the existing LmxProxyClient SDK behind IDataConnection.
Acceptance Criteria:
- Implements all IDataConnection methods mapped per docs/requirements/Component-DCL concrete type mappings.
- Connect: calls
ConnectAsync, stores SessionId. - Subscribe: calls
SubscribeAsync, processesIAsyncEnumerable<VtqMessage>stream, forwards updates. - Write: calls
WriteAsync. - Read: calls
ReadAsync. - Configurable sampling interval (default 1000ms, 0 = on-change).
- gRPC/HTTP/2 transport on configured port (default 5050).
- API key authentication passed in ConnectRequest.
- TLS support: TLS 1.2/1.3, mutual TLS, custom CA trust, self-signed for dev.
- 30s keep-alive heartbeat via
GetConnectionStateAsync. On failure, marks disconnected, disposes subscriptions. - SDK retry policy (Polly exponential backoff) retained for operation-level transient failures.
- Batch operations exposed (ReadBatchAsync, WriteBatchAsync) for future use.
Estimated Complexity: L
Requirements Traced: [2.4-1], [2.4-2], [CD-DCL-1], [CD-DCL-2], [CD-DCL-3], [CD-DCL-4], [CD-DCL-5], [CD-DCL-6], [CD-DCL-15]
WP-9: Data Connection Layer — Auto-Reconnect & Bad Quality Propagation
Description: Implement auto-reconnection at fixed interval with immediate bad quality propagation on disconnect.
Acceptance Criteria:
- On connection drop: immediately push value update with quality
badfor every tag subscribed on that connection. - Auto-reconnect at configurable fixed interval per data connection (e.g., 5 seconds default).
- Reconnect interval is per-connection, not global.
- Connection state tracked as connected/disconnected/reconnecting.
- All state transitions logged to Site Event Logging.
- Instance Actors and downstream consumers see staleness immediately on disconnect.
Estimated Complexity: M
Requirements Traced: [KDD-data-2], [CD-DCL-8], [CD-DCL-9]
WP-10: Data Connection Layer — Transparent Re-Subscribe
Description: On successful reconnection, automatically re-establish all previously active subscriptions.
Acceptance Criteria:
- After reconnection, all subscriptions that were active before disconnect are re-subscribed.
- Instance Actors require no action — they see quality return to good as fresh values arrive.
- LmxProxy adapter: new session established, new subscriptions created (old session/subscriptions were disposed on disconnect).
- OPC UA adapter: new session established, monitored items re-created.
- Test: disconnect OPC UA server, reconnect, verify values resume without Instance Actor intervention.
Estimated Complexity: M
Requirements Traced: [KDD-data-2], [2.4-2]
WP-11: Data Connection Layer — Write-Back Support
Description: Implement write-back from Instance Actors through DCL to physical devices.
Acceptance Criteria:
- Instance Actor sends write request to DCL when script calls SetAttribute for data-connected attribute.
- DCL writes value via appropriate protocol (OPC UA Write / LmxProxy WriteAsync).
- Write failure (connection down, device rejection, timeout) returned synchronously to calling script.
- Successful write: in-memory value NOT optimistically updated. Value updates only when device confirms via existing subscription.
- Write failures also logged to Site Event Logging.
- No store-and-forward for device writes.
- Test: script writes value, verify value update arrives only after device confirms.
Estimated Complexity: M
Requirements Traced: [4.4-2], [KDD-data-3], [CD-DCL-11]
WP-12: Data Connection Layer — Tag Path Resolution with Retry
Description: Handle tag paths that do not resolve on the physical device, with periodic retry.
Acceptance Criteria:
- When tag path does not exist on device: failure logged to Site Event Logging.
- Attribute marked with quality
bad. - Periodic retry at configurable interval to accommodate devices that boot in stages.
- On successful resolution: subscription activates normally, quality reflects live value.
- Separate from connection-level reconnect — tag resolution retry handles individual tag failures on an active connection.
Estimated Complexity: M
Requirements Traced: [KDD-data-4], [CD-DCL-10]
WP-13: Data Connection Layer — Health Reporting
Description: DCL reports connection status and tag resolution metrics to Health Monitoring.
Acceptance Criteria:
- Reports connection status (connected/disconnected/reconnecting) per data connection.
- Reports tag resolution counts per connection: total subscribed tags vs. successfully resolved tags.
- Metrics collected and available for inclusion in periodic health report.
Estimated Complexity: S
Requirements Traced: [11.1-3], [CD-HM-5], [CD-DCL-12]
WP-14: Data Connection Layer — Subscription & Cleanup Lifecycle
Description: Manage subscription creation when Instance Actors start and cleanup when they stop.
Acceptance Criteria:
- When Instance Actor created: registers data source references with DCL for subscription.
- DCL subscribes to tag paths using concrete connection details from flattened configuration.
- Tag value updates delivered directly to requesting Instance Actor.
- When Instance Actor stopped (disable, delete, redeployment): DCL cleans up associated subscriptions.
- On redeployment: subscriptions established fresh based on new configuration.
- Protocol-agnostic — works for both OPC UA and LmxProxy.
Estimated Complexity: M
Requirements Traced: [2.4-4], [CD-DCL-13], [CD-DCL-14]
WP-15: Site Runtime — Script Actor & Script Execution Actor
Description: Implement the Script Actor coordinator and short-lived Script Execution Actor for script invocation.
Acceptance Criteria:
- Script Actor created as child of Instance Actor (one per script definition).
- Script Actor holds compiled script code, trigger configuration, and manages trigger evaluation.
- Interval trigger: internal timer, spawns Script Execution Actor on fire.
- Value Change trigger: subscribes to attribute change notifications from Instance Actor, spawns Script Execution Actor on change.
- Conditional trigger: subscribes to attribute notifications, evaluates condition (equals/not-equals), spawns Script Execution Actor when condition met.
- Minimum time between runs: Script Actor tracks last execution time, skips trigger if minimum interval not elapsed.
- Script Execution Actor is short-lived child, receives compiled code, input parameters, reference to Instance Actor, current call depth.
- Script Execution Actor runs on dedicated blocking I/O dispatcher.
- Multiple Script Execution Actors can run concurrently.
- Script Actor coordinator does not block on child completion.
- Supervision: Script Actor resumed on exception; Script Execution Actor stopped on unhandled exception.
- Return value (if defined) sent back to caller; discarded for trigger invocations.
Estimated Complexity: L
Requirements Traced: [4.2-1], [4.2-2], [4.2-3], [4.2-4], [4.1-5], [4.1-6], [4.1-7], [4.1-8], [KDD-runtime-2], [KDD-runtime-3], [KDD-runtime-9], [CD-SR-1], [CD-SR-10]
WP-16: Site Runtime — Alarm Actor & Alarm Execution Actor
Description: Implement the Alarm Actor coordinator for alarm condition evaluation and state management.
Acceptance Criteria:
- Alarm Actor created as child of Instance Actor (one per alarm definition).
- Alarm Actor subscribes to attribute change notifications from Instance Actor for referenced attribute(s).
- Evaluates trigger conditions: Value Match, Range Violation, Rate of Change.
- Alarm state (active/normal) held in memory only — not persisted.
- On alarm activate (condition met, currently normal): transition to active, update Instance Actor alarm state (publishes to stream), spawn Alarm Execution Actor for on-trigger script if defined.
- On alarm clear (condition clears, currently active): transition to normal, update Instance Actor. No script execution on clear.
- On restart/failover: alarm starts in normal, re-evaluates from incoming values.
- Alarm Execution Actor: short-lived child, same pattern as Script Execution Actor. Has access to Instance Actor for GetAttribute/SetAttribute.
- Alarm Actors are a separate peer subsystem from Script Actors (not nested inside).
- Alarm evaluation errors logged locally; Alarm Actor remains active for subsequent updates.
- Supervision: Alarm Actor resumed on exception; Alarm Execution Actor stopped on unhandled exception.
Estimated Complexity: L
Requirements Traced: [3.4.1-1], [3.4.1-2], [3.4.1-3], [3.4.1-4], [4.6-1], [4.6-2], [KDD-runtime-4], [KDD-runtime-9], [CD-SR-2], [CD-SR-3], [CD-SR-6]
WP-17: Site Runtime — Shared Script Library (Inline Execution)
Description: Implement shared script compilation and inline execution within Script Execution Actor context.
Acceptance Criteria:
- Shared scripts compiled at site when received from central.
- Compiled code stored in memory, made available to all Script Actors.
Scripts.CallShared("scriptName", params)executes shared script inline — direct method invocation, not actor message.- Shared scripts not associated with any template — system-wide library.
- Shared scripts can define input parameters and return value definitions.
- No serialization bottleneck — inline execution avoids contention on a shared actor.
- Shared scripts have access to same runtime API as instance scripts (GetAttribute, SetAttribute, etc.).
- Shared scripts are not available on central cluster. (Negative: verified by architecture — site-only compilation.)
Estimated Complexity: M
Requirements Traced: [4.5-1], [4.5-2], [4.5-3], [4.5-4], [4.5-5], [KDD-runtime-5]
WP-18: Site Runtime — Script Runtime API (Core Operations)
Description: Implement the core Script Runtime API available to all script and alarm execution actors.
Acceptance Criteria:
Instance.GetAttribute("name")— reads current in-memory value from parent Instance Actor.Instance.SetAttribute("name", value)— for data-connected: sends write to DCL, error returned synchronously; for static: updates in-memory + persists to SQLite, survives restart/failover, resets on redeployment.Instance.CallScript("scriptName", params)— ask pattern to sibling Script Actor, target spawns Script Execution Actor, returns result. Includes current recursion depth.Scripts.CallShared("scriptName", params)— inline execution. Includes current recursion depth.- Scripts can only access own instance's attributes/scripts. Cross-instance access fails with clear error.
- Runtime API provided via a context object injected into Script Execution Actor.
Estimated Complexity: L
Requirements Traced: [4.4-1], [4.4-2], [4.4-3], [4.4-4], [4.4-5], [4.4-10], [KDD-data-7]
WP-19: Site Runtime — Script Trust Model & Constrained Compilation
Description: Implement compilation restrictions and runtime constraints for script execution.
Acceptance Criteria:
- Forbidden APIs enforced at compilation: System.IO, System.Diagnostics.Process, System.Threading (except async/await), System.Reflection, System.Net.Sockets, System.Net.Http, assembly loading, unsafe code.
- Compilation context restricts available assemblies and namespaces.
- Execution timeout: configurable per-script maximum execution time. Exceeding timeout cancels script and logs error.
- Memory: scripts share host process memory, no per-script memory limit (timeout prevents runaway allocations).
- Test: verify compilation fails when script references forbidden API.
- Test: verify runtime timeout cancels long-running script.
Estimated Complexity: L
Requirements Traced: [KDD-code-9], [CD-SR-10], [CD-SR-11], [CD-SR-12]
WP-20: Site Runtime — Recursion Limit Enforcement
Description: Enforce maximum recursion depth for script-to-script calls.
Acceptance Criteria:
- Every CallScript and CallShared increments call depth counter.
- Default maximum depth: 10 levels (configurable).
- If limit exceeded, call fails with error.
- Error logged to site event log.
- Applies to all call chains: script -> script, script -> shared, alarm on-trigger -> instance script chains.
- Test: create call chain of depth 11, verify it fails at the 11th level with logged error.
Estimated Complexity: S
Requirements Traced: [4.4.1-1], [4.4.1-2], [4.4.1-3], [4.4.1-4], [4.4.1-5], [4.6-5]
WP-21: Site Runtime — Alarm On-Trigger Script Call Direction Enforcement
Description: Enforce one-way call direction between alarm on-trigger scripts and instance scripts.
Acceptance Criteria:
- Alarm Execution Actor can call instance scripts via
Instance.CallScript()(sends ask to sibling Script Actor). - Instance scripts (Script Execution Actors) cannot call alarm on-trigger scripts. Mechanism: alarm on-trigger scripts are not exposed as callable targets in the Script Runtime API; no
Instance.CallAlarmScript()API exists. - Test: verify alarm on-trigger script successfully calls instance script.
- Test: verify no API path exists for instance scripts to invoke alarm on-trigger scripts.
Estimated Complexity: S
Requirements Traced: [4.6-3], [4.6-4], [CD-SR-9]
WP-22: Site Runtime — Tell vs Ask Conventions
Description: Implement correct Tell/Ask usage patterns per Akka.NET best practices.
Acceptance Criteria:
- Tell (fire-and-forget) used for: tag value updates (DCL -> Instance Actor), attribute change notifications (Instance Actor -> Script/Alarm Actors), stream publishing (Instance Actor -> Akka stream).
- Ask used for:
Instance.CallScript()(Script Execution Actor -> sibling Script Actor),Route.To().Call()(Inbound API -> site, Phase 7), debug view snapshot (Communication Layer -> Instance Actor). - No Ask usage on the hot path (tag updates, notifications).
Estimated Complexity: S
Requirements Traced: [KDD-data-7]
WP-23: Site Runtime — Site-Wide Akka Stream
Description: Implement the site-wide Akka stream for attribute value and alarm state changes with per-subscriber backpressure.
Acceptance Criteria:
- All Instance Actors publish attribute value changes and alarm state changes to the stream.
- Attribute change format:
[InstanceUniqueName].[AttributePath].[AttributeName], value, quality, timestamp. - Alarm change format:
[InstanceUniqueName].[AlarmName], state (active/normal), priority, timestamp. - Per-subscriber bounded buffers. Each subscriber gets independent buffer.
- Slow subscriber: buffer fills, oldest events dropped. Does not affect other subscribers or publishers.
- Instance Actors publish with fire-and-forget — publishing never blocks the actor.
- Debug view can subscribe filtered by instance unique name.
- Stream survives individual Instance Actor stop/restart.
Estimated Complexity: L
Requirements Traced: [3.4.1-4], [KDD-runtime-6], [CD-SR-7], [CD-SR-8], [8.1-1], [8.1-4], [8.1-5]
WP-24: Site Runtime — Concurrency Serialization
Description: Ensure Instance Actor correctly serializes all state mutations while allowing concurrent script execution.
Acceptance Criteria:
- Instance Actor processes messages sequentially (standard Akka model).
- SetAttribute calls from concurrent Script Execution Actors serialized at Instance Actor — no race conditions on attribute state.
- Script Execution Actors may run concurrently; all state mutations mediated through Instance Actor message queue.
- External side effects (external system calls, notifications) not serialized — concurrent scripts produce interleaved side effects (acceptable).
- Test: two concurrent scripts writing to same attribute, verify no lost updates (serialized through Instance Actor).
Estimated Complexity: M
Requirements Traced: [KDD-runtime-7]
WP-25: Site Runtime — Debug View Backend Support
Description: Implement the site-side debug view infrastructure: snapshot + stream subscription.
Acceptance Criteria:
- Central sends subscribe request for specific instance (by unique name).
- Instance Actor provides snapshot of all current attribute values and alarm states.
- Site subscribes to site-wide Akka stream filtered by instance unique name and forwards changes to central.
- Central sends unsubscribe request when debug view closes; site removes stream subscription.
- Session-based and temporary — no persistent subscriptions.
- No attribute/alarm selection — always shows all tags and alarms for the instance.
- No special concurrency limits on debug subscriptions.
- Connection interruption kills debug stream; engineer must reopen.
Estimated Complexity: M
Requirements Traced: [8.1-1], [8.1-2], [8.1-3], [8.1-4], [8.1-5], [8.1-6], [8.1-7], [8.1-8], [KDD-ui-2]
WP-26: Health Monitoring — Site-Side Metric Collection
Description: Implement the site-side health metric collector that aggregates metrics from all site subsystems.
Acceptance Criteria:
- Collects all metrics defined in 11.1:
- Active/standby node status (from Cluster Infrastructure).
- Data connection health: connected/disconnected/reconnecting per data connection (from DCL).
- Tag resolution counts per connection (from DCL).
- Script error rates: raw count per interval, reset after report (from Site Runtime).
- Alarm evaluation error rates: raw count per interval, reset after report (from Site Runtime).
- Store-and-forward buffer depth by category. (Reports 0/placeholder until S&F implemented in Phase 3C.)
- Dead letter count: subscribed to Akka.NET EventStream dead letter events, count per interval.
- Script errors include: unhandled exceptions, timeouts, recursion limit violations.
Estimated Complexity: M
Requirements Traced: [11.1-1], [11.1-2], [11.1-3], [11.1-4], [11.1-5], [11.1-6], [KDD-ui-4], [CD-HM-4], [CD-HM-5], [CD-SR-5]
WP-27: Health Monitoring — Periodic Reporting with Sequence Numbers
Description: Implement periodic health report sending from site to central with monotonic sequence numbers.
Acceptance Criteria:
- Health report sent at configurable interval (default 30 seconds).
- Report is flat snapshot of all current metric values.
- Includes monotonic sequence number (incremented per report).
- Includes report timestamp (UTC from site clock).
- Sent via Communication Layer (Pattern 7: periodic push, Tell — no response needed).
- Sequence number survives within a singleton lifecycle; resets on singleton restart (central handles via comparison).
Estimated Complexity: S
Requirements Traced: [11.2-1], [KDD-ui-3], [CD-HM-1]
WP-28: Health Monitoring — Central-Side Aggregation & Offline Detection
Description: Implement central-side health metric reception, aggregation, and site online/offline detection.
Acceptance Criteria:
- Receives health reports from all sites.
- Stores latest metrics per site in memory (no persistence).
- Replaces previous state only if incoming sequence number > last received (prevents stale overwrite).
- Offline detection: if no report received within configurable timeout (default 60s — 2x report interval), site marked offline.
- Online recovery: receipt of report from offline site automatically marks it online — no manual ack.
- Metrics available for Central UI dashboard (rendering is Phase 4/6).
- No alerting — display-only.
Estimated Complexity: M
Requirements Traced: [11.1-1], [11.2-1], [11.2-2], [KDD-ui-3], [CD-HM-2], [CD-HM-3], [CD-HM-6], [CD-HM-7]
WP-29: Site Event Logging — Event Recording to SQLite
Description: Implement the site event logging service with SQLite persistence.
Acceptance Criteria:
- Event logging service available as a cross-cutting concern to all site subsystems.
- Events recorded with schema: timestamp (UTC), event type, severity (Info/Warning/Error), instance ID (optional), source, message, details (optional).
- Categories supported: script executions, alarm events, deployment applications, data connection status, store-and-forward activity, instance lifecycle.
- Only active node generates and stores events. Event logs not replicated to standby.
- On failover, new active node starts logging to its own SQLite; historical from previous active unavailable until that node returns.
- SQLite database created at site startup if not exists.
Estimated Complexity: M
Requirements Traced: [12.1-1], [12.1-2], [12.1-3], [12.1-4], [12.1-5], [12.1-6], [12.2-1], [CD-SEL-1], [CD-SEL-2]
WP-30: Site Event Logging — Retention & Storage Cap Enforcement
Description: Implement 30-day retention with daily purge and 1GB storage cap.
Acceptance Criteria:
- Daily background job on active node deletes all events older than 30 days. Hard delete, no archival.
- Configurable storage cap (default 1 GB). If cap reached before 30-day window, oldest events purged first.
- Storage cap checked periodically (at least daily, ideally on each purge run).
- Purge does not block event recording (runs on background thread/task).
Estimated Complexity: S
Requirements Traced: [12.2-2], [KDD-ui-5], [CD-SEL-3]
WP-31: Site Event Logging — Remote Query with Pagination & Keyword Search
Description: Implement remote query support for central to query site event logs.
Acceptance Criteria:
- Query received via Communication Layer (Pattern 8: Remote Queries).
- Supports filtering by: event type/category, time range, instance ID, severity, keyword search (SQLite LIKE on message and source fields).
- Results paginated with configurable page size (default 500 events).
- Each response includes continuation token for fetching additional pages.
- Site processes query locally against SQLite and returns matching results to central.
Estimated Complexity: M
Requirements Traced: [12.3-1], [KDD-ui-5], [CD-SEL-4], [CD-SEL-5]
WP-32: Site Runtime — Script Error Handling Integration
Description: Implement script error handling behavior per requirements.
Acceptance Criteria:
- Script failure (unhandled exception, timeout): logged locally to site event log with error details.
- Script not disabled after failure — remains active, fires on next qualifying trigger.
- Script failures not reported to central individually (only aggregated error rate via health report).
- Script compilation errors on deployment reject entire instance deployment — no partial state.
Estimated Complexity: S
Requirements Traced: [4.3-1], [4.3-2], [4.3-3], [CD-SR-4], [CD-SR-5]
WP-33: Site Runtime — Local Artifact Storage
Description: Implement local storage for system-wide artifacts received from central (shared scripts, external system definitions, DB connection definitions, notification lists).
Acceptance Criteria:
- SQLite schema or file storage for: shared scripts, external system definitions, database connection definitions, notification lists.
- Artifacts stored on receipt from central (via Pattern 3: System-Wide Artifact Deployment).
- Shared scripts recompiled on update and new code made available to Script Actors.
- Artifact storage persists across restart.
- Sites are headless — no local UI for artifact management.
Estimated Complexity: M
Requirements Traced: [2.3-1], [2.3-2], [4.5-3]
WP-34: Data Connection Layer — Protocol Extensibility
Description: Ensure the IDataConnection interface allows adding new protocol adapters.
Acceptance Criteria:
- IDataConnection interface defined in Commons (Phase 0 — REQ-COM-2).
- OPC UA adapter and LmxProxy adapter both implement IDataConnection.
- Connection actor instantiates the correct adapter based on data connection protocol type from configuration.
- Adding a new protocol requires only implementing IDataConnection and registering the adapter — no changes to connection actor or Instance Actor.
Estimated Complexity: S
Requirements Traced: [2.4-3]
WP-35: Failover Acceptance Tests
Description: Validate failover behavior for all Phase 3B components.
Acceptance Criteria:
- DCL reconnection after failover: Active node fails, singleton migrates, new Deployment Manager re-creates Instance Actors, DCL re-establishes connections and subscriptions. Values resume flowing.
- Health report continuity: After failover, new active node begins sending health reports with new sequence numbers. Central detects the gap but accepts new reports (sequence number > 0 accepted for a site that was marked offline).
- Stream recovery: Debug stream interrupted on failover. Engineer reopens debug view and gets fresh snapshot + stream.
- Alarm re-evaluation: After failover, alarms start in normal state and re-evaluate from incoming values.
- Script triggers resume: After failover, interval timers restart, value change/conditional triggers re-subscribe.
- Event log continuity: New active node starts fresh event log. Previous active's events available when that node returns.
- Static attribute overrides survive: Instance Actor loads persisted overrides from SQLite after failover. (Covered in Phase 3A but re-verified here with full runtime.)
Estimated Complexity: L
Requirements Traced: [3.4.1-3], [CD-SEL-2], [KDD-data-2], [CD-Comm-6]
Test Strategy
Unit Tests
| Area | Test Scenarios |
|---|---|
| Connection Actor | State machine transitions (Connecting -> Connected -> Reconnecting), stash/unstash behavior, bad quality propagation on disconnect |
| OPC UA Adapter | IDataConnection contract compliance, subscribe/unsubscribe, write |
| LmxProxy Adapter | IDataConnection contract compliance, SessionId management, keep-alive, subscription stream processing |
| Script Actor | Trigger evaluation (interval, value change, conditional), minimum time between runs, concurrent execution |
| Alarm Actor | Condition evaluation (Value Match, Range Violation, Rate of Change), state transitions (normal->active, active->normal), no script on clear |
| Script Runtime API | GetAttribute, SetAttribute (data-connected + static), CallScript, CallShared |
| Script Trust Model | Compilation rejection for forbidden APIs, execution timeout |
| Recursion Limit | Depth tracking, limit enforcement, error logging |
| Health Metric Collector | Counter accumulation, reset after report, dead letter counting |
| Event Logger | Event recording, schema compliance, retention purge, storage cap |
| Event Query | Filter combinations, pagination, keyword search |
| Communication Contracts | Serialization/deserialization, correlation ID propagation |
Integration Tests
| Area | Test Scenarios |
|---|---|
| OPC UA End-to-End | Connect to OPC PLC simulator, subscribe, receive values, write, verify round-trip |
| DCL -> Instance Actor | Tag value updates flow from DCL to Instance Actor, update in-memory state, publish to stream |
| Script Execution | Trigger fires, Script Execution Actor spawns, executes script, reads/writes attributes, returns |
| Alarm Evaluation | Value update triggers alarm, state change published to stream, on-trigger script fires |
| CallScript Chain | Script A calls Script B, recursion depth tracked, return value propagated |
| Shared Script | Instance script calls shared script inline, shared script accesses runtime API |
| Debug View | Subscribe, receive snapshot, stream changes, unsubscribe |
| Health Report | Site sends report, central receives, offline detection after timeout |
| Event Log Query | Central queries site event log, receives paginated results |
| Communication Patterns | All 8 patterns exercised end-to-end |
Negative Tests
| Requirement | Test |
|---|---|
[4.4-10] Scripts cannot access other instances |
Script attempts cross-instance attribute access; verify clear error returned |
[4.6-4] Instance scripts cannot call alarm scripts |
Verify no API path exists for this; attempt to address alarm script from instance script fails |
[4.5-5] Shared scripts not available on central |
Verify shared script library is site-only compilation |
[2.2-3] No continuous real-time streaming |
Verify no background stream runs without debug view open |
[4.3-2] Script not disabled after failure |
Script fails, verify next trigger still fires |
[4.3-3] Script failures not reported to central |
Verify no individual failure message sent; only aggregated rate in health report |
[3.4.1-3] Alarm state not persisted |
Restart, verify all alarms start normal |
[CD-DCL-11] No S&F for device writes |
Verify write failure returned to script, not buffered |
[CD-HM-7] No alerting |
Verify health monitoring is display-only |
[KDD-code-9] Forbidden APIs |
Compile script with System.IO reference; verify compilation fails |
Failover Tests
See WP-35 acceptance criteria above.
Verification Gate
Phase 3B is complete when ALL of the following pass:
- OPC UA integration: Site connects to OPC PLC simulator, subscribes to tags, values flow to Instance Actors, attribute values visible in debug view snapshot.
- Script execution: All three trigger types (interval, value change, conditional) fire correctly. Minimum time between runs enforced. Scripts read/write attributes. CallScript returns values. CallShared executes inline.
- Alarm evaluation: All three condition types (Value Match, Range Violation, Rate of Change) correctly transition alarms. Alarm state changes appear on Akka stream. On-trigger scripts execute. No script on clear.
- Script trust model: Forbidden APIs rejected at compilation. Execution timeout cancels scripts.
- Recursion limit: Call chain depth enforced at configured limit. Error logged.
- Health monitoring: Site sends periodic reports with sequence numbers. Central aggregates, detects offline (60s), detects online recovery. All metric categories populated.
- Event logging: Events recorded for all categories. 30-day retention purge works. 1GB cap enforced. Remote query with pagination and keyword search returns correct results.
- Debug view: Full cycle — subscribe, snapshot, stream changes, unsubscribe.
- Communication: All 8 patterns exercised. Per-pattern timeouts verified. Correlation IDs propagated.
- Failover: WP-35 acceptance tests pass — DCL reconnection, health continuity, stream recovery, alarm re-evaluation, script trigger resume.
- Negative tests: All negative test cases pass (cross-instance access blocked, alarm script call direction enforced, forbidden APIs rejected, etc.).
Open Questions
| # | Question | Context | Impact | Status |
|---|---|---|---|---|
| Q-P3B-1 | What is the exact dedicated blocking I/O dispatcher configuration for Script Execution Actors? | KDD-runtime-3 says "dedicated blocking I/O dispatcher" — need Akka.NET HOCON config (thread pool size, throughput settings). | WP-15. Sensible defaults can be set; tuned in Phase 8. | Deferred — use Akka.NET default blocking-io-dispatcher config; tune during Phase 8 performance testing. |
| Q-P3B-2 | Should LmxProxy adapter expose WriteBatchAndWaitAsync (write-and-poll handshake) through IDataConnection or as a protocol-specific extension? | CD-DCL-5 lists WriteBatchAndWaitAsync but IDataConnection only defines simple Write. | WP-8. Does not block core functionality. | Deferred — expose as protocol-specific extension method; not part of IDataConnection core contract. |
| Q-P3B-3 | What is the Rate of Change alarm evaluation time window? | Section 3.4 says "changes faster than a defined threshold" but does not specify the time window (per-second? per-minute? configurable?). | WP-16. Needs a design decision for the evaluation algorithm. | Deferred — implement as configurable window (default: per-second rate). Document in alarm definition schema. |
| Q-P3B-4 | How does the health report sequence number behave across failover? | Sequence number is monotonic within a singleton lifecycle. After failover, the new singleton starts at 1. Central must handle this. | WP-27, WP-28. Central should accept any report from a site marked offline regardless of sequence number. | Resolved in design — central accepts report when site is offline; for online sites, requires seq > last. On failover, site goes offline first (missed reports), so the reset is naturally handled. |
Split-Section Tracking
Section 4.1 — Script Definitions
- Phase 3B covers:
[4.1-5]site compilation,[4.1-6]input parameters (runtime),[4.1-7]return values (runtime),[4.1-8]return value usage (trigger vs. call). - Phase 2 covers:
[4.1-1]C# defined at template level,[4.1-2]inheritance/override/lock,[4.1-3]deployed as flattened config,[4.1-4]first-class template members. - Union: Complete.
Section 4.4 — Script Capabilities
- Phase 3B covers:
[4.4-1]read,[4.4-2]write data-sourced,[4.4-3]write static,[4.4-4]CallScript,[4.4-5]CallShared,[4.4-10]cannot access other instances. - Phase 7 covers:
[4.4-6]ExternalSystem.Call,[4.4-7]CachedCall,[4.4-8]notifications,[4.4-9]Database.Connection. - Union: Complete.
Section 4.5 — Shared Scripts
- Phase 3B covers:
[4.5-1]system-wide library,[4.5-2]parameters/return values,[4.5-3]deployment to sites (site-side reception),[4.5-4]inline execution,[4.5-5]not available on central. - Phase 2 covers: Model/definition (shared script entity schema).
- Union: Complete.
Section 2.3 — Site-Level Storage & Interface
- Phase 3A covers:
[2.3-2]deployed configs,[2.3-3]S&F buffers (schema preparation). - Phase 3B covers:
[2.3-1]headless,[2.3-2]shared scripts/ext sys defs/db conn defs/notification lists storage. - Phase 3C covers:
[2.3-3]S&F buffer persistence and replication. - Union: Complete.
Section 8.1 — Debug View
- Phase 3B covers:
[8.1-1]through[8.1-8]— all backend/site-side debug view infrastructure. - Phase 6 covers: Central UI rendering of debug view.
- Union: Complete (backend vs. UI split).
Section 12.3 — Central Access to Event Logs
- Phase 3B covers:
[12.3-1]backend query mechanism (site-side query processing, communication pattern). - Phase 6 covers: Central UI Event Log Viewer rendering.
- Union: Complete.
Section 4.3 — Script Error Handling
- Phase 3B covers:
[4.3-1],[4.3-2],[4.3-3](all core error handling). - Phase 7 covers:
[4.3-4]external system call failure S&F interaction (depends on S&F integration). - Union: Complete.
Orphan Check Result
Forward Check (Requirements -> Work Packages)
Every item in the Requirements Checklist and Design Constraints Checklist was walked. Results:
- Requirements Checklist: All 79 requirement bullets map to at least one work package with acceptance criteria.
- Design Constraints Checklist: All 47 design constraint items map to at least one work package with acceptance criteria.
- No orphaned requirements or constraints found.
Note: [2.5-1], [2.5-2], [2.5-3] are context-only items that inform design decisions in this phase but are formally validated in Phase 8. They are referenced in WP-15 (staggered startup batch sizing consideration) and WP-14 (subscription management design).
Reverse Check (Work Packages -> Requirements)
Every work package was walked. Results:
- All 35 work packages trace back to at least one requirement bullet or design constraint.
- No untraceable work packages found.
Split-Section Check
All 7 split sections verified. The union of bullets across phases equals the complete section for each. No gaps found.
Negative Requirement Check
All negative requirements have explicit test cases in the Test Strategy:
| Negative Requirement | Test Location |
|---|---|
[4.4-10] Cannot access other instances |
Negative Tests table |
[4.6-4] Instance scripts cannot call alarm scripts |
Negative Tests table |
[4.5-5] Shared scripts not available on central |
Negative Tests table |
[2.2-3] No continuous real-time streaming |
Negative Tests table |
[4.3-2] Script not disabled after failure |
Negative Tests table |
[4.3-3] Failures not reported to central |
Negative Tests table |
[3.4.1-3] Alarm state not persisted |
Negative Tests table |
[CD-DCL-11] No S&F for device writes |
Negative Tests table |
[CD-HM-7] No alerting |
Negative Tests table |
[KDD-code-9] Forbidden APIs |
Negative Tests table |
[3.4.1-2] No acknowledgment workflow |
Covered by WP-16 acceptance criteria |
All negative requirements have acceptance criteria that would catch violations.
Verification Status
- Forward check: PASS
- Reverse check: PASS
- Split-section check: PASS
- Negative requirement check: PASS
External Verification (Codex MCP)
Model: gpt-5.4 Date: 2026-03-16
Step 1 — Requirements Coverage Review
Codex received work package titles (not full acceptance criteria due to prompt size constraints) and identified 12 findings. Analysis:
| # | Finding | Disposition |
|---|---|---|
| 1 | [2.3-3] S&F buffer persistence listed in checklist but no WP covers it |
Valid — clarified as Phase 3C scope. [2.3-3] annotation updated to note split-section reference only. |
| 2 | Script Runtime API missing ExternalSystem/Notify/Database | False positive — plan explicitly assigns [4.4-6] through [4.4-9] to Phase 7. WP-18 covers only the Phase 3B portion (read/write/CallScript/CallShared). Scope table says "core operations." |
| 3 | Static attribute SQLite persistence not verified for restart/failover/redeploy | False positive — WP-18 acceptance criteria explicitly state "persists to SQLite, survives restart/failover, resets on redeployment." WP-35 re-verifies with full runtime. |
| 4 | System-wide artifact explicit deployment behavior uncovered | False positive — WP-33 covers artifact storage on receipt. Deployment trigger mechanism is Phase 3C (Deployment Manager). WP-4 Pattern 3 covers the communication pattern. |
| 5 | Staggered startup missing | False positive — staggered startup is Phase 3A (listed in prerequisites table). |
| 6 | Blocking I/O dispatcher and supervision strategy uncovered | False positive — WP-15 acceptance criteria: "runs on dedicated blocking I/O dispatcher" and "Supervision: Script Actor resumed, Script Execution Actor stopped." WP-16 has same for Alarm Actors. |
| 7 | Per-subscriber buffering uncovered in WP-23 | False positive — WP-23 acceptance criteria explicitly cover: "Per-subscriber bounded buffers. Each subscriber gets independent buffer. Slow subscriber: buffer fills, oldest events dropped." |
| 8 | Tag resolution counts and dead letter count missing | False positive — WP-26 acceptance criteria include both. WP-13 covers tag resolution counts from DCL side. |
| 9 | UTC timestamps not covered | False positive — UTC is a Phase 0 convention (KDD-data-6). Message contracts in WP-1 specify "All timestamps in message contracts are UTC." Health report in WP-27 specifies "UTC from site clock." |
| 10 | Event log schema and active-node behavior uncovered | False positive — WP-29 acceptance criteria list full schema and "Only active node generates and stores events. Event logs not replicated to standby." |
| 11 | Remote query filters/pagination details uncovered | False positive — WP-31 acceptance criteria list all filter types, "default 500 events," and "continuation token." |
| 12 | LmxProxy details uncovered in WP-8 | False positive — WP-8 acceptance criteria explicitly cover port, API key, SessionId, keep-alive, TLS, batch ops, Polly retry. |
Step 2 — Negative Requirement Review
Codex did not raise concerns about negative requirements (not included in abbreviated submission). Self-review confirms all 11 negative requirements have explicit test cases in the Negative Tests table.
Step 3 — Split-Section Gap Review
Not submitted separately. Self-review in Split-Section Tracking section confirms all 7 split sections have complete unions.
Outcome
Pass with 1 correction — [2.3-3] annotation clarified as Phase 3C scope reference. All other findings were false positives caused by Codex receiving only work package titles rather than full acceptance criteria.