Fix ScadaBridge accuracy per design repo review

Corrections:
- Notifications: email only, not Teams. Design repo documents SMTP/OAuth2
  email only; Teams was incorrectly claimed. Corrected in current-state.md
  and legacy-integrations.md (LEG-003).
- EventHub/Kafka forwarding: committed but not yet implemented. Clarified
  as a Year 1 ScadaBridge Extensions deliverable, not an existing capability.

Additions from design repo (previously undocumented):
- Dual transport (Akka.NET ClusterClient + gRPC server-streaming)
- Split-brain resolver (keep-oldest, 15s stability, ~25s failover)
- Staggered batch startup (20 instances at a time)
- Central UI: Blazor Server with LDAP/AD, JWT sessions, SignalR debug
- Comprehensive synchronous audit logging (JSON after-state)
- Three-phase deployment process with rollback
- Site-level SQLite (flattened config, not full SQL Server)
- Supervision detail: OneForOneStrategy, Resume/Stop per actor type
This commit is contained in:
Joseph Doherty
2026-04-17 09:30:22 -04:00
parent fc3e19fde1
commit 8428b7c186
2 changed files with 13 additions and 7 deletions

View File

@@ -80,17 +80,23 @@ SCADA responsibilities are split across two platforms by purpose:
- _TBD — serialization format on the wire, push mechanism (pull vs push), conflict/version handling if a site is offline during an update, audit trail of template changes._
- **Secure Web API (inbound)** — external systems can interface with ScadaBridge over an authenticated Web API. Authentication is handled via **API keys** — clients present a static, per-client API key on each call. _TBD — key issuance and rotation process, storage at the client side, scoping (per client vs per capability), revocation process, audit trail of key usage._
- **Web API client (outbound) — pre-configured, script-callable.** ScadaBridge provides a generic outbound Web API client capability: **any Web API can be pre-configured** (endpoint URL, credentials, headers, auth scheme, etc.) and then **called easily from scripts** using the configured name. There is no hard-coded list of "known" external Web APIs — the set of callable APIs is whatever is configured today, and new APIs can be added without ScadaBridge code changes.
- **Notifications — contact-list driven, transport-agnostic.** ScadaBridge maintains **contact lists** (named groups of recipients) as a first-class concept. Scripts send notifications **to a contact list**; ScadaBridge handles the delivery over the appropriate transport (**email** or **Microsoft Teams**) based on how the contacts are configured. Scripts do not care about the transport — they call a single "notify" capability against a named contact list, and routing/fan-out happens inside ScadaBridge. New contact lists and new recipients can be added without script changes.
- **Notifications — contact-list driven.** ScadaBridge maintains **contact lists** (named groups of recipients) as a first-class concept. Scripts send notifications **to a contact list**; ScadaBridge handles the delivery. **Supported transport today: email only** (SMTP-based with OAuth2 Client Credentials for modern providers; plain-text message bodies, no HTML). Scripts do not care about the transport — they call a single "notify" capability against a named contact list, and routing/fan-out happens inside ScadaBridge. New contact lists and new recipients can be added without script changes. _Note: an earlier version of this plan claimed Microsoft Teams support — the ScadaBridge design repo does not document Teams as an implemented transport. If Teams support is needed, it would be a ScadaBridge extension, not an existing capability._
- **Database writes** — can write to databases as a sink for collected/processed data. **Supported target today: SQL Server only** — other databases (PostgreSQL, Oracle, etc.) are not currently supported. _TBD — whether a generic ADO.NET / ODBC path is planned to broaden support, or whether SQL Server is intentionally the only target._
- **Equipment writes via OPC UA** — can write back to equipment over OPC UA (not just read).
- **EventHub forwarding** — can forward events to an **EventHub** (Kafka-compatible) for async downstream consumers.
- **EventHub forwarding (committed, not yet implemented)** — designed to forward events to an **EventHub** (Kafka-compatible) for async downstream consumers. The design is committed and the architecture accommodates it, but the implementation does not exist yet in the ScadaBridge codebase. Year 1 ScadaBridge Extensions workstream in [`roadmap.md`](roadmap.md) includes this as the "EventHub producer" deliverable.
- **Store-and-forward (per-call, optional)** — Web API calls, notifications, and database interactions can **optionally** be cached in a **store-and-forward** queue on a **per-call basis**. If the downstream target is unreachable (WAN outage, target down), the call is persisted locally and replayed when connectivity returns — preserving site resilience without forcing every integration to be async.
- **Deployment topology:** runs as **2-node Akka.NET clusters**, **co-located on the existing Aveva System Platform cluster nodes** (no dedicated hardware — shares the same physical/virtual nodes that host System Platform).
- **Benchmarked throughput (OPC UA ingestion ceiling):** a single 2-node site cluster has been **benchmarked** to handle **~225,000 OPC UA updates per second** at the ingestion layer. This is the **input rate ceiling**, not the downstream work rate — triggered events, script executions, Web API calls, DB writes, EventHub forwards, and notifications all happen on a filtered subset of those updates and run at significantly lower rates. _TBD — actual production load per site (typically far below this ceiling), downstream work-rate profile (what percent of ingested updates trigger work), whether the benchmark was sustained or peak, and the co-located System Platform node headroom at benchmark load._
- **Dual transport (Akka.NET + gRPC):** cluster-internal communication uses two channels — **Akka.NET ClusterClient** for command/control and **gRPC server-streaming** for real-time data. This is a design-repo-documented architectural detail, not visible to external consumers.
- **Supervision model (Akka.NET):** ScadaBridge uses Akka.NET supervision to self-heal around transient failures. Concretely:
- **OPC UA connection restarts.** When an OPC UA source disconnects, returns malformed data, or stalls, ScadaBridge **restarts the connection to that source** rather than letting the failure propagate up. Individual source failures are isolated from each other.
- **Actor tree restarts on failure.** When a failure escalates beyond a single connection (e.g., a faulty script or a downstream integration wedged in an unrecoverable state), ScadaBridge can **restart the affected actor tree**, bringing its children back to a known-good state without taking the whole cluster down.
- _TBD — specific supervision strategies per actor tier (OneForOne vs AllForOne, restart limits, backoff), what failures escalate to cluster-level rather than tree-level, and how recurring script failures are throttled/quarantined rather than restart-looped._
- **Actor tree restarts on failure.** When a failure escalates beyond a single connection (e.g., a faulty script or a downstream integration wedged in an unrecoverable state), ScadaBridge can **restart the affected actor tree**, bringing its children back to a known-good state without taking the whole cluster down. The Deployment Manager supervises Instance Actors with a **OneForOneStrategy**; Script/Alarm Actors use **Resume** (preserves coordinator state); Execution Actors use **Stop**.
- **Split-brain resolver: keep-oldest strategy** with 15-second stability windows and `down-if-alone = on` safeguards. Approximate failover time: **~25 seconds** (detection + stabilization).
- **Staggered batch startup** — the Deployment Manager creates Instance Actors in staggered batches (e.g., 20 at a time) to avoid overwhelming OPC UA servers and network capacity during cluster start or restart.
- **Central UI: Blazor Server.** Full web-based authoring environment for templates, scripts, external system definitions, and deployment workflows. Role-based access with three levels (**Admin**, **Design**, **Deployment**). **LDAP / Active Directory integration** with JWT session management (15-minute expiry, sliding refresh, HttpOnly/Secure cookies). Includes **real-time debug streaming** via SignalR WebSocket for live attribute values and alarm state changes.
- **Audit logging.** Comprehensive synchronous audit entries within the same database transaction as changes. After-state serialized as JSON for full change-history reconstruction.
- **Three-phase deployment process.** Central nodes first, database validation required, staggered site deployment, rollback strategy with retained binaries and backups.
- **Site-level configuration: SQLite.** Sites receive only **flattened configurations** and read from a **local SQLite** database, not a full SQL Server instance. The central SQL Server configuration database is the source of truth; sites consume a read-only serialized subset.
- **Downstream consumers / integration targets in production today:**
- **Aveva System Platform — via LmxOpcUa.** ScadaBridge interacts with System Platform through the in-house **LmxOpcUa** server rather than a direct System Platform API. LmxOpcUa exposes System Platform objects over OPC UA; ScadaBridge reads from and writes to System Platform through that OPC UA surface. This is the primary OT-side consumer.
- **Internal Web APIs.** ScadaBridge makes outbound calls to internal enterprise Web APIs using its generic pre-configured Web API client capability (see Capabilities above). Because any Web API can be configured dynamically, there is no fixed enumeration of "ScadaBridge's Web API integrations" to capture here; specific IT↔OT Web API crossings land in the legacy integrations inventory (`current-state/legacy-integrations.md`) regardless of whether they're reached via ScadaBridge's generic client.
@@ -102,7 +108,7 @@ SCADA responsibilities are split across two platforms by purpose:
- **Direct access** — site-level ScadaBridge clusters can also be reached directly (not only via the hub), enabling point-to-point integration where appropriate.
- **Data locality (design principle):** ScadaBridge is designed to **keep local data sources localized** — equipment at a site communicates with the **local ScadaBridge instance** at that site, not with the central cluster. This minimizes cross-site/WAN traffic, reduces latency, and keeps site operations resilient to WAN outages.
- **Deployment status:** ScadaBridge is **already deployed** across the current cluster footprint. However, **not all legacy API integrations have been migrated onto it yet** — some older point-to-point integrations still run outside ScadaBridge and need to be ported. The authoritative inventory of these integrations (and their retirement tracking against `goal-state.md` pillar 3) lives in [`current-state/legacy-integrations.md`](current-state/legacy-integrations.md).
- _TBD — scripting language/runtime, template format, Web API auth model, supported DB targets, supervision model, throughput, downstream consumers, resource impact of co-location with System Platform._
- _TBD — resource impact of co-location with System Platform at the largest sites; whether any additional downstream consumers exist beyond those listed above; whether the notification capability will be extended to support Microsoft Teams (not currently implemented)._
### LmxOpcUa (in-house)
- **What:** in-house **OPC UA server** with **tight integration to Aveva System Platform**.