Finalize digital-twin scope as two ACL-based patterns

Plan's digital-twin scope is now exactly (1) environment-lifecycle
promotion via ACL flip on write authority and (2) safe read-only
consumption for KPI / monitoring systems — both delivered by
already-committed architecture. Removes the meeting-prep brief and the
management-delivered use-cases source document; canonical model and
state vocabulary stand as pillar-2 work on their own.
This commit is contained in:
Joseph Doherty
2026-04-24 14:13:10 -04:00
parent c8d7bf37de
commit 22a86974f6
11 changed files with 59 additions and 303 deletions

View File

@@ -234,14 +234,14 @@ Two projection flavors are possible, not mutually exclusive:
**Decision trigger for building a projection service:** when a specific consumer (vendor tool, COTS HMI, analytics product, new initiative) requires a classic UNS surface and the cost of writing a Kafka client for that consumer exceeds the cost of operating the projection layer for the rest of the consumer's lifetime. Until that trigger is hit, the canonical model + Redpanda **is** the UNS and consumers reach it directly.
This mirrors the treatment of OtOpcUa's future `simulated` namespace and the Digital Twin Use Case 2 simulation-lite foundation: the architecture supports the addition; the plan does not commit the build until a specific need justifies it.
This mirrors the treatment of OtOpcUa's future `simulated` namespace: the architecture supports the addition; the plan does not commit the build until a specific need justifies it.
#### What the UNS framing does and does not change
**Changes:**
- Stakeholders who ask "do we have a UNS?" get a direct "yes — composed of OtOpcUa + Redpanda + `schemas` repo + dbt" answer instead of "we have a canonical model but we didn't use that word."
- Digital Twin Use Cases 1 and 3 (see **Strategic Considerations → Digital twin**) — which are functionally UNS use cases in another vocabulary — now have a second name and a second stakeholder audience.
- The **canonical machine state vocabulary** and **canonical equipment/production/event model declaration** (see **Async Event Backbone → Canonical Equipment, Production, and Event Model**) — which are functionally UNS deliverables in another vocabulary — now have a second name and a second stakeholder audience.
- A future projection service is pre-legitimized as a small optional addition, not a parallel or competing initiative.
- Vendor conversations that assume "UNS" means a specific MQTT broker purchase can be reframed: the plan delivers the UNS value proposition via different transport; the vendor's MQTT expectations become a projection-layer concern, not a core-architecture concern.
@@ -411,7 +411,7 @@ _TBD — service name (working title only); hosting (South Bend, alongside Redpa
1. **Equipment namespace (raw data).** Live values read from equipment via native OPC UA or native device protocols (Modbus, Ethernet/IP, Siemens S7, etc.) translated to OPC UA. This is the new capability the plan introduces — what the "layer 2 — raw data" role in the layered architecture describes.
2. **System Platform namespace (processed data tap).** The former **LmxOpcUa** functionality, folded in. Exposes Aveva System Platform objects (via the local App Server's LMX API) as OPC UA so that OPC UA-native consumers can read processed data through the same endpoint they use for raw equipment data.
**Namespace model is extensible — future "simulated" namespace supported architecturally, not committed for build.** The two-namespace design is not a hard cap. A future **`simulated` namespace** could expose synthetic or replayed equipment data to consumers, letting tier-1 / tier-2 consumers (ScadaBridge, Ignition, System Platform IO) be exercised against real-shaped-but-offline data streams without physical equipment. This is the **OtOpcUa-side foundation for Digital Twin Use Case 2** (Virtual Testing / Simulation — see **Strategic Considerations → Digital twin**). The plan **does not commit to building** a simulated namespace in the 3-year scope; it commits that the namespace architecture can accommodate one when a specific testing need justifies it, without reshaping OtOpcUa. The complementary foundation (historical event replay) lives in the Redpanda layer — see **Async Event Backbone → Usage patterns → Historical replay**.
**Namespace model is extensible — future "simulated" namespace supported architecturally, not committed for build.** The two-namespace design is not a hard cap. A future **`simulated` namespace** could expose synthetic or replayed equipment data to consumers, letting tier-1 / tier-2 consumers (ScadaBridge, Ignition, System Platform IO) be exercised against real-shaped-but-offline data streams without physical equipment — primarily useful for the **pre-install case** (dev work against a piece of equipment that is not yet physically on the floor). The plan **does not commit to building** a simulated namespace in the 3-year scope; it commits that the namespace architecture can accommodate one when a specific testing need justifies it, without reshaping OtOpcUa. The complementary foundation (historical event replay) lives in the Redpanda layer — see **Async Event Backbone → Usage patterns → Historical replay**. Note: once equipment is physically present, the two access-control patterns that are this plan's digital-twin scope (environment-lifecycle promotion without reconfiguration, and safe read-only KPI/monitoring exposure) are delivered by the **ACL model below**, not by the `simulated` namespace. See **Consumer access patterns enabled by the ACL model** further down, and **Strategic Considerations → Digital twin**.
**LmxOpcUa is absorbed into OtOpcUa, not replaced by a separate component.** The existing LmxOpcUa software and deployment pattern (per-node service on every System Platform node) evolves into OtOpcUa. Consumers that previously pointed at LmxOpcUa for System Platform data and at "nothing yet" for equipment data now point at OtOpcUa and see both in its namespace. There is not a second OPC UA server running alongside.
@@ -475,6 +475,13 @@ _TBD — service name (working title only); hosting (South Bend, alongside Redpa
- **Phasing:** Phase 1 ships the schema + Admin UI + evaluator unit tests; per-driver enforcement lands in each driver's phase (Phase 2+). **Phase 1 completes before any driver phase**, so the ACL model exists in the central config DB before any driver consumes it — satisfying the "must be working before Tier 1 cutover" timing constraint.
- _TBD — specific OPC UA security mode + profile combinations required vs allowed; where UserName credentials/certs are sourced from (local site directory, a per-site credential vault, AD/LDAP); rotation cadence; audit trail of authz decisions._
**Consumer access patterns enabled by the ACL model (plan's digital-twin scope).** Two consumer patterns fall directly out of the ACL model + single-connection-per-equipment design and are worth naming explicitly, because they are this plan's full digital-twin scope (see **Strategic Considerations → Digital twin**):
- **Environment-lifecycle promotion without reconfiguration.** Dev / QA / Prod System Platform instances each authenticate to OtOpcUa as a distinct identity (e.g., `sp-dev`, `sp-qa`, `sp-prod`). Promotion of a piece of equipment from dev → qa → prod is an ACL change that moves the **single write-holder grant** for that equipment UUID from one identity to the next; the OPC UA session itself — which OtOpcUa owns — is configured once and never torn down or rebuilt. Read grants can stay broad (all three environment identities observe continuously); only write authority is single-assignee and mobile. This replaces today's pattern of disabling the direct System Platform connection on the dev box, then re-creating the same connection on the qa box, then again on the prod box.
- **Safe read-only consumption for KPI / monitoring systems.** Ignition KPI views, Power BI dashboards, observability / monitoring consumers, and any future read-only analytics consumer authenticate to OtOpcUa as identities with read-only grants. Because OtOpcUa owns the single OPC UA session to each piece of equipment, there is **no write path available to these consumers at all** — the guarantee is structural (there is no equipment-side session for a read-only consumer to misuse) rather than procedural (relying on the consumer's own code to not issue writes). This materially reduces the risk of adding new KPI / monitoring consumers to the estate.
**Out of scope for this plan:** how OtOpcUa arbitrates write-authority moves between environments — e.g., an Admin UI switch, a PR-merge on the `schemas` repo, a release-pipeline step, or some combination. That mechanism is the OtOpcUa team's implementation decision. What the plan commits to is the architectural substrate (stable equipment UUID, single OPC UA session per equipment owned by OtOpcUa, read-vs-write-distinguishing ACL model) that makes both patterns above possible.
**Open questions (TBD).**
- **Driver coverage.** Which equipment protocols need to be bridged to OPC UA beyond native OPC UA equipment — this is where product-driven decisions matter most.
- **Rollout posture: build and deploy the cluster software to every site ASAP.** The cluster software (server + core driver library) is built and rolled out to **every site's System Platform nodes as fast as practical** — deployment to all sites is treated as a prerequisite for the rest of the OT plan, not a gradual per-site effort. "Deployment" here means installing and configuring the cluster software at each site so the node is ready to front equipment; it does **not** mean immediately migrating consumers (that follows the tiered cutover below). A deployed but inactive cluster is cheap; what's expensive is delaying deployment and then trying to do it site-by-site on the critical path of every other workstream.
@@ -564,14 +571,14 @@ _TBD — service name (working title only); hosting (South Bend, alongside Redpa
- **Async event notifications** — shopfloor events (state changes, alarms, lifecycle events, etc.) published to EventHub for any interested consumer to subscribe to, without producers needing to know who's listening.
- **Async processing for KPI** — KPI calculations (currently handled on Ignition SCADA) can consume event streams from EventHub, enabling decoupled, replayable KPI pipelines instead of tightly coupled point queries.
- **System integrations** — other enterprise systems (Camstar, Snowflake, future consumers) integrate by subscribing to EventHub topics rather than opening point-to-point connections into OT.
- **Historical replay for integration testing and simulation-lite.** The `analytics`-tier retention (30 days) is explicitly also a **replay surface** for testing and simulation-lite: downstream consumers (ScadaBridge scripts, KPI pipelines, dbt models, a future digital twin layer) can be exercised against real historical event streams instead of synthetic data. This is the minimal answer to **Digital Twin Use Case 2 (Virtual Testing / Simulation)** — see **Strategic Considerations → Digital twin** → use case 2 — and does not require any new component. When longer horizons are needed, extend to the `compliance` tier (90 days). Replay windows beyond 90 days are served by the dbt curated layer in Snowflake, not by Redpanda.
- **Historical replay for integration testing.** The `analytics`-tier retention (30 days) is explicitly also a **replay surface** for testing: downstream consumers (ScadaBridge scripts, KPI pipelines, dbt models, or any future consumer that needs to re-run historical windows) can be exercised against real historical event streams instead of synthetic data. Does not require any new component. When longer horizons are needed, extend to the `compliance` tier (90 days). Replay windows beyond 90 days are served by the dbt curated layer in Snowflake, not by Redpanda. **Note:** if a funded physics-simulation / FAT initiative ever materializes, this replay surface is one of the foundations it can consume — but such an initiative is out of the 3-year scope of this plan.
- _Remaining open items are tracked inline in the subsections above — sizing, read-path implications, long-outage planning, IdP selection, schema subject/versioning details, etc. Support staffing and on-call ownership are out of scope for this plan._
#### Canonical Equipment, Production, and Event Model
The plan already delivers the infrastructure for a cross-system canonical model — OtOpcUa's equipment namespace, Redpanda's `{domain}.{entity}.{event-type}` topic taxonomy, Protobuf schemas in the central `schemas` repo, and the dbt curated layer in Snowflake. What it had not, until now, explicitly committed to is **declaring** that these pieces together constitute the enterprise's canonical equipment / production / event model, and that consumers are entitled to treat them as an integration interface.
This subsection makes that declaration. It is the plan's answer to **Digital Twin Use Cases 1 and 3** (see **Strategic Considerations → Digital twin**) and — independent of digital twin framing — is load-bearing for pillar 2 (analytics/AI enablement) because a canonical model is what makes "not possible before" cross-domain analytics possible at all.
This subsection makes that declaration. It is load-bearing for pillar 2 (analytics/AI enablement) because a canonical model is what makes "not possible before" cross-domain analytics possible at all.
> **Schemas-repo dependency — partially resolved.** The OtOpcUa team has contributed an initial seed at [`schemas/`](../schemas/) (temporary location in the 3-year-plan repo until the dedicated `schemas` repo is created — Gitea push-to-create is disabled). The seed includes:
> - JSON Schema format definitions (`format/equipment-class.schema.json` with an `extends` field for class inheritance, `format/tag-definition.schema.json`, `format/uns-subtree.schema.json`)
@@ -613,7 +620,7 @@ Consumers that need to know "what does a `Faulted` state mean" or "what are all
##### Canonical machine state vocabulary
The plan commits to a **single authoritative set of machine state values** used consistently across layer-3 state derivations, Redpanda event payloads, and dbt curated views. This is the answer to Digital Twin Use Case 1.
The plan commits to a **single authoritative set of machine state values** used consistently across layer-3 state derivations, Redpanda event payloads, and dbt curated views.
Starting set (subject to refinement during implementation, but the names and semantics below are committed as the baseline):
@@ -741,58 +748,42 @@ _TBD — named owners for each pillar's criterion; quarterly progress metrics (e
External strategic asks that are **not** part of this plan's three pillars but that the plan should be *shaped to serve* when they materialize. None of these commit the plan to deliver anything — they are constraints on how components are built so that future adjacent initiatives can consume them.
### Digital twin (management ask — use cases received 2026-04-15)
### Digital twin (scope: two access-control patterns)
**Status: management has delivered the requirements; the plan absorbs two of the three use cases and treats the third as exploratory.** The plan does not add a new "digital twin workstream" to `roadmap.md`, and no pillar criterion depends on a digital twin deliverable. What the plan does is **commit to the pieces** that management's three use cases actually require, as additions to existing components rather than as a parallel initiative. See [`goal-state/digital-twin-management-brief.md`](goal-state/digital-twin-management-brief.md) → "Outcome" for the meeting resolution.
**Status: scope is definitive as of 2026-04-24.** This plan's digital-twin scope is exactly **two access-control patterns**, both delivered by architecture already committed elsewhere in this document. No new component, no new workstream, no pillar criterion depends on a digital-twin deliverable. Anything else stakeholders may call "digital twin" (physics simulation, FAT / commissioning emulation, 3D visualization, genealogy tracking, predictive-maintenance AI) is explicitly **not** in the plan's digital-twin scope — it either belongs to an adjacent initiative, a different pillar, or a separately funded future effort.
#### Management-provided use cases
#### The two patterns
These are the **only requirements** management can provide — high-level framing, no product selection, no sponsor, no timeline beyond "directionally, this is what we want." Captured here verbatim in intent; the source document lives at [`../digital_twin_usecases.md.txt`](../digital_twin_usecases.md.txt) in its original form.
1. **Environment-lifecycle promotion without reconfiguration.** Dev / QA / Prod System Platform instances all consume equipment through OtOpcUa; promotion of a piece of equipment from dev → qa → prod is an ACL flip that moves the single write-holder grant from `sp-dev``sp-qa``sp-prod` against the same equipment UUID. The connection is configured once; only write authority moves. Replaces today's disable-dev / enable-qa / re-create-connection pattern, and eliminates the stomping-on-each-other risk when multiple environments would otherwise each need write access to the single physical equipment.
2. **Safe read-only consumption for KPI / monitoring systems.** Ignition KPI views, Power BI dashboards, observability / monitoring consumers, and any future read-only analytics consumer get read-only access to canonical equipment streams with **zero write path to physical equipment**. The guarantee is structural (single OPC UA session per piece of equipment, owned by OtOpcUa — no equipment-side session exists for a read-only consumer to misuse) rather than procedural (relying on the consumer's own code to not issue writes). Materially reduces the risk of adding new KPI / monitoring consumers to the estate.
1. **Standardized Equipment State / Metadata Model.** A consistent, high-level representation of machine state derived from raw signals: Running / Idle / Faulted / Starved / Blocked. Normalized across equipment types. Single authoritative machine state, derived from multiple interlocks and status bits. Actual-vs-theoretical cycle time. Top-fault instead of dozens of raw alarms. Value: single consistent view of equipment behavior, reduced downstream complexity, improved KPI accuracy (OEE, downtime).
2. **Virtual Testing / Simulation (FAT, Integration, Validation).** A digital representation of equipment that emulates signals, states, and sequences, so automation logic / workflows / integrations can be tested without physical machines. Replay of historical scenarios, synthetic scenarios, edge-case coverage. Value: earlier testing, reduced commissioning time and risk, improved deployed-system stability.
3. **Cross-System Data Normalization / Canonical Model.** A common semantic layer between systems: standardized data structures for equipment, production, and events. Translates system-specific formats into a unified model. Consistent interface for all consumers. Uniform event definitions (`machine fault`, `job complete`). Value: simplified integration, reduced duplication of transformation logic, improved consistency across the enterprise.
**Implementation surface:** both patterns live in **OtOpcUa → Consumer access patterns enabled by the ACL model**. The structural substrate that makes them possible is: (a) the canonical model's stable equipment UUIDs, (b) the single OPC UA session per piece of equipment owned by OtOpcUa, (c) the read-vs-write-distinguishing ACL model committed in OtOpcUa v2.
Management's own framing of the combined outcome: "a translator (raw signals → meaningful state), a simulator (test without physical dependency), and a standard interface (consistent data across systems)."
#### Design constraints for future adjacent initiatives
#### Plan mapping — what each use case costs this plan
If a later adjacent initiative builds something stakeholders want to call "digital twin" on top of this plan's foundation (physics simulation, 3D visualization, a twin product surface), these constraints apply — they are already committed plan decisions, restated here so adjacent initiatives consume this plan cleanly:
| # | Use case | Maps to existing plan components | Delta this plan commits to |
|---|---|---|---|
| 1 | Standardized equipment state model | Layer 3 (Aveva System Platform + Ignition state derivation) for real-time; dbt curated layer for historical; Redpanda event schemas for event-level state transitions | **Canonical machine state vocabulary.** Adopt `Running / Idle / Faulted / Starved / Blocked` (plus any additions agreed during implementation) as the **authoritative state set** across layer-3 derivations, Redpanda event payloads, and dbt curated views. No new component — commitment is that every surface uses the same state values, and the vocabulary is published in the central `schemas` repo. See **Async Event Backbone → Canonical Equipment, Production, and Event Model.** |
| 2 | Virtual testing / simulation | Not served today by the plan, and not going to be served by a full simulation stack. | **Simulation-lite via replay.** Redpanda's analytics-tier retention (30 days) already enables historical event replay to exercise downstream consumers. OtOpcUa's namespace architecture can in principle host a future "simulated" namespace that replays historical equipment data to exercise tier-1 and tier-2 consumers — architecturally supported, not committed for build in this plan. **Full commissioning-grade simulation stays out of scope** pending a separate funded initiative. |
| 3 | Cross-system canonical model | OtOpcUa equipment namespace (canonical OPC UA surface); Redpanda topic taxonomy (`{domain}.{entity}.{event-type}`) + Protobuf schemas; dbt curated layer (canonical analytics model) — all three already committed. | **Canonical model declaration.** The plan already builds the pieces; what it did not do is **declare** that these pieces together constitute a canonical equipment/production/event model that consumers are entitled to use as an integration interface. This declaration lives in the central `schemas` repo as first-class content and is referenced from every surface that exposes the model. See **Async Event Backbone → Canonical Equipment, Production, and Event Model.** |
- **Must consume equipment data through OtOpcUa.** No direct equipment OPC UA sessions.
- **Must consume historical and analytical data through Snowflake + dbt** — not Historian directly, not a bespoke pipeline. The `≤15-minute analytics` SLO is the freshness budget available.
- **Must consume event streams through Redpanda** — not a parallel bus. The same schemas-in-git and `{domain}.{entity}.{event-type}` topic naming apply. The canonical state vocabulary and canonical model declaration (see **Async Event Backbone → Canonical Equipment, Production, and Event Model**) are how consistent state semantics are delivered.
- **Must stay within the IT↔OT boundary.** Enterprise-hosted twin capabilities cross through ScadaBridge central and the SnowBridge like every other enterprise consumer.
#### Resolution against the meeting brief's four buckets
#### What this commits / does not commit
The meeting brief framed four outcome buckets (#1 already-delivered, #2 adjacent-funded, #3 future-plan-cycle, #4 exploratory). Management's actual answer does not land in a single bucket — it **splits per use case:**
**Commits** (the two patterns — all substrate already committed elsewhere):
- **Use cases 1 and 3 → Bucket #1 with small plan additions.** The plan already delivers the substrate; it now also commits to the canonical state vocabulary (use case 1) and the canonical model declaration (use case 3), both captured below under **Async Event Backbone → Canonical Equipment, Production, and Event Model**. No new workstream, no new component, no pillar impact.
- **Use case 2 → Bucket #4, served minimally.** Replay-based "simulation-lite" is architecturally enabled by Redpanda's retention tiers and OtOpcUa's namespace model. Full FAT / commissioning / integration-test simulation remains out of scope for this plan. If a funded simulation initiative materializes later, this plan's foundation supports it; until then, the narrow answer to use case 2 is "replay what Redpanda already holds, and build a simulated OtOpcUa namespace when a specific testing need justifies it."
#### Design constraints this imposes (unchanged)
- **Any digital twin capability must consume equipment data through OtOpcUa.** No direct equipment OPC UA sessions.
- **Any digital twin capability must consume historical and analytical data through Snowflake + dbt** — not from Historian directly, not through a bespoke pipeline. The `≤15-minute analytics` SLO is the freshness budget available to it.
- **Any digital twin capability must consume event streams through Redpanda** — not a parallel bus. The same schemas-in-git and `{domain}.{entity}.{event-type}` topic naming apply. The canonical state vocabulary and canonical model declaration (see below) are how "consistent state semantics" is delivered.
- **Any digital twin capability must stay within the IT↔OT boundary.** Enterprise-hosted twins cross through ScadaBridge central and the SnowBridge like every other enterprise consumer.
> **Unified Namespace vocabulary:** stakeholders framing the digital twin ask in "Unified Namespace" terms are asking for the same thing Use Cases 1 and 3 describe, just in UNS language. See **Target IT/OT Integration → Unified Namespace (UNS) posture** for the plan's explicit UNS framing and the decision trigger for a future MQTT/Sparkplug projection service. In short: the plan **already** delivers the UNS value proposition; an MQTT-native projection can be added later if a consumer specifically requires it.
#### What this does and does not commit
**Commits:**
- A canonical machine state vocabulary (`Running / Idle / Faulted / Starved / Blocked` + any additions), published in the `schemas` repo and used consistently across layer-3 derivations, Redpanda event schemas, and dbt curated views.
- A canonical equipment / production / event model declaration in the `schemas` repo, referencing the three surfaces (OtOpcUa, Redpanda, dbt) where it is exposed.
- Retention-tier replay of Redpanda analytics topics as a documented capability usable for integration testing and simulation-lite.
- A single OPC UA session per piece of equipment owned by OtOpcUa, keyed to stable equipment UUIDs. (See **OtOpcUa**.)
- An ACL model on OtOpcUa that distinguishes read from write and is scoped per equipment UUID. (See **OtOpcUa → Authorization model**.)
- The canonical model and stable UUID identity that make both patterns portable across environments and consumers. (See **Unified Namespace (UNS) posture** and **Async Event Backbone → Canonical Equipment, Production, and Event Model**.)
**Does not commit:**
- Building or buying a full commissioning-grade simulation product (Aveva Digital Twin, Siemens NX, DELMIA, Azure Digital Twins, etc.).
- A digital twin UI, dashboard, 3D visualization, or product surface.
- Predictive / AI models specific to digital twin use cases — those are captured under pillar 2 as general analytics/AI enablement, not as digital-twin-specific deliverables.
- Any new workstream, pillar, or end-of-plan criterion tied to digital twin delivery.
_TBD — whether any equipment state additions beyond the five names above are needed (e.g., `Changeover`, `Maintenance`, `Setup`); ownership of the canonical state vocabulary in the `schemas` repo (likely a domain-specific team rather than the ScadaBridge team); whether a use-case-2 funded simulation initiative is on anyone's horizon._
- The mechanism by which OtOpcUa arbitrates write-authority moves between environments (Admin UI switch, PR-merge on the `schemas` repo, release-pipeline step, or any combination). That is the OtOpcUa team's implementation decision and lives outside this plan.
- Any form of physics simulation, FAT / commissioning-grade integration emulation, 3D visualization, predictive-maintenance AI, or genealogy tracking branded as "digital twin." Adjacent initiatives and other pillars may build such things on this plan's foundation; this plan does not.
- Purchase or build of a commercial digital-twin product (Aveva Digital Twin, Siemens NX, DELMIA, Azure Digital Twins, etc.).
- Any new workstream in `roadmap.md`, any pillar, or any end-of-plan criterion tied to digital-twin delivery.
_TBD — none remaining for this section. Canonical state vocabulary ownership and possible additions (`Changeover`, `Maintenance`, `Setup`) are tracked under **Async Event Backbone → Canonical machine state vocabulary**, where that work now lives._
### Enterprise reporting: BOBJ → Power BI migration (adjacent initiative)