Files
3yearplan/goal-state.md
Joseph Doherty 78a58b3a31 Resolve enterprise shortname = zb (closes corrections-doc D4) and propagate through all UNS path examples and schema seeds.
Updated goal-state.md UNS hierarchy table (level 1 example with rationale: matches existing ZB.MOM.WW.* namespace prefix, short by design for a segment that appears in every equipment path, operators already say "ZB" colloquially), all worked-example paths in text + OPC UA browse forms, small-site placeholder example. Removed enterprise-shortname from the §UNS-hierarchy TBD list.

Updated schemas/uns/example-warsaw-west.json `enterprise: "zb"`.

Updated corrections-doc D4 entry to RESOLVED with full propagation list, and updated summary table accordingly.

Production deployments use `zb` from cluster-create. The hardcoded `_default` reserved-segment rule is unchanged (still the placeholder for unused Area/Line levels at single-cluster sites).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 13:12:59 -04:00

871 lines
148 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Goal State (3-Year Target)
Target end-state for shopfloor IT/OT interfaces and data collection at the end of the 3-year plan.
> When a section below grows beyond a few paragraphs, break it out into `goal-state/<component>.md` and leave a short summary + link here. See [`CLAUDE.md`](CLAUDE.md#breaking-out-components).
## Vision
**Overarching theme: provide a stable, single point of integration between shopfloor OT and enterprise IT.** Every design decision in this plan reduces back to that goal — one bridge (ScadaBridge central), one event backbone (Redpanda), one machine-data path to Snowflake (the integration service), one place for schemas, one set of conventions. "Stable" and "single" are load-bearing words: stability rules out bespoke one-offs that drift, and singleness rules out parallel integration estates that compete with the unified model. When a decision is ambiguous later in the plan, this theme is the tiebreaker.
By the end of the 3-year plan, shopfloor IT/OT is built on **one unified integration model across every site** — large campuses (Warsaw West/North), mid-size integrated sites (Shannon, Galway, TMT, Ponce), and the currently-unintegrated smaller sites (Berlin, Winterthur, Jacksonville, and others) all onboard through the same standardized pattern centered on **ScadaBridge** as the IT/OT bridge and **EventHub** as the async backbone. That unified foundation **unlocks enterprise analytics and AI on shopfloor data** by making curated, governed shopfloor data available in **Snowflake** at the right latency and granularity for downstream consumers. In parallel, **legacy point-to-point middleware and bespoke integrations are retired** in favor of ScadaBridge-managed flows, leaving a single, supportable IT/OT integration estate.
**Explicitly in scope:**
- (a) Unifying all sites under one integration model.
- (b) Unlocking enterprise analytics/AI on shopfloor data.
- (d) Replacing legacy middleware.
**Explicitly not a primary goal:**
- (c) Modernizing operator UX is **not** a primary driver of this 3-year plan. Operator-facing UIs may improve incidentally (e.g., as a side effect of migrating to new integrations), but UX modernization is not a success criterion and should not compete for scope, budget, or sequencing. See **Non-Goals**.
## Target IT/OT Integration
### Layered architecture — the mental model
The target architecture has **four layers** on the OT side plus the enterprise IT side above them. Each layer has one job, one kind of data, and one set of downstream consumers. This is the framing to hold in mind before diving into the component-level sections below.
```
┌──────────────────────────────────────┐
│ Enterprise IT side │
│ (Camstar, Delmia, Snowflake via │
│ Machine Data → Snowflake Service, │
│ Power BI / BOBJ, other apps) │
└───────────────▲──────────────────────┘
── IT / OT boundary ─┼───────────────────
│ (ScadaBridge central
│ is the sole crossing)
┌───────────────┴──────────────────────┐
Layer 4 — Bridge │ ScadaBridge │
(bridge to enterprise) │ (site clusters + central cluster) │
└───────────────▲──────────────────────┘
┌───────────────┴──────────────────────┐
Layer 3 — SCADA │ Aveva System Platform │ Ignition │
(processed data) │ (validated collection) │ (KPI) │
└───────────────▲──────────────────────┘
┌───────────────┴──────────────────────┐
Layer 2 — Raw data │ OtOpcUa │
(single OT OPC UA endpoint │ (one cluster per site; one session │
per site, two namespaces) │ per equipment; System Platform │
│ namespace folded in from LmxOpcUa) │
└───────────────▲──────────────────────┘
┌───────────────┴──────────────────────┐
Layer 1 — Equipment │ PLCs · CNCs · robots · controllers · │
(physical devices) │ instruments · sensors │
└──────────────────────────────────────┘
```
**Layer responsibilities.**
- **Layer 1 — Equipment.** Physical devices. Speaks whatever native protocol each device supports (native OPC UA, Modbus, Ethernet/IP, Siemens S7, proprietary, etc.). This layer does not change as part of the plan.
- **Layer 2 — Raw data: OtOpcUa.** Holds the **single** session to each piece of equipment. Translates native device protocols into a uniform OPC UA surface (equipment namespace). Also hosts a **System Platform namespace** — the evolved LmxOpcUa capability — that exposes Aveva System Platform objects as OPC UA through the same endpoint, so OT OPC UA clients have one place to read both raw equipment data and processed System Platform data. Enforces access control and audit at the OT data boundary. Raw equipment data at this layer is exactly that — **raw** — no deadbanding, no aggregation, no business meaning; the System Platform namespace is a view onto layer 3, not a transformation.
- **Layer 3 — SCADA: Aveva System Platform + Ignition.** Transforms raw OPC UA data into **processed data** with business meaning. Aveva System Platform handles validated/compliance-grade processing and collection; Ignition handles KPI-focused processing and presentation. Both layer-3 systems read raw equipment data from OtOpcUa's equipment namespace — neither holds direct equipment sessions in the target state. This is where engineering-unit conversions, state derivations, alarm definitions, and validated archiving happen. System Platform's outputs are re-exposed back through OtOpcUa's System Platform namespace for OPC UA-native consumers that need them.
- **Layer 4 — Bridge: ScadaBridge.** Bridges processed OT data into the enterprise IT side. Everything that needs to cross IT↔OT goes through ScadaBridge central. Site-level ScadaBridge clusters handle site-local scripting, templating, notifications, and store-and-forward; ScadaBridge central is the sanctioned crossing point. ScadaBridge does **not** reach into equipment directly — it consumes OT data from OtOpcUa (both namespaces as needed).
- **Enterprise IT side.** Camstar, Delmia, Snowflake (via the SnowBridge), Power BI / BusinessObjects, and any other enterprise app. Lives entirely above the IT↔OT boundary; consumes layer-4 outputs through sanctioned interfaces.
**Where the other components fit in this mental model.**
- **LmxOpcUa is not a separate component in the target state — it is absorbed into OtOpcUa as the System Platform namespace.** The layered model therefore has no separate "LmxOpcUa tap"; OtOpcUa serves both raw equipment data (its primary layer-2 role) and a System Platform view (the absorbed LmxOpcUa role) through a single endpoint with two namespaces. See the OtOpcUa section above for the fold-in details.
- **Aveva Historian** is a **store adjacent to layer 3**, not a layer of its own. System Platform (layer 3) writes into it; consumers read historical validated data out of it (either through its SQL interface or through Snowflake after SnowBridge has pulled from it). It is the long-term system of record for validated data regardless of what Snowflake chooses to store.
- **Redpanda (EventHub)** is **infrastructure used between layer 4 and enterprise IT**, not a layer of its own. ScadaBridge publishes events into it; enterprise consumers (SnowBridge, KPI processors, Camstar integration) read from it. It decouples layer-4 producers from enterprise consumers without introducing a fifth layer.
- **SnowBridge** is an **enterprise-side consumer** that happens to read from both the adjacent-to-layer-3 store (Aveva Historian) and the layer-4-to-enterprise backbone (Redpanda). Its job is the governed, filtered upload to Snowflake — it does not fit inside the layered data path itself.
- **dbt** runs **inside Snowflake**, so it is enterprise-side infrastructure that transforms landed data. It has no layer-1-through-4 position.
- **OtOpcUa's raw data goes "up" through this stack and back out the top on the IT side.** A tag read from a machine in Warsaw West flows: equipment → OtOpcUa (layer 2) → System Platform or Ignition (layer 3) → ScadaBridge (layer 4) → Redpanda → SnowBridge → Snowflake → dbt → Power BI or downstream consumer.
**What the layering rules out.** Cross-layer shortcuts that bypass the layer in between:
- No layer-4 component (ScadaBridge) reads equipment directly. It must go through OtOpcUa (for OPC UA reads) or layer 3 (for processed data).
- No enterprise-IT component reads layer 2 or equipment directly. It must go through layer 4 (or, for historical data, through the Aveva Historian → SnowBridge path, which still originates in layer 3's write-to-store).
- No layer-3 component holds direct equipment sessions in the target state. Today they do (see current-state); the OtOpcUa tier-3 cutover is what ends that.
- No enterprise-IT component holds direct Redpanda-bypass connections into site infrastructure. Redpanda is the only path from layer 4 out to enterprise consumers for event-driven flows.
- No second OPC UA server runs alongside OtOpcUa. LmxOpcUa's role lives inside OtOpcUa; a future component that needs to expose OT data over OPC UA should add a namespace to OtOpcUa, not stand up its own OPC UA server.
If a future component (including a digital twin, a new AI/ML platform, an additional enterprise app, or a new third-party tool) does not fit into one of these layer roles, it is almost certainly either a layer-3 or layer-4 consumer that needs to be reshaped to consume through the existing stack — not a new parallel path.
### Unified Namespace (UNS) posture
"Unified Namespace" is a concept popularized by Walker Reynolds / 4.0 Solutions / HighByte and the MQTT-Sparkplug community. It typically means: one hierarchical data tree (ISA-95 — Enterprise / Site / Area / Line / Cell / Equipment), MQTT Sparkplug B as transport, a central broker as the hub, and every system publishing to or consuming from the same namespace instead of building point-to-point integrations. The plan **does not build a classic MQTT/Sparkplug B UNS** — but the composition of existing commitments already delivers the UNS value proposition, and the plan reframes that composition as **"our UNS"** for stakeholders who use that vocabulary.
This subsection is a **framing commitment**, not a build commitment. It does not add a new component, a new workstream, or a new pillar criterion. It is the plan's answer to the question "do we have a unified namespace?"
#### The plan's UNS composition
Four existing commitments, together, constitute the unified namespace:
| Component | UNS role |
|---|---|
| **OtOpcUa equipment namespace** (per site) | Hierarchical real-time OT data surface. Equipment, tags, and derived state exposed as a canonical OPC UA tree at each site. This is the "classic UNS hierarchy" at the site level — equipment-class templates in the `schemas` repo define the node layout. |
| **Redpanda topics + canonical Protobuf schemas** | Enterprise-wide pub-sub backbone carrying canonical equipment / production / event messages. The `{domain}.{entity}.{event-type}` taxonomy + schema registry define what "speaking UNS" means on the wire. Retention tiers give consumers a bounded replay window against the UNS. |
| **`schemas` repo + canonical model declaration** | The shared **context layer** — equipment classes, machine states (`Running / Idle / Faulted / Starved / Blocked`), event types — that makes every surface's data semantically consistent. See **Async Event Backbone → Canonical Equipment, Production, and Event Model** for the full declaration. This is where the ISA-95 hierarchy conceptually lives, even though it is not the topic path. |
| **dbt curated layer in Snowflake** | Canonical historical / analytical surface. Consumers that need "what has this equipment done over time" read the UNS via the curated layer, with the same vocabulary as the real-time surfaces. Same canonical model, different access pattern. |
Together: a single canonical data model (the `schemas` repo), a single real-time backbone (Redpanda), a canonical OT-side hierarchy at each site (OtOpcUa), and a canonical analytical surface (dbt). That is the UNS.
#### UNS naming hierarchy standard
The plan commits to a **single canonical naming hierarchy** for addressing equipment across every UNS surface (OtOpcUa, Redpanda, dbt, `schemas` repo). Without this, each surface would re-derive its own naming and drift apart; the whole point of "a single canonical model" evaporates.
##### Hierarchy — five levels, always present
| Level | Name | Semantics | Example |
|---|---|---|---|
| 1 | **Enterprise** | Single root for the whole organization. One value for the entire estate. | **`zb`** — confirmed 2026-04-17 (matches the existing `ZB.MOM.WW.*` namespace prefix used in the codebase; short by design since this segment appears in every equipment path) |
| 2 | **Site** | Physical location. Matches the authoritative site list in [`current-state.md`](current-state.md) → Enterprise Layout. | `south-bend`, `warsaw-west`, `warsaw-north`, `shannon`, `galway`, `tmt`, `ponce`, `berlin`, `winterthur`, `jacksonville`, … |
| 3 | **Area** | A section of the site — typically a **production building** at the Warsaw campuses (which run one cluster per building), or `_default` at sites that have a single cluster covering the whole site. Always present; uniform path depth is a design goal. | `bldg-3`, `bldg-7`, `_default` |
| 4 | **Line** | A production line or work cell within an area. One line = one coherent sequence of equipment working together toward a product or sub-assembly. | `line-2`, `assembly-a`, `packout-1` |
| 5 | **Equipment** | An individual machine instance. Equipment class prefix + instance number or shortname. | `cnc-mill-05`, `injection-molder-02`, `vision-system-01` |
Five levels is a **hard commitment**. Consumers can assume every equipment instance in the UNS has exactly five path segments from root to leaf. This simplifies parsing, filtering, and joining across surfaces.
**Signal / tag is a level-6 property of equipment, not a path level.** Individual data points on a piece of equipment (e.g., `spindle-speed`, `door-state`, `top-fault`) live as child nodes under the equipment in the OtOpcUa namespace and as field references in canonical event payloads. The "address" of the **equipment** stops at level 5; a **signal** is addressed as `equipment-path + signal-name`.
##### Why "always present with placeholders" rather than "variable depth"
- **Uniform depth makes consumers simpler.** Subscribers and dbt models assume a fixed schema for the equipment identifier; variable-depth paths require special-casing.
- **Adding a building later doesn't shift paths.** If a small site adds a second production building and needs an Area level, the existing equipment at that site keeps its path (now pointing at a named area instead of `_default`), and the new building gets a new area segment — no rewrites, no breaking changes for historical consumers.
- **Explicit placeholder is more discoverable than an implicit skip.** A reader looking at `zb.shannon._default.line-1.cnc-mill-03` immediately sees that Shannon has no area distinction today; a variable-depth alternative like `zb.shannon.line-1.cnc-mill-03` leaves the reader wondering whether a level is missing.
##### Naming rules
Identical conventions to the existing Redpanda topic naming — one vocabulary, two serializations (text form for messages / docs / dbt keys; OPC UA browse-path form for OtOpcUa):
- **Character set:** `[a-z0-9-]` only. Lowercase enforced. No underscores (except the literal placeholder `_default`), no camelCase, no spaces, no unicode.
- **Segment separator:** `.` (dot) in text form; `/` (slash) in OPC UA browse paths. The two forms are **mechanically interchangeable** — same segments, different delimiter.
- **Within-segment separator:** `-` (hyphen). Multi-word segments use hyphens (`warsaw-west`, `cnc-mill-05`).
- **Segment length:** max 32 characters per segment. Keeps individual segments readable and bounds OPC UA node-name length.
- **Total path length:** max 200 characters in text form (root + 4 dots + 5 segments ≤ 160 chars in practice, leaving headroom for longer instance names).
- **Reserved tokens:** `_default` is the only reserved segment name, used exclusively as the placeholder for a level that does not apply at a given site.
- **Case rule on display:** paths are always shown in their canonical lowercase form. UIs may style them (larger font, breadcrumbs) but may not transform case.
**Worked example — full path for one tag, two serializations:**
| Form | Example |
|---|---|
| Text (messages, docs, dbt keys) | `zb.warsaw-west.bldg-3.line-2.cnc-mill-05.spindle-speed` |
| OPC UA browse path | `zb/warsaw-west/bldg-3/line-2/cnc-mill-05/spindle-speed` |
| Same machine at a small site (area placeholder) | `zb.shannon._default.line-1.cnc-mill-03` |
##### Stable equipment identity — path is navigation, UUID is lineage
The path is the **navigation identifier**: it tells you where the equipment lives today and how to browse to it. But paths **can change** — a machine moves from one line to another, a building gets renumbered, a campus reorganizes. If the path were the only identifier, every rename would break historical queries, genealogy traces, and audit trails.
The plan commits to a **multi-identifier model** — the v2 implementation design surfaced that production usage requires more than UUID + path. Every equipment instance carries **five identifiers**, all exposed as OPC UA properties on the equipment node so external systems can resolve by their preferred identifier without a sidecar service:
| Identifier | Required? | Uniqueness | Mutable? | Who sets it? | Purpose |
|---|---|---|---|---|---|
| **EquipmentUuid** | Yes | Fleet-wide | **Immutable** | System-generated (UUIDv4) | Downstream events, canonical model joins, cross-system lineage. Not derived from the path. |
| **EquipmentId** | Yes | Within cluster | **Immutable** | **System-generated** (`'EQ-' + first 12 hex chars of EquipmentUuid`) | Internal logical key for cross-generation config diffs. Never operator-supplied, never editable, never present in CSV imports. System-generation eliminates the corruption path where operator typos or bulk-import renames would mint duplicate equipment identities and permanently split downstream UUID-keyed lineage. |
| **MachineCode** | Yes | Within cluster | Mutable (rename tracked) | **Operator-set** | **Operator-facing colloquial name** (e.g., `machine_001`). What operators say on the radio and write in runbooks. Surfaced prominently in Admin UI alongside ZTag. |
| **ZTag** | Optional | Fleet-wide | Mutable (re-assigned by ERP) | **Operator-set** (sourced from ERP) | **ERP equipment identifier.** Primary identifier for browsing in Admin UI per operational request. Fleet-wide uniqueness enforced via `ExternalIdReservation` table outside generation versioning (see below). |
| **SAPID** | Optional | Fleet-wide | Mutable (re-assigned by SAP PM) | **Operator-set** (sourced from SAP PM) | **SAP PM equipment identifier.** Required for maintenance system join. Fleet-wide uniqueness enforced via `ExternalIdReservation` table (see below). |
**Three operator-set fields** (MachineCode, ZTag, SAPID), **two system-generated** (EquipmentUuid, EquipmentId). CSV imports match by `EquipmentUuid` for updates; rows without UUID create new equipment with system-generated identifiers.
Equipment also carries **OPC UA Companion Spec OPC 40010 (Machinery) Identification fields** as first-class metadata — Manufacturer, Model, SerialNumber, HardwareRevision, SoftwareRevision, YearOfConstruction, AssetLocation (free-text supplementary to the UNS path), ManufacturerUri, DeviceManualUri. These are operator-set static metadata (Manufacturer + Model required, the rest optional) exposed on the OPC UA equipment node under the OPC 40010-standard `Identification` sub-folder. Drivers that can read these dynamically (FANUC `cnc_sysinfo()`, Beckhoff `TwinCAT.SystemInfo`, etc.) override the static value at runtime. The full universal cross-machine signal set lives in the `_base` equipment-class template in the schemas repo; every other class extends it.
**`ExternalIdReservation` table — fleet-wide uniqueness for ZTag and SAPID across rollback and re-enable.** Fleet-wide uniqueness for external identifiers cannot be expressed within generation-versioned tables because old generations and disabled equipment can still hold the same values — rollback or re-enable would silently reintroduce duplicates that corrupt downstream ERP/SAP joins. A dedicated `ExternalIdReservation` table sits **outside generation versioning**; `sp_PublishGeneration` reserves IDs atomically at publish; FleetAdmin-only `sp_ReleaseExternalIdReservation` (audit-logged, requires reason) is the only path to free a value for reuse. This is a precedent: some cross-generation invariants need their own non-versioned tables. When ACL design is scoped (see OtOpcUa → Authorization model), check whether any ACL grant has a similar rollback-reuse hazard.
**Path** (the 5-level UNS hierarchy address) is a **sixth** identifier but is **not** stored on the equipment node as a flat property — it is the node's **browse path** by construction. Path can change (equipment moves, area renamed); UUID and EquipmentId cannot.
**Consumer rules of thumb:**
- Use **EquipmentUuid** for joins, lineage, genealogy, audit, and long-running historical queries.
- Use **MachineCode** for operator-facing UIs, runbooks, on-the-floor communications.
- Use **ZTag** when correlating with ERP/finance systems.
- Use **SAPID** when correlating with maintenance/PM systems.
- Use the **path** for dashboards, filters, browsing, human search, and anywhere a reader needs to know *where* the equipment is right now.
Canonical events on Redpanda carry `equipment_uuid` (stable) and `equipment_path` (current at event time) as fields. A dbt dimension table (`dim_equipment`) carries all five identifiers plus current and historical paths, and is the authoritative join point for analytical consumers. OtOpcUa's equipment namespace exposes all five as properties on each equipment node.
_TBD — **UUID generation authority** (E1 from v2 corrections): OtOpcUa Admin UI currently auto-generates UUIDv4 on equipment creation. If ERP or SAP PM systems take authoritative ownership of equipment registration in the future, the UUID-generation policy should be configurable per cluster (generate locally vs. look up from external system). For now, OtOpcUa generates by default._
##### Where the authoritative hierarchy lives
**The `schemas` repo is the single source of truth.** Every other UNS surface consumes the authoritative definition; none of them carry an independent copy.
| Surface | What the surface carries | Authority relationship |
|---|---|---|
| `schemas` repo | Canonical hierarchy definition — the full tree with UUIDs, current paths, equipment-class assignments, and evolution history. Stored as **JSON Schema (.json files)** — idiomatic for .NET (System.Text.Json), CI-friendly (any runner can validate with `jq`), human-authorable, and merge-friendly in git. Protobuf is derived (code-generated) from the JSON Schema for wire serialization where needed (Redpanda events). One-way derivation: JSON Schema → Protobuf, not bidirectional. | **Authoritative.** Changes go through the same CODEOWNERS + `buf`-CI governance as other schema changes. |
| OtOpcUa equipment namespace | Browse-path structure matching the hierarchy; equipment nodes carry the UUID as a property. Built per-site from the relevant subtree (each site's OtOpcUa only exposes equipment at that site). | **Consumer.** Generated from the `schemas` repo definition at deploy/config time. Drift between OtOpcUa and `schemas` repo is a defect. |
| Redpanda canonical event payloads | Every event payload carries `equipment_uuid` (stable) and `equipment_path` (current at event time) as fields. Enables filtering without topic explosion. | **Consumer.** Protobuf schemas reference the hierarchy definition in the same `schemas` repo. |
| dbt curated layer in Snowflake | `dim_equipment` dimension table with UUID, current path, historical path versions, equipment class, site, area, line. Used as the join key by every analytical consumer. | **Consumer.** Populated by a dbt model that reads from an upstream reference table synced from the `schemas` repo — not hand-maintained in Snowflake. |
##### Evolution and change management
The hierarchy will change. Sites get added (smaller sites onboarding in Year 2). Buildings get reorganized. Lines get restructured. Equipment moves. Renames happen. The plan commits to a governance model that makes these changes safe:
- **All changes go through the `schemas` repo** via normal PR + CODEOWNERS review.
- **CI enforces uniqueness** — no two pieces of equipment can share a UUID; no two pieces of equipment at the same (site, area, line) can share a leaf name; `_default` is reserved.
- **CI enforces structural invariants** — every equipment has exactly 5 path segments; every segment matches the naming rules; no reserved tokens except `_default`.
- **Path changes are tracked as history** — when an equipment's path changes, the old path is retained in a `path_history` list on that equipment's definition, with a timestamp and reason. Historical messages that carry the old path can be joined back to the current definition via UUID.
- **Renames never reuse UUIDs** — if a machine is physically replaced, the new machine gets a new UUID even if the new machine sits at the same path. Genealogy stays clean.
- **Area placeholder → named area promotion** — when a small site grows to justify area distinctions (e.g., adding a second building), the existing equipment has its path updated from `_default` to the new named area via a single PR. Historical events keep the old `_default` path; new events carry the new path; UUID stays stable.
##### Out of scope for this subsection
- **Product / job / traveler hierarchy.** Products flow through equipment orthogonally to the equipment tree and are tracked in Camstar's MES genealogy, not in the UNS equipment hierarchy. A product's current equipment is joined in via MES events (`mes.workorder.*`) referencing equipment UUIDs — not by putting products into the equipment path.
- **Operator / crew / shift hierarchy.** Same reason — orthogonal to equipment; lives elsewhere.
- **Logical vs physical equipment.** The plan's hierarchy addresses **physical equipment instances**. Logical groupings (e.g., "all CNC mills," "all equipment on the shop floor") are queryable via equipment class + attributes in dbt or via OPC UA browse filters — not via the path hierarchy.
- **Real-time UNS browsing UI.** If stakeholders want a tree-browse experience against the UNS (an HMI, an engineering tool), that is a consumer surface, not a hierarchy definition. The projection service discussed below is the likely delivery path if this is ever funded.
**Resolved:** storage format for the hierarchy in the `schemas` repo is **JSON Schema** (see "Where the authoritative hierarchy lives" above).
_TBD — authoritative initial **UNS hierarchy snapshot** for the currently-integrated sites — requires a per-site area/line/equipment walk to capture equipment instances, their UNS path assignments, and stable UUIDs (the protocol survey has been removed since the OtOpcUa v2 design committed the driver list directly; the hierarchy walk is now a standalone Year 1 deliverable); whether the dbt `dim_equipment` historical-path tracking needs a slowly-changing-dimension type-2 pattern or a simpler current+history list; ownership of hierarchy change PRs (likely a domain SME group, not the ScadaBridge team)._
#### How this differs from a classic MQTT-based UNS
Three deliberate deviations from the MQTT/Sparkplug B template. Each is a decision this plan already made for other reasons — none was made to avoid UNS, and none precludes serving consumers that expect UNS shape.
- **Transport is Kafka, not MQTT.** Kafka was chosen for analytics replay, a bundled schema registry, long-horizon retention (30/90 days per tier), and native Snowflake ingestion via the Snowflake Kafka Connector — all of which are critical to pillar 2. MQTT is better for lightweight real-time pub-sub and has a richer tooling ecosystem for HMIs and COTS integration products; Kafka is better for the plan's actual consumer mix. **Cost of the deviation:** vendor tools that expect "UNS = MQTT" need either a Kafka client (many modern vendors have one) or a projection layer (see below).
- **Topic structure is flat, not ISA-95-hierarchical.** Topics are named `{domain}.{entity}.{event-type}` with **site identity in the message**, not `enterprise.site.area.line.cell.equipment.tag` with per-equipment topic explosion. This was a deliberate topic-count bounding decision — adding Berlin, Winterthur, Jacksonville, etc. does not multiply the topic set. The ISA-95 hierarchy still exists **conceptually** in the `schemas` repo's equipment-class definitions; it is just not the Kafka topic path. **Cost of the deviation:** consumers expecting hierarchical topic navigation get it via the `schemas` repo and (optionally) via the projection service below, not via Kafka topic listing.
- **No Sparkplug B state machine.** Sparkplug B adds birth/death/state certificates and a stateful producer model on top of MQTT. The plan's canonical events are **stateless** messages carrying explicit state transitions (`equipment.state.transitioned`) rather than implicit Sparkplug state. **Cost of the deviation:** if a consumer needs Sparkplug state semantics, the projection service translates at the boundary; the plan does not emit Sparkplug natively.
#### Optional future: UNS projection service (architecturally supported, not committed for build)
For consumers that expect a "classic UNS" surface — vendor COTS tools, operator HMI platforms with an MQTT / Sparkplug B client, third-party analytics tools pre-wired to MQTT, or enterprise OPC UA clients expecting a unified address space — a **small projection service** can consume from Redpanda topics and expose the same canonical data on a classic UNS surface. The plan supports this architecturally (Pattern A from the brainstorm that produced this subsection) but **does not commit to building it** in the 3-year scope.
Two projection flavors are possible, not mutually exclusive:
1. **MQTT Sparkplug B broker projection.** A central MQTT broker (HiveMQ, EMQX, Mosquitto, or similar) republishes selected Redpanda topics as Sparkplug-shaped messages on an ISA-95 topic path constructed from the canonical model's equipment-class metadata. Consumers connect to the MQTT broker as if it were their UNS.
2. **Enterprise OPC UA aggregator projection.** A central OPC UA server unions the per-site OtOpcUa instances into one enterprise-wide hierarchical address space, for consumers that prefer OPC UA over MQTT. This flavor needs care around the IT↔OT boundary — an enterprise-reachable OPC UA surface on top of OT-side servers can blur the "ScadaBridge central is the sole IT↔OT crossing" rule, and would need to be either routed through ScadaBridge central or limited to OT-side clients. Build this only if a consumer specifically needs it and the boundary question gets explicit resolution.
**Decision trigger for building a projection service:** when a specific consumer (vendor tool, COTS HMI, analytics product, new initiative) requires a classic UNS surface and the cost of writing a Kafka client for that consumer exceeds the cost of operating the projection layer for the rest of the consumer's lifetime. Until that trigger is hit, the canonical model + Redpanda **is** the UNS and consumers reach it directly.
This mirrors the treatment of OtOpcUa's future `simulated` namespace and the Digital Twin Use Case 2 simulation-lite foundation: the architecture supports the addition; the plan does not commit the build until a specific need justifies it.
#### What the UNS framing does and does not change
**Changes:**
- Stakeholders who ask "do we have a UNS?" get a direct "yes — composed of OtOpcUa + Redpanda + `schemas` repo + dbt" answer instead of "we have a canonical model but we didn't use that word."
- Digital Twin Use Cases 1 and 3 (see **Strategic Considerations → Digital twin**) — which are functionally UNS use cases in another vocabulary — now have a second name and a second stakeholder audience.
- A future projection service is pre-legitimized as a small optional addition, not a parallel or competing initiative.
- Vendor conversations that assume "UNS" means a specific MQTT broker purchase can be reframed: the plan delivers the UNS value proposition via different transport; the vendor's MQTT expectations become a projection-layer concern, not a core-architecture concern.
**Does not change:**
- **Does not add a new workstream** to [`roadmap.md`](roadmap.md).
- **Does not commit to building** an MQTT broker, a Sparkplug B producer layer, or an enterprise OPC UA aggregator.
- **Does not change** the Redpanda topic naming convention, the `{domain}.{entity}.{event-type}` taxonomy, or the site-identity-in-message decision.
- **Does not change** the IT/OT boundary or the "ScadaBridge central is the sole IT↔OT crossing" rule. A projection service, if ever built, would live on one side of the boundary (most likely IT side, as a Redpanda consumer) and would not create a new crossing.
- **Does not invalidate** any existing plan decision — the UNS framing is additive and interpretive, not restructural.
_TBD — whether any stakeholder has specifically asked for UNS vocabulary, or whether this framing is proactive; whether any vendor tool currently in evaluation expects an MQTT/Sparkplug UNS and would motivate a Year 2 or Year 3 projection build; whether to add the projection service as an explicit "optional future capability" callout in roadmap.md, or leave it as an unscheduled architectural option (current posture)._
### IT/OT Bridge — ScadaBridge as the Strategic Layer
- **ScadaBridge** is the **global integration network** providing controlled access between **IT and OT**.
- **The IT↔OT boundary sits at ScadaBridge central.** In the target architecture:
- **OT side = machine data.** Everything that collects, transforms, or stores machine data lives on the OT side. Concretely this includes: **Aveva System Platform** (primary and site clusters, Global Galaxy federation, hot-warm redundancy), **equipment OPC UA and native device protocols** (PLCs, controllers, instruments), **OtOpcUa** (the unified per-site OPC UA layer that exposes raw equipment data and the System Platform namespace — the evolution of LmxOpcUa), **ScadaBridge** (site clusters and central), **Aveva Historian**, and **Ignition SCADA** (as the KPI SCADA UX layer per the UX split).
- **IT side = enterprise applications.** Everything business-facing lives on the IT side. Concretely this includes: **Camstar** (MES), **Delmia** (DNC / digital manufacturing), **enterprise reporting and analytics** (Snowflake, dbt, SAP BusinessObjects today / Power BI tomorrow), **the SnowBridge** (it's a Snowflake-facing enterprise consumer, not an OT component — it happens to read from OT sources but its identity, hosting, and governance are IT), and **any other enterprise app** that needs shopfloor data or has to drive shopfloor behavior.
- **Long-term posture.** System Platform traffic (Global Galaxy, site↔site cluster federation, site System Platform clusters ↔ central System Platform cluster, site-level ScadaBridge ↔ local equipment) **stays on the OT side** and is **not** subject to "retire to ScadaBridge." Global Galaxy is how System Platform is supposed to federate and stays the authorized mechanism for OT-internal integration.
- **The crossing point.** **ScadaBridge central ↔ enterprise integrations is the single IT↔OT bridge.** Any traffic that crosses between the two zones must cross *through* ScadaBridge central; nothing else is permitted as a long-term path. That includes reads (enterprise app wanting machine data) and writes (enterprise app driving shopfloor state).
- **Implication for the Global Galaxy Web API** on the Aveva System Platform primary cluster: its two existing interfaces (Delmia DNC, Camstar MES) are IT↔OT crossings that currently run *outside* ScadaBridge and are therefore in-scope for retirement under pillar 3.
- **Implication for the EventHub backbone.** Redpanda is consumed by both sides: ScadaBridge (OT) produces to it, and the SnowBridge (IT) consumes from it. The cluster itself lives in South Bend and is operationally an IT-zone resource — the physical network path from site ScadaBridge to the central cluster is therefore an IT↔OT crossing, and its ScadaBridge-side producer counts as the ScadaBridge-central-mediated crossing per the rule above.
- **Out of scope here:** physical network segmentation (VLANs, firewalls, DMZ design, IDMZ conduits, specific IEC 62443 zone/conduit definitions, etc.). This plan defines the **logical** IT↔OT boundary and the **single sanctioned crossing point**. How the enterprise network team implements that logical boundary in physical infrastructure is owned outside this plan.
- All cross-domain traffic (enterprise systems ↔ shopfloor equipment/SCADA) flows through ScadaBridge rather than ad-hoc point-to-point connections.
- Leverages ScadaBridge's existing capabilities — **OPC UA** on the OT side, **secure Web API**, **scripting**, and **templating** — to standardize how integrations are built and governed.
- **Legacy integration migration:** ScadaBridge is already deployed, but **remaining legacy IT↔OT integrations must be migrated onto it** to retire point-to-point paths. This migration is a prerequisite for the "all cross-domain traffic flows through ScadaBridge" target state. The authoritative inventory lives in [`current-state/legacy-integrations.md`](current-state/legacy-integrations.md). _TBD — migration plan, sequencing, decommission criteria for each legacy path._
- _TBD — identity & auth details beyond what's already captured (EventHub SASL/OAUTHBEARER, ScadaBridge API keys, OtOpcUa UserName tokens with standard OPC UA security modes/profiles — inherited from the LmxOpcUa pattern), change control, HA/DR, capacity planning per site._
### Enterprise System Integration
- **Camstar MES** — deeper, richer integration than today's single Web API interface. _TBD — bidirectional flows, event-driven vs request/response, data scope, error handling._
- **Snowflake** — first-class integration with the enterprise data platform so shopfloor data lands in Snowflake for analytics, reporting, and downstream consumers. See **Aveva Historian → Snowflake** below for the time-series ingestion pattern. _TBD — non-historian data flows (MES, ScadaBridge events, metadata), schema ownership, latency targets, governance._
- _TBD — other enterprise systems (ERP, PLM, quality, etc.) that need integration._
### SnowBridge — the Machine Data to Snowflake upload service
**SnowBridge** is a **dedicated integration service** that owns all **machine data** flows into Snowflake. It is a new, purpose-built component — **not** the Snowflake Kafka Connector directly, and **not** configuration living inside ScadaBridge or the central `schemas` repo.
**Responsibilities.**
- **Source abstraction.** Reads from multiple machine-data sources behind a common interface: **Aveva Historian** (via its SQL interface), **ScadaBridge / EventHub (Redpanda topics)**, and any future source (e.g., Ignition, Aveva Data Hub, direct OPC UA collectors) without each source needing its own bespoke pipeline.
- **Selection.** Operators configure **which topics** (for Redpanda-backed sources) and **which tags/streams** (for Historian and other sources) flow to Snowflake. Selection is a first-class, governed action inside this service — not a side effect of deploying a schema or a ScadaBridge template.
- **Sink to Snowflake.** Writes into Snowflake via the appropriate native mechanism per source (Snowpipe Streaming for event/topic sources, bulk or COPY-based for Historian backfills, etc.) while presenting one unified operational surface.
- **Governance home.** Topic/source/tag opt-in, mapping to Snowflake target tables, schema bindings, and freshness expectations all live in this service's configuration — one place to ask "is this tag going to Snowflake, and why?"
**Rationale.**
- **Separation of concerns.** ScadaBridge's job is IT/OT integration at the edge (site-local, OPC UA, store-and-forward, scripting). Shoveling curated data into Snowflake is a different job — long-lived connector state, Snowflake credentials, per-source backfill logic, schema mapping — and does not belong on every ScadaBridge cluster.
- **Source-agnostic.** Not all machine data is going to flow through Redpanda. Aveva Historian in particular has its own SQL interface that is better read directly for bulk/historical work than replayed through EventHub. This service handles that heterogeneity in one place.
- **Governance visibility.** A single operator-facing system answers the "what machine data is in Snowflake?" question, which matters for compliance, cost attribution, and incident response.
- **Decouples schema evolution from data flow.** Adding a Protobuf schema to the central repo no longer implicitly adds data to Snowflake — that requires an explicit action in this service. Prevents accidental volume.
**Implications for other decisions in this plan.**
- The **Aveva Historian → Snowflake** recommendation (below) is updated: this service is the component that actually implements the path, rather than a direct ScadaBridge→EventHub→Snowflake Kafka Connector pipeline.
- The **tag opt-in governance** question for the Snowflake-bound stream is resolved here: the opt-in list lives in **this service**, not in the central `schemas` repo and not in ScadaBridge configuration.
- The **Snowflake Kafka Connector** is no longer presumed to be the primary path. It may still be used internally by this service for Redpanda-backed flows, or this service may implement its own consumer — an implementation choice inside the service, not a plan-level commitment.
- The **central EventHub cluster** does not change — machine data still flows through Redpanda for event/topic sources; this service is just one of several consumers (alongside KPI processors, Camstar integration, etc.).
**Build-vs-buy: custom build, in-house.** The service is built in-house rather than adopting Aveva Data Hub or a third-party ETL tool (Fivetran, Airbyte, StreamSets, Precog, etc.).
- **Rationale:** bespoke fit for the exact source mix (Aveva Historian SQL + Redpanda/ScadaBridge + future sources), full control over the selection/governance model, alignment with the existing .NET ecosystem that ScadaBridge and OtOpcUa already run on, no commercial license dependency, and no vendor roadmap risk for a component this central.
- **Trade-off accepted:** commits the organization to building and operating another service over the lifetime of the plan. Justified because the requirements (multi-source abstraction, topic/tag selection as a governed first-class action, Snowflake as a targeted sink) don't map cleanly onto any off-the-shelf tool, and the cost of a bad fit would be paid forever.
- **Implementation hint (not a commitment):** the most natural starting point is a .NET service — possibly an Akka.NET application to share infrastructure patterns with ScadaBridge — but the specific runtime/framework is an implementation detail for the build team.
**Operator interface: web UI backed by an API.** Operators manage source/topic/tag selection through a **dedicated web UI** that sits on top of the service's own API. Selection state lives in an **internal datastore owned by the service**, not in a git repo.
- **Rationale:** lowest friction for the operators who actually run the machine-data estate — non-engineers can onboard a tag or disable a topic without opening a PR. Makes the "what's flowing to Snowflake right now?" question answerable from one screen instead of correlating git state with running state. The underlying API lets ScadaBridge, dbt, or future tooling drive selection changes programmatically when needed.
- **Trade-off accepted:** git is **not** the source of truth for selection state. Audit, change review, and rollback all have to be built into the service itself — they do not come for free from `git log` and PR review.
- **Non-negotiable requirements on the UI/API (to offset the trade-off):**
- **Full audit trail** — every selection change records who, what, when, and why (with a required change-reason field). Audit entries are queryable and exportable.
- **Role-based access control** — viewing vs proposing vs approving selection changes are separate permissions; ties into the same enterprise IdP used for Redpanda SASL/OAUTHBEARER (see EventHub auth).
- **Approval workflow — blast-radius based.** Not every change needs four-eyes review; the gate is the **potential blast radius** of the change, not the environment it runs in.
- **Self-service (no approval):** adding/removing a single tag from a non-compliance topic, adjusting a non-structural mapping, toggling an individual tag on/off, renaming a display label.
- **Requires approval (four-eyes):** adding a brand-new topic or source, adding or modifying any tag/topic classified as **compliance-tier**, any change whose **estimated Snowflake storage or compute impact** crosses a configured threshold (the UI must compute and display that estimate before the change can be submitted), and any change to the approval rules themselves.
- **Rationale:** cost and compliance incidents can happen in any environment connected to a real Snowflake account, so gating approvals purely by environment is weaker than gating by blast radius. Fast path stays fast; risky changes always get reviewed regardless of where they're being made.
- _TBD — concrete storage/compute impact thresholds that trigger the gate, whether approvers can be inside or must be outside the requester's team, SLA for approvals so the gate doesn't become a bottleneck, and how "estimated impact" is calculated (heuristic, dry-run query, cost modeling service)._
- **Readable, exportable state** — the current selection can be dumped as a machine-readable document (YAML/JSON) at any time, so operators and auditors can see the full picture without walking the UI screen-by-screen, and so disaster recovery of the service has a clear input format.
- **Reversible changes** — every change can be reverted; reverts are themselves audited.
_TBD — service name (working title only); hosting (South Bend, alongside Redpanda, or elsewhere); high-availability posture; how it authenticates to Snowflake, Historian, and Redpanda; which datastore holds selection state (SQL? internal KV?); how the service recovers its selection state after a failure; how it handles schema evolution when the upstream Protobuf schema changes; backfill and replay semantics per source; whether the UI is a standalone app or embeds into an existing operations portal._
### Aveva Historian → Snowflake
**Problem.** Aveva Historian exposes a SQL interface (OPENQUERY, history views) that *can* be queried from Snowflake, but full-fidelity bulk loading of raw tag data into Snowflake is not viable at enterprise scale — the data volume (sub-second tag values across thousands of tags across many sites) would overwhelm both the export path and Snowflake storage/compute costs.
**Design constraints.**
- **Volume-aware** — we cannot ship raw full-resolution data wholesale.
- **Data locality** — collection stays local to each site (see ScadaBridge principle). Heavy pre-processing should happen near the source, not after a WAN hop.
- **Replayability** — downstream systems (including Snowflake) must tolerate outages without data loss.
- **Single backbone** — reuse the planned **EventHub (Kafka-compatible)** backbone rather than inventing a point-to-point Historian↔Snowflake path.
**Candidate options researched.**
1. **Direct Snowflake → Historian SQL (OPENQUERY / linked server / JDBC pulls).**
- *Pros:* simplest conceptually; no new infra; uses existing Historian SQL surface.
- *Cons:* **does not scale** — pulling raw resolution is the volume problem we're trying to avoid; puts read load on production Historians; no store-and-forward; Snowflake compute wakes up repeatedly to poll.
- *Verdict:* viable only for low-volume metadata/config tables, not tag data.
2. **Aveva Historian Tier-2 Summary Replication → Snowflake.**
- Aveva Historian natively supports **Tier-2 replication** with **summary replication** (periodic analog/state summary stats rather than full resolution). Summaries are stored as history blocks, not raw SQL rows.
- *Pros in principle:* native feature; summaries designed for analytics.
- *Why it doesn't apply here:* the current-state Historian topology is **central-only in South Bend** — there is no tier-1/tier-2 replication in play today, so "Tier-2 summary replication" would require introducing a new historian server to replicate into before it could feed Snowflake. That's infrastructure we don't have and don't want to add just to serve Snowflake ingestion.
- *Verdict:* **not used.** This option assumed a tiered historian that doesn't exist in the current topology, and standing one up just for Snowflake contradicts the "stable, single point of integration" theme.
3. **ScadaBridge → EventHub → Snowflake Kafka Connector (Snowpipe Streaming).** *(recommended primary path)*
- ScadaBridge already has **EventHub forwarding** as a capability and runs at each site (data locality). It can publish tag values, aggregates, or exception-based events to EventHub.
- Snowflake offers the **Snowflake Connector for Kafka with Snowpipe Streaming**, which streams rows directly into Snowflake tables with ~1 second latency and lower per-row cost than file-based Snowpipe.
- *Pros:* reuses the planned EventHub backbone; ScadaBridge scripting/templating does the aggregation/thinning at the source; store-and-forward handles WAN outages; Snowflake-native streaming ingestion; horizontally scalable per site.
- *Cons:* requires disciplined schema/topic taxonomy and a schema registry; ScadaBridge has to carry aggregation logic (or delegate it to Historian Tier-2 summaries — see option 2); cost of EventHub + Snowpipe Streaming must be modeled.
- *Verdict:* **recommended primary path.** It aligns with every other goal-state decision (ScadaBridge as IT/OT bridge, EventHub as async backbone, data locality).
4. **Aveva Data Hub / Cogent DataHub as the bridge.**
- Aveva's own DataHub products can connect to AVEVA Historian and forward to external targets including **Apache Kafka** and **Azure Event Hubs**, with store-and-forward.
- *Pros:* supported Aveva product; pre-built Historian connector; could reduce custom ScadaBridge logic for historian-sourced flows.
- *Cons:* additional licensed product to buy, deploy, and operate; overlaps significantly with ScadaBridge's role and risks creating two IT/OT bridges; may not fit the existing ScadaBridge-centric governance model.
- *Verdict:* fallback/complement — evaluate if ScadaBridge's historian read path proves insufficient, or for sites where a packaged product is preferable to scripting.
5. **File-based bulk export → object storage → Snowpipe / external tables.**
- Periodic SQL export to Parquet/CSV on S3/ADLS/GCS, loaded via Snowpipe or queried as external tables / Iceberg.
- *Pros:* cheap for large batch windows; decouples Snowflake cost from ingestion timing.
- *Cons:* high latency (minutes to hours); still requires someone to do the SQL pull; not great for operational/near-real-time KPIs.
- *Verdict:* useful for **historical backfill** and cold/archive tier only.
**Recommended direction.**
- **Primary path (updated):** all machine-data ingestion into Snowflake is owned by the **SnowBridge** (see its dedicated section above). That service reads from ScadaBridge/EventHub for event-driven flows and directly from Aveva Historian's SQL interface for historian-native flows, then writes to Snowflake. ScadaBridge remains the producer for event-driven machine data into Redpanda; the integration service consumes from Redpanda rather than Snowflake pulling directly.
- **Aggregation boundary: aggregation lives in Snowflake.** Heavy transforms — summary statistics, time-window rollups, state derivations, cross-site joins, enrichment with MES/other data, business-level KPIs — are **all done in Snowflake** using Snowflake-native transform tooling (dbt, Dynamic Tables, Streams + Tasks — exact tool selection TBD).
- **Rationale:** keeps transform logic where the data analysts and platform owners already work; avoids scattering business logic across Historian Tier-2, ScadaBridge scripts, and Snowflake. One place to version, review, and lineage transforms. Accepts higher Snowflake compute/storage cost as the explicit trade-off.
- **Not used:** Aveva Historian **Tier-2 summary replication** is **not** used as the aggregation layer for the Snowflake path. (Tier-2 may still be used for its original historian purpose, but it's not part of the Snowflake ingestion pipeline.)
- **Not used:** ScadaBridge **scripting** is **not** the aggregation layer either — aggregation logic does not live in ScadaBridge scripts.
- **Volume control — two layers.** Since aggregation happens in Snowflake, both ScadaBridge and the SnowBridge share responsibility for preventing a full-fidelity firehose from reaching Snowflake:
- **At ScadaBridge (producing to EventHub):** **deadband / exception-based publishing** — only publish when a tag value changes by a configured threshold or a state changes, not on every OPC UA sample. **Rate limiting** per tag / per site where needed. This controls how much machine data reaches Redpanda in the first place.
- **At the SnowBridge (selecting what reaches Snowflake):** **topic and tag selection** — only topics and tags explicitly opted in within the service are forwarded to Snowflake. Not every Redpanda topic or every Historian tag automatically flows. Adding a tag or topic to Snowflake is a governed action in this service.
- Keep **raw full-resolution data in Aveva Historian** as the system of record — Snowflake stores the selected, deadband-filtered stream plus whatever aggregations dbt builds on top, not a mirror of Historian.
- **Drill-down:** for rare raw-data investigations, query the Historian SQL interface directly from analyst tooling rather than copying raw data into Snowflake.
- **Historical backfill:** one-off file-based exports (option 5) to seed Snowflake history when a new tag set comes online.
**Snowflake-side transform tooling: dbt.** All Snowflake transforms are built in **dbt**, versioned in git alongside the other integration source (schemas, topic config, etc.), and run on a schedule (or via CI) — not as real-time streaming transforms.
- **Rationale:** dbt is the mature, portable standard for SQL transforms. Strong lineage, testing (`dbt test`), environment separation, and documentation generation. Fits the "everything load-bearing lives in git and is reviewed before it ships" discipline already established for schemas and topic definitions.
- **Explicit trade-off — no real-time transforms.** dbt is batch/micro-batch, not streaming. Transforms land in Snowflake tables on whatever cadence dbt runs (likely minutes-to-hours depending on the model), **not** sub-second. This is acceptable because operational/real-time KPIs continue to run on **Ignition SCADA**, not on Snowflake (see Target Operator / User Experience — Ignition owns KPI UX). Snowflake's job is analytics and enterprise rollups, which tolerate minute-plus latency.
- **Not used:** Dynamic Tables and Streams+Tasks are **not** adopted as part of the primary transform toolchain. If a specific future use case genuinely needs sub-minute latency from Snowflake itself (not Ignition), re-open this decision — don't quietly add a second tool.
- **Orchestration: dbt Core on a self-hosted scheduler.** dbt runs are driven by a self-hosted orchestrator (Airflow / Dagster / Prefect — specific tool TBD), **not** dbt Cloud and **not** CI-only. The orchestrator schedules `dbt build`/`dbt run`/`dbt test`, manages freshness SLAs, backfills, and coordinates dbt alongside the rest of the data pipeline (Snowflake Kafka Connector health checks, ad-hoc backfill jobs, downstream notifications).
- **Rationale:** gives full control over scheduling, dependencies, and alerting; avoids recurring dbt Cloud SaaS spend; keeps dbt runs decoupled from CI so a long-running transform isn't sitting inside a CI build minute. Accepts the operational cost of running one more platform service.
- **Not used:** dbt Cloud (avoided recurring SaaS cost and vendor coupling), pure CI-driven runs (too coupled to PR merge cadence), and Snowflake Tasks as the primary scheduler (too limited).
- **Out of scope for this plan:** specific orchestrator selection (Airflow vs Dagster vs Prefect), whether to stand up a new one or reuse one the enterprise data platform already runs, hosting, and credential management. This plan commits to the *pattern* (dbt Core run by a self-hosted orchestrator) but leaves the concrete orchestrator choice to the team that owns the Snowflake-side data platform.
- _TBD — dbt project layout (one project vs per-domain), model cadence targets, test coverage expectations, source freshness thresholds, CI/CD pipeline for dbt changes._
**Deadband / filtering model: global default with explicit per-tag overrides.** ScadaBridge applies **one global deadband** to every tag opted in to the Snowflake stream, and specific tags can be **explicitly overridden** when the global default is too loose or too tight. No per-tag-class templating for deadband — the global default is the floor, overrides are the only fine-tuning mechanism.
- **Rationale:** simplest model to reason about and operate — one number to understand across the whole fleet, plus an explicit list of exceptions. Makes it immediately obvious which tags have bespoke tuning (any tag *not* on the override list uses the global default). Avoids per-class template proliferation as a governance surface.
- **Trade-off accepted:** a single global default will be wrong for many tags — too coarse for some (losing signal) and too fine for others (wasting throughput). This is acceptable because (1) overrides exist for the tags that matter, and (2) Snowflake compute cost of over-sampled tags is a known-and-bounded cost, not a silent failure.
- **Starting value (not a final commitment):** **approximately 1% of span** for analog tags, **change-only** for booleans/state tags, and **every-increment** for counters — as a sensible industry-conservative starting point. The exact percentage is deliberately not pinned in this plan: the mechanism (global default + per-tag overrides) is what's load-bearing, and the starting value will be retuned in Year 2 based on observed Snowflake cost and signal loss. The build team may adjust the starting point during implementation without reopening the plan.
- _TBD — override governance (how overrides are requested, reviewed, and tracked) and whether the override list lives in the central `schemas` repo alongside tag opt-in metadata or in ScadaBridge configuration directly._
**Open questions (TBD).**
- Tag and topic opt-in governance: who approves adding a tag/topic to Snowflake? The **list now lives inside the SnowBridge**, not the central `schemas` repo — approval workflow, audit trail, and change control for that service's selection state are TBD.
- ~~Latency SLOs per data class.~~ **Resolved — captured below.**
**Latency SLOs per data class.** End-to-end latency is measured from the moment ScadaBridge (or the Historian source) emits a value to the moment it is queryable in its target consumer.
- **Operational / real-time KPI UX — out of scope for the Snowflake path.** Real-time KPI runs on **Ignition SCADA** per the UX split (see Target Operator / User Experience). Snowflake has no sub-minute SLO obligation because no operational UI depends on it.
- **Analytics feeds (Snowflake): ≤ 15 minutes end-to-end.** Covers ScadaBridge emit → Redpanda → SnowBridge → Snowflake landing table → dbt model refresh → queryable in the curated layer. Tight enough to feel alive for analysts and dashboards, loose enough to be reachable with dbt on a self-hosted scheduler and no streaming transforms.
- **Compliance / validated data feeds (Snowflake): ≤ 60 minutes end-to-end.** Snowflake is an investigation/reporting tier for validated data; the system of record remains **Aveva Historian**. A 60-minute SLO is sufficient because no compliance control depends on Snowflake freshness — if an investigation needs sub-hour data, it queries Historian directly.
- **Ad-hoc raw drill-down — no SLO.** Analysts query the Historian SQL interface directly for rare raw-resolution investigations; this path is not budgeted against any latency target.
- _TBD — which layer is responsible for each segment of the budget (e.g., how much of the 15 minutes is Redpanda vs integration service vs dbt), and how the SLOs are monitored and alerted on in practice._
- Cost model: EventHub throughput + ingestion (Snowpipe Streaming or whatever the integration service uses) + Snowflake **compute for transforms** + Snowflake storage for the target tag volume. Compute cost matters more under this choice than it would have under a source-aggregated model — worth pricing early.
- Whether Aveva Data Hub (option 4) should still be piloted as a **reference point** for the custom build — useful for comparison on specific capabilities (Historian connector depth, store-and-forward behavior) even though it is not the target implementation.
### OtOpcUa — the unified site-level OPC UA layer (absorbs LmxOpcUa)
**OtOpcUa** is a per-site **clustered OPC UA server** that is the **single sanctioned OPC UA access point for all OT data at each site**. It owns the one connection to each piece of equipment and exposes a unified OPC UA surface — containing **two logical namespaces** — to every downstream consumer (System Platform, Ignition, ScadaBridge, future consumers). In a 2-node cluster, **each node has its own endpoint** (unique `ApplicationUri` per OPC UA spec); both nodes expose identical address spaces. Consumers see both endpoints in `ServerUriArray` and select by `ServiceLevel` — this is **non-transparent redundancy** (the v1 LmxOpcUa pattern, inherited by v2). Transparent single-endpoint redundancy would require a VIP / load balancer per cluster and is not planned.
**The two namespaces served by OtOpcUa:**
1. **Equipment namespace (raw data).** Live values read from equipment via native OPC UA or native device protocols (Modbus, Ethernet/IP, Siemens S7, etc.) translated to OPC UA. This is the new capability the plan introduces — what the "layer 2 — raw data" role in the layered architecture describes.
2. **System Platform namespace (processed data tap).** The former **LmxOpcUa** functionality, folded in. Exposes Aveva System Platform objects (via the local App Server's LMX API) as OPC UA so that OPC UA-native consumers can read processed data through the same endpoint they use for raw equipment data.
**Namespace model is extensible — future "simulated" namespace supported architecturally, not committed for build.** The two-namespace design is not a hard cap. A future **`simulated` namespace** could expose synthetic or replayed equipment data to consumers, letting tier-1 / tier-2 consumers (ScadaBridge, Ignition, System Platform IO) be exercised against real-shaped-but-offline data streams without physical equipment. This is the **OtOpcUa-side foundation for Digital Twin Use Case 2** (Virtual Testing / Simulation — see **Strategic Considerations → Digital twin**). The plan **does not commit to building** a simulated namespace in the 3-year scope; it commits that the namespace architecture can accommodate one when a specific testing need justifies it, without reshaping OtOpcUa. The complementary foundation (historical event replay) lives in the Redpanda layer — see **Async Event Backbone → Usage patterns → Historical replay**.
**LmxOpcUa is absorbed into OtOpcUa, not replaced by a separate component.** The existing LmxOpcUa software and deployment pattern (per-node service on every System Platform node) evolves into OtOpcUa. Consumers that previously pointed at LmxOpcUa for System Platform data and at "nothing yet" for equipment data now point at OtOpcUa and see both in its namespace. There is not a second OPC UA server running alongside.
**Responsibilities.**
- **Single connection per equipment.** OtOpcUa is the **only** OPC UA client that talks to equipment directly. Equipment holds one session — to OtOpcUa — regardless of how many downstream consumers need its data. This eliminates the multiple-direct-sessions problem documented in `current-state.md` → Equipment OPC UA.
- **Site-local aggregation.** Downstream consumers (System Platform IO, Ignition, ScadaBridge, and any future consumers such as a prospective digital twin layer) connect to OtOpcUa rather than to equipment directly. A consumer reading the same tag gets the same value regardless of who else is subscribed.
- **Unified OPC UA endpoint for OT data (per node).** Each cluster node exposes both raw equipment data and processed System Platform data through a **single OPC UA endpoint with two namespaces**, instead of consumers connecting to two separate OPC UA servers. In a 2-node cluster, consumers connect to one of the two node endpoints (selected by `ServiceLevel`); each node serves identical namespaces.
- **Access control / authorization chokepoint.** Authentication, authorization, rate limiting, and audit of OT OPC UA reads/writes are enforced at OtOpcUa, not at each consumer. This is the site-level analogue of the "single sanctioned crossing point" theme the plan applies between IT and OT.
- **Clustered for HA.** Runs as a cluster (multi-node), not a single server, so a node loss does not drop equipment or System Platform visibility.
**Rationale.**
- **Stable, single point of integration — applied at the equipment boundary.** The overarching plan theme is a stable, single point of integration between IT and OT. This component extends the same discipline down one layer: a stable, single point of integration between **consumers and equipment** at each site. Every other decision in this plan presumes equipment data is sharable — this is what makes that true in practice.
- **Protects equipment.** Many devices have hard caps on concurrent OPC UA sessions; some degrade under session load. Collapsing to one session per equipment removes that risk entirely.
- **Data consistency.** Downstream consumers reading the same tag see the same value (sampling cadence, deadband, and buffer are controlled in one place).
- **Auditability.** Equipment access becomes observable and governable from one place — which also gives OT security a concrete control surface.
- **Preserves data locality.** The cluster is per-site, same pattern as ScadaBridge and Aveva System Platform site clusters. Equipment never has to be reachable from outside its site to serve a downstream consumer.
**Implications for other decisions in this plan.**
- **ScadaBridge** stops connecting to equipment OPC UA directly. In the target state, ScadaBridge reads equipment data from OtOpcUa's equipment namespace and System Platform data from OtOpcUa's System Platform namespace — all from the same OPC UA endpoint. ScadaBridge's **data locality** principle is preserved and strengthened — the local ScadaBridge talks to the local OtOpcUa, which talks to local equipment and the local System Platform node.
- **Ignition** stops connecting to equipment OPC UA directly. Today Ignition is central in South Bend and holds direct OPC UA sessions to equipment across the WAN; in the target state, Ignition consumes from each site's OtOpcUa instead. If Ignition remains centrally hosted, the WAN dependency is still present but collapses from *N sessions per equipment* to *one session per site*. (A future goal-state decision on Ignition deployment topology — per-site vs central — is independent of this change.)
- **Aveva System Platform IO** consumes equipment data from OtOpcUa's equipment namespace rather than holding its own direct equipment sessions. This is a meaningful shift in System Platform's IO layer and needs validation against Aveva's supported patterns — System Platform is the most opinionated consumer about how its IO is sourced. (System Platform is still the owner of the objects in OtOpcUa's System Platform namespace — that namespace is a view onto System Platform, not a replacement for it.)
- **LmxOpcUa evolves into OtOpcUa**, rather than running alongside it. The existing LmxOpcUa deployment (per-node service on every System Platform node, exposing System Platform objects) grows to also expose the equipment namespace, picks up clustering, and is renamed OtOpcUa. Consumers that used LmxOpcUa-style OPC UA access to System Platform continue to get that access through OtOpcUa; the previous LmxOpcUa operational pattern (credentials, security modes, namespace shapes for System Platform) carries forward.
- **The IT↔OT boundary is unchanged.** OtOpcUa lives entirely on the **OT side** — it's OT-data-facing, site-local, and fronts OT consumers. It does not change where the IT↔OT crossing sits (still ScadaBridge central ↔ enterprise integrations).
**Build-vs-buy: custom build, in-house.** The site-level OPC UA server cluster is built in-house rather than adopting Kepware, Matrikon, Aveva Communication Drivers, or any other off-the-shelf OPC UA aggregator.
- **Rationale:** matches the existing in-house .NET pattern for ScadaBridge and SnowBridge (and continues the in-house .NET pattern LmxOpcUa already established — which OtOpcUa is the evolution of); full control over clustering semantics, access model, and integration with ScadaBridge's operational surface; no per-site commercial license; no vendor roadmap risk for a component this central to the OT plan.
- **Primary cost driver acknowledged upfront: equipment driver coverage.** Unlike ScadaBridge (which reads OPC UA and doesn't have to speak native device protocols) or the SnowBridge (which reads from a small set of well-defined sources), this component has to **speak the actual protocols of every piece of equipment it fronts**. Commercial aggregators like Kepware justify their license cost largely through their driver library, and picking custom build means that library has to be built or sourced internally over the lifetime of the plan. This is the real cost of option (a), and it is accepted as the trade-off for control.
- **Mitigation:** equipment that already speaks native OPC UA does not require a *new protocol-specific driver project* — the `OpcUaClient` gateway driver in the core library handles all of them. However, per-equipment configuration (endpoint URL, browse strategy, namespace remapping to UNS, certificate trust, security policy) is still required. **Onboarding effort per OPC UA-native equipment is ~hours of config, not zero** — the "no driver build" framing from earlier versions of this plan understated the work.
- **Driver strategy: hybrid — proactive core library plus on-demand long-tail.** A **core driver library** covering the top equipment protocols for the estate is built **proactively** (Year 1 into Year 2), so that most site onboardings can draw from existing drivers rather than blocking on driver work. Protocols beyond the core library — long-tail equipment specific to one site or one equipment class — are built **on-demand** as each site onboards.
- **Why hybrid:** purely lazy (on-demand only) makes site onboarding unpredictable and bumpy; purely proactive risks building drivers for protocols nobody actually uses. The hybrid matches the reality of a mixed equipment estate over a 3-year horizon.
- **Core library scope** is confirmed by the v2 implementation team's internal knowledge of the estate. A formal protocol survey is no longer required for driver scoping.
- **Committed core driver list (from v2 implementation design):** (1) **OPC UA Client** — gateway driver for OPC UA-native equipment; (2) **Modbus TCP** (also covers DL205 via octal address translation); (3) **AB CIP** (ControlLogix / CompactLogix); (4) **AB Legacy** (SLC 500 / MicroLogix, PCCC — separate driver from CIP due to different protocol stack); (5) **Siemens S7** (S7-300/400/1200/1500); (6) **TwinCAT** (Beckhoff ADS, native subscriptions — known Beckhoff installations at specific sites); (7) **FOCAS** (FANUC CNC). Plus the existing **Galaxy** driver (System Platform namespace, carried forward from LmxOpcUa v1). Total: 8 drivers.
- **Implementation approach:** embedded open-source protocol stacks (NModbus for Modbus, Sharp7 for Siemens S7, libplctag for Ethernet/IP and AB Legacy) wrapped in OtOpcUa's driver framework. Reduces driver work to "write the OPC UA ↔ protocol adapter" rather than "implement the protocol."
- _TBD — how long-tail driver requests (protocols beyond the committed 8) are prioritized vs site-onboarding deadlines._
- **Driver stability tiers (v2 implementation design decision — process isolation model).** Not all drivers have equal stability profiles. The v2 design introduces a three-tier model that determines whether a driver runs in-process or out-of-process:
- **Tier A (pure managed .NET):** Modbus TCP, OPC UA Client. Run **in-process** in the OtOpcUa server. Low risk — no native code, no COM interop.
- **Tier B (wrapped native, mature libraries):** S7, AB CIP, AB Legacy, TwinCAT. Run **in-process** with additional guards — SafeHandle wrappers, memory watchdog, bounded queues. The native libraries are mature and well-tested, so process isolation is not required, but guardrails contain resource leaks.
- **Tier C (heavy native / COM / black-box vendor DLL):** Galaxy, FOCAS. Run **out-of-process** as separate Windows services with named-pipe IPC. An `AccessViolationException` from native code (e.g., FANUC's `Fwlib64.dll`) is uncatchable in modern .NET and would tear down the entire OtOpcUa server — process isolation contains the blast radius. Galaxy additionally requires out-of-process because MXAccess COM is .NET Framework 4.8 x86 (bitness constraint forces a separate process).
- **Operational footprint impact:** a site with Tier C drivers runs **1 to 3 Windows services per cluster node** (OtOpcUa main + Galaxy host + FOCAS host). With 2-node clusters, that's up to 6 services per cluster. Deployment guides and operational runbooks must cover the multi-service install/upgrade/recycle workflow.
- **Reusable pattern:** Tier C drivers follow a generalized `Proxy/Host/Shared` three-project layout reusable for any future driver that needs process isolation.
- **Per-device resilience: Polly v8+ (`Microsoft.Extensions.Resilience`).** Every driver instance and every device within a driver uses composable **retry, circuit-breaker, and timeout pipelines** via Polly v8+. Circuit-breaker state is surfaced in the status dashboard. **Write safety: default-no-retry on writes.** Timeouts on equipment writes can fire after the device has already accepted the command; blind retry of non-idempotent operations (pulses, alarm acks, recipe steps) would cause duplicate field actions. Per-tag `WriteIdempotent` flag with explicit opt-in required for write retry. This is a substantive safety decision that affects operator training and tag onboarding runbooks.
- **Not used:** Kepware, Matrikon, Aveva Communication Drivers, HiveMQ Edge, and other off-the-shelf options. Reference products may still be useful for comparison on specific capabilities (clustering patterns, security features, driver implementations) even though they are not the target implementation.
**Deployment footprint per site: co-located on existing Aveva System Platform nodes.** Same pattern as ScadaBridge today — the site-level OPC UA server cluster runs on the **same physical/virtual nodes** that host Aveva System Platform and ScadaBridge, not on dedicated hardware.
- **Rationale:** zero new hardware footprint, consistent operational model across the in-house .NET OT components (ScadaBridge and OtOpcUa — with OtOpcUa running on the same nodes LmxOpcUa already runs on today), and the driver workload at a typical site is modest compared to ScadaBridge's 225k/sec OPC UA ingestion ceiling that these nodes already handle. Co-location keeps site infrastructure simple as smaller sites onboard.
- **Cluster size:** same as ScadaBridge — **2-node clusters at most sites**, with the understanding that the **largest sites** (Warsaw West, Warsaw North) run **one cluster per production building** matching ScadaBridge's and System Platform's existing per-building cluster pattern. No special hardware or quorum model is required beyond what ScadaBridge already uses.
- **Per-building cluster implication for consumers needing site-wide visibility.** At Warsaw campuses, a consumer (e.g., a ScadaBridge instance or a reporting tool) that needs to see equipment across multiple buildings must connect to **N clusters** (one per building) and stitch the data. OtOpcUa's "site-local aggregation" responsibility is satisfied per-cluster, not site-wide. Two mitigations exist: (a) configure consumer-side templates to enumerate per-building clusters — current expected pattern, adds consumer-side complexity; (b) deploy a **site-aggregator OtOpcUa instance** that consumes from per-building clusters via the OPC UA Client gateway driver and re-exposes a unified site namespace — doable with existing toolset (the OpcUaClient driver was designed for gateway-of-gateways), but adds operational complexity. _TBD — whether the per-building cluster decision is a constraint to optimize for or whether a single Warsaw-West cluster is feasible if hardware allows; whether ZTag fleet-wide uniqueness extends across per-building clusters at the same site (assumed yes, confirm with ERP team)._
- **Trade-off accepted:** System Platform, ScadaBridge, and OtOpcUa all share the same nodes' CPU, memory, and network. Resource contention is a risk — mitigated by (1) the modest driver workload relative to ScadaBridge's proven ingestion ceiling, (2) monitoring via the observability minimum signal set, and (3) the option to move off-node if contention is observed during rollout. Note: the OtOpcUa workload largely replaces what LmxOpcUa already runs on these nodes, so the *incremental* resource draw is just the new equipment-driver and clustering work, not a full new service. _TBD — measured impact of adding this workload to already-shared nodes; headroom numbers at the largest sites; whether any specific site needs to escalate to dedicated hardware._
**Authorization model: OPC UA-native — user tokens for authentication + namespace-level ACLs for authorization.** Every downstream consumer (System Platform IO, Ignition, ScadaBridge, future consumers) authenticates to the cluster using **standard OPC UA user tokens** (UserName tokens and/or X.509 client certs, per site/consumer policy), and authorization is enforced via **namespace-level ACLs** inside the cluster — each authenticated identity is scoped to the equipment/namespaces it is permitted to read/write.
- **Rationale:** OPC UA is the protocol we're fronting, so the auth model stays in OPC UA's own terms. No SASL/OAUTHBEARER bridging, no custom token-exchange glue — OtOpcUa is self-contained and operable with standard OPC UA client tooling. **Inherits the LmxOpcUa auth pattern** — UserName tokens with standard OPC UA security modes/profiles — so the consumer-side experience does not change for clients that used LmxOpcUa previously, and the fold-in is an evolution rather than a rewrite.
- **Explicitly not federated with the enterprise IdP.** Unlike Redpanda (which uses SASL/OAUTHBEARER against the enterprise IdP) and SnowBridge (which uses the same IdP for RBAC), OtOpcUa does **not** pull enterprise IdP identity into the OT data access path. OT data access is a pure OT concern, and the plan's IT/OT boundary stays at ScadaBridge central — not here.
- **Trade-off accepted:** identity lifecycle (user token/cert provisioning, rotation, revocation) is managed locally in the OT estate rather than inherited from the enterprise IdP. Two identity stores to operate (enterprise IdP for IT-facing components, OPC UA-native identities for OtOpcUa) is the cost of keeping the OPC UA layer clean and self-contained.
- **Data-path ACL model (designed and committed — lmxopcua decisions #129132).** The v2 implementation design has committed the full ACL model in `lmxopcua/docs/v2/acl-design.md`. Key design points:
- **`NodePermissions` bitmask:** Browse / Read / Subscribe / HistoryRead / WriteOperate / WriteTune / WriteConfigure / AlarmRead / AlarmAcknowledge / AlarmConfirm / AlarmShelve / MethodCall, plus common bundles (`ReadOnly` / `Operator` / `Engineer` / `Admin`).
- **6-level scope hierarchy** with default-deny + additive grants: Cluster → Namespace → UnsArea → UnsLine → Equipment → Tag. Grant at UnsArea cascades to all children unless overridden. Browse-implication on ancestors (granting Read on a child implies Browse on its parents).
- **`NodeAcl` table is generation-versioned** (decision #130) — ACL changes go through draft → diff → publish → rollback like every other content table.
- **Cluster-create seeds default ACLs** matching the v1 LmxOpcUa LDAP-role-to-permission map (decision #131), preserving behavioral parity for v1 → v2 consumer migration.
- **Per-session permission-trie evaluator** with O(depth × group-count) cost; cache invalidated on generation-apply or LDAP group cache expiry.
- **Admin UI:** ACL tab + bulk grant + permission simulator.
- **Phasing:** Phase 1 ships the schema + Admin UI + evaluator unit tests; per-driver enforcement lands in each driver's phase (Phase 2+). **Phase 1 completes before any driver phase**, so the ACL model exists in the central config DB before any driver consumes it — satisfying the "must be working before Tier 1 cutover" timing constraint.
- _TBD — specific OPC UA security mode + profile combinations required vs allowed; where UserName credentials/certs are sourced from (local site directory, a per-site credential vault, AD/LDAP); rotation cadence; audit trail of authz decisions._
**Open questions (TBD).**
- **Driver coverage.** Which equipment protocols need to be bridged to OPC UA beyond native OPC UA equipment — this is where product-driven decisions matter most.
- **Rollout posture: build and deploy the cluster software to every site ASAP.** The cluster software (server + core driver library) is built and rolled out to **every site's System Platform nodes as fast as practical** — deployment to all sites is treated as a prerequisite for the rest of the OT plan, not a gradual per-site effort. "Deployment" here means installing and configuring the cluster software at each site so the node is ready to front equipment; it does **not** mean immediately migrating consumers (that follows the tiered cutover below). A deployed but inactive cluster is cheap; what's expensive is delaying deployment and then trying to do it site-by-site on the critical path of every other workstream.
- **Migration path: tiered consumer-by-consumer cutover, sequenced by risk.** Existing direct equipment connections are moved to the cluster **one consumer at a time**, in risk order — not equipment-by-equipment or site-by-site.
1. **ScadaBridge first.** We own both the server and the client, so redirecting ScadaBridge to consume from the new cluster is the lowest-risk cutover. It also validates the cluster under real ingestion load before Ignition or System Platform are affected.
2. **Ignition second.** Moving Ignition off direct equipment OPC UA to the site-level cluster collapses its WAN session count per equipment from *N* to *one per site cluster*. Medium risk — Ignition is the KPI SCADA and a cutover mistake is user-visible, but Ignition has no validated-data obligations.
3. **Aveva System Platform IO last.** System Platform IO is the hardest cutover because its IO path feeds validated data collection. Moving it through the new cluster needs validation/re-qualification with compliance stakeholders, and System Platform is the most opinionated consumer about how its IO is sourced. Doing it last lets us accumulate operational confidence on the cluster from the ScadaBridge and Ignition cutovers first.
- **Certificate-distribution pre-cutover step (B3 from v2 corrections).** Before any consumer is cut over at a site, that consumer's OPC UA certificate trust store must be populated with the target OtOpcUa cluster's **per-node certificates and ApplicationUris** (2 per cluster; at Warsaw campuses with per-building clusters, multiply by building count if the consumer needs cross-building visibility). Consumers without pre-loaded trust will fail to connect. **Once a consumer has trusted a node's `ApplicationUri`, changing that `ApplicationUri` requires the consumer to re-establish trust** — this is an OPC UA spec constraint, not an implementation choice. OtOpcUa's Admin UI auto-suggests `urn:{Host}:OtOpcUa` on node creation but warns if `Host` changes later.
- **Acceptable double-connection windows.** During each consumer's cutover, a short window of **both old direct connection and new cluster connection** existing at the same time for the same equipment is **tolerated** — it temporarily aggravates the session-load problem the cluster is meant to solve, but keeping the window short (minutes to hours, not days) bounds the exposure. Longer parallel windows are only acceptable for the System Platform cutover where compliance validation may require extended dual-run.
- **Rollback posture.** Each consumer's cutover is reversible — if the cluster misbehaves during or immediately after a cutover, the consumer falls back to direct equipment OPC UA, and the cutover is retried after the issue is understood. The old direct-connection capability is **not removed** from consumers until all three cutover tiers are complete and stable at a site.
- **Consumer cutover plan — owned by a separate integration / operations team (not OtOpcUa).** Per lmxopcua decision #136, consumer cutover is **out of OtOpcUa v2 scope**. The OtOpcUa team's responsibility ends at Phase 5 — all drivers built, all stability protections in place, full Admin UI shipped including the data-path ACL editor. Cutover sequencing per site, validation methodology (proving consumers see equivalent data through OtOpcUa), rollback procedures, coordination with Aveva for System Platform IO cutover (tier 3), and operational runbooks are deliverables of a separate **integration / operations team that has yet to be named**. The handoff's tier 1/2/3 sequencing (above) remains the authoritative high-level roadmap; the implementation-level cutover plan lives outside OtOpcUa's docs. _TBD — name the integration/operations team and link their cutover plan doc._
- _TBD — per-site cutover sequencing across the three tiers (all sites reach tier 1 before any reaches tier 2, or one site completes all three tiers before the next site starts), and per-equipment-class criteria for when a System Platform IO cutover requires compliance re-validation; cutover plan owner assignment._
- **Validated-data implication (E2 — Aveva pattern validation needed Year 1 or early Year 2).** System Platform's validated data collection currently uses its own IO path; moving that through OtOpcUa may require validation/re-qualification depending on the regulated context. **Year 1 or early Year 2 research deliverable:** validate with Aveva that System Platform IO drivers support upstream OPC UA-server data sources (OtOpcUa), including any restrictions on security mode, namespace structure, or session model. If Aveva's pattern requires something OtOpcUa doesn't expose, that's a long-lead-time discovery that must surface well before Year 3's Tier 3 cutover.
- **Relationship to ScadaBridge's 225k/sec ingestion ceiling** (per `current-state.md`): the cluster's aggregate throughput must be able to feed ScadaBridge at its capacity without becoming a bottleneck — sizing needs to reflect this.
### Async Event Backbone
- **EventHub (Kafka-compatible)** — introduce an EventHub as the **async event integration layer** between shopfloor systems and enterprise consumers. Provides a Kafka-compatible API so producers/consumers can use standard Kafka tooling.
- **Purpose:** decouple producers from consumers, enable fan-out, replay, and event-driven integrations (Camstar, Snowflake, and future systems).
- **Hosting model:** **self-hosted Redpanda** (Kafka-API-compatible). Chosen over Apache Kafka and over managed cloud offerings (Azure Event Hubs, Confluent Cloud, AWS MSK) for: maximum control, lower per-message cost at scale, no cloud-provider coupling, and — vs Apache Kafka specifically — a single-binary operational model (no ZooKeeper/KRaft quorum to manage separately), bundled **Schema Registry** and HTTP Proxy, and lower node count for equivalent throughput. Trade-off accepted: higher operational burden than managed offerings (cluster ops, upgrades, capacity planning, DR) is owned internally, and we commit to the Redpanda ecosystem for broker-adjacent features.
- **Deployment footprint: central cluster only (South Bend).** A single self-hosted EventHub cluster runs in the South Bend Data Center. No per-site or per-region clusters.
- **Rationale:** keeps operational burden manageable (one cluster to run, upgrade, secure, and back up) and gives all enterprise consumers (Snowflake, Camstar, global dashboards) a single integration point. Avoids the cost and complexity of federation/mirroring across N sites.
- **Write-path resilience via ScadaBridge store-and-forward.** Because the cluster is central, every site→EventHub write traverses the WAN. To preserve the ScadaBridge **data locality** and **WAN-outage resilience** principles, **all ScadaBridge writes to EventHub are configured as store-and-forward**. During a WAN outage, ScadaBridge queues events locally at the site and replays them to the central EventHub when connectivity returns — no data loss, no site operational impact. This leverages ScadaBridge's existing per-call store-and-forward capability (see `current-state.md`).
- **Consequence:** site-local producers never block on the central cluster being reachable. Producers remain local; durability during outages is handled at the ScadaBridge layer, not the Kafka layer.
- **DR posture: single-cluster HA only; disaster recovery is handled at the VM layer, outside this plan.** The Redpanda cluster is deployed as a **multi-node HA cluster inside South Bend** (rack/zone awareness, replication factor sufficient for node-level failures) and **does not** run a second Redpanda cluster, Cluster Linking, or MirrorMaker. Cross-DC recovery of South Bend as a whole is covered by the **existing enterprise VM-level DR** solution, which is **out of scope for this plan** — EventHub is just another VM workload as far as that DR story is concerned.
- **Rationale:** avoids operating a standby Redpanda cluster (significant ongoing cost for a failure mode the enterprise already has a solution for), and inherits the DR posture already decided for every other central workload rather than inventing a bespoke one.
- **Medium-outage resilience comes from ScadaBridge, not from a second cluster.** ScadaBridge's per-call store-and-forward (see `current-state.md`) keeps site writes durable through any outage shorter than its local queue capacity — that's the mechanism for ride-through, not a standby Redpanda.
- _TBD — long-outage planning: at what outage duration does ScadaBridge local queue capacity become the binding constraint, and what's the operational response then (grow local disk, prioritize topics, drop operational-tier events)? Should be modeled once the Redpanda write-volume numbers are firm._
- _TBD — read-path implications: any site-local consumers (e.g., site KPI processors, alarm pipelines) that need to react to events would need to traverse WAN to the central cluster. Confirm whether all planned consumers tolerate this, or whether specific high-criticality local consumers need a different pattern (e.g., consume directly from ScadaBridge rather than EventHub)._
- _TBD — sizing the central cluster for the aggregate write volume of every site plus full-site backlog replay after a WAN outage._
- **Topic design — site identity lives in the message, not the topic name.** Topics are named by **domain and event type only**, and every event carries a **`site` field in its headers/payload**. Consumers that want a single site's data filter in code (or via consumer-side stream processing).
- **Rationale:** keeps topic count bounded as new sites onboard — adding Berlin, Winterthur, Jacksonville, etc. does not create a new topic set per site. Consumers that span all sites (Snowflake ingestion, global dashboards, enterprise Camstar integration) subscribe once and see the whole fleet; consumers that need one site filter on the header.
- **Trade-off:** per-site ACLs and quotas are harder — Kafka ACLs are topic-scoped, so site-level authorization has to be enforced by producers (ScadaBridge) or by a stream-processing layer, not by the broker.
- _TBD — whether the `site` identifier goes in Kafka **record headers** (cheaper to filter, not part of payload) or in the **payload schema** (forces schema discipline, survives header-unaware consumers). Likely both: header for routing/filtering, payload for durability._
- **Topic naming convention: `{domain}.{entity}.{event-type}`.** Hierarchical, lowercase, dot-separated. Examples:
- `mes.workorder.started`
- `mes.workorder.completed`
- `equipment.tag.value-changed`
- `equipment.tag.quality-changed`
- `scada.alarm.raised`
- `scada.alarm.acknowledged`
- `quality.inspection.completed`
- **Rules of thumb:**
- `domain` is the bounded context (mes, scada, equipment, quality, maintenance, …).
- `entity` is the thing the event is about (workorder, tag, alarm, inspection, …).
- `event-type` is what happened in past tense (started, completed, raised, acknowledged, …).
- Use **hyphens within a segment** (`value-changed`), **dots between segments** (`equipment.tag.value-changed`).
- Keep all segments lowercase; no site names, no environment names (dev/prod handled by cluster, not topic).
- _TBD — authoritative list of `domain` values and who owns each; governance process for adding a new domain or event type._
- **Schema registry: Redpanda Schema Registry (bundled).** Reuse the schema registry bundled with Redpanda — Confluent-API-compatible, so the **Snowflake Kafka Connector** and any Confluent-compatible client libraries work unmodified. Avoids running a separate Apicurio/Confluent registry service.
- **Schema format: Protobuf.** All EventHub payloads are encoded as **Protobuf** and registered in the schema registry.
- **Rationale:** first-class .NET tooling (matters because ScadaBridge is Akka.NET), compact binary wire format, and `.proto` files are reusable outside Kafka — the same schemas can be shared with Web API consumers, internal services, and any downstream system that wants a canonical definition of a shopfloor event.
- **Trade-off:** Protobuf's schema-evolution rules are looser than Avro's, so we compensate with a strict **compatibility policy** in the registry (see TBD below) and discipline around field numbering and `reserved` markers.
- **Compatibility policy: `BACKWARD_TRANSITIVE`.** New schema versions must be readable by consumers compiled against **any** previous version, not just the immediately preceding one. Chosen because: (1) it matches the natural producer-first rollout order (producers upgrade, consumers follow), (2) it is what the Snowflake Kafka Connector expects, and (3) the *transitive* variant protects against long-tail consumers that haven't upgraded in a while — a real risk given the per-site, per-ScadaBridge-cluster producer footprint. Trade-off: disallows some schema changes (e.g., removing a field still used by an older consumer); `buf breaking` in CI catches these at PR time.
- **Subject naming strategy: `TopicNameStrategy`.** Each topic has exactly one message type, and registry subjects follow `{topic}-value` (and `{topic}-key` if keys are schematized) — e.g., `mes.workorder.started-value`. One topic, one schema, one subject. Chosen because: (1) it matches the "one topic = one event type" convention already adopted for topic naming, (2) it's the Snowflake Kafka Connector's default, so the Historian→Snowflake path works with no extra connector config, and (3) per-topic compatibility is easy to reason about (each subject evolves independently). Trade-off: topics cannot carry mixed record types — if we ever need an "event envelope" pattern, that topic will need a dedicated wrapper message rather than multiple concrete records on the same topic.
- **Retention policy: per-topic tiered, classified by purpose.** Retention is not a single cluster-wide default; every topic is assigned a **retention tier** as part of its onboarding metadata.
- **`operational` tier — 7 days.** Short-lived operational events (e.g., `scada.alarm.raised`, transient state changes) whose value decays rapidly. Enough to cover consumer restarts, deploys, and long weekends, but not so long that storage and replay costs balloon.
- **`analytics` tier — 30 days.** Streams consumed by Snowflake ingestion, KPI processors, and other analytical consumers. 30 days gives enough room for backfill, connector re-bootstrapping, and monthly reprocessing windows without turning Redpanda into a long-term store.
- **`compliance` tier — 90 days.** Events tied to validated/regulated flows where retention is driven by compliance requirements rather than operational need. Explicitly not the compliance system of record — Aveva Historian remains that — but 90 days gives a working window for investigations and audit support.
- **Rationale:** retention is cheap to vary per topic, and these event classes have genuinely different needs. A cluster-wide default would either under-serve analytics or over-pay for operational topics.
- **Enforcement:** the tier is set when the topic is created (via infra-as-code for topic definitions), not edited ad-hoc. Classification lives next to the schema in the central `schemas` repo so retention moves with the topic's definition.
- _TBD — exact topic-classification criteria, exception process for topics that don't fit cleanly, tiered-storage offload for the analytics/compliance tiers (Redpanda Tiered Storage → S3/ADLS) to keep hot broker storage small, whether compliance-tier topics need separate ACLs or encryption._
- **Security / auth model: SASL/OAUTHBEARER + prefix-based ACLs.**
- **Authentication:** producers and consumers authenticate to Redpanda with **SASL/OAUTHBEARER** tokens issued by the enterprise identity provider. Every ScadaBridge cluster, every connector (Snowflake Kafka Connector), and every downstream service acquires a token via the OAuth **client-credentials** flow against the IdP, then presents it to Redpanda. Human operators reuse the same IdP via existing SSO.
- **Rationale:** reuses enterprise SSO/IdP identity for machine workloads, avoids managing per-client X.509 certificates or long-lived passwords, and gives the security team one place to revoke, rotate, and audit access. Aligns with the broader enterprise integration direction.
- **WAN-outage implication:** OAuth requires the IdP to be reachable for token issuance/refresh. When the WAN is down, a site's ScadaBridge cannot refresh its token. This is **acceptable** because the site's write path uses ScadaBridge **store-and-forward** (see EventHub Deployment footprint above) — queued events simply wait for both the IdP and Redpanda to be reachable before replaying. Local site operations are unaffected because ScadaBridge's producers don't block on EventHub availability.
- _TBD — which IdP (Azure AD/Entra ID is the obvious candidate but not confirmed), token lifetime and refresh strategy, how ScadaBridge securely stores its client-credentials secret at each site, whether there's a local IdP fallback (short-lived cached tokens, emergency break-glass credentials) for extended multi-day WAN outages._
- **Authorization: prefix-based ACLs.** Redpanda ACLs are granted on **topic prefixes** that follow the `{domain}.{entity}.{event-type}` naming convention, so each principal gets a compact, legible rule set instead of per-topic grants.
- **Principal shape:** one principal per ScadaBridge cluster (per site), one per connector/consumer service, one per human role.
- **Grant pattern:** write access is scoped to the domains a site owns (e.g., a site's ScadaBridge cluster gets `PRODUCE` on `equipment.*` and `scada.*`), while read access is scoped to the domains a consumer cares about (e.g., Snowflake ingestion gets `CONSUME` on `equipment.*`, `scada.*`, `mes.*`, and `quality.*`).
- **Rationale:** the topic naming convention already carries ownership and domain information; prefix ACLs let that structure drive authorization without per-topic bookkeeping. Adding a new topic inside a domain automatically inherits the domain's grants.
- _TBD — authoritative mapping of principals to prefix grants, whether ScadaBridge cluster principals should be scoped by site (producing only its own site's events) — note that site filtering is enforced at the producer today since site identity lives in the message, not the topic name, so Redpanda ACLs cannot enforce "only Warsaw West produces Warsaw West events"; that invariant is a ScadaBridge-side responsibility, which should be called out explicitly in the ScadaBridge configuration contract._
- **Schema source of truth: single central `schemas` repo.** All `.proto` files live in one central repository, not per-domain repos and not co-located with producing services.
- **Structure:** organized by `domain/entity` inside the repo, mirroring the topic naming convention (`mes/workorder.proto`, `equipment/tag.proto`, `scada/alarm.proto`, etc.). **CODEOWNERS** enforces per-domain ownership inside the single repo — reviews still go to the right team, but the style, layout, and compatibility policy are enforced uniformly.
- **Publishing pipeline:** on every merge to main, the repo builds and publishes:
- A **NuGet package** for .NET consumers (ScadaBridge and any .NET-based Web API clients).
- Additional language packages (Java/Maven, Python/PyPI, etc.) **added only when a concrete consumer needs them** — not up front.
- Compiled schemas are also **registered with the Redpanda Schema Registry** as part of the same pipeline, so registry state and source state never drift.
- **Rationale:** one place to enforce the compatibility policy, one chokepoint for schema review, one versioned artifact for every consumer. Prevents drift between copies, avoids duplicate message definitions across domains, and gives enterprise consumers (Snowflake ingestion, Camstar integration) exactly one dependency to track.
- **Trade-off accepted:** the central repo is a review chokepoint. Mitigated by CODEOWNERS routing and by keeping the compatibility policy mechanical (tooling enforces it — see TBD) rather than relying on a gatekeeper team.
- **Schema tooling: `buf` (buf.build).** The central `schemas` repo standardizes on `buf` for lint, breaking-change detection, and code generation.
- **`buf lint`** enforces style in CI — field naming, package layout, `reserved` discipline — so schema style doesn't drift across domains.
- **`buf breaking`** runs in CI on every PR, checking the proposed changes against the previously published version of the package. Breaking changes fail the PR **before** merge — compatibility is enforced at PR time, not at registry-publish time.
- **`buf generate`** drives code generation for the NuGet package (and any other language packages added on demand), replacing ad-hoc `protoc` invocations.
- **Rationale:** PR-time feedback is dramatically better than publish-time errors; `buf` is the de-facto modern Protobuf toolchain with strong .NET/C# support; having lint, breaking-change checks, and codegen in one tool keeps the schema repo's CI simple.
- _TBD — `buf.yaml` lint/breaking rule set (likely default + a few project-specific additions), versioning scheme on the NuGet package (semver, how breaking changes are signaled), release cadence (per-merge vs batched), whether to use `buf push` to a BSR or keep the NuGet package as the only distribution artifact._
- **Usage patterns:**
- **Async event notifications** — shopfloor events (state changes, alarms, lifecycle events, etc.) published to EventHub for any interested consumer to subscribe to, without producers needing to know who's listening.
- **Async processing for KPI** — KPI calculations (currently handled on Ignition SCADA) can consume event streams from EventHub, enabling decoupled, replayable KPI pipelines instead of tightly coupled point queries.
- **System integrations** — other enterprise systems (Camstar, Snowflake, future consumers) integrate by subscribing to EventHub topics rather than opening point-to-point connections into OT.
- **Historical replay for integration testing and simulation-lite.** The `analytics`-tier retention (30 days) is explicitly also a **replay surface** for testing and simulation-lite: downstream consumers (ScadaBridge scripts, KPI pipelines, dbt models, a future digital twin layer) can be exercised against real historical event streams instead of synthetic data. This is the minimal answer to **Digital Twin Use Case 2 (Virtual Testing / Simulation)** — see **Strategic Considerations → Digital twin** → use case 2 — and does not require any new component. When longer horizons are needed, extend to the `compliance` tier (90 days). Replay windows beyond 90 days are served by the dbt curated layer in Snowflake, not by Redpanda.
- _Remaining open items are tracked inline in the subsections above — sizing, read-path implications, long-outage planning, IdP selection, schema subject/versioning details, etc. Support staffing and on-call ownership are out of scope for this plan._
#### Canonical Equipment, Production, and Event Model
The plan already delivers the infrastructure for a cross-system canonical model — OtOpcUa's equipment namespace, Redpanda's `{domain}.{entity}.{event-type}` topic taxonomy, Protobuf schemas in the central `schemas` repo, and the dbt curated layer in Snowflake. What it had not, until now, explicitly committed to is **declaring** that these pieces together constitute the enterprise's canonical equipment / production / event model, and that consumers are entitled to treat them as an integration interface.
This subsection makes that declaration. It is the plan's answer to **Digital Twin Use Cases 1 and 3** (see **Strategic Considerations → Digital twin**) and — independent of digital twin framing — is load-bearing for pillar 2 (analytics/AI enablement) because a canonical model is what makes "not possible before" cross-domain analytics possible at all.
> **Schemas-repo dependency — partially resolved.** The OtOpcUa team has contributed an initial seed at [`schemas/`](../schemas/) (temporary location in the 3-year-plan repo until the dedicated `schemas` repo is created — Gitea push-to-create is disabled). The seed includes:
> - JSON Schema format definitions (`format/equipment-class.schema.json` with an `extends` field for class inheritance, `format/tag-definition.schema.json`, `format/uns-subtree.schema.json`)
> - A **`_base` equipment-class template** (`classes/_base.json` — the universal cross-machine baseline that every other class extends, providing 27 signals across Identity / Status / Alarm / Diagnostic / Counter / Process categories + 2 universal alarms + the canonical state vocabulary). Aligned with **OPC UA Companion Spec OPC 40010 (Machinery)** for the Identification component (Manufacturer, Model, ProductInstanceUri, SerialNumber, HardwareRevision, SoftwareRevision, YearOfConstruction, ManufacturerUri, DeviceManual, AssetLocation) and MachineryOperationMode enum, **OPC UA Part 9** for alarm-summary fields (HasActiveAlarms, ActiveAlarmCount, HighestActiveAlarmSeverity), **ISO 22400** for lifetime counters that feed Availability + Performance KPIs (TotalRunSeconds, TotalCycles), and the canonical state vocabulary (Running / Idle / Faulted / Starved / Blocked) declared in `_base.stateModel`
> - The **FANUC CNC pilot equipment class** (`classes/fanuc-cnc.json`, `extends: "_base"` — adds 14 CNC-specific signals on top of the inherited base) + 3 FANUC-specific alarm definitions
> - A worked UNS subtree example (`uns/example-warsaw-west.json`)
> - Documentation: `docs/overview.md`, `docs/format-decisions.md` with 10 numbered decisions (the original 8 plus D9 covering `_base` + `extends` inheritance and D10 covering the signal `category` → OPC UA folder placement table), `docs/consumer-integration.md`
>
> The **UNS hierarchy snapshot** (a per-site equipment-instance walk) feeds the initial hierarchy definition. Core driver scope is already resolved by the v2 implementation team's committed driver list.
>
> **OtOpcUa central config DB extended** (per lmxopcua decisions #138 + #139): the Equipment table gains 9 nullable columns for the OPC 40010 Identification fields (Manufacturer, Model, SerialNumber, HardwareRevision, SoftwareRevision, YearOfConstruction, AssetLocation, ManufacturerUri, DeviceManualUri) so operator-set static identity has a first-class home; drivers that can read these dynamically (FANUC `cnc_sysinfo()`, Beckhoff `TwinCAT.SystemInfo`, etc.) override the static value at runtime. Exposed on the OPC UA equipment node under the OPC 40010-standard `Identification` sub-folder per the `category` → folder mapping in `schemas/docs/format-decisions.md` D10.
>
> **Still needs cross-team ownership:**
> - Name an owner team for the schemas content (it's consumed by OT and IT systems alike — OtOpcUa, Redpanda, dbt)
> - Decide whether to move to a dedicated `gitea.dohertylan.com/dohertj2/schemas` repo (proposed) or keep as a 3-year-plan sub-tree
> - Ratify or revise the 10 format decisions in `schemas/docs/format-decisions.md`
> - Establish the CI gate for JSON Schema validation
> - Decide on consumer-integration plumbing for Redpanda Protobuf code-gen and dbt macro generation per `schemas/docs/consumer-integration.md`
> **Unified Namespace framing:** this canonical model is also the plan's **Unified Namespace** (UNS) — see **Target IT/OT Integration → Unified Namespace (UNS) posture**. The UNS posture is a higher-level framing of the same mechanics described here: this section specifies the canonical model mechanically; the UNS posture explains what stakeholders asking about UNS should understand about how the plan delivers the UNS value proposition without an MQTT/Sparkplug broker.
##### The three surfaces
The canonical model is exposed on three surfaces, one per layer:
| Layer | Surface | What it canonicalizes |
|---|---|---|
| Layer 2 — Equipment | **OtOpcUa equipment namespace** | Canonical per-equipment OPC UA node structure. A consumer reading tag `X` from equipment `Y` at any site gets the same node path, the same data type, and the same unit. Equipment-class templates (e.g., "3-axis CNC," "injection molding cell") live here and are referenced from the Redpanda and dbt surfaces. |
| Layer 4 → IT — Events | **Redpanda topics + Protobuf schemas** (`schemas` repo) | Canonical event shape. Every shopfloor event — `equipment.tag.value-changed`, `equipment.state.transitioned`, `mes.workorder.started`, `scada.alarm.raised`, etc. — has exactly one Protobuf message type, registered once, consumed everywhere. **This is where the canonical model is source-of-truth.** |
| IT — Analytics | **dbt curated layer in Snowflake** | Canonical analytics model. Curated views expose equipment, production runs, events, and aggregates with the same vocabulary, dimensions, and state values as the OtOpcUa and Redpanda surfaces. Downstream reporting (Power BI, ad-hoc SQL) and AI/ML consume from here. |
**Single source of truth: the `schemas` repo.** The three surfaces reference a shared canonical definition — they do not each carry their own. Specifically:
- **Protobuf message definitions** in the `schemas` repo define the wire format for every canonical event.
- **Shared enum types** in the `schemas` repo define the canonical **machine state vocabulary** (see below), canonical event-type values, and any other closed sets of values.
- **Equipment-class definitions** in the `schemas` repo (format TBD — could be a Protobuf message set, could be a YAML document set referenced from Protobuf) describe the canonical node layout that OtOpcUa templates instantiate and that dbt curated views flatten into fact/dim tables.
Consumers that need to know "what does a `Faulted` state mean" or "what are all the event types in the `mes` domain" look at the `schemas` repo. Any divergence between a surface and the `schemas` repo is a defect in the surface, not in the schema.
##### Canonical machine state vocabulary
The plan commits to a **single authoritative set of machine state values** used consistently across layer-3 state derivations, Redpanda event payloads, and dbt curated views. This is the answer to Digital Twin Use Case 1.
Starting set (subject to refinement during implementation, but the names and semantics below are committed as the baseline):
| State | Semantics |
|---|---|
| `Running` | Equipment is actively producing at or near theoretical cycle time. |
| `Idle` | Equipment is powered and available but not producing — no work in progress, no fault, nothing blocking. |
| `Faulted` | Equipment has raised a fault that requires acknowledgement or intervention before it can return to `Running`. |
| `Starved` | Equipment is ready to run but is blocked by missing upstream input (material, preceding operation). |
| `Blocked` | Equipment is ready to run but is blocked by a downstream constraint (full buffer, unavailable downstream operation). |
**Rules of the vocabulary:**
- **One state at a time.** An equipment instance is in exactly one of these states at any moment. Multi-dimensional status (e.g., alarm severity, operator mode) is carried in **additional fields** on the state event, not by overloading the state value.
- **Derivation lives at layer 3.** Deriving "true state" from raw signals (interlocks, status bits, PLC words, alarm registers) is a **Layer 3** responsibility — Aveva System Platform for validated derivations, Ignition for KPI-facing derivations. The dbt curated layer consumes the already-derived state; it does not re-derive.
- **Events carry state transitions, not state polls.** Redpanda publishes a canonical `equipment.state.transitioned` event every time an equipment instance changes state, with the previous state, the new state, a reason code when available, and the underlying derivation inputs referenced by ID where possible. Current state is reconstructable from the transition stream.
- **State values are an enum in the `schemas` repo.** Adding a state value is a schema change reviewed through the normal `schemas` repo governance (CODEOWNERS, `buf` CI, compatibility checks). Removing a state value is effectively impossible without a long-tail consumer migration — treat the starting set as durable.
- **Top-fault derivation.** When `Faulted`, the canonical event carries a `top_fault` field (single fault code or string, per the `schemas` repo enum) rather than exposing the full alarm vector. The derivation of "top" from the underlying alarm set lives at layer 3 and is documented per-equipment-class in the `schemas` repo alongside the equipment-class definition.
**Additions under active consideration (TBD — resolve during Year 1 implementation):**
- `Changeover` — distinct from `Idle` because it is planned downtime with a known duration and setup workflow. Likely needed for OEE accuracy.
- `Maintenance` — planned or unplanned maintenance state, again distinct from `Idle` for OEE accounting.
- `Setup` / `WarmingUp` — start-of-shift or start-of-run conditioning where equipment is powered but not yet eligible to run.
These are strong candidates but not committed in the starting set; the implementation team closes them with domain SMEs once the first equipment classes are modeled.
##### Relationship to OEE and KPI
The canonical state vocabulary directly enables accurate OEE computation in the dbt curated layer without each consumer having to re-derive availability / performance / quality from scratch. This is one of the most immediate answers to pillar 2's "not possible before" use case criterion: cross-equipment, cross-site OEE computed once in dbt from a canonical state stream is meaningfully harder today because the state-derivation logic is fragmented across System Platform and Ignition scripts. Once the canonical state vocabulary is in place, OEE becomes a dbt model, not a bespoke script per site.
##### Not in scope for this subsection
- **UI / visualization of the canonical model.** The model is a data contract, not a product. Dashboards, HMIs, and reporting surfaces (Ignition KPI, Power BI, any future digital-twin dashboard) all consume the canonical model — but building those surfaces is **not** what the canonical model commits to.
- **Real-time state event frequency guarantees.** Latency and delivery semantics for `equipment.state.transitioned` events are governed by the general Redpanda latency profile and the `analytics`-tier retention (30 days); this subsection does not add a per-event SLO beyond what pillar 2's `≤15-minute analytics` commitment already provides.
- **Predictive state prediction.** Forecasting whether an equipment instance is "about to fault" is a pillar-2 AI/ML use case on top of the canonical stream, not a canonical-model deliverable. The canonical model just has to be clean enough to train on.
**Resolved:** equipment-class definitions in the `schemas` repo use **JSON Schema (.json files)** as the authoring format, with Protobuf code-generated for wire serialization (see UNS naming hierarchy → Where the authoritative hierarchy lives). **Pilot equipment class: FANUC CNC** — the v2 OtOpcUa FOCAS driver design already specifies a fixed pre-defined node hierarchy (Identity, Status, Axes, Spindle, Program, Tool, Alarms, PMC, Macro) populated by specific FOCAS2 API calls, which is essentially a class template already; the schemas repo formalizes it. FANUC CNC is the natural pilot because: single vendor with well-known API surface, finite equipment count per site, clean failure-mode boundary (Tier C out-of-process driver), and no greenfield template invention required. The pilot should land in the schemas repo before Tier 1 cutover begins, to validate the template-consumer contract end-to-end.
_TBD — ownership of the canonical state vocabulary (likely a domain SME group rather than the ScadaBridge team); reconciliation process if System Platform and Ignition derivations of the same equipment's state diverge (they should not, but the canonical surface needs a tiebreak rule)._
## Observability
The plan commits to **what must be observable**, not to **which tool** emits/stores/visualizes the signals. Each new component must expose its signals to whatever enterprise observability stack exists (reuse, don't reinvent). Tool selection is out of scope for this plan — the same posture as VM-level DR and orchestrator selection.
**Minimum signal set — non-negotiable requirements on the goal-state components.**
- **Redpanda (central EventHub cluster).**
- Per-topic throughput (messages/sec, bytes/sec).
- Producer and consumer **lag** per consumer group.
- Broker node health and partition distribution.
- **ACL deny counts** (authorization failures).
- **OAuth token acquisition failures** (ties to the SASL/OAUTHBEARER auth model).
- Disk utilization against retention policy per tier (`operational`, `analytics`, `compliance`).
- **ScadaBridge (existing component, observability obligations added by this plan).**
- Per-site store-and-forward queue depth and oldest-unsent age (critical for detecting extended WAN outages against ScadaBridge disk capacity).
- EventHub publish success/failure rate per site.
- Outbound Web API success/failure rate per destination.
- Deadband filter activity (how many samples suppressed vs published) per tag class, to tune the global deadband value.
- **SnowBridge.**
- Per-source ingest rate (events/sec from Redpanda, rows/sec from Historian, etc.).
- Per-source error rate and last-successful-read timestamp.
- Snowflake write latency and failure rate per target table.
- **Selection-change audit events** — every approved change is observable as an event, not just a DB row (so alerting on unusual change patterns is possible).
- End-to-end latency (source emit → Snowflake queryable), measured against the **≤15-minute analytics** and **≤60-minute compliance** SLOs.
- **dbt (on the self-hosted orchestrator).**
- Per-model run duration, success/failure, and last-successful-run timestamp.
- **Source freshness** — how stale the landing-table sources are that dbt reads from.
- Failed test count (not just failed model count — `dbt test` results are a first-class signal).
- Queued/stalled job counts on the orchestrator.
**Alerting floor (what must page someone, whatever tool is chosen).**
- Any component above breaching its SLO for sustained periods (definition of "sustained" is a per-signal TBD).
- ScadaBridge store-and-forward queue approaching disk capacity at any site.
- Consumer lag on any Snowflake-bound topic exceeding the analytics SLO budget.
- Any selection change that bypasses the approval workflow (defense-in-depth against the UI being compromised).
- Any OAuth token acquisition failure affecting production producers.
_TBD — "sustained" durations per signal, concrete alert routing (on-call rotation, ticketing, etc.) — both outside this plan's scope per the "don't cover support staffing" decision, but noted as a seam for operations teams to wire up._
## Target Operator / User Experience
> UX modernization is **not a primary goal** of this plan (see **Vision** and **Non-Goals**). This section captures the **long-term UX split** so that any incidental UX work done during migrations lands in the right place — not so that UX becomes a funded workstream.
The long-term UX split **mirrors the SCADA data split** already in place:
- **Ignition SCADA → KPI UX.** All KPI-facing operator and supervisor UX lives on Ignition. Dashboards, OEE views, production status boards, and any other KPI visualizations are built and maintained in Ignition.
- **Aveva System Platform HMI → Validated data UX.** Any operator UX that interacts with **validated/regulated data** (compliance-grade data collection, GxP-relevant workflows, etc.) lives on System Platform HMI. Validation controls, audit trails, and regulated operator actions stay on System Platform.
**Implications for the 3-year plan:**
- Don't introduce a third UX platform. New operator UX should be placed on Ignition or System Platform HMI based on whether it's KPI-facing or validated-data-facing.
- Migrations that touch UX should route it to the correct side of the split (e.g., a legacy KPI screen living outside Ignition should move to Ignition; a validated workflow outside System Platform HMI should move to System Platform).
- No commitment is made here to rebuild existing UX that already runs acceptably — we're setting the target shape, not scheduling a UX refresh.
_TBD — criteria for when a screen counts as "KPI" vs "validated"; handling of edge cases that need data from both worlds._
## Success Criteria / KPIs
Success is measured against the three in-scope pillars from the **Vision**. Each pillar has one concrete, measurable criterion — vague criteria produce plans that quietly expand, so "done" is unambiguous.
1. **Unification — 100% of sites integrated under the standardized model.**
- Every site is onboarded through the standard stack: **ScadaBridge** at the edge, **Redpanda EventHub** as the async backbone, the **SnowBridge** for Snowflake-bound flows, and the **long-term UX split** (Ignition for KPI, System Platform HMI for validated).
- In-scope sites: South Bend, Warsaw West, Warsaw North, Shannon, Galway, TMT, Ponce, **and all currently unintegrated smaller sites** (Berlin, Winterthur, Jacksonville, and the rest of the list once firm).
- Measurement: a site counts as "integrated" when its equipment produces events into Redpanda via a local ScadaBridge cluster (or legitimate exception documented), it is represented in the standard topic taxonomy, and its legacy point-to-point integrations are retired (pillar 3).
2. **Analytics / AI enablement — machine data queryable in Snowflake at the committed SLOs.**
- **Aveva Historian** machine data (and event-driven data from ScadaBridge) is queryable in **Snowflake** at the committed latencies: **analytics ≤ 15 minutes end-to-end**, **compliance ≤ 60 minutes end-to-end**.
- A defined set of **priority tags** (list TBD as part of onboarding) is flowing end-to-end through the SnowBridge, landing in Snowflake, and transformed into **dbt curated layers**.
- At least **one production enterprise analytics or AI use case that was not possible before this pipeline existed** is consuming the curated layer in production. The test is enablement, not throughput: the use case must depend on data, freshness, or cross-site reach that the pre-plan stack could not deliver. Re-platforming an existing report onto Snowflake does not count; a new AI/ML model trained on cross-site machine data, a net-new cross-site OEE view, or an alert that depends on the ≤15-minute SLO would count.
3. **Legacy middleware retirement — zero remaining legacy point-to-point integrations.**
- The inventory of "legacy API integrations not yet migrated to ScadaBridge" (captured in `current-state.md`) goes to **zero** by end of plan.
- Every cross-domain IT↔OT path runs through **ScadaBridge** (or, for the Snowflake path, through ScadaBridge → Redpanda → the SnowBridge).
- Documented exceptions are allowed only with explicit approval and a retirement date; "temporary" carve-outs that outlast the plan count as a failure of this criterion.
**Operating principle.** A pillar is not "partially done." Each criterion is binary at the end of the 3 years — either every site is on the standardized model or not, either the analytics pipeline is in production or not, either the legacy inventory is zero or not. Intermediate progress is tracked per-site / per-tag / per-integration in the plan's working documents, not by softening the end-state criteria.
_TBD — named owners for each pillar's criterion; quarterly progress metrics (e.g., sites integrated/quarter, tags onboarded/quarter, legacy integrations retired/quarter) that roll up into the three binary end-state checks; the priority-tag list for pillar 2; the authoritative legacy-integration inventory for pillar 3._
## Strategic Considerations (Adjacent Asks)
External strategic asks that are **not** part of this plan's three pillars but that the plan should be *shaped to serve* when they materialize. None of these commit the plan to deliver anything — they are constraints on how components are built so that future adjacent initiatives can consume them.
### Digital twin (management ask — use cases received 2026-04-15)
**Status: management has delivered the requirements; the plan absorbs two of the three use cases and treats the third as exploratory.** The plan does not add a new "digital twin workstream" to `roadmap.md`, and no pillar criterion depends on a digital twin deliverable. What the plan does is **commit to the pieces** that management's three use cases actually require, as additions to existing components rather than as a parallel initiative. See [`goal-state/digital-twin-management-brief.md`](goal-state/digital-twin-management-brief.md) → "Outcome" for the meeting resolution.
#### Management-provided use cases
These are the **only requirements** management can provide — high-level framing, no product selection, no sponsor, no timeline beyond "directionally, this is what we want." Captured here verbatim in intent; the source document lives at [`../digital_twin_usecases.md.txt`](../digital_twin_usecases.md.txt) in its original form.
1. **Standardized Equipment State / Metadata Model.** A consistent, high-level representation of machine state derived from raw signals: Running / Idle / Faulted / Starved / Blocked. Normalized across equipment types. Single authoritative machine state, derived from multiple interlocks and status bits. Actual-vs-theoretical cycle time. Top-fault instead of dozens of raw alarms. Value: single consistent view of equipment behavior, reduced downstream complexity, improved KPI accuracy (OEE, downtime).
2. **Virtual Testing / Simulation (FAT, Integration, Validation).** A digital representation of equipment that emulates signals, states, and sequences, so automation logic / workflows / integrations can be tested without physical machines. Replay of historical scenarios, synthetic scenarios, edge-case coverage. Value: earlier testing, reduced commissioning time and risk, improved deployed-system stability.
3. **Cross-System Data Normalization / Canonical Model.** A common semantic layer between systems: standardized data structures for equipment, production, and events. Translates system-specific formats into a unified model. Consistent interface for all consumers. Uniform event definitions (`machine fault`, `job complete`). Value: simplified integration, reduced duplication of transformation logic, improved consistency across the enterprise.
Management's own framing of the combined outcome: "a translator (raw signals → meaningful state), a simulator (test without physical dependency), and a standard interface (consistent data across systems)."
#### Plan mapping — what each use case costs this plan
| # | Use case | Maps to existing plan components | Delta this plan commits to |
|---|---|---|---|
| 1 | Standardized equipment state model | Layer 3 (Aveva System Platform + Ignition state derivation) for real-time; dbt curated layer for historical; Redpanda event schemas for event-level state transitions | **Canonical machine state vocabulary.** Adopt `Running / Idle / Faulted / Starved / Blocked` (plus any additions agreed during implementation) as the **authoritative state set** across layer-3 derivations, Redpanda event payloads, and dbt curated views. No new component — commitment is that every surface uses the same state values, and the vocabulary is published in the central `schemas` repo. See **Async Event Backbone → Canonical Equipment, Production, and Event Model.** |
| 2 | Virtual testing / simulation | Not served today by the plan, and not going to be served by a full simulation stack. | **Simulation-lite via replay.** Redpanda's analytics-tier retention (30 days) already enables historical event replay to exercise downstream consumers. OtOpcUa's namespace architecture can in principle host a future "simulated" namespace that replays historical equipment data to exercise tier-1 and tier-2 consumers — architecturally supported, not committed for build in this plan. **Full commissioning-grade simulation stays out of scope** pending a separate funded initiative. |
| 3 | Cross-system canonical model | OtOpcUa equipment namespace (canonical OPC UA surface); Redpanda topic taxonomy (`{domain}.{entity}.{event-type}`) + Protobuf schemas; dbt curated layer (canonical analytics model) — all three already committed. | **Canonical model declaration.** The plan already builds the pieces; what it did not do is **declare** that these pieces together constitute a canonical equipment/production/event model that consumers are entitled to use as an integration interface. This declaration lives in the central `schemas` repo as first-class content and is referenced from every surface that exposes the model. See **Async Event Backbone → Canonical Equipment, Production, and Event Model.** |
#### Resolution against the meeting brief's four buckets
The meeting brief framed four outcome buckets (#1 already-delivered, #2 adjacent-funded, #3 future-plan-cycle, #4 exploratory). Management's actual answer does not land in a single bucket — it **splits per use case:**
- **Use cases 1 and 3 → Bucket #1 with small plan additions.** The plan already delivers the substrate; it now also commits to the canonical state vocabulary (use case 1) and the canonical model declaration (use case 3), both captured below under **Async Event Backbone → Canonical Equipment, Production, and Event Model**. No new workstream, no new component, no pillar impact.
- **Use case 2 → Bucket #4, served minimally.** Replay-based "simulation-lite" is architecturally enabled by Redpanda's retention tiers and OtOpcUa's namespace model. Full FAT / commissioning / integration-test simulation remains out of scope for this plan. If a funded simulation initiative materializes later, this plan's foundation supports it; until then, the narrow answer to use case 2 is "replay what Redpanda already holds, and build a simulated OtOpcUa namespace when a specific testing need justifies it."
#### Design constraints this imposes (unchanged)
- **Any digital twin capability must consume equipment data through OtOpcUa.** No direct equipment OPC UA sessions.
- **Any digital twin capability must consume historical and analytical data through Snowflake + dbt** — not from Historian directly, not through a bespoke pipeline. The `≤15-minute analytics` SLO is the freshness budget available to it.
- **Any digital twin capability must consume event streams through Redpanda** — not a parallel bus. The same schemas-in-git and `{domain}.{entity}.{event-type}` topic naming apply. The canonical state vocabulary and canonical model declaration (see below) are how "consistent state semantics" is delivered.
- **Any digital twin capability must stay within the IT↔OT boundary.** Enterprise-hosted twins cross through ScadaBridge central and the SnowBridge like every other enterprise consumer.
> **Unified Namespace vocabulary:** stakeholders framing the digital twin ask in "Unified Namespace" terms are asking for the same thing Use Cases 1 and 3 describe, just in UNS language. See **Target IT/OT Integration → Unified Namespace (UNS) posture** for the plan's explicit UNS framing and the decision trigger for a future MQTT/Sparkplug projection service. In short: the plan **already** delivers the UNS value proposition; an MQTT-native projection can be added later if a consumer specifically requires it.
#### What this does and does not commit
**Commits:**
- A canonical machine state vocabulary (`Running / Idle / Faulted / Starved / Blocked` + any additions), published in the `schemas` repo and used consistently across layer-3 derivations, Redpanda event schemas, and dbt curated views.
- A canonical equipment / production / event model declaration in the `schemas` repo, referencing the three surfaces (OtOpcUa, Redpanda, dbt) where it is exposed.
- Retention-tier replay of Redpanda analytics topics as a documented capability usable for integration testing and simulation-lite.
**Does not commit:**
- Building or buying a full commissioning-grade simulation product (Aveva Digital Twin, Siemens NX, DELMIA, Azure Digital Twins, etc.).
- A digital twin UI, dashboard, 3D visualization, or product surface.
- Predictive / AI models specific to digital twin use cases — those are captured under pillar 2 as general analytics/AI enablement, not as digital-twin-specific deliverables.
- Any new workstream, pillar, or end-of-plan criterion tied to digital twin delivery.
_TBD — whether any equipment state additions beyond the five names above are needed (e.g., `Changeover`, `Maintenance`, `Setup`); ownership of the canonical state vocabulary in the `schemas` repo (likely a domain-specific team rather than the ScadaBridge team); whether a use-case-2 funded simulation initiative is on anyone's horizon._
### Enterprise reporting: BOBJ → Power BI migration (adjacent initiative)
**Status: in-flight, not owned by this plan.** Enterprise reporting is actively migrating from **SAP BusinessObjects** to **Microsoft Power BI** (see [`current-state.md`](current-state.md) → Aveva Historian → Current consumers). This is a reporting-team initiative, not a workstream of this 3-year plan — but it **overlaps with pillar 2** (analytics/AI enablement) in a way that requires explicit coordination, because both initiatives ultimately consume machine data and both ultimately present analytics to business users.
**This plan's posture:** no workstream is added to `roadmap.md`, and no pillar criterion depends on the Power BI migration landing on any particular schedule. However, the plan's Snowflake-side components (SnowBridge, dbt curated layer) are shaped so that Power BI can consume them cleanly **if and when** the reporting team decides to point there. Whether Power BI actually does so, on what timeline, and for which reports is **not this plan's decision** — it is a coordination question between this plan and the reporting team.
#### Three consumption paths for Power BI
The reporting team's Power BI migration can land on any of three paths. Each has different implications for this plan:
**Path A — Power BI reads from the Snowflake dbt curated layer.**
- *Fit with this plan's architecture:* **best.** Machine data flows through the planned pipeline (equipment → OtOpcUa → layer 3 → ScadaBridge → Redpanda → SnowBridge → Snowflake → dbt → Power BI). The architectural diagram in `## Layered Architecture` above already shows this as the intended shape.
- *What it requires from this plan:* the dbt curated layer must be built to serve **reporting**, not only AI/ML. Likely adds a **reporting-shaped view or semantic layer** on top of the curated layer, tuned for Power BI's query patterns and cross-domain joins. SnowBridge's tag selection must include tags that feed reporting, not only tags that feed the pillar-2 AI use case.
- *What it requires from the reporting team:* capacity and willingness to consume Snowflake as a data source (Power BI has a native Snowflake connector; the learning curve is in the semantic layer, not the connection). Commitment to defer at least the machine-data portion of the BOBJ migration until the dbt curated layer is live — which ties the reporting migration's machine-data cutover to this plan's Year 2+ delivery.
- *Risk:* **timing coupling.** If the reporting team wants to finish their migration inside Year 1, this path doesn't work for machine-data reports. They'd need to hold on machine-data reports and migrate the rest first — which is tenable (reports migrate in waves anyway) but needs agreement.
- *"Not possible before" hook:* Path A opens the door to **cross-domain reports** (machine data joined with MES/ERP data in one query) that BOBJ couldn't easily deliver. This is a strong candidate for pillar 2's "not possible before" use case.
**Path B — Power BI reads from Historian's MSSQL surface directly.**
- *Fit with this plan's architecture:* **neutral.** Historian's SQL interface is its native consumption surface (see [`current-state/legacy-integrations.md`](current-state/legacy-integrations.md) → Deliberately not tracked → Historian SQL reporting consumers). This path is not legacy, not a retirement target.
- *What it requires from this plan:* **nothing.** This plan makes no changes to Historian's MSSQL surface.
- *What it requires from the reporting team:* a pure tool migration (BOBJ → Power BI, same data path). Shortest path to finishing the Power BI migration on the reporting team's preferred timeline.
- *Risk:* **perpetuates the current pattern.** All of the reasons the plan chose a Snowflake-based analytics substrate still apply — cross-domain joins are hard, raw-resolution scale is painful, Historian carries reporting read load on top of its compliance role. Pillar 2's "single analytics substrate" story weakens; the organization ends up running two reporting substrates (Historian SQL for machine data, Snowflake for AI/ML use cases). The machine-data analytics cost moves with Historian rather than with Snowflake's pay-per-use model, which makes the "Snowflake cost story" of this plan less compelling against a baseline that doesn't include reporting load.
- *"Not possible before" hook:* none beyond what Historian SQL already offers.
**Path C — Both, partitioned by report category.**
- *Shape:* compliance/validation reports read Historian directly (because Historian is the authoritative system of record and auditors typically want reports against it); machine-data analytics and cross-domain reports read from Snowflake dbt; reports sourced from Camstar/Delmia/ERP stay on their native connectors. Reports migrate per category.
- *Fit with this plan's architecture:* **pragmatic.** Acknowledges that enterprise reporting is heterogeneous and that one path doesn't fit everything.
- *What it requires from this plan:* Path-A requirements (reporting-shaped dbt layer, tag selection in SnowBridge) for the Snowflake portion. No new requirements for the Historian portion.
- *What it requires from the reporting team:* a published **report-category → data-source** rubric that dev teams can use to place new reports on the right path. Needs governance; otherwise new reports land wherever feels easiest at the time.
- *Risk:* **complexity.** Two semantic layers, two connection paths, two mental models for report authors. Worth it only if the volume of cross-domain / AI-adjacent reporting is high enough to justify Path A alongside Path B.
#### Recommended position
**Path C (with Path A as the strategic direction).** Expect most machine-data-heavy reports and all cross-domain reports to move to Snowflake (Path A) over Years 23 as the dbt curated layer matures; expect compliance reports to stay on Historian's SQL surface (Path B) indefinitely because Historian is the authoritative regulatory system of record and moving compliance reporting off it introduces chain-of-custody questions we don't want to open. Path B is **explicitly** not a retirement target (see the carve-out in the legacy inventory), so "staying" is a valid end state for compliance reporting.
**Why not pure Path A:** forces a needless fight over compliance reports that have no business case for leaving Historian.
**Why not pure Path B:** gives up the cross-domain reporting upside that is one of the most compelling answers to "what does pillar 2 get us that we couldn't do before?"
**Why not leave the decision open:** without a plan position, the reporting team will default to Path B by inertia (it's the shortest path and they're already mid-migration). That locks in the weakest of the three outcomes.
#### Questions to take to the reporting team
Use these to land the coordination conversation. Priority order — the first four are the must-answers:
1. **What's your timeline for completing BOBJ → Power BI?** Specifically, when do you expect to have migrated (a) all non-machine-data reports, (b) machine-data reports that read Historian, and (c) cross-domain reports? This tells us whether holding machine-data reports for Path A is even tenable on your side.
2. **Have you made an architectural decision on Power BI's connection to Historian?** Direct MSSQL link, Power BI gateway + on-prem data source, Azure Analysis Services in front of Historian, dataflows, something else? A decision already baked in may be hard to unwind.
3. **Has Snowflake been evaluated as a Power BI data source?** If yes, what were the findings (cost, performance, semantic modeling effort)? If no, would you be open to an evaluation once the first dbt curated layer is live in Year 2?
4. **Is there a business stakeholder asking for cross-domain reports** (machine data joined with MES/ERP/Camstar data in one report) that BOBJ can't deliver today? A named stakeholder here is the strongest signal that Path A is worth the coordination cost.
5. **What's the rough split of your report inventory** between machine-data-heavy reports, compliance reports, cross-domain reports, and pure-enterprise reports? A rough count is enough — we're not looking for a census, just the shape of the portfolio.
6. **Does the reporting team have capacity to learn Snowflake + dbt semantic modeling?** If that's a deal-breaker, Path A is off the table and we should plan for Path B + a parallel Snowflake analytics stack that non-reporting users consume.
7. **Who owns the decision on Power BI's data sources?** Your team, a BI governance body, IT architecture, the CIO? We need to know who to bring into the Path-A discussion if it progresses.
8. **Would you be willing to pilot one cross-domain report on Snowflake (Path A) during Year 2** as a proof point, independent of the rest of the migration? This is a low-commitment way to validate Path A before betting more reports on it.
#### Decision rubric
After the conversation, place the outcome into one of these buckets:
- **Bucket A — Full Path A commitment.** Reporting team commits to migrating all non-compliance reports to Snowflake over Years 23. → Update `roadmap.md` (Snowflake dbt Transform Layer workstream) to include reporting-shaped views in Year 2. Update `goal-state.md` to name cross-domain reporting as a pillar 2 "not possible before" candidate.
- **Bucket B — Path C commitment.** Reporting team commits to the hybrid path with a published report-category rubric. → Same roadmap updates as A, plus document the rubric as a link from this subsection.
- **Bucket C — Path B lock-in.** Reporting team declines Path A for cost, capacity, or timing reasons. → Update `goal-state.md` here to record the decision. No roadmap changes. Pillar 2's "not possible before" use case must come from a different source (e.g., predictive maintenance, OEE anomaly detection) because cross-domain reporting is off the table.
- **Bucket D — Conversation inconclusive.** Reporting team needs more time, or the decision is above their level. → Schedule follow-up. Note which questions were answered and which are still open.
#### What this does NOT decide
- Whether the reporting team completes their Power BI migration (their decision).
- Whether Historian's SQL surface is ever retired (no — it's the compliance system of record).
- Whether this plan's Snowflake dbt layer supports Power BI (yes, it can — the question is only whether the reporting team will consume it).
- Whether the SnowBridge's tag selection is driven by reporting requirements (partly — SnowBridge's selection is governed by blast-radius approval, so reporting-team requests are handled through the same workflow as any other).
_TBD — name and sponsor of the Power BI migration initiative; named owner on the reporting team for this coordination; whether a joint session between this plan's build team and the reporting team has been scheduled; whether a Power BI + Snowflake proof-of-concept can fit into Year 1 as a forward-looking test, independent of the rest of Year 1's scope._
## Non-Goals
- **Operator UX modernization is not a primary goal** of this 3-year plan. Revamping HMIs, operator dashboards, or shopfloor UI frameworks is explicitly deprioritized against the three in-scope pillars (unification, analytics/AI enablement, legacy middleware retirement). If UX work happens, it happens as a by-product of migrations — never as a standalone initiative funded from this plan.
- _TBD — other explicit non-goals (e.g., specific technologies we are not adopting, scope boundaries vs. adjacent programs)._