diff --git a/CLAUDE.md b/CLAUDE.md index 352c2a5..e5331b6 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -17,6 +17,7 @@ Plan content lives in markdown files at the repo root to keep it easy to read an - [`current-state/legacy-integrations.md`](current-state/legacy-integrations.md) — authoritative inventory of **bespoke IT↔OT integrations** that cross ScadaBridge-central outside ScadaBridge. Denominator for pillar 3 retirement. - [`goal-state/digital-twin-management-brief.md`](goal-state/digital-twin-management-brief.md) — meeting-prep artifact for the management conversation that turns "we want digital twins" into a scoped response. Parallel structure to `goal-state.md` → Strategic Considerations → Digital twin. +- [`schemas/`](schemas/) — **Canonical OT equipment definitions seed** (temporary location — see `schemas/README.md`). JSON Schema format definitions, FANUC CNC pilot equipment class, UNS subtree example, and documentation. Contributed by the OtOpcUa team; ownership TBD. To be migrated to a dedicated `schemas` repo once created. ### Output generation pipeline diff --git a/goal-state.md b/goal-state.md index 4d67c15..c6a4125 100644 --- a/goal-state.md +++ b/goal-state.md @@ -463,7 +463,14 @@ _TBD — service name (working title only); hosting (South Bend, alongside Redpa - **Rationale:** OPC UA is the protocol we're fronting, so the auth model stays in OPC UA's own terms. No SASL/OAUTHBEARER bridging, no custom token-exchange glue — OtOpcUa is self-contained and operable with standard OPC UA client tooling. **Inherits the LmxOpcUa auth pattern** — UserName tokens with standard OPC UA security modes/profiles — so the consumer-side experience does not change for clients that used LmxOpcUa previously, and the fold-in is an evolution rather than a rewrite. - **Explicitly not federated with the enterprise IdP.** Unlike Redpanda (which uses SASL/OAUTHBEARER against the enterprise IdP) and SnowBridge (which uses the same IdP for RBAC), OtOpcUa does **not** pull enterprise IdP identity into the OT data access path. OT data access is a pure OT concern, and the plan's IT/OT boundary stays at ScadaBridge central — not here. - **Trade-off accepted:** identity lifecycle (user token/cert provisioning, rotation, revocation) is managed locally in the OT estate rather than inherited from the enterprise IdP. Two identity stores to operate (enterprise IdP for IT-facing components, OPC UA-native identities for OtOpcUa) is the cost of keeping the OPC UA layer clean and self-contained. -- **ACL implementation (Year 1 deliverable — required before Tier 1 cutover).** The v2 implementation design surfaced that namespace-level ACLs are not yet modeled. The plan commits to: a per-cluster `EquipmentAcl` table (or equivalent) in the **central configuration database** mapping LDAP-group → permitted Namespace + UnsArea / UnsLine / Equipment subtree + permission level (Read / Write / AlarmAck). ACLs support four granularity levels with inheritance: Namespace → UnsArea → UnsLine → Equipment (grant at UnsArea cascades to all children unless overridden). ACLs are edited through the **Admin UI**, go through the same draft → diff → publish flow as driver/topology config, and are **generation-versioned** for auditability and rollback. The OPC UA NodeManager checks the ACL on every browse / read / write / subscribe against the connected user's LDAP group claims. **This is a substantial missing surface area that must be built before Tier 1 ScadaBridge cutover**, since the "access control / authorization chokepoint" responsibility is the plan's core promise at this layer. +- **Data-path ACL model (designed and committed — lmxopcua decisions #129–132).** The v2 implementation design has committed the full ACL model in `lmxopcua/docs/v2/acl-design.md`. Key design points: + - **`NodePermissions` bitmask:** Browse / Read / Subscribe / HistoryRead / WriteOperate / WriteTune / WriteConfigure / AlarmRead / AlarmAcknowledge / AlarmConfirm / AlarmShelve / MethodCall, plus common bundles (`ReadOnly` / `Operator` / `Engineer` / `Admin`). + - **6-level scope hierarchy** with default-deny + additive grants: Cluster → Namespace → UnsArea → UnsLine → Equipment → Tag. Grant at UnsArea cascades to all children unless overridden. Browse-implication on ancestors (granting Read on a child implies Browse on its parents). + - **`NodeAcl` table is generation-versioned** (decision #130) — ACL changes go through draft → diff → publish → rollback like every other content table. + - **Cluster-create seeds default ACLs** matching the v1 LmxOpcUa LDAP-role-to-permission map (decision #131), preserving behavioral parity for v1 → v2 consumer migration. + - **Per-session permission-trie evaluator** with O(depth × group-count) cost; cache invalidated on generation-apply or LDAP group cache expiry. + - **Admin UI:** ACL tab + bulk grant + permission simulator. + - **Phasing:** Phase 1 ships the schema + Admin UI + evaluator unit tests; per-driver enforcement lands in each driver's phase (Phase 2+). **Phase 1 completes before any driver phase**, so the ACL model exists in the central config DB before any driver consumes it — satisfying the "must be working before Tier 1 cutover" timing constraint. - _TBD — specific OPC UA security mode + profile combinations required vs allowed; where UserName credentials/certs are sourced from (local site directory, a per-site credential vault, AD/LDAP); rotation cadence; audit trail of authz decisions._ **Open questions (TBD).** @@ -476,7 +483,7 @@ _TBD — service name (working title only); hosting (South Bend, alongside Redpa - **Certificate-distribution pre-cutover step (B3 from v2 corrections).** Before any consumer is cut over at a site, that consumer's OPC UA certificate trust store must be populated with the target OtOpcUa cluster's **per-node certificates and ApplicationUris** (2 per cluster; at Warsaw campuses with per-building clusters, multiply by building count if the consumer needs cross-building visibility). Consumers without pre-loaded trust will fail to connect. **Once a consumer has trusted a node's `ApplicationUri`, changing that `ApplicationUri` requires the consumer to re-establish trust** — this is an OPC UA spec constraint, not an implementation choice. OtOpcUa's Admin UI auto-suggests `urn:{Host}:OtOpcUa` on node creation but warns if `Host` changes later. - **Acceptable double-connection windows.** During each consumer's cutover, a short window of **both old direct connection and new cluster connection** existing at the same time for the same equipment is **tolerated** — it temporarily aggravates the session-load problem the cluster is meant to solve, but keeping the window short (minutes to hours, not days) bounds the exposure. Longer parallel windows are only acceptable for the System Platform cutover where compliance validation may require extended dual-run. - **Rollback posture.** Each consumer's cutover is reversible — if the cluster misbehaves during or immediately after a cutover, the consumer falls back to direct equipment OPC UA, and the cutover is retried after the issue is understood. The old direct-connection capability is **not removed** from consumers until all three cutover tiers are complete and stable at a site. - - **Consumer cutover plan needs an owner.** The v2 OtOpcUa implementation design covers building the server, drivers, config, and Admin UI (Phases 0–5) but does **not** address consumer cutover planning. The following are unaddressed and need ownership: per-site cutover sequencing, per-equipment validation methodology (proving consumers see equivalent data through OtOpcUa), rollback procedures, coordination with Aveva for System Platform IO cutover, operational runbooks for consumer connection failures. **Either** the OtOpcUa team adds cutover phases (6/7/8) to the v2 design, **or** an integration / operations team owns the cutover plan separately — in which case this section should name them and link the doc. + - **Consumer cutover plan — owned by a separate integration / operations team (not OtOpcUa).** Per lmxopcua decision #136, consumer cutover is **out of OtOpcUa v2 scope**. The OtOpcUa team's responsibility ends at Phase 5 — all drivers built, all stability protections in place, full Admin UI shipped including the data-path ACL editor. Cutover sequencing per site, validation methodology (proving consumers see equivalent data through OtOpcUa), rollback procedures, coordination with Aveva for System Platform IO cutover (tier 3), and operational runbooks are deliverables of a separate **integration / operations team that has yet to be named**. The handoff's tier 1/2/3 sequencing (above) remains the authoritative high-level roadmap; the implementation-level cutover plan lives outside OtOpcUa's docs. _TBD — name the integration/operations team and link their cutover plan doc._ - _TBD — per-site cutover sequencing across the three tiers (all sites reach tier 1 before any reaches tier 2, or one site completes all three tiers before the next site starts), and per-equipment-class criteria for when a System Platform IO cutover requires compliance re-validation; cutover plan owner assignment._ - **Validated-data implication (E2 — Aveva pattern validation needed Year 1 or early Year 2).** System Platform's validated data collection currently uses its own IO path; moving that through OtOpcUa may require validation/re-qualification depending on the regulated context. **Year 1 or early Year 2 research deliverable:** validate with Aveva that System Platform IO drivers support upstream OPC UA-server data sources (OtOpcUa), including any restrictions on security mode, namespace structure, or session model. If Aveva's pattern requires something OtOpcUa doesn't expose, that's a long-lead-time discovery that must surface well before Year 3's Tier 3 cutover. - **Relationship to ScadaBridge's 225k/sec ingestion ceiling** (per `current-state.md`): the cluster's aggregate throughput must be able to feed ScadaBridge at its capacity without becoming a bottleneck — sizing needs to reflect this. @@ -564,7 +571,14 @@ The plan already delivers the infrastructure for a cross-system canonical model This subsection makes that declaration. It is the plan's answer to **Digital Twin Use Cases 1 and 3** (see **Strategic Considerations → Digital twin**) and — independent of digital twin framing — is load-bearing for pillar 2 (analytics/AI enablement) because a canonical model is what makes "not possible before" cross-domain analytics possible at all. -> **Schemas-repo dependency is on the OtOpcUa critical path (B2 from v2 corrections).** The `schemas` repo does not exist yet. Until it does, OtOpcUa equipment configurations are hand-curated per-equipment with no class templates, no auto-generated tag lists, no cross-cluster consistency checks, and no signal-validation contract for Layer 3 state derivation. The plan commits to **schemas-repo creation as a Year 1 deliverable** (its own scope, distinct from the OtOpcUa workstream) with a **pilot equipment class (FANUC CNC)** landed in the repo before Tier 1 cutover begins. The **UNS hierarchy snapshot** (a per-site equipment-instance walk) feeds the initial schemas-repo equipment-class list and hierarchy definition. Core driver scope is already resolved by the v2 implementation team's committed driver list. +> **Schemas-repo dependency — partially resolved.** The OtOpcUa team has contributed an initial seed at [`schemas/`](../schemas/) (temporary location in the 3-year-plan repo until the dedicated `schemas` repo is created — Gitea push-to-create is disabled). The seed includes: JSON Schema format definitions (`format/equipment-class.schema.json`, `format/tag-definition.schema.json`, `format/uns-subtree.schema.json`), the **FANUC CNC pilot equipment class** (`classes/fanuc-cnc.json` — 16 signals + 3 alarm definitions + state-derivation notes), a worked UNS subtree example (`uns/example-warsaw-west.json`), and documentation (`docs/overview.md`, `docs/format-decisions.md` with 8 numbered decisions, `docs/consumer-integration.md`). The **UNS hierarchy snapshot** (a per-site equipment-instance walk) feeds the initial hierarchy definition. Core driver scope is already resolved by the v2 implementation team's committed driver list. +> +> **Still needs cross-team ownership:** +> - Name an owner team for the schemas content (it's consumed by OT and IT systems alike — OtOpcUa, Redpanda, dbt) +> - Decide whether to move to a dedicated `gitea.dohertylan.com/dohertj2/schemas` repo (proposed) or keep as a 3-year-plan sub-tree +> - Ratify or revise the 8 format decisions in `schemas/docs/format-decisions.md` +> - Establish the CI gate for JSON Schema validation +> - Decide on consumer-integration plumbing for Redpanda Protobuf code-gen and dbt macro generation per `schemas/docs/consumer-integration.md` > **Unified Namespace framing:** this canonical model is also the plan's **Unified Namespace** (UNS) — see **Target IT/OT Integration → Unified Namespace (UNS) posture**. The UNS posture is a higher-level framing of the same mechanics described here: this section specifies the canonical model mechanically; the UNS posture explains what stakeholders asking about UNS should understand about how the plan delivers the UNS value proposition without an MQTT/Sparkplug broker. diff --git a/roadmap.md b/roadmap.md index 9f46ebe..da3b422 100644 --- a/roadmap.md +++ b/roadmap.md @@ -63,7 +63,7 @@ The roadmap is laid out as a 2D grid — **workstreams** (rows) crossed with **y | Workstream | **Year 1 — Foundation** | **Year 2 — Scale** | **Year 3 — Completion** | |---|---|---|---| -| **OtOpcUa** | **Evolve LmxOpcUa into OtOpcUa** — extend the existing in-house OPC UA server to add (a) a new equipment namespace with single session per equipment via native protocols translated to OPC UA (committed core drivers: OPC UA Client, Modbus TCP, AB CIP, AB Legacy, S7, TwinCAT, FOCAS, plus Galaxy carried forward), and (b) clustering (non-transparent redundancy, 2-node per site) on top of the existing per-node deployment. **Driver stability tiers:** Tier A in-process (Modbus, OPC UA Client), Tier B in-process with guards (S7, AB CIP, AB Legacy, TwinCAT), Tier C out-of-process (Galaxy — bitness constraint, FOCAS — uncatchable AVE). Core driver list confirmed by v2 implementation team (protocol survey no longer needed for driver scoping). **UNS hierarchy snapshot walk** — per-site equipment-instance discovery (site/area/line/equipment + UUID assignment) to feed the initial schemas-repo hierarchy definition and canonical model; target done Q1–Q2. **Build ACL surface** (per-cluster `EquipmentAcl` table, Admin UI, OPC UA NodeManager enforcement) — required before tier-1 cutover. **Deploy OtOpcUa to every site** as fast as practical. **Begin tier 1 cutover (ScadaBridge)** at large sites. **Prerequisite: certificate-distribution** to consumer trust stores before each cutover. **Aveva System Platform IO pattern validation** — Year 1 or early Year 2 research to confirm Aveva supports upstream OPC UA data sources, well ahead of Year 3 tier 3. _TBD — survey owner; first-cutover site selection; cutover plan owner (OtOpcUa team or integration team); enterprise shortname for UNS hierarchy root._ | **Complete tier 1 (ScadaBridge)** across all sites. **Begin tier 2 (Ignition)** — Ignition consumers redirected from direct-equipment OPC UA to each site's OtOpcUa, collapsing WAN session counts from *N per equipment* to *one per site*. **Build long-tail drivers** on demand as sites require them. Resolve Warsaw per-building multi-cluster consumer-addressing pattern (consumer-side stitching vs site-aggregator OtOpcUa instance). _TBD — per-site tier-2 rollout sequence._ | **Complete tier 2 (Ignition)** across all sites. **Execute tier 3 (Aveva System Platform IO)** with compliance stakeholder validation — the hardest cutover because System Platform IO feeds validated data collection. Reach steady state: every equipment session is held by OtOpcUa, every downstream consumer reads OT data through it. _TBD — per-equipment-class criteria for System Platform IO re-validation._ | +| **OtOpcUa** | **Evolve LmxOpcUa into OtOpcUa** — extend the existing in-house OPC UA server to add (a) a new equipment namespace with single session per equipment via native protocols translated to OPC UA (committed core drivers: OPC UA Client, Modbus TCP, AB CIP, AB Legacy, S7, TwinCAT, FOCAS, plus Galaxy carried forward), and (b) clustering (non-transparent redundancy, 2-node per site) on top of the existing per-node deployment. **Driver stability tiers:** Tier A in-process (Modbus, OPC UA Client), Tier B in-process with guards (S7, AB CIP, AB Legacy, TwinCAT), Tier C out-of-process (Galaxy — bitness constraint, FOCAS — uncatchable AVE). Core driver list confirmed by v2 implementation team (protocol survey no longer needed for driver scoping). **UNS hierarchy snapshot walk** — per-site equipment-instance discovery (site/area/line/equipment + UUID assignment) to feed the initial schemas-repo hierarchy definition and canonical model; target done Q1–Q2. **ACL model designed and committed** (decisions #129–132): 6-level scope hierarchy, `NodePermissions` bitmask, generation-versioned `NodeAcl` table, Admin UI + permission simulator. Phase 1 ships before any driver phase. **Deploy OtOpcUa to every site** as fast as practical. **Begin tier 1 cutover (ScadaBridge)** at large sites. **Prerequisite: certificate-distribution** to consumer trust stores before each cutover. **Aveva System Platform IO pattern validation** — Year 1 or early Year 2 research to confirm Aveva supports upstream OPC UA data sources, well ahead of Year 3 tier 3. _TBD — first-cutover site selection; **cutover plan owner** (not OtOpcUa — a separate integration/operations team, per decision #136, not yet named); enterprise shortname for UNS hierarchy root; schemas-repo owner team and dedicated repo creation._ | **Complete tier 1 (ScadaBridge)** across all sites. **Begin tier 2 (Ignition)** — Ignition consumers redirected from direct-equipment OPC UA to each site's OtOpcUa, collapsing WAN session counts from *N per equipment* to *one per site*. **Build long-tail drivers** on demand as sites require them. Resolve Warsaw per-building multi-cluster consumer-addressing pattern (consumer-side stitching vs site-aggregator OtOpcUa instance). _TBD — per-site tier-2 rollout sequence._ | **Complete tier 2 (Ignition)** across all sites. **Execute tier 3 (Aveva System Platform IO)** with compliance stakeholder validation — the hardest cutover because System Platform IO feeds validated data collection. Reach steady state: every equipment session is held by OtOpcUa, every downstream consumer reads OT data through it. _TBD — per-equipment-class criteria for System Platform IO re-validation._ | | **Redpanda EventHub** | Stand up central Redpanda cluster in South Bend (single-cluster HA). Stand up bundled Schema Registry. Wire SASL/OAUTHBEARER to enterprise IdP. Create initial topic set (prefix-based ACLs). Hook up observability minimum signal set. Define the three retention tiers (`operational`/`analytics`/`compliance`). **Stand up the central `schemas` repo** with `buf` CI, CODEOWNERS, and the NuGet publishing pipeline. **Publish the canonical equipment/production/event model v1** — including the canonical machine state vocabulary (`Running / Idle / Faulted / Starved / Blocked` + any agreed additions) as a Protobuf enum, the `equipment.state.transitioned` event schema, and initial equipment-class definitions for pilot equipment. This is the foundation for Digital Twin Use Cases 1 and 3 (see `goal-state.md` → Strategic Considerations → Digital twin) and is load-bearing for pillar 2. **Pilot equipment class for canonical definition: FANUC CNC** (pre-defined FOCAS2 hierarchy already exists in OtOpcUa v2 driver design). Land the FANUC CNC class template in the schemas repo before Tier 1 cutover begins. _TBD — sizing decisions, initial topic list, canonical vocabulary ownership (domain SME group)._ | Expand topic coverage as additional domains onboard. Enforce tiered retention and ACLs at scale. Prove backlog replay after a WAN-outage drill (also exercises the Digital Twin Use Case 2 simulation-lite replay path). Exercise long-outage planning (ScadaBridge queue capacity vs. outage duration). Iterate the canonical model as additional equipment classes and domains onboard. _TBD — concrete drill cadence._ | Steady-state operation. Harden alerting and runbooks against the observed failure modes from Years 1–2. Canonical model is mature and covers every in-scope equipment class; schema changes are routine rather than foundational. | | **SnowBridge** | Design and begin custom build in .NET. **Filtered, governed upload to Snowflake is the Year 1 purpose** — the service is the component that decides which topics/tags flow to Snowflake, applies the governed selection model, and writes into Snowflake. Ship an initial version with **one working source adapter** — starting with **Aveva Historian (SQL interface)** because it's central-only, exists today, and lets the workstream progress in parallel with Redpanda rather than waiting on it. First end-to-end **filtered** flow to Snowflake landing tables on a handful of priority tags. Selection model in place even if the operator UI isn't yet (config-driven is acceptable for Year 1). _TBD — team, credential management, datastore for selection state._ | Add the **ScadaBridge/Redpanda source adapter** alongside Historian. Build and ship the operator **web UI + API** on top of the Year 1 selection model, including the blast-radius-based approval workflow, audit trail, RBAC, and exportable state. Onboard priority tags per domain under the UI-driven governance path. _TBD — UI framework._ | All planned source adapters live behind the unified interface. Approval workflow tuned based on Year 2 operational experience. Feature freeze; focus on hardening. | | **Snowflake dbt Transform Layer** | Scaffold a dbt project in git, wired to the self-hosted orchestrator (per `goal-state.md`; specific orchestrator chosen outside this plan). Build first **landing → curated** model for priority tags. **Align curated views with the canonical model v1** published in the `schemas` repo — equipment, production, and event entities in the curated layer use the canonical state vocabulary and the same event-type enum values, so downstream consumers (Power BI, ad-hoc analysts, future AI/ML) see the same shape of data Redpanda publishes. This is the dbt-side delivery for Digital Twin Use Cases 1 and 3. Establish `dbt test` discipline from day one — including tests that catch divergence between curated views and the canonical enums. _TBD — project layout (single vs per-domain); reconciliation rule if derived state in curated views disagrees with the layer-3 derivation (should not happen, but the rule needs to exist)._ | Build curated layers for all in-scope domains. **Ship a canonical-state-based OEE model** as a strong candidate for the pillar-2 "not possible before" use case — accurate cross-equipment, cross-site OEE computed once in dbt from the canonical state stream, rather than re-derived in every reporting surface. Source-freshness SLAs tied to the **≤15-minute analytics** budget. Begin development of the first **"not possible before" AI/analytics use case** (pillar 2). | The "not possible before" use case is **in production**, consuming the curated layer, meeting its own SLO. Pillar 2 check passes. |