Compare commits
9 Commits
master
...
1189dc87fd
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
1189dc87fd | ||
|
|
0307a498ed | ||
|
|
d8fa3a0ddd | ||
|
|
4903a19ec9 | ||
|
|
2a6c9828e4 | ||
|
|
592fa79e3c | ||
|
|
a59ad2e0c6 | ||
|
|
449bc454b7 | ||
|
|
a1e79cdb06 |
3
.gitignore
vendored
3
.gitignore
vendored
@@ -26,3 +26,6 @@ packages/
|
||||
*.pfx
|
||||
*.pem
|
||||
|
||||
# Claude Code (per-developer settings, runtime lock files, agent transcripts)
|
||||
.claude/
|
||||
|
||||
|
||||
379
docs/v2/acl-design.md
Normal file
379
docs/v2/acl-design.md
Normal file
@@ -0,0 +1,379 @@
|
||||
# OPC UA Client Authorization (ACL Design) — OtOpcUa v2
|
||||
|
||||
> **Status**: DRAFT — closes corrections-doc finding B1 (namespace / equipment-subtree ACLs not yet modeled in the data path).
|
||||
>
|
||||
> **Branch**: `v2`
|
||||
> **Created**: 2026-04-17
|
||||
|
||||
## Scope
|
||||
|
||||
This document defines the **OPC UA client data-path authorization model** — who can read, write, subscribe, browse, ack alarms, etc. on which nodes when connecting to the OtOpcUa server endpoint. It is distinct from:
|
||||
|
||||
- **Admin UI authorization** (`admin-ui.md`) — who can edit configuration. That layer has FleetAdmin / ConfigEditor / ReadOnly roles + cluster-scoped grants per decisions #88, #105.
|
||||
- **DB principal authorization** (`config-db-schema.md` §"Authorization Model") — who can call which stored procedures on the central config DB. That layer is per-NodeId for cluster nodes and per-Admin for Admin app users.
|
||||
|
||||
The data-path ACL layer covers OPC UA clients (ScadaBridge, Ignition, System Platform IO, third-party tools) that connect to the OPC UA endpoint to read or modify equipment data.
|
||||
|
||||
## Permission Model
|
||||
|
||||
Every node operation requires an explicit permission. Permissions are bitmask flags on the v2 schema; the OPC UA NodeManager checks them on every browse, read, write, subscribe, history read, alarm event, and method call.
|
||||
|
||||
### Permission flags
|
||||
|
||||
```csharp
|
||||
[Flags]
|
||||
public enum NodePermissions : uint
|
||||
{
|
||||
None = 0,
|
||||
|
||||
// Read-side
|
||||
Browse = 1 << 0, // See node in BrowseRequest results
|
||||
Read = 1 << 1, // ReadRequest current value
|
||||
Subscribe = 1 << 2, // CreateMonitoredItems
|
||||
HistoryRead = 1 << 3, // HistoryReadRaw / HistoryReadProcessed
|
||||
|
||||
// Write-side (mirrors v1 SecurityClassification model — see config-db-schema.md Equipment ACL)
|
||||
WriteOperate = 1 << 4, // Write attrs with FreeAccess/Operate classification
|
||||
WriteTune = 1 << 5, // Write attrs with Tune classification
|
||||
WriteConfigure = 1 << 6, // Write attrs with Configure classification
|
||||
|
||||
// Alarm-side
|
||||
AlarmRead = 1 << 7, // Receive alarm events for this node
|
||||
AlarmAcknowledge = 1 << 8, // Ack alarms (separate from Confirm — OPC UA Part 9 distinction)
|
||||
AlarmConfirm = 1 << 9, // Confirm alarms
|
||||
AlarmShelve = 1 << 10, // Shelve / unshelve alarms
|
||||
|
||||
// Method invocation (OPC UA Part 4 §5.11)
|
||||
MethodCall = 1 << 11, // Invoke methods on the node
|
||||
|
||||
// Common bundles (also exposed in Admin UI as one-click selections)
|
||||
ReadOnly = Browse | Read | Subscribe | HistoryRead | AlarmRead,
|
||||
Operator = ReadOnly | WriteOperate | AlarmAcknowledge | AlarmConfirm,
|
||||
Engineer = Operator | WriteTune | AlarmShelve,
|
||||
Admin = Engineer | WriteConfigure | MethodCall,
|
||||
}
|
||||
```
|
||||
|
||||
The bundles (`ReadOnly` / `Operator` / `Engineer` / `Admin`) are derived from production patterns at sites running v1 LmxOpcUa — they're the common grant shapes operators reach for. Granular per-flag grants stay supported for unusual cases.
|
||||
|
||||
### Why three Write tiers (Operate / Tune / Configure)
|
||||
|
||||
Mirrors v1's `SecurityClassification` mapping (`docs/DataTypeMapping.md`). Galaxy attributes carry a security classification; v1 maps `FreeAccess`/`Operate` to writable, `SecuredWrite`/`VerifiedWrite`/`ViewOnly` to read-only. The v2 model preserves this for Galaxy and extends it to all drivers via `Tag.SecurityClassification`:
|
||||
|
||||
| Classification | Permission needed to write |
|
||||
|----------------|---------------------------|
|
||||
| FreeAccess | `WriteOperate` |
|
||||
| Operate | `WriteOperate` |
|
||||
| Tune | `WriteTune` |
|
||||
| Configure | `WriteConfigure` |
|
||||
| SecuredWrite / VerifiedWrite / ViewOnly | (not writable from OPC UA — v1 behavior preserved) |
|
||||
|
||||
A user with `WriteTune` can write Operate-classified attrs too (Tune is more privileged). The check is `requestedClassification ≤ grantedTier`.
|
||||
|
||||
### Why AlarmRead is separate from Read
|
||||
|
||||
In OPC UA Part 9 alarm subscriptions are a distinct subscription type — a client can subscribe to events on a node without reading its value. Granting Read alone does not let a client see alarm events; AlarmRead is required separately. The `ReadOnly` bundle includes both.
|
||||
|
||||
### Why MethodCall is separate
|
||||
|
||||
OPC UA methods (Part 4 §5.11) are arbitrary procedure invocations on a node. v1 LmxOpcUa exposes very few; future drivers (especially OPC UA Client gateway) will surface more. MethodCall is gated explicitly because side-effects can be unbounded — analogous to executing a stored procedure rather than reading a column.
|
||||
|
||||
## Scope Hierarchy
|
||||
|
||||
ACL grants attach to one of six scope levels. Granting at higher level cascades to lower (with browse implication for ancestors); explicit Deny at lower level is **deferred to v2.1** (decision below).
|
||||
|
||||
```
|
||||
Cluster ← cluster-wide grant (highest scope)
|
||||
└── Namespace ← per-namespace grant (Equipment vs SystemPlatform vs Simulated)
|
||||
└── UnsArea ← per-area grant (Equipment-namespace only)
|
||||
└── UnsLine ← per-line grant
|
||||
└── Equipment ← per-equipment grant
|
||||
└── Tag ← per-tag grant (lowest scope; rarely used)
|
||||
```
|
||||
|
||||
For SystemPlatform-namespace tags (no Equipment row, no UNS structure), the chain shortens to:
|
||||
|
||||
```
|
||||
Cluster
|
||||
└── Namespace
|
||||
└── (Tag's FolderPath segments — treated as opaque hierarchy)
|
||||
└── Tag
|
||||
```
|
||||
|
||||
### Inheritance and evaluation
|
||||
|
||||
For each operation on a node:
|
||||
|
||||
1. Walk the node's scope chain from leaf to root (`Tag → Equipment → UnsLine → UnsArea → Namespace → Cluster`)
|
||||
2. At each level, look up `NodeAcl` rows where `LdapGroup ∈ user.Groups` and `(ScopeKind, ScopeId)` matches
|
||||
3. Union the `PermissionFlags` from every matching row
|
||||
4. Required permission must be set in the union → allow; else → deny
|
||||
5. Browse is implied at every ancestor of any node where the user has any non-Browse permission — otherwise the user can't navigate to it
|
||||
|
||||
### Default-deny
|
||||
|
||||
If the union is empty (no group of the user's has any grant matching the node's chain), the operation is **denied**:
|
||||
- Browse → node hidden from results
|
||||
- Read / Subscribe / HistoryRead → `BadUserAccessDenied`
|
||||
- Write → `BadUserAccessDenied`
|
||||
- AlarmAck / AlarmConfirm / AlarmShelve → `BadUserAccessDenied`
|
||||
- MethodCall → `BadUserAccessDenied`
|
||||
|
||||
### Why no explicit Deny in v2.0
|
||||
|
||||
Two patterns can express "X group can write everywhere except production line 3":
|
||||
|
||||
- **(a)** Verbose: grant Engineering on every line except line 3 — many rows but unambiguous
|
||||
- **(b)** Explicit Deny that overrides Grant — fewer rows but evaluation logic must distinguish "no grant" from "explicit deny"
|
||||
|
||||
For v2.0 fleets (≤50 clusters, ≤20 lines per cluster typical) approach (a) is workable — operators use the bulk-grant Admin UI flow to apply grants across many lines minus exceptions. Explicit Deny adds non-trivial complexity to the evaluator and the Admin UI; defer to v2.1 unless a deployment demonstrates a real need.
|
||||
|
||||
## Schema — `NodeAcl` Table
|
||||
|
||||
Generation-versioned (decision #105 pattern — ACLs are content, travel through draft → diff → publish like every other consumer-visible config):
|
||||
|
||||
```sql
|
||||
CREATE TABLE dbo.NodeAcl (
|
||||
NodeAclRowId uniqueidentifier NOT NULL PRIMARY KEY DEFAULT NEWSEQUENTIALID(),
|
||||
GenerationId bigint NOT NULL FOREIGN KEY REFERENCES dbo.ConfigGeneration(GenerationId),
|
||||
NodeAclId nvarchar(64) NOT NULL, -- stable logical ID across generations
|
||||
ClusterId nvarchar(64) NOT NULL FOREIGN KEY REFERENCES dbo.ServerCluster(ClusterId),
|
||||
LdapGroup nvarchar(256) NOT NULL, -- LDAP group name (e.g. "OtOpcUaOperators-LINE3")
|
||||
ScopeKind nvarchar(16) NOT NULL CHECK (ScopeKind IN ('Cluster', 'Namespace', 'UnsArea', 'UnsLine', 'Equipment', 'Tag')),
|
||||
ScopeId nvarchar(64) NULL, -- NULL when ScopeKind='Cluster'; otherwise the logical ID of the scoped entity
|
||||
PermissionFlags int NOT NULL, -- bitmask of NodePermissions
|
||||
Notes nvarchar(512) NULL
|
||||
);
|
||||
|
||||
CREATE INDEX IX_NodeAcl_Generation_Cluster
|
||||
ON dbo.NodeAcl (GenerationId, ClusterId);
|
||||
CREATE INDEX IX_NodeAcl_Generation_Group
|
||||
ON dbo.NodeAcl (GenerationId, LdapGroup);
|
||||
CREATE INDEX IX_NodeAcl_Generation_Scope
|
||||
ON dbo.NodeAcl (GenerationId, ScopeKind, ScopeId) WHERE ScopeId IS NOT NULL;
|
||||
CREATE UNIQUE INDEX UX_NodeAcl_Generation_LogicalId
|
||||
ON dbo.NodeAcl (GenerationId, NodeAclId);
|
||||
-- Within a generation, a (Group, Scope) pair has at most one row (additive grants would be confusing
|
||||
-- in the audit trail; use a single row with the union of intended permissions instead)
|
||||
CREATE UNIQUE INDEX UX_NodeAcl_Generation_GroupScope
|
||||
ON dbo.NodeAcl (GenerationId, ClusterId, LdapGroup, ScopeKind, ScopeId);
|
||||
```
|
||||
|
||||
### Cross-generation invariant
|
||||
|
||||
Same pattern as Equipment / Namespace: `NodeAclId` is append-only per cluster — once published, the logical ID stays bound to its `(LdapGroup, ScopeKind, ScopeId)` triple. Renaming an LDAP group is forbidden — disable the old grant and create a new one. This protects the audit trail.
|
||||
|
||||
### Validation in `sp_ValidateDraft`
|
||||
|
||||
Adds these checks beyond the existing schema rules:
|
||||
|
||||
- **ScopeId resolution**: when `ScopeKind ∈ {Namespace, UnsArea, UnsLine, Equipment, Tag}`, `ScopeId` must resolve to the corresponding entity in the same generation
|
||||
- **Cluster cohesion**: the resolved scope must belong to the same `ClusterId` as the ACL row
|
||||
- **PermissionFlags validity**: bitmask must only contain bits defined in `NodePermissions` enum (no future-bit speculation)
|
||||
- **LdapGroup format**: non-empty, ≤256 chars, no characters that would break LDAP DN escaping (allowlist)
|
||||
- **No identity drift**: `NodeAclId` once published with `(LdapGroup, ScopeKind, ScopeId)` cannot have any of those four columns change in a future generation
|
||||
|
||||
## Evaluation Algorithm
|
||||
|
||||
### At session establishment
|
||||
|
||||
```
|
||||
on AcceptSession(user):
|
||||
user.Groups = LdapAuth.ResolveGroups(user.Token)
|
||||
user.PermissionMap = BuildEffectivePermissionMap(currentGeneration, user.Groups)
|
||||
cache user.PermissionMap on the session
|
||||
```
|
||||
|
||||
`BuildEffectivePermissionMap` produces a sparse trie keyed by node-path-prefix:
|
||||
|
||||
```
|
||||
PermissionMap structure (per session):
|
||||
/ → grant union from Cluster + Namespace-level rows
|
||||
/Equipment-NS/UnsArea-A/ → adds UnsArea-level grants
|
||||
/Equipment-NS/UnsArea-A/UnsLine-1/ → adds UnsLine-level grants
|
||||
/Equipment-NS/UnsArea-A/UnsLine-1/Equipment-X/ → adds Equipment-level grants
|
||||
/Equipment-NS/UnsArea-A/UnsLine-1/Equipment-X/Tag-Y → adds Tag-level grants (rare)
|
||||
```
|
||||
|
||||
Lookup for a node at path P: walk the trie from `/` to P, OR-ing PermissionFlags at each visited level. Result = effective permissions for P. O(depth) — typically 6 or fewer hops.
|
||||
|
||||
### Per-operation check
|
||||
|
||||
```csharp
|
||||
bool Authorize(SessionContext ctx, NodePath path, NodePermissions required)
|
||||
{
|
||||
var effective = ctx.PermissionMap.Lookup(path);
|
||||
return (effective & required) == required;
|
||||
}
|
||||
```
|
||||
|
||||
- Browse: `Authorize(ctx, path, Browse)` — falsy → omit from results
|
||||
- Read: `Authorize(ctx, path, Read)` → falsy → `BadUserAccessDenied`
|
||||
- Write: `Authorize(ctx, path, requiredWriteFlag)` where `requiredWriteFlag` is derived from the target attribute's `SecurityClassification`
|
||||
- Subscribe: `Authorize(ctx, path, Subscribe)` — also implies Browse on the path
|
||||
- HistoryRead: `Authorize(ctx, path, HistoryRead)`
|
||||
- Alarm event: `Authorize(ctx, path, AlarmRead)` — events for unauthorized nodes are filtered out before delivery
|
||||
- AlarmAck/Confirm/Shelve: corresponding flag check
|
||||
- MethodCall: `Authorize(ctx, methodNode.path, MethodCall)`
|
||||
|
||||
### Cache invalidation
|
||||
|
||||
The session's `PermissionMap` is rebuilt when:
|
||||
- A new config generation is applied locally (the path-trie may have changed structure due to UNS reorg or new equipment)
|
||||
- The LDAP group cache for the user expires (default: 15 min — driven by the LDAP layer, separate from this design)
|
||||
- The user's session is re-established
|
||||
|
||||
For unattended consumer connections (ScadaBridge, Ignition) that hold long sessions, the per-generation rebuild keeps permissions current without forcing reconnects.
|
||||
|
||||
## Performance
|
||||
|
||||
Worst-case per-operation cost: O(depth × group-count). For typical fleet sizes (10 LDAP groups per user, 6-deep UNS path), that's ~60 trie lookups per operation — sub-microsecond on modern hardware. The session-scoped cache means the per-operation hot path is array indexing, not DB queries.
|
||||
|
||||
Build cost (at session establish or generation reapply): O(N_acl × M_groups) for N_acl rows and M_groups in user's claim set. For 1000 ACL rows × 10 groups = 10k joins; sub-second on a sane DB.
|
||||
|
||||
Memory cost: per-session trie ~4 KB for typical scopes; bounded by O(N_acl) worst case. Sessions hold their own trie — no shared state to invalidate.
|
||||
|
||||
## Default Permissions for Existing v1 LDAP Groups
|
||||
|
||||
To preserve v1 LmxOpcUa behavior on first migration, the v2 default ACL set on cluster creation maps the existing v1 LDAP-role-to-permission grants:
|
||||
|
||||
| v1 LDAP role (per `Security.md`) | v2 NodePermissions bundle | Scope |
|
||||
|----------------------------------|---------------------------|-------|
|
||||
| `ReadOnly` (group: `OtOpcUaReadOnly`) | `ReadOnly` bundle | Cluster |
|
||||
| `WriteOperate` (group: `OtOpcUaWriteOperate`) | `Operator` bundle | Cluster |
|
||||
| `WriteTune` (group: `OtOpcUaWriteTune`) | `Engineer` bundle | Cluster |
|
||||
| `WriteConfigure` (group: `OtOpcUaWriteConfigure`) | `Admin` bundle | Cluster |
|
||||
| `AlarmAck` (group: `OtOpcUaAlarmAck`) | adds `AlarmAcknowledge \| AlarmConfirm` to user's existing grants | Cluster |
|
||||
|
||||
These are seeded by the cluster-create workflow into the initial draft generation (per decision #123 — namespaces and ACLs both travel through publish boundary). Operators can then refine to per-Equipment scopes as needed.
|
||||
|
||||
## Admin UI
|
||||
|
||||
### New tab: ACLs (under Cluster Detail)
|
||||
|
||||
```
|
||||
/clusters/{ClusterId} Cluster detail (tabs: Overview / Namespaces / UNS Structure / Drivers / Devices / Equipment / Tags / **ACLs** / Generations / Audit)
|
||||
```
|
||||
|
||||
Two views, toggle at the top:
|
||||
|
||||
#### View 1 — By LDAP group
|
||||
|
||||
| LDAP Group | Scopes | Permissions | Notes |
|
||||
|------------|--------|-------------|-------|
|
||||
| `OtOpcUaOperators` | Cluster | Operator bundle | Default operators (seeded) |
|
||||
| `OtOpcUaOperators-LINE3` | UnsArea bldg-3 | Engineer bundle | Line 3 supervisors |
|
||||
| `OtOpcUaScadaBridge` | Cluster | ReadOnly | Tier 1 consumer (added before cutover) |
|
||||
|
||||
Click a row → edit grant: change scope, change permission set (one-click bundles or per-flag), edit notes.
|
||||
|
||||
#### View 2 — By scope (UNS tree)
|
||||
|
||||
Tree view of UnsArea → UnsLine → Equipment with permission badges per node showing which groups have what:
|
||||
|
||||
```
|
||||
bldg-3/ [Operators: Operator, ScadaBridge: ReadOnly]
|
||||
├── line-2/ [+ LINE3-Supervisors: Engineer]
|
||||
│ ├── cnc-mill-05 [+ CNC-Maintenance: WriteTune]
|
||||
│ ├── cnc-mill-06
|
||||
│ └── injection-molder-02
|
||||
└── line-3/
|
||||
```
|
||||
|
||||
Click a node → see effective permissions per group, edit grants at that scope.
|
||||
|
||||
### Bulk grant flow
|
||||
|
||||
"Bulk grant" button on either view:
|
||||
1. Pick LDAP group(s)
|
||||
2. Pick permission bundle or per-flag
|
||||
3. Pick scope set: pattern (e.g. all UnsArea matching `bldg-*`), or multi-select from tree
|
||||
4. Preview: list of `NodeAcl` rows that will be created
|
||||
5. Confirm → adds rows to current draft
|
||||
|
||||
### Permission simulator
|
||||
|
||||
"Simulate as user" panel: enter username + LDAP groups → UI shows the effective permission map across the cluster's UNS tree. Useful before publishing — operators verify "after this change, ScadaBridge can still read everything it needs" without actually deploying.
|
||||
|
||||
### Operator workflows added to admin-ui.md
|
||||
|
||||
Three new workflows:
|
||||
1. **Grant ACL** — usual draft → diff → publish, scoped to ACLs tab
|
||||
2. **Bulk grant** — multi-select scope + group + permission, preview, publish
|
||||
3. **Simulate as user** — preview-only, no publish required
|
||||
|
||||
### v1 deviation log
|
||||
|
||||
For each cluster, the Admin UI shows a banner if its NodeAcl set diverges from the v1-default seed (per the table above). This makes intentional tightening or loosening visible at a glance — important for compliance review during the long v1 → v2 coexistence period.
|
||||
|
||||
## Audit
|
||||
|
||||
Every NodeAcl change is in `ConfigAuditLog` automatically (per the publish boundary — same as any other content edit). Plus the OPC UA NodeManager logs every **denied** operation:
|
||||
|
||||
```
|
||||
EventType = 'OpcUaAccessDenied'
|
||||
DetailsJson = { user, groups, requestedOperation, nodePath, requiredPermission, effectivePermissions }
|
||||
```
|
||||
|
||||
Allowed operations are NOT logged at this layer (would dwarf the audit log; OPC UA SDK has its own session/operation diagnostics for high-frequency telemetry). The choice to log denials only mirrors typical authorization-audit practice and can be tightened per-deployment if a customer requires full positive-action logging.
|
||||
|
||||
## Test Strategy
|
||||
|
||||
Unit tests for the evaluator:
|
||||
- Empty ACL set → all operations denied (default-deny invariant)
|
||||
- Single Cluster-scope grant → operation allowed at every node in the cluster
|
||||
- Single Equipment-scope grant → allowed at the equipment + its tags; denied at sibling equipment
|
||||
- Multiple grants for same group → union (additive)
|
||||
- Multiple groups for same user → union of all groups' grants
|
||||
- Browse implication: granting Read on a deep equipment auto-allows Browse at every ancestor
|
||||
- Permission bundle expansion: granting `Operator` bundle = granting `Browse | Read | Subscribe | HistoryRead | AlarmRead | WriteOperate | AlarmAcknowledge | AlarmConfirm`
|
||||
- v1-compatibility seed: a fresh cluster with the default ACL set behaves identically to v1 LmxOpcUa for users in the v1 LDAP groups
|
||||
|
||||
Integration test (Phase 1+):
|
||||
- Create cluster + equipment + tags + ACL grants
|
||||
- Connect OPC UA client as a `ReadOnly`-mapped user → browse and read succeed; write fails
|
||||
- Re-publish with a tighter ACL → existing session's permission map rebuilds; subsequent writes that were allowed are now denied
|
||||
- Verify `OpcUaAccessDenied` audit log entries for the denied operations
|
||||
|
||||
Adversarial review checks (run during exit gate):
|
||||
- Can a client connect with no LDAP group at all and read anything? (must be no — default deny)
|
||||
- Can a client see a node in browse but not read its value? (yes, if Browse granted but not Read — unusual but valid)
|
||||
- Does a UnsArea rename cascade ACL grants correctly? (the grant references UnsAreaId not name, so rename is transparent)
|
||||
- Does an Equipment merge (Admin operator flow) preserve ACL grants on the surviving equipment? (must yes; merge flow updates references)
|
||||
- Does generation rollback restore the prior ACL state? (must yes; ACLs are generation-versioned)
|
||||
|
||||
## Implementation Plan
|
||||
|
||||
ACL design enters the implementation pipeline as follows:
|
||||
|
||||
### Phase 1 (Configuration + Admin scaffold)
|
||||
- Schema: add `NodeAcl` table to the Phase 1 migration
|
||||
- Validation: add NodeAcl rules to `sp_ValidateDraft`
|
||||
- Admin UI: scaffold the ACLs tab with view + edit + bulk grant + simulator
|
||||
- Default seed: cluster-create workflow seeds the v1-compatibility ACL set
|
||||
- Generation diff: include NodeAcl in `sp_ComputeGenerationDiff`
|
||||
|
||||
### Phase 2+ (every driver phase)
|
||||
- Wire the ACL evaluator into `GenericDriverNodeManager` so every Browse / Read / Write / Subscribe / HistoryRead / AlarmRead / AlarmAck / MethodCall consults the per-session permission map
|
||||
- Per-driver tests: assert that a default-deny user cannot read or subscribe to that driver's namespace; assert that a `ReadOnly`-bundled user can; assert that the appropriate Write tier is needed for each `SecurityClassification`
|
||||
|
||||
### Pre-tier-1-cutover (before Phase 6 / consumer cutover)
|
||||
- Verify ScadaBridge's effective permissions in the Admin UI simulator before any cutover
|
||||
- Adversarial review of the per-cluster ACL set with a fresh pair of eyes
|
||||
|
||||
## Decisions to Add to plan.md
|
||||
|
||||
(Will be appended to the decision log on the next plan.md edit.)
|
||||
|
||||
| # | Decision | Rationale |
|
||||
|---|----------|-----------|
|
||||
| 129 | OPC UA client data-path authorization model = bitmask `NodePermissions` flags + per-LDAP-group grants on a 6-level scope hierarchy (Cluster / Namespace / UnsArea / UnsLine / Equipment / Tag) | Closes corrections-doc finding B1. Mirrors v1 SecurityClassification model for Write tiers; adds explicit AlarmRead/Ack/Confirm/Shelve and MethodCall flags. Default-deny; additive grants; explicit Deny deferred to v2.1. See `acl-design.md` |
|
||||
| 130 | `NodeAcl` table generation-versioned, edited via draft → diff → publish like every other content table | Same pattern as Namespace (decision #123) and Equipment (decision #109). ACL changes are content, not topology — they affect what consumers see at the OPC UA endpoint. Rollback restores the prior ACL state |
|
||||
| 131 | Cluster-create workflow seeds default ACL set matching v1 LmxOpcUa LDAP-role-to-permission map | Preserves behavioral parity for v1 → v2 consumer migration. Operators tighten or loosen from there. Admin UI flags any cluster whose ACL set diverges from the seed |
|
||||
| 132 | OPC UA NodeManager logs denied operations only; allowed operations rely on SDK session/operation diagnostics | Logging every allowed op would dwarf the audit log. Denied-only mirrors typical authorization audit practice. Per-deployment policy can tighten if compliance requires positive-action logging |
|
||||
|
||||
## Open Questions
|
||||
|
||||
- **OPC UA Method support scope**: how many methods does v1 expose? Need to enumerate before tier 3 cutover (System Platform IO is the most likely consumer of methods). The MethodCall permission is defined defensively but may not be exercised in v2.0.
|
||||
- **Group claim source latency**: LDAP group cache TTL (default 15 min above) is taken from the v1 LDAP layer. If the OPC UA session's group claims need to be refreshed faster (e.g. for emergency revoke), we need a shorter TTL or an explicit revoke channel. Decide per operational risk appetite.
|
||||
- **AlarmConfirm vs AlarmAcknowledge** semantics: OPC UA Part 9 distinguishes them (Ack = "I've seen this"; Confirm = "I've taken action"). Some sites only use Ack; the v2.0 model exposes both but a deployment-level policy can collapse them in practice.
|
||||
502
docs/v2/admin-ui.md
Normal file
502
docs/v2/admin-ui.md
Normal file
@@ -0,0 +1,502 @@
|
||||
# Admin Web UI — OtOpcUa v2
|
||||
|
||||
> **Status**: DRAFT — companion to `plan.md` §4 and `config-db-schema.md`. Defines the Blazor Server admin app for managing the central config DB.
|
||||
>
|
||||
> **Branch**: `v2`
|
||||
> **Created**: 2026-04-17
|
||||
|
||||
## Scope
|
||||
|
||||
This document covers the **OtOpcUa Admin** web app — the operator-facing UI for managing fleet configuration. It owns every write to the central config DB; OtOpcUa nodes are read-only consumers.
|
||||
|
||||
Out of scope here:
|
||||
|
||||
- Per-node operator dashboards (status, alarm acks for runtime concerns) — that's the existing Status Dashboard, deployed alongside each node, not the Admin app
|
||||
- Driver-specific config screens — these are deferred to each driver's implementation phase per decision #27, and each driver doc is responsible for sketching its config UI surface
|
||||
- Authentication of the OPC UA endpoint itself — covered by `Security.md` (LDAP)
|
||||
|
||||
## Tech Stack
|
||||
|
||||
**Aligned with ScadaLink CentralUI** (`scadalink-design/src/ScadaLink.CentralUI`) — operators using both apps see the same login screen, same sidebar, same component vocabulary. Same patterns, same aesthetic.
|
||||
|
||||
| Component | Choice | Reason |
|
||||
|-----------|--------|--------|
|
||||
| Framework | **Blazor Server** (.NET 10 Razor Components, `AddInteractiveServerComponents`) | Same as ScadaLink; real-time UI without separate SPA build; SignalR built-in for live cluster status |
|
||||
| Hosting | Co-deploy with central DB by default; standalone option | Most deployments run Admin on the same machine as MSSQL; large fleets can split |
|
||||
| Auth | **LDAP bind via `LdapAuthService` (sibling of `ScadaLink.Security`) + cookie auth + `JwtTokenService` for API tokens** | Direct parity with ScadaLink — same login form, same cookie scheme, same claim shape, same `RoleMapper` pattern. Operators authenticated to one app feel at home in the other |
|
||||
| DB access | EF Core (same `Configuration` project that nodes use) | Schema versioning lives in one place |
|
||||
| Real-time | SignalR (Blazor Server's underlying transport) | Live updates on `ClusterNodeGenerationState` and crash-loop alerts |
|
||||
| Styling | **Bootstrap 5** vendored under `wwwroot/lib/bootstrap/` | Direct parity with ScadaLink; standard component vocabulary (card, table, alert, btn, form-control, modal); no third-party Blazor-component-library dependency |
|
||||
| Shared components | `DataTable`, `ConfirmDialog`, `LoadingSpinner`, `ToastNotification`, `TimestampDisplay`, `RedirectToLogin`, `NotAuthorizedView` | Same set as ScadaLink CentralUI; copy structurally so cross-app feel is identical |
|
||||
| Reconnect overlay | Custom Bootstrap modal triggered on `Blazor` SignalR disconnect | Same pattern as ScadaLink — modal appears on connection loss, dismisses on reconnect |
|
||||
|
||||
### Code organization
|
||||
|
||||
Mirror ScadaLink's layout exactly:
|
||||
|
||||
```
|
||||
src/
|
||||
ZB.MOM.WW.OtOpcUa.Admin/ # Razor Components project (.NET 10)
|
||||
Auth/
|
||||
AuthEndpoints.cs # /auth/login, /auth/logout, /auth/token
|
||||
CookieAuthenticationStateProvider.cs # bridges cookie auth to Blazor <AuthorizeView>
|
||||
Components/
|
||||
Layout/
|
||||
MainLayout.razor # dark sidebar + light main flex layout
|
||||
NavMenu.razor # role-gated nav sections
|
||||
Pages/
|
||||
Login.razor # server-rendered HTML form POSTing to /auth/login
|
||||
Dashboard.razor # default landing
|
||||
Clusters/
|
||||
Generations/
|
||||
Credentials/
|
||||
Audit/
|
||||
Shared/
|
||||
DataTable.razor # paged/sortable/filterable table (verbatim from ScadaLink)
|
||||
ConfirmDialog.razor
|
||||
LoadingSpinner.razor
|
||||
ToastNotification.razor
|
||||
TimestampDisplay.razor
|
||||
RedirectToLogin.razor
|
||||
NotAuthorizedView.razor
|
||||
EndpointExtensions.cs # MapAuthEndpoints + role policies
|
||||
ServiceCollectionExtensions.cs # AddCentralAdmin
|
||||
ZB.MOM.WW.OtOpcUa.Admin.Security/ # LDAP + role mapping + JWT (sibling of ScadaLink.Security)
|
||||
```
|
||||
|
||||
The `Admin.Security` project carries `LdapAuthService`, `RoleMapper`, `JwtTokenService`, `AuthorizationPolicies`. If it ever makes sense to consolidate with ScadaLink's identical project, lift to a shared internal NuGet — out of scope for v2.0 to keep OtOpcUa decoupled from ScadaLink's release cycle.
|
||||
|
||||
## Authentication & Authorization
|
||||
|
||||
### Operator authentication
|
||||
|
||||
**Identical pattern to ScadaLink CentralUI.** Operators log in via LDAP bind against the GLAuth server. The login flow is a server-rendered HTML form POSTing to `/auth/login` (NOT a Blazor interactive form — `data-enhance="false"` to disable Blazor enhanced navigation), handled by a minimal-API endpoint that:
|
||||
|
||||
1. Reads `username` / `password` from form
|
||||
2. Calls `LdapAuthService.AuthenticateAsync(username, password)` — performs LDAP bind, returns `Username`, `DisplayName`, `Groups`
|
||||
3. Calls `RoleMapper.MapGroupsToRolesAsync(groups)` — translates LDAP groups → application roles + cluster-scope set
|
||||
4. Builds `ClaimsIdentity` with `Name`, `DisplayName`, `Username`, `Role` (multiple), `ClusterId` scope claims (multiple, when not system-wide)
|
||||
5. `HttpContext.SignInAsync(CookieAuthenticationDefaults.AuthenticationScheme, principal, ...)` with `IsPersistent = true`, `ExpiresUtc = +30 min` (sliding)
|
||||
6. Redirects to `/`
|
||||
7. On failure, redirects to `/login?error={URL-encoded message}`
|
||||
|
||||
A parallel `/auth/token` endpoint returns a JWT for API clients (CLI tooling, scripts) — same auth, different transport. Symmetric with ScadaLink's pattern.
|
||||
|
||||
`CookieAuthenticationStateProvider` bridges the cookie principal to Blazor's `AuthenticationStateProvider` so `<AuthorizeView>` and `[Authorize]` work in components.
|
||||
|
||||
### LDAP group → role mapping
|
||||
|
||||
| LDAP group | Admin role | Capabilities |
|
||||
|------------|------------|--------------|
|
||||
| `OtOpcUaAdmins` | `FleetAdmin` | Everything: cluster CRUD, node CRUD, credential management, publish/rollback any cluster |
|
||||
| `OtOpcUaConfigEditors` | `ConfigEditor` | Edit drafts and publish for assigned clusters; cannot create/delete clusters or manage credentials |
|
||||
| `OtOpcUaViewers` | `ReadOnly` | View-only access to all clusters and generations; cannot edit drafts or publish |
|
||||
|
||||
`AuthorizationPolicies` constants (mirrors ScadaLink): `RequireFleetAdmin`, `RequireConfigEditor`, `RequireReadOnly`. `<AuthorizeView Policy="@AuthorizationPolicies.RequireFleetAdmin">` gates nav menu sections and page-level access.
|
||||
|
||||
### Cluster-scoped grants (lifted from v2.1 to v2.0)
|
||||
|
||||
Because ScadaLink already has the site-scoped grant pattern (`PermittedSiteIds` claim, `IsSystemWideDeployment` flag), we get cluster-scoped grants essentially for free in v2.0 by mirroring it:
|
||||
|
||||
- A `ConfigEditor` user mapped to LDAP group `OtOpcUaConfigEditors-LINE3` is granted `ConfigEditor` role + `ClusterId=LINE3-OPCUA` scope claim only
|
||||
- The `RoleMapper` reads a small `LdapGroupRoleMapping` table (Group → Role, Group → ClusterId scope) configured by `FleetAdmin` via the Admin UI
|
||||
- All cluster-scoped pages check both role AND `ClusterId` scope claim before showing edit affordances
|
||||
|
||||
System-wide users (no `ClusterId` scope claims, `IsSystemWideDeployment = true`) see every cluster.
|
||||
|
||||
### Bootstrap (first-run)
|
||||
|
||||
Same as ScadaLink: a local-admin login configured in `appsettings.json` (or a local certificate-authenticated user) bootstraps the first `OtOpcUaAdmins` LDAP group binding before LDAP-only access takes over. Documented as a one-time setup step.
|
||||
|
||||
### Audit
|
||||
|
||||
Every write operation goes through `sp_*` procs that log to `ConfigAuditLog` with the operator's principal. The Admin UI also logs view-only actions (page navigation, generation diff views) to a separate UI access log for compliance.
|
||||
|
||||
## Visual Design — Direct Parity with ScadaLink
|
||||
|
||||
Every visual element is lifted from ScadaLink CentralUI's design system to ensure cross-app consistency. Concrete specs:
|
||||
|
||||
### Layout
|
||||
|
||||
- **Flex layout**: `<div class="d-flex">` containing `<NavMenu />` (sidebar) and `<main class="flex-grow-1 p-3">` (content)
|
||||
- **Sidebar**: 220px fixed width (`min-width: 220px; max-width: 220px`), full viewport height (`min-height: 100vh`), background `#212529` (Bootstrap dark)
|
||||
- **Main background**: `#f8f9fa` (Bootstrap light)
|
||||
- **Brand**: "OtOpcUa" in white bold (font-size: 1.1rem, padding 1rem, border-bottom `1px solid #343a40`) at top of sidebar
|
||||
- **Nav links**: color `#adb5bd`, padding `0.4rem 1rem`, font-size `0.9rem`. Hover: white text, background `#343a40`. Active: white text, background `#0d6efd` (Bootstrap primary)
|
||||
- **Section headers** ("Admin", "Configuration", "Monitoring"): color `#6c757d`, uppercase, font-size `0.75rem`, font-weight `600`, letter-spacing `0.05em`, padding `0.75rem 1rem 0.25rem`
|
||||
- **User strip** at bottom of sidebar: display name (text-light small) + Sign Out button (`btn-outline-light btn-sm`), separated from nav by `border-top border-secondary`
|
||||
|
||||
### Login page
|
||||
|
||||
Verbatim structure from ScadaLink's `Login.razor`:
|
||||
|
||||
```razor
|
||||
<div class="container" style="max-width: 400px; margin-top: 10vh;">
|
||||
<div class="card shadow-sm">
|
||||
<div class="card-body p-4">
|
||||
<h4 class="card-title mb-4 text-center">OtOpcUa</h4>
|
||||
|
||||
@if (!string.IsNullOrEmpty(ErrorMessage))
|
||||
{
|
||||
<div class="alert alert-danger py-2" role="alert">@ErrorMessage</div>
|
||||
}
|
||||
|
||||
<form method="post" action="/auth/login" data-enhance="false">
|
||||
<div class="mb-3">
|
||||
<label for="username" class="form-label">Username</label>
|
||||
<input type="text" class="form-control" id="username" name="username"
|
||||
required autocomplete="username" autofocus />
|
||||
</div>
|
||||
<div class="mb-3">
|
||||
<label for="password" class="form-label">Password</label>
|
||||
<input type="password" class="form-control" id="password" name="password"
|
||||
required autocomplete="current-password" />
|
||||
</div>
|
||||
<button type="submit" class="btn btn-primary w-100">Sign In</button>
|
||||
</form>
|
||||
</div>
|
||||
</div>
|
||||
<p class="text-center text-muted mt-3 small">Authenticate with your organization's LDAP credentials.</p>
|
||||
</div>
|
||||
```
|
||||
|
||||
Exact same dimensions, exact same copy pattern, only the brand name differs.
|
||||
|
||||
### Reconnection overlay
|
||||
|
||||
Same SignalR-disconnect modal as ScadaLink — `#reconnect-modal` overlay (`rgba(0,0,0,0.5)` backdrop, centered white card with `spinner-border text-primary`, "Connection Lost" heading, "Attempting to reconnect to the server. Please wait..." body). Listens for `Blazor.addEventListener('enhancedload')` to dismiss on reconnect. Lifted from ScadaLink's `App.razor` inline styles.
|
||||
|
||||
### Shared components — direct copies
|
||||
|
||||
All seven shared components from ScadaLink CentralUI are copied verbatim into our `Components/Shared/`:
|
||||
|
||||
| Component | Use |
|
||||
|-----------|-----|
|
||||
| `DataTable.razor` | Sortable, filterable, paged table — used for tags, generations, audit log, cluster list |
|
||||
| `ConfirmDialog.razor` | Modal confirmation for destructive actions (publish, rollback, discard draft, disable credential) |
|
||||
| `LoadingSpinner.razor` | Standard spinner for in-flight DB operations |
|
||||
| `ToastNotification.razor` | Transient success/error toasts for non-modal feedback |
|
||||
| `TimestampDisplay.razor` | Consistent UTC + relative-time rendering ("3 minutes ago") |
|
||||
| `RedirectToLogin.razor` | Component used by pages requiring auth — server-side redirect to `/login?returnUrl=...` |
|
||||
| `NotAuthorizedView.razor` | Standard "you don't have permission for this action" view, shown by `<AuthorizeView>` Not authorized branch |
|
||||
|
||||
If we discover an Admin-specific component need, add it to our Shared folder rather than diverging from ScadaLink's set.
|
||||
|
||||
## Information Architecture
|
||||
|
||||
```
|
||||
/ Fleet Overview (default landing)
|
||||
/clusters Cluster list
|
||||
/clusters/{ClusterId} Cluster detail (tabs: Overview / Namespaces / UNS Structure / Drivers / Devices / Equipment / Tags / Generations / Audit)
|
||||
/clusters/{ClusterId}/nodes/{NodeId} Node detail
|
||||
/clusters/{ClusterId}/namespaces Namespace management (generation-versioned via draft → publish; same boundary as drivers/tags)
|
||||
/clusters/{ClusterId}/uns UNS structure management (areas, lines, drag-drop reorganize)
|
||||
/clusters/{ClusterId}/equipment Equipment list (default sorted by ZTag)
|
||||
/clusters/{ClusterId}/equipment/{EquipmentId} Equipment detail (5 identifiers, UNS placement, signals, audit)
|
||||
/clusters/{ClusterId}/draft Draft editor (drivers/devices/equipment/tags)
|
||||
/clusters/{ClusterId}/draft/diff Draft vs current diff viewer
|
||||
/clusters/{ClusterId}/generations Generation history
|
||||
/clusters/{ClusterId}/generations/{Id} Generation detail (read-only view of any generation)
|
||||
/clusters/{ClusterId}/audit Audit log filtered to this cluster
|
||||
/credentials Credential management (FleetAdmin only)
|
||||
/audit Fleet-wide audit log
|
||||
/admin/users Admin role assignments (FleetAdmin only)
|
||||
```
|
||||
|
||||
## Core Pages
|
||||
|
||||
### Fleet Overview (`/`)
|
||||
|
||||
Single-page summary intended as the operator landing page.
|
||||
|
||||
- **Cluster cards**, one per `ServerCluster`, showing:
|
||||
- Cluster name, site, redundancy mode, node count
|
||||
- Per-node status: online/offline (from `ClusterNodeGenerationState.LastSeenAt`), current generation, RedundancyRole, ServiceLevel (last reported)
|
||||
- Drift indicator: red if 2-node cluster's nodes are on different generations, amber if mid-apply, green if converged
|
||||
- **Active alerts** strip (top of page):
|
||||
- Sticky crash-loop circuit alerts (per `driver-stability.md`)
|
||||
- Stragglers: nodes that haven't applied the latest published generation within 5 min
|
||||
- Failed applies (`LastAppliedStatus = 'Failed'`)
|
||||
- **Recent activity**: last 20 events from `ConfigAuditLog` across the fleet
|
||||
- **Search bar** at top: jump to any cluster, node, tag, or driver instance by name
|
||||
|
||||
Refresh: SignalR push for status changes; full reload every 30 s as a safety net.
|
||||
|
||||
### Cluster Detail (`/clusters/{ClusterId}`)
|
||||
|
||||
Tabbed view for one cluster.
|
||||
|
||||
**Tabs:**
|
||||
|
||||
1. **Overview** — cluster metadata (name, Enterprise, Site, redundancy mode), namespace summary (which kinds are configured + their URIs), node table with online/offline/role/generation/last-applied-status, current published generation summary, draft status (none / in progress / ready to publish)
|
||||
2. **Namespaces** — list of `Namespace` rows for this cluster *in the current published generation* (Kind, NamespaceUri, Enabled). **Namespaces are generation-versioned** (revised after adversarial review finding #2): add / disable / re-enable a namespace by opening a draft, making the change, and publishing. The tab is read-only when no draft is open; "Edit in draft" button opens the cluster's draft scoped to the namespace section. Equipment kind is auto-included in the cluster's first generation; SystemPlatform kind is added when a Galaxy driver is configured. Simulated kind is reserved (operator can add a row of `Kind = 'Simulated'` in a draft but no driver populates it in v2.0; UI shows "Awaiting replay driver — see roadmap" placeholder).
|
||||
3. **UNS Structure** — tree view of `UnsArea` → `UnsLine` → `Equipment` for this cluster's current published generation. Operators can:
|
||||
- Add/rename/delete areas and lines (changes go into the active draft)
|
||||
- Bulk-move lines between areas (drag-and-drop in the tree, single edit propagates UNS path changes to all equipment under the moved line)
|
||||
- Bulk-move equipment between lines
|
||||
- View live UNS path preview per node (`Enterprise/Site/Area/Line/Equipment`)
|
||||
- See validation errors inline (segment regex, length cap, _default placeholder rules)
|
||||
- Counts per node: # lines per area, # equipment per line, # signals per equipment
|
||||
- Path-rename impact: when renaming an area, UI shows "X lines, Y equipment, Z signals will pick up new path" before commit
|
||||
4. **Drivers** — table of `DriverInstance` rows in the *current published* generation, with per-row namespace assignment shown. Per-row navigation to driver-specific config screens. "Edit in draft" button creates or opens the cluster's draft.
|
||||
5. **Devices** — table of `Device` rows (where applicable), grouped by `DriverInstance`
|
||||
6. **Equipment** — table of `Equipment` rows in the current published generation, scoped to drivers in Equipment-kind namespaces. **Default sort: ZTag ascending** (the primary browse identifier per decision #117). Default columns:
|
||||
- `ZTag` (primary, bold, copyable)
|
||||
- `MachineCode` (secondary, e.g. `machine_001`)
|
||||
- Full UNS path (rendered live from cluster + UnsLine→UnsArea + Equipment.Name)
|
||||
- `SAPID` (when set)
|
||||
- `EquipmentUuid` (collapsed badge, copyable on click — "show UUID" toggle to expand)
|
||||
- `EquipmentClassRef` (placeholder until schemas repo lands)
|
||||
- DriverInstance, DeviceId, Enabled
|
||||
|
||||
Search bar supports any of the five identifiers (ZTag, MachineCode, SAPID, EquipmentId, EquipmentUuid) — operator types and the search dispatches across all five with a typeahead that disambiguates ("Found in ZTag" / "Found in MachineCode" labels on each suggestion). Per-row click opens the Equipment Detail page.
|
||||
7. **Tags** — paged, filterable table of all tags. Filters: namespace kind, equipment (by ZTag/MachineCode/SAPID), driver, device, folder path, name pattern, data type. For Equipment-ns tags the path is shown as the full UNS path; for SystemPlatform-ns tags the v1-style `FolderPath/Name` is shown. Bulk operations toolbar: export to CSV, import from CSV (validated against active draft).
|
||||
8. **ACLs** — OPC UA client data-path authorization grants. Two views (toggle at top): "By LDAP group" (rows) and "By scope" (UNS tree with permission badges per node). Bulk-grant flow: pick group + permission bundle (`ReadOnly` / `Operator` / `Engineer` / `Admin`) or per-flag selection + scope (multi-select from tree or pattern), preview, confirm via draft. Permission simulator panel: enter username + LDAP groups → effective permission map across the cluster's UNS tree. Default seed on cluster creation maps v1 LmxOpcUa LDAP roles. Banner shows when this cluster's ACL set diverges from the seed. See `acl-design.md` for full design.
|
||||
9. **Generations** — generation history list (see Generation History page)
|
||||
10. **Audit** — filtered audit log
|
||||
|
||||
The Drivers/Devices/Equipment/Tags tabs are **read-only views** of the published generation; editing is done in the dedicated draft editor to make the publish boundary explicit. The Namespaces tab and the UNS Structure tab follow the same hybrid pattern: navigation is read-only over the published generation, click-to-edit on any node opens the draft editor scoped to that node. **No table in v2.0 is edited outside the publish boundary** (revised after adversarial review finding #2).
|
||||
|
||||
### Equipment Detail (`/clusters/{ClusterId}/equipment/{EquipmentId}`)
|
||||
|
||||
Per-equipment view. Form sections:
|
||||
|
||||
- **OPC 40010 Identification panel** (per the `_base` equipment-class template): operator-set static metadata exposed as OPC UA properties on the equipment node's `Identification` sub-folder — Manufacturer (required), Model (required), SerialNumber, HardwareRevision, SoftwareRevision, YearOfConstruction, AssetLocation (free-text supplementary to UNS path), ManufacturerUri (URL), DeviceManualUri (URL). Manufacturer + Model are required because the `_base` template declares them as `isRequired: true`; the rest are optional and can be filled in over time. Drivers that can read fields dynamically (e.g. FANUC `cnc_sysinfo()` returning `SoftwareRevision`) override the static value at runtime; otherwise the operator-set value flows through.
|
||||
- **Identifiers panel**: all five identifiers, with explicit purpose labels and copy-to-clipboard buttons
|
||||
- `ZTag` — editable; live fleet-wide uniqueness check via `ExternalIdReservation` (warns if value is currently held by another EquipmentUuid; cannot save unless reservation is released first)
|
||||
- `MachineCode` — editable; live within-cluster uniqueness check
|
||||
- `SAPID` — editable; same reservation-backed check as ZTag
|
||||
- `EquipmentId` — **read-only forever** (revised after adversarial review finding #4). System-generated as `'EQ-' + first 12 hex chars of EquipmentUuid`. Never operator-editable, never present in any input form, never accepted from CSV imports
|
||||
- `EquipmentUuid` — read-only forever (auto-generated UUIDv4 on creation, never editable; copyable badge with "downstream consumers join on this" tooltip)
|
||||
- **UNS placement panel**: UnsArea/UnsLine pickers (typeahead from existing structure); `Equipment.Name` field with live segment validation; live full-path preview with character counter
|
||||
- **Class template panel**: `EquipmentClassRef` — free text in v2.0; becomes a typeahead picker when schemas repo lands
|
||||
- **Driver source panel**: DriverInstance + DeviceId pickers (filtered to drivers in Equipment-kind namespaces of this cluster)
|
||||
- **Signals panel**: list of `Tag` rows that belong to this equipment; inline edit not supported here (use Draft Editor's Tags panel for editing); read-only with a "Edit in draft" deep link
|
||||
- **Audit panel**: filtered audit log scoped to this equipment row across generations
|
||||
|
||||
### Node Detail (`/clusters/{ClusterId}/nodes/{NodeId}`)
|
||||
|
||||
Per-node view for `ClusterNode` management.
|
||||
|
||||
- **Physical attributes** form: Host, OpcUaPort, DashboardPort, ApplicationUri, ServiceLevelBase, RedundancyRole
|
||||
- **ApplicationUri auto-suggest** behavior (per decision #86):
|
||||
- When creating a new node: prefilled with `urn:{Host}:OtOpcUa`
|
||||
- When editing an existing node: changing `Host` shows a warning banner — "ApplicationUri is not updated automatically. Changing it will require all OPC UA clients to re-establish trust." Operator must explicitly click an "Update ApplicationUri" button to apply the suggestion.
|
||||
- **Credentials** sub-tab: list of `ClusterNodeCredential` rows (kind, value, enabled, rotated-at). FleetAdmin can add/disable/rotate. Credential rotation flow is documented inline ("create new credential → wait for node to use it → disable old credential").
|
||||
- **Per-node overrides** sub-tab: structured editor for `DriverConfigOverridesJson`. Surfaces the cluster's `DriverInstance` rows with their current `DriverConfig`, and lets the operator add path → value override entries per driver. Validation: override path must exist in the current draft's `DriverConfig`; loud failure if it doesn't (per the merge semantics in the schema doc).
|
||||
- **Generation state**: current applied generation, last-applied timestamp, last-applied status, last error if any
|
||||
- **Recent node activity**: filtered audit log
|
||||
|
||||
### Draft Editor (`/clusters/{ClusterId}/draft`)
|
||||
|
||||
The primary edit surface. Three-panel layout: tree on the left (Drivers → Devices → Equipment → Tags, with Equipment shown only for drivers in Equipment-kind namespaces), edit form on the right, validation panel at the bottom.
|
||||
|
||||
- **Drivers panel**: add/edit/remove `DriverInstance` rows in the draft. Each driver type opens a driver-specific config screen (deferred per #27). Generic fields (Name, NamespaceId, Enabled) are always editable. The NamespaceId picker is filtered to namespace kinds that are valid for the chosen driver type (e.g. selecting `DriverType=Galaxy` restricts the picker to SystemPlatform-kind namespaces only).
|
||||
- **Devices panel**: scoped to the selected driver instance (where applicable)
|
||||
- **UNS Structure panel** (Equipment-ns drivers only): tree of UnsArea → UnsLine; CRUD on areas and lines; rename and move operations with live impact preview ("renaming bldg-3 → bldg-3a will update 12 lines, 47 equipment, 1,103 signal paths"); validator rejects identity reuse with a different parent
|
||||
- **Equipment panel** (Equipment-ns drivers only):
|
||||
- Add/edit/remove `Equipment` rows scoped to the selected driver
|
||||
- Inline form sections:
|
||||
- **Identifiers**: `MachineCode` (required, e.g. `machine_001`, validates within-cluster uniqueness live); `ZTag` (optional, ERP id, validates fleet-wide uniqueness via `ExternalIdReservation` lookup live — surfaces "currently reserved by EquipmentUuid X in cluster Y" if collision); `SAPID` (optional, SAP PM id, same reservation-backed check)
|
||||
- **UNS placement**: `UnsLineId` picker (typeahead from existing structure or "Create new line" inline); `Name` (UNS level 5, live segment validation `^[a-z0-9-]{1,32}$`)
|
||||
- **Class template**: `EquipmentClassRef` (free text in v2.0; becomes a typeahead picker when schemas repo lands)
|
||||
- **Source**: `DeviceId` (when driver has multiple devices); `Enabled`
|
||||
- **`EquipmentUuid` is auto-generated UUIDv4 on creation, displayed read-only as a copyable badge**, never editable. **`EquipmentId` is also auto-generated** (`'EQ-' + first 12 hex chars of EquipmentUuid`) and never editable in any form. Both stay constant across renames, MachineCode/ZTag/SAPID edits, and area/line moves. The validator rejects any draft that tries to change either value on a published equipment.
|
||||
- **Live UNS path preview** above the form: `{Cluster.Enterprise}/{Cluster.Site}/{UnsArea.Name}/{UnsLine.Name}/{Name}` with character count and ≤200 limit indicator
|
||||
- Bulk operations:
|
||||
- Move many equipment from one line to another (UUIDs and identifiers preserved)
|
||||
- Bulk-edit MachineCode/ZTag/SAPID via inline grid (validation per row)
|
||||
- Bulk-create equipment from CSV (one row per equipment; UUIDs auto-generated for new rows)
|
||||
- **Tags panel**:
|
||||
- Tree view: by Equipment when in Equipment-ns; by `FolderPath` when in SystemPlatform-ns
|
||||
- Inline edit for individual tags (Name, DataType, AccessLevel, WriteIdempotent, PollGroupId, TagConfig JSON in a structured editor)
|
||||
- **Bulk operations**: select multiple tags → bulk edit (change poll group, access level, etc.)
|
||||
- **CSV import** schemas (one per namespace kind):
|
||||
- Equipment-ns: `(EquipmentId, Name, DataType, AccessLevel, WriteIdempotent, PollGroupId, TagConfig)`
|
||||
- SystemPlatform-ns: `(DriverInstanceId, DeviceId?, FolderPath, Name, DataType, AccessLevel, WriteIdempotent, PollGroupId, TagConfig)`
|
||||
- Preview shows additions/modifications/removals against current draft, with row-level validation errors. Operator confirms or cancels.
|
||||
- **CSV export**: emit the matching shape from the current published generation
|
||||
- **Equipment CSV import** (separate flow): bulk-create-or-update equipment. Columns: `(EquipmentUuid?, MachineCode, ZTag?, SAPID?, UnsAreaName, UnsLineName, Name, DriverInstanceId, DeviceId?, EquipmentClassRef?)`. **No `EquipmentId` column** (revised after adversarial review finding #4 — operator-supplied EquipmentId would mint duplicate equipment identity on typos):
|
||||
- **Row with `EquipmentUuid` set**: matches existing equipment by UUID, updates the matched row's editable fields (MachineCode/ZTag/SAPID/UnsLineId/Name/EquipmentClassRef/DeviceId/Enabled). Mismatched UUID = error, abort row.
|
||||
- **Row without `EquipmentUuid`**: creates new equipment. System generates fresh UUID and `EquipmentId = 'EQ-' + first 12 hex chars`. Cannot be used to update an existing row — operator must include UUID for updates.
|
||||
- UnsArea/UnsLine resolved by name within the cluster (auto-create if not present, with validation prompt).
|
||||
- Identifier uniqueness checks run row-by-row with errors surfaced before commit. ZTag/SAPID checked against `ExternalIdReservation` — collisions surface inline with the conflicting EquipmentUuid named.
|
||||
- Explicit "merge equipment A into B" or "rebind ZTag from A to B" operations are not in the CSV import path — see the Merge / Rebind operator flow below.
|
||||
- **Validation panel** runs `sp_ValidateDraft` continuously (debounced 500 ms) and surfaces FK errors, JSON schema errors, duplicate paths, missing references, UNS naming-rule violations, UUID-immutability violations, and driver-type-vs-namespace-kind mismatches. Publish button is disabled while errors exist.
|
||||
- **Diff link** at top: opens the diff viewer comparing the draft against the current published generation
|
||||
|
||||
### Diff Viewer (`/clusters/{ClusterId}/draft/diff`)
|
||||
|
||||
Three-column compare: previous published | draft | summary. Per-table sections (drivers, devices, tags, poll groups) with rows colored by change type:
|
||||
|
||||
- Green: added in draft
|
||||
- Red: removed in draft
|
||||
- Yellow: modified (with field-level diff on hover/expand)
|
||||
|
||||
Includes a **publish dialog** triggered from this view: required Notes field, optional "publish and apply now" vs. "publish and let nodes pick up on next poll" (the latter is the default; the former invokes a one-shot push notification, deferred per existing plan).
|
||||
|
||||
### Generation History (`/clusters/{ClusterId}/generations`)
|
||||
|
||||
List of all generations for the cluster with: ID, status, published-by, published-at, notes, and a per-row "Roll back to this" action (FleetAdmin or ConfigEditor). Clicking a row opens the generation detail page (read-only view of all rows in that generation, with diff-against-current as a button).
|
||||
|
||||
Rollback flow:
|
||||
|
||||
1. Operator clicks "Roll back to this generation"
|
||||
2. Modal: "This will create a new published generation cloned from generation N. Both nodes of this cluster will pick up the change on their next poll. Notes (required):"
|
||||
3. Confirm → invokes `sp_RollbackToGeneration` → immediate UI feedback that a new generation was published
|
||||
|
||||
### Credential Management (`/credentials`)
|
||||
|
||||
FleetAdmin-only. Lists all `ClusterNodeCredential` rows fleet-wide, filterable by cluster/node/kind/enabled.
|
||||
|
||||
Operations: add credential to node, disable credential, mark credential rotated. Rotation is the most common operation — the UI provides a guided flow ("create new → confirm node has used it once via `LastAppliedAt` advance → disable old").
|
||||
|
||||
### Fleet Audit (`/audit`)
|
||||
|
||||
Searchable / filterable view of `ConfigAuditLog` across all clusters. Filters: cluster, node, principal, event type, date range. Export to CSV for compliance.
|
||||
|
||||
## Real-Time Updates
|
||||
|
||||
Blazor Server runs over SignalR by default. The Admin app uses two SignalR hubs:
|
||||
|
||||
| Hub | Purpose |
|
||||
|-----|---------|
|
||||
| `FleetStatusHub` | Push `ClusterNodeGenerationState` changes (LastSeenAt updates, applied-generation transitions, status changes) to any open Fleet Overview or Cluster Detail page |
|
||||
| `AlertHub` | Push new sticky alerts (crash-loop circuit trips, failed applies) to all subscribed pages |
|
||||
|
||||
Updates fan out from a backend `IHostedService` that polls `ClusterNodeGenerationState` every 5 s and diffs against last-known state. Pages subscribe selectively (Cluster Detail page subscribes to one cluster's updates; Fleet Overview subscribes to all). No polling from the browser.
|
||||
|
||||
## UX Rules
|
||||
|
||||
- **Sticky alerts that don't auto-clear** — per the crash-loop circuit-breaker rule in `driver-stability.md`, alerts in the Active Alerts strip require explicit operator acknowledgment before clearing, regardless of whether the underlying state has recovered. "We crash-looped 3 times overnight" must remain visible the next morning.
|
||||
- **Publish boundary is explicit** — there is no "edit in place" path. All changes go through draft → diff → publish. The diff viewer is required reading before the publish dialog enables.
|
||||
- **Loud failures over silent fallbacks** — if validation fails, the publish button is disabled and the failures are listed; we never publish a generation with warnings hidden. If a node override path doesn't resolve in the draft, the override editor flags it red, not yellow.
|
||||
- **No auto-rewrite of `ApplicationUri`** — see Node Detail page above. The principle generalizes: any field that OPC UA clients pin trust to (`ApplicationUri`, certificate thumbprints) requires explicit operator action to change, never silent updates.
|
||||
- **Bulk operations always preview before commit** — CSV imports, bulk tag edits, rollbacks all show a diff and require confirmation. No "apply" buttons that act without preview.
|
||||
|
||||
## Per-Driver Config Screens (deferred)
|
||||
|
||||
Per decision #27, driver-specific config screens are added in each driver's implementation phase, not up front. The Admin app provides:
|
||||
|
||||
- A pluggable `IDriverConfigEditor` interface in `Configuration.Abstractions`
|
||||
- Driver projects implement an editor that renders into a slot on the Driver Detail screen
|
||||
- For drivers that don't yet have a custom editor, a generic JSON editor with schema-driven validation is used (better than nothing, ugly but functional)
|
||||
|
||||
The generic JSON editor uses the per-driver JSON schema from `DriverTypeRegistry` so even pre-custom-editor, validation works.
|
||||
|
||||
## Workflows
|
||||
|
||||
### Add a new cluster
|
||||
|
||||
1. FleetAdmin: `/clusters` → "New cluster"
|
||||
2. Form: Name, **Enterprise** (UNS level 1; default-prefilled `zb` per the org-wide canonical value, validated `^[a-z0-9-]{1,32}$`), **Site** (UNS level 2, e.g. `warsaw-west`, same validation), NodeCount (1 or 2), RedundancyMode (auto-set based on NodeCount)
|
||||
3. Save → cluster row created (`Enabled = 1`, no generations yet)
|
||||
4. **Open initial draft** containing default namespaces:
|
||||
- Equipment-kind namespace (`NamespaceId = {ClusterName}-equipment`, `NamespaceUri = urn:{Enterprise}:{Site}:equipment`). Operator can edit URI in the draft before publish.
|
||||
- Prompt: "This cluster will host a Galaxy / System Platform driver?" → if yes, the draft also includes a SystemPlatform-kind namespace (`urn:{Enterprise}:{Site}:system-platform`). If no, skip — operator can add it later via a draft.
|
||||
5. Operator reviews the initial draft, optionally adds the first nodes' worth of drivers/equipment, then publishes generation 1. The cluster cannot serve any consumer until generation 1 is published (no namespaces exist before that).
|
||||
6. Redirect to Cluster Detail; prompt to add nodes via the Node tab (cluster topology) — node addition itself remains cluster-level since `ClusterNode` rows are physical-machine topology, not consumer-visible content.
|
||||
|
||||
(Revised after adversarial review finding #2 — namespaces must travel through the publish boundary; the cluster-create flow no longer writes namespace rows directly.)
|
||||
|
||||
### Add a node to a cluster
|
||||
|
||||
1. Cluster Detail → "Add node"
|
||||
2. Form: NodeId, RedundancyRole, Host (required), OpcUaPort (default 4840), DashboardPort (default 8081), ApplicationUri (auto-prefilled `urn:{Host}:OtOpcUa`), ServiceLevelBase (auto: Primary=200, Secondary=150)
|
||||
3. Save
|
||||
4. Prompt: "Add a credential for this node now?" → opens credential add flow
|
||||
5. The node won't be functional until at least one credential is added and the credential is provisioned on the node's machine (out-of-band step documented in deployment guide)
|
||||
|
||||
### Edit drivers/tags and publish
|
||||
|
||||
1. Cluster Detail → "Edit configuration" → opens draft editor (creates a draft generation if none exists)
|
||||
2. Operator edits drivers, devices, tags, poll groups
|
||||
3. Validation panel updates live; publish disabled while errors exist
|
||||
4. Operator clicks "Diff" → diff viewer
|
||||
5. Operator clicks "Publish" → modal asks for Notes, confirms
|
||||
6. `sp_PublishGeneration` runs in transaction; on success, draft becomes new published generation; previous published becomes superseded
|
||||
7. Within ~30 s (default poll interval), both nodes pick up the new generation; Cluster Detail page shows live progress as `LastAppliedAt` advances on each node
|
||||
|
||||
### Roll back
|
||||
|
||||
1. Cluster Detail → Generations tab → find target generation → "Roll back to this"
|
||||
2. Modal: explains a new generation will be created (clone of target) and published; require Notes
|
||||
3. Confirm → `sp_RollbackToGeneration` runs
|
||||
4. Same propagation as a forward publish — both nodes pick up the new generation on next poll
|
||||
|
||||
### Override a setting per node
|
||||
|
||||
1. Node Detail → Overrides sub-tab
|
||||
2. Pick driver instance from dropdown → schema-driven editor shows current `DriverConfig` keys
|
||||
3. Add override row: select key path (validated against the driver's JSON schema), enter override value
|
||||
4. Save → updates `ClusterNode.DriverConfigOverridesJson`
|
||||
5. **No new generation created** — overrides are per-node metadata, not generation-versioned. They take effect on the node's next config-apply cycle.
|
||||
|
||||
The "no new generation" choice is deliberate: overrides are operationally bound to a specific physical machine, not to the cluster's logical config evolution. A node replacement scenario would copy the override to the replacement node via the credential/override migration flow, not by replaying generation history.
|
||||
|
||||
### Rotate a credential
|
||||
|
||||
1. Node Detail → Credentials sub-tab → "Add credential"
|
||||
2. Pick Kind, enter Value, save → new credential is enabled alongside the old
|
||||
3. Wait for `LastAppliedAt` on the node to advance (proves the new credential is being used by the node — operator-side work to provision the new credential on the node's machine happens out-of-band)
|
||||
4. Once verified, disable the old credential → only the new one is valid
|
||||
|
||||
### Release an external-ID reservation
|
||||
|
||||
When equipment is permanently retired and its `ZTag` or `SAPID` needs to be reusable by a different physical asset (a known-rare event):
|
||||
|
||||
1. FleetAdmin: navigate to Equipment Detail of the retired equipment, or to a global "External ID Reservations" view
|
||||
2. Select the reservation (Kind + Value), click "Release"
|
||||
3. Modal requires: confirmation of the EquipmentUuid that currently holds the reservation, and a free-text **release reason** (compliance audit trail)
|
||||
4. Confirm → `sp_ReleaseExternalIdReservation` runs: sets `ReleasedAt`, `ReleasedBy`, `ReleaseReason`. Audit-logged with `EventType = 'ExternalIdReleased'`.
|
||||
5. The same `(Kind, Value)` can now be reserved by a different EquipmentUuid in a future publish. The released row stays in the table forever for audit.
|
||||
|
||||
This is the **only** path that allows ZTag/SAPID reuse — no implicit release on equipment disable, no implicit release on cluster delete. Requires explicit FleetAdmin action with a documented reason.
|
||||
|
||||
### Merge or rebind equipment (rare)
|
||||
|
||||
When operators discover that two `EquipmentRow`s in different generations actually represent the same physical asset (e.g. a typo created a duplicate) — or when an asset's identity has been incorrectly split across UUIDs — the resolution is **not** an in-place EquipmentId edit (which is now impossible per finding #4). Instead:
|
||||
|
||||
1. FleetAdmin: Equipment Detail of the row that should be retained → "Merge from another EquipmentUuid"
|
||||
2. Pick the source EquipmentUuid (the one to retire); modal shows a side-by-side diff of identifiers and signal counts
|
||||
3. Confirm → opens a **draft** that:
|
||||
- Disables the source equipment row (`Enabled = 0`) and adds an `EventType = 'EquipmentMergedAway'` audit entry naming the target UUID
|
||||
- Re-points any tags currently on the source equipment to the target equipment
|
||||
- If the source held a ZTag/SAPID reservation that should move to the target: explicit release of the source's reservation followed by re-reservation under the target UUID, both audit-logged
|
||||
4. Operator reviews the draft diff; publishes
|
||||
5. Downstream consumers see the source EquipmentUuid disappear (joins on it return historical data only) and the target EquipmentUuid gain the merged tags
|
||||
|
||||
Merge is a destructive lineage operation — the source EquipmentUuid is never reused, but its history persists in old generations + audit log. Rare by intent; UI buries the action behind two confirmation prompts.
|
||||
|
||||
## Deferred / Out of Scope
|
||||
|
||||
- **Cluster-scoped admin grants** (`ConfigEditor` for Cluster X only, not for Cluster Y) — surface in v2.1
|
||||
- **Per-driver custom config editors** — added in each driver's implementation phase
|
||||
- **Tag template / inheritance** — define a tag pattern once and apply to many similar device instances; deferred until the bulk import path proves insufficient
|
||||
- **Multi-cluster synchronized publish** — push a configuration change across many clusters atomically. Out of scope; orchestrate via per-cluster publishes from a script if needed.
|
||||
- **Mobile / tablet layout** — desktop-only initially
|
||||
- **Role grants editor in UI** — initial v2 manages LDAP group → admin role mappings via `appsettings.json`; UI editor surfaced later
|
||||
|
||||
## Decisions / Open Questions
|
||||
|
||||
**Decided** (captured in `plan.md` decision log):
|
||||
|
||||
- Blazor Server tech stack (vs. SPA + API)
|
||||
- **Visual + auth parity with ScadaLink CentralUI** — Bootstrap 5, dark sidebar, server-rendered login form, cookie auth + JWT API endpoint, copied shared component set, reconnect overlay
|
||||
- LDAP for operator auth via `LdapAuthService` + `RoleMapper` + `JwtTokenService` mirrored from `ScadaLink.Security`
|
||||
- Three admin roles: FleetAdmin / ConfigEditor / ReadOnly, with cluster-scoped grants in v2.0 (mirrored from ScadaLink's site-scoped pattern)
|
||||
- Draft → diff → publish is the only edit path; no in-place edits
|
||||
- Sticky alerts require manual ack
|
||||
- Per-node overrides are NOT generation-versioned
|
||||
- **All content edits go through the draft → diff → publish boundary** — Namespaces, UNS Structure, Drivers, Devices, Equipment, Tags. The UNS Structure and Namespaces tabs are hybrid (read-only navigation over the published generation, click-to-edit opens the draft editor scoped to that node). No table is editable outside the publish boundary in v2.0 (revised after adversarial review finding #2 — earlier draft mistakenly treated namespaces as cluster-level)
|
||||
- **Equipment list defaults to ZTag sort** (primary browse identifier per the 3-year-plan handoff). All five identifiers (ZTag/MachineCode/SAPID/EquipmentId/EquipmentUuid) are searchable; typeahead disambiguates which field matched
|
||||
- **EquipmentUuid is read-only forever** in the UI; never editable. Auto-generated UUIDv4 on equipment creation, displayed as a copyable badge
|
||||
|
||||
**Resolved Defaults**:
|
||||
|
||||
- **Styling: Bootstrap 5 vendored** (not MudBlazor or Fluent UI). Direct parity with ScadaLink CentralUI; standard component vocabulary; no Blazor-specific component-library dependency. Reverses an earlier draft choice — the cross-app consistency requirement outweighs MudBlazor's component conveniences.
|
||||
- **Theme: light only (single theme matching ScadaLink).** ScadaLink ships light-only with the dark sidebar / light main pattern. Operators using both apps see one consistent aesthetic. Reverses an earlier draft choice that proposed both light and dark — cross-app consistency wins. Revisit only if ScadaLink adds dark mode.
|
||||
- **CSV import dialect: strict CSV (RFC 4180), UTF-8 BOM accepted.** Excel "Save as CSV (UTF-8)" produces RFC 4180-compatible output and is the documented primary input format. TSV not supported initially; add only if operator feedback shows real friction with Excel CSV.
|
||||
- **Push notification deferred to v2.1; polling is initial model.** SignalR-from-DB-to-nodes would tighten apply latency from ~30 s to ~1 s but adds infrastructure (SignalR backplane or SQL Service Broker) that's not earning its keep at v2.0 scale. The publish dialog reserves a disabled **"Push now"** button labeled "Available in v2.1" so the future UX is anchored.
|
||||
- **Auto-save drafts with explicit Discard button.** Every form field change writes to the draft rows immediately (debounced 500 ms). The Discard button shows a confirmation dialog ("Discard all changes since last publish?") and rolls the draft generation back to empty. The Publish button is the only commit; auto-save does not publish.
|
||||
- **Cluster-scoped admin grants in v2.0** (lifted from v2.1 deferred list). ScadaLink already ships the equivalent site-scoped pattern, so we get cluster-scoped grants essentially for free by mirroring it. `RoleMapper` reads an `LdapGroupRoleMapping` table; cluster-scoped users carry `ClusterId` claims and see only their permitted clusters.
|
||||
180
docs/v2/aveva-system-platform-io-research.md
Normal file
180
docs/v2/aveva-system-platform-io-research.md
Normal file
@@ -0,0 +1,180 @@
|
||||
# AVEVA System Platform — Upstream OPC UA Server Research
|
||||
|
||||
> **Status**: DRAFT — closes corrections-doc finding E2. Research deliverable for the v2 plan; informs the tier-3 (System Platform IO) cutover scheduled for Year 3 of the 3-year plan.
|
||||
>
|
||||
> **Branch**: `v2`
|
||||
> **Created**: 2026-04-17
|
||||
|
||||
## Question
|
||||
|
||||
Can AVEVA System Platform (formerly Wonderware) consume equipment data from an arbitrary upstream OPC UA server (in our case `OtOpcUa`) instead of talking to equipment directly? This is the technical premise of tier 3 of the consumer cutover (handoff §"Rollout Posture", Year 3).
|
||||
|
||||
## Verdict — GREEN-YELLOW
|
||||
|
||||
**The cutover is technically supported.** AVEVA has a documented, official path for AppServer's IO layer to consume from any compliant OPC UA server via the **OI Gateway** communication driver. Multiple AVEVA partners (Software Toolbox, InSource) have published end-to-end integrations against four different non-AVEVA upstream servers (TOP Server, OPC Router, OmniServer, Cogent DataHub).
|
||||
|
||||
The yellow comes from two unanswered items in public sources: (1) GxP / 21 CFR Part 11 validation guidance for non-AVEVA upstream servers, and (2) published scale benchmarks for OI Gateway's OPC UA client. Both are integrator-burden items, not blockers.
|
||||
|
||||
## 1. Does AppServer's IO layer support an upstream OPC UA server?
|
||||
|
||||
Yes — through the **OI Gateway** communication driver (formerly "Wonderware Gateway", "FSGateway", product code `OI.GATEWAY.3`). OI Gateway is an OPC UA client that republishes data to System Platform via SuiteLink. AppServer attributes reference data through a DI Object pointing at the OI Gateway instance.
|
||||
|
||||
AVEVA's official docs page "Configure an OPC UA data source object" describes adding an `OPCUA Connection` under OI Gateway, entering a Server Node (localhost / IP / hostname), and browsing the upstream server's namespace ([docs.aveva.com](https://docs.aveva.com/bundle/sp-appserver/page/689813.html)).
|
||||
|
||||
**Sources** confirming the pattern is intentionally generic:
|
||||
- [AVEVA docs: Configure an OPC UA data source object](https://docs.aveva.com/bundle/sp-appserver/page/689813.html)
|
||||
- [AVEVA docs: Connect to an OPC UA data source](https://docs.aveva.com/bundle/sp-cdp-drivers/page/202524.html)
|
||||
- [Software Toolbox: System Platform → TOP Server via OPC UA](https://blog.softwaretoolbox.com/topserver-aveva-system-platform-opc-ua)
|
||||
- [Software Toolbox: System Platform → OPC Router via OPC UA](https://blog.softwaretoolbox.com/opc-router-aveva-system-platform-opc-ua)
|
||||
- [Software Toolbox: System Platform → OmniServer via OPC UA](https://blog.softwaretoolbox.com/omniserver-aveva-system-platform-opc-ua)
|
||||
- [Software Toolbox: System Platform → Cogent DataHub via OPC UA](https://blog.softwaretoolbox.com/cogent-datahub-aveva-system-platform-opc-ua)
|
||||
|
||||
## 2. IO reference shape
|
||||
|
||||
AppServer does not have a native "OPC UA IO reference" type; the path is:
|
||||
|
||||
```
|
||||
OPC UA node → OI Gateway (OPC UA client) → SuiteLink → $DDESuiteLinkDIObject → AppServer attribute
|
||||
```
|
||||
|
||||
Attribute `IO.SourceAttribute` syntax:
|
||||
|
||||
```
|
||||
<SuiteLinkDIObjectName>.<TopicName>.<ItemReference>
|
||||
```
|
||||
|
||||
where `TopicName` matches the OPC UA Group defined inside OI Gateway, and `ItemReference` is the OPC UA node (browsed or manually entered).
|
||||
|
||||
**From System Platform 2023 R2 Patch 01 onward**: AppServer/Industrial Graphics scripts can also call OPC UA methods on the upstream server via the Gateway driver, and the External Providers pane exposes the Gateway's configured OPC UA servers directly in the Tag Dictionary ([2023 R2 P01 Readme](https://docs-be.aveva.com/bundle/sp-2023-r2-p01-readme/raw/resource/enus/sp-2023-r2-p01-readme.pdf)).
|
||||
|
||||
## 3. Restrictions and requirements for the upstream OPC UA server
|
||||
|
||||
### Security (documented)
|
||||
|
||||
| Setting | Supported values |
|
||||
|---------|------------------|
|
||||
| Security policy | `None`, `Basic128Rsa15`, `Basic256`, `Basic256Sha256` (default + recommended) |
|
||||
| Message security mode | `None`, `Sign`, `SignAndEncrypt` |
|
||||
| User token | Anonymous, Username/Password (cert-based not called out) |
|
||||
|
||||
**Hard rule** from UA SDK v1.7.0 onward: if username/password is used, security policy cannot be `None` ([AVEVA Communications Drivers Pack 2020 R2 Gateway Readme](https://industrial-software.com/wp-content/uploads/Communication_Drivers/oi-communication-drivers-pack-2020-r2/Readme_Gateway.html)).
|
||||
|
||||
TLS stack is OpenSSL — `1.1.1g` in 2020 R2, upgraded to `1.1.1n` in System Platform 2023 ([System Platform 2023 Readme](https://industrial-software.com/wp-content/uploads/AVEVA_SystemPlatform/2023/ReadMe.html)).
|
||||
|
||||
### Certificates
|
||||
|
||||
OI Gateway auto-generates a client cert on first test connection and pushes it to the upstream server. Server-side admin must move it from rejected/quarantine to Trusted. Reverse direction (trusting the server cert on OI Gateway's side) is handled via OI Gateway's client security certificate pane ([AVEVA docs: OI Gateway client cert](https://docs.aveva.com/bundle/intouch-hmi/page/732922.html)).
|
||||
|
||||
### Namespace structure
|
||||
|
||||
**No mandated structure.** OI Gateway browses whatever the server exposes. There is an optional **Tag Prefix** column to strip a common namespace prefix and reduce item-name length in subscriptions.
|
||||
|
||||
### OPC UA Companion Specifications
|
||||
|
||||
**No AVEVA documentation found that requires** OPC 40010 Machinery, PA-DIM, or any other companion spec. OI Gateway behaves as a vanilla DA-style subscription client. **Implication for OtOpcUa**: our `_base` equipment-class template alignment with OPC 40010 is a *cross-system consistency* benefit, not an AppServer-required feature.
|
||||
|
||||
### Subscription / sampling
|
||||
|
||||
Configured per OPCUAGroup (publishing interval, sampling). **Specific documented limits not found in public readmes** — see §6 (Performance / scale).
|
||||
|
||||
## 4. Minimum versions
|
||||
|
||||
| Version | Capability |
|
||||
|---------|------------|
|
||||
| OI Gateway 2.0+ (Wonderware AppServer 2017 Update 3 / Comm Drivers Pack) | First OPC UA client capability |
|
||||
| Communications Drivers Pack 2020 R2 | First broadly-documented baseline |
|
||||
| System Platform 2020 | Native OPC UA *server* added (separate feature) |
|
||||
| System Platform 2023 | OpenSSL upgraded to 1.1.1n; endpoint/security-policy browsing added |
|
||||
| **System Platform 2023 R2 Patch 01** | **Recommended floor** — script-level OPC UA method calls + External Providers UI |
|
||||
|
||||
Recommend a greenfield deployment standardize on System Platform 2023 R2 P01 + the Comm Drivers Pack that ships with it.
|
||||
|
||||
## 5. Validated / GxP / 21 CFR Part 11
|
||||
|
||||
AVEVA has a [21 CFR Part 11 doc page](https://docs.aveva.com/bundle/system-platform/page/338331.html) for System Platform but the Zoomin-rendered content was not extractable over HTTP; AVEVA's Part 11 story centers on electronic signatures, audit trails, and secure user access at the AppServer / InTouch layer, with AVEVA PI often cited as the validated historian.
|
||||
|
||||
**No AVEVA document was found that explicitly addresses inserting a non-AVEVA OPC UA server between equipment and AppServer for a validated deployment.** This does not mean it's prohibited — it means the validation burden (qualification, change control, data-integrity evidence) falls on the integrator. Plan to:
|
||||
|
||||
- Produce an IQ/OQ package for OtOpcUa itself
|
||||
- Map OtOpcUa controls to ALCOA+ data-integrity principles and 21 CFR Part 11 §11.10(a)–(k)
|
||||
- Run a change-control impact assessment against the existing validated AppServer deployment when introducing OtOpcUa as the upstream IO source
|
||||
- Engage QA / regulatory early (well before Year 3) so the validation paperwork doesn't gate the cutover
|
||||
|
||||
## 6. Performance / scale guidance
|
||||
|
||||
**Public AVEVA documentation does not publish per-topic item limits, max subscription counts, or throughput benchmarks** for OI Gateway's OPC UA client. The Gateway readme only mentions the Tag Prefix optimization and a note that parallel driver instances improve throughput ([AVEVA Communication Drivers product page](https://www.aveva.com/en/products/communication-drivers/)).
|
||||
|
||||
For a sizing commitment: benchmark in-house at target item count, publishing interval, and `SignAndEncrypt` security mode, **before tier 3 cutover scheduling**. Open an AVEVA support case if the in-house benchmark surfaces concerns; AVEVA support has performance data they don't publish.
|
||||
|
||||
## 7. Practical reports — teams doing this in production
|
||||
|
||||
- **Software Toolbox** has production-grade walkthroughs for four different non-AVEVA upstream servers (TOP Server, OPC Router, OmniServer, DataHub — links in §1). Consistent pattern: OPCUA Connection → OPCUAGroup → SuiteLink DI → `$AnalogDevice` / `$DiscreteDevice` attributes.
|
||||
- **Inductive Automation forum**: ["Wonderware as OPC UA Client"](https://forum.inductiveautomation.com/t/wonderware-as-opc-ua-client/55888) — user connected OI Gateway to an Ignition OPC UA server. **Two recurring failure modes worth pre-empting in OtOpcUa's onboarding runbook**:
|
||||
1. `Bad_SecurityChecksFailed` after trusting the cert — fix was **removing `/discovery` from the endpoint URL** (`opc.tcp://host:port` not `opc.tcp://host:port/discovery`)
|
||||
2. Worked only with `None` after a hostname rename invalidated the Common Name on the reissued cert. **Pin cert CN to hostname; don't rename hosts post-issue.**
|
||||
- **InSource Tech Note DAS153** ([knowledge.insourcess.com](https://knowledge.insourcess.com/aveva-communication-drivers-tech-notes/tn-das153-connecting-to-an-opc-ua-server-through-the-wonderware-gateway-oi-server)) — canonical AVEVA-partner procedure for the OPCUA Connection + OPCUAGroup + Activate pattern.
|
||||
- **No public reports found** of teams deploying AppServer against a fully-custom in-house OPC UA server at scale. All published examples are against commercial upstream servers. **OtOpcUa would be the first of its kind in the public record** — not necessarily a problem, but worth noting that we won't have prior-art to lean on for tuning at production scale.
|
||||
|
||||
## 8. Known incompatibilities / hard blockers
|
||||
|
||||
- **OI Gateway under SYSTEM account cannot connect to a remote OPC server** (known issue per the 2020 R2 readme). Use a dedicated service account.
|
||||
- **Endpoint URL sensitivity** to `/discovery` suffix (see §7).
|
||||
- **Cert Common Name must match hostname** in use.
|
||||
- **No deprecation timeline** found for OI Gateway's OPC UA client; AVEVA actively enhanced it in 2023 and 2023 R2 P01 (positive signal — AVEVA still investing in this path).
|
||||
- **Public docs make no commitment** on OPC UA Companion Specifications. If OtOpcUa's future uplift relies on PA-DIM or Machinery types being consumed by AppServer specifically, validate that on a real deployment first.
|
||||
|
||||
## Implications for OtOpcUa Implementation
|
||||
|
||||
**Things to bake in earlier rather than retrofit:**
|
||||
|
||||
1. **`Basic256Sha256` + `SignAndEncrypt` + Username token must work** — this is the recommended-and-most-secure combination AppServer expects. v2's transport security model (`docs/v2/plan.md` §"Transport Security") already commits to `Basic256Sha256-Sign` and `-SignAndEncrypt` profiles per Phase 1. Confirm during Phase 1 implementation that the combination of `Basic256Sha256` + `SignAndEncrypt` + LDAP username token works end-to-end.
|
||||
2. **Endpoint URL hygiene**: don't expose `/discovery` as part of the canonical endpoint URL operators copy. If we surface a "connection string for AppServer" in the Admin UI's Cluster Detail page, render it as `opc.tcp://{host}:{port}` with no path suffix.
|
||||
3. **Certificate / hostname stability**: per decision #86, `ApplicationUri` is auto-suggested but never auto-rewritten because OPC UA clients pin trust to it. AppServer is exactly that case. The Admin UI's "warn on `Host` change" affordance applies directly here. Document the hostname-stable-certs requirement explicitly in the deployment guide.
|
||||
4. **Service-account install** — OtOpcUa already runs as a Windows service per Phase 1 decisions; just make sure the deployment guide explicitly says **don't** run AppServer's OI Gateway under SYSTEM when pointing at OtOpcUa.
|
||||
|
||||
**Things NOT required by AppServer that we still chose for other reasons** — useful to know they're not load-bearing for tier 3 specifically:
|
||||
- OPC 40010 Machinery alignment in the `_base` template — for cross-system consistency (Redpanda, dbt, MQTT projection), not because AppServer reads it
|
||||
- `_base` template's Identification fields exposed as OPC UA properties — for general client convenience, not because AppServer reads them
|
||||
|
||||
## Risks for the implementation team
|
||||
|
||||
In priority order:
|
||||
|
||||
1. **Validation / GxP story is the integrator's problem.** AVEVA has no published blueprint for Part 11 deployments where a non-AVEVA OPC UA server sits between equipment and AppServer. Budget IQ/OQ work for OtOpcUa itself and a change-control impact assessment against the existing validated AppServer deployment.
|
||||
2. **Scale is unpublished.** Benchmark OI Gateway → OtOpcUa at target item count, publishing interval, and `SignAndEncrypt` security mode before cutover scheduling. Don't assume the current direct-equipment IO rates carry over.
|
||||
3. **First-of-its-kind deployment.** All published partner examples use commercial upstream servers (Kepware, OPC Router, etc.). OtOpcUa would be the first custom in-house upstream server in the public record. Mitigation: stand up a non-production AppServer + OtOpcUa pairing well before tier 3 cutover begins, exercise the full Software Toolbox-style configuration walkthrough, and capture lessons.
|
||||
|
||||
## Recommendation
|
||||
|
||||
**Tier 3 cutover is feasible and can stay on the Year 3 schedule.** No re-architecting of OtOpcUa is required to support it. The implementation team should:
|
||||
|
||||
- Add **Phase 1 acceptance test**: AppServer (any version ≥ 2023 R2 P01) connects to OtOpcUa via OI Gateway with `Basic256Sha256` + `SignAndEncrypt` + username token; reads at least one tag end-to-end through a SuiteLink DI Object.
|
||||
- Add **`docs/deployment/aveva-system-platform-integration.md`** (separate doc, post-Phase-1) covering: OI Gateway version requirements, certificate exchange procedure, endpoint URL convention (no `/discovery`), service-account install requirement, common failure modes from §7.
|
||||
- **Schedule the validation/QA conversation in Year 1** so paperwork doesn't gate the Year 3 cutover.
|
||||
- **Stand up a non-production AppServer + OtOpcUa pairing in Year 2** for the first-of-its-kind benchmarking and operational learning.
|
||||
|
||||
## Decisions to add to plan.md
|
||||
|
||||
| # | Decision | Rationale |
|
||||
|---|----------|-----------|
|
||||
| 141 | Tier 3 (AppServer IO) cutover is feasible — AVEVA's OI Gateway supports arbitrary upstream OPC UA servers as a documented pattern | Closes corrections-doc E2 with **GREEN-YELLOW** verdict. Multiple AVEVA partners have published working integrations against non-AVEVA upstream servers. No re-architecting of OtOpcUa required. See `aveva-system-platform-io-research.md` |
|
||||
| 142 | Phase 1 acceptance includes an end-to-end AppServer-via-OI-Gateway smoke test against OtOpcUa | Catches the AppServer-specific quirks (cert exchange, endpoint URL handling, service account, `Basic256Sha256` + `SignAndEncrypt` + username token combo) early — well before Year 3 cutover schedule |
|
||||
|
||||
## References
|
||||
|
||||
- [AVEVA docs: Configure an OPC UA data source object](https://docs.aveva.com/bundle/sp-appserver/page/689813.html)
|
||||
- [AVEVA docs: Connect to an OPC UA data source](https://docs.aveva.com/bundle/sp-cdp-drivers/page/202524.html)
|
||||
- [AVEVA docs: OI Gateway OPC UA config (GATEWAY4OI4OPCUA)](https://docs.aveva.com/bundle/sp-cdp-drivers/page/GATEWAY4OI4OPCUA.html)
|
||||
- [AVEVA docs: OI Gateway client security certificate](https://docs.aveva.com/bundle/intouch-hmi/page/732922.html)
|
||||
- [AVEVA docs: 21 CFR Part 11 in System Platform](https://docs.aveva.com/bundle/system-platform/page/338331.html)
|
||||
- [AVEVA Communications Drivers Pack 2020 R2 Gateway Readme](https://industrial-software.com/wp-content/uploads/Communication_Drivers/oi-communication-drivers-pack-2020-r2/Readme_Gateway.html)
|
||||
- [AVEVA System Platform 2023 Readme](https://industrial-software.com/wp-content/uploads/AVEVA_SystemPlatform/2023/ReadMe.html)
|
||||
- [AVEVA System Platform 2023 R2 Patch 01 Readme](https://docs-be.aveva.com/bundle/sp-2023-r2-p01-readme/raw/resource/enus/sp-2023-r2-p01-readme.pdf)
|
||||
- [AVEVA Communication Drivers product page](https://www.aveva.com/en/products/communication-drivers/)
|
||||
- [Software Toolbox: System Platform → TOP Server via OPC UA](https://blog.softwaretoolbox.com/topserver-aveva-system-platform-opc-ua)
|
||||
- [Software Toolbox: System Platform → OPC Router via OPC UA](https://blog.softwaretoolbox.com/opc-router-aveva-system-platform-opc-ua)
|
||||
- [Software Toolbox: System Platform → OmniServer](https://blog.softwaretoolbox.com/omniserver-aveva-system-platform-opc-ua)
|
||||
- [Software Toolbox: System Platform → Cogent DataHub](https://blog.softwaretoolbox.com/cogent-datahub-aveva-system-platform-opc-ua)
|
||||
- [InSource TN DAS153: Gateway OI Server → OPC UA](https://knowledge.insourcess.com/aveva-communication-drivers-tech-notes/tn-das153-connecting-to-an-opc-ua-server-through-the-wonderware-gateway-oi-server)
|
||||
- [Inductive Automation forum: Wonderware as OPC UA Client](https://forum.inductiveautomation.com/t/wonderware-as-opc-ua-client/55888)
|
||||
- [OPC Connect: Turning Wonderware System Platform into an OPC UA Server (2017)](https://opcconnect.opcfoundation.org/2017/06/turning-wonderware-system-platform-into-an-opc-ua-server/)
|
||||
912
docs/v2/config-db-schema.md
Normal file
912
docs/v2/config-db-schema.md
Normal file
@@ -0,0 +1,912 @@
|
||||
# Central Config DB Schema — OtOpcUa v2
|
||||
|
||||
> **Status**: DRAFT — companion to `plan.md` §4. Concrete schema, indexes, stored procedures, and authorization model for the central MSSQL configuration database.
|
||||
>
|
||||
> **Branch**: `v2`
|
||||
> **Created**: 2026-04-17
|
||||
|
||||
## Scope
|
||||
|
||||
This document defines the central MSSQL database that stores all OtOpcUa fleet configuration: clusters, nodes, drivers, devices, tags, poll groups, credentials, and config generations. It is the single source of truth for fleet management — every running OtOpcUa node reads its config from here, and every operator change goes through here.
|
||||
|
||||
Out of scope here (covered elsewhere):
|
||||
|
||||
- The Admin web UI that edits this DB → `admin-ui.md`
|
||||
- The local LiteDB cache on each node → covered briefly at the end of this doc; full schema is small and tracks only what's needed for offline boot
|
||||
- Driver-specific JSON shapes inside `DriverConfig` / `DeviceConfig` / `TagConfig` → `driver-specs.md` per driver
|
||||
- The cluster topology and rollout model → `plan.md` §4
|
||||
|
||||
## Design Goals
|
||||
|
||||
1. **Atomic publish, surgical apply** — operators publish a whole generation in one transaction; nodes apply only the diff
|
||||
2. **Cluster-scoped isolation** — one cluster's config changes never affect another cluster
|
||||
3. **Per-node credential binding** — each physical node has its own auth principal; the DB rejects cross-cluster reads server-side
|
||||
4. **Schemaless driver config** — driver-type-specific settings live in JSON columns so adding a new driver type doesn't require a schema migration
|
||||
5. **Append-only generations** — old generations are never deleted; rollback is just publishing an older generation as new
|
||||
6. **Auditable** — every publish, rollback, and apply event is recorded with the principal that did it
|
||||
|
||||
## Schema Overview
|
||||
|
||||
```
|
||||
ServerCluster (1)──(1..2) ClusterNode (1)──(1..N) ClusterNodeCredential
|
||||
│
|
||||
└──(1)──(N) ConfigGeneration ──(N)── Namespace (generation-versioned; Kind: Equipment | SystemPlatform | Simulated)
|
||||
│ ↑
|
||||
│ │
|
||||
├──(N)── DriverInstance ──(N)── Device
|
||||
│ │
|
||||
│ │ Same-cluster invariant:
|
||||
│ │ DriverInstance.NamespaceId → Namespace
|
||||
│ │ must satisfy Namespace.ClusterId = DriverInstance.ClusterId
|
||||
│ │
|
||||
│ (1)──┴──(N) Equipment ──(N)── Tag (Equipment-ns)
|
||||
│ │ │
|
||||
│ │ │ Equipment carries:
|
||||
│ │ │ - EquipmentId (system-generated 'EQ-' + uuid prefix; never operator-set)
|
||||
│ │ │ - EquipmentUuid (immutable UUIDv4)
|
||||
│ │ │ - MachineCode (operator colloquial; required)
|
||||
│ │ │ - ZTag (ERP id; primary browse identifier; reservation-backed)
|
||||
│ │ │ - SAPID (SAP PM id; reservation-backed)
|
||||
│ │ │ - UnsLineId → UnsLine → UnsArea (UNS structure)
|
||||
│ │ │ - Name (UNS level 5)
|
||||
│ │
|
||||
│ └──(N)── Tag (SystemPlatform-ns; via DriverInstance + FolderPath)
|
||||
│
|
||||
├──(N)── UnsArea (UNS level 3; per-cluster, generation-versioned)
|
||||
│ │
|
||||
│ └──(1..N) UnsLine (UNS level 4; per-area, generation-versioned)
|
||||
│
|
||||
└──(N)── PollGroup (driver-scoped)
|
||||
|
||||
ExternalIdReservation — fleet-wide ZTag/SAPID uniqueness, NOT generation-versioned;
|
||||
survives rollback, disable, and re-enable
|
||||
ClusterNodeGenerationState (1:1 ClusterNode) — tracks applied generation per node
|
||||
ConfigAuditLog — append-only event log
|
||||
```
|
||||
|
||||
**Key relationships for UNS / two-namespace model**:
|
||||
- Each `DriverInstance` is bound to one `Namespace` (driver type restricts allowed `Namespace.Kind`).
|
||||
- `UnsArea` and `UnsLine` are first-class generation-versioned entities so renaming/reorganizing the UNS structure doesn't require rewriting every equipment row — change one `UnsArea.Name` and every equipment under it picks up the new path automatically.
|
||||
- `Equipment` rows exist only when their driver is in an Equipment-kind namespace; `EquipmentUuid` is immutable across all generations of the cluster. Five identifiers per equipment (EquipmentId / EquipmentUuid / MachineCode / ZTag / SAPID) serve different audiences and are all exposed as OPC UA properties.
|
||||
- `Tag.EquipmentId` is required for Equipment-ns tags, NULL for SystemPlatform-ns tags. The `FolderPath` column is used only by SystemPlatform-ns tags (preserving v1 LmxOpcUa hierarchy expression).
|
||||
|
||||
## Table Definitions
|
||||
|
||||
All `Json` columns use `nvarchar(max)` with a `CHECK (ISJSON(col) = 1)` constraint. Timestamps are `datetime2(3)` UTC. PKs use `uniqueidentifier` (sequential GUIDs) unless noted; logical IDs (`ClusterId`, `NodeId`, `DriverInstanceId`, `TagId`) are `nvarchar(64)` for human readability.
|
||||
|
||||
### `ServerCluster`
|
||||
|
||||
```sql
|
||||
CREATE TABLE dbo.ServerCluster (
|
||||
ClusterId nvarchar(64) NOT NULL PRIMARY KEY,
|
||||
Name nvarchar(128) NOT NULL,
|
||||
Enterprise nvarchar(32) NOT NULL, -- UNS level 1, canonical org value: "zb"
|
||||
Site nvarchar(32) NOT NULL, -- UNS level 2, e.g. "warsaw-west"
|
||||
NodeCount tinyint NOT NULL CHECK (NodeCount IN (1, 2)),
|
||||
RedundancyMode nvarchar(16) NOT NULL CHECK (RedundancyMode IN ('None', 'Warm', 'Hot')),
|
||||
Enabled bit NOT NULL DEFAULT 1,
|
||||
Notes nvarchar(1024) NULL,
|
||||
CreatedAt datetime2(3) NOT NULL DEFAULT SYSUTCDATETIME(),
|
||||
CreatedBy nvarchar(128) NOT NULL,
|
||||
ModifiedAt datetime2(3) NULL,
|
||||
ModifiedBy nvarchar(128) NULL,
|
||||
CONSTRAINT CK_ServerCluster_RedundancyMode_NodeCount
|
||||
CHECK ((NodeCount = 1 AND RedundancyMode = 'None')
|
||||
OR (NodeCount = 2 AND RedundancyMode IN ('Warm', 'Hot')))
|
||||
-- Stricter UNS segment validation (`^[a-z0-9-]{1,32}$`) is enforced in the
|
||||
-- application layer + sp_ValidateDraft. The GRANT model prevents direct table
|
||||
-- inserts so application validation is the enforcement point.
|
||||
);
|
||||
|
||||
CREATE UNIQUE INDEX UX_ServerCluster_Name ON dbo.ServerCluster (Name);
|
||||
CREATE INDEX IX_ServerCluster_Site ON dbo.ServerCluster (Site);
|
||||
```
|
||||
|
||||
`Enterprise` and `Site` are UNS levels 1–2; cluster-level (don't change per generation), feed every Equipment-namespace path. `NamespaceUri` moved out of this table — namespaces are now first-class rows in the `Namespace` table.
|
||||
|
||||
### `Namespace`
|
||||
|
||||
```sql
|
||||
CREATE TABLE dbo.Namespace (
|
||||
NamespaceRowId uniqueidentifier NOT NULL PRIMARY KEY DEFAULT NEWSEQUENTIALID(),
|
||||
GenerationId bigint NOT NULL FOREIGN KEY REFERENCES dbo.ConfigGeneration(GenerationId),
|
||||
NamespaceId nvarchar(64) NOT NULL, -- stable logical ID across generations, e.g. "LINE3-OPCUA-equipment"
|
||||
ClusterId nvarchar(64) NOT NULL FOREIGN KEY REFERENCES dbo.ServerCluster(ClusterId),
|
||||
Kind nvarchar(32) NOT NULL CHECK (Kind IN ('Equipment', 'SystemPlatform', 'Simulated')),
|
||||
NamespaceUri nvarchar(256) NOT NULL,
|
||||
Enabled bit NOT NULL DEFAULT 1,
|
||||
Notes nvarchar(1024) NULL
|
||||
);
|
||||
|
||||
-- Within a generation: a cluster has at most one namespace per Kind
|
||||
CREATE UNIQUE INDEX UX_Namespace_Generation_Cluster_Kind ON dbo.Namespace (GenerationId, ClusterId, Kind);
|
||||
-- Within a generation: NamespaceUri unique fleet-wide (clients pin to namespace URIs)
|
||||
CREATE UNIQUE INDEX UX_Namespace_Generation_NamespaceUri ON dbo.Namespace (GenerationId, NamespaceUri);
|
||||
-- Within a generation: logical ID unique per cluster
|
||||
CREATE UNIQUE INDEX UX_Namespace_Generation_LogicalId ON dbo.Namespace (GenerationId, NamespaceId);
|
||||
-- Composite key DriverInstance uses for same-cluster validation
|
||||
CREATE UNIQUE INDEX UX_Namespace_Generation_LogicalId_Cluster ON dbo.Namespace (GenerationId, NamespaceId, ClusterId);
|
||||
CREATE INDEX IX_Namespace_Generation_Cluster ON dbo.Namespace (GenerationId, ClusterId);
|
||||
```
|
||||
|
||||
`Namespace` is **generation-versioned** (revised after adversarial review 2026-04-17 finding #2). Adding, disabling, or changing a namespace is a content publish, not a topology operation — these changes affect what consumers see at the OPC UA endpoint and must travel through the same draft → diff → publish → rollback flow as drivers/tags/equipment. Reasoning: a cluster-level namespace would let an admin disable a namespace that a published driver depends on, breaking the live config without a generation change and making rollback unreproducible.
|
||||
|
||||
**Cross-generation invariants** (enforced by `sp_ValidateDraft`):
|
||||
- **Logical-ID identity stability**: once a `(NamespaceId, ClusterId)` pair is published, every subsequent generation that includes that NamespaceId must keep the same `Kind` and the same `NamespaceUri`. Renaming a NamespaceUri or changing its Kind is forbidden — create a new NamespaceId instead. This protects clients that pin trust to the URI.
|
||||
- **Append-only logical-ID space**: a NamespaceId once introduced is never reused in the same cluster for a different namespace, even after disable. Disabling sets `Enabled = 0`; the logical ID stays bound to its original Kind/URI.
|
||||
- **Auto-rollback safety**: rolling back to a generation that included a namespace which is currently disabled is permitted (publish reactivates it). Rolling back through a NamespaceUri rename is forbidden by the invariant above — operator must explicitly reconcile.
|
||||
|
||||
`Simulated` is reserved in the `Kind` enum but no driver populates it in v2.0 — adding the future replay driver is a draft → publish flow that adds a Namespace row of `Kind = 'Simulated'` and one or more drivers bound to it.
|
||||
|
||||
### `ClusterNode`
|
||||
|
||||
```sql
|
||||
CREATE TABLE dbo.ClusterNode (
|
||||
NodeId nvarchar(64) NOT NULL PRIMARY KEY,
|
||||
ClusterId nvarchar(64) NOT NULL FOREIGN KEY REFERENCES dbo.ServerCluster(ClusterId),
|
||||
RedundancyRole nvarchar(16) NOT NULL CHECK (RedundancyRole IN ('Primary', 'Secondary', 'Standalone')),
|
||||
Host nvarchar(255) NOT NULL,
|
||||
OpcUaPort int NOT NULL DEFAULT 4840,
|
||||
DashboardPort int NOT NULL DEFAULT 8081,
|
||||
ApplicationUri nvarchar(256) NOT NULL,
|
||||
ServiceLevelBase tinyint NOT NULL DEFAULT 200,
|
||||
DriverConfigOverridesJson nvarchar(max) NULL CHECK (DriverConfigOverridesJson IS NULL OR ISJSON(DriverConfigOverridesJson) = 1),
|
||||
Enabled bit NOT NULL DEFAULT 1,
|
||||
LastSeenAt datetime2(3) NULL,
|
||||
CreatedAt datetime2(3) NOT NULL DEFAULT SYSUTCDATETIME(),
|
||||
CreatedBy nvarchar(128) NOT NULL
|
||||
);
|
||||
|
||||
-- ApplicationUri uniqueness is FLEET-WIDE, not per-cluster (per plan.md decision #86)
|
||||
CREATE UNIQUE INDEX UX_ClusterNode_ApplicationUri ON dbo.ClusterNode (ApplicationUri);
|
||||
CREATE INDEX IX_ClusterNode_ClusterId ON dbo.ClusterNode (ClusterId);
|
||||
|
||||
-- Each cluster has at most one Primary
|
||||
CREATE UNIQUE INDEX UX_ClusterNode_Primary_Per_Cluster
|
||||
ON dbo.ClusterNode (ClusterId)
|
||||
WHERE RedundancyRole = 'Primary';
|
||||
```
|
||||
|
||||
`DriverConfigOverridesJson` shape:
|
||||
|
||||
```jsonc
|
||||
{
|
||||
"<DriverInstanceId>": {
|
||||
"<JSON path within DriverConfig>": "<override value>"
|
||||
},
|
||||
// Example:
|
||||
"GalaxyMain": {
|
||||
"MxAccess.ClientName": "OtOpcUa-NodeB"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The merge happens at apply time on the node — cluster-level `DriverConfig` is read, then this node's overrides are layered on top using JSON-pointer or simple key-path semantics. Tags and devices have **no** per-node override path.
|
||||
|
||||
### `ClusterNodeCredential`
|
||||
|
||||
```sql
|
||||
CREATE TABLE dbo.ClusterNodeCredential (
|
||||
CredentialId uniqueidentifier NOT NULL PRIMARY KEY DEFAULT NEWSEQUENTIALID(),
|
||||
NodeId nvarchar(64) NOT NULL FOREIGN KEY REFERENCES dbo.ClusterNode(NodeId),
|
||||
Kind nvarchar(32) NOT NULL CHECK (Kind IN ('SqlLogin', 'ClientCertThumbprint', 'ADPrincipal', 'gMSA')),
|
||||
Value nvarchar(512) NOT NULL, -- login name, cert thumbprint, SID, etc.
|
||||
Enabled bit NOT NULL DEFAULT 1,
|
||||
RotatedAt datetime2(3) NULL,
|
||||
CreatedAt datetime2(3) NOT NULL DEFAULT SYSUTCDATETIME(),
|
||||
CreatedBy nvarchar(128) NOT NULL
|
||||
);
|
||||
|
||||
CREATE INDEX IX_ClusterNodeCredential_NodeId ON dbo.ClusterNodeCredential (NodeId, Enabled);
|
||||
CREATE UNIQUE INDEX UX_ClusterNodeCredential_Value ON dbo.ClusterNodeCredential (Kind, Value) WHERE Enabled = 1;
|
||||
```
|
||||
|
||||
A node may have multiple enabled credentials simultaneously (e.g. during cert rotation: old + new both valid for a window). Disabled rows are kept for audit.
|
||||
|
||||
### `ConfigGeneration`
|
||||
|
||||
```sql
|
||||
CREATE TABLE dbo.ConfigGeneration (
|
||||
GenerationId bigint NOT NULL PRIMARY KEY IDENTITY(1, 1),
|
||||
ClusterId nvarchar(64) NOT NULL FOREIGN KEY REFERENCES dbo.ServerCluster(ClusterId),
|
||||
Status nvarchar(16) NOT NULL CHECK (Status IN ('Draft', 'Published', 'Superseded', 'RolledBack')),
|
||||
ParentGenerationId bigint NULL FOREIGN KEY REFERENCES dbo.ConfigGeneration(GenerationId),
|
||||
PublishedAt datetime2(3) NULL,
|
||||
PublishedBy nvarchar(128) NULL,
|
||||
Notes nvarchar(1024) NULL,
|
||||
CreatedAt datetime2(3) NOT NULL DEFAULT SYSUTCDATETIME(),
|
||||
CreatedBy nvarchar(128) NOT NULL
|
||||
);
|
||||
|
||||
-- Fast lookup of "latest published generation for cluster X" (the per-node poll path)
|
||||
CREATE INDEX IX_ConfigGeneration_Cluster_Published
|
||||
ON dbo.ConfigGeneration (ClusterId, Status, GenerationId DESC)
|
||||
INCLUDE (PublishedAt);
|
||||
|
||||
-- One Draft per cluster at a time (prevents accidental concurrent edits)
|
||||
CREATE UNIQUE INDEX UX_ConfigGeneration_Draft_Per_Cluster
|
||||
ON dbo.ConfigGeneration (ClusterId)
|
||||
WHERE Status = 'Draft';
|
||||
```
|
||||
|
||||
`Status` transitions: `Draft → Published → Superseded` (when a newer generation is published) or `Draft → Published → RolledBack` (when explicitly rolled back). No transition skips Published.
|
||||
|
||||
### `DriverInstance`
|
||||
|
||||
```sql
|
||||
CREATE TABLE dbo.DriverInstance (
|
||||
DriverInstanceRowId uniqueidentifier NOT NULL PRIMARY KEY DEFAULT NEWSEQUENTIALID(),
|
||||
GenerationId bigint NOT NULL FOREIGN KEY REFERENCES dbo.ConfigGeneration(GenerationId),
|
||||
DriverInstanceId nvarchar(64) NOT NULL, -- stable logical ID across generations
|
||||
ClusterId nvarchar(64) NOT NULL FOREIGN KEY REFERENCES dbo.ServerCluster(ClusterId),
|
||||
NamespaceId nvarchar(64) NOT NULL FOREIGN KEY REFERENCES dbo.Namespace(NamespaceId),
|
||||
Name nvarchar(128) NOT NULL,
|
||||
DriverType nvarchar(32) NOT NULL, -- Galaxy | ModbusTcp | AbCip | AbLegacy | S7 | TwinCat | Focas | OpcUaClient
|
||||
Enabled bit NOT NULL DEFAULT 1,
|
||||
DriverConfig nvarchar(max) NOT NULL CHECK (ISJSON(DriverConfig) = 1)
|
||||
);
|
||||
|
||||
CREATE INDEX IX_DriverInstance_Generation_Cluster
|
||||
ON dbo.DriverInstance (GenerationId, ClusterId);
|
||||
CREATE INDEX IX_DriverInstance_Generation_Namespace
|
||||
ON dbo.DriverInstance (GenerationId, NamespaceId);
|
||||
CREATE UNIQUE INDEX UX_DriverInstance_Generation_LogicalId
|
||||
ON dbo.DriverInstance (GenerationId, DriverInstanceId);
|
||||
```
|
||||
|
||||
`NamespaceId` references the generation-versioned `Namespace` row that this driver populates. Driver type → allowed namespace Kind mapping is enforced in `sp_ValidateDraft` (not in DB CHECK because it's a cross-table constraint):
|
||||
|
||||
| `DriverType` | Allowed `Namespace.Kind` |
|
||||
|--------------|--------------------------|
|
||||
| Galaxy | SystemPlatform |
|
||||
| ModbusTcp / AbCip / AbLegacy / S7 / TwinCat / Focas | Equipment |
|
||||
| OpcUaClient | Equipment OR SystemPlatform (per-instance config decides) |
|
||||
|
||||
**Same-cluster invariant** (revised after adversarial review 2026-04-17 finding #1): the `Namespace` referenced by `DriverInstance.NamespaceId` MUST belong to the same `ClusterId`. This is a cross-cluster trust boundary — without enforcement, a draft for cluster A could bind to a namespace owned by cluster B, leaking that cluster's URI into A's endpoint and breaking tenant isolation. Three layers of enforcement:
|
||||
|
||||
1. **`sp_ValidateDraft`**: rejects any draft where `(NamespaceId, ClusterId)` does not resolve in the `Namespace` table for the same generation. Implementation joins `DriverInstance` (NamespaceId, ClusterId) against `UX_Namespace_Generation_LogicalId_Cluster` — the unique index above is sized for exactly this lookup.
|
||||
2. **API scoping**: the namespace-selection endpoint used by the Admin UI's draft editor accepts a `ClusterId` parameter and returns only namespaces for that cluster. UI filtering alone is insufficient — server-side scoping prevents bypass via crafted requests.
|
||||
3. **Audit on cross-cluster attempt**: any rejected draft that attempted a cross-cluster namespace binding is logged with `EventType = 'CrossClusterNamespaceAttempt'` in `ConfigAuditLog` for review.
|
||||
|
||||
### `Device`
|
||||
|
||||
```sql
|
||||
CREATE TABLE dbo.Device (
|
||||
DeviceRowId uniqueidentifier NOT NULL PRIMARY KEY DEFAULT NEWSEQUENTIALID(),
|
||||
GenerationId bigint NOT NULL FOREIGN KEY REFERENCES dbo.ConfigGeneration(GenerationId),
|
||||
DeviceId nvarchar(64) NOT NULL,
|
||||
DriverInstanceId nvarchar(64) NOT NULL,
|
||||
Name nvarchar(128) NOT NULL,
|
||||
Enabled bit NOT NULL DEFAULT 1,
|
||||
DeviceConfig nvarchar(max) NOT NULL CHECK (ISJSON(DeviceConfig) = 1)
|
||||
);
|
||||
|
||||
CREATE INDEX IX_Device_Generation_Driver
|
||||
ON dbo.Device (GenerationId, DriverInstanceId);
|
||||
CREATE UNIQUE INDEX UX_Device_Generation_LogicalId
|
||||
ON dbo.Device (GenerationId, DeviceId);
|
||||
```
|
||||
|
||||
The FK to `DriverInstance` is logical (matched by `GenerationId + DriverInstanceId` in app code), not declared as a SQL FK — declaring it would require composite FKs that are awkward when generations are immutable. The publish stored procedure validates referential integrity before flipping `Status`.
|
||||
|
||||
### `UnsArea`
|
||||
|
||||
```sql
|
||||
CREATE TABLE dbo.UnsArea (
|
||||
UnsAreaRowId uniqueidentifier NOT NULL PRIMARY KEY DEFAULT NEWSEQUENTIALID(),
|
||||
GenerationId bigint NOT NULL FOREIGN KEY REFERENCES dbo.ConfigGeneration(GenerationId),
|
||||
UnsAreaId nvarchar(64) NOT NULL, -- stable logical ID across generations
|
||||
ClusterId nvarchar(64) NOT NULL FOREIGN KEY REFERENCES dbo.ServerCluster(ClusterId),
|
||||
Name nvarchar(32) NOT NULL, -- UNS level 3, [a-z0-9-]{1,32} or "_default"
|
||||
Notes nvarchar(512) NULL
|
||||
);
|
||||
|
||||
CREATE INDEX IX_UnsArea_Generation_Cluster
|
||||
ON dbo.UnsArea (GenerationId, ClusterId);
|
||||
CREATE UNIQUE INDEX UX_UnsArea_Generation_LogicalId
|
||||
ON dbo.UnsArea (GenerationId, UnsAreaId);
|
||||
CREATE UNIQUE INDEX UX_UnsArea_Generation_ClusterName
|
||||
ON dbo.UnsArea (GenerationId, ClusterId, Name);
|
||||
```
|
||||
|
||||
### `UnsLine`
|
||||
|
||||
```sql
|
||||
CREATE TABLE dbo.UnsLine (
|
||||
UnsLineRowId uniqueidentifier NOT NULL PRIMARY KEY DEFAULT NEWSEQUENTIALID(),
|
||||
GenerationId bigint NOT NULL FOREIGN KEY REFERENCES dbo.ConfigGeneration(GenerationId),
|
||||
UnsLineId nvarchar(64) NOT NULL, -- stable logical ID across generations
|
||||
UnsAreaId nvarchar(64) NOT NULL, -- FK to UnsArea (by logical id; resolved within same generation)
|
||||
Name nvarchar(32) NOT NULL, -- UNS level 4, [a-z0-9-]{1,32} or "_default"
|
||||
Notes nvarchar(512) NULL
|
||||
);
|
||||
|
||||
CREATE INDEX IX_UnsLine_Generation_Area
|
||||
ON dbo.UnsLine (GenerationId, UnsAreaId);
|
||||
CREATE UNIQUE INDEX UX_UnsLine_Generation_LogicalId
|
||||
ON dbo.UnsLine (GenerationId, UnsLineId);
|
||||
CREATE UNIQUE INDEX UX_UnsLine_Generation_AreaName
|
||||
ON dbo.UnsLine (GenerationId, UnsAreaId, Name);
|
||||
```
|
||||
|
||||
`UnsArea` and `UnsLine` make the UNS structure first-class so operators can rename / move it without rewriting every equipment row. Both are **generation-versioned** (renames go through publish + diff for safety + audit). Cross-generation logical-ID stability is enforced by `sp_ValidateDraft`: a `UnsAreaId` once introduced keeps the same identity across generations, even if its `Name` changes.
|
||||
|
||||
### `Equipment`
|
||||
|
||||
```sql
|
||||
CREATE TABLE dbo.Equipment (
|
||||
EquipmentRowId uniqueidentifier NOT NULL PRIMARY KEY DEFAULT NEWSEQUENTIALID(),
|
||||
GenerationId bigint NOT NULL FOREIGN KEY REFERENCES dbo.ConfigGeneration(GenerationId),
|
||||
EquipmentId nvarchar(64) NOT NULL, -- system-generated stable internal logical ID. NEVER operator-supplied,
|
||||
-- NEVER appears in CSV imports, NEVER editable in Admin UI. Format:
|
||||
-- 'EQ-' + first 12 hex chars of EquipmentUuid. Generated by server-side
|
||||
-- equipment-creation API; sp_ValidateDraft rejects any draft whose
|
||||
-- Equipment.EquipmentId does not match the canonical derivation rule.
|
||||
EquipmentUuid uniqueidentifier NOT NULL, -- UUIDv4, IMMUTABLE across all generations of the same EquipmentId
|
||||
DriverInstanceId nvarchar(64) NOT NULL, -- which driver provides data for this equipment
|
||||
DeviceId nvarchar(64) NULL, -- optional, for multi-device drivers
|
||||
UnsLineId nvarchar(64) NOT NULL, -- FK to UnsLine (by logical id; resolved within same generation).
|
||||
-- Determines UNS Area + Line via the UnsLine→UnsArea chain.
|
||||
Name nvarchar(32) NOT NULL, -- UNS level 5, [a-z0-9-]{1,32} (the equipment segment in the path)
|
||||
|
||||
-- Operator-facing and external-system identifiers
|
||||
MachineCode nvarchar(64) NOT NULL, -- Operator colloquial id (e.g. "machine_001"). Unique within cluster.
|
||||
ZTag nvarchar(64) NULL, -- ERP equipment id. Unique fleet-wide. Primary identifier for browsing in Admin UI.
|
||||
SAPID nvarchar(64) NULL, -- SAP PM equipment id. Unique fleet-wide.
|
||||
|
||||
-- OPC 40010 Machinery Identification fields (per the `_base` equipment-class template in the schemas repo)
|
||||
-- All nullable so equipment can be added before identity is fully captured; populated over time.
|
||||
Manufacturer nvarchar(64) NULL, -- OPC 40010 Manufacturer
|
||||
Model nvarchar(64) NULL, -- OPC 40010 Model
|
||||
SerialNumber nvarchar(64) NULL, -- OPC 40010 SerialNumber
|
||||
HardwareRevision nvarchar(32) NULL, -- OPC 40010 HardwareRevision
|
||||
SoftwareRevision nvarchar(32) NULL, -- OPC 40010 SoftwareRevision (some drivers can read dynamically; this is the operator-set fallback)
|
||||
YearOfConstruction smallint NULL, -- OPC 40010 YearOfConstruction
|
||||
AssetLocation nvarchar(256) NULL, -- OPC 40010 Location, free-text supplementary to UNS path (e.g. "Bay 3, Row 12")
|
||||
ManufacturerUri nvarchar(512) NULL, -- OPC 40010 ManufacturerUri
|
||||
DeviceManualUri nvarchar(512) NULL, -- OPC 40010 DeviceManual (URL)
|
||||
|
||||
EquipmentClassRef nvarchar(128) NULL, -- nullable hook for future schemas-repo template ID
|
||||
Enabled bit NOT NULL DEFAULT 1
|
||||
);
|
||||
|
||||
CREATE INDEX IX_Equipment_Generation_Driver
|
||||
ON dbo.Equipment (GenerationId, DriverInstanceId);
|
||||
CREATE INDEX IX_Equipment_Generation_Line
|
||||
ON dbo.Equipment (GenerationId, UnsLineId);
|
||||
CREATE UNIQUE INDEX UX_Equipment_Generation_LogicalId
|
||||
ON dbo.Equipment (GenerationId, EquipmentId);
|
||||
-- UNS path uniqueness within a generation: (UnsLineId, Name) — Area/Line names live on UnsLine
|
||||
CREATE UNIQUE INDEX UX_Equipment_Generation_LinePath
|
||||
ON dbo.Equipment (GenerationId, UnsLineId, Name);
|
||||
-- EquipmentUuid → EquipmentId mapping is 1:1 across all generations of a cluster (cross-gen check in sp_ValidateDraft)
|
||||
CREATE UNIQUE INDEX UX_Equipment_Generation_Uuid
|
||||
ON dbo.Equipment (GenerationId, EquipmentUuid);
|
||||
|
||||
-- Operator-facing identifier indexes — primary browse identifier is ZTag
|
||||
CREATE INDEX IX_Equipment_Generation_ZTag
|
||||
ON dbo.Equipment (GenerationId, ZTag) WHERE ZTag IS NOT NULL;
|
||||
CREATE INDEX IX_Equipment_Generation_SAPID
|
||||
ON dbo.Equipment (GenerationId, SAPID) WHERE SAPID IS NOT NULL;
|
||||
-- MachineCode unique within cluster — composite check in sp_ValidateDraft (needs join through DriverInstance to get cluster)
|
||||
CREATE INDEX IX_Equipment_Generation_MachineCode
|
||||
ON dbo.Equipment (GenerationId, MachineCode);
|
||||
```
|
||||
|
||||
**Note on ZTag/SAPID uniqueness**: per-generation indexes above are non-unique (only `IX_*`, not `UX_*`). Fleet-wide uniqueness lives in the `ExternalIdReservation` table (see below). Per-generation indexes exist only for fast lookup; uniqueness is enforced at publish time against the reservation table, which is rollback-safe.
|
||||
|
||||
**Identifier responsibilities** — equipment carries five distinct identifiers, each with a different audience:
|
||||
|
||||
| Identifier | Audience | Mutable? | Uniqueness scope | Purpose |
|
||||
|------------|----------|:--------:|------------------|---------|
|
||||
| `EquipmentId` | Internal config DB | No (after publish) | Within cluster | Stable logical key for cross-generation diffs |
|
||||
| `EquipmentUuid` | Downstream events / dbt / Redpanda | **No, ever** | Globally unique (UUIDv4) | Permanent join key across systems and time |
|
||||
| `MachineCode` | OT operators | Yes (with publish) | Within cluster | Colloquial name in conversations and runbooks (e.g. `machine_001`) |
|
||||
| `ZTag` | ERP integration | Yes (rare) | Fleet-wide | **Primary identifier for browsing in Admin UI** — list/search default sort |
|
||||
| `SAPID` | SAP PM integration | Yes (rare) | Fleet-wide | Maintenance system join key |
|
||||
|
||||
All five are exposed as **OPC UA properties** on the equipment node so external systems can resolve equipment by whichever identifier they natively use, without needing a sidecar lookup service.
|
||||
|
||||
**OPC 40010 Machinery identity columns**: `Manufacturer`, `Model`, `SerialNumber`, `HardwareRevision`, `SoftwareRevision`, `YearOfConstruction`, `AssetLocation`, `ManufacturerUri`, `DeviceManualUri` are operator-set static metadata exposed as OPC UA properties on the equipment node's `Identification` sub-folder per OPC UA Companion Spec OPC 40010 (Machinery). Drivers that can read these dynamically (e.g. FANUC `cnc_sysinfo()` returns the software revision) override the static value at runtime; for everything else, the operator-set value flows through. The `_base` equipment-class template in the schemas repo declares these as required identity signals (`Manufacturer`, `Model` required; the rest optional) — every equipment-class template inherits the set via `extends: "_base"`.
|
||||
|
||||
**UUID immutability**: `sp_ValidateDraft` rejects a generation if any `(EquipmentId, EquipmentUuid)` pair conflicts with the same `EquipmentId` in any prior generation of the same cluster. Once an EquipmentId is published with a UUID, that UUID is locked for the life of the cluster. Operators can rename Area/Line/Name and edit MachineCode/ZTag/SAPID freely; the UUID stays.
|
||||
|
||||
**UNS validation** (in `sp_ValidateDraft` and Admin UI):
|
||||
- `UnsArea.Name`, `UnsLine.Name`, `Equipment.Name`: each matches `^[a-z0-9-]{1,32}$` OR equals literal `_default`
|
||||
- Computed full path `{Cluster.Enterprise}/{Cluster.Site}/{UnsArea.Name}/{UnsLine.Name}/{Equipment.Name}` ≤ 200 chars
|
||||
- Driver providing this Equipment must belong to a namespace with `Kind = 'Equipment'` (cross-table check)
|
||||
|
||||
**Identifier validation** (in `sp_ValidateDraft`):
|
||||
- `MachineCode` unique within cluster (cross-table check via `DriverInstance.ClusterId`)
|
||||
- `ZTag` unique fleet-wide when not null
|
||||
- `SAPID` unique fleet-wide when not null
|
||||
- `MachineCode` is required; `ZTag` and `SAPID` are optional (some equipment might not yet be in ERP/SAP)
|
||||
|
||||
`EquipmentClassRef` is a nullable string hook; v2.0 ships with no validation. When the central `schemas` repo lands, this becomes a foreign key into the schemas-repo equipment-class catalog, validated at draft-publish time.
|
||||
|
||||
### `NodeAcl`
|
||||
|
||||
```sql
|
||||
CREATE TABLE dbo.NodeAcl (
|
||||
NodeAclRowId uniqueidentifier NOT NULL PRIMARY KEY DEFAULT NEWSEQUENTIALID(),
|
||||
GenerationId bigint NOT NULL FOREIGN KEY REFERENCES dbo.ConfigGeneration(GenerationId),
|
||||
NodeAclId nvarchar(64) NOT NULL, -- stable logical ID across generations
|
||||
ClusterId nvarchar(64) NOT NULL FOREIGN KEY REFERENCES dbo.ServerCluster(ClusterId),
|
||||
LdapGroup nvarchar(256) NOT NULL,
|
||||
ScopeKind nvarchar(16) NOT NULL CHECK (ScopeKind IN ('Cluster', 'Namespace', 'UnsArea', 'UnsLine', 'Equipment', 'Tag')),
|
||||
ScopeId nvarchar(64) NULL, -- NULL when ScopeKind='Cluster'; logical ID otherwise
|
||||
PermissionFlags int NOT NULL, -- bitmask of NodePermissions enum
|
||||
Notes nvarchar(512) NULL
|
||||
);
|
||||
|
||||
CREATE INDEX IX_NodeAcl_Generation_Cluster
|
||||
ON dbo.NodeAcl (GenerationId, ClusterId);
|
||||
CREATE INDEX IX_NodeAcl_Generation_Group
|
||||
ON dbo.NodeAcl (GenerationId, LdapGroup);
|
||||
CREATE INDEX IX_NodeAcl_Generation_Scope
|
||||
ON dbo.NodeAcl (GenerationId, ScopeKind, ScopeId) WHERE ScopeId IS NOT NULL;
|
||||
CREATE UNIQUE INDEX UX_NodeAcl_Generation_LogicalId
|
||||
ON dbo.NodeAcl (GenerationId, NodeAclId);
|
||||
-- Within a generation, a (Group, Scope) pair has at most one row
|
||||
CREATE UNIQUE INDEX UX_NodeAcl_Generation_GroupScope
|
||||
ON dbo.NodeAcl (GenerationId, ClusterId, LdapGroup, ScopeKind, ScopeId);
|
||||
```
|
||||
|
||||
`NodeAcl` is **generation-versioned** (decision #130). ACL changes go through draft → diff → publish → rollback like every other content table. Cross-generation invariant: `NodeAclId` once published with `(LdapGroup, ScopeKind, ScopeId)` cannot have any of those columns change in a future generation; rename an LDAP group by disabling the old grant and creating a new one.
|
||||
|
||||
`PermissionFlags` is a bitmask of the `NodePermissions` enum defined in `acl-design.md` (Browse, Read, Subscribe, HistoryRead, WriteOperate, WriteTune, WriteConfigure, AlarmRead, AlarmAcknowledge, AlarmConfirm, AlarmShelve, MethodCall). Common bundles (`ReadOnly`, `Operator`, `Engineer`, `Admin`) expand to specific flag combinations at evaluation time.
|
||||
|
||||
Validation in `sp_ValidateDraft`:
|
||||
- `ScopeId` must resolve in the same generation when `ScopeKind ≠ 'Cluster'`
|
||||
- Resolved scope must belong to the same `ClusterId` as the ACL row (cross-cluster bindings rejected, same pattern as decision #122)
|
||||
- `PermissionFlags` must contain only bits defined in `NodePermissions`
|
||||
- `LdapGroup` non-empty, ≤256 chars, allowlisted character set (no LDAP-DN-breaking chars)
|
||||
- Cross-generation identity stability per the invariant above
|
||||
|
||||
Full evaluation algorithm + Admin UI design + v1-compatibility seed in `acl-design.md`.
|
||||
|
||||
### `ExternalIdReservation`
|
||||
|
||||
```sql
|
||||
CREATE TABLE dbo.ExternalIdReservation (
|
||||
ReservationId uniqueidentifier NOT NULL PRIMARY KEY DEFAULT NEWSEQUENTIALID(),
|
||||
Kind nvarchar(16) NOT NULL CHECK (Kind IN ('ZTag', 'SAPID')),
|
||||
Value nvarchar(64) NOT NULL,
|
||||
EquipmentUuid uniqueidentifier NOT NULL, -- which equipment owns this reservation, FOREVER
|
||||
ClusterId nvarchar(64) NOT NULL, -- first cluster to publish this id
|
||||
FirstPublishedAt datetime2(3) NOT NULL DEFAULT SYSUTCDATETIME(),
|
||||
FirstPublishedBy nvarchar(128) NOT NULL,
|
||||
LastPublishedAt datetime2(3) NOT NULL DEFAULT SYSUTCDATETIME(),
|
||||
ReleasedAt datetime2(3) NULL, -- non-null when explicitly released by operator
|
||||
ReleasedBy nvarchar(128) NULL,
|
||||
ReleaseReason nvarchar(512) NULL
|
||||
);
|
||||
|
||||
-- Active reservations (not released) MUST be unique per (Kind, Value)
|
||||
CREATE UNIQUE INDEX UX_ExternalIdReservation_KindValue_Active
|
||||
ON dbo.ExternalIdReservation (Kind, Value)
|
||||
WHERE ReleasedAt IS NULL;
|
||||
|
||||
-- One Equipment can hold reservations for both ZTag and SAPID
|
||||
CREATE INDEX IX_ExternalIdReservation_Equipment ON dbo.ExternalIdReservation (EquipmentUuid);
|
||||
CREATE INDEX IX_ExternalIdReservation_KindValue ON dbo.ExternalIdReservation (Kind, Value);
|
||||
```
|
||||
|
||||
`ExternalIdReservation` is **NOT generation-versioned** (revised after adversarial review 2026-04-17 finding #3). It exists outside the generation-publish flow specifically to provide rollback-safe identifier uniqueness — generation-versioned uniqueness alone fails because old generations and disabled equipment can hold the same external ID, allowing rollback or re-enable to silently reintroduce duplicates that corrupt downstream ERP/SAP joins.
|
||||
|
||||
**Lifecycle**:
|
||||
- **Reserve on publish**: `sp_PublishGeneration` creates a reservation row for every `(Kind, Value, EquipmentUuid)` triple in the new generation that doesn't already have a reservation; updates `LastPublishedAt` for existing reservations.
|
||||
- **Reject on conflict**: if a publish includes `(Kind = 'ZTag', Value = 'ABC')` for `EquipmentUuid = X` but an active reservation already binds `('ZTag', 'ABC')` to `EquipmentUuid = Y`, the publish fails with `BadDuplicateExternalIdentifier` and the offending row is named in the audit log.
|
||||
- **Survive disable**: disabling an equipment (`Equipment.Enabled = 0` in a future generation) does NOT release the reservation. The ID stays bound to that EquipmentUuid until explicit operator release.
|
||||
- **Survive rollback**: rollback to an old generation that includes the original `(Kind, Value, EquipmentUuid)` triple is permitted (reservation already binds it correctly). Rollback through a state where the same value was bound to a different EquipmentUuid is rejected — operator must explicitly release the conflicting reservation first.
|
||||
- **Explicit release**: operator can release a reservation via Admin UI (FleetAdmin only); requires reason; audit-logged. Released reservations stay in the table (`ReleasedAt` non-null) for audit; the unique index on `WHERE ReleasedAt IS NULL` allows the same value to be re-reserved by a different EquipmentUuid afterward.
|
||||
|
||||
This is the only safe way to express "ZTag and SAPID are fleet-wide unique forever, including under rollback and re-enable" without a generation-versioned schema constraint that can't see other generations.
|
||||
|
||||
### `Tag`
|
||||
|
||||
```sql
|
||||
CREATE TABLE dbo.Tag (
|
||||
TagRowId uniqueidentifier NOT NULL PRIMARY KEY DEFAULT NEWSEQUENTIALID(),
|
||||
GenerationId bigint NOT NULL FOREIGN KEY REFERENCES dbo.ConfigGeneration(GenerationId),
|
||||
TagId nvarchar(64) NOT NULL,
|
||||
DriverInstanceId nvarchar(64) NOT NULL,
|
||||
DeviceId nvarchar(64) NULL, -- null for driver-scoped tags
|
||||
EquipmentId nvarchar(64) NULL, -- REQUIRED when driver is in Equipment-kind namespace;
|
||||
-- NULL when driver is in SystemPlatform-kind namespace.
|
||||
-- Cross-table constraint enforced by sp_ValidateDraft.
|
||||
Name nvarchar(128) NOT NULL, -- signal name; level-6 in Equipment ns
|
||||
FolderPath nvarchar(512) NULL, -- only used when EquipmentId IS NULL (SystemPlatform ns).
|
||||
-- Equipment provides path otherwise.
|
||||
DataType nvarchar(32) NOT NULL, -- OPC UA built-in type name (Boolean, Int32, Float, etc.)
|
||||
AccessLevel nvarchar(16) NOT NULL CHECK (AccessLevel IN ('Read', 'ReadWrite')),
|
||||
WriteIdempotent bit NOT NULL DEFAULT 0,
|
||||
PollGroupId nvarchar(64) NULL,
|
||||
TagConfig nvarchar(max) NOT NULL CHECK (ISJSON(TagConfig) = 1)
|
||||
);
|
||||
|
||||
CREATE INDEX IX_Tag_Generation_Driver_Device
|
||||
ON dbo.Tag (GenerationId, DriverInstanceId, DeviceId);
|
||||
CREATE INDEX IX_Tag_Generation_Equipment
|
||||
ON dbo.Tag (GenerationId, EquipmentId) WHERE EquipmentId IS NOT NULL;
|
||||
CREATE UNIQUE INDEX UX_Tag_Generation_LogicalId
|
||||
ON dbo.Tag (GenerationId, TagId);
|
||||
-- Path uniqueness: in Equipment ns the path is (EquipmentId, Name); in SystemPlatform ns it's (DriverInstanceId, FolderPath, Name)
|
||||
CREATE UNIQUE INDEX UX_Tag_Generation_EquipmentPath
|
||||
ON dbo.Tag (GenerationId, EquipmentId, Name) WHERE EquipmentId IS NOT NULL;
|
||||
CREATE UNIQUE INDEX UX_Tag_Generation_FolderPath
|
||||
ON dbo.Tag (GenerationId, DriverInstanceId, FolderPath, Name) WHERE EquipmentId IS NULL;
|
||||
```
|
||||
|
||||
**Path resolution at apply time**:
|
||||
- If `EquipmentId IS NOT NULL` (Equipment namespace tag): full path = `{Cluster.Enterprise}/{Cluster.Site}/{Equipment.Area}/{Equipment.Line}/{Equipment.Name}/{Tag.Name}`. `FolderPath` ignored.
|
||||
- If `EquipmentId IS NULL` (SystemPlatform namespace tag): full path = `{FolderPath}/{Tag.Name}` exactly as v1 LmxOpcUa expressed it. No UNS rules apply.
|
||||
|
||||
`sp_ValidateDraft` enforces the EquipmentId-vs-namespace-kind invariant: if the tag's `DriverInstanceId` belongs to an Equipment-kind namespace, `EquipmentId` must be set; if SystemPlatform-kind, `EquipmentId` must be null. The DB CHECK can't see across tables so this check lives in the validator.
|
||||
|
||||
### `PollGroup`
|
||||
|
||||
```sql
|
||||
CREATE TABLE dbo.PollGroup (
|
||||
PollGroupRowId uniqueidentifier NOT NULL PRIMARY KEY DEFAULT NEWSEQUENTIALID(),
|
||||
GenerationId bigint NOT NULL FOREIGN KEY REFERENCES dbo.ConfigGeneration(GenerationId),
|
||||
PollGroupId nvarchar(64) NOT NULL,
|
||||
DriverInstanceId nvarchar(64) NOT NULL,
|
||||
Name nvarchar(128) NOT NULL,
|
||||
IntervalMs int NOT NULL CHECK (IntervalMs >= 50)
|
||||
);
|
||||
|
||||
CREATE INDEX IX_PollGroup_Generation_Driver
|
||||
ON dbo.PollGroup (GenerationId, DriverInstanceId);
|
||||
CREATE UNIQUE INDEX UX_PollGroup_Generation_LogicalId
|
||||
ON dbo.PollGroup (GenerationId, PollGroupId);
|
||||
```
|
||||
|
||||
### `ClusterNodeGenerationState`
|
||||
|
||||
```sql
|
||||
CREATE TABLE dbo.ClusterNodeGenerationState (
|
||||
NodeId nvarchar(64) NOT NULL PRIMARY KEY FOREIGN KEY REFERENCES dbo.ClusterNode(NodeId),
|
||||
CurrentGenerationId bigint NULL FOREIGN KEY REFERENCES dbo.ConfigGeneration(GenerationId),
|
||||
LastAppliedAt datetime2(3) NULL,
|
||||
LastAppliedStatus nvarchar(16) NULL CHECK (LastAppliedStatus IN ('Applied', 'RolledBack', 'Failed', 'InProgress')),
|
||||
LastAppliedError nvarchar(2048) NULL,
|
||||
LastSeenAt datetime2(3) NULL -- updated on every poll, for liveness
|
||||
);
|
||||
|
||||
CREATE INDEX IX_ClusterNodeGenerationState_Generation
|
||||
ON dbo.ClusterNodeGenerationState (CurrentGenerationId);
|
||||
```
|
||||
|
||||
A 2-node cluster with both nodes on the same `CurrentGenerationId` is "converged"; nodes on different generations are "applying" or "diverged" — Admin surfaces this directly.
|
||||
|
||||
### `ConfigAuditLog`
|
||||
|
||||
```sql
|
||||
CREATE TABLE dbo.ConfigAuditLog (
|
||||
AuditId bigint NOT NULL PRIMARY KEY IDENTITY(1, 1),
|
||||
Timestamp datetime2(3) NOT NULL DEFAULT SYSUTCDATETIME(),
|
||||
Principal nvarchar(128) NOT NULL, -- DB principal that performed the action
|
||||
EventType nvarchar(64) NOT NULL, -- DraftCreated, DraftEdited, Published, RolledBack, NodeApplied, CredentialAdded, CredentialDisabled, ClusterCreated, NodeAdded, etc.
|
||||
ClusterId nvarchar(64) NULL,
|
||||
NodeId nvarchar(64) NULL,
|
||||
GenerationId bigint NULL,
|
||||
DetailsJson nvarchar(max) NULL CHECK (DetailsJson IS NULL OR ISJSON(DetailsJson) = 1)
|
||||
);
|
||||
|
||||
CREATE INDEX IX_ConfigAuditLog_Cluster_Time
|
||||
ON dbo.ConfigAuditLog (ClusterId, Timestamp DESC);
|
||||
CREATE INDEX IX_ConfigAuditLog_Generation
|
||||
ON dbo.ConfigAuditLog (GenerationId) WHERE GenerationId IS NOT NULL;
|
||||
```
|
||||
|
||||
Append-only by convention (no UPDATE/DELETE permissions granted to any principal); enforced by GRANT model below.
|
||||
|
||||
## Stored Procedures
|
||||
|
||||
All non-trivial DB access goes through stored procedures. Direct table SELECT/INSERT/UPDATE/DELETE is **not granted** to node or admin principals — only the procs are callable. This is the enforcement point for the authorization model.
|
||||
|
||||
### `sp_GetCurrentGenerationForCluster` (called by node)
|
||||
|
||||
```sql
|
||||
-- @NodeId: passed by the calling node; verified against authenticated principal
|
||||
-- @ClusterId: passed by the calling node; verified to match @NodeId's cluster
|
||||
-- Returns: latest Published generation for the cluster, or NULL if none
|
||||
CREATE PROCEDURE dbo.sp_GetCurrentGenerationForCluster
|
||||
@NodeId nvarchar(64),
|
||||
@ClusterId nvarchar(64)
|
||||
AS
|
||||
BEGIN
|
||||
SET NOCOUNT ON;
|
||||
|
||||
-- 1. Authenticate: verify the calling principal is bound to @NodeId
|
||||
DECLARE @CallerPrincipal nvarchar(128) = SUSER_SNAME();
|
||||
IF NOT EXISTS (
|
||||
SELECT 1 FROM dbo.ClusterNodeCredential
|
||||
WHERE NodeId = @NodeId
|
||||
AND Value = @CallerPrincipal
|
||||
AND Enabled = 1
|
||||
)
|
||||
BEGIN
|
||||
RAISERROR('Unauthorized: caller %s is not bound to NodeId %s', 16, 1, @CallerPrincipal, @NodeId);
|
||||
RETURN;
|
||||
END
|
||||
|
||||
-- 2. Authorize: verify @NodeId belongs to @ClusterId
|
||||
IF NOT EXISTS (
|
||||
SELECT 1 FROM dbo.ClusterNode
|
||||
WHERE NodeId = @NodeId AND ClusterId = @ClusterId AND Enabled = 1
|
||||
)
|
||||
BEGIN
|
||||
RAISERROR('Forbidden: NodeId %s does not belong to ClusterId %s', 16, 1, @NodeId, @ClusterId);
|
||||
RETURN;
|
||||
END
|
||||
|
||||
-- 3. Return latest Published generation
|
||||
SELECT TOP 1 GenerationId, PublishedAt, PublishedBy, Notes
|
||||
FROM dbo.ConfigGeneration
|
||||
WHERE ClusterId = @ClusterId AND Status = 'Published'
|
||||
ORDER BY GenerationId DESC;
|
||||
END
|
||||
```
|
||||
|
||||
Companion procs: `sp_GetGenerationContent` (returns full generation rows for a given `GenerationId`, with the same auth checks) and `sp_RegisterNodeGenerationApplied` (node reports back which generation it has now applied + status).
|
||||
|
||||
### `sp_PublishGeneration` (called by Admin)
|
||||
|
||||
```sql
|
||||
-- Atomic: validates the draft, reserves external identifiers, flips Status
|
||||
CREATE PROCEDURE dbo.sp_PublishGeneration
|
||||
@ClusterId nvarchar(64),
|
||||
@DraftGenerationId bigint,
|
||||
@Notes nvarchar(1024) = NULL
|
||||
AS
|
||||
BEGIN
|
||||
SET NOCOUNT ON;
|
||||
SET XACT_ABORT ON;
|
||||
BEGIN TRANSACTION;
|
||||
|
||||
-- 1. Verify caller is an admin (separate authz check vs. node auth)
|
||||
-- 2. Validate Draft: FKs resolve, no orphan tags, JSON columns parse, identity invariants,
|
||||
-- same-cluster namespace bindings, ZTag/SAPID reservations pre-flight, etc.
|
||||
-- EXEC sp_ValidateDraft @DraftGenerationId; — raises on failure
|
||||
-- 3. RESERVE / RENEW external identifiers atomically with the publish
|
||||
-- For each (Kind, Value, EquipmentUuid) triple in the draft's Equipment rows:
|
||||
-- - INSERT into ExternalIdReservation if no row matches (Kind, Value, EquipmentUuid)
|
||||
-- AND no active row matches (Kind, Value) — the latter would have been caught by
|
||||
-- sp_ValidateDraft, but rechecked here under transaction lock to prevent race.
|
||||
-- - UPDATE LastPublishedAt for any existing matching reservation.
|
||||
-- Rollback the whole publish if any reservation conflict surfaces under lock.
|
||||
MERGE dbo.ExternalIdReservation AS tgt
|
||||
USING (
|
||||
SELECT 'ZTag' AS Kind, ZTag AS Value, EquipmentUuid
|
||||
FROM dbo.Equipment
|
||||
WHERE GenerationId = @DraftGenerationId AND ZTag IS NOT NULL
|
||||
UNION ALL
|
||||
SELECT 'SAPID', SAPID, EquipmentUuid
|
||||
FROM dbo.Equipment
|
||||
WHERE GenerationId = @DraftGenerationId AND SAPID IS NOT NULL
|
||||
) AS src
|
||||
ON tgt.Kind = src.Kind AND tgt.Value = src.Value AND tgt.EquipmentUuid = src.EquipmentUuid
|
||||
WHEN MATCHED THEN
|
||||
UPDATE SET LastPublishedAt = SYSUTCDATETIME()
|
||||
WHEN NOT MATCHED BY TARGET THEN
|
||||
INSERT (Kind, Value, EquipmentUuid, ClusterId, FirstPublishedBy, LastPublishedAt)
|
||||
VALUES (src.Kind, src.Value, src.EquipmentUuid, @ClusterId, SUSER_SNAME(), SYSUTCDATETIME());
|
||||
-- The unique index UX_ExternalIdReservation_KindValue_Active raises a primary-key violation
|
||||
-- if a different EquipmentUuid attempts to reserve the same active (Kind, Value).
|
||||
-- 4. Mark previous Published as Superseded
|
||||
UPDATE dbo.ConfigGeneration
|
||||
SET Status = 'Superseded'
|
||||
WHERE ClusterId = @ClusterId AND Status = 'Published';
|
||||
-- 5. Promote Draft to Published
|
||||
UPDATE dbo.ConfigGeneration
|
||||
SET Status = 'Published',
|
||||
PublishedAt = SYSUTCDATETIME(),
|
||||
PublishedBy = SUSER_SNAME(),
|
||||
Notes = ISNULL(@Notes, Notes)
|
||||
WHERE GenerationId = @DraftGenerationId AND ClusterId = @ClusterId;
|
||||
-- 6. Audit log
|
||||
INSERT dbo.ConfigAuditLog (Principal, EventType, ClusterId, GenerationId)
|
||||
VALUES (SUSER_SNAME(), 'Published', @ClusterId, @DraftGenerationId);
|
||||
|
||||
COMMIT;
|
||||
END
|
||||
```
|
||||
|
||||
**`sp_ReleaseExternalIdReservation`** (FleetAdmin only): explicit operator action to release a reservation when equipment is permanently retired and its ZTag/SAPID needs to be reused by a different physical asset. Sets `ReleasedAt`, `ReleasedBy`, `ReleaseReason`. After release, the unique index `WHERE ReleasedAt IS NULL` allows the same `(Kind, Value)` to be re-reserved by a different EquipmentUuid in a future publish. Audit-logged with `EventType = 'ExternalIdReleased'`.
|
||||
|
||||
### `sp_RollbackToGeneration` (called by Admin)
|
||||
|
||||
Creates a *new* Published generation by cloning rows from the target generation. The target stays in `Superseded` state; the new clone becomes `Published`. This way every state visible to nodes is an actual published generation, never a "rolled back to" pointer that's hard to reason about.
|
||||
|
||||
### `sp_ValidateDraft` (called inside publish, also exposed for Admin preview)
|
||||
|
||||
Checks (existing):
|
||||
- Every `Tag.DriverInstanceId` resolves
|
||||
- Every `Tag.DeviceId` resolves to a `Device` whose `DriverInstanceId` matches the tag's
|
||||
- Every `Tag.PollGroupId` resolves
|
||||
- Every `Device.DriverInstanceId` resolves
|
||||
- No duplicate `(GenerationId, DriverInstanceId, FolderPath, Name)` collisions for SystemPlatform-ns tags
|
||||
- No duplicate `(GenerationId, EquipmentId, Name)` collisions for Equipment-ns tags
|
||||
- Every JSON column parses; every `DriverConfig` matches its `DriverType`'s schema (per "JSON column conventions" below)
|
||||
|
||||
Checks (UNS / namespace integration):
|
||||
- **Namespace exists in same generation**: `DriverInstance.NamespaceId` must resolve to a `Namespace` row in the same `GenerationId`
|
||||
- **Same-cluster namespace binding** (revised after adversarial review finding #1): the resolved `Namespace.ClusterId` must equal `DriverInstance.ClusterId`. Cross-cluster bindings are rejected with `BadCrossClusterNamespaceBinding` and audit-logged as `EventType = 'CrossClusterNamespaceAttempt'`
|
||||
- **Namespace identity stability across generations** (finding #2): for every `Namespace` row in the draft, if a row with the same `(NamespaceId, ClusterId)` exists in any prior generation, it must have the same `Kind` and the same `NamespaceUri`. NamespaceUri renames are forbidden — use a new NamespaceId
|
||||
- **Driver type ↔ namespace kind**: every `DriverInstance.NamespaceId` must resolve to a `Namespace` whose `Kind` matches the allowed set for that `DriverType` (Galaxy → SystemPlatform; native-protocol drivers → Equipment; OpcUaClient → either)
|
||||
- **Tag ↔ namespace kind**: if a tag's `DriverInstanceId` belongs to an Equipment-kind namespace, `EquipmentId` must be set; if SystemPlatform-kind, `EquipmentId` must be null
|
||||
- **UnsArea / UnsLine / Equipment.Name segment validation**: each matches `^[a-z0-9-]{1,32}$` OR equals literal `_default`
|
||||
- **UnsLine.UnsAreaId resolves**: must reference a `UnsArea` row in the same generation; both must belong to the same cluster (via `UnsArea.ClusterId`)
|
||||
- **Equipment.UnsLineId resolves**: must reference a `UnsLine` row in the same generation, and the area chain must trace to the same cluster as the equipment's driver
|
||||
- **Equipment full-path length**: `LEN(Cluster.Enterprise) + LEN(Cluster.Site) + LEN(UnsArea.Name) + LEN(UnsLine.Name) + LEN(Equipment.Name) + 4` (slashes) ≤ 200
|
||||
- **UnsArea/UnsLine logical-ID stability across generations**: once introduced, an `UnsAreaId` keeps the same identity across generations even if its `Name` changes. Same for `UnsLineId`. Renaming surfaces in the diff viewer; identity reuse with a different parent is rejected.
|
||||
- **EquipmentUuid immutability across generations**: for every `(EquipmentId, EquipmentUuid)` pair in the draft, no prior generation of this cluster has the same `EquipmentId` with a different `EquipmentUuid`. Once published, an EquipmentId's UUID is locked for the cluster's lifetime
|
||||
- **EquipmentId belongs to the same cluster**: `Equipment.DriverInstanceId` must resolve to a `DriverInstance` whose `ClusterId` matches the draft's cluster
|
||||
- **Equipment.DriverInstanceId namespace kind**: the equipment's driver must be in an Equipment-kind namespace
|
||||
- **Cluster Enterprise/Site UNS segment validation**: same regex as Area/Line/Name (defense in depth — also enforced at cluster create time)
|
||||
|
||||
Checks (operator/external identifiers):
|
||||
- **EquipmentId is system-generated** (revised after adversarial review finding #4): every `Equipment.EquipmentId` in the draft must match the canonical derivation `'EQ-' + LOWER(LEFT(REPLACE(CONVERT(nvarchar(36), EquipmentUuid), '-', ''), 12))`. Operator-supplied or modified IDs are rejected. CSV imports never carry an `EquipmentId` column — see Admin UI workflow
|
||||
- **EquipmentUuid required and stable**: `Equipment.EquipmentUuid` must be non-NULL on every row; once published with a given `(EquipmentId, EquipmentUuid)`, neither value can change in any future generation of the same cluster (cross-generation invariant)
|
||||
- **MachineCode required and unique within cluster**: `Equipment.MachineCode` must be non-empty; uniqueness checked across all equipment whose driver shares the same `ClusterId` in the same generation
|
||||
- **ZTag/SAPID uniqueness via reservation table** (revised after adversarial review finding #3): per-generation per-cluster checks are insufficient because old generations and disabled equipment can hold the same external IDs. Fleet-wide uniqueness is enforced by `ExternalIdReservation`:
|
||||
- Validator pre-flights every `(Kind, Value, EquipmentUuid)` triple in the draft against `ExternalIdReservation WHERE ReleasedAt IS NULL`
|
||||
- If reservation exists for same EquipmentUuid → ok (continuation)
|
||||
- If reservation exists for a different EquipmentUuid → REJECT with `BadDuplicateExternalIdentifier`; operator must release the conflicting reservation explicitly first
|
||||
- If no reservation exists → ok (sp_PublishGeneration will create on commit)
|
||||
- **Identifier free-text**: MachineCode/ZTag/SAPID are not subject to UNS-segment regex (they're external system identifiers, not OPC UA path segments) — only required to be non-empty (when present) and ≤64 chars
|
||||
|
||||
### `sp_ComputeGenerationDiff`
|
||||
|
||||
Returns the rows that differ between two generations: added, removed, modified per table. Used by the Admin UI's diff viewer and by the node's apply logic to decide what to surgically update without bouncing the whole driver instance.
|
||||
|
||||
## Authorization Model
|
||||
|
||||
### SQL principals
|
||||
|
||||
Two principal classes:
|
||||
|
||||
1. **Node principals** — one per `ClusterNode` (SQL login, gMSA, or cert-mapped user). Granted EXECUTE on `sp_GetCurrentGenerationForCluster`, `sp_GetGenerationContent`, `sp_RegisterNodeGenerationApplied` only. No table SELECT.
|
||||
2. **Admin principals** — granted to operator accounts. EXECUTE on all `sp_*` procs. No direct table access either, except read-only views for reporting (`vw_ClusterFleetStatus`, `vw_GenerationHistory`).
|
||||
|
||||
The `dbo` schema is owned by no application principal; only `db_owner` (DBA-managed) can change schema.
|
||||
|
||||
### Per-node binding enforcement
|
||||
|
||||
`sp_GetCurrentGenerationForCluster` uses `SUSER_SNAME()` to identify the calling principal and cross-checks against `ClusterNodeCredential.Value`. A principal asking for another node's cluster gets `RAISERROR` with HTTP-403-equivalent semantics (16/1).
|
||||
|
||||
For `Authentication=ActiveDirectoryMsi` or cert-auth scenarios where `SUSER_SNAME()` returns the AD principal name or cert thumbprint, `ClusterNodeCredential.Value` stores the matching value. Multiple `Kind` values are supported so a single deployment can mix gMSA and cert auth across different nodes.
|
||||
|
||||
### Defense-in-depth: SESSION_CONTEXT
|
||||
|
||||
After authentication, the caller-side connection wrapper sets `SESSION_CONTEXT` with `NodeId` and `ClusterId` to make audit logging trivial. The procs ignore client-asserted SESSION_CONTEXT values — they recompute from `SUSER_SNAME()` — but the audit log captures both, so any attempt to spoof shows up in the audit trail.
|
||||
|
||||
### Admin authn separation
|
||||
|
||||
Admin UI authenticates operators via the LDAP layer described in `Security.md` (existing v1 LDAP authentication, reused). Successful LDAP bind maps to a SQL principal that has admin DB grants. Operators do not get direct DB credentials.
|
||||
|
||||
## JSON Column Conventions
|
||||
|
||||
`DriverConfig`, `DeviceConfig`, `TagConfig`, and `DriverConfigOverridesJson` are schemaless to the DB but **strictly schemaed by the application**. Each driver type registers a JSON schema in `Core.Abstractions.DriverTypeRegistry` describing valid keys for its `DriverConfig`, `DeviceConfig`, and `TagConfig`. `sp_ValidateDraft` calls into managed code (CLR-hosted validator or external EF/.NET pre-publish step) to validate before the `Status` flip.
|
||||
|
||||
Examples of the per-driver shapes — full specs in `driver-specs.md`:
|
||||
|
||||
```jsonc
|
||||
// DriverConfig for DriverType=Galaxy
|
||||
{
|
||||
"MxAccess": { "ClientName": "OtOpcUa-Cluster1", "RequestTimeoutSeconds": 30 },
|
||||
"Database": { "ConnectionString": "Server=...;Database=ZB;...", "PollIntervalSeconds": 60 },
|
||||
"Historian": { "Enabled": false }
|
||||
}
|
||||
|
||||
// DeviceConfig for DriverType=ModbusTcp
|
||||
{
|
||||
"Host": "10.0.3.42",
|
||||
"Port": 502,
|
||||
"UnitId": 1,
|
||||
"ByteOrder": "BigEndianBigEndianWord",
|
||||
"AddressFormat": "Standard" // or "DL205"
|
||||
}
|
||||
|
||||
// TagConfig for DriverType=ModbusTcp
|
||||
{
|
||||
"RegisterType": "HoldingRegister",
|
||||
"Address": 100,
|
||||
"Length": 1,
|
||||
"Scaling": { "Multiplier": 0.1, "Offset": 0 }
|
||||
}
|
||||
```
|
||||
|
||||
The JSON schema lives in source so it versions with the driver; the DB doesn't carry per-type DDL.
|
||||
|
||||
## Per-Node Override Merge Semantics
|
||||
|
||||
At config-apply time on a node:
|
||||
|
||||
1. Node fetches `DriverInstance` rows for the current generation and its `ClusterId`
|
||||
2. Node fetches its own `ClusterNode.DriverConfigOverridesJson`
|
||||
3. For each `DriverInstance`, node parses `DriverConfig` (cluster-level), then walks the override JSON for that `DriverInstanceId`, applying each leaf-key override on top
|
||||
4. Merge is **shallow at the leaf level** — the override key path locates the exact JSON node to replace. Arrays are replaced wholesale, not merged element-wise. If the override path doesn't exist in `DriverConfig`, the merge fails the apply step (loud failure beats silent drift).
|
||||
5. Resulting JSON is the effective `DriverConfig` for this node, passed to the driver factory
|
||||
|
||||
Tags and devices are never overridden per-node. If you need a tag definition to differ between nodes, you have a different cluster — split it.
|
||||
|
||||
## Local LiteDB Cache
|
||||
|
||||
Each node maintains a small LiteDB file (default `config_cache.db`) keyed by `GenerationId`. On startup, if the central DB is unreachable, the node loads the most recent cached generation and starts.
|
||||
|
||||
Schema (LiteDB collections):
|
||||
|
||||
| Collection | Purpose |
|
||||
|------------|---------|
|
||||
| `Generations` | Header rows (GenerationId, ClusterId, PublishedAt, Notes) |
|
||||
| `DriverInstances` | Cluster-level driver definitions per generation |
|
||||
| `Devices` | Per-driver devices |
|
||||
| `Tags` | Per-driver/device tags |
|
||||
| `PollGroups` | Per-driver poll groups |
|
||||
| `NodeConfig` | This node's `ClusterNode` row + overrides JSON |
|
||||
|
||||
A node only ever caches its own cluster's generations. Old cached generations beyond the most recent N (default 10) are pruned to bound disk usage.
|
||||
|
||||
## EF Core Migrations
|
||||
|
||||
The `Configuration` project (per `plan.md` §5) owns the schema. EF Core code-first migrations under `Configuration/Migrations/`. Every migration ships with:
|
||||
|
||||
- The forward `Up()` and reverse `Down()` operations
|
||||
- A schema-validation test that runs the migration against a clean DB and verifies indexes, constraints, and stored procedures match the expected DDL
|
||||
- A data-fixture test that seeds a minimal cluster + node + generation and exercises `sp_GetCurrentGenerationForCluster` end-to-end
|
||||
|
||||
Stored procedures are managed via `MigrationBuilder.Sql()` blocks (idempotent CREATE OR ALTER style) so they version with the schema, not as separate DDL artifacts.
|
||||
|
||||
## Indexes — Hot Paths Summary
|
||||
|
||||
| Path | Index |
|
||||
|------|-------|
|
||||
| Node poll: "latest published generation for my cluster" | `IX_ConfigGeneration_Cluster_Published` |
|
||||
| Node fetch generation content | Per-table `(GenerationId, ...)` indexes |
|
||||
| Admin: list clusters by site | `IX_ServerCluster_Site` |
|
||||
| Admin: list generations per cluster | `IX_ConfigGeneration_Cluster_Published` (covers all statuses via DESC scan) |
|
||||
| Admin: who's on which generation | `IX_ClusterNodeGenerationState_Generation` |
|
||||
| Admin / driver build: list equipment for a driver | `IX_Equipment_Generation_Driver` |
|
||||
| Admin / driver build: list equipment for a UNS line | `IX_Equipment_Generation_Line` |
|
||||
| Admin / driver build: lookup equipment by UNS path within line | `UX_Equipment_Generation_LinePath` |
|
||||
| Admin: list lines for a UNS area | `IX_UnsLine_Generation_Area` |
|
||||
| Admin: list areas for a cluster | `IX_UnsArea_Generation_Cluster` |
|
||||
| Admin: equipment search by ZTag (primary browse identifier) | `UX_Equipment_Generation_ZTag` |
|
||||
| Admin: equipment search by SAPID | `UX_Equipment_Generation_SAPID` |
|
||||
| Admin: equipment search by MachineCode (cluster-scoped) | `IX_Equipment_Generation_MachineCode` |
|
||||
| Tag fetch by equipment (address-space build) | `IX_Tag_Generation_Equipment` |
|
||||
| Tag fetch by driver (SystemPlatform ns address-space build) | `IX_Tag_Generation_Driver_Device` |
|
||||
| Cross-generation UUID immutability check | `UX_Equipment_Generation_Uuid` (per-gen scan combined with prior-gen lookup) |
|
||||
| Driver fetch by namespace | `IX_DriverInstance_Generation_Namespace` |
|
||||
| Same-cluster namespace validation | `UX_Namespace_Generation_LogicalId_Cluster` |
|
||||
| Namespace fetch for cluster | `IX_Namespace_Generation_Cluster` |
|
||||
| External-ID reservation lookup at publish | `UX_ExternalIdReservation_KindValue_Active` |
|
||||
| External-ID reservation by equipment | `IX_ExternalIdReservation_Equipment` |
|
||||
| Audit query: cluster history | `IX_ConfigAuditLog_Cluster_Time` |
|
||||
| Auth check on every node poll | `IX_ClusterNodeCredential_NodeId` |
|
||||
|
||||
## Backup, Retention, and Operational Concerns
|
||||
|
||||
- **Generations are never deleted** (per decision #58). Storage cost is small — even at one publish per day per cluster, a 50-cluster fleet generates ~18k generations/year with average row counts in the hundreds. Total at full v2 fleet scale: well under 10 GB/year.
|
||||
- **Backup**: standard SQL Server full + differential + log backups. Point-in-time restore covers operator mistake recovery (rolled back the wrong generation, etc.).
|
||||
- **Audit log retention**: 7 years by default, partitioned by year for cheap pruning if a customer requires shorter retention.
|
||||
- **Connection pooling**: each OtOpcUa node holds a pooled connection; admin UI uses standard EF DbContext pooling.
|
||||
|
||||
## Decisions / Open Questions
|
||||
|
||||
**Decided** (captured in `plan.md` decision log):
|
||||
|
||||
- Cluster-scoped generations (#82)
|
||||
- Per-node credential binding (#83)
|
||||
- Both nodes apply independently with brief divergence acceptable (#84)
|
||||
- ApplicationUri unique fleet-wide, never auto-rewritten (#86)
|
||||
- All new tables (#79, #80)
|
||||
|
||||
**Resolved Defaults**:
|
||||
|
||||
- **JSON validation: external (in Admin app), not CLR-hosted.** Requiring CLR on the SQL Server is an operational tax (CLR is disabled by default on hardened DB instances and many DBAs refuse to enable it). The Admin app validates draft content against the per-driver JSON schemas before calling `sp_PublishGeneration`; the proc enforces structural integrity (FKs, uniqueness, JSON parseability via `ISJSON`) but trusts the caller for content schema. Direct proc invocation outside the Admin app is already prevented by the GRANT model — only admin principals can publish.
|
||||
- **Dotted JSON path syntax for `DriverConfigOverridesJson`.** Example: `"MxAccess.ClientName"` not `"/MxAccess/ClientName"`. Dotted is more readable in operator-facing UI and CSV exports. Reserved chars: literal `.` in a key segment is escaped as `\.`; literal `\` is escaped as `\\`. Array indexing uses bracket form: `Items[0].Name`. Documented inline in the override editor's help text.
|
||||
- **`sp_PurgeGenerationsBefore` proc deferred to v2.1.** Initial release ships with "keep all generations forever" (decision #58). The purge proc is shaped now so we don't have to re-think it later: signature `sp_PurgeGenerationsBefore(@ClusterId, @CutoffGenerationId, @ConfirmToken)` requires an Admin-supplied confirmation token (random hex shown in the UI) to prevent script-based mass deletion; deletes are CASCADEd via per-table `WHERE GenerationId IN (...)`; audit log entry recorded with the principal, the cutoff, and the row counts deleted. Surface in v2.1 only when a customer compliance ask demands it.
|
||||
276
docs/v2/dev-environment.md
Normal file
276
docs/v2/dev-environment.md
Normal file
@@ -0,0 +1,276 @@
|
||||
# Development Environment — OtOpcUa v2
|
||||
|
||||
> **Status**: DRAFT — concrete inventory + setup plan for every external resource the v2 build needs. Companion to `test-data-sources.md` (which catalogues the simulator/stub strategy per driver) and `implementation/overview.md` (which references the dev environment in entry-gate checklists).
|
||||
>
|
||||
> **Branch**: `v2`
|
||||
> **Created**: 2026-04-17
|
||||
|
||||
## Scope
|
||||
|
||||
Every external resource a developer needs on their machine, plus the dedicated integration host that runs the heavier simulators per CI tiering decision #99. Includes Docker container images, ports, default credentials (dev only — production overrides documented), and ownership.
|
||||
|
||||
**Not in scope here**: production deployment topology (separate doc when v2 ships), CI pipeline configuration (separate ops concern), individual developer's IDE / editor preferences.
|
||||
|
||||
## Two Environment Tiers
|
||||
|
||||
Per decision #99:
|
||||
|
||||
| Tier | Purpose | Where it runs | Resources |
|
||||
|------|---------|---------------|-----------|
|
||||
| **PR-CI / inner-loop dev** | Fast, runs on minimal Windows + Linux build agents and developer laptops | Each developer's machine; CI runners | Pure-managed in-process simulators (NModbus, OPC Foundation reference server, FOCAS TCP stub from test project). No Docker, no VMs. |
|
||||
| **Nightly / integration CI** | Full driver-stack validation against real wire protocols | One dedicated Windows host with Docker Desktop + Hyper-V + a TwinCAT XAR VM | All Docker simulators (`oitc/modbus-server`, `ab_server`, Snap7), TwinCAT XAR VM, Galaxy.Host installer + dev Galaxy access, FOCAS TCP stub binary, FOCAS FaultShim assembly |
|
||||
|
||||
The tier split keeps developer onboarding fast (no Docker required for first build) while concentrating the heavy simulator setup on one machine the team maintains.
|
||||
|
||||
## Resource Inventory
|
||||
|
||||
### A. Always-required (every developer + integration host)
|
||||
|
||||
| Resource | Purpose | Type | Default port | Default credentials | Owner |
|
||||
|----------|---------|------|--------------|---------------------|-------|
|
||||
| **.NET 10 SDK** | Build all .NET 10 x64 projects | OS install | n/a | n/a | Developer |
|
||||
| **.NET Framework 4.8 SDK + targeting pack** | Build `Driver.Galaxy.Host` (Phase 2+) | Windows install | n/a | n/a | Developer |
|
||||
| **Visual Studio 2022 17.8+ or Rider 2024+** | IDE (any C# IDE works; these are the supported configs) | OS install | n/a | n/a | Developer |
|
||||
| **Git** | Source control | OS install | n/a | n/a | Developer |
|
||||
| **PowerShell 7.4+** | Compliance scripts (`phase-N-compliance.ps1`) | OS install | n/a | n/a | Developer |
|
||||
| **Repo clones** | `lmxopcua` (this repo), `scadalink-design` (UI/auth reference per memory file `scadalink_reference.md`), `3yearplan` (handoff + corrections) | Git clone | n/a | n/a | Developer |
|
||||
|
||||
### B. Inner-loop dev (developer machines + PR-CI)
|
||||
|
||||
| Resource | Purpose | Type | Default port | Default credentials | Owner |
|
||||
|----------|---------|------|--------------|---------------------|-------|
|
||||
| **SQL Server 2022 dev edition** | Central config DB; integration tests against `Configuration` project | Local install OR Docker container `mcr.microsoft.com/mssql/server:2022-latest` | 1433 | `sa` / `OtOpcUaDev_2026!` (dev only — production uses Integrated Security or gMSA per decision #46) | Developer (per machine) |
|
||||
| **GLAuth (LDAP server)** | Admin UI authentication tests; data-path ACL evaluation tests | Local binary at `C:\publish\glauth\` per existing CLAUDE.md | 3893 (LDAP) / 3894 (LDAPS) | Service principal: `cn=admin,dc=otopcua,dc=local` / `OtOpcUaDev_2026!`; test users defined in GLAuth config | Developer (per machine) |
|
||||
| **Local dev Galaxy** (Aveva System Platform) | Galaxy driver tests; v1 IntegrationTests parity | Existing on dev box per CLAUDE.md | n/a (local COM) | Windows Auth | Developer (already present per project setup) |
|
||||
|
||||
### C. Integration host (one dedicated Windows machine the team shares)
|
||||
|
||||
| Resource | Purpose | Type | Default port | Default credentials | Owner |
|
||||
|----------|---------|------|--------------|---------------------|-------|
|
||||
| **Docker Desktop for Windows** | Host for containerized simulators | Install | (Hyper-V required; not compatible with TwinCAT runtime — see TwinCAT row below for the workaround) | n/a | Integration host admin |
|
||||
| **`oitc/modbus-server`** | Modbus TCP simulator (per `test-data-sources.md` §1) | Docker container | 502 (Modbus TCP) | n/a (no auth in protocol) | Integration host admin |
|
||||
| **`ab_server`** (libplctag binary) | AB CIP + AB Legacy simulator (per `test-data-sources.md` §2 + §3) | Native binary built from libplctag source; runs in a separate VM or host since it conflicts with Docker Desktop's Hyper-V if run on bare metal | 44818 (CIP) | n/a | Integration host admin |
|
||||
| **Snap7 Server** | S7 simulator (per `test-data-sources.md` §4) | Native binary; runs in a separate VM or in WSL2 to avoid Hyper-V conflict | 102 (ISO-TCP) | n/a | Integration host admin |
|
||||
| **TwinCAT XAR runtime VM** | TwinCAT ADS testing (per `test-data-sources.md` §5; Beckhoff XAR cannot coexist with Hyper-V on the same OS) | Hyper-V VM with Windows + TwinCAT XAR installed under 7-day renewable trial | 48898 (ADS over TCP) | TwinCAT default route credentials configured per Beckhoff docs | Integration host admin |
|
||||
| **OPC Foundation reference server** | OPC UA Client driver test source (per `test-data-sources.md` §"OPC UA Client") | Built from `OPCFoundation/UA-.NETStandard` `ConsoleReferenceServer` project | 62541 (default for the reference server) | Anonymous + Username (`user1` / `password1`) per the reference server's built-in user list | Integration host admin |
|
||||
| **FOCAS TCP stub** (`Driver.Focas.TestStub`) | FOCAS functional testing (per `test-data-sources.md` §6) | Local .NET 10 console app from this repo | 8193 (FOCAS) | n/a | Developer / integration host (run on demand) |
|
||||
| **FOCAS FaultShim** (`Driver.Focas.FaultShim`) | FOCAS native-fault injection (per `test-data-sources.md` §6) | Test-only native DLL named `Fwlib64.dll`, loaded via DLL search path in the test fixture | n/a (in-process) | n/a | Developer / integration host (test-only) |
|
||||
|
||||
### D. Cloud / external services
|
||||
|
||||
| Resource | Purpose | Type | Access | Owner |
|
||||
|----------|---------|------|--------|-------|
|
||||
| **Gitea** at `gitea.dohertylan.com` | Hosts `lmxopcua`, `3yearplan`, `scadalink-design` repos | HTTPS git | Existing org credentials | Org IT |
|
||||
| **Anthropic API** (for Codex adversarial reviews) | `/codex:adversarial-review` invocations during exit gates | HTTPS via Codex companion script | API key in developer's `~/.claude/...` config | Developer (per `codex:setup` skill) |
|
||||
|
||||
## Network Topology (integration host)
|
||||
|
||||
```
|
||||
┌────────────────────────────────────────┐
|
||||
│ Integration Host (Windows + Docker) │
|
||||
│ │
|
||||
│ Docker Desktop (Linux containers): │
|
||||
│ ┌───────────────────────────────┐ │
|
||||
│ │ oitc/modbus-server :502/tcp │ │
|
||||
│ └───────────────────────────────┘ │
|
||||
│ │
|
||||
│ WSL2 (Snap7 + ab_server, separate │
|
||||
│ from Docker Desktop's HyperV): │
|
||||
│ ┌───────────────────────────────┐ │
|
||||
│ │ snap7-server :102/tcp │ │
|
||||
│ │ ab_server :44818/tcp │ │
|
||||
│ └───────────────────────────────┘ │
|
||||
│ │
|
||||
│ Hyper-V VM (Windows + TwinCAT XAR): │
|
||||
│ ┌───────────────────────────────┐ │
|
||||
│ │ TwinCAT XAR :48898 │ │
|
||||
│ └───────────────────────────────┘ │
|
||||
│ │
|
||||
│ Native processes: │
|
||||
│ ┌───────────────────────────────┐ │
|
||||
│ │ ConsoleReferenceServer :62541│ │
|
||||
│ │ FOCAS TestStub :8193│ │
|
||||
│ └───────────────────────────────┘ │
|
||||
│ │
|
||||
│ SQL Server 2022 (local install): │
|
||||
│ ┌───────────────────────────────┐ │
|
||||
│ │ OtOpcUaConfig_Test :1433 │ │
|
||||
│ └───────────────────────────────┘ │
|
||||
└────────────────────────────────────────┘
|
||||
▲
|
||||
│ tests connect via the host's hostname or 127.0.0.1
|
||||
│
|
||||
┌────────────────────────────────────────┐
|
||||
│ Developer / CI machine running │
|
||||
│ `dotnet test --filter Category=...` │
|
||||
└────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Bootstrap Order — Inner-loop Developer Machine
|
||||
|
||||
Order matters because some installs have prerequisites. ~30–60 min total on a fresh machine.
|
||||
|
||||
1. **Install .NET 10 SDK** (https://dotnet.microsoft.com/) — required to build anything
|
||||
2. **Install .NET Framework 4.8 SDK + targeting pack** — only needed when starting Phase 2 (Galaxy.Host); skip for Phase 0–1 if not yet there
|
||||
3. **Install Git + PowerShell 7.4+**
|
||||
4. **Clone repos**:
|
||||
```powershell
|
||||
git clone https://gitea.dohertylan.com/dohertj2/lmxopcua.git
|
||||
git clone https://gitea.dohertylan.com/dohertj2/scadalink-design.git
|
||||
git clone https://gitea.dohertylan.com/dohertj2/3yearplan.git
|
||||
```
|
||||
5. **Install SQL Server 2022 dev edition** (local install) OR start the Docker container (see Resource B):
|
||||
```powershell
|
||||
docker run --name otopcua-mssql -e "ACCEPT_EULA=Y" -e "MSSQL_SA_PASSWORD=OtOpcUaDev_2026!" `
|
||||
-p 1433:1433 -d mcr.microsoft.com/mssql/server:2022-latest
|
||||
```
|
||||
6. **Install GLAuth** at `C:\publish\glauth\` per existing CLAUDE.md instructions; populate `glauth-otopcua.cfg` with the test users + groups (template in `docs/v2/dev-environment-glauth-config.md` — to be added in the setup task)
|
||||
7. **Run `dotnet restore`** in the `lmxopcua` repo
|
||||
8. **Run `dotnet build ZB.MOM.WW.OtOpcUa.slnx`** (post-Phase-0) or `ZB.MOM.WW.LmxOpcUa.slnx` (pre-Phase-0) — verifies the toolchain
|
||||
9. **Run `dotnet test`** with the inner-loop filter — should pass on a fresh machine
|
||||
|
||||
## Bootstrap Order — Integration Host
|
||||
|
||||
Order matters more here because of Hyper-V conflicts. ~half-day on a fresh machine.
|
||||
|
||||
1. **Install Windows Server 2022 or Windows 11 Pro** (Hyper-V capable)
|
||||
2. **Enable Hyper-V** + WSL2
|
||||
3. **Install Docker Desktop for Windows**, configure to use WSL2 backend (NOT Hyper-V backend — leaves Hyper-V free for the TwinCAT XAR VM)
|
||||
4. **Set up WSL2 distro** (Ubuntu 22.04 LTS) for native Linux binaries that conflict with Docker Desktop
|
||||
5. **Pull / start Modbus simulator**:
|
||||
```powershell
|
||||
docker run -d --name modbus-sim -p 502:502 -v ${PWD}/modbus-config.yaml:/server_config.yaml oitc/modbus-server
|
||||
```
|
||||
6. **Build + start ab_server** (in WSL2):
|
||||
```bash
|
||||
git clone https://github.com/libplctag/libplctag
|
||||
cd libplctag/src/tests
|
||||
make ab_server
|
||||
./ab_server --plc=ControlLogix --port=44818 # default tags loaded from a config file
|
||||
```
|
||||
7. **Build + start Snap7 Server** (in WSL2):
|
||||
- Download Snap7 from https://snap7.sourceforge.net/
|
||||
- Build the example server; run on port 102 with the test DB layout from `test-data-sources.md` §4
|
||||
8. **Set up TwinCAT XAR VM**:
|
||||
- Create a Hyper-V VM (Gen 2, Windows 11)
|
||||
- Install TwinCAT 3 XAE + XAR (download from Beckhoff, free for dev/test)
|
||||
- Activate the 7-day trial; document the rotation schedule
|
||||
- Configure ADS routes for the integration host to reach the VM
|
||||
- Deploy the test PLC project from `test-data-sources.md` §5 ("a tiny test project — `MAIN` (PLC code) + `GVL`")
|
||||
9. **Build + start OPC Foundation reference server**:
|
||||
```bash
|
||||
git clone https://github.com/OPCFoundation/UA-.NETStandard
|
||||
cd UA-.NETStandard/Applications/ConsoleReferenceServer
|
||||
dotnet run --port 62541
|
||||
```
|
||||
10. **Install SQL Server 2022 dev edition** (or run the Docker container as on developer machines)
|
||||
11. **Build + run FOCAS TestStub** (from this repo, post-Phase-5):
|
||||
```powershell
|
||||
dotnet run --project src/ZB.MOM.WW.OtOpcUa.Driver.Focas.TestStub -- --port 8193
|
||||
```
|
||||
12. **Verify** by running `dotnet test --filter Category=Integration` from a developer machine pointed at the integration host
|
||||
|
||||
## Credential Management
|
||||
|
||||
### Dev environment defaults
|
||||
|
||||
The defaults in this doc are **for dev environments only**. They're documented here so a developer can stand up a working setup without hunting; they're not secret.
|
||||
|
||||
### Production overrides
|
||||
|
||||
For any production deployment:
|
||||
- SQL Server: Integrated Security with gMSA (decision #46) — never SQL login with shared password
|
||||
- LDAP: production GLAuth or AD instance with proper service principal
|
||||
- TwinCAT: paid license (per-runtime), not the 7-day trial
|
||||
- All other services: deployment-team's credential management process; documented in deployment-guide.md (separate doc, post-v2.0)
|
||||
|
||||
### Storage
|
||||
|
||||
For dev defaults:
|
||||
- SQL Server SA password: stored in each developer's local `appsettings.Development.json` (gitignored)
|
||||
- GLAuth bind DN/password: stored in `glauth-otopcua.cfg` (gitignored)
|
||||
- Docker secrets / volumes: developer-local
|
||||
|
||||
For production:
|
||||
- gMSA / cert-mapped principals — no passwords stored anywhere
|
||||
- Per-NodeId credentials in `ClusterNodeCredential` table (per decision #83)
|
||||
- Admin app uses LDAP (no SQL credential at all on the user-facing side)
|
||||
|
||||
## Test Data Seed
|
||||
|
||||
Each environment needs a baseline data set so cross-developer tests are reproducible. Lives in `tests/ZB.MOM.WW.OtOpcUa.IntegrationTests/SeedData/`:
|
||||
|
||||
- **GLAuth** users: `test-readonly@otopcua.local` (in `OtOpcUaReadOnly`), `test-operator@otopcua.local` (`OtOpcUaWriteOperate` + `OtOpcUaAlarmAck`), `test-fleetadmin@otopcua.local` (`OtOpcUaAdmins`)
|
||||
- **Central config DB**: a seed cluster `TEST-CLUSTER-01` with 1 node + 1 namespace + 0 drivers (other tests add drivers)
|
||||
- **Modbus sim**: YAML config preloading the addresses from `test-data-sources.md` §1 (HR 0–9 constants, ramp at HR 100, etc.)
|
||||
- **TwinCAT XAR**: the test PLC project deployed; symbols match `test-data-sources.md` §5
|
||||
- **OPC Foundation reference server**: starts with built-in test address space; tests don't modify it
|
||||
|
||||
Seeds are idempotent (re-runnable) and gitignored where they contain credentials.
|
||||
|
||||
## Setup Plan (executable)
|
||||
|
||||
### Step 1 — Inner-loop dev environment (each developer, ~1 day with documentation)
|
||||
|
||||
**Owner**: developer
|
||||
**Prerequisite**: Bootstrap order steps 1–9 above
|
||||
**Acceptance**:
|
||||
- `dotnet test ZB.MOM.WW.OtOpcUa.slnx` passes
|
||||
- A test that touches the central config DB succeeds (proves SQL Server reachable)
|
||||
- A test that authenticates against GLAuth succeeds (proves LDAP reachable)
|
||||
|
||||
### Step 2 — Integration host (one-time, ~1 week)
|
||||
|
||||
**Owner**: DevOps lead
|
||||
**Prerequisite**: dedicated Windows machine, hardware specs ≥ 8 cores / 32 GB RAM / 500 GB SSD
|
||||
**Acceptance**:
|
||||
- Each simulator (Modbus, AB, S7, TwinCAT, OPC UA reference) responds to a probe from a developer machine
|
||||
- A nightly CI job runs `dotnet test --filter Category=Integration` against the integration host and passes
|
||||
- Service-account permissions reviewed by security lead
|
||||
|
||||
### Step 3 — TwinCAT XAR VM trial rotation automation (one-time, half-day)
|
||||
|
||||
**Owner**: DevOps lead
|
||||
**Prerequisite**: Step 2 complete
|
||||
**Acceptance**:
|
||||
- A scheduled task on the integration host either re-activates the 7-day trial automatically OR alerts the team 24h before expiry; cycle tested
|
||||
|
||||
### Step 4 — Per-developer GLAuth config sync (recurring, when test users change)
|
||||
|
||||
**Owner**: developer (each)
|
||||
**Acceptance**:
|
||||
- A script in the repo (`scripts/sync-glauth-dev-config.ps1`) updates the local GLAuth config from a template; documented in CLAUDE.md
|
||||
- Test users defined in the template work on every developer machine
|
||||
|
||||
### Step 5 — Docker simulator config (per-developer, ~30 min)
|
||||
|
||||
**Owner**: developer (each)
|
||||
**Acceptance**:
|
||||
- The Modbus simulator container is reachable from `127.0.0.1:502` from the developer's test runner (only needed if the developer is debugging Modbus driver work; not required for Phase 0/1)
|
||||
|
||||
### Step 6 — Codex companion setup (per-developer, ~5 min)
|
||||
|
||||
**Owner**: developer (each)
|
||||
**Acceptance**:
|
||||
- `/codex:setup` skill confirms readiness; `/codex:adversarial-review` works against a small test diff
|
||||
|
||||
## Operational Risks
|
||||
|
||||
| Risk | Mitigation |
|
||||
|------|------------|
|
||||
| TwinCAT 7-day trial expires mid-CI run | Step 3 automation; alert before expiry; license budget approved as fallback for production-grade pre-release validation |
|
||||
| Docker Desktop license terms change for org use | Track Docker pricing; budget approved or fall back to Podman if license becomes blocking |
|
||||
| Integration host single point of failure | Document the setup so a second host can be provisioned in <2 days; test fixtures pin to a hostname so failover changes one DNS entry |
|
||||
| GLAuth dev config drifts between developers | Sync script + template (Step 4) keep configs aligned; periodic review |
|
||||
| Galaxy / MXAccess licensing for non-dev-machine | Galaxy stays on the dev machines that already have Aveva licenses; integration host does NOT run Galaxy (Galaxy.Host integration tests run on the dev box, not the shared host) |
|
||||
| Long-lived dev env credentials in dev `appsettings.Development.json` | Gitignored; documented as dev-only; production never uses these |
|
||||
|
||||
## Decisions to Add to plan.md
|
||||
|
||||
| # | Decision | Rationale |
|
||||
|---|----------|-----------|
|
||||
| 133 | Two-tier dev environment: inner-loop (in-process simulators on developer machines) + integration (Docker / VM / native simulators on a single dedicated Windows host) | Per decision #99. Concrete inventory + setup plan in `dev-environment.md` |
|
||||
| 134 | Docker Desktop with WSL2 backend (not Hyper-V backend) on integration host so TwinCAT XAR VM can run in Hyper-V alongside Docker | TwinCAT runtime cannot coexist with Hyper-V-mode Docker Desktop; WSL2 backend leaves Hyper-V free for the XAR VM. Documented operational constraint |
|
||||
| 135 | TwinCAT XAR runs only in a dedicated VM on the integration host; developer machines do NOT run XAR locally | The 7-day trial reactivation needs centralized management; the VM is shared infrastructure |
|
||||
| 136 | Galaxy / MXAccess testing happens on developer machines that have local Aveva installs, NOT on the shared integration host | Aveva licensing scoped to dev workstations; integration host doesn't carry the license. v1 IntegrationTests parity (Phase 2) runs on developer boxes. |
|
||||
| 137 | Dev env credentials are documented openly in `dev-environment.md`; production credentials use Integrated Security / gMSA per decision #46 | Dev defaults are not secrets; they're convenience. Production never uses these values |
|
||||
1030
docs/v2/driver-specs.md
Normal file
1030
docs/v2/driver-specs.md
Normal file
File diff suppressed because it is too large
Load Diff
495
docs/v2/driver-stability.md
Normal file
495
docs/v2/driver-stability.md
Normal file
@@ -0,0 +1,495 @@
|
||||
# Driver Stability & Isolation — OtOpcUa v2
|
||||
|
||||
> **Status**: DRAFT — companion to `plan.md`. Defines the stability tier model, per-driver hosting decisions, cross-cutting protections every driver process must apply, and the canonical worked example (FOCAS) for the high-risk tier.
|
||||
>
|
||||
> **Branch**: `v2`
|
||||
> **Created**: 2026-04-17
|
||||
|
||||
## Problem Statement
|
||||
|
||||
The v2 plan adds eight drivers spanning pure managed code (Modbus, OPC UA Client), wrapped C libraries (libplctag for AB CIP/Legacy, S7netplus for Siemens, Beckhoff.TwinCAT.Ads for ADS), heavy native/COM with thread affinity (Galaxy MXAccess), and black-box vendor DLLs (FANUC `Fwlib64.dll` for FOCAS).
|
||||
|
||||
These do not all carry the same failure profile, but the v1 plan treats them uniformly: every driver runs in-process in the .NET 10 server except Galaxy (isolated only because of its 32-bit COM constraint). This means:
|
||||
|
||||
- An `AccessViolationException` from `Fwlib64.dll` — **uncatchable** by managed code in modern .NET — tears down the whole OPC UA server, all subscriptions, and every other driver with it.
|
||||
- A native handle leak (FOCAS `cnc_allclibhndl3` not paired with `cnc_freelibhndl`, or libplctag tag handles not freed) accumulates against the *server* process, not the driver.
|
||||
- A thread-affinity bug (calling Fwlib on two threads against the same handle) corrupts state for every other driver sharing the process.
|
||||
- Polly's circuit breaker handles transient *errors*; it does nothing for *process death* or *resource exhaustion*.
|
||||
|
||||
Driver stability needs to be a first-class architectural concern, not a per-driver afterthought.
|
||||
|
||||
---
|
||||
|
||||
## Stability Tier Model
|
||||
|
||||
Every driver is assigned to one of three tiers based on the trust level of its dependency stack:
|
||||
|
||||
### Tier A — Pure Managed
|
||||
|
||||
Drivers whose entire dependency chain is verifiable .NET. Standard exception handling and Polly are sufficient. Run in-process in the main server.
|
||||
|
||||
| Driver | Stack | Notes |
|
||||
|--------|-------|-------|
|
||||
| Modbus TCP | NModbus (pure managed) | Sockets only |
|
||||
| OPC UA Client | OPC Foundation .NETStandard SDK (pure managed) | Reference-grade SDK |
|
||||
|
||||
### Tier B — Wrapped Native, Mature
|
||||
|
||||
Drivers that P/Invoke into a mature, well-maintained native library, or use a managed wrapper that has limited native bits (router, transport). Run in-process **with the cross-cutting protections from §3 mandatory**: SafeHandle for every native resource, memory watchdog, bounded queues. Any driver in this tier may be promoted to Tier C if production data shows leaks or crashes.
|
||||
|
||||
| Driver | Stack | Notes |
|
||||
|--------|-------|-------|
|
||||
| Siemens S7 | S7netplus (mostly managed) | Sockets + small native helpers |
|
||||
| AB CIP | libplctag (C library via P/Invoke) | Mature, widely deployed; manages its own threads |
|
||||
| AB Legacy | libplctag (same as CIP) | Same library, different protocol mode |
|
||||
| TwinCAT | Beckhoff.TwinCAT.Ads v6 + AmsTcpIpRouter | Mostly managed; native callback pump for ADS notifications |
|
||||
|
||||
### Tier C — Heavy Native / COM / Thread-Affinity
|
||||
|
||||
Drivers whose dependency is a black-box vendor DLL, COM object with apartment requirements, or any code where a fault is likely uncatchable. **Run as a separate Windows service** behind the Galaxy.Proxy/Host/Shared pattern. A crash isolates to that driver's process; the main server fans out Bad quality on the affected nodes and respawns the host.
|
||||
|
||||
| Driver | Stack | Reason for Tier C |
|
||||
|--------|-------|-------------------|
|
||||
| Galaxy | MXAccess COM (.NET 4.8 x86) | Bitness mismatch + COM/STA + long history of native quirks |
|
||||
| FOCAS | `Fwlib64.dll` P/Invoke | Black-box vendor DLL, handle-affinity, thread-unsafe per handle, no public SLA |
|
||||
|
||||
---
|
||||
|
||||
## Cross-Cutting Protections
|
||||
|
||||
Two distinct protection sets, **scoped by hosting mode** rather than applied uniformly. This split exists because process-level signals (RSS watchdog, recycle, kill) act on a *process*, not a driver — applying them in the shared server process would let a leak in one in-proc driver knock out every other driver, every session, and the OPC UA endpoint. That contradicts the v2 isolation invariant. Process-level protections therefore apply **only to isolated host processes** (Tier C); in-process drivers (Tier A/B) get a different set of guards that operate at the driver-instance level.
|
||||
|
||||
### Universal — apply to every driver regardless of tier
|
||||
|
||||
#### SafeHandle for every native resource
|
||||
|
||||
Every native handle (FOCAS `cnc_freelibhndl`, libplctag tag handle, COM IUnknown ref, OS file/socket handles we pass through P/Invoke) is wrapped in a `SafeHandle` subclass with a finalizer that calls the release function. This guarantees release even when:
|
||||
|
||||
- The owning thread crashes
|
||||
- A `using` block is bypassed by an exception we forgot to catch
|
||||
- The driver host process is shutting down ungracefully
|
||||
|
||||
`Marshal.ReleaseComObject` calls go through `CriticalFinalizerObject` to honor finalizer ordering during AppDomain unload.
|
||||
|
||||
#### Bounded operation queues (per device, per driver instance)
|
||||
|
||||
Every driver-instance/device pairing has a bounded outgoing-operation queue (default 1000 entries). When the queue is full, new operations fail fast with `BadResourceUnavailable` rather than backing up unboundedly against a slow or dead device. Polly's circuit breaker also opens, surfacing the device-down state to the dashboard.
|
||||
|
||||
This prevents the canonical "device went offline → reads pile up → driver eats all RAM" failure mode. Crucially, it operates **per device** in the in-process case so one stuck device cannot starve another driver's queue or accumulate against the shared server's heap.
|
||||
|
||||
#### Crash-loop circuit breaker
|
||||
|
||||
If a driver host crashes 3 times within 5 minutes, the supervisor stops respawning, leaves the driver's nodes in Bad quality, raises an operator alert, and starts an **escalating cooldown** before attempting auto-reset. This balances "unattended sites need recovery without an operator on console" against "don't silently mask a persistent problem."
|
||||
|
||||
| Trip sequence | Cooldown before auto-reset |
|
||||
|---------------|----------------------------|
|
||||
| First trip | 1 hour |
|
||||
| Re-trips within 10 min of an auto-reset | 4 hours |
|
||||
| Re-trips after the 4 h cooldown | **24 hours, manual reset required via Admin UI** |
|
||||
|
||||
Every trip raises a sticky operator alert that does **not** auto-clear when the cooldown elapses — only manual acknowledgment clears it. So even if recovery is automatic, "we crash-looped 3 times overnight" stays visible the next morning. The auto-reset path keeps unattended plants running; the sticky alert + 24 h manual-only floor prevents the breaker from becoming a "silent retry forever" mechanism.
|
||||
|
||||
For Tier A/B (in-process) drivers, the "crash" being counted is a driver-instance reset (capability-level reinitialization, not a process exit). For Tier C drivers, it's a host process exit.
|
||||
|
||||
### In-process only (Tier A/B) — driver-instance allocation tracking
|
||||
|
||||
In-process drivers cannot be recycled by killing the server process — that would take down every other driver, every session, and the OPC UA endpoint. RSS watchdogs and scheduled recycle therefore do **not** apply to Tier A/B. Instead, each driver instance is monitored at a finer grain:
|
||||
|
||||
- **Per-instance allocation tracking**: drivers expose a `GetMemoryFootprint()` capability returning bytes attributable to their own caches (symbol cache, subscription items, queued operations). The Core polls this every 30 s and logs growth slope per driver instance.
|
||||
- **Soft-limit on cached state**: each driver declares a memory budget for its caches in `DriverConfig`. On breach, the Core asks the driver to flush optional caches (e.g. discard symbol cache, force re-discovery). No process action.
|
||||
- **Escalation rule**: if a driver instance's footprint cannot be bounded by cache flushing — or if growth is in opaque allocations the driver can't account for — that driver is a candidate for **promotion to Tier C**. Process recycle is the only safe leak remediation, and the only way to apply process recycle to a single driver is to give it its own process.
|
||||
- **No process kill on a Tier A/B driver**. Ever. The only Core-initiated recovery is asking the driver to reset its own state via `IDriver.Reinitialize()`. If that fails, the driver instance is marked Faulted, its nodes go Bad quality, and the operator is alerted. The server process keeps running for everyone else.
|
||||
|
||||
### Isolated host only (Tier C) — process-level protections
|
||||
|
||||
These act on the host process. They cannot affect any other driver or the main server, because each Tier C driver has its own process.
|
||||
|
||||
#### Per-host memory watchdog
|
||||
|
||||
Each host process measures baseline RSS after warm-up (post-discovery, post-first-poll). A monitor thread samples RSS every 30 s and tracks **both a multiplier of baseline and an absolute hard ceiling**.
|
||||
|
||||
| Threshold | Action |
|
||||
|-----------|--------|
|
||||
| 1.5× baseline **OR** baseline + 50 MB (whichever larger) | Log warning, surface in status dashboard |
|
||||
| 3× baseline **OR** baseline + 200 MB (whichever larger) | Trigger soft recycle (graceful drain → exit → respawn) |
|
||||
| 1 GB absolute hard ceiling | Force-kill driver process, supervisor respawns |
|
||||
| Slope > 2 MB/min sustained 30 min | Treat as leak signal, soft recycle even below absolute threshold |
|
||||
|
||||
The "whichever larger" floor prevents spurious triggers when baseline is tiny — a 30 MB FOCAS Host shouldn't recycle at 45 MB just because the multiplier says so. All thresholds are per-driver-type defaults, overridable per-driver-instance in central config. **Only valid for isolated hosts** — never apply to the main server process.
|
||||
|
||||
#### Heartbeat between proxy and host
|
||||
|
||||
The proxy in the main server sends a heartbeat ping to the driver host **every 2 s** and expects a reply within 1 s. **Three consecutive misses → proxy declares the host dead** (6 s total detection latency), fans out Bad quality on all of that driver's nodes, and asks the supervisor to respawn.
|
||||
|
||||
2 s is fast enough that subscribers on a 1 s OPC UA publishing interval see Bad quality within one or two missed publish cycles, but slow enough that GC pauses (typically <500 ms even on bad days) and Windows pipe scheduling jitter don't generate false positives. The 3-miss tolerance absorbs single-cycle noise.
|
||||
|
||||
The heartbeat is on a separate named-pipe channel from the data-plane RPCs so a stuck data-plane operation doesn't mask host death. Cadence and miss-count are tunable per-driver-instance in central config.
|
||||
|
||||
#### Scheduled recycling
|
||||
|
||||
Each Tier C host process is recycled on a schedule (default 24 h, configurable per driver type). The recycle is a soft drain → exit → respawn, identical to a watchdog-triggered recycle. Defensive measure against slow leaks that stay below the watchdog thresholds.
|
||||
|
||||
### Post-mortem log
|
||||
|
||||
Each driver process writes a ring buffer of the last 1000 operations to a memory-mapped file (`%ProgramData%\OtOpcUa\driver-postmortem\<driver>.mmf`):
|
||||
|
||||
```
|
||||
timestamp | handle/connection ID | operation | args summary | return code | duration
|
||||
```
|
||||
|
||||
On graceful shutdown, the ring is flushed to a rotating log. On a hard crash (including AV), the supervisor reads the MMF after the corpse is gone and attaches the tail to the crash event reported on the dashboard. Without this, post-mortem of a Fwlib AV is impossible.
|
||||
|
||||
---
|
||||
|
||||
## Out-of-Process Driver Pattern (Generalized)
|
||||
|
||||
This is the Galaxy.Proxy/Host/Shared layout from `plan.md` §3, lifted to a reusable pattern for every Tier C driver. Two new projects per Tier C driver beyond the in-process driver projects:
|
||||
|
||||
```
|
||||
src/
|
||||
ZB.MOM.WW.OtOpcUa.Driver.<Name>.Proxy/ # In main server: implements IDriver, forwards over IPC
|
||||
ZB.MOM.WW.OtOpcUa.Driver.<Name>.Host/ # Separate Windows service: actual driver implementation
|
||||
ZB.MOM.WW.OtOpcUa.Driver.<Name>.Shared/ # IPC message contracts (.NET Standard 2.0)
|
||||
```
|
||||
|
||||
Common contract for a Tier C host:
|
||||
|
||||
- Hosted as a Windows service with `Microsoft.Extensions.Hosting`
|
||||
- Named-pipe IPC server (named pipes already established for Galaxy in §3)
|
||||
- MessagePack-serialized contracts in `<Name>.Shared`
|
||||
- Heartbeat endpoint on a separate pipe from the data plane
|
||||
- Memory watchdog runs in-process and triggers `Environment.Exit(2)` on threshold breach
|
||||
- Post-mortem MMF writer initialized on startup
|
||||
- Standard supervisor protocol: respawn-with-backoff, crash-loop circuit breaker
|
||||
|
||||
Common contract for the proxy in the main server:
|
||||
|
||||
- Implements `IDriver` + capability interfaces; forwards every call over IPC
|
||||
- Owns the heartbeat sender and host liveness state
|
||||
- Fans out Bad quality on all nodes when host is declared dead
|
||||
- Owns the supervisor that respawns the host process
|
||||
- Exposes host status (Up / Down / Recycling / CircuitOpen) to the status dashboard
|
||||
|
||||
### IPC Security (mandatory for every Tier C driver)
|
||||
|
||||
Named pipes default to allowing connections from any local user. Without explicit ACLs, any process on the host machine that knows the pipe name could connect, bypass the OPC UA server's authentication and authorization layers, and issue reads, writes, or alarm acknowledgments directly against the driver host. **This is a real privilege-escalation surface** — a service account with no OPC UA permissions could write field values it should never have access to. Every Tier C driver enforces the following:
|
||||
|
||||
1. **Pipe ACL**: the host creates the pipe with a `PipeSecurity` ACL that grants `ReadWrite | Synchronize` only to the OtOpcUa server's service principal SID. All other local users — including LocalSystem and Administrators — are explicitly denied. The ACL is set at pipe-creation time so it's atomic with the pipe being listenable.
|
||||
2. **Caller identity verification**: on each new pipe connection, the host calls `NamedPipeServerStream.GetImpersonationUserName()` (or impersonates and inspects the token) and verifies the connected client's SID matches the configured server service SID. Mismatches are logged and the connection is dropped before any RPC frame is read.
|
||||
3. **Per-message authorization context**: every RPC frame includes the operation's authenticated OPC UA principal (forwarded by the Core after it has done its own authn/authz). The host treats this as input only — the driver-level authorization (e.g. "is this principal allowed to write Tune attributes?") is performed by the Core, but the host's own audit log records the principal so post-incident attribution is possible.
|
||||
4. **No anonymous endpoints**: the heartbeat pipe has the same ACL as the data-plane pipe. There are no "open" pipes a generic client can probe.
|
||||
5. **Defense-in-depth shared secret**: the supervisor generates a per-host-process random secret at spawn time, passes it to both proxy and host via command-line args (or a parent-pipe handshake), and the host requires it on the first frame of every connection. This is belt-and-suspenders for the case where pipe ACLs are misconfigured during deployment.
|
||||
|
||||
Configuration: the server service SID is read from `appsettings.json` (`Hosting.ServiceAccountSid`) and validated against the actual running identity at startup. Mismatch fails startup loudly rather than producing a silently-insecure pipe.
|
||||
|
||||
For Galaxy, this pattern is retroactively required (the v1 named-pipe IPC predates this contract and must be hardened during the Phase 2 refactor). For FOCAS and any future Tier C driver, IPC security is part of the initial implementation, not an add-on.
|
||||
|
||||
### Reusability
|
||||
|
||||
For Galaxy, this pattern is already specified. For FOCAS, the same three projects appear in §5 below. Future Tier C escalations (e.g. if libplctag develops a stability problem) reuse the same template.
|
||||
|
||||
---
|
||||
|
||||
## FOCAS — Deep Dive (Canonical Tier C Worked Example)
|
||||
|
||||
FOCAS is the most exposed driver in the v2 plan: a black-box vendor DLL (`Fwlib64.dll`), handle-based API with per-handle thread-affinity, no public stability SLA, and a target market (CNC integrations) where periodic-restart workarounds are common practice. The protections below are not theoretical — every one is a known FOCAS failure mode.
|
||||
|
||||
### Project Layout
|
||||
|
||||
```
|
||||
src/
|
||||
ZB.MOM.WW.OtOpcUa.Driver.Focas.Proxy/ # .NET 10 x64 in main server
|
||||
ZB.MOM.WW.OtOpcUa.Driver.Focas.Host/ # .NET 10 x64 separate Windows service
|
||||
ZB.MOM.WW.OtOpcUa.Driver.Focas.Shared/ # .NET Standard 2.0 IPC contracts
|
||||
ZB.MOM.WW.OtOpcUa.Driver.Focas.TestStub/ # Stub FOCAS server for dev/CI (per test-data-sources.md)
|
||||
```
|
||||
|
||||
The Host process is the only place `Fwlib64.dll` is loaded. Every concern below is a Host-internal concern.
|
||||
|
||||
### Handle Pool
|
||||
|
||||
One Fwlib handle per CNC connection. Pool design:
|
||||
|
||||
- **`FocasHandle : SafeHandle`** wraps the integer handle returned by `cnc_allclibhndl3`. Finalizer calls `cnc_freelibhndl`. Use of the handle inside the wrapper goes through `DangerousAddRef`/`DangerousRelease` to prevent finalization mid-call.
|
||||
- **Per-handle lock**. Fwlib is thread-unsafe per handle — one mutex per `FocasHandle`, every API call acquires it. Lock fairness is FIFO so polling and write requests don't starve each other.
|
||||
- **Pool size of 1 per CNC by default**. FANUC controllers typically allow 4–8 concurrent FOCAS sessions; we don't need parallelism inside one driver-to-CNC link unless profiling shows it. Configurable per device.
|
||||
- **Health probe**. A background task issues `cnc_sysinfo` against each handle every 30 s. Failure → release the handle, mark device disconnected, let normal reconnect logic re-establish.
|
||||
- **TTL**. Each handle is forcibly recycled every 6 h (configurable) regardless of health. Defensive against slow Fwlib state corruption.
|
||||
- **Acquire timeout**. Handle-lock acquisition has a 10 s timeout. Timeout = treat the handle as wedged, kill it, mark device disconnected. (Real FOCAS calls have hung indefinitely in production reports.)
|
||||
|
||||
### Thread Serialization
|
||||
|
||||
The Host runs a single-threaded scheduler with **handle-affinity dispatch**: each pending operation is tagged with the target handle, and a dedicated worker thread per handle drains its queue. Two consequences:
|
||||
|
||||
- Zero parallel calls into Fwlib for the same handle (correctness).
|
||||
- A single slow CNC's queue can grow without blocking other CNCs' workers (isolation).
|
||||
|
||||
The bounded outgoing queue from §3 is per-handle, not process-global, so one stuck CNC can't starve another's queue capacity.
|
||||
|
||||
### Memory Watchdog Thresholds (FOCAS-specific)
|
||||
|
||||
FOCAS baseline is small (~30–50 MB after discovery on a typical 32-axis machine). Defaults tighter than the global protection — FOCAS workloads should be stable, so any meaningful growth is a leak signal worth acting on early.
|
||||
|
||||
| Threshold | Action |
|
||||
|-----------|--------|
|
||||
| 1.5× baseline **OR** baseline + 25 MB (whichever larger) | Warning |
|
||||
| 2× baseline **OR** baseline + 75 MB (whichever larger) | Soft recycle |
|
||||
| 300 MB absolute hard ceiling | Force-kill |
|
||||
| Slope > 1 MB/min sustained 15 min | Soft recycle |
|
||||
|
||||
Same multiplier + floor + hard-ceiling pattern as the global default; tighter ratios and a lower hard ceiling because the workload profile is well-bounded.
|
||||
|
||||
### Recycle Policy
|
||||
|
||||
Soft recycle in the Host distinguishes between **operations queued in managed code** (safely cancellable) and **operations currently inside `Fwlib64.dll`** (not safely cancellable — Fwlib calls have no cancellation mechanism, and freeing a handle while a native call is using it is undefined behavior, exactly the AV path the isolation is meant to prevent).
|
||||
|
||||
Sequence:
|
||||
|
||||
1. Stop accepting new IPC requests (pipe rejects with `BadServerHalted`)
|
||||
2. Cancel queued (not-yet-dispatched) operations: return `BadCommunicationError` to the proxy
|
||||
3. Wait up to **10 s grace** for any handle's worker thread to return from its current native call
|
||||
4. **For handles whose worker thread returned within grace**: call `cnc_freelibhndl` on the handle, dispose `FocasHandle`
|
||||
5. **For handles still inside a native call after grace**: do **NOT** call `cnc_freelibhndl` — leave the handle wrapper marked Abandoned, skip clean release. The OS reclaims the file descriptors and TCP sockets when the process exits; the CNC's session count decrements on its own connection-timeout (typically 30–60 s)
|
||||
6. Flush post-mortem ring buffer to disk; record which handles were Abandoned and why
|
||||
7. **If any handle was Abandoned** → escalate from soft recycle to **hard exit**: `Environment.Exit(2)` rather than `Environment.Exit(0)`. The supervisor logs this as an unclean recycle and applies the crash-loop circuit breaker to it (an Abandoned handle indicates a wedged Fwlib call, which is the kind of state that justifies treating the recycle as "this driver is in trouble").
|
||||
8. **If all handles released cleanly** → `Environment.Exit(0)` and supervisor respawns normally
|
||||
|
||||
Recycle triggers (any one):
|
||||
|
||||
- Memory watchdog threshold breach
|
||||
- Scheduled (daily 03:00 local by default)
|
||||
- Operator command via Admin UI
|
||||
- Crash-loop circuit breaker fired and reset (manual reset)
|
||||
|
||||
Recycle frequency cap: 1/hour. More than that = page operator instead of thrashing.
|
||||
|
||||
#### Why we never free a handle with an active native call
|
||||
|
||||
Calling `cnc_freelibhndl` on a handle while another thread is mid-call inside `cnc_*` against that same handle is undefined behavior per FANUC's docs (handle is not thread-safe; release races with use). The most likely outcome is an immediate AV inside Fwlib — which is precisely the scenario the entire Tier C isolation is designed to contain. The defensive choice is: if we can't release cleanly within the grace window, accept the handle leak (bounded by process lifetime) and let process exit do what we can't safely do from managed code.
|
||||
|
||||
This means a wedged Fwlib call always escalates to process exit. There is no in-process recovery path for a hung native call — the only correct response is to let the process die and have the supervisor start a fresh one.
|
||||
|
||||
### What Survives a Recycle
|
||||
|
||||
| State | Survives? | How |
|
||||
|-------|:---------:|-----|
|
||||
| Subscription set | ✔ | Proxy re-issues subscribe on host startup |
|
||||
| Last-known values | ✔ (cached in proxy) | Surfaced as Bad quality during recycle window |
|
||||
| In-flight reads | ✗ | Proxy returns BadCommunicationError; OPC UA client retries |
|
||||
| In-flight writes | ✗ | Per Polly write-retry policy: NOT auto-retried; OPC UA client decides |
|
||||
| Handle TTL clocks | ✗ (intentional) | Fresh handles after recycle, fresh TTL |
|
||||
|
||||
### Recovery Sequence After Crash
|
||||
|
||||
1. Supervisor detects host exit (heartbeat timeout or process exit code)
|
||||
2. Supervisor reads post-mortem MMF, attaches tail to a crash event
|
||||
3. Proxy fans out Bad quality on all FOCAS device nodes
|
||||
4. Backoff before respawn: 5 s → 15 s → 60 s (capped)
|
||||
5. Spawn new Host process
|
||||
6. Host re-discovers (functional structure is fixed; PMC/macro discovery from central config), re-subscribes
|
||||
7. Quality returns to Good as values arrive
|
||||
8. **3 crashes in 5 minutes → crash-loop circuit opens.** Supervisor stops respawning, leaves Bad quality in place, raises operator alert. Manual reset required via Admin UI.
|
||||
|
||||
### Post-Mortem Log Contents (FOCAS-specific)
|
||||
|
||||
In addition to the generic last-N-operations ring, the FOCAS Host post-mortem captures:
|
||||
|
||||
- Active handle pool snapshot (handle ID, target IP, age, last-call timestamp, consecutive failures)
|
||||
- Handle health probe history (last 100 results)
|
||||
- Memory samples (last 60 — 30 minutes at 30 s cadence)
|
||||
- Recycle history (last 10 recycles with trigger reason)
|
||||
- Last 50 IPC requests received (for correlating crashes to specific operator actions)
|
||||
|
||||
This makes post-mortem of an `AccessViolationException` actionable — without it, a Fwlib AV is essentially undebuggable.
|
||||
|
||||
### Test Coverage for FOCAS Stability
|
||||
|
||||
There are **two distinct test surfaces** here, and an earlier draft conflated them. Splitting them honestly:
|
||||
|
||||
#### Surface 1 — Functional protocol coverage via the TCP stub
|
||||
|
||||
The `Driver.Focas.TestStub` (per `test-data-sources.md` §6) is a TCP listener that mimics a CNC over the FOCAS wire protocol. It can exercise everything that travels over the network:
|
||||
|
||||
- **Inject network slow** — stub adds latency on FOCAS responses, exercising the bounded queue, Polly timeout, and handle-lock acquire timeout
|
||||
- **Inject network hang** — stub stops responding mid-call (TCP keeps the socket open but never writes), exercising the per-call grace window and the wedged-handle → hard-exit escalation
|
||||
- **Inject protocol error** — stub returns FOCAS error codes (`EW_HANDLE`, `EW_SOCKET`, etc.) at chosen call boundaries, exercising error-code → StatusCode mapping and Polly retry policies
|
||||
- **Inject disconnect** — stub closes the TCP socket, exercising the reconnect path and Bad-quality fan-out
|
||||
|
||||
This covers the **majority** of stability paths because most FOCAS failure modes manifest as the network behaving badly — the Fwlib library itself tends to be stable when its CNC behaves; the trouble is that real CNCs misbehave often.
|
||||
|
||||
#### Surface 2 — Native fault injection via a separate shim
|
||||
|
||||
Native AVs and native handle leaks **cannot** be triggered through a TCP stub — they live inside `Fwlib64.dll`, on the host side of the P/Invoke boundary. Faking them requires a separate mechanism:
|
||||
|
||||
- **`Driver.Focas.FaultShim` project** — a small native DLL named `Fwlib64.dll` (test-only build configuration) that exports the same FOCAS API surface but, instead of calling FANUC's library, performs configurable fault behaviors: deliberately raise an AV at a chosen call site, return success but never release allocated buffers (leak), return success on `cnc_freelibhndl` but keep the handle table populated (orphan handle), etc.
|
||||
- **Activated by binding redirect / DLL search path order** in the Host's test fixture only; production builds load FANUC's real `Fwlib64.dll`.
|
||||
- **Tested paths**: supervisor respawn after AV, post-mortem MMF readability after hard crash, watchdog → recycle path on simulated leaks, Abandoned-handle path when the shim simulates a wedged native call.
|
||||
|
||||
The Host code is unchanged between the two surfaces — it just experiences different symptoms depending on which DLL it loaded. Honest framing of test coverage: **the TCP stub covers ~80% of real-world FOCAS failures (network/protocol); the FaultShim covers the remaining ~20% (native crashes/leaks). Hardware/manual testing on a real CNC remains the only validation path for vendor-specific Fwlib quirks that neither stub can predict.**
|
||||
|
||||
---
|
||||
|
||||
## Galaxy — Deep Dive (Tier C, COM/STA Worked Example)
|
||||
|
||||
Galaxy is the second Tier C driver and the only one bound to .NET 4.8 x86 (MXAccess COM has no 64-bit variant). Unlike FOCAS, Galaxy carries 12+ years of v1 production history, so the failure surface is well-mapped — most of the protections below close known incident classes rather than guarding against speculative ones. The four findings closed in commit `c76ab8f` (stability-review 2026-04-13) are concrete examples: a failed runtime probe subscription leaving a phantom entry that flipped Tick() to Stopped and fanned out false BadOutOfService quality, sync-over-async on the OPC UA stack thread, fire-and-forget alarm tasks racing shutdown.
|
||||
|
||||
### Project Layout
|
||||
|
||||
```
|
||||
src/
|
||||
ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy/ # .NET 10 x64 in main server
|
||||
ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/ # .NET 4.8 x86 separate Windows service
|
||||
ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared/ # .NET Standard 2.0 IPC contracts
|
||||
```
|
||||
|
||||
The Host is the only place MXAccess COM objects, the Galaxy SQL Server connection, and the optional Wonderware Historian SDK are loaded. Bitness mismatch with the .NET 10 x64 main server is the original isolation reason; Tier C stability isolation is the layered reason.
|
||||
|
||||
### STA Thread + Win32 Message Pump (the foundation)
|
||||
|
||||
Every MXAccess COM call must execute on a dedicated STA thread that runs a `GetMessage`/`DispatchMessage` loop, because MXAccess delivers `OnDataChange` / `OnWriteComplete` / advisory callbacks via window messages. This is non-negotiable — calls from the wrong apartment fail or, worse, cross-thread COM marshaling silently corrupts state.
|
||||
|
||||
- **One STA thread per Host process** owns all `LMXProxyServer` instances and all advisory subscriptions
|
||||
- **Work item dispatch** uses `PostThreadMessage(WM_APP)` to marshal incoming IPC requests onto the STA thread
|
||||
- **Pump shutdown** posts `WM_QUIT` only after all outstanding work items have completed, preventing torn-down COM proxies from receiving callbacks
|
||||
- **Pump health** is itself probed: the proxy sends a no-op work item every 10 s and expects a round-trip; missing round-trip = pump wedged = trigger recycle
|
||||
|
||||
The pattern is the same as the v1 `StaComThread` in `ZB.MOM.WW.LmxProxy.Host` — proven at this point and not a place for invention.
|
||||
|
||||
### COM Object Lifetime
|
||||
|
||||
MXAccess COM objects (`LMXProxyServer` connection handles, item handles) accumulate native references that the GC does not track. Leaks here are silent until the Host runs out of handles or the Galaxy refuses new advisory subscriptions.
|
||||
|
||||
- **`MxAccessHandle : SafeHandle`** wraps each `LMXProxyServer` connection. Finalizer calls `Marshal.ReleaseComObject` until refcount = 0, then `UnregisterProxy`.
|
||||
- **Subscription handles** wrapped per item; `RemoveAdvise` + `RemoveItem` on dispose, in that order (event handlers must be unwired before the item handle goes away — undefined behavior otherwise).
|
||||
- **`CriticalFinalizerObject`** for handle wrappers so finalizer ordering during AppDomain unload is predictable.
|
||||
- **Pre-shutdown drain**: on Host stop, Proxy first cancels all subscriptions cleanly via the STA pump (`AdviseSupervisory(stop)` → `RemoveItem` → `UnregisterProxy`). Only then does the Host exit. Fire-and-forget shutdown is a known v1 bug class — the four 2026-04-13 stability findings include "alarm auto-subscribe and transferred-subscription restore no longer race shutdown as untracked fire-and-forget tasks."
|
||||
|
||||
### Subscription State and Reconnect
|
||||
|
||||
Galaxy's MXAccess advisory subscriptions are stateful — once established, Galaxy pushes value updates until `RemoveAdvise`. Network disconnects, Galaxy redeployments, and Platform/AppEngine restarts all break the subscription stream and require replay.
|
||||
|
||||
- **Subscription registry** in the Host: every `AddItem` + `AdviseSupervisory` is recorded so reconnect can replay
|
||||
- **Reconnect trigger**: connection-health probe (see below) detects loss → marks subscriptions Disconnected → fans out Bad quality via Proxy → enters reconnect loop
|
||||
- **Replay order**: register proxy → re-add items → re-advise. Order matters; re-advising an item that was never re-added wedges silently.
|
||||
- **Quality fan-out** during reconnect window respects host scope — per the same 2026-04-13 findings, a stopped DevAppEngine must not let a recovering DevPlatform's startup callback wipe Bad quality on the still-stopped engine's variables. **Cross-host quality clear is gated on host-status check.**
|
||||
- **Symbol-version-changed equivalent**: Galaxy `time_of_last_deploy` change → driver invokes `IRediscoverable` → rebuild affected subtree only (per Galaxy platform scope filter, commit `bc282b6`)
|
||||
|
||||
### Connection Health Probe (`GalaxyRuntimeProbeManager`)
|
||||
|
||||
A dedicated probe subscribes to a synthetic per-host runtime-status attribute (Platform/Engine ScanState). Probe state drives:
|
||||
|
||||
- **Bad-quality fan-out** when a host (Platform or AppEngine) reports Stopped
|
||||
- **Quality restoration** when state transitions back to Running, scoped to that host's subtree only (not Galaxy-wide — closes the 2026-04-13 finding about a Running→Unknown→Running callback wiping sibling state)
|
||||
- **Probe failure handling**: a failed probe subscription must NOT leave a phantom entry that Tick() flips to Stopped — phantom probes are an accidental Bad-quality source. Closed in `c76ab8f`.
|
||||
|
||||
### Memory Watchdog Thresholds (Galaxy-specific)
|
||||
|
||||
Galaxy baseline depends heavily on Galaxy size. The platform scope filter (commit `bc282b6`) reduced a dev Galaxy's footprint from 49 objects / 4206 attributes (full Galaxy) to 3 objects / 386 attributes (local subtree). Real production Galaxies vary from a few hundred to tens of thousands of attributes.
|
||||
|
||||
| Threshold | Action |
|
||||
|-----------|--------|
|
||||
| 1.5× baseline (per-instance, after warm-up) | Warning |
|
||||
| 2× baseline **OR** baseline + 200 MB (whichever larger) | Soft recycle |
|
||||
| 1.5 GB absolute hard ceiling | Force-kill |
|
||||
| Slope > 5 MB/min sustained 30 min | Soft recycle |
|
||||
|
||||
Higher hard ceiling than FOCAS (1.5 GB vs 300 MB) because legitimate Galaxy baselines are larger. Same multiplier-with-floor pattern. The slope threshold is more permissive (5 MB/min vs 1 MB/min) because Galaxy's address-space rebuild on redeploy can transiently allocate large amounts.
|
||||
|
||||
### Recycle Policy (COM-specific)
|
||||
|
||||
Soft recycle distinguishes between **work items queued for the STA pump** (cancellable before dispatch) and **MXAccess calls in flight on the STA thread** (not cancellable — COM has no abort).
|
||||
|
||||
1. Stop accepting new IPC requests
|
||||
2. Cancel queued (not-yet-dispatched) STA work items
|
||||
3. Wait up to **15 s grace** for the in-flight STA call to return (longer than FOCAS because some MXAccess calls — bulk attribute reads, large hierarchy traversals — legitimately take seconds)
|
||||
4. **For each subscription**: post `RemoveAdvise` → `RemoveItem` → release item handle, in that order, on the STA thread
|
||||
5. **For the proxy connection**: post `UnregisterProxy` → `Marshal.ReleaseComObject` until refcount = 0 → release `MxAccessHandle`
|
||||
6. **STA pump shutdown**: post `WM_QUIT` only after all of the above have completed
|
||||
7. Flush post-mortem ring buffer
|
||||
8. **If STA pump did not exit within 5 s** of `WM_QUIT` → escalate to `Environment.Exit(2)`. A wedged COM call cannot be recovered cleanly; same logic as the FOCAS Abandoned-handle escalation.
|
||||
9. **If clean** → `Environment.Exit(0)`, supervisor respawns
|
||||
|
||||
Recycle frequency cap is the same as FOCAS (1/hour). Scheduled recycle defaults to 24 h.
|
||||
|
||||
### What Survives a Galaxy Recycle
|
||||
|
||||
| State | Survives? | How |
|
||||
|-------|:---------:|-----|
|
||||
| Address space (built from Galaxy DB) | ✔ | Proxy caches the last built tree; rebuild from DB on host startup |
|
||||
| Subscription set | ✔ | Proxy re-issues subscribe on host startup |
|
||||
| Last-known values | ✔ (in proxy cache) | Surfaced as Bad quality during recycle window |
|
||||
| Alarm state | partial | Active alarm registry replayed; AlarmTracking re-subscribes |
|
||||
| In-flight reads | ✗ | BadCommunicationError; client retries |
|
||||
| In-flight writes | ✗ | Per Polly write-retry policy: not auto-retried |
|
||||
| Historian subscriptions | ✗ | Re-established on next HistoryRead |
|
||||
| `time_of_last_deploy` watermark | ✔ | Cached in proxy; resync on startup avoids spurious full rebuild |
|
||||
|
||||
### Recovery Sequence After Crash
|
||||
|
||||
Same supervisor protocol as FOCAS, with one Galaxy-specific addition:
|
||||
|
||||
1. Supervisor detects host exit
|
||||
2. Reads post-mortem MMF, attaches tail to crash event
|
||||
3. Proxy fans out Bad quality on **all Galaxy nodes scoped to the lost host's platform** (not necessarily every Galaxy node — multi-host respect is per the 2026-04-13 findings)
|
||||
4. Backoff: 5 s → 15 s → 60 s
|
||||
5. Spawn new Host
|
||||
6. Host checks `time_of_last_deploy`; if unchanged from cached watermark, skip full DB rediscovery and reuse cached hierarchy (faster recovery for the common case where the crash was unrelated to a redeploy)
|
||||
7. Re-register MXAccess proxy, re-add items, re-advise
|
||||
8. Quality returns to Good as values arrive
|
||||
9. **3 crashes in 5 minutes → crash-loop circuit opens** (same escalating-cooldown rules as FOCAS)
|
||||
|
||||
### Post-Mortem Log Contents (Galaxy-specific)
|
||||
|
||||
In addition to the universal last-N-operations ring:
|
||||
|
||||
- **STA pump state snapshot**: thread ID, last-message-dispatched timestamp, queue depth
|
||||
- **Active subscription count** + breakdown by host (Platform/AppEngine)
|
||||
- **`MxAccessHandle` refcount snapshot** for every live handle
|
||||
- **Last 100 probe results** with host status transitions
|
||||
- **Last redeploy event** timestamp (from `time_of_last_deploy` polling)
|
||||
- **Galaxy DB connection state** (last query duration, last error)
|
||||
- **Historian connection state** if HDA enabled
|
||||
|
||||
### Test Coverage for Galaxy Stability
|
||||
|
||||
Galaxy is the easiest of the Tier C drivers to test because the dev machine already has a real Galaxy. Three test surfaces:
|
||||
|
||||
1. **Real Galaxy on dev machine** (per `test-data-sources.md`) — the primary integration test environment. Covers MXAccess wire behavior, subscription replay, redeploy-triggered rediscovery, host status transitions.
|
||||
2. **`Driver.Galaxy.FaultShim`** — analogous to the FOCAS FaultShim, a test-only managed assembly substituted for `ArchestrA.MxAccess.dll` via assembly binding. Injects: COM exception at chosen call site, subscription that never fires `OnDataChange`, `Marshal.ReleaseComObject` returning unexpected refcount, STA pump deadlock simulation.
|
||||
3. **v1 IntegrationTests parity suite** — the existing v1 test suite must pass against the v2 Galaxy driver before move-behind-IPC is considered complete (decision #56). This is the primary regression net.
|
||||
|
||||
The 2026-04-13 stability findings should each become a regression test in the parity suite — phantom probe subscription, cross-host quality clear, sync-over-async on stack thread, fire-and-forget shutdown race. Closing those bugs without test coverage is how they come back.
|
||||
|
||||
---
|
||||
|
||||
## Decision Additions for `plan.md`
|
||||
|
||||
Proposed new entries for the Decision Log (numbering continues from #62):
|
||||
|
||||
| # | Decision | Rationale |
|
||||
|---|----------|-----------|
|
||||
| 63 | Driver stability tier model (A/B/C) | Drivers vary in failure profile; tier dictates hosting and protection level. See `driver-stability.md` |
|
||||
| 64 | FOCAS is Tier C — out-of-process Windows service | Fwlib64.dll is black-box, AV uncatchable, handle-affinity, no SLA. Same Proxy/Host/Shared pattern as Galaxy |
|
||||
| 65 | Cross-cutting protections mandatory in all tiers | SafeHandle, memory watchdog, bounded queues, scheduled recycle, post-mortem log apply to every driver process |
|
||||
| 66 | Out-of-process driver pattern is reusable | Galaxy.Proxy/Host/Shared template generalizes to any Tier C driver; FOCAS is the second user |
|
||||
| 67 | Tier B drivers may escalate to Tier C on production evidence | libplctag, S7netplus, TwinCAT.Ads start in-process; promote if leaks or crashes appear in production |
|
||||
| 68 | Crash-loop circuit breaker stops respawn after 3 crashes/5 min | Prevents thrashing; requires manual reset to surface an operator-actionable problem |
|
||||
| 69 | Post-mortem log via memory-mapped file | Survives hard process death (including AV); supervisor reads after corpse is gone; only viable post-mortem path for native crashes |
|
||||
|
||||
---
|
||||
|
||||
## Resolved Defaults
|
||||
|
||||
The three open questions from the initial draft are resolved as follows. All values are tunable per-driver-instance in central config; the defaults are what ships out of the box.
|
||||
|
||||
### Watchdog thresholds — hybrid multiplier + absolute floor + hard ceiling
|
||||
|
||||
Pure multipliers misfire on tiny baselines (a 30 MB FOCAS Host shouldn't recycle at 45 MB). Pure absolute thresholds in MB don't scale across deployment sizes. Hybrid: trigger on whichever threshold reaches first — `max(N× baseline, baseline + floor MB)` for warn/recycle, plus an absolute hard ceiling that always force-kills. Slope detection stays orthogonal — it catches slow leaks well below any threshold.
|
||||
|
||||
### Crash-loop reset — auto-reset with escalating cooldown, sticky alert, 24 h manual floor
|
||||
|
||||
Manual-only reset is too rigid for unattended plants (CNC sites don't have operators on console 24/7). Pure auto-reset after a fixed cooldown defeats the purpose of the breaker by letting it silently retry forever. Escalating cooldown (1 h → 4 h → 24 h-with-manual-reset) auto-recovers from transient problems while ensuring persistent problems eventually demand human attention. Sticky alerts that don't auto-clear keep the trail visible regardless.
|
||||
|
||||
### Heartbeat cadence — 2 s with 3-miss tolerance
|
||||
|
||||
5 s × 3 misses = 15 s detection is too slow against typical 1 s OPC UA publishing intervals (subscribers see Bad quality 15+ samples late). 1 s × 3 = 3 s is plausible but raises false-positive rate from GC pauses and Windows pipe scheduling. 2 s × 3 = 6 s is the sweet spot: subscribers see Bad quality within one or two missed publish cycles, GC pauses (~500 ms typical) and pipe jitter stay well inside the tolerance budget.
|
||||
181
docs/v2/implementation/overview.md
Normal file
181
docs/v2/implementation/overview.md
Normal file
@@ -0,0 +1,181 @@
|
||||
# Implementation Plan Overview — OtOpcUa v2
|
||||
|
||||
> **Status**: DRAFT — defines the gate structure, compliance check approach, and deliverable conventions used across all phase implementation plans (`phase-0-*.md`, `phase-1-*.md`, etc.).
|
||||
>
|
||||
> **Branch**: `v2`
|
||||
> **Created**: 2026-04-17
|
||||
|
||||
## Purpose
|
||||
|
||||
Each phase of the v2 build (`plan.md` §6 Migration Strategy) gets a dedicated detailed implementation doc in this folder. This overview defines the structure those docs follow so reviewers can verify compliance with the v2 design without re-reading every artifact.
|
||||
|
||||
## Phase Gate Structure
|
||||
|
||||
Every phase has **three gates** the work must pass through:
|
||||
|
||||
```
|
||||
┌──────────┐ ┌──────────┐ ┌──────────┐
|
||||
START ──┤ ENTRY │── do ──┤ MID │── verify ──┤ EXIT │── PHASE COMPLETE
|
||||
│ GATE │ work │ GATE │ artifacts │ GATE │
|
||||
└──────────┘ └──────────┘ └──────────┘
|
||||
```
|
||||
|
||||
### Entry gate
|
||||
|
||||
**Purpose**: ensures the phase starts with a known-good state and all prerequisites met. Prevents starting work on top of broken foundations.
|
||||
|
||||
**Checked before any phase work begins**:
|
||||
- Prior phase has cleared its **exit gate** (or this is Phase 0)
|
||||
- Working tree is clean on the appropriate branch
|
||||
- All baseline tests for the prior phase still pass
|
||||
- Any external dependencies the phase needs are confirmed available
|
||||
- Implementation lead has read the phase doc and the relevant sections of `plan.md`, `config-db-schema.md`, `driver-specs.md`, `driver-stability.md`, `admin-ui.md`
|
||||
|
||||
**Evidence captured**: a short markdown file `entry-gate-{phase}.md` recording the date, signoff, baseline test pass, and any deviations noted.
|
||||
|
||||
### Mid gate
|
||||
|
||||
**Purpose**: course-correct partway through the phase. Catches drift before it compounds. Optional for phases ≤ 2 weeks; required for longer phases.
|
||||
|
||||
**Checked at the midpoint**:
|
||||
- Are the highest-risk deliverables landing on schedule?
|
||||
- Have any new design questions surfaced that the v2 docs don't answer? If so, escalate to plan revision before continuing.
|
||||
- Are tests being written alongside code, or accumulating as a backlog?
|
||||
- Has any decision (`plan.md` decision log) been silently violated by the implementation? If so, either revise the implementation or revise the decision (with explicit "supersedes" entry).
|
||||
|
||||
**Evidence captured**: short status update appended to the phase doc.
|
||||
|
||||
### Exit gate
|
||||
|
||||
**Purpose**: ensures the phase actually achieved what the v2 design specified, not just "the code compiles". This is where compliance verification happens.
|
||||
|
||||
**Checked before the phase is declared complete**:
|
||||
- All **acceptance criteria** for every task in the phase doc are met (each criterion has explicit evidence)
|
||||
- All **compliance checks** (see below) pass
|
||||
- All **completion checklist** items are ticked, with links to the verifying artifact (test, screenshot, log line, etc.)
|
||||
- Phase commit history is clean (no half-merged WIP, no skipped hooks)
|
||||
- Documentation updates merged: any change in approach during the phase is reflected back in the v2 design docs (`plan.md` decision log gets new entries; `config-db-schema.md` updated if schema differed from spec; etc.)
|
||||
- Adversarial review run on the phase output (`/codex:adversarial-review` or equivalent) — findings closed or explicitly deferred with rationale
|
||||
- Implementation lead **and** one other reviewer sign off
|
||||
|
||||
**Evidence captured**: `exit-gate-{phase}.md` recording all of the above with links and signatures.
|
||||
|
||||
## Compliance Check Categories
|
||||
|
||||
Phase exit gates run compliance checks across these axes. Each phase doc enumerates the specific checks for that phase under "Compliance Checks".
|
||||
|
||||
### 1. Schema compliance (Phase 1+)
|
||||
|
||||
For phases that touch the central config DB:
|
||||
- Run EF Core migrations against a clean SQL Server instance
|
||||
- Diff the resulting schema against the DDL in `config-db-schema.md`:
|
||||
- Table list matches
|
||||
- Column types and nullability match
|
||||
- Indexes (regular + unique + filtered) match
|
||||
- CHECK constraints match
|
||||
- Foreign keys match
|
||||
- Stored procedures present and signatures match
|
||||
- Any drift = blocking. Either fix the migration or update the schema doc with explicit reasoning, then re-run.
|
||||
|
||||
### 2. Decision compliance
|
||||
|
||||
For each decision number cited in the phase doc (`#XX` references to `plan.md` decision log):
|
||||
- Locate the artifact (code module, test, configuration file) that demonstrates the decision is honored
|
||||
- Add a code comment or test name that cites the decision number
|
||||
- Phase exit gate uses a script (or grep) to verify every cited decision has at least one citation in the codebase
|
||||
|
||||
This makes the decision log a **load-bearing reference**, not a historical record.
|
||||
|
||||
### 3. Visual compliance (Admin UI phases)
|
||||
|
||||
For phases that touch the Admin UI:
|
||||
- Side-by-side screenshots of equivalent ScadaLink CentralUI screens vs the new OtOpcUa Admin screens
|
||||
- Login page, sidebar, dashboard, generic forms — must visually match per `admin-ui.md` §"Visual Design — Direct Parity with ScadaLink"
|
||||
- Reviewer signoff: "could the same operator move between apps without noticing?"
|
||||
|
||||
### 4. Behavioral compliance (end-to-end smoke tests)
|
||||
|
||||
For each phase, an integration test exercises the new capability end-to-end:
|
||||
- Phase 0: existing v1 IntegrationTests pass under the renamed projects
|
||||
- Phase 1: create a cluster → publish a generation → node fetches the generation → roll back → fetch again
|
||||
- Phase 2: v1 IntegrationTests parity suite passes against the v2 Galaxy.Host (per decision #56)
|
||||
- Phase 3+: per-driver smoke test against the simulator
|
||||
|
||||
Smoke tests are **always green at exit**, never "known broken, fix later".
|
||||
|
||||
### 5. Stability compliance (Phase 2+ for Tier C drivers)
|
||||
|
||||
For phases that introduce Tier C drivers (Galaxy in Phase 2, FOCAS in Phase 5):
|
||||
- All `Driver Stability & Isolation` cross-cutting protections from `driver-stability.md` §"Cross-Cutting Protections" are wired up:
|
||||
- SafeHandle wrappers exist for every native handle
|
||||
- Memory watchdog runs and triggers recycle on threshold breach (testable via FaultShim)
|
||||
- Crash-loop circuit breaker fires after 3 crashes / 5 min (testable via stub-injected crash)
|
||||
- Heartbeat between proxy and host functions; missed heartbeats trigger respawn
|
||||
- Post-mortem MMF survives a hard process kill and the supervisor reads it on respawn
|
||||
- Each protection has a regression test in the driver's test suite
|
||||
|
||||
### 6. Documentation compliance
|
||||
|
||||
For every phase:
|
||||
- Any deviation from the v2 design docs (`plan.md`, `config-db-schema.md`, `admin-ui.md`, `driver-specs.md`, `driver-stability.md`, `test-data-sources.md`) is reflected back in the docs
|
||||
- New decisions added to the decision log with rationale
|
||||
- Old decisions superseded explicitly (not silently)
|
||||
- Cross-references between docs stay current
|
||||
|
||||
## Deliverable Types
|
||||
|
||||
Each phase produces a defined set of deliverables. The phase doc enumerates which deliverables apply.
|
||||
|
||||
| Type | Format | Purpose |
|
||||
|------|--------|---------|
|
||||
| **Code** | Source files committed to a feature branch, merged to `v2` after exit gate | The implementation itself |
|
||||
| **Tests** | xUnit unit + integration tests; per-phase smoke tests | Behavioral evidence |
|
||||
| **Migrations** | EF Core migrations under `Configuration/Migrations/` | Schema delta |
|
||||
| **Decision-log entries** | New rows appended to `plan.md` decision table | Architectural choices made during the phase |
|
||||
| **Doc updates** | Edits to existing v2 docs | Keep design and implementation aligned |
|
||||
| **Gate records** | `entry-gate-{phase}.md`, `exit-gate-{phase}.md` in this folder | Audit trail of gate clearance |
|
||||
| **Compliance script** | Per-phase shell or PowerShell script that runs the compliance checks | Repeatable verification |
|
||||
| **Adversarial review** | `/codex:adversarial-review` output on the phase diff | Independent challenge |
|
||||
|
||||
## Branch and PR Conventions
|
||||
|
||||
| Branch | Purpose |
|
||||
|--------|---------|
|
||||
| `v2` | Long-running design + implementation branch. All phase work merges here. |
|
||||
| `v2/phase-{N}-{slug}` | Per-phase feature branch (e.g. `v2/phase-0-rename`) |
|
||||
| `v2/phase-{N}-{slug}-{subtask}` | Per-subtask branches when the phase is large enough to warrant them |
|
||||
|
||||
Each phase merges to `v2` via PR after the exit gate clears. PRs include:
|
||||
- Link to the phase implementation doc
|
||||
- Link to the exit-gate record
|
||||
- Compliance-script output
|
||||
- Adversarial-review output
|
||||
- Reviewer signoffs
|
||||
|
||||
The `master` branch stays at v1 production state until all phases are complete and a separate v2 release decision is made.
|
||||
|
||||
## What Counts as "Following the Plan"
|
||||
|
||||
The implementation **follows the plan** when, at every phase exit gate:
|
||||
|
||||
1. Every task listed in the phase doc has been done OR explicitly deferred with rationale
|
||||
2. Every compliance check has a passing artifact OR an explicit deviation note signed off by the reviewer
|
||||
3. The codebase contains traceable references to every decision number the phase implements
|
||||
4. The v2 design docs are updated to reflect any approach changes
|
||||
5. The smoke test for the phase passes
|
||||
6. Two people have signed off — implementation lead + one other reviewer
|
||||
|
||||
The implementation **deviates from the plan** when any of those conditions fails. Deviations are not failures; they are signals to update the plan or revise the implementation. The unrecoverable failure mode is **silent deviation** — code that doesn't match the plan, with no decision-log update explaining why. The exit gate's compliance checks exist specifically to make silent deviation impossible to ship.
|
||||
|
||||
## Phase Implementation Docs
|
||||
|
||||
| Phase | Doc | Status |
|
||||
|-------|-----|--------|
|
||||
| 0 | [`phase-0-rename-and-net10.md`](phase-0-rename-and-net10.md) | DRAFT |
|
||||
| 1 | [`phase-1-configuration-and-admin-scaffold.md`](phase-1-configuration-and-admin-scaffold.md) | DRAFT |
|
||||
| 2 | [`phase-2-galaxy-out-of-process.md`](phase-2-galaxy-out-of-process.md) | DRAFT |
|
||||
| 3 | (Phase 3: Modbus TCP driver — TBD) | NOT STARTED |
|
||||
| 4 | (Phase 4: PLC drivers AB CIP / AB Legacy / S7 / TwinCAT — TBD) | NOT STARTED |
|
||||
| 5 | (Phase 5: Specialty drivers FOCAS / OPC UA Client — TBD) | NOT STARTED |
|
||||
|
||||
**Consumer cutover (ScadaBridge / Ignition / System Platform IO) is OUT of v2 scope.** It is a separate work track owned by the integration / operations team, tracked in the 3-year-plan handoff (`handoffs/otopcua-handoff.md` §"Rollout Posture") and the corrections doc (§C5). The OtOpcUa team's responsibility ends at Phase 5 (all drivers built, all stability protections in place, full Admin UI shipped). Cutover sequencing, validation methodology, rollback procedures, and Aveva-pattern validation for tier 3 are the integration team's deliverables.
|
||||
269
docs/v2/implementation/phase-0-rename-and-net10.md
Normal file
269
docs/v2/implementation/phase-0-rename-and-net10.md
Normal file
@@ -0,0 +1,269 @@
|
||||
# Phase 0 — Rename to OtOpcUa + .NET 10 Cleanup
|
||||
|
||||
> **Status**: DRAFT — implementation plan for Phase 0 of the v2 build (`plan.md` §6).
|
||||
>
|
||||
> **Branch**: `v2/phase-0-rename`
|
||||
> **Estimated duration**: 3–5 working days
|
||||
> **Predecessor**: none (first phase)
|
||||
> **Successor**: Phase 1 (`phase-1-configuration-and-admin-scaffold.md`)
|
||||
|
||||
## Phase Objective
|
||||
|
||||
Mechanically rename the existing v1 codebase from `LmxOpcUa` to `OtOpcUa` and verify all existing v1 tests still pass under the new names. **No new functionality**, **no .NET 10 retargeting of `Host` or `Historian.Aveva`** (those move in Phase 2 with the Galaxy split — they need to stay on .NET 4.8 because of MXAccess and Wonderware Historian SDK dependencies). All other projects are already on .NET 10 and stay there.
|
||||
|
||||
The phase exists as a clean checkpoint: future PRs reference `OtOpcUa` consistently, the rename is not entangled with semantic changes, and the diff is mechanical enough to review safely.
|
||||
|
||||
## Scope — What Changes
|
||||
|
||||
| Concern | Change |
|
||||
|---------|--------|
|
||||
| Project names | `ZB.MOM.WW.LmxOpcUa.*` → `ZB.MOM.WW.OtOpcUa.*` (all 11 projects) |
|
||||
| Solution file | `ZB.MOM.WW.LmxOpcUa.slnx` → `ZB.MOM.WW.OtOpcUa.slnx` |
|
||||
| Namespaces | `ZB.MOM.WW.LmxOpcUa` root → `ZB.MOM.WW.OtOpcUa` root (all source files) |
|
||||
| Assembly names | `<AssemblyName>` and `<RootNamespace>` in every csproj |
|
||||
| Folder names | `src/ZB.MOM.WW.LmxOpcUa.*` → `src/ZB.MOM.WW.OtOpcUa.*`; same in `tests/` |
|
||||
| Default `appsettings.json` keys | `Lmx*` → `Ot*` only where the section name is product-bound (e.g. `LmxOpcUa.Server` → `OtOpcUa.Server`); leave `MxAccess.*` keys alone (those refer to the AVEVA product, not ours) |
|
||||
| Service registration name | TopShelf service name `LmxOpcUa` → `OtOpcUa` (until Phase 1 swaps TopShelf for `Microsoft.Extensions.Hosting`) |
|
||||
| Documentation | All `docs/*.md` references; `CLAUDE.md` |
|
||||
| Repo name | **NOT** in scope for Phase 0 — repo rename happens in a separate ops step after exit gate clears |
|
||||
|
||||
## Scope — What Does NOT Change
|
||||
|
||||
| Item | Reason |
|
||||
|------|--------|
|
||||
| `.NET Framework 4.8` target on `Host` and `Historian.Aveva` | MXAccess COM is 32-bit only; Wonderware Historian SDK is .NET 4.8. Both move to `Galaxy.Host` (still .NET 4.8 x86) in Phase 2. |
|
||||
| `.NET 10` target on Client.CLI / Client.Shared / Client.UI / all Tests | Already there (verified 2026-04-17 via `grep TargetFramework src/*/*.csproj`). |
|
||||
| Project structure (no new projects) | New projects (Configuration, Core, Core.Abstractions, Server, Admin) are added in Phase 1, not Phase 0. |
|
||||
| Galaxy / MXAccess implementation | Stays in `OtOpcUa.Host` for now; Phase 2 splits it into Proxy/Host/Shared. |
|
||||
| `master` branch / production deployments | Untouched — v2 work all happens on the `v2` branch. |
|
||||
| OPC UA `ApplicationUri` defaults | Currently include `LmxOpcUa` — leave as-is to avoid breaking existing client trust during v1/v2 coexistence. New `ApplicationUri` defaults land in Phase 1 alongside the cluster model. |
|
||||
| MxAccess product references in docs / code | "MxAccess" is AVEVA's product name, not ours. Stays. |
|
||||
|
||||
## Entry Gate Checklist
|
||||
|
||||
Verify all before opening the `v2/phase-0-rename` branch:
|
||||
|
||||
- [ ] `v2` branch is at commit `a59ad2e` or later (decisions #1–125 captured)
|
||||
- [ ] `git status` is clean on `v2`
|
||||
- [ ] `dotnet test ZB.MOM.WW.LmxOpcUa.slnx` passes locally with **zero failing tests**, baseline test count recorded
|
||||
- [ ] `dotnet build ZB.MOM.WW.LmxOpcUa.slnx` succeeds with zero errors and ≤ baseline warning count
|
||||
- [ ] All design docs reviewed by the implementation lead: `docs/v2/plan.md`, `docs/v2/config-db-schema.md`, `docs/v2/admin-ui.md`, `docs/v2/driver-specs.md`, `docs/v2/driver-stability.md`, `docs/v2/implementation/overview.md`
|
||||
- [ ] Decision #9 (rename to OtOpcUa as step 1) re-read and confirmed
|
||||
- [ ] No other developers have open work on `v2` that would conflict with bulk renames
|
||||
|
||||
**Evidence file**: `docs/v2/implementation/entry-gate-phase-0.md` recording date, baseline test count, signoff name.
|
||||
|
||||
## Task Breakdown
|
||||
|
||||
### Task 0.1 — Inventory references
|
||||
|
||||
Generate a complete map of every place `LmxOpcUa` appears:
|
||||
|
||||
```bash
|
||||
grep -rln "LmxOpcUa" --include="*.cs" --include="*.csproj" --include="*.slnx" --include="*.json" --include="*.md" --include="*.razor" .
|
||||
```
|
||||
|
||||
Save the result to `docs/v2/implementation/phase-0-rename-inventory.md` (gitignored after phase completes).
|
||||
|
||||
**Acceptance**:
|
||||
- Inventory file exists, lists every reference grouped by file type
|
||||
- Reviewer agrees inventory is complete (cross-check against `git grep -i lmx` for case-sensitivity bugs)
|
||||
|
||||
### Task 0.2 — Rename project folders
|
||||
|
||||
Per project (11 projects total — 5 src + 6 tests):
|
||||
|
||||
```bash
|
||||
git mv src/ZB.MOM.WW.LmxOpcUa.Client.CLI src/ZB.MOM.WW.OtOpcUa.Client.CLI
|
||||
git mv src/ZB.MOM.WW.OtOpcUa.Client.CLI/ZB.MOM.WW.LmxOpcUa.Client.CLI.csproj \
|
||||
src/ZB.MOM.WW.OtOpcUa.Client.CLI/ZB.MOM.WW.OtOpcUa.Client.CLI.csproj
|
||||
```
|
||||
|
||||
Repeat for: `Client.Shared`, `Client.UI`, `Historian.Aveva`, `Host`, and all 6 test projects.
|
||||
|
||||
Use `git mv` (not `mv` + `git rm`/`git add`) to preserve history.
|
||||
|
||||
**Acceptance**:
|
||||
- `ls src/` shows only `ZB.MOM.WW.OtOpcUa.*` folders
|
||||
- `ls tests/` shows only `ZB.MOM.WW.OtOpcUa.*` folders
|
||||
- `git log --follow` on a renamed file shows continuous history pre-rename
|
||||
|
||||
### Task 0.3 — Rename solution file
|
||||
|
||||
```bash
|
||||
git mv ZB.MOM.WW.LmxOpcUa.slnx ZB.MOM.WW.OtOpcUa.slnx
|
||||
```
|
||||
|
||||
Edit the `.slnx` to update every project path reference inside it.
|
||||
|
||||
**Acceptance**:
|
||||
- `ZB.MOM.WW.OtOpcUa.slnx` exists and references the renamed project paths
|
||||
- `dotnet sln list` (or `dotnet build` against the slnx) succeeds
|
||||
|
||||
### Task 0.4 — Update csproj contents
|
||||
|
||||
For every csproj:
|
||||
- Update `<AssemblyName>` if explicitly set
|
||||
- Update `<RootNamespace>` if explicitly set
|
||||
- Update `<ProjectReference Include=...>` paths for inter-project refs
|
||||
- Update `<PackageId>` if any project ships as a NuGet (none currently expected, but verify)
|
||||
|
||||
**Acceptance**:
|
||||
- `grep -rl "LmxOpcUa" src/*/*.csproj tests/*/*.csproj` returns empty
|
||||
- `dotnet restore` succeeds with no missing project references
|
||||
|
||||
### Task 0.5 — Bulk-rename namespaces in source files
|
||||
|
||||
Run the rename across all `.cs` and `.razor` files:
|
||||
|
||||
```bash
|
||||
grep -rl "ZB.MOM.WW.LmxOpcUa" --include="*.cs" --include="*.razor" . \
|
||||
| xargs sed -i 's/ZB\.MOM\.WW\.LmxOpcUa/ZB.MOM.WW.OtOpcUa/g'
|
||||
```
|
||||
|
||||
**Acceptance**:
|
||||
- `grep -rln "ZB.MOM.WW.LmxOpcUa" --include="*.cs" --include="*.razor" .` returns empty
|
||||
- `dotnet build ZB.MOM.WW.OtOpcUa.slnx` succeeds
|
||||
|
||||
### Task 0.6 — Update appsettings.json + service hosting
|
||||
|
||||
In `src/ZB.MOM.WW.OtOpcUa.Host/appsettings.json` and equivalents:
|
||||
- Rename product-named sections: `LmxOpcUa.Server` → `OtOpcUa.Server` (if present)
|
||||
- Leave `MxAccess`, `Galaxy`, `Historian` keys untouched (those are external product names)
|
||||
- Update TopShelf `ServiceName` constant from `LmxOpcUa` → `OtOpcUa`
|
||||
|
||||
**Acceptance**:
|
||||
- Service install (`dotnet run --project src/.../Host install`) registers as `OtOpcUa`
|
||||
- Service uninstall + reinstall cycle succeeds on a Windows test box
|
||||
|
||||
### Task 0.7 — Update documentation references
|
||||
|
||||
- `CLAUDE.md`: replace `LmxOpcUa` references with `OtOpcUa` in product-naming contexts; leave `MxAccess` / `MXAccess` references alone
|
||||
- `docs/*.md` (existing v1 docs): same pattern
|
||||
- `docs/v2/*.md`: already uses `OtOpcUa` — verify with grep
|
||||
|
||||
**Acceptance**:
|
||||
- `grep -rln "LmxOpcUa" docs/ CLAUDE.md` returns only references that explicitly need to retain the old name (e.g. historical sections, change log)
|
||||
- Each retained reference has a comment explaining why
|
||||
|
||||
### Task 0.8 — Run full test suite + smoke test
|
||||
|
||||
```bash
|
||||
dotnet build ZB.MOM.WW.OtOpcUa.slnx
|
||||
dotnet test ZB.MOM.WW.OtOpcUa.slnx
|
||||
```
|
||||
|
||||
Plus manual smoke test of Client.CLI against a running v1 OPC UA server:
|
||||
|
||||
```bash
|
||||
dotnet run --project src/ZB.MOM.WW.OtOpcUa.Client.CLI -- connect -u opc.tcp://localhost:4840
|
||||
dotnet run --project src/ZB.MOM.WW.OtOpcUa.Client.CLI -- browse -u opc.tcp://localhost:4840 -r -d 2
|
||||
```
|
||||
|
||||
**Acceptance**:
|
||||
- Test count matches the baseline recorded at entry gate; **zero failing tests**
|
||||
- Smoke test produces equivalent output to baseline (capture both, diff)
|
||||
|
||||
### Task 0.9 — Update build commands in CLAUDE.md
|
||||
|
||||
The Build Commands section currently references `ZB.MOM.WW.LmxOpcUa.slnx`. Update to `ZB.MOM.WW.OtOpcUa.slnx`. Also update test paths.
|
||||
|
||||
**Acceptance**:
|
||||
- `cat CLAUDE.md | grep -i lmxopcua` returns only retained-by-design references
|
||||
- A new developer cloning the repo can follow CLAUDE.md to build + test successfully
|
||||
|
||||
## Compliance Checks (run at exit gate)
|
||||
|
||||
A `phase-0-compliance.ps1` (or `.sh`) script runs all these and exits non-zero on any failure:
|
||||
|
||||
1. **No stale `LmxOpcUa` references**:
|
||||
```
|
||||
grep -rln "LmxOpcUa" --include="*.cs" --include="*.csproj" --include="*.slnx" \
|
||||
--include="*.json" --include="*.razor" . | wc -l
|
||||
```
|
||||
Expected: 0 (or only allowlisted retained references)
|
||||
|
||||
2. **All projects build**:
|
||||
```
|
||||
dotnet build ZB.MOM.WW.OtOpcUa.slnx --warnaserror
|
||||
```
|
||||
Expected: success, warning count ≤ baseline
|
||||
|
||||
3. **All tests pass**:
|
||||
```
|
||||
dotnet test ZB.MOM.WW.OtOpcUa.slnx
|
||||
```
|
||||
Expected: total count = baseline, failures = 0
|
||||
|
||||
4. **Solution structure matches plan**:
|
||||
- `ls src/` shows exactly: `ZB.MOM.WW.OtOpcUa.{Client.CLI, Client.Shared, Client.UI, Historian.Aveva, Host}` (5 entries)
|
||||
- `ls tests/` shows the 6 test projects similarly renamed
|
||||
- No new projects yet (those land in Phase 1)
|
||||
|
||||
5. **.NET targets unchanged**:
|
||||
- Client projects (CLI/Shared/UI): `net10.0`
|
||||
- Host + Historian.Aveva: `net48` (split + retarget happens Phase 2)
|
||||
- All test projects: same targets as their SUT projects
|
||||
|
||||
6. **Decision compliance**: this phase implements decision #9 ("Rename to OtOpcUa as step 1"). Verify by:
|
||||
```
|
||||
grep -rln "decision #9\|Decision #9" src/ tests/
|
||||
```
|
||||
Expected: at least one citation in CLAUDE.md or a phase-rename README explaining the mechanical scope.
|
||||
|
||||
7. **Service registration works**:
|
||||
- Install service → `sc query OtOpcUa` returns the service
|
||||
- Uninstall service → `sc query OtOpcUa` returns "service does not exist"
|
||||
|
||||
## Behavioral Smoke Test (exit-gate gate)
|
||||
|
||||
The v1 IntegrationTests suite is the authoritative behavioral spec for Phase 0. The renamed code must pass it identically.
|
||||
|
||||
```bash
|
||||
dotnet test tests/ZB.MOM.WW.OtOpcUa.IntegrationTests --logger "console;verbosity=detailed"
|
||||
```
|
||||
|
||||
Expected: pass count = baseline. Fail count = 0. Skipped count = baseline.
|
||||
|
||||
## Completion Checklist
|
||||
|
||||
The exit gate signs off only when **every** item below is checked:
|
||||
|
||||
- [ ] All 11 projects renamed (5 src + 6 tests)
|
||||
- [ ] Solution file renamed
|
||||
- [ ] All `<AssemblyName>` / `<RootNamespace>` / `<ProjectReference>` updated
|
||||
- [ ] All namespaces in source files updated
|
||||
- [ ] `appsettings.json` product-named sections updated; external product names untouched
|
||||
- [ ] TopShelf service name updated; install/uninstall cycle verified on a Windows host
|
||||
- [ ] `docs/*.md` and `CLAUDE.md` references updated; retained references explained
|
||||
- [ ] Build succeeds with zero errors and warning count ≤ baseline
|
||||
- [ ] Test suite passes with zero failures and count = baseline
|
||||
- [ ] Smoke test against running OPC UA server matches baseline output
|
||||
- [ ] `phase-0-compliance.ps1` script runs and exits 0
|
||||
- [ ] Adversarial review of the phase diff (`/codex:adversarial-review --base v2`) — findings closed or deferred with rationale
|
||||
- [ ] PR opened against `v2`, includes: link to this doc, link to exit-gate record, compliance script output, adversarial review output
|
||||
- [ ] Reviewer signoff (one reviewer beyond the implementation lead)
|
||||
- [ ] `exit-gate-phase-0.md` recorded with all of the above
|
||||
|
||||
After the PR merges, repo rename (`lmxopcua` → `otopcua` on Gitea) happens as a separate ops step — out of scope for Phase 0.
|
||||
|
||||
## Risks and Mitigations
|
||||
|
||||
| Risk | Likelihood | Impact | Mitigation |
|
||||
|------|:----------:|:------:|------------|
|
||||
| Bulk `sed` rename breaks string literals (e.g. `"LmxOpcUa"` used as a runtime identifier) | Medium | Medium | Inventory step (0.1) flags string literals separately; rename them deliberately, not via bulk sed |
|
||||
| MxAccess / Galaxy / Wonderware references accidentally renamed | Low | High (breaks COM interop) | Inventory step (0.1) calls out external product names explicitly; bulk rename targets only `ZB.MOM.WW.LmxOpcUa` (with namespace prefix), not bare `LmxOpcUa` |
|
||||
| Test count drops silently because a test project doesn't get re-discovered | Medium | High | Baseline test count captured at entry gate; exit gate compares exactly |
|
||||
| `.slnx` references break and projects disappear from solution view | Low | Medium | `dotnet sln list` after Task 0.3 verifies all projects load |
|
||||
| TopShelf service install fails on a hardened Windows box (UAC, signing) | Low | Low | Manual install/uninstall cycle is part of Task 0.6 acceptance |
|
||||
| Long-lived branches diverge while phase 0 is in flight | Medium | Low | Phase 0 expected duration ≤ 5 days; coordinate that no other v2 work merges during the phase |
|
||||
|
||||
## Out of Scope (do not do in Phase 0)
|
||||
|
||||
- Adding any new project (Configuration, Admin, Core, Server, Driver.* — all Phase 1+)
|
||||
- Splitting Host into Galaxy.Proxy/Host/Shared (Phase 2)
|
||||
- Migrating Host/Historian.Aveva to .NET 10 (Phase 2 — when Galaxy is split, the .NET 4.8 x86 piece becomes Galaxy.Host and the rest can move)
|
||||
- Replacing TopShelf with `Microsoft.Extensions.Hosting` (Phase 1, decision #30)
|
||||
- Implementing the cluster / namespace / equipment data model (Phase 1)
|
||||
- Changing any OPC UA wire behavior
|
||||
- Renaming the Gitea repo
|
||||
@@ -0,0 +1,652 @@
|
||||
# Phase 1 — Configuration Project + Core.Abstractions + Admin UI Scaffold
|
||||
|
||||
> **Status**: DRAFT — implementation plan for Phase 1 of the v2 build (`plan.md` §6).
|
||||
>
|
||||
> **Branch**: `v2/phase-1-configuration`
|
||||
> **Estimated duration**: 4–6 weeks (largest greenfield phase; most foundational)
|
||||
> **Predecessor**: Phase 0 (`phase-0-rename-and-net10.md`)
|
||||
> **Successor**: Phase 2 (Galaxy parity refactor)
|
||||
|
||||
## Phase Objective
|
||||
|
||||
Stand up the **central configuration substrate** for the v2 fleet:
|
||||
|
||||
1. **`Core.Abstractions` project** — driver capability interfaces (`IDriver`, `ITagDiscovery`, `IReadable`, `IWritable`, `ISubscribable`, `IAlarmSource`, `IHistoryProvider`, `IRediscoverable`, `IHostConnectivityProbe`, `IDriverConfigEditor`, `DriverAttributeInfo`)
|
||||
2. **`Configuration` project** — central MSSQL schema + EF Core migrations + stored procedures + LiteDB local cache + generation-diff application logic
|
||||
3. **`Core` project** — `GenericDriverNodeManager` (renamed from `LmxNodeManager`), driver-hosting infrastructure, OPC UA server lifecycle, address-space registration via `IAddressSpaceBuilder`
|
||||
4. **`Server` project** — `Microsoft.Extensions.Hosting`-based Windows Service host (replacing TopShelf), bootstrap from Configuration using node-bound credential, register drivers, start Core
|
||||
5. **`Admin` project** — Blazor Server admin app scaffolded with ScadaLink CentralUI parity (Bootstrap 5, dark sidebar, LDAP cookie auth, three admin roles, draft → publish → rollback workflow, cluster/node/namespace/equipment/tag CRUD)
|
||||
|
||||
**No driver instances yet** (Galaxy stays in legacy in-process Host until Phase 2). The phase exit requires that an empty cluster can be created in Admin, an empty generation can be published, and a node can fetch the published generation — proving the configuration substrate works end-to-end.
|
||||
|
||||
## Scope — What Changes
|
||||
|
||||
| Concern | Change |
|
||||
|---------|--------|
|
||||
| New projects | 5 new src projects + 5 matching test projects |
|
||||
| Existing v1 Host project | Refactored to consume `Core.Abstractions` interfaces against its existing Galaxy implementation — **but not split into Proxy/Host/Shared yet** (Phase 2) |
|
||||
| `LmxNodeManager` | **Renamed to `GenericDriverNodeManager`** in Core, with `IDriver` swapped in for `IMxAccessClient`. The existing v1 Host instantiates `GalaxyNodeManager : GenericDriverNodeManager` (legacy in-process) — see `plan.md` §5a |
|
||||
| Service hosting | TopShelf removed; `Microsoft.Extensions.Hosting` BackgroundService used (decision #30) |
|
||||
| Central config DB | New SQL Server database `OtOpcUaConfig` provisioned from EF Core migrations |
|
||||
| LDAP authentication for Admin | `Admin.Security` project mirrors `ScadaLink.Security`; cookie auth + JWT API endpoint |
|
||||
| Local LiteDB cache on each node | New `config_cache.db` per node; bootstraps from central DB or cache |
|
||||
|
||||
## Scope — What Does NOT Change
|
||||
|
||||
| Item | Reason |
|
||||
|------|--------|
|
||||
| Galaxy out-of-process split | Phase 2 |
|
||||
| Any new driver (Modbus, AB, S7, etc.) | Phase 3+ |
|
||||
| OPC UA wire behavior | Galaxy address space still served exactly as v1; the Configuration substrate is read but not yet driving everything |
|
||||
| Equipment-class template integration with future schemas repo | `EquipmentClassRef` is a nullable hook column; no validation yet (decisions #112, #115) |
|
||||
| Per-driver custom config editors in Admin | Generic JSON editor only in v2.0 (decision #27); driver-specific editors land in their respective phases |
|
||||
| Consumer cutover (ScadaBridge / Ignition / SystemPlatform IO) | OUT of v2 scope — separate integration-team track per `implementation/overview.md` |
|
||||
|
||||
## Entry Gate Checklist
|
||||
|
||||
- [ ] Phase 0 exit gate cleared (rename complete, all v1 tests pass under OtOpcUa names)
|
||||
- [ ] `v2` branch is clean
|
||||
- [ ] Phase 0 PR merged
|
||||
- [ ] SQL Server 2019+ instance available for development (local dev box minimum; shared dev instance for integration tests)
|
||||
- [ ] LDAP / GLAuth dev instance available for Admin auth integration testing
|
||||
- [ ] ScadaLink CentralUI source accessible at `C:\Users\dohertj2\Desktop\scadalink-design\` for parity reference
|
||||
- [ ] All Phase 1-relevant design docs reviewed: `plan.md` §4–5, `config-db-schema.md` (entire), `admin-ui.md` (entire), `driver-stability.md` §"Cross-Cutting Protections" (sets context for `Core.Abstractions` scope)
|
||||
- [ ] Decisions #1–125 read at least skim-level; key ones for Phase 1: #14–22, #25, #28, #30, #32–33, #46–51, #79–125
|
||||
|
||||
**Evidence file**: `docs/v2/implementation/entry-gate-phase-1.md` recording date, signoff, environment availability.
|
||||
|
||||
## Task Breakdown
|
||||
|
||||
Phase 1 is large — broken into 5 work streams (A–E) that can partly overlap. A typical sequencing: A → B → (C and D in parallel) → E.
|
||||
|
||||
### Stream A — Core.Abstractions (1 week)
|
||||
|
||||
#### Task A.1 — Define driver capability interfaces
|
||||
|
||||
Create `src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/` (.NET 10, no dependencies). Define:
|
||||
|
||||
```csharp
|
||||
public interface IDriver { /* lifecycle, metadata, health */ }
|
||||
public interface ITagDiscovery { /* discover tags/hierarchy from backend */ }
|
||||
public interface IReadable { /* on-demand read */ }
|
||||
public interface IWritable { /* on-demand write */ }
|
||||
public interface ISubscribable { /* data change subscriptions */ }
|
||||
public interface IAlarmSource { /* alarm events + acknowledgment */ }
|
||||
public interface IHistoryProvider { /* historical reads */ }
|
||||
public interface IRediscoverable { /* opt-in change-detection signal */ }
|
||||
public interface IHostConnectivityProbe { /* per-host runtime status */ }
|
||||
public interface IDriverConfigEditor { /* Admin UI plug point per driver */ }
|
||||
public interface IAddressSpaceBuilder { /* core-owned tree builder */ }
|
||||
```
|
||||
|
||||
Plus the data models referenced from the interfaces:
|
||||
|
||||
```csharp
|
||||
public sealed record DriverAttributeInfo(
|
||||
string FullName,
|
||||
DriverDataType DriverDataType,
|
||||
bool IsArray,
|
||||
uint? ArrayDim,
|
||||
SecurityClassification SecurityClass,
|
||||
bool IsHistorized);
|
||||
public enum DriverDataType { Boolean, Int16, Int32, Int64, UInt16, UInt32, UInt64, Float32, Float64, String, DateTime, Reference, Custom }
|
||||
public enum SecurityClassification { FreeAccess, Operate, SecuredWrite, VerifiedWrite, Tune, Configure, ViewOnly }
|
||||
```
|
||||
|
||||
**Acceptance**:
|
||||
- All interfaces compile in a project with **zero dependencies** beyond BCL
|
||||
- xUnit test project asserts (via reflection) that no interface returns or accepts a type from `Core` or `Configuration` (interface independence per decision #59)
|
||||
- Each interface XML doc cites the design decision(s) it implements (e.g. `IRediscoverable` cites #54)
|
||||
|
||||
#### Task A.2 — Define DriverTypeRegistry
|
||||
|
||||
```csharp
|
||||
public sealed class DriverTypeRegistry
|
||||
{
|
||||
public DriverTypeMetadata Get(string driverType);
|
||||
public IEnumerable<DriverTypeMetadata> All();
|
||||
}
|
||||
|
||||
public sealed record DriverTypeMetadata(
|
||||
string TypeName, // "Galaxy" | "ModbusTcp" | ...
|
||||
NamespaceKindCompatibility AllowedNamespaceKinds, // per decision #111
|
||||
string DriverConfigJsonSchema, // per decision #91
|
||||
string DeviceConfigJsonSchema, // optional
|
||||
string TagConfigJsonSchema);
|
||||
|
||||
[Flags]
|
||||
public enum NamespaceKindCompatibility
|
||||
{
|
||||
Equipment = 1, SystemPlatform = 2, Simulated = 4
|
||||
}
|
||||
```
|
||||
|
||||
In v2.0 v1 only registers the `Galaxy` type (`AllowedNamespaceKinds = SystemPlatform`). Phase 3+ extends.
|
||||
|
||||
**Acceptance**:
|
||||
- Registry compiles, has unit tests for: register a type, look it up, reject duplicate registration, enumerate all
|
||||
- Galaxy registration entry exists with `AllowedNamespaceKinds = SystemPlatform` per decision #111
|
||||
|
||||
### Stream B — Configuration project (1.5 weeks)
|
||||
|
||||
#### Task B.1 — EF Core schema + initial migration
|
||||
|
||||
Create `src/ZB.MOM.WW.OtOpcUa.Configuration/` (.NET 10, EF Core 10).
|
||||
|
||||
Implement DbContext with entities matching `config-db-schema.md` exactly:
|
||||
- `ServerCluster`, `ClusterNode`, `ClusterNodeCredential`
|
||||
- `Namespace` (generation-versioned per decision #123)
|
||||
- `UnsArea`, `UnsLine`
|
||||
- `ConfigGeneration`
|
||||
- `DriverInstance`, `Device`, `Equipment`, `Tag`, `PollGroup`
|
||||
- `NodeAcl` (generation-versioned per decision #130; data-path authorization grants per `acl-design.md`)
|
||||
- `ClusterNodeGenerationState`, `ConfigAuditLog`
|
||||
- `ExternalIdReservation` (NOT generation-versioned per decision #124)
|
||||
|
||||
Generate the initial migration:
|
||||
|
||||
```bash
|
||||
dotnet ef migrations add InitialSchema --project src/ZB.MOM.WW.OtOpcUa.Configuration
|
||||
```
|
||||
|
||||
**Acceptance**:
|
||||
- Apply migration to a clean SQL Server instance produces the schema in `config-db-schema.md`
|
||||
- Schema-validation test (`SchemaComplianceTests`) introspects the live DB and asserts every table/column/index/constraint matches the doc
|
||||
- Test runs in CI against a SQL Server container
|
||||
|
||||
#### Task B.2 — Stored procedures via `MigrationBuilder.Sql`
|
||||
|
||||
Add stored procedures from `config-db-schema.md` §"Stored Procedures":
|
||||
- `sp_GetCurrentGenerationForCluster`
|
||||
- `sp_GetGenerationContent`
|
||||
- `sp_RegisterNodeGenerationApplied`
|
||||
- `sp_PublishGeneration` (with the `MERGE` against `ExternalIdReservation` per decision #124)
|
||||
- `sp_RollbackToGeneration`
|
||||
- `sp_ValidateDraft` (calls into managed validator code per decision #91 — proc is structural-only, content schema validation is in the Admin app)
|
||||
- `sp_ComputeGenerationDiff`
|
||||
- `sp_ReleaseExternalIdReservation` (FleetAdmin only)
|
||||
|
||||
Use `CREATE OR ALTER` style in `MigrationBuilder.Sql()` blocks so procs version with the schema.
|
||||
|
||||
**Acceptance**:
|
||||
- Each proc has at least one xUnit test exercising the happy path + at least one error path
|
||||
- `sp_PublishGeneration` has a concurrency test: two simultaneous publishes for the same cluster → one wins, one fails with a recognizable error
|
||||
- `sp_GetCurrentGenerationForCluster` has an authorization test: caller bound to NodeId X cannot read cluster Y's generation
|
||||
|
||||
#### Task B.3 — Authorization model (SQL principals + GRANT)
|
||||
|
||||
Add a separate migration `AuthorizationGrants` that:
|
||||
- Creates two SQL roles: `OtOpcUaNode`, `OtOpcUaAdmin`
|
||||
- Grants EXECUTE on the appropriate procs per `config-db-schema.md` §"Authorization Model"
|
||||
- Grants no direct table access to either role
|
||||
|
||||
**Acceptance**:
|
||||
- Test that runs as a `OtOpcUaNode`-roled principal can only call the node procs, not admin procs
|
||||
- Test that runs as a `OtOpcUaAdmin`-roled principal can call publish/rollback procs
|
||||
- Test that direct `SELECT * FROM dbo.ConfigGeneration` from a `OtOpcUaNode` principal is denied
|
||||
|
||||
#### Task B.4 — JSON-schema validators (managed code)
|
||||
|
||||
In `Configuration.Validation/`, implement validators consumed by `sp_ValidateDraft` (called from the Admin app pre-publish per decision #91):
|
||||
- UNS segment regex (`^[a-z0-9-]{1,32}$` or `_default`)
|
||||
- Path length (≤200 chars)
|
||||
- UUID immutability across generations
|
||||
- Same-cluster namespace binding (decision #122)
|
||||
- ZTag/SAPID reservation pre-flight (decision #124)
|
||||
- EquipmentId derivation rule (decision #125)
|
||||
- Driver type ↔ namespace kind allowed (decision #111)
|
||||
- JSON-schema validation per `DriverType` from `DriverTypeRegistry`
|
||||
|
||||
**Acceptance**:
|
||||
- One unit test per rule, both passing and failing cases
|
||||
- Cross-rule integration test: a draft that violates 3 rules surfaces all 3 (not just the first)
|
||||
|
||||
#### Task B.5 — LiteDB local cache
|
||||
|
||||
In `Configuration.LocalCache/`, implement the LiteDB schema from `config-db-schema.md` §"Local LiteDB Cache":
|
||||
|
||||
```csharp
|
||||
public interface ILocalConfigCache
|
||||
{
|
||||
Task<GenerationCacheEntry?> GetMostRecentAsync(string clusterId);
|
||||
Task PutAsync(GenerationCacheEntry entry);
|
||||
Task PruneOldGenerationsAsync(string clusterId, int keepLatest = 10);
|
||||
}
|
||||
```
|
||||
|
||||
**Acceptance**:
|
||||
- Round-trip test: write a generation snapshot, read it back, assert deep equality
|
||||
- Pruning test: write 15 generations, prune to 10, assert the 5 oldest are gone
|
||||
- Corruption test: corrupt the LiteDB file, assert the loader fails fast with a clear error
|
||||
|
||||
#### Task B.6 — Generation-diff application logic
|
||||
|
||||
In `Configuration.Apply/`, implement the diff-and-apply logic that runs on each node when a new generation arrives:
|
||||
|
||||
```csharp
|
||||
public interface IGenerationApplier
|
||||
{
|
||||
Task<ApplyResult> ApplyAsync(GenerationSnapshot from, GenerationSnapshot to, CancellationToken ct);
|
||||
}
|
||||
```
|
||||
|
||||
Diff per entity type, dispatch to driver `Reinitialize` / cache flush as needed.
|
||||
|
||||
**Acceptance**:
|
||||
- Diff test: from = empty, to = (1 driver + 5 equipment + 50 tags) → `Added` for each
|
||||
- Diff test: from = (above), to = same with one tag's `Name` changed → `Modified` for one tag, no other changes
|
||||
- Diff test: from = (above), to = same with one equipment removed → `Removed` for the equipment + cascading `Removed` for its tags
|
||||
- Apply test against an in-memory mock driver: applies the diff in correct order, idempotent on retry
|
||||
|
||||
### Stream C — Core project (1 week, can parallel with Stream D)
|
||||
|
||||
#### Task C.1 — Rename `LmxNodeManager` → `GenericDriverNodeManager`
|
||||
|
||||
Per `plan.md` §5a:
|
||||
- Lift the file from `Host/OpcUa/LmxNodeManager.cs` to `Core/OpcUa/GenericDriverNodeManager.cs`
|
||||
- Swap `IMxAccessClient` for `IDriver` (composing `IReadable` / `IWritable` / `ISubscribable`)
|
||||
- Swap `GalaxyAttributeInfo` for `DriverAttributeInfo`
|
||||
- Promote `GalaxyRuntimeProbeManager` interactions to use `IHostConnectivityProbe`
|
||||
- Move `MxDataTypeMapper` and `SecurityClassificationMapper` to a new `Driver.Galaxy.Mapping/` (still in legacy Host until Phase 2)
|
||||
|
||||
**Acceptance**:
|
||||
- v1 IntegrationTests still pass against the renamed class (parity is the gate, decision #62 — class is "foundation, not rewrite")
|
||||
- Reflection test asserts `GenericDriverNodeManager` has no static or instance reference to any Galaxy-specific type
|
||||
|
||||
#### Task C.2 — Derive `GalaxyNodeManager : GenericDriverNodeManager` (legacy in-process)
|
||||
|
||||
In the existing Host project, add a thin `GalaxyNodeManager` that:
|
||||
- Inherits from `GenericDriverNodeManager`
|
||||
- Wires up `MxDataTypeMapper`, `SecurityClassificationMapper`, the probe manager, etc.
|
||||
- Replaces direct instantiation of the renamed class
|
||||
|
||||
**Acceptance**:
|
||||
- v1 IntegrationTests pass identically with `GalaxyNodeManager` instantiated instead of the old direct class
|
||||
- Existing dev Galaxy still serves the same address space byte-for-byte (compare with a baseline browse capture)
|
||||
|
||||
#### Task C.3 — `IAddressSpaceBuilder` API (decision #52)
|
||||
|
||||
Implement the streaming builder API drivers use to register nodes:
|
||||
|
||||
```csharp
|
||||
public interface IAddressSpaceBuilder
|
||||
{
|
||||
IFolderBuilder Folder(string browseName, string displayName);
|
||||
IVariableBuilder Variable(string browseName, DriverDataType type, ...);
|
||||
void AddProperty(string browseName, object value);
|
||||
}
|
||||
```
|
||||
|
||||
Refactor `GenericDriverNodeManager.BuildAddressSpace` to consume `IAddressSpaceBuilder` (driver streams in tags rather than buffering them).
|
||||
|
||||
**Acceptance**:
|
||||
- Build a Galaxy address space via the new builder API, assert byte-equivalent OPC UA browse output vs v1
|
||||
- Memory profiling test: building a 5000-tag address space via the builder uses <50% the peak RAM of the buffered approach
|
||||
|
||||
#### Task C.4 — Driver hosting + isolation (decision #65, #74)
|
||||
|
||||
Implement the in-process driver host that:
|
||||
- Loads each `DriverInstance` row's driver assembly
|
||||
- Catches and contains driver exceptions (driver isolation, decision #12)
|
||||
- Surfaces `IDriver.Reinitialize()` to the configuration applier
|
||||
- Tracks per-driver allocation footprint (`GetMemoryFootprint()` polled every 30s per `driver-stability.md`)
|
||||
- Flushes optional caches on budget breach
|
||||
- Marks drivers `Faulted` (Bad quality on their nodes) if `Reinitialize` fails
|
||||
|
||||
**Acceptance**:
|
||||
- Integration test: spin up two mock drivers; one throws on Read; the other keeps working. Quality on the broken driver's nodes goes Bad; the other driver is unaffected.
|
||||
- Memory-budget test: mock driver reports growing footprint above budget; cache-flush is triggered; footprint drops; no process action taken.
|
||||
|
||||
### Stream D — Server project (4 days, can parallel with Stream C)
|
||||
|
||||
#### Task D.1 — `Microsoft.Extensions.Hosting` Windows Service host (decision #30)
|
||||
|
||||
Replace TopShelf with `Microsoft.Extensions.Hosting`:
|
||||
- New `Program.cs` using `Host.CreateApplicationBuilder()`
|
||||
- `BackgroundService` that owns the OPC UA server lifecycle
|
||||
- `services.UseWindowsService()` registers as a Windows service
|
||||
- Configuration bootstrap from `appsettings.json` (NodeId + ClusterId + DB conn) per decision #18
|
||||
|
||||
**Acceptance**:
|
||||
- `dotnet run` runs interactively (console mode)
|
||||
- Installed as a Windows Service (`sc create OtOpcUa ...`), starts and stops cleanly
|
||||
- Service install + uninstall cycle leaves no leftover state
|
||||
|
||||
#### Task D.2 — Bootstrap with credential-bound DB connection (decisions #46, #83)
|
||||
|
||||
On startup:
|
||||
- Read `Cluster.NodeId` + `Cluster.ClusterId` + `ConfigDatabase.ConnectionString` from `appsettings.json`
|
||||
- Connect to central DB with the configured principal (gMSA / SQL login / cert-mapped)
|
||||
- Call `sp_GetCurrentGenerationForCluster(@NodeId, @ClusterId)` — the proc verifies the connected principal is bound to NodeId
|
||||
- If proc rejects → fail startup loudly with the principal mismatch message
|
||||
|
||||
**Acceptance**:
|
||||
- Test: principal bound to Node A boots successfully when configured with NodeId = A
|
||||
- Test: principal bound to Node A configured with NodeId = B → startup fails with `Unauthorized` and the service does not stay running
|
||||
- Test: principal bound to Node A in cluster C1 configured with ClusterId = C2 → `Forbidden`
|
||||
|
||||
#### Task D.3 — LiteDB cache fallback on DB outage
|
||||
|
||||
If the central DB is unreachable at startup, load the most recent cached generation from LiteDB and start with it. Log loudly. Continue retrying the central DB in the background; on reconnect, resume normal poll cycle.
|
||||
|
||||
**Acceptance**:
|
||||
- Test: with central DB unreachable, node starts from cache, logs `ConfigDbUnreachableUsingCache` event, OPC UA endpoint serves the cached config
|
||||
- Test: cache empty AND central DB unreachable → startup fails with `NoConfigAvailable` (decision #21)
|
||||
|
||||
### Stream E — Admin project (2.5 weeks)
|
||||
|
||||
#### Task E.1 — Project scaffold mirroring ScadaLink CentralUI (decision #102)
|
||||
|
||||
Copy the project layout from `scadalink-design/src/ScadaLink.CentralUI/` (decision #104):
|
||||
- `src/ZB.MOM.WW.OtOpcUa.Admin/`: Razor Components project, .NET 10, `AddInteractiveServerComponents`
|
||||
- `Auth/AuthEndpoints.cs`, `Auth/CookieAuthenticationStateProvider.cs`
|
||||
- `Components/Layout/MainLayout.razor`, `Components/Layout/NavMenu.razor`
|
||||
- `Components/Pages/Login.razor`, `Components/Pages/Dashboard.razor`
|
||||
- `Components/Shared/{DataTable, ConfirmDialog, LoadingSpinner, NotAuthorizedView, RedirectToLogin, TimestampDisplay, ToastNotification}.razor`
|
||||
- `EndpointExtensions.cs`, `ServiceCollectionExtensions.cs`
|
||||
|
||||
Plus `src/ZB.MOM.WW.OtOpcUa.Admin.Security/` (decision #104): `LdapAuthService`, `RoleMapper`, `JwtTokenService`, `AuthorizationPolicies` mirroring `ScadaLink.Security`.
|
||||
|
||||
**Acceptance**:
|
||||
- App builds and runs locally
|
||||
- Login page renders with OtOpcUa branding (only the `<h4>` text differs from ScadaLink)
|
||||
- Visual diff between OtOpcUa and ScadaLink login pages: only the brand text differs (compliance check #3)
|
||||
|
||||
#### Task E.2 — Bootstrap LDAP + cookie auth + admin role mapping
|
||||
|
||||
Wire up `LdapAuthService` against the dev GLAuth instance per `Security.md`. Map LDAP groups to admin roles:
|
||||
- `OtOpcUaAdmins` → `FleetAdmin`
|
||||
- `OtOpcUaConfigEditors` → `ConfigEditor`
|
||||
- `OtOpcUaViewers` → `ReadOnly`
|
||||
|
||||
Plus cluster-scoped grants per decision #105 (LDAP group `OtOpcUaConfigEditors-LINE3` → `ConfigEditor` + `ClusterId = LINE3-OPCUA` claim).
|
||||
|
||||
**Acceptance**:
|
||||
- Login as a `FleetAdmin`-mapped user → redirected to `/`, sidebar shows admin sections
|
||||
- Login as a `ReadOnly`-mapped user → redirected to `/`, sidebar shows view-only sections
|
||||
- Login as a cluster-scoped `ConfigEditor` → only their permitted clusters appear in `/clusters`
|
||||
- Login with bad credentials → redirected to `/login?error=...` with the LDAP error surfaced
|
||||
|
||||
#### Task E.3 — Cluster CRUD pages
|
||||
|
||||
Implement per `admin-ui.md`:
|
||||
- `/clusters` — Cluster list (FleetAdmin sees all, ConfigEditor sees scoped)
|
||||
- `/clusters/{ClusterId}` — Cluster Detail with all 9 tabs (Overview / Namespaces / UNS Structure / Drivers / Devices / Equipment / Tags / Generations / Audit), but Drivers/Devices/Equipment/Tags tabs initially show empty tables (no driver implementations yet — Phase 2+)
|
||||
- "New cluster" workflow per `admin-ui.md` §"Add a new cluster" — creates cluster row, opens initial draft with default namespaces (decision #123)
|
||||
- ApplicationUri auto-suggest on node create per decision #86
|
||||
|
||||
**Acceptance**:
|
||||
- Create a cluster → cluster row exists, initial draft exists with Equipment-kind namespace
|
||||
- Edit cluster name → change reflected in list + detail
|
||||
- Disable a cluster → no longer offered as a target for new nodes; existing nodes keep showing in list with "Disabled" badge
|
||||
|
||||
#### Task E.4 — Draft → diff → publish workflow (decision #89)
|
||||
|
||||
Implement per `admin-ui.md` §"Draft Editor", §"Diff Viewer", §"Generation History":
|
||||
- `/clusters/{Id}/draft` — full draft editor with auto-save (debounced 500ms per decision #97)
|
||||
- `/clusters/{Id}/draft/diff` — three-column diff viewer
|
||||
- `/clusters/{Id}/generations` — list of historical generations with rollback action
|
||||
- Live `sp_ValidateDraft` invocation in the validation panel; publish disabled while errors exist
|
||||
- Publish dialog requires Notes; runs `sp_PublishGeneration` in a transaction
|
||||
|
||||
**Acceptance**:
|
||||
- Create draft → validation panel runs and shows clean state for empty draft
|
||||
- Add an invalid Equipment row (bad UNS segment) → validation panel surfaces the error inline + publish stays disabled
|
||||
- Fix the row → validation panel goes green + publish enables
|
||||
- Publish → generation moves Draft → Published; previous Published moves to Superseded; audit log row created
|
||||
- Roll back to a prior generation → new generation cloned from target; previous generation moves to Superseded; nodes pick up the new generation on next poll
|
||||
- The "Push now" button per decision #96 is rendered but disabled with the "Available in v2.1" label
|
||||
|
||||
#### Task E.5 — UNS Structure + Equipment + Namespace tabs
|
||||
|
||||
Implement the three hybrid tabs:
|
||||
- Namespaces tab — list with click-to-edit-in-draft
|
||||
- UNS Structure tab — tree view with drag-drop reorganize, rename with live impact preview
|
||||
- Equipment tab — list with default sort by ZTag, search across all 5 identifiers
|
||||
|
||||
CSV import for Equipment per the revised schema in `admin-ui.md` (no EquipmentId column; matches by EquipmentUuid for updates per decision #125).
|
||||
|
||||
**Acceptance**:
|
||||
- Add a UnsArea via draft → publishes → appears in tree
|
||||
- Drag a UnsLine to a different UnsArea → impact preview shows count of affected equipment + signals → publish moves it; UUIDs preserved
|
||||
- Equipment CSV import: 10 new rows → all get system-generated EquipmentId + EquipmentUuid; ZTag uniqueness checked against `ExternalIdReservation` (decision #124)
|
||||
- Equipment CSV import: 1 row with existing EquipmentUuid → updates the matched row's editable fields
|
||||
|
||||
#### Task E.6 — Generic JSON config editor for `DriverConfig`
|
||||
|
||||
Per decision #94 — until per-driver editors land in their respective phases, use a generic JSON editor with schema-driven validation against `DriverTypeRegistry`'s registered JSON schema for the driver type.
|
||||
|
||||
**Acceptance**:
|
||||
- Add a Galaxy `DriverInstance` in a draft → JSON editor renders the Galaxy DriverConfig schema
|
||||
- Editing produces live validation errors per the schema
|
||||
- Saving with errors → publish stays disabled
|
||||
|
||||
#### Task E.7 — Real-time updates via SignalR (admin-ui.md §"Real-Time Updates")
|
||||
|
||||
Two SignalR hubs:
|
||||
- `FleetStatusHub` — pushes `ClusterNodeGenerationState` changes
|
||||
- `AlertHub` — pushes new sticky alerts (crash-loop circuit trips, failed applies)
|
||||
|
||||
Backend `IHostedService` polls every 5s and diffs.
|
||||
|
||||
**Acceptance**:
|
||||
- Open Cluster Detail in two browser tabs → publish in tab A → tab B's "current generation" updates within 5s without page reload
|
||||
- Simulate a `LastAppliedStatus = Failed` for a node → AlertHub pushes a sticky alert that doesn't auto-clear
|
||||
|
||||
#### Task E.8 — Release reservation + Merge equipment workflows
|
||||
|
||||
Per `admin-ui.md` §"Release an external-ID reservation" and §"Merge or rebind equipment":
|
||||
- Release flow: FleetAdmin only, requires reason, audit-logged via `sp_ReleaseExternalIdReservation`
|
||||
- Merge flow: opens a draft that disables source equipment, re-points tags, releases + re-reserves IDs
|
||||
|
||||
**Acceptance**:
|
||||
- Release a reservation → `ReleasedAt` set in DB + audit log entry created with reason
|
||||
- After release: same `(Kind, Value)` can be reserved by a different EquipmentUuid in a future publish
|
||||
- Merge equipment A → B: draft preview shows tag re-pointing + ID re-reservation; publish executes atomically; A is disabled with `EquipmentMergedAway` audit entry
|
||||
|
||||
#### Task E.10 — AppServer-via-OI-Gateway end-to-end smoke test (decision #142)
|
||||
|
||||
Per `aveva-system-platform-io-research.md`, the tier-3 (Year 3) cutover redirects AVEVA System Platform's IO layer from direct-equipment connections to consuming OtOpcUa via OI Gateway. Catching AppServer-specific quirks at Phase 1 — well before the cutover schedule — protects the Year 3 timeline and ensures OtOpcUa's transport security model is actually compatible with the most opinionated downstream consumer.
|
||||
|
||||
Stand up a non-production AppServer + OtOpcUa pairing and verify:
|
||||
|
||||
1. AppServer (System Platform 2023 R2 Patch 01 or later) installed with the matching Communications Drivers Pack
|
||||
2. OI Gateway service runs under a dedicated service account (NOT SYSTEM — known issue per AVEVA 2020 R2 readme)
|
||||
3. OtOpcUa endpoint exposed as `opc.tcp://{host}:{port}` with **no `/discovery` suffix** (forum-documented failure mode that produces `Bad_SecurityChecksFailed` even after cert trust)
|
||||
4. Configure an OPCUA Connection in OI Gateway pointing at OtOpcUa with `Basic256Sha256` + `SignAndEncrypt` + LDAP-username token
|
||||
5. OI Gateway client cert appears in OtOpcUa's pending-certs folder; admin moves it to Trusted; OtOpcUa server cert trusted on OI Gateway side
|
||||
6. Configure an OPCUAGroup with at least one tag from a published OtOpcUa generation
|
||||
7. Configure a SuiteLink DI Object in AppServer pointing at the OI Gateway instance
|
||||
8. Create an AppServer attribute with `IO.SourceAttribute = <SuiteLinkDIObjectName>.<TopicName>.<ItemReference>`
|
||||
9. Verify the attribute reads end-to-end with quality `0x00C0` (good)
|
||||
10. Re-verify after a publish-generation cycle in OtOpcUa (the AppServer attribute must continue reading without manual re-trust)
|
||||
11. Capture the full configuration as `docs/deployment/aveva-system-platform-integration.md` for the future tier-3 cutover team
|
||||
|
||||
**Acceptance**:
|
||||
- All 10 connection steps succeed; AppServer reads at least one tag end-to-end with good quality
|
||||
- Reconnect after OtOpcUa publish: no manual intervention required
|
||||
- Documentation captured for the cutover team
|
||||
- Any failure mode that surfaces during the test is either: (a) fixed in OtOpcUa Phase 1, (b) added to Phase 1 known-limitations + escalated to corrections doc, or (c) confirmed as an AppServer / OI Gateway quirk operators must accept
|
||||
|
||||
This is **non-blocking** for Phase 1 exit if the test surfaces only documentation-level fixes. It IS blocking if it surfaces an OtOpcUa-side incompatibility that requires architectural change — that would be a tier-3 cutover risk and should escalate immediately.
|
||||
|
||||
#### Task E.9 — ACLs tab + bulk-grant + permission simulator
|
||||
|
||||
Per `admin-ui.md` Cluster Detail tab #8 ("ACLs") and `acl-design.md` §"Admin UI":
|
||||
- ACLs tab on Cluster Detail with two views ("By LDAP group" + "By scope")
|
||||
- Edit grant flow: pick scope, group, permission bundle or per-flag, save to draft
|
||||
- Bulk-grant flow: multi-select scope, group, permissions, preview rows that will be created, publish via draft
|
||||
- Permission simulator: enter username + LDAP groups → live trie of effective permissions across the cluster's UNS tree
|
||||
- Cluster-create workflow seeds the v1-compatibility default ACL set (per decision #131)
|
||||
- Banner on Cluster Detail when the cluster's ACL set diverges from the seed
|
||||
|
||||
**Acceptance**:
|
||||
- Add an ACL grant via draft → publishes → row in `NodeAcl` table; appears in both Admin views
|
||||
- Bulk grant 10 LDAP groups × 1 permission set across 5 UnsAreas → preview shows 50 rows; publish creates them atomically
|
||||
- Simulator: a user in `OtOpcUaReadOnly` group sees `ReadOnly` bundle effective at every node in the cluster
|
||||
- Simulator: a user in `OtOpcUaWriteTune` sees `Engineer` bundle effective; `WriteConfigure` is denied
|
||||
- Cluster-create workflow seeds 5 default ACL grants matching v1 LDAP roles (table in `acl-design.md` §"Default Permissions")
|
||||
- Divergence banner appears when an operator removes any of the seeded grants
|
||||
|
||||
## Compliance Checks (run at exit gate)
|
||||
|
||||
A `phase-1-compliance.ps1` script that exits non-zero on any failure:
|
||||
|
||||
### Schema compliance
|
||||
|
||||
```powershell
|
||||
# Run all migrations against a clean SQL Server instance
|
||||
dotnet ef database update --project src/ZB.MOM.WW.OtOpcUa.Configuration --connection "Server=...;Database=OtOpcUaConfig_Test_$(date +%s);..."
|
||||
|
||||
# Run schema-introspection tests
|
||||
dotnet test tests/ZB.MOM.WW.OtOpcUa.Configuration.Tests --filter "Category=SchemaCompliance"
|
||||
```
|
||||
|
||||
Expected: every table, column, index, FK, CHECK, and stored procedure in `config-db-schema.md` is present and matches.
|
||||
|
||||
### Decision compliance
|
||||
|
||||
```powershell
|
||||
# For each decision number Phase 1 implements (#9, #14-22, #25, #28, #30, #32-33, #46-51, #79-125),
|
||||
# verify at least one citation exists in source, tests, or migrations:
|
||||
$decisions = @(9, 14, 15, 16, 17, 18, 19, 20, 21, 22, 25, 28, 30, 32, 33, 46, 47, 48, 49, 50, 51, 79..125)
|
||||
foreach ($d in $decisions) {
|
||||
$hits = git grep "decision #$d" -- 'src/' 'tests/' 'docs/v2/implementation/'
|
||||
if (-not $hits) { Write-Error "Decision #$d has no citation in code or tests"; exit 1 }
|
||||
}
|
||||
```
|
||||
|
||||
### Visual compliance (Admin UI)
|
||||
|
||||
Manual screenshot review:
|
||||
1. Login page side-by-side with ScadaLink's `Login.razor` rendered
|
||||
2. Sidebar + main layout side-by-side with ScadaLink's `MainLayout.razor` + `NavMenu.razor`
|
||||
3. Dashboard side-by-side with ScadaLink's `Dashboard.razor`
|
||||
4. Reconnect overlay triggered (kill the SignalR connection) — same modal as ScadaLink
|
||||
|
||||
Reviewer answers: "could the same operator move between apps without noticing?" Y/N. N = blocking.
|
||||
|
||||
### Behavioral compliance (end-to-end smoke test)
|
||||
|
||||
```bash
|
||||
dotnet test tests/ZB.MOM.WW.OtOpcUa.IntegrationTests --filter "Category=Phase1Smoke"
|
||||
```
|
||||
|
||||
The smoke test:
|
||||
1. Spins up SQL Server in a container
|
||||
2. Runs all migrations
|
||||
3. Creates a `OtOpcUaAdmin` SQL principal + `OtOpcUaNode` principal bound to a test NodeId
|
||||
4. Starts the Admin app
|
||||
5. Creates a cluster + 1 node + Equipment-kind namespace via Admin API
|
||||
6. Opens a draft, adds 1 UnsArea + 1 UnsLine + 1 Equipment + 0 tags (empty)
|
||||
7. Publishes the draft
|
||||
8. Boots a Server instance configured with the test NodeId
|
||||
9. Asserts the Server fetched the published generation via `sp_GetCurrentGenerationForCluster`
|
||||
10. Asserts the Server's `ClusterNodeGenerationState` row reports `Applied`
|
||||
11. Adds a tag in a new draft, publishes
|
||||
12. Asserts the Server picks up the new generation within 30s (next poll)
|
||||
13. Rolls back to generation 1
|
||||
14. Asserts the Server picks up the rollback within 30s
|
||||
|
||||
Expected: all 14 steps pass. Smoke test runs in CI on every PR to `v2/phase-1-*` branches.
|
||||
|
||||
### Stability compliance
|
||||
|
||||
For Phase 1 the only stability concern is the in-process driver isolation primitives (used later by Phase 3+ drivers, but built in Phase 1):
|
||||
- `IDriver.Reinitialize()` semantics tested
|
||||
- Driver-instance allocation tracking + cache flush tested with a mock driver
|
||||
- Crash-loop circuit breaker tested with a mock driver that throws on every Reinitialize
|
||||
|
||||
Galaxy is still legacy in-process in Phase 1 — Tier C protections for Galaxy land in Phase 2.
|
||||
|
||||
### Documentation compliance
|
||||
|
||||
```bash
|
||||
# Every Phase 1 task in this doc must either be Done or have a deferral note in exit-gate-phase-1.md
|
||||
# Every decision the phase implements must be reflected in plan.md (no silent decisions)
|
||||
# Schema doc + admin-ui doc must be updated if implementation deviated
|
||||
```
|
||||
|
||||
## Completion Checklist
|
||||
|
||||
The exit gate signs off only when **every** item below is checked. Each item links to the verifying artifact (test name, screenshot, log line, etc.).
|
||||
|
||||
### Stream A — Core.Abstractions
|
||||
- [ ] All 11 capability interfaces defined and compiling
|
||||
- [ ] `DriverAttributeInfo` + supporting enums defined
|
||||
- [ ] `DriverTypeRegistry` implemented with Galaxy registration
|
||||
- [ ] Interface-independence reflection test passes
|
||||
|
||||
### Stream B — Configuration
|
||||
- [ ] EF Core migration `InitialSchema` applies cleanly to a clean SQL Server
|
||||
- [ ] Schema introspection test asserts the live schema matches `config-db-schema.md`
|
||||
- [ ] All stored procedures present and tested (happy path + error paths)
|
||||
- [ ] `sp_PublishGeneration` concurrency test passes (one wins, one fails)
|
||||
- [ ] Authorization tests pass (Node principal limited to its cluster, Admin can read/write fleet-wide)
|
||||
- [ ] All 12 validation rules in `Configuration.Validation` have unit tests
|
||||
- [ ] LiteDB cache round-trip + pruning + corruption tests pass
|
||||
- [ ] Generation-diff applier handles add/remove/modify across all entity types
|
||||
|
||||
### Stream C — Core
|
||||
- [ ] `LmxNodeManager` renamed to `GenericDriverNodeManager`; v1 IntegrationTests still pass
|
||||
- [ ] `GalaxyNodeManager : GenericDriverNodeManager` exists in legacy Host
|
||||
- [ ] `IAddressSpaceBuilder` API implemented; byte-equivalent OPC UA browse output to v1
|
||||
- [ ] Driver hosting + isolation tested with mock drivers (one fails, others continue)
|
||||
- [ ] Memory-budget cache-flush tested with mock driver
|
||||
|
||||
### Stream D — Server
|
||||
- [ ] `Microsoft.Extensions.Hosting` host runs in console mode and as Windows Service
|
||||
- [ ] TopShelf removed from the codebase
|
||||
- [ ] Credential-bound bootstrap tested (correct principal succeeds; wrong principal fails)
|
||||
- [ ] LiteDB fallback on DB outage tested
|
||||
|
||||
### Stream E — Admin
|
||||
- [ ] Admin app boots, login screen renders with ScadaLink-equivalent visual
|
||||
- [ ] LDAP cookie auth works against dev GLAuth
|
||||
- [ ] Admin roles mapped (FleetAdmin / ConfigEditor / ReadOnly)
|
||||
- [ ] Cluster-scoped grants work (decision #105)
|
||||
- [ ] Cluster CRUD works end-to-end
|
||||
- [ ] Draft → diff → publish workflow works end-to-end
|
||||
- [ ] Rollback works end-to-end
|
||||
- [ ] UNS Structure tab supports add / rename / drag-move with impact preview
|
||||
- [ ] Equipment tab supports CSV import + search across 5 identifiers
|
||||
- [ ] Generic JSON config editor renders + validates DriverConfig per registered schema
|
||||
- [ ] SignalR real-time updates work (multi-tab test)
|
||||
- [ ] Release reservation flow works + audit-logged
|
||||
- [ ] Merge equipment flow works + audit-logged
|
||||
|
||||
### Cross-cutting
|
||||
- [ ] `phase-1-compliance.ps1` runs and exits 0
|
||||
- [ ] Smoke test (14 steps) passes in CI
|
||||
- [ ] Visual compliance review signed off (operator-equivalence test)
|
||||
- [ ] All decisions cited in code/tests (`git grep "decision #N"` returns hits for each)
|
||||
- [ ] Adversarial review of the phase diff (`/codex:adversarial-review --base v2`) — findings closed or deferred with rationale
|
||||
- [ ] PR opened against `v2`, includes: link to this doc, link to exit-gate record, compliance script output, smoke test logs, adversarial review output, screenshots
|
||||
- [ ] Reviewer signoff (one reviewer beyond the implementation lead)
|
||||
- [ ] `exit-gate-phase-1.md` recorded
|
||||
|
||||
## Risks and Mitigations
|
||||
|
||||
| Risk | Likelihood | Impact | Mitigation |
|
||||
|------|:----------:|:------:|------------|
|
||||
| EF Core 10 idiosyncrasies vs the documented schema | Medium | Medium | Schema-introspection test catches drift; validate early in Stream B |
|
||||
| `sp_ValidateDraft` cross-table checks complex enough to be slow | Medium | Medium | Per-decision-cited test exists; benchmark with a large draft (1000+ tags) before exit |
|
||||
| Visual parity with ScadaLink slips because two component libraries diverge over time | Low | Medium | Copy ScadaLink's CSS verbatim where possible; shared component set is structurally identical |
|
||||
| LDAP integration breaks against production GLAuth (different schema than dev) | Medium | High | Use the v1 LDAP layer as the integration reference; mirror its config exactly |
|
||||
| Generation-diff applier has subtle bugs on edge cases (renamed entity with same logical ID) | High | High | Property-based test that generates random diffs and asserts apply-then-rebuild produces the same end state |
|
||||
| ScadaLink.Security pattern works well for site-scoped roles but our cluster-scoped grants are subtly different | Medium | Medium | Side-by-side review of `RoleMapper` after Stream E starts; refactor if claim shape diverges |
|
||||
| Phase 1 takes longer than 6 weeks | High | Medium | Mid-gate review at 3 weeks — if Stream B isn't done, defer Stream E.5–8 to a Phase 1.5 follow-up |
|
||||
| `MERGE` against `ExternalIdReservation` has a deadlock pathology under concurrent publishes | Medium | High | Concurrency test in Task B.2 specifically targets this; if it deadlocks, switch to `INSERT ... WHERE NOT EXISTS` with explicit row locks |
|
||||
|
||||
## Out of Scope (do not do in Phase 1)
|
||||
|
||||
- Galaxy out-of-process split (Phase 2)
|
||||
- Any Modbus / AB / S7 / TwinCAT / FOCAS driver code (Phases 3–5)
|
||||
- Per-driver custom config editors in Admin (each driver's phase)
|
||||
- Equipment-class template integration with the schemas repo
|
||||
- Consumer cutover (out of v2 scope, separate integration-team track per `implementation/overview.md`)
|
||||
- Wiring the OPC UA NodeManager to enforce ACLs at runtime (Phase 2+ in each driver phase). Phase 1 ships the `NodeAcl` table + Admin UI ACL editing + evaluator unit tests; per-driver enforcement lands in each driver's phase per `acl-design.md` §"Implementation Plan"
|
||||
- Push-from-DB notification (decision #96 — v2.1)
|
||||
- Generation pruning operator UI (decision #93 — v2.1)
|
||||
- Cluster-scoped admin grant editor in UI (admin-ui.md "Deferred / Out of Scope" — v2.1)
|
||||
- Mobile / tablet layout
|
||||
505
docs/v2/implementation/phase-2-galaxy-out-of-process.md
Normal file
505
docs/v2/implementation/phase-2-galaxy-out-of-process.md
Normal file
@@ -0,0 +1,505 @@
|
||||
# Phase 2 — Galaxy Out-of-Process Refactor (Tier C)
|
||||
|
||||
> **Status**: DRAFT — implementation plan for Phase 2 of the v2 build (`plan.md` §6, `driver-stability.md` §"Galaxy — Deep Dive").
|
||||
>
|
||||
> **Branch**: `v2/phase-2-galaxy`
|
||||
> **Estimated duration**: 6–8 weeks (largest refactor phase; Tier C protections + IPC are the bulk)
|
||||
> **Predecessor**: Phase 1 (`phase-1-configuration-and-admin-scaffold.md`)
|
||||
> **Successor**: Phase 3 (Modbus TCP driver)
|
||||
|
||||
## Phase Objective
|
||||
|
||||
Move Galaxy / MXAccess from the legacy in-process `OtOpcUa.Host` project into the **Tier C out-of-process** topology specified in `driver-stability.md`:
|
||||
|
||||
1. **`Driver.Galaxy.Shared`** — .NET Standard 2.0 IPC message contracts (MessagePack DTOs)
|
||||
2. **`Driver.Galaxy.Host`** — .NET 4.8 x86 separate Windows Service that owns `MxAccessBridge`, `GalaxyRepository`, alarm tracking, `GalaxyRuntimeProbeManager`, the Wonderware Historian SDK, the STA thread + Win32 message pump, and all Tier C cross-cutting protections (memory watchdog, scheduled recycle, post-mortem MMF, IPC ACL + caller SID verification, per-process shared secret)
|
||||
3. **`Driver.Galaxy.Proxy`** — .NET 10 in-process driver implementing every capability interface (`IDriver`, `ITagDiscovery`, `IRediscoverable`, `IReadable`, `IWritable`, `ISubscribable`, `IAlarmSource`, `IHistoryProvider`, `IHostConnectivityProbe`), forwarding each call over named-pipe IPC and owning the supervisor (heartbeat, host liveness, respawn with backoff, crash-loop circuit breaker, fan-out of Bad quality on host death)
|
||||
4. **Retire the legacy `OtOpcUa.Host` project** — its responsibilities now live in `OtOpcUa.Server` (built in Phase 1) for OPC UA hosting and `OtOpcUa.Driver.Galaxy.Host` for Galaxy-specific runtime
|
||||
|
||||
**Parity, not regression.** The phase exit gate is: the v1 `IntegrationTests` suite passes byte-for-byte against the v2 Galaxy.Proxy + Galaxy.Host topology, and a scripted Client.CLI walkthrough produces equivalent output to v1 (decision #56). Anything different — quality codes, browse paths, alarm shapes, history responses — is a parity defect.
|
||||
|
||||
This phase also closes the four 2026-04-13 stability findings (commits `c76ab8f` and `7310925`) by adding regression tests to the parity suite per `driver-specs.md` Galaxy "Operational Stability Notes".
|
||||
|
||||
## Scope — What Changes
|
||||
|
||||
| Concern | Change |
|
||||
|---------|--------|
|
||||
| Project layout | 3 new projects: `Driver.Galaxy.Shared` (.NET Standard 2.0), `Driver.Galaxy.Host` (.NET 4.8 x86), `Driver.Galaxy.Proxy` (.NET 10) |
|
||||
| `OtOpcUa.Host` (legacy in-process) | **Retired**. Galaxy-specific code moves to `Driver.Galaxy.Host`; the small remainder (TopShelf wrapper, `Program.cs`) was already replaced by `OtOpcUa.Server` in Phase 1 |
|
||||
| MXAccess COM access | Now lives only in `Driver.Galaxy.Host` (.NET 4.8 x86, STA thread + Win32 message pump). Main server (`OtOpcUa.Server`, .NET 10 x64) never references `ArchestrA.MxAccess` |
|
||||
| Wonderware Historian SDK | Same — only in `Driver.Galaxy.Host` |
|
||||
| Galaxy DB queries | `GalaxyRepository` moves to `Driver.Galaxy.Host`; the SQL connection string lives in the Galaxy `DriverConfig` JSON |
|
||||
| OPC UA address space build for Galaxy | Driven by `Driver.Galaxy.Proxy` calls into `IAddressSpaceBuilder` (Phase 1 API) — Proxy fetches the hierarchy via IPC, streams nodes to the builder |
|
||||
| Subscriptions, reads, writes, alarms, history | All forwarded over named-pipe IPC via MessagePack contracts in `Driver.Galaxy.Shared` |
|
||||
| Tier C cross-cutting protections | All wired up per `driver-stability.md` §"Cross-Cutting Protections" → "Isolated host only (Tier C)" + the Galaxy deep dive |
|
||||
| Windows service installer | Two services per Galaxy-using cluster node: `OtOpcUa` (the main server) + `OtOpcUaGalaxyHost` (the Galaxy host). Installer scripts updated. |
|
||||
| `appsettings.json` (legacy Galaxy config sections) | Migrated into the central config DB under `DriverInstance.DriverConfig` JSON for the Galaxy driver instance. Local `appsettings.json` keeps only `Cluster.NodeId` + `ClusterId` + DB conn (per decision #18) |
|
||||
|
||||
## Scope — What Does NOT Change
|
||||
|
||||
| Item | Reason |
|
||||
|------|--------|
|
||||
| OPC UA wire behavior visible to clients | Parity is the gate. Clients see the same browse paths, quality codes, alarm shapes, and history responses as v1 |
|
||||
| Galaxy hierarchy mapping (gobject parents → OPC UA folders) | Galaxy uses the SystemPlatform-kind namespace; UNS rules don't apply (decision #108). `Tag.FolderPath` mirrors v1 LmxOpcUa exactly |
|
||||
| Galaxy `EquipmentClassRef` integration | Galaxy is SystemPlatform-namespace; no `Equipment` rows are created for Galaxy tags. Equipment-namespace work is for the native-protocol drivers in Phase 3+ |
|
||||
| Any non-Galaxy driver | Phase 3+ |
|
||||
| `OtOpcUa.Server` lifecycle / configuration substrate / Admin UI | Built in Phase 1; Phase 2 only adds the Galaxy.Proxy as a `DriverInstance` |
|
||||
| Wonderware Historian dependency | Stays optional, loaded only when `Historian.Enabled = true` in the Galaxy `DriverConfig` |
|
||||
|
||||
## Entry Gate Checklist
|
||||
|
||||
- [ ] Phase 1 exit gate cleared (Configuration + Admin + Server + Core.Abstractions all green; Galaxy still in-process via legacy Host)
|
||||
- [ ] `v2` branch is clean
|
||||
- [ ] Phase 1 PR merged
|
||||
- [ ] Dev Galaxy reachable for parity testing — same Galaxy that v1 tests against
|
||||
- [ ] v1 IntegrationTests baseline pass count + duration recorded (this is the parity bar)
|
||||
- [ ] Client.CLI walkthrough script captured against v1 and saved as reference output
|
||||
- [ ] All Phase 2-relevant docs reviewed: `plan.md` §3–4, §5a (LmxNodeManager reusability), `driver-stability.md` §"Out-of-Process Driver Pattern (Generalized)" + §"Galaxy — Deep Dive (Tier C)", `driver-specs.md` §1 (Galaxy)
|
||||
- [ ] Decisions cited or implemented by Phase 2 read at least skim-level: #11, #24, #25, #28, #29, #32, #34, #44, #46–47, #55–56, #62, #63–69, #76, #102 (the Tier C IPC ACL + recycle decisions are all relevant)
|
||||
- [ ] Confirmation that the four 2026-04-13 stability findings (`c76ab8f`, `7310925`) have existing v1 tests that will be the regression net for the v2 split
|
||||
|
||||
**Evidence file**: `docs/v2/implementation/entry-gate-phase-2.md`.
|
||||
|
||||
## Task Breakdown
|
||||
|
||||
Five work streams (A–E). Stream A is the foundation; B and C run partly in parallel after A; D depends on B + C; E is the parity gate at the end.
|
||||
|
||||
### Stream A — Driver.Galaxy.Shared (1 week)
|
||||
|
||||
#### Task A.1 — Create the project
|
||||
|
||||
`src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared/` (.NET Standard 2.0 — must be consumable by both .NET 10 Proxy and .NET 4.8 Host per decision #25). Single dependency: `MessagePack` NuGet (decision #32).
|
||||
|
||||
#### Task A.2 — IPC message contracts
|
||||
|
||||
Define the MessagePack DTOs covering every Galaxy operation the Proxy will forward:
|
||||
- **Lifecycle**: `OpenSessionRequest`, `OpenSessionResponse`, `CloseSessionRequest`, `Heartbeat` (separate channel per decision §"Heartbeat between proxy and host")
|
||||
- **Discovery**: `DiscoverGalaxyHierarchyRequest`, `GalaxyObjectInfo`, `GalaxyAttributeInfo` (these are not the v1 Domain types — they're the IPC-shape with MessagePack attributes; the Proxy maps to/from `DriverAttributeInfo` from `Core.Abstractions`)
|
||||
- **Read / Write**: `ReadValuesRequest`, `ReadValuesResponse`, `WriteValuesRequest`, `WriteValuesResponse` (carries `DataValue` shape per decision #13: value + StatusCode + timestamps)
|
||||
- **Subscriptions**: `SubscribeRequest`, `UnsubscribeRequest`, `OnDataChangeNotification` (server-pushed)
|
||||
- **Alarms**: `AlarmSubscribeRequest`, `AlarmEvent`, `AlarmAcknowledgeRequest`
|
||||
- **History**: `HistoryReadRequest`, `HistoryReadResponse`
|
||||
- **Probe**: `HostConnectivityStatus`, `RuntimeStatusChangeNotification`
|
||||
- **Recycle / control**: `RecycleHostRequest`, `RecycleStatusResponse`
|
||||
|
||||
Length-prefixed framing per decision #28; MessagePack body inside each frame.
|
||||
|
||||
**Acceptance**:
|
||||
- All contracts compile against .NET Standard 2.0
|
||||
- Unit test project asserts each contract round-trips through MessagePack serialize → deserialize byte-for-byte
|
||||
- Reflection test asserts no contract references `System.Text.Json` or anything not in BCL/MessagePack
|
||||
|
||||
#### Task A.3 — Versioning + capability negotiation
|
||||
|
||||
Add a top-of-stream `Hello` message exchanged on connection: protocol version, supported features. Future-proofs for adding new operations without breaking older Hosts.
|
||||
|
||||
**Acceptance**:
|
||||
- Proxy refuses to talk to a Host advertising a major version it doesn't understand; logs the mismatch
|
||||
- Host refuses to accept a Proxy from an unknown major version
|
||||
|
||||
### Stream B — Driver.Galaxy.Host (3–4 weeks)
|
||||
|
||||
#### Task B.1 — Create the project + move Galaxy code
|
||||
|
||||
`src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/` (.NET 4.8, **x86 platform** target — required for MXAccess COM per decision #23).
|
||||
|
||||
Move from legacy `OtOpcUa.Host`:
|
||||
- `MxAccessBridge.cs` and supporting types
|
||||
- `GalaxyRepository.cs` and SQL queries
|
||||
- Alarm tracking infrastructure
|
||||
- `GalaxyRuntimeProbeManager.cs`
|
||||
- `MxDataTypeMapper.cs`, `SecurityClassificationMapper.cs`
|
||||
- Historian plugin loader and `IHistorianDataSource` (only loaded when `Historian.Enabled = true`)
|
||||
- Configuration types (`MxAccessConfiguration`, `GalaxyRepositoryConfiguration`, `HistorianConfiguration`, `GalaxyScope`) — these now read from the JSON `DriverConfig` rather than `appsettings.json`
|
||||
|
||||
`Driver.Galaxy.Host` does **not** reference `Core.Abstractions` (decision §5 dependency graph) — it's a closed unit, IPC-fronted.
|
||||
|
||||
**Acceptance**:
|
||||
- Project builds against .NET 4.8 x86
|
||||
- All moved files have their namespace updated to `ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host.*`
|
||||
- v1 unit tests for these classes (still in `OtOpcUa.Host.Tests`) move to a new `OtOpcUa.Driver.Galaxy.Host.Tests` project and pass
|
||||
|
||||
#### Task B.2 — STA thread + Win32 message pump
|
||||
|
||||
Per `driver-stability.md` Galaxy deep dive:
|
||||
- Single STA thread per Host process owns all `LMXProxyServer` instances
|
||||
- Work item dispatch via `PostThreadMessage(WM_APP)`
|
||||
- `WM_QUIT` shutdown only after all outstanding work items complete
|
||||
- Pump health probe: no-op work item every 10s, timeout = wedged-pump signal that triggers recycle
|
||||
|
||||
This is essentially v1's `StaComThread` lifted from `LmxProxy.Host` reference (per CLAUDE.md "Reference Implementation" section).
|
||||
|
||||
**Acceptance**:
|
||||
- Pump starts, dispatches work items, exits cleanly on `WM_QUIT`
|
||||
- Pump-wedged simulation (work item that infinite-loops) triggers the 10s timeout and posts a recycle event
|
||||
- COM call from non-STA thread fails fast with a recognizable error (regression net for cross-apartment bugs)
|
||||
|
||||
#### Task B.3 — `MxAccessHandle : SafeHandle` for COM lifetime
|
||||
|
||||
Wrap each `LMXProxyServer` connection in a `SafeHandle` subclass (decision #65 + Galaxy deep dive):
|
||||
- `ReleaseHandle()` calls `Marshal.ReleaseComObject` until refcount = 0, then `UnregisterProxy`
|
||||
- Subscription handles wrapped per item; `RemoveAdvise` → `RemoveItem` ordering enforced
|
||||
- `CriticalFinalizerObject` for finalizer ordering during AppDomain unload
|
||||
- Pre-shutdown drain: cancel all subscriptions cleanly via the STA pump, in order, before pump exit
|
||||
|
||||
**Acceptance**:
|
||||
- Unit test asserts a leaked handle (no `Dispose`) is released by the finalizer
|
||||
- Shutdown test asserts no orphan COM refs after Host exits cleanly
|
||||
- Stress test: 1000 subscribe/unsubscribe cycles → handle table empty at the end
|
||||
|
||||
#### Task B.4 — Subscription registry + reconnect
|
||||
|
||||
Per `driver-stability.md` Galaxy deep dive §"Subscription State and Reconnect":
|
||||
- In-memory registry of `(Item, AdviseId, OwningHost)` for every subscription
|
||||
- Reconnect order: register proxy → re-add items → re-advise
|
||||
- Cross-host quality clear gated on host-status check (closes 2026-04-13 finding)
|
||||
|
||||
**Acceptance**:
|
||||
- Disconnect simulation: kill TCP to MXAccess; subscriptions go Bad; reconnect; subscriptions restore in correct order
|
||||
- Multi-host test: stop AppEngine A while AppEngine B is running; verify A's subscriptions go Bad but B's stay Good (closes the cross-host quality clear regression)
|
||||
|
||||
#### Task B.5 — Connection health probe (`GalaxyRuntimeProbeManager` rebuild)
|
||||
|
||||
Lift the existing `GalaxyRuntimeProbeManager` into the new project. Behaviors per `driver-stability.md`:
|
||||
- Subscribe to per-host runtime-status synthetic attribute
|
||||
- Bad-quality fan-out scoped to the host's subtree (not Galaxy-wide)
|
||||
- Failed probe subscription does **not** leave a phantom entry that Tick() flips to Stopped (closes 2026-04-13 finding)
|
||||
|
||||
**Acceptance**:
|
||||
- Probe failure simulation → no phantom entry; Tick() does not flip arbitrary subscriptions to Stopped (regression test for the finding)
|
||||
- Probe transitions Stopped → Running → Stopped → Running over 5 minutes; quality fan-out happens correctly each transition
|
||||
|
||||
#### Task B.6 — Named-pipe IPC server with mandatory ACL
|
||||
|
||||
Per decision #76 + `driver-stability.md` §"IPC Security":
|
||||
- Pipe ACL on creation: `ReadWrite | Synchronize` granted only to the OtOpcUa server's service principal SID; LocalSystem and Administrators **explicitly denied**
|
||||
- Caller identity verification on each new connection: `GetImpersonationUserName()` cross-checked against configured server service SID; mismatches dropped before any RPC frame is read
|
||||
- Per-process shared secret: passed by the supervisor at spawn time, required on first frame of every connection
|
||||
- Heartbeat pipe: separate from data-plane pipe, same ACL
|
||||
|
||||
**Acceptance**:
|
||||
- Unit test: pipe ACL enumeration shows only the configured SID + Synchronize/ReadWrite
|
||||
- Integration test: connection from a non-server-SID local process is dropped with audit log entry
|
||||
- Integration test: connection without correct shared secret on first frame is dropped
|
||||
- Defense-in-depth test: even if ACL is misconfigured (manually overridden), shared-secret check catches the wrong client
|
||||
|
||||
#### Task B.7 — Memory watchdog with Galaxy-specific thresholds
|
||||
|
||||
Per `driver-stability.md` Galaxy deep dive §"Memory Watchdog Thresholds":
|
||||
- Sample RSS every 30s
|
||||
- Warning: `1.5× baseline OR baseline + 200 MB` (whichever larger)
|
||||
- Soft recycle: `2× baseline OR baseline + 200 MB` (whichever larger)
|
||||
- Hard ceiling: 1.5 GB → force-kill
|
||||
- Slope: > 5 MB/min sustained 30 min → soft recycle
|
||||
|
||||
**Acceptance**:
|
||||
- Unit test against a mock RSS source: each threshold triggers the correct action
|
||||
- Integration test with the FaultShim (Stream B.10): leak simulation crosses the soft-recycle threshold and triggers soft recycle path
|
||||
|
||||
#### Task B.8 — Recycle policy with WM_QUIT escalation
|
||||
|
||||
Per `driver-stability.md` Galaxy deep dive §"Recycle Policy (COM-specific)":
|
||||
- 15s grace for in-flight COM calls (longer than FOCAS because legitimate MXAccess bulk reads take seconds)
|
||||
- Per-handle: `RemoveAdvise` → `RemoveItem` → `ReleaseComObject` → `UnregisterProxy`, on the STA thread
|
||||
- `WM_QUIT` posted only after all of the above complete
|
||||
- If STA pump doesn't exit within 5s of `WM_QUIT` → `Environment.Exit(2)` (hard exit)
|
||||
- Soft recycle scheduled daily at 03:00 local; recycle frequency cap 1/hour
|
||||
|
||||
**Acceptance**:
|
||||
- Soft recycle test: in-flight call returns within grace → clean exit (`Exit(0)`)
|
||||
- Soft recycle test: in-flight call exceeds grace → hard exit (`Exit(2)`); supervisor records as unclean recycle
|
||||
- Wedged-pump test: pump doesn't drain after `WM_QUIT` → `Exit(2)` within 5s
|
||||
- Frequency cap test: trigger 2 soft recycles within an hour → second is blocked, alert raised
|
||||
|
||||
#### Task B.9 — Post-mortem MMF writer
|
||||
|
||||
Per `driver-stability.md` Galaxy deep dive §"Post-Mortem Log Contents":
|
||||
- Ring buffer of last 1000 IPC operations
|
||||
- Plus Galaxy-specific snapshots: STA pump state (thread ID, last dispatched timestamp, queue depth), active subscription count by host, `MxAccessHandle` refcount snapshot, last 100 probe results, last redeploy event, Galaxy DB connection state, Historian connection state if HDA enabled
|
||||
- Memory-mapped file at `%ProgramData%\OtOpcUa\driver-postmortem\galaxy.mmf`
|
||||
- On graceful shutdown: flush ring + snapshots to a rotating log
|
||||
- On hard crash: supervisor reads the MMF after the corpse is gone
|
||||
|
||||
**Acceptance**:
|
||||
- Round-trip test: write 1000 operations → read back → assert order + content
|
||||
- Hard-crash test: kill the process mid-operation → supervisor reads the MMF → ring tail shows the operation that was in flight
|
||||
|
||||
#### Task B.10 — Driver.Galaxy.FaultShim (test-only)
|
||||
|
||||
Per `driver-stability.md` §"Test Coverage for Galaxy Stability" — analogous to FOCAS FaultShim:
|
||||
- Test-only managed assembly substituted for `ArchestrA.MxAccess.dll` via assembly binding
|
||||
- Injects: COM exception at chosen call site, subscription that never fires `OnDataChange`, `Marshal.ReleaseComObject` returning unexpected refcount, STA pump deadlock simulation
|
||||
- Production builds load the real `ArchestrA.MxAccess` from GAC
|
||||
|
||||
**Acceptance**:
|
||||
- FaultShim binds successfully under test configuration
|
||||
- Each fault scenario triggers the expected protection (memory watchdog → recycle, supervisor → respawn, etc.)
|
||||
|
||||
### Stream C — Driver.Galaxy.Proxy (1.5 weeks, can parallel with B after A done)
|
||||
|
||||
#### Task C.1 — Create the project + capability interface implementation
|
||||
|
||||
`src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy/` (.NET 10). Dependencies: `Core.Abstractions` (Phase 1) + `Driver.Galaxy.Shared` (Stream A) + `MessagePack`.
|
||||
|
||||
Implement every interface listed in Phase Objective above. Each method:
|
||||
- Marshals arguments into the matching IPC contract
|
||||
- Sends over the data-plane pipe
|
||||
- Awaits the response (with timeout per Polly per decision #34)
|
||||
- Maps the response into the `Core.Abstractions` shape (`DataValue`, `DriverAttributeInfo`, etc.)
|
||||
- Surfaces failures as the appropriate StatusCode
|
||||
|
||||
**Acceptance**:
|
||||
- Each interface method has a unit test against a mock IPC channel: happy path + IPC timeout path + IPC error path
|
||||
- `IRediscoverable` opt-in works: when Galaxy.Host signals a redeploy, Proxy invokes the Core's rediscovery flow (not full restart)
|
||||
|
||||
#### Task C.2 — Heartbeat sender + host liveness
|
||||
|
||||
Per `driver-stability.md` §"Heartbeat between proxy and host":
|
||||
- 2s cadence (decision #72) on the dedicated heartbeat pipe
|
||||
- 3 consecutive missed responses = host declared dead (6s detection)
|
||||
- On host-dead: fan out Bad quality on all Galaxy-namespace nodes; ask supervisor to respawn
|
||||
|
||||
**Acceptance**:
|
||||
- Heartbeat round-trip test against a mock host
|
||||
- Missed-heartbeat test: stop the mock host's heartbeat responder → 3 misses → supervisor respawn requested
|
||||
- GC pause test: simulate a 700ms GC pause on the proxy side → no false positive (single missed beat absorbed by 3-miss tolerance)
|
||||
|
||||
#### Task C.3 — Supervisor with respawn-with-backoff + crash-loop circuit breaker
|
||||
|
||||
Per `driver-stability.md` §"Crash-loop circuit breaker" + Galaxy §"Recovery Sequence After Crash":
|
||||
- Backoff: 5s → 15s → 60s (capped)
|
||||
- Crash-loop: 3 crashes / 5 min → escalating cooldown (1h → 4h → 24h manual)
|
||||
- Sticky alert that doesn't auto-clear when cooldown elapses
|
||||
- On respawn after recycle: reuse cached `time_of_last_deploy` watermark to skip full DB rediscovery if unchanged
|
||||
|
||||
**Acceptance**:
|
||||
- Respawn test: kill host process → supervisor respawns within 5s → host re-establishes
|
||||
- Crash-loop test: force 3 crashes within 5 minutes → 4th respawn blocked, alert raised, manual reset clears alert
|
||||
- Cooldown escalation test: trip → 1h auto-reset → re-trip within 10 min → 4h cooldown → re-trip → 24h manual
|
||||
|
||||
#### Task C.4 — Address space build via `IAddressSpaceBuilder`
|
||||
|
||||
When the Proxy is asked to discover its tags, it issues `DiscoverGalaxyHierarchyRequest` to the Host, receives the gobject tree + attributes, and streams them to `IAddressSpaceBuilder` (Phase 1 API per decision #52). Galaxy uses the SystemPlatform-kind namespace; tags use `FolderPath` (v1-style) — no `Equipment` rows are created.
|
||||
|
||||
**Acceptance**:
|
||||
- Build a Galaxy address space via the Proxy → byte-equivalent OPC UA browse output to v1
|
||||
- Memory test: large Galaxy (4000+ attributes) → Proxy peak RAM stays under 200 MB during build
|
||||
|
||||
### Stream D — Retire legacy OtOpcUa.Host (1 week, depends on B + C)
|
||||
|
||||
#### Task D.1 — Delete legacy Host project
|
||||
|
||||
Once Galaxy.Host + Galaxy.Proxy are functional, the legacy `OtOpcUa.Host` project's responsibilities are split:
|
||||
- Galaxy-specific code → `Driver.Galaxy.Host` (already moved in Stream B)
|
||||
- TopShelf wrapper, `Program.cs`, generic OPC UA hosting → already replaced by `OtOpcUa.Server` in Phase 1
|
||||
- Anything else (configuration types, generic helpers) → moved to `OtOpcUa.Server` or `OtOpcUa.Configuration` as appropriate
|
||||
|
||||
Delete the project from the solution. Update `.slnx` and any references.
|
||||
|
||||
**Acceptance**:
|
||||
- `ls src/` shows `OtOpcUa.Host` is gone
|
||||
- `dotnet build OtOpcUa.slnx` succeeds with `OtOpcUa.Host` no longer in the build graph
|
||||
- All previously-`OtOpcUa.Host.Tests` tests are either moved to the appropriate new test project or deleted as obsolete
|
||||
|
||||
#### Task D.2 — Update Windows service installer scripts
|
||||
|
||||
Two services per cluster node when Galaxy is configured:
|
||||
- `OtOpcUa` (the main `OtOpcUa.Server`) — already installable per Phase 1
|
||||
- `OtOpcUaGalaxyHost` (the `Driver.Galaxy.Host`) — new service registration
|
||||
|
||||
Installer must:
|
||||
- Install both services with the correct service-account SIDs (Galaxy.Host's pipe ACL must grant the OtOpcUa service principal)
|
||||
- Set the supervisor's per-process secret in the registry or a protected file before first start
|
||||
- Honor service dependency: Galaxy.Host should be configured to start before OtOpcUa, or OtOpcUa retries until Galaxy.Host is up
|
||||
|
||||
**Acceptance**:
|
||||
- Install both services on a test box → both start successfully
|
||||
- Uninstall both → no leftover registry / file system state
|
||||
- Service-restart cycle: stop OtOpcUa.Server → Galaxy.Host stays up → start OtOpcUa.Server → reconnects to Galaxy.Host pipe
|
||||
|
||||
#### Task D.3 — Migrate Galaxy `appsettings.json` config to central config DB
|
||||
|
||||
Galaxy-specific config sections (`MxAccess`, `Galaxy`, `Historian`) move into the `DriverInstance.DriverConfig` JSON for the Galaxy driver instance in the Configuration DB. The local `appsettings.json` keeps only `Cluster.NodeId` + `ClusterId` + DB conn (per decision #18).
|
||||
|
||||
Migration script: for each existing v1 `appsettings.json`, generate the equivalent `DriverConfig` JSON and either insert via Admin UI or via a one-shot SQL script.
|
||||
|
||||
**Acceptance**:
|
||||
- Migration script runs against a v1 dev `appsettings.json` → produces a JSON blob that loads into the Galaxy `DriverConfig` field
|
||||
- The Galaxy driver instance starts with the migrated config and serves the same address space as v1
|
||||
|
||||
### Stream E — Parity validation (1 week, gate)
|
||||
|
||||
#### Task E.1 — Run v1 IntegrationTests against v2 Galaxy topology
|
||||
|
||||
Per decision #56:
|
||||
- The same v1 IntegrationTests suite runs against the v2 build with Galaxy.Proxy + Galaxy.Host instead of in-process Galaxy
|
||||
- All tests must pass
|
||||
- Pass count = v1 baseline; failure count = 0; skip count = v1 baseline
|
||||
- Test duration may increase (IPC round-trip latency); document the deviation
|
||||
|
||||
**Acceptance**:
|
||||
- Test report shows pass/fail/skip counts identical to v1 baseline
|
||||
- Per-test duration regression report: any test that takes >2× v1 baseline is flagged for review (may be an IPC bottleneck)
|
||||
|
||||
#### Task E.2 — Scripted Client.CLI walkthrough parity
|
||||
|
||||
Per decision #56:
|
||||
- Execute the captured Client.CLI script (recorded at Phase 2 entry gate against v1) against the v2 Galaxy topology
|
||||
- Diff the output against v1 reference
|
||||
- Differences allowed only in: timestamps, latency-measurement output. Any value, quality, browse path, or alarm shape difference = parity defect
|
||||
|
||||
**Acceptance**:
|
||||
- Walkthrough completes without errors
|
||||
- Output diff vs v1: only timestamp / latency lines differ
|
||||
|
||||
#### Task E.3 — Regression tests for the four 2026-04-13 stability findings
|
||||
|
||||
Per `driver-specs.md` Galaxy "Operational Stability Notes": each of the four findings closed in commits `c76ab8f` and `7310925` should have a regression test in the Phase 2 parity suite:
|
||||
- Phantom probe subscription flipping Tick() to Stopped (covered by Task B.5)
|
||||
- Cross-host quality clear wiping sibling state during recovery (covered by Task B.4)
|
||||
- Sync-over-async on the OPC UA stack thread → guard against new instances in `GenericDriverNodeManager`
|
||||
- Fire-and-forget alarm tasks racing shutdown → guard via the pre-shutdown drain ordering in Task B.3
|
||||
|
||||
**Acceptance**:
|
||||
- Each of the four scenarios has a named test in the parity suite
|
||||
- Each test fails on a hand-introduced regression (revert the v1 fix, see test fail)
|
||||
|
||||
#### Task E.4 — Adversarial review of the Phase 2 diff
|
||||
|
||||
Per `implementation/overview.md` exit gate:
|
||||
- Run `/codex:adversarial-review --base v2` on the merged Phase 2 diff
|
||||
- Findings closed or explicitly deferred with rationale and ticket link
|
||||
|
||||
## Compliance Checks (run at exit gate)
|
||||
|
||||
`phase-2-compliance.ps1`:
|
||||
|
||||
### Schema compliance
|
||||
N/A for Phase 2 — no schema changes (Configuration DB schema is unchanged from Phase 1).
|
||||
|
||||
### Decision compliance
|
||||
For each decision number Phase 2 implements (#11, #24, #25, #28, #29, #32, #34, #44, #46–47, #55–56, #62, #63–69, #76, #102, plus the Galaxy-specific #62), verify at least one citation exists in source, tests, or migrations:
|
||||
|
||||
```powershell
|
||||
$decisions = @(11, 24, 25, 28, 29, 32, 34, 44, 46, 47, 55, 56, 62, 63..69, 76, 102, 122, 123, 124)
|
||||
foreach ($d in $decisions) {
|
||||
$hits = git grep "decision #$d" -- 'src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.*/' 'tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.*/'
|
||||
if (-not $hits) { Write-Error "Decision #$d has no citation"; exit 1 }
|
||||
}
|
||||
```
|
||||
|
||||
### Visual compliance
|
||||
N/A — no Admin UI changes in Phase 2 (Galaxy is just another `DriverInstance` in the Drivers tab).
|
||||
|
||||
### Behavioral compliance — parity smoke test
|
||||
The parity suite (Stream E) is the smoke test:
|
||||
1. v1 IntegrationTests pass count = baseline, fail count = 0
|
||||
2. Client.CLI walkthrough output matches v1 (modulo timestamps/latency)
|
||||
3. Four regression tests for 2026-04-13 findings pass
|
||||
|
||||
### Stability compliance
|
||||
For Phase 2 (introduces the first Tier C driver in production form):
|
||||
- Galaxy.Host implements every Tier C cross-cutting protection from `driver-stability.md`:
|
||||
- SafeHandle wrappers for COM (Task B.3) ✓
|
||||
- Memory watchdog with Galaxy thresholds (Task B.7) ✓
|
||||
- Bounded operation queues per device (already in Core, Phase 1) ✓
|
||||
- Heartbeat between proxy and host on separate channel (Tasks A.2, B.6, C.2) ✓
|
||||
- Scheduled recycling with `WM_QUIT` escalation to hard exit (Task B.8) ✓
|
||||
- Crash-loop circuit breaker (Task C.3) ✓
|
||||
- Post-mortem MMF readable after hard crash (Task B.9) ✓
|
||||
- IPC ACL + caller SID verification + per-process shared secret (Task B.6) ✓
|
||||
|
||||
Each protection has at least one regression test. The compliance script enumerates and verifies presence:
|
||||
|
||||
```powershell
|
||||
$protections = @(
|
||||
@{Name="SafeHandle for COM"; Test="MxAccessHandleFinalizerReleasesCom"},
|
||||
@{Name="Memory watchdog"; Test="WatchdogTriggersRecycleAtThreshold"},
|
||||
@{Name="Heartbeat detection"; Test="ThreeMissedHeartbeatsDeclaresHostDead"},
|
||||
@{Name="WM_QUIT escalation"; Test="WedgedPumpEscalatesToHardExit"},
|
||||
@{Name="Crash-loop breaker"; Test="ThreeCrashesInFiveMinutesOpensCircuit"},
|
||||
@{Name="Post-mortem MMF"; Test="MmfSurvivesHardCrashAndIsReadable"},
|
||||
@{Name="Pipe ACL enforcement"; Test="NonServerSidConnectionRejected"},
|
||||
@{Name="Shared secret"; Test="ConnectionWithoutSecretRejected"}
|
||||
)
|
||||
foreach ($p in $protections) {
|
||||
$hits = dotnet test --filter "FullyQualifiedName~$($p.Test)" --no-build --logger "console;verbosity=quiet"
|
||||
if ($LASTEXITCODE -ne 0) { Write-Error "Stability protection '$($p.Name)' has no passing test '$($p.Test)'"; exit 1 }
|
||||
}
|
||||
```
|
||||
|
||||
### Documentation compliance
|
||||
- Any deviation from the Galaxy deep dive in `driver-stability.md` reflected back; new decisions added with `supersedes` notes if needed
|
||||
- `driver-specs.md` §1 (Galaxy) updated to reflect the actual implementation if the IPC contract or recycle behavior differs from the design doc
|
||||
|
||||
## Completion Checklist
|
||||
|
||||
### Stream A — Driver.Galaxy.Shared
|
||||
- [ ] Project created (.NET Standard 2.0, MessagePack-only dependency)
|
||||
- [ ] All IPC contracts defined and round-trip tested
|
||||
- [ ] Hello-message version negotiation implemented
|
||||
- [ ] Reflection test confirms no .NET 10-only types leaked in
|
||||
|
||||
### Stream B — Driver.Galaxy.Host
|
||||
- [ ] Project created (.NET 4.8 x86)
|
||||
- [ ] All Galaxy-specific code moved from legacy Host
|
||||
- [ ] STA thread + Win32 pump implemented; pump health probe wired up
|
||||
- [ ] `MxAccessHandle : SafeHandle` for COM lifetime
|
||||
- [ ] Subscription registry + reconnect with cross-host quality scoping
|
||||
- [ ] `GalaxyRuntimeProbeManager` rebuilt; phantom-probe regression test passes
|
||||
- [ ] Named-pipe IPC server with mandatory ACL + caller SID verification + per-process secret
|
||||
- [ ] Memory watchdog with Galaxy-specific thresholds
|
||||
- [ ] Recycle policy with 15s grace + WM_QUIT escalation to hard exit
|
||||
- [ ] Post-mortem MMF writer + supervisor reader
|
||||
- [ ] FaultShim test-only assembly for fault injection
|
||||
|
||||
### Stream C — Driver.Galaxy.Proxy
|
||||
- [ ] Project created (.NET 10, depends on Core.Abstractions + Galaxy.Shared)
|
||||
- [ ] All capability interfaces implemented (IDriver, ITagDiscovery, IRediscoverable, IReadable, IWritable, ISubscribable, IAlarmSource, IHistoryProvider, IHostConnectivityProbe)
|
||||
- [ ] Heartbeat sender on dedicated channel; missed-heartbeat detection
|
||||
- [ ] Supervisor with respawn-with-backoff + crash-loop circuit breaker (escalating cooldown 1h/4h/24h)
|
||||
- [ ] Address space build via `IAddressSpaceBuilder` produces byte-equivalent v1 output
|
||||
|
||||
### Stream D — Retire legacy OtOpcUa.Host
|
||||
- [ ] Legacy `OtOpcUa.Host` project deleted from solution
|
||||
- [ ] Windows service installer registers two services (OtOpcUa + OtOpcUaGalaxyHost)
|
||||
- [ ] Galaxy `appsettings.json` config migrated into central DB `DriverConfig`
|
||||
- [ ] Migration script tested against v1 dev config
|
||||
|
||||
### Stream E — Parity validation
|
||||
- [ ] v1 IntegrationTests pass with count = baseline, failures = 0
|
||||
- [ ] Client.CLI walkthrough output matches v1 (modulo timestamps/latency)
|
||||
- [ ] All four 2026-04-13 stability findings have passing regression tests
|
||||
- [ ] Per-test duration regression report: no test >2× v1 baseline (or flagged for review)
|
||||
|
||||
### Cross-cutting
|
||||
- [ ] `phase-2-compliance.ps1` runs and exits 0
|
||||
- [ ] All 8 Tier C stability protections have named, passing tests
|
||||
- [ ] Adversarial review of the phase diff — findings closed or deferred with rationale
|
||||
- [ ] PR opened against `v2`, includes: link to this doc, link to exit-gate record, compliance script output, parity test report, adversarial review output
|
||||
- [ ] Reviewer signoff (one reviewer beyond the implementation lead)
|
||||
- [ ] `exit-gate-phase-2.md` recorded
|
||||
|
||||
## Risks and Mitigations
|
||||
|
||||
| Risk | Likelihood | Impact | Mitigation |
|
||||
|------|:----------:|:------:|------------|
|
||||
| IPC round-trip latency makes parity tests fail on timing assumptions | High | Medium | Per-test duration regression report identifies hot tests; tune timeouts in test config rather than in production code |
|
||||
| MessagePack contract drift between Proxy and Host during development | Medium | High | Hello-message version negotiation rejects mismatched majors loudly; CI builds both projects in the same job |
|
||||
| STA pump health probe is itself flaky and triggers spurious recycles | Medium | High | Probe interval tunable; default 10s gives 1000ms+ slack on a healthy pump; monitor via post-mortem MMF for false positives |
|
||||
| Pipe ACL misconfiguration on installer leaves the IPC accessible to local users | Low | Critical | Defense-in-depth shared secret catches the case; ACL enumeration test in installer integration test |
|
||||
| Galaxy.Host process recycle thrash if Galaxy or DB is intermittently unavailable | Medium | Medium | Crash-loop circuit breaker with escalating cooldown caps the thrash; Polly retry on the data path inside Host (not via supervisor restart) handles transient errors |
|
||||
| Migration of `appsettings.json` Galaxy config to DB blob breaks existing deployments | Medium | Medium | Migration script is idempotent and dry-run-able; deploy script asserts central DB has the migrated config before stopping legacy Host |
|
||||
| Phase 2 takes longer than 8 weeks | High | Medium | Mid-gate review at 4 weeks — if Stream B isn't past Task B.6 (IPC + ACL), defer Stream B.10 (FaultShim) to Phase 2.5 follow-up |
|
||||
| Wonderware Historian SDK incompatibility with .NET 4.8 x86 in the new project layout | Low | High | Move and validate Historian loader as part of Task B.1 — early signal if SDK has any project-shape sensitivity |
|
||||
| Hard-exit on wedged pump leaks COM resources | Accepted | Low | Documented intent: hard exit is the only safe response; OS process exit reclaims fds and the OS COM cleanup is best-effort. CNC equivalent in FOCAS deep dive accepts the same trade-off |
|
||||
|
||||
## Out of Scope (do not do in Phase 2)
|
||||
|
||||
- Any non-Galaxy driver (Phase 3+)
|
||||
- UNS / Equipment-namespace work for Galaxy (Galaxy is SystemPlatform-namespace; no Equipment rows for Galaxy tags per decision #108)
|
||||
- Equipment-class template integration with the schemas repo (Galaxy doesn't use `EquipmentClassRef`)
|
||||
- Push-from-DB notification (decision #96 — v2.1)
|
||||
- Any change to OPC UA wire behavior visible to clients (parity is the gate)
|
||||
- Consumer cutover (ScadaBridge, Ignition, System Platform IO) — out of v2 scope, separate integration-team track per `implementation/overview.md`
|
||||
- Removing the v1 deployment from production (a v2 release decision, not Phase 2)
|
||||
928
docs/v2/plan.md
Normal file
928
docs/v2/plan.md
Normal file
@@ -0,0 +1,928 @@
|
||||
# Next Phase Plan — OtOpcUa v2: Multi-Driver Architecture
|
||||
|
||||
> **Status**: DRAFT — brainstorming in progress, do NOT execute until explicitly approved.
|
||||
>
|
||||
> **Branch**: `v2`
|
||||
> **Created**: 2026-04-16
|
||||
|
||||
## Vision
|
||||
|
||||
Rename from **LmxOpcUa** to **OtOpcUa** and evolve from a single-protocol OPC UA server (Galaxy/MXAccess only) into a **multi-driver OPC UA server** where:
|
||||
|
||||
- The **common core** owns the OPC UA server, address space management, session/security/subscription machinery, and client-facing concerns.
|
||||
- **Driver modules** are pluggable backends that each know how to connect to a specific data source, discover its tags/hierarchy, and shuttle live data back through the core to OPC UA clients.
|
||||
- Drivers implement **composable capability interfaces** — a driver only implements what it supports (e.g. subscriptions, alarms, history).
|
||||
- The existing Galaxy/MXAccess integration becomes the **first driver module**, proving the abstraction works against real production use.
|
||||
|
||||
---
|
||||
|
||||
## Target Drivers
|
||||
|
||||
| Driver | Protocol | Capability Profile | Notes |
|
||||
|--------|----------|--------------------|-------|
|
||||
| **Galaxy** | MXAccess COM + Galaxy DB | Read, Write, Subscribe, Alarms, HDA | Existing v1 logic, out-of-process (.NET 4.8 x86) |
|
||||
| **Modbus TCP** | MB-TCP | Read, Write, Subscribe (polled) | Flat register model, config-driven tag map. Also covers DL205 via `AddressFormat=DL205` (octal translation) |
|
||||
| **AB CIP** | EtherNet/IP CIP | Read, Write, Subscribe (polled) | ControlLogix/CompactLogix, symbolic tag addressing |
|
||||
| **AB Legacy** | EtherNet/IP PCCC | Read, Write, Subscribe (polled) | SLC 500/MicroLogix, file-based addressing |
|
||||
| **Siemens S7** | S7comm (ISO-on-TCP) | Read, Write, Subscribe (polled) | S7-300/400/1200/1500, DB/M/I/Q addressing |
|
||||
| **TwinCAT** | ADS (Beckhoff) | Read, Write, Subscribe (native) | Symbol-based, native ADS notifications |
|
||||
| **FOCAS** | FOCAS2 (FANUC CNC) | Read, Write, Subscribe (polled) | CNC data model (axes, spindle, PMC, macros) |
|
||||
| **OPC UA Client** | OPC UA | Read, Write, Subscribe, Alarms, HDA | Gateway/aggregation — proxy a remote server |
|
||||
|
||||
### Driver Characteristics That Shape the Interface
|
||||
|
||||
| Concern | Galaxy | Modbus TCP | AB CIP | AB Legacy | S7 | TwinCAT | FOCAS | OPC UA Client |
|
||||
|---------|--------|------------|--------|-----------|-----|---------|-------|---------------|
|
||||
| Tag discovery | DB query | Config DB | Config DB | Config DB | Config DB | Symbol upload | CNC query + Config DB | Browse remote |
|
||||
| Hierarchy | Rich tree | Flat (user groups) | Flat or program-scoped | Flat (file-based) | Flat (DB/area) | Symbol tree | Functional (axes/spindle/PMC) | Mirror remote |
|
||||
| Data types | mx_data_type | Raw registers (user-typed) | CIP typed | File-typed (N=INT16, F=FLOAT) | S7 typed | IEC 61131-3 | Scaled integers + structs | Full OPC UA |
|
||||
| Native subscriptions | Yes (MXAccess) | No (polled) | No (polled) | No (polled) | No (polled) | **Yes (ADS notifications)** | No (polled) | Yes (OPC UA) |
|
||||
| Alarms | Yes | No | No | No | No | Possible (ADS state) | Yes (CNC alarms) | Yes (A&C) |
|
||||
| History | Yes (Historian) | No | No | No | No | No | No | Yes (HistoryRead) |
|
||||
|
||||
**Note:** AutomationDirect DL205 PLCs are supported by the Modbus TCP driver via `AddressFormat=DL205` (octal V/X/Y/C/T/CT address translation over H2-ECOM100 module, port 502). No separate driver needed.
|
||||
|
||||
---
|
||||
|
||||
## Architecture — Key Decisions & Open Questions
|
||||
|
||||
### 1. Common Core Boundary
|
||||
|
||||
**Core owns:**
|
||||
- OPC UA server lifecycle (startup, shutdown, session management)
|
||||
- Security (transport profiles, authentication, authorization)
|
||||
- Address space tree management (add/remove/update nodes)
|
||||
- Subscription engine (create, publish, transfer)
|
||||
- Status dashboard / health reporting
|
||||
- Redundancy
|
||||
- Configuration framework
|
||||
- Namespace allocation per driver
|
||||
|
||||
**Driver owns:**
|
||||
- Data source connection management
|
||||
- Tag/hierarchy discovery
|
||||
- Data type mapping (driver types → OPC UA types)
|
||||
- Read/write translation
|
||||
- Alarm sourcing (if supported)
|
||||
- Historical data access (if supported)
|
||||
|
||||
**Decided:**
|
||||
- Each driver instance manages its own polling internally — the core does not provide a shared poll scheduler.
|
||||
- Multiple instances of the same driver type are supported (e.g. two Modbus TCP drivers for different device groups).
|
||||
- One namespace index per driver instance (each instance gets its own `NamespaceUri`).
|
||||
|
||||
**Decided:**
|
||||
- Drivers register nodes via a **builder/context API** (`IAddressSpaceBuilder`) provided by the core. Core owns the tree; driver streams `AddFolder` / `AddVariable` calls as it discovers nodes. Supports incremental/large address spaces without forcing the driver to buffer the whole tree.
|
||||
|
||||
---
|
||||
|
||||
### 2. Driver Capability Interfaces
|
||||
|
||||
Composable — a driver implements only what it supports:
|
||||
|
||||
```
|
||||
IDriver — required: lifecycle, metadata, health
|
||||
├── ITagDiscovery — discover tags/hierarchy from the backend
|
||||
├── IReadable — on-demand read
|
||||
├── IWritable — on-demand write
|
||||
├── ISubscribable — data change subscriptions (native or driver-managed polling)
|
||||
├── IAlarmSource — alarm events and acknowledgment
|
||||
└── IHistoryProvider — historical data reads
|
||||
```
|
||||
|
||||
Note: `ISubscribable` covers both native subscriptions (Galaxy MXAccess advisory, OPC UA monitored items) and driver-internal polled subscriptions (Modbus, AB CIP). The driver owns its polling loop — the core just sees `OnDataChange` callbacks regardless of mechanism.
|
||||
|
||||
**Capability matrix:**
|
||||
|
||||
| Interface | Galaxy | Modbus TCP | AB CIP | AB Legacy | S7 | TwinCAT | FOCAS | OPC UA Client |
|
||||
|-----------|--------|------------|--------|-----------|-----|---------|-------|---------------|
|
||||
| IDriver | Y | Y | Y | Y | Y | Y | Y | Y |
|
||||
| ITagDiscovery | Y | Y (config DB) | Y (config DB) | Y (config DB) | Y (config DB) | Y (symbol upload) | Y (built-in + config DB) | Y (browse) |
|
||||
| IReadable | Y | Y | Y | Y | Y | Y | Y | Y |
|
||||
| IWritable | Y | Y | Y | Y | Y | Y | Y (limited) | Y |
|
||||
| ISubscribable | Y (native) | Y (polled) | Y (polled) | Y (polled) | Y (polled) | Y (native ADS) | Y (polled) | Y (native) |
|
||||
| IAlarmSource | Y | — | — | — | — | — | Y (CNC alarms) | Y |
|
||||
| IHistoryProvider | Y | — | — | — | — | — | — | Y |
|
||||
|
||||
**Decided:**
|
||||
- Data change callback uses shared data models (`DataValue` with value, `StatusCode` quality, timestamp). Every driver maps to the same OPC UA `StatusCode` space — drivers define which quality codes they can produce but the model is universal.
|
||||
- Driver isolation: each driver instance runs independently. A crash or disconnect in one driver sets Bad quality on its own nodes only — no impact on other driver instances. The core must catch and contain driver failures.
|
||||
|
||||
### Resilience — Polly
|
||||
|
||||
**Decided: Use Polly v8+ (`Microsoft.Extensions.Resilience`) as the resilience layer across all drivers and the configuration subsystem.**
|
||||
|
||||
Polly provides composable resilience pipelines rather than hand-rolled retry/circuit-breaker logic. Each driver instance (and each device within a driver) gets its own pipeline so failures are isolated at the finest practical level.
|
||||
|
||||
**Where Polly applies:**
|
||||
|
||||
| Component | Pipeline | Strategies | Purpose |
|
||||
|-----------|----------|------------|---------|
|
||||
| **Driver device connection** | Per device | Retry (exp. backoff) + CircuitBreaker + Timeout | Reconnect to offline PLC/device, stop hammering after N failures, bound connection attempts |
|
||||
| **Driver read ops** | Per device | Timeout + Retry | Reads are idempotent — retry transient failures freely |
|
||||
| **Driver write ops** | Per device | Timeout **only** by default | Writes are NOT auto-retried — a timeout may fire after the device already accepted the command; replaying non-idempotent field actions (pulses, acks, recipe steps, counter increments) can cause duplicate operations |
|
||||
| **Driver poll loop** | Per device | CircuitBreaker | When a device is consistently unreachable, open circuit and probe periodically instead of polling at full rate |
|
||||
| **Galaxy IPC (Proxy → Host)** | Per proxy | Retry (backoff) + CircuitBreaker | Reconnect when Galaxy Host service restarts, stop retrying if Host is down for extended period |
|
||||
| **Config DB polling** | Singleton | Retry (backoff) + Fallback (use cache) | Central DB unreachable → fall back to LiteDB cache, keep retrying in background |
|
||||
| **Config DB startup** | Singleton | Retry (backoff) + Fallback (use cache) | If DB is briefly unavailable at startup, retry before falling back to cache |
|
||||
|
||||
**How it integrates:**
|
||||
|
||||
```
|
||||
IHostedService (per driver instance)
|
||||
├── Per-device ReadPipeline
|
||||
│ ├── Timeout — bound how long a read can take
|
||||
│ ├── Retry — transient failure recovery with jitter (SAFE: reads are idempotent)
|
||||
│ └── CircuitBreaker — stop polling dead devices, probe periodically
|
||||
│ on break: set device tags to Bad quality
|
||||
│ on reset: resume normal polling, restore quality
|
||||
│
|
||||
└── Per-device WritePipeline
|
||||
├── Timeout — bound how long a write can take
|
||||
└── (NO retry by default) — opt-in per tag via TagConfig.WriteIdempotent = true
|
||||
OR via a CAS (compare-and-set) wrapper that verifies
|
||||
the device state before each retry attempt
|
||||
|
||||
ConfigurationService
|
||||
└── ResiliencePipeline
|
||||
├── Retry — transient DB connectivity issues
|
||||
└── Fallback — serve from LiteDB cache on sustained outage
|
||||
```
|
||||
|
||||
**Write-retry policy (per the adversarial review, finding #1):**
|
||||
- Default: **no automatic retry on writes.** A timeout bubbles up as a write failure; the OPC UA client decides whether to re-issue.
|
||||
- Opt-in per tag via `TagConfig.WriteIdempotent = true` — explicit assertion by the configurer that replaying the same write has no side effect (e.g. setpoint overwrite, steady-state mode selection).
|
||||
- Opt-in via CAS (compare-and-set): before retrying, read the current value; retry only if the device still holds the pre-write value. Drivers whose protocol supports atomic read-modify-write (e.g. Modbus mask-write, OPC UA writes with expected-value) can plug this in.
|
||||
- Documented **never-retry** cases: edge-triggered acks, pulse outputs, monotonic counters, recipe-step advances, alarm acknowledgments, any "fire-and-forget" command register.
|
||||
|
||||
**Polly integration points:**
|
||||
- `Microsoft.Extensions.Resilience` for DI-friendly pipeline registration
|
||||
- `TelemetryListener` feeds circuit-breaker state changes into the status dashboard (operators see which devices are in open/half-open/closed state)
|
||||
- Per-driver/per-device pipeline configuration from the central config DB (retry counts, backoff intervals, circuit breaker thresholds can be tuned per device)
|
||||
|
||||
**Decided:**
|
||||
- Capability discovery uses **interface checks via `is`** (e.g. `if (driver is IAlarmSource a) ...`). The interface *is* the capability — no redundant flag enum to keep in sync.
|
||||
- `ITagDiscovery` is discovery-only. Drivers with a change signal (Galaxy deploy time, OPC UA server change notifications) additionally implement an **optional `IRediscoverable`** sub-interface; the core subscribes and rebuilds the affected subtree. Static drivers (Modbus, S7, etc. whose tags only change via a published config generation) don't implement it.
|
||||
|
||||
---
|
||||
|
||||
### 3. Runtime & Target Framework
|
||||
|
||||
**Decided: .NET 10, C#, x64 for everything — except where explicitly required.**
|
||||
|
||||
| Component | Target | Reason |
|
||||
|-----------|--------|--------|
|
||||
| Core, Core.Abstractions | .NET 10 x64 | Default |
|
||||
| Server | .NET 10 x64 | Default |
|
||||
| Configuration | .NET 10 x64 | Default |
|
||||
| Admin | .NET 10 x64 | Blazor Server |
|
||||
| Driver.ModbusTcp | .NET 10 x64 | Default |
|
||||
| Driver.AbCip | .NET 10 x64 | Default |
|
||||
| Driver.OpcUaClient | .NET 10 x64 | Default |
|
||||
| Client.CLI | .NET 10 x64 | Default |
|
||||
| Client.UI | .NET 10 x64 | Avalonia |
|
||||
| **Driver.Galaxy** | **.NET Framework 4.8 x86** | **MXAccess COM interop requires 32-bit** |
|
||||
|
||||
**Critical implication:** The Galaxy driver **cannot load in-process** with a .NET 10 x64 server. It must run as an **out-of-process driver** — a separate .NET 4.8 x86 process that the core communicates with over IPC.
|
||||
|
||||
**Decided: Named pipes with MessagePack serialization for IPC.**
|
||||
- Galaxy Host always runs on the same machine (MXAccess needs local ArchestrA Platform)
|
||||
- Named pipes are fast, no port allocation, built into both .NET 4.8 (`System.IO.Pipes`) and .NET 10
|
||||
- `Galaxy.Shared` defines request/response message types serialized with **MessagePack** over length-prefixed frames
|
||||
- MessagePack-CSharp (`MessagePack` NuGet) supports .NET Framework 4.6.1+ and .NET Standard 2.0+ — works on both sides
|
||||
- Compact binary format, faster than JSON, good fit for high-frequency data change callbacks
|
||||
- Simpler than gRPC on .NET 4.8 (which needs legacy `Grpc.Core` native library)
|
||||
|
||||
**Decided: Galaxy Host is a separate Windows service.**
|
||||
- Independent lifecycle from the OtOpcUa Server
|
||||
- Can be restarted without affecting the main server or other drivers
|
||||
- Galaxy.Proxy detects connection loss, sets Bad quality on Galaxy nodes, reconnects when Host comes back
|
||||
- Installed/managed via standard Windows service tooling
|
||||
|
||||
```
|
||||
┌──────────────────────────────────┐ named pipe ┌───────────────────────────┐
|
||||
│ OtOpcUa Server (.NET 10 x64) │◄────────────►│ Galaxy Host Service │
|
||||
│ Windows Service │ │ Windows Service │
|
||||
│ (Microsoft.Extensions.Hosting) │ │ (.NET 4.8 x86) │
|
||||
│ │ │ │
|
||||
│ Core │ │ MxAccessBridge │
|
||||
│ ├── Driver.ModbusTcp (in-proc)│ │ GalaxyRepository │
|
||||
│ ├── Driver.AbCip (in-proc) │ │ GalaxyDriverService │
|
||||
│ └── GalaxyProxy (in-proc)──┼──────────────┼──AlarmTracking │
|
||||
│ │ │ HDA Plugin │
|
||||
└──────────────────────────────────┘ └───────────────────────────┘
|
||||
```
|
||||
|
||||
**Notes for future work:**
|
||||
- The Proxy/Host/Shared split is a general pattern — any future driver with process-isolation requirements (bitness mismatch, unstable native dependency, license boundary) can reuse the same three-project layout.
|
||||
- Reusability of `LmxNodeManager` as a "generic driver node manager" will be assessed during Phase 2 interface extraction.
|
||||
|
||||
---
|
||||
|
||||
### 4. Galaxy/MXAccess as Out-of-Process Driver
|
||||
|
||||
**Current tightly-coupled pieces to refactor:**
|
||||
- `LmxNodeManager` — mixes OPC UA node management with MXAccess-specific logic
|
||||
- `MxAccessBridge` — COM thread, subscriptions, reconnect
|
||||
- `GalaxyRepository` — SQL queries for hierarchy/attributes
|
||||
- Alarm tracking tied to MXAccess subscription model
|
||||
- HDA via Wonderware Historian plugin
|
||||
|
||||
All of these stay in the Galaxy Host process (.NET 4.8 x86). The `GalaxyProxy` in the main server implements the standard driver interfaces and forwards over IPC.
|
||||
|
||||
**Decided:**
|
||||
- Refactor is **incremental**: extract `IDriver` / `ISubscribable` / `ITagDiscovery` etc. against the existing `LmxNodeManager` first (still in-process on v2 branch), validate the system still runs, *then* move the implementation behind the IPC boundary into Galaxy.Host. Keeps the system runnable at each step and de-risks the out-of-process move.
|
||||
- **Parity test**: run the existing v1 IntegrationTests suite against the v2 Galaxy driver (same Galaxy, same expectations) **plus** a scripted Client.CLI walkthrough (connect / browse / read / write / subscribe / history / alarms) on a dev Galaxy. Automated regression + human-observable behavior.
|
||||
|
||||
---
|
||||
|
||||
### 4. Configuration Model — Centralized MSSQL + Local Cache
|
||||
|
||||
**Deployment topology — server clusters:**
|
||||
|
||||
Sites deploy OtOpcUa as **2-node clusters** to provide non-transparent OPC UA redundancy (per v1 — `RedundancySupport.Warm` / `Hot`, no VIP/load-balancer involvement; clients see both endpoints in `ServerUriArray` and pick by `ServiceLevel`). Single-node deployments are the same model with `NodeCount = 1`. The config schema treats this uniformly: every server is a member of a **`ServerCluster`** with 1 or 2 **`ClusterNode`** members.
|
||||
|
||||
Within a cluster, both nodes serve **identical** address spaces — defining tags twice would invite drift — so driver definitions, device configs, tag definitions, and poll groups attach to `ClusterId`, not to individual nodes. Per-node overrides exist only for physical-machine settings that legitimately differ (host, port, `ApplicationUri`, redundancy role, machine cert) and for the rare driver setting that must differ per node (e.g. `MxAccess.ClientName` so Galaxy distinguishes them). Overrides are minimal by intent.
|
||||
|
||||
**Namespaces — two today, extensible to N:**
|
||||
|
||||
Each cluster serves **multiple OPC UA namespaces through a single endpoint**, per the 3-year-plan handoff (`handoffs/otopcua-handoff.md` §4). At v2.0 GA there are two namespace kinds:
|
||||
|
||||
| Kind | Source | Purpose |
|
||||
|------|--------|---------|
|
||||
| **Equipment** | New drivers (Modbus, AB CIP, AB Legacy, S7, TwinCAT, FOCAS, OPC UA Client when gatewaying raw data) | Raw equipment data — no deadbanding, no aggregation, no business meaning. The OT-side surface of the canonical model. |
|
||||
| **SystemPlatform** | Galaxy driver (existing v1 LmxOpcUa functionality, folded in) | Processed data tap — Aveva System Platform objects exposed as OPC UA so OPC UA-native consumers read derived state through the same endpoint as raw equipment data. |
|
||||
|
||||
Future kinds — `Simulated` is named in the plan as a next addition (replay historical equipment data to exercise tier-1/tier-2 consumers without physical equipment). Architecturally supported, **not committed for v2.0 build**. The schema models namespace as a first-class entity (`Namespace` table) so adding a third kind is a config-DB row insert + driver wiring, not a structural refactor.
|
||||
|
||||
A cluster always has at most one namespace per kind (UNIQUE on `ClusterId, Kind`). Each `DriverInstance` is bound to exactly one `NamespaceId`; a driver type is restricted to the namespace kinds it can populate (Galaxy → SystemPlatform; all native-protocol drivers → Equipment; OPC UA Client → either, by config).
|
||||
|
||||
**UNS naming hierarchy — mandatory in the Equipment namespace:**
|
||||
|
||||
Per the 3-year-plan handoff §12, the Equipment namespace browse paths must conform to the canonical 5-level Unified Namespace structure:
|
||||
|
||||
| Level | Name | Source | Example |
|
||||
|-------|------|--------|---------|
|
||||
| 1 | Enterprise | `ServerCluster.Enterprise` | `ent` |
|
||||
| 2 | Site | `ServerCluster.Site` | `warsaw-west` |
|
||||
| 3 | Area | `UnsArea.Name` (first-class table) | `bldg-3` or `_default` |
|
||||
| 4 | Line | `UnsLine.Name` (first-class table) | `line-2` or `_default` |
|
||||
| 5 | Equipment | `Equipment.Name` | `cnc-mill-05` |
|
||||
| 6 | Signal | `Tag.Name` | `RunState`, `ActualFeedRate` |
|
||||
|
||||
OPC UA browse path: `zb/warsaw-west/bldg-3/line-2/cnc-mill-05/RunState`.
|
||||
|
||||
**`UnsArea` and `UnsLine` are first-class generation-versioned entities** so the UNS structure is manageable on its own — operators can rename `bldg-3` → `bldg-3a` and every equipment under it picks up the new path automatically; bulk-move 5 lines from one building to another with a single edit; etc. Equipment references `UnsLineId` (FK), not denormalized Area/Line strings.
|
||||
|
||||
**Naming rules** (validated at draft-publish time and in Admin UI):
|
||||
- Each segment matches `^[a-z0-9-]{1,32}$`, OR equals the reserved placeholder `_default`
|
||||
- Lowercase enforced; hyphens allowed within a segment, slashes only between segments
|
||||
- Total path ≤ 200 characters
|
||||
|
||||
**Equipment is a first-class entity** with five distinct identifiers serving different audiences:
|
||||
|
||||
| Identifier | Audience | Mutability | Uniqueness | Purpose |
|
||||
|------------|----------|:----------:|------------|---------|
|
||||
| `EquipmentUuid` | Downstream events / dbt / Redpanda | **Immutable forever** | Globally unique (UUIDv4) | Permanent join key across systems and time |
|
||||
| `EquipmentId` | Internal config DB | Immutable after publish | Within cluster | Stable logical key for cross-generation diffs |
|
||||
| `MachineCode` | OT operators | Mutable (with publish) | Within cluster | Colloquial name in conversations and runbooks (e.g. `machine_001`) |
|
||||
| `ZTag` | ERP integration | Mutable (rare) | Fleet-wide | **Primary identifier for browsing in Admin UI** — list/search default sort |
|
||||
| `SAPID` | SAP PM integration | Mutable (rare) | Fleet-wide | Maintenance system join key |
|
||||
|
||||
All five are exposed as **OPC UA properties** on the equipment node. External systems can resolve equipment by whichever identifier they natively use — ERP queries by ZTag, SAP PM by SAPID, OT operators by MachineCode in conversation, downstream events by EquipmentUuid for permanent lineage. The OPC UA browse path uses `Equipment.Name` as the level-5 segment; the other identifiers do not appear in the path but are properties on the node.
|
||||
|
||||
**SystemPlatform namespace does NOT use UNS** — Galaxy's hierarchy is preserved as v1 LmxOpcUa exposes it (Area > Object). UNS rules apply only to drivers in Equipment-kind namespaces.
|
||||
|
||||
**Authority for equipment-class templates** lives in a future central `schemas` repo (not yet created per the 3-year-plan). v2.0 ships an `Equipment.EquipmentClassRef` column as a hook (nullable, FK-to-future); enforcement is added when the schemas repo lands. Cheap to add now, expensive to retrofit.
|
||||
|
||||
**Canonical machine state vocabulary** (`Running`, `Idle`, `Faulted`, `Starved`, `Blocked`) — derivation lives at Layer 3 (System Platform / Ignition), not in OtOpcUa. Our role is delivering the raw signals cleanly so derivation is accurate. Equipment-class templates from the schemas repo will define which raw signals each class exposes.
|
||||
|
||||
**Architecture:**
|
||||
|
||||
```
|
||||
┌─────────────────────────────────┐
|
||||
│ Central Config DB (MSSQL) │
|
||||
│ │
|
||||
│ - Server clusters (1 or 2 nodes)│
|
||||
│ - Cluster nodes (physical srvs)│
|
||||
│ - Driver assignments (per cluster)│
|
||||
│ - Tag definitions (per cluster)│
|
||||
│ - Device configs (per cluster) │
|
||||
│ - Per-node overrides (minimal) │
|
||||
│ - Schemaless driver config │
|
||||
│ (JSON; cluster-level + node │
|
||||
│ override JSON) │
|
||||
└──────────┬──────────────────────┘
|
||||
│ poll / change detection
|
||||
▼
|
||||
┌─── Cluster LINE3-OPCUA ────────────────────┐
|
||||
│ │
|
||||
┌──────┴──────────────────┐ ┌──────────────────┴──┐
|
||||
│ Node LINE3-OPCUA-A │ │ Node LINE3-OPCUA-B │
|
||||
│ RedundancyRole=Primary │ │ RedundancyRole=Secondary │
|
||||
│ │ │ │
|
||||
│ appsettings.json: │ │ appsettings.json: │
|
||||
│ - MSSQL conn string │ │ - MSSQL conn str │
|
||||
│ - ClusterId │ │ - ClusterId │
|
||||
│ - NodeId │ │ - NodeId │
|
||||
│ - Local cache path │ │ - Local cache path│
|
||||
│ │ │ │
|
||||
│ Local cache (LiteDB) │ │ Local cache (LiteDB)│
|
||||
└─────────────────────────┘ └─────────────────────┘
|
||||
```
|
||||
|
||||
**How it works:**
|
||||
|
||||
1. Each OtOpcUa node has a minimal `appsettings.json` with just: MSSQL connection string, its `ClusterId` and `NodeId`, a local machine-bound client certificate (or gMSA credential), and local cache file path. **OPC UA port and `ApplicationUri` come from the central DB** (`ClusterNode.OpcUaPort` / `ClusterNode.ApplicationUri`), not from local config — they're cluster topology, not local concerns.
|
||||
2. On startup, the node authenticates to the central DB **using a credential bound to its `NodeId`** — a client cert or SQL login per node, NOT a shared DB login. The DB-side authorization layer enforces that the authenticated principal may only read config for its `NodeId`'s `ClusterId`. A self-asserted `NodeId` with the wrong credential is rejected. A node may not read another cluster's config, even if both clusters belong to the same admin team.
|
||||
3. The node requests its current **config generation** from the central DB: "give me the latest published generation for cluster X." Generations are **cluster-scoped** — one generation = one cluster's full configuration snapshot.
|
||||
4. The node receives the cluster-level config (drivers, devices, tags, poll groups) plus its own `ClusterNode` row (physical attributes + override JSON). It merges node overrides onto cluster-level driver configs at apply time.
|
||||
5. Config is cached locally in a **LiteDB file** keyed by generation number — if the central DB is unreachable at startup, the node boots from the latest cached generation.
|
||||
6. The node polls the central DB for a **new published generation**. When a new generation is published, the node downloads it, diffs it against its current one, and applies only the affected drivers/devices/tags (surgical *application* against an atomic *snapshot*).
|
||||
7. **Both nodes of a cluster apply the same generation**, but apply timing can differ slightly (network jitter, polling phase). During the apply window, one node may be on generation N and the other on N+1; this is acceptable because OPC UA non-transparent redundancy already accommodates per-endpoint state divergence and `ServiceLevel` will dip on the node that's mid-apply.
|
||||
8. If generation application fails mid-flight, the node rolls back to the previous generation and surfaces the failure in the status dashboard; admins can publish a corrective generation or explicitly roll back the cluster.
|
||||
9. The central DB is the single source of truth for fleet management — all tag definitions, device configs, driver assignments, and cluster topology live there, versioned by generation.
|
||||
|
||||
**Central DB schema (conceptual):**
|
||||
|
||||
```
|
||||
ServerCluster ← top-level deployment unit (1 or 2 nodes)
|
||||
- ClusterId (PK)
|
||||
- Name ← human-readable e.g. "LINE3-OPCUA"
|
||||
- Enterprise ← UNS level 1, canonical org value: "zb" (validated [a-z0-9-]{1,32})
|
||||
- Site ← UNS level 2, e.g. "warsaw-west" (validated [a-z0-9-]{1,32})
|
||||
- NodeCount (1 | 2)
|
||||
- RedundancyMode (None | Warm | Hot) ← None when NodeCount=1
|
||||
- Enabled
|
||||
- Notes
|
||||
-- NOTE: NamespaceUri removed; namespaces are now first-class rows in Namespace table
|
||||
|
||||
Namespace ← generation-versioned (revised after adversarial review finding #2),
|
||||
1+ per cluster per generation
|
||||
- NamespaceRowId (PK)
|
||||
- GenerationId (FK)
|
||||
- NamespaceId ← stable logical ID across generations, e.g. "LINE3-OPCUA-equipment"
|
||||
- ClusterId (FK)
|
||||
- Kind (Equipment | SystemPlatform | Simulated) ← UNIQUE (GenerationId, ClusterId, Kind)
|
||||
- NamespaceUri ← e.g. "urn:zb:warsaw-west:equipment".
|
||||
UNIQUE per generation; cross-generation invariant: once a
|
||||
(NamespaceId, ClusterId) pair publishes a NamespaceUri,
|
||||
it cannot change in any future generation
|
||||
- Enabled
|
||||
- Notes
|
||||
|
||||
ClusterNode ← physical OPC UA server within a cluster
|
||||
- NodeId (PK) ← stable per physical machine, e.g. "LINE3-OPCUA-A"
|
||||
- ClusterId (FK)
|
||||
- RedundancyRole (Primary | Secondary | Standalone)
|
||||
- Host ← machine hostname / IP
|
||||
- OpcUaPort ← typically 4840 on each machine
|
||||
- DashboardPort ← typically 8081
|
||||
- ApplicationUri ← MUST be unique per node per OPC UA spec.
|
||||
Convention: urn:{Host}:OtOpcUa (hostname-embedded).
|
||||
Unique index enforced fleet-wide, not just per-cluster
|
||||
— two clusters sharing an ApplicationUri would confuse
|
||||
any client that browses both.
|
||||
Stored explicitly, NOT derived from Host at runtime —
|
||||
OPC UA clients pin trust to ApplicationUri (part of
|
||||
the cert validation chain), so silent rewrites would
|
||||
break client trust.
|
||||
- ServiceLevelBase ← Primary 200, Secondary 150 by default
|
||||
- DriverConfigOverridesJson ← per-node overrides keyed by DriverInstanceId,
|
||||
merged onto cluster-level DriverConfig at apply.
|
||||
Minimal by intent — only settings that genuinely
|
||||
differ per node (e.g. MxAccess.ClientName).
|
||||
- Enabled
|
||||
- LastSeenAt
|
||||
|
||||
ClusterNodeCredential ← 1:1 or 1:N with ClusterNode
|
||||
- CredentialId (PK)
|
||||
- NodeId (FK) ← bound to the physical node, NOT the cluster
|
||||
- Kind (SqlLogin | ClientCertThumbprint | ADPrincipal | gMSA)
|
||||
- Value ← login name, thumbprint, SID, etc.
|
||||
- Enabled
|
||||
- RotatedAt
|
||||
|
||||
ConfigGeneration ← atomic, immutable snapshot of one cluster's config
|
||||
- GenerationId (PK) ← monotonically increasing
|
||||
- ClusterId (FK) ← cluster-scoped — every generation belongs to one cluster
|
||||
- PublishedAt
|
||||
- PublishedBy
|
||||
- Status (Draft | Published | Superseded | RolledBack)
|
||||
- ParentGenerationId (FK) ← rollback target
|
||||
- Notes
|
||||
|
||||
DriverInstance ← rows reference GenerationId; new generations = new rows
|
||||
- DriverInstanceRowId (PK)
|
||||
- GenerationId (FK)
|
||||
- DriverInstanceId ← stable logical ID across generations
|
||||
- ClusterId (FK) ← driver lives at the cluster level — both nodes
|
||||
instantiate it identically (modulo node overrides)
|
||||
- NamespaceId (FK) ← which namespace this driver populates.
|
||||
Driver type restricts allowed namespace Kind:
|
||||
Galaxy → SystemPlatform
|
||||
Modbus/AB CIP/AB Legacy/S7/TwinCAT/FOCAS → Equipment
|
||||
OpcUaClient → either, by config
|
||||
- Name
|
||||
- DriverType (Galaxy | ModbusTcp | AbCip | OpcUaClient | …)
|
||||
- Enabled
|
||||
- DriverConfig (JSON) ← schemaless, driver-type-specific settings.
|
||||
Per-node overrides applied via
|
||||
ClusterNode.DriverConfigOverridesJson at apply time.
|
||||
|
||||
Device (for multi-device drivers like Modbus, CIP)
|
||||
- DeviceRowId (PK)
|
||||
- GenerationId (FK)
|
||||
- DeviceId ← stable logical ID
|
||||
- DriverInstanceId (FK)
|
||||
- Name
|
||||
- DeviceConfig (JSON) ← host, port, unit ID, slot, etc.
|
||||
|
||||
UnsArea ← UNS level 3 (first-class for rename/move)
|
||||
- UnsAreaRowId (PK)
|
||||
- GenerationId (FK)
|
||||
- UnsAreaId ← stable logical ID across generations
|
||||
- ClusterId (FK)
|
||||
- Name ← UNS level 3, [a-z0-9-]{1,32} or "_default"
|
||||
- Notes
|
||||
|
||||
UnsLine ← UNS level 4 (first-class for rename/move)
|
||||
- UnsLineRowId (PK)
|
||||
- GenerationId (FK)
|
||||
- UnsLineId ← stable logical ID across generations
|
||||
- UnsAreaId (FK)
|
||||
- Name ← UNS level 4, [a-z0-9-]{1,32} or "_default"
|
||||
- Notes
|
||||
|
||||
Equipment ← UNS level-5 entity. Only for drivers in Equipment-kind namespace.
|
||||
- EquipmentRowId (PK)
|
||||
- GenerationId (FK)
|
||||
- EquipmentId ← SYSTEM-GENERATED ('EQ-' + first 12 hex chars of EquipmentUuid).
|
||||
Never operator-supplied, never editable, never in CSV imports.
|
||||
(Revised after adversarial review finding #4 — operator-set ID
|
||||
is a corruption path: typos mint duplicate identities.)
|
||||
- EquipmentUuid (UUIDv4) ← IMMUTABLE across all generations of the same EquipmentId.
|
||||
Validated by sp_ValidateDraft. Path/MachineCode/ZTag/SAPID
|
||||
can change; UUID cannot.
|
||||
- DriverInstanceId (FK) ← which driver provides data for this equipment
|
||||
- DeviceId (FK, nullable) ← optional, for multi-device drivers
|
||||
- UnsLineId (FK) ← UNS level-3+4 source via UnsLine→UnsArea
|
||||
- Name ← UNS level 5, [a-z0-9-]{1,32} (the equipment name)
|
||||
|
||||
-- Operator-facing and external-system identifiers (all exposed as OPC UA properties)
|
||||
- MachineCode ← Operator colloquial id (e.g. "machine_001"); REQUIRED;
|
||||
unique within cluster
|
||||
- ZTag ← ERP equipment id; nullable; unique fleet-wide;
|
||||
PRIMARY identifier for browsing in Admin UI
|
||||
- SAPID ← SAP PM equipment id; nullable; unique fleet-wide
|
||||
|
||||
- EquipmentClassRef ← nullable; future FK to schemas-repo template (TBD authority)
|
||||
- Enabled
|
||||
|
||||
Tag
|
||||
- TagRowId (PK)
|
||||
- GenerationId (FK)
|
||||
- TagId ← stable logical ID
|
||||
- EquipmentId (FK, nullable) ← REQUIRED when driver is in Equipment-kind namespace.
|
||||
NULL when driver is in SystemPlatform-kind namespace
|
||||
(Galaxy hierarchy is preserved as v1 expressed it).
|
||||
- DriverInstanceId (FK) ← always present (Equipment.DriverInstanceId mirrors this
|
||||
when EquipmentId is set; redundant but indexed for joins)
|
||||
- DeviceId (FK, nullable)
|
||||
- Name ← signal name. UNS level 6 when in Equipment namespace.
|
||||
- FolderPath ← only used when EquipmentId is NULL (SystemPlatform ns);
|
||||
Equipment provides path otherwise.
|
||||
- DataType
|
||||
- AccessLevel (Read | ReadWrite)
|
||||
- WriteIdempotent (bool) ← opt-in for write retry eligibility (see Polly section)
|
||||
- TagConfig (JSON) ← register address, poll group, scaling, etc.
|
||||
|
||||
PollGroup
|
||||
- PollGroupRowId (PK)
|
||||
- GenerationId (FK)
|
||||
- PollGroupId ← stable logical ID
|
||||
- DriverInstanceId (FK)
|
||||
- Name
|
||||
- IntervalMs
|
||||
|
||||
ClusterNodeGenerationState ← tracks which generation each NODE has applied
|
||||
- NodeId (PK, FK) ← per-node, not per-cluster — both nodes of a
|
||||
2-node cluster track independently
|
||||
- CurrentGenerationId (FK)
|
||||
- LastAppliedAt
|
||||
- LastAppliedStatus (Applied | RolledBack | Failed)
|
||||
- LastAppliedError
|
||||
|
||||
ExternalIdReservation ← NOT generation-versioned (revised after adversarial review finding #3).
|
||||
Fleet-wide ZTag/SAPID uniqueness that survives rollback,
|
||||
disable, and re-enable. Per-generation indexes can't enforce
|
||||
this — old generations still hold the same external IDs.
|
||||
- ReservationId (PK)
|
||||
- Kind (ZTag | SAPID)
|
||||
- Value ← the identifier string
|
||||
- EquipmentUuid ← which equipment owns this reservation, FOREVER
|
||||
- ClusterId ← first cluster to publish it
|
||||
- FirstPublishedAt / LastPublishedAt
|
||||
- ReleasedAt / ReleasedBy / ReleaseReason ← non-null when explicitly released by FleetAdmin
|
||||
|
||||
Lifecycle: sp_PublishGeneration auto-reserves on publish. Disable doesn't release.
|
||||
Rollback respects the reservation table. Explicit release is the only way to free a value
|
||||
for reuse by a different EquipmentUuid. UNIQUE (Kind, Value) WHERE ReleasedAt IS NULL.
|
||||
```
|
||||
|
||||
**Authorization model (server-side, enforced in DB):**
|
||||
- All config reads go through stored procedures that take the authenticated principal from `SESSION_CONTEXT` / `SUSER_SNAME()` / `CURRENT_USER` and cross-check it against `ClusterNodeCredential.Value` for the requesting `NodeId`. A principal asking for config of a `ClusterId` that does not contain its `NodeId` gets rejected, not just filtered.
|
||||
- Cross-cluster reads are forbidden even within the same site or admin scope — every config read carries the requesting `NodeId` and is checked.
|
||||
- Admin UI connects with a separate elevated principal that has read/write on all clusters and generations.
|
||||
- Publishing a generation is a stored procedure that validates the draft, computes the diff vs. the previous generation, and flips `Status` to `Published` atomically within a transaction. The publish is **cluster-scoped** — publishing a new generation for one cluster does not affect any other cluster.
|
||||
|
||||
**appsettings.json stays minimal:**
|
||||
|
||||
```jsonc
|
||||
{
|
||||
"Cluster": {
|
||||
"ClusterId": "LINE3-OPCUA",
|
||||
"NodeId": "LINE3-OPCUA-A"
|
||||
// OPC UA port, ApplicationUri, redundancy role all come from central DB
|
||||
},
|
||||
"ConfigDatabase": {
|
||||
// The connection string MUST authenticate as a principal bound to this NodeId.
|
||||
// Options (pick one per deployment):
|
||||
// - Integrated Security + gMSA (preferred on AD-joined hosts)
|
||||
// - Client certificate (Authentication=ActiveDirectoryMsi or cert-auth)
|
||||
// - SQL login scoped via ClusterNodeCredential table (rotate regularly)
|
||||
// A shared DB login across nodes is NOT supported — the server-side
|
||||
// authorization layer will reject cross-cluster config reads.
|
||||
"ConnectionString": "Server=configsrv;Database=OtOpcUaConfig;Authentication=...;...",
|
||||
"GenerationPollIntervalSeconds": 30,
|
||||
"LocalCachePath": "config_cache.db"
|
||||
},
|
||||
"Security": { /* transport/auth settings — still local */ }
|
||||
}
|
||||
```
|
||||
|
||||
**Decided:**
|
||||
- Central MSSQL database is the single source of truth for all configuration.
|
||||
- **Top-level deployment unit is `ServerCluster`** with 1 or 2 `ClusterNode` members. Single-node and 2-node deployments use the same schema; single-node is a cluster of one.
|
||||
- **Driver, device, tag, equipment, and poll-group config attaches to `ClusterId`, not to individual nodes.** Both nodes of a cluster serve identical address spaces.
|
||||
- **Per-node overrides are minimal by intent** — `ClusterNode.DriverConfigOverridesJson` is the only override mechanism, scoped to driver-config settings that genuinely must differ per node (e.g. `MxAccess.ClientName`). Tags, equipment, and devices have no per-node override path.
|
||||
- **`ApplicationUri` is auto-suggested but never auto-rewritten.** When an operator creates a new `ClusterNode` in Admin, the UI prefills `urn:{Host}:OtOpcUa`. If the operator later changes `Host`, the UI surfaces a warning that `ApplicationUri` is **not** updated automatically — OPC UA clients pin trust to it, and a silent rewrite would force every client to re-pair. Operator must explicitly opt in to changing it.
|
||||
- Each node identifies itself by `NodeId` and `ClusterId` **and authenticates with a credential bound to its NodeId**; the DB enforces the mapping server-side. A self-asserted `NodeId` is not accepted, and a node may not read another cluster's config.
|
||||
- **Each cluster serves multiple namespaces through one endpoint**, modeled as first-class `Namespace` rows (Kind ∈ {Equipment, SystemPlatform, Simulated}). Adding a future namespace kind is a config-DB row insert + driver wiring, not a structural refactor.
|
||||
- **UNS naming hierarchy mandatory in Equipment-kind namespaces**: 5 levels (Enterprise/Site/Area/Line/Equipment) with signals as level-6 children. Each segment validated `^[a-z0-9-]{1,32}$` or `_default`; total path ≤ 200 chars. SystemPlatform namespace preserves Galaxy's existing hierarchy unchanged.
|
||||
- **Equipment is a first-class entity in Equipment namespaces** with stable `EquipmentUuid` (UUIDv4) immutable across renames, moves, and generations. Path can change; UUID cannot.
|
||||
- **`Equipment.EquipmentClassRef` is a hook for future schemas-repo integration** — nullable now, FK enforcement added when the central `schemas` repo lands per the 3-year-plan.
|
||||
- Local LiteDB cache for offline startup resilience, keyed by generation.
|
||||
- JSON columns for driver-type-specific config (schemaless per driver type, structured at the fleet level).
|
||||
- Multiple instances of the same driver type supported within one cluster.
|
||||
- Each device in a driver instance appears as a folder node in the address space.
|
||||
|
||||
**Decided (rollout model):**
|
||||
- Config is versioned as **immutable, cluster-scoped generations**. Admin authors a draft for a cluster, then publishes it in a single transaction. Nodes only ever observe a fully-published generation — never a half-edited mix of rows.
|
||||
- One generation = one cluster's full configuration snapshot. Publishing a generation for one cluster does not affect any other cluster.
|
||||
- Each node polls for the latest generation for its cluster, diffs it against its current applied generation, and surgically applies only the affected drivers/devices/tags. Surgical *application* is safe because the *source snapshot* is atomic.
|
||||
- **Both nodes of a cluster apply the same generation independently** — the apply timing can differ slightly. During the apply window, one node may be on generation N while the other is on N+1; this is acceptable because non-transparent redundancy already accommodates per-endpoint state divergence and `ServiceLevel` will dip on the node that's mid-apply.
|
||||
- Rollback: publishing a new generation never deletes old ones. Admins can roll back a cluster to any previous generation; nodes apply the target generation the same way as a forward publish.
|
||||
- Applied-state per node is tracked in `ClusterNodeGenerationState` so Admin can see which nodes have picked up a new publish and detect stragglers or a 2-node cluster that's diverged.
|
||||
- If neither the central DB nor a local cache is available, the node fails to start. This is acceptable — there's no meaningful "run with zero config" mode.
|
||||
|
||||
**Decided:**
|
||||
- **Transport security config (certs, LDAP settings, transport profiles) stays local** in `appsettings.json` per instance. Avoids a bootstrap chicken-and-egg where DB connection credentials would depend on config retrieved *from* the DB. Matches current v1 deployment model.
|
||||
- **Generation retention: keep all generations forever.** Rollback target is always available; audit trail is complete. Config rows are small and publish cadence is low (days/weeks), so storage cost is negligible versus the utility of a complete history.
|
||||
|
||||
**Deferred:**
|
||||
- Event-driven generation notification (SignalR / Service Broker) as an optimisation over poll interval — deferred until polling proves insufficient.
|
||||
|
||||
---
|
||||
|
||||
### 5. Project Structure
|
||||
|
||||
**All projects target .NET 10 x64 unless noted.**
|
||||
|
||||
```
|
||||
src/
|
||||
# ── Configuration layer ──
|
||||
ZB.MOM.WW.OtOpcUa.Configuration/ # Central DB schema (EF), change detection,
|
||||
# local LiteDB cache, config models (.NET 10)
|
||||
ZB.MOM.WW.OtOpcUa.Admin/ # Blazor Server admin UI + API for managing the
|
||||
# central config DB (.NET 10)
|
||||
|
||||
# ── Core + Server ──
|
||||
ZB.MOM.WW.OtOpcUa.Core/ # OPC UA server, address space, subscriptions,
|
||||
# driver hosting (.NET 10)
|
||||
ZB.MOM.WW.OtOpcUa.Core.Abstractions/ # IDriver, IReadable, ISubscribable, etc.
|
||||
# thin contract (.NET 10)
|
||||
ZB.MOM.WW.OtOpcUa.Server/ # Host (Microsoft.Extensions.Hosting),
|
||||
# Windows Service, config bootstrap (.NET 10)
|
||||
|
||||
# ── In-process drivers (.NET 10 x64) ──
|
||||
ZB.MOM.WW.OtOpcUa.Driver.ModbusTcp/ # Modbus TCP driver (NModbus)
|
||||
ZB.MOM.WW.OtOpcUa.Driver.AbCip/ # Allen-Bradley CIP driver (libplctag)
|
||||
ZB.MOM.WW.OtOpcUa.Driver.AbLegacy/ # Allen-Bradley SLC/MicroLogix driver (libplctag)
|
||||
ZB.MOM.WW.OtOpcUa.Driver.S7/ # Siemens S7 driver (S7netplus)
|
||||
ZB.MOM.WW.OtOpcUa.Driver.TwinCat/ # Beckhoff TwinCAT ADS driver (Beckhoff.TwinCAT.Ads)
|
||||
ZB.MOM.WW.OtOpcUa.Driver.Focas/ # FANUC FOCAS CNC driver (Fwlib64.dll P/Invoke)
|
||||
ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient/ # OPC UA client gateway driver
|
||||
|
||||
# ── Out-of-process Galaxy driver ──
|
||||
ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy/ # In-process proxy that implements IDriver interfaces
|
||||
# and forwards over IPC (.NET 10)
|
||||
ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/ # Separate process: MXAccess COM, Galaxy DB,
|
||||
# alarms, HDA. Hosts IPC server (.NET 4.8 x86)
|
||||
ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared/ # Shared IPC message contracts between Proxy
|
||||
# and Host (.NET Standard 2.0)
|
||||
|
||||
# ── Client tooling (.NET 10 x64) ──
|
||||
ZB.MOM.WW.OtOpcUa.Client.CLI/ # client CLI
|
||||
ZB.MOM.WW.OtOpcUa.Client.UI/ # Avalonia client
|
||||
|
||||
tests/
|
||||
ZB.MOM.WW.OtOpcUa.Configuration.Tests/
|
||||
ZB.MOM.WW.OtOpcUa.Core.Tests/
|
||||
ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests/
|
||||
ZB.MOM.WW.OtOpcUa.Driver.ModbusTcp.Tests/
|
||||
ZB.MOM.WW.OtOpcUa.Driver.AbCip.Tests/
|
||||
ZB.MOM.WW.OtOpcUa.Driver.AbLegacy.Tests/
|
||||
ZB.MOM.WW.OtOpcUa.Driver.S7.Tests/
|
||||
ZB.MOM.WW.OtOpcUa.Driver.TwinCat.Tests/
|
||||
ZB.MOM.WW.OtOpcUa.Driver.Focas.Tests/
|
||||
ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient.Tests/
|
||||
ZB.MOM.WW.OtOpcUa.IntegrationTests/
|
||||
```
|
||||
|
||||
**Deployment units:**
|
||||
|
||||
| Unit | Description | Target | Deploys to |
|
||||
|------|-------------|--------|------------|
|
||||
| **OtOpcUa Server** | Windows Service (M.E.Hosting) — OPC UA server + in-process drivers | .NET 10 x64 | Each site node |
|
||||
| **Galaxy Host** | Windows Service — out-of-process MXAccess driver | .NET 4.8 x86 | Same machine as Server (when Galaxy driver is used) |
|
||||
| **OtOpcUa Admin** | Blazor Server config management UI | .NET 10 x64 | Same server or central management host |
|
||||
| **OtOpcUa Client CLI** | Operator CLI tool | .NET 10 x64 | Any workstation |
|
||||
| **OtOpcUa Client UI** | Avalonia desktop client | .NET 10 x64 | Any workstation |
|
||||
|
||||
**Dependency graph:**
|
||||
|
||||
```
|
||||
Admin ──→ Configuration
|
||||
Server ──→ Core ──→ Core.Abstractions
|
||||
│ ↑
|
||||
│ Driver.ModbusTcp, Driver.AbCip, Driver.AbLegacy,
|
||||
│ Driver.S7, Driver.TwinCat, Driver.Focas,
|
||||
│ Driver.OpcUaClient (in-process)
|
||||
│ Driver.Galaxy.Proxy (in-process, forwards over IPC)
|
||||
↓
|
||||
Configuration
|
||||
|
||||
Galaxy.Proxy ──→ Galaxy.Shared ←── Galaxy.Host
|
||||
(.NET 4.8 x86, separate process)
|
||||
```
|
||||
|
||||
- `Core.Abstractions` — no dependencies, referenced by Core and all drivers (including Galaxy.Proxy)
|
||||
- `Configuration` — owns central DB access + local cache, referenced by Server and Admin
|
||||
- `Admin` — Blazor Server app, depends on Configuration, can deploy on same server
|
||||
- In-process drivers depend on `Core.Abstractions` only
|
||||
- `Galaxy.Shared` — .NET Standard 2.0 IPC contracts, referenced by both Proxy (.NET 10) and Host (.NET 4.8)
|
||||
- `Galaxy.Host` — standalone .NET 4.8 x86 process, does NOT reference Core or Core.Abstractions
|
||||
- `Galaxy.Proxy` — implements `IDriver` etc., depends on Core.Abstractions + Galaxy.Shared
|
||||
|
||||
**Decided:**
|
||||
- Mono-repo (Decision #31 above).
|
||||
- `Core.Abstractions` is **internal-only for now** — no standalone NuGet. Keep the contract mutable while the first 8 drivers are being built; revisit publishing after Phase 5 when the shape has stabilized. Design the contract *as if* it will eventually be public (no leaky types, stable names) to minimize churn later.
|
||||
|
||||
---
|
||||
|
||||
### 5a. LmxNodeManager Reusability Analysis
|
||||
|
||||
**Investigated 2026-04-17.** The existing `LmxNodeManager` (2923 lines) is the foundation for the new generic node manager — not a rewrite candidate. Categorized inventory:
|
||||
|
||||
| Bucket | Lines | % | What's here |
|
||||
|--------|-------|-----|-------------|
|
||||
| **Already generic** | ~1310 | 45% | OPC UA plumbing: `CreateAddressSpace` + topological sort + `_nodeMap`, Read/Write dispatch, HistoryRead + continuation points, subscription delivery + `_pendingDataChanges` queue, dispatch thread lifecycle, runtime-status node mechanism, status-code mapping |
|
||||
| **Generic pattern, Galaxy-coded today** | ~1170 | 40% | Bad-quality fan-out when a host drops, alarm auto-subscribe (InAlarm+Priority+Description pattern), background-subscribe tracking with shutdown-safe WaitAll, value normalization for arrays, connection-health probe machinery — each is a pattern every driver will need, currently wired to Galaxy types |
|
||||
| **Truly MXAccess-specific** | ~290 | 10% | `IMxAccessClient` calls, `MxDataTypeMapper`, `SecurityClassificationMapper`, `GalaxyRuntimeProbeManager` construction/lifecycle, Historian literal, alarm auto-subscribe trigger |
|
||||
| Metadata / comments | ~153 | 5% | |
|
||||
|
||||
**Interleaving assessment:** concerns are cleanly separated at method boundaries. Read/Write handlers do generic resolution → generic host-status check → isolated `_mxAccessClient` call. The dispatch loop is fully generic. The only meaningful interleaving is in `BuildAddressSpace()` where `GalaxyAttributeInfo` leaks into node creation — fixable by introducing a driver-agnostic `DriverAttributeInfo` DTO.
|
||||
|
||||
**Refactor plan:**
|
||||
|
||||
1. **Rename `LmxNodeManager` → `GenericDriverNodeManager : CustomNodeManager2`** and lift the generic blocks unchanged. Swap `IMxAccessClient` for `IDriver` (composing `IReadable` / `IWritable` / `ISubscribable`). Swap `GalaxyAttributeInfo` for a driver-agnostic `DriverAttributeInfo { FullName, DriverDataType, IsArray, ArrayDim, SecurityClass, IsHistorized }`. Promote `GalaxyRuntimeProbeManager` to an `IHostConnectivityProbe` capability interface.
|
||||
2. **Derive `GalaxyNodeManager : GenericDriverNodeManager`** — driver-specific builder that maps `GalaxyAttributeInfo → DriverAttributeInfo`, registers `MxDataTypeMapper` / `SecurityClassificationMapper`, injects the probe manager.
|
||||
3. **New drivers** (Modbus, S7, etc.) extend `GenericDriverNodeManager` and implement the capability interfaces. No forking of the OPC UA machinery.
|
||||
|
||||
**Ordering within Phase 2** (fits the "incremental extraction" approach in Decision #55):
|
||||
- (a) Introduce capability interfaces + `DriverAttributeInfo` in `Core.Abstractions`.
|
||||
- (b) Rename to `GenericDriverNodeManager` with Galaxy still in-process as the only driver; validate parity against v1 integration tests + CLI walkthrough.
|
||||
- (c) Only then move Galaxy behind the IPC boundary into `Galaxy.Host`.
|
||||
|
||||
Each step leaves the system runnable. The generic extraction is effectively free — the class is already mostly generic, just named and typed for Galaxy.
|
||||
|
||||
---
|
||||
|
||||
### 6. Migration Strategy
|
||||
|
||||
**Decided approach:**
|
||||
|
||||
**Phase 0 — Rename + .NET 10 migration**
|
||||
1. **Rename to OtOpcUa** — mechanical rename of namespaces, assemblies, config, and docs
|
||||
2. **Migrate to .NET 10 x64** — retarget all projects except Galaxy Host
|
||||
|
||||
**Phase 1 — Core extraction + Configuration layer + Admin scaffold**
|
||||
3. **Build `Configuration` project** — central MSSQL schema with `ServerCluster`, `ClusterNode`, `ClusterNodeCredential`, `Namespace` (generation-versioned), `UnsArea`, `UnsLine`, `ConfigGeneration`, `ClusterNodeGenerationState`, `ExternalIdReservation` (NOT generation-versioned, fleet-wide ZTag/SAPID uniqueness) plus the cluster-scoped `DriverInstance` / `Device` / `Equipment` / `Tag` / `PollGroup` tables (EF Core + migrations); UNS naming validators (segment regex, path length, `_default` placeholder, UUIDv4 immutability across generations, system-generated EquipmentId, same-cluster namespace binding, ZTag/SAPID reservation pre-flight, within-cluster uniqueness for MachineCode); server-side authorization stored procs that enforce per-node-bound-to-cluster access from authenticated principals; atomic cluster-scoped publish/rollback stored procs (`sp_PublishGeneration` reserves external IDs atomically; `sp_ReleaseExternalIdReservation` is FleetAdmin-only); LiteDB local cache keyed by generation; generation-diff application logic; per-node override merge at apply time.
|
||||
4. **Extract `Core.Abstractions`** — define `IDriver`, `ITagDiscovery`, `IReadable`, `IWritable`, `ISubscribable`, `IAlarmSource`, `IHistoryProvider`. `IWritable` contract separates idempotent vs. non-idempotent writes at the interface level.
|
||||
5. **Build `Core`** — generic driver-hosting node manager that delegates to capability interfaces, driver isolation (catch/contain), address space registration, separate Polly pipelines for reads vs. writes per the write-retry policy above.
|
||||
6. **Wire `Server`** — bootstrap from Configuration using an instance-bound credential (cert/gMSA/SQL login), fail fast if the credential is rejected, register drivers, start Core.
|
||||
7. **Scaffold `Admin`** — Blazor Server app with: instance + credential management, draft/publish/rollback generation workflow (diff viewer, "publish to fleet", per-instance override), and core CRUD for drivers/devices/tags. Driver-specific config screens deferred to later phases.
|
||||
|
||||
**Phase 2 — Galaxy driver (prove the refactor)**
|
||||
8. **Build `Galaxy.Shared`** — .NET Standard 2.0 IPC message contracts
|
||||
9. **Build `Galaxy.Host`** — .NET 4.8 x86 process hosting MxAccessBridge, GalaxyRepository, alarms, HDA with IPC server
|
||||
10. **Build `Galaxy.Proxy`** — .NET 10 in-process proxy implementing IDriver interfaces, forwarding over IPC
|
||||
11. **Validate parity** — v2 Galaxy driver must pass the same integration tests as v1
|
||||
|
||||
**Phase 3 — Modbus TCP driver (prove the abstraction)**
|
||||
12. **Build `Driver.ModbusTcp`** — NModbus, config-driven tags from central DB, internal poll loop, device-as-folder hierarchy
|
||||
13. **Add Modbus config screens to Admin** (first driver-specific config UI)
|
||||
|
||||
**Phase 4 — PLC drivers**
|
||||
14. **Build `Driver.AbCip`** — libplctag, ControlLogix/CompactLogix symbolic tags + Admin config screens
|
||||
15. **Build `Driver.AbLegacy`** — libplctag, SLC 500/MicroLogix file-based addressing + Admin config screens
|
||||
16. **Build `Driver.S7`** — S7netplus, Siemens S7-300/400/1200/1500 + Admin config screens
|
||||
17. **Build `Driver.TwinCat`** — Beckhoff.TwinCAT.Ads v6, native ADS notifications, symbol upload + Admin config screens
|
||||
|
||||
**Phase 5 — Specialty drivers**
|
||||
18. **Build `Driver.Focas`** — FANUC FOCAS2 P/Invoke, pre-defined CNC tag set, PMC/macro config + Admin config screens
|
||||
19. **Build `Driver.OpcUaClient`** — OPC UA client gateway/aggregation, namespace remapping, subscription proxying + Admin config screens
|
||||
|
||||
**Decided:**
|
||||
- **Parity test for Galaxy**: existing v1 IntegrationTests suite + scripted Client.CLI walkthrough (see Section 4 above).
|
||||
- **Timeline**: no hard deadline. Each phase ships when it's right — tests passing, Galaxy parity bar met. Quality cadence over calendar cadence.
|
||||
- **FOCAS SDK**: license already secured. Phase 5 can proceed as scheduled; `Fwlib64.dll` available for P/Invoke.
|
||||
|
||||
---
|
||||
|
||||
## Decision Log
|
||||
|
||||
| # | Decision | Rationale | Date |
|
||||
|---|----------|-----------|------|
|
||||
| 1 | Work on `v2` branch | Keep master stable for production | 2026-04-16 |
|
||||
| 2 | OPC UA core + pluggable driver modules | Enable multi-protocol support without forking the server | 2026-04-16 |
|
||||
| 3 | Rename to **OtOpcUa** | Product is no longer LMX-specific | 2026-04-16 |
|
||||
| 4 | Composable capability interfaces | Drivers vary widely in what they support; flat `IDriver` would force stubs | 2026-04-16 |
|
||||
| 5 | Target drivers: Galaxy, Modbus TCP, AB CIP, AB Legacy, S7, TwinCAT, FOCAS, OPC UA Client | Full PLC/CNC/SCADA/aggregation coverage | 2026-04-16 |
|
||||
| 6 | Polling is driver-internal, not core-managed | Each driver owns its poll loop; core just sees data change callbacks | 2026-04-16 |
|
||||
| 7 | Multiple instances of same driver type supported | Need e.g. separate Modbus drivers for different device groups | 2026-04-16 |
|
||||
| 8 | Namespace index per driver instance | Each instance gets its own NamespaceUri for clean isolation | 2026-04-16 |
|
||||
| 9 | Rename to OtOpcUa as step 1 | Clean mechanical change before any refactoring | 2026-04-16 |
|
||||
| 10 | Modbus TCP as second driver | Simplest protocol, validates abstraction with flat/polled/config-driven model | 2026-04-16 |
|
||||
| 11 | Library selections per driver | NModbus (Modbus), libplctag (AB CIP + AB Legacy), S7netplus (S7), Beckhoff.TwinCAT.Ads v6 (TwinCAT), Fwlib64.dll P/Invoke (FOCAS), OPC Foundation SDK (OPC UA Client) | 2026-04-16 |
|
||||
| 12 | Driver isolation — failure contained per instance | One driver crash/disconnect must not affect other drivers' nodes or quality | 2026-04-16 |
|
||||
| 13 | Shared OPC UA StatusCode model for quality | Drivers map to the same StatusCode space; each defines which codes it produces | 2026-04-16 |
|
||||
| 14 | Central MSSQL config database | Single source of truth for fleet-wide config — instances, drivers, tags, devices | 2026-04-16 |
|
||||
| 15 | LiteDB local cache per instance | Offline startup resilience — instance boots from cache if central DB is unreachable | 2026-04-16 |
|
||||
| 16 | JSON columns for driver-specific config | Schemaless per driver type, avoids table-per-driver-type explosion | 2026-04-16 |
|
||||
| 17 | Device-as-folder in address space | Multi-device drivers expose Device/Tag hierarchy for intuitive browsing | 2026-04-16 |
|
||||
| 18 | Minimal appsettings.json (ClusterId + NodeId + DB conn) | All real config lives in central DB, not local files. OPC UA port and ApplicationUri come from `ClusterNode` row, not local config | 2026-04-16 / 2026-04-17 |
|
||||
| 19 | Blazor Server admin app for config management | Separate deployable, manages central MSSQL config DB | 2026-04-16 |
|
||||
| 20 | Surgical config change detection | Instance detects which drivers/devices/tags changed, applies incremental updates | 2026-04-16 |
|
||||
| 21 | Fail-to-start without DB or cache | No meaningful zero-config mode — requires at least cached config | 2026-04-16 |
|
||||
| 22 | `Configuration` project owns DB + cache layer | Clean separation: Server and Admin both depend on it | 2026-04-16 |
|
||||
| 23 | .NET 10 x64 default, .NET 4.8 x86 only for Galaxy Host | Modern runtime for everything; COM constraint isolated to Galaxy | 2026-04-16 |
|
||||
| 24 | Galaxy driver is out-of-process | .NET 4.8 x86 process can't load into .NET 10 x64; IPC bridge required | 2026-04-16 |
|
||||
| 25 | Galaxy.Shared (.NET Standard 2.0) for IPC contracts | Must be consumable by both .NET 10 Proxy and .NET 4.8 Host | 2026-04-16 |
|
||||
| 26 | Admin deploys on same server (co-hosted) | Simplifies deployment; can also run on separate management host | 2026-04-16 |
|
||||
| 27 | Admin scaffold early, driver-specific screens deferred | Core CRUD for instances/drivers first; per-driver config UI added with each driver | 2026-04-16 |
|
||||
| 28 | Named pipes for Galaxy IPC | Fast, no port conflicts, native to both .NET 4.8 and .NET 10 | 2026-04-16 |
|
||||
| 29 | Galaxy Host is a separate Windows service | Independent lifecycle, can restart without affecting main server or other drivers | 2026-04-16 |
|
||||
| 30 | Drop TopShelf, use Microsoft.Extensions.Hosting | Built-in Windows Service support in .NET 10, no third-party dependency | 2026-04-16 |
|
||||
| 31 | Mono-repo for all drivers | Simpler dependency management, single CI pipeline, shared abstractions | 2026-04-16 |
|
||||
| 32 | MessagePack serialization for Galaxy IPC | Binary, fast, works on .NET 4.8+ and .NET 10 via MessagePack-CSharp NuGet | 2026-04-16 |
|
||||
| 33 | EF Core for Configuration DB | Migrations, LINQ queries, standard .NET 10 ORM | 2026-04-16 |
|
||||
| 34 | Polly v8+ for resilience | Retry, circuit breaker, timeout per device/driver — replaces hand-rolled supervision | 2026-04-16 |
|
||||
| 35 | Per-device resilience pipelines | Circuit breaker on Drive1 doesn't affect Drive2, even in same driver instance | 2026-04-16 |
|
||||
| 36 | Polly for config DB access | Retry + fallback to LiteDB cache on sustained DB outage | 2026-04-16 |
|
||||
| 37 | FOCAS driver uses pre-defined tag set | CNC data is functional (axes, spindle, PMC), not user-defined tags — driver exposes fixed node hierarchy populated by specific FOCAS2 API calls | 2026-04-16 |
|
||||
| 38 | FOCAS PMC + macro variables are user-configured | PMC addresses (R, D, G, F, etc.) and macro variable ranges configured in central DB; not auto-discovered | 2026-04-16 |
|
||||
| 39 | TwinCAT uses native ADS notifications | One of 3 drivers with native subscriptions (Galaxy, TwinCAT, OPC UA Client); no polling needed for subscribed tags | 2026-04-16 |
|
||||
| 40 | TwinCAT no runtime required on server | Beckhoff.TwinCAT.Ads v6 supports in-process ADS router; only needs AMS route on target device | 2026-04-16 |
|
||||
| 41 | AB Legacy (SLC/MicroLogix) as separate driver from AB CIP | Different protocol (PCCC vs CIP), different addressing (file-based vs symbolic), severe connection limits (4-8) | 2026-04-16 |
|
||||
| 42 | S7 driver notes: PUT/GET must be enabled on S7-1200/1500 | Disabled by default in TIA Portal; document as prerequisite | 2026-04-16 |
|
||||
| 43 | DL205 (AutomationDirect) handled by Modbus TCP driver | DL205 supports Modbus TCP via H2-ECOM100; no separate driver needed — `AddressFormat=DL205` adds octal address translation | 2026-04-16 |
|
||||
| 44 | No automatic retry on writes by default | Write retries are unsafe for non-idempotent field actions — a timeout can fire after the device already accepted the command, and replay duplicates pulses/acks/counters/recipe steps (adversarial review finding #1) | 2026-04-16 |
|
||||
| 45 | Opt-in write retry via `TagConfig.WriteIdempotent` or CAS wrapper | Retries must be explicit per tag; CAS (compare-and-set) verifies device state before retry where the protocol supports it | 2026-04-16 |
|
||||
| 46 | Instance identity is credential-bound, not self-asserted | Each instance authenticates to the central DB with a credential (cert/gMSA/SQL login) bound to its `InstanceId`; the DB rejects cross-instance config reads server-side (adversarial review finding #2) | 2026-04-16 |
|
||||
| 47 | `InstanceCredential` table + authorization stored procs | Credentials and the `InstanceId` they are authorized for live in the DB; all config reads go through procs that enforce the mapping rather than trusting the client | 2026-04-16 |
|
||||
| 48 | Config is versioned as immutable generations with atomic publish | Admin publishes a whole generation in one transaction; instances only ever observe fully-published generations, never partial multi-row edits (adversarial review finding #3) | 2026-04-16 |
|
||||
| 49 | Surgical reload applies a generation diff, not raw row deltas | The source snapshot is atomic (generation), but applying it to a running instance is still incremental — only affected drivers/devices/tags reload | 2026-04-16 |
|
||||
| 50 | Explicit rollback via re-publishing a prior generation | Generations are never deleted; rollback is just publishing an older generation as the new current, so instances apply it the same way as a forward publish | 2026-04-16 |
|
||||
| 51 | `InstanceGenerationState` tracks applied generation per instance | Admin can see which instances have picked up a new publish and detect stragglers or failed applies | 2026-04-16 |
|
||||
| 52 | Address space registration via builder/context API | Core owns the tree; driver streams AddFolder/AddVariable on an `IAddressSpaceBuilder`, avoids buffering the whole tree and supports incremental discovery | 2026-04-17 |
|
||||
| 53 | Capability discovery via interface checks (`is IAlarmSource`) | The interface *is* the capability — no redundant flag enum to keep in sync with the implementation | 2026-04-17 |
|
||||
| 54 | Optional `IRediscoverable` sub-interface for change-detection | Drivers with a native change signal (Galaxy deploy time, OPC UA change notifications) opt in; static drivers skip it | 2026-04-17 |
|
||||
| 55 | Galaxy refactor is incremental — extract interfaces in place first | Refactor `LmxNodeManager` against new abstractions while still in-process, validate, then move behind IPC. Keeps system runnable at each step | 2026-04-17 |
|
||||
| 56 | Galaxy parity test = v1 integration suite + scripted CLI walkthrough | Automated regression plus human-observable behavior on a dev Galaxy | 2026-04-17 |
|
||||
| 57 | Transport security config stays local in `appsettings.json` | Avoids bootstrap chicken-and-egg (DB-connection credentials can't depend on config fetched from the DB); matches v1 deployment | 2026-04-17 |
|
||||
| 58 | Generation retention: keep all generations forever | Rollback target always available; audit trail complete; storage cost negligible at publish cadence of days/weeks | 2026-04-17 |
|
||||
| 59 | `Core.Abstractions` internal-only for now, no NuGet | Keep the contract mutable through the first 8 drivers; design as if public, revisit after Phase 5 | 2026-04-17 |
|
||||
| 60 | No hard deadline — phases deliver when they're right | Quality cadence over calendar cadence; Galaxy parity bar must be met before moving on | 2026-04-17 |
|
||||
| 61 | FOCAS SDK license already secured | Phase 5 can proceed; `Fwlib64.dll` available for P/Invoke with no procurement blocker | 2026-04-17 |
|
||||
| 62 | `LmxNodeManager` is the foundation for `GenericDriverNodeManager`, not a rewrite | ~85% of the 2923 lines are generic or generic-in-spirit; only ~10% (~290 lines) are truly MXAccess-specific. Concerns are cleanly separated at method boundaries — refactor is rename + DTO swap, not restructuring | 2026-04-17 |
|
||||
| 63 | Driver stability tier model (A/B/C) | Drivers vary in failure profile (pure managed vs wrapped native vs black-box DLL); tier dictates hosting and protection level. See `driver-stability.md` | 2026-04-17 |
|
||||
| 64 | FOCAS is Tier C — out-of-process Windows service from day one | `Fwlib64.dll` is a black-box vendor DLL; an `AccessViolationException` is uncatchable in modern .NET and would tear down the OPC UA server. Same Proxy/Host/Shared pattern as Galaxy | 2026-04-17 |
|
||||
| 65 | Cross-cutting stability protections mandatory in all tiers | SafeHandle for every native resource, memory watchdog, bounded operation queues, scheduled recycle, crash-loop circuit breaker, post-mortem log — apply to every driver process whether in-proc or isolated | 2026-04-17 |
|
||||
| 66 | Out-of-process driver pattern is reusable across Tier C drivers | Galaxy.Proxy/Host/Shared template generalizes; FOCAS is the second user; future Tier B → Tier C escalations reuse the same three-project template | 2026-04-17 |
|
||||
| 67 | Tier B drivers may escalate to Tier C on production evidence | libplctag (AB CIP/Legacy), S7netplus, TwinCAT.Ads start in-process; promote to isolated host if leaks or crashes appear in field | 2026-04-17 |
|
||||
| 68 | Crash-loop circuit breaker — 3 crashes/5 min stops respawn | Prevents host respawn thrashing when the underlying device or DLL is in a state respawning won't fix; surfaces operator-actionable alert; manual reset via Admin UI | 2026-04-17 |
|
||||
| 69 | Post-mortem log via memory-mapped file | Ring buffer of last-N operations + driver-specific state; survives hard process death including native AV; supervisor reads MMF after corpse is gone — only viable post-mortem path for native crashes | 2026-04-17 |
|
||||
| 70 | Watchdog thresholds = hybrid multiplier + absolute floor + hard ceiling | Pure multipliers misfire on tiny baselines; pure absolute MB doesn't scale across deployment sizes. `max(N× baseline, baseline + floor MB)` for warn/recycle plus an absolute hard ceiling. Slope detection stays orthogonal | 2026-04-17 |
|
||||
| 71 | Crash-loop reset = escalating cooldown (1 h → 4 h → 24 h manual) with sticky alerts | Manual-only is too rigid for unattended plants; pure auto-reset silently retries forever. Escalating cooldown auto-recovers transient problems but forces human attention on persistent ones; sticky alerts preserve the trail regardless of reset path | 2026-04-17 |
|
||||
| 72 | Heartbeat cadence = 2 s with 3-miss tolerance (6 s detection) | 5 s × 3 = 15 s is too slow against 1 s OPC UA publish intervals; 1 s × 3 = 3 s false-positives on GC pauses and pipe jitter. 2 s × 3 = 6 s is the sweet spot | 2026-04-17 |
|
||||
| 73 | Process-level protections (RSS watchdog, scheduled recycle) apply ONLY to Tier C isolated host processes | Process recycle in the shared server would kill every other in-proc driver, every session, and the OPC UA endpoint — directly contradicts the per-driver isolation invariant. Tier A/B drivers get per-instance allocation tracking + cache flush + no-process-kill instead (adversarial review finding #1) | 2026-04-17 |
|
||||
| 74 | A Tier A/B driver that needs process-level recycle MUST be promoted to Tier C | The only safe way to apply process recycle to a single driver is to give it its own process. If allocation tracking + cache flush can't bound a leak, the answer is isolation, not killing the server | 2026-04-17 |
|
||||
| 75 | Wedged native calls in Tier C drivers escalate to hard process exit, never handle-free-during-call | Calling release functions on a handle with an active native call is undefined behavior — exactly the AV path Tier C is designed to prevent. After grace window, leave the handle Abandoned and `Environment.Exit(2)`. The OS reclaims fds/sockets on exit; the device's connection-timeout reclaims its end (adversarial review finding #2) | 2026-04-17 |
|
||||
| 76 | Tier C IPC has mandatory pipe ACL + caller SID verification + per-process shared secret | Default named-pipe ACL allows any local user to bypass OPC UA auth and issue reads/writes/acks directly against the host. Pipe ACL restricts to server service SID, host verifies caller token on connect, supervisor-generated per-process secret as defense-in-depth (adversarial review finding #3) | 2026-04-17 |
|
||||
| 77 | FOCAS stability test coverage = TCP stub (functional) + FaultShim native DLL (host-side faults) | A TCP stub cannot make Fwlib leak handles or AV — those live inside the P/Invoke boundary. Two artifacts cover the two layers honestly: TCP stub for ~80% of failures (network/protocol), FaultShim for the remaining ~20% (native crashes/leaks). Real-CNC validation remains the only path for vendor-specific Fwlib quirks (adversarial review finding #5) | 2026-04-17 |
|
||||
| 78 | Per-driver stability treatment is proportional to driver risk | Galaxy and FOCAS get full Tier C deep dives in `driver-stability.md` (different concerns: COM/STA pump vs Fwlib handle pool); TwinCAT, AB CIP, AB Legacy get short Operational Stability Notes in `driver-specs.md` for their tier-promotion triggers and protocol-specific failure modes; pure-managed Tier A drivers get one paragraph each. Avoids duplicating the cross-cutting protections doc seven times | 2026-04-17 |
|
||||
| 79 | Top-level deployment unit is `ServerCluster` with 1 or 2 `ClusterNode` members | Sites deploy 2-node clusters for OPC UA non-transparent redundancy (per v1 — Warm/Hot, no VIP). Single-node deployments are clusters of one. Uniform schema avoids forking the config model | 2026-04-17 |
|
||||
| 80 | Driver / device / tag / poll-group config attaches to `ClusterId`, not to individual nodes | Both nodes of a cluster serve identical address spaces; defining tags twice would invite drift. One generation = one cluster's complete config | 2026-04-17 |
|
||||
| 81 | Per-node overrides minimal — `ClusterNode.DriverConfigOverridesJson` only | Some driver settings legitimately differ per node (e.g. `MxAccess.ClientName` so Galaxy distinguishes them) but the surface is small. Single JSON column merged onto cluster-level `DriverConfig` at apply time. Tags and devices have no per-node override path | 2026-04-17 |
|
||||
| 82 | `ConfigGeneration` is cluster-scoped, not fleet-scoped | Publishing a generation for one cluster does not affect any other cluster. Simpler rollout (one cluster at a time), simpler rollback, simpler auth boundary. Fleet-wide synchronized rollouts (if ever needed) become a separate concern — orchestrate per-cluster publishes from Admin | 2026-04-17 |
|
||||
| 83 | Each node authenticates with its own `ClusterNodeCredential` bound to `NodeId` | Cluster-scoped auth would be too coarse — both nodes sharing a credential makes credential rotation harder and obscures which node read what. Per-node binding also enforces that Node A cannot impersonate Node B in audit logs | 2026-04-17 |
|
||||
| 84 | Both nodes apply the same generation independently; brief divergence acceptable | OPC UA non-transparent redundancy already handles per-endpoint state divergence; `ServiceLevel` dips on the node mid-apply and clients fail over. Forcing two-phase commit across nodes would be a complex distributed-system problem with no real upside | 2026-04-17 |
|
||||
| 85 | OPC UA `RedundancySupport.Transparent` not adopted in v2 | True transparent redundancy needs a VIP/load-balancer in front of the cluster. v1 ships non-transparent (Warm/Hot) with `ServerUriArray` and client-driven failover; v2 inherits the same model. Revisit only if a customer requirement demands LB-fronted transparency | 2026-04-17 |
|
||||
| 86 | `ApplicationUri` auto-suggested as `urn:{Host}:OtOpcUa` but never auto-rewritten | OPC UA clients pin trust to `ApplicationUri` — it's part of the cert validation chain. Auto-rewriting it when an operator changes `Host` would silently invalidate every client trust relationship. Admin UI prefills on node creation, warns on `Host` change, requires explicit opt-in to change. Fleet-wide unique index enforces no two nodes share an `ApplicationUri` | 2026-04-17 |
|
||||
| 87 | Concrete schema and stored-proc design lives in `config-db-schema.md` | The plan §4 sketches the conceptual model; the schema doc carries the actual DDL, indexes, stored procs, JSON conventions, and authorization model implementations. Keeps the plan readable while making the schema concrete enough to start implementing | 2026-04-17 |
|
||||
| 88 | Admin UI is Blazor Server with LDAP-mapped admin roles (FleetAdmin / ConfigEditor / ReadOnly) | Blazor Server gives real-time SignalR for live cluster status without a separate SPA build pipeline. LDAP reuses the OPC UA auth provider (no parallel user table). Three roles cover the common ops split; cluster-scoped editor grants deferred to v2.1 | 2026-04-17 |
|
||||
| 89 | Edit path is draft → diff → publish; no in-place edits, no auto-publish | Generations are atomic snapshots — every change goes through an explicit publish boundary so operators see what they're committing. The diff viewer is required reading before the publish dialog enables. Bulk operations always preview before commit | 2026-04-17 |
|
||||
| 90 | Per-node overrides are NOT generation-versioned | Overrides are operationally bound to a specific physical machine, not to the cluster's logical config evolution. Editing a node override doesn't create a new generation — it updates `ClusterNode.DriverConfigOverridesJson` directly and takes effect on next apply. Replacement-node scenarios copy the override via deployment tooling, not by replaying generation history | 2026-04-17 |
|
||||
| 91 | JSON content validation runs in the Admin app, not via SQL CLR | CLR is disabled by default on hardened SQL Server instances; many DBAs refuse to enable it. Admin validates against per-driver JSON schemas before invoking `sp_PublishGeneration`; the proc enforces structural integrity (FKs, uniqueness, `ISJSON`) only. Direct proc invocation is already prevented by the GRANT model | 2026-04-17 |
|
||||
| 92 | Dotted-path syntax for `DriverConfigOverridesJson` keys (e.g. `MxAccess.ClientName`) | More readable than JSON Pointer in operator UI and CSV exports. Reserved-char escaping documented (`\.`, `\\`); array indexing uses `Items[0].Name` | 2026-04-17 |
|
||||
| 93 | `sp_PurgeGenerationsBefore` deferred to v2.1; signature pre-specified | Initial release keeps all generations forever (decision #58). Purge proc shape locked in now: requires `@ConfirmToken` UI-shown random hex to prevent script-based mass deletion, CASCADE-deletes via `WHERE GenerationId IN (...)`, audit-log entry with row counts. Surface only when a customer compliance ask demands it | 2026-04-17 |
|
||||
| 94 | ~~Admin UI component library = MudBlazor~~ **SUPERSEDED by #102** | (See #102 — switched to Bootstrap 5 for ScadaLink parity) | 2026-04-17 |
|
||||
| 95 | CSV import dialect = strict CSV (RFC 4180) UTF-8, BOM accepted | Excel "Save as CSV (UTF-8)" produces RFC 4180 output and is the documented primary input format. TSV not initially supported | 2026-04-17 |
|
||||
| 96 | Push-from-DB notification deferred to v2.1; polling is the v2.0 model | Tightening apply latency from ~30 s → ~1 s would need SignalR backplane or SQL Service Broker — infrastructure not earning its keep at v2.0 scale. Publish dialog reserves a disabled "Push now" button labeled "Available in v2.1" so the future UX is anchored | 2026-04-17 |
|
||||
| 97 | Draft auto-save (debounced 500 ms) with explicit Discard; Publish is the only commit | Eliminates "lost work" complaints; matches Google Docs / Notion mental model. Auto-save writes to draft rows only — never to Published. Discard requires confirmation dialog | 2026-04-17 |
|
||||
| 98 | ~~Admin UI ships both light and dark themes~~ **SUPERSEDED by #103** | (See #103 — light-only to match ScadaLink) | 2026-04-17 |
|
||||
| 99 | CI tiering: PR-CI uses only in-process simulators; nightly/integration CI runs on dedicated Docker + Hyper-V host | Keeps PR builds fast and runnable on minimal build agents; the dedicated integration host runs the heavy simulators (`oitc/modbus-server`, TwinCAT XAR VM, Snap7 Server, libplctag `ab_server`). Operational dependency: stand up the dedicated host before Phase 3 | 2026-04-17 |
|
||||
| 100 | Studio 5000 Logix Emulate: pre-release validation tier only, no phase-gate | If an org license can be earmarked, designate a golden box for quarterly UDT/Program-scope passes. If not, AB CIP ships validated against `ab_server` only with documented UAT-time fidelity gap. Don't block Phase 4 on procurement | 2026-04-17 |
|
||||
| 101 | FOCAS Wireshark capture is a Phase 5 prerequisite identified during Phase 4 | Target capture (production CNC, CNC Guide seat, or customer site visit) identified by Phase 4 mid-point; if no target by then, escalate to procurement (CNC Guide license or dev-rig CNC) as a Phase 5 dependency | 2026-04-17 |
|
||||
| 102 | Admin UI styling = Bootstrap 5 vendored (parity with ScadaLink CentralUI) | Operators using both ScadaLink and OtOpcUa Admin see the same login screen, same sidebar, same component vocabulary. ScadaLink ships Bootstrap 5 with a custom dark-sidebar + light-main aesthetic; mirroring it directly outweighs MudBlazor's Blazor-component conveniences. Supersedes #94 | 2026-04-17 |
|
||||
| 103 | Admin UI ships single light theme matching ScadaLink (no dark mode in v2.0) | ScadaLink is light-only; cross-app aesthetic consistency outweighs the ergonomic argument for dark mode. Revisit only if ScadaLink adds dark mode. Supersedes #98 | 2026-04-17 |
|
||||
| 104 | Admin auth pattern lifted directly from ScadaLink: `LdapAuthService` + `RoleMapper` + `JwtTokenService` + cookie auth + `CookieAuthenticationStateProvider` | Same login form, same cookie scheme (30-min sliding), same claim shape (Name, DisplayName, Username, Role[], optional ClusterId[] scope), parallel `/auth/token` endpoint for API clients. Code lives in `ZB.MOM.WW.OtOpcUa.Admin.Security` (sibling of `ScadaLink.Security`); consolidate to a shared NuGet only if it later makes operational sense | 2026-04-17 |
|
||||
| 105 | Cluster-scoped admin grants ship in v2.0 (lifted from v2.1 deferred list) | ScadaLink already ships the equivalent site-scoped pattern (`PermittedSiteIds` claim, `IsSystemWideDeployment` flag), so we get cluster-scoped grants free by mirroring it. `LdapGroupRoleMapping` table maps groups → role + cluster scope; users without explicit cluster claims are system-wide | 2026-04-17 |
|
||||
| 106 | Shared component set copied verbatim from ScadaLink CentralUI | `DataTable`, `ConfirmDialog`, `LoadingSpinner`, `ToastNotification`, `TimestampDisplay`, `RedirectToLogin`, `NotAuthorizedView`. New Admin-specific shared components added to our folder rather than diverging from ScadaLink's set, so the shared vocabulary stays aligned | 2026-04-17 |
|
||||
| 107 | Each cluster serves multiple OPC UA namespaces through one endpoint, modeled as first-class `Namespace` rows | Per 3-year-plan handoff §4: at v2.0 GA there are two namespaces (Equipment for raw signals, SystemPlatform for Galaxy-derived data); future Simulated namespace must be addable as a config-DB row + driver wiring, not a structural refactor. UNIQUE (ClusterId, Kind) | 2026-04-17 |
|
||||
| 108 | UNS 5-level naming hierarchy mandatory in Equipment-kind namespaces | Per 3-year-plan handoff §12: Enterprise/Site/Area/Line/Equipment with signals as level-6 children. Each segment `^[a-z0-9-]{1,32}$` or `_default`; total path ≤ 200 chars. Validated at draft-publish and in Admin UI. SystemPlatform namespace preserves Galaxy's existing hierarchy unchanged — UNS rules don't apply there | 2026-04-17 |
|
||||
| 109 | `Equipment` is a first-class entity in Equipment namespaces with stable `EquipmentUuid` (UUIDv4), immutable across renames/moves/generations | Per handoff §12: path can change (rename, move) but UUID cannot. Downstream consumers (Redpanda events, dbt) carry both UUID for joins/lineage and path for dashboards/filtering. `sp_ValidateDraft` enforces UUID-per-EquipmentId is constant across all generations of a cluster | 2026-04-17 |
|
||||
| 110 | Tag belongs to Equipment in Equipment namespaces; tag belongs to Driver+FolderPath in SystemPlatform namespaces | Single `Tag` table with nullable `EquipmentId`. When set (Equipment ns), full path is computed `Enterprise/Site/Area/Line/Name/TagName`. When null (SystemPlatform ns), v1-style `DriverInstanceId + FolderPath` provides the path. Application-level constraint enforced by `sp_ValidateDraft`, not DB CHECK | 2026-04-17 |
|
||||
| 111 | Driver type restricts allowed namespace Kind | Galaxy → SystemPlatform only; Modbus/AB CIP/AB Legacy/S7/TwinCAT/FOCAS → Equipment only; OpcUaClient → either, by config. Encoded in `Core.Abstractions` driver-type registry; enforced by `sp_ValidateDraft` | 2026-04-17 |
|
||||
| 112 | `Equipment.EquipmentClassRef` shipped as nullable hook in v2.0 for future schemas-repo integration | Per handoff §12: equipment-class templates will live in a central `schemas` repo (not yet created). Cheap to add the column now; expensive to retrofit later. Enforcement added when schemas repo lands. v2.0 ships without template validation | 2026-04-17 |
|
||||
| 113 | Canonical machine state derivation lives at Layer 3, not in OtOpcUa | Per handoff §13: `Running`/`Idle`/`Faulted`/`Starved`/`Blocked` derivation is System Platform / Ignition's job. OtOpcUa's role is delivering raw signals cleanly so derivation is accurate. Equipment-class templates (when schemas repo lands) define which raw signals each class exposes | 2026-04-17 |
|
||||
| 114 | Future `Simulated` namespace architecturally supported, not v2.0 committed | Per handoff §14: `Simulated` is named as the next namespace kind for replaying historical equipment data without physical equipment. The `Namespace.Kind` enum reserves the value; no driver implementation in v2.0. Adds via config-DB row + a future replay driver | 2026-04-17 |
|
||||
| 115 | UNS structure (Area, Line) modeled as first-class generation-versioned tables (`UnsArea`, `UnsLine`), not denormalized strings on Equipment | Renaming an area or moving lines between buildings is a single edit that propagates to every equipment under it; bulk-restructure operations work cleanly. Generation-versioning preserves the publish/diff/rollback safety boundary for structural changes | 2026-04-17 |
|
||||
| 116 | Equipment carries five identifiers: EquipmentUuid, EquipmentId, MachineCode, ZTag, SAPID — each with a different audience | Single-identifier-per-equipment can't satisfy the diverse consumer set: downstream events need a UUID for permanent lineage, OT operators say `machine_001` (MachineCode), ERP queries by ZTag, SAP PM by SAPID, internal config diffs need a stable EquipmentId. All five exposed as OPC UA properties on the equipment node so external systems resolve by their preferred identifier without a sidecar | 2026-04-17 |
|
||||
| 117 | `ZTag` is the primary browse identifier in the Admin UI | Equipment list/search defaults to ZTag column + sort. MachineCode shown alongside; SAPID searchable. The OPC UA browse path itself uses `Equipment.Name` (UNS-segment rules); ZTag/MachineCode/SAPID are properties on the node, not path components | 2026-04-17 |
|
||||
| 118 | `MachineCode` required, fleet-wide uniqueness on `ZTag` and `SAPID` when set | MachineCode is the operator's colloquial name — every equipment must have one. ZTag and SAPID are external system identifiers that may not exist for newly commissioned equipment. Fleet-wide uniqueness on ERP/SAP IDs prevents the same external identifier from referencing two equipment in our config (which would silently corrupt joins) | 2026-04-17 |
|
||||
| 119 | MachineCode/ZTag/SAPID free-text, not subject to UNS regex | These are external system identifiers, not OPC UA path segments. They can carry whatever conventions ERP/SAP/operator workflows use (mixed case, underscores, vendor-specific schemes). Validation is only non-empty (when present) and ≤64 chars | 2026-04-17 |
|
||||
| 120 | Admin UI exposes UNS structure as a first-class management surface | Dedicated **UNS Structure tab** with tree of UnsArea → UnsLine → Equipment, drag-drop reorganize, rename with live impact preview ("X lines, Y equipment, Z signals will pick up new path"). Hybrid model: read-only navigation over the published generation, click-to-edit opens the draft editor scoped to that node. Bulk-rename and bulk-move propagate through UnsLineId FK (no per-equipment row rewrite) | 2026-04-17 |
|
||||
| 121 | All five equipment identifiers exposed as OPC UA properties on the equipment node | `MachineCode`, `ZTag`, `SAPID`, `EquipmentUuid`, `EquipmentId` are properties so external systems resolve equipment by their preferred identifier without a sidecar lookup service. Browse path uses `Equipment.Name` as the level-5 segment (UNS-compliant); the other identifiers are properties, not path components | 2026-04-17 |
|
||||
| 122 | Same-cluster invariant on `DriverInstance.NamespaceId` enforced in three layers (sp_ValidateDraft, API scoping, audit) | Without enforcement a draft for cluster A could bind to cluster B's namespace, leaking the URI into A's endpoint and breaking tenant isolation. UI filtering alone is insufficient — server-side scoping prevents bypass via crafted requests. Cross-cluster attempts audit-logged as `CrossClusterNamespaceAttempt`. (Closes adversarial review 2026-04-17 finding #1, critical) | 2026-04-17 |
|
||||
| 123 | `Namespace` is generation-versioned (revised from earlier "cluster-level" decision) | A cluster-level namespace lets an admin disable a namespace that a published driver depends on, breaking the live config without a generation change and making rollback unreproducible. Namespaces affect what consumers see at the OPC UA endpoint — they are content, not topology — and must travel through draft → diff → publish like every other consumer-visible config. Cross-generation invariant: once a (NamespaceId, ClusterId) publishes a NamespaceUri/Kind, it cannot change. (Closes adversarial review 2026-04-17 finding #2, supersedes part of #107) | 2026-04-17 |
|
||||
| 124 | ZTag/SAPID fleet-wide uniqueness backed by an `ExternalIdReservation` table, NOT generation-versioned per-generation indexes | Per-generation indexes fail under rollback and disable: old generations and disabled equipment can still hold the same external IDs, so rollback or re-enable can silently reintroduce duplicates that corrupt downstream ERP/SAP joins. The reservation table sits outside generation versioning, survives rollback, and reserves fresh values atomically at publish via `sp_PublishGeneration`. Explicit FleetAdmin release (audit-logged) is the only path that frees a value for reuse by a different EquipmentUuid. (Closes adversarial review 2026-04-17 finding #3) | 2026-04-17 |
|
||||
| 125 | `Equipment.EquipmentId` is system-generated (`'EQ-' + first 12 hex chars of EquipmentUuid`), never operator-supplied or editable, never in CSV imports | Operator-supplied IDs are a real corruption path: typos and bulk-import renames mint new EquipmentIds, which then get new UUIDs even when the physical asset is the same. That permanently splits downstream joins keyed on EquipmentUuid. Removing operator authoring of EquipmentId eliminates the failure mode entirely. CSV imports match by EquipmentUuid (preferred) for updates; rows without UUID create new equipment with system-generated identifiers. Explicit Merge / Rebind operator flow handles the rare case where two UUIDs need to be reconciled. (Closes adversarial review 2026-04-17 finding #4, supersedes part of #116) | 2026-04-17 |
|
||||
| 126 | Three-gate model (entry / mid / exit) for every implementation phase, with explicit compliance-check categories | Specified in `implementation/overview.md`. Categories: schema compliance (DB matches the doc), decision compliance (every decision number has a code/test citation), visual compliance (Admin UI parity with ScadaLink), behavioral compliance (per-phase smoke test), stability compliance (cross-cutting protections wired up for Tier C drivers), documentation compliance (any deviation reflected back in v2 docs). Exit gate requires two-reviewer signoff; silent deviation is the failure mode the gates exist to prevent | 2026-04-17 |
|
||||
| 127 | Per-phase implementation docs live under `docs/v2/implementation/` with structured task / acceptance / compliance / completion sections | Each phase doc enumerates: scope (in / out), entry gate checklist, task breakdown with per-task acceptance criteria, compliance checks (script-runnable), behavioral smoke test, completion checklist. Phase 0 + Phase 1 docs are committed; Phases 2–8 land as their predecessors clear exit gates | 2026-04-17 |
|
||||
| 128 | Driver list is fixed for v2.0 — Equipment Protocol Survey is NOT a prerequisite | The seven committed drivers (Modbus TCP including DL205, AB CIP, AB Legacy, S7, TwinCAT, FOCAS, OPC UA Client) plus the existing Galaxy/MXAccess driver are confirmed by direct knowledge of the equipment estate, not pending the formal survey. Supersedes the corrections-doc concern (C1) that the v2 commitment was made pre-survey. The survey may still produce useful inventory data for downstream planning (capacity, prioritization), but adding or removing drivers from the v2 implementation list is out of scope. Closes corrections-doc C1 | 2026-04-17 |
|
||||
| 129 | OPC UA client data-path authorization model = `NodePermissions` bitmask flags + per-LDAP-group grants on a 6-level scope hierarchy (Cluster / Namespace / UnsArea / UnsLine / Equipment / Tag) with default-deny + additive grants; explicit Deny deferred to v2.1 | Mirrors v1 SecurityClassification model for Write tiers (WriteOperate / WriteTune / WriteConfigure); adds explicit AlarmRead / AlarmAcknowledge / AlarmConfirm / AlarmShelve / MethodCall flags; bundles (`ReadOnly` / `Operator` / `Engineer` / `Admin`) for one-click grants. Per-session permission-trie evaluator with O(depth × group-count) cost; cache invalidated on generation-apply or LDAP group cache expiry. Closes corrections-doc B1. See `acl-design.md` | 2026-04-17 |
|
||||
| 130 | `NodeAcl` table generation-versioned, edited via draft → diff → publish | Same pattern as Namespace (#123) and Equipment (#109). ACL changes are content, not topology — they affect what consumers see at the OPC UA endpoint. Rollback restores the prior ACL state. Cross-generation invariant: `NodeAclId` once published with `(LdapGroup, ScopeKind, ScopeId)` cannot have any of those columns change | 2026-04-17 |
|
||||
| 131 | Cluster-create workflow seeds default ACL set matching v1 LmxOpcUa LDAP-role-to-permission map | Preserves behavioral parity for v1 → v2 consumer migration. Operators tighten or loosen from there. Admin UI flags any cluster whose ACL set diverges from the seed | 2026-04-17 |
|
||||
| 132 | OPC UA NodeManager logs denied operations only; allowed operations rely on SDK session/operation diagnostics | Logging every allowed op would dwarf the audit log. Denied-only mirrors typical authorization audit practice. Per-deployment policy can tighten if compliance requires positive-action logging | 2026-04-17 |
|
||||
| 133 | Two-tier dev environment: inner-loop (in-process simulators on developer machines) + integration (Docker / VM / native simulators on a single dedicated Windows host) | Per decision #99. Concrete inventory + setup plan in `dev-environment.md` | 2026-04-17 |
|
||||
| 134 | Docker Desktop with WSL2 backend (not Hyper-V backend) on integration host so TwinCAT XAR VM can run in Hyper-V alongside Docker | TwinCAT runtime cannot coexist with Hyper-V-mode Docker Desktop; WSL2 backend leaves Hyper-V free for the XAR VM. Documented operational constraint | 2026-04-17 |
|
||||
| 135 | TwinCAT XAR runs only in a dedicated VM on the integration host; developer machines do NOT run XAR locally | The 7-day trial reactivation needs centralized management; the VM is shared infrastructure. Galaxy is the inverse — runs only on developer machines (Aveva license scoping), not on integration host | 2026-04-17 |
|
||||
| 136 | Consumer cutover (ScadaBridge / Ignition / System Platform IO) is OUT of v2 scope | Owned by a separate integration / operations team. OtOpcUa team's scope ends at Phase 5 (all drivers built, all stability protections in place, full Admin UI shipped including ACL editor). Cutover sequencing, validation methodology, rollback procedures, and Aveva-pattern validation for tier 3 are the integration team's deliverables, tracked in 3-year-plan handoff §"Rollout Posture" and corrections doc §C5 | 2026-04-17 |
|
||||
| 137 | Dev env credentials documented openly in `dev-environment.md`; production uses Integrated Security / gMSA per decision #46 | Dev defaults are not secrets — they're convenience. Production never uses these values; documented separation prevents leakage | 2026-04-17 |
|
||||
| 138 | Every equipment-class template extends a shared `_base` class providing universal cross-machine metadata (identity, state, alarm summary, optional production context) | References OPC UA Companion Spec OPC 40010 (Machinery) for the Identification component + MachineryOperationMode enum, OPC UA Part 9 for alarm summary fields, ISO 22400 for KPI inputs (TotalRunSeconds, TotalCycles), 3-year-plan handoff §"Canonical Model Integration" for the canonical state vocabulary. Inheritance via `extends` field on the equipment-class JSON Schema. Avoids per-class drift in identity / state / alarm field naming and ensures every machine in the estate exposes the same baseline metadata regardless of vendor or protocol. `_base` lives in `3yearplan/schemas/classes/_base.json` (temporary location until the dedicated `schemas` repo is created) | 2026-04-17 |
|
||||
| 139 | Equipment table extended with OPC 40010 identity columns (Manufacturer, Model, SerialNumber, HardwareRevision, SoftwareRevision, YearOfConstruction, AssetLocation, ManufacturerUri, DeviceManualUri) all nullable so equipment can be added before identity is fully captured | First-class columns rather than a JSON blob because these fields are universal (every machine has them) and need to be queryable / searchable in the Admin UI. Manufacturer and Model are declared `isRequired: true` in `_base.json` and the Admin UI flags equipment that lacks them; the rest are optional. Drivers that can read these dynamically (FANUC, Beckhoff, etc.) override the static value at runtime; static value is the fallback. Exposed on the OPC UA node under the OPC 40010-standard `Identification` sub-folder | 2026-04-17 |
|
||||
| 140 | Enterprise shortname = `zb` (UNS level-1 segment) | Closes corrections-doc D4. Matches the existing `ZB.MOM.WW.*` namespace prefix used throughout the codebase; short by design since this segment appears in every equipment path (`zb/warsaw-west/bldg-3/line-2/cnc-mill-05/RunState`); operators already say "ZB" colloquially. Admin UI cluster-create form default-prefills `zb` for the Enterprise field. Production deployments use it directly from cluster-create | 2026-04-17 |
|
||||
| 141 | Tier 3 (AppServer IO) cutover is feasible — AVEVA's OI Gateway supports arbitrary upstream OPC UA servers as a documented pattern | Closes corrections-doc E2 with **GREEN-YELLOW** verdict. Multiple AVEVA partners (Software Toolbox, InSource) have published working integrations against four different non-AVEVA upstream servers (TOP Server, OPC Router, OmniServer, Cogent DataHub). No re-architecting of OtOpcUa required. Path: `OPC UA node → OI Gateway → SuiteLink → $DDESuiteLinkDIObject → AppServer attribute`. Recommended AppServer floor: System Platform 2023 R2 Patch 01. Two integrator-burden risks tracked: validation/GxP paperwork (no AVEVA blueprint exists for non-AVEVA upstream servers in Part 11 deployments) and unpublished scale benchmarks (in-house benchmark required before cutover scheduling). See `aveva-system-platform-io-research.md` | 2026-04-17 |
|
||||
| 142 | Phase 1 acceptance includes an end-to-end AppServer-via-OI-Gateway smoke test against OtOpcUa | Catches AppServer-specific quirks (cert exchange via reject-and-trust workflow, endpoint URL must NOT include `/discovery` suffix per Inductive Automation forum failure mode, service-account install required because OI Gateway under SYSTEM cannot connect to remote OPC servers, `Basic256Sha256` + `SignAndEncrypt` + LDAP-username token combination must work end-to-end) early — well before the Year 3 tier-3 cutover schedule. Adds one task to `phase-1-configuration-and-admin-scaffold.md` Stream E (Admin smoke test) | 2026-04-17 |
|
||||
|
||||
## Reference Documents
|
||||
|
||||
- **[Driver Implementation Specifications](driver-specs.md)** — per-driver details: connection settings, addressing, data types, libraries, API mappings, error handling, implementation notes
|
||||
- **[Test Data Sources](test-data-sources.md)** — per-driver simulator/emulator/stub for development and integration testing
|
||||
- **[Driver Stability & Isolation](driver-stability.md)** — stability tier model (A/B/C), per-driver hosting decisions, cross-cutting protections, FOCAS and Galaxy deep dives
|
||||
- **[Central Config DB Schema](config-db-schema.md)** — concrete table definitions, indexes, stored procedures, authorization model, JSON conventions, EF Core migrations approach
|
||||
- **[Admin Web UI](admin-ui.md)** — Blazor Server admin app: information architecture, page-by-page workflows, per-driver config screen extensibility, real-time updates, UX rules
|
||||
- **[OPC UA Client Authorization (ACL Design)](acl-design.md)** — data-path authz model: `NodePermissions` bitmask flags (Browse / Read / Subscribe / HistoryRead / WriteOperate / WriteTune / WriteConfigure / AlarmRead / AlarmAcknowledge / AlarmConfirm / AlarmShelve / MethodCall + bundles), 6-level scope hierarchy (Cluster / Namespace / UnsArea / UnsLine / Equipment / Tag) with inheritance, default-deny + additive grants, per-session permission-trie evaluator with O(depth × group-count) cost, default cluster-seed mapping v1 LmxOpcUa LDAP roles, Admin UI ACL tab + bulk grant + simulator. Closes corrections-doc finding B1.
|
||||
- **[Development Environment](dev-environment.md)** — every external resource the v2 build needs (SQL Server, GLAuth, Galaxy, Docker simulators, TwinCAT XAR VM, OPC Foundation reference server, FOCAS stub + FaultShim) with default ports / credentials / owners; two-tier model (inner-loop on developer machines, integration on a single dedicated Windows host with WSL2-backed Docker + Hyper-V VM for TwinCAT); concrete bootstrap order for both tiers
|
||||
- **[AVEVA System Platform IO research](aveva-system-platform-io-research.md)** — closes corrections-doc E2. Validates that the planned tier-3 cutover (AppServer IO consuming from OtOpcUa instead of equipment directly) is supported via AVEVA's OI Gateway driver. Verdict: **GREEN-YELLOW**. Multiple non-AVEVA upstream-server integrations published. Two integrator-burden risks: validation/GxP paperwork and unpublished scale benchmarks
|
||||
- **[Implementation Plan Overview](implementation/overview.md)** — phase gate structure (entry / mid / exit), compliance check categories (schema / decision / visual / behavioral / stability / documentation), deliverable conventions, "what counts as following the plan"
|
||||
- **[Phase 0 — Rename + .NET 10 cleanup](implementation/phase-0-rename-and-net10.md)** — mechanical LmxOpcUa → OtOpcUa rename with full task breakdown, compliance checks, completion checklist
|
||||
- **[Phase 1 — Configuration + Core.Abstractions + Admin scaffold](implementation/phase-1-configuration-and-admin-scaffold.md)** — central MSSQL schema, EF Core migrations, stored procs, LDAP-authenticated Blazor Server admin app with ScadaLink visual parity, LiteDB local cache, generation-diff applier; 5 work streams (A–E), full task breakdown, compliance checks, 14-step end-to-end smoke test
|
||||
- **[Phase 2 — Galaxy out-of-process refactor (Tier C)](implementation/phase-2-galaxy-out-of-process.md)** — split legacy in-process Galaxy into `Driver.Galaxy.Shared` (.NET Standard 2.0 IPC contracts) + `Driver.Galaxy.Host` (.NET 4.8 x86 separate Windows service with STA pump, COM SafeHandle wrappers, named-pipe IPC with mandatory ACL, memory watchdog, scheduled recycle with WM_QUIT escalation, post-mortem MMF, FaultShim) + `Driver.Galaxy.Proxy` (.NET 10 in-process IDriver implementation with heartbeat sender and crash-loop circuit-breaker supervisor); retire legacy `OtOpcUa.Host` project; parity gate is v1 IntegrationTests + scripted Client.CLI walkthrough passing byte-equivalent to v1; closes the four 2026-04-13 stability findings as named regression tests
|
||||
|
||||
## Out of Scope / Deferred
|
||||
|
||||
-
|
||||
499
docs/v2/test-data-sources.md
Normal file
499
docs/v2/test-data-sources.md
Normal file
@@ -0,0 +1,499 @@
|
||||
# Test Data Sources — OtOpcUa v2
|
||||
|
||||
> **Status**: DRAFT — companion to `plan.md`. Identifies the simulator/emulator/stub each driver will be developed and integration-tested against, so a developer laptop and a CI runner can exercise every driver without physical hardware.
|
||||
>
|
||||
> **Branch**: `v2`
|
||||
> **Created**: 2026-04-17
|
||||
|
||||
## Scope
|
||||
|
||||
The v2 plan adds eight drivers (Galaxy, Modbus TCP, AB CIP, AB Legacy, S7, TwinCAT, FOCAS, OPC UA Client). Each needs a repeatable, low-friction data source for:
|
||||
|
||||
- **Inner-loop development** — a developer running tests on their own machine
|
||||
- **CI integration tests** — automated runs against a known-good fixture
|
||||
- **Pre-release fidelity validation** — at least one "golden" rig with the highest-fidelity option available, even if paid/heavy
|
||||
|
||||
Two drivers are already covered and are **out of scope** for this document:
|
||||
|
||||
| Driver | Existing source | Why no work needed |
|
||||
|--------|-----------------|---------------------|
|
||||
| Galaxy | Real System Platform Galaxy on the dev machine | MXAccess requires a deployed ArchestrA Platform anyway; the dev box already has one |
|
||||
| OPC UA Client | OPC Foundation `ConsoleReferenceServer` from UA-.NETStandard | Reference-grade simulator from the same SDK we depend on; trivial to spin up |
|
||||
|
||||
The remaining six drivers are the subject of this document.
|
||||
|
||||
## Standard Test Scenario
|
||||
|
||||
Each simulator must expose a fixture that lets cross-driver integration tests exercise three orthogonal axes: the **data type matrix**, the **behavior matrix**, and **capability-gated extras**. v1 LMX testing already exercises ~12 Galaxy types plus 1D arrays plus security classifications plus historized attrs — the v2 fixture per driver has to reach at least that bar.
|
||||
|
||||
### A. Data type matrix (every driver, scalar and array)
|
||||
|
||||
Each simulator exposes one tag per cell where the protocol supports the type natively:
|
||||
|
||||
| Type family | Scalar | 1D array (small, ~10) | 1D array (large, ~500) | Notes |
|
||||
|-------------|:------:|:---------------------:|:----------------------:|-------|
|
||||
| Bool | ✔ | ✔ | — | Discrete subscription test |
|
||||
| Int16 (signed) | ✔ | ✔ | ✔ | Where protocol distinguishes from Int32 |
|
||||
| Int32 (signed) | ✔ | ✔ | ✔ | Universal |
|
||||
| Int64 | ✔ | ✔ | — | Where protocol supports it |
|
||||
| UInt16 / UInt32 | ✔ | — | — | Where protocol distinguishes signed/unsigned (Modbus, S7) |
|
||||
| Float32 | ✔ | ✔ | ✔ | Endianness test |
|
||||
| Float64 | ✔ | ✔ | — | Where protocol supports it |
|
||||
| String | ✔ | ✔ (Galaxy/AB/TwinCAT) | — | Include empty, ASCII, UTF-8/Unicode, max-length |
|
||||
| DateTime | ✔ | — | — | Galaxy, TwinCAT, OPC UA Client only |
|
||||
|
||||
Large arrays (~500 elements) catch paged-read, fragmentation, and PDU-batching bugs that small arrays miss.
|
||||
|
||||
### B. Behavior matrix (applied to a subset of the type matrix)
|
||||
|
||||
| Behavior | Applied to | Validates |
|
||||
|----------|------------|-----------|
|
||||
| **Static read** | One tag per type in matrix A | Type mapping, value decoding |
|
||||
| **Ramp** | Int32, Float32 | Subscription delivery cadence, source timestamps |
|
||||
| **Write-then-read-back** | Bool, Int32, Float32, String | Round-trip per type family, idempotent-write path |
|
||||
| **Array element write** | Int32[10] | Partial-write paths (where protocol supports them); whole-array replace where it doesn't |
|
||||
| **Large-array read** | Int32[500] | Paged reads, PDU batching, no truncation |
|
||||
| **Bool toggle on cadence** | Bool | Discrete subscription, change detection |
|
||||
| **Bad-quality on demand** | Any tag | Polly circuit-breaker → quality fan-out |
|
||||
| **Disconnect / reconnect** | Whole simulator | Reconnect, subscription replay, status dashboard, redundancy failover |
|
||||
|
||||
### C. Capability-gated extras (only where the driver supports them)
|
||||
|
||||
| Extra | Drivers | Fixture requirement |
|
||||
|-------|---------|---------------------|
|
||||
| **Security / access levels** | Galaxy, OPC UA Client | At least one read-only and one read-write tag of the same type |
|
||||
| **Alarms** | Galaxy, FOCAS, OPC UA Client | One alarm that fires after N seconds; one that the test can acknowledge; one that auto-clears |
|
||||
| **HistoryRead** | Galaxy, OPC UA Client | One historized tag with a known back-fill of >100 samples spanning >1 hour |
|
||||
| **String edge cases** | All with String support | Empty string, max-length string, embedded nulls, UTF-8 multi-byte chars |
|
||||
| **Endianness round-trip** | Modbus, S7 | Float32 written by test, read back, byte-for-byte equality |
|
||||
|
||||
Each driver section below maps these axes to concrete addresses/tags in that protocol's namespace. Where the protocol has no native equivalent (e.g. Modbus has no String type), the row is marked **N/A** and the driver-side tests skip it.
|
||||
|
||||
---
|
||||
|
||||
## 1. Modbus TCP (and DL205)
|
||||
|
||||
### Recommendation
|
||||
|
||||
**Default**: `oitc/modbus-server` Docker image for CI; in-process `NModbus` slave for xUnit fixtures.
|
||||
|
||||
Both speak real Modbus TCP wire protocol. The Docker image is a one-line `docker run` for whole-system tests; the in-proc slave gives per-test deterministic state with no new dependencies (NModbus is already on the driver-side dependency list).
|
||||
|
||||
### Options Evaluated
|
||||
|
||||
| Option | License | Platform | Notes |
|
||||
|--------|---------|----------|-------|
|
||||
| **oitc/modbus-server** ([Docker Hub](https://hub.docker.com/r/oitc/modbus-server), [GitHub](https://github.com/cybcon/modbus-server)) | MIT | Docker | YAML preload of all 4 register areas; `docker run -p 502:502` |
|
||||
| **NModbus `ModbusTcpSlave`** ([GitHub](https://github.com/NModbus/NModbus)) | MIT | In-proc .NET 10 | ~20 LOC fixture; programmatic register control |
|
||||
| **diagslave** ([modbusdriver.com](https://www.modbusdriver.com/diagslave.html)) | Free (proprietary) | Win/Linux/QNX | Single binary; free mode times out hourly |
|
||||
| **EasyModbusTCP** | LGPL | .NET / Java / Python | MSI installer |
|
||||
| **ModbusPal** ([SourceForge](https://sourceforge.net/projects/modbuspal/)) | BSD | Java | Register automation scripting; needs a JVM |
|
||||
|
||||
### DL205 Coverage
|
||||
|
||||
DL205 PLCs are accessed via H2-ECOM100, which exposes plain Modbus TCP. The `AddressFormat=DL205` feature is purely an octal-to-decimal **address translation** in the driver — the simulator only needs to expose the underlying Modbus registers. Unit-test the translation by preloading specific Modbus addresses (`HR 1024 = V2000`, `DI 15 = X17`, `Coil 8 = Y10`) and asserting the driver reads them via DL205 notation.
|
||||
|
||||
### Native Type Coverage
|
||||
|
||||
Modbus has no native String, DateTime, or Int64 — those rows are skipped on this driver. Native primitives are coil/discrete-input (Bool) and 16-bit registers; everything wider is composed from contiguous registers with explicit byte/word ordering.
|
||||
|
||||
| Type | Modbus mapping | Supported |
|
||||
|------|----------------|:---------:|
|
||||
| Bool | Coil / DI | ✔ |
|
||||
| Int16 / UInt16 | One HR/IR | ✔ |
|
||||
| Int32 / UInt32 | Two HR (big-endian word) | ✔ |
|
||||
| Float32 | Two HR | ✔ |
|
||||
| Float64 | Four HR | ✔ |
|
||||
| String | — | N/A |
|
||||
| DateTime | — | N/A |
|
||||
|
||||
### Standard Scenario Mapping
|
||||
|
||||
| Axis | Address |
|
||||
|------|---------|
|
||||
| Bool scalar / Bool[10] | Coil 1000 / Coils 1010–1019 |
|
||||
| Int16 scalar / Int16[10] / Int16[500] | HR 0 / HR 10–19 / HR 500–999 |
|
||||
| Int32 scalar / Int32[10] | HR 2000–2001 / HR 2010–2029 |
|
||||
| UInt16 scalar | HR 50 |
|
||||
| UInt32 scalar | HR 60–61 |
|
||||
| Float32 scalar / Float32[10] / Float32[500] | HR 3000–3001 / HR 3010–3029 / HR 4000–4999 |
|
||||
| Float64 scalar | HR 5000–5003 |
|
||||
| Ramp (Int32) | HR 100–101 — 0→1000 @ 1 Hz |
|
||||
| Ramp (Float32) | HR 110–111 — sine wave |
|
||||
| Write-read-back (Bool / Int32 / Float32) | Coil 1100 / HR 2100–2101 / HR 3100–3101 |
|
||||
| Array element write (Int32[10]) | HR 2200–2219 |
|
||||
| Bool toggle on cadence | Coil 0 — toggles @ 2 Hz |
|
||||
| Endianness round-trip (Float32) | HR 6000–6001, written then read |
|
||||
| Bad on demand | Coil 99 — write `1` to make the slave drop the TCP socket |
|
||||
| Disconnect | restart container / dispose in-proc slave |
|
||||
|
||||
### Gotchas
|
||||
|
||||
- **Byte order** is simulator-configurable. Pin a default in our test harness (big-endian word, big-endian byte) and document.
|
||||
- **diagslave free mode** restarts every hour — fine for inner-loop, not CI.
|
||||
- **Docker image defaults registers to 0** — ship a YAML config in the test repo.
|
||||
|
||||
---
|
||||
|
||||
## 2. Allen-Bradley CIP (ControlLogix / CompactLogix)
|
||||
|
||||
### Recommendation
|
||||
|
||||
**Default**: `ab_server` from the libplctag repo. Real CIP-over-EtherNet/IP, written by the same project that owns the libplctag NuGet our driver consumes — every tag shape the simulator handles is one the driver can address.
|
||||
|
||||
**Pre-release fidelity tier**: Studio 5000 Logix Emulate on one designated "golden" dev box for cases that need full UDT / Program-scope fidelity. Not a default because of cost (~$1k+ Pro-edition add-on) and toolchain weight.
|
||||
|
||||
### Options Evaluated
|
||||
|
||||
| Option | License | Platform | Notes |
|
||||
|--------|---------|----------|-------|
|
||||
| **ab_server** ([libplctag](https://github.com/libplctag/libplctag), [kyle-github/ab_server](https://github.com/kyle-github/ab_server)) | MIT | Win/Linux/macOS | Build from source; CI-grade fixture |
|
||||
| **Studio 5000 Logix Emulate** | Rockwell paid (~$1k+) | Windows | 100% firmware fidelity |
|
||||
| **Factory I/O + PLCSIM** | Paid | Windows | Visual sim, not raw CIP |
|
||||
|
||||
### Native Type Coverage
|
||||
|
||||
| Type | CIP mapping | Supported by ab_server |
|
||||
|------|-------------|:----------------------:|
|
||||
| Bool | BOOL | ✔ |
|
||||
| Int16 | INT | ✔ |
|
||||
| Int32 | DINT | ✔ |
|
||||
| Int64 | LINT | ✔ |
|
||||
| Float32 | REAL | ✔ |
|
||||
| Float64 | LREAL | ✔ |
|
||||
| String | STRING (built-in struct) | ✔ basic only |
|
||||
| DateTime | — | N/A |
|
||||
| UDT | user-defined STRUCT | not in ab_server CI scope |
|
||||
|
||||
### Standard Scenario Mapping
|
||||
|
||||
| Axis | Tag |
|
||||
|------|-----|
|
||||
| Bool scalar / Bool[10] | `bTest` / `abTest[10]` |
|
||||
| Int16 scalar / Int16[10] | `iTest` / `aiTest[10]` |
|
||||
| Int32 scalar / Int32[10] / Int32[500] | `diTest` / `adiTest[10]` / `adiBig[500]` |
|
||||
| Int64 scalar | `liTest` |
|
||||
| Float32 scalar / Float32[10] / Float32[500] | `Motor1_Speed` / `aReal[10]` / `aRealBig[500]` |
|
||||
| Float64 scalar | `Motor1_Position` (LREAL) |
|
||||
| String scalar / String[10] | `sIdentity` / `asNames[10]` |
|
||||
| Ramp (Float32) | `Motor1_Speed` (0→60 @ 0.5 Hz) |
|
||||
| Ramp (Int32) | `StepCounter` (0→10000 @ 1 Hz) |
|
||||
| Write-read-back (Bool / Int32 / Float32 / String) | `bWriteTarget` / `StepIndex` / `rSetpoint` / `sLastWrite` |
|
||||
| Array element write (Int32[10]) | `adiWriteTarget[10]` |
|
||||
| Bool toggle on cadence | `Flags[0]` toggling @ 2 Hz; `Flags[1..15]` latched |
|
||||
| Bad on demand | Test harness flag that makes ab_server refuse the next read |
|
||||
| Disconnect | Stop ab_server process |
|
||||
|
||||
### Gotchas
|
||||
|
||||
- **ab_server tag-type coverage is finite** (BOOL, DINT, REAL, arrays, basic strings). UDTs and `Program:` scoping are not fully implemented. Document an "ab_server-supported tag set" in the harness and exclude the rest from default CI; UDT coverage moves to the Studio 5000 Emulate golden-box tier.
|
||||
- CIP has no native subscriptions, so polling behavior matches real hardware.
|
||||
|
||||
---
|
||||
|
||||
## 3. Allen-Bradley Legacy (SLC 500 / MicroLogix, PCCC)
|
||||
|
||||
### Recommendation
|
||||
|
||||
**Default**: `ab_server` in PCCC mode, with a small in-repo PCCC stub for any file types ab_server doesn't fully cover (notably Timer/Counter `.ACC`/`.PRE`/`.DN` decomposition).
|
||||
|
||||
The same binary covers AB CIP and AB Legacy via a `plc=slc500` (or `micrologix`) flag, so we get one fixture for two drivers. If the timer/counter fidelity is too thin in practice, fall back to a ~200-line `TcpListener` stub answering the specific PCCC function codes the driver issues.
|
||||
|
||||
### Options Evaluated
|
||||
|
||||
| Option | License | Platform | Notes |
|
||||
|--------|---------|----------|-------|
|
||||
| **ab_server PCCC mode** | MIT | cross-platform | Same binary as AB CIP; partial T/C/R structure fidelity |
|
||||
| **Rockwell RSEmulate 500** | Rockwell legacy paid | Windows | EOL, ages poorly on modern Windows |
|
||||
| **In-repo PCCC stub** | Own | .NET 10 | Fallback only — covers what we P/Invoke |
|
||||
|
||||
### Native Type Coverage
|
||||
|
||||
PCCC types are file-based. Int32/Float64/DateTime are not native to SLC/MicroLogix.
|
||||
|
||||
| Type | PCCC mapping | Supported |
|
||||
|------|--------------|:---------:|
|
||||
| Bool | `B3:n/b` (bit in B file) | ✔ |
|
||||
| Int16 | `N7:n` | ✔ |
|
||||
| Int32 | — (decompose in driver from two N words) | partial |
|
||||
| Float32 | `F8:n` | ✔ |
|
||||
| String | `ST9:n` | ✔ |
|
||||
| Timer struct | `T4:n.ACC` / `.PRE` / `/DN` | ✔ |
|
||||
| Counter struct | `C5:n.ACC` / `.PRE` / `/DN` | ✔ |
|
||||
|
||||
### Standard Scenario Mapping
|
||||
|
||||
| Axis | Address |
|
||||
|------|---------|
|
||||
| Bool scalar / Bool[16] | `B3:0/0` / `B3:0` (treated as bit array) |
|
||||
| Int16 scalar / Int16[10] / Int16[500] | `N7:0` / `N7:0..9` / `N10:0..499` (separate file) |
|
||||
| Float32 scalar / Float32[10] | `F8:0` / `F8:0..9` |
|
||||
| String scalar / String[10] | `ST9:0` / `ST9:0..9` |
|
||||
| Ramp (Int16) | `N7:1` 0→1000 |
|
||||
| Ramp (Float32) | `F8:1` sine wave |
|
||||
| Write-read-back (Bool / Int16 / Float32 / String) | `B3:1/0` / `N7:100` / `F8:100` / `ST9:100` |
|
||||
| Array element write (Int16[10]) | `N7:200..209` |
|
||||
| Timer fidelity | `T4:0.ACC`, `T4:0.PRE`, `T4:0/DN` |
|
||||
| Counter fidelity | `C5:0.ACC`, `C5:0.PRE`, `C5:0/DN` |
|
||||
| Connection-limit refusal | Driver harness toggle to simulate 4-conn limit |
|
||||
| Bad on demand | Connection-refused toggle |
|
||||
|
||||
### Gotchas
|
||||
|
||||
- **Real SLC/MicroLogix enforce 4–8 connection limits**; ab_server does not. Add a test-only toggle in the driver (or in the stub) to refuse connections so we exercise the queuing path.
|
||||
- Timer/Counter structures are the most likely place ab_server fidelity falls short — design the test harness so we can drop in a stub for those specific files without rewriting the rest.
|
||||
|
||||
---
|
||||
|
||||
## 4. Siemens S7 (S7-300/400/1200/1500)
|
||||
|
||||
### Recommendation
|
||||
|
||||
**Default**: Snap7 Server. Real S7comm over ISO-on-TCP, free, cross-platform, and the same wire protocol the S7netplus driver emits.
|
||||
|
||||
**Pre-release fidelity tier**: PLCSIM Advanced on one golden dev box (7-day renewable trial; paid for production). Required for true firmware-level validation and for testing programs that include actual ladder logic.
|
||||
|
||||
### Options Evaluated
|
||||
|
||||
| Option | License | Platform | Notes |
|
||||
|--------|---------|----------|-------|
|
||||
| **Snap7 Server** ([snap7](https://snap7.sourceforge.net/snap7_server.html)) | LGPLv3 | Win/Linux/macOS, 32/64-bit | CP emulator; no PLC logic execution |
|
||||
| **PLCSIM Advanced** ([Siemens](https://www.siemens.com/en-us/products/simatic/s7-plcsim-advanced/)) | Siemens trial / paid | Windows + VM | Full S7-1500 fidelity, runs TIA programs |
|
||||
| **S7-PLCSIM (classic)** | Bundled with TIA | Windows | S7-300/400; no external S7comm without PLCSIM Advanced |
|
||||
|
||||
### Native Type Coverage
|
||||
|
||||
S7 has a rich native type system; Snap7 supports the wire-level read/write of all of them via DB byte access.
|
||||
|
||||
| Type | S7 mapping | Notes |
|
||||
|------|------------|-------|
|
||||
| Bool | `M0.0`, `DBn.DBXm.b` | ✔ |
|
||||
| Byte / Word / DWord | `DBn.DBB`, `.DBW`, `.DBD` | unsigned |
|
||||
| Int (Int16) / DInt (Int32) | `DBn.DBW`, `.DBD` | signed, big-endian |
|
||||
| LInt (Int64) | `DBn.DBLW` | S7-1500 only |
|
||||
| Real (Float32) / LReal (Float64) | `.DBD`, `.DBLW` | big-endian IEEE |
|
||||
| String | `DBn.DBB[]` (length-prefixed: max+actual+chars) | length-prefixed |
|
||||
| Char / WChar | byte / word with semantic | |
|
||||
| Date / Time / DT / TOD | structured byte layouts | |
|
||||
|
||||
### Standard Scenario Mapping
|
||||
|
||||
All in `DB1` unless noted; host script provides ramp behavior since Snap7 has no logic.
|
||||
|
||||
| Axis | Address |
|
||||
|------|---------|
|
||||
| Bool scalar / Bool[16] | `M0.0` / `DB1.DBX0.0..1.7` |
|
||||
| Int16 scalar / Int16[10] / Int16[500] | `DB1.DBW10` / `DB1.DBW20..38` / `DB2.DBW0..998` |
|
||||
| Int32 scalar / Int32[10] | `DB1.DBD100` / `DB1.DBD110..146` |
|
||||
| Int64 scalar | `DB1.DBLW200` |
|
||||
| UInt16 / UInt32 | `DB1.DBW300` / `DB1.DBD310` |
|
||||
| Float32 scalar / Float32[10] / Float32[500] | `DB1.DBD400` / `DB1.DBD410..446` / `DB3.DBD0..1996` |
|
||||
| Float64 scalar | `DB1.DBLW500` |
|
||||
| String scalar / String[10] | `DB1.STRING600` (max 254) / `DB1.STRING700..` |
|
||||
| DateTime scalar | `DB1.DT800` |
|
||||
| Ramp (Int16) | `DB1.DBW10` 0→1000 @ 1 Hz (host script) |
|
||||
| Ramp (Float32) | `DB1.DBD400` sine (host script) |
|
||||
| Write-read-back (Bool / Int16 / Float32 / String) | `M1.0` / `DB1.DBW900` / `DB1.DBD904` / `DB1.STRING908` |
|
||||
| Array element write (Int16[10]) | `DB1.DBW1000..1018` |
|
||||
| Endianness round-trip (Float32) | `DB1.DBD1100` |
|
||||
| Big-endian Int32 check | `DB2.DBD0` |
|
||||
| PUT/GET disabled simulation | Refuse-connection toggle |
|
||||
| Bad on demand | Stop Snap7 host process |
|
||||
| Re-download | Swap DB definitions (exercises symbol-version handling) |
|
||||
|
||||
### Gotchas
|
||||
|
||||
- **Snap7 is not a SoftPLC** — no logic runs. Ramps must be scripted by the test host writing to a DB on a timer.
|
||||
- **PUT/GET enforcement** is a property of real S7-1200/1500 (disabled by default in TIA). Snap7 doesn't enforce it. Add a test case that simulates "PUT/GET disabled" via a deliberately refused connection.
|
||||
- **Snap7 binary bitness**: some distributions are 32-bit only — match the test harness bitness.
|
||||
- **PLCSIM Advanced in VMs** is officially supported but trips up on nested virtualization and time-sync.
|
||||
|
||||
---
|
||||
|
||||
## 5. Beckhoff TwinCAT (ADS)
|
||||
|
||||
### Recommendation
|
||||
|
||||
**Default**: TwinCAT XAR runtime in a dev VM under Beckhoff's 7-day renewable dev/test trial. Required because TwinCAT is the only one of three native-subscription drivers (Galaxy, TwinCAT, OPC UA Client) that doesn't have a separate stub option — exercising native ADS notifications without a real XAR would hide the most important driver bugs.
|
||||
|
||||
The OtOpcUa test process talks to the VM over the network using `Beckhoff.TwinCAT.Ads` v6's in-process router (`AmsTcpIpRouter`), so individual dev machines and CI runners don't need the full TwinCAT stack installed locally.
|
||||
|
||||
### Options Evaluated
|
||||
|
||||
| Option | License | Platform | Notes |
|
||||
|--------|---------|----------|-------|
|
||||
| **TwinCAT 3 XAE + XAR** ([Beckhoff](https://www.beckhoff.com/en-us/products/automation/twincat/), [licensing](https://infosys.beckhoff.com/content/1033/tc3_licensing/921947147.html)) | Free dev download; 7-day renewable trial; paid for production | Windows + Hyper-V/VMware | Full ADS fidelity with real PLC runtime |
|
||||
| **Beckhoff.TwinCAT.Ads.TcpRouter** ([NuGet](https://www.nuget.org/packages/Beckhoff.TwinCAT.Ads.TcpRouter)) | Free, bundled | NuGet, in-proc | Router only — needs a real XAR on the other end |
|
||||
| **TwinCAT XAR in Docker** ([Beckhoff/TC_XAR_Container_Sample](https://github.com/Beckhoff/TC_XAR_Container_Sample)) | Same trial license; no prebuilt image | **Linux host with PREEMPT_RT** | Evaluated and rejected — see "Why not Docker" below |
|
||||
| **Roll-our-own ADS stub** | Own | .NET 10 | Would have to fake notifications; significant effort |
|
||||
|
||||
### Why not Docker (evaluated 2026-04-17)
|
||||
|
||||
Beckhoff publishes an [official sample](https://github.com/Beckhoff/TC_XAR_Container_Sample) for running XAR in a container, but it's not a viable replacement for the VM in our environment. Four blockers:
|
||||
|
||||
1. **Linux-only host with PREEMPT_RT.** The container is a Linux container that requires a Beckhoff RT Linux host (or equivalent PREEMPT_RT kernel). Docker Desktop on Windows forces Hyper-V, which [TwinCAT runtime cannot coexist with](https://hellotwincat.dev/disable-hyper-v-vs-twincat-problem-solution/). Our CI and dev boxes are Windows.
|
||||
2. **ADS-over-MQTT, not classic TCP/48898.** The official sample exposes ADS through a containerized mosquitto broker. Real field deployments use TCP/48898; testing against MQTT reduces the fidelity we're paying for.
|
||||
3. **XAE-on-Windows still required for project deployment.** No headless `.tsproj` deploy path exists. We don't escape the Windows dependency by going to Docker.
|
||||
4. **Same trial license either way.** No licensing win — 7-day renewable applies identically to bare-metal XAR and containerized XAR.
|
||||
|
||||
Revisit if Beckhoff publishes a prebuilt image with classic TCP ADS exposure, or if our CI fleet gains a Linux RT runner. Until then, Windows VM with XAR + XAE + trial license is the pragmatic answer.
|
||||
|
||||
### Native Type Coverage
|
||||
|
||||
TwinCAT exposes the full IEC 61131-3 type system; the test PLC project includes one symbol per cell.
|
||||
|
||||
| Type | TwinCAT mapping | Supported |
|
||||
|------|-----------------|:---------:|
|
||||
| Bool | `BOOL` | ✔ |
|
||||
| Int16 / UInt16 | `INT` / `UINT` | ✔ |
|
||||
| Int32 / UInt32 | `DINT` / `UDINT` | ✔ |
|
||||
| Int64 / UInt64 | `LINT` / `ULINT` | ✔ |
|
||||
| Float32 / Float64 | `REAL` / `LREAL` | ✔ |
|
||||
| String | `STRING(255)` | ✔ |
|
||||
| WString | `WSTRING(255)` | ✔ Unicode coverage |
|
||||
| DateTime | `DT`, `DATE`, `TOD`, `TIME`, `LTIME` | ✔ |
|
||||
| STRUCT / ENUM / ALIAS | user-defined | ✔ |
|
||||
|
||||
### Standard Scenario Mapping
|
||||
|
||||
In a tiny test project — `MAIN` (PLC code) + `GVL` (constants and write targets):
|
||||
|
||||
| Axis | Symbol |
|
||||
|------|--------|
|
||||
| Bool scalar / Bool[10] | `GVL.bTest` / `GVL.abTest` |
|
||||
| Int16 / Int32 / Int64 scalars | `GVL.iTest` / `GVL.diTest` / `GVL.liTest` |
|
||||
| UInt16 / UInt32 scalars | `GVL.uiTest` / `GVL.udiTest` |
|
||||
| Int32[10] / Int32[500] | `GVL.adiTest` / `GVL.adiBig` |
|
||||
| Float32 / Float64 scalars | `GVL.rTest` / `GVL.lrTest` |
|
||||
| Float32[10] / Float32[500] | `GVL.arTest` / `GVL.arBig` |
|
||||
| String / WString / String[10] | `GVL.sIdentity` / `GVL.wsIdentity` / `GVL.asNames` |
|
||||
| DateTime (DT) | `GVL.dtTimestamp` |
|
||||
| STRUCT member access | `GVL.fbMotor.rSpeed` (REAL inside FB) |
|
||||
| Ramp (DINT) | `MAIN.nRamp` — PLC increments each cycle |
|
||||
| Ramp (REAL) | `MAIN.rSine` — PLC computes sine |
|
||||
| Write-read-back (Bool / DINT / REAL / STRING / WSTRING) | `GVL.bWriteTarget` / `GVL.diWriteTarget` / `GVL.rWriteTarget` / `GVL.sWriteTarget` / `GVL.wsWriteTarget` |
|
||||
| Array element write (DINT[10]) | `GVL.adiWriteTarget` |
|
||||
| Native ADS notification | every scalar above subscribed via OnDataChange |
|
||||
| Bad on demand | Stop the runtime — driver gets port-not-found |
|
||||
| Re-download | Re-deploy the project to exercise symbol-version-changed (`0x0702`) |
|
||||
|
||||
### Gotchas
|
||||
|
||||
- **AMS route table** — XAR refuses ADS connections from unknown hosts. Test setup must add a backroute for each dev machine and CI runner (scriptable via `AddRoute` on the NuGet API).
|
||||
- **7-day trial reset** requires a click in the XAE UI; investigate scripting it for unattended CI.
|
||||
- **Symbol-version-changed** is the hardest path to exercise — needs a PLC re-download mid-test, so structure the integration suite to accommodate that step.
|
||||
|
||||
---
|
||||
|
||||
## 6. FANUC FOCAS (FOCAS2)
|
||||
|
||||
### Recommendation
|
||||
|
||||
**No good off-the-shelf simulator exists. Build two test artifacts** that cover different layers of the FOCAS surface:
|
||||
|
||||
1. **`Driver.Focas.TestStub`** — a TCP listener mimicking a real CNC over the FOCAS wire protocol. Covers functional behavior (reads, writes, ramps, alarms, network failures).
|
||||
2. **`Driver.Focas.FaultShim`** — a test-only native DLL that masquerades as `Fwlib64.dll` and injects faults inside the host process (AVs, handle leaks, orphan handles). Covers the stability-recovery paths in `driver-stability.md` that the TCP stub physically cannot exercise.
|
||||
|
||||
CNC Guide is the only off-the-shelf FOCAS-capable simulator and gating every dev rig on a FANUC purchase isn't viable. There are no open-source FOCAS server stubs at useful fidelity. The FOCAS SDK license is already secured (decision #61), so we own the API contract — build both artifacts ourselves against captured Wireshark traces from a real CNC.
|
||||
|
||||
### Artifact 1 — TCP Stub (functional coverage)
|
||||
|
||||
A `TcpListener` on port 8193 that answers only the FOCAS2 functions the driver P/Invokes:
|
||||
|
||||
```
|
||||
cnc_allclibhndl3, cnc_freelibhndl, cnc_sysinfo, cnc_statinfo,
|
||||
cnc_actf, cnc_acts, cnc_absolute, cnc_machine, cnc_rdaxisname,
|
||||
cnc_rdspmeter, cnc_rdprgnum, cnc_rdparam, cnc_rdalmmsg,
|
||||
pmc_rdpmcrng, cnc_rdmacro, cnc_getfigure
|
||||
```
|
||||
|
||||
Capture the wire framing once against a real CNC (or a colleague's CNC Guide seat), then the stub becomes a fixed-point reference. For pre-release validation, run the driver against a real CNC.
|
||||
|
||||
**Covers**: read/write/poll behavior, scaled-integer round-trip, alarm fire/clear, network slowness, network hang, network disconnect, FOCAS-error-code → StatusCode mapping. Roughly 80% of real-world FOCAS failure modes.
|
||||
|
||||
### Artifact 2 — FaultShim (native fault injection, host-side)
|
||||
|
||||
A separate test-only native DLL named `Fwlib64.dll` that exports the same function surface but instead of calling FANUC's library, performs configurable fault behaviors: deliberate AV at a chosen call site, return success but never release allocated buffers (memory leak), accept `cnc_freelibhndl` but keep handle table populated (orphan handle), simulate a wedged native call that doesn't return.
|
||||
|
||||
Activated by DLL search-path order in the test fixture only; production builds load FANUC's real `Fwlib64.dll`. The Host code is unchanged — it just experiences different symptoms depending on which DLL is loaded.
|
||||
|
||||
**Covers**: supervisor respawn after AV, post-mortem MMF readability after hard crash, watchdog → recycle path on simulated leak, Abandoned-handle path when a wedged native call exceeds recycle grace. The remaining ~20% of failure modes that live below the network layer.
|
||||
|
||||
### What neither artifact covers
|
||||
|
||||
Vendor-specific Fwlib quirks that depend on the real `Fwlib64.dll` interacting with a real CNC firmware version. These remain hardware/manual-test-only and are validated on the pre-release real-CNC tier, not in CI.
|
||||
|
||||
### Options Evaluated
|
||||
|
||||
| Option | License | Platform | Notes |
|
||||
|--------|---------|----------|-------|
|
||||
| **FANUC CNC Guide** ([FANUC](https://www.fanuc.co.jp/en/product/cnc/f_ncguide.html)) | Paid, dealer-ordered | Windows | High fidelity; FOCAS-over-Ethernet not enabled in all editions |
|
||||
| **FANUC Roboguide** | Paid | Windows | Robot-focused, not CNC FOCAS |
|
||||
| **MTConnect agents** | various | — | Different protocol; not a FOCAS source |
|
||||
| **Public FOCAS stubs** | — | — | None at useful fidelity |
|
||||
| **In-repo TCP stub + FaultShim DLL** | Own | .NET 10 + native | Recommended path — two artifacts, see above |
|
||||
|
||||
### Native Type Coverage
|
||||
|
||||
FOCAS does not have a tag system in the conventional sense — it has a fixed set of API calls returning structured CNC data. Tag families the driver exposes:
|
||||
|
||||
| Type | FOCAS source | Notes |
|
||||
|------|--------------|-------|
|
||||
| Bool | PMC bit | discrete inputs/outputs |
|
||||
| Int16 / Int32 | PMC R/D word & dword, status fields | |
|
||||
| Int64 | composed from PMC | rare |
|
||||
| Float32 / Float64 | macros (`cnc_rdmacro`), some params | |
|
||||
| Scaled integer | position values + `cnc_getfigure()` decimal places | THE FOCAS-specific bug surface |
|
||||
| String | alarm messages, program names | length-bounded |
|
||||
| Array | PMC ranges (`pmc_rdpmcrng`), per-axis arrays | |
|
||||
|
||||
### Standard Scenario Mapping
|
||||
|
||||
| Axis | Element |
|
||||
|------|---------|
|
||||
| Static identity (struct) | `cnc_sysinfo` — series, version, axis count |
|
||||
| Bool scalar / Bool[16] | PMC `G0.0` / PMC `G0` (bits 0–15) |
|
||||
| Int16 / Int32 PMC scalars | PMC `R200` / PMC `D300` |
|
||||
| Int32 PMC array (small / large) | PMC `R1000..R1019` / PMC `R5000..R5499` |
|
||||
| Float64 macro variable | macro `#100` |
|
||||
| Macro array | macro `#500..#509` |
|
||||
| String | active program name; alarm message text |
|
||||
| Scaled integer round-trip | X-axis position (decimal-place conversion via `cnc_getfigure`) |
|
||||
| State machine | `RunState` cycling Stop → Running → Hold |
|
||||
| Ramp (scaled int) | X-axis position 0→100.000 mm |
|
||||
| Step (Int32) | `ActualFeedRate` stepping on `cnc_actf` |
|
||||
| Write-read-back (PMC Int32) | PMC `R100` 32-bit scratch register |
|
||||
| PMC array element write | PMC `R200..R209` |
|
||||
| Alarms | one alarm appears after N seconds; one is acknowledgeable; one auto-clears |
|
||||
| Bad on demand | Stub closes the socket on a marker request |
|
||||
|
||||
### Gotchas
|
||||
|
||||
- **FOCAS wire framing is proprietary** — stub fidelity depends entirely on Wireshark captures from a real CNC. Plan to do that capture early.
|
||||
- **Fwlib is thread-unsafe per handle** — the stub must serialize so we don't accidentally test threading behavior the driver can't rely on in production.
|
||||
- **Scaled-integer position values** require the stub to return a credible `cnc_getfigure()` so the driver's decimal-place conversion is exercised.
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
| Driver | Primary | License | Fallback / fidelity tier |
|
||||
|--------|---------|---------|---------------------------|
|
||||
| Galaxy | Real Galaxy on dev box | — | (n/a — already covered) |
|
||||
| Modbus TCP / DL205 | `oitc/modbus-server` + NModbus in-proc | MIT | diagslave for wire-inspection |
|
||||
| AB CIP | libplctag `ab_server` | MIT | Studio 5000 Logix Emulate (golden box) |
|
||||
| AB Legacy | `ab_server` PCCC mode + in-repo PCCC stub | MIT | Real SLC/MicroLogix on lab rig |
|
||||
| S7 | Snap7 Server | LGPLv3 | PLCSIM Advanced (golden box) |
|
||||
| TwinCAT | TwinCAT XAR in dev VM | Free trial | — |
|
||||
| FOCAS | **In-repo `Driver.Focas.TestStub` (TCP)** + `Driver.Focas.FaultShim` (native DLL) | Own code | CNC Guide / real CNC pre-release |
|
||||
| OPC UA Client | OPC Foundation `ConsoleReferenceServer` | OPC Foundation | — |
|
||||
|
||||
Six of eight drivers have a free, scriptable, cross-platform test source we can check into CI. TwinCAT requires a VM but no recurring cost. FOCAS is the one case with no public answer — we own the stub. The driver specs in `driver-specs.md` enumerate every API call we make, which scopes the FOCAS stub.
|
||||
|
||||
## Resolved Defaults
|
||||
|
||||
The questions raised by the initial draft are resolved as planning defaults below. Each carries an operational dependency that needs site/team confirmation before Phase 1 work depends on it; flagged inline so the dependency stays visible.
|
||||
|
||||
- **CI tiering: PR-CI uses only in-process simulators; nightly/integration CI runs on a dedicated host with Docker + TwinCAT VM.** PR builds need to be fast and need to run on minimal Windows/Linux build agents; standardizing on the in-process subset (`NModbus` server fixture for Modbus, OPC Foundation `ConsoleReferenceServer` in-process for OPC UA Client, and the FOCAS TCP stub from the test project) covers ~70% of cross-driver behavior with no infrastructure dependency. Anything needing Docker (`oitc/modbus-server`), the TwinCAT XAR VM, the libplctag `ab_server` binary, or the Snap7 Server runs on a single dedicated integration host that runs the full suite nightly and on demand. **Operational dependency**: stand up one Windows host with Docker Desktop + Hyper-V before Phase 3 (Modbus driver) — without it, integration tests for Modbus/AB CIP/AB Legacy/S7/TwinCAT all queue behind the same scarcity.
|
||||
- **Studio 5000 Logix Emulate: not assumed in CI; pre-release validation only.** Don't gate any phase on procuring a license. If an existing org license can be earmarked, designate one Windows machine as the AB CIP golden box and run a quarterly UDT/Program-scope fidelity pass against it. If no license is available, the AB CIP driver ships validated against `ab_server` only, with a documented limitation that UDTs and `Program:` scoping are exercised at customer sites during UAT, not in our CI.
|
||||
- **FANUC CNC Wireshark captures: scheduled as a Phase 5 prerequisite.** During Phase 4 (PLC drivers), the team identifies a target CNC — production machine accessible during a maintenance window, a colleague's CNC Guide seat, or a customer who'll allow a one-day on-site capture. Capture the wire framing for every FOCAS function in the call list (per `driver-stability.md` §FOCAS) plus a 30-min poll trace, before Phase 5 starts. If no target is identified by Phase 4 mid-point, escalate to procurement: a CNC Guide license seat (1-time cost) or a small dev-rig CNC purchase becomes a Phase 5 dependency.
|
||||
Reference in New Issue
Block a user