Draft v2 multi-driver planning docs (docs/v2/) so Phase 0–5 work has a complete reference: rename to OtOpcUa, migrate to .NET 10 x64 (Galaxy stays .NET 4.8 x86 out-of-process), add seven new drivers behind composable capability interfaces (Modbus TCP / DL205, AB CIP, AB Legacy, S7, TwinCAT, FOCAS, OPC UA Client), introduce a central MSSQL config DB with cluster-scoped immutable generations and per-node credential binding, deploy as two-node site clusters with non-transparent redundancy and minimal per-node overrides, classify drivers by stability tier (A pure-managed / B wrapped-native / C out-of-process Windows service) with Tier C deep dives for both Galaxy and FOCAS, define per-driver test data sources (libplctag ab_server, Snap7, NModbus in-proc, TwinCAT XAR VM, FOCAS TCP stub plus native FaultShim) plus a 6-axis cross-driver test matrix, and ship a Blazor Server admin UI mirroring ScadaLink CentralUI's Bootstrap 5 / LDAP cookie auth / dark-sidebar look-and-feel — 106 numbered decisions across six docs (plan.md, driver-specs.md, driver-stability.md, test-data-sources.md, config-db-schema.md, admin-ui.md), DRAFT only and intentionally not yet wired to code.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
402
docs/v2/admin-ui.md
Normal file
402
docs/v2/admin-ui.md
Normal file
@@ -0,0 +1,402 @@
|
||||
# Admin Web UI — OtOpcUa v2
|
||||
|
||||
> **Status**: DRAFT — companion to `plan.md` §4 and `config-db-schema.md`. Defines the Blazor Server admin app for managing the central config DB.
|
||||
>
|
||||
> **Branch**: `v2`
|
||||
> **Created**: 2026-04-17
|
||||
|
||||
## Scope
|
||||
|
||||
This document covers the **OtOpcUa Admin** web app — the operator-facing UI for managing fleet configuration. It owns every write to the central config DB; OtOpcUa nodes are read-only consumers.
|
||||
|
||||
Out of scope here:
|
||||
|
||||
- Per-node operator dashboards (status, alarm acks for runtime concerns) — that's the existing Status Dashboard, deployed alongside each node, not the Admin app
|
||||
- Driver-specific config screens — these are deferred to each driver's implementation phase per decision #27, and each driver doc is responsible for sketching its config UI surface
|
||||
- Authentication of the OPC UA endpoint itself — covered by `Security.md` (LDAP)
|
||||
|
||||
## Tech Stack
|
||||
|
||||
**Aligned with ScadaLink CentralUI** (`scadalink-design/src/ScadaLink.CentralUI`) — operators using both apps see the same login screen, same sidebar, same component vocabulary. Same patterns, same aesthetic.
|
||||
|
||||
| Component | Choice | Reason |
|
||||
|-----------|--------|--------|
|
||||
| Framework | **Blazor Server** (.NET 10 Razor Components, `AddInteractiveServerComponents`) | Same as ScadaLink; real-time UI without separate SPA build; SignalR built-in for live cluster status |
|
||||
| Hosting | Co-deploy with central DB by default; standalone option | Most deployments run Admin on the same machine as MSSQL; large fleets can split |
|
||||
| Auth | **LDAP bind via `LdapAuthService` (sibling of `ScadaLink.Security`) + cookie auth + `JwtTokenService` for API tokens** | Direct parity with ScadaLink — same login form, same cookie scheme, same claim shape, same `RoleMapper` pattern. Operators authenticated to one app feel at home in the other |
|
||||
| DB access | EF Core (same `Configuration` project that nodes use) | Schema versioning lives in one place |
|
||||
| Real-time | SignalR (Blazor Server's underlying transport) | Live updates on `ClusterNodeGenerationState` and crash-loop alerts |
|
||||
| Styling | **Bootstrap 5** vendored under `wwwroot/lib/bootstrap/` | Direct parity with ScadaLink; standard component vocabulary (card, table, alert, btn, form-control, modal); no third-party Blazor-component-library dependency |
|
||||
| Shared components | `DataTable`, `ConfirmDialog`, `LoadingSpinner`, `ToastNotification`, `TimestampDisplay`, `RedirectToLogin`, `NotAuthorizedView` | Same set as ScadaLink CentralUI; copy structurally so cross-app feel is identical |
|
||||
| Reconnect overlay | Custom Bootstrap modal triggered on `Blazor` SignalR disconnect | Same pattern as ScadaLink — modal appears on connection loss, dismisses on reconnect |
|
||||
|
||||
### Code organization
|
||||
|
||||
Mirror ScadaLink's layout exactly:
|
||||
|
||||
```
|
||||
src/
|
||||
ZB.MOM.WW.OtOpcUa.Admin/ # Razor Components project (.NET 10)
|
||||
Auth/
|
||||
AuthEndpoints.cs # /auth/login, /auth/logout, /auth/token
|
||||
CookieAuthenticationStateProvider.cs # bridges cookie auth to Blazor <AuthorizeView>
|
||||
Components/
|
||||
Layout/
|
||||
MainLayout.razor # dark sidebar + light main flex layout
|
||||
NavMenu.razor # role-gated nav sections
|
||||
Pages/
|
||||
Login.razor # server-rendered HTML form POSTing to /auth/login
|
||||
Dashboard.razor # default landing
|
||||
Clusters/
|
||||
Generations/
|
||||
Credentials/
|
||||
Audit/
|
||||
Shared/
|
||||
DataTable.razor # paged/sortable/filterable table (verbatim from ScadaLink)
|
||||
ConfirmDialog.razor
|
||||
LoadingSpinner.razor
|
||||
ToastNotification.razor
|
||||
TimestampDisplay.razor
|
||||
RedirectToLogin.razor
|
||||
NotAuthorizedView.razor
|
||||
EndpointExtensions.cs # MapAuthEndpoints + role policies
|
||||
ServiceCollectionExtensions.cs # AddCentralAdmin
|
||||
ZB.MOM.WW.OtOpcUa.Admin.Security/ # LDAP + role mapping + JWT (sibling of ScadaLink.Security)
|
||||
```
|
||||
|
||||
The `Admin.Security` project carries `LdapAuthService`, `RoleMapper`, `JwtTokenService`, `AuthorizationPolicies`. If it ever makes sense to consolidate with ScadaLink's identical project, lift to a shared internal NuGet — out of scope for v2.0 to keep OtOpcUa decoupled from ScadaLink's release cycle.
|
||||
|
||||
## Authentication & Authorization
|
||||
|
||||
### Operator authentication
|
||||
|
||||
**Identical pattern to ScadaLink CentralUI.** Operators log in via LDAP bind against the GLAuth server. The login flow is a server-rendered HTML form POSTing to `/auth/login` (NOT a Blazor interactive form — `data-enhance="false"` to disable Blazor enhanced navigation), handled by a minimal-API endpoint that:
|
||||
|
||||
1. Reads `username` / `password` from form
|
||||
2. Calls `LdapAuthService.AuthenticateAsync(username, password)` — performs LDAP bind, returns `Username`, `DisplayName`, `Groups`
|
||||
3. Calls `RoleMapper.MapGroupsToRolesAsync(groups)` — translates LDAP groups → application roles + cluster-scope set
|
||||
4. Builds `ClaimsIdentity` with `Name`, `DisplayName`, `Username`, `Role` (multiple), `ClusterId` scope claims (multiple, when not system-wide)
|
||||
5. `HttpContext.SignInAsync(CookieAuthenticationDefaults.AuthenticationScheme, principal, ...)` with `IsPersistent = true`, `ExpiresUtc = +30 min` (sliding)
|
||||
6. Redirects to `/`
|
||||
7. On failure, redirects to `/login?error={URL-encoded message}`
|
||||
|
||||
A parallel `/auth/token` endpoint returns a JWT for API clients (CLI tooling, scripts) — same auth, different transport. Symmetric with ScadaLink's pattern.
|
||||
|
||||
`CookieAuthenticationStateProvider` bridges the cookie principal to Blazor's `AuthenticationStateProvider` so `<AuthorizeView>` and `[Authorize]` work in components.
|
||||
|
||||
### LDAP group → role mapping
|
||||
|
||||
| LDAP group | Admin role | Capabilities |
|
||||
|------------|------------|--------------|
|
||||
| `OtOpcUaAdmins` | `FleetAdmin` | Everything: cluster CRUD, node CRUD, credential management, publish/rollback any cluster |
|
||||
| `OtOpcUaConfigEditors` | `ConfigEditor` | Edit drafts and publish for assigned clusters; cannot create/delete clusters or manage credentials |
|
||||
| `OtOpcUaViewers` | `ReadOnly` | View-only access to all clusters and generations; cannot edit drafts or publish |
|
||||
|
||||
`AuthorizationPolicies` constants (mirrors ScadaLink): `RequireFleetAdmin`, `RequireConfigEditor`, `RequireReadOnly`. `<AuthorizeView Policy="@AuthorizationPolicies.RequireFleetAdmin">` gates nav menu sections and page-level access.
|
||||
|
||||
### Cluster-scoped grants (lifted from v2.1 to v2.0)
|
||||
|
||||
Because ScadaLink already has the site-scoped grant pattern (`PermittedSiteIds` claim, `IsSystemWideDeployment` flag), we get cluster-scoped grants essentially for free in v2.0 by mirroring it:
|
||||
|
||||
- A `ConfigEditor` user mapped to LDAP group `OtOpcUaConfigEditors-LINE3` is granted `ConfigEditor` role + `ClusterId=LINE3-OPCUA` scope claim only
|
||||
- The `RoleMapper` reads a small `LdapGroupRoleMapping` table (Group → Role, Group → ClusterId scope) configured by `FleetAdmin` via the Admin UI
|
||||
- All cluster-scoped pages check both role AND `ClusterId` scope claim before showing edit affordances
|
||||
|
||||
System-wide users (no `ClusterId` scope claims, `IsSystemWideDeployment = true`) see every cluster.
|
||||
|
||||
### Bootstrap (first-run)
|
||||
|
||||
Same as ScadaLink: a local-admin login configured in `appsettings.json` (or a local certificate-authenticated user) bootstraps the first `OtOpcUaAdmins` LDAP group binding before LDAP-only access takes over. Documented as a one-time setup step.
|
||||
|
||||
### Audit
|
||||
|
||||
Every write operation goes through `sp_*` procs that log to `ConfigAuditLog` with the operator's principal. The Admin UI also logs view-only actions (page navigation, generation diff views) to a separate UI access log for compliance.
|
||||
|
||||
## Visual Design — Direct Parity with ScadaLink
|
||||
|
||||
Every visual element is lifted from ScadaLink CentralUI's design system to ensure cross-app consistency. Concrete specs:
|
||||
|
||||
### Layout
|
||||
|
||||
- **Flex layout**: `<div class="d-flex">` containing `<NavMenu />` (sidebar) and `<main class="flex-grow-1 p-3">` (content)
|
||||
- **Sidebar**: 220px fixed width (`min-width: 220px; max-width: 220px`), full viewport height (`min-height: 100vh`), background `#212529` (Bootstrap dark)
|
||||
- **Main background**: `#f8f9fa` (Bootstrap light)
|
||||
- **Brand**: "OtOpcUa" in white bold (font-size: 1.1rem, padding 1rem, border-bottom `1px solid #343a40`) at top of sidebar
|
||||
- **Nav links**: color `#adb5bd`, padding `0.4rem 1rem`, font-size `0.9rem`. Hover: white text, background `#343a40`. Active: white text, background `#0d6efd` (Bootstrap primary)
|
||||
- **Section headers** ("Admin", "Configuration", "Monitoring"): color `#6c757d`, uppercase, font-size `0.75rem`, font-weight `600`, letter-spacing `0.05em`, padding `0.75rem 1rem 0.25rem`
|
||||
- **User strip** at bottom of sidebar: display name (text-light small) + Sign Out button (`btn-outline-light btn-sm`), separated from nav by `border-top border-secondary`
|
||||
|
||||
### Login page
|
||||
|
||||
Verbatim structure from ScadaLink's `Login.razor`:
|
||||
|
||||
```razor
|
||||
<div class="container" style="max-width: 400px; margin-top: 10vh;">
|
||||
<div class="card shadow-sm">
|
||||
<div class="card-body p-4">
|
||||
<h4 class="card-title mb-4 text-center">OtOpcUa</h4>
|
||||
|
||||
@if (!string.IsNullOrEmpty(ErrorMessage))
|
||||
{
|
||||
<div class="alert alert-danger py-2" role="alert">@ErrorMessage</div>
|
||||
}
|
||||
|
||||
<form method="post" action="/auth/login" data-enhance="false">
|
||||
<div class="mb-3">
|
||||
<label for="username" class="form-label">Username</label>
|
||||
<input type="text" class="form-control" id="username" name="username"
|
||||
required autocomplete="username" autofocus />
|
||||
</div>
|
||||
<div class="mb-3">
|
||||
<label for="password" class="form-label">Password</label>
|
||||
<input type="password" class="form-control" id="password" name="password"
|
||||
required autocomplete="current-password" />
|
||||
</div>
|
||||
<button type="submit" class="btn btn-primary w-100">Sign In</button>
|
||||
</form>
|
||||
</div>
|
||||
</div>
|
||||
<p class="text-center text-muted mt-3 small">Authenticate with your organization's LDAP credentials.</p>
|
||||
</div>
|
||||
```
|
||||
|
||||
Exact same dimensions, exact same copy pattern, only the brand name differs.
|
||||
|
||||
### Reconnection overlay
|
||||
|
||||
Same SignalR-disconnect modal as ScadaLink — `#reconnect-modal` overlay (`rgba(0,0,0,0.5)` backdrop, centered white card with `spinner-border text-primary`, "Connection Lost" heading, "Attempting to reconnect to the server. Please wait..." body). Listens for `Blazor.addEventListener('enhancedload')` to dismiss on reconnect. Lifted from ScadaLink's `App.razor` inline styles.
|
||||
|
||||
### Shared components — direct copies
|
||||
|
||||
All seven shared components from ScadaLink CentralUI are copied verbatim into our `Components/Shared/`:
|
||||
|
||||
| Component | Use |
|
||||
|-----------|-----|
|
||||
| `DataTable.razor` | Sortable, filterable, paged table — used for tags, generations, audit log, cluster list |
|
||||
| `ConfirmDialog.razor` | Modal confirmation for destructive actions (publish, rollback, discard draft, disable credential) |
|
||||
| `LoadingSpinner.razor` | Standard spinner for in-flight DB operations |
|
||||
| `ToastNotification.razor` | Transient success/error toasts for non-modal feedback |
|
||||
| `TimestampDisplay.razor` | Consistent UTC + relative-time rendering ("3 minutes ago") |
|
||||
| `RedirectToLogin.razor` | Component used by pages requiring auth — server-side redirect to `/login?returnUrl=...` |
|
||||
| `NotAuthorizedView.razor` | Standard "you don't have permission for this action" view, shown by `<AuthorizeView>` Not authorized branch |
|
||||
|
||||
If we discover an Admin-specific component need, add it to our Shared folder rather than diverging from ScadaLink's set.
|
||||
|
||||
## Information Architecture
|
||||
|
||||
```
|
||||
/ Fleet Overview (default landing)
|
||||
/clusters Cluster list
|
||||
/clusters/{ClusterId} Cluster detail
|
||||
/clusters/{ClusterId}/nodes/{NodeId} Node detail
|
||||
/clusters/{ClusterId}/draft Draft editor (drivers/devices/tags)
|
||||
/clusters/{ClusterId}/draft/diff Draft vs current diff viewer
|
||||
/clusters/{ClusterId}/generations Generation history
|
||||
/clusters/{ClusterId}/generations/{Id} Generation detail (read-only view of any generation)
|
||||
/clusters/{ClusterId}/audit Audit log filtered to this cluster
|
||||
/credentials Credential management (FleetAdmin only)
|
||||
/audit Fleet-wide audit log
|
||||
/admin/users Admin role assignments (FleetAdmin only)
|
||||
```
|
||||
|
||||
## Core Pages
|
||||
|
||||
### Fleet Overview (`/`)
|
||||
|
||||
Single-page summary intended as the operator landing page.
|
||||
|
||||
- **Cluster cards**, one per `ServerCluster`, showing:
|
||||
- Cluster name, site, redundancy mode, node count
|
||||
- Per-node status: online/offline (from `ClusterNodeGenerationState.LastSeenAt`), current generation, RedundancyRole, ServiceLevel (last reported)
|
||||
- Drift indicator: red if 2-node cluster's nodes are on different generations, amber if mid-apply, green if converged
|
||||
- **Active alerts** strip (top of page):
|
||||
- Sticky crash-loop circuit alerts (per `driver-stability.md`)
|
||||
- Stragglers: nodes that haven't applied the latest published generation within 5 min
|
||||
- Failed applies (`LastAppliedStatus = 'Failed'`)
|
||||
- **Recent activity**: last 20 events from `ConfigAuditLog` across the fleet
|
||||
- **Search bar** at top: jump to any cluster, node, tag, or driver instance by name
|
||||
|
||||
Refresh: SignalR push for status changes; full reload every 30 s as a safety net.
|
||||
|
||||
### Cluster Detail (`/clusters/{ClusterId}`)
|
||||
|
||||
Tabbed view for one cluster.
|
||||
|
||||
**Tabs:**
|
||||
|
||||
1. **Overview** — cluster metadata (name, site, redundancy mode, namespace URI), node table with online/offline/role/generation/last-applied-status, current published generation summary, draft status (none / in progress / ready to publish)
|
||||
2. **Drivers** — table of `DriverInstance` rows in the *current published* generation, with per-row navigation to driver-specific config screens. "Edit in draft" button creates or opens the cluster's draft.
|
||||
3. **Devices** — table of `Device` rows (where applicable), grouped by `DriverInstance`
|
||||
4. **Tags** — paged, filterable table of all tags. Filters: driver, device, folder path, name pattern, data type. Bulk operations toolbar: export to CSV, import from CSV (validated against active draft).
|
||||
5. **Generations** — generation history list (see Generation History page)
|
||||
6. **Audit** — filtered audit log
|
||||
|
||||
The Drivers/Devices/Tags tabs are **read-only views** of the published generation; editing is done in the dedicated draft editor to make the publish boundary explicit.
|
||||
|
||||
### Node Detail (`/clusters/{ClusterId}/nodes/{NodeId}`)
|
||||
|
||||
Per-node view for `ClusterNode` management.
|
||||
|
||||
- **Physical attributes** form: Host, OpcUaPort, DashboardPort, ApplicationUri, ServiceLevelBase, RedundancyRole
|
||||
- **ApplicationUri auto-suggest** behavior (per decision #86):
|
||||
- When creating a new node: prefilled with `urn:{Host}:OtOpcUa`
|
||||
- When editing an existing node: changing `Host` shows a warning banner — "ApplicationUri is not updated automatically. Changing it will require all OPC UA clients to re-establish trust." Operator must explicitly click an "Update ApplicationUri" button to apply the suggestion.
|
||||
- **Credentials** sub-tab: list of `ClusterNodeCredential` rows (kind, value, enabled, rotated-at). FleetAdmin can add/disable/rotate. Credential rotation flow is documented inline ("create new credential → wait for node to use it → disable old credential").
|
||||
- **Per-node overrides** sub-tab: structured editor for `DriverConfigOverridesJson`. Surfaces the cluster's `DriverInstance` rows with their current `DriverConfig`, and lets the operator add path → value override entries per driver. Validation: override path must exist in the current draft's `DriverConfig`; loud failure if it doesn't (per the merge semantics in the schema doc).
|
||||
- **Generation state**: current applied generation, last-applied timestamp, last-applied status, last error if any
|
||||
- **Recent node activity**: filtered audit log
|
||||
|
||||
### Draft Editor (`/clusters/{ClusterId}/draft`)
|
||||
|
||||
The primary edit surface. Three-panel layout: tree on the left (drivers → devices → tags), edit form on the right, validation panel at the bottom.
|
||||
|
||||
- **Drivers panel**: add/edit/remove `DriverInstance` rows in the draft. Each driver type opens a driver-specific config screen (deferred per #27). Generic fields (Name, NamespaceUri, Enabled) are always editable.
|
||||
- **Devices panel**: scoped to the selected driver instance (where applicable)
|
||||
- **Tags panel**:
|
||||
- Tree view by `FolderPath`
|
||||
- Inline edit for individual tags (Name, DataType, AccessLevel, WriteIdempotent, PollGroupId, TagConfig JSON in a structured editor)
|
||||
- **Bulk operations**: select multiple tags → bulk edit (change poll group, access level, etc.)
|
||||
- **CSV import**: upload a CSV with `(DriverInstanceId, DeviceId?, FolderPath, Name, DataType, AccessLevel, WriteIdempotent, PollGroupId, TagConfig)` columns. Preview shows additions/modifications/removals against current draft, with row-level validation errors. Operator confirms or cancels.
|
||||
- **CSV export**: emit the same shape from the current published generation, useful as a starting point for bulk edits in Excel
|
||||
- **Validation panel** runs `sp_ValidateDraft` continuously (debounced) and surfaces FK errors, JSON schema errors, duplicate paths, missing references. Publish button is disabled while errors exist.
|
||||
- **Diff link** at top: opens the diff viewer comparing the draft against the current published generation
|
||||
|
||||
### Diff Viewer (`/clusters/{ClusterId}/draft/diff`)
|
||||
|
||||
Three-column compare: previous published | draft | summary. Per-table sections (drivers, devices, tags, poll groups) with rows colored by change type:
|
||||
|
||||
- Green: added in draft
|
||||
- Red: removed in draft
|
||||
- Yellow: modified (with field-level diff on hover/expand)
|
||||
|
||||
Includes a **publish dialog** triggered from this view: required Notes field, optional "publish and apply now" vs. "publish and let nodes pick up on next poll" (the latter is the default; the former invokes a one-shot push notification, deferred per existing plan).
|
||||
|
||||
### Generation History (`/clusters/{ClusterId}/generations`)
|
||||
|
||||
List of all generations for the cluster with: ID, status, published-by, published-at, notes, and a per-row "Roll back to this" action (FleetAdmin or ConfigEditor). Clicking a row opens the generation detail page (read-only view of all rows in that generation, with diff-against-current as a button).
|
||||
|
||||
Rollback flow:
|
||||
|
||||
1. Operator clicks "Roll back to this generation"
|
||||
2. Modal: "This will create a new published generation cloned from generation N. Both nodes of this cluster will pick up the change on their next poll. Notes (required):"
|
||||
3. Confirm → invokes `sp_RollbackToGeneration` → immediate UI feedback that a new generation was published
|
||||
|
||||
### Credential Management (`/credentials`)
|
||||
|
||||
FleetAdmin-only. Lists all `ClusterNodeCredential` rows fleet-wide, filterable by cluster/node/kind/enabled.
|
||||
|
||||
Operations: add credential to node, disable credential, mark credential rotated. Rotation is the most common operation — the UI provides a guided flow ("create new → confirm node has used it once via `LastAppliedAt` advance → disable old").
|
||||
|
||||
### Fleet Audit (`/audit`)
|
||||
|
||||
Searchable / filterable view of `ConfigAuditLog` across all clusters. Filters: cluster, node, principal, event type, date range. Export to CSV for compliance.
|
||||
|
||||
## Real-Time Updates
|
||||
|
||||
Blazor Server runs over SignalR by default. The Admin app uses two SignalR hubs:
|
||||
|
||||
| Hub | Purpose |
|
||||
|-----|---------|
|
||||
| `FleetStatusHub` | Push `ClusterNodeGenerationState` changes (LastSeenAt updates, applied-generation transitions, status changes) to any open Fleet Overview or Cluster Detail page |
|
||||
| `AlertHub` | Push new sticky alerts (crash-loop circuit trips, failed applies) to all subscribed pages |
|
||||
|
||||
Updates fan out from a backend `IHostedService` that polls `ClusterNodeGenerationState` every 5 s and diffs against last-known state. Pages subscribe selectively (Cluster Detail page subscribes to one cluster's updates; Fleet Overview subscribes to all). No polling from the browser.
|
||||
|
||||
## UX Rules
|
||||
|
||||
- **Sticky alerts that don't auto-clear** — per the crash-loop circuit-breaker rule in `driver-stability.md`, alerts in the Active Alerts strip require explicit operator acknowledgment before clearing, regardless of whether the underlying state has recovered. "We crash-looped 3 times overnight" must remain visible the next morning.
|
||||
- **Publish boundary is explicit** — there is no "edit in place" path. All changes go through draft → diff → publish. The diff viewer is required reading before the publish dialog enables.
|
||||
- **Loud failures over silent fallbacks** — if validation fails, the publish button is disabled and the failures are listed; we never publish a generation with warnings hidden. If a node override path doesn't resolve in the draft, the override editor flags it red, not yellow.
|
||||
- **No auto-rewrite of `ApplicationUri`** — see Node Detail page above. The principle generalizes: any field that OPC UA clients pin trust to (`ApplicationUri`, certificate thumbprints) requires explicit operator action to change, never silent updates.
|
||||
- **Bulk operations always preview before commit** — CSV imports, bulk tag edits, rollbacks all show a diff and require confirmation. No "apply" buttons that act without preview.
|
||||
|
||||
## Per-Driver Config Screens (deferred)
|
||||
|
||||
Per decision #27, driver-specific config screens are added in each driver's implementation phase, not up front. The Admin app provides:
|
||||
|
||||
- A pluggable `IDriverConfigEditor` interface in `Configuration.Abstractions`
|
||||
- Driver projects implement an editor that renders into a slot on the Driver Detail screen
|
||||
- For drivers that don't yet have a custom editor, a generic JSON editor with schema-driven validation is used (better than nothing, ugly but functional)
|
||||
|
||||
The generic JSON editor uses the per-driver JSON schema from `DriverTypeRegistry` so even pre-custom-editor, validation works.
|
||||
|
||||
## Workflows
|
||||
|
||||
### Add a new cluster
|
||||
|
||||
1. FleetAdmin: `/clusters` → "New cluster"
|
||||
2. Form: Name, Site, NodeCount (1 or 2), RedundancyMode (auto-set based on NodeCount), NamespaceUri (auto-suggested from name)
|
||||
3. Save → cluster row created (`Status = Enabled`, no generations yet)
|
||||
4. Redirect to Cluster Detail; prompt to add nodes
|
||||
|
||||
### Add a node to a cluster
|
||||
|
||||
1. Cluster Detail → "Add node"
|
||||
2. Form: NodeId, RedundancyRole, Host (required), OpcUaPort (default 4840), DashboardPort (default 8081), ApplicationUri (auto-prefilled `urn:{Host}:OtOpcUa`), ServiceLevelBase (auto: Primary=200, Secondary=150)
|
||||
3. Save
|
||||
4. Prompt: "Add a credential for this node now?" → opens credential add flow
|
||||
5. The node won't be functional until at least one credential is added and the credential is provisioned on the node's machine (out-of-band step documented in deployment guide)
|
||||
|
||||
### Edit drivers/tags and publish
|
||||
|
||||
1. Cluster Detail → "Edit configuration" → opens draft editor (creates a draft generation if none exists)
|
||||
2. Operator edits drivers, devices, tags, poll groups
|
||||
3. Validation panel updates live; publish disabled while errors exist
|
||||
4. Operator clicks "Diff" → diff viewer
|
||||
5. Operator clicks "Publish" → modal asks for Notes, confirms
|
||||
6. `sp_PublishGeneration` runs in transaction; on success, draft becomes new published generation; previous published becomes superseded
|
||||
7. Within ~30 s (default poll interval), both nodes pick up the new generation; Cluster Detail page shows live progress as `LastAppliedAt` advances on each node
|
||||
|
||||
### Roll back
|
||||
|
||||
1. Cluster Detail → Generations tab → find target generation → "Roll back to this"
|
||||
2. Modal: explains a new generation will be created (clone of target) and published; require Notes
|
||||
3. Confirm → `sp_RollbackToGeneration` runs
|
||||
4. Same propagation as a forward publish — both nodes pick up the new generation on next poll
|
||||
|
||||
### Override a setting per node
|
||||
|
||||
1. Node Detail → Overrides sub-tab
|
||||
2. Pick driver instance from dropdown → schema-driven editor shows current `DriverConfig` keys
|
||||
3. Add override row: select key path (validated against the driver's JSON schema), enter override value
|
||||
4. Save → updates `ClusterNode.DriverConfigOverridesJson`
|
||||
5. **No new generation created** — overrides are per-node metadata, not generation-versioned. They take effect on the node's next config-apply cycle.
|
||||
|
||||
The "no new generation" choice is deliberate: overrides are operationally bound to a specific physical machine, not to the cluster's logical config evolution. A node replacement scenario would copy the override to the replacement node via the credential/override migration flow, not by replaying generation history.
|
||||
|
||||
### Rotate a credential
|
||||
|
||||
1. Node Detail → Credentials sub-tab → "Add credential"
|
||||
2. Pick Kind, enter Value, save → new credential is enabled alongside the old
|
||||
3. Wait for `LastAppliedAt` on the node to advance (proves the new credential is being used by the node — operator-side work to provision the new credential on the node's machine happens out-of-band)
|
||||
4. Once verified, disable the old credential → only the new one is valid
|
||||
|
||||
## Deferred / Out of Scope
|
||||
|
||||
- **Cluster-scoped admin grants** (`ConfigEditor` for Cluster X only, not for Cluster Y) — surface in v2.1
|
||||
- **Per-driver custom config editors** — added in each driver's implementation phase
|
||||
- **Tag template / inheritance** — define a tag pattern once and apply to many similar device instances; deferred until the bulk import path proves insufficient
|
||||
- **Multi-cluster synchronized publish** — push a configuration change across many clusters atomically. Out of scope; orchestrate via per-cluster publishes from a script if needed.
|
||||
- **Mobile / tablet layout** — desktop-only initially
|
||||
- **Role grants editor in UI** — initial v2 manages LDAP group → admin role mappings via `appsettings.json`; UI editor surfaced later
|
||||
|
||||
## Decisions / Open Questions
|
||||
|
||||
**Decided** (captured in `plan.md` decision log):
|
||||
|
||||
- Blazor Server tech stack (vs. SPA + API)
|
||||
- **Visual + auth parity with ScadaLink CentralUI** — Bootstrap 5, dark sidebar, server-rendered login form, cookie auth + JWT API endpoint, copied shared component set, reconnect overlay
|
||||
- LDAP for operator auth via `LdapAuthService` + `RoleMapper` + `JwtTokenService` mirrored from `ScadaLink.Security`
|
||||
- Three admin roles: FleetAdmin / ConfigEditor / ReadOnly, with cluster-scoped grants in v2.0 (mirrored from ScadaLink's site-scoped pattern)
|
||||
- Draft → diff → publish is the only edit path; no in-place edits
|
||||
- Sticky alerts require manual ack
|
||||
- Per-node overrides are NOT generation-versioned
|
||||
|
||||
**Resolved Defaults**:
|
||||
|
||||
- **Styling: Bootstrap 5 vendored** (not MudBlazor or Fluent UI). Direct parity with ScadaLink CentralUI; standard component vocabulary; no Blazor-specific component-library dependency. Reverses an earlier draft choice — the cross-app consistency requirement outweighs MudBlazor's component conveniences.
|
||||
- **Theme: light only (single theme matching ScadaLink).** ScadaLink ships light-only with the dark sidebar / light main pattern. Operators using both apps see one consistent aesthetic. Reverses an earlier draft choice that proposed both light and dark — cross-app consistency wins. Revisit only if ScadaLink adds dark mode.
|
||||
- **CSV import dialect: strict CSV (RFC 4180), UTF-8 BOM accepted.** Excel "Save as CSV (UTF-8)" produces RFC 4180-compatible output and is the documented primary input format. TSV not supported initially; add only if operator feedback shows real friction with Excel CSV.
|
||||
- **Push notification deferred to v2.1; polling is initial model.** SignalR-from-DB-to-nodes would tighten apply latency from ~30 s to ~1 s but adds infrastructure (SignalR backplane or SQL Service Broker) that's not earning its keep at v2.0 scale. The publish dialog reserves a disabled **"Push now"** button labeled "Available in v2.1" so the future UX is anchored.
|
||||
- **Auto-save drafts with explicit Discard button.** Every form field change writes to the draft rows immediately (debounced 500 ms). The Discard button shows a confirmation dialog ("Discard all changes since last publish?") and rolls the draft generation back to empty. The Publish button is the only commit; auto-save does not publish.
|
||||
- **Cluster-scoped admin grants in v2.0** (lifted from v2.1 deferred list). ScadaLink already ships the equivalent site-scoped pattern, so we get cluster-scoped grants essentially for free by mirroring it. `RoleMapper` reads an `LdapGroupRoleMapping` table; cluster-scoped users carry `ClusterId` claims and see only their permitted clusters.
|
||||
522
docs/v2/config-db-schema.md
Normal file
522
docs/v2/config-db-schema.md
Normal file
@@ -0,0 +1,522 @@
|
||||
# Central Config DB Schema — OtOpcUa v2
|
||||
|
||||
> **Status**: DRAFT — companion to `plan.md` §4. Concrete schema, indexes, stored procedures, and authorization model for the central MSSQL configuration database.
|
||||
>
|
||||
> **Branch**: `v2`
|
||||
> **Created**: 2026-04-17
|
||||
|
||||
## Scope
|
||||
|
||||
This document defines the central MSSQL database that stores all OtOpcUa fleet configuration: clusters, nodes, drivers, devices, tags, poll groups, credentials, and config generations. It is the single source of truth for fleet management — every running OtOpcUa node reads its config from here, and every operator change goes through here.
|
||||
|
||||
Out of scope here (covered elsewhere):
|
||||
|
||||
- The Admin web UI that edits this DB → `admin-ui.md`
|
||||
- The local LiteDB cache on each node → covered briefly at the end of this doc; full schema is small and tracks only what's needed for offline boot
|
||||
- Driver-specific JSON shapes inside `DriverConfig` / `DeviceConfig` / `TagConfig` → `driver-specs.md` per driver
|
||||
- The cluster topology and rollout model → `plan.md` §4
|
||||
|
||||
## Design Goals
|
||||
|
||||
1. **Atomic publish, surgical apply** — operators publish a whole generation in one transaction; nodes apply only the diff
|
||||
2. **Cluster-scoped isolation** — one cluster's config changes never affect another cluster
|
||||
3. **Per-node credential binding** — each physical node has its own auth principal; the DB rejects cross-cluster reads server-side
|
||||
4. **Schemaless driver config** — driver-type-specific settings live in JSON columns so adding a new driver type doesn't require a schema migration
|
||||
5. **Append-only generations** — old generations are never deleted; rollback is just publishing an older generation as new
|
||||
6. **Auditable** — every publish, rollback, and apply event is recorded with the principal that did it
|
||||
|
||||
## Schema Overview
|
||||
|
||||
```
|
||||
ServerCluster (1)──(1..2) ClusterNode (1)──(1..N) ClusterNodeCredential
|
||||
│
|
||||
└──(1)──(N) ConfigGeneration ──(N)── DriverInstance ──(N)── Device ──(N)── Tag
|
||||
│ │
|
||||
│ └──(N)── PollGroup
|
||||
│
|
||||
└──(N)── PollGroup (driver-scoped)
|
||||
|
||||
ClusterNodeGenerationState (1:1 ClusterNode) — tracks applied generation per node
|
||||
ConfigAuditLog — append-only event log
|
||||
```
|
||||
|
||||
## Table Definitions
|
||||
|
||||
All `Json` columns use `nvarchar(max)` with a `CHECK (ISJSON(col) = 1)` constraint. Timestamps are `datetime2(3)` UTC. PKs use `uniqueidentifier` (sequential GUIDs) unless noted; logical IDs (`ClusterId`, `NodeId`, `DriverInstanceId`, `TagId`) are `nvarchar(64)` for human readability.
|
||||
|
||||
### `ServerCluster`
|
||||
|
||||
```sql
|
||||
CREATE TABLE dbo.ServerCluster (
|
||||
ClusterId nvarchar(64) NOT NULL PRIMARY KEY,
|
||||
Name nvarchar(128) NOT NULL,
|
||||
Site nvarchar(64) NULL, -- grouping for fleet management
|
||||
NodeCount tinyint NOT NULL CHECK (NodeCount IN (1, 2)),
|
||||
RedundancyMode nvarchar(16) NOT NULL CHECK (RedundancyMode IN ('None', 'Warm', 'Hot')),
|
||||
NamespaceUri nvarchar(256) NOT NULL, -- shared by both nodes
|
||||
Enabled bit NOT NULL DEFAULT 1,
|
||||
Notes nvarchar(1024) NULL,
|
||||
CreatedAt datetime2(3) NOT NULL DEFAULT SYSUTCDATETIME(),
|
||||
CreatedBy nvarchar(128) NOT NULL,
|
||||
ModifiedAt datetime2(3) NULL,
|
||||
ModifiedBy nvarchar(128) NULL,
|
||||
CONSTRAINT CK_ServerCluster_RedundancyMode_NodeCount
|
||||
CHECK ((NodeCount = 1 AND RedundancyMode = 'None')
|
||||
OR (NodeCount = 2 AND RedundancyMode IN ('Warm', 'Hot')))
|
||||
);
|
||||
|
||||
CREATE UNIQUE INDEX UX_ServerCluster_Name ON dbo.ServerCluster (Name);
|
||||
CREATE INDEX IX_ServerCluster_Site ON dbo.ServerCluster (Site) WHERE Site IS NOT NULL;
|
||||
```
|
||||
|
||||
### `ClusterNode`
|
||||
|
||||
```sql
|
||||
CREATE TABLE dbo.ClusterNode (
|
||||
NodeId nvarchar(64) NOT NULL PRIMARY KEY,
|
||||
ClusterId nvarchar(64) NOT NULL FOREIGN KEY REFERENCES dbo.ServerCluster(ClusterId),
|
||||
RedundancyRole nvarchar(16) NOT NULL CHECK (RedundancyRole IN ('Primary', 'Secondary', 'Standalone')),
|
||||
Host nvarchar(255) NOT NULL,
|
||||
OpcUaPort int NOT NULL DEFAULT 4840,
|
||||
DashboardPort int NOT NULL DEFAULT 8081,
|
||||
ApplicationUri nvarchar(256) NOT NULL,
|
||||
ServiceLevelBase tinyint NOT NULL DEFAULT 200,
|
||||
DriverConfigOverridesJson nvarchar(max) NULL CHECK (DriverConfigOverridesJson IS NULL OR ISJSON(DriverConfigOverridesJson) = 1),
|
||||
Enabled bit NOT NULL DEFAULT 1,
|
||||
LastSeenAt datetime2(3) NULL,
|
||||
CreatedAt datetime2(3) NOT NULL DEFAULT SYSUTCDATETIME(),
|
||||
CreatedBy nvarchar(128) NOT NULL
|
||||
);
|
||||
|
||||
-- ApplicationUri uniqueness is FLEET-WIDE, not per-cluster (per plan.md decision #86)
|
||||
CREATE UNIQUE INDEX UX_ClusterNode_ApplicationUri ON dbo.ClusterNode (ApplicationUri);
|
||||
CREATE INDEX IX_ClusterNode_ClusterId ON dbo.ClusterNode (ClusterId);
|
||||
|
||||
-- Each cluster has at most one Primary
|
||||
CREATE UNIQUE INDEX UX_ClusterNode_Primary_Per_Cluster
|
||||
ON dbo.ClusterNode (ClusterId)
|
||||
WHERE RedundancyRole = 'Primary';
|
||||
```
|
||||
|
||||
`DriverConfigOverridesJson` shape:
|
||||
|
||||
```jsonc
|
||||
{
|
||||
"<DriverInstanceId>": {
|
||||
"<JSON path within DriverConfig>": "<override value>"
|
||||
},
|
||||
// Example:
|
||||
"GalaxyMain": {
|
||||
"MxAccess.ClientName": "OtOpcUa-NodeB"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The merge happens at apply time on the node — cluster-level `DriverConfig` is read, then this node's overrides are layered on top using JSON-pointer or simple key-path semantics. Tags and devices have **no** per-node override path.
|
||||
|
||||
### `ClusterNodeCredential`
|
||||
|
||||
```sql
|
||||
CREATE TABLE dbo.ClusterNodeCredential (
|
||||
CredentialId uniqueidentifier NOT NULL PRIMARY KEY DEFAULT NEWSEQUENTIALID(),
|
||||
NodeId nvarchar(64) NOT NULL FOREIGN KEY REFERENCES dbo.ClusterNode(NodeId),
|
||||
Kind nvarchar(32) NOT NULL CHECK (Kind IN ('SqlLogin', 'ClientCertThumbprint', 'ADPrincipal', 'gMSA')),
|
||||
Value nvarchar(512) NOT NULL, -- login name, cert thumbprint, SID, etc.
|
||||
Enabled bit NOT NULL DEFAULT 1,
|
||||
RotatedAt datetime2(3) NULL,
|
||||
CreatedAt datetime2(3) NOT NULL DEFAULT SYSUTCDATETIME(),
|
||||
CreatedBy nvarchar(128) NOT NULL
|
||||
);
|
||||
|
||||
CREATE INDEX IX_ClusterNodeCredential_NodeId ON dbo.ClusterNodeCredential (NodeId, Enabled);
|
||||
CREATE UNIQUE INDEX UX_ClusterNodeCredential_Value ON dbo.ClusterNodeCredential (Kind, Value) WHERE Enabled = 1;
|
||||
```
|
||||
|
||||
A node may have multiple enabled credentials simultaneously (e.g. during cert rotation: old + new both valid for a window). Disabled rows are kept for audit.
|
||||
|
||||
### `ConfigGeneration`
|
||||
|
||||
```sql
|
||||
CREATE TABLE dbo.ConfigGeneration (
|
||||
GenerationId bigint NOT NULL PRIMARY KEY IDENTITY(1, 1),
|
||||
ClusterId nvarchar(64) NOT NULL FOREIGN KEY REFERENCES dbo.ServerCluster(ClusterId),
|
||||
Status nvarchar(16) NOT NULL CHECK (Status IN ('Draft', 'Published', 'Superseded', 'RolledBack')),
|
||||
ParentGenerationId bigint NULL FOREIGN KEY REFERENCES dbo.ConfigGeneration(GenerationId),
|
||||
PublishedAt datetime2(3) NULL,
|
||||
PublishedBy nvarchar(128) NULL,
|
||||
Notes nvarchar(1024) NULL,
|
||||
CreatedAt datetime2(3) NOT NULL DEFAULT SYSUTCDATETIME(),
|
||||
CreatedBy nvarchar(128) NOT NULL
|
||||
);
|
||||
|
||||
-- Fast lookup of "latest published generation for cluster X" (the per-node poll path)
|
||||
CREATE INDEX IX_ConfigGeneration_Cluster_Published
|
||||
ON dbo.ConfigGeneration (ClusterId, Status, GenerationId DESC)
|
||||
INCLUDE (PublishedAt);
|
||||
|
||||
-- One Draft per cluster at a time (prevents accidental concurrent edits)
|
||||
CREATE UNIQUE INDEX UX_ConfigGeneration_Draft_Per_Cluster
|
||||
ON dbo.ConfigGeneration (ClusterId)
|
||||
WHERE Status = 'Draft';
|
||||
```
|
||||
|
||||
`Status` transitions: `Draft → Published → Superseded` (when a newer generation is published) or `Draft → Published → RolledBack` (when explicitly rolled back). No transition skips Published.
|
||||
|
||||
### `DriverInstance`
|
||||
|
||||
```sql
|
||||
CREATE TABLE dbo.DriverInstance (
|
||||
DriverInstanceRowId uniqueidentifier NOT NULL PRIMARY KEY DEFAULT NEWSEQUENTIALID(),
|
||||
GenerationId bigint NOT NULL FOREIGN KEY REFERENCES dbo.ConfigGeneration(GenerationId),
|
||||
DriverInstanceId nvarchar(64) NOT NULL, -- stable logical ID across generations
|
||||
ClusterId nvarchar(64) NOT NULL FOREIGN KEY REFERENCES dbo.ServerCluster(ClusterId),
|
||||
Name nvarchar(128) NOT NULL,
|
||||
DriverType nvarchar(32) NOT NULL, -- Galaxy | ModbusTcp | AbCip | AbLegacy | S7 | TwinCat | Focas | OpcUaClient
|
||||
NamespaceUri nvarchar(256) NOT NULL, -- per-driver namespace within the cluster's URI scope
|
||||
Enabled bit NOT NULL DEFAULT 1,
|
||||
DriverConfig nvarchar(max) NOT NULL CHECK (ISJSON(DriverConfig) = 1)
|
||||
);
|
||||
|
||||
CREATE INDEX IX_DriverInstance_Generation_Cluster
|
||||
ON dbo.DriverInstance (GenerationId, ClusterId);
|
||||
CREATE UNIQUE INDEX UX_DriverInstance_Generation_LogicalId
|
||||
ON dbo.DriverInstance (GenerationId, DriverInstanceId);
|
||||
CREATE UNIQUE INDEX UX_DriverInstance_Generation_NamespaceUri
|
||||
ON dbo.DriverInstance (GenerationId, NamespaceUri);
|
||||
```
|
||||
|
||||
### `Device`
|
||||
|
||||
```sql
|
||||
CREATE TABLE dbo.Device (
|
||||
DeviceRowId uniqueidentifier NOT NULL PRIMARY KEY DEFAULT NEWSEQUENTIALID(),
|
||||
GenerationId bigint NOT NULL FOREIGN KEY REFERENCES dbo.ConfigGeneration(GenerationId),
|
||||
DeviceId nvarchar(64) NOT NULL,
|
||||
DriverInstanceId nvarchar(64) NOT NULL,
|
||||
Name nvarchar(128) NOT NULL,
|
||||
Enabled bit NOT NULL DEFAULT 1,
|
||||
DeviceConfig nvarchar(max) NOT NULL CHECK (ISJSON(DeviceConfig) = 1)
|
||||
);
|
||||
|
||||
CREATE INDEX IX_Device_Generation_Driver
|
||||
ON dbo.Device (GenerationId, DriverInstanceId);
|
||||
CREATE UNIQUE INDEX UX_Device_Generation_LogicalId
|
||||
ON dbo.Device (GenerationId, DeviceId);
|
||||
```
|
||||
|
||||
The FK to `DriverInstance` is logical (matched by `GenerationId + DriverInstanceId` in app code), not declared as a SQL FK — declaring it would require composite FKs that are awkward when generations are immutable. The publish stored procedure validates referential integrity before flipping `Status`.
|
||||
|
||||
### `Tag`
|
||||
|
||||
```sql
|
||||
CREATE TABLE dbo.Tag (
|
||||
TagRowId uniqueidentifier NOT NULL PRIMARY KEY DEFAULT NEWSEQUENTIALID(),
|
||||
GenerationId bigint NOT NULL FOREIGN KEY REFERENCES dbo.ConfigGeneration(GenerationId),
|
||||
TagId nvarchar(64) NOT NULL,
|
||||
DriverInstanceId nvarchar(64) NOT NULL,
|
||||
DeviceId nvarchar(64) NULL, -- null for driver-scoped tags (no device layer)
|
||||
Name nvarchar(128) NOT NULL,
|
||||
FolderPath nvarchar(512) NOT NULL, -- address space hierarchy
|
||||
DataType nvarchar(32) NOT NULL, -- OPC UA built-in type name (Boolean, Int32, Float, etc.)
|
||||
AccessLevel nvarchar(16) NOT NULL CHECK (AccessLevel IN ('Read', 'ReadWrite')),
|
||||
WriteIdempotent bit NOT NULL DEFAULT 0,
|
||||
PollGroupId nvarchar(64) NULL,
|
||||
TagConfig nvarchar(max) NOT NULL CHECK (ISJSON(TagConfig) = 1)
|
||||
);
|
||||
|
||||
CREATE INDEX IX_Tag_Generation_Driver_Device
|
||||
ON dbo.Tag (GenerationId, DriverInstanceId, DeviceId);
|
||||
CREATE UNIQUE INDEX UX_Tag_Generation_LogicalId
|
||||
ON dbo.Tag (GenerationId, TagId);
|
||||
CREATE UNIQUE INDEX UX_Tag_Generation_Path
|
||||
ON dbo.Tag (GenerationId, DriverInstanceId, FolderPath, Name);
|
||||
```
|
||||
|
||||
### `PollGroup`
|
||||
|
||||
```sql
|
||||
CREATE TABLE dbo.PollGroup (
|
||||
PollGroupRowId uniqueidentifier NOT NULL PRIMARY KEY DEFAULT NEWSEQUENTIALID(),
|
||||
GenerationId bigint NOT NULL FOREIGN KEY REFERENCES dbo.ConfigGeneration(GenerationId),
|
||||
PollGroupId nvarchar(64) NOT NULL,
|
||||
DriverInstanceId nvarchar(64) NOT NULL,
|
||||
Name nvarchar(128) NOT NULL,
|
||||
IntervalMs int NOT NULL CHECK (IntervalMs >= 50)
|
||||
);
|
||||
|
||||
CREATE INDEX IX_PollGroup_Generation_Driver
|
||||
ON dbo.PollGroup (GenerationId, DriverInstanceId);
|
||||
CREATE UNIQUE INDEX UX_PollGroup_Generation_LogicalId
|
||||
ON dbo.PollGroup (GenerationId, PollGroupId);
|
||||
```
|
||||
|
||||
### `ClusterNodeGenerationState`
|
||||
|
||||
```sql
|
||||
CREATE TABLE dbo.ClusterNodeGenerationState (
|
||||
NodeId nvarchar(64) NOT NULL PRIMARY KEY FOREIGN KEY REFERENCES dbo.ClusterNode(NodeId),
|
||||
CurrentGenerationId bigint NULL FOREIGN KEY REFERENCES dbo.ConfigGeneration(GenerationId),
|
||||
LastAppliedAt datetime2(3) NULL,
|
||||
LastAppliedStatus nvarchar(16) NULL CHECK (LastAppliedStatus IN ('Applied', 'RolledBack', 'Failed', 'InProgress')),
|
||||
LastAppliedError nvarchar(2048) NULL,
|
||||
LastSeenAt datetime2(3) NULL -- updated on every poll, for liveness
|
||||
);
|
||||
|
||||
CREATE INDEX IX_ClusterNodeGenerationState_Generation
|
||||
ON dbo.ClusterNodeGenerationState (CurrentGenerationId);
|
||||
```
|
||||
|
||||
A 2-node cluster with both nodes on the same `CurrentGenerationId` is "converged"; nodes on different generations are "applying" or "diverged" — Admin surfaces this directly.
|
||||
|
||||
### `ConfigAuditLog`
|
||||
|
||||
```sql
|
||||
CREATE TABLE dbo.ConfigAuditLog (
|
||||
AuditId bigint NOT NULL PRIMARY KEY IDENTITY(1, 1),
|
||||
Timestamp datetime2(3) NOT NULL DEFAULT SYSUTCDATETIME(),
|
||||
Principal nvarchar(128) NOT NULL, -- DB principal that performed the action
|
||||
EventType nvarchar(64) NOT NULL, -- DraftCreated, DraftEdited, Published, RolledBack, NodeApplied, CredentialAdded, CredentialDisabled, ClusterCreated, NodeAdded, etc.
|
||||
ClusterId nvarchar(64) NULL,
|
||||
NodeId nvarchar(64) NULL,
|
||||
GenerationId bigint NULL,
|
||||
DetailsJson nvarchar(max) NULL CHECK (DetailsJson IS NULL OR ISJSON(DetailsJson) = 1)
|
||||
);
|
||||
|
||||
CREATE INDEX IX_ConfigAuditLog_Cluster_Time
|
||||
ON dbo.ConfigAuditLog (ClusterId, Timestamp DESC);
|
||||
CREATE INDEX IX_ConfigAuditLog_Generation
|
||||
ON dbo.ConfigAuditLog (GenerationId) WHERE GenerationId IS NOT NULL;
|
||||
```
|
||||
|
||||
Append-only by convention (no UPDATE/DELETE permissions granted to any principal); enforced by GRANT model below.
|
||||
|
||||
## Stored Procedures
|
||||
|
||||
All non-trivial DB access goes through stored procedures. Direct table SELECT/INSERT/UPDATE/DELETE is **not granted** to node or admin principals — only the procs are callable. This is the enforcement point for the authorization model.
|
||||
|
||||
### `sp_GetCurrentGenerationForCluster` (called by node)
|
||||
|
||||
```sql
|
||||
-- @NodeId: passed by the calling node; verified against authenticated principal
|
||||
-- @ClusterId: passed by the calling node; verified to match @NodeId's cluster
|
||||
-- Returns: latest Published generation for the cluster, or NULL if none
|
||||
CREATE PROCEDURE dbo.sp_GetCurrentGenerationForCluster
|
||||
@NodeId nvarchar(64),
|
||||
@ClusterId nvarchar(64)
|
||||
AS
|
||||
BEGIN
|
||||
SET NOCOUNT ON;
|
||||
|
||||
-- 1. Authenticate: verify the calling principal is bound to @NodeId
|
||||
DECLARE @CallerPrincipal nvarchar(128) = SUSER_SNAME();
|
||||
IF NOT EXISTS (
|
||||
SELECT 1 FROM dbo.ClusterNodeCredential
|
||||
WHERE NodeId = @NodeId
|
||||
AND Value = @CallerPrincipal
|
||||
AND Enabled = 1
|
||||
)
|
||||
BEGIN
|
||||
RAISERROR('Unauthorized: caller %s is not bound to NodeId %s', 16, 1, @CallerPrincipal, @NodeId);
|
||||
RETURN;
|
||||
END
|
||||
|
||||
-- 2. Authorize: verify @NodeId belongs to @ClusterId
|
||||
IF NOT EXISTS (
|
||||
SELECT 1 FROM dbo.ClusterNode
|
||||
WHERE NodeId = @NodeId AND ClusterId = @ClusterId AND Enabled = 1
|
||||
)
|
||||
BEGIN
|
||||
RAISERROR('Forbidden: NodeId %s does not belong to ClusterId %s', 16, 1, @NodeId, @ClusterId);
|
||||
RETURN;
|
||||
END
|
||||
|
||||
-- 3. Return latest Published generation
|
||||
SELECT TOP 1 GenerationId, PublishedAt, PublishedBy, Notes
|
||||
FROM dbo.ConfigGeneration
|
||||
WHERE ClusterId = @ClusterId AND Status = 'Published'
|
||||
ORDER BY GenerationId DESC;
|
||||
END
|
||||
```
|
||||
|
||||
Companion procs: `sp_GetGenerationContent` (returns full generation rows for a given `GenerationId`, with the same auth checks) and `sp_RegisterNodeGenerationApplied` (node reports back which generation it has now applied + status).
|
||||
|
||||
### `sp_PublishGeneration` (called by Admin)
|
||||
|
||||
```sql
|
||||
-- Atomic: validates the draft, computes diff vs. previous Published, flips Status
|
||||
CREATE PROCEDURE dbo.sp_PublishGeneration
|
||||
@ClusterId nvarchar(64),
|
||||
@DraftGenerationId bigint,
|
||||
@Notes nvarchar(1024) = NULL
|
||||
AS
|
||||
BEGIN
|
||||
SET NOCOUNT ON;
|
||||
SET XACT_ABORT ON;
|
||||
BEGIN TRANSACTION;
|
||||
|
||||
-- 1. Verify caller is an admin (separate authz check vs. node auth)
|
||||
-- 2. Validate Draft: foreign keys resolve, no orphan tags, JSON columns parse, etc.
|
||||
-- EXEC sp_ValidateDraft @DraftGenerationId; — raises on failure
|
||||
-- 3. Mark previous Published as Superseded
|
||||
UPDATE dbo.ConfigGeneration
|
||||
SET Status = 'Superseded'
|
||||
WHERE ClusterId = @ClusterId AND Status = 'Published';
|
||||
-- 4. Promote Draft to Published
|
||||
UPDATE dbo.ConfigGeneration
|
||||
SET Status = 'Published',
|
||||
PublishedAt = SYSUTCDATETIME(),
|
||||
PublishedBy = SUSER_SNAME(),
|
||||
Notes = ISNULL(@Notes, Notes)
|
||||
WHERE GenerationId = @DraftGenerationId AND ClusterId = @ClusterId;
|
||||
-- 5. Audit log
|
||||
INSERT dbo.ConfigAuditLog (Principal, EventType, ClusterId, GenerationId)
|
||||
VALUES (SUSER_SNAME(), 'Published', @ClusterId, @DraftGenerationId);
|
||||
|
||||
COMMIT;
|
||||
END
|
||||
```
|
||||
|
||||
### `sp_RollbackToGeneration` (called by Admin)
|
||||
|
||||
Creates a *new* Published generation by cloning rows from the target generation. The target stays in `Superseded` state; the new clone becomes `Published`. This way every state visible to nodes is an actual published generation, never a "rolled back to" pointer that's hard to reason about.
|
||||
|
||||
### `sp_ValidateDraft` (called inside publish, also exposed for Admin preview)
|
||||
|
||||
Checks: every `Tag.DriverInstanceId` resolves; every `Tag.DeviceId` resolves to a `Device` whose `DriverInstanceId` matches the tag's; every `Tag.PollGroupId` resolves; every `Device.DriverInstanceId` resolves; no duplicate `(GenerationId, DriverInstanceId, FolderPath, Name)` collisions; every JSON column parses; every `DriverConfig` matches its `DriverType`'s schema (validated against a registered JSON schema per driver type — see "JSON column conventions" below).
|
||||
|
||||
### `sp_ComputeGenerationDiff`
|
||||
|
||||
Returns the rows that differ between two generations: added, removed, modified per table. Used by the Admin UI's diff viewer and by the node's apply logic to decide what to surgically update without bouncing the whole driver instance.
|
||||
|
||||
## Authorization Model
|
||||
|
||||
### SQL principals
|
||||
|
||||
Two principal classes:
|
||||
|
||||
1. **Node principals** — one per `ClusterNode` (SQL login, gMSA, or cert-mapped user). Granted EXECUTE on `sp_GetCurrentGenerationForCluster`, `sp_GetGenerationContent`, `sp_RegisterNodeGenerationApplied` only. No table SELECT.
|
||||
2. **Admin principals** — granted to operator accounts. EXECUTE on all `sp_*` procs. No direct table access either, except read-only views for reporting (`vw_ClusterFleetStatus`, `vw_GenerationHistory`).
|
||||
|
||||
The `dbo` schema is owned by no application principal; only `db_owner` (DBA-managed) can change schema.
|
||||
|
||||
### Per-node binding enforcement
|
||||
|
||||
`sp_GetCurrentGenerationForCluster` uses `SUSER_SNAME()` to identify the calling principal and cross-checks against `ClusterNodeCredential.Value`. A principal asking for another node's cluster gets `RAISERROR` with HTTP-403-equivalent semantics (16/1).
|
||||
|
||||
For `Authentication=ActiveDirectoryMsi` or cert-auth scenarios where `SUSER_SNAME()` returns the AD principal name or cert thumbprint, `ClusterNodeCredential.Value` stores the matching value. Multiple `Kind` values are supported so a single deployment can mix gMSA and cert auth across different nodes.
|
||||
|
||||
### Defense-in-depth: SESSION_CONTEXT
|
||||
|
||||
After authentication, the caller-side connection wrapper sets `SESSION_CONTEXT` with `NodeId` and `ClusterId` to make audit logging trivial. The procs ignore client-asserted SESSION_CONTEXT values — they recompute from `SUSER_SNAME()` — but the audit log captures both, so any attempt to spoof shows up in the audit trail.
|
||||
|
||||
### Admin authn separation
|
||||
|
||||
Admin UI authenticates operators via the LDAP layer described in `Security.md` (existing v1 LDAP authentication, reused). Successful LDAP bind maps to a SQL principal that has admin DB grants. Operators do not get direct DB credentials.
|
||||
|
||||
## JSON Column Conventions
|
||||
|
||||
`DriverConfig`, `DeviceConfig`, `TagConfig`, and `DriverConfigOverridesJson` are schemaless to the DB but **strictly schemaed by the application**. Each driver type registers a JSON schema in `Core.Abstractions.DriverTypeRegistry` describing valid keys for its `DriverConfig`, `DeviceConfig`, and `TagConfig`. `sp_ValidateDraft` calls into managed code (CLR-hosted validator or external EF/.NET pre-publish step) to validate before the `Status` flip.
|
||||
|
||||
Examples of the per-driver shapes — full specs in `driver-specs.md`:
|
||||
|
||||
```jsonc
|
||||
// DriverConfig for DriverType=Galaxy
|
||||
{
|
||||
"MxAccess": { "ClientName": "OtOpcUa-Cluster1", "RequestTimeoutSeconds": 30 },
|
||||
"Database": { "ConnectionString": "Server=...;Database=ZB;...", "PollIntervalSeconds": 60 },
|
||||
"Historian": { "Enabled": false }
|
||||
}
|
||||
|
||||
// DeviceConfig for DriverType=ModbusTcp
|
||||
{
|
||||
"Host": "10.0.3.42",
|
||||
"Port": 502,
|
||||
"UnitId": 1,
|
||||
"ByteOrder": "BigEndianBigEndianWord",
|
||||
"AddressFormat": "Standard" // or "DL205"
|
||||
}
|
||||
|
||||
// TagConfig for DriverType=ModbusTcp
|
||||
{
|
||||
"RegisterType": "HoldingRegister",
|
||||
"Address": 100,
|
||||
"Length": 1,
|
||||
"Scaling": { "Multiplier": 0.1, "Offset": 0 }
|
||||
}
|
||||
```
|
||||
|
||||
The JSON schema lives in source so it versions with the driver; the DB doesn't carry per-type DDL.
|
||||
|
||||
## Per-Node Override Merge Semantics
|
||||
|
||||
At config-apply time on a node:
|
||||
|
||||
1. Node fetches `DriverInstance` rows for the current generation and its `ClusterId`
|
||||
2. Node fetches its own `ClusterNode.DriverConfigOverridesJson`
|
||||
3. For each `DriverInstance`, node parses `DriverConfig` (cluster-level), then walks the override JSON for that `DriverInstanceId`, applying each leaf-key override on top
|
||||
4. Merge is **shallow at the leaf level** — the override key path locates the exact JSON node to replace. Arrays are replaced wholesale, not merged element-wise. If the override path doesn't exist in `DriverConfig`, the merge fails the apply step (loud failure beats silent drift).
|
||||
5. Resulting JSON is the effective `DriverConfig` for this node, passed to the driver factory
|
||||
|
||||
Tags and devices are never overridden per-node. If you need a tag definition to differ between nodes, you have a different cluster — split it.
|
||||
|
||||
## Local LiteDB Cache
|
||||
|
||||
Each node maintains a small LiteDB file (default `config_cache.db`) keyed by `GenerationId`. On startup, if the central DB is unreachable, the node loads the most recent cached generation and starts.
|
||||
|
||||
Schema (LiteDB collections):
|
||||
|
||||
| Collection | Purpose |
|
||||
|------------|---------|
|
||||
| `Generations` | Header rows (GenerationId, ClusterId, PublishedAt, Notes) |
|
||||
| `DriverInstances` | Cluster-level driver definitions per generation |
|
||||
| `Devices` | Per-driver devices |
|
||||
| `Tags` | Per-driver/device tags |
|
||||
| `PollGroups` | Per-driver poll groups |
|
||||
| `NodeConfig` | This node's `ClusterNode` row + overrides JSON |
|
||||
|
||||
A node only ever caches its own cluster's generations. Old cached generations beyond the most recent N (default 10) are pruned to bound disk usage.
|
||||
|
||||
## EF Core Migrations
|
||||
|
||||
The `Configuration` project (per `plan.md` §5) owns the schema. EF Core code-first migrations under `Configuration/Migrations/`. Every migration ships with:
|
||||
|
||||
- The forward `Up()` and reverse `Down()` operations
|
||||
- A schema-validation test that runs the migration against a clean DB and verifies indexes, constraints, and stored procedures match the expected DDL
|
||||
- A data-fixture test that seeds a minimal cluster + node + generation and exercises `sp_GetCurrentGenerationForCluster` end-to-end
|
||||
|
||||
Stored procedures are managed via `MigrationBuilder.Sql()` blocks (idempotent CREATE OR ALTER style) so they version with the schema, not as separate DDL artifacts.
|
||||
|
||||
## Indexes — Hot Paths Summary
|
||||
|
||||
| Path | Index |
|
||||
|------|-------|
|
||||
| Node poll: "latest published generation for my cluster" | `IX_ConfigGeneration_Cluster_Published` |
|
||||
| Node fetch generation content | Per-table `(GenerationId, ...)` indexes |
|
||||
| Admin: list clusters by site | `IX_ServerCluster_Site` |
|
||||
| Admin: list generations per cluster | `IX_ConfigGeneration_Cluster_Published` (covers all statuses via DESC scan) |
|
||||
| Admin: who's on which generation | `IX_ClusterNodeGenerationState_Generation` |
|
||||
| Audit query: cluster history | `IX_ConfigAuditLog_Cluster_Time` |
|
||||
| Auth check on every node poll | `IX_ClusterNodeCredential_NodeId` |
|
||||
|
||||
## Backup, Retention, and Operational Concerns
|
||||
|
||||
- **Generations are never deleted** (per decision #58). Storage cost is small — even at one publish per day per cluster, a 50-cluster fleet generates ~18k generations/year with average row counts in the hundreds. Total at full v2 fleet scale: well under 10 GB/year.
|
||||
- **Backup**: standard SQL Server full + differential + log backups. Point-in-time restore covers operator mistake recovery (rolled back the wrong generation, etc.).
|
||||
- **Audit log retention**: 7 years by default, partitioned by year for cheap pruning if a customer requires shorter retention.
|
||||
- **Connection pooling**: each OtOpcUa node holds a pooled connection; admin UI uses standard EF DbContext pooling.
|
||||
|
||||
## Decisions / Open Questions
|
||||
|
||||
**Decided** (captured in `plan.md` decision log):
|
||||
|
||||
- Cluster-scoped generations (#82)
|
||||
- Per-node credential binding (#83)
|
||||
- Both nodes apply independently with brief divergence acceptable (#84)
|
||||
- ApplicationUri unique fleet-wide, never auto-rewritten (#86)
|
||||
- All new tables (#79, #80)
|
||||
|
||||
**Resolved Defaults**:
|
||||
|
||||
- **JSON validation: external (in Admin app), not CLR-hosted.** Requiring CLR on the SQL Server is an operational tax (CLR is disabled by default on hardened DB instances and many DBAs refuse to enable it). The Admin app validates draft content against the per-driver JSON schemas before calling `sp_PublishGeneration`; the proc enforces structural integrity (FKs, uniqueness, JSON parseability via `ISJSON`) but trusts the caller for content schema. Direct proc invocation outside the Admin app is already prevented by the GRANT model — only admin principals can publish.
|
||||
- **Dotted JSON path syntax for `DriverConfigOverridesJson`.** Example: `"MxAccess.ClientName"` not `"/MxAccess/ClientName"`. Dotted is more readable in operator-facing UI and CSV exports. Reserved chars: literal `.` in a key segment is escaped as `\.`; literal `\` is escaped as `\\`. Array indexing uses bracket form: `Items[0].Name`. Documented inline in the override editor's help text.
|
||||
- **`sp_PurgeGenerationsBefore` proc deferred to v2.1.** Initial release ships with "keep all generations forever" (decision #58). The purge proc is shaped now so we don't have to re-think it later: signature `sp_PurgeGenerationsBefore(@ClusterId, @CutoffGenerationId, @ConfirmToken)` requires an Admin-supplied confirmation token (random hex shown in the UI) to prevent script-based mass deletion; deletes are CASCADEd via per-table `WHERE GenerationId IN (...)`; audit log entry recorded with the principal, the cutoff, and the row counts deleted. Surface in v2.1 only when a customer compliance ask demands it.
|
||||
1012
docs/v2/driver-specs.md
Normal file
1012
docs/v2/driver-specs.md
Normal file
File diff suppressed because it is too large
Load Diff
495
docs/v2/driver-stability.md
Normal file
495
docs/v2/driver-stability.md
Normal file
@@ -0,0 +1,495 @@
|
||||
# Driver Stability & Isolation — OtOpcUa v2
|
||||
|
||||
> **Status**: DRAFT — companion to `plan.md`. Defines the stability tier model, per-driver hosting decisions, cross-cutting protections every driver process must apply, and the canonical worked example (FOCAS) for the high-risk tier.
|
||||
>
|
||||
> **Branch**: `v2`
|
||||
> **Created**: 2026-04-17
|
||||
|
||||
## Problem Statement
|
||||
|
||||
The v2 plan adds eight drivers spanning pure managed code (Modbus, OPC UA Client), wrapped C libraries (libplctag for AB CIP/Legacy, S7netplus for Siemens, Beckhoff.TwinCAT.Ads for ADS), heavy native/COM with thread affinity (Galaxy MXAccess), and black-box vendor DLLs (FANUC `Fwlib64.dll` for FOCAS).
|
||||
|
||||
These do not all carry the same failure profile, but the v1 plan treats them uniformly: every driver runs in-process in the .NET 10 server except Galaxy (isolated only because of its 32-bit COM constraint). This means:
|
||||
|
||||
- An `AccessViolationException` from `Fwlib64.dll` — **uncatchable** by managed code in modern .NET — tears down the whole OPC UA server, all subscriptions, and every other driver with it.
|
||||
- A native handle leak (FOCAS `cnc_allclibhndl3` not paired with `cnc_freelibhndl`, or libplctag tag handles not freed) accumulates against the *server* process, not the driver.
|
||||
- A thread-affinity bug (calling Fwlib on two threads against the same handle) corrupts state for every other driver sharing the process.
|
||||
- Polly's circuit breaker handles transient *errors*; it does nothing for *process death* or *resource exhaustion*.
|
||||
|
||||
Driver stability needs to be a first-class architectural concern, not a per-driver afterthought.
|
||||
|
||||
---
|
||||
|
||||
## Stability Tier Model
|
||||
|
||||
Every driver is assigned to one of three tiers based on the trust level of its dependency stack:
|
||||
|
||||
### Tier A — Pure Managed
|
||||
|
||||
Drivers whose entire dependency chain is verifiable .NET. Standard exception handling and Polly are sufficient. Run in-process in the main server.
|
||||
|
||||
| Driver | Stack | Notes |
|
||||
|--------|-------|-------|
|
||||
| Modbus TCP | NModbus (pure managed) | Sockets only |
|
||||
| OPC UA Client | OPC Foundation .NETStandard SDK (pure managed) | Reference-grade SDK |
|
||||
|
||||
### Tier B — Wrapped Native, Mature
|
||||
|
||||
Drivers that P/Invoke into a mature, well-maintained native library, or use a managed wrapper that has limited native bits (router, transport). Run in-process **with the cross-cutting protections from §3 mandatory**: SafeHandle for every native resource, memory watchdog, bounded queues. Any driver in this tier may be promoted to Tier C if production data shows leaks or crashes.
|
||||
|
||||
| Driver | Stack | Notes |
|
||||
|--------|-------|-------|
|
||||
| Siemens S7 | S7netplus (mostly managed) | Sockets + small native helpers |
|
||||
| AB CIP | libplctag (C library via P/Invoke) | Mature, widely deployed; manages its own threads |
|
||||
| AB Legacy | libplctag (same as CIP) | Same library, different protocol mode |
|
||||
| TwinCAT | Beckhoff.TwinCAT.Ads v6 + AmsTcpIpRouter | Mostly managed; native callback pump for ADS notifications |
|
||||
|
||||
### Tier C — Heavy Native / COM / Thread-Affinity
|
||||
|
||||
Drivers whose dependency is a black-box vendor DLL, COM object with apartment requirements, or any code where a fault is likely uncatchable. **Run as a separate Windows service** behind the Galaxy.Proxy/Host/Shared pattern. A crash isolates to that driver's process; the main server fans out Bad quality on the affected nodes and respawns the host.
|
||||
|
||||
| Driver | Stack | Reason for Tier C |
|
||||
|--------|-------|-------------------|
|
||||
| Galaxy | MXAccess COM (.NET 4.8 x86) | Bitness mismatch + COM/STA + long history of native quirks |
|
||||
| FOCAS | `Fwlib64.dll` P/Invoke | Black-box vendor DLL, handle-affinity, thread-unsafe per handle, no public SLA |
|
||||
|
||||
---
|
||||
|
||||
## Cross-Cutting Protections
|
||||
|
||||
Two distinct protection sets, **scoped by hosting mode** rather than applied uniformly. This split exists because process-level signals (RSS watchdog, recycle, kill) act on a *process*, not a driver — applying them in the shared server process would let a leak in one in-proc driver knock out every other driver, every session, and the OPC UA endpoint. That contradicts the v2 isolation invariant. Process-level protections therefore apply **only to isolated host processes** (Tier C); in-process drivers (Tier A/B) get a different set of guards that operate at the driver-instance level.
|
||||
|
||||
### Universal — apply to every driver regardless of tier
|
||||
|
||||
#### SafeHandle for every native resource
|
||||
|
||||
Every native handle (FOCAS `cnc_freelibhndl`, libplctag tag handle, COM IUnknown ref, OS file/socket handles we pass through P/Invoke) is wrapped in a `SafeHandle` subclass with a finalizer that calls the release function. This guarantees release even when:
|
||||
|
||||
- The owning thread crashes
|
||||
- A `using` block is bypassed by an exception we forgot to catch
|
||||
- The driver host process is shutting down ungracefully
|
||||
|
||||
`Marshal.ReleaseComObject` calls go through `CriticalFinalizerObject` to honor finalizer ordering during AppDomain unload.
|
||||
|
||||
#### Bounded operation queues (per device, per driver instance)
|
||||
|
||||
Every driver-instance/device pairing has a bounded outgoing-operation queue (default 1000 entries). When the queue is full, new operations fail fast with `BadResourceUnavailable` rather than backing up unboundedly against a slow or dead device. Polly's circuit breaker also opens, surfacing the device-down state to the dashboard.
|
||||
|
||||
This prevents the canonical "device went offline → reads pile up → driver eats all RAM" failure mode. Crucially, it operates **per device** in the in-process case so one stuck device cannot starve another driver's queue or accumulate against the shared server's heap.
|
||||
|
||||
#### Crash-loop circuit breaker
|
||||
|
||||
If a driver host crashes 3 times within 5 minutes, the supervisor stops respawning, leaves the driver's nodes in Bad quality, raises an operator alert, and starts an **escalating cooldown** before attempting auto-reset. This balances "unattended sites need recovery without an operator on console" against "don't silently mask a persistent problem."
|
||||
|
||||
| Trip sequence | Cooldown before auto-reset |
|
||||
|---------------|----------------------------|
|
||||
| First trip | 1 hour |
|
||||
| Re-trips within 10 min of an auto-reset | 4 hours |
|
||||
| Re-trips after the 4 h cooldown | **24 hours, manual reset required via Admin UI** |
|
||||
|
||||
Every trip raises a sticky operator alert that does **not** auto-clear when the cooldown elapses — only manual acknowledgment clears it. So even if recovery is automatic, "we crash-looped 3 times overnight" stays visible the next morning. The auto-reset path keeps unattended plants running; the sticky alert + 24 h manual-only floor prevents the breaker from becoming a "silent retry forever" mechanism.
|
||||
|
||||
For Tier A/B (in-process) drivers, the "crash" being counted is a driver-instance reset (capability-level reinitialization, not a process exit). For Tier C drivers, it's a host process exit.
|
||||
|
||||
### In-process only (Tier A/B) — driver-instance allocation tracking
|
||||
|
||||
In-process drivers cannot be recycled by killing the server process — that would take down every other driver, every session, and the OPC UA endpoint. RSS watchdogs and scheduled recycle therefore do **not** apply to Tier A/B. Instead, each driver instance is monitored at a finer grain:
|
||||
|
||||
- **Per-instance allocation tracking**: drivers expose a `GetMemoryFootprint()` capability returning bytes attributable to their own caches (symbol cache, subscription items, queued operations). The Core polls this every 30 s and logs growth slope per driver instance.
|
||||
- **Soft-limit on cached state**: each driver declares a memory budget for its caches in `DriverConfig`. On breach, the Core asks the driver to flush optional caches (e.g. discard symbol cache, force re-discovery). No process action.
|
||||
- **Escalation rule**: if a driver instance's footprint cannot be bounded by cache flushing — or if growth is in opaque allocations the driver can't account for — that driver is a candidate for **promotion to Tier C**. Process recycle is the only safe leak remediation, and the only way to apply process recycle to a single driver is to give it its own process.
|
||||
- **No process kill on a Tier A/B driver**. Ever. The only Core-initiated recovery is asking the driver to reset its own state via `IDriver.Reinitialize()`. If that fails, the driver instance is marked Faulted, its nodes go Bad quality, and the operator is alerted. The server process keeps running for everyone else.
|
||||
|
||||
### Isolated host only (Tier C) — process-level protections
|
||||
|
||||
These act on the host process. They cannot affect any other driver or the main server, because each Tier C driver has its own process.
|
||||
|
||||
#### Per-host memory watchdog
|
||||
|
||||
Each host process measures baseline RSS after warm-up (post-discovery, post-first-poll). A monitor thread samples RSS every 30 s and tracks **both a multiplier of baseline and an absolute hard ceiling**.
|
||||
|
||||
| Threshold | Action |
|
||||
|-----------|--------|
|
||||
| 1.5× baseline **OR** baseline + 50 MB (whichever larger) | Log warning, surface in status dashboard |
|
||||
| 3× baseline **OR** baseline + 200 MB (whichever larger) | Trigger soft recycle (graceful drain → exit → respawn) |
|
||||
| 1 GB absolute hard ceiling | Force-kill driver process, supervisor respawns |
|
||||
| Slope > 2 MB/min sustained 30 min | Treat as leak signal, soft recycle even below absolute threshold |
|
||||
|
||||
The "whichever larger" floor prevents spurious triggers when baseline is tiny — a 30 MB FOCAS Host shouldn't recycle at 45 MB just because the multiplier says so. All thresholds are per-driver-type defaults, overridable per-driver-instance in central config. **Only valid for isolated hosts** — never apply to the main server process.
|
||||
|
||||
#### Heartbeat between proxy and host
|
||||
|
||||
The proxy in the main server sends a heartbeat ping to the driver host **every 2 s** and expects a reply within 1 s. **Three consecutive misses → proxy declares the host dead** (6 s total detection latency), fans out Bad quality on all of that driver's nodes, and asks the supervisor to respawn.
|
||||
|
||||
2 s is fast enough that subscribers on a 1 s OPC UA publishing interval see Bad quality within one or two missed publish cycles, but slow enough that GC pauses (typically <500 ms even on bad days) and Windows pipe scheduling jitter don't generate false positives. The 3-miss tolerance absorbs single-cycle noise.
|
||||
|
||||
The heartbeat is on a separate named-pipe channel from the data-plane RPCs so a stuck data-plane operation doesn't mask host death. Cadence and miss-count are tunable per-driver-instance in central config.
|
||||
|
||||
#### Scheduled recycling
|
||||
|
||||
Each Tier C host process is recycled on a schedule (default 24 h, configurable per driver type). The recycle is a soft drain → exit → respawn, identical to a watchdog-triggered recycle. Defensive measure against slow leaks that stay below the watchdog thresholds.
|
||||
|
||||
### Post-mortem log
|
||||
|
||||
Each driver process writes a ring buffer of the last 1000 operations to a memory-mapped file (`%ProgramData%\OtOpcUa\driver-postmortem\<driver>.mmf`):
|
||||
|
||||
```
|
||||
timestamp | handle/connection ID | operation | args summary | return code | duration
|
||||
```
|
||||
|
||||
On graceful shutdown, the ring is flushed to a rotating log. On a hard crash (including AV), the supervisor reads the MMF after the corpse is gone and attaches the tail to the crash event reported on the dashboard. Without this, post-mortem of a Fwlib AV is impossible.
|
||||
|
||||
---
|
||||
|
||||
## Out-of-Process Driver Pattern (Generalized)
|
||||
|
||||
This is the Galaxy.Proxy/Host/Shared layout from `plan.md` §3, lifted to a reusable pattern for every Tier C driver. Two new projects per Tier C driver beyond the in-process driver projects:
|
||||
|
||||
```
|
||||
src/
|
||||
ZB.MOM.WW.OtOpcUa.Driver.<Name>.Proxy/ # In main server: implements IDriver, forwards over IPC
|
||||
ZB.MOM.WW.OtOpcUa.Driver.<Name>.Host/ # Separate Windows service: actual driver implementation
|
||||
ZB.MOM.WW.OtOpcUa.Driver.<Name>.Shared/ # IPC message contracts (.NET Standard 2.0)
|
||||
```
|
||||
|
||||
Common contract for a Tier C host:
|
||||
|
||||
- Hosted as a Windows service with `Microsoft.Extensions.Hosting`
|
||||
- Named-pipe IPC server (named pipes already established for Galaxy in §3)
|
||||
- MessagePack-serialized contracts in `<Name>.Shared`
|
||||
- Heartbeat endpoint on a separate pipe from the data plane
|
||||
- Memory watchdog runs in-process and triggers `Environment.Exit(2)` on threshold breach
|
||||
- Post-mortem MMF writer initialized on startup
|
||||
- Standard supervisor protocol: respawn-with-backoff, crash-loop circuit breaker
|
||||
|
||||
Common contract for the proxy in the main server:
|
||||
|
||||
- Implements `IDriver` + capability interfaces; forwards every call over IPC
|
||||
- Owns the heartbeat sender and host liveness state
|
||||
- Fans out Bad quality on all nodes when host is declared dead
|
||||
- Owns the supervisor that respawns the host process
|
||||
- Exposes host status (Up / Down / Recycling / CircuitOpen) to the status dashboard
|
||||
|
||||
### IPC Security (mandatory for every Tier C driver)
|
||||
|
||||
Named pipes default to allowing connections from any local user. Without explicit ACLs, any process on the host machine that knows the pipe name could connect, bypass the OPC UA server's authentication and authorization layers, and issue reads, writes, or alarm acknowledgments directly against the driver host. **This is a real privilege-escalation surface** — a service account with no OPC UA permissions could write field values it should never have access to. Every Tier C driver enforces the following:
|
||||
|
||||
1. **Pipe ACL**: the host creates the pipe with a `PipeSecurity` ACL that grants `ReadWrite | Synchronize` only to the OtOpcUa server's service principal SID. All other local users — including LocalSystem and Administrators — are explicitly denied. The ACL is set at pipe-creation time so it's atomic with the pipe being listenable.
|
||||
2. **Caller identity verification**: on each new pipe connection, the host calls `NamedPipeServerStream.GetImpersonationUserName()` (or impersonates and inspects the token) and verifies the connected client's SID matches the configured server service SID. Mismatches are logged and the connection is dropped before any RPC frame is read.
|
||||
3. **Per-message authorization context**: every RPC frame includes the operation's authenticated OPC UA principal (forwarded by the Core after it has done its own authn/authz). The host treats this as input only — the driver-level authorization (e.g. "is this principal allowed to write Tune attributes?") is performed by the Core, but the host's own audit log records the principal so post-incident attribution is possible.
|
||||
4. **No anonymous endpoints**: the heartbeat pipe has the same ACL as the data-plane pipe. There are no "open" pipes a generic client can probe.
|
||||
5. **Defense-in-depth shared secret**: the supervisor generates a per-host-process random secret at spawn time, passes it to both proxy and host via command-line args (or a parent-pipe handshake), and the host requires it on the first frame of every connection. This is belt-and-suspenders for the case where pipe ACLs are misconfigured during deployment.
|
||||
|
||||
Configuration: the server service SID is read from `appsettings.json` (`Hosting.ServiceAccountSid`) and validated against the actual running identity at startup. Mismatch fails startup loudly rather than producing a silently-insecure pipe.
|
||||
|
||||
For Galaxy, this pattern is retroactively required (the v1 named-pipe IPC predates this contract and must be hardened during the Phase 2 refactor). For FOCAS and any future Tier C driver, IPC security is part of the initial implementation, not an add-on.
|
||||
|
||||
### Reusability
|
||||
|
||||
For Galaxy, this pattern is already specified. For FOCAS, the same three projects appear in §5 below. Future Tier C escalations (e.g. if libplctag develops a stability problem) reuse the same template.
|
||||
|
||||
---
|
||||
|
||||
## FOCAS — Deep Dive (Canonical Tier C Worked Example)
|
||||
|
||||
FOCAS is the most exposed driver in the v2 plan: a black-box vendor DLL (`Fwlib64.dll`), handle-based API with per-handle thread-affinity, no public stability SLA, and a target market (CNC integrations) where periodic-restart workarounds are common practice. The protections below are not theoretical — every one is a known FOCAS failure mode.
|
||||
|
||||
### Project Layout
|
||||
|
||||
```
|
||||
src/
|
||||
ZB.MOM.WW.OtOpcUa.Driver.Focas.Proxy/ # .NET 10 x64 in main server
|
||||
ZB.MOM.WW.OtOpcUa.Driver.Focas.Host/ # .NET 10 x64 separate Windows service
|
||||
ZB.MOM.WW.OtOpcUa.Driver.Focas.Shared/ # .NET Standard 2.0 IPC contracts
|
||||
ZB.MOM.WW.OtOpcUa.Driver.Focas.TestStub/ # Stub FOCAS server for dev/CI (per test-data-sources.md)
|
||||
```
|
||||
|
||||
The Host process is the only place `Fwlib64.dll` is loaded. Every concern below is a Host-internal concern.
|
||||
|
||||
### Handle Pool
|
||||
|
||||
One Fwlib handle per CNC connection. Pool design:
|
||||
|
||||
- **`FocasHandle : SafeHandle`** wraps the integer handle returned by `cnc_allclibhndl3`. Finalizer calls `cnc_freelibhndl`. Use of the handle inside the wrapper goes through `DangerousAddRef`/`DangerousRelease` to prevent finalization mid-call.
|
||||
- **Per-handle lock**. Fwlib is thread-unsafe per handle — one mutex per `FocasHandle`, every API call acquires it. Lock fairness is FIFO so polling and write requests don't starve each other.
|
||||
- **Pool size of 1 per CNC by default**. FANUC controllers typically allow 4–8 concurrent FOCAS sessions; we don't need parallelism inside one driver-to-CNC link unless profiling shows it. Configurable per device.
|
||||
- **Health probe**. A background task issues `cnc_sysinfo` against each handle every 30 s. Failure → release the handle, mark device disconnected, let normal reconnect logic re-establish.
|
||||
- **TTL**. Each handle is forcibly recycled every 6 h (configurable) regardless of health. Defensive against slow Fwlib state corruption.
|
||||
- **Acquire timeout**. Handle-lock acquisition has a 10 s timeout. Timeout = treat the handle as wedged, kill it, mark device disconnected. (Real FOCAS calls have hung indefinitely in production reports.)
|
||||
|
||||
### Thread Serialization
|
||||
|
||||
The Host runs a single-threaded scheduler with **handle-affinity dispatch**: each pending operation is tagged with the target handle, and a dedicated worker thread per handle drains its queue. Two consequences:
|
||||
|
||||
- Zero parallel calls into Fwlib for the same handle (correctness).
|
||||
- A single slow CNC's queue can grow without blocking other CNCs' workers (isolation).
|
||||
|
||||
The bounded outgoing queue from §3 is per-handle, not process-global, so one stuck CNC can't starve another's queue capacity.
|
||||
|
||||
### Memory Watchdog Thresholds (FOCAS-specific)
|
||||
|
||||
FOCAS baseline is small (~30–50 MB after discovery on a typical 32-axis machine). Defaults tighter than the global protection — FOCAS workloads should be stable, so any meaningful growth is a leak signal worth acting on early.
|
||||
|
||||
| Threshold | Action |
|
||||
|-----------|--------|
|
||||
| 1.5× baseline **OR** baseline + 25 MB (whichever larger) | Warning |
|
||||
| 2× baseline **OR** baseline + 75 MB (whichever larger) | Soft recycle |
|
||||
| 300 MB absolute hard ceiling | Force-kill |
|
||||
| Slope > 1 MB/min sustained 15 min | Soft recycle |
|
||||
|
||||
Same multiplier + floor + hard-ceiling pattern as the global default; tighter ratios and a lower hard ceiling because the workload profile is well-bounded.
|
||||
|
||||
### Recycle Policy
|
||||
|
||||
Soft recycle in the Host distinguishes between **operations queued in managed code** (safely cancellable) and **operations currently inside `Fwlib64.dll`** (not safely cancellable — Fwlib calls have no cancellation mechanism, and freeing a handle while a native call is using it is undefined behavior, exactly the AV path the isolation is meant to prevent).
|
||||
|
||||
Sequence:
|
||||
|
||||
1. Stop accepting new IPC requests (pipe rejects with `BadServerHalted`)
|
||||
2. Cancel queued (not-yet-dispatched) operations: return `BadCommunicationError` to the proxy
|
||||
3. Wait up to **10 s grace** for any handle's worker thread to return from its current native call
|
||||
4. **For handles whose worker thread returned within grace**: call `cnc_freelibhndl` on the handle, dispose `FocasHandle`
|
||||
5. **For handles still inside a native call after grace**: do **NOT** call `cnc_freelibhndl` — leave the handle wrapper marked Abandoned, skip clean release. The OS reclaims the file descriptors and TCP sockets when the process exits; the CNC's session count decrements on its own connection-timeout (typically 30–60 s)
|
||||
6. Flush post-mortem ring buffer to disk; record which handles were Abandoned and why
|
||||
7. **If any handle was Abandoned** → escalate from soft recycle to **hard exit**: `Environment.Exit(2)` rather than `Environment.Exit(0)`. The supervisor logs this as an unclean recycle and applies the crash-loop circuit breaker to it (an Abandoned handle indicates a wedged Fwlib call, which is the kind of state that justifies treating the recycle as "this driver is in trouble").
|
||||
8. **If all handles released cleanly** → `Environment.Exit(0)` and supervisor respawns normally
|
||||
|
||||
Recycle triggers (any one):
|
||||
|
||||
- Memory watchdog threshold breach
|
||||
- Scheduled (daily 03:00 local by default)
|
||||
- Operator command via Admin UI
|
||||
- Crash-loop circuit breaker fired and reset (manual reset)
|
||||
|
||||
Recycle frequency cap: 1/hour. More than that = page operator instead of thrashing.
|
||||
|
||||
#### Why we never free a handle with an active native call
|
||||
|
||||
Calling `cnc_freelibhndl` on a handle while another thread is mid-call inside `cnc_*` against that same handle is undefined behavior per FANUC's docs (handle is not thread-safe; release races with use). The most likely outcome is an immediate AV inside Fwlib — which is precisely the scenario the entire Tier C isolation is designed to contain. The defensive choice is: if we can't release cleanly within the grace window, accept the handle leak (bounded by process lifetime) and let process exit do what we can't safely do from managed code.
|
||||
|
||||
This means a wedged Fwlib call always escalates to process exit. There is no in-process recovery path for a hung native call — the only correct response is to let the process die and have the supervisor start a fresh one.
|
||||
|
||||
### What Survives a Recycle
|
||||
|
||||
| State | Survives? | How |
|
||||
|-------|:---------:|-----|
|
||||
| Subscription set | ✔ | Proxy re-issues subscribe on host startup |
|
||||
| Last-known values | ✔ (cached in proxy) | Surfaced as Bad quality during recycle window |
|
||||
| In-flight reads | ✗ | Proxy returns BadCommunicationError; OPC UA client retries |
|
||||
| In-flight writes | ✗ | Per Polly write-retry policy: NOT auto-retried; OPC UA client decides |
|
||||
| Handle TTL clocks | ✗ (intentional) | Fresh handles after recycle, fresh TTL |
|
||||
|
||||
### Recovery Sequence After Crash
|
||||
|
||||
1. Supervisor detects host exit (heartbeat timeout or process exit code)
|
||||
2. Supervisor reads post-mortem MMF, attaches tail to a crash event
|
||||
3. Proxy fans out Bad quality on all FOCAS device nodes
|
||||
4. Backoff before respawn: 5 s → 15 s → 60 s (capped)
|
||||
5. Spawn new Host process
|
||||
6. Host re-discovers (functional structure is fixed; PMC/macro discovery from central config), re-subscribes
|
||||
7. Quality returns to Good as values arrive
|
||||
8. **3 crashes in 5 minutes → crash-loop circuit opens.** Supervisor stops respawning, leaves Bad quality in place, raises operator alert. Manual reset required via Admin UI.
|
||||
|
||||
### Post-Mortem Log Contents (FOCAS-specific)
|
||||
|
||||
In addition to the generic last-N-operations ring, the FOCAS Host post-mortem captures:
|
||||
|
||||
- Active handle pool snapshot (handle ID, target IP, age, last-call timestamp, consecutive failures)
|
||||
- Handle health probe history (last 100 results)
|
||||
- Memory samples (last 60 — 30 minutes at 30 s cadence)
|
||||
- Recycle history (last 10 recycles with trigger reason)
|
||||
- Last 50 IPC requests received (for correlating crashes to specific operator actions)
|
||||
|
||||
This makes post-mortem of an `AccessViolationException` actionable — without it, a Fwlib AV is essentially undebuggable.
|
||||
|
||||
### Test Coverage for FOCAS Stability
|
||||
|
||||
There are **two distinct test surfaces** here, and an earlier draft conflated them. Splitting them honestly:
|
||||
|
||||
#### Surface 1 — Functional protocol coverage via the TCP stub
|
||||
|
||||
The `Driver.Focas.TestStub` (per `test-data-sources.md` §6) is a TCP listener that mimics a CNC over the FOCAS wire protocol. It can exercise everything that travels over the network:
|
||||
|
||||
- **Inject network slow** — stub adds latency on FOCAS responses, exercising the bounded queue, Polly timeout, and handle-lock acquire timeout
|
||||
- **Inject network hang** — stub stops responding mid-call (TCP keeps the socket open but never writes), exercising the per-call grace window and the wedged-handle → hard-exit escalation
|
||||
- **Inject protocol error** — stub returns FOCAS error codes (`EW_HANDLE`, `EW_SOCKET`, etc.) at chosen call boundaries, exercising error-code → StatusCode mapping and Polly retry policies
|
||||
- **Inject disconnect** — stub closes the TCP socket, exercising the reconnect path and Bad-quality fan-out
|
||||
|
||||
This covers the **majority** of stability paths because most FOCAS failure modes manifest as the network behaving badly — the Fwlib library itself tends to be stable when its CNC behaves; the trouble is that real CNCs misbehave often.
|
||||
|
||||
#### Surface 2 — Native fault injection via a separate shim
|
||||
|
||||
Native AVs and native handle leaks **cannot** be triggered through a TCP stub — they live inside `Fwlib64.dll`, on the host side of the P/Invoke boundary. Faking them requires a separate mechanism:
|
||||
|
||||
- **`Driver.Focas.FaultShim` project** — a small native DLL named `Fwlib64.dll` (test-only build configuration) that exports the same FOCAS API surface but, instead of calling FANUC's library, performs configurable fault behaviors: deliberately raise an AV at a chosen call site, return success but never release allocated buffers (leak), return success on `cnc_freelibhndl` but keep the handle table populated (orphan handle), etc.
|
||||
- **Activated by binding redirect / DLL search path order** in the Host's test fixture only; production builds load FANUC's real `Fwlib64.dll`.
|
||||
- **Tested paths**: supervisor respawn after AV, post-mortem MMF readability after hard crash, watchdog → recycle path on simulated leaks, Abandoned-handle path when the shim simulates a wedged native call.
|
||||
|
||||
The Host code is unchanged between the two surfaces — it just experiences different symptoms depending on which DLL it loaded. Honest framing of test coverage: **the TCP stub covers ~80% of real-world FOCAS failures (network/protocol); the FaultShim covers the remaining ~20% (native crashes/leaks). Hardware/manual testing on a real CNC remains the only validation path for vendor-specific Fwlib quirks that neither stub can predict.**
|
||||
|
||||
---
|
||||
|
||||
## Galaxy — Deep Dive (Tier C, COM/STA Worked Example)
|
||||
|
||||
Galaxy is the second Tier C driver and the only one bound to .NET 4.8 x86 (MXAccess COM has no 64-bit variant). Unlike FOCAS, Galaxy carries 12+ years of v1 production history, so the failure surface is well-mapped — most of the protections below close known incident classes rather than guarding against speculative ones. The four findings closed in commit `c76ab8f` (stability-review 2026-04-13) are concrete examples: a failed runtime probe subscription leaving a phantom entry that flipped Tick() to Stopped and fanned out false BadOutOfService quality, sync-over-async on the OPC UA stack thread, fire-and-forget alarm tasks racing shutdown.
|
||||
|
||||
### Project Layout
|
||||
|
||||
```
|
||||
src/
|
||||
ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy/ # .NET 10 x64 in main server
|
||||
ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/ # .NET 4.8 x86 separate Windows service
|
||||
ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared/ # .NET Standard 2.0 IPC contracts
|
||||
```
|
||||
|
||||
The Host is the only place MXAccess COM objects, the Galaxy SQL Server connection, and the optional Wonderware Historian SDK are loaded. Bitness mismatch with the .NET 10 x64 main server is the original isolation reason; Tier C stability isolation is the layered reason.
|
||||
|
||||
### STA Thread + Win32 Message Pump (the foundation)
|
||||
|
||||
Every MXAccess COM call must execute on a dedicated STA thread that runs a `GetMessage`/`DispatchMessage` loop, because MXAccess delivers `OnDataChange` / `OnWriteComplete` / advisory callbacks via window messages. This is non-negotiable — calls from the wrong apartment fail or, worse, cross-thread COM marshaling silently corrupts state.
|
||||
|
||||
- **One STA thread per Host process** owns all `LMXProxyServer` instances and all advisory subscriptions
|
||||
- **Work item dispatch** uses `PostThreadMessage(WM_APP)` to marshal incoming IPC requests onto the STA thread
|
||||
- **Pump shutdown** posts `WM_QUIT` only after all outstanding work items have completed, preventing torn-down COM proxies from receiving callbacks
|
||||
- **Pump health** is itself probed: the proxy sends a no-op work item every 10 s and expects a round-trip; missing round-trip = pump wedged = trigger recycle
|
||||
|
||||
The pattern is the same as the v1 `StaComThread` in `ZB.MOM.WW.LmxProxy.Host` — proven at this point and not a place for invention.
|
||||
|
||||
### COM Object Lifetime
|
||||
|
||||
MXAccess COM objects (`LMXProxyServer` connection handles, item handles) accumulate native references that the GC does not track. Leaks here are silent until the Host runs out of handles or the Galaxy refuses new advisory subscriptions.
|
||||
|
||||
- **`MxAccessHandle : SafeHandle`** wraps each `LMXProxyServer` connection. Finalizer calls `Marshal.ReleaseComObject` until refcount = 0, then `UnregisterProxy`.
|
||||
- **Subscription handles** wrapped per item; `RemoveAdvise` + `RemoveItem` on dispose, in that order (event handlers must be unwired before the item handle goes away — undefined behavior otherwise).
|
||||
- **`CriticalFinalizerObject`** for handle wrappers so finalizer ordering during AppDomain unload is predictable.
|
||||
- **Pre-shutdown drain**: on Host stop, Proxy first cancels all subscriptions cleanly via the STA pump (`AdviseSupervisory(stop)` → `RemoveItem` → `UnregisterProxy`). Only then does the Host exit. Fire-and-forget shutdown is a known v1 bug class — the four 2026-04-13 stability findings include "alarm auto-subscribe and transferred-subscription restore no longer race shutdown as untracked fire-and-forget tasks."
|
||||
|
||||
### Subscription State and Reconnect
|
||||
|
||||
Galaxy's MXAccess advisory subscriptions are stateful — once established, Galaxy pushes value updates until `RemoveAdvise`. Network disconnects, Galaxy redeployments, and Platform/AppEngine restarts all break the subscription stream and require replay.
|
||||
|
||||
- **Subscription registry** in the Host: every `AddItem` + `AdviseSupervisory` is recorded so reconnect can replay
|
||||
- **Reconnect trigger**: connection-health probe (see below) detects loss → marks subscriptions Disconnected → fans out Bad quality via Proxy → enters reconnect loop
|
||||
- **Replay order**: register proxy → re-add items → re-advise. Order matters; re-advising an item that was never re-added wedges silently.
|
||||
- **Quality fan-out** during reconnect window respects host scope — per the same 2026-04-13 findings, a stopped DevAppEngine must not let a recovering DevPlatform's startup callback wipe Bad quality on the still-stopped engine's variables. **Cross-host quality clear is gated on host-status check.**
|
||||
- **Symbol-version-changed equivalent**: Galaxy `time_of_last_deploy` change → driver invokes `IRediscoverable` → rebuild affected subtree only (per Galaxy platform scope filter, commit `bc282b6`)
|
||||
|
||||
### Connection Health Probe (`GalaxyRuntimeProbeManager`)
|
||||
|
||||
A dedicated probe subscribes to a synthetic per-host runtime-status attribute (Platform/Engine ScanState). Probe state drives:
|
||||
|
||||
- **Bad-quality fan-out** when a host (Platform or AppEngine) reports Stopped
|
||||
- **Quality restoration** when state transitions back to Running, scoped to that host's subtree only (not Galaxy-wide — closes the 2026-04-13 finding about a Running→Unknown→Running callback wiping sibling state)
|
||||
- **Probe failure handling**: a failed probe subscription must NOT leave a phantom entry that Tick() flips to Stopped — phantom probes are an accidental Bad-quality source. Closed in `c76ab8f`.
|
||||
|
||||
### Memory Watchdog Thresholds (Galaxy-specific)
|
||||
|
||||
Galaxy baseline depends heavily on Galaxy size. The platform scope filter (commit `bc282b6`) reduced a dev Galaxy's footprint from 49 objects / 4206 attributes (full Galaxy) to 3 objects / 386 attributes (local subtree). Real production Galaxies vary from a few hundred to tens of thousands of attributes.
|
||||
|
||||
| Threshold | Action |
|
||||
|-----------|--------|
|
||||
| 1.5× baseline (per-instance, after warm-up) | Warning |
|
||||
| 2× baseline **OR** baseline + 200 MB (whichever larger) | Soft recycle |
|
||||
| 1.5 GB absolute hard ceiling | Force-kill |
|
||||
| Slope > 5 MB/min sustained 30 min | Soft recycle |
|
||||
|
||||
Higher hard ceiling than FOCAS (1.5 GB vs 300 MB) because legitimate Galaxy baselines are larger. Same multiplier-with-floor pattern. The slope threshold is more permissive (5 MB/min vs 1 MB/min) because Galaxy's address-space rebuild on redeploy can transiently allocate large amounts.
|
||||
|
||||
### Recycle Policy (COM-specific)
|
||||
|
||||
Soft recycle distinguishes between **work items queued for the STA pump** (cancellable before dispatch) and **MXAccess calls in flight on the STA thread** (not cancellable — COM has no abort).
|
||||
|
||||
1. Stop accepting new IPC requests
|
||||
2. Cancel queued (not-yet-dispatched) STA work items
|
||||
3. Wait up to **15 s grace** for the in-flight STA call to return (longer than FOCAS because some MXAccess calls — bulk attribute reads, large hierarchy traversals — legitimately take seconds)
|
||||
4. **For each subscription**: post `RemoveAdvise` → `RemoveItem` → release item handle, in that order, on the STA thread
|
||||
5. **For the proxy connection**: post `UnregisterProxy` → `Marshal.ReleaseComObject` until refcount = 0 → release `MxAccessHandle`
|
||||
6. **STA pump shutdown**: post `WM_QUIT` only after all of the above have completed
|
||||
7. Flush post-mortem ring buffer
|
||||
8. **If STA pump did not exit within 5 s** of `WM_QUIT` → escalate to `Environment.Exit(2)`. A wedged COM call cannot be recovered cleanly; same logic as the FOCAS Abandoned-handle escalation.
|
||||
9. **If clean** → `Environment.Exit(0)`, supervisor respawns
|
||||
|
||||
Recycle frequency cap is the same as FOCAS (1/hour). Scheduled recycle defaults to 24 h.
|
||||
|
||||
### What Survives a Galaxy Recycle
|
||||
|
||||
| State | Survives? | How |
|
||||
|-------|:---------:|-----|
|
||||
| Address space (built from Galaxy DB) | ✔ | Proxy caches the last built tree; rebuild from DB on host startup |
|
||||
| Subscription set | ✔ | Proxy re-issues subscribe on host startup |
|
||||
| Last-known values | ✔ (in proxy cache) | Surfaced as Bad quality during recycle window |
|
||||
| Alarm state | partial | Active alarm registry replayed; AlarmTracking re-subscribes |
|
||||
| In-flight reads | ✗ | BadCommunicationError; client retries |
|
||||
| In-flight writes | ✗ | Per Polly write-retry policy: not auto-retried |
|
||||
| Historian subscriptions | ✗ | Re-established on next HistoryRead |
|
||||
| `time_of_last_deploy` watermark | ✔ | Cached in proxy; resync on startup avoids spurious full rebuild |
|
||||
|
||||
### Recovery Sequence After Crash
|
||||
|
||||
Same supervisor protocol as FOCAS, with one Galaxy-specific addition:
|
||||
|
||||
1. Supervisor detects host exit
|
||||
2. Reads post-mortem MMF, attaches tail to crash event
|
||||
3. Proxy fans out Bad quality on **all Galaxy nodes scoped to the lost host's platform** (not necessarily every Galaxy node — multi-host respect is per the 2026-04-13 findings)
|
||||
4. Backoff: 5 s → 15 s → 60 s
|
||||
5. Spawn new Host
|
||||
6. Host checks `time_of_last_deploy`; if unchanged from cached watermark, skip full DB rediscovery and reuse cached hierarchy (faster recovery for the common case where the crash was unrelated to a redeploy)
|
||||
7. Re-register MXAccess proxy, re-add items, re-advise
|
||||
8. Quality returns to Good as values arrive
|
||||
9. **3 crashes in 5 minutes → crash-loop circuit opens** (same escalating-cooldown rules as FOCAS)
|
||||
|
||||
### Post-Mortem Log Contents (Galaxy-specific)
|
||||
|
||||
In addition to the universal last-N-operations ring:
|
||||
|
||||
- **STA pump state snapshot**: thread ID, last-message-dispatched timestamp, queue depth
|
||||
- **Active subscription count** + breakdown by host (Platform/AppEngine)
|
||||
- **`MxAccessHandle` refcount snapshot** for every live handle
|
||||
- **Last 100 probe results** with host status transitions
|
||||
- **Last redeploy event** timestamp (from `time_of_last_deploy` polling)
|
||||
- **Galaxy DB connection state** (last query duration, last error)
|
||||
- **Historian connection state** if HDA enabled
|
||||
|
||||
### Test Coverage for Galaxy Stability
|
||||
|
||||
Galaxy is the easiest of the Tier C drivers to test because the dev machine already has a real Galaxy. Three test surfaces:
|
||||
|
||||
1. **Real Galaxy on dev machine** (per `test-data-sources.md`) — the primary integration test environment. Covers MXAccess wire behavior, subscription replay, redeploy-triggered rediscovery, host status transitions.
|
||||
2. **`Driver.Galaxy.FaultShim`** — analogous to the FOCAS FaultShim, a test-only managed assembly substituted for `ArchestrA.MxAccess.dll` via assembly binding. Injects: COM exception at chosen call site, subscription that never fires `OnDataChange`, `Marshal.ReleaseComObject` returning unexpected refcount, STA pump deadlock simulation.
|
||||
3. **v1 IntegrationTests parity suite** — the existing v1 test suite must pass against the v2 Galaxy driver before move-behind-IPC is considered complete (decision #56). This is the primary regression net.
|
||||
|
||||
The 2026-04-13 stability findings should each become a regression test in the parity suite — phantom probe subscription, cross-host quality clear, sync-over-async on stack thread, fire-and-forget shutdown race. Closing those bugs without test coverage is how they come back.
|
||||
|
||||
---
|
||||
|
||||
## Decision Additions for `plan.md`
|
||||
|
||||
Proposed new entries for the Decision Log (numbering continues from #62):
|
||||
|
||||
| # | Decision | Rationale |
|
||||
|---|----------|-----------|
|
||||
| 63 | Driver stability tier model (A/B/C) | Drivers vary in failure profile; tier dictates hosting and protection level. See `driver-stability.md` |
|
||||
| 64 | FOCAS is Tier C — out-of-process Windows service | Fwlib64.dll is black-box, AV uncatchable, handle-affinity, no SLA. Same Proxy/Host/Shared pattern as Galaxy |
|
||||
| 65 | Cross-cutting protections mandatory in all tiers | SafeHandle, memory watchdog, bounded queues, scheduled recycle, post-mortem log apply to every driver process |
|
||||
| 66 | Out-of-process driver pattern is reusable | Galaxy.Proxy/Host/Shared template generalizes to any Tier C driver; FOCAS is the second user |
|
||||
| 67 | Tier B drivers may escalate to Tier C on production evidence | libplctag, S7netplus, TwinCAT.Ads start in-process; promote if leaks or crashes appear in production |
|
||||
| 68 | Crash-loop circuit breaker stops respawn after 3 crashes/5 min | Prevents thrashing; requires manual reset to surface an operator-actionable problem |
|
||||
| 69 | Post-mortem log via memory-mapped file | Survives hard process death (including AV); supervisor reads after corpse is gone; only viable post-mortem path for native crashes |
|
||||
|
||||
---
|
||||
|
||||
## Resolved Defaults
|
||||
|
||||
The three open questions from the initial draft are resolved as follows. All values are tunable per-driver-instance in central config; the defaults are what ships out of the box.
|
||||
|
||||
### Watchdog thresholds — hybrid multiplier + absolute floor + hard ceiling
|
||||
|
||||
Pure multipliers misfire on tiny baselines (a 30 MB FOCAS Host shouldn't recycle at 45 MB). Pure absolute thresholds in MB don't scale across deployment sizes. Hybrid: trigger on whichever threshold reaches first — `max(N× baseline, baseline + floor MB)` for warn/recycle, plus an absolute hard ceiling that always force-kills. Slope detection stays orthogonal — it catches slow leaks well below any threshold.
|
||||
|
||||
### Crash-loop reset — auto-reset with escalating cooldown, sticky alert, 24 h manual floor
|
||||
|
||||
Manual-only reset is too rigid for unattended plants (CNC sites don't have operators on console 24/7). Pure auto-reset after a fixed cooldown defeats the purpose of the breaker by letting it silently retry forever. Escalating cooldown (1 h → 4 h → 24 h-with-manual-reset) auto-recovers from transient problems while ensuring persistent problems eventually demand human attention. Sticky alerts that don't auto-clear keep the trail visible regardless.
|
||||
|
||||
### Heartbeat cadence — 2 s with 3-miss tolerance
|
||||
|
||||
5 s × 3 misses = 15 s detection is too slow against typical 1 s OPC UA publishing intervals (subscribers see Bad quality 15+ samples late). 1 s × 3 = 3 s is plausible but raises false-positive rate from GC pauses and Windows pipe scheduling. 2 s × 3 = 6 s is the sweet spot: subscribers see Bad quality within one or two missed publish cycles, GC pauses (~500 ms typical) and pipe jitter stay well inside the tolerance budget.
|
||||
746
docs/v2/plan.md
Normal file
746
docs/v2/plan.md
Normal file
@@ -0,0 +1,746 @@
|
||||
# Next Phase Plan — OtOpcUa v2: Multi-Driver Architecture
|
||||
|
||||
> **Status**: DRAFT — brainstorming in progress, do NOT execute until explicitly approved.
|
||||
>
|
||||
> **Branch**: `v2`
|
||||
> **Created**: 2026-04-16
|
||||
|
||||
## Vision
|
||||
|
||||
Rename from **LmxOpcUa** to **OtOpcUa** and evolve from a single-protocol OPC UA server (Galaxy/MXAccess only) into a **multi-driver OPC UA server** where:
|
||||
|
||||
- The **common core** owns the OPC UA server, address space management, session/security/subscription machinery, and client-facing concerns.
|
||||
- **Driver modules** are pluggable backends that each know how to connect to a specific data source, discover its tags/hierarchy, and shuttle live data back through the core to OPC UA clients.
|
||||
- Drivers implement **composable capability interfaces** — a driver only implements what it supports (e.g. subscriptions, alarms, history).
|
||||
- The existing Galaxy/MXAccess integration becomes the **first driver module**, proving the abstraction works against real production use.
|
||||
|
||||
---
|
||||
|
||||
## Target Drivers
|
||||
|
||||
| Driver | Protocol | Capability Profile | Notes |
|
||||
|--------|----------|--------------------|-------|
|
||||
| **Galaxy** | MXAccess COM + Galaxy DB | Read, Write, Subscribe, Alarms, HDA | Existing v1 logic, out-of-process (.NET 4.8 x86) |
|
||||
| **Modbus TCP** | MB-TCP | Read, Write, Subscribe (polled) | Flat register model, config-driven tag map. Also covers DL205 via `AddressFormat=DL205` (octal translation) |
|
||||
| **AB CIP** | EtherNet/IP CIP | Read, Write, Subscribe (polled) | ControlLogix/CompactLogix, symbolic tag addressing |
|
||||
| **AB Legacy** | EtherNet/IP PCCC | Read, Write, Subscribe (polled) | SLC 500/MicroLogix, file-based addressing |
|
||||
| **Siemens S7** | S7comm (ISO-on-TCP) | Read, Write, Subscribe (polled) | S7-300/400/1200/1500, DB/M/I/Q addressing |
|
||||
| **TwinCAT** | ADS (Beckhoff) | Read, Write, Subscribe (native) | Symbol-based, native ADS notifications |
|
||||
| **FOCAS** | FOCAS2 (FANUC CNC) | Read, Write, Subscribe (polled) | CNC data model (axes, spindle, PMC, macros) |
|
||||
| **OPC UA Client** | OPC UA | Read, Write, Subscribe, Alarms, HDA | Gateway/aggregation — proxy a remote server |
|
||||
|
||||
### Driver Characteristics That Shape the Interface
|
||||
|
||||
| Concern | Galaxy | Modbus TCP | AB CIP | AB Legacy | S7 | TwinCAT | FOCAS | OPC UA Client |
|
||||
|---------|--------|------------|--------|-----------|-----|---------|-------|---------------|
|
||||
| Tag discovery | DB query | Config DB | Config DB | Config DB | Config DB | Symbol upload | CNC query + Config DB | Browse remote |
|
||||
| Hierarchy | Rich tree | Flat (user groups) | Flat or program-scoped | Flat (file-based) | Flat (DB/area) | Symbol tree | Functional (axes/spindle/PMC) | Mirror remote |
|
||||
| Data types | mx_data_type | Raw registers (user-typed) | CIP typed | File-typed (N=INT16, F=FLOAT) | S7 typed | IEC 61131-3 | Scaled integers + structs | Full OPC UA |
|
||||
| Native subscriptions | Yes (MXAccess) | No (polled) | No (polled) | No (polled) | No (polled) | **Yes (ADS notifications)** | No (polled) | Yes (OPC UA) |
|
||||
| Alarms | Yes | No | No | No | No | Possible (ADS state) | Yes (CNC alarms) | Yes (A&C) |
|
||||
| History | Yes (Historian) | No | No | No | No | No | No | Yes (HistoryRead) |
|
||||
|
||||
**Note:** AutomationDirect DL205 PLCs are supported by the Modbus TCP driver via `AddressFormat=DL205` (octal V/X/Y/C/T/CT address translation over H2-ECOM100 module, port 502). No separate driver needed.
|
||||
|
||||
---
|
||||
|
||||
## Architecture — Key Decisions & Open Questions
|
||||
|
||||
### 1. Common Core Boundary
|
||||
|
||||
**Core owns:**
|
||||
- OPC UA server lifecycle (startup, shutdown, session management)
|
||||
- Security (transport profiles, authentication, authorization)
|
||||
- Address space tree management (add/remove/update nodes)
|
||||
- Subscription engine (create, publish, transfer)
|
||||
- Status dashboard / health reporting
|
||||
- Redundancy
|
||||
- Configuration framework
|
||||
- Namespace allocation per driver
|
||||
|
||||
**Driver owns:**
|
||||
- Data source connection management
|
||||
- Tag/hierarchy discovery
|
||||
- Data type mapping (driver types → OPC UA types)
|
||||
- Read/write translation
|
||||
- Alarm sourcing (if supported)
|
||||
- Historical data access (if supported)
|
||||
|
||||
**Decided:**
|
||||
- Each driver instance manages its own polling internally — the core does not provide a shared poll scheduler.
|
||||
- Multiple instances of the same driver type are supported (e.g. two Modbus TCP drivers for different device groups).
|
||||
- One namespace index per driver instance (each instance gets its own `NamespaceUri`).
|
||||
|
||||
**Decided:**
|
||||
- Drivers register nodes via a **builder/context API** (`IAddressSpaceBuilder`) provided by the core. Core owns the tree; driver streams `AddFolder` / `AddVariable` calls as it discovers nodes. Supports incremental/large address spaces without forcing the driver to buffer the whole tree.
|
||||
|
||||
---
|
||||
|
||||
### 2. Driver Capability Interfaces
|
||||
|
||||
Composable — a driver implements only what it supports:
|
||||
|
||||
```
|
||||
IDriver — required: lifecycle, metadata, health
|
||||
├── ITagDiscovery — discover tags/hierarchy from the backend
|
||||
├── IReadable — on-demand read
|
||||
├── IWritable — on-demand write
|
||||
├── ISubscribable — data change subscriptions (native or driver-managed polling)
|
||||
├── IAlarmSource — alarm events and acknowledgment
|
||||
└── IHistoryProvider — historical data reads
|
||||
```
|
||||
|
||||
Note: `ISubscribable` covers both native subscriptions (Galaxy MXAccess advisory, OPC UA monitored items) and driver-internal polled subscriptions (Modbus, AB CIP). The driver owns its polling loop — the core just sees `OnDataChange` callbacks regardless of mechanism.
|
||||
|
||||
**Capability matrix:**
|
||||
|
||||
| Interface | Galaxy | Modbus TCP | AB CIP | AB Legacy | S7 | TwinCAT | FOCAS | OPC UA Client |
|
||||
|-----------|--------|------------|--------|-----------|-----|---------|-------|---------------|
|
||||
| IDriver | Y | Y | Y | Y | Y | Y | Y | Y |
|
||||
| ITagDiscovery | Y | Y (config DB) | Y (config DB) | Y (config DB) | Y (config DB) | Y (symbol upload) | Y (built-in + config DB) | Y (browse) |
|
||||
| IReadable | Y | Y | Y | Y | Y | Y | Y | Y |
|
||||
| IWritable | Y | Y | Y | Y | Y | Y | Y (limited) | Y |
|
||||
| ISubscribable | Y (native) | Y (polled) | Y (polled) | Y (polled) | Y (polled) | Y (native ADS) | Y (polled) | Y (native) |
|
||||
| IAlarmSource | Y | — | — | — | — | — | Y (CNC alarms) | Y |
|
||||
| IHistoryProvider | Y | — | — | — | — | — | — | Y |
|
||||
|
||||
**Decided:**
|
||||
- Data change callback uses shared data models (`DataValue` with value, `StatusCode` quality, timestamp). Every driver maps to the same OPC UA `StatusCode` space — drivers define which quality codes they can produce but the model is universal.
|
||||
- Driver isolation: each driver instance runs independently. A crash or disconnect in one driver sets Bad quality on its own nodes only — no impact on other driver instances. The core must catch and contain driver failures.
|
||||
|
||||
### Resilience — Polly
|
||||
|
||||
**Decided: Use Polly v8+ (`Microsoft.Extensions.Resilience`) as the resilience layer across all drivers and the configuration subsystem.**
|
||||
|
||||
Polly provides composable resilience pipelines rather than hand-rolled retry/circuit-breaker logic. Each driver instance (and each device within a driver) gets its own pipeline so failures are isolated at the finest practical level.
|
||||
|
||||
**Where Polly applies:**
|
||||
|
||||
| Component | Pipeline | Strategies | Purpose |
|
||||
|-----------|----------|------------|---------|
|
||||
| **Driver device connection** | Per device | Retry (exp. backoff) + CircuitBreaker + Timeout | Reconnect to offline PLC/device, stop hammering after N failures, bound connection attempts |
|
||||
| **Driver read ops** | Per device | Timeout + Retry | Reads are idempotent — retry transient failures freely |
|
||||
| **Driver write ops** | Per device | Timeout **only** by default | Writes are NOT auto-retried — a timeout may fire after the device already accepted the command; replaying non-idempotent field actions (pulses, acks, recipe steps, counter increments) can cause duplicate operations |
|
||||
| **Driver poll loop** | Per device | CircuitBreaker | When a device is consistently unreachable, open circuit and probe periodically instead of polling at full rate |
|
||||
| **Galaxy IPC (Proxy → Host)** | Per proxy | Retry (backoff) + CircuitBreaker | Reconnect when Galaxy Host service restarts, stop retrying if Host is down for extended period |
|
||||
| **Config DB polling** | Singleton | Retry (backoff) + Fallback (use cache) | Central DB unreachable → fall back to LiteDB cache, keep retrying in background |
|
||||
| **Config DB startup** | Singleton | Retry (backoff) + Fallback (use cache) | If DB is briefly unavailable at startup, retry before falling back to cache |
|
||||
|
||||
**How it integrates:**
|
||||
|
||||
```
|
||||
IHostedService (per driver instance)
|
||||
├── Per-device ReadPipeline
|
||||
│ ├── Timeout — bound how long a read can take
|
||||
│ ├── Retry — transient failure recovery with jitter (SAFE: reads are idempotent)
|
||||
│ └── CircuitBreaker — stop polling dead devices, probe periodically
|
||||
│ on break: set device tags to Bad quality
|
||||
│ on reset: resume normal polling, restore quality
|
||||
│
|
||||
└── Per-device WritePipeline
|
||||
├── Timeout — bound how long a write can take
|
||||
└── (NO retry by default) — opt-in per tag via TagConfig.WriteIdempotent = true
|
||||
OR via a CAS (compare-and-set) wrapper that verifies
|
||||
the device state before each retry attempt
|
||||
|
||||
ConfigurationService
|
||||
└── ResiliencePipeline
|
||||
├── Retry — transient DB connectivity issues
|
||||
└── Fallback — serve from LiteDB cache on sustained outage
|
||||
```
|
||||
|
||||
**Write-retry policy (per the adversarial review, finding #1):**
|
||||
- Default: **no automatic retry on writes.** A timeout bubbles up as a write failure; the OPC UA client decides whether to re-issue.
|
||||
- Opt-in per tag via `TagConfig.WriteIdempotent = true` — explicit assertion by the configurer that replaying the same write has no side effect (e.g. setpoint overwrite, steady-state mode selection).
|
||||
- Opt-in via CAS (compare-and-set): before retrying, read the current value; retry only if the device still holds the pre-write value. Drivers whose protocol supports atomic read-modify-write (e.g. Modbus mask-write, OPC UA writes with expected-value) can plug this in.
|
||||
- Documented **never-retry** cases: edge-triggered acks, pulse outputs, monotonic counters, recipe-step advances, alarm acknowledgments, any "fire-and-forget" command register.
|
||||
|
||||
**Polly integration points:**
|
||||
- `Microsoft.Extensions.Resilience` for DI-friendly pipeline registration
|
||||
- `TelemetryListener` feeds circuit-breaker state changes into the status dashboard (operators see which devices are in open/half-open/closed state)
|
||||
- Per-driver/per-device pipeline configuration from the central config DB (retry counts, backoff intervals, circuit breaker thresholds can be tuned per device)
|
||||
|
||||
**Decided:**
|
||||
- Capability discovery uses **interface checks via `is`** (e.g. `if (driver is IAlarmSource a) ...`). The interface *is* the capability — no redundant flag enum to keep in sync.
|
||||
- `ITagDiscovery` is discovery-only. Drivers with a change signal (Galaxy deploy time, OPC UA server change notifications) additionally implement an **optional `IRediscoverable`** sub-interface; the core subscribes and rebuilds the affected subtree. Static drivers (Modbus, S7, etc. whose tags only change via a published config generation) don't implement it.
|
||||
|
||||
---
|
||||
|
||||
### 3. Runtime & Target Framework
|
||||
|
||||
**Decided: .NET 10, C#, x64 for everything — except where explicitly required.**
|
||||
|
||||
| Component | Target | Reason |
|
||||
|-----------|--------|--------|
|
||||
| Core, Core.Abstractions | .NET 10 x64 | Default |
|
||||
| Server | .NET 10 x64 | Default |
|
||||
| Configuration | .NET 10 x64 | Default |
|
||||
| Admin | .NET 10 x64 | Blazor Server |
|
||||
| Driver.ModbusTcp | .NET 10 x64 | Default |
|
||||
| Driver.AbCip | .NET 10 x64 | Default |
|
||||
| Driver.OpcUaClient | .NET 10 x64 | Default |
|
||||
| Client.CLI | .NET 10 x64 | Default |
|
||||
| Client.UI | .NET 10 x64 | Avalonia |
|
||||
| **Driver.Galaxy** | **.NET Framework 4.8 x86** | **MXAccess COM interop requires 32-bit** |
|
||||
|
||||
**Critical implication:** The Galaxy driver **cannot load in-process** with a .NET 10 x64 server. It must run as an **out-of-process driver** — a separate .NET 4.8 x86 process that the core communicates with over IPC.
|
||||
|
||||
**Decided: Named pipes with MessagePack serialization for IPC.**
|
||||
- Galaxy Host always runs on the same machine (MXAccess needs local ArchestrA Platform)
|
||||
- Named pipes are fast, no port allocation, built into both .NET 4.8 (`System.IO.Pipes`) and .NET 10
|
||||
- `Galaxy.Shared` defines request/response message types serialized with **MessagePack** over length-prefixed frames
|
||||
- MessagePack-CSharp (`MessagePack` NuGet) supports .NET Framework 4.6.1+ and .NET Standard 2.0+ — works on both sides
|
||||
- Compact binary format, faster than JSON, good fit for high-frequency data change callbacks
|
||||
- Simpler than gRPC on .NET 4.8 (which needs legacy `Grpc.Core` native library)
|
||||
|
||||
**Decided: Galaxy Host is a separate Windows service.**
|
||||
- Independent lifecycle from the OtOpcUa Server
|
||||
- Can be restarted without affecting the main server or other drivers
|
||||
- Galaxy.Proxy detects connection loss, sets Bad quality on Galaxy nodes, reconnects when Host comes back
|
||||
- Installed/managed via standard Windows service tooling
|
||||
|
||||
```
|
||||
┌──────────────────────────────────┐ named pipe ┌───────────────────────────┐
|
||||
│ OtOpcUa Server (.NET 10 x64) │◄────────────►│ Galaxy Host Service │
|
||||
│ Windows Service │ │ Windows Service │
|
||||
│ (Microsoft.Extensions.Hosting) │ │ (.NET 4.8 x86) │
|
||||
│ │ │ │
|
||||
│ Core │ │ MxAccessBridge │
|
||||
│ ├── Driver.ModbusTcp (in-proc)│ │ GalaxyRepository │
|
||||
│ ├── Driver.AbCip (in-proc) │ │ GalaxyDriverService │
|
||||
│ └── GalaxyProxy (in-proc)──┼──────────────┼──AlarmTracking │
|
||||
│ │ │ HDA Plugin │
|
||||
└──────────────────────────────────┘ └───────────────────────────┘
|
||||
```
|
||||
|
||||
**Notes for future work:**
|
||||
- The Proxy/Host/Shared split is a general pattern — any future driver with process-isolation requirements (bitness mismatch, unstable native dependency, license boundary) can reuse the same three-project layout.
|
||||
- Reusability of `LmxNodeManager` as a "generic driver node manager" will be assessed during Phase 2 interface extraction.
|
||||
|
||||
---
|
||||
|
||||
### 4. Galaxy/MXAccess as Out-of-Process Driver
|
||||
|
||||
**Current tightly-coupled pieces to refactor:**
|
||||
- `LmxNodeManager` — mixes OPC UA node management with MXAccess-specific logic
|
||||
- `MxAccessBridge` — COM thread, subscriptions, reconnect
|
||||
- `GalaxyRepository` — SQL queries for hierarchy/attributes
|
||||
- Alarm tracking tied to MXAccess subscription model
|
||||
- HDA via Wonderware Historian plugin
|
||||
|
||||
All of these stay in the Galaxy Host process (.NET 4.8 x86). The `GalaxyProxy` in the main server implements the standard driver interfaces and forwards over IPC.
|
||||
|
||||
**Decided:**
|
||||
- Refactor is **incremental**: extract `IDriver` / `ISubscribable` / `ITagDiscovery` etc. against the existing `LmxNodeManager` first (still in-process on v2 branch), validate the system still runs, *then* move the implementation behind the IPC boundary into Galaxy.Host. Keeps the system runnable at each step and de-risks the out-of-process move.
|
||||
- **Parity test**: run the existing v1 IntegrationTests suite against the v2 Galaxy driver (same Galaxy, same expectations) **plus** a scripted Client.CLI walkthrough (connect / browse / read / write / subscribe / history / alarms) on a dev Galaxy. Automated regression + human-observable behavior.
|
||||
|
||||
---
|
||||
|
||||
### 4. Configuration Model — Centralized MSSQL + Local Cache
|
||||
|
||||
**Deployment topology — server clusters:**
|
||||
|
||||
Sites deploy OtOpcUa as **2-node clusters** to provide non-transparent OPC UA redundancy (per v1 — `RedundancySupport.Warm` / `Hot`, no VIP/load-balancer involvement; clients see both endpoints in `ServerUriArray` and pick by `ServiceLevel`). Single-node deployments are the same model with `NodeCount = 1`. The config schema treats this uniformly: every server is a member of a **`ServerCluster`** with 1 or 2 **`ClusterNode`** members.
|
||||
|
||||
Within a cluster, both nodes serve **identical** address spaces — defining tags twice would invite drift — so driver definitions, device configs, tag definitions, and poll groups attach to `ClusterId`, not to individual nodes. Per-node overrides exist only for physical-machine settings that legitimately differ (host, port, `ApplicationUri`, redundancy role, machine cert) and for the rare driver setting that must differ per node (e.g. `MxAccess.ClientName` so Galaxy distinguishes them). Overrides are minimal by intent.
|
||||
|
||||
**Architecture:**
|
||||
|
||||
```
|
||||
┌─────────────────────────────────┐
|
||||
│ Central Config DB (MSSQL) │
|
||||
│ │
|
||||
│ - Server clusters (1 or 2 nodes)│
|
||||
│ - Cluster nodes (physical srvs)│
|
||||
│ - Driver assignments (per cluster)│
|
||||
│ - Tag definitions (per cluster)│
|
||||
│ - Device configs (per cluster) │
|
||||
│ - Per-node overrides (minimal) │
|
||||
│ - Schemaless driver config │
|
||||
│ (JSON; cluster-level + node │
|
||||
│ override JSON) │
|
||||
└──────────┬──────────────────────┘
|
||||
│ poll / change detection
|
||||
▼
|
||||
┌─── Cluster LINE3-OPCUA ────────────────────┐
|
||||
│ │
|
||||
┌──────┴──────────────────┐ ┌──────────────────┴──┐
|
||||
│ Node LINE3-OPCUA-A │ │ Node LINE3-OPCUA-B │
|
||||
│ RedundancyRole=Primary │ │ RedundancyRole=Secondary │
|
||||
│ │ │ │
|
||||
│ appsettings.json: │ │ appsettings.json: │
|
||||
│ - MSSQL conn string │ │ - MSSQL conn str │
|
||||
│ - ClusterId │ │ - ClusterId │
|
||||
│ - NodeId │ │ - NodeId │
|
||||
│ - Local cache path │ │ - Local cache path│
|
||||
│ │ │ │
|
||||
│ Local cache (LiteDB) │ │ Local cache (LiteDB)│
|
||||
└─────────────────────────┘ └─────────────────────┘
|
||||
```
|
||||
|
||||
**How it works:**
|
||||
|
||||
1. Each OtOpcUa node has a minimal `appsettings.json` with just: MSSQL connection string, its `ClusterId` and `NodeId`, a local machine-bound client certificate (or gMSA credential), and local cache file path. **OPC UA port and `ApplicationUri` come from the central DB** (`ClusterNode.OpcUaPort` / `ClusterNode.ApplicationUri`), not from local config — they're cluster topology, not local concerns.
|
||||
2. On startup, the node authenticates to the central DB **using a credential bound to its `NodeId`** — a client cert or SQL login per node, NOT a shared DB login. The DB-side authorization layer enforces that the authenticated principal may only read config for its `NodeId`'s `ClusterId`. A self-asserted `NodeId` with the wrong credential is rejected. A node may not read another cluster's config, even if both clusters belong to the same admin team.
|
||||
3. The node requests its current **config generation** from the central DB: "give me the latest published generation for cluster X." Generations are **cluster-scoped** — one generation = one cluster's full configuration snapshot.
|
||||
4. The node receives the cluster-level config (drivers, devices, tags, poll groups) plus its own `ClusterNode` row (physical attributes + override JSON). It merges node overrides onto cluster-level driver configs at apply time.
|
||||
5. Config is cached locally in a **LiteDB file** keyed by generation number — if the central DB is unreachable at startup, the node boots from the latest cached generation.
|
||||
6. The node polls the central DB for a **new published generation**. When a new generation is published, the node downloads it, diffs it against its current one, and applies only the affected drivers/devices/tags (surgical *application* against an atomic *snapshot*).
|
||||
7. **Both nodes of a cluster apply the same generation**, but apply timing can differ slightly (network jitter, polling phase). During the apply window, one node may be on generation N and the other on N+1; this is acceptable because OPC UA non-transparent redundancy already accommodates per-endpoint state divergence and `ServiceLevel` will dip on the node that's mid-apply.
|
||||
8. If generation application fails mid-flight, the node rolls back to the previous generation and surfaces the failure in the status dashboard; admins can publish a corrective generation or explicitly roll back the cluster.
|
||||
9. The central DB is the single source of truth for fleet management — all tag definitions, device configs, driver assignments, and cluster topology live there, versioned by generation.
|
||||
|
||||
**Central DB schema (conceptual):**
|
||||
|
||||
```
|
||||
ServerCluster ← top-level deployment unit (1 or 2 nodes)
|
||||
- ClusterId (PK)
|
||||
- Name ← human-readable e.g. "LINE3-OPCUA"
|
||||
- Site ← grouping for fleet management e.g. "PlantA"
|
||||
- NodeCount (1 | 2)
|
||||
- RedundancyMode (None | Warm | Hot) ← None when NodeCount=1
|
||||
- NamespaceUri ← shared by both nodes (per v1 redundancy model)
|
||||
- Enabled
|
||||
- Notes
|
||||
|
||||
ClusterNode ← physical OPC UA server within a cluster
|
||||
- NodeId (PK) ← stable per physical machine, e.g. "LINE3-OPCUA-A"
|
||||
- ClusterId (FK)
|
||||
- RedundancyRole (Primary | Secondary | Standalone)
|
||||
- Host ← machine hostname / IP
|
||||
- OpcUaPort ← typically 4840 on each machine
|
||||
- DashboardPort ← typically 8081
|
||||
- ApplicationUri ← MUST be unique per node per OPC UA spec.
|
||||
Convention: urn:{Host}:OtOpcUa (hostname-embedded).
|
||||
Unique index enforced fleet-wide, not just per-cluster
|
||||
— two clusters sharing an ApplicationUri would confuse
|
||||
any client that browses both.
|
||||
Stored explicitly, NOT derived from Host at runtime —
|
||||
OPC UA clients pin trust to ApplicationUri (part of
|
||||
the cert validation chain), so silent rewrites would
|
||||
break client trust.
|
||||
- ServiceLevelBase ← Primary 200, Secondary 150 by default
|
||||
- DriverConfigOverridesJson ← per-node overrides keyed by DriverInstanceId,
|
||||
merged onto cluster-level DriverConfig at apply.
|
||||
Minimal by intent — only settings that genuinely
|
||||
differ per node (e.g. MxAccess.ClientName).
|
||||
- Enabled
|
||||
- LastSeenAt
|
||||
|
||||
ClusterNodeCredential ← 1:1 or 1:N with ClusterNode
|
||||
- CredentialId (PK)
|
||||
- NodeId (FK) ← bound to the physical node, NOT the cluster
|
||||
- Kind (SqlLogin | ClientCertThumbprint | ADPrincipal | gMSA)
|
||||
- Value ← login name, thumbprint, SID, etc.
|
||||
- Enabled
|
||||
- RotatedAt
|
||||
|
||||
ConfigGeneration ← atomic, immutable snapshot of one cluster's config
|
||||
- GenerationId (PK) ← monotonically increasing
|
||||
- ClusterId (FK) ← cluster-scoped — every generation belongs to one cluster
|
||||
- PublishedAt
|
||||
- PublishedBy
|
||||
- Status (Draft | Published | Superseded | RolledBack)
|
||||
- ParentGenerationId (FK) ← rollback target
|
||||
- Notes
|
||||
|
||||
DriverInstance ← rows reference GenerationId; new generations = new rows
|
||||
- DriverInstanceRowId (PK)
|
||||
- GenerationId (FK)
|
||||
- DriverInstanceId ← stable logical ID across generations
|
||||
- ClusterId (FK) ← driver lives at the cluster level — both nodes
|
||||
instantiate it identically (modulo node overrides)
|
||||
- Name
|
||||
- DriverType (Galaxy | ModbusTcp | AbCip | OpcUaClient | …)
|
||||
- NamespaceUri ← per-driver namespace within the cluster's URI scope
|
||||
- Enabled
|
||||
- DriverConfig (JSON) ← schemaless, driver-type-specific settings.
|
||||
Per-node overrides applied via
|
||||
ClusterNode.DriverConfigOverridesJson at apply time.
|
||||
|
||||
Device (for multi-device drivers like Modbus, CIP)
|
||||
- DeviceRowId (PK)
|
||||
- GenerationId (FK)
|
||||
- DeviceId ← stable logical ID
|
||||
- DriverInstanceId (FK)
|
||||
- Name
|
||||
- DeviceConfig (JSON) ← host, port, unit ID, slot, etc.
|
||||
|
||||
Tag
|
||||
- TagRowId (PK)
|
||||
- GenerationId (FK)
|
||||
- TagId ← stable logical ID
|
||||
- DeviceId (FK) or DriverInstanceId (FK)
|
||||
- Name
|
||||
- FolderPath ← address space hierarchy
|
||||
- DataType
|
||||
- AccessLevel (Read | ReadWrite)
|
||||
- WriteIdempotent (bool) ← opt-in for write retry eligibility (see Polly section)
|
||||
- TagConfig (JSON) ← register address, poll group, scaling, etc.
|
||||
|
||||
PollGroup
|
||||
- PollGroupRowId (PK)
|
||||
- GenerationId (FK)
|
||||
- PollGroupId ← stable logical ID
|
||||
- DriverInstanceId (FK)
|
||||
- Name
|
||||
- IntervalMs
|
||||
|
||||
ClusterNodeGenerationState ← tracks which generation each NODE has applied
|
||||
- NodeId (PK, FK) ← per-node, not per-cluster — both nodes of a
|
||||
2-node cluster track independently
|
||||
- CurrentGenerationId (FK)
|
||||
- LastAppliedAt
|
||||
- LastAppliedStatus (Applied | RolledBack | Failed)
|
||||
- LastAppliedError
|
||||
```
|
||||
|
||||
**Authorization model (server-side, enforced in DB):**
|
||||
- All config reads go through stored procedures that take the authenticated principal from `SESSION_CONTEXT` / `SUSER_SNAME()` / `CURRENT_USER` and cross-check it against `ClusterNodeCredential.Value` for the requesting `NodeId`. A principal asking for config of a `ClusterId` that does not contain its `NodeId` gets rejected, not just filtered.
|
||||
- Cross-cluster reads are forbidden even within the same site or admin scope — every config read carries the requesting `NodeId` and is checked.
|
||||
- Admin UI connects with a separate elevated principal that has read/write on all clusters and generations.
|
||||
- Publishing a generation is a stored procedure that validates the draft, computes the diff vs. the previous generation, and flips `Status` to `Published` atomically within a transaction. The publish is **cluster-scoped** — publishing a new generation for one cluster does not affect any other cluster.
|
||||
|
||||
**appsettings.json stays minimal:**
|
||||
|
||||
```jsonc
|
||||
{
|
||||
"Cluster": {
|
||||
"ClusterId": "LINE3-OPCUA",
|
||||
"NodeId": "LINE3-OPCUA-A"
|
||||
// OPC UA port, ApplicationUri, redundancy role all come from central DB
|
||||
},
|
||||
"ConfigDatabase": {
|
||||
// The connection string MUST authenticate as a principal bound to this NodeId.
|
||||
// Options (pick one per deployment):
|
||||
// - Integrated Security + gMSA (preferred on AD-joined hosts)
|
||||
// - Client certificate (Authentication=ActiveDirectoryMsi or cert-auth)
|
||||
// - SQL login scoped via ClusterNodeCredential table (rotate regularly)
|
||||
// A shared DB login across nodes is NOT supported — the server-side
|
||||
// authorization layer will reject cross-cluster config reads.
|
||||
"ConnectionString": "Server=configsrv;Database=OtOpcUaConfig;Authentication=...;...",
|
||||
"GenerationPollIntervalSeconds": 30,
|
||||
"LocalCachePath": "config_cache.db"
|
||||
},
|
||||
"Security": { /* transport/auth settings — still local */ }
|
||||
}
|
||||
```
|
||||
|
||||
**Decided:**
|
||||
- Central MSSQL database is the single source of truth for all configuration.
|
||||
- **Top-level deployment unit is `ServerCluster`** with 1 or 2 `ClusterNode` members. Single-node and 2-node deployments use the same schema; single-node is a cluster of one.
|
||||
- **Driver, device, tag, and poll-group config attaches to `ClusterId`, not to individual nodes.** Both nodes of a cluster serve identical address spaces.
|
||||
- **Per-node overrides are minimal by intent** — `ClusterNode.DriverConfigOverridesJson` is the only override mechanism, scoped to driver-config settings that genuinely must differ per node (e.g. `MxAccess.ClientName`). Tags and devices have no per-node override path.
|
||||
- **`ApplicationUri` is auto-suggested but never auto-rewritten.** When an operator creates a new `ClusterNode` in Admin, the UI prefills `urn:{Host}:OtOpcUa`. If the operator later changes `Host`, the UI surfaces a warning that `ApplicationUri` is **not** updated automatically — OPC UA clients pin trust to it, and a silent rewrite would force every client to re-pair. Operator must explicitly opt in to changing it.
|
||||
- Each node identifies itself by `NodeId` and `ClusterId` **and authenticates with a credential bound to its NodeId**; the DB enforces the mapping server-side. A self-asserted `NodeId` is not accepted, and a node may not read another cluster's config.
|
||||
- Local LiteDB cache for offline startup resilience, keyed by generation.
|
||||
- JSON columns for driver-type-specific config (schemaless per driver type, structured at the fleet level).
|
||||
- Multiple instances of the same driver type supported within one cluster.
|
||||
- Each device in a driver instance appears as a folder node in the address space.
|
||||
|
||||
**Decided (rollout model):**
|
||||
- Config is versioned as **immutable, cluster-scoped generations**. Admin authors a draft for a cluster, then publishes it in a single transaction. Nodes only ever observe a fully-published generation — never a half-edited mix of rows.
|
||||
- One generation = one cluster's full configuration snapshot. Publishing a generation for one cluster does not affect any other cluster.
|
||||
- Each node polls for the latest generation for its cluster, diffs it against its current applied generation, and surgically applies only the affected drivers/devices/tags. Surgical *application* is safe because the *source snapshot* is atomic.
|
||||
- **Both nodes of a cluster apply the same generation independently** — the apply timing can differ slightly. During the apply window, one node may be on generation N while the other is on N+1; this is acceptable because non-transparent redundancy already accommodates per-endpoint state divergence and `ServiceLevel` will dip on the node that's mid-apply.
|
||||
- Rollback: publishing a new generation never deletes old ones. Admins can roll back a cluster to any previous generation; nodes apply the target generation the same way as a forward publish.
|
||||
- Applied-state per node is tracked in `ClusterNodeGenerationState` so Admin can see which nodes have picked up a new publish and detect stragglers or a 2-node cluster that's diverged.
|
||||
- If neither the central DB nor a local cache is available, the node fails to start. This is acceptable — there's no meaningful "run with zero config" mode.
|
||||
|
||||
**Decided:**
|
||||
- **Transport security config (certs, LDAP settings, transport profiles) stays local** in `appsettings.json` per instance. Avoids a bootstrap chicken-and-egg where DB connection credentials would depend on config retrieved *from* the DB. Matches current v1 deployment model.
|
||||
- **Generation retention: keep all generations forever.** Rollback target is always available; audit trail is complete. Config rows are small and publish cadence is low (days/weeks), so storage cost is negligible versus the utility of a complete history.
|
||||
|
||||
**Deferred:**
|
||||
- Event-driven generation notification (SignalR / Service Broker) as an optimisation over poll interval — deferred until polling proves insufficient.
|
||||
|
||||
---
|
||||
|
||||
### 5. Project Structure
|
||||
|
||||
**All projects target .NET 10 x64 unless noted.**
|
||||
|
||||
```
|
||||
src/
|
||||
# ── Configuration layer ──
|
||||
ZB.MOM.WW.OtOpcUa.Configuration/ # Central DB schema (EF), change detection,
|
||||
# local LiteDB cache, config models (.NET 10)
|
||||
ZB.MOM.WW.OtOpcUa.Admin/ # Blazor Server admin UI + API for managing the
|
||||
# central config DB (.NET 10)
|
||||
|
||||
# ── Core + Server ──
|
||||
ZB.MOM.WW.OtOpcUa.Core/ # OPC UA server, address space, subscriptions,
|
||||
# driver hosting (.NET 10)
|
||||
ZB.MOM.WW.OtOpcUa.Core.Abstractions/ # IDriver, IReadable, ISubscribable, etc.
|
||||
# thin contract (.NET 10)
|
||||
ZB.MOM.WW.OtOpcUa.Server/ # Host (Microsoft.Extensions.Hosting),
|
||||
# Windows Service, config bootstrap (.NET 10)
|
||||
|
||||
# ── In-process drivers (.NET 10 x64) ──
|
||||
ZB.MOM.WW.OtOpcUa.Driver.ModbusTcp/ # Modbus TCP driver (NModbus)
|
||||
ZB.MOM.WW.OtOpcUa.Driver.AbCip/ # Allen-Bradley CIP driver (libplctag)
|
||||
ZB.MOM.WW.OtOpcUa.Driver.AbLegacy/ # Allen-Bradley SLC/MicroLogix driver (libplctag)
|
||||
ZB.MOM.WW.OtOpcUa.Driver.S7/ # Siemens S7 driver (S7netplus)
|
||||
ZB.MOM.WW.OtOpcUa.Driver.TwinCat/ # Beckhoff TwinCAT ADS driver (Beckhoff.TwinCAT.Ads)
|
||||
ZB.MOM.WW.OtOpcUa.Driver.Focas/ # FANUC FOCAS CNC driver (Fwlib64.dll P/Invoke)
|
||||
ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient/ # OPC UA client gateway driver
|
||||
|
||||
# ── Out-of-process Galaxy driver ──
|
||||
ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy/ # In-process proxy that implements IDriver interfaces
|
||||
# and forwards over IPC (.NET 10)
|
||||
ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/ # Separate process: MXAccess COM, Galaxy DB,
|
||||
# alarms, HDA. Hosts IPC server (.NET 4.8 x86)
|
||||
ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared/ # Shared IPC message contracts between Proxy
|
||||
# and Host (.NET Standard 2.0)
|
||||
|
||||
# ── Client tooling (.NET 10 x64) ──
|
||||
ZB.MOM.WW.OtOpcUa.Client.CLI/ # client CLI
|
||||
ZB.MOM.WW.OtOpcUa.Client.UI/ # Avalonia client
|
||||
|
||||
tests/
|
||||
ZB.MOM.WW.OtOpcUa.Configuration.Tests/
|
||||
ZB.MOM.WW.OtOpcUa.Core.Tests/
|
||||
ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests/
|
||||
ZB.MOM.WW.OtOpcUa.Driver.ModbusTcp.Tests/
|
||||
ZB.MOM.WW.OtOpcUa.Driver.AbCip.Tests/
|
||||
ZB.MOM.WW.OtOpcUa.Driver.AbLegacy.Tests/
|
||||
ZB.MOM.WW.OtOpcUa.Driver.S7.Tests/
|
||||
ZB.MOM.WW.OtOpcUa.Driver.TwinCat.Tests/
|
||||
ZB.MOM.WW.OtOpcUa.Driver.Focas.Tests/
|
||||
ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient.Tests/
|
||||
ZB.MOM.WW.OtOpcUa.IntegrationTests/
|
||||
```
|
||||
|
||||
**Deployment units:**
|
||||
|
||||
| Unit | Description | Target | Deploys to |
|
||||
|------|-------------|--------|------------|
|
||||
| **OtOpcUa Server** | Windows Service (M.E.Hosting) — OPC UA server + in-process drivers | .NET 10 x64 | Each site node |
|
||||
| **Galaxy Host** | Windows Service — out-of-process MXAccess driver | .NET 4.8 x86 | Same machine as Server (when Galaxy driver is used) |
|
||||
| **OtOpcUa Admin** | Blazor Server config management UI | .NET 10 x64 | Same server or central management host |
|
||||
| **OtOpcUa Client CLI** | Operator CLI tool | .NET 10 x64 | Any workstation |
|
||||
| **OtOpcUa Client UI** | Avalonia desktop client | .NET 10 x64 | Any workstation |
|
||||
|
||||
**Dependency graph:**
|
||||
|
||||
```
|
||||
Admin ──→ Configuration
|
||||
Server ──→ Core ──→ Core.Abstractions
|
||||
│ ↑
|
||||
│ Driver.ModbusTcp, Driver.AbCip, Driver.AbLegacy,
|
||||
│ Driver.S7, Driver.TwinCat, Driver.Focas,
|
||||
│ Driver.OpcUaClient (in-process)
|
||||
│ Driver.Galaxy.Proxy (in-process, forwards over IPC)
|
||||
↓
|
||||
Configuration
|
||||
|
||||
Galaxy.Proxy ──→ Galaxy.Shared ←── Galaxy.Host
|
||||
(.NET 4.8 x86, separate process)
|
||||
```
|
||||
|
||||
- `Core.Abstractions` — no dependencies, referenced by Core and all drivers (including Galaxy.Proxy)
|
||||
- `Configuration` — owns central DB access + local cache, referenced by Server and Admin
|
||||
- `Admin` — Blazor Server app, depends on Configuration, can deploy on same server
|
||||
- In-process drivers depend on `Core.Abstractions` only
|
||||
- `Galaxy.Shared` — .NET Standard 2.0 IPC contracts, referenced by both Proxy (.NET 10) and Host (.NET 4.8)
|
||||
- `Galaxy.Host` — standalone .NET 4.8 x86 process, does NOT reference Core or Core.Abstractions
|
||||
- `Galaxy.Proxy` — implements `IDriver` etc., depends on Core.Abstractions + Galaxy.Shared
|
||||
|
||||
**Decided:**
|
||||
- Mono-repo (Decision #31 above).
|
||||
- `Core.Abstractions` is **internal-only for now** — no standalone NuGet. Keep the contract mutable while the first 8 drivers are being built; revisit publishing after Phase 5 when the shape has stabilized. Design the contract *as if* it will eventually be public (no leaky types, stable names) to minimize churn later.
|
||||
|
||||
---
|
||||
|
||||
### 5a. LmxNodeManager Reusability Analysis
|
||||
|
||||
**Investigated 2026-04-17.** The existing `LmxNodeManager` (2923 lines) is the foundation for the new generic node manager — not a rewrite candidate. Categorized inventory:
|
||||
|
||||
| Bucket | Lines | % | What's here |
|
||||
|--------|-------|-----|-------------|
|
||||
| **Already generic** | ~1310 | 45% | OPC UA plumbing: `CreateAddressSpace` + topological sort + `_nodeMap`, Read/Write dispatch, HistoryRead + continuation points, subscription delivery + `_pendingDataChanges` queue, dispatch thread lifecycle, runtime-status node mechanism, status-code mapping |
|
||||
| **Generic pattern, Galaxy-coded today** | ~1170 | 40% | Bad-quality fan-out when a host drops, alarm auto-subscribe (InAlarm+Priority+Description pattern), background-subscribe tracking with shutdown-safe WaitAll, value normalization for arrays, connection-health probe machinery — each is a pattern every driver will need, currently wired to Galaxy types |
|
||||
| **Truly MXAccess-specific** | ~290 | 10% | `IMxAccessClient` calls, `MxDataTypeMapper`, `SecurityClassificationMapper`, `GalaxyRuntimeProbeManager` construction/lifecycle, Historian literal, alarm auto-subscribe trigger |
|
||||
| Metadata / comments | ~153 | 5% | |
|
||||
|
||||
**Interleaving assessment:** concerns are cleanly separated at method boundaries. Read/Write handlers do generic resolution → generic host-status check → isolated `_mxAccessClient` call. The dispatch loop is fully generic. The only meaningful interleaving is in `BuildAddressSpace()` where `GalaxyAttributeInfo` leaks into node creation — fixable by introducing a driver-agnostic `DriverAttributeInfo` DTO.
|
||||
|
||||
**Refactor plan:**
|
||||
|
||||
1. **Rename `LmxNodeManager` → `GenericDriverNodeManager : CustomNodeManager2`** and lift the generic blocks unchanged. Swap `IMxAccessClient` for `IDriver` (composing `IReadable` / `IWritable` / `ISubscribable`). Swap `GalaxyAttributeInfo` for a driver-agnostic `DriverAttributeInfo { FullName, DriverDataType, IsArray, ArrayDim, SecurityClass, IsHistorized }`. Promote `GalaxyRuntimeProbeManager` to an `IHostConnectivityProbe` capability interface.
|
||||
2. **Derive `GalaxyNodeManager : GenericDriverNodeManager`** — driver-specific builder that maps `GalaxyAttributeInfo → DriverAttributeInfo`, registers `MxDataTypeMapper` / `SecurityClassificationMapper`, injects the probe manager.
|
||||
3. **New drivers** (Modbus, S7, etc.) extend `GenericDriverNodeManager` and implement the capability interfaces. No forking of the OPC UA machinery.
|
||||
|
||||
**Ordering within Phase 2** (fits the "incremental extraction" approach in Decision #55):
|
||||
- (a) Introduce capability interfaces + `DriverAttributeInfo` in `Core.Abstractions`.
|
||||
- (b) Rename to `GenericDriverNodeManager` with Galaxy still in-process as the only driver; validate parity against v1 integration tests + CLI walkthrough.
|
||||
- (c) Only then move Galaxy behind the IPC boundary into `Galaxy.Host`.
|
||||
|
||||
Each step leaves the system runnable. The generic extraction is effectively free — the class is already mostly generic, just named and typed for Galaxy.
|
||||
|
||||
---
|
||||
|
||||
### 6. Migration Strategy
|
||||
|
||||
**Decided approach:**
|
||||
|
||||
**Phase 0 — Rename + .NET 10 migration**
|
||||
1. **Rename to OtOpcUa** — mechanical rename of namespaces, assemblies, config, and docs
|
||||
2. **Migrate to .NET 10 x64** — retarget all projects except Galaxy Host
|
||||
|
||||
**Phase 1 — Core extraction + Configuration layer + Admin scaffold**
|
||||
3. **Build `Configuration` project** — central MSSQL schema with `ServerCluster`, `ClusterNode`, `ClusterNodeCredential`, `ConfigGeneration`, `ClusterNodeGenerationState` plus the cluster-scoped `DriverInstance` / `Device` / `Tag` / `PollGroup` tables (EF Core + migrations); server-side authorization stored procs that enforce per-node-bound-to-cluster access from authenticated principals; atomic cluster-scoped publish/rollback stored procs; LiteDB local cache keyed by generation; generation-diff application logic; per-node override merge at apply time.
|
||||
4. **Extract `Core.Abstractions`** — define `IDriver`, `ITagDiscovery`, `IReadable`, `IWritable`, `ISubscribable`, `IAlarmSource`, `IHistoryProvider`. `IWritable` contract separates idempotent vs. non-idempotent writes at the interface level.
|
||||
5. **Build `Core`** — generic driver-hosting node manager that delegates to capability interfaces, driver isolation (catch/contain), address space registration, separate Polly pipelines for reads vs. writes per the write-retry policy above.
|
||||
6. **Wire `Server`** — bootstrap from Configuration using an instance-bound credential (cert/gMSA/SQL login), fail fast if the credential is rejected, register drivers, start Core.
|
||||
7. **Scaffold `Admin`** — Blazor Server app with: instance + credential management, draft/publish/rollback generation workflow (diff viewer, "publish to fleet", per-instance override), and core CRUD for drivers/devices/tags. Driver-specific config screens deferred to later phases.
|
||||
|
||||
**Phase 2 — Galaxy driver (prove the refactor)**
|
||||
8. **Build `Galaxy.Shared`** — .NET Standard 2.0 IPC message contracts
|
||||
9. **Build `Galaxy.Host`** — .NET 4.8 x86 process hosting MxAccessBridge, GalaxyRepository, alarms, HDA with IPC server
|
||||
10. **Build `Galaxy.Proxy`** — .NET 10 in-process proxy implementing IDriver interfaces, forwarding over IPC
|
||||
11. **Validate parity** — v2 Galaxy driver must pass the same integration tests as v1
|
||||
|
||||
**Phase 3 — Modbus TCP driver (prove the abstraction)**
|
||||
12. **Build `Driver.ModbusTcp`** — NModbus, config-driven tags from central DB, internal poll loop, device-as-folder hierarchy
|
||||
13. **Add Modbus config screens to Admin** (first driver-specific config UI)
|
||||
|
||||
**Phase 4 — PLC drivers**
|
||||
14. **Build `Driver.AbCip`** — libplctag, ControlLogix/CompactLogix symbolic tags + Admin config screens
|
||||
15. **Build `Driver.AbLegacy`** — libplctag, SLC 500/MicroLogix file-based addressing + Admin config screens
|
||||
16. **Build `Driver.S7`** — S7netplus, Siemens S7-300/400/1200/1500 + Admin config screens
|
||||
17. **Build `Driver.TwinCat`** — Beckhoff.TwinCAT.Ads v6, native ADS notifications, symbol upload + Admin config screens
|
||||
|
||||
**Phase 5 — Specialty drivers**
|
||||
18. **Build `Driver.Focas`** — FANUC FOCAS2 P/Invoke, pre-defined CNC tag set, PMC/macro config + Admin config screens
|
||||
19. **Build `Driver.OpcUaClient`** — OPC UA client gateway/aggregation, namespace remapping, subscription proxying + Admin config screens
|
||||
|
||||
**Decided:**
|
||||
- **Parity test for Galaxy**: existing v1 IntegrationTests suite + scripted Client.CLI walkthrough (see Section 4 above).
|
||||
- **Timeline**: no hard deadline. Each phase ships when it's right — tests passing, Galaxy parity bar met. Quality cadence over calendar cadence.
|
||||
- **FOCAS SDK**: license already secured. Phase 5 can proceed as scheduled; `Fwlib64.dll` available for P/Invoke.
|
||||
|
||||
---
|
||||
|
||||
## Decision Log
|
||||
|
||||
| # | Decision | Rationale | Date |
|
||||
|---|----------|-----------|------|
|
||||
| 1 | Work on `v2` branch | Keep master stable for production | 2026-04-16 |
|
||||
| 2 | OPC UA core + pluggable driver modules | Enable multi-protocol support without forking the server | 2026-04-16 |
|
||||
| 3 | Rename to **OtOpcUa** | Product is no longer LMX-specific | 2026-04-16 |
|
||||
| 4 | Composable capability interfaces | Drivers vary widely in what they support; flat `IDriver` would force stubs | 2026-04-16 |
|
||||
| 5 | Target drivers: Galaxy, Modbus TCP, AB CIP, AB Legacy, S7, TwinCAT, FOCAS, OPC UA Client | Full PLC/CNC/SCADA/aggregation coverage | 2026-04-16 |
|
||||
| 6 | Polling is driver-internal, not core-managed | Each driver owns its poll loop; core just sees data change callbacks | 2026-04-16 |
|
||||
| 7 | Multiple instances of same driver type supported | Need e.g. separate Modbus drivers for different device groups | 2026-04-16 |
|
||||
| 8 | Namespace index per driver instance | Each instance gets its own NamespaceUri for clean isolation | 2026-04-16 |
|
||||
| 9 | Rename to OtOpcUa as step 1 | Clean mechanical change before any refactoring | 2026-04-16 |
|
||||
| 10 | Modbus TCP as second driver | Simplest protocol, validates abstraction with flat/polled/config-driven model | 2026-04-16 |
|
||||
| 11 | Library selections per driver | NModbus (Modbus), libplctag (AB CIP + AB Legacy), S7netplus (S7), Beckhoff.TwinCAT.Ads v6 (TwinCAT), Fwlib64.dll P/Invoke (FOCAS), OPC Foundation SDK (OPC UA Client) | 2026-04-16 |
|
||||
| 12 | Driver isolation — failure contained per instance | One driver crash/disconnect must not affect other drivers' nodes or quality | 2026-04-16 |
|
||||
| 13 | Shared OPC UA StatusCode model for quality | Drivers map to the same StatusCode space; each defines which codes it produces | 2026-04-16 |
|
||||
| 14 | Central MSSQL config database | Single source of truth for fleet-wide config — instances, drivers, tags, devices | 2026-04-16 |
|
||||
| 15 | LiteDB local cache per instance | Offline startup resilience — instance boots from cache if central DB is unreachable | 2026-04-16 |
|
||||
| 16 | JSON columns for driver-specific config | Schemaless per driver type, avoids table-per-driver-type explosion | 2026-04-16 |
|
||||
| 17 | Device-as-folder in address space | Multi-device drivers expose Device/Tag hierarchy for intuitive browsing | 2026-04-16 |
|
||||
| 18 | Minimal appsettings.json (ClusterId + NodeId + DB conn) | All real config lives in central DB, not local files. OPC UA port and ApplicationUri come from `ClusterNode` row, not local config | 2026-04-16 / 2026-04-17 |
|
||||
| 19 | Blazor Server admin app for config management | Separate deployable, manages central MSSQL config DB | 2026-04-16 |
|
||||
| 20 | Surgical config change detection | Instance detects which drivers/devices/tags changed, applies incremental updates | 2026-04-16 |
|
||||
| 21 | Fail-to-start without DB or cache | No meaningful zero-config mode — requires at least cached config | 2026-04-16 |
|
||||
| 22 | `Configuration` project owns DB + cache layer | Clean separation: Server and Admin both depend on it | 2026-04-16 |
|
||||
| 23 | .NET 10 x64 default, .NET 4.8 x86 only for Galaxy Host | Modern runtime for everything; COM constraint isolated to Galaxy | 2026-04-16 |
|
||||
| 24 | Galaxy driver is out-of-process | .NET 4.8 x86 process can't load into .NET 10 x64; IPC bridge required | 2026-04-16 |
|
||||
| 25 | Galaxy.Shared (.NET Standard 2.0) for IPC contracts | Must be consumable by both .NET 10 Proxy and .NET 4.8 Host | 2026-04-16 |
|
||||
| 26 | Admin deploys on same server (co-hosted) | Simplifies deployment; can also run on separate management host | 2026-04-16 |
|
||||
| 27 | Admin scaffold early, driver-specific screens deferred | Core CRUD for instances/drivers first; per-driver config UI added with each driver | 2026-04-16 |
|
||||
| 28 | Named pipes for Galaxy IPC | Fast, no port conflicts, native to both .NET 4.8 and .NET 10 | 2026-04-16 |
|
||||
| 29 | Galaxy Host is a separate Windows service | Independent lifecycle, can restart without affecting main server or other drivers | 2026-04-16 |
|
||||
| 30 | Drop TopShelf, use Microsoft.Extensions.Hosting | Built-in Windows Service support in .NET 10, no third-party dependency | 2026-04-16 |
|
||||
| 31 | Mono-repo for all drivers | Simpler dependency management, single CI pipeline, shared abstractions | 2026-04-16 |
|
||||
| 32 | MessagePack serialization for Galaxy IPC | Binary, fast, works on .NET 4.8+ and .NET 10 via MessagePack-CSharp NuGet | 2026-04-16 |
|
||||
| 33 | EF Core for Configuration DB | Migrations, LINQ queries, standard .NET 10 ORM | 2026-04-16 |
|
||||
| 34 | Polly v8+ for resilience | Retry, circuit breaker, timeout per device/driver — replaces hand-rolled supervision | 2026-04-16 |
|
||||
| 35 | Per-device resilience pipelines | Circuit breaker on Drive1 doesn't affect Drive2, even in same driver instance | 2026-04-16 |
|
||||
| 36 | Polly for config DB access | Retry + fallback to LiteDB cache on sustained DB outage | 2026-04-16 |
|
||||
| 37 | FOCAS driver uses pre-defined tag set | CNC data is functional (axes, spindle, PMC), not user-defined tags — driver exposes fixed node hierarchy populated by specific FOCAS2 API calls | 2026-04-16 |
|
||||
| 38 | FOCAS PMC + macro variables are user-configured | PMC addresses (R, D, G, F, etc.) and macro variable ranges configured in central DB; not auto-discovered | 2026-04-16 |
|
||||
| 39 | TwinCAT uses native ADS notifications | One of 3 drivers with native subscriptions (Galaxy, TwinCAT, OPC UA Client); no polling needed for subscribed tags | 2026-04-16 |
|
||||
| 40 | TwinCAT no runtime required on server | Beckhoff.TwinCAT.Ads v6 supports in-process ADS router; only needs AMS route on target device | 2026-04-16 |
|
||||
| 41 | AB Legacy (SLC/MicroLogix) as separate driver from AB CIP | Different protocol (PCCC vs CIP), different addressing (file-based vs symbolic), severe connection limits (4-8) | 2026-04-16 |
|
||||
| 42 | S7 driver notes: PUT/GET must be enabled on S7-1200/1500 | Disabled by default in TIA Portal; document as prerequisite | 2026-04-16 |
|
||||
| 43 | DL205 (AutomationDirect) handled by Modbus TCP driver | DL205 supports Modbus TCP via H2-ECOM100; no separate driver needed — `AddressFormat=DL205` adds octal address translation | 2026-04-16 |
|
||||
| 44 | No automatic retry on writes by default | Write retries are unsafe for non-idempotent field actions — a timeout can fire after the device already accepted the command, and replay duplicates pulses/acks/counters/recipe steps (adversarial review finding #1) | 2026-04-16 |
|
||||
| 45 | Opt-in write retry via `TagConfig.WriteIdempotent` or CAS wrapper | Retries must be explicit per tag; CAS (compare-and-set) verifies device state before retry where the protocol supports it | 2026-04-16 |
|
||||
| 46 | Instance identity is credential-bound, not self-asserted | Each instance authenticates to the central DB with a credential (cert/gMSA/SQL login) bound to its `InstanceId`; the DB rejects cross-instance config reads server-side (adversarial review finding #2) | 2026-04-16 |
|
||||
| 47 | `InstanceCredential` table + authorization stored procs | Credentials and the `InstanceId` they are authorized for live in the DB; all config reads go through procs that enforce the mapping rather than trusting the client | 2026-04-16 |
|
||||
| 48 | Config is versioned as immutable generations with atomic publish | Admin publishes a whole generation in one transaction; instances only ever observe fully-published generations, never partial multi-row edits (adversarial review finding #3) | 2026-04-16 |
|
||||
| 49 | Surgical reload applies a generation diff, not raw row deltas | The source snapshot is atomic (generation), but applying it to a running instance is still incremental — only affected drivers/devices/tags reload | 2026-04-16 |
|
||||
| 50 | Explicit rollback via re-publishing a prior generation | Generations are never deleted; rollback is just publishing an older generation as the new current, so instances apply it the same way as a forward publish | 2026-04-16 |
|
||||
| 51 | `InstanceGenerationState` tracks applied generation per instance | Admin can see which instances have picked up a new publish and detect stragglers or failed applies | 2026-04-16 |
|
||||
| 52 | Address space registration via builder/context API | Core owns the tree; driver streams AddFolder/AddVariable on an `IAddressSpaceBuilder`, avoids buffering the whole tree and supports incremental discovery | 2026-04-17 |
|
||||
| 53 | Capability discovery via interface checks (`is IAlarmSource`) | The interface *is* the capability — no redundant flag enum to keep in sync with the implementation | 2026-04-17 |
|
||||
| 54 | Optional `IRediscoverable` sub-interface for change-detection | Drivers with a native change signal (Galaxy deploy time, OPC UA change notifications) opt in; static drivers skip it | 2026-04-17 |
|
||||
| 55 | Galaxy refactor is incremental — extract interfaces in place first | Refactor `LmxNodeManager` against new abstractions while still in-process, validate, then move behind IPC. Keeps system runnable at each step | 2026-04-17 |
|
||||
| 56 | Galaxy parity test = v1 integration suite + scripted CLI walkthrough | Automated regression plus human-observable behavior on a dev Galaxy | 2026-04-17 |
|
||||
| 57 | Transport security config stays local in `appsettings.json` | Avoids bootstrap chicken-and-egg (DB-connection credentials can't depend on config fetched from the DB); matches v1 deployment | 2026-04-17 |
|
||||
| 58 | Generation retention: keep all generations forever | Rollback target always available; audit trail complete; storage cost negligible at publish cadence of days/weeks | 2026-04-17 |
|
||||
| 59 | `Core.Abstractions` internal-only for now, no NuGet | Keep the contract mutable through the first 8 drivers; design as if public, revisit after Phase 5 | 2026-04-17 |
|
||||
| 60 | No hard deadline — phases deliver when they're right | Quality cadence over calendar cadence; Galaxy parity bar must be met before moving on | 2026-04-17 |
|
||||
| 61 | FOCAS SDK license already secured | Phase 5 can proceed; `Fwlib64.dll` available for P/Invoke with no procurement blocker | 2026-04-17 |
|
||||
| 62 | `LmxNodeManager` is the foundation for `GenericDriverNodeManager`, not a rewrite | ~85% of the 2923 lines are generic or generic-in-spirit; only ~10% (~290 lines) are truly MXAccess-specific. Concerns are cleanly separated at method boundaries — refactor is rename + DTO swap, not restructuring | 2026-04-17 |
|
||||
| 63 | Driver stability tier model (A/B/C) | Drivers vary in failure profile (pure managed vs wrapped native vs black-box DLL); tier dictates hosting and protection level. See `driver-stability.md` | 2026-04-17 |
|
||||
| 64 | FOCAS is Tier C — out-of-process Windows service from day one | `Fwlib64.dll` is a black-box vendor DLL; an `AccessViolationException` is uncatchable in modern .NET and would tear down the OPC UA server. Same Proxy/Host/Shared pattern as Galaxy | 2026-04-17 |
|
||||
| 65 | Cross-cutting stability protections mandatory in all tiers | SafeHandle for every native resource, memory watchdog, bounded operation queues, scheduled recycle, crash-loop circuit breaker, post-mortem log — apply to every driver process whether in-proc or isolated | 2026-04-17 |
|
||||
| 66 | Out-of-process driver pattern is reusable across Tier C drivers | Galaxy.Proxy/Host/Shared template generalizes; FOCAS is the second user; future Tier B → Tier C escalations reuse the same three-project template | 2026-04-17 |
|
||||
| 67 | Tier B drivers may escalate to Tier C on production evidence | libplctag (AB CIP/Legacy), S7netplus, TwinCAT.Ads start in-process; promote to isolated host if leaks or crashes appear in field | 2026-04-17 |
|
||||
| 68 | Crash-loop circuit breaker — 3 crashes/5 min stops respawn | Prevents host respawn thrashing when the underlying device or DLL is in a state respawning won't fix; surfaces operator-actionable alert; manual reset via Admin UI | 2026-04-17 |
|
||||
| 69 | Post-mortem log via memory-mapped file | Ring buffer of last-N operations + driver-specific state; survives hard process death including native AV; supervisor reads MMF after corpse is gone — only viable post-mortem path for native crashes | 2026-04-17 |
|
||||
| 70 | Watchdog thresholds = hybrid multiplier + absolute floor + hard ceiling | Pure multipliers misfire on tiny baselines; pure absolute MB doesn't scale across deployment sizes. `max(N× baseline, baseline + floor MB)` for warn/recycle plus an absolute hard ceiling. Slope detection stays orthogonal | 2026-04-17 |
|
||||
| 71 | Crash-loop reset = escalating cooldown (1 h → 4 h → 24 h manual) with sticky alerts | Manual-only is too rigid for unattended plants; pure auto-reset silently retries forever. Escalating cooldown auto-recovers transient problems but forces human attention on persistent ones; sticky alerts preserve the trail regardless of reset path | 2026-04-17 |
|
||||
| 72 | Heartbeat cadence = 2 s with 3-miss tolerance (6 s detection) | 5 s × 3 = 15 s is too slow against 1 s OPC UA publish intervals; 1 s × 3 = 3 s false-positives on GC pauses and pipe jitter. 2 s × 3 = 6 s is the sweet spot | 2026-04-17 |
|
||||
| 73 | Process-level protections (RSS watchdog, scheduled recycle) apply ONLY to Tier C isolated host processes | Process recycle in the shared server would kill every other in-proc driver, every session, and the OPC UA endpoint — directly contradicts the per-driver isolation invariant. Tier A/B drivers get per-instance allocation tracking + cache flush + no-process-kill instead (adversarial review finding #1) | 2026-04-17 |
|
||||
| 74 | A Tier A/B driver that needs process-level recycle MUST be promoted to Tier C | The only safe way to apply process recycle to a single driver is to give it its own process. If allocation tracking + cache flush can't bound a leak, the answer is isolation, not killing the server | 2026-04-17 |
|
||||
| 75 | Wedged native calls in Tier C drivers escalate to hard process exit, never handle-free-during-call | Calling release functions on a handle with an active native call is undefined behavior — exactly the AV path Tier C is designed to prevent. After grace window, leave the handle Abandoned and `Environment.Exit(2)`. The OS reclaims fds/sockets on exit; the device's connection-timeout reclaims its end (adversarial review finding #2) | 2026-04-17 |
|
||||
| 76 | Tier C IPC has mandatory pipe ACL + caller SID verification + per-process shared secret | Default named-pipe ACL allows any local user to bypass OPC UA auth and issue reads/writes/acks directly against the host. Pipe ACL restricts to server service SID, host verifies caller token on connect, supervisor-generated per-process secret as defense-in-depth (adversarial review finding #3) | 2026-04-17 |
|
||||
| 77 | FOCAS stability test coverage = TCP stub (functional) + FaultShim native DLL (host-side faults) | A TCP stub cannot make Fwlib leak handles or AV — those live inside the P/Invoke boundary. Two artifacts cover the two layers honestly: TCP stub for ~80% of failures (network/protocol), FaultShim for the remaining ~20% (native crashes/leaks). Real-CNC validation remains the only path for vendor-specific Fwlib quirks (adversarial review finding #5) | 2026-04-17 |
|
||||
| 78 | Per-driver stability treatment is proportional to driver risk | Galaxy and FOCAS get full Tier C deep dives in `driver-stability.md` (different concerns: COM/STA pump vs Fwlib handle pool); TwinCAT, AB CIP, AB Legacy get short Operational Stability Notes in `driver-specs.md` for their tier-promotion triggers and protocol-specific failure modes; pure-managed Tier A drivers get one paragraph each. Avoids duplicating the cross-cutting protections doc seven times | 2026-04-17 |
|
||||
| 79 | Top-level deployment unit is `ServerCluster` with 1 or 2 `ClusterNode` members | Sites deploy 2-node clusters for OPC UA non-transparent redundancy (per v1 — Warm/Hot, no VIP). Single-node deployments are clusters of one. Uniform schema avoids forking the config model | 2026-04-17 |
|
||||
| 80 | Driver / device / tag / poll-group config attaches to `ClusterId`, not to individual nodes | Both nodes of a cluster serve identical address spaces; defining tags twice would invite drift. One generation = one cluster's complete config | 2026-04-17 |
|
||||
| 81 | Per-node overrides minimal — `ClusterNode.DriverConfigOverridesJson` only | Some driver settings legitimately differ per node (e.g. `MxAccess.ClientName` so Galaxy distinguishes them) but the surface is small. Single JSON column merged onto cluster-level `DriverConfig` at apply time. Tags and devices have no per-node override path | 2026-04-17 |
|
||||
| 82 | `ConfigGeneration` is cluster-scoped, not fleet-scoped | Publishing a generation for one cluster does not affect any other cluster. Simpler rollout (one cluster at a time), simpler rollback, simpler auth boundary. Fleet-wide synchronized rollouts (if ever needed) become a separate concern — orchestrate per-cluster publishes from Admin | 2026-04-17 |
|
||||
| 83 | Each node authenticates with its own `ClusterNodeCredential` bound to `NodeId` | Cluster-scoped auth would be too coarse — both nodes sharing a credential makes credential rotation harder and obscures which node read what. Per-node binding also enforces that Node A cannot impersonate Node B in audit logs | 2026-04-17 |
|
||||
| 84 | Both nodes apply the same generation independently; brief divergence acceptable | OPC UA non-transparent redundancy already handles per-endpoint state divergence; `ServiceLevel` dips on the node mid-apply and clients fail over. Forcing two-phase commit across nodes would be a complex distributed-system problem with no real upside | 2026-04-17 |
|
||||
| 85 | OPC UA `RedundancySupport.Transparent` not adopted in v2 | True transparent redundancy needs a VIP/load-balancer in front of the cluster. v1 ships non-transparent (Warm/Hot) with `ServerUriArray` and client-driven failover; v2 inherits the same model. Revisit only if a customer requirement demands LB-fronted transparency | 2026-04-17 |
|
||||
| 86 | `ApplicationUri` auto-suggested as `urn:{Host}:OtOpcUa` but never auto-rewritten | OPC UA clients pin trust to `ApplicationUri` — it's part of the cert validation chain. Auto-rewriting it when an operator changes `Host` would silently invalidate every client trust relationship. Admin UI prefills on node creation, warns on `Host` change, requires explicit opt-in to change. Fleet-wide unique index enforces no two nodes share an `ApplicationUri` | 2026-04-17 |
|
||||
| 87 | Concrete schema and stored-proc design lives in `config-db-schema.md` | The plan §4 sketches the conceptual model; the schema doc carries the actual DDL, indexes, stored procs, JSON conventions, and authorization model implementations. Keeps the plan readable while making the schema concrete enough to start implementing | 2026-04-17 |
|
||||
| 88 | Admin UI is Blazor Server with LDAP-mapped admin roles (FleetAdmin / ConfigEditor / ReadOnly) | Blazor Server gives real-time SignalR for live cluster status without a separate SPA build pipeline. LDAP reuses the OPC UA auth provider (no parallel user table). Three roles cover the common ops split; cluster-scoped editor grants deferred to v2.1 | 2026-04-17 |
|
||||
| 89 | Edit path is draft → diff → publish; no in-place edits, no auto-publish | Generations are atomic snapshots — every change goes through an explicit publish boundary so operators see what they're committing. The diff viewer is required reading before the publish dialog enables. Bulk operations always preview before commit | 2026-04-17 |
|
||||
| 90 | Per-node overrides are NOT generation-versioned | Overrides are operationally bound to a specific physical machine, not to the cluster's logical config evolution. Editing a node override doesn't create a new generation — it updates `ClusterNode.DriverConfigOverridesJson` directly and takes effect on next apply. Replacement-node scenarios copy the override via deployment tooling, not by replaying generation history | 2026-04-17 |
|
||||
| 91 | JSON content validation runs in the Admin app, not via SQL CLR | CLR is disabled by default on hardened SQL Server instances; many DBAs refuse to enable it. Admin validates against per-driver JSON schemas before invoking `sp_PublishGeneration`; the proc enforces structural integrity (FKs, uniqueness, `ISJSON`) only. Direct proc invocation is already prevented by the GRANT model | 2026-04-17 |
|
||||
| 92 | Dotted-path syntax for `DriverConfigOverridesJson` keys (e.g. `MxAccess.ClientName`) | More readable than JSON Pointer in operator UI and CSV exports. Reserved-char escaping documented (`\.`, `\\`); array indexing uses `Items[0].Name` | 2026-04-17 |
|
||||
| 93 | `sp_PurgeGenerationsBefore` deferred to v2.1; signature pre-specified | Initial release keeps all generations forever (decision #58). Purge proc shape locked in now: requires `@ConfirmToken` UI-shown random hex to prevent script-based mass deletion, CASCADE-deletes via `WHERE GenerationId IN (...)`, audit-log entry with row counts. Surface only when a customer compliance ask demands it | 2026-04-17 |
|
||||
| 94 | ~~Admin UI component library = MudBlazor~~ **SUPERSEDED by #102** | (See #102 — switched to Bootstrap 5 for ScadaLink parity) | 2026-04-17 |
|
||||
| 95 | CSV import dialect = strict CSV (RFC 4180) UTF-8, BOM accepted | Excel "Save as CSV (UTF-8)" produces RFC 4180 output and is the documented primary input format. TSV not initially supported | 2026-04-17 |
|
||||
| 96 | Push-from-DB notification deferred to v2.1; polling is the v2.0 model | Tightening apply latency from ~30 s → ~1 s would need SignalR backplane or SQL Service Broker — infrastructure not earning its keep at v2.0 scale. Publish dialog reserves a disabled "Push now" button labeled "Available in v2.1" so the future UX is anchored | 2026-04-17 |
|
||||
| 97 | Draft auto-save (debounced 500 ms) with explicit Discard; Publish is the only commit | Eliminates "lost work" complaints; matches Google Docs / Notion mental model. Auto-save writes to draft rows only — never to Published. Discard requires confirmation dialog | 2026-04-17 |
|
||||
| 98 | ~~Admin UI ships both light and dark themes~~ **SUPERSEDED by #103** | (See #103 — light-only to match ScadaLink) | 2026-04-17 |
|
||||
| 99 | CI tiering: PR-CI uses only in-process simulators; nightly/integration CI runs on dedicated Docker + Hyper-V host | Keeps PR builds fast and runnable on minimal build agents; the dedicated integration host runs the heavy simulators (`oitc/modbus-server`, TwinCAT XAR VM, Snap7 Server, libplctag `ab_server`). Operational dependency: stand up the dedicated host before Phase 3 | 2026-04-17 |
|
||||
| 100 | Studio 5000 Logix Emulate: pre-release validation tier only, no phase-gate | If an org license can be earmarked, designate a golden box for quarterly UDT/Program-scope passes. If not, AB CIP ships validated against `ab_server` only with documented UAT-time fidelity gap. Don't block Phase 4 on procurement | 2026-04-17 |
|
||||
| 101 | FOCAS Wireshark capture is a Phase 5 prerequisite identified during Phase 4 | Target capture (production CNC, CNC Guide seat, or customer site visit) identified by Phase 4 mid-point; if no target by then, escalate to procurement (CNC Guide license or dev-rig CNC) as a Phase 5 dependency | 2026-04-17 |
|
||||
| 102 | Admin UI styling = Bootstrap 5 vendored (parity with ScadaLink CentralUI) | Operators using both ScadaLink and OtOpcUa Admin see the same login screen, same sidebar, same component vocabulary. ScadaLink ships Bootstrap 5 with a custom dark-sidebar + light-main aesthetic; mirroring it directly outweighs MudBlazor's Blazor-component conveniences. Supersedes #94 | 2026-04-17 |
|
||||
| 103 | Admin UI ships single light theme matching ScadaLink (no dark mode in v2.0) | ScadaLink is light-only; cross-app aesthetic consistency outweighs the ergonomic argument for dark mode. Revisit only if ScadaLink adds dark mode. Supersedes #98 | 2026-04-17 |
|
||||
| 104 | Admin auth pattern lifted directly from ScadaLink: `LdapAuthService` + `RoleMapper` + `JwtTokenService` + cookie auth + `CookieAuthenticationStateProvider` | Same login form, same cookie scheme (30-min sliding), same claim shape (Name, DisplayName, Username, Role[], optional ClusterId[] scope), parallel `/auth/token` endpoint for API clients. Code lives in `ZB.MOM.WW.OtOpcUa.Admin.Security` (sibling of `ScadaLink.Security`); consolidate to a shared NuGet only if it later makes operational sense | 2026-04-17 |
|
||||
| 105 | Cluster-scoped admin grants ship in v2.0 (lifted from v2.1 deferred list) | ScadaLink already ships the equivalent site-scoped pattern (`PermittedSiteIds` claim, `IsSystemWideDeployment` flag), so we get cluster-scoped grants free by mirroring it. `LdapGroupRoleMapping` table maps groups → role + cluster scope; users without explicit cluster claims are system-wide | 2026-04-17 |
|
||||
| 106 | Shared component set copied verbatim from ScadaLink CentralUI | `DataTable`, `ConfirmDialog`, `LoadingSpinner`, `ToastNotification`, `TimestampDisplay`, `RedirectToLogin`, `NotAuthorizedView`. New Admin-specific shared components added to our folder rather than diverging from ScadaLink's set, so the shared vocabulary stays aligned | 2026-04-17 |
|
||||
|
||||
## Reference Documents
|
||||
|
||||
- **[Driver Implementation Specifications](driver-specs.md)** — per-driver details: connection settings, addressing, data types, libraries, API mappings, error handling, implementation notes
|
||||
- **[Test Data Sources](test-data-sources.md)** — per-driver simulator/emulator/stub for development and integration testing
|
||||
- **[Driver Stability & Isolation](driver-stability.md)** — stability tier model (A/B/C), per-driver hosting decisions, cross-cutting protections, FOCAS and Galaxy deep dives
|
||||
- **[Central Config DB Schema](config-db-schema.md)** — concrete table definitions, indexes, stored procedures, authorization model, JSON conventions, EF Core migrations approach
|
||||
- **[Admin Web UI](admin-ui.md)** — Blazor Server admin app: information architecture, page-by-page workflows, per-driver config screen extensibility, real-time updates, UX rules
|
||||
|
||||
## Out of Scope / Deferred
|
||||
|
||||
-
|
||||
499
docs/v2/test-data-sources.md
Normal file
499
docs/v2/test-data-sources.md
Normal file
@@ -0,0 +1,499 @@
|
||||
# Test Data Sources — OtOpcUa v2
|
||||
|
||||
> **Status**: DRAFT — companion to `plan.md`. Identifies the simulator/emulator/stub each driver will be developed and integration-tested against, so a developer laptop and a CI runner can exercise every driver without physical hardware.
|
||||
>
|
||||
> **Branch**: `v2`
|
||||
> **Created**: 2026-04-17
|
||||
|
||||
## Scope
|
||||
|
||||
The v2 plan adds eight drivers (Galaxy, Modbus TCP, AB CIP, AB Legacy, S7, TwinCAT, FOCAS, OPC UA Client). Each needs a repeatable, low-friction data source for:
|
||||
|
||||
- **Inner-loop development** — a developer running tests on their own machine
|
||||
- **CI integration tests** — automated runs against a known-good fixture
|
||||
- **Pre-release fidelity validation** — at least one "golden" rig with the highest-fidelity option available, even if paid/heavy
|
||||
|
||||
Two drivers are already covered and are **out of scope** for this document:
|
||||
|
||||
| Driver | Existing source | Why no work needed |
|
||||
|--------|-----------------|---------------------|
|
||||
| Galaxy | Real System Platform Galaxy on the dev machine | MXAccess requires a deployed ArchestrA Platform anyway; the dev box already has one |
|
||||
| OPC UA Client | OPC Foundation `ConsoleReferenceServer` from UA-.NETStandard | Reference-grade simulator from the same SDK we depend on; trivial to spin up |
|
||||
|
||||
The remaining six drivers are the subject of this document.
|
||||
|
||||
## Standard Test Scenario
|
||||
|
||||
Each simulator must expose a fixture that lets cross-driver integration tests exercise three orthogonal axes: the **data type matrix**, the **behavior matrix**, and **capability-gated extras**. v1 LMX testing already exercises ~12 Galaxy types plus 1D arrays plus security classifications plus historized attrs — the v2 fixture per driver has to reach at least that bar.
|
||||
|
||||
### A. Data type matrix (every driver, scalar and array)
|
||||
|
||||
Each simulator exposes one tag per cell where the protocol supports the type natively:
|
||||
|
||||
| Type family | Scalar | 1D array (small, ~10) | 1D array (large, ~500) | Notes |
|
||||
|-------------|:------:|:---------------------:|:----------------------:|-------|
|
||||
| Bool | ✔ | ✔ | — | Discrete subscription test |
|
||||
| Int16 (signed) | ✔ | ✔ | ✔ | Where protocol distinguishes from Int32 |
|
||||
| Int32 (signed) | ✔ | ✔ | ✔ | Universal |
|
||||
| Int64 | ✔ | ✔ | — | Where protocol supports it |
|
||||
| UInt16 / UInt32 | ✔ | — | — | Where protocol distinguishes signed/unsigned (Modbus, S7) |
|
||||
| Float32 | ✔ | ✔ | ✔ | Endianness test |
|
||||
| Float64 | ✔ | ✔ | — | Where protocol supports it |
|
||||
| String | ✔ | ✔ (Galaxy/AB/TwinCAT) | — | Include empty, ASCII, UTF-8/Unicode, max-length |
|
||||
| DateTime | ✔ | — | — | Galaxy, TwinCAT, OPC UA Client only |
|
||||
|
||||
Large arrays (~500 elements) catch paged-read, fragmentation, and PDU-batching bugs that small arrays miss.
|
||||
|
||||
### B. Behavior matrix (applied to a subset of the type matrix)
|
||||
|
||||
| Behavior | Applied to | Validates |
|
||||
|----------|------------|-----------|
|
||||
| **Static read** | One tag per type in matrix A | Type mapping, value decoding |
|
||||
| **Ramp** | Int32, Float32 | Subscription delivery cadence, source timestamps |
|
||||
| **Write-then-read-back** | Bool, Int32, Float32, String | Round-trip per type family, idempotent-write path |
|
||||
| **Array element write** | Int32[10] | Partial-write paths (where protocol supports them); whole-array replace where it doesn't |
|
||||
| **Large-array read** | Int32[500] | Paged reads, PDU batching, no truncation |
|
||||
| **Bool toggle on cadence** | Bool | Discrete subscription, change detection |
|
||||
| **Bad-quality on demand** | Any tag | Polly circuit-breaker → quality fan-out |
|
||||
| **Disconnect / reconnect** | Whole simulator | Reconnect, subscription replay, status dashboard, redundancy failover |
|
||||
|
||||
### C. Capability-gated extras (only where the driver supports them)
|
||||
|
||||
| Extra | Drivers | Fixture requirement |
|
||||
|-------|---------|---------------------|
|
||||
| **Security / access levels** | Galaxy, OPC UA Client | At least one read-only and one read-write tag of the same type |
|
||||
| **Alarms** | Galaxy, FOCAS, OPC UA Client | One alarm that fires after N seconds; one that the test can acknowledge; one that auto-clears |
|
||||
| **HistoryRead** | Galaxy, OPC UA Client | One historized tag with a known back-fill of >100 samples spanning >1 hour |
|
||||
| **String edge cases** | All with String support | Empty string, max-length string, embedded nulls, UTF-8 multi-byte chars |
|
||||
| **Endianness round-trip** | Modbus, S7 | Float32 written by test, read back, byte-for-byte equality |
|
||||
|
||||
Each driver section below maps these axes to concrete addresses/tags in that protocol's namespace. Where the protocol has no native equivalent (e.g. Modbus has no String type), the row is marked **N/A** and the driver-side tests skip it.
|
||||
|
||||
---
|
||||
|
||||
## 1. Modbus TCP (and DL205)
|
||||
|
||||
### Recommendation
|
||||
|
||||
**Default**: `oitc/modbus-server` Docker image for CI; in-process `NModbus` slave for xUnit fixtures.
|
||||
|
||||
Both speak real Modbus TCP wire protocol. The Docker image is a one-line `docker run` for whole-system tests; the in-proc slave gives per-test deterministic state with no new dependencies (NModbus is already on the driver-side dependency list).
|
||||
|
||||
### Options Evaluated
|
||||
|
||||
| Option | License | Platform | Notes |
|
||||
|--------|---------|----------|-------|
|
||||
| **oitc/modbus-server** ([Docker Hub](https://hub.docker.com/r/oitc/modbus-server), [GitHub](https://github.com/cybcon/modbus-server)) | MIT | Docker | YAML preload of all 4 register areas; `docker run -p 502:502` |
|
||||
| **NModbus `ModbusTcpSlave`** ([GitHub](https://github.com/NModbus/NModbus)) | MIT | In-proc .NET 10 | ~20 LOC fixture; programmatic register control |
|
||||
| **diagslave** ([modbusdriver.com](https://www.modbusdriver.com/diagslave.html)) | Free (proprietary) | Win/Linux/QNX | Single binary; free mode times out hourly |
|
||||
| **EasyModbusTCP** | LGPL | .NET / Java / Python | MSI installer |
|
||||
| **ModbusPal** ([SourceForge](https://sourceforge.net/projects/modbuspal/)) | BSD | Java | Register automation scripting; needs a JVM |
|
||||
|
||||
### DL205 Coverage
|
||||
|
||||
DL205 PLCs are accessed via H2-ECOM100, which exposes plain Modbus TCP. The `AddressFormat=DL205` feature is purely an octal-to-decimal **address translation** in the driver — the simulator only needs to expose the underlying Modbus registers. Unit-test the translation by preloading specific Modbus addresses (`HR 1024 = V2000`, `DI 15 = X17`, `Coil 8 = Y10`) and asserting the driver reads them via DL205 notation.
|
||||
|
||||
### Native Type Coverage
|
||||
|
||||
Modbus has no native String, DateTime, or Int64 — those rows are skipped on this driver. Native primitives are coil/discrete-input (Bool) and 16-bit registers; everything wider is composed from contiguous registers with explicit byte/word ordering.
|
||||
|
||||
| Type | Modbus mapping | Supported |
|
||||
|------|----------------|:---------:|
|
||||
| Bool | Coil / DI | ✔ |
|
||||
| Int16 / UInt16 | One HR/IR | ✔ |
|
||||
| Int32 / UInt32 | Two HR (big-endian word) | ✔ |
|
||||
| Float32 | Two HR | ✔ |
|
||||
| Float64 | Four HR | ✔ |
|
||||
| String | — | N/A |
|
||||
| DateTime | — | N/A |
|
||||
|
||||
### Standard Scenario Mapping
|
||||
|
||||
| Axis | Address |
|
||||
|------|---------|
|
||||
| Bool scalar / Bool[10] | Coil 1000 / Coils 1010–1019 |
|
||||
| Int16 scalar / Int16[10] / Int16[500] | HR 0 / HR 10–19 / HR 500–999 |
|
||||
| Int32 scalar / Int32[10] | HR 2000–2001 / HR 2010–2029 |
|
||||
| UInt16 scalar | HR 50 |
|
||||
| UInt32 scalar | HR 60–61 |
|
||||
| Float32 scalar / Float32[10] / Float32[500] | HR 3000–3001 / HR 3010–3029 / HR 4000–4999 |
|
||||
| Float64 scalar | HR 5000–5003 |
|
||||
| Ramp (Int32) | HR 100–101 — 0→1000 @ 1 Hz |
|
||||
| Ramp (Float32) | HR 110–111 — sine wave |
|
||||
| Write-read-back (Bool / Int32 / Float32) | Coil 1100 / HR 2100–2101 / HR 3100–3101 |
|
||||
| Array element write (Int32[10]) | HR 2200–2219 |
|
||||
| Bool toggle on cadence | Coil 0 — toggles @ 2 Hz |
|
||||
| Endianness round-trip (Float32) | HR 6000–6001, written then read |
|
||||
| Bad on demand | Coil 99 — write `1` to make the slave drop the TCP socket |
|
||||
| Disconnect | restart container / dispose in-proc slave |
|
||||
|
||||
### Gotchas
|
||||
|
||||
- **Byte order** is simulator-configurable. Pin a default in our test harness (big-endian word, big-endian byte) and document.
|
||||
- **diagslave free mode** restarts every hour — fine for inner-loop, not CI.
|
||||
- **Docker image defaults registers to 0** — ship a YAML config in the test repo.
|
||||
|
||||
---
|
||||
|
||||
## 2. Allen-Bradley CIP (ControlLogix / CompactLogix)
|
||||
|
||||
### Recommendation
|
||||
|
||||
**Default**: `ab_server` from the libplctag repo. Real CIP-over-EtherNet/IP, written by the same project that owns the libplctag NuGet our driver consumes — every tag shape the simulator handles is one the driver can address.
|
||||
|
||||
**Pre-release fidelity tier**: Studio 5000 Logix Emulate on one designated "golden" dev box for cases that need full UDT / Program-scope fidelity. Not a default because of cost (~$1k+ Pro-edition add-on) and toolchain weight.
|
||||
|
||||
### Options Evaluated
|
||||
|
||||
| Option | License | Platform | Notes |
|
||||
|--------|---------|----------|-------|
|
||||
| **ab_server** ([libplctag](https://github.com/libplctag/libplctag), [kyle-github/ab_server](https://github.com/kyle-github/ab_server)) | MIT | Win/Linux/macOS | Build from source; CI-grade fixture |
|
||||
| **Studio 5000 Logix Emulate** | Rockwell paid (~$1k+) | Windows | 100% firmware fidelity |
|
||||
| **Factory I/O + PLCSIM** | Paid | Windows | Visual sim, not raw CIP |
|
||||
|
||||
### Native Type Coverage
|
||||
|
||||
| Type | CIP mapping | Supported by ab_server |
|
||||
|------|-------------|:----------------------:|
|
||||
| Bool | BOOL | ✔ |
|
||||
| Int16 | INT | ✔ |
|
||||
| Int32 | DINT | ✔ |
|
||||
| Int64 | LINT | ✔ |
|
||||
| Float32 | REAL | ✔ |
|
||||
| Float64 | LREAL | ✔ |
|
||||
| String | STRING (built-in struct) | ✔ basic only |
|
||||
| DateTime | — | N/A |
|
||||
| UDT | user-defined STRUCT | not in ab_server CI scope |
|
||||
|
||||
### Standard Scenario Mapping
|
||||
|
||||
| Axis | Tag |
|
||||
|------|-----|
|
||||
| Bool scalar / Bool[10] | `bTest` / `abTest[10]` |
|
||||
| Int16 scalar / Int16[10] | `iTest` / `aiTest[10]` |
|
||||
| Int32 scalar / Int32[10] / Int32[500] | `diTest` / `adiTest[10]` / `adiBig[500]` |
|
||||
| Int64 scalar | `liTest` |
|
||||
| Float32 scalar / Float32[10] / Float32[500] | `Motor1_Speed` / `aReal[10]` / `aRealBig[500]` |
|
||||
| Float64 scalar | `Motor1_Position` (LREAL) |
|
||||
| String scalar / String[10] | `sIdentity` / `asNames[10]` |
|
||||
| Ramp (Float32) | `Motor1_Speed` (0→60 @ 0.5 Hz) |
|
||||
| Ramp (Int32) | `StepCounter` (0→10000 @ 1 Hz) |
|
||||
| Write-read-back (Bool / Int32 / Float32 / String) | `bWriteTarget` / `StepIndex` / `rSetpoint` / `sLastWrite` |
|
||||
| Array element write (Int32[10]) | `adiWriteTarget[10]` |
|
||||
| Bool toggle on cadence | `Flags[0]` toggling @ 2 Hz; `Flags[1..15]` latched |
|
||||
| Bad on demand | Test harness flag that makes ab_server refuse the next read |
|
||||
| Disconnect | Stop ab_server process |
|
||||
|
||||
### Gotchas
|
||||
|
||||
- **ab_server tag-type coverage is finite** (BOOL, DINT, REAL, arrays, basic strings). UDTs and `Program:` scoping are not fully implemented. Document an "ab_server-supported tag set" in the harness and exclude the rest from default CI; UDT coverage moves to the Studio 5000 Emulate golden-box tier.
|
||||
- CIP has no native subscriptions, so polling behavior matches real hardware.
|
||||
|
||||
---
|
||||
|
||||
## 3. Allen-Bradley Legacy (SLC 500 / MicroLogix, PCCC)
|
||||
|
||||
### Recommendation
|
||||
|
||||
**Default**: `ab_server` in PCCC mode, with a small in-repo PCCC stub for any file types ab_server doesn't fully cover (notably Timer/Counter `.ACC`/`.PRE`/`.DN` decomposition).
|
||||
|
||||
The same binary covers AB CIP and AB Legacy via a `plc=slc500` (or `micrologix`) flag, so we get one fixture for two drivers. If the timer/counter fidelity is too thin in practice, fall back to a ~200-line `TcpListener` stub answering the specific PCCC function codes the driver issues.
|
||||
|
||||
### Options Evaluated
|
||||
|
||||
| Option | License | Platform | Notes |
|
||||
|--------|---------|----------|-------|
|
||||
| **ab_server PCCC mode** | MIT | cross-platform | Same binary as AB CIP; partial T/C/R structure fidelity |
|
||||
| **Rockwell RSEmulate 500** | Rockwell legacy paid | Windows | EOL, ages poorly on modern Windows |
|
||||
| **In-repo PCCC stub** | Own | .NET 10 | Fallback only — covers what we P/Invoke |
|
||||
|
||||
### Native Type Coverage
|
||||
|
||||
PCCC types are file-based. Int32/Float64/DateTime are not native to SLC/MicroLogix.
|
||||
|
||||
| Type | PCCC mapping | Supported |
|
||||
|------|--------------|:---------:|
|
||||
| Bool | `B3:n/b` (bit in B file) | ✔ |
|
||||
| Int16 | `N7:n` | ✔ |
|
||||
| Int32 | — (decompose in driver from two N words) | partial |
|
||||
| Float32 | `F8:n` | ✔ |
|
||||
| String | `ST9:n` | ✔ |
|
||||
| Timer struct | `T4:n.ACC` / `.PRE` / `/DN` | ✔ |
|
||||
| Counter struct | `C5:n.ACC` / `.PRE` / `/DN` | ✔ |
|
||||
|
||||
### Standard Scenario Mapping
|
||||
|
||||
| Axis | Address |
|
||||
|------|---------|
|
||||
| Bool scalar / Bool[16] | `B3:0/0` / `B3:0` (treated as bit array) |
|
||||
| Int16 scalar / Int16[10] / Int16[500] | `N7:0` / `N7:0..9` / `N10:0..499` (separate file) |
|
||||
| Float32 scalar / Float32[10] | `F8:0` / `F8:0..9` |
|
||||
| String scalar / String[10] | `ST9:0` / `ST9:0..9` |
|
||||
| Ramp (Int16) | `N7:1` 0→1000 |
|
||||
| Ramp (Float32) | `F8:1` sine wave |
|
||||
| Write-read-back (Bool / Int16 / Float32 / String) | `B3:1/0` / `N7:100` / `F8:100` / `ST9:100` |
|
||||
| Array element write (Int16[10]) | `N7:200..209` |
|
||||
| Timer fidelity | `T4:0.ACC`, `T4:0.PRE`, `T4:0/DN` |
|
||||
| Counter fidelity | `C5:0.ACC`, `C5:0.PRE`, `C5:0/DN` |
|
||||
| Connection-limit refusal | Driver harness toggle to simulate 4-conn limit |
|
||||
| Bad on demand | Connection-refused toggle |
|
||||
|
||||
### Gotchas
|
||||
|
||||
- **Real SLC/MicroLogix enforce 4–8 connection limits**; ab_server does not. Add a test-only toggle in the driver (or in the stub) to refuse connections so we exercise the queuing path.
|
||||
- Timer/Counter structures are the most likely place ab_server fidelity falls short — design the test harness so we can drop in a stub for those specific files without rewriting the rest.
|
||||
|
||||
---
|
||||
|
||||
## 4. Siemens S7 (S7-300/400/1200/1500)
|
||||
|
||||
### Recommendation
|
||||
|
||||
**Default**: Snap7 Server. Real S7comm over ISO-on-TCP, free, cross-platform, and the same wire protocol the S7netplus driver emits.
|
||||
|
||||
**Pre-release fidelity tier**: PLCSIM Advanced on one golden dev box (7-day renewable trial; paid for production). Required for true firmware-level validation and for testing programs that include actual ladder logic.
|
||||
|
||||
### Options Evaluated
|
||||
|
||||
| Option | License | Platform | Notes |
|
||||
|--------|---------|----------|-------|
|
||||
| **Snap7 Server** ([snap7](https://snap7.sourceforge.net/snap7_server.html)) | LGPLv3 | Win/Linux/macOS, 32/64-bit | CP emulator; no PLC logic execution |
|
||||
| **PLCSIM Advanced** ([Siemens](https://www.siemens.com/en-us/products/simatic/s7-plcsim-advanced/)) | Siemens trial / paid | Windows + VM | Full S7-1500 fidelity, runs TIA programs |
|
||||
| **S7-PLCSIM (classic)** | Bundled with TIA | Windows | S7-300/400; no external S7comm without PLCSIM Advanced |
|
||||
|
||||
### Native Type Coverage
|
||||
|
||||
S7 has a rich native type system; Snap7 supports the wire-level read/write of all of them via DB byte access.
|
||||
|
||||
| Type | S7 mapping | Notes |
|
||||
|------|------------|-------|
|
||||
| Bool | `M0.0`, `DBn.DBXm.b` | ✔ |
|
||||
| Byte / Word / DWord | `DBn.DBB`, `.DBW`, `.DBD` | unsigned |
|
||||
| Int (Int16) / DInt (Int32) | `DBn.DBW`, `.DBD` | signed, big-endian |
|
||||
| LInt (Int64) | `DBn.DBLW` | S7-1500 only |
|
||||
| Real (Float32) / LReal (Float64) | `.DBD`, `.DBLW` | big-endian IEEE |
|
||||
| String | `DBn.DBB[]` (length-prefixed: max+actual+chars) | length-prefixed |
|
||||
| Char / WChar | byte / word with semantic | |
|
||||
| Date / Time / DT / TOD | structured byte layouts | |
|
||||
|
||||
### Standard Scenario Mapping
|
||||
|
||||
All in `DB1` unless noted; host script provides ramp behavior since Snap7 has no logic.
|
||||
|
||||
| Axis | Address |
|
||||
|------|---------|
|
||||
| Bool scalar / Bool[16] | `M0.0` / `DB1.DBX0.0..1.7` |
|
||||
| Int16 scalar / Int16[10] / Int16[500] | `DB1.DBW10` / `DB1.DBW20..38` / `DB2.DBW0..998` |
|
||||
| Int32 scalar / Int32[10] | `DB1.DBD100` / `DB1.DBD110..146` |
|
||||
| Int64 scalar | `DB1.DBLW200` |
|
||||
| UInt16 / UInt32 | `DB1.DBW300` / `DB1.DBD310` |
|
||||
| Float32 scalar / Float32[10] / Float32[500] | `DB1.DBD400` / `DB1.DBD410..446` / `DB3.DBD0..1996` |
|
||||
| Float64 scalar | `DB1.DBLW500` |
|
||||
| String scalar / String[10] | `DB1.STRING600` (max 254) / `DB1.STRING700..` |
|
||||
| DateTime scalar | `DB1.DT800` |
|
||||
| Ramp (Int16) | `DB1.DBW10` 0→1000 @ 1 Hz (host script) |
|
||||
| Ramp (Float32) | `DB1.DBD400` sine (host script) |
|
||||
| Write-read-back (Bool / Int16 / Float32 / String) | `M1.0` / `DB1.DBW900` / `DB1.DBD904` / `DB1.STRING908` |
|
||||
| Array element write (Int16[10]) | `DB1.DBW1000..1018` |
|
||||
| Endianness round-trip (Float32) | `DB1.DBD1100` |
|
||||
| Big-endian Int32 check | `DB2.DBD0` |
|
||||
| PUT/GET disabled simulation | Refuse-connection toggle |
|
||||
| Bad on demand | Stop Snap7 host process |
|
||||
| Re-download | Swap DB definitions (exercises symbol-version handling) |
|
||||
|
||||
### Gotchas
|
||||
|
||||
- **Snap7 is not a SoftPLC** — no logic runs. Ramps must be scripted by the test host writing to a DB on a timer.
|
||||
- **PUT/GET enforcement** is a property of real S7-1200/1500 (disabled by default in TIA). Snap7 doesn't enforce it. Add a test case that simulates "PUT/GET disabled" via a deliberately refused connection.
|
||||
- **Snap7 binary bitness**: some distributions are 32-bit only — match the test harness bitness.
|
||||
- **PLCSIM Advanced in VMs** is officially supported but trips up on nested virtualization and time-sync.
|
||||
|
||||
---
|
||||
|
||||
## 5. Beckhoff TwinCAT (ADS)
|
||||
|
||||
### Recommendation
|
||||
|
||||
**Default**: TwinCAT XAR runtime in a dev VM under Beckhoff's 7-day renewable dev/test trial. Required because TwinCAT is the only one of three native-subscription drivers (Galaxy, TwinCAT, OPC UA Client) that doesn't have a separate stub option — exercising native ADS notifications without a real XAR would hide the most important driver bugs.
|
||||
|
||||
The OtOpcUa test process talks to the VM over the network using `Beckhoff.TwinCAT.Ads` v6's in-process router (`AmsTcpIpRouter`), so individual dev machines and CI runners don't need the full TwinCAT stack installed locally.
|
||||
|
||||
### Options Evaluated
|
||||
|
||||
| Option | License | Platform | Notes |
|
||||
|--------|---------|----------|-------|
|
||||
| **TwinCAT 3 XAE + XAR** ([Beckhoff](https://www.beckhoff.com/en-us/products/automation/twincat/), [licensing](https://infosys.beckhoff.com/content/1033/tc3_licensing/921947147.html)) | Free dev download; 7-day renewable trial; paid for production | Windows + Hyper-V/VMware | Full ADS fidelity with real PLC runtime |
|
||||
| **Beckhoff.TwinCAT.Ads.TcpRouter** ([NuGet](https://www.nuget.org/packages/Beckhoff.TwinCAT.Ads.TcpRouter)) | Free, bundled | NuGet, in-proc | Router only — needs a real XAR on the other end |
|
||||
| **TwinCAT XAR in Docker** ([Beckhoff/TC_XAR_Container_Sample](https://github.com/Beckhoff/TC_XAR_Container_Sample)) | Same trial license; no prebuilt image | **Linux host with PREEMPT_RT** | Evaluated and rejected — see "Why not Docker" below |
|
||||
| **Roll-our-own ADS stub** | Own | .NET 10 | Would have to fake notifications; significant effort |
|
||||
|
||||
### Why not Docker (evaluated 2026-04-17)
|
||||
|
||||
Beckhoff publishes an [official sample](https://github.com/Beckhoff/TC_XAR_Container_Sample) for running XAR in a container, but it's not a viable replacement for the VM in our environment. Four blockers:
|
||||
|
||||
1. **Linux-only host with PREEMPT_RT.** The container is a Linux container that requires a Beckhoff RT Linux host (or equivalent PREEMPT_RT kernel). Docker Desktop on Windows forces Hyper-V, which [TwinCAT runtime cannot coexist with](https://hellotwincat.dev/disable-hyper-v-vs-twincat-problem-solution/). Our CI and dev boxes are Windows.
|
||||
2. **ADS-over-MQTT, not classic TCP/48898.** The official sample exposes ADS through a containerized mosquitto broker. Real field deployments use TCP/48898; testing against MQTT reduces the fidelity we're paying for.
|
||||
3. **XAE-on-Windows still required for project deployment.** No headless `.tsproj` deploy path exists. We don't escape the Windows dependency by going to Docker.
|
||||
4. **Same trial license either way.** No licensing win — 7-day renewable applies identically to bare-metal XAR and containerized XAR.
|
||||
|
||||
Revisit if Beckhoff publishes a prebuilt image with classic TCP ADS exposure, or if our CI fleet gains a Linux RT runner. Until then, Windows VM with XAR + XAE + trial license is the pragmatic answer.
|
||||
|
||||
### Native Type Coverage
|
||||
|
||||
TwinCAT exposes the full IEC 61131-3 type system; the test PLC project includes one symbol per cell.
|
||||
|
||||
| Type | TwinCAT mapping | Supported |
|
||||
|------|-----------------|:---------:|
|
||||
| Bool | `BOOL` | ✔ |
|
||||
| Int16 / UInt16 | `INT` / `UINT` | ✔ |
|
||||
| Int32 / UInt32 | `DINT` / `UDINT` | ✔ |
|
||||
| Int64 / UInt64 | `LINT` / `ULINT` | ✔ |
|
||||
| Float32 / Float64 | `REAL` / `LREAL` | ✔ |
|
||||
| String | `STRING(255)` | ✔ |
|
||||
| WString | `WSTRING(255)` | ✔ Unicode coverage |
|
||||
| DateTime | `DT`, `DATE`, `TOD`, `TIME`, `LTIME` | ✔ |
|
||||
| STRUCT / ENUM / ALIAS | user-defined | ✔ |
|
||||
|
||||
### Standard Scenario Mapping
|
||||
|
||||
In a tiny test project — `MAIN` (PLC code) + `GVL` (constants and write targets):
|
||||
|
||||
| Axis | Symbol |
|
||||
|------|--------|
|
||||
| Bool scalar / Bool[10] | `GVL.bTest` / `GVL.abTest` |
|
||||
| Int16 / Int32 / Int64 scalars | `GVL.iTest` / `GVL.diTest` / `GVL.liTest` |
|
||||
| UInt16 / UInt32 scalars | `GVL.uiTest` / `GVL.udiTest` |
|
||||
| Int32[10] / Int32[500] | `GVL.adiTest` / `GVL.adiBig` |
|
||||
| Float32 / Float64 scalars | `GVL.rTest` / `GVL.lrTest` |
|
||||
| Float32[10] / Float32[500] | `GVL.arTest` / `GVL.arBig` |
|
||||
| String / WString / String[10] | `GVL.sIdentity` / `GVL.wsIdentity` / `GVL.asNames` |
|
||||
| DateTime (DT) | `GVL.dtTimestamp` |
|
||||
| STRUCT member access | `GVL.fbMotor.rSpeed` (REAL inside FB) |
|
||||
| Ramp (DINT) | `MAIN.nRamp` — PLC increments each cycle |
|
||||
| Ramp (REAL) | `MAIN.rSine` — PLC computes sine |
|
||||
| Write-read-back (Bool / DINT / REAL / STRING / WSTRING) | `GVL.bWriteTarget` / `GVL.diWriteTarget` / `GVL.rWriteTarget` / `GVL.sWriteTarget` / `GVL.wsWriteTarget` |
|
||||
| Array element write (DINT[10]) | `GVL.adiWriteTarget` |
|
||||
| Native ADS notification | every scalar above subscribed via OnDataChange |
|
||||
| Bad on demand | Stop the runtime — driver gets port-not-found |
|
||||
| Re-download | Re-deploy the project to exercise symbol-version-changed (`0x0702`) |
|
||||
|
||||
### Gotchas
|
||||
|
||||
- **AMS route table** — XAR refuses ADS connections from unknown hosts. Test setup must add a backroute for each dev machine and CI runner (scriptable via `AddRoute` on the NuGet API).
|
||||
- **7-day trial reset** requires a click in the XAE UI; investigate scripting it for unattended CI.
|
||||
- **Symbol-version-changed** is the hardest path to exercise — needs a PLC re-download mid-test, so structure the integration suite to accommodate that step.
|
||||
|
||||
---
|
||||
|
||||
## 6. FANUC FOCAS (FOCAS2)
|
||||
|
||||
### Recommendation
|
||||
|
||||
**No good off-the-shelf simulator exists. Build two test artifacts** that cover different layers of the FOCAS surface:
|
||||
|
||||
1. **`Driver.Focas.TestStub`** — a TCP listener mimicking a real CNC over the FOCAS wire protocol. Covers functional behavior (reads, writes, ramps, alarms, network failures).
|
||||
2. **`Driver.Focas.FaultShim`** — a test-only native DLL that masquerades as `Fwlib64.dll` and injects faults inside the host process (AVs, handle leaks, orphan handles). Covers the stability-recovery paths in `driver-stability.md` that the TCP stub physically cannot exercise.
|
||||
|
||||
CNC Guide is the only off-the-shelf FOCAS-capable simulator and gating every dev rig on a FANUC purchase isn't viable. There are no open-source FOCAS server stubs at useful fidelity. The FOCAS SDK license is already secured (decision #61), so we own the API contract — build both artifacts ourselves against captured Wireshark traces from a real CNC.
|
||||
|
||||
### Artifact 1 — TCP Stub (functional coverage)
|
||||
|
||||
A `TcpListener` on port 8193 that answers only the FOCAS2 functions the driver P/Invokes:
|
||||
|
||||
```
|
||||
cnc_allclibhndl3, cnc_freelibhndl, cnc_sysinfo, cnc_statinfo,
|
||||
cnc_actf, cnc_acts, cnc_absolute, cnc_machine, cnc_rdaxisname,
|
||||
cnc_rdspmeter, cnc_rdprgnum, cnc_rdparam, cnc_rdalmmsg,
|
||||
pmc_rdpmcrng, cnc_rdmacro, cnc_getfigure
|
||||
```
|
||||
|
||||
Capture the wire framing once against a real CNC (or a colleague's CNC Guide seat), then the stub becomes a fixed-point reference. For pre-release validation, run the driver against a real CNC.
|
||||
|
||||
**Covers**: read/write/poll behavior, scaled-integer round-trip, alarm fire/clear, network slowness, network hang, network disconnect, FOCAS-error-code → StatusCode mapping. Roughly 80% of real-world FOCAS failure modes.
|
||||
|
||||
### Artifact 2 — FaultShim (native fault injection, host-side)
|
||||
|
||||
A separate test-only native DLL named `Fwlib64.dll` that exports the same function surface but instead of calling FANUC's library, performs configurable fault behaviors: deliberate AV at a chosen call site, return success but never release allocated buffers (memory leak), accept `cnc_freelibhndl` but keep handle table populated (orphan handle), simulate a wedged native call that doesn't return.
|
||||
|
||||
Activated by DLL search-path order in the test fixture only; production builds load FANUC's real `Fwlib64.dll`. The Host code is unchanged — it just experiences different symptoms depending on which DLL is loaded.
|
||||
|
||||
**Covers**: supervisor respawn after AV, post-mortem MMF readability after hard crash, watchdog → recycle path on simulated leak, Abandoned-handle path when a wedged native call exceeds recycle grace. The remaining ~20% of failure modes that live below the network layer.
|
||||
|
||||
### What neither artifact covers
|
||||
|
||||
Vendor-specific Fwlib quirks that depend on the real `Fwlib64.dll` interacting with a real CNC firmware version. These remain hardware/manual-test-only and are validated on the pre-release real-CNC tier, not in CI.
|
||||
|
||||
### Options Evaluated
|
||||
|
||||
| Option | License | Platform | Notes |
|
||||
|--------|---------|----------|-------|
|
||||
| **FANUC CNC Guide** ([FANUC](https://www.fanuc.co.jp/en/product/cnc/f_ncguide.html)) | Paid, dealer-ordered | Windows | High fidelity; FOCAS-over-Ethernet not enabled in all editions |
|
||||
| **FANUC Roboguide** | Paid | Windows | Robot-focused, not CNC FOCAS |
|
||||
| **MTConnect agents** | various | — | Different protocol; not a FOCAS source |
|
||||
| **Public FOCAS stubs** | — | — | None at useful fidelity |
|
||||
| **In-repo TCP stub + FaultShim DLL** | Own | .NET 10 + native | Recommended path — two artifacts, see above |
|
||||
|
||||
### Native Type Coverage
|
||||
|
||||
FOCAS does not have a tag system in the conventional sense — it has a fixed set of API calls returning structured CNC data. Tag families the driver exposes:
|
||||
|
||||
| Type | FOCAS source | Notes |
|
||||
|------|--------------|-------|
|
||||
| Bool | PMC bit | discrete inputs/outputs |
|
||||
| Int16 / Int32 | PMC R/D word & dword, status fields | |
|
||||
| Int64 | composed from PMC | rare |
|
||||
| Float32 / Float64 | macros (`cnc_rdmacro`), some params | |
|
||||
| Scaled integer | position values + `cnc_getfigure()` decimal places | THE FOCAS-specific bug surface |
|
||||
| String | alarm messages, program names | length-bounded |
|
||||
| Array | PMC ranges (`pmc_rdpmcrng`), per-axis arrays | |
|
||||
|
||||
### Standard Scenario Mapping
|
||||
|
||||
| Axis | Element |
|
||||
|------|---------|
|
||||
| Static identity (struct) | `cnc_sysinfo` — series, version, axis count |
|
||||
| Bool scalar / Bool[16] | PMC `G0.0` / PMC `G0` (bits 0–15) |
|
||||
| Int16 / Int32 PMC scalars | PMC `R200` / PMC `D300` |
|
||||
| Int32 PMC array (small / large) | PMC `R1000..R1019` / PMC `R5000..R5499` |
|
||||
| Float64 macro variable | macro `#100` |
|
||||
| Macro array | macro `#500..#509` |
|
||||
| String | active program name; alarm message text |
|
||||
| Scaled integer round-trip | X-axis position (decimal-place conversion via `cnc_getfigure`) |
|
||||
| State machine | `RunState` cycling Stop → Running → Hold |
|
||||
| Ramp (scaled int) | X-axis position 0→100.000 mm |
|
||||
| Step (Int32) | `ActualFeedRate` stepping on `cnc_actf` |
|
||||
| Write-read-back (PMC Int32) | PMC `R100` 32-bit scratch register |
|
||||
| PMC array element write | PMC `R200..R209` |
|
||||
| Alarms | one alarm appears after N seconds; one is acknowledgeable; one auto-clears |
|
||||
| Bad on demand | Stub closes the socket on a marker request |
|
||||
|
||||
### Gotchas
|
||||
|
||||
- **FOCAS wire framing is proprietary** — stub fidelity depends entirely on Wireshark captures from a real CNC. Plan to do that capture early.
|
||||
- **Fwlib is thread-unsafe per handle** — the stub must serialize so we don't accidentally test threading behavior the driver can't rely on in production.
|
||||
- **Scaled-integer position values** require the stub to return a credible `cnc_getfigure()` so the driver's decimal-place conversion is exercised.
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
| Driver | Primary | License | Fallback / fidelity tier |
|
||||
|--------|---------|---------|---------------------------|
|
||||
| Galaxy | Real Galaxy on dev box | — | (n/a — already covered) |
|
||||
| Modbus TCP / DL205 | `oitc/modbus-server` + NModbus in-proc | MIT | diagslave for wire-inspection |
|
||||
| AB CIP | libplctag `ab_server` | MIT | Studio 5000 Logix Emulate (golden box) |
|
||||
| AB Legacy | `ab_server` PCCC mode + in-repo PCCC stub | MIT | Real SLC/MicroLogix on lab rig |
|
||||
| S7 | Snap7 Server | LGPLv3 | PLCSIM Advanced (golden box) |
|
||||
| TwinCAT | TwinCAT XAR in dev VM | Free trial | — |
|
||||
| FOCAS | **In-repo `Driver.Focas.TestStub` (TCP)** + `Driver.Focas.FaultShim` (native DLL) | Own code | CNC Guide / real CNC pre-release |
|
||||
| OPC UA Client | OPC Foundation `ConsoleReferenceServer` | OPC Foundation | — |
|
||||
|
||||
Six of eight drivers have a free, scriptable, cross-platform test source we can check into CI. TwinCAT requires a VM but no recurring cost. FOCAS is the one case with no public answer — we own the stub. The driver specs in `driver-specs.md` enumerate every API call we make, which scopes the FOCAS stub.
|
||||
|
||||
## Resolved Defaults
|
||||
|
||||
The questions raised by the initial draft are resolved as planning defaults below. Each carries an operational dependency that needs site/team confirmation before Phase 1 work depends on it; flagged inline so the dependency stays visible.
|
||||
|
||||
- **CI tiering: PR-CI uses only in-process simulators; nightly/integration CI runs on a dedicated host with Docker + TwinCAT VM.** PR builds need to be fast and need to run on minimal Windows/Linux build agents; standardizing on the in-process subset (`NModbus` server fixture for Modbus, OPC Foundation `ConsoleReferenceServer` in-process for OPC UA Client, and the FOCAS TCP stub from the test project) covers ~70% of cross-driver behavior with no infrastructure dependency. Anything needing Docker (`oitc/modbus-server`), the TwinCAT XAR VM, the libplctag `ab_server` binary, or the Snap7 Server runs on a single dedicated integration host that runs the full suite nightly and on demand. **Operational dependency**: stand up one Windows host with Docker Desktop + Hyper-V before Phase 3 (Modbus driver) — without it, integration tests for Modbus/AB CIP/AB Legacy/S7/TwinCAT all queue behind the same scarcity.
|
||||
- **Studio 5000 Logix Emulate: not assumed in CI; pre-release validation only.** Don't gate any phase on procuring a license. If an existing org license can be earmarked, designate one Windows machine as the AB CIP golden box and run a quarterly UDT/Program-scope fidelity pass against it. If no license is available, the AB CIP driver ships validated against `ab_server` only, with a documented limitation that UDTs and `Program:` scoping are exercised at customer sites during UAT, not in our CI.
|
||||
- **FANUC CNC Wireshark captures: scheduled as a Phase 5 prerequisite.** During Phase 4 (PLC drivers), the team identifies a target CNC — production machine accessible during a maintenance window, a colleague's CNC Guide seat, or a customer who'll allow a one-day on-site capture. Capture the wire framing for every FOCAS function in the call list (per `driver-stability.md` §FOCAS) plus a 30-min poll trace, before Phase 5 starts. If no target is identified by Phase 4 mid-point, escalate to procurement: a CNC Guide license seat (1-time cost) or a small dev-rig CNC purchase becomes a Phase 5 dependency.
|
||||
Reference in New Issue
Block a user