diff --git a/docs/AddressSpace.md b/docs/AddressSpace.md index a687687..eba3ad3 100644 --- a/docs/AddressSpace.md +++ b/docs/AddressSpace.md @@ -1,82 +1,72 @@ # Address Space -The address space maps the Galaxy object hierarchy and attribute definitions into an OPC UA browse tree. `LmxNodeManager` builds the tree from data queried by `GalaxyRepositoryService`, while `AddressSpaceBuilder` provides a testable in-memory model of the same structure. +Each driver's browsable subtree is built by streaming nodes from the driver's `ITagDiscovery.DiscoverAsync` implementation into an `IAddressSpaceBuilder`. `GenericDriverNodeManager` (`src/ZB.MOM.WW.OtOpcUa.Core/OpcUa/GenericDriverNodeManager.cs`) owns the shared orchestration; `DriverNodeManager` (`src/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs`) implements `IAddressSpaceBuilder` against the OPC Foundation stack's `CustomNodeManager2`. The same code path serves Galaxy object hierarchies, Modbus PLC registers, AB CIP tags, TwinCAT symbols, FOCAS CNC parameters, and OPC UA Client aggregations — Galaxy is one driver of seven, not the driver. -## Root ZB Folder +## Driver root folder -Every address space starts with a single root folder node named `ZB` (NodeId `ns=1;s=ZB`). This folder is added under the standard OPC UA `Objects` folder via an `Organizes` reference. The reverse reference is registered through `MasterNodeManager.AddReferences` because `BuildAddressSpace` runs after `CreateAddressSpace` has already consumed the external references dictionary. +Every driver's subtree starts with a root `FolderState` under the standard OPC UA `Objects` folder, wired with an `Organizes` reference. `DriverNodeManager.CreateAddressSpace` creates this folder with `NodeId = ns;s={DriverInstanceId}`, `BrowseName = {DriverInstanceId}`, and `EventNotifier = SubscribeToEvents | HistoryRead` so alarm and history-event subscriptions can target the root. The namespace URI is `urn:OtOpcUa:{DriverInstanceId}`. -The root folder has `EventNotifier = SubscribeToEvents` enabled so alarm events propagate up to clients subscribed at the root level. +## IAddressSpaceBuilder surface -## Area Folders vs Object Nodes +`IAddressSpaceBuilder` (`src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IAddressSpaceBuilder.cs`) offers three calls: -Galaxy objects fall into two categories based on `template_definition.category_id`: +- `Folder(browseName, displayName)` — creates a child `FolderState` and returns a child builder scoped to it. +- `Variable(browseName, displayName, DriverAttributeInfo attributeInfo)` — creates a `BaseDataVariableState` and returns an `IVariableHandle` the driver keeps for alarm wiring. +- `AddProperty(browseName, DriverDataType, value)` — attaches a `PropertyState` for static metadata (e.g. equipment identification fields). -- **Areas** (`category_id = 13`) become `FolderState` nodes with `FolderType` type definition and `Organizes` references. They represent logical groupings in the Galaxy hierarchy (e.g., production lines, cells). -- **Non-area objects** (AppEngine, Platform, UserDefined, etc.) become `BaseObjectState` nodes with `BaseObjectType` type definition and `HasComponent` references. These represent runtime automation objects that carry attributes. +Drivers drive ordering. Typical pattern: root → folder per equipment → variables per tag. `GenericDriverNodeManager` calls `DiscoverAsync` once on startup and once per rediscovery cycle. -Both node types use `contained_name` as the browse name. When `contained_name` is null or empty, `tag_name` is used as a fallback. +## DriverAttributeInfo → OPC UA variable -## Variable Nodes for Attributes +Each variable carries a `DriverAttributeInfo` (`src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/DriverAttributeInfo.cs`): -Each Galaxy attribute becomes a `BaseDataVariableState` node under its parent object. The variable is configured with: +| Field | OPC UA target | +|---|---| +| `FullName` | `NodeId.Identifier` — used as the driver-side lookup key for Read/Write/Subscribe | +| `DriverDataType` | mapped to a built-in `DataTypeIds.*` NodeId via `DriverNodeManager.MapDataType` | +| `IsArray` | `ValueRank = OneDimension` when true, `Scalar` otherwise | +| `ArrayDim` | declared array length, carried through as metadata | +| `SecurityClass` | stored in `_securityByFullRef` for `WriteAuthzPolicy` gating on write | +| `IsHistorized` | flips `AccessLevel.HistoryRead` + `Historizing = true` | +| `IsAlarm` | drives the `MarkAsAlarmCondition` pass (see below) | +| `WriteIdempotent` | stored in `_writeIdempotentByFullRef`; fed to `CapabilityInvoker.ExecuteWriteAsync` | -- **DataType** -- Mapped from `mx_data_type` via `MxDataTypeMapper` (see [DataTypeMapping.md](DataTypeMapping.md)) -- **ValueRank** -- `OneDimension` (1) for arrays, `Scalar` (-1) for scalars -- **ArrayDimensions** -- Set to `[array_dimension]` when the attribute is an array -- **AccessLevel** -- `CurrentReadOrWrite` or `CurrentRead` based on security classification, with `HistoryRead` added for historized attributes -- **Historizing** -- Set to `true` for attributes with a `HistoryExtension` primitive -- **Initial value** -- `null` with `StatusCode = BadWaitingForInitialData` until the first MXAccess callback delivers a live value +The initial value stays `null` with `StatusCode = BadWaitingForInitialData` until the first Read or `ISubscribable.OnDataChange` push lands. -## Primitive Grouping +## CapturingBuilder + alarm sink registration -Galaxy objects can have primitive components (e.g., alarm extensions, history extensions) that attach sub-attributes to a parent attribute. The address space handles this with a two-pass approach: +`GenericDriverNodeManager.BuildAddressSpaceAsync` wraps the supplied builder in a `CapturingBuilder` before calling `DiscoverAsync`. The wrapper observes every `Variable()` call: when a returned `IVariableHandle.MarkAsAlarmCondition(AlarmConditionInfo)` fires, the sink is registered in the manager's `_alarmSinks` dictionary keyed by the variable's `FullReference`. Subsequent `IAlarmSource.OnAlarmEvent` pushes are routed to the matching sink by `SourceNodeId`. This keeps the alarm-wiring protocol declarative — drivers just flag `DriverAttributeInfo.IsAlarm = true` and the materialization of the OPC UA `AlarmConditionState` node is handled by the server layer. See `docs/AlarmTracking.md`. -### First pass: direct attributes +## NodeId scheme -Attributes with an empty `PrimitiveName` are created as direct variable children of the object node. If a direct attribute shares its name with a primitive group, the variable node reference is saved for the second pass. - -### Second pass: primitive child attributes - -Attributes with a non-empty `PrimitiveName` are grouped by that name. For each group: - -1. If a direct attribute variable with the same name already exists, the primitive's child attributes are added as `HasComponent` children of that variable node. This merges alarm/history sub-attributes (e.g., `InAlarm`, `Priority`) under the parent variable they describe. -2. If no matching direct attribute exists, a new `BaseObjectState` node is created with NodeId `ns=1;s={TagName}.{PrimitiveName}`, and the primitive's attributes are added under it. - -This structure means that browsing `TestMachine_001/SomeAlarmAttr` reveals both the process value and its alarm sub-attributes (`InAlarm`, `Priority`, `DescAttrName`) as children. - -## NodeId Scheme - -All node identifiers use string-based NodeIds in namespace index 1 (`ns=1`): +All nodes live in the driver's namespace (not a shared `ns=1`). Browse paths are driver-defined: | Node type | NodeId format | Example | -|-----------|---------------|---------| -| Root folder | `ns=1;s=ZB` | `ns=1;s=ZB` | -| Area folder | `ns=1;s={tag_name}` | `ns=1;s=Area_001` | -| Object node | `ns=1;s={tag_name}` | `ns=1;s=TestMachine_001` | -| Scalar variable | `ns=1;s={tag_name}.{attr}` | `ns=1;s=TestMachine_001.MachineID` | -| Array variable | `ns=1;s={tag_name}.{attr}` | `ns=1;s=MESReceiver_001.MoveInPartNumbers` | -| Primitive sub-object | `ns=1;s={tag_name}.{prim}` | `ns=1;s=TestMachine_001.AlarmPrim` | +|---|---|---| +| Driver root | `ns;s={DriverInstanceId}` | `urn:OtOpcUa:galaxy-01;s=galaxy-01` | +| Folder | `ns;s={parent}/{browseName}` | `ns;s=galaxy-01/Area_001` | +| Variable | `ns;s={DriverAttributeInfo.FullName}` | `ns;s=DelmiaReceiver_001.DownloadPath` | +| Alarm condition | `ns;s={FullReference}.Condition` | `ns;s=DelmiaReceiver_001.Temperature.Condition` | -For array attributes, the `[]` suffix present in `full_tag_reference` is stripped from the NodeId. The `full_tag_reference` (with `[]`) is kept internally for MXAccess subscription addressing. This means `MESReceiver_001.MoveInPartNumbers[]` in the Galaxy maps to NodeId `ns=1;s=MESReceiver_001.MoveInPartNumbers`. +For Galaxy the `FullName` stays in the legacy `tag_name.AttributeName` format; Modbus uses `unit:register:type`; AB CIP uses the native `program:tag.member` path; etc. — the shape is the driver's choice. -## Topological Sort +## Per-driver hierarchy examples -The hierarchy query returns objects ordered by `parent_gobject_id, tag_name`, but this does not guarantee that a parent appears before all of its children in all cases. `LmxNodeManager.TopologicalSort` performs a depth-first traversal to produce a list where every parent is guaranteed to precede its children. This allows the build loop to look up parent nodes from `_nodeMap` without forward references. +- **Galaxy Proxy**: walks the DB-snapshot hierarchy (`GalaxyProxyDriver.DiscoverAsync`), streams Area objects as folders and non-area objects as variable-bearing folders, marks `IsAlarm = true` on attributes that have an `AlarmExtension` primitive. The v1 two-pass primitive-grouping logic is retained inside the Galaxy driver. +- **Modbus**: streams one folder per device, one variable per register range from `ModbusDriverOptions`. No alarm surface. +- **AB CIP**: uses `AbCipTemplateCache` to enumerate user-defined types, streams a folder per program with variables keyed on the native tag path. +- **OPC UA Client**: re-exposes a remote server's address space — browses the upstream and relays nodes through the builder. -## Platform Scope Filtering +See `docs/v2/driver-specs.md` for the per-driver discovery contracts. -When `GalaxyRepository.Scope` is set to `LocalPlatform`, the hierarchy and attributes passed to `BuildAddressSpace` are pre-filtered by `PlatformScopeFilter` inside `GalaxyRepositoryService`. The node manager receives only the local platform's objects and their ancestor areas, so the resulting browse tree is a subset of the full Galaxy. The filtering is transparent to `LmxNodeManager` — it builds nodes from whatever data it receives. +## Rediscovery -Clients browsing a `LocalPlatform`-scoped server will see only the areas and objects hosted by that platform. Areas that exist in the Galaxy but contain no local descendants are excluded. See [Galaxy Repository — Platform Scope Filter](GalaxyRepository.md#platform-scope-filter) for the filtering algorithm and configuration. - -## Incremental Sync - -On address space rebuild (triggered by a Galaxy deploy change), `SyncAddressSpace` uses `AddressSpaceDiff` to identify which `gobject_id` values have changed between the old and new snapshots. Only the affected subtrees are torn down and rebuilt, preserving unchanged nodes and their active subscriptions. Affected subscriptions are snapshot before teardown and replayed after rebuild. - -If no previous state is cached (first build), the full `BuildAddressSpace` path runs instead. +Drivers that implement `IRediscoverable` fire `OnRediscoveryNeeded` when their backend signals a change (Galaxy: `time_of_last_deploy` advance; TwinCAT: symbol-version-changed; OPC UA Client: server namespace change). Core re-runs `DiscoverAsync` and diffs — see `docs/IncrementalSync.md`. Static drivers (Modbus, S7) don't implement `IRediscoverable`; their address space only changes when a new generation is published from the Config DB. ## Key source files -- `src/ZB.MOM.WW.OtOpcUa.Host/OpcUa/LmxNodeManager.cs` -- Node manager with `BuildAddressSpace`, `SyncAddressSpace`, and `TopologicalSort` -- `src/ZB.MOM.WW.OtOpcUa.Host/OpcUa/AddressSpaceBuilder.cs` -- Testable in-memory model builder +- `src/ZB.MOM.WW.OtOpcUa.Core/OpcUa/GenericDriverNodeManager.cs` — orchestration + `CapturingBuilder` +- `src/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs` — OPC UA materialization (`IAddressSpaceBuilder` impl + `NestedBuilder`) +- `src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IAddressSpaceBuilder.cs` — builder contract +- `src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/ITagDiscovery.cs` — driver discovery capability +- `src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/DriverAttributeInfo.cs` — per-attribute descriptor diff --git a/docs/AlarmTracking.md b/docs/AlarmTracking.md index 8d3558b..e0ba2c5 100644 --- a/docs/AlarmTracking.md +++ b/docs/AlarmTracking.md @@ -1,234 +1,76 @@ # Alarm Tracking -`LmxNodeManager` generates OPC UA alarm conditions from Galaxy attributes marked as alarms. The system detects alarm-capable attributes during address space construction, creates `AlarmConditionState` nodes, auto-subscribes to the runtime alarm tags via MXAccess, and reports state transitions as OPC UA events. +Alarm surfacing is an optional driver capability exposed via `IAlarmSource` (`src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IAlarmSource.cs`). Drivers whose backends have an alarm concept implement it — today: Galaxy (MXAccess alarms), FOCAS (CNC alarms), OPC UA Client (A&C events from the upstream server). Modbus / S7 / AB CIP / AB Legacy / TwinCAT do not implement the interface and the feature is simply absent from their subtrees. -## AlarmInfo Structure - -Each tracked alarm is represented by an `AlarmInfo` instance stored in the `_alarmInAlarmTags` dictionary, keyed by the `InAlarm` tag reference: +## IAlarmSource surface ```csharp -private sealed class AlarmInfo -{ - public string SourceTagReference { get; set; } // e.g., "Tag_001.Temperature" - public NodeId SourceNodeId { get; set; } - public string SourceName { get; set; } // attribute name for event messages - public bool LastInAlarm { get; set; } // tracks previous state for edge detection - public AlarmConditionState? ConditionNode { get; set; } - public string PriorityTagReference { get; set; } // e.g., "Tag_001.Temperature.Priority" - public string DescAttrNameTagReference { get; set; } // e.g., "Tag_001.Temperature.DescAttrName" - public ushort CachedSeverity { get; set; } - public string CachedMessage { get; set; } -} +Task SubscribeAlarmsAsync( + IReadOnlyList sourceNodeIds, CancellationToken cancellationToken); +Task UnsubscribeAlarmsAsync(IAlarmSubscriptionHandle handle, CancellationToken cancellationToken); +Task AcknowledgeAsync(IReadOnlyList acknowledgements, + CancellationToken cancellationToken); +event EventHandler? OnAlarmEvent; ``` -`LastInAlarm` enables edge detection so only actual transitions (inactive-to-active or active-to-inactive) generate events, not repeated identical values. +The driver fires `OnAlarmEvent` for every transition (`Active`, `Acknowledged`, `Inactive`) with an `AlarmEventArgs` carrying the source node id, condition id, alarm type, message, severity (`AlarmSeverity` enum), and source timestamp. -## Alarm Detection via is_alarm Flag +## AlarmSurfaceInvoker -During `BuildAddressSpace` (and `BuildSubtree` for incremental sync), the node manager scans each non-area Galaxy object for attributes where `IsAlarm == true` and `PrimitiveName` is empty (direct attributes only, not primitive children): +`AlarmSurfaceInvoker` (`src/ZB.MOM.WW.OtOpcUa.Core/Resilience/AlarmSurfaceInvoker.cs`) wraps the three mutating surfaces through `CapabilityInvoker`: -```csharp -var alarmAttrs = objAttrs.Where(a => a.IsAlarm && string.IsNullOrEmpty(a.PrimitiveName)).ToList(); -``` +- `SubscribeAlarmsAsync` / `UnsubscribeAlarmsAsync` run through the `DriverCapability.AlarmSubscribe` pipeline — retries apply under the tier configuration. +- `AcknowledgeAsync` runs through `DriverCapability.AlarmAcknowledge` which does NOT retry per decision #143. A timed-out ack may have already registered at the plant floor; replay would silently double-acknowledge. -The `IsAlarm` flag originates from the `AlarmExtension` primitive in the Galaxy repository database. When a Galaxy attribute has an associated `AlarmExtension` primitive, the SQL query sets `is_alarm = 1` on the corresponding `GalaxyAttributeInfo`. +Multi-host fan-out: when the driver implements `IPerCallHostResolver`, each source node id is resolved individually and batches are grouped by host so a dead PLC inside a multi-device driver doesn't poison sibling breakers. Single-host drivers fall back to `IDriver.DriverInstanceId` as the pipeline-key host. -For each alarm attribute, the code verifies that a corresponding `InAlarm` sub-attribute variable node exists in `_tagToVariableNode` (constructed from `FullTagReference + ".InAlarm"`). If the variable node is missing, the alarm is skipped -- this prevents creating orphaned alarm conditions for attributes whose extension primitives were not published. +## Condition-node creation via CapturingBuilder -## Template-Based Alarm Object Filter +Alarm-condition nodes are materialized at address-space build time. During `GenericDriverNodeManager.BuildAddressSpaceAsync` the builder is wrapped in a `CapturingBuilder` that observes every `Variable()` call. When a driver calls `IVariableHandle.MarkAsAlarmCondition(AlarmConditionInfo)` on a returned handle, the server-side `DriverNodeManager.VariableHandle` creates a sibling `AlarmConditionState` node and returns an `IAlarmConditionSink`. The wrapper stores the sink in `_alarmSinks` keyed by the variable's full reference, then `GenericDriverNodeManager` registers a forwarder on `IAlarmSource.OnAlarmEvent` that routes each push to the matching sink by `SourceNodeId`. Unknown source ids are dropped silently — they may belong to another driver. -When large galaxies contain more alarm-bearing objects than clients need, `OpcUa.AlarmFilter.ObjectFilters` restricts alarm condition creation to a subset of objects selected by **template name pattern**. The filter is applied at both alarm creation sites -- the full build in `BuildAddressSpace` and the subtree rebuild path triggered by Galaxy redeployment -- so the included set is recomputed on every rebuild against the fresh hierarchy. +The `AlarmConditionState` layout matches OPC UA Part 9: -### Matching rules +- `SourceNode` → the originating variable +- `SourceName` / `ConditionName` → from `AlarmConditionInfo.SourceName` +- Initial state: enabled, inactive, acknowledged, severity per `InitialSeverity`, retain false +- `HasCondition` references wire the source variable ↔ the condition node bidirectionally -- `*` is the only wildcard (glob-style, zero or more characters). All other regex metacharacters are escaped and matched literally. -- Matching is case-insensitive. -- The leading `$` used by Galaxy template `tag_name` values is normalized away on both the stored chain entry and the operator pattern, so `TestMachine*` matches the stored `$TestMachine`. -- Each configured entry may itself be comma-separated for operator convenience (`"TestMachine*, Pump_*"`). -- An empty list disables the filter and restores the prior behavior: every alarm-bearing object is tracked when `AlarmTrackingEnabled=true`. +Drivers flag alarm-bearing variables at discovery time via `DriverAttributeInfo.IsAlarm = true`. The Galaxy driver, for example, sets this on attributes that have an `AlarmExtension` primitive in the Galaxy repository DB; FOCAS sets it on the CNC alarm register. -### What gets included +## State transitions -Every Galaxy object whose **template derivation chain** contains any template matching any pattern is included. The chain walks `gobject.derived_from_gobject_id` from the instance through its immediate template and each ancestor template, up to `$Object`. An instance of `TestCoolMachine` whose chain is `$TestCoolMachine -> $TestMachine -> $UserDefined` matches the pattern `TestMachine` via the ancestor hit. +`ConditionSink.OnTransition` runs under the node manager's `Lock` and maps the `AlarmEventArgs.AlarmType` string to Part 9 state: -Inclusion propagates down the **containment hierarchy**: if an object matches, all of its descendants are included as well, regardless of their own template chains. This lets operators target a parent and pick up all its alarm-bearing children with one pattern. +| AlarmType | Action | +|---|---| +| `Active` | `SetActiveState(true)`, `SetAcknowledgedState(false)`, `Retain = true` | +| `Acknowledged` | `SetAcknowledgedState(true)` | +| `Inactive` | `SetActiveState(false)`; `Retain = false` once both inactive and acknowledged | -Each object is evaluated exactly once. Overlapping matches (multiple patterns hit, or both an ancestor and descendant match independently) never produce duplicate alarm condition subscriptions -- the filter operates on object identity via a `HashSet` of included `GobjectId` values. +Severity is remapped: `AlarmSeverity.Low/Medium/High/Critical` → OPC UA numeric 250 / 500 / 700 / 900. `Message.Value` is set from `AlarmEventArgs.Message` on every transition. `ClearChangeMasks(true)` and `ReportEvent(condition)` fire the OPC UA event notification for clients subscribed to any ancestor notifier. -### Resolution algorithm +## Acknowledge dispatch -`AlarmObjectFilter.ResolveIncludedObjects(hierarchy)` runs once per build: +Alarm acknowledgement initiated by an OPC UA client flows: -1. Compile each pattern into a regex with `IgnoreCase | CultureInvariant | Compiled`. -2. Build a `parent -> children` map from the hierarchy. Orphans (parent id not in the hierarchy) are treated as roots. -3. BFS from each root with a `(nodeId, parentIncluded)` queue and a `visited` set for cycle defense. -4. At each node: if the parent was included OR any chain entry matches any pattern, add the node and mark its subtree as included. -5. Return the `HashSet` of included object IDs. When no patterns are configured the filter is disabled and the method returns `null`, which the alarm loop treats as "no filtering". +1. The SDK invokes the `AlarmConditionState.OnAcknowledge` method delegate. +2. The handler checks the session's roles for `AlarmAck` — drivers never see a request the session wasn't entitled to make. +3. `AlarmSurfaceInvoker.AcknowledgeAsync` is called with the source / condition / comment tuple. The invoker groups by host and runs each batch through the no-retry `AlarmAcknowledge` pipeline. -After each resolution, `UnmatchedPatterns` exposes any raw pattern that matched zero objects so the startup log can warn about operator typos without failing startup. +Drivers return normally for success or throw to signal the ack failed at the backend. -### How the alarm loop applies the filter +## EventNotifier propagation -```csharp -// LmxNodeManager.BuildAddressSpace (and the subtree rebuild path) -if (_alarmTrackingEnabled) -{ - var includedIds = ResolveAlarmFilterIncludedIds(sorted); // null if no filter - foreach (var obj in sorted) - { - if (obj.IsArea) continue; - if (includedIds != null && !includedIds.Contains(obj.GobjectId)) continue; - // ... existing alarm-attribute collection + AlarmConditionState creation - } -} -``` +Drivers that want hierarchical alarm subscriptions propagate `EventNotifier.SubscribeToEvents` up the containment chain during discovery — the Galaxy driver flips the flag on every ancestor of an alarm-bearing object up to the driver root, mirroring v1 behavior. Clients subscribed at the driver root, a mid-level folder, or the `Objects/` root see alarm events from every descendant with an `AlarmConditionState` sibling. The driver-root `FolderState` is created in `DriverNodeManager.CreateAddressSpace` with `EventNotifier = SubscribeToEvents | HistoryRead` so alarm event subscriptions and alarm history both have a single natural target. -`ResolveAlarmFilterIncludedIds` also emits a one-line summary (`Alarm filter: X of Y objects included (Z pattern(s))`) and per-pattern warnings for patterns that matched nothing. The included count is published to the dashboard via `AlarmFilterIncludedObjectCount`. +## ConditionRefresh -### Runtime telemetry +The OPC UA `ConditionRefresh` service queues the current state of every retained condition back to the requesting monitored items. `DriverNodeManager` iterates the node manager's `AlarmConditionState` collection and queues each condition whose `Retain.Value == true` — matching the Part 9 requirement. -`LmxNodeManager` exposes three read-only properties populated by the filter: +## Key source files -- `AlarmFilterEnabled` -- true when patterns are configured. -- `AlarmFilterPatternCount` -- number of compiled patterns. -- `AlarmFilterIncludedObjectCount` -- number of objects in the most recent included set. - -`StatusReportService` reads these into `AlarmStatusInfo.FilterEnabled`, `FilterPatternCount`, and `FilterIncludedObjectCount`. The Alarms panel on the dashboard renders `Filter: N pattern(s), M object(s) included` only when the filter is enabled. See [Status Dashboard](StatusDashboard.md#alarms). - -### Validator warning - -`ConfigurationValidator.ValidateAndLog()` logs the effective filter at startup and emits a `Warning` if `AlarmFilter.ObjectFilters` is non-empty while `AlarmTrackingEnabled` is `false`, because the filter would have no effect. - -## AlarmConditionState Creation - -Each detected alarm attribute produces an `AlarmConditionState` node: - -```csharp -var condition = new AlarmConditionState(sourceVariable); -condition.Create(SystemContext, conditionNodeId, - new QualifiedName(alarmAttr.AttributeName + "Alarm", NamespaceIndex), - new LocalizedText("en", alarmAttr.AttributeName + " Alarm"), - true); -``` - -Key configuration on the condition node: - -- **SourceNode** -- Set to the OPC UA NodeId of the source variable, linking the condition to the attribute that triggered it. -- **SourceName / ConditionName** -- Set to the Galaxy attribute name for identification in event notifications. -- **AutoReportStateChanges** -- Set to `true` so the OPC UA framework automatically generates event notifications when condition properties change. -- **Initial state** -- Enabled, inactive, acknowledged, severity Medium, retain false. -- **HasCondition references** -- Bidirectional references are added between the source variable and the condition node. - -The condition's `OnReportEvent` callback forwards events to `Server.ReportEvent` so they reach clients subscribed at the server level. - -### Condition Methods - -Each alarm condition supports the following OPC UA Part 9 methods: - -- **Acknowledge** (`OnAcknowledge`) -- Writes the acknowledgment message to the Galaxy `AckMsg` tag. Requires the `AlarmAck` role. -- **Confirm** (`OnConfirm`) -- Confirms a previously acknowledged alarm. The SDK manages the `ConfirmedState` transition. -- **AddComment** (`OnAddComment`) -- Attaches an operator comment to the condition for audit trail purposes. -- **Enable / Disable** (`OnEnableDisable`) -- Activates or deactivates alarm monitoring for the specific condition. The SDK manages the `EnabledState` transition. -- **Shelve** (`OnShelve`) -- Supports `TimedShelve`, `OneShotShelve`, and `Unshelve` operations. The SDK manages the `ShelvedStateMachineType` state transitions including automatic timed unshelve. -- **TimedUnshelve** (`OnTimedUnshelve`) -- Automatically called by the SDK when a timed shelve period expires. - -### Event Fields - -Alarm events include the following fields: - -- `EventId` -- Unique GUID for each event, used as reference for Acknowledge/Confirm -- `ActiveState`, `AckedState`, `ConfirmedState` -- State transitions -- `Message` -- Alarm message from Galaxy `DescAttrName` or default text -- `Severity` -- Galaxy Priority clamped to OPC UA range 1-1000 -- `Retain` -- True while alarm is active or unacknowledged -- `LocalTime` -- Server timezone offset with daylight saving flag -- `Quality` -- Set to Good for alarm events - -## Auto-subscription to Alarm Tags - -After alarm condition nodes are created, `SubscribeAlarmTags` opens MXAccess subscriptions for three tags per alarm: - -1. **InAlarm** (`Tag_001.Temperature.InAlarm`) -- The boolean trigger for alarm activation/deactivation. -2. **Priority** (`Tag_001.Temperature.Priority`) -- Numeric priority that maps to OPC UA severity. -3. **DescAttrName** (`Tag_001.Temperature.DescAttrName`) -- String description used as the alarm event message. - -These subscriptions are opened unconditionally (not ref-counted) because they serve the server's own alarm tracking, not client-initiated monitoring. Tags that do not have corresponding variable nodes in `_tagToVariableNode` are skipped. - -## EventNotifier Propagation - -When a Galaxy object contains at least one alarm attribute, `EventNotifiers.SubscribeToEvents` is set on the object node **and all its ancestors** up to the root. This allows OPC UA clients to subscribe to events at any level in the hierarchy and receive alarm notifications from all descendants: - -```csharp -if (hasAlarms && _nodeMap.TryGetValue(obj.GobjectId, out var objNode)) - EnableEventNotifierUpChain(objNode); -``` - -For example, an alarm on `TestMachine_001.SubObject.Temperature` will be visible to clients subscribed on `SubObject`, `TestMachine_001`, or the root `ZB` folder. The root `ZB` folder also has `EventNotifiers.SubscribeToEvents` enabled during initial construction. - -## InAlarm Transition Detection in DispatchLoop - -Alarm state changes are detected in the dispatch loop's Phase 1 (outside `Lock`), which runs on the background dispatch thread rather than the STA thread. This placement is intentional because the detection logic reads Priority and DescAttrName values from MXAccess, which would block the STA thread if done inside the `OnMxAccessDataChange` callback. - -For each pending data change, the loop checks whether the address matches a key in `_alarmInAlarmTags`: - -```csharp -if (_alarmInAlarmTags.TryGetValue(address, out var alarmInfo)) -{ - var newInAlarm = vtq.Value is true || vtq.Value is 1 - || (vtq.Value is int intVal && intVal != 0); - if (newInAlarm != alarmInfo.LastInAlarm) - { - alarmInfo.LastInAlarm = newInAlarm; - // Read Priority and DescAttrName via MXAccess (outside Lock) - ... - pendingAlarmEvents.Add((alarmInfo, newInAlarm)); - } -} -``` - -The boolean coercion handles multiple value representations: `true`, integer `1`, or any non-zero integer. When the value changes state, Priority and DescAttrName are read synchronously from MXAccess to populate `CachedSeverity` and `CachedMessage`. These reads happen outside `Lock` because they call into the STA thread. - -Priority values are clamped to the OPC UA severity range (1-1000). Both `int` and `short` types are handled. - -## ReportAlarmEvent - -`ReportAlarmEvent` runs inside `Lock` during Phase 2 of the dispatch loop. It updates the `AlarmConditionState` and generates an OPC UA event: - -```csharp -condition.SetActiveState(SystemContext, active); -condition.Message.Value = new LocalizedText("en", message); -condition.SetSeverity(SystemContext, (EventSeverity)severity); -condition.Retain.Value = active || (condition.AckedState?.Id?.Value == false); -``` - -Key behaviors: - -- **Active state** -- Set to `true` on activation, `false` on clearing. -- **Message** -- Uses `CachedMessage` (from DescAttrName) when available on activation. Falls back to a generated `"Alarm active: {SourceName}"` string. Cleared alarms always use `"Alarm cleared: {SourceName}"`. -- **Severity** -- Set from `CachedSeverity`, which was read from the Priority tag. -- **Retain** -- `true` while the alarm is active or unacknowledged. This keeps the condition visible in condition refresh responses. -- **Acknowledged state** -- Reset to `false` when the alarm activates, requiring explicit client acknowledgment. When role-based auth is active, alarm acknowledgment requires the `AlarmAck` role on the session (checked via `GrantedRoleIds`). Users without this role receive `BadUserAccessDenied`. - -The event is reported by walking up the notifier chain from the source variable's parent through all ancestor nodes. Each ancestor with `EventNotifier` set receives the event via `ReportEvent`, so clients subscribed at any level in the Galaxy hierarchy see alarm transitions from descendant objects. - -## Condition Refresh Override - -The `ConditionRefresh` override iterates all tracked alarms and queues retained conditions to the requesting monitored items: - -```csharp -public override ServiceResult ConditionRefresh(OperationContext context, - IList monitoredItems) -{ - foreach (var kvp in _alarmInAlarmTags) - { - var info = kvp.Value; - if (info.ConditionNode == null || info.ConditionNode.Retain?.Value != true) - continue; - foreach (var item in monitoredItems) - item.QueueEvent(info.ConditionNode); - } - return ServiceResult.Good; -} -``` - -Only conditions where `Retain.Value == true` are included. This means only active or unacknowledged alarms appear in condition refresh responses, matching the OPC UA specification requirement that condition refresh returns the current state of all retained conditions. +- `src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IAlarmSource.cs` — capability contract + `AlarmEventArgs` +- `src/ZB.MOM.WW.OtOpcUa.Core/Resilience/AlarmSurfaceInvoker.cs` — per-host fan-out + no-retry ack +- `src/ZB.MOM.WW.OtOpcUa.Core/OpcUa/GenericDriverNodeManager.cs` — `CapturingBuilder` + alarm forwarder +- `src/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs` — `VariableHandle.MarkAsAlarmCondition` + `ConditionSink` +- `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Backend/Alarms/GalaxyAlarmTracker.cs` — Galaxy-specific alarm-event production diff --git a/docs/Client.CLI.md b/docs/Client.CLI.md index 32ed038..cdab19d 100644 --- a/docs/Client.CLI.md +++ b/docs/Client.CLI.md @@ -2,9 +2,9 @@ ## Overview -`ZB.MOM.WW.OtOpcUa.Client.CLI` is a cross-platform command-line client for the LmxOpcUa OPC UA server. It targets .NET 10 and uses the shared `IOpcUaClientService` from `Client.Shared` for all OPC UA operations. Commands are routed and parsed by [CliFx](https://github.com/Tyrrrz/CliFx). +`ZB.MOM.WW.OtOpcUa.Client.CLI` is a cross-platform command-line client for the OtOpcUa OPC UA server. It targets .NET 10 and uses the shared `IOpcUaClientService` from `Client.Shared` for all OPC UA operations. Commands are routed and parsed by [CliFx](https://github.com/Tyrrrz/CliFx). -The CLI is the primary tool for operators and developers to test and interact with the server from a terminal. It supports all core operations: connectivity testing, browsing, reading, writing, subscriptions, alarm monitoring, history reads, and redundancy queries. +The CLI is the primary tool for operators and developers to test and interact with the server from a terminal. It supports all core operations: connectivity testing, browsing, reading, writing, subscriptions, alarm monitoring, history reads, and redundancy queries. Any driver surface exposed by the server (Galaxy, Modbus, S7, AB CIP, AB Legacy, TwinCAT, FOCAS, OPC UA Client) is reachable through these commands — the CLI is driver-agnostic because everything below the OPC UA endpoint is. ## Build and Run @@ -14,7 +14,7 @@ dotnet build dotnet run -- [options] ``` -The executable name is `lmxopcua-cli`. +The executable name is still `lmxopcua-cli` — a residual from the pre-v2 rename (`Program.cs:SetExecutableName`). Scripts + operator muscle memory depend on the name; flipping it to `otopcua-cli` is a follow-up that also needs to move the client-side PKI store folder ({LocalAppData}/LmxOpcUaClient/pki/ — used by the shared client for its application certificate) so trust relationships survive the rename. ## Architecture @@ -54,7 +54,7 @@ lmxopcua-cli write -u opc.tcp://localhost:4840 -n "ns=2;s=MyNode" -v 42 -U opera When `-F` is provided, the shared service tries the primary URL first, then each failover URL in order. For long-running commands (`subscribe`, `alarms`), the service monitors the session via keep-alive and automatically reconnects to the next available server on failure. ```bash -lmxopcua-cli connect -u opc.tcp://localhost:4840/LmxOpcUa -F opc.tcp://localhost:4841/LmxOpcUa +lmxopcua-cli connect -u opc.tcp://localhost:4840/OtOpcUa -F opc.tcp://localhost:4841/OtOpcUa ``` ### Transport Security @@ -67,7 +67,7 @@ When `sign` or `encrypt` is specified, the shared service: 4. Fails with a clear error if no matching endpoint is found ```bash -lmxopcua-cli browse -u opc.tcp://localhost:4840/LmxOpcUa -S encrypt -U admin -P secret -r -d 2 +lmxopcua-cli browse -u opc.tcp://localhost:4840/OtOpcUa -S encrypt -U admin -P secret -r -d 2 ``` ### Verbose Logging @@ -81,14 +81,14 @@ The `--verbose` flag switches Serilog output from `Warning` to `Debug` level, sh Tests connectivity to an OPC UA server. Creates a session, prints connection metadata, and disconnects. ```bash -lmxopcua-cli connect -u opc.tcp://localhost:4840/LmxOpcUa -U admin -P admin123 +lmxopcua-cli connect -u opc.tcp://localhost:4840/OtOpcUa -U admin -P admin123 ``` Output: ```text -Connected to: opc.tcp://localhost:4840/LmxOpcUa -Server: LmxOpcUa +Connected to: opc.tcp://localhost:4840/OtOpcUa +Server: OtOpcUa Server Security Mode: None Security Policy: http://opcfoundation.org/UA/SecurityPolicy#None Connection successful. @@ -99,7 +99,7 @@ Connection successful. Reads the current value of a single node and prints the value, status code, and timestamps. ```bash -lmxopcua-cli read -u opc.tcp://localhost:4840/LmxOpcUa -n "ns=3;s=DEV.ScanState" -U admin -P admin123 +lmxopcua-cli read -u opc.tcp://localhost:4840/OtOpcUa -n "ns=3;s=DEV.ScanState" -U admin -P admin123 ``` | Flag | Description | @@ -135,10 +135,10 @@ Browses the OPC UA address space starting from the Objects folder or a specified ```bash # Browse top-level Objects folder -lmxopcua-cli browse -u opc.tcp://localhost:4840/LmxOpcUa -U admin -P admin123 +lmxopcua-cli browse -u opc.tcp://localhost:4840/OtOpcUa -U admin -P admin123 # Browse a specific node recursively to depth 3 -lmxopcua-cli browse -u opc.tcp://localhost:4840/LmxOpcUa -U admin -P admin123 -r -d 3 -n "ns=3;s=ZB" +lmxopcua-cli browse -u opc.tcp://localhost:4840/OtOpcUa -U admin -P admin123 -r -d 3 -n "ns=3;s=ZB" ``` | Flag | Description | @@ -166,12 +166,12 @@ Reads historical data from a node. Supports raw history reads and aggregate (pro ```bash # Raw history -lmxopcua-cli historyread -u opc.tcp://localhost:4840/LmxOpcUa \ +lmxopcua-cli historyread -u opc.tcp://localhost:4840/OtOpcUa \ -n "ns=1;s=TestMachine_001.TestHistoryValue" \ --start "2026-03-25" --end "2026-03-30" # Aggregate: 1-hour average -lmxopcua-cli historyread -u opc.tcp://localhost:4840/LmxOpcUa \ +lmxopcua-cli historyread -u opc.tcp://localhost:4840/OtOpcUa \ -n "ns=1;s=TestMachine_001.TestHistoryValue" \ --start "2026-03-25" --end "2026-03-30" \ --aggregate Average --interval 3600000 @@ -203,10 +203,10 @@ Subscribes to alarm events on a node. Prints structured alarm output including s ```bash # Subscribe to alarm events on the Server node -lmxopcua-cli alarms -u opc.tcp://localhost:4840/LmxOpcUa +lmxopcua-cli alarms -u opc.tcp://localhost:4840/OtOpcUa # Subscribe to a specific source node with condition refresh -lmxopcua-cli alarms -u opc.tcp://localhost:4840/LmxOpcUa \ +lmxopcua-cli alarms -u opc.tcp://localhost:4840/OtOpcUa \ -n "ns=1;s=TestMachine_001" --refresh ``` @@ -221,7 +221,7 @@ lmxopcua-cli alarms -u opc.tcp://localhost:4840/LmxOpcUa \ Reads the OPC UA redundancy state from a server: redundancy mode, service level, server URIs, and application URI. ```bash -lmxopcua-cli redundancy -u opc.tcp://localhost:4840/LmxOpcUa -U admin -P admin123 +lmxopcua-cli redundancy -u opc.tcp://localhost:4840/OtOpcUa -U admin -P admin123 ``` Example output: @@ -230,9 +230,9 @@ Example output: Redundancy Mode: Warm Service Level: 200 Server URIs: - - urn:localhost:LmxOpcUa:instance1 - - urn:localhost:LmxOpcUa:instance2 -Application URI: urn:localhost:LmxOpcUa:instance1 + - urn:localhost:OtOpcUa:instance1 + - urn:localhost:OtOpcUa:instance2 +Application URI: urn:localhost:OtOpcUa:instance1 ``` ## Testing diff --git a/docs/Client.UI.md b/docs/Client.UI.md index 3c9c861..6839055 100644 --- a/docs/Client.UI.md +++ b/docs/Client.UI.md @@ -2,7 +2,7 @@ ## Overview -`ZB.MOM.WW.OtOpcUa.Client.UI` is a cross-platform Avalonia desktop application for connecting to and interacting with the LmxOpcUa OPC UA server. It targets .NET 10 and uses the shared `IOpcUaClientService` from `Client.Shared` for all OPC UA operations. +`ZB.MOM.WW.OtOpcUa.Client.UI` is a cross-platform Avalonia desktop application for connecting to and interacting with the OtOpcUa OPC UA server. It targets .NET 10 and uses the shared `IOpcUaClientService` from `Client.Shared` for all OPC UA operations. The UI provides a single-window interface for browsing the address space, reading and writing values, monitoring live subscriptions, managing alarms, and querying historical data. @@ -43,7 +43,7 @@ The application uses a single-window layout with five main areas: │ │ │ ││ │ (lazy-load) │ └──────────────────────────────────────────────┘│ ├──────────────┴──────────────────────────────────────────────┤ -│ Connected to opc.tcp://... | LmxOpcUa | Session: ... | 3 subs│ +│ Connected to opc.tcp://... | OtOpcUa Server | Session: ... | 3 subs│ └─────────────────────────────────────────────────────────────┘ ``` @@ -55,7 +55,7 @@ The top bar provides the endpoint URL, Connect, and Disconnect buttons. The **Co | Setting | Description | |---------|-------------| -| Endpoint URL | OPC UA server endpoint (e.g., `opc.tcp://localhost:4840/LmxOpcUa`) | +| Endpoint URL | OPC UA server endpoint (e.g., `opc.tcp://localhost:4840/OtOpcUa`) | | Username / Password | Credentials for `UserName` token authentication | | Security Mode | Transport security: None, Sign, SignAndEncrypt | | Failover URLs | Comma-separated backup endpoints for redundancy failover | @@ -65,7 +65,7 @@ The top bar provides the endpoint URL, Connect, and Disconnect buttons. The **Co ### Settings Persistence -Connection settings are saved to `{LocalAppData}/LmxOpcUaClient/settings.json` after each successful connection and on window close. The settings are reloaded on next launch, including: +Connection settings are saved to `{LocalAppData}/LmxOpcUaClient/settings.json` after each successful connection and on window close. The folder name is a residual from the pre-v2 rename (the `Client.Shared` session factory still calls itself `LmxOpcUaClient` at `OpcUaClientService.cs:428`); renaming to `OtOpcUaClient` is a follow-up that needs a migration shim so existing users don't lose their settings on upgrade. The settings are reloaded on next launch, including: - All connection parameters - Active subscription node IDs (restored after reconnection) diff --git a/docs/Configuration.md b/docs/Configuration.md index 213c978..b55ef1a 100644 --- a/docs/Configuration.md +++ b/docs/Configuration.md @@ -1,370 +1,141 @@ # Configuration -## Overview +## Two-layer model -The service loads configuration from `appsettings.json` at startup using the Microsoft.Extensions.Configuration stack. `AppConfiguration` is the root holder class that aggregates typed sections: `OpcUa`, `MxAccess`, `GalaxyRepository`, `Dashboard`, `Historian`, `Authentication`, and `Security`. Each section binds to a dedicated POCO class with sensible defaults, so the service runs with zero configuration on a standard deployment. +OtOpcUa configuration is split into two layers: -## Config Binding Pattern +| Layer | Where | Scope | Edited by | +|---|---|---|---| +| **Bootstrap** | `appsettings.json` per process | Enough to start the process and reach the Config DB | Local file edit + process restart | +| **Authoritative config** | Config DB (SQL Server) via `OtOpcUaConfigDbContext` | Clusters, namespaces, UNS hierarchy, equipment, tags, driver instances, ACLs, role grants, poll groups | Admin UI draft/publish workflow | -The production constructor in `OpcUaService` builds the configuration pipeline and binds each JSON section to its typed class: +The rule: if the setting describes *how the process connects to the rest of the world* (Config DB connection string, LDAP bind, transport security profile, node identity, logging), it lives in `appsettings.json`. If it describes *what the fleet does* (clusters, drivers, tags, UNS, ACLs), it lives in the Config DB and is edited through the Admin UI. -```csharp -var configuration = new ConfigurationBuilder() - .AddJsonFile("appsettings.json", optional: false) - .AddJsonFile($"appsettings.{Environment.GetEnvironmentVariable("DOTNET_ENVIRONMENT") ?? "Production"}.json", optional: true) - .AddEnvironmentVariables() - .Build(); +--- -_config = new AppConfiguration(); -configuration.GetSection("OpcUa").Bind(_config.OpcUa); -configuration.GetSection("MxAccess").Bind(_config.MxAccess); -configuration.GetSection("GalaxyRepository").Bind(_config.GalaxyRepository); -configuration.GetSection("Dashboard").Bind(_config.Dashboard); -configuration.GetSection("Historian").Bind(_config.Historian); -configuration.GetSection("Authentication").Bind(_config.Authentication); -configuration.GetSection("Security").Bind(_config.Security); -``` +## Bootstrap configuration (`appsettings.json`) -This pattern uses `IConfiguration.GetSection().Bind()` rather than `IOptions` because the service targets .NET Framework 4.8, where the full dependency injection container is not used. +Each of the three processes (Server, Admin, Galaxy.Host) reads its own `appsettings.json` plus environment overrides. -## Environment-Specific Overrides +### OtOpcUa Server — `src/ZB.MOM.WW.OtOpcUa.Server/appsettings.json` -The configuration pipeline supports three layers of override, applied in order: +Bootstrap-only. `Program.cs` reads four top-level sections: -1. `appsettings.json` -- base configuration (required) -2. `appsettings.{DOTNET_ENVIRONMENT}.json` -- environment-specific overlay (optional) -3. Environment variables -- highest priority, useful for deployment automation +| Section | Keys | Purpose | +|---|---|---| +| `Node` | `NodeId`, `ClusterId`, `ConfigDbConnectionString`, `LocalCachePath` | Identity + path to the Config DB + LiteDB offline cache path. | +| `OpcUaServer` | `EndpointUrl`, `ApplicationName`, `ApplicationUri`, `PkiStoreRoot`, `AutoAcceptUntrustedClientCertificates`, `SecurityProfile` | OPC UA endpoint + transport security. See [`security.md`](security.md). | +| `OpcUaServer:Ldap` | `Enabled`, `Server`, `Port`, `UseTls`, `AllowInsecureLdap`, `SearchBase`, `ServiceAccountDn`, `ServiceAccountPassword`, `GroupToRole`, `UserNameAttribute`, `GroupAttribute` | LDAP auth for OPC UA UserName tokens. See [`security.md`](security.md). | +| `Serilog` | Standard Serilog keys + `WriteJson` bool | Logging verbosity + optional JSON file sink for SIEM ingest. | +| `Authorization` | `StrictMode` (bool) | Flip `true` to fail-closed on sessions lacking LDAP group metadata. Default false during ACL rollouts. | +| `Metrics:Prometheus:Enabled` | bool | Toggles the `/metrics` endpoint. | -Set the `DOTNET_ENVIRONMENT` variable to load a named overlay file. For example, setting `DOTNET_ENVIRONMENT=Staging` loads `appsettings.Staging.json` if it exists. - -Environment variables follow the standard `Section__Property` naming convention. For example, `OpcUa__Port=5840` overrides the OPC UA port. - -## Configuration Sections - -### OpcUa - -Controls the OPC UA server endpoint and session limits. Defined in `OpcUaConfiguration`. - -| Property | Type | Default | Description | -|----------|------|---------|-------------| -| `BindAddress` | `string` | `"0.0.0.0"` | IP address or hostname the server binds to. Use `0.0.0.0` for all interfaces, `localhost` for local-only, or a specific IP | -| `Port` | `int` | `4840` | TCP port the OPC UA server listens on | -| `EndpointPath` | `string` | `"/LmxOpcUa"` | Path appended to the host URI | -| `ServerName` | `string` | `"LmxOpcUa"` | Server name presented to OPC UA clients | -| `GalaxyName` | `string` | `"ZB"` | Galaxy name used as the OPC UA namespace | -| `MaxSessions` | `int` | `100` | Maximum simultaneous OPC UA sessions | -| `SessionTimeoutMinutes` | `int` | `30` | Idle session timeout in minutes | -| `AlarmTrackingEnabled` | `bool` | `false` | Enables `AlarmConditionState` nodes for alarm attributes | -| `AlarmFilter.ObjectFilters` | `List` | `[]` | Wildcard template-name patterns (with `*`) that scope alarm tracking to matching objects and their descendants. Empty list disables filtering. See [Alarm Tracking](AlarmTracking.md#template-based-alarm-object-filter) | -| `ApplicationUri` | `string?` | `null` | Explicit application URI for this server instance. Required when redundancy is enabled. Defaults to `urn:{GalaxyName}:LmxOpcUa` when null | - -### MxAccess - -Controls the MXAccess runtime connection used for live tag reads and writes. Defined in `MxAccessConfiguration`. - -| Property | Type | Default | Description | -|----------|------|---------|-------------| -| `ClientName` | `string` | `"LmxOpcUa"` | Client name registered with MXAccess | -| `NodeName` | `string?` | `null` | Optional Galaxy node name to target | -| `GalaxyName` | `string?` | `null` | Optional Galaxy name for MXAccess reference resolution | -| `ReadTimeoutSeconds` | `int` | `5` | Maximum wait for a live tag read | -| `WriteTimeoutSeconds` | `int` | `5` | Maximum wait for a write acknowledgment | -| `MaxConcurrentOperations` | `int` | `10` | Cap on concurrent MXAccess operations | -| `MonitorIntervalSeconds` | `int` | `5` | Connectivity monitor probe interval | -| `AutoReconnect` | `bool` | `true` | Automatically re-establish dropped MXAccess sessions | -| `ProbeTag` | `string?` | `null` | Optional tag used to verify the runtime returns fresh data | -| `ProbeStaleThresholdSeconds` | `int` | `60` | Seconds a probe value may remain unchanged before the connection is considered stale | -| `RuntimeStatusProbesEnabled` | `bool` | `true` | Advises `.ScanState` on every deployed `$WinPlatform` and `$AppEngine` to track per-host runtime state. Drives the Galaxy Runtime dashboard panel, HealthCheck Rule 2e, and the Read-path short-circuit that invalidates OPC UA variable quality when a host is Stopped. Set `false` to return to legacy behavior where host state is invisible and the bridge serves whatever quality MxAccess reports for individual tags. See [MXAccess Bridge](MxAccessBridge.md#per-host-runtime-status-probes-hostscanstate) | -| `RuntimeStatusUnknownTimeoutSeconds` | `int` | `15` | Maximum seconds to wait for the initial probe callback before marking a host as Stopped. Only applies to the Unknown → Stopped transition; Running hosts never time out because `ScanState` is delivered on-change only. A value below 5s triggers a validator warning | -| `RequestTimeoutSeconds` | `int` | `30` | Outer safety timeout applied to sync-over-async MxAccess operations invoked from the OPC UA stack thread (Read, Write, address-space rebuild probe sync). Backstop for the inner `ReadTimeoutSeconds` / `WriteTimeoutSeconds`. A timed-out operation returns `BadTimeout`. Validator rejects values < 1 and warns if set below the inner Read/Write timeouts. See [MXAccess Bridge](MxAccessBridge.md#request-timeout-safety-backstop). Stability review 2026-04-13 Finding 3 | - -### GalaxyRepository - -Controls the Galaxy repository database connection used to build the OPC UA address space. Defined in `GalaxyRepositoryConfiguration`. - -| Property | Type | Default | Description | -|----------|------|---------|-------------| -| `ConnectionString` | `string` | `"Server=localhost;Database=ZB;Integrated Security=true;"` | SQL Server connection string for the Galaxy database | -| `ChangeDetectionIntervalSeconds` | `int` | `30` | How often the service polls for Galaxy deploy changes | -| `CommandTimeoutSeconds` | `int` | `30` | SQL command timeout for repository queries | -| `ExtendedAttributes` | `bool` | `false` | Load extended Galaxy attribute metadata into the OPC UA model | -| `Scope` | `GalaxyScope` | `"Galaxy"` | Controls how much of the Galaxy hierarchy is loaded. `Galaxy` loads all deployed objects (default). `LocalPlatform` loads only objects hosted by the platform deployed on this machine. See [Galaxy Repository — Platform Scope Filter](GalaxyRepository.md#platform-scope-filter) | -| `PlatformName` | `string?` | `null` | Explicit platform hostname for `LocalPlatform` filtering. When null, uses `Environment.MachineName`. Only used when `Scope` is `LocalPlatform` | - -### Dashboard - -Controls the embedded HTTP status dashboard. Defined in `DashboardConfiguration`. - -| Property | Type | Default | Description | -|----------|------|---------|-------------| -| `Enabled` | `bool` | `true` | Whether the status dashboard is hosted | -| `Port` | `int` | `8081` | HTTP port for the dashboard endpoint | -| `RefreshIntervalSeconds` | `int` | `10` | HTML auto-refresh interval in seconds | - -### Historian - -Controls the Wonderware Historian SDK connection for OPC UA historical data access. Defined in `HistorianConfiguration`. - -| Property | Type | Default | Description | -|----------|------|---------|-------------| -| `Enabled` | `bool` | `false` | Enables OPC UA historical data access | -| `ServerName` | `string` | `"localhost"` | Single Historian server hostname used when `ServerNames` is empty. Preserved for backward compatibility with pre-cluster deployments | -| `ServerNames` | `List` | `[]` | Ordered list of Historian cluster nodes. When non-empty, supersedes `ServerName` and enables read-only cluster failover. See [Historical Data Access](HistoricalDataAccess.md#read-only-cluster-failover) | -| `FailureCooldownSeconds` | `int` | `60` | How long a failed cluster node is skipped before being re-tried. Zero disables the cooldown | -| `IntegratedSecurity` | `bool` | `true` | Use Windows authentication | -| `UserName` | `string?` | `null` | Username when `IntegratedSecurity` is false | -| `Password` | `string?` | `null` | Password when `IntegratedSecurity` is false | -| `Port` | `int` | `32568` | Historian TCP port | -| `CommandTimeoutSeconds` | `int` | `30` | SDK packet timeout in seconds (inner async bound) | -| `RequestTimeoutSeconds` | `int` | `60` | Outer safety timeout applied to sync-over-async Historian operations invoked from the OPC UA stack thread (`HistoryReadRaw`, `HistoryReadProcessed`, `HistoryReadAtTime`, `HistoryReadEvents`). Backstop for `CommandTimeoutSeconds`; a timed-out read returns `BadTimeout`. Validator rejects values < 1 and warns if set below `CommandTimeoutSeconds`. Stability review 2026-04-13 Finding 3 | -| `MaxValuesPerRead` | `int` | `10000` | Maximum values returned per `HistoryRead` request | - -### Authentication - -Controls user authentication and write authorization for the OPC UA server. Defined in `AuthenticationConfiguration`. - -| Property | Type | Default | Description | -|----------|------|---------|-------------| -| `AllowAnonymous` | `bool` | `true` | Accepts anonymous client connections when `true` | -| `AnonymousCanWrite` | `bool` | `true` | Permits anonymous users to write when `true` | - -#### LDAP Authentication - -When `Ldap.Enabled` is `true`, credentials are validated against the configured LDAP server and group membership determines OPC UA permissions. - -| Property | Type | Default | Description | -|----------|------|---------|-------------| -| `Ldap.Enabled` | `bool` | `false` | Enables LDAP authentication | -| `Ldap.Host` | `string` | `localhost` | LDAP server hostname | -| `Ldap.Port` | `int` | `3893` | LDAP server port | -| `Ldap.BaseDN` | `string` | `dc=lmxopcua,dc=local` | Base DN for LDAP operations | -| `Ldap.BindDnTemplate` | `string` | `cn={username},dc=lmxopcua,dc=local` | Bind DN template (`{username}` is replaced) | -| `Ldap.ServiceAccountDn` | `string` | `""` | Service account DN for group lookups | -| `Ldap.ServiceAccountPassword` | `string` | `""` | Service account password | -| `Ldap.TimeoutSeconds` | `int` | `5` | Connection timeout | -| `Ldap.ReadOnlyGroup` | `string` | `ReadOnly` | LDAP group granting read-only access | -| `Ldap.WriteOperateGroup` | `string` | `WriteOperate` | LDAP group granting write access for FreeAccess/Operate attributes | -| `Ldap.WriteTuneGroup` | `string` | `WriteTune` | LDAP group granting write access for Tune attributes | -| `Ldap.WriteConfigureGroup` | `string` | `WriteConfigure` | LDAP group granting write access for Configure attributes | -| `Ldap.AlarmAckGroup` | `string` | `AlarmAck` | LDAP group granting alarm acknowledgment | - -#### Permission Model - -When LDAP is enabled, LDAP group membership is mapped to OPC UA session role NodeIds during authentication. All authenticated LDAP users can browse and read nodes regardless of group membership. Groups grant additional permissions: - -| LDAP Group | Permission | -|---|---| -| ReadOnly | No additional permissions (read-only access) | -| WriteOperate | Write FreeAccess and Operate attributes | -| WriteTune | Write Tune attributes | -| WriteConfigure | Write Configure attributes | -| AlarmAck | Acknowledge alarms | - -Users can belong to multiple groups. The `admin` user in the default GLAuth configuration belongs to all three groups. - -Write access depends on both the user's role and the Galaxy attribute's security classification. See the [Effective Permission Matrix](Security.md#effective-permission-matrix) in the Security Guide for the full breakdown. - -Example configuration: - -```json -"Authentication": { - "AllowAnonymous": true, - "AnonymousCanWrite": false, - "Ldap": { - "Enabled": true, - "Host": "localhost", - "Port": 3893, - "BaseDN": "dc=lmxopcua,dc=local", - "BindDnTemplate": "cn={username},dc=lmxopcua,dc=local", - "ServiceAccountDn": "cn=serviceaccount,dc=lmxopcua,dc=local", - "ServiceAccountPassword": "serviceaccount123", - "TimeoutSeconds": 5, - "ReadOnlyGroup": "ReadOnly", - "WriteOperateGroup": "WriteOperate", - "WriteTuneGroup": "WriteTune", - "WriteConfigureGroup": "WriteConfigure", - "AlarmAckGroup": "AlarmAck" - } -} -``` - -### Security - -Controls OPC UA transport security profiles and certificate handling. Defined in `SecurityProfileConfiguration`. See [Security Guide](security.md) for detailed usage. - -| Property | Type | Default | Description | -|----------|------|---------|-------------| -| `Profiles` | `List` | `["None"]` | Security profiles to expose. Valid: `None`, `Basic256Sha256-Sign`, `Basic256Sha256-SignAndEncrypt`, `Aes128_Sha256_RsaOaep-Sign`, `Aes128_Sha256_RsaOaep-SignAndEncrypt`, `Aes256_Sha256_RsaPss-Sign`, `Aes256_Sha256_RsaPss-SignAndEncrypt` | -| `AutoAcceptClientCertificates` | `bool` | `true` | Auto-accept untrusted client certificates. Set to `false` in production | -| `RejectSHA1Certificates` | `bool` | `true` | Reject client certificates signed with SHA-1 | -| `MinimumCertificateKeySize` | `int` | `2048` | Minimum RSA key size for client certificates | -| `PkiRootPath` | `string?` | `null` | Override for PKI root directory. Defaults to `%LOCALAPPDATA%\OPC Foundation\pki` | -| `CertificateSubject` | `string?` | `null` | Override for server certificate subject. Defaults to `CN={ServerName}, O=ZB MOM, DC=localhost` | - -Example — production deployment with encrypted transport: - -```json -"Security": { - "Profiles": ["Basic256Sha256-SignAndEncrypt"], - "AutoAcceptClientCertificates": false, - "RejectSHA1Certificates": true, - "MinimumCertificateKeySize": 2048 -} -``` - -### Redundancy - -Controls non-transparent OPC UA redundancy. Defined in `RedundancyConfiguration`. See [Redundancy Guide](Redundancy.md) for detailed usage. - -| Property | Type | Default | Description | -|----------|------|---------|-------------| -| `Enabled` | `bool` | `false` | Enables redundancy mode and ServiceLevel computation | -| `Mode` | `string` | `"Warm"` | Redundancy mode: `Warm` or `Hot` | -| `Role` | `string` | `"Primary"` | Instance role: `Primary` (higher ServiceLevel) or `Secondary` | -| `ServerUris` | `List` | `[]` | ApplicationUri values for all servers in the redundant set | -| `ServiceLevelBase` | `int` | `200` | Base ServiceLevel when healthy (1-255). Secondary receives base - 50 | - -Example — two-instance redundant pair (Primary): - -```json -"Redundancy": { - "Enabled": true, - "Mode": "Warm", - "Role": "Primary", - "ServerUris": ["urn:localhost:LmxOpcUa:instance1", "urn:localhost:LmxOpcUa:instance2"], - "ServiceLevelBase": 200 -} -``` - -## Feature Flags - -Three boolean properties act as feature flags that control optional subsystems: - -- **`OpcUa.AlarmTrackingEnabled`** -- When `true`, the node manager creates `AlarmConditionState` nodes for alarm attributes and monitors `InAlarm` transitions. Disabled by default because alarm tracking adds per-attribute overhead. -- **`OpcUa.AlarmFilter.ObjectFilters`** -- List of wildcard template-name patterns that scope alarm tracking to matching objects and their descendants. An empty list preserves the current unfiltered behavior; a non-empty list includes an object only when any name in its template derivation chain matches any pattern, then propagates the inclusion to every descendant in the containment hierarchy. `*` is the only wildcard, matching is case-insensitive, and the Galaxy `$` prefix on template names is normalized so operators can write `TestMachine*` instead of `$TestMachine*`. Each list entry may itself contain comma-separated patterns (`"TestMachine*, Pump_*"`) for convenience. When the list is non-empty but `AlarmTrackingEnabled` is `false`, the validator emits a warning because the filter has no effect. See [Alarm Tracking](AlarmTracking.md#template-based-alarm-object-filter) for the full matching algorithm and telemetry. -- **`Historian.Enabled`** -- When `true`, the service calls `HistorianPluginLoader.TryLoad(config)` to load the `ZB.MOM.WW.OtOpcUa.Historian.Aveva` plugin from the `Historian/` subfolder next to the host exe and registers the resulting `IHistorianDataSource` with the OPC UA server host. Disabled by default because not all deployments have a Historian instance -- when disabled the plugin is not probed and the Wonderware SDK DLLs are not required on the host. If the flag is `true` but the plugin or its SDK dependencies cannot be loaded, the server still starts and every history read returns `BadHistoryOperationUnsupported` with a warning in the log. -- **`GalaxyRepository.ExtendedAttributes`** -- When `true`, the repository loads additional Galaxy attribute metadata beyond the core set needed for the address space. Disabled by default to minimize startup query time. -- **`GalaxyRepository.Scope`** -- When set to `LocalPlatform`, the repository filters the hierarchy and attributes to only include objects hosted by the platform whose `node_name` matches this machine (or the explicit `PlatformName` override). Ancestor areas are retained to keep the browse tree connected. Default is `Galaxy` (load everything). See [Galaxy Repository — Platform Scope Filter](GalaxyRepository.md#platform-scope-filter). - -## Configuration Validation - -`ConfigurationValidator.ValidateAndLog()` runs at the start of `OpcUaService.Start()`. It logs every resolved configuration value at `Information` level and validates required constraints: - -- `OpcUa.Port` must be between 1 and 65535 -- `OpcUa.GalaxyName` must not be empty -- `MxAccess.ClientName` must not be empty -- `GalaxyRepository.ConnectionString` must not be empty -- `Security.MinimumCertificateKeySize` must be at least 2048 -- Unknown security profile names are logged as warnings -- `AutoAcceptClientCertificates = true` emits a warning -- Only-`None` profile configuration emits a warning -- `OpcUa.AlarmFilter.ObjectFilters` is non-empty while `OpcUa.AlarmTrackingEnabled = false` emits a warning (filter has no effect) -- `Historian.ServerName` (or `Historian.ServerNames`) must not be empty when `Historian.Enabled = true` -- `Historian.FailureCooldownSeconds` must be zero or positive -- `Historian.ServerName` is set alongside a non-empty `Historian.ServerNames` emits a warning (single ServerName is ignored) -- `MxAccess.RuntimeStatusUnknownTimeoutSeconds` below 5s emits a warning (below the reasonable floor for MxAccess initial-resolution latency) -- `OpcUa.ApplicationUri` must be set when `Redundancy.Enabled = true` -- `Redundancy.ServiceLevelBase` must be between 1 and 255 -- `Redundancy.ServerUris` should contain at least 2 entries when enabled -- Local `ApplicationUri` should appear in `Redundancy.ServerUris` - -If validation fails, the service throws `InvalidOperationException` and does not start. - -## Test Constructor Pattern - -`OpcUaService` provides an `internal` constructor that accepts pre-built dependencies instead of loading `appsettings.json`: - -```csharp -internal OpcUaService( - AppConfiguration config, - IMxProxy? mxProxy, - IGalaxyRepository? galaxyRepository, - IMxAccessClient? mxAccessClientOverride = null, - bool hasMxAccessClientOverride = false) -``` - -Integration tests use this constructor to inject substitute implementations of `IMxProxy`, `IGalaxyRepository`, and `IMxAccessClient`, bypassing the STA thread, COM interop, and SQL Server dependencies. The `hasMxAccessClientOverride` flag tells the service to use the injected `IMxAccessClient` directly instead of creating one from the `IMxProxy` on the STA thread. - -## Example appsettings.json +Minimal example: ```json { - "OpcUa": { - "BindAddress": "0.0.0.0", - "Port": 4840, - "EndpointPath": "/LmxOpcUa", - "ServerName": "LmxOpcUa", - "GalaxyName": "ZB", - "MaxSessions": 100, - "SessionTimeoutMinutes": 30, - "AlarmTrackingEnabled": false, - "AlarmFilter": { - "ObjectFilters": [] - }, - "ApplicationUri": null + "Serilog": { "MinimumLevel": "Information" }, + "Node": { + "NodeId": "node-dev-a", + "ClusterId": "cluster-dev", + "ConfigDbConnectionString": "Server=localhost,14330;Database=OtOpcUaConfig;Integrated Security=True;TrustServerCertificate=True;Encrypt=False;", + "LocalCachePath": "config_cache.db" }, - "MxAccess": { - "ClientName": "LmxOpcUa", - "NodeName": null, - "GalaxyName": null, - "ReadTimeoutSeconds": 5, - "WriteTimeoutSeconds": 5, - "MaxConcurrentOperations": 10, - "MonitorIntervalSeconds": 5, - "AutoReconnect": true, - "ProbeTag": null, - "ProbeStaleThresholdSeconds": 60, - "RuntimeStatusProbesEnabled": true, - "RuntimeStatusUnknownTimeoutSeconds": 15, - "RequestTimeoutSeconds": 30 - }, - "GalaxyRepository": { - "ConnectionString": "Server=localhost;Database=ZB;Integrated Security=true;", - "ChangeDetectionIntervalSeconds": 30, - "CommandTimeoutSeconds": 30, - "ExtendedAttributes": false, - "Scope": "Galaxy", - "PlatformName": null - }, - "Dashboard": { - "Enabled": true, - "Port": 8081, - "RefreshIntervalSeconds": 10 - }, - "Historian": { - "Enabled": false, - "ServerName": "localhost", - "ServerNames": [], - "FailureCooldownSeconds": 60, - "IntegratedSecurity": true, - "UserName": null, - "Password": null, - "Port": 32568, - "CommandTimeoutSeconds": 30, - "RequestTimeoutSeconds": 60, - "MaxValuesPerRead": 10000 - }, - "Authentication": { - "AllowAnonymous": true, - "AnonymousCanWrite": true, - "Ldap": { - "Enabled": false - } - }, - "Security": { - "Profiles": ["None"], - "AutoAcceptClientCertificates": true, - "RejectSHA1Certificates": true, - "MinimumCertificateKeySize": 2048, - "PkiRootPath": null, - "CertificateSubject": null - }, - "Redundancy": { - "Enabled": false, - "Mode": "Warm", - "Role": "Primary", - "ServerUris": [], - "ServiceLevelBase": 200 + "OpcUaServer": { + "EndpointUrl": "opc.tcp://0.0.0.0:4840/OtOpcUa", + "ApplicationUri": "urn:node-dev-a:OtOpcUa", + "SecurityProfile": "None", + "AutoAcceptUntrustedClientCertificates": true, + "Ldap": { "Enabled": false } } } ``` + +### OtOpcUa Admin — `src/ZB.MOM.WW.OtOpcUa.Admin/appsettings.json` + +| Section | Purpose | +|---|---| +| `ConnectionStrings:ConfigDb` | SQL connection string — must point at the same Config DB every Server reaches. | +| `Authentication:Ldap` | LDAP bind for the Admin login form (same options shape as the Server's `OpcUaServer:Ldap`). | +| `CertTrust` | `CertTrustOptions` — file-system path under the Server's `PkiStoreRoot` so the Admin Certificates page can promote rejected client certs. | +| `Metrics:Prometheus:Enabled` | Toggles the `/metrics` scrape endpoint (default true). | +| `Serilog` | Logging. | + +### Galaxy.Host + +Environment-variable driven (`OTOPCUA_GALAXY_PIPE`, `OTOPCUA_ALLOWED_SID`, `OTOPCUA_GALAXY_SECRET`, `OTOPCUA_GALAXY_BACKEND`, `OTOPCUA_GALAXY_ZB_CONN`, `OTOPCUA_HISTORIAN_*`). No `appsettings.json` — the supervisor owns the launch environment. See [`ServiceHosting.md`](ServiceHosting.md#galaxyhost-process). + +### Environment overrides + +Standard .NET config layering applies: `appsettings.{Environment}.json`, then environment variables with `Section__Property` naming. `DOTNET_ENVIRONMENT` (or `ASPNETCORE_ENVIRONMENT` for Admin) selects the overlay. + +--- + +## Authoritative configuration (Config DB) + +The Config DB is the single source of truth for every setting that a v1 deployment used to carry in `appsettings.json` as driver-specific state. `OtOpcUaConfigDbContext` (`src/ZB.MOM.WW.OtOpcUa.Configuration/OtOpcUaConfigDbContext.cs`) is the EF Core context used by both the Admin writer and every Server reader. + +### Top-level sections operators touch + +| Concept | Entity | Admin UI surface | Purpose | +|---|---|---|---| +| Cluster | `ServerCluster` | Clusters pages | Fleet unit; owns nodes, generations, UNS, ACLs. | +| Cluster node | `ClusterNode` + `ClusterNodeCredential` | RedundancyTab, Hosts page | Per-node identity, `RedundancyRole`, `ServiceLevelBase`, ApplicationUri, service-account credentials. | +| Generation | `ConfigGeneration` + `ClusterNodeGenerationState` | Generations / DiffViewer | Append-only; draft → publish workflow (`sp_PublishGeneration`). | +| Namespace | `Namespace` | Namespaces tab | Per-cluster OPC UA namespace; `Kind` = Equipment / SystemPlatform / Simulated. | +| Driver instance | `DriverInstance` | Drivers tab | Configured driver (Modbus, S7, OpcUaClient, Galaxy, …) + `DriverConfig` JSON + resilience profile. | +| Device | `Device` | Under each driver instance | Per-host settings inside a driver instance (IP, port, unit-id…). | +| UNS hierarchy | `UnsArea` + `UnsLine` | UnsTab (drag/drop) | L3 / L4 of the unified namespace. | +| Equipment | `Equipment` | Equipment pages, CSV import | L5; carries `MachineCode`, `ZTag`, `SAPID`, `EquipmentUuid`, reservation-backed external ids. | +| Tag | `Tag` | Under each equipment | Driver-specific tag address + `SecurityClassification` + poll-group assignment. | +| Poll group | `PollGroup` | Driver-scoped | Poll cadence buckets; `PollGroupEngine` in Core.Abstractions uses this at runtime. | +| ACL | `NodeAcl` | AclsTab + Probe dialog | Per-level permission grants, additive only. See [`security.md`](security.md#data-plane-authorization). | +| Role grant | `LdapGroupRoleMapping` | RoleGrants page | Maps LDAP groups → Admin roles (`ConfigViewer` / `ConfigEditor` / `FleetAdmin`). | +| External id reservation | `ExternalIdReservation` | Reservations page | Reservation-backed `ZTag` and `SAPID` uniqueness. | +| Equipment import batch | `EquipmentImportBatch` | CSV import flow | Staged bulk-add with validation preview. | +| Audit log | `ConfigAuditLog` | Audit page | Append-only record of every publish, rollback, credential rotation, role-grant change. | + +### Draft → publish generation model + +All edits go into a **draft** generation scoped to one cluster. `DraftValidationService` checks invariants (same-cluster FKs, reservation collisions, UNS path consistency, ACL scope validity). When the operator clicks Publish, `sp_PublishGeneration` atomically promotes the draft, records the audit event, and causes every `RedundancyCoordinator.RefreshAsync` in the affected cluster to pick up the new topology + ACL set. The Admin UI `DiffViewer` shows exactly what's changing before publish. + +Old generations are retained; rollback is "publish older generation as new". `ConfigAuditLog` makes every change auditable by principal + timestamp. + +### Offline cache + +Each Server process caches the last-seen published generation in `Node:LocalCachePath` via LiteDB (`LiteDbConfigCache` in `src/ZB.MOM.WW.OtOpcUa.Configuration/LocalCache/`). The cache lets a node start without the central DB reachable; once the DB comes back, `NodeBootstrap` syncs to the current generation. + +### Full schema reference + +For table columns, indexes, stored procedures, the publish-transaction semantics, and the SQL authorization model (per-node SQL principals + `SESSION_CONTEXT` cluster binding), see [`docs/v2/config-db-schema.md`](v2/config-db-schema.md). + +### Admin UI flow + +For the draft editor, DiffViewer, CSV import, IdentificationFields, RedundancyTab, AclsTab + Probe-this-permission, RoleGrants, and the SignalR real-time surface, see [`docs/v2/admin-ui.md`](v2/admin-ui.md). + +--- + +## Where did v1 appsettings sections go? + +Quick index for operators coming from v1 LmxOpcUa: + +| v1 appsettings section | v2 home | +|---|---| +| `OpcUa.Port` / `BindAddress` / `EndpointPath` / `ServerName` | Bootstrap `OpcUaServer:EndpointUrl` + `ApplicationName`. | +| `OpcUa.ApplicationUri` | Config DB `ClusterNode.ApplicationUri`. | +| `OpcUa.MaxSessions` / `SessionTimeoutMinutes` | Bootstrap `OpcUaServer:*` (if exposed) or stack defaults. | +| `OpcUa.AlarmTrackingEnabled` / `AlarmFilter` | Per driver instance in Config DB (alarm surface is capability-driven per `IAlarmSource`). | +| `MxAccess.*` | Galaxy driver instance `DriverConfig` JSON + Galaxy.Host env vars (see [`ServiceHosting.md`](ServiceHosting.md#galaxyhost-process)). | +| `GalaxyRepository.*` | Galaxy driver instance `DriverConfig` JSON + `OTOPCUA_GALAXY_ZB_CONN` env var. | +| `Dashboard.*` | Retired — Admin UI replaces the dashboard. See [`StatusDashboard.md`](StatusDashboard.md). | +| `Historian.*` | Galaxy driver instance `DriverConfig` JSON + `OTOPCUA_HISTORIAN_*` env vars. | +| `Authentication.Ldap.*` | Bootstrap `OpcUaServer:Ldap` (same shape) + Admin `Authentication:Ldap` for the UI login. | +| `Security.*` | Bootstrap `OpcUaServer:SecurityProfile` + `PkiStoreRoot` + `AutoAcceptUntrustedClientCertificates`. | +| `Redundancy.*` | Config DB `ClusterNode.RedundancyRole` + `ServiceLevelBase`. | + +--- + +## Validation + +- **Bootstrap**: the process fails fast on missing required keys in `Program.cs` (e.g. `Node:NodeId`, `Node:ClusterId`, `Node:ConfigDbConnectionString` all throw `InvalidOperationException` if unset). +- **Authoritative**: `DraftValidationService` runs on every save; `sp_ValidateDraft` runs as part of `sp_PublishGeneration` so an invalid draft cannot reach any node. diff --git a/docs/DataTypeMapping.md b/docs/DataTypeMapping.md index 9fa4cfb..e4e6d95 100644 --- a/docs/DataTypeMapping.md +++ b/docs/DataTypeMapping.md @@ -1,84 +1,65 @@ # Data Type Mapping -`MxDataTypeMapper` and `SecurityClassificationMapper` translate Galaxy attribute metadata into OPC UA variable node properties. These mappings determine how Galaxy runtime values are represented to OPC UA clients and whether clients can write to them. +Data-type mapping is driver-defined. Each driver translates its native attribute metadata into two driver-agnostic enums from `Core.Abstractions` — `DriverDataType` (`src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/DriverDataType.cs`) and `SecurityClassification` (`src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/SecurityClassification.cs`) — and populates the `DriverAttributeInfo` record it hands to `IAddressSpaceBuilder.Variable(...)`. Core doesn't interpret the native types; it trusts the driver's translation. -## mx_data_type to OPC UA Type Mapping +## DriverDataType → OPC UA built-in type -Each Galaxy attribute carries an `mx_data_type` integer that identifies its data type. `MxDataTypeMapper.MapToOpcUaDataType` maps these to OPC UA built-in type NodeIds: +`DriverNodeManager.MapDataType` (`src/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs`) is the single translation table for every driver: -| mx_data_type | Galaxy type | OPC UA type | NodeId | CLR type | -|:---:|-------------|-------------|:------:|----------| -| 1 | Boolean | Boolean | i=1 | `bool` | -| 2 | Integer | Int32 | i=6 | `int` | -| 3 | Float | Float | i=10 | `float` | -| 4 | Double | Double | i=11 | `double` | -| 5 | String | String | i=12 | `string` | -| 6 | Time | DateTime | i=13 | `DateTime` | -| 7 | ElapsedTime | Double | i=11 | `double` | -| 8 | Reference | String | i=12 | `string` | -| 13 | Enumeration | Int32 | i=6 | `int` | -| 14 | Custom | String | i=12 | `string` | -| 15 | InternationalizedString | LocalizedText | i=21 | `string` | -| 16 | Custom | String | i=12 | `string` | -| other | Unknown | String | i=12 | `string` | +| DriverDataType | OPC UA NodeId | +|---|---| +| `Boolean` | `DataTypeIds.Boolean` (i=1) | +| `Int32` | `DataTypeIds.Int32` (i=6) | +| `Float32` | `DataTypeIds.Float` (i=10) | +| `Float64` | `DataTypeIds.Double` (i=11) | +| `String` | `DataTypeIds.String` (i=12) | +| `DateTime` | `DataTypeIds.DateTime` (i=13) | +| anything else | `DataTypeIds.BaseDataType` | -Unknown types default to String. This is a safe fallback because MXAccess delivers values as COM `VARIANT` objects, and string serialization preserves any value that does not have a direct OPC UA counterpart. +The enum also carries `Int16 / Int64 / UInt16 / UInt32 / UInt64 / Reference` members for drivers that need them; the mapping table is extended as those types surface in actual drivers. `Reference` is the Galaxy-style attribute reference — it's encoded as an OPC UA `String` on the wire. -### Why ElapsedTime maps to Double +## Per-driver mappers -Galaxy `ElapsedTime` (mx_data_type 7) represents a duration/timespan. OPC UA has no native `TimeSpan` type. The OPC UA specification defines a `Duration` type alias (NodeId i=290) that is semantically a `Double` representing milliseconds, but the simpler approach is to map directly to `Double` (i=11) representing seconds. This avoids ambiguity about whether the value is in seconds or milliseconds and matches how the Galaxy runtime exposes elapsed time values through MXAccess. +Each driver owns its native → `DriverDataType` translation: -## Array Handling +- **Galaxy Proxy** — `GalaxyProxyDriver.MapDataType(int mxDataType)` and `MapSecurity(int mxSec)` (inline in `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy/GalaxyProxyDriver.cs`). The Galaxy `mx_data_type` integer is sent across the Host↔Proxy pipe and mapped on the Proxy side. Galaxy's full classic 16-entry table (Boolean / Integer / Float / Double / String / Time / ElapsedTime / Reference / Enumeration / Custom / InternationalizedString) is preserved but compressed into the seven-entry `DriverDataType` enum — `ElapsedTime` → `Float64`, `InternationalizedString` → `String`, `Reference` → `Reference`, enumerations → `Int32`. +- **AB CIP** — `src/ZB.MOM.WW.OtOpcUa.Driver.AbCip/AbCipDataType.cs` maps CIP tag type codes. +- **Modbus** — `src/ZB.MOM.WW.OtOpcUa.Driver.Modbus/ModbusDriver.cs` maps register shapes (16-bit signed, 16-bit unsigned, 32-bit float, etc.) including the DirectLogic quirk table in `DirectLogicAddress.cs`. +- **S7 / AB Legacy / TwinCAT / FOCAS / OPC UA Client** — each has its own inline mapper or `*DataType.cs` file per the same pattern. -Galaxy attributes with `is_array = 1` in the repository are exposed as one-dimensional OPC UA array variables. +The driver's mapping is authoritative — when a field type is ambiguous (a `LREAL` that could be bit-reinterpreted, a BCD counter, a string of a particular encoding), the driver decides the exposed OPC UA shape. -### ValueRank +## Array handling -The `ValueRank` property on the OPC UA variable node indicates the array dimensionality: +`DriverAttributeInfo.IsArray = true` flips `ValueRank = OneDimension` on the generated `BaseDataVariableState`; scalars stay at `ValueRank.Scalar`. `DriverAttributeInfo.ArrayDim` carries the declared length. Writing element-by-element (OPC UA `IndexRange`) is a driver-level decision — see `docs/ReadWriteOperations.md`. -| `is_array` | ValueRank | Constant | -|:---:|:---------:|----------| -| 0 | -1 | `ValueRanks.Scalar` | -| 1 | 1 | `ValueRanks.OneDimension` | +## SecurityClassification — metadata, not ACL -### ArrayDimensions +`SecurityClassification` is driver-reported metadata only. Drivers never enforce write permissions themselves — the classification flows into the Server project where `WriteAuthzPolicy.IsAllowed(classification, userRoles)` (`src/ZB.MOM.WW.OtOpcUa.Server/Security/WriteAuthzPolicy.cs`) gates the write against the session's LDAP-derived roles, and (Phase 6.2) the `AuthorizationGate` + permission trie apply on top. This is the "ACL at server layer" invariant recorded in `feedback_acl_at_server_layer.md`. -When `ValueRank = 1`, the `ArrayDimensions` property is set to a single-element `ReadOnlyList` containing the declared array length from `array_dimension`: +The classification values mirror the v1 Galaxy model so existing Galaxy galaxies keep their published semantics: -```csharp -if (attr.IsArray && attr.ArrayDimension.HasValue) -{ - variable.ArrayDimensions = new ReadOnlyList( - new List { (uint)attr.ArrayDimension.Value }); -} -``` +| SecurityClassification | Required role | Write-from-OPC-UA | +|---|---|---| +| `FreeAccess` | — | yes (even anonymous) | +| `Operate` | `WriteOperate` | yes | +| `Tune` | `WriteTune` | yes | +| `Configure` | `WriteConfigure` | yes | +| `SecuredWrite` | `WriteOperate` | yes | +| `VerifiedWrite` | `WriteConfigure` | yes | +| `ViewOnly` | — | no | -The `array_dimension` value is extracted from the `mx_value` binary column in the Galaxy database (bytes 13-16, little-endian int32). +Drivers whose backend has no notion of classification (Modbus, most PLCs) default every tag to `FreeAccess` or `Operate`; drivers whose backend does carry the notion (Galaxy, OPC UA Client relaying `UserAccessLevel`) translate it directly. -### NodeId for array variables +## Historization -Array variables use a NodeId without the `[]` suffix. The `full_tag_reference` stored internally for MXAccess addressing retains the `[]` (e.g., `MESReceiver_001.MoveInPartNumbers[]`), but the OPC UA NodeId strips it to `ns=1;s=MESReceiver_001.MoveInPartNumbers`. - -## Security Classification to AccessLevel Mapping - -Galaxy attributes carry a `security_classification` value that controls write permissions. `SecurityClassificationMapper.IsWritable` determines the OPC UA `AccessLevel`: - -| security_classification | Galaxy level | OPC UA AccessLevel | Writable | -|:---:|--------------|-------------------|:--------:| -| 0 | FreeAccess | CurrentReadOrWrite | Yes | -| 1 | Operate | CurrentReadOrWrite | Yes | -| 2 | SecuredWrite | CurrentRead | No | -| 3 | VerifiedWrite | CurrentRead | No | -| 4 | Tune | CurrentReadOrWrite | Yes | -| 5 | Configure | CurrentReadOrWrite | Yes | -| 6 | ViewOnly | CurrentRead | No | - -Most attributes default to Operate (1). The mapper treats SecuredWrite, VerifiedWrite, and ViewOnly as read-only because the OPC UA server does not implement the Galaxy's multi-level authentication model. Allowing writes to SecuredWrite or VerifiedWrite attributes without proper verification would bypass Galaxy security. - -For historized attributes, `AccessLevels.HistoryRead` is added to the access level via bitwise OR, enabling OPC UA history read requests when an `IHistorianDataSource` is configured via the runtime-loaded historian plugin. +`DriverAttributeInfo.IsHistorized = true` flips `AccessLevel.HistoryRead` and `Historizing = true` on the variable. The driver must then implement `IHistoryProvider` for HistoryRead service calls to succeed; otherwise the node manager surfaces `BadHistoryOperationUnsupported` per request. ## Key source files -- `src/ZB.MOM.WW.OtOpcUa.Host/Domain/MxDataTypeMapper.cs` -- Type and CLR mapping -- `src/ZB.MOM.WW.OtOpcUa.Host/Domain/SecurityClassificationMapper.cs` -- Write access mapping -- `gr/data_type_mapping.md` -- Reference documentation for the full mapping table +- `src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/DriverDataType.cs` — driver-agnostic type enum +- `src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/SecurityClassification.cs` — write-authz tier metadata +- `src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/DriverAttributeInfo.cs` — per-attribute descriptor +- `src/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs` — `MapDataType` translation +- `src/ZB.MOM.WW.OtOpcUa.Server/Security/WriteAuthzPolicy.cs` — classification-to-role policy +- Per-driver mappers in each `Driver.*` project diff --git a/docs/HistoricalDataAccess.md b/docs/HistoricalDataAccess.md index 57987a1..6b531b6 100644 --- a/docs/HistoricalDataAccess.md +++ b/docs/HistoricalDataAccess.md @@ -1,228 +1,109 @@ # Historical Data Access -`LmxNodeManager` exposes OPC UA historical data access (HDA) through an abstract `IHistorianDataSource` interface (`Historian/IHistorianDataSource.cs`). The Wonderware Historian implementation lives in a separate assembly, `ZB.MOM.WW.OtOpcUa.Historian.Aveva`, which is loaded at runtime only when `Historian.Enabled=true`. This keeps the `aahClientManaged` SDK out of the core Host so deployments that do not need history do not need the SDK installed. +OPC UA HistoryRead is a **per-driver optional capability** in OtOpcUa. The Core dispatches HistoryRead service calls to the owning driver through the `IHistoryProvider` capability interface (`src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IHistoryProvider.cs`). Drivers that don't implement the interface return `BadHistoryOperationUnsupported` for every history call on their nodes; that is the expected behavior for protocol drivers (Modbus, S7, AB CIP, AB Legacy, TwinCAT, FOCAS) whose wire protocols carry no time-series data. -## Plugin Architecture +Historian integration is no longer a separate bolt-on assembly, as it was in v1 (`ZB.MOM.WW.LmxOpcUa.Historian.Aveva` plugin). It is now one optional capability any driver can implement. The first implementation is the Galaxy driver's Wonderware Historian integration; OPC UA Client forwards HistoryRead to the upstream server. Every other driver leaves the capability unimplemented and the Core short-circuits history calls on nodes that belong to those drivers. -The historian surface is split across two assemblies: +## `IHistoryProvider` -- **`ZB.MOM.WW.OtOpcUa.Host`** (core) owns only OPC UA / BCL types: - - `IHistorianDataSource` -- the interface `LmxNodeManager` depends on - - `HistorianEventDto` -- SDK-free representation of a historian event record - - `HistorianAggregateMap` -- maps OPC UA aggregate NodeIds to AnalogSummary column names - - `HistorianPluginLoader` -- loads the plugin via `Assembly.LoadFrom` at startup - - `HistoryContinuationPointManager` -- paginates HistoryRead results -- **`ZB.MOM.WW.OtOpcUa.Historian.Aveva`** (plugin) owns everything SDK-bound: - - `HistorianDataSource` -- implements `IHistorianDataSource`, wraps `aahClientManaged` - - `IHistorianConnectionFactory` / `SdkHistorianConnectionFactory` -- opens and polls `ArchestrA.HistorianAccess` connections - - `AvevaHistorianPluginEntry.Create(HistorianConfiguration)` -- the static factory invoked by the loader +Four methods, mapping onto the four OPC UA HistoryRead service variants: -The plugin assembly and its SDK dependencies (`aahClientManaged.dll`, `aahClient.dll`, `aahClientCommon.dll`, `Historian.CBE.dll`, `Historian.DPAPI.dll`, `ArchestrA.CloudHistorian.Contract.dll`) deploy to a `Historian/` subfolder next to `ZB.MOM.WW.OtOpcUa.Host.exe`. See [Service Hosting](ServiceHosting.md#required-runtime-assemblies) for the full layout and deployment matrix. +| Method | OPC UA service | Notes | +|--------|----------------|-------| +| `ReadRawAsync` | HistoryReadRawModified (raw subset) | Returns `HistoryReadResult { Samples, ContinuationPoint? }`. The Core handles `ContinuationPoint` pagination. | +| `ReadProcessedAsync` | HistoryReadProcessed | Takes a `HistoryAggregateType` (Average / Minimum / Maximum / Total / Count) and a bucket `interval`. Drivers that can't express an aggregate throw `NotSupportedException`; the Core translates that into `BadAggregateNotSupported`. | +| `ReadAtTimeAsync` | HistoryReadAtTime | Default implementation throws `NotSupportedException` — drivers without interpolation / prior-boundary support leave the default. | +| `ReadEventsAsync` | HistoryReadEvents | Historical alarm/event rows, distinct from the live `IAlarmSource` stream. Default throws; only drivers with an event historian (Galaxy's A&E log) override. | -## Plugin Loading +Supporting DTOs live alongside the interface in `Core.Abstractions`: -When the service starts with `Historian.Enabled=true`, `OpcUaService` calls `HistorianPluginLoader.TryLoad(config)`. The loader: +- `HistoryReadResult(IReadOnlyList Samples, byte[]? ContinuationPoint)` +- `HistoryAggregateType` — enum `{ Average, Minimum, Maximum, Total, Count }` +- `HistoricalEvent(EventId, SourceName?, EventTimeUtc, ReceivedTimeUtc, Message?, Severity)` +- `HistoricalEventsResult(IReadOnlyList Events, byte[]? ContinuationPoint)` -1. Probes `AppDomain.CurrentDomain.BaseDirectory\Historian\ZB.MOM.WW.OtOpcUa.Historian.Aveva.dll`. -2. Installs a one-shot `AppDomain.AssemblyResolve` handler that redirects any `aahClientManaged`/`aahClientCommon`/`Historian.*` lookups to the same subfolder, so the CLR can resolve SDK dependencies when the plugin first JITs. -3. Calls the plugin's `AvevaHistorianPluginEntry.Create(HistorianConfiguration)` via reflection and returns the resulting `IHistorianDataSource`. -4. On any failure (plugin missing, entry type not found, SDK assembly unresolvable, bad image), logs a warning with the expected plugin path and returns `null`. The server starts normally and `LmxNodeManager` returns `BadHistoryOperationUnsupported` for every history call. +## Dispatch through `CapabilityInvoker` -## Wonderware Historian SDK +All four HistoryRead surfaces are wrapped by `CapabilityInvoker` (`Core/Resilience/CapabilityInvoker.cs`) with `DriverCapability.HistoryRead`. The Polly pipeline keyed on `(DriverInstanceId, HostName, DriverCapability.HistoryRead)` provides timeout, circuit-breaker, and bulkhead defaults per the driver's stability tier (see [docs/v2/driver-stability.md](v2/driver-stability.md)). -The plugin uses the AVEVA Historian managed SDK (`aahClientManaged.dll`) to query historical data. The SDK provides a cursor-based query API through `ArchestrA.HistorianAccess`, replacing direct SQL queries against the Historian Runtime database. Two query types are used: +The dispatch point is `DriverNodeManager` in `ZB.MOM.WW.OtOpcUa.Server`. When the OPC UA stack calls `HistoryRead`, the node manager: -- **`HistoryQuery`** -- Raw historical samples with timestamp, value (numeric or string), and OPC quality. -- **`AnalogSummaryQuery`** -- Pre-computed aggregates with properties for Average, Minimum, Maximum, ValueCount, First, Last, StdDev, and more. +1. Resolves the target `NodeHandle` to a `(DriverInstanceId, fullReference)` pair. +2. Checks the owning driver's `DriverTypeMetadata` to see if the type may advertise history at all (fast reject for types that never implement `IHistoryProvider`). +3. If the driver instance implements `IHistoryProvider`, wraps the `ReadRawAsync` / `ReadProcessedAsync` / `ReadAtTimeAsync` / `ReadEventsAsync` call in `CapabilityInvoker.InvokeAsync(... DriverCapability.HistoryRead ...)`. +4. Translates the `HistoryReadResult` into an OPC UA `HistoryData` + `ExtensionObject`. +5. Manages the continuation point via `HistoryContinuationPointManager` so clients can page through large result sets. -The SDK DLLs are located in `lib/` and originate from `C:\Program Files (x86)\Wonderware\Historian\`. Only the plugin project (`src/ZB.MOM.WW.OtOpcUa.Historian.Aveva/`) references them at build time; the core Host project does not. +Driver-level history code never sees the continuation-point protocol or the OPC UA stack types — those stay in the Core. -## Configuration +## Driver coverage -`HistorianConfiguration` controls the SDK connection: +| Driver | Implements `IHistoryProvider`? | Source | +|--------|:------------------------------:|--------| +| Galaxy | Yes — raw, processed, at-time, events | `aahClientManaged` SDK (Wonderware Historian) on the Host side, forwarded through the Proxy's IPC | +| OPC UA Client | Yes — raw, processed, at-time, events (forwarded to upstream) | `Opc.Ua.Client.Session.HistoryRead` against the remote server | +| Modbus | No | Wire protocol has no time-series concept | +| Siemens S7 | No | S7comm has no time-series concept | +| AB CIP | No | CIP has no time-series concept | +| AB Legacy | No | PCCC has no time-series concept | +| TwinCAT | No | ADS symbol reads are point-in-time; archiving is an external concern | +| FOCAS | No | Default — FOCAS has no general-purpose historian API | -```csharp -public class HistorianConfiguration -{ - public bool Enabled { get; set; } = false; - public string ServerName { get; set; } = "localhost"; - public List ServerNames { get; set; } = new(); - public int FailureCooldownSeconds { get; set; } = 60; - public bool IntegratedSecurity { get; set; } = true; - public string? UserName { get; set; } - public string? Password { get; set; } - public int Port { get; set; } = 32568; - public int CommandTimeoutSeconds { get; set; } = 30; - public int MaxValuesPerRead { get; set; } = 10000; - public int RequestTimeoutSeconds { get; set; } = 60; -} -``` +## Galaxy — Wonderware Historian (`aahClientManaged`) -When `Enabled` is `false`, `HistorianPluginLoader.TryLoad` is not called, no plugin is loaded, and the node manager returns `BadHistoryOperationUnsupported` for history read requests. When `Enabled` is `true` but the plugin cannot be loaded (missing `Historian/` subfolder, SDK assembly resolve failure, etc.), the server still starts and returns the same `BadHistoryOperationUnsupported` status with a warning in the log. +The Galaxy driver's `IHistoryProvider` implementation lives on the Host side (`.NET 4.8 x86`) in `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Backend/Historian/`. The Proxy's `GalaxyProxyDriver.ReadRawAsync` / `ReadProcessedAsync` / `ReadAtTimeAsync` / `ReadEventsAsync` each serializes a `HistoryRead*Request` and awaits the matching `HistoryRead*Response` over the named pipe (see [drivers/Galaxy.md](drivers/Galaxy.md#ipc-transport)). -### Connection Properties +Host-side, `HistorianDataSource` uses the AVEVA Historian managed SDK (`aahClientManaged.dll`) to query historical data via a cursor-based API through `ArchestrA.HistorianAccess`: -| Property | Default | Description | -|---|---|---| -| `ServerName` | `localhost` | Single Historian server hostname used when `ServerNames` is empty. Preserved for backward compatibility with pre-cluster deployments | -| `ServerNames` | `[]` | Ordered list of Historian cluster nodes. When non-empty, supersedes `ServerName` and enables read-only cluster failover (see [Cluster Failover](#read-only-cluster-failover)) | -| `FailureCooldownSeconds` | `60` | How long a failed cluster node is skipped before being re-tried. Zero means no cooldown (retry on every request) | -| `IntegratedSecurity` | `true` | Use Windows authentication | -| `UserName` | `null` | Username when `IntegratedSecurity` is false | -| `Password` | `null` | Password when `IntegratedSecurity` is false | -| `Port` | `32568` | Historian TCP port | -| `CommandTimeoutSeconds` | `30` | SDK packet timeout in seconds (inner async bound) | -| `RequestTimeoutSeconds` | `60` | Outer safety timeout applied to sync-over-async history reads on the OPC UA stack thread. Backstop for `CommandTimeoutSeconds`; a timed-out read returns `BadTimeout`. Should be greater than `CommandTimeoutSeconds`. Stability review 2026-04-13 Finding 3 | -| `MaxValuesPerRead` | `10000` | Maximum values per history read request | +- **`HistoryQuery`** — raw historical samples (timestamp, value, OPC quality) +- **`AnalogSummaryQuery`** — pre-computed aggregates (Average, Minimum, Maximum, ValueCount, First, Last, StdDev) -## Connection Lifecycle +The SDK DLLs are pulled into the Galaxy.Host project at build time; the Server and every other driver project remain SDK-free. -`HistorianDataSource` (in the plugin assembly) maintains a persistent connection to the Historian server via `ArchestrA.HistorianAccess`: +> **Gap / status note.** The raw SDK wrapper (`HistorianDataSource`, `HistorianClusterEndpointPicker`, `HistorianHealthSnapshot`, etc.) has been ported from the v1 `ZB.MOM.WW.LmxOpcUa.Historian.Aveva` plugin into `Driver.Galaxy.Host/Backend/Historian/`. The **IPC wire-up** — `HistoryReadRequest` / `HistoryReadResponse` message kinds, Proxy-side `ReadRawAsync` / `ReadProcessedAsync` / `ReadAtTimeAsync` / `ReadEventsAsync` forwarding — is in place on `GalaxyProxyDriver`. What remains to close on a given branch is Host-side **mapping of `HistoryAggregateType` onto the `AnalogSummaryQuery` column names** (done in `GalaxyProxyDriver.MapAggregateToColumn`; the Host side must mirror it) and the **end-to-end integration test** that was held by the v1 plugin suite. Until those land on a given driver branch, history calls against Galaxy may surface `GalaxyIpcException { Code = "not-implemented" }` or backend-specific errors rather than populated `HistoryReadResult`s. Track the remaining work against the Phase 2 Galaxy out-of-process gate in `docs/v2/plan.md`. -1. **Lazy connect** -- The connection is established on the first query via `EnsureConnected()`. When a cluster is configured, the data source iterates `HistorianClusterEndpointPicker.GetHealthyNodes()` in order and returns the first node that successfully connects. -2. **Connection reuse** -- Subsequent queries reuse the same connection. The active node is tracked in `_activeProcessNode` / `_activeEventNode` and surfaced on the dashboard. -3. **Auto-reconnect** -- On connection failure, the connection is disposed, the active node is marked failed in the picker, and the next query re-enters the picker loop to try the next eligible candidate. -4. **Clean shutdown** -- `Dispose()` closes the connection when the service stops. +### Aggregate function mapping -The connection is opened with `ReadOnly = true` and `ConnectionType = Process`. The event (alarm history) path uses a separate connection with `ConnectionType = Event`, but both silos share the same cluster picker so a node that fails on one silo is immediately skipped on the other. +`GalaxyProxyDriver.MapAggregateToColumn` (Proxy-side) translates the OPC UA Part 13 standard aggregate enum onto `AnalogSummaryQuery` column names consumed by `HistorianDataSource.ReadAggregateAsync`: -## Read-Only Cluster Failover +| `HistoryAggregateType` | Result Property | +|------------------------|-----------------| +| `Average` | `Average` | +| `Minimum` | `Minimum` | +| `Maximum` | `Maximum` | +| `Count` | `ValueCount` | -When `HistorianConfiguration.ServerNames` is non-empty, the plugin picks from an ordered list of cluster nodes instead of a single `ServerName`. Each connection attempt tries candidates in configuration order until one succeeds. Failed nodes are placed into a timed cooldown and re-admitted when the cooldown elapses. +`HistoryAggregateType.Total` is **not supported** by Wonderware `AnalogSummary` and raises `NotSupportedException`, which the Core translates to `BadAggregateNotSupported`. Additional OPC UA aggregates (`Start`, `End`, `StandardDeviationPopulation`) sit on the Historian columns `First`, `Last`, `StdDev` and can be exposed by extending the enum + mapping together. -### HistorianClusterEndpointPicker +### Read-only cluster failover -The picker (in the plugin assembly, internal) is pure logic with no SDK dependency — all cluster behavior is unit-testable with a fake clock and scripted factory. Key characteristics: +`HistorianConfiguration.ServerNames` accepts an ordered list of cluster nodes. `HistorianClusterEndpointPicker` iterates the list in configuration order, marks failed nodes with a `FailureCooldownSeconds` window, and re-admits them when the cooldown elapses. One picker instance is shared by the process-values connection and the event-history connection (two SDK silos), so a node failure on one silo immediately benches it for the other. `FailureCooldownSeconds = 0` disables the cooldown — the SDK's own retry semantics are the sole gate. -- **Ordered iteration**: nodes are tried in the exact order they appear in `ServerNames`. Operators can express a preference ("primary first, fallback second") by ordering the list. -- **Per-node cooldown**: `MarkFailed(node, error)` starts a `FailureCooldownSeconds` window during which the node is skipped from `GetHealthyNodes()`. `MarkHealthy(node)` clears the window immediately (used on successful connect). -- **Automatic re-admission**: when a node's cooldown elapses, the next call to `GetHealthyNodes()` includes it automatically — no background probe, no manual reset. The cumulative `FailureCount` and `LastError` are retained for operator diagnostics. -- **Thread-safe**: a single lock guards the per-node state. Operations are microsecond-scale so contention is a non-issue. -- **Shared across silos**: one picker instance is shared by the process-values connection and the event-history connection, so a node failure on one path immediately benches it for the other. -- **Zero cooldown mode**: `FailureCooldownSeconds = 0` disables the cooldown entirely — the node is never benched. Useful for tests or for operators who want the SDK's own retry semantics to be the sole gate. +Host-side cluster health is surfaced via `HistorianHealthSnapshot { NodeCount, HealthyNodeCount, ActiveProcessNode, ActiveEventNode, Nodes }` and forwarded to the Proxy so the Admin UI Historian panel can render a per-node table. `HealthCheckService` flips overall service health to `Degraded` when `HealthyNodeCount < NodeCount`. -### Connection attempt flow +### Runtime health counters -`HistorianDataSource.ConnectToAnyHealthyNode(HistorianConnectionType)` performs the actual iteration: +`HistorianDataSource` maintains per-read counters — `TotalQueries`, `TotalSuccesses`, `TotalFailures`, `ConsecutiveFailures`, `LastSuccessTime`, `LastFailureTime`, `LastError`, `ProcessConnectionOpen`, `EventConnectionOpen` — so the dashboard can distinguish "backend loaded but never queried" from "backend loaded and queries are failing". `LastError` is prefixed with the read path (`raw:`, `aggregate:`, `at-time:`, `events:`) so operators can tell which silo is broken. `HealthCheckService` degrades at `ConsecutiveFailures >= 3`. -1. Snapshot healthy nodes from the picker. If empty, throw `InvalidOperationException` with either "No historian nodes configured" or "All N historian nodes are in cooldown". -2. For each candidate, clone `HistorianConfiguration` with the candidate as `ServerName` and pass it to the factory. On success: `MarkHealthy(node)` and return the `(Connection, Node)` tuple. On exception: `MarkFailed(node, ex.Message)`, log a warning, continue. -3. If all candidates fail, wrap the last inner exception in an `InvalidOperationException` with the cumulative failure count so the existing read-method catch blocks surface a meaningful error through the health counters. +### Quality mapping -The wrapping exception intentionally includes the last inner error message in the outer `Message` so the health snapshot's `LastError` field is still human-readable when the cluster exhausts every candidate. - -### Single-node backward compatibility - -When `ServerNames` is empty, the picker is seeded with a single entry from `ServerName` and the iteration loop still runs — it just has one candidate. Legacy deployments see no behavior change: the picker marks the single node healthy on success, runs the same cooldown logic on failure, and the dashboard renders a compact `Node: ` line instead of the cluster table. - -### Cluster health surface - -Runtime cluster state is exposed on `HistorianHealthSnapshot`: - -- `NodeCount` / `HealthyNodeCount` -- size of the configured cluster and how many are currently eligible. -- `ActiveProcessNode` / `ActiveEventNode` -- which nodes are currently serving the two connection silos, or `null` when a silo has no open connection. -- `Nodes: List` -- per-node state with `Name`, `IsHealthy`, `CooldownUntil`, `FailureCount`, `LastError`, `LastFailureTime`. - -The dashboard renders this as a cluster table when `NodeCount > 1`. See [Status Dashboard](StatusDashboard.md#historian). `HealthCheckService` flips the overall service health to `Degraded` when `HealthyNodeCount < NodeCount` so operators can alert on a partially-failed cluster even while queries are still succeeding via the remaining nodes. - -## Runtime Health Counters - -`HistorianDataSource` maintains runtime query counters updated on every read method exit — success or failure — so the dashboard can distinguish "plugin loaded but never queried" from "plugin loaded and queries are failing". The load-time `HistorianPluginLoader.LastOutcome` only reports whether the assembly resolved at startup; it cannot catch a connection that succeeds at boot and degrades later. - -### Counters - -- `TotalQueries` / `TotalSuccesses` / `TotalFailures` — cumulative since startup. Every call to `RecordSuccess` or `RecordFailure` in the read methods updates these under `_healthLock`. Empty result sets count as successes — the counter reflects "the SDK call returned" rather than "the SDK call returned data". -- `ConsecutiveFailures` — latches while queries are failing; reset to zero by the first success. Drives `HealthCheckService` degradation at threshold 3. -- `LastSuccessTime` / `LastFailureTime` — UTC timestamps of the most recent success or failure, or `null` when no query of that outcome has occurred yet. -- `LastError` — exception message from the most recent failure, prefixed with the read-path name (`raw:`, `aggregate:`, `at-time:`, `events:`) so operators can tell which SDK call is broken. Cleared on the next success. -- `ProcessConnectionOpen` / `EventConnectionOpen` — whether the plugin currently holds an open SDK connection on each silo. Read from the data source's `_connection` / `_eventConnection` fields via a `Volatile.Read`. - -These fields are read once per dashboard refresh via `IHistorianDataSource.GetHealthSnapshot()` and serialized into `HistorianStatusInfo`. See [Status Dashboard](StatusDashboard.md#historian) for the HTML/JSON surface. - -### Two SDK connection silos - -The plugin maintains two independent `ArchestrA.HistorianAccess` connections, one per `HistorianConnectionType`: - -- **Process connection** (`ConnectionType = Process`) — serves historical *value* queries: `ReadRawAsync`, `ReadAggregateAsync`, `ReadAtTimeAsync`. This is the SDK's query channel for tags stored in the Historian runtime. -- **Event connection** (`ConnectionType = Event`) — serves historical *event/alarm* queries: `ReadEventsAsync`. The SDK requires a separately opened connection for its event store because the query API and wire schema are distinct from value queries. - -Both connections are lazy: they open on the first query that needs them. Either can be open, closed, or open against a different cluster node than the other. The dashboard renders both independently in the Historian panel (`Process Conn: open (host-a) | Event Conn: closed`) so operators can tell which silos are active and which node is serving each. When cluster support is configured, both silos share the same `HistorianClusterEndpointPicker`, so a failure on one silo marks the node unhealthy for the other as well. - -## Raw Reads - -`IHistorianDataSource.ReadRawAsync` (plugin implementation) uses a `HistoryQuery` to retrieve individual samples within a time range: - -1. Create a `HistoryQuery` via `_connection.CreateHistoryQuery()` -2. Configure `HistoryQueryArgs` with `TagNames`, `StartDateTime`, `EndDateTime`, and `RetrievalMode = Full` -3. Iterate: `StartQuery` -> `MoveNext` loop -> `EndQuery` - -Each result row is converted to an OPC UA `DataValue`: - -- `QueryResult.Value` (double) takes priority; `QueryResult.StringValue` is used as fallback for string-typed tags. -- `SourceTimestamp` and `ServerTimestamp` are both set to `QueryResult.StartDateTime`. -- `StatusCode` is mapped from the `QueryResult.OpcQuality` (UInt16) via `QualityMapper` (the same OPC DA quality byte mapping used for live MXAccess data). - -## Aggregate Reads - -`IHistorianDataSource.ReadAggregateAsync` (plugin implementation) uses an `AnalogSummaryQuery` to retrieve pre-computed aggregates: - -1. Create an `AnalogSummaryQuery` via `_connection.CreateAnalogSummaryQuery()` -2. Configure `AnalogSummaryQueryArgs` with `TagNames`, `StartDateTime`, `EndDateTime`, and `Resolution` (milliseconds) -3. Iterate the same `StartQuery` -> `MoveNext` -> `EndQuery` pattern -4. Extract the requested aggregate from named properties on `AnalogSummaryQueryResult` - -Null aggregate values return `BadNoData` status rather than `Good` with a null variant. - -## Quality Mapping - -The Historian SDK returns standard OPC DA quality values in `QueryResult.OpcQuality` (UInt16). The low byte is passed through the shared `QualityMapper` pipeline (`MapFromMxAccessQuality` -> `MapToOpcUaStatusCode`), which maps the OPC DA quality families to OPC UA status codes: +The Historian SDK returns standard OPC DA quality values in `QueryResult.OpcQuality` (UInt16). The low byte flows through the shared `QualityMapper` pipeline (`MapFromMxAccessQuality` → `MapToOpcUaStatusCode`): | OPC Quality Byte | OPC DA Family | OPC UA StatusCode | -|---|---|---| +|------------------|---------------|-------------------| | 0-63 | Bad | `Bad` (with sub-code when an exact enum match exists) | | 64-191 | Uncertain | `Uncertain` (with sub-code when an exact enum match exists) | | 192+ | Good | `Good` (with sub-code when an exact enum match exists) | -See `Domain/QualityMapper.cs` and `Domain/Quality.cs` for the full mapping table and sub-code definitions. +See `Domain/QualityMapper.cs` and `Domain/Quality.cs` in `Driver.Galaxy.Host` for the full table. -## Aggregate Function Mapping +## OPC UA Client — upstream forwarding -`HistorianAggregateMap.MapAggregateToColumn` (in the core Host assembly, so the node manager can validate aggregate support without requiring the plugin to be loaded) translates OPC UA aggregate NodeIds to `AnalogSummaryQueryResult` property names: +The OPC UA Client driver (`Driver.OpcUaClient`) implements `IHistoryProvider` by forwarding each call to the upstream server via `Session.HistoryRead`. Raw / processed / at-time / events map onto the stack's native HistoryRead details types. Continuation points are passed through — the Core's `HistoryContinuationPointManager` treats the driver as an opaque pager. -| OPC UA Aggregate | Result Property | -|---|---| -| `AggregateFunction_Average` | `Average` | -| `AggregateFunction_Minimum` | `Minimum` | -| `AggregateFunction_Maximum` | `Maximum` | -| `AggregateFunction_Count` | `ValueCount` | -| `AggregateFunction_Start` | `First` | -| `AggregateFunction_End` | `Last` | -| `AggregateFunction_StandardDeviationPopulation` | `StdDev` | +## Historizing flag and AccessLevel -Unsupported aggregates return `null`, which causes the node manager to return `BadAggregateNotSupported`. - -## HistoryReadRawModified Override - -`LmxNodeManager` overrides `HistoryReadRawModified` to handle raw history read requests: - -1. Resolve the `NodeHandle` to a tag reference via `_nodeIdToTagReference`. Return `BadNodeIdUnknown` if not found. -2. Check that `_historianDataSource` is not null. Return `BadHistoryOperationUnsupported` if historian is disabled. -3. Call `ReadRawAsync` with the time range and `NumValuesPerNode` from the `ReadRawModifiedDetails`. -4. Pack the resulting `DataValue` list into a `HistoryData` object and wrap it in an `ExtensionObject` for the `HistoryReadResult`. - -## HistoryReadProcessed Override - -`HistoryReadProcessed` handles aggregate history requests with additional validation: - -1. Resolve the node and check historian availability (same as raw). -2. Validate that `AggregateType` is present in the `ReadProcessedDetails`. Return `BadAggregateListMismatch` if empty. -3. Map the requested aggregate to a result property via `MapAggregateToColumn`. Return `BadAggregateNotSupported` if unmapped. -4. Call `ReadAggregateAsync` with the time range, `ProcessingInterval`, and property name. -5. Return results in the same `HistoryData` / `ExtensionObject` format. - -## Historizing Flag and AccessLevel - -During variable node creation in `CreateAttributeVariable`, attributes with `IsHistorized == true` receive two additional settings: +During variable node creation, drivers that advertise history set: ```csharp if (attr.IsHistorized) @@ -230,7 +111,13 @@ if (attr.IsHistorized) variable.Historizing = attr.IsHistorized; ``` -- **`Historizing = true`** -- Tells OPC UA clients that this node has historical data available. -- **`AccessLevels.HistoryRead`** -- Enables the `HistoryRead` access bit on the node, which the OPC UA stack checks before routing history requests to the node manager override. Nodes without this bit set will be rejected by the framework before reaching `HistoryReadRawModified` or `HistoryReadProcessed`. +- **`Historizing = true`** — tells OPC UA clients that the node has historical data available. +- **`AccessLevels.HistoryRead`** — enables the `HistoryRead` access bit. The OPC UA stack checks this bit before routing history requests to the Core dispatcher; nodes without it are rejected before reaching `IHistoryProvider`. -The `IsHistorized` flag originates from the Galaxy repository database query, which checks whether the attribute has Historian logging configured. +The `IsHistorized` flag originates in the driver's discovery output. For Galaxy it comes from the repository query detecting a `HistoryExtension` primitive (see [drivers/Galaxy-Repository.md](drivers/Galaxy-Repository.md)). For OPC UA Client it is copied from the upstream server's `Historizing` property. + +## Configuration + +Driver-specific historian config lives in each driver's `DriverConfig` JSON blob, validated against the driver type's `DriverConfigJsonSchema` in `DriverTypeRegistry`. The Galaxy driver's historian section carries the fields exercised by `HistorianConfiguration` — `ServerName` / `ServerNames`, `FailureCooldownSeconds`, `IntegratedSecurity` / `UserName` / `Password`, `Port` (default `32568`), `CommandTimeoutSeconds`, `RequestTimeoutSeconds`, `MaxValuesPerRead`. The OPC UA Client driver inherits its timeouts from the upstream session. + +See [Configuration.md](Configuration.md) for the schema shape and validation path. diff --git a/docs/IncrementalSync.md b/docs/IncrementalSync.md index 2ee05b1..09daefd 100644 --- a/docs/IncrementalSync.md +++ b/docs/IncrementalSync.md @@ -1,121 +1,65 @@ # Incremental Sync -When a Galaxy redeployment is detected, the OPC UA address space must be updated to reflect the new hierarchy and attributes. Rather than tearing down the entire address space and rebuilding from scratch (which disconnects all clients and drops all subscriptions), `LmxNodeManager` performs an incremental sync that identifies changed objects and rebuilds only the affected subtrees. +Two distinct change-detection paths feed the running server: driver-backend rediscovery (Galaxy's `time_of_last_deploy`, TwinCAT's symbol-version-changed, OPC UA Client's upstream namespace change) and generation-level config publishes from the Admin UI. Both flow into re-runs of `ITagDiscovery.DiscoverAsync`, but they originate differently. -## Cached State +## Driver-backend rediscovery — IRediscoverable -`LmxNodeManager` retains shallow copies of the last-published hierarchy and attributes: +Drivers whose backend has a native change signal implement `IRediscoverable` (`src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IRediscoverable.cs`): ```csharp -private List? _lastHierarchy; -private List? _lastAttributes; -``` - -These are updated at the end of every `BuildAddressSpace` or `SyncAddressSpace` call via `new List(source)` to create independent copies. The copies serve as the baseline for the next diff comparison. - -On the first call (when `_lastHierarchy` is null), `SyncAddressSpace` falls through to a full `BuildAddressSpace` since there is no baseline to diff against. - -## AddressSpaceDiff - -`AddressSpaceDiff` is a static helper class that computes the set of changed Galaxy object IDs between two snapshots. - -### FindChangedGobjectIds - -This method compares old and new hierarchy+attributes and returns a `HashSet` of gobject IDs that have any difference. It detects three categories of changes: - -**Added objects** -- Present in new hierarchy but not in old: - -```csharp -foreach (var id in newObjects.Keys) - if (!oldObjects.ContainsKey(id)) - changed.Add(id); -``` - -**Removed objects** -- Present in old hierarchy but not in new: - -```csharp -foreach (var id in oldObjects.Keys) - if (!newObjects.ContainsKey(id)) - changed.Add(id); -``` - -**Modified objects** -- Present in both but with different properties. `ObjectsEqual` compares `TagName`, `BrowseName`, `ContainedName`, `ParentGobjectId`, and `IsArea`. - -**Attribute set changes** -- For objects that exist in both snapshots, attributes are grouped by `GobjectId` and compared pairwise. `AttributeSetsEqual` sorts both lists by `FullTagReference` and `PrimitiveName`, then checks each pair via `AttributesEqual`, which compares `AttributeName`, `FullTagReference`, `MxDataType`, `IsArray`, `ArrayDimension`, `PrimitiveName`, `SecurityClassification`, `IsHistorized`, and `IsAlarm`. A difference in count or any field mismatch marks the owning gobject as changed. - -Objects already marked as changed by hierarchy comparison are skipped during attribute comparison to avoid redundant work. - -### ExpandToSubtrees - -When a Galaxy object changes, its children must also be rebuilt because they may reference the parent's node or have inherited attribute changes. `ExpandToSubtrees` performs a BFS traversal from each changed ID, adding all descendants: - -```csharp -public static HashSet ExpandToSubtrees(HashSet changed, - List hierarchy) +public interface IRediscoverable { - var childrenByParent = hierarchy.GroupBy(h => h.ParentGobjectId) - .ToDictionary(g => g.Key, g => g.Select(h => h.GobjectId).ToList()); - - var expanded = new HashSet(changed); - var queue = new Queue(changed); - while (queue.Count > 0) - { - var id = queue.Dequeue(); - if (childrenByParent.TryGetValue(id, out var children)) - foreach (var childId in children) - if (expanded.Add(childId)) - queue.Enqueue(childId); - } - return expanded; + event EventHandler? OnRediscoveryNeeded; } +public sealed record RediscoveryEventArgs(string Reason, string? ScopeHint); ``` -The expansion runs against both the old and new hierarchy. This is necessary because a removed parent's children appear in the old hierarchy (for teardown) while an added parent's children appear in the new hierarchy (for construction). +The driver fires the event with a reason string (for the diagnostic log) and an optional scope hint — a non-null hint lets Core scope the rebuild surgically to that subtree; null means "the whole address space may have changed". -## SyncAddressSpace Flow +Drivers that implement the capability today: -`SyncAddressSpace` orchestrates the incremental update inside the OPC UA framework `Lock`: +- **Galaxy** — polls `galaxy.time_of_last_deploy` in the Galaxy repository DB and fires on change. This is Galaxy-internal change detection, not the platform-wide mechanism. +- **TwinCAT** — observes ADS symbol-version-changed notifications (`0x0702`). +- **OPC UA Client** — subscribes to the upstream server's `Server/NamespaceArray` change notifications. -1. **Diff** -- Call `FindChangedGobjectIds` with the cached and new snapshots. If no changes are detected, update the cached snapshots and return early. +Static drivers (Modbus, S7, AB CIP, AB Legacy, FOCAS) do not implement `IRediscoverable` — their tags only change when a new generation is published from the Config DB. Core sees absence of the interface and skips change-detection wiring for those drivers (decision #54). -2. **Expand** -- Call `ExpandToSubtrees` on both old and new hierarchies to include descendant objects. +## Config-DB generation publishes -3. **Snapshot subscriptions** -- Before teardown, iterate `_gobjectToTagRefs` for each changed gobject ID and record the current MXAccess subscription ref-counts. These are needed to restore subscriptions after rebuild. +Tag-set changes authored in the Admin UI (UNS edits, CSV imports, driver-config edits) accumulate in a draft generation and commit via `sp_PublishGeneration`. The delta between the currently-published generation and the proposed next one is computed by `sp_ComputeGenerationDiff`, which drives: -4. **Teardown** -- Call `TearDownGobjects` to remove the old nodes and clean up tracking state. +- The **DiffViewer** in Admin (`src/ZB.MOM.WW.OtOpcUa.Admin/Components/Pages/Clusters/DiffViewer.razor`) so operators can preview what will change before clicking Publish. +- The 409-on-stale-draft flow (decision #161) — a UNS drag-reorder preview carries a `DraftRevisionToken` so Confirm returns `409 Conflict / refresh-required` if the draft advanced between preview and commit. -5. **Rebuild** -- Filter the new hierarchy and attributes to only the changed gobject IDs, then call `BuildSubtree` to create the replacement nodes. +After publish, the server's generation applier invokes `IDriver.ReinitializeAsync(driverConfigJson, ct)` on every driver whose `DriverInstance.DriverConfig` row changed in the new generation. Reinitialize is the in-process recovery path for Tier A/B drivers; if it fails the driver is marked `DriverState.Faulted` and its nodes go Bad quality — but the server process stays running. See `docs/v2/driver-stability.md`. -6. **Restore subscriptions** -- For each previously subscribed tag reference that still exists in `_tagToVariableNode` after rebuild, re-open the MXAccess subscription and restore the original ref-count. +Drivers whose discovery depends on Config DB state (Modbus register maps, S7 DBs, AB CIP tag lists) re-run their discovery inside `ReinitializeAsync`; Core then diffs the new node set against the current address space. -7. **Update cache** -- Replace `_lastHierarchy` and `_lastAttributes` with shallow copies of the new data. +## Rebuild flow -## TearDownGobjects +When a rediscovery is triggered (by either source), `GenericDriverNodeManager` re-runs `ITagDiscovery.DiscoverAsync` into the same `CapturingBuilder` it used at first build. The new node set is diffed against the current: -`TearDownGobjects` removes all OPC UA nodes and tracking state for a set of gobject IDs: +1. **Diff** — full-name comparison of the new `DriverAttributeInfo` set against the existing `_variablesByFullRef` map. Added / removed / modified references are partitioned. +2. **Snapshot subscriptions** — before teardown, Core captures the current monitored-item ref-counts for every affected reference so subscriptions can be replayed after rebuild. +3. **Teardown** — removed / modified variable nodes are deleted via `CustomNodeManager2.DeleteNode`. Driver-side subscriptions for those references are unwound via `ISubscribable.UnsubscribeAsync`. +4. **Rebuild** — added / modified references get fresh `BaseDataVariableState` nodes via the standard `IAddressSpaceBuilder.Variable(...)` path. Alarm-flagged references re-register their `IAlarmConditionSink` through `CapturingBuilder`. +5. **Restore subscriptions** — for every captured reference that still exists after rebuild, Core re-opens the driver subscription and restores the original ref-count. -For each gobject ID, it processes the associated tag references from `_gobjectToTagRefs`: +Exceptions during teardown are swallowed per decision #12 — a driver throw must not leave the node tree half-deleted. -1. **Unsubscribe** -- If the tag has an active MXAccess subscription (entry in `_subscriptionRefCounts`), call `UnsubscribeAsync` and remove the ref-count entry. +## Scope hint -2. **Remove alarm tracking** -- Find any `_alarmInAlarmTags` entries whose `SourceTagReference` matches the tag. For each, unsubscribe the InAlarm, Priority, and DescAttrName tags, then remove the alarm entry. +When `RediscoveryEventArgs.ScopeHint` is non-null (e.g. a folder path), Core restricts the diff to that subtree. This matters for Galaxy Platform-scoped deployments where a `time_of_last_deploy` advance may only affect one platform's subtree, and for OPC UA Client where an upstream change may be localized. Null scope falls back to a full-tree diff. -3. **Delete variable node** -- Call `DeleteNode` on the variable's `NodeId`, remove from `_tagToVariableNode`, clean up `_nodeIdToTagReference` and `_tagMetadata`, and decrement `VariableNodeCount`. +## Active subscriptions survive rebuild -4. **Delete object/folder node** -- Remove the gobject's entry from `_nodeMap` and call `DeleteNode`. Non-folder nodes decrement `ObjectNodeCount`. +Subscriptions for unchanged references stay live across rebuilds — their ref-count map is not disturbed. Clients monitoring a stable tag never see a data-change gap during a deploy, only clients monitoring a tag that was genuinely removed see the subscription drop. -All MXAccess calls and `DeleteNode` calls are wrapped in try/catch with ignored exceptions, since teardown must complete even if individual cleanup steps fail. +## Key source files -## BuildSubtree - -`BuildSubtree` creates OPC UA nodes for a subset of the Galaxy hierarchy, reusing existing parent nodes from `_nodeMap`. - -The method first topologically sorts the input hierarchy (same `TopologicalSort` used by `BuildAddressSpace`) to ensure parents are created before children. For each object: - -1. **Find parent** -- Look up `ParentGobjectId` in `_nodeMap`. If the parent was not part of the changed set, it already exists from the previous build. If no parent is found, fall back to the root `ZB` folder. This is the key difference from `BuildAddressSpace` -- subtree builds reuse the existing node tree rather than starting from the root. - -2. **Create node** -- Areas become `FolderState` with `Organizes` reference; non-areas become `BaseObjectState` with `HasComponent` reference. The node is added to `_nodeMap`. - -3. **Create variable nodes** -- Attributes are processed with the same primitive-grouping logic as `BuildAddressSpace`, creating `BaseDataVariableState` nodes via `CreateAttributeVariable`. - -4. **Alarm tracking** -- If `_alarmTrackingEnabled` is set, alarm attributes are detected and `AlarmConditionState` nodes are created using the same logic as the full build. EventNotifier flags are set on parent nodes, and alarm tags are auto-subscribed. +- `src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IRediscoverable.cs` — backend-change capability +- `src/ZB.MOM.WW.OtOpcUa.Core/OpcUa/GenericDriverNodeManager.cs` — discovery orchestration +- `src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IDriver.cs` — `ReinitializeAsync` contract +- `src/ZB.MOM.WW.OtOpcUa.Admin/Services/GenerationService.cs` — publish-flow driver +- `docs/v2/config-db-schema.md` — `sp_PublishGeneration` + `sp_ComputeGenerationDiff` +- `docs/v2/admin-ui.md` — DiffViewer + draft-revision-token flow diff --git a/docs/MxAccessBridge.md b/docs/MxAccessBridge.md deleted file mode 100644 index f81ae66..0000000 --- a/docs/MxAccessBridge.md +++ /dev/null @@ -1,166 +0,0 @@ -# MXAccess Bridge - -The MXAccess bridge connects the OPC UA server to the AVEVA System Platform runtime through the `ArchestrA.MxAccess` COM API. It handles all COM threading requirements, translates between OPC UA read/write requests and MXAccess operations, and manages connection health. - -## STA Thread Requirement - -MXAccess is a COM-based API that requires a Single-Threaded Apartment (STA). All COM objects -- `LMXProxyServer` instantiation, `Register`, `AddItem`, `AdviseSupervisory`, `Write`, and cleanup calls -- must execute on the same STA thread. Calling COM objects from the wrong thread causes marshalling failures or silent data corruption. - -`StaComThread` provides a dedicated STA thread with the apartment state set before the thread starts: - -```csharp -_thread = new Thread(ThreadEntry) { Name = "MxAccess-STA", IsBackground = true }; -_thread.SetApartmentState(ApartmentState.STA); -``` - -Work items are queued via `RunAsync(Action)` or `RunAsync(Func)`, which enqueue the work to a `ConcurrentQueue` and post a `WM_APP` message to wake the pump. Each work item is wrapped in a `TaskCompletionSource` so callers can `await` the result from any thread. - -## Win32 Message Pump - -COM callbacks (like `OnDataChange`) are delivered through the Windows message loop. `StaComThread` runs a standard Win32 message pump using P/Invoke: - -1. `PeekMessage` primes the message queue (required before `PostThreadMessage` works) -2. `GetMessage` blocks until a message arrives -3. `WM_APP` messages drain the work queue -4. `WM_APP + 1` drains the queue and posts `WM_QUIT` to exit the loop -5. All other messages are passed through `TranslateMessage`/`DispatchMessage` for COM callback delivery - -Without this message pump, MXAccess COM callbacks would never fire and the server would receive no live data. - -## LMXProxyServer COM Object - -`MxProxyAdapter` wraps the real `ArchestrA.MxAccess.LMXProxyServer` COM object behind the `IMxProxy` interface. This abstraction allows unit tests to substitute a fake proxy without requiring the ArchestrA runtime. - -The COM object lifecycle: - -1. **`Register(clientName)`** -- Creates a new `LMXProxyServer` instance, wires up `OnDataChange` and `OnWriteComplete` event handlers, and calls `Register` to obtain a connection handle -2. **`Unregister(handle)`** -- Unwires event handlers, calls `Unregister`, and releases the COM object via `Marshal.ReleaseComObject` - -## Register/AddItem/AdviseSupervisory Pattern - -Every MXAccess data operation follows a three-step pattern, all executed on the STA thread: - -1. **`AddItem(handle, address)`** -- Resolves a Galaxy tag reference (e.g., `TestMachine_001.MachineID`) to an integer item handle -2. **`AdviseSupervisory(handle, itemHandle)`** -- Subscribes the item for supervisory data change callbacks -3. The runtime begins delivering `OnDataChange` events for the item - -For writes, after `AddItem` + `AdviseSupervisory`, `Write(handle, itemHandle, value, securityClassification)` sends the value to the runtime. The `OnWriteComplete` callback confirms or rejects the write. - -Cleanup reverses the pattern: `UnAdviseSupervisory` then `RemoveItem`. - -## OnDataChange and OnWriteComplete Callbacks - -### OnDataChange - -Fired by the COM runtime on the STA thread when a subscribed tag value changes. The handler in `MxAccessClient.EventHandlers.cs`: - -1. Maps the integer `phItemHandle` back to a tag address via `_handleToAddress` -2. Maps the MXAccess quality code to the internal `Quality` enum -3. Checks `MXSTATUS_PROXY` for error details and adjusts quality accordingly -4. Converts the timestamp to UTC -5. Constructs a `Vtq` (Value/Timestamp/Quality) and delivers it to: - - The stored per-tag subscription callback - - Any pending one-shot read completions - - The global `OnTagValueChanged` event (consumed by `LmxNodeManager`) - -### OnWriteComplete - -Fired when the runtime acknowledges or rejects a write. The handler resolves the pending `TaskCompletionSource` for the item handle. If `MXSTATUS_PROXY.success == 0`, the write is considered failed and the error detail is logged. - -## Reconnection Logic - -`MxAccessClient` implements automatic reconnection through two mechanisms: - -### Monitor loop - -`StartMonitor` launches a background task that polls at `MonitorIntervalSeconds`. On each cycle: - -- If the state is `Disconnected` or `Error` and `AutoReconnect` is enabled, it calls `ReconnectAsync` -- If connected and a probe tag is configured, it checks the probe staleness threshold - -### Reconnect sequence - -`ReconnectAsync` performs a full disconnect-then-connect cycle: - -1. Increment the reconnect counter -2. `DisconnectAsync` -- Tears down all active subscriptions (`UnAdviseSupervisory` + `RemoveItem` for each), detaches COM event handlers, calls `Unregister`, and clears all handle mappings -3. `ConnectAsync` -- Creates a fresh `LMXProxyServer`, registers, replays all stored subscriptions, and re-subscribes the probe tag - -Stored subscriptions (`_storedSubscriptions`) persist across reconnects. When `ConnectAsync` succeeds, `ReplayStoredSubscriptionsAsync` iterates all stored entries and calls `AddItem` + `AdviseSupervisory` for each. - -## Probe Tag Health Monitoring - -A configurable probe tag (e.g., a frequently updating Galaxy attribute) serves as a connection health indicator. After connecting, the client subscribes to the probe tag and records `_lastProbeValueTime` on every `OnDataChange` callback. - -The monitor loop compares `DateTime.UtcNow - _lastProbeValueTime` against `ProbeStaleThresholdSeconds`. If the probe value has not updated within the threshold, the connection is assumed stale and a reconnect is forced. This catches scenarios where the COM connection is technically alive but the runtime has stopped delivering data. - -## Per-Host Runtime Status Probes (`.ScanState`) - -Separate from the connection-level probe above, the bridge advises `.ScanState` on every deployed `$WinPlatform` and `$AppEngine` in the Galaxy. These probes track per-host runtime state so the dashboard can report "this specific Platform / AppEngine is off scan" and the bridge can proactively invalidate every OPC UA variable hosted by the stopped object — preventing MxAccess from serving stale Good-quality cached values to clients who read those tags while the host is down. - -Enabled by default via `MxAccess.RuntimeStatusProbesEnabled`; see [Configuration](Configuration.md#mxaccess) for the two config fields. - -### How it works - -`GalaxyRuntimeProbeManager` is owned by `LmxNodeManager` and operates on a simple three-state machine per host (Unknown / Running / Stopped): - -1. **Discovery** — After `BuildAddressSpace` completes, the manager filters the hierarchy to rows where `CategoryId == 1` (`$WinPlatform`) or `CategoryId == 3` (`$AppEngine`) and issues `AdviseSupervisory` for `.ScanState` on each one. Probes are bridge-owned, not ref-counted against client subscriptions, and persist across address-space rebuilds via a `Sync` diff. -2. **Transition predicate** — A probe callback is interpreted as `isRunning = vtq.Quality.IsGood() && vtq.Value is bool b && b`. Everything else (explicit `ScanState = false`, bad quality, communication errors from the broker) means **Stopped**. -3. **On-change-only delivery** — `ScanState` is delivered **only when the value actually changes**. A stably Running host may go hours without a callback. The probe manager's `Tick()` explicitly does NOT run a starvation check on Running entries — the only time-based transition is **Unknown → Stopped** when the initial callback hasn't arrived within `RuntimeStatusUnknownTimeoutSeconds` (default 15s). This protects against a probe that fails to resolve at all without incorrectly flipping healthy long-running hosts. -4. **Transport gating** — When `IMxAccessClient.State != Connected`, `GetSnapshot()` forces every entry to `Unknown` regardless of underlying state. The dashboard shows the Connection panel as the primary signal in that case rather than misleading operators with "every host stopped." -5. **Subscribe failure rollback** — If `SubscribeAsync` throws for a new probe (SDK failure, broker rejection, transport error), the manager rolls back both `_byProbe` and `_probeByGobjectId` so the probe never appears in `GetSnapshot()`. Without this rollback, a failed subscribe would leave the entry in `Unknown` forever, and `Tick()` would later transition it to `Stopped` after the unknown-resolution timeout, fanning out a **false-negative** host-down signal that invalidates the subtree of a host that was never actually advised. Stability review 2026-04-13 Finding 1. - -### Subtree quality invalidation on transition - -When a host transitions **Running → Stopped**, the probe manager invokes a callback that walks `_hostedVariables[gobjectId]` — the set of every OPC UA variable transitively hosted by that Galaxy object — and sets each variable's `StatusCode` to `BadOutOfService`. The reverse happens on **Stopped → Running**: `ClearHostVariablesBadQuality` resets each to `Good` and lets subsequent on-change MxAccess updates repopulate the values. - -The hosted-variables map is built once per `BuildAddressSpace` by walking each object's `HostedByGobjectId` chain up to the nearest Platform or Engine ancestor. A variable hosted by an Engine inside a Platform ends up in **both** the Engine's list and the Platform's list, so stopping the Platform transitively invalidates every descendant Engine's variables. - -### Read-path short-circuit (`IsTagUnderStoppedHost`) - -`LmxNodeManager.Read` override is called by the OPC UA SDK for both direct Read requests and monitored-item sampling. It previously called `_mxAccessClient.ReadAsync(tagRef)` unconditionally and returned whatever VTQ the runtime reported. That created a gap: MxAccess happily serves the last cached value as Good on a tag whose hosting Engine has gone off scan. - -The Read override now checks `IsTagUnderStoppedHost(tagRef)` (a reverse-index lookup `_hostIdsByTagRef[tagRef]` → `GalaxyRuntimeProbeManager.IsHostStopped(hostId)`) before the MxAccess round-trip. When the owning host is Stopped, the handler returns a synthesized `DataValue { Value = cachedVar.Value, StatusCode = BadOutOfService }` directly without touching MxAccess. This guarantees clients see a uniform `BadOutOfService` on every descendant tag of a stopped host, regardless of whether they're reading or subscribing. - -### Deferred dispatch: the STA deadlock - -**Critical**: probe transition callbacks must **not** run synchronously on the STA thread that delivered the `OnDataChange`. `MarkHostVariablesBadQuality` takes the `LmxNodeManager.Lock`, which may be held by a worker thread currently inside `Read` waiting on an `_mxAccessClient.ReadAsync()` round-trip that is itself waiting for the STA thread. Classic circular wait — the first real deploy of this feature hung inside 30 seconds from exactly this pattern. - -The fix is a deferred-dispatch queue: probe callbacks enqueue the transition onto `ConcurrentQueue<(int GobjectId, bool Stopped)>` and set the existing dispatch signal. The dispatch thread drains the queue inside its existing 100ms `WaitOne` loop — **outside** any locks held by the STA path — and then calls `MarkHostVariablesBadQuality` / `ClearHostVariablesBadQuality` under its own natural `Lock` acquisition. No circular wait, no STA dispatch involvement. - -See the `runtimestatus.md` plan file and the `service_info.md` entry for the in-flight debugging that led to this pattern. - -### Dashboard + health surface - -- Dashboard **Galaxy Runtime** panel between Galaxy Info and Historian shows per-host state with Name / Kind / State / Since / Last Error columns. Panel color is green (all Running), yellow (any Unknown, none Stopped), red (any Stopped), gray (MxAccess transport disconnected). -- Subscriptions panel gains a `Probes: N (bridge-owned runtime status)` line when at least one probe is active, so operators can distinguish bridge-owned probe count from client-driven subscriptions. -- `HealthCheckService.CheckHealth` Rule 2e rolls overall health to `Degraded` when any host is Stopped, ordered after the MxAccess-transport check (Rule 1) so a transport outage stays `Unhealthy` without double-messaging. - -See [Status Dashboard](StatusDashboard.md#galaxy-runtime) for the field table and [Configuration](Configuration.md#mxaccess) for the two new config fields. - -## Request Timeout Safety Backstop - -Every sync-over-async site on the OPC UA stack thread that calls into MxAccess (`Read`, `Write`, address-space rebuild probe sync) is wrapped in a bounded `SyncOverAsync.WaitSync(...)` helper with timeout `MxAccess.RequestTimeoutSeconds` (default 30s). This is a backstop: `MxAccessClient.Read/Write` already enforce inner `ReadTimeoutSeconds` / `WriteTimeoutSeconds` bounds on the async path. The outer wrapper exists so a scheduler stall, slow reconnect, or any other non-returning async path cannot park the stack thread indefinitely. - -On timeout, the underlying task is **not** cancelled — it runs to completion on the thread pool and is abandoned. This is acceptable because MxAccess clients are shared singletons and the abandoned continuation does not capture request-scoped state. The OPC UA stack receives `StatusCodes.BadTimeout` on the affected operation. - -`ConfigurationValidator` enforces `RequestTimeoutSeconds >= 1` and warns when it is set below the inner Read/Write timeouts (operator misconfiguration). Stability review 2026-04-13 Finding 3. - -## Why Marshal.ReleaseComObject Is Needed - -The .NET runtime's garbage collector releases COM references non-deterministically. For MXAccess, delayed release can leave stale COM connections open, preventing clean re-registration. `MxProxyAdapter.Unregister` calls `Marshal.ReleaseComObject(_lmxProxy)` in a `finally` block to immediately release the COM reference count to zero. This ensures the underlying COM server is freed before a reconnect attempt creates a new instance. - -## Key source files - -- `src/ZB.MOM.WW.OtOpcUa.Host/MxAccess/StaComThread.cs` -- STA thread and Win32 message pump -- `src/ZB.MOM.WW.OtOpcUa.Host/MxAccess/MxAccessClient.cs` -- Core client class (partial) -- `src/ZB.MOM.WW.OtOpcUa.Host/MxAccess/MxAccessClient.Connection.cs` -- Connect, disconnect, reconnect -- `src/ZB.MOM.WW.OtOpcUa.Host/MxAccess/MxAccessClient.Subscription.cs` -- Subscribe, unsubscribe, replay -- `src/ZB.MOM.WW.OtOpcUa.Host/MxAccess/MxAccessClient.ReadWrite.cs` -- Read and write operations -- `src/ZB.MOM.WW.OtOpcUa.Host/MxAccess/MxAccessClient.EventHandlers.cs` -- OnDataChange and OnWriteComplete handlers -- `src/ZB.MOM.WW.OtOpcUa.Host/MxAccess/MxAccessClient.Monitor.cs` -- Background health monitor -- `src/ZB.MOM.WW.OtOpcUa.Host/MxAccess/MxProxyAdapter.cs` -- COM object wrapper -- `src/ZB.MOM.WW.OtOpcUa.Host/MxAccess/GalaxyRuntimeProbeManager.cs` -- Per-host `ScanState` probes, state machine, `IsHostStopped` lookup -- `src/ZB.MOM.WW.OtOpcUa.Host/Domain/GalaxyRuntimeStatus.cs` -- Per-host DTO -- `src/ZB.MOM.WW.OtOpcUa.Host/Domain/GalaxyRuntimeState.cs` -- `Unknown` / `Running` / `Stopped` enum -- `src/ZB.MOM.WW.OtOpcUa.Host/Domain/IMxAccessClient.cs` -- Client interface diff --git a/docs/OpcUaServer.md b/docs/OpcUaServer.md index 2ce4232..88f15b9 100644 --- a/docs/OpcUaServer.md +++ b/docs/OpcUaServer.md @@ -1,137 +1,88 @@ # OPC UA Server -The OPC UA server component hosts the Galaxy-backed namespace on a configurable TCP endpoint and exposes deployed System Platform objects and attributes to OPC UA clients. +The OPC UA server component (`src/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OtOpcUaServer.cs`) hosts the OPC UA stack and exposes one browsable subtree per registered driver. The server itself is driver-agnostic — Galaxy/MXAccess, Modbus, S7, AB CIP, AB Legacy, TwinCAT, FOCAS, and OPC UA Client are all plugged in as `IDriver` implementations via the capability interfaces in `src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/`. + +## Composition + +`OtOpcUaServer` subclasses the OPC Foundation `StandardServer` and wires: + +- A `DriverHost` (`src/ZB.MOM.WW.OtOpcUa.Core/Hosting/DriverHost.cs`) which registers drivers and holds the per-instance `IDriver` references. +- One `DriverNodeManager` per registered driver (`src/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs`), constructed in `CreateMasterNodeManager`. Each manager owns its own namespace URI (`urn:OtOpcUa:{DriverInstanceId}`) and exposes the driver as a subtree under the standard `Objects` folder. +- A `CapabilityInvoker` (`src/ZB.MOM.WW.OtOpcUa.Core/Resilience/CapabilityInvoker.cs`) per driver instance, keyed on `(DriverInstanceId, HostName, DriverCapability)` against the shared `DriverResiliencePipelineBuilder`. Every Read/Write/Discovery/Subscribe/HistoryRead/AlarmSubscribe call on the driver flows through this invoker so the Polly pipeline (retry / timeout / breaker / bulkhead) applies. The OTOPCUA0001 Roslyn analyzer enforces the wrapping at compile time. +- An `IUserAuthenticator` (LDAP in production, injected stub in tests) for `UserName` token validation in the `ImpersonateUser` hook. +- Optional `AuthorizationGate` + `NodeScopeResolver` (Phase 6.2) that sit in front of every dispatch call. In lax mode the gate passes through when the identity lacks LDAP groups so existing integration tests keep working; strict mode (`Authorization:StrictMode = true`) denies those cases. + +`OtOpcUaServer.DriverNodeManagers` exposes the materialized list so the hosting layer can walk each one post-start and call `GenericDriverNodeManager.BuildAddressSpaceAsync(manager)` — the manager is passed as its own `IAddressSpaceBuilder`. ## Configuration -`OpcUaConfiguration` defines the server endpoint and session settings. All properties have sensible defaults: +Server wiring used to live in `appsettings.json`. It now flows from the SQL Server **Config DB**: `ServerInstance` + `DriverInstance` + `Tag` + `NodeAcl` rows are published as a *generation* via `sp_PublishGeneration` and loaded into the running process by the generation applier. The Admin UI (Blazor Server, `docs/v2/admin-ui.md`) is the operator surface — drafts accumulate edits; `sp_ComputeGenerationDiff` drives the DiffViewer preview; a UNS drag-reorder carries a `DraftRevisionToken` so Confirm re-checks against the current draft and returns 409 if it advanced (decision #161). See `docs/v2/config-db-schema.md` for the schema. -| Property | Default | Description | -|----------|---------|-------------| -| `BindAddress` | `0.0.0.0` | IP address or hostname the server binds to | -| `Port` | `4840` | TCP port the server listens on | -| `EndpointPath` | `/LmxOpcUa` | URI path appended to the base address | -| `ServerName` | `LmxOpcUa` | Application name presented to clients | -| `GalaxyName` | `ZB` | Galaxy name used in the namespace URI | -| `MaxSessions` | `100` | Maximum concurrent client sessions | -| `SessionTimeoutMinutes` | `30` | Idle session timeout | -| `AlarmTrackingEnabled` | `false` | Enables `AlarmConditionState` nodes for alarm attributes | -| `AlarmFilter.ObjectFilters` | `[]` | Wildcard template-name patterns that scope alarm tracking to matching objects and their descendants (see [Alarm Tracking](AlarmTracking.md#template-based-alarm-object-filter)) | +Environmental knobs that aren't per-tenant (bind address, port, PKI path) still live in `appsettings.json` on the Server project; everything tenant-scoped moved to the Config DB. -The resulting endpoint URL is `opc.tcp://{BindAddress}:{Port}{EndpointPath}`, e.g., `opc.tcp://0.0.0.0:4840/LmxOpcUa`. +## Transport -The namespace URI follows the pattern `urn:{GalaxyName}:LmxOpcUa` and is used as the `ProductUri`. The `ApplicationUri` can be set independently via `OpcUa.ApplicationUri` to support redundant deployments where each instance needs a unique identity. When `ApplicationUri` is null, it defaults to the namespace URI. +The server binds one TCP endpoint per `ServerInstance` (default `opc.tcp://0.0.0.0:4840`). The `ApplicationConfiguration` is built programmatically in the `OpcUaApplicationHost` — there are no UA XML files. Security profiles (`None`, `Basic256Sha256-Sign`, `Basic256Sha256-SignAndEncrypt`) are resolved from the `ServerInstance.Security` JSON at startup; the default profile is still `None` for backward compatibility. User token policies (`Anonymous`, `UserName`) are attached based on whether LDAP is configured. See `docs/security.md` for hardening. -## Programmatic ApplicationConfiguration +## Session impersonation -`OpcUaServerHost` builds the entire `ApplicationConfiguration` in code. There are no XML configuration files. This keeps deployment simple on factory floor machines where editing XML is error-prone. +`OtOpcUaServer.OnImpersonateUser` handles the three token types: -The configuration covers: +- `AnonymousIdentityToken` → default anonymous `UserIdentity`. +- `UserNameIdentityToken` → `IUserAuthenticator.AuthenticateAsync` validates the credential (`LdapUserAuthenticator` in production). On success, the resolved display name + LDAP-derived roles are wrapped in a `RoleBasedIdentity` that implements `IRoleBearer`. `DriverNodeManager.OnWriteValue` reads these roles via `context.UserIdentity is IRoleBearer` and applies `WriteAuthzPolicy` per write. +- Anything else → `BadIdentityTokenInvalid`. -- **ServerConfiguration** -- base address, session limits, security policies, and user token policies -- **SecurityConfiguration** -- certificate store paths under `%LOCALAPPDATA%\OPC Foundation\pki\`, auto-accept enabled -- **TransportQuotas** -- 4 MB max message/string/byte-string size, 120-second operation timeout, 1-hour security token lifetime -- **TraceConfiguration** -- OPC Foundation SDK tracing is disabled (output path `null`, trace masks `0`); all logging goes through Serilog instead +The Phase 6.2 `AuthorizationGate` runs on top of this baseline: when configured it consults the cluster's permission trie (loaded from `NodeAcl` rows) using the session's `UserAuthorizationState` and can deny Read / HistoryRead / Write / Browse independently per tag. See `docs/v2/acl-design.md`. -## Security Profiles +## Dispatch -The server supports configurable transport security profiles controlled by the `Security` section in `appsettings.json`. The default configuration exposes only `MessageSecurityMode.None` for backward compatibility. +Every service call the stack hands to `DriverNodeManager` is translated to the driver's capability interface and routed through `CapabilityInvoker`: -Supported Phase 1 profiles: - -| Profile Name | SecurityPolicy URI | MessageSecurityMode | +| Service | Capability | Invoker method | |---|---|---| -| `None` | `SecurityPolicy#None` | `None` | -| `Basic256Sha256-Sign` | `SecurityPolicy#Basic256Sha256` | `Sign` | -| `Basic256Sha256-SignAndEncrypt` | `SecurityPolicy#Basic256Sha256` | `SignAndEncrypt` | +| Read | `IReadable.ReadAsync` | `ExecuteAsync(DriverCapability.Read, host, …)` | +| Write | `IWritable.WriteAsync` | `ExecuteWriteAsync(host, isIdempotent, …)` — honors `WriteIdempotentAttribute` (#143) | +| CreateMonitoredItems / DeleteMonitoredItems | `ISubscribable.SubscribeAsync/UnsubscribeAsync` | `ExecuteAsync(DriverCapability.Subscribe, host, …)` | +| HistoryRead (raw / processed / at-time / events) | `IHistoryProvider.*Async` | `ExecuteAsync(DriverCapability.HistoryRead, host, …)` | +| ConditionRefresh / Acknowledge | `IAlarmSource.*Async` | via `AlarmSurfaceInvoker` (fans out per host) | -`SecurityProfileResolver` maps configured profile names to `ServerSecurityPolicy` instances at startup. Unknown names are skipped with a warning, and an empty or invalid list falls back to `None`. - -For production deployments, configure `["Basic256Sha256-SignAndEncrypt"]` or `["None", "Basic256Sha256-SignAndEncrypt"]` and set `AutoAcceptClientCertificates` to `false`. See the [Security Guide](security.md) for hardening details. +The host name fed to the invoker comes from `IPerCallHostResolver.ResolveHost(fullReference)` when the driver implements it (multi-host drivers: AB CIP, Modbus with per-device options). Single-host drivers fall back to `DriverInstanceId`, preserving pre-Phase-6.1 pipeline-key semantics (decision #144). ## Redundancy -When `Redundancy.Enabled = true`, `LmxOpcUaServer` exposes the standard OPC UA redundancy nodes on startup: +`Redundancy.Enabled = true` on the `ServerInstance` activates the `RedundancyCoordinator` + `ServiceLevelCalculator` (`src/ZB.MOM.WW.OtOpcUa.Server/Redundancy/`). Standard OPC UA redundancy nodes (`Server/ServerRedundancy/RedundancySupport`, `ServerUriArray`, `Server/ServiceLevel`) are populated on startup; `ServiceLevel` recomputes whenever any driver's `DriverHealth` changes. The apply-lease mechanism prevents two instances from concurrently applying a generation. See `docs/Redundancy.md`. -- `Server/ServerRedundancy/RedundancySupport` — set to `Warm` or `Hot` based on configuration -- `Server/ServerRedundancy/ServerUriArray` — populated with the configured `ServerUris` -- `Server/ServiceLevel` — computed dynamically from role and runtime health +## Server class hierarchy -The `ServiceLevel` is updated whenever MXAccess connection state changes or Galaxy DB health changes. See [Redundancy Guide](Redundancy.md) for full details. +### OtOpcUaServer extends StandardServer -### User token policies +- **`CreateMasterNodeManager`** — Iterates `_driverHost.RegisteredDriverIds`, builds one `DriverNodeManager` per driver with its own `CapabilityInvoker` + resilience options (tier from `DriverTypeRegistry`, per-instance JSON overrides from `DriverInstance.ResilienceConfig` via `DriverResilienceOptionsParser`). The managers are wrapped in a `MasterNodeManager` with no additional core managers. +- **`OnServerStarted`** — Hooks `SessionManager.ImpersonateUser` for LDAP auth. Redundancy + server-capability population happens via `OpcUaApplicationHost`. +- **`LoadServerProperties`** — Manufacturer `OtOpcUa`, Product `OtOpcUa.Server`, ProductUri `urn:OtOpcUa:Server`. -`UserTokenPolicies` are dynamically configured based on the `Authentication` settings in `appsettings.json`: +### ServerCapabilities -- An `Anonymous` user token policy is added when `AllowAnonymous` is `true` (the default). -- A `UserName` user token policy is added when an authentication provider is configured (LDAP or injected). - -Both policies can be active simultaneously, allowing clients to connect with or without credentials. - -### Session impersonation - -When a client presents `UserName` credentials, the server validates them through `IUserAuthenticationProvider`. If the provider also implements `IRoleProvider` (as `LdapAuthenticationProvider` does), LDAP group membership is resolved once during authentication and mapped to custom OPC UA role `NodeId`s in a dedicated `urn:zbmom:lmxopcua:roles` namespace. These role NodeIds are added to the session's `RoleBasedIdentity.GrantedRoleIds`. - -Anonymous sessions receive `WellKnownRole_Anonymous`. Authenticated sessions receive `WellKnownRole_AuthenticatedUser` plus any LDAP-derived role NodeIds. Permission checks in `LmxNodeManager` inspect `GrantedRoleIds` directly — no username extraction or side-channel cache is needed. - -`AnonymousCanWrite` controls whether anonymous sessions can write, regardless of whether LDAP is enabled. +`OpcUaApplicationHost` populates `Server/ServerCapabilities` with `StandardUA2017`, `en` locale, 100 ms `MinSupportedSampleRate`, 4 MB message caps, and per-operation limits (1000 per Read/Write/Browse/TranslateBrowsePaths/MonitoredItems/HistoryRead; 0 for MethodCall/NodeManagement/HistoryUpdate). ## Certificate handling -On startup, `OpcUaServerHost.StartAsync` calls `CheckApplicationInstanceCertificate(false, minKeySize)` to locate or create a self-signed certificate meeting the configured minimum key size (default 2048). The certificate subject defaults to `CN={ServerName}, O=ZB MOM, DC=localhost` but can be overridden via `Security.CertificateSubject`. Certificate stores use the directory-based store type under the configured `Security.PkiRootPath` (default `%LOCALAPPDATA%\OPC Foundation\pki\`): +Certificate stores default to `%LOCALAPPDATA%\OPC Foundation\pki\` (directory-based): | Store | Path suffix | -|-------|-------------| +|---|---| | Own | `pki/own` | | Trusted issuers | `pki/issuer` | | Trusted peers | `pki/trusted` | | Rejected | `pki/rejected` | -`AutoAcceptUntrustedCertificates` is controlled by `Security.AutoAcceptClientCertificates` (default `true`). Set to `false` in production to enforce client certificate trust. When `RejectSHA1Certificates` is `true` (default), client certificates signed with SHA-1 are rejected. Certificate validation events are logged for visibility into accepted and rejected client connections. - -## Server class hierarchy - -### LmxOpcUaServer extends StandardServer - -`LmxOpcUaServer` inherits from the OPC Foundation `StandardServer` base class and overrides two methods: - -- **`CreateMasterNodeManager`** -- Instantiates `LmxNodeManager` with the Galaxy namespace URI, the `IMxAccessClient` for runtime I/O, performance metrics, and an optional `IHistorianDataSource` (supplied by the runtime-loaded historian plugin, see [Historical Data Access](HistoricalDataAccess.md)). The node manager is wrapped in a `MasterNodeManager` with no additional core node managers. -- **`OnServerStarted`** -- Configures redundancy, history capabilities, and server capabilities at startup. Called after the server is fully initialized. -- **`LoadServerProperties`** -- Returns server metadata: manufacturer `ZB MOM`, product `LmxOpcUa Server`, and the assembly version as the software version. - -### ServerCapabilities - -`ConfigureServerCapabilities` populates the `ServerCapabilities` node at startup: - -- **ServerProfileArray** -- `StandardUA2017` -- **LocaleIdArray** -- `en` -- **MinSupportedSampleRate** -- 100ms -- **MaxBrowseContinuationPoints** -- 100 -- **MaxHistoryContinuationPoints** -- 100 -- **MaxArrayLength** -- 65535 -- **MaxStringLength / MaxByteStringLength** -- 4MB -- **OperationLimits** -- 1000 nodes per Read/Write/Browse/RegisterNodes/TranslateBrowsePaths/MonitoredItems/HistoryRead; 0 for MethodCall/NodeManagement/HistoryUpdate (not supported) -- **ServerDiagnostics.EnabledFlag** -- `true` (SDK tracks session/subscription counts automatically) - -### Session tracking - -`LmxOpcUaServer` exposes `ActiveSessionCount` by querying `ServerInternal.SessionManager.GetSessions().Count`. `OpcUaServerHost` surfaces this for status reporting. - -## Startup and Shutdown - -`OpcUaServerHost.StartAsync` performs the following sequence: - -1. Build `ApplicationConfiguration` programmatically -2. Validate the configuration via `appConfig.Validate(ApplicationType.Server)` -3. Create `ApplicationInstance` and check/create the application certificate -4. Instantiate `LmxOpcUaServer` and start it via `ApplicationInstance.Start` - -`OpcUaServerHost.Stop` calls `_server.Stop()` and nulls both the server and application instance references. The class implements `IDisposable`, delegating to `Stop`. +`Security.AutoAcceptClientCertificates` (default `true`) and `RejectSHA1Certificates` (default `true`) are honored. The server certificate is always created — even for `None`-only deployments — because `UserName` token encryption needs it. ## Key source files -- `src/ZB.MOM.WW.OtOpcUa.Host/OpcUa/OpcUaServerHost.cs` -- Application lifecycle and programmatic configuration -- `src/ZB.MOM.WW.OtOpcUa.Host/OpcUa/LmxOpcUaServer.cs` -- StandardServer subclass and node manager creation -- `src/ZB.MOM.WW.OtOpcUa.Host/OpcUa/SecurityProfileResolver.cs` -- Profile-name to ServerSecurityPolicy mapping -- `src/ZB.MOM.WW.OtOpcUa.Host/Configuration/OpcUaConfiguration.cs` -- Configuration POCO -- `src/ZB.MOM.WW.OtOpcUa.Host/Configuration/SecurityProfileConfiguration.cs` -- Security configuration POCO +- `src/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OtOpcUaServer.cs` — `StandardServer` subclass + `ImpersonateUser` hook +- `src/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs` — per-driver `CustomNodeManager2` + dispatch surface +- `src/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OpcUaApplicationHost.cs` — programmatic `ApplicationConfiguration` + lifecycle +- `src/ZB.MOM.WW.OtOpcUa.Core/Hosting/DriverHost.cs` — driver registration +- `src/ZB.MOM.WW.OtOpcUa.Core/Resilience/CapabilityInvoker.cs` — Polly pipeline entry point +- `src/ZB.MOM.WW.OtOpcUa.Core/Authorization/` — Phase 6.2 permission trie + evaluator +- `src/ZB.MOM.WW.OtOpcUa.Server/Security/AuthorizationGate.cs` — stack-to-evaluator bridge diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 0000000..e06888c --- /dev/null +++ b/docs/README.md @@ -0,0 +1,83 @@ +# OtOpcUa documentation + +Two tiers of documentation live here: + +- **Current reference** at the top level (`docs/*.md`) — describes what's shipped today. Start here for operator + integrator reference. +- **Implementation history + design notes** at `docs/v2/*.md` — the authoritative plan + decision log the current reference is built from. Start here when you need the *why* behind an architectural choice, or when a top-level doc says "see plan.md § X". + +The project was originally called **LmxOpcUa** (a single-driver Galaxy/MXAccess OPC UA server) and has since become **OtOpcUa**, a multi-driver OPC UA server platform. Any lingering `LmxOpcUa`-string in a path you see in docs is a deliberate residual (executable name `lmxopcua-cli`, client PKI folder `{LocalAppData}/LmxOpcUaClient/`) — fixing those requires migration shims + is tracked as follow-ups. + +## Platform overview + +- **Core** owns the OPC UA stack, address space, session/security/subscription machinery. +- **Drivers** plug in via capability interfaces in `ZB.MOM.WW.OtOpcUa.Core.Abstractions`: `IDriver`, `IReadable`, `IWritable`, `ITagDiscovery`, `ISubscribable`, `IHostConnectivityProbe`, `IAlarmSource`, `IHistoryProvider`, `IPerCallHostResolver`. Each driver opts into whichever it supports. +- **Server** is the OPC UA endpoint process (net10, x64). Hosts every driver except Galaxy in-process; talks to Galaxy via a named pipe because MXAccess COM is 32-bit-only. +- **Admin** is the Blazor Server operator UI (net10, x64). Owns the Config DB draft/publish flow, ACL + role-grant authoring, fleet status + `/metrics` scrape endpoint. +- **Galaxy.Host** is a .NET Framework 4.8 x86 Windows service that wraps MXAccess COM on an STA thread for the Galaxy driver. + +## Where to find what + +### Architecture + data-path reference + +| Doc | Covers | +|-----|--------| +| [OpcUaServer.md](OpcUaServer.md) | Top-level server architecture — Core, driver dispatch, Config DB, generations | +| [AddressSpace.md](AddressSpace.md) | `GenericDriverNodeManager` + `ITagDiscovery` + `IAddressSpaceBuilder` | +| [ReadWriteOperations.md](ReadWriteOperations.md) | OPC UA Read/Write → `CapabilityInvoker` → `IReadable`/`IWritable` | +| [Subscriptions.md](Subscriptions.md) | Monitored items → `ISubscribable` + per-driver subscription refcount | +| [AlarmTracking.md](AlarmTracking.md) | `IAlarmSource` + `AlarmSurfaceInvoker` + OPC UA alarm conditions | +| [DataTypeMapping.md](DataTypeMapping.md) | Per-driver `DriverAttributeInfo` → OPC UA variable types | +| [IncrementalSync.md](IncrementalSync.md) | Address-space rebuild on redeploy + `sp_ComputeGenerationDiff` | +| [HistoricalDataAccess.md](HistoricalDataAccess.md) | `IHistoryProvider` as a per-driver optional capability | + +### Drivers + +| Doc | Covers | +|-----|--------| +| [drivers/README.md](drivers/README.md) | Index of the seven shipped drivers + capability matrix | +| [drivers/Galaxy.md](drivers/Galaxy.md) | Galaxy driver — MXAccess bridge, Host/Proxy split, named-pipe IPC | +| [drivers/Galaxy-Repository.md](drivers/Galaxy-Repository.md) | Galaxy-specific discovery via the ZB SQL database | + +For Modbus / S7 / AB CIP / AB Legacy / TwinCAT / FOCAS / OPC UA Client specifics, see [v2/driver-specs.md](v2/driver-specs.md). + +### Operational + +| Doc | Covers | +|-----|--------| +| [Configuration.md](Configuration.md) | appsettings bootstrap + Config DB + Admin UI draft/publish | +| [security.md](security.md) | Transport security profiles, LDAP auth, ACL trie, role grants, OTOPCUA0001 analyzer | +| [Redundancy.md](Redundancy.md) | `RedundancyCoordinator`, `ServiceLevelCalculator`, apply-lease, Prometheus metrics | +| [ServiceHosting.md](ServiceHosting.md) | Three-process deploy (Server + Admin + Galaxy.Host) install/uninstall | +| [StatusDashboard.md](StatusDashboard.md) | Pointer — superseded by [v2/admin-ui.md](v2/admin-ui.md) | + +### Client tooling + +| Doc | Covers | +|-----|--------| +| [Client.CLI.md](Client.CLI.md) | `lmxopcua-cli` — command-line client | +| [Client.UI.md](Client.UI.md) | Avalonia desktop client | + +### Requirements + +| Doc | Covers | +|-----|--------| +| [reqs/HighLevelReqs.md](reqs/HighLevelReqs.md) | HLRs — numbered system-level requirements | +| [reqs/OpcUaServerReqs.md](reqs/OpcUaServerReqs.md) | OPC UA server-layer reqs | +| [reqs/ServiceHostReqs.md](reqs/ServiceHostReqs.md) | Per-process hosting reqs | +| [reqs/ClientRequirements.md](reqs/ClientRequirements.md) | Client CLI + UI reqs | +| [reqs/GalaxyRepositoryReqs.md](reqs/GalaxyRepositoryReqs.md) | Galaxy-scoped repository reqs | +| [reqs/MxAccessClientReqs.md](reqs/MxAccessClientReqs.md) | Galaxy-scoped MXAccess reqs | +| [reqs/StatusDashboardReqs.md](reqs/StatusDashboardReqs.md) | Pointer — superseded by Admin UI | + +## Implementation history (`docs/v2/`) + +Design decisions + phase plans + execution notes. Load-bearing cross-references from the top-level docs: + +- [v2/plan.md](v2/plan.md) — authoritative v2 vision doc + numbered decision log (referenced as "decision #N" elsewhere) +- [v2/admin-ui.md](v2/admin-ui.md) — Admin UI spec +- [v2/acl-design.md](v2/acl-design.md) — data-plane ACL + permission-trie design (Phase 6.2) +- [v2/config-db-schema.md](v2/config-db-schema.md) — Config DB schema reference +- [v2/driver-specs.md](v2/driver-specs.md) — per-driver addressing + quirks for every shipped protocol +- [v2/dev-environment.md](v2/dev-environment.md) — dev-box bootstrap +- [v2/test-data-sources.md](v2/test-data-sources.md) — integration-test simulator matrix (includes the pinned libplctag `ab_server` version for AB CIP tests) +- [v2/implementation/phase-*-*.md](v2/implementation/) — per-phase execution plans with exit-gate evidence diff --git a/docs/ReadWriteOperations.md b/docs/ReadWriteOperations.md index 48b10f1..c6ed0b4 100644 --- a/docs/ReadWriteOperations.md +++ b/docs/ReadWriteOperations.md @@ -1,99 +1,57 @@ # Read/Write Operations -`LmxNodeManager` overrides the OPC UA `Read` and `Write` methods to translate client requests into MXAccess runtime calls. Each override resolves the OPC UA `NodeId` to a Galaxy tag reference, performs the I/O through `IMxAccessClient`, and returns the result with appropriate status codes. +`DriverNodeManager` (`src/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs`) wires the OPC UA stack's per-variable `OnReadValue` and `OnWriteValue` hooks to each driver's `IReadable` and `IWritable` capabilities. Every dispatch flows through `CapabilityInvoker` so the Polly pipeline (retry / timeout / breaker / bulkhead) applies uniformly across Galaxy, Modbus, S7, AB CIP, AB Legacy, TwinCAT, FOCAS, and OPC UA Client drivers. -## Read Override +## OnReadValue -The `Read` override in `LmxNodeManager` intercepts value attribute reads for nodes in the Galaxy namespace. +The hook is registered on every `BaseDataVariableState` created by the `IAddressSpaceBuilder.Variable(...)` call during discovery. When the stack dispatches a Read for a node in this namespace: -### Resolution flow +1. If the driver does not implement `IReadable`, the hook returns `BadNotReadable`. +2. The node's `NodeId.Identifier` is used directly as the driver-side full reference — it matches `DriverAttributeInfo.FullName` registered at discovery time. +3. (Phase 6.2) If an `AuthorizationGate` + `NodeScopeResolver` are wired, the gate is consulted first via `IsAllowed(identity, OpcUaOperation.Read, scope)`. A denied read never hits the driver. +4. The call is wrapped by `_invoker.ExecuteAsync(DriverCapability.Read, ResolveHostFor(fullRef), …)`. The resolved host is `IPerCallHostResolver.ResolveHost(fullRef)` for multi-host drivers; single-host drivers fall back to `DriverInstanceId` (decision #144). +5. The first `DataValueSnapshot` from the batch populates the outgoing `value` / `statusCode` / `timestamp`. An empty batch surfaces `BadNoData`; any exception surfaces `BadInternalError`. -1. The base class `Read` runs first, handling non-value attributes (DisplayName, DataType, etc.) through the standard node manager. -2. For each `ReadValueId` where `AttributeId == Attributes.Value`, the override checks whether the node belongs to this namespace (`NamespaceIndex` match). -3. The string-typed `NodeId.Identifier` is looked up in `_nodeIdToTagReference` to find the corresponding `FullTagReference` (e.g., `DelmiaReceiver_001.DownloadPath`). -4. `_mxAccessClient.ReadAsync(tagRef)` retrieves the current value, timestamp, and quality from MXAccess. The async call is synchronously awaited because the OPC UA SDK `Read` override is synchronous. -5. The returned `Vtq` is converted to a `DataValue` via `CreatePublishedDataValue`, which normalizes array values through `NormalizePublishedValue` (substituting a default typed array when the value is null for array nodes). -6. On success, `errors[i]` is set to `ServiceResult.Good`. On exception, the error is set to `BadInternalError`. +The hook is synchronous — the async invoker call is bridged with `AsTask().GetAwaiter().GetResult()` because the OPC UA SDK's value-hook signature is sync. Idempotent-by-construction reads mean this bridge is safe to retry inside the Polly pipeline. -```csharp -if (_nodeIdToTagReference.TryGetValue(nodeIdStr, out var tagRef)) -{ - var vtq = _mxAccessClient.ReadAsync(tagRef).GetAwaiter().GetResult(); - results[i] = CreatePublishedDataValue(tagRef, vtq); - errors[i] = ServiceResult.Good; -} -``` +## OnWriteValue -## Write Override +`OnWriteValue` follows the same shape with two additional concerns: authorization and idempotence. -The `Write` override follows a similar pattern but includes access-level enforcement and array element write support. +### Authorization (two layers) -### Access level check +1. **SecurityClassification gate.** Every variable stores its `SecurityClassification` in `_securityByFullRef` at registration time (populated from `DriverAttributeInfo.SecurityClass`). `WriteAuthzPolicy.IsAllowed(classification, userRoles)` runs first, consulting the session's roles via `context.UserIdentity is IRoleBearer`. `FreeAccess` passes anonymously, `ViewOnly` denies everyone, and `Operate / Tune / Configure / SecuredWrite / VerifiedWrite` require `WriteOperate / WriteTune / WriteConfigure` roles respectively. Denial returns `BadUserAccessDenied` without consulting the driver — drivers never enforce ACLs themselves; they only report classification as discovery metadata (feedback `feedback_acl_at_server_layer.md`). +2. **Phase 6.2 permission-trie gate.** When `AuthorizationGate` is wired, it re-runs with the operation derived from `WriteAuthzPolicy.ToOpcUaOperation(classification)`. The gate consults the per-cluster permission trie loaded from `NodeAcl` rows, enforcing fine-grained per-tag ACLs on top of the role-based classification policy. See `docs/v2/acl-design.md`. -The base class `Write` runs first and sets `BadNotWritable` for nodes whose `AccessLevel` does not include `CurrentWrite`. The override skips these nodes: +### Dispatch -```csharp -if (errors[i] != null && errors[i].StatusCode == StatusCodes.BadNotWritable) - continue; -``` +`_invoker.ExecuteWriteAsync(host, isIdempotent, callSite, …)` honors the `WriteIdempotentAttribute` semantics per decisions #44-45 and #143: -The `AccessLevel` is set during node creation based on `SecurityClassificationMapper.IsWritable(attr.SecurityClassification)`. Read-only Galaxy attributes (e.g., security classification `FreeRead`) get `AccessLevels.CurrentRead` only. +- `isIdempotent = true` (tag flagged `WriteIdempotent` in the Config DB) → runs through the standard `DriverCapability.Write` pipeline; retry may apply per the tier configuration. +- `isIdempotent = false` (default) → the invoker builds a one-off pipeline with `RetryCount = 0`. A timeout may fire after the device already accepted the pulse / alarm-ack / counter-increment; replay is the caller's decision, not the server's. -### Write flow +The `_writeIdempotentByFullRef` lookup is populated at discovery time from the `DriverAttributeInfo.WriteIdempotent` field. -1. The `NodeId` is resolved to a tag reference via `_nodeIdToTagReference`. -2. The raw value is extracted from `writeValue.Value.WrappedValue.Value`. -3. If the write includes an `IndexRange` (array element write), `TryApplyArrayElementWrite` handles the merge before sending the full array to MXAccess. -4. `_mxAccessClient.WriteAsync(tagRef, value)` sends the value to the Galaxy runtime. -5. On success, `PublishLocalWrite` updates the in-memory node immediately so subscribed clients see the change without waiting for the next MXAccess data change callback. +### Per-write status -### Array element writes via IndexRange +`IWritable.WriteAsync` returns `IReadOnlyList` — one numeric `StatusCode` per requested write. A non-zero code is surfaced directly to the client; exceptions become `BadInternalError`. The OPC UA stack's pattern of batching per-service is preserved through the full chain. -`TryApplyArrayElementWrite` supports writing individual elements of an array attribute. MXAccess does not support element-level writes, so the method performs a read-modify-write: +## Array element writes -1. Parse the `IndexRange` string as a zero-based integer index. Return `BadIndexRangeInvalid` if parsing fails or the index is negative. -2. Read the current array value from MXAccess via `ReadAsync`. -3. Clone the array and set the element at the target index. -4. `NormalizeIndexedWriteValue` unwraps single-element arrays (OPC UA clients sometimes wrap a scalar in a one-element array). -5. `ConvertArrayElementValue` coerces the value to the array's element type using `Convert.ChangeType`, handling null values by substituting the type's default. -6. The full modified array is written back to MXAccess as a single `WriteAsync` call. +Array-element writes via OPC UA `IndexRange` are driver-specific. The OPC UA stack hands the dispatch an unwrapped `NumericRange` on the `indexRange` parameter of `OnWriteValue`; `DriverNodeManager` passes the full `value` object to `IWritable.WriteAsync` and the driver decides whether to support partial writes. Galaxy performs a read-modify-write inside the Galaxy driver (MXAccess has no element-level writes); other drivers generally accept only full-array writes today. -```csharp -var nextArray = (Array)currentArray.Clone(); -nextArray.SetValue(ConvertArrayElementValue(normalizedValue, elementType), index); -updatedArray = nextArray; -``` +## HistoryRead -### Role-based write enforcement +`DriverNodeManager.HistoryReadRawModified`, `HistoryReadProcessed`, `HistoryReadAtTime`, and `HistoryReadEvents` route through the driver's `IHistoryProvider` capability with `DriverCapability.HistoryRead`. Drivers without `IHistoryProvider` surface `BadHistoryOperationUnsupported` per node. See `docs/HistoricalDataAccess.md`. -When `AnonymousCanWrite` is `false` in the `Authentication` configuration, the write override enforces role-based access control before dispatching to MXAccess. The check order is: +## Failure isolation -1. The base class `Write` runs first, enforcing `AccessLevel`. Nodes without `CurrentWrite` get `BadNotWritable` and the override skips them. -2. The override checks whether the node is in the Galaxy namespace. Non-namespace nodes are skipped. -3. If `AnonymousCanWrite` is `false`, the override inspects `context.OperationContext.Session` for `GrantedRoleIds`. If the session does not hold `WellKnownRole_AuthenticatedUser`, the error is set to `BadUserAccessDenied` and the write is rejected. -4. If the role check passes (or `AnonymousCanWrite` is `true`), the write proceeds to MXAccess. +Per decision #12, exceptions in the driver's capability call are logged and converted to a per-node `BadInternalError` — they never unwind into the master node manager. This keeps one driver's outage from disrupting sibling drivers in the same server process. -The existing security classification enforcement (ReadOnly nodes getting `BadNotWritable` via `AccessLevel`) still applies first and takes precedence over the role check. +## Key source files -## Value Type Conversion - -`CreatePublishedDataValue` wraps the conversion pipeline. `NormalizePublishedValue` checks whether the tag is an array type with a declared `ArrayDimension` and substitutes a default typed array (via `CreateDefaultArrayValue`) when the raw value is null. This prevents OPC UA clients from receiving a null variant for array nodes, which violates the specification for nodes declared with `ValueRank.OneDimension`. - -`CreateDefaultArrayValue` uses `MxDataTypeMapper.MapToClrType` to determine the CLR element type, then creates an `Array.CreateInstance` of the declared length. String arrays are initialized with `string.Empty` elements rather than null. - -## PublishLocalWrite - -After a successful write, `PublishLocalWrite` updates the variable node in memory without waiting for the MXAccess `OnDataChange` callback to arrive: - -```csharp -private void PublishLocalWrite(string tagRef, object? value) -{ - var dataValue = CreatePublishedDataValue(tagRef, Vtq.Good(value)); - variable.Value = dataValue.Value; - variable.StatusCode = dataValue.StatusCode; - variable.Timestamp = dataValue.SourceTimestamp; - variable.ClearChangeMasks(SystemContext, false); -} -``` - -`ClearChangeMasks` notifies the OPC UA framework that the node value has changed, which triggers data change notifications to any active monitored items. Without this call, subscribed clients would only see the update when the next MXAccess data change event arrives, which could be delayed depending on the subscription interval. +- `src/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs` — `OnReadValue` / `OnWriteValue` hooks +- `src/ZB.MOM.WW.OtOpcUa.Server/Security/WriteAuthzPolicy.cs` — classification-to-role policy +- `src/ZB.MOM.WW.OtOpcUa.Server/Security/AuthorizationGate.cs` — Phase 6.2 trie gate +- `src/ZB.MOM.WW.OtOpcUa.Core/Resilience/CapabilityInvoker.cs` — `ExecuteAsync` / `ExecuteWriteAsync` +- `src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IReadable.cs`, `IWritable.cs`, `WriteIdempotentAttribute.cs` diff --git a/docs/Redundancy.md b/docs/Redundancy.md index f78a971..91ea62c 100644 --- a/docs/Redundancy.md +++ b/docs/Redundancy.md @@ -2,189 +2,102 @@ ## Overview -LmxOpcUa supports OPC UA **non-transparent redundancy** in Warm or Hot mode. In a non-transparent redundancy deployment, two independent server instances run side by side. Both connect to the same Galaxy repository database and the same MXAccess runtime, but each maintains its own OPC UA sessions and subscriptions. Clients discover the redundant set through the `ServerUriArray` exposed in each server's address space and are responsible for managing failover between the two endpoints. +OtOpcUa supports OPC UA **non-transparent** warm/hot redundancy. Two (or more) OtOpcUa Server processes run side-by-side, share the same Config DB, the same driver backends (Galaxy ZB, MXAccess runtime, remote PLCs), and advertise the same OPC UA node tree. Each process owns a distinct `ApplicationUri`; OPC UA clients see both endpoints via the standard `ServerUriArray` and pick one based on the `ServiceLevel` that each server publishes. -When redundancy is disabled (the default), the server reports `RedundancySupport.None` and a fixed `ServiceLevel` of 255. +The redundancy surface lives in `src/ZB.MOM.WW.OtOpcUa.Server/Redundancy/`: -## Namespace vs Application Identity - -Both servers in the redundant set share the same **namespace URI** so that clients see identical node IDs regardless of which instance they are connected to. The namespace URI follows the pattern `urn:{GalaxyName}:LmxOpcUa` (e.g., `urn:ZB:LmxOpcUa`). - -The **ApplicationUri**, on the other hand, must be unique per instance. This is how the OPC UA stack and clients distinguish one server from the other within the redundant set. Each instance sets its own ApplicationUri via the `OpcUa.ApplicationUri` configuration property (e.g., `urn:localhost:LmxOpcUa:instance1` and `urn:localhost:LmxOpcUa:instance2`). - -When redundancy is disabled, `ApplicationUri` defaults to `urn:{GalaxyName}:LmxOpcUa` if left null. - -## Configuration - -### Redundancy Section - -| Property | Type | Default | Description | -|---|---|---|---| -| `Enabled` | bool | `false` | Enables non-transparent redundancy. When false, the server reports `RedundancySupport.None` and `ServiceLevel = 255`. | -| `Mode` | string | `"Warm"` | The redundancy mode advertised to clients. Valid values: `Warm`, `Hot`. | -| `Role` | string | `"Primary"` | This instance's role in the redundant pair. Valid values: `Primary`, `Secondary`. The Primary advertises a higher ServiceLevel than the Secondary when both are healthy. | -| `ServerUris` | string[] | `[]` | The ApplicationUri values of all servers in the redundant set. Must include this instance's own `OpcUa.ApplicationUri`. Should contain at least 2 entries. | -| `ServiceLevelBase` | int | `200` | The base ServiceLevel when the server is fully healthy. Valid range: 1-255. The Secondary automatically receives `ServiceLevelBase - 50`. | - -### OpcUa.ApplicationUri - -| Property | Type | Default | Description | -|---|---|---|---| -| `ApplicationUri` | string | `null` | Explicit application URI for this server instance. When null, defaults to `urn:{GalaxyName}:LmxOpcUa`. **Required when redundancy is enabled** -- each instance needs a unique identity. | - -## ServiceLevel Computation - -ServiceLevel is a standard OPC UA diagnostic value (0-255) that indicates server health. Clients in a redundant deployment should prefer the server advertising the highest ServiceLevel. - -**Baseline values:** - -| Role | Baseline | +| Class | Role | |---|---| -| Primary | `ServiceLevelBase` (default 200) | -| Secondary | `ServiceLevelBase - 50` (default 150) | +| `RedundancyCoordinator` | Process-singleton; owns the current `RedundancyTopology` loaded from the `ClusterNode` table. `RefreshAsync` re-reads after `sp_PublishGeneration` so operator role swaps take effect without a process restart. CAS-style swap (`Interlocked.Exchange`) means readers always see a coherent snapshot. | +| `RedundancyTopology` | Immutable `(ClusterId, Self, Peers, ServerUriArray, ValidityFlags)` snapshot. | +| `ApplyLeaseRegistry` | Tracks in-progress `sp_PublishGeneration` apply leases keyed on `(ConfigGenerationId, PublishRequestId)`. `await using` the disposable scope guarantees every exit path (success / exception / cancellation) decrements the lease; a stale-lease watchdog force-closes any lease older than `ApplyMaxDuration` (default 10 minutes) so a crashed publisher can't pin the node at `PrimaryMidApply`. | +| `PeerReachabilityTracker` | Maintains last-known reachability for each peer node over two independent probes — OPC UA ping and HTTP `/healthz`. Both must succeed for `peerReachable = true`. | +| `RecoveryStateManager` | Gates transitions out of the `Recovering*` bands; requires dwell + publish-witness satisfaction before allowing a return to nominal. | +| `ServiceLevelCalculator` | Pure function `(role, selfHealthy, peerUa, peerHttp, applyInProgress, recoveryDwellMet, topologyValid, operatorMaintenance) → byte`. | +| `RedundancyStatePublisher` | Orchestrates inputs into the calculator, pushes the resulting byte to the OPC UA `ServiceLevel` variable via an edge-triggered `OnStateChanged` event, and fires `OnServerUriArrayChanged` when the topology's `ServerUriArray` shifts. | -**Penalties applied to the baseline:** +## Data model -| Condition | Penalty | +Per-node redundancy state lives in the Config DB `ClusterNode` table (`src/ZB.MOM.WW.OtOpcUa.Configuration/Entities/ClusterNode.cs`): + +| Column | Role | |---|---| -| MXAccess disconnected | -100 | -| Galaxy DB unreachable | -50 | -| Both MXAccess and DB down | ServiceLevel forced to 0 | +| `NodeId` | Unique node identity; matches `Node:NodeId` in the server's bootstrap `appsettings.json`. | +| `ClusterId` | Foreign key into `ServerCluster`. | +| `RedundancyRole` | `Primary`, `Secondary`, or `Standalone` (`RedundancyRole` enum in `Configuration/Enums`). | +| `ServiceLevelBase` | Per-node base value used to bias nominal ServiceLevel output. | +| `ApplicationUri` | Unique-per-node OPC UA ApplicationUri advertised in endpoint descriptions. | -The final value is clamped to the range 0-255. +`ServerUriArray` is derived from the set of peer `ApplicationUri` values at topology-load time and republished when the topology changes. -**Examples (with default ServiceLevelBase = 200):** +## ServiceLevel matrix -| Scenario | Primary | Secondary | +`ServiceLevelCalculator` produces one of the following bands (see `ServiceLevelBand` enum in the same file): + +| Band | Byte | Meaning | |---|---|---| -| Both healthy | 200 | 150 | -| MXAccess down | 100 | 50 | -| DB down | 150 | 100 | -| Both down | 0 | 0 | +| `Maintenance` | 0 | Operator-declared maintenance. | +| `NoData` | 1 | Self-reported unhealthy (`/healthz` fails). | +| `InvalidTopology` | 2 | More than one Primary detected; both nodes self-demote. | +| `RecoveringBackup` | 30 | Backup post-fault, dwell not met. | +| `BackupMidApply` | 50 | Backup inside a publish-apply window. | +| `IsolatedBackup` | 80 | Primary unreachable; Backup says "take over if asked" — does **not** auto-promote (non-transparent model). | +| `AuthoritativeBackup` | 100 | Backup nominal. | +| `RecoveringPrimary` | 180 | Primary post-fault, dwell not met. | +| `PrimaryMidApply` | 200 | Primary inside a publish-apply window. | +| `IsolatedPrimary` | 230 | Primary with unreachable peer, retains authority. | +| `AuthoritativePrimary` | 255 | Primary nominal. | -## Two-Instance Deployment +The reserved bands (0 Maintenance, 1 NoData, 2 InvalidTopology) take precedence over operational states per OPC UA Part 5 §6.3.34. Operational values occupy 2..255 so spec-compliant clients that treat "<3 = unhealthy" keep working. -When deploying a redundant pair, the following configuration properties must differ between the two instances. All other settings (GalaxyName, ConnectionString, etc.) are shared. +Standalone nodes (single-instance deployments) report `AuthoritativePrimary` when healthy and `PrimaryMidApply` during publish. -| Property | Instance 1 (Primary) | Instance 2 (Secondary) | -|---|---|---| -| `OpcUa.Port` | 4840 | 4841 | -| `OpcUa.ServerName` | `LmxOpcUa-1` | `LmxOpcUa-2` | -| `OpcUa.ApplicationUri` | `urn:localhost:LmxOpcUa:instance1` | `urn:localhost:LmxOpcUa:instance2` | -| `Dashboard.Port` | 8081 | 8082 | -| `MxAccess.ClientName` | `LmxOpcUa-1` | `LmxOpcUa-2` | -| `Redundancy.Role` | `Primary` | `Secondary` | +## Publish fencing and split-brain prevention -### Instance 1 -- Primary (appsettings.json) +Any Admin-triggered `sp_PublishGeneration` acquires an apply lease through `ApplyLeaseRegistry.BeginApplyLease`. While the lease is held: -```json -{ - "OpcUa": { - "Port": 4840, - "ServerName": "LmxOpcUa-1", - "GalaxyName": "ZB", - "ApplicationUri": "urn:localhost:LmxOpcUa:instance1" - }, - "MxAccess": { - "ClientName": "LmxOpcUa-1" - }, - "Dashboard": { - "Port": 8081 - }, - "Redundancy": { - "Enabled": true, - "Mode": "Warm", - "Role": "Primary", - "ServerUris": [ - "urn:localhost:LmxOpcUa:instance1", - "urn:localhost:LmxOpcUa:instance2" - ], - "ServiceLevelBase": 200 - } -} -``` +- The calculator reports `PrimaryMidApply` / `BackupMidApply` — clients see the band shift and cut over to the unaffected peer rather than racing against a half-applied generation. +- `RedundancyCoordinator.RefreshAsync` is called at the end of the apply window so the post-publish topology becomes visible exactly once, atomically. +- The watchdog force-closes any lease older than `ApplyMaxDuration`; a stuck publisher therefore cannot strand a node at `PrimaryMidApply`. -### Instance 2 -- Secondary (appsettings.json) +Because role transitions are **operator-driven** (write `RedundancyRole` in the Config DB + publish), the Backup never auto-promotes. An `IsolatedBackup` at 80 is the signal that the operator should intervene; auto-failover is intentionally out of scope for the non-transparent model (decision #154). -```json -{ - "OpcUa": { - "Port": 4841, - "ServerName": "LmxOpcUa-2", - "GalaxyName": "ZB", - "ApplicationUri": "urn:localhost:LmxOpcUa:instance2" - }, - "MxAccess": { - "ClientName": "LmxOpcUa-2" - }, - "Dashboard": { - "Port": 8082 - }, - "Redundancy": { - "Enabled": true, - "Mode": "Warm", - "Role": "Secondary", - "ServerUris": [ - "urn:localhost:LmxOpcUa:instance1", - "urn:localhost:LmxOpcUa:instance2" - ], - "ServiceLevelBase": 200 - } -} -``` +## Metrics -## CLI `redundancy` Command +`RedundancyMetrics` in `src/ZB.MOM.WW.OtOpcUa.Admin/Services/RedundancyMetrics.cs` registers the `ZB.MOM.WW.OtOpcUa.Redundancy` meter on the Admin process. Instruments: -The Client CLI includes a `redundancy` command that reads the redundancy state from a running server. +| Name | Kind | Tags | Description | +|---|---|---|---| +| `otopcua.redundancy.role_transition` | Counter | `cluster.id`, `node.id`, `from_role`, `to_role` | Incremented every time `FleetStatusPoller` observes a `RedundancyRole` change on a `ClusterNode` row. | +| `otopcua.redundancy.primary_count` | ObservableGauge | `cluster.id` | Primary-role nodes per cluster — should be exactly 1 in nominal state. | +| `otopcua.redundancy.secondary_count` | ObservableGauge | `cluster.id` | Secondary-role nodes per cluster. | +| `otopcua.redundancy.stale_count` | ObservableGauge | `cluster.id` | Nodes whose `LastSeenAt` exceeded the stale threshold. | -```bash -dotnet run --project src/ZB.MOM.WW.OtOpcUa.Client.CLI -- redundancy -u opc.tcp://localhost:4840/LmxOpcUa -dotnet run --project src/ZB.MOM.WW.OtOpcUa.Client.CLI -- redundancy -u opc.tcp://localhost:4841/LmxOpcUa -``` +Admin `Program.cs` wires OpenTelemetry to the Prometheus exporter when `Metrics:Prometheus:Enabled=true` (default), exposing the meter under `/metrics`. The endpoint is intentionally unauthenticated — fleet conventions put it behind a reverse-proxy basic-auth gate if needed. -The command reads the following standard OPC UA nodes and displays their values: +## Real-time notifications (Admin UI) -- **Redundancy Mode** -- from `Server_ServerRedundancy_RedundancySupport` (None, Warm, or Hot) -- **Service Level** -- from `Server_ServiceLevel` (0-255) -- **Server URIs** -- from `Server_ServerRedundancy_ServerUriArray` (list of ApplicationUri values in the redundant set) -- **Application URI** -- from `Server_ServerArray` (this instance's ApplicationUri) +`FleetStatusPoller` in `src/ZB.MOM.WW.OtOpcUa.Admin/Hubs/` polls the `ClusterNode` table, records role transitions, updates `RedundancyMetrics.SetClusterCounts`, and pushes a `RoleChanged` SignalR event onto `FleetStatusHub` when a transition is observed. `RedundancyTab.razor` subscribes with `_hub.On("RoleChanged", …)` so connected Admin sessions see role swaps the moment they happen. -Example output for a healthy Primary: +## Configuring a redundant pair -``` -Redundancy Mode: Warm -Service Level: 200 -Server URIs: - - urn:localhost:LmxOpcUa:instance1 - - urn:localhost:LmxOpcUa:instance2 -Application URI: urn:localhost:LmxOpcUa:instance1 -``` +Redundancy is configured **in the Config DB, not appsettings.json**. The fields that must differ between the two instances: -The command also supports `--username`/`--password` and `--security` options for authenticated or encrypted connections. +| Field | Location | Instance 1 | Instance 2 | +|---|---|---|---| +| `NodeId` | `appsettings.json` `Node:NodeId` (bootstrap) | `node-a` | `node-b` | +| `ClusterNode.ApplicationUri` | Config DB | `urn:node-a:OtOpcUa` | `urn:node-b:OtOpcUa` | +| `ClusterNode.RedundancyRole` | Config DB | `Primary` | `Secondary` | +| `ClusterNode.ServiceLevelBase` | Config DB | typically 255 | typically 100 | -### Client Failover with `-F` +Shared between instances: `ClusterId`, Config DB connection string, published generation, cluster-level ACLs, UNS hierarchy, driver instances. -All CLI commands support the `-F` / `--failover-urls` flag for automatic client-side failover. When provided, the CLI tries the primary endpoint first and falls back to the listed URLs if the primary is unreachable. +Role swaps, stand-alone promotions, and base-level adjustments all happen through the Admin UI `RedundancyTab` — the operator edits the `ClusterNode` row in a draft generation and publishes. `RedundancyCoordinator.RefreshAsync` picks up the new topology without a process restart. -```bash -# Connect with failover — uses secondary if primary is down -dotnet run -- connect -u opc.tcp://localhost:4840/LmxOpcUa -F opc.tcp://localhost:4841/LmxOpcUa +## Client-side failover -# Subscribe with live failover — reconnects to secondary if primary drops mid-stream -dotnet run -- subscribe -u opc.tcp://localhost:4840/LmxOpcUa -F opc.tcp://localhost:4841/LmxOpcUa \ - -n "ns=1;s=TestMachine_001.MachineID" -``` +The OtOpcUa Client CLI at `src/ZB.MOM.WW.OtOpcUa.Client.CLI` supports `-F` / `--failover-urls` for automatic client-side failover; for long-running subscriptions the CLI monitors session KeepAlive and reconnects to the next available server, recreating the subscription on the new endpoint. See [`Client.CLI.md`](Client.CLI.md) for the command reference. -For long-running commands (`subscribe`), the CLI monitors the session KeepAlive and automatically reconnects to the next available server when the current session drops. The subscription is re-created on the new server. +## Depth reference -## Troubleshooting - -**Mismatched ServerUris between instances** -- Both instances must list the exact same set of ApplicationUri values in `Redundancy.ServerUris`. If they differ, clients may not discover the full redundant set. Check the startup log for the `Redundancy.ServerUris` line on each instance. - -**ServiceLevel stuck at 255** -- This indicates redundancy is not enabled. When `Redundancy.Enabled` is false (the default), the server always reports `ServiceLevel = 255` and `RedundancySupport.None`. Verify that `Redundancy.Enabled` is set to `true` in the configuration and that the configuration section is correctly bound. - -**ApplicationUri not set** -- The configuration validator rejects startup when redundancy is enabled but `OpcUa.ApplicationUri` is null or empty. Each instance must have a unique ApplicationUri. Check the error log for: `OpcUa.ApplicationUri must be set when redundancy is enabled`. - -**Both servers report the same ServiceLevel** -- Verify that one instance has `Redundancy.Role` set to `Primary` and the other to `Secondary`. Both set to `Primary` (or both to `Secondary`) will produce identical baseline values, preventing clients from distinguishing the preferred server. - -**ServerUriArray not readable** -- When `RedundancySupport` is `None` (redundancy disabled), the OPC UA SDK may not expose the `ServerUriArray` node or it may return an empty value. The CLI `redundancy` command handles this gracefully by catching the read error. Enable redundancy to populate this array. +For the full decision trail and implementation plan — topology invariants, peer-probe cadence, recovery-dwell policy, compliance-script guard against enum-value drift — see `docs/v2/plan.md` §Phase 6.3. diff --git a/docs/ServiceHosting.md b/docs/ServiceHosting.md index f9b7735..833ca27 100644 --- a/docs/ServiceHosting.md +++ b/docs/ServiceHosting.md @@ -2,189 +2,132 @@ ## Overview -The service runs as a Windows service or console application using TopShelf for lifecycle management. It targets .NET Framework 4.8 with an x86 (32-bit) platform target, which is required for MXAccess COM interop with the ArchestrA runtime DLLs. +A production OtOpcUa deployment runs **three processes**, each with a distinct runtime, platform target, and install surface: -## TopShelf Configuration +| Process | Project | Runtime | Platform | Responsibility | +|---|---|---|---|---| +| **OtOpcUa Server** | `src/ZB.MOM.WW.OtOpcUa.Server` | .NET 10 | x64 | Hosts the OPC UA endpoint; loads every non-Galaxy driver in-process; exposes `/healthz`. | +| **OtOpcUa Admin** | `src/ZB.MOM.WW.OtOpcUa.Admin` | .NET 10 (ASP.NET Core / Blazor Server) | x64 | Operator UI for Config DB editing + fleet status, SignalR hubs (`FleetStatusHub`, `AlertHub`), Prometheus `/metrics`. | +| **OtOpcUa Galaxy.Host** | `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host` | .NET Framework 4.8 | x86 (32-bit) | Hosts MXAccess COM on a dedicated STA thread with a Win32 message pump; exposes a named-pipe IPC surface consumed by `Driver.Galaxy.Proxy` inside the Server process. | -`Program.Main()` configures TopShelf to manage the `OpcUaService` lifecycle: +The x86 / .NET Framework 4.8 constraint applies **only** to Galaxy.Host because the MXAccess toolkit DLLs (`Program Files (x86)\ArchestrA\Framework\bin`) are 32-bit-only COM. Every other driver (Modbus, S7, OpcUaClient, AbCip, AbLegacy, TwinCAT, FOCAS) runs in-process in the 64-bit Server. + +## Server process + +`src/ZB.MOM.WW.OtOpcUa.Server/Program.cs` uses the generic host: ```csharp -var exitCode = HostFactory.Run(host => -{ - host.UseSerilog(); - - host.Service(svc => - { - svc.ConstructUsing(() => new OpcUaService()); - svc.WhenStarted(s => s.Start()); - svc.WhenStopped(s => s.Stop()); - }); - - host.SetServiceName("LmxOpcUa"); - host.SetDisplayName("LMX OPC UA Server"); - host.SetDescription("OPC UA server exposing System Platform Galaxy tags via MXAccess."); - host.RunAsLocalSystem(); - host.StartAutomatically(); -}); +var builder = Host.CreateApplicationBuilder(args); +builder.Services.AddSerilog(); +builder.Services.AddWindowsService(o => o.ServiceName = "OtOpcUa"); +… +builder.Services.AddHostedService(); +builder.Services.AddHostedService(); ``` -TopShelf provides these deployment modes from the same executable: +`OpcUaServerService` is a `BackgroundService` (decision #30 — TopShelf from v1 was replaced by the generic-host `AddWindowsService` wrapper; no TopShelf dependency remains in any csproj). It owns: -| Command | Description | -|---------|-------------| -| `OtOpcUa.Host.exe` | Run as a console application (foreground) | -| `OtOpcUa.Host.exe install` | Install as a Windows service | -| `OtOpcUa.Host.exe uninstall` | Remove the Windows service | -| `OtOpcUa.Host.exe start` | Start the installed service | -| `OtOpcUa.Host.exe stop` | Stop the installed service | +1. Config bootstrap — reads `Node:NodeId`, `Node:ClusterId`, `Node:ConfigDbConnectionString`, `Node:LocalCachePath` from `appsettings.json`. +2. `NodeBootstrap` — pulls the latest published generation from the Config DB into the LiteDB local cache (`LiteDbConfigCache`) so the node starts even if the central DB is briefly unreachable. +3. `DriverHost` — instantiates configured driver instances from the generation, wires each through `CapabilityInvoker` resilience pipelines. +4. `OpcUaApplicationHost` — builds the OPC UA endpoint, applies `OpcUaServerOptions` + `LdapOptions`, registers `AuthorizationGate` at dispatch. +5. `HostStatusPublisher` — a second hosted service that heartbeats `DriverHostStatus` rows so the Admin UI Fleet view sees the node. -The service is configured to run as `LocalSystem` and start automatically on boot. +### Installation -## Working Directory +Same executable, different modes driven by the .NET generic-host `AddWindowsService` wrapper: -Before configuring Serilog, `Program.Main()` sets the working directory to the executable's location: +| Mode | Invocation | +|---|---| +| Console | `ZB.MOM.WW.OtOpcUa.Server.exe` | +| Install as Windows service | `sc create OtOpcUa binPath="C:\Program Files\OtOpcUa\Server\ZB.MOM.WW.OtOpcUa.Server.exe" start=auto` | +| Start | `sc start OtOpcUa` | +| Stop | `sc stop OtOpcUa` | +| Uninstall | `sc delete OtOpcUa` | -```csharp -Environment.CurrentDirectory = AppDomain.CurrentDomain.BaseDirectory; -``` +### Health endpoints -This is necessary because Windows services default their working directory to `System32`, which would cause relative log paths and `appsettings.json` to resolve incorrectly. +The Server exposes `/healthz` + `/readyz` used by (a) the Admin `FleetStatusPoller` as input to Fleet status and (b) `PeerReachabilityTracker` in a peer Server process as the HTTP side of the peer-reachability probe. -## Startup Sequence +## Admin process -`OpcUaService.Start()` executes the following steps in order. If any required step fails, the service logs the error and throws, preventing a partially initialized state. +`src/ZB.MOM.WW.OtOpcUa.Admin/Program.cs` is a stock `WebApplication`. Highlights: -1. **Load configuration** -- The production constructor reads `appsettings.json`, optional environment overlay, and environment variables, then binds each section to its typed configuration class. -2. **Validate configuration** -- `ConfigurationValidator.ValidateAndLog()` logs all resolved values and checks required constraints (port range, non-empty names and connection strings). If validation fails, the service throws `InvalidOperationException`. -3. **Register exception handler** -- Registers `AppDomain.CurrentDomain.UnhandledException` to log fatal unhandled exceptions with `IsTerminating` context. -4. **Create performance metrics** -- Creates the `PerformanceMetrics` instance and a `CancellationTokenSource` for coordinating shutdown. -5. **Create and connect MXAccess client** -- Starts the STA COM thread, creates the `MxAccessClient`, and attempts an initial connection. If the connection fails, the service logs a warning and continues -- the monitor loop will retry in the background. -6. **Start MXAccess monitor** -- Starts the connectivity monitor loop that probes the runtime connection at the configured interval and handles auto-reconnect. -7. **Test Galaxy repository connection** -- Calls `TestConnectionAsync()` on the Galaxy repository to verify the SQL Server database is reachable. If it fails, the service continues without initial address-space data. -8. **Create OPC UA server host** -- Creates `OpcUaServerHost` with the effective MXAccess client (real, override, or null fallback), performance metrics, and an optional `IHistorianDataSource` obtained from `HistorianPluginLoader.TryLoad` when `Historian.Enabled=true` (returns `null` if the plugin is absent or fails to load). -9. **Query Galaxy hierarchy** -- Fetches the object hierarchy and attribute definitions from the Galaxy repository database, recording object and attribute counts. -10. **Start server and build address space** -- Starts the OPC UA server, retrieves the `LmxNodeManager`, and calls `BuildAddressSpace()` with the queried hierarchy and attributes. If the query or build fails, the server still starts with an empty address space. -11. **Start change detection** -- Creates and starts `ChangeDetectionService`, which polls `galaxy.time_of_last_deploy` at the configured interval. When a change is detected, it triggers an address-space rebuild via the `OnGalaxyChanged` event. -12. **Start status dashboard** -- Creates the `HealthCheckService` and `StatusReportService`, wires in all live components, and starts the `StatusWebServer` HTTP listener if the dashboard is enabled. If `StatusWebServer.Start()` returns `false` (port already bound, insufficient permissions, etc.), the service logs a warning, disposes the unstarted instance, sets `OpcUaService.DashboardStartFailed = true`, and continues in degraded mode. Matches the warning-continue policy applied to MxAccess connect, Galaxy DB connect, and initial address space build. Stability review 2026-04-13 Finding 2. -13. **Log startup complete** -- Logs "LmxOpcUa service started successfully" at `Information` level. +- Cookie auth (`CookieAuthenticationDefaults`, scheme name `OtOpcUa.Admin`) + Blazor Server (`AddInteractiveServerComponents`) + SignalR. +- Authorization policies gated by `AdminRoles`: `ConfigViewer`, `ConfigEditor`, `FleetAdmin` (see `Services/AdminRoles.cs`). `CanEdit` policy requires `ConfigEditor` or `FleetAdmin`; `CanPublish` requires `FleetAdmin`. +- `OtOpcUaConfigDbContext` registered against `ConnectionStrings:ConfigDb`. +- Scoped services: `ClusterService`, `GenerationService`, `EquipmentService`, `UnsService`, `NamespaceService`, `DriverInstanceService`, `NodeAclService`, `PermissionProbeService`, `AclChangeNotifier`, `ReservationService`, `DraftValidationService`, `AuditLogService`, `HostStatusService`, `ClusterNodeService`, `EquipmentImportBatchService`, `ILdapGroupRoleMappingService`. +- Singleton `RedundancyMetrics` (meter name `ZB.MOM.WW.OtOpcUa.Redundancy`) + `CertTrustService` (promotes rejected client certs in the Server's PKI store to trusted via the Admin Certificates page). +- `LdapAuthService` bound to `Authentication:Ldap` — same LDAP flow as ScadaLink CentralUI for visual parity. +- SignalR hubs mapped at `/hubs/fleet` and `/hubs/alerts`; `FleetStatusPoller` runs as a hosted service and pushes `RoleChanged`, host status, and alert events. +- OpenTelemetry → Prometheus exporter at `/metrics` when `Metrics:Prometheus:Enabled=true` (default). Pull-based means no Collector required in the common K8s deploy. -## Shutdown Sequence +### Installation -`OpcUaService.Stop()` tears down components in reverse dependency order: +Deployed as an ASP.NET Core service; the generic-host `AddWindowsService` wrapper (or IIS reverse-proxy for multi-node fleets) provides install/uninstall. Listens on whatever `ASPNETCORE_URLS` specifies. -1. **Cancel operations** -- Signals the `CancellationTokenSource` to stop all background loops. -2. **Stop change detection** -- Stops the Galaxy deploy polling loop. -3. **Stop OPC UA server** -- Shuts down the OPC UA server host, disconnecting all client sessions. -4. **Stop MXAccess monitor** -- Stops the connectivity monitor loop. -5. **Disconnect MXAccess** -- Disconnects the MXAccess client and releases COM resources. -6. **Dispose STA thread** -- Shuts down the dedicated STA COM thread and its message pump. -7. **Stop dashboard** -- Disposes the `StatusWebServer` HTTP listener. -8. **Dispose metrics** -- Releases the performance metrics collector. -9. **Dispose change detection** -- Releases the change detection service. -10. **Unregister exception handler** -- Removes the `AppDomain.UnhandledException` handler. +## Galaxy.Host process -The entire shutdown is wrapped in a `try/catch` that logs warnings for errors during cleanup, ensuring the service exits even if a component fails to dispose cleanly. +`src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Program.cs` is a .NET Framework 4.8 x86 console executable. Configuration comes from environment variables supplied by the supervisor (`Driver.Galaxy.Proxy.Supervisor`): -## Error Handling +| Env var | Purpose | +|---|---| +| `OTOPCUA_GALAXY_PIPE` | Pipe name the host listens on (default `OtOpcUaGalaxy`). | +| `OTOPCUA_ALLOWED_SID` | SID of the Server process's principal; anyone else is refused during the handshake. | +| `OTOPCUA_GALAXY_SECRET` | Per-spawn shared secret the client must present in the Hello frame. | +| `OTOPCUA_GALAXY_BACKEND` | `mxaccess` (default), `db` (ZB-only, no COM), `stub` (in-memory; for tests). | +| `OTOPCUA_GALAXY_ZB_CONN` | SQL connection string to the ZB Galaxy repository. | +| `OTOPCUA_HISTORIAN_*` | Optional Wonderware Historian SDK config if Historian is enabled for this node. | -### Unhandled exceptions +The host spins up `StaPump` (the STA thread with message pump), creates the MXAccess `LMXProxyServer` COM object on that thread, and handles all COM calls there; the IPC layer marshals work items via `PostThreadMessage`. -`AppDomain.CurrentDomain.UnhandledException` is registered at startup and removed at shutdown. The handler logs the exception at `Fatal` level with the `IsTerminating` flag: +### Pipe security -```csharp -Log.Fatal(e.ExceptionObject as Exception, - "Unhandled exception (IsTerminating={IsTerminating})", e.IsTerminating); -``` +`PipeServer` builds a `PipeAcl` from the provided `SecurityIdentifier` + uses `NamedPipeServerStream` with `maxNumberOfServerInstances: 1`. The handshake requires a matching shared secret in the first Hello frame; callers whose SID doesn't match `OTOPCUA_ALLOWED_SID` are rejected before any frame is processed. **By design the pipe ACL denies BUILTIN\Administrators** — live smoke tests must therefore run from a non-elevated shell that matches the allowed principal. The installed dev host (`OtOpcUaGalaxyHost`) runs as `dohertj2` with the secret at `.local/galaxy-host-secret.txt`. -### Startup resilience +### Installation -The startup sequence is designed to degrade gracefully rather than fail entirely: - -- If MXAccess connection fails, the service continues with a `NullMxAccessClient` that returns bad-quality values for all reads. -- If the Galaxy repository database is unreachable, the OPC UA server starts with an empty address space. -- If the status dashboard port is in use, the dashboard logs a warning and does not start, but the OPC UA server continues. - -### Fatal startup failure - -If a critical step (configuration validation, OPC UA server start) throws, `Start()` catches the exception, logs it at `Fatal`, and re-throws to let TopShelf report the failure. - -## Logging - -The service uses Serilog with two sinks configured in `Program.Main()`: - -```csharp -Log.Logger = new LoggerConfiguration() - .MinimumLevel.Information() - .WriteTo.Console() - .WriteTo.File( - path: "logs/lmxopcua-.log", - rollingInterval: RollingInterval.Day, - retainedFileCountLimit: 31) - .CreateLogger(); -``` - -| Sink | Details | -|------|---------| -| Console | Writes to stdout, useful when running as a console application | -| Rolling file | Writes to `logs/lmxopcua-{date}.log`, rolls daily, retains 31 days of history | - -Log files are written relative to the executable directory (see Working Directory above). Each component creates its own contextual logger using `Log.ForContext()` or `Log.ForContext(typeof(T))`. - -`Log.CloseAndFlush()` is called in the `finally` block of `Program.Main()` to ensure all buffered log entries are written before process exit. - -## Multi-Instance Deployment - -The service supports running multiple instances for redundancy. Each instance requires: - -- A unique Windows service name (e.g., `LmxOpcUa`, `LmxOpcUa2`) -- A unique OPC UA port and dashboard port -- A unique `OpcUa.ApplicationUri` and `OpcUa.ServerName` -- A unique `MxAccess.ClientName` -- Matching `Redundancy.ServerUris` arrays on all instances - -Install additional instances using TopShelf's `-servicename` flag: +NSSM-wrapped (the Non-Sucking Service Manager) because the executable itself is a plain console app, not a `ServiceBase` Windows service. The supervisor then adopts the child process over the pipe after install. Install/uninstall commands follow the NSSM pattern: ```bash -cd C:\publish\lmxopcua\instance2 -ZB.MOM.WW.OtOpcUa.Host.exe install -servicename "LmxOpcUa2" -displayname "LMX OPC UA Server (Instance 2)" +nssm install OtOpcUaGalaxyHost "C:\Program Files (x86)\OtOpcUa\Galaxy.Host\ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host.exe" +nssm set OtOpcUaGalaxyHost ObjectName .\dohertj2 +nssm set OtOpcUaGalaxyHost AppEnvironmentExtra OTOPCUA_GALAXY_BACKEND=mxaccess OTOPCUA_GALAXY_SECRET=… OTOPCUA_ALLOWED_SID=… +nssm start OtOpcUaGalaxyHost ``` -See [Redundancy Guide](Redundancy.md) for full deployment details. +(Exact values for the environment block are generated by the Admin UI + committed alongside `.local/galaxy-host-secret.txt` on the dev box.) -## Required Runtime Assemblies - -The build uses Costura.Fody to embed all NuGet dependencies into the single `ZB.MOM.WW.OtOpcUa.Host.exe`. The only native dependency that must sit alongside the executable in every deployment is the MXAccess COM toolkit: - -| Assembly | Purpose | -|----------|---------| -| `ArchestrA.MxAccess.dll` | MXAccess COM interop — runtime data access to Galaxy tags | - -The Wonderware Historian SDK is packaged as a **runtime-loaded plugin** so hosts that will not use historical data access do not need the SDK installed. The plugin lives in a `Historian/` subfolder next to `ZB.MOM.WW.OtOpcUa.Host.exe`: +## Inter-process communication ``` -ZB.MOM.WW.OtOpcUa.Host.exe -ArchestrA.MxAccess.dll -Historian/ - ZB.MOM.WW.OtOpcUa.Historian.Aveva.dll - aahClientManaged.dll - aahClientCommon.dll - aahClient.dll - Historian.CBE.dll - Historian.DPAPI.dll - ArchestrA.CloudHistorian.Contract.dll +┌──────────────────────────┐ LDAP bind (Authentication:Ldap) ┌──────────────────────────┐ +│ OtOpcUa Admin (x64) │ ─────────────────────────────────────────────▶│ LDAP / AD │ +│ Blazor Server + SignalR │ └──────────────────────────┘ +│ /metrics (Prometheus) │ FleetStatusPoller → ClusterNode poll +│ │ ─────────────────────────────────────────────▶┌──────────────────────────┐ +│ │ Cluster/Generation/ACL writes │ Config DB (SQL Server) │ +└──────────────────────────┘ ─────────────────────────────────────────────▶│ OtOpcUaConfigDbContext │ + ▲ └──────────────────────────┘ + │ SignalR ▲ + │ (role change, │ sp_GetCurrentGenerationForCluster + │ host status, │ sp_PublishGeneration + │ alerts) │ +┌──────────────────────────┐ │ +│ OtOpcUa Server (x64) │ ──────────────────────────────────────────────────────────┘ +│ OPC UA endpoint │ +│ Non-Galaxy drivers │ Named pipe (OtOpcUaGalaxy) ┌──────────────────────────┐ +│ Driver.Galaxy.Proxy │ ─────────────────────────────────────────────▶│ Galaxy.Host (x86 .NFx) │ +│ │ SID + shared-secret handshake │ STA + message pump │ +│ /healthz /readyz │ │ MXAccess COM │ +└──────────────────────────┘ │ Historian SDK (opt) │ + └──────────────────────────┘ ``` -At startup, if `Historian.Enabled=true` in `appsettings.json`, `HistorianPluginLoader` probes `Historian/ZB.MOM.WW.OtOpcUa.Historian.Aveva.dll` via `Assembly.LoadFrom` and instantiates the plugin's entry point. An `AppDomain.AssemblyResolve` handler redirects the SDK assembly lookups (`aahClientManaged`, `aahClientCommon`, …) to the same subfolder so the CLR can resolve them when the plugin first JITs. If the plugin directory is absent or any SDK dependency fails to load, the loader logs a warning and the server continues to run with history support disabled — `LmxNodeManager` returns `BadHistoryOperationUnsupported` for every history call. +## appsettings.json boundary -Deployment matrix: +Each process reads its own `appsettings.json` for **bootstrap only** — connection strings, LDAP bind config, transport security profile, redundancy node id, logging. The authoritative configuration tree (drivers, UNS, tags, ACLs) lives in the Config DB and is edited through the Admin UI. See [`Configuration.md`](Configuration.md) for the split. -| Scenario | Host exe | `ArchestrA.MxAccess.dll` | `Historian/` subfolder | -|----------|----------|--------------------------|------------------------| -| `Historian.Enabled=false` | required | required | **omit** | -| `Historian.Enabled=true` | required | required | required | +## Development bootstrap -`ArchestrA.MxAccess.dll` and the historian SDK DLLs are not redistributable — they are provided by the AVEVA System Platform and Historian installations on the target machine. The copies in `lib/` are taken from `Program Files (x86)\ArchestrA\Framework\bin` on a machine with the platform installed. - -## Platform Target - -The service must be compiled and run as x86 (32-bit). The MXAccess COM toolkit DLLs in `Program Files (x86)\ArchestrA\Framework\bin` are 32-bit only. Running the service as x64 or AnyCPU (64-bit preferred) causes COM interop failures when creating the `LMXProxyServer` object on the STA thread. +For the Windows install steps (SQL Server in Docker, .NET 10 SDK, .NET Framework 4.8 SDK, Docker Desktop WSL 2 backend, EF Core CLI, first-run migration), see [`docs/v2/dev-environment.md`](v2/dev-environment.md). diff --git a/docs/StatusDashboard.md b/docs/StatusDashboard.md index 75896e0..5050c59 100644 --- a/docs/StatusDashboard.md +++ b/docs/StatusDashboard.md @@ -1,274 +1,16 @@ -# Status Dashboard +# Status Dashboard — Superseded -## Overview +This document has been superseded. -The service hosts an embedded HTTP status dashboard that surfaces real-time health, connection state, subscription counts, data change throughput, and Galaxy metadata. Operators access it through a browser to verify the bridge is functioning without needing an OPC UA client. The dashboard is enabled by default on port 8081 and can be disabled via configuration. +The single-process, HTTP-listener "Status Dashboard" (`StatusWebServer` bound to port 8081) belonged to v1 LmxOpcUa, where one process owned the OPC UA endpoint, the MXAccess bridge, and the operator surface. In the multi-process OtOpcUa platform the operator surface has moved into the **OtOpcUa Admin** app — a Blazor Server UI that talks to the shared Config DB and to every deployed node over SignalR (`FleetStatusHub`, `AlertHub`). Prometheus scraping lives on the Admin app's `/metrics` endpoint via OpenTelemetry (`Metrics:Prometheus:Enabled`). -## HTTP Server +Operator surfaces now covered by the Admin UI: -`StatusWebServer` wraps a `System.Net.HttpListener` bound to `http://+:{port}/`. It starts a background task that accepts requests in a loop and dispatches them by path. Only `GET` requests are accepted; all other methods return `405 Method Not Allowed`. Responses include `Cache-Control: no-cache` headers to prevent stale data in the browser. +- Fleet health, per-node role/ServiceLevel, crash-loop detection (`Fleet.razor`, `Hosts.razor`, `FleetStatusPoller`) +- Redundancy state + role transitions (`RedundancyMetrics`, `otopcua.redundancy.*`) +- Cluster + node + credential management (`ClusterService`, `ClusterNodeService`) +- Draft/publish generation editor, diff viewer, CSV import, UnsTab, IdentificationFields, RedundancyTab, AclsTab with Probe-this-permission +- Certificate trust management (`CertTrustService` promotes rejected client certs to trusted) +- Audit log viewer (`AuditLogService`) -### Endpoints - -| Path | Content-Type | Description | -|------|-------------|-------------| -| `/` | `text/html` | Operator dashboard with auto-refresh | -| `/health` | `text/html` | Focused health page with service-level badge and component cards | -| `/api/status` | `application/json` | Full status snapshot as JSON (`StatusData`) | -| `/api/health` | `application/json` | Health endpoint (`HealthEndpointData`) -- returns `503` when status is `Unhealthy`, `200` otherwise | - -Any other path returns `404 Not Found`. - -## Health Check Logic - -`HealthCheckService.CheckHealth` evaluates bridge health using the following rules applied in order. The first rule that matches wins; rules 2b, 2c, 2d, and 2e only fire when the corresponding integration is enabled and a non-null snapshot is passed: - -1. **Rule 1 -- Unhealthy**: MXAccess connection state is not `Connected`. Returns a red banner with the current state. -2. **Rule 2b -- Degraded**: `Historian.Enabled=true` but the plugin load outcome is not `Loaded`. Returns a yellow banner citing the plugin status (`NotFound`, `LoadFailed`) and the error message if one is available. -3. **Rule 2 / 2c -- Degraded**: Any recorded operation has a low success rate. The sample threshold depends on the operation category: - - Regular operations (`Read`, `Write`, `Subscribe`, `AlarmAcknowledge`): >100 invocations and <50% success rate. - - Historian operations (`HistoryReadRaw`, `HistoryReadProcessed`, `HistoryReadAtTime`, `HistoryReadEvents`): >10 invocations and <50% success rate. The lower threshold surfaces a stuck historian quickly, since history reads are rare relative to live reads. -4. **Rule 2d -- Degraded (latched)**: `AlarmTrackingEnabled=true` and any alarm acknowledge MXAccess write has failed since startup. Latched on purpose -- an ack write failure is a durable MXAccess write problem that should stay visible until the operator restarts. -5. **Rule 2e -- Degraded**: `RuntimeStatus.StoppedCount > 0` -- at least one Galaxy runtime host (`$WinPlatform` / `$AppEngine`) is currently reported Stopped by the runtime probe manager. The rule names the stopped hosts in the message. Ordered after Rule 1 so an MxAccess transport outage stays `Unhealthy` via Rule 1 and this rule never double-messages; the probe manager also forces every entry to `Unknown` when the transport is disconnected, so the `StoppedCount` is always 0 in that case. -6. **Rule 3 -- Healthy**: All checks pass. Returns a green banner with "All systems operational." - -The `/api/health` endpoint returns `200` for both Healthy and Degraded states, and `503` only for Unhealthy. This allows load balancers or monitoring tools to distinguish between a service that is running but degraded and one that has lost its runtime connection. - -## Status Data Model - -`StatusReportService` aggregates data from all bridge components into a `StatusData` DTO, which is then rendered as HTML or serialized to JSON. The DTO contains the following sections: - -### Connection - -| Field | Type | Description | -|-------|------|-------------| -| `State` | `string` | Current MXAccess connection state (Connected, Disconnected, Connecting) | -| `ReconnectCount` | `int` | Number of reconnect attempts since startup | -| `ActiveSessions` | `int` | Number of active OPC UA client sessions | - -### Health - -| Field | Type | Description | -|-------|------|-------------| -| `Status` | `string` | Healthy, Degraded, or Unhealthy | -| `Message` | `string` | Operator-facing explanation | -| `Color` | `string` | CSS color token (green, yellow, red, gray) | - -### Subscriptions - -| Field | Type | Description | -|-------|------|-------------| -| `ActiveCount` | `int` | Number of active MXAccess tag subscriptions (includes bridge-owned runtime status probes — see `ProbeCount`) | -| `ProbeCount` | `int` | Subset of `ActiveCount` attributable to bridge-owned runtime status probes (`.ScanState` per deployed `$WinPlatform` / `$AppEngine`). Rendered as a separate `Probes: N (bridge-owned runtime status)` line on the dashboard so operators can distinguish probe overhead from client-driven subscription load | - -### Galaxy - -| Field | Type | Description | -|-------|------|-------------| -| `GalaxyName` | `string` | Name of the Galaxy being bridged | -| `DbConnected` | `bool` | Whether the Galaxy repository database is reachable | -| `LastDeployTime` | `DateTime?` | Most recent deploy timestamp from the Galaxy | -| `ObjectCount` | `int` | Number of Galaxy objects in the address space | -| `AttributeCount` | `int` | Number of Galaxy attributes as OPC UA variables | -| `LastRebuildTime` | `DateTime?` | UTC timestamp of the last completed address-space rebuild | - -### Data change - -| Field | Type | Description | -|-------|------|-------------| -| `EventsPerSecond` | `double` | Rate of MXAccess data change events per second | -| `AvgBatchSize` | `double` | Average items processed per dispatch cycle | -| `PendingItems` | `int` | Items waiting in the dispatch queue | -| `TotalEvents` | `long` | Total MXAccess data change events since startup | - -### Galaxy Runtime - -Populated from the `GalaxyRuntimeProbeManager` that advises `.ScanState` on every deployed `$WinPlatform` and `$AppEngine`. See [MXAccess Bridge](MxAccessBridge.md#per-host-runtime-status-probes-hostscanstate) for the probe machinery, state machine, and the subtree quality invalidation that fires on transitions. Disabled when `MxAccess.RuntimeStatusProbesEnabled = false`; the panel is suppressed entirely from the HTML when `Total == 0`. - -| Field | Type | Description | -|-------|------|-------------| -| `Total` | `int` | Number of runtime hosts tracked (Platforms + AppEngines) | -| `RunningCount` | `int` | Hosts whose last probe callback reported `ScanState = true` with Good quality | -| `StoppedCount` | `int` | Hosts whose last probe callback reported `ScanState != true` or a failed item status, or whose initial probe timed out in Unknown state | -| `UnknownCount` | `int` | Hosts still awaiting initial probe resolution, or rewritten to Unknown when the MxAccess transport is Disconnected | -| `Hosts` | `List` | Per-host detail rows, sorted alphabetically by `ObjectName` | - -Each `GalaxyRuntimeStatus` entry: - -| Field | Type | Description | -|-------|------|-------------| -| `ObjectName` | `string` | Galaxy `tag_name` of the host (e.g., `DevPlatform`, `DevAppEngine`) | -| `GobjectId` | `int` | Galaxy `gobject_id` of the host | -| `Kind` | `string` | `$WinPlatform` or `$AppEngine` | -| `State` | `enum` | `Unknown`, `Running`, or `Stopped` | -| `LastStateCallbackTime` | `DateTime?` | UTC time of the most recent probe callback, whether good or bad | -| `LastStateChangeTime` | `DateTime?` | UTC time of the most recent Running↔Stopped transition; backs the dashboard "Since" column | -| `LastScanState` | `bool?` | Last `ScanState` value received; `null` before the first callback | -| `LastError` | `string?` | Detail message from the most recent failure callback (e.g., `"ScanState = false (OffScan)"`); cleared on successful recovery | -| `GoodUpdateCount` | `long` | Cumulative count of `ScanState = true` callbacks | -| `FailureCount` | `long` | Cumulative count of `ScanState != true` callbacks or failed item statuses | - -The HTML panel renders a per-host table with Name / Kind / State / Since / Last Error columns. Panel color reflects aggregate state: green when every host is `Running`, yellow when any host is `Unknown` with zero `Stopped`, red when any host is `Stopped`, gray when the MxAccess transport is disconnected (the Connection panel is the primary signal in that case and every row is force-rewritten to `Unknown`). - -### Operations - -A dictionary of `MetricsStatistics` keyed by operation name. Each entry contains: - -- `TotalCount` -- total invocations -- `SuccessRate` -- fraction of successful operations -- `AverageMilliseconds`, `MinMilliseconds`, `MaxMilliseconds`, `Percentile95Milliseconds` -- latency distribution - -The instrumented operation names are: - -| Name | Source | -|---|---| -| `Read` | MXAccess live tag reads (`MxAccessClient.ReadWrite.cs`) | -| `Write` | MXAccess live tag writes | -| `Subscribe` | MXAccess subscription attach | -| `HistoryReadRaw` | `LmxNodeManager.HistoryReadRawModified` -> historian plugin | -| `HistoryReadProcessed` | `LmxNodeManager.HistoryReadProcessed` -> historian plugin (aggregates) | -| `HistoryReadAtTime` | `LmxNodeManager.HistoryReadAtTime` -> historian plugin (interpolated) | -| `HistoryReadEvents` | `LmxNodeManager.HistoryReadEvents` -> historian plugin (alarm/event history) | -| `AlarmAcknowledge` | `LmxNodeManager.OnAlarmAcknowledge` -> MXAccess AckMsg write | - -New operation names are auto-registered on first use, so the `Operations` dictionary only contains entries for features that have actually been exercised since startup. - -### Historian - -`HistorianStatusInfo` -- reflects the outcome of the runtime-loaded historian plugin and the runtime query-health counters. See [Historical Data Access](HistoricalDataAccess.md) for the plugin architecture and the [Runtime Health Counters](HistoricalDataAccess.md#runtime-health-counters) section for the data source instrumentation. - -| Field | Type | Description | -|-------|------|-------------| -| `Enabled` | `bool` | Whether `Historian.Enabled` is set in configuration | -| `PluginStatus` | `string` | `Disabled`, `NotFound`, `LoadFailed`, or `Loaded` — load-time outcome from `HistorianPluginLoader.LastOutcome` | -| `PluginError` | `string?` | Exception message from the last load attempt when `PluginStatus=LoadFailed`; otherwise `null` | -| `PluginPath` | `string` | Absolute path the loader probed for the plugin assembly | -| `ServerName` | `string` | Legacy single-node hostname from `Historian.ServerName`; ignored when `ServerNames` is non-empty | -| `Port` | `int` | Configured historian TCP port | -| `QueryTotal` | `long` | Total historian read queries attempted since startup (raw + aggregate + at-time + events) | -| `QuerySuccesses` | `long` | Queries that completed without an exception | -| `QueryFailures` | `long` | Queries that raised an exception — each failure also triggers the plugin's reconnect path | -| `ConsecutiveFailures` | `int` | Failures since the last success. Resets to zero on any successful query. Drives the `Degraded` health rule at threshold 3 | -| `LastSuccessTime` | `DateTime?` | UTC timestamp of the most recent successful query, or `null` when no query has succeeded since startup | -| `LastFailureTime` | `DateTime?` | UTC timestamp of the most recent failure | -| `LastQueryError` | `string?` | Exception message from the most recent failure. Prefixed with the read-path name (`raw:`, `aggregate:`, `at-time:`, `events:`) so operators can tell which SDK call failed | -| `ProcessConnectionOpen` | `bool` | Whether the plugin currently holds an open SDK connection for the **process** silo (historical value queries — `ReadRaw`, `ReadAggregate`, `ReadAtTime`). See [Two SDK connection silos](HistoricalDataAccess.md#two-sdk-connection-silos) | -| `EventConnectionOpen` | `bool` | Whether the plugin currently holds an open SDK connection for the **event** silo (alarm history queries — `ReadEvents`). Separate from the process connection because the SDK requires distinct query channels | -| `ActiveProcessNode` | `string?` | Cluster node currently serving the process silo, or `null` when no process connection is open | -| `ActiveEventNode` | `string?` | Cluster node currently serving the event silo, or `null` when no event connection is open | -| `NodeCount` | `int` | Total configured historian cluster nodes. 1 for a legacy single-node deployment | -| `HealthyNodeCount` | `int` | Nodes currently eligible for new connections (not in failure cooldown) | -| `Nodes` | `List` | Per-node cluster state in configuration order. Each entry carries `Name`, `IsHealthy`, `CooldownUntil`, `FailureCount`, `LastError`, `LastFailureTime` | - -The operator dashboard renders a cluster table inside the Historian panel when `NodeCount > 1`. Legacy single-node deployments render a compact `Node: ` line and no table. Panel color reflects combined load-time + runtime health: green when everything is fine, yellow when any cluster node is in cooldown or 1-4 consecutive query failures are accumulated, red when the plugin is unloaded / all cluster nodes are failed / 5+ consecutive failures. - -### Alarms - -`AlarmStatusInfo` -- surfaces alarm-condition tracking health and dispatch counters. - -| Field | Type | Description | -|-------|------|-------------| -| `TrackingEnabled` | `bool` | Whether `OpcUa.AlarmTrackingEnabled` is set in configuration | -| `ConditionCount` | `int` | Number of distinct alarm conditions currently tracked | -| `ActiveAlarmCount` | `int` | Number of alarms currently in the `InAlarm=true` state | -| `TransitionCount` | `long` | Total `InAlarm` transitions observed in the dispatch loop since startup | -| `AckEventCount` | `long` | Total alarm acknowledgement transitions observed since startup | -| `AckWriteFailures` | `long` | Total MXAccess AckMsg writes that have failed while processing alarm acknowledges. Any non-zero value latches the service into Degraded (see Rule 2d). | -| `FilterEnabled` | `bool` | Whether `OpcUa.AlarmFilter.ObjectFilters` has any patterns configured | -| `FilterPatternCount` | `int` | Number of compiled filter patterns (after comma-splitting and trimming) | -| `FilterIncludedObjectCount` | `int` | Number of Galaxy objects included by the filter during the most recent address-space build. Zero when the filter is disabled. | - -When the filter is active, the operator dashboard's Alarms panel renders an extra line `Filter: N pattern(s), M object(s) included` so operators can verify scope at a glance. See [Alarm Tracking](AlarmTracking.md#template-based-alarm-object-filter) for the matching rules and resolution algorithm. - -### Redundancy - -`RedundancyInfo` -- only populated when `Redundancy.Enabled=true` in configuration. Shows mode, role, computed service level, application URI, and the set of peer server URIs. See [Redundancy](Redundancy.md) for the full guide. - -### Footer - -| Field | Type | Description | -|-------|------|-------------| -| `Timestamp` | `DateTime` | UTC time when the snapshot was generated | -| `Version` | `string` | Service assembly version | - -## `/api/health` Payload - -The health endpoint returns a `HealthEndpointData` document distinct from the full dashboard snapshot. It is designed for load balancers and external monitoring probes that only need an up/down signal plus component-level detail: - -| Field | Type | Description | -|-------|------|-------------| -| `Status` | `string` | `Healthy`, `Degraded`, or `Unhealthy` (drives the HTTP status code) | -| `ServiceLevel` | `byte` | OPC UA-style 0-255 service level. 255 when healthy non-redundant; 0 when MXAccess is down; redundancy-adjusted otherwise | -| `RedundancyEnabled` | `bool` | Whether redundancy is configured | -| `RedundancyRole` | `string?` | `Primary` or `Secondary` when redundancy is enabled; `null` otherwise | -| `RedundancyMode` | `string?` | `Warm` or `Hot` when redundancy is enabled; `null` otherwise | -| `Components.MxAccess` | `string` | `Connected` or `Disconnected` | -| `Components.Database` | `string` | `Connected` or `Disconnected` | -| `Components.OpcUaServer` | `string` | `Running` or `Stopped` | -| `Components.Historian` | `string` | `Disabled`, `NotFound`, `LoadFailed`, or `Loaded` -- matches `HistorianStatusInfo.PluginStatus` | -| `Components.Alarms` | `string` | `Disabled` or `Enabled` -- mirrors `OpcUa.AlarmTrackingEnabled` | -| `Uptime` | `string` | Formatted service uptime (e.g., `3d 5h 20m`) | -| `Timestamp` | `DateTime` | UTC time the snapshot was generated | - -Monitoring tools should: - -- Alert on `Status=Unhealthy` (HTTP 503) for hard outages. -- Alert on `Status=Degraded` (HTTP 200) for latched or cumulative failures -- a degraded status means the server is still operating but a subsystem needs attention (historian plugin missing, alarm ack writes failing, history read error rate too high, etc.). - -## HTML Dashboards - -### `/` -- Operator dashboard - -Monospace, dark background, color-coded panels. Panels: Connection, Health, Redundancy (when enabled), Subscriptions, Data Change Dispatch, Galaxy Info, **Historian**, **Alarms**, Operations (table), Footer. Each panel border color reflects component state (green, yellow, red, or gray). - -The page includes a `` tag set to the configured `RefreshIntervalSeconds` (default 10 seconds), so the browser polls automatically without JavaScript. - -### `/health` -- Focused health view - -Large status badge, computed `ServiceLevel` value, redundancy summary (when enabled), and a row of component cards: MXAccess, Galaxy Database, OPC UA Server, **Historian**, **Alarm Tracking**. Each card turns red when its component is in a failure state and grey when disabled. Best for wallboards and quick at-a-glance monitoring. - -## Configuration - -The dashboard is configured through the `Dashboard` section in `appsettings.json`: - -```json -{ - "Dashboard": { - "Enabled": true, - "Port": 8081, - "RefreshIntervalSeconds": 10 - } -} -``` - -Setting `Enabled` to `false` prevents the `StatusWebServer` from starting. The `StatusReportService` is still created so that other components can query health programmatically, but no HTTP listener is opened. - -### Dashboard start failures are non-fatal - -If the dashboard is enabled but the configured port is already bound (e.g., a previous instance did not clean up, another service is squatting on the port, or the user lacks URL-reservation rights), `StatusWebServer.Start()` logs the listener exception at Error level and returns `false`. `OpcUaService` then logs a Warning, disposes the unstarted instance, sets `DashboardStartFailed = true`, and continues in degraded mode — the OPC UA endpoint still starts. Operators can detect the failure by searching the service log for: - -``` -[WRN] Status dashboard failed to bind on port {Port}; service continues without dashboard -``` - -Stability review 2026-04-13 Finding 2. - -## Component Wiring - -`StatusReportService` is initialized after all other service components are created. `OpcUaService.Start()` calls `SetComponents()` to supply the live references, including the historian configuration so the dashboard can label the plugin target and evaluate Rule 2b: - -```csharp -StatusReportInstance.SetComponents( - effectiveMxClient, - Metrics, - GalaxyStatsInstance, - ServerHost, - NodeManagerInstance, - _config.Redundancy, - _config.OpcUa.ApplicationUri, - _config.Historian); -``` - -This deferred wiring allows the report service to be constructed before the MXAccess client or node manager are fully initialized. If a component is `null`, the report service falls back to default values (e.g., `ConnectionState.Disconnected`, zero counts, `HistorianPluginStatus.Disabled`). - -The historian plugin status is sourced from `HistorianPluginLoader.LastOutcome`, which is updated on every load attempt. `OpcUaService` explicitly calls `HistorianPluginLoader.MarkDisabled()` when `Historian.Enabled=false` so the dashboard can distinguish "feature off" from "load failed" without ambiguity. +See [`docs/v2/admin-ui.md`](v2/admin-ui.md) for the current operator surface and [`docs/ServiceHosting.md`](ServiceHosting.md) for the three-process layout. diff --git a/docs/Subscriptions.md b/docs/Subscriptions.md index de7b2d3..19505f4 100644 --- a/docs/Subscriptions.md +++ b/docs/Subscriptions.md @@ -1,135 +1,60 @@ # Subscriptions -`LmxNodeManager` bridges OPC UA monitored items to MXAccess runtime subscriptions using reference counting and a decoupled dispatch architecture. This design ensures that MXAccess COM callbacks (which run on the STA thread) never contend with the OPC UA framework lock. +Driver-side data-change subscriptions live behind `ISubscribable` (`src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/ISubscribable.cs`). The interface is deliberately mechanism-agnostic: it covers native subscriptions (Galaxy MXAccess advisory, OPC UA monitored items on an upstream server, TwinCAT ADS notifications) and driver-internal polled subscriptions (Modbus, AB CIP, S7, FOCAS). Core sees the same event shape regardless — drivers fire `OnDataChange` and Core dispatches to the matching OPC UA monitored items. -## Ref-Counted MXAccess Subscriptions - -Multiple OPC UA clients can subscribe to the same Galaxy tag simultaneously. Rather than opening duplicate MXAccess subscriptions, `LmxNodeManager` maintains a reference count per tag in `_subscriptionRefCounts`. - -### SubscribeTag - -`SubscribeTag` increments the reference count for a tag reference. On the first subscription (count goes from 0 to 1), it calls `_mxAccessClient.SubscribeAsync` to open the MXAccess runtime subscription: +## ISubscribable surface ```csharp -internal void SubscribeTag(string fullTagReference) -{ - lock (_lock) - { - if (_subscriptionRefCounts.TryGetValue(fullTagReference, out var count)) - _subscriptionRefCounts[fullTagReference] = count + 1; - else - { - _subscriptionRefCounts[fullTagReference] = 1; - _ = _mxAccessClient.SubscribeAsync(fullTagReference, (_, _) => { }); - } - } -} +Task SubscribeAsync( + IReadOnlyList fullReferences, + TimeSpan publishingInterval, + CancellationToken cancellationToken); + +Task UnsubscribeAsync(ISubscriptionHandle handle, CancellationToken cancellationToken); + +event EventHandler? OnDataChange; ``` -### UnsubscribeTag +A single `SubscribeAsync` call may batch many attributes and returns an opaque handle the caller passes back to `UnsubscribeAsync`. The driver may emit an immediate `OnDataChange` for each subscribed reference (the OPC UA initial-data convention) and then a push per change. -`UnsubscribeTag` decrements the reference count. When the count reaches zero, the MXAccess subscription is closed via `UnsubscribeAsync` and the tag is removed from the dictionary: +Every subscribe / unsubscribe call goes through `CapabilityInvoker.ExecuteAsync(DriverCapability.Subscribe, host, …)` so the per-host pipeline applies. -```csharp -if (count <= 1) -{ - _subscriptionRefCounts.Remove(fullTagReference); - _ = _mxAccessClient.UnsubscribeAsync(fullTagReference); -} -else - _subscriptionRefCounts[fullTagReference] = count - 1; -``` +## Reference counting at Core -Both methods use `lock (_lock)` (a private object, distinct from the OPC UA framework `Lock`) to serialize ref-count updates without blocking node value dispatches. +Multiple OPC UA clients can monitor the same variable simultaneously. Rather than open duplicate driver subscriptions, Core maintains a ref-count per `(driver, fullReference)` pair: the first OPC UA monitored-item for a reference triggers `ISubscribable.SubscribeAsync` with that single reference; each additional monitored-item just increments the count; decrement-to-zero triggers `UnsubscribeAsync`. Transferred subscriptions (client reconnect → resume session) replay against the same ref-count map so active driver subscriptions are preserved across session migration. -## OnMonitoredItemCreated +## Threading -The OPC UA framework calls `OnMonitoredItemCreated` when a client creates a monitored item. The override resolves the node handle to a tag reference and calls `SubscribeTag`, which opens the MXAccess subscription early so runtime values start arriving before the first publish cycle: +The STA thread story is now driver-specific, not a server-wide concern: -```csharp -protected override void OnMonitoredItemCreated(ServerSystemContext context, - NodeHandle handle, MonitoredItem monitoredItem) -{ - base.OnMonitoredItemCreated(context, handle, monitoredItem); - var nodeIdStr = handle?.NodeId?.Identifier as string; - if (nodeIdStr != null && _nodeIdToTagReference.TryGetValue(nodeIdStr, out var tagRef)) - SubscribeTag(tagRef); -} -``` +- **Galaxy** runs its MXAccess COM objects on a dedicated STA thread with a Win32 message pump (`src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Sta/StaPump.cs`) inside the standalone `Driver.Galaxy.Host` Windows service. The Proxy driver (`Driver.Galaxy.Proxy`) connects to the Host via named pipe and re-exposes the data on a free-threaded surface to Core. Core never touches COM. +- **Modbus / S7 / AB CIP / AB Legacy / TwinCAT / FOCAS** are free-threaded — they run their polling loops on ordinary `Task`s. Their `OnDataChange` fires on thread-pool threads. +- **OPC UA Client** delegates to the OPC Foundation stack's subscription loop. -`OnDeleteMonitoredItemsComplete` performs the inverse, calling `UnsubscribeTag` for each deleted monitored item. +The common contract: drivers are responsible for marshalling from whatever native thread the backend uses onto thread-pool threads before raising `OnDataChange`. Core's dispatch path acquires the OPC UA framework `Lock` and calls `ClearChangeMasks` on the corresponding `BaseDataVariableState` to notify subscribed clients. -## Data Change Dispatch Queue +## Dispatch -MXAccess delivers data change callbacks on the STA thread via the `OnTagValueChanged` event. These callbacks must not acquire the OPC UA framework `Lock` directly because the lock is also held during `Read`/`Write` operations that call into MXAccess (creating a potential deadlock with the STA thread). The solution is a `ConcurrentDictionary` named `_pendingDataChanges` that decouples the two threads. +Core's subscription dispatch path: -### Callback handler +1. `ISubscribable.OnDataChange` fires on a thread-pool thread with a `DataChangeEventArgs(subscriptionHandle, fullReference, DataValueSnapshot)`. +2. Core looks up the variable by `fullReference` in the driver's `DriverNodeManager` variable map. +3. Under the OPC UA framework `Lock`, the variable's `Value` / `StatusCode` / `Timestamp` are updated and `ClearChangeMasks(SystemContext, false)` is called. +4. The OPC Foundation stack then enqueues data-change notifications for every monitored-item attached to that variable, honoring each subscription's sampling + filter configuration. -`OnMxAccessDataChange` runs on the STA thread. It stores the latest value in the concurrent dictionary (coalescing rapid updates for the same tag) and signals the dispatch thread: +Batch coalescing — coalescing multiple pushes for the same reference between publish cycles — is done driver-side when the backend natively supports it (Galaxy keeps the v1 coalescing dictionary); otherwise the SDK's own data-change filter suppresses no-change notifications. -```csharp -private void OnMxAccessDataChange(string address, Vtq vtq) -{ - Interlocked.Increment(ref _totalMxChangeEvents); - _pendingDataChanges[address] = vtq; - _dataChangeSignal.Set(); -} -``` +## Initial values -### Dispatch thread architecture +A freshly-built variable carries `StatusCode = BadWaitingForInitialData` until the driver delivers the first value. Drivers whose backends supply an initial read (Galaxy `AdviseSupervisory`, TwinCAT `AddDeviceNotification`) fire `OnDataChange` immediately after `SubscribeAsync` returns. Polled drivers fire the first push when their first poll cycle completes. -A dedicated background thread (`OpcUaDataChangeDispatch`) runs `DispatchLoop`, which waits on an `AutoResetEvent` with a 100ms timeout. The decoupled design exists for two reasons: +## Transferred subscription restoration -1. **Deadlock avoidance** -- The STA thread must not acquire the OPC UA `Lock`. The dispatch thread is a normal background thread that can safely acquire `Lock`. -2. **Batch coalescing** -- Multiple MXAccess callbacks for the same tag between dispatch cycles are collapsed to the latest value via dictionary key overwrite. Under high load, this reduces the number of `ClearChangeMasks` calls. +When an OPC UA session is resumed (client reconnect with `TransferSubscriptions`), Core walks the transferred monitored-items and ensures every referenced `(driver, fullReference)` has a live driver subscription. References already active (in-process migration) skip re-subscribing; references that lost their driver-side handle during the session gap are re-subscribed via `SubscribeAsync`. -The dispatch loop processes changes in two phases: +## Key source files -**Phase 1 (outside Lock):** Drain keys from `_pendingDataChanges`, convert each `Vtq` to a `DataValue` via `CreatePublishedDataValue`, and collect alarm transition events. MXAccess reads for alarm Priority and DescAttrName values also happen in this phase, since they call back into the STA thread. - -**Phase 2 (inside Lock):** Apply all prepared updates to variable nodes and call `ClearChangeMasks` on each to trigger OPC UA data change notifications. Alarm events are reported in this same lock scope. - -```csharp -lock (Lock) -{ - foreach (var (variable, dataValue) in updates) - { - variable.Value = dataValue.Value; - variable.StatusCode = dataValue.StatusCode; - variable.Timestamp = dataValue.SourceTimestamp; - variable.ClearChangeMasks(SystemContext, false); - } -} -``` - -### ClearChangeMasks - -`ClearChangeMasks(SystemContext, false)` is the mechanism that notifies the OPC UA framework a node's value has changed. The framework uses change masks internally to track which nodes have pending notifications for active monitored items. Calling this method causes the server to enqueue data change notifications for all monitoring clients of that node. The `false` parameter indicates that child nodes should not be recursively cleared. - -## Transferred Subscription Restoration - -When OPC UA sessions are transferred (e.g., client reconnects and resumes a previous session), the framework calls `OnMonitoredItemsTransferred`. The override collects the tag references for all transferred items and calls `RestoreTransferredSubscriptions`. - -`RestoreTransferredSubscriptions` groups the tag references by count and, for each tag that does not already have an active ref-count entry, opens a new MXAccess subscription and sets the initial reference count: - -```csharp -internal void RestoreTransferredSubscriptions(IEnumerable fullTagReferences) -{ - var transferredCounts = fullTagReferences - .GroupBy(tagRef => tagRef, StringComparer.OrdinalIgnoreCase) - .ToDictionary(g => g.Key, g => g.Count(), StringComparer.OrdinalIgnoreCase); - - foreach (var kvp in transferredCounts) - { - lock (_lock) - { - if (_subscriptionRefCounts.ContainsKey(kvp.Key)) - continue; - _subscriptionRefCounts[kvp.Key] = kvp.Value; - } - _ = _mxAccessClient.SubscribeAsync(kvp.Key, (_, _) => { }); - } -} -``` - -Tags that already have in-memory bookkeeping are skipped to avoid double-counting when the transfer happens within the same server process (normal in-process session migration). +- `src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/ISubscribable.cs` — capability contract +- `src/ZB.MOM.WW.OtOpcUa.Core/Resilience/CapabilityInvoker.cs` — pipeline wrapping +- `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Sta/StaPump.cs` — Galaxy STA thread + message pump +- Per-driver subscribe implementations in each `Driver.*` project diff --git a/docs/GalaxyRepository.md b/docs/drivers/Galaxy-Repository.md similarity index 51% rename from docs/GalaxyRepository.md rename to docs/drivers/Galaxy-Repository.md index 30be885..381ca85 100644 --- a/docs/GalaxyRepository.md +++ b/docs/drivers/Galaxy-Repository.md @@ -1,6 +1,19 @@ -# Galaxy Repository +# Galaxy Repository — Tag Discovery for the Galaxy Driver -`GalaxyRepositoryService` reads the Galaxy object hierarchy and attribute metadata from the System Platform Galaxy Repository SQL Server database. This data drives the construction of the OPC UA address space. +`GalaxyRepositoryService` reads the Galaxy object hierarchy and attribute metadata from the System Platform Galaxy Repository SQL Server database. It is the Galaxy driver's implementation of **`ITagDiscovery.DiscoverAsync`** — every driver has its own discovery source, and the Galaxy driver's is a direct SQL query against the Galaxy Repository (the `ZB` database). Other drivers use completely different mechanisms: + +| Driver | `ITagDiscovery` source | +|--------|------------------------| +| Galaxy | ZB SQL hierarchy + attribute queries (this doc) | +| AB CIP | `@tags` walker against the PLC controller | +| AB Legacy | Data-table scan via PCCC `LogicalRead` on the PLC | +| TwinCAT | Beckhoff `SymbolLoaderFactory` — uploads the full symbol tree from the ADS runtime | +| S7 | Config-DB enumeration (no native symbol upload for S7comm) | +| Modbus | Config-DB enumeration (flat register map, user-authored) | +| FOCAS | CNC queries (`cnc_rdaxisname`, `cnc_rdmacroinfo`, …) + optional Config-DB overlays | +| OPC UA Client | `Session.Browse` against the remote server | + +`GalaxyRepositoryService` lives in `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Backend/GalaxyRepository/` — Host-side, .NET Framework 4.8 x86, same process that owns the MXAccess COM objects. The Proxy forwards discovery over IPC the same way it forwards reads and writes. ## Connection Configuration @@ -19,7 +32,7 @@ The connection uses Windows Authentication because the Galaxy Repository databas ## SQL Queries -All queries are embedded as `const string` fields in `GalaxyRepositoryService`. No dynamic SQL is used. +All queries are embedded as `const string` fields in `GalaxyRepositoryService`. No dynamic SQL is used. Project convention `GR-006` requires `const string` SQL queries; any new query must be added as a named constant rather than built at runtime. ### Hierarchy query @@ -31,9 +44,9 @@ Returns deployed Galaxy objects with their parent relationships, browse names, a - Marks objects with `category_id = 13` as areas - Filters to `is_template = 0` (instances only, not templates) - Filters to `deployed_package_id <> 0` (deployed objects only) -- Returns a `template_chain` column built by a recursive CTE that walks `gobject.derived_from_gobject_id` from each instance through its immediate template and ancestor templates (depth guard `< 10`). Template names are ordered by depth and joined with `|` via `STUFF(... FOR XML PATH(''))`. Example: `TestMachine_001` returns `$TestMachine|$gMachine|$gUserDefined|$UserDefined`. The C# repository reader splits the column on `|`, trims, and populates `GalaxyObjectInfo.TemplateChain`, which is consumed by `AlarmObjectFilter` for template-based alarm filtering. See [Alarm Tracking](AlarmTracking.md#template-based-alarm-object-filter). -- Returns `template_definition.category_id` as a `category_id` column, populated into `GalaxyObjectInfo.CategoryId`. The runtime status probe manager filters this down to `CategoryId == 1` (`$WinPlatform`) and `CategoryId == 3` (`$AppEngine`) to decide which objects get a `.ScanState` probe advised. Also used by `LmxNodeManager.BuildHostedVariablesMap` to identify Platform/Engine ancestors during the hosted-variables walk. -- Returns `gobject.hosted_by_gobject_id` as a `hosted_by_gobject_id` column, populated into `GalaxyObjectInfo.HostedByGobjectId`. This is the **runtime host** of the object (e.g., which `$AppEngine` actually runs it), **not** the browse-containment parent (`contained_by_gobject_id`). The two are often different — an object can live in one Area in the browse tree but be hosted by an Engine on a different Platform for runtime execution. The node manager walks this chain during `BuildHostedVariablesMap` to find the nearest `$WinPlatform` or `$AppEngine` ancestor so subtree quality invalidation on a Stopped host reaches exactly the variables that were actually executing there. Note: the Galaxy schema column is named `hosted_by_gobject_id` (not `host_gobject_id` as some documentation sources guess). See [MXAccess Bridge — Per-Host Runtime Status Probes](MxAccessBridge.md#per-host-runtime-status-probes-hostscanstate). +- Returns a `template_chain` column built by a recursive CTE that walks `gobject.derived_from_gobject_id` from each instance through its immediate template and ancestor templates (depth guard `< 10`). Template names are ordered by depth and joined with `|` via `STUFF(... FOR XML PATH(''))`. Example: `TestMachine_001` returns `$TestMachine|$gMachine|$gUserDefined|$UserDefined`. The C# repository reader splits the column on `|`, trims, and populates `GalaxyObjectInfo.TemplateChain`, which is consumed by `AlarmObjectFilter` for template-based alarm filtering. See [Alarm Tracking](../AlarmTracking.md#template-based-alarm-object-filter). +- Returns `template_definition.category_id` as a `category_id` column, populated into `GalaxyObjectInfo.CategoryId`. The runtime status probe manager filters this down to `CategoryId == 1` (`$WinPlatform`) and `CategoryId == 3` (`$AppEngine`) to decide which objects get a `.ScanState` probe advised. Also used during the hosted-variables walk to identify Platform/Engine ancestors. +- Returns `gobject.hosted_by_gobject_id` as a `hosted_by_gobject_id` column, populated into `GalaxyObjectInfo.HostedByGobjectId`. This is the **runtime host** of the object (e.g., which `$AppEngine` actually runs it), **not** the browse-containment parent (`contained_by_gobject_id`). The two are often different — an object can live in one Area in the browse tree but be hosted by an Engine on a different Platform for runtime execution. The driver walks this chain during `BuildHostedVariablesMap` to find the nearest `$WinPlatform` or `$AppEngine` ancestor so subtree quality invalidation on a Stopped host reaches exactly the variables that were actually executing there. Note: the Galaxy schema column is named `hosted_by_gobject_id` (not `host_gobject_id` as some documentation sources guess). See [Galaxy driver — Per-Host Runtime Status Probes](Galaxy.md#per-host-runtime-status-probes-hostscanstate). ### Attributes query (standard) @@ -53,8 +66,8 @@ Returns user-defined dynamic attributes for deployed objects: When `ExtendedAttributes = true`, a more comprehensive query runs that unions two sources: -1. **Primitive attributes** -- Joins through `primitive_instance` and `attribute_definition` to include system-level attributes from primitive components. Each attribute carries its `primitive_name` so the address space can group them under their parent variable. -2. **Dynamic attributes** -- The same CTE-based query as the standard path, with an empty `primitive_name`. +1. **Primitive attributes** — Joins through `primitive_instance` and `attribute_definition` to include system-level attributes from primitive components. Each attribute carries its `primitive_name` so the address space can group them under their parent variable. +2. **Dynamic attributes** — The same CTE-based query as the standard path, with an empty `primitive_name`. The `full_tag_reference` for primitive attributes follows the pattern `tag_name.primitive_name.attribute_name` (e.g., `TestMachine_001.AlarmAttr.InAlarm`). @@ -66,10 +79,10 @@ A single-column query: `SELECT time_of_last_deploy FROM galaxy`. The `galaxy` ta The Galaxy maintains two package references for each object: -- `checked_in_package_id` -- The latest saved version, which may include undeployed configuration changes -- `deployed_package_id` -- The version currently running on the target platform +- `checked_in_package_id` — the latest saved version, which may include undeployed configuration changes +- `deployed_package_id` — the version currently running on the target platform -The queries filter on `deployed_package_id <> 0` because the OPC UA server must mirror what is actually running in the Galaxy runtime. Using `checked_in_package_id` would expose attributes and objects that exist in the IDE but have not been deployed, causing mismatches between the OPC UA address space and the MXAccess runtime. +The queries filter on `deployed_package_id <> 0` because the OPC UA address space must mirror what is actually running in the Galaxy runtime. Using `checked_in_package_id` would expose attributes and objects that exist in the IDE but have not been deployed, causing mismatches between the OPC UA address space and the MXAccess runtime. ## Platform Scope Filter @@ -77,21 +90,16 @@ When `Scope` is set to `LocalPlatform`, the repository applies a post-query C# f ### How it works -1. **Platform lookup** -- A separate `const string` SQL query (`PlatformLookupSql`) reads `platform_gobject_id` and `node_name` from the `platform` table for all deployed platforms. This runs once per hierarchy load. - -2. **Platform matching** -- The configured `PlatformName` (or `Environment.MachineName` when null) is matched case-insensitively against the `node_name` column. If no match is found, a warning is logged listing the available platforms, and the address space is empty. - -3. **Host chain collection** -- The filter collects the matching platform's `gobject_id`, then iterates the hierarchy to find all `$AppEngine` (category 3) objects whose `HostedByGobjectId` equals the platform. This produces the full set of host gobject_ids under the local platform. - -4. **Object inclusion** -- All non-area objects whose `HostedByGobjectId` is in the host set are included, along with the hosts themselves. - -5. **Area retention** -- `ParentGobjectId` chains are walked upward from included objects to pull in ancestor areas, keeping the browse tree connected. Areas that contain no local descendants are excluded. - -6. **Attribute filtering** -- The set of included `gobject_id` values is cached after `GetHierarchyAsync` and reused by `GetAttributesAsync` to filter attributes to the same scope. +1. **Platform lookup** — A separate `const string` SQL query (`PlatformLookupSql`) reads `platform_gobject_id` and `node_name` from the `platform` table for all deployed platforms. This runs once per hierarchy load. +2. **Platform matching** — The configured `PlatformName` (or `Environment.MachineName` when null) is matched case-insensitively against the `node_name` column. If no match is found, a warning is logged listing the available platforms and the address space is empty. +3. **Host chain collection** — The filter collects the matching platform's `gobject_id`, then iterates the hierarchy to find all `$AppEngine` (category 3) objects whose `HostedByGobjectId` equals the platform. This produces the full set of host gobject_ids under the local platform. +4. **Object inclusion** — All non-area objects whose `HostedByGobjectId` is in the host set are included, along with the hosts themselves. +5. **Area retention** — `ParentGobjectId` chains are walked upward from included objects to pull in ancestor areas, keeping the browse tree connected. Areas that contain no local descendants are excluded. +6. **Attribute filtering** — The set of included `gobject_id` values is cached after `GetHierarchyAsync` and reused by `GetAttributesAsync` to filter attributes to the same scope. ### Design rationale -The filter is applied in C# rather than SQL because the project convention `GR-006` requires `const string` SQL queries with no dynamic SQL. The hierarchy query already returns `HostedByGobjectId` and `CategoryId` on every row, so all information needed for filtering is already in memory after the query runs. The only new SQL is the lightweight platform lookup query. +The filter is applied in C# rather than SQL because project convention `GR-006` requires `const string` SQL queries with no dynamic SQL. The hierarchy query already returns `HostedByGobjectId` and `CategoryId` on every row, so all information needed for filtering is already in memory after the query runs. The only new SQL is the lightweight platform lookup query. ### Configuration @@ -102,7 +110,7 @@ The filter is applied in C# rather than SQL because the project convention `GR-0 } ``` -- Set `Scope` to `"LocalPlatform"` to enable filtering. Default is `"Galaxy"` (load everything, backward compatible). +- Set `Scope` to `"LocalPlatform"` to enable filtering. Default is `"Galaxy"` (load everything). - Set `PlatformName` to an explicit hostname to target a specific platform, or leave null to use the local machine name. ### Startup log @@ -119,25 +127,26 @@ GetAttributesAsync returned 4206 attributes (extended=true) Scope filter retained 2100 of 4206 attributes ``` -## Change Detection Polling +## Change Detection Polling and IRediscoverable -`ChangeDetectionService` runs a background polling loop that calls `GetLastDeployTimeAsync` at the configured interval. It compares the returned timestamp against the last known value: +`ChangeDetectionService` runs a background polling loop in the Host process that calls `GetLastDeployTimeAsync` at the configured interval. It compares the returned timestamp against the last known value: - On the first poll (no previous state), the timestamp is recorded and `OnGalaxyChanged` fires unconditionally - On subsequent polls, `OnGalaxyChanged` fires only when `time_of_last_deploy` differs from the cached value -When the event fires, the host service queries fresh hierarchy and attribute data from the repository and calls `LmxNodeManager.RebuildAddressSpace` (which delegates to incremental `SyncAddressSpace`). +When the event fires, the Host re-runs the hierarchy and attribute queries and pushes the result back to the Server via an IPC `RediscoveryNeeded` message. That surfaces on `GalaxyProxyDriver` as the **`IRediscoverable.OnRediscoveryNeeded`** event; the Server's `DriverNodeManager` consumes it and calls `SyncAddressSpace` to compute the diff against the live address space. The polling approach is used because the Galaxy Repository database does not provide change notifications. The `galaxy.time_of_last_deploy` column updates only on completed deployments, so the polling interval controls how quickly the OPC UA address space reflects Galaxy changes. ## TestConnection -`TestConnectionAsync` runs `SELECT 1` against the configured database. This is used at service startup to verify connectivity before attempting the full hierarchy query. +`TestConnectionAsync` runs `SELECT 1` against the configured database. This is used at Host startup to verify connectivity before attempting the full hierarchy query. ## Key source files -- `src/ZB.MOM.WW.OtOpcUa.Host/GalaxyRepository/GalaxyRepositoryService.cs` -- SQL queries and data access -- `src/ZB.MOM.WW.OtOpcUa.Host/GalaxyRepository/PlatformScopeFilter.cs` -- Platform-based hierarchy and attribute filtering -- `src/ZB.MOM.WW.OtOpcUa.Host/GalaxyRepository/ChangeDetectionService.cs` -- Deploy timestamp polling loop -- `src/ZB.MOM.WW.OtOpcUa.Host/Configuration/GalaxyRepositoryConfiguration.cs` -- Connection, polling, and scope settings -- `src/ZB.MOM.WW.OtOpcUa.Host/Domain/PlatformInfo.cs` -- Platform-to-hostname DTO +- `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Backend/GalaxyRepository/GalaxyRepositoryService.cs` — SQL queries and data access +- `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Backend/GalaxyRepository/PlatformScopeFilter.cs` — Platform-based hierarchy and attribute filtering +- `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Backend/GalaxyRepository/ChangeDetectionService.cs` — Deploy timestamp polling loop +- `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Configuration/GalaxyRepositoryConfiguration.cs` — Connection, polling, and scope settings +- `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Domain/PlatformInfo.cs` — Platform-to-hostname DTO +- `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared/Contracts/DiscoveryResponse.cs` — IPC DTO the Host uses to return hierarchy + attribute results across the pipe diff --git a/docs/drivers/Galaxy.md b/docs/drivers/Galaxy.md new file mode 100644 index 0000000..91d3039 --- /dev/null +++ b/docs/drivers/Galaxy.md @@ -0,0 +1,211 @@ +# Galaxy Driver + +The Galaxy driver bridges OtOpcUa to AVEVA System Platform (Wonderware) Galaxies through the `ArchestrA.MxAccess` COM API plus the Galaxy Repository SQL database. It is one driver of seven in the OtOpcUa platform (see [drivers/README.md](README.md) for the full list); all other drivers run in-process in the main Server (.NET 10 x64). Galaxy is the exception — it runs as its own Windows service and talks to the Server over a local named pipe. + +For the decision record on why Galaxy is out-of-process and how the refactor was staged, see [docs/v2/plan.md §4 Galaxy/MXAccess as Out-of-Process Driver](../v2/plan.md). For the full driver spec (addressing, data-type map, config shape), see [docs/v2/driver-specs.md §1](../v2/driver-specs.md). + +## Project Split + +Galaxy ships as three projects: + +| Project | Target | Role | +|---------|--------|------| +| `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared/` | .NET Standard 2.0 | IPC contracts (MessagePack records + `MessageKind` enum) referenced by both sides | +| `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/` | .NET Framework 4.8 **x86** | Separate Windows service hosting the MXAccess COM objects, STA thread + Win32 message pump, Galaxy Repository reader, Historian SDK, runtime-probe manager | +| `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy/` | .NET 10 (matches Server) | `GalaxyProxyDriver : IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IAlarmSource, IHistoryProvider, IRediscoverable, IHostConnectivityProbe` — loaded in-process by the Server; every call forwards over the pipe to the Host | + +The Shared assembly is the **only** contract between the two runtimes. It carries no COM or SDK references so Proxy (net10) can reference it without dragging x86 code into the Server process. + +## Why Out-of-Process + +Two reasons drive the split, per `docs/v2/plan.md`: + +1. **Bitness constraint.** MXAccess is 32-bit COM only — `ArchestrA.MxAccess.dll` in `Program Files (x86)\ArchestrA\Framework\bin` has no 64-bit variant. The main OtOpcUa Server is .NET 10 x64 (the OPC Foundation stack, SqlClient, and every other non-Galaxy driver target 64-bit). In-process hosting would force the whole Server to x86, which every other driver project would then inherit. +2. **Tier-C stability isolation.** Galaxy is classified Tier C in [docs/v2/driver-stability.md](../v2/driver-stability.md) — the COM runtime, STA thread, Aveva Historian SDK, and SQL queries all have crash/hang modes that can take down the hosting process. Isolating the driver in its own Windows service means a COM deadlock, AccessViolation in an unmanaged Historian DLL, or a runaway SQL query never takes the Server endpoint down. The Proxy-side supervisor restarts the Host with crash-loop circuit-breaker. + +The same Tier-C isolation story applies to FOCAS (decision record in `docs/v2/plan.md` §7), which is the second out-of-process driver. + +## IPC Transport + +`GalaxyProxyDriver` → `GalaxyIpcClient` → named pipe → `Galaxy.Host` pipe server. + +- Pipe name: `otopcua-galaxy-{DriverInstanceId}` (localhost-only, no TCP surface) +- Wire format: MessagePack-CSharp, length-prefixed frames +- ACL: pipe is created with a DACL that grants only the Server's service identity; the Admins group is explicitly denied so a live-smoke test running from an elevated shell fails fast rather than silently bypassing the handshake +- Handshake: Proxy presents a shared secret at `OpenSessionRequest`; Host rejects anything else with `MessageKind.OpenSessionResponse{Success=false}` +- Heartbeat: Proxy sends a periodic ping; missed heartbeats trigger the Proxy-side crash-loop supervisor to restart the Host + +Every capability call on `GalaxyProxyDriver` (Read, Write, Subscribe, HistoryRead*, etc.) serializes a `*Request`, awaits the matching `*Response` via a `CallAsync` helper, and rehydrates the result into the `Core.Abstractions` shape the Server expects. + +## STA Thread Requirement (Host-side) + +MXAccess COM objects — `LMXProxyServer` instantiation, `Register`, `AddItem`, `AdviseSupervisory`, `Write`, and cleanup calls — must all execute on the same Single-Threaded Apartment. Calling a COM object from the wrong thread causes marshalling failures or silent data corruption. + +`StaComThread` in the Host provides that thread with the apartment state set before the thread starts: + +```csharp +_thread = new Thread(ThreadEntry) { Name = "MxAccess-STA", IsBackground = true }; +_thread.SetApartmentState(ApartmentState.STA); +``` + +Work items queue via `RunAsync(Action)` or `RunAsync(Func)` into a `ConcurrentQueue` and post `WM_APP` to wake the pump. Each work item is wrapped in a `TaskCompletionSource` so callers can `await` the result from any thread — including the IPC handler thread that receives the inbound pipe request. + +## Win32 Message Pump (Host-side) + +COM callbacks (`OnDataChange`, `OnWriteComplete`) are delivered through the Windows message loop. `StaComThread` runs a standard Win32 message pump via P/Invoke: + +1. `PeekMessage` primes the message queue (required before `PostThreadMessage` works) +2. `GetMessage` blocks until a message arrives +3. `WM_APP` drains the work queue +4. `WM_APP + 1` drains the queue and posts `WM_QUIT` to exit the loop +5. All other messages go through `TranslateMessage` / `DispatchMessage` for COM callback delivery + +Without this pump MXAccess callbacks never fire and the driver delivers no live data. + +## LMXProxyServer COM Object + +`MxProxyAdapter` wraps the real `ArchestrA.MxAccess.LMXProxyServer` COM object behind the `IMxProxy` interface so Host unit tests can substitute a fake proxy without requiring the ArchestrA runtime. Lifecycle: + +1. **`Register(clientName)`** — Creates a new `LMXProxyServer` instance, wires up `OnDataChange` and `OnWriteComplete` event handlers, calls `Register` to obtain a connection handle +2. **`Unregister(handle)`** — Unwires event handlers, calls `Unregister`, releases the COM object via `Marshal.ReleaseComObject` + +## Register / AddItem / AdviseSupervisory Pattern + +Every MXAccess data operation follows a three-step pattern, all executed on the STA thread: + +1. **`AddItem(handle, address)`** — Resolves a Galaxy tag reference (e.g., `TestMachine_001.MachineID`) to an integer item handle +2. **`AdviseSupervisory(handle, itemHandle)`** — Subscribes the item for supervisory data-change callbacks +3. The runtime begins delivering `OnDataChange` events + +For writes, after `AddItem` + `AdviseSupervisory`, `Write(handle, itemHandle, value, securityClassification)` sends the value; `OnWriteComplete` confirms or rejects. Cleanup reverses: `UnAdviseSupervisory` then `RemoveItem`. + +## OnDataChange and OnWriteComplete Callbacks + +### OnDataChange + +Fired by the COM runtime on the STA thread when a subscribed tag changes. The handler in `MxAccessClient.EventHandlers.cs`: + +1. Maps the integer `phItemHandle` back to a tag address via `_handleToAddress` +2. Maps the MXAccess quality code to the internal `Quality` enum +3. Checks `MXSTATUS_PROXY` for error details and adjusts quality +4. Converts the timestamp to UTC +5. Constructs a `Vtq` (Value/Timestamp/Quality) and delivers it to: + - The stored per-tag subscription callback + - Any pending one-shot read completions + - The global `OnTagValueChanged` event (consumed by the Host's subscription dispatcher, which packages changes into `DataChangeEventArgs` and forwards them over the pipe to `GalaxyProxyDriver.OnDataChange`) + +### OnWriteComplete + +Fired when the runtime acknowledges or rejects a write. The handler resolves the pending `TaskCompletionSource` for the item handle. If `MXSTATUS_PROXY.success == 0` the write is considered failed and the error detail is logged. + +## Reconnection Logic + +`MxAccessClient` implements automatic reconnection through two mechanisms. + +### Monitor loop + +`StartMonitor` launches a background task that polls at `MonitorIntervalSeconds`. On each cycle: + +- If the state is `Disconnected` or `Error` and `AutoReconnect` is enabled, it calls `ReconnectAsync` +- If connected and a probe tag is configured, it checks the probe staleness threshold + +### Reconnect sequence + +`ReconnectAsync` performs a full disconnect-then-connect cycle: + +1. Increment the reconnect counter +2. `DisconnectAsync` — tear down all active subscriptions (`UnAdviseSupervisory` + `RemoveItem` for each), detach COM event handlers, call `Unregister`, clear all handle mappings +3. `ConnectAsync` — create a fresh `LMXProxyServer`, register, replay all stored subscriptions, re-subscribe the probe tag + +Stored subscriptions (`_storedSubscriptions`) persist across reconnects. `ReplayStoredSubscriptionsAsync` iterates the stored entries and calls `AddItem` + `AdviseSupervisory` for each. + +## Probe Tag Health Monitoring + +A configurable probe tag (e.g., a frequently updating Galaxy attribute) serves as a connection health indicator. After connecting, the client subscribes to the probe tag and records `_lastProbeValueTime` on every `OnDataChange`. The monitor loop compares `DateTime.UtcNow - _lastProbeValueTime` against `ProbeStaleThresholdSeconds`; if the probe has not updated within the window, the connection is assumed stale and a reconnect is forced. This catches scenarios where the COM connection is technically alive but the runtime has stopped delivering data. + +## Per-Host Runtime Status Probes (`.ScanState`) + +Separate from the connection-level probe, the driver advises `.ScanState` on every deployed `$WinPlatform` and `$AppEngine` in the Galaxy. These probes track per-host runtime state so the Admin UI dashboard can report "this specific Platform / AppEngine is off scan" and the driver can proactively invalidate every OPC UA variable hosted by the stopped object — preventing MXAccess from serving stale Good-quality cached values to clients who read those tags while the host is down. + +Enabled by default via `MxAccess.RuntimeStatusProbesEnabled`; see [Configuration](../Configuration.md#mxaccess) for the two config fields. + +### How it works + +`GalaxyRuntimeProbeManager` lives in `Driver.Galaxy.Host` alongside the rest of the MXAccess code. It is owned by the Host's subscription dispatcher and runs a three-state machine per host (Unknown / Running / Stopped): + +1. **Discovery** — After the Host completes `BuildAddressSpace`, the manager filters the hierarchy to rows where `CategoryId == 1` (`$WinPlatform`) or `CategoryId == 3` (`$AppEngine`) and issues `AdviseSupervisory` for `.ScanState` on each one. Probes are driver-owned, not ref-counted against client subscriptions, and persist across address-space rebuilds via a `Sync` diff. +2. **Transition predicate** — A probe callback is interpreted as `isRunning = vtq.Quality.IsGood() && vtq.Value is bool b && b`. Everything else (explicit `ScanState = false`, bad quality, communication errors) means **Stopped**. +3. **On-change-only delivery** — `ScanState` is delivered only when the value actually changes. A stably Running host may go hours without a callback. `Tick()` does NOT run a starvation check on Running entries — the only time-based transition is **Unknown → Stopped** when the initial callback hasn't arrived within `RuntimeStatusUnknownTimeoutSeconds` (default 15s). This protects against a probe that fails to resolve at all without incorrectly flipping healthy long-running hosts. +4. **Transport gating** — When `IMxAccessClient.State != Connected`, `GetSnapshot()` forces every entry to `Unknown`. The dashboard shows the Connection panel as the primary signal in that case rather than misleading operators with "every host stopped". +5. **Subscribe failure rollback** — If `SubscribeAsync` throws for a new probe (SDK failure, broker rejection, transport error), the manager rolls back both `_byProbe` and `_probeByGobjectId` so the probe never appears in `GetSnapshot()`. Stability review 2026-04-13 Finding 1. + +### Subtree quality invalidation on transition + +When a host transitions **Running → Stopped**, the probe manager invokes a callback that walks `_hostedVariables[gobjectId]` — the set of every OPC UA variable transitively hosted by that Galaxy object — and sets each variable's `StatusCode` to `BadOutOfService`. **Stopped → Running** calls `ClearHostVariablesBadQuality` to reset each to `Good` so the next on-change MXAccess update repopulates the value. + +The hosted-variables map is built once per `BuildAddressSpace` by walking each object's `HostedByGobjectId` chain up to the nearest Platform or Engine ancestor. A variable hosted by an Engine inside a Platform lands in both the Engine's list and the Platform's list, so stopping the Platform transitively invalidates every descendant Engine's variables. + +### Read-path short-circuit (`IsTagUnderStoppedHost`) + +The Host's Read handler checks `IsTagUnderStoppedHost(tagRef)` (a reverse-index lookup `_hostIdsByTagRef[tagRef]` → `GalaxyRuntimeProbeManager.IsHostStopped(hostId)`) before the MXAccess round-trip. When the owning host is Stopped, the handler returns a synthesized `DataValue { Value = cachedVar.Value, StatusCode = BadOutOfService }` directly without touching MXAccess. This guarantees clients see a uniform `BadOutOfService` on every descendant tag of a stopped host, regardless of whether they're reading or subscribing. + +### Deferred dispatch — the STA deadlock + +**Critical**: probe transition callbacks must **not** run synchronously on the STA thread that delivered the `OnDataChange`. `MarkHostVariablesBadQuality` takes the subscription dispatcher lock, which may be held by a worker thread currently inside `Read` waiting on an `_mxAccessClient.ReadAsync()` round-trip that is itself waiting for the STA thread. Classic circular wait — the first real deploy of this feature hung inside 30 seconds from exactly this pattern. + +The fix is a deferred-dispatch queue: probe callbacks enqueue the transition onto `ConcurrentQueue<(int GobjectId, bool Stopped)>` and set the existing dispatch signal. The dispatch thread drains the queue inside its existing 100ms `WaitOne` loop — outside any locks held by the STA path — and then calls `MarkHostVariablesBadQuality` / `ClearHostVariablesBadQuality` under its own natural lock acquisition. No circular wait, no STA involvement. + +### Dashboard and health surface + +- Admin UI **Galaxy Runtime** panel shows per-host state with Name / Kind / State / Since / Last Error columns. Panel color is green (all Running), yellow (any Unknown, none Stopped), red (any Stopped), gray (MXAccess transport disconnected) +- `HealthCheckService.CheckHealth` rolls overall driver health to `Degraded` when any host is Stopped + +See [Status Dashboard](../StatusDashboard.md#galaxy-runtime) for the field table and [Configuration](../Configuration.md#mxaccess) for the config fields. + +## Request Timeout Safety Backstop + +Every sync-over-async site on the OPC UA stack thread that calls into Galaxy (`Read`, `Write`, address-space rebuild probe sync) is wrapped in a bounded `SyncOverAsync.WaitSync(...)` helper with timeout `MxAccess.RequestTimeoutSeconds` (default 30s). Inner `ReadTimeoutSeconds` / `WriteTimeoutSeconds` bounds on the async path are the first line of defense; the outer wrapper is a backstop so a scheduler stall, slow reconnect, or any other non-returning async path cannot park the stack thread indefinitely. + +On timeout, the underlying task is **not** cancelled — it runs to completion on the thread pool and is abandoned. This is acceptable because Galaxy IPC clients are shared singletons and the abandoned continuation does not capture request-scoped state. The OPC UA stack receives `StatusCodes.BadTimeout` on the affected operation. + +`ConfigurationValidator` enforces `RequestTimeoutSeconds >= 1` and warns when it is set below the inner Read/Write timeouts (operator misconfiguration). Stability review 2026-04-13 Finding 3. + +All capability calls at the Server dispatch layer are additionally wrapped by `CapabilityInvoker` (Core/Resilience/) which runs them through a Polly pipeline keyed on `(DriverInstanceId, HostName, DriverCapability)`. `OTOPCUA0001` analyzer enforces the wrap at build time. + +## Why Marshal.ReleaseComObject Is Needed + +The .NET Framework runtime's garbage collector releases COM references non-deterministically. For MXAccess, delayed release can leave stale COM connections open, preventing clean re-registration. `MxProxyAdapter.Unregister` calls `Marshal.ReleaseComObject(_lmxProxy)` in a `finally` block to immediately drive the COM reference count to zero. This ensures the underlying COM server is freed before a reconnect attempt creates a new instance. + +## Tag Discovery and Historical Data + +Tag discovery (the Galaxy Repository SQL reader + `LocalPlatform` scope filter) is covered in [Galaxy-Repository.md](Galaxy-Repository.md). The Galaxy driver is `ITagDiscovery` for the Server's bootstrap path and `IRediscoverable` for the on-change-redeploy path. + +Historical data access (raw, processed, at-time, events) runs against the Aveva Historian via the `aahClientManaged` SDK and is exposed through the Galaxy driver's `IHistoryProvider` implementation. See [HistoricalDataAccess.md](../HistoricalDataAccess.md). + +## Key source files + +Host-side (`.NET 4.8 x86`, `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/`): + +- `Backend/MxAccess/StaComThread.cs` — STA thread and Win32 message pump +- `Backend/MxAccess/MxAccessClient.cs` — Core client (partial) +- `Backend/MxAccess/MxAccessClient.Connection.cs` — Connect / disconnect / reconnect +- `Backend/MxAccess/MxAccessClient.Subscription.cs` — Subscribe / unsubscribe / replay +- `Backend/MxAccess/MxAccessClient.ReadWrite.cs` — Read and write operations +- `Backend/MxAccess/MxAccessClient.EventHandlers.cs` — `OnDataChange` / `OnWriteComplete` handlers +- `Backend/MxAccess/MxAccessClient.Monitor.cs` — Background health monitor +- `Backend/MxAccess/MxProxyAdapter.cs` — COM object wrapper +- `Backend/MxAccess/GalaxyRuntimeProbeManager.cs` — Per-host `ScanState` probes, state machine, `IsHostStopped` lookup +- `Backend/Historian/HistorianDataSource.cs` — `aahClientManaged` SDK wrapper (see [HistoricalDataAccess.md](../HistoricalDataAccess.md)) +- `Ipc/GalaxyIpcServer.cs` — Named-pipe server, message dispatch +- `Domain/IMxAccessClient.cs` — Client interface + +Shared (`.NET Standard 2.0`, `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared/`): + +- `Contracts/MessageKind.cs` — IPC message kinds (`ReadRequest`, `HistoryReadRequest`, `OpenSessionResponse`, …) +- `Contracts/*.cs` — MessagePack DTOs for every request/response pair + +Proxy-side (`.NET 10`, `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy/`): + +- `GalaxyProxyDriver.cs` — `IDriver`/`ITagDiscovery`/`IReadable`/`IWritable`/`ISubscribable`/`IAlarmSource`/`IHistoryProvider`/`IRediscoverable`/`IHostConnectivityProbe` implementation; every method forwards via `GalaxyIpcClient` +- `Ipc/GalaxyIpcClient.cs` — Named-pipe client, `CallAsync`, reconnect on broken pipe +- `GalaxyProxySupervisor.cs` — Host-process monitor, crash-loop circuit-breaker, Host relaunch diff --git a/docs/drivers/README.md b/docs/drivers/README.md new file mode 100644 index 0000000..164ac03 --- /dev/null +++ b/docs/drivers/README.md @@ -0,0 +1,46 @@ +# Drivers + +OtOpcUa is a multi-driver OPC UA server. The Core (`ZB.MOM.WW.OtOpcUa.Core` + `Core.Abstractions` + `Server`) owns the OPC UA stack, address space, session/security/subscription machinery, resilience pipeline, and namespace kinds (Equipment + SystemPlatform). Drivers plug in through **capability interfaces** defined in `src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/`: + +- `IDriver` — lifecycle (`InitializeAsync`, `ReinitializeAsync`, `ShutdownAsync`, `GetHealth`) +- `IReadable` / `IWritable` — one-shot reads and writes +- `ITagDiscovery` — address-space enumeration +- `ISubscribable` — driver-pushed data-change streams +- `IHostConnectivityProbe` — per-host reachability events +- `IPerCallHostResolver` — multi-host drivers that route each call to a target endpoint at dispatch time +- `IAlarmSource` — driver-emitted OPC UA A&C events +- `IHistoryProvider` — raw / processed / at-time / events HistoryRead (see [HistoricalDataAccess.md](../HistoricalDataAccess.md)) +- `IRediscoverable` — driver-initiated address-space rebuild notifications + +Each driver opts into only the capabilities it supports. Every async capability call at the Server dispatch layer goes through `CapabilityInvoker` (`Core/Resilience/CapabilityInvoker.cs`), which wraps it in a Polly pipeline keyed on `(DriverInstanceId, HostName, DriverCapability)`. The `OTOPCUA0001` analyzer enforces the wrap at build time. Drivers themselves never depend on Polly; they just implement the capability interface and let the Core wrap it. + +Driver type metadata is registered at startup in `DriverTypeRegistry` (`src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/DriverTypeRegistry.cs`). The registry records each type's allowed namespace kinds (`Equipment` / `SystemPlatform` / `Simulated`), its JSON Schema for `DriverConfig` / `DeviceConfig` / `TagConfig` columns, and its stability tier per [docs/v2/driver-stability.md](../v2/driver-stability.md). + +## Ground-truth driver list + +| Driver | Project path | Tier | Wire / library | Capabilities | Notable quirk | +|--------|--------------|:----:|----------------|--------------|---------------| +| [Galaxy](Galaxy.md) | `Driver.Galaxy.{Shared, Host, Proxy}` | C | MXAccess COM + `aahClientManaged` + SqlClient | IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IAlarmSource, IHistoryProvider, IRediscoverable, IHostConnectivityProbe | Out-of-process — Host is its own Windows service (.NET 4.8 x86 for the COM bitness constraint); Proxy talks to Host over a named pipe | +| Modbus TCP | `Driver.Modbus` | A | NModbus-derived in-house client | IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IHostConnectivityProbe | Polled subscriptions via the shared `PollGroupEngine`. DL205 PLCs are covered by `AddressFormat=DL205` (octal V/X/Y/C/T/CT translation) — no separate driver | +| Siemens S7 | `Driver.S7` | A | [S7netplus](https://github.com/S7NetPlus/s7netplus) | IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IHostConnectivityProbe | Single S7netplus `Plc` instance per PLC serialized with `SemaphoreSlim` — the S7 CPU's comm mailbox is scanned at most once per cycle, so parallel reads don't help | +| AB CIP | `Driver.AbCip` | A | libplctag CIP | IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IHostConnectivityProbe, IPerCallHostResolver | ControlLogix / CompactLogix. Tag discovery uses the `@tags` walker to enumerate controller-scoped + program-scoped symbols; UDT member resolution via the UDT template reader | +| AB Legacy | `Driver.AbLegacy` | A | libplctag PCCC | IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IHostConnectivityProbe, IPerCallHostResolver | SLC 500 / MicroLogix. File-based addressing (`N7:0`, `F8:0`) — no symbol table, tag list is user-authored in the config DB | +| TwinCAT | `Driver.TwinCAT` | B | Beckhoff `TwinCAT.Ads` (`TcAdsClient`) | IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IHostConnectivityProbe, IPerCallHostResolver | The only native-notification driver outside Galaxy — ADS delivers `ValueChangedCallback` events the driver forwards straight to `ISubscribable.OnDataChange` without polling. Symbol tree uploaded via `SymbolLoaderFactory` | +| FOCAS | `Driver.FOCAS` | C | FANUC FOCAS2 (`Fwlib32.dll` P/Invoke) | IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IHostConnectivityProbe, IPerCallHostResolver | Tier C — FOCAS DLL has crash modes that warrant process isolation. CNC-shaped data model (axes, spindle, PMC, macros, alarms) not a flat tag map | +| OPC UA Client | `Driver.OpcUaClient` | B | OPCFoundation `Opc.Ua.Client` | IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IAlarmSource, IHistoryProvider, IHostConnectivityProbe | Gateway/aggregation driver. Opens a single `Session` against a remote OPC UA server and re-exposes its address space. Owns its own `ApplicationConfiguration` (distinct from `Client.Shared`) because it's always-on with keep-alive + `TransferSubscriptions` across SDK reconnect, not an interactive CLI | + +## Per-driver documentation + +- **Galaxy** has its own docs in this folder because the out-of-process architecture + MXAccess COM rules + Galaxy Repository SQL + Historian + runtime probe manager don't fit a single table row: + - [Galaxy.md](Galaxy.md) — COM bridge, STA pump, IPC, runtime probes + - [Galaxy-Repository.md](Galaxy-Repository.md) — ZB SQL reader, `LocalPlatform` scope filter, change detection + +- **All other drivers** share a single per-driver specification in [docs/v2/driver-specs.md](../v2/driver-specs.md) — addressing, data-type maps, connection settings, and quirks live there. That file is the authoritative per-driver reference; this index points at it rather than duplicating. + +## Related cross-driver docs + +- [HistoricalDataAccess.md](../HistoricalDataAccess.md) — `IHistoryProvider` dispatch, aggregate mapping, continuation points. The Galaxy driver's Aveva Historian implementation is the first; OPC UA Client forwards to the upstream server; other drivers do not implement the interface and return `BadHistoryOperationUnsupported`. +- [AlarmTracking.md](../AlarmTracking.md) — `IAlarmSource` event model and filtering. +- [Subscriptions.md](../Subscriptions.md) — how the Server multiplexes subscriptions onto `ISubscribable.OnDataChange`. +- [docs/v2/driver-stability.md](../v2/driver-stability.md) — tier system (A / B / C), shared `CapabilityPolicy` defaults per tier × capability, `MemoryTracking` hybrid formula, and process-level recycle rules. +- [docs/v2/plan.md](../v2/plan.md) — authoritative vision, architecture decisions, migration strategy. diff --git a/docs/reqs/ClientRequirements.md b/docs/reqs/ClientRequirements.md index c9bd0ab..277469a 100644 --- a/docs/reqs/ClientRequirements.md +++ b/docs/reqs/ClientRequirements.md @@ -1,8 +1,10 @@ # OPC UA Client Requirements -## Overview +> **Revision** — Refreshed 2026-04-19 for the OtOpcUa v2 multi-driver platform (task #205). The Client surface (shared library + CLI + UI) shipped for v2 is preserved; this refresh restructures the document into numbered, directly-verifiable requirements (CLI-* and UI-* prefixes) layered on top of the existing detailed design content. Requirement coverage added for the `redundancy` command, alarm subscribe/ack round-trip, history-read, and UI tree-browser drag-to-subscribe behaviors. Original design-spec material for `ConnectionSettings`, `IOpcUaClientService`, models, and view-models is retained as reference-level details below the numbered requirements. -Three new .NET 10 cross-platform projects providing a shared OPC UA client library, a CLI tool, and an Avalonia desktop UI. All projects target Windows and macOS. +Parent: [HLR-001](HighLevelReqs.md#hlr-001-opc-ua-server), [HLR-009](HighLevelReqs.md#hlr-009-transport-security-and-authentication), [HLR-013](HighLevelReqs.md#hlr-013-cluster-redundancy) + +See also: `docs/Client.CLI.md`, `docs/Client.UI.md`. ## Projects @@ -10,134 +12,161 @@ Three new .NET 10 cross-platform projects providing a shared OPC UA client libra |---------|------|---------| | `ZB.MOM.WW.OtOpcUa.Client.Shared` | Class library | Core OPC UA client, models, interfaces | | `ZB.MOM.WW.OtOpcUa.Client.CLI` | Console app | Command-line interface using CliFx | -| `ZB.MOM.WW.OtOpcUa.Client.UI` | Avalonia app | Desktop UI with tree browser, subscriptions, alarms | -| `ZB.MOM.WW.OtOpcUa.Client.Shared.Tests` | Test project | Unit tests for shared library | -| `ZB.MOM.WW.OtOpcUa.Client.CLI.Tests` | Test project | Unit tests for CLI commands | -| `ZB.MOM.WW.OtOpcUa.Client.UI.Tests` | Test project | Unit tests for UI view models | +| `ZB.MOM.WW.OtOpcUa.Client.UI` | Avalonia app | Desktop UI | +| `ZB.MOM.WW.OtOpcUa.Client.Shared.Tests` | Test project | Shared-library unit tests | +| `ZB.MOM.WW.OtOpcUa.Client.CLI.Tests` | Test project | CLI command tests | +| `ZB.MOM.WW.OtOpcUa.Client.UI.Tests` | Test project | ViewModel unit tests | + +## Shared Requirements (Client.Shared) + +### SHR-001: Single Service Interface + +The Client.Shared library shall expose a single service interface `IOpcUaClientService` covering connect, disconnect, read, write, browse, subscribe, alarm-subscribe, alarm-ack, history-read-raw, history-read-aggregate, and get-redundancy-info operations. + +### SHR-002: ConnectionSettings Model + +The library shall expose a `ConnectionSettings` record with the fields: `EndpointUrl` (required), `FailoverUrls[]`, `Username`, `Password`, `SecurityMode` (None/Sign/SignAndEncrypt; default None), `SessionTimeoutSeconds` (default 60), `AutoAcceptCertificates` (default true), `CertificateStorePath`. + +### SHR-003: Automatic Failover + +The library shall monitor session keep-alive and automatically fail over across `FailoverUrls` when the primary endpoint is unreachable, emitting a `ConnectionStateChanged` event on each transition (Disconnected / Connecting / Connected / Reconnecting). + +### SHR-004: Cross-Platform Certificate Store + +The library shall auto-generate a client certificate on first use and store it in a cross-platform path (default `{AppData}/OtOpcUaClient/pki/`). Server certificates are auto-accepted when `AutoAcceptCertificates = true`. + +### SHR-005: Type-Coercing Write + +The library's `WriteValueAsync(NodeId, object)` shall read the node's current value to determine target type and coerce the input value before sending. + +### SHR-006: UI-Thread Dispatch Neutrality + +The library shall not assume any specific synchronization context. Events (`DataChanged`, `AlarmEvent`, `ConnectionStateChanged`) are raised on the OPC UA stack thread; the consuming CLI / UI is responsible for dispatching to its UI thread. + +--- + +## CLI Requirements (Client.CLI) + +### CLI-001: Command Surface + +The CLI shall expose the following commands: `connect`, `read`, `write`, `browse`, `subscribe`, `historyread`, `alarms`, `redundancy`. + +### CLI-002: Common Options + +All CLI commands shall accept the options `-u, --url` (required), `-U, --username`, `-P, --password`, `-S, --security none|sign|encrypt`, `-F, --failover-urls` (comma-separated), `--verbose`. + +### CLI-003: Connect Command + +The `connect` command shall attempt to establish a session using the supplied options and print `Connected` plus the resolved endpoint's `ServerUriArray` and `ApplicationUri` on success, or a diagnostic error message on failure. + +### CLI-004: Read Command + +The `read -n ` command shall print `NodeId`, `Value`, `StatusCode`, `SourceTimestamp`, `ServerTimestamp` one per line. + +### CLI-005: Write Command + +The `write -n -v ` command shall coerce the value to the node's current type (per SHR-005) and print the resulting `StatusCode`. A `Bad_UserAccessDenied` result is printed verbatim so operators see the authorization outcome. + +### CLI-006: Browse Command + +The `browse [-n ] [-r] [-d ]` command shall list child nodes under `parent` (or the `Objects` folder if omitted). `-r` enables recursion up to `-d` depth (default 1). + +### CLI-007: Subscribe Command + +The `subscribe -n -i ` command shall create a monitored item at `intervalMs` publishing interval, print each `DataChanged` event as ` ` until Ctrl-C, then cleanly unsubscribe. + +### CLI-008: Historyread Command + +The `historyread -n --start --end [--max ] [--aggregate --interval ]` command shall print raw values or aggregate buckets. Supported aggregate types: Average, Minimum, Maximum, Count, Start, End. + +### CLI-009: Alarms Command + +The `alarms [-n ] [-i ]` command shall subscribe to alarm events, print each event as `