Doc refresh (task #204) — operational docs for multi-process multi-driver OtOpcUa

Five operational docs rewritten for v2 (multi-process, multi-driver, Config-DB authoritative): - docs/Configuration.md — replaced appsettings-only story with the two-layer model. appsettings.json is bootstrap only (Node identity, Config DB connection string, transport security, LDAP bind, logging). Authoritative config (clusters, namespaces, UNS, equipment, tags, driver instances, ACLs, role grants, poll groups) lives in the Config DB accessed via OtOpcUaConfigDbContext and edited through the Admin UI draft/publish workflow. Added v1-to-v2 migration index so operators can locate where each old section moved. Cross-links to docs/v2/config-db-schema.md + docs/v2/admin-ui.md. - docs/Redundancy.md — Phase 6.3 rewrite. Named every class under src/ZB.MOM.WW.OtOpcUa.Server/Redundancy/: RedundancyCoordinator, RedundancyTopology, ApplyLeaseRegistry (publish fencing), PeerReachabilityTracker, RecoveryStateManager, ServiceLevelCalculator (pure function), RedundancyStatePublisher. Documented the full 11-band ServiceLevel matrix (Maintenance=0 through AuthoritativePrimary=255) from ServiceLevelCalculator.cs and the per-ClusterNode fields (RedundancyRole, ServiceLevelBase, ApplicationUri). Covered metrics (otopcua.redundancy.role_transition counter + primary/secondary/stale_count gauges on meter ZB.MOM.WW.OtOpcUa.Redundancy) and SignalR RoleChanged push from FleetStatusPoller to RedundancyTab.razor. - docs/security.md — preserved the transport-security section (still accurate) and added Phase 6.2 authorization. Four concerns now documented in one place: (1) transport security profiles, (2) OPC UA auth via LdapUserAuthenticator (note: task spec called this LdapAuthenticationProvider — actual class name is LdapUserAuthenticator in Server/Security/), (3) data-plane authorization via NodeAcl + PermissionTrie + AuthorizationGate — additive-only model per decision #129, ClusterId → Namespace → UnsArea → UnsLine → Equipment → Tag hierarchy, NodePermissions bundle, PermissionProbeService in Admin for "probe this permission", (4) control-plane authorization via LdapGroupRoleMapping + AdminRole (ConfigViewer / ConfigEditor / FleetAdmin, CanEdit / CanPublish policies) — deliberately independent of data-plane ACLs per decision #150. Documented the OTOPCUA0001 Roslyn analyzer (UnwrappedCapabilityCallAnalyzer) as the compile-time guard ensuring every driver-capability async call is wrapped by CapabilityInvoker. - docs/ServiceHosting.md — three-process rewrite: OtOpcUa Server (net10 x64, BackgroundService + AddWindowsService, hosts OPC UA endpoint + all non-Galaxy drivers), OtOpcUa Admin (net10 x64, Blazor Server + SignalR + /metrics via OpenTelemetry Prometheus exporter), OtOpcUa Galaxy.Host (.NET Framework 4.8 x86, NSSM-wrapped, env-variable driven, STA thread + MXAccess COM). Pipe ACL denies-Admins detail + non-elevated shell requirement captured from feedback memory. Divergence from CLAUDE.md: task spec said "TopShelf is still the service-installer wrapper per CLAUDE.md note" but no csproj in the repo references TopShelf — decision #30 replaced it with the generic host's AddWindowsService wrapper (per the doc comment on OpcUaServerService). Reflected the actual state + flagged this divergence here so someone can update CLAUDE.md separately. - docs/StatusDashboard.md — replaced the full v1 reference (dashboard endpoints, health check rules, StatusData DTO, etc.) with a short "superseded by Admin UI" pointer that preserves git-blame continuity + avoids broken links from other docs that reference it. Class references verified by reading: src/ZB.MOM.WW.OtOpcUa.Server/Redundancy/{RedundancyCoordinator, ServiceLevelCalculator, ApplyLeaseRegistry, RedundancyStatePublisher}.cs src/ZB.MOM.WW.OtOpcUa.Core/Authorization/{PermissionTrie, PermissionTrieBuilder, PermissionTrieCache, TriePermissionEvaluator, AuthorizationGate}.cs src/ZB.MOM.WW.OtOpcUa.Server/Security/{AuthorizationGate, LdapUserAuthenticator}.cs src/ZB.MOM.WW.OtOpcUa.Admin/{Program.cs, Services/AdminRoles.cs, Services/RedundancyMetrics.cs, Hubs/FleetStatusPoller.cs} src/ZB.MOM.WW.OtOpcUa.Server/Program.cs + appsettings.json src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/{Program.cs, Ipc/PipeServer.cs} src/ZB.MOM.WW.OtOpcUa.Configuration/Entities/{ClusterNode, NodeAcl, LdapGroupRoleMapping}.cs src/ZB.MOM.WW.OtOpcUa.Analyzers/UnwrappedCapabilityCallAnalyzer.cs Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 01:30:56 -04:00
parent 71339307fa
commit 5506b43ddc
5 changed files with 509 additions and 1236 deletions
@@ -2,189 +2,102 @@

 ## Overview

-LmxOpcUa supports OPC UA **non-transparent redundancy** in Warm or Hot mode. In a non-transparent redundancy deployment, two independent server instances run side by side. Both connect to the same Galaxy repository database and the same MXAccess runtime, but each maintains its own OPC UA sessions and subscriptions. Clients discover the redundant set through the `ServerUriArray` exposed in each server's address space and are responsible for managing failover between the two endpoints.
+OtOpcUa supports OPC UA **non-transparent** warm/hot redundancy. Two (or more) OtOpcUa Server processes run side-by-side, share the same Config DB, the same driver backends (Galaxy ZB, MXAccess runtime, remote PLCs), and advertise the same OPC UA node tree. Each process owns a distinct `ApplicationUri`; OPC UA clients see both endpoints via the standard `ServerUriArray` and pick one based on the `ServiceLevel` that each server publishes.

-When redundancy is disabled (the default), the server reports `RedundancySupport.None` and a fixed `ServiceLevel` of 255.
+The redundancy surface lives in `src/ZB.MOM.WW.OtOpcUa.Server/Redundancy/`:

-## Namespace vs Application Identity
-
-Both servers in the redundant set share the same **namespace URI** so that clients see identical node IDs regardless of which instance they are connected to. The namespace URI follows the pattern `urn:{GalaxyName}:LmxOpcUa` (e.g., `urn:ZB:LmxOpcUa`).
-
-The **ApplicationUri**, on the other hand, must be unique per instance. This is how the OPC UA stack and clients distinguish one server from the other within the redundant set. Each instance sets its own ApplicationUri via the `OpcUa.ApplicationUri` configuration property (e.g., `urn:localhost:LmxOpcUa:instance1` and `urn:localhost:LmxOpcUa:instance2`).
-
-When redundancy is disabled, `ApplicationUri` defaults to `urn:{GalaxyName}:LmxOpcUa` if left null.
-
-## Configuration
-
-### Redundancy Section
-
-| Property | Type | Default | Description |
-|---|---|---|---|
-| `Enabled` | bool | `false` | Enables non-transparent redundancy. When false, the server reports `RedundancySupport.None` and `ServiceLevel = 255`. |
-| `Mode` | string | `"Warm"` | The redundancy mode advertised to clients. Valid values: `Warm`, `Hot`. |
-| `Role` | string | `"Primary"` | This instance's role in the redundant pair. Valid values: `Primary`, `Secondary`. The Primary advertises a higher ServiceLevel than the Secondary when both are healthy. |
-| `ServerUris` | string[] | `[]` | The ApplicationUri values of all servers in the redundant set. Must include this instance's own `OpcUa.ApplicationUri`. Should contain at least 2 entries. |
-| `ServiceLevelBase` | int | `200` | The base ServiceLevel when the server is fully healthy. Valid range: 1-255. The Secondary automatically receives `ServiceLevelBase - 50`. |
-
-### OpcUa.ApplicationUri
-
-| Property | Type | Default | Description |
-|---|---|---|---|
-| `ApplicationUri` | string | `null` | Explicit application URI for this server instance. When null, defaults to `urn:{GalaxyName}:LmxOpcUa`. **Required when redundancy is enabled** -- each instance needs a unique identity. |
-
-## ServiceLevel Computation
-
-ServiceLevel is a standard OPC UA diagnostic value (0-255) that indicates server health. Clients in a redundant deployment should prefer the server advertising the highest ServiceLevel.
-
-**Baseline values:**
-
-| Role | Baseline |
+| Class | Role |
 |---|---|
-| Primary | `ServiceLevelBase` (default 200) |
-| Secondary | `ServiceLevelBase - 50` (default 150) |
+| `RedundancyCoordinator` | Process-singleton; owns the current `RedundancyTopology` loaded from the `ClusterNode` table. `RefreshAsync` re-reads after `sp_PublishGeneration` so operator role swaps take effect without a process restart. CAS-style swap (`Interlocked.Exchange`) means readers always see a coherent snapshot. |
+| `RedundancyTopology` | Immutable `(ClusterId, Self, Peers, ServerUriArray, ValidityFlags)` snapshot. |
+| `ApplyLeaseRegistry` | Tracks in-progress `sp_PublishGeneration` apply leases keyed on `(ConfigGenerationId, PublishRequestId)`. `await using` the disposable scope guarantees every exit path (success / exception / cancellation) decrements the lease; a stale-lease watchdog force-closes any lease older than `ApplyMaxDuration` (default 10 minutes) so a crashed publisher can't pin the node at `PrimaryMidApply`. |
+| `PeerReachabilityTracker` | Maintains last-known reachability for each peer node over two independent probes — OPC UA ping and HTTP `/healthz`. Both must succeed for `peerReachable = true`. |
+| `RecoveryStateManager` | Gates transitions out of the `Recovering*` bands; requires dwell + publish-witness satisfaction before allowing a return to nominal. |
+| `ServiceLevelCalculator` | Pure function `(role, selfHealthy, peerUa, peerHttp, applyInProgress, recoveryDwellMet, topologyValid, operatorMaintenance) → byte`. |
+| `RedundancyStatePublisher` | Orchestrates inputs into the calculator, pushes the resulting byte to the OPC UA `ServiceLevel` variable via an edge-triggered `OnStateChanged` event, and fires `OnServerUriArrayChanged` when the topology's `ServerUriArray` shifts. |

-**Penalties applied to the baseline:**
+## Data model

-| Condition | Penalty |
+Per-node redundancy state lives in the Config DB `ClusterNode` table (`src/ZB.MOM.WW.OtOpcUa.Configuration/Entities/ClusterNode.cs`):
+
+| Column | Role |
 |---|---|
-| MXAccess disconnected | -100 |
-| Galaxy DB unreachable | -50 |
-| Both MXAccess and DB down | ServiceLevel forced to 0 |
+| `NodeId` | Unique node identity; matches `Node:NodeId` in the server's bootstrap `appsettings.json`. |
+| `ClusterId` | Foreign key into `ServerCluster`. |
+| `RedundancyRole` | `Primary`, `Secondary`, or `Standalone` (`RedundancyRole` enum in `Configuration/Enums`). |
+| `ServiceLevelBase` | Per-node base value used to bias nominal ServiceLevel output. |
+| `ApplicationUri` | Unique-per-node OPC UA ApplicationUri advertised in endpoint descriptions. |

-The final value is clamped to the range 0-255.
+`ServerUriArray` is derived from the set of peer `ApplicationUri` values at topology-load time and republished when the topology changes.

-**Examples (with default ServiceLevelBase = 200):**
+## ServiceLevel matrix

-| Scenario | Primary | Secondary |
+`ServiceLevelCalculator` produces one of the following bands (see `ServiceLevelBand` enum in the same file):
+
+| Band | Byte | Meaning |
 |---|---|---|
-| Both healthy | 200 | 150 |
-| MXAccess down | 100 | 50 |
-| DB down | 150 | 100 |
-| Both down | 0 | 0 |
+| `Maintenance` | 0 | Operator-declared maintenance. |
+| `NoData` | 1 | Self-reported unhealthy (`/healthz` fails). |
+| `InvalidTopology` | 2 | More than one Primary detected; both nodes self-demote. |
+| `RecoveringBackup` | 30 | Backup post-fault, dwell not met. |
+| `BackupMidApply` | 50 | Backup inside a publish-apply window. |
+| `IsolatedBackup` | 80 | Primary unreachable; Backup says "take over if asked" — does **not** auto-promote (non-transparent model). |
+| `AuthoritativeBackup` | 100 | Backup nominal. |
+| `RecoveringPrimary` | 180 | Primary post-fault, dwell not met. |
+| `PrimaryMidApply` | 200 | Primary inside a publish-apply window. |
+| `IsolatedPrimary` | 230 | Primary with unreachable peer, retains authority. |
+| `AuthoritativePrimary` | 255 | Primary nominal. |

-## Two-Instance Deployment
+The reserved bands (0 Maintenance, 1 NoData, 2 InvalidTopology) take precedence over operational states per OPC UA Part 5 §6.3.34. Operational values occupy 2..255 so spec-compliant clients that treat "<3 = unhealthy" keep working.

-When deploying a redundant pair, the following configuration properties must differ between the two instances. All other settings (GalaxyName, ConnectionString, etc.) are shared.
+Standalone nodes (single-instance deployments) report `AuthoritativePrimary` when healthy and `PrimaryMidApply` during publish.

-| Property | Instance 1 (Primary) | Instance 2 (Secondary) |
-|---|---|---|
-| `OpcUa.Port` | 4840 | 4841 |
-| `OpcUa.ServerName` | `LmxOpcUa-1` | `LmxOpcUa-2` |
-| `OpcUa.ApplicationUri` | `urn:localhost:LmxOpcUa:instance1` | `urn:localhost:LmxOpcUa:instance2` |
-| `Dashboard.Port` | 8081 | 8082 |
-| `MxAccess.ClientName` | `LmxOpcUa-1` | `LmxOpcUa-2` |
-| `Redundancy.Role` | `Primary` | `Secondary` |
+## Publish fencing and split-brain prevention

-### Instance 1 -- Primary (appsettings.json)
+Any Admin-triggered `sp_PublishGeneration` acquires an apply lease through `ApplyLeaseRegistry.BeginApplyLease`. While the lease is held:

-```json
-{
-  "OpcUa": {
-    "Port": 4840,
-    "ServerName": "LmxOpcUa-1",
-    "GalaxyName": "ZB",
-    "ApplicationUri": "urn:localhost:LmxOpcUa:instance1"
-  },
-  "MxAccess": {
-    "ClientName": "LmxOpcUa-1"
-  },
-  "Dashboard": {
-    "Port": 8081
-  },
-  "Redundancy": {
-    "Enabled": true,
-    "Mode": "Warm",
-    "Role": "Primary",
-    "ServerUris": [
-      "urn:localhost:LmxOpcUa:instance1",
-      "urn:localhost:LmxOpcUa:instance2"
-    ],
-    "ServiceLevelBase": 200
-  }
-}
-```
+- The calculator reports `PrimaryMidApply` / `BackupMidApply` — clients see the band shift and cut over to the unaffected peer rather than racing against a half-applied generation.
+- `RedundancyCoordinator.RefreshAsync` is called at the end of the apply window so the post-publish topology becomes visible exactly once, atomically.
+- The watchdog force-closes any lease older than `ApplyMaxDuration`; a stuck publisher therefore cannot strand a node at `PrimaryMidApply`.

-### Instance 2 -- Secondary (appsettings.json)
+Because role transitions are **operator-driven** (write `RedundancyRole` in the Config DB + publish), the Backup never auto-promotes. An `IsolatedBackup` at 80 is the signal that the operator should intervene; auto-failover is intentionally out of scope for the non-transparent model (decision #154).

-```json
-{
-  "OpcUa": {
-    "Port": 4841,
-    "ServerName": "LmxOpcUa-2",
-    "GalaxyName": "ZB",
-    "ApplicationUri": "urn:localhost:LmxOpcUa:instance2"
-  },
-  "MxAccess": {
-    "ClientName": "LmxOpcUa-2"
-  },
-  "Dashboard": {
-    "Port": 8082
-  },
-  "Redundancy": {
-    "Enabled": true,
-    "Mode": "Warm",
-    "Role": "Secondary",
-    "ServerUris": [
-      "urn:localhost:LmxOpcUa:instance1",
-      "urn:localhost:LmxOpcUa:instance2"
-    ],
-    "ServiceLevelBase": 200
-  }
-}
-```
+## Metrics

-## CLI `redundancy` Command
+`RedundancyMetrics` in `src/ZB.MOM.WW.OtOpcUa.Admin/Services/RedundancyMetrics.cs` registers the `ZB.MOM.WW.OtOpcUa.Redundancy` meter on the Admin process. Instruments:

-The Client CLI includes a `redundancy` command that reads the redundancy state from a running server.
+| Name | Kind | Tags | Description |
+|---|---|---|---|
+| `otopcua.redundancy.role_transition` | Counter<long> | `cluster.id`, `node.id`, `from_role`, `to_role` | Incremented every time `FleetStatusPoller` observes a `RedundancyRole` change on a `ClusterNode` row. |
+| `otopcua.redundancy.primary_count` | ObservableGauge<long> | `cluster.id` | Primary-role nodes per cluster — should be exactly 1 in nominal state. |
+| `otopcua.redundancy.secondary_count` | ObservableGauge<long> | `cluster.id` | Secondary-role nodes per cluster. |
+| `otopcua.redundancy.stale_count` | ObservableGauge<long> | `cluster.id` | Nodes whose `LastSeenAt` exceeded the stale threshold. |

-```bash
-dotnet run --project src/ZB.MOM.WW.OtOpcUa.Client.CLI -- redundancy -u opc.tcp://localhost:4840/LmxOpcUa
-dotnet run --project src/ZB.MOM.WW.OtOpcUa.Client.CLI -- redundancy -u opc.tcp://localhost:4841/LmxOpcUa
-```
+Admin `Program.cs` wires OpenTelemetry to the Prometheus exporter when `Metrics:Prometheus:Enabled=true` (default), exposing the meter under `/metrics`. The endpoint is intentionally unauthenticated — fleet conventions put it behind a reverse-proxy basic-auth gate if needed.

-The command reads the following standard OPC UA nodes and displays their values:
+## Real-time notifications (Admin UI)

- **Redundancy Mode** -- from `Server_ServerRedundancy_RedundancySupport` (None, Warm, or Hot)
- **Service Level** -- from `Server_ServiceLevel` (0-255)
- **Server URIs** -- from `Server_ServerRedundancy_ServerUriArray` (list of ApplicationUri values in the redundant set)
- **Application URI** -- from `Server_ServerArray` (this instance's ApplicationUri)
+`FleetStatusPoller` in `src/ZB.MOM.WW.OtOpcUa.Admin/Hubs/` polls the `ClusterNode` table, records role transitions, updates `RedundancyMetrics.SetClusterCounts`, and pushes a `RoleChanged` SignalR event onto `FleetStatusHub` when a transition is observed. `RedundancyTab.razor` subscribes with `_hub.On<RoleChangedMessage>("RoleChanged", …)` so connected Admin sessions see role swaps the moment they happen.

-Example output for a healthy Primary:
+## Configuring a redundant pair

-```
-Redundancy Mode:  Warm
-Service Level:    200
-Server URIs:
-  - urn:localhost:LmxOpcUa:instance1
-  - urn:localhost:LmxOpcUa:instance2
-Application URI:  urn:localhost:LmxOpcUa:instance1
-```
+Redundancy is configured **in the Config DB, not appsettings.json**. The fields that must differ between the two instances:

-The command also supports `--username`/`--password` and `--security` options for authenticated or encrypted connections.
+| Field | Location | Instance 1 | Instance 2 |
+|---|---|---|---|
+| `NodeId` | `appsettings.json` `Node:NodeId` (bootstrap) | `node-a` | `node-b` |
+| `ClusterNode.ApplicationUri` | Config DB | `urn:node-a:OtOpcUa` | `urn:node-b:OtOpcUa` |
+| `ClusterNode.RedundancyRole` | Config DB | `Primary` | `Secondary` |
+| `ClusterNode.ServiceLevelBase` | Config DB | typically 255 | typically 100 |

-### Client Failover with `-F`
+Shared between instances: `ClusterId`, Config DB connection string, published generation, cluster-level ACLs, UNS hierarchy, driver instances.

-All CLI commands support the `-F` / `--failover-urls` flag for automatic client-side failover. When provided, the CLI tries the primary endpoint first and falls back to the listed URLs if the primary is unreachable.
+Role swaps, stand-alone promotions, and base-level adjustments all happen through the Admin UI `RedundancyTab` — the operator edits the `ClusterNode` row in a draft generation and publishes. `RedundancyCoordinator.RefreshAsync` picks up the new topology without a process restart.

-```bash
-# Connect with failover — uses secondary if primary is down
-dotnet run -- connect -u opc.tcp://localhost:4840/LmxOpcUa -F opc.tcp://localhost:4841/LmxOpcUa
+## Client-side failover

-# Subscribe with live failover — reconnects to secondary if primary drops mid-stream
-dotnet run -- subscribe -u opc.tcp://localhost:4840/LmxOpcUa -F opc.tcp://localhost:4841/LmxOpcUa \
-  -n "ns=1;s=TestMachine_001.MachineID"
-```
+The OtOpcUa Client CLI at `src/ZB.MOM.WW.OtOpcUa.Client.CLI` supports `-F` / `--failover-urls` for automatic client-side failover; for long-running subscriptions the CLI monitors session KeepAlive and reconnects to the next available server, recreating the subscription on the new endpoint. See [`Client.CLI.md`](Client.CLI.md) for the command reference.

-For long-running commands (`subscribe`), the CLI monitors the session KeepAlive and automatically reconnects to the next available server when the current session drops. The subscription is re-created on the new server.
+## Depth reference

-## Troubleshooting
-
-**Mismatched ServerUris between instances** -- Both instances must list the exact same set of ApplicationUri values in `Redundancy.ServerUris`. If they differ, clients may not discover the full redundant set. Check the startup log for the `Redundancy.ServerUris` line on each instance.
-
-**ServiceLevel stuck at 255** -- This indicates redundancy is not enabled. When `Redundancy.Enabled` is false (the default), the server always reports `ServiceLevel = 255` and `RedundancySupport.None`. Verify that `Redundancy.Enabled` is set to `true` in the configuration and that the configuration section is correctly bound.
-
-**ApplicationUri not set** -- The configuration validator rejects startup when redundancy is enabled but `OpcUa.ApplicationUri` is null or empty. Each instance must have a unique ApplicationUri. Check the error log for: `OpcUa.ApplicationUri must be set when redundancy is enabled`.
-
-**Both servers report the same ServiceLevel** -- Verify that one instance has `Redundancy.Role` set to `Primary` and the other to `Secondary`. Both set to `Primary` (or both to `Secondary`) will produce identical baseline values, preventing clients from distinguishing the preferred server.
-
-**ServerUriArray not readable** -- When `RedundancySupport` is `None` (redundancy disabled), the OPC UA SDK may not expose the `ServerUriArray` node or it may return an empty value. The CLI `redundancy` command handles this gracefully by catching the read error. Enable redundancy to populate this array.
+For the full decision trail and implementation plan — topology invariants, peer-probe cadence, recovery-dwell policy, compliance-script guard against enum-value drift — see `docs/v2/plan.md` §Phase 6.3.