Files
lmxopcua/docs/Redundancy.md
Joseph Doherty 5506b43ddc Doc refresh (task #204) — operational docs for multi-process multi-driver OtOpcUa
Five operational docs rewritten for v2 (multi-process, multi-driver, Config-DB authoritative):

- docs/Configuration.md — replaced appsettings-only story with the two-layer model.
  appsettings.json is bootstrap only (Node identity, Config DB connection string,
  transport security, LDAP bind, logging). Authoritative config (clusters, namespaces,
  UNS, equipment, tags, driver instances, ACLs, role grants, poll groups) lives in
  the Config DB accessed via OtOpcUaConfigDbContext and edited through the Admin UI
  draft/publish workflow. Added v1-to-v2 migration index so operators can locate where
  each old section moved. Cross-links to docs/v2/config-db-schema.md + docs/v2/admin-ui.md.

- docs/Redundancy.md — Phase 6.3 rewrite. Named every class under
  src/ZB.MOM.WW.OtOpcUa.Server/Redundancy/: RedundancyCoordinator, RedundancyTopology,
  ApplyLeaseRegistry (publish fencing), PeerReachabilityTracker, RecoveryStateManager,
  ServiceLevelCalculator (pure function), RedundancyStatePublisher. Documented the
  full 11-band ServiceLevel matrix (Maintenance=0 through AuthoritativePrimary=255)
  from ServiceLevelCalculator.cs and the per-ClusterNode fields (RedundancyRole,
  ServiceLevelBase, ApplicationUri). Covered metrics
  (otopcua.redundancy.role_transition counter + primary/secondary/stale_count gauges
  on meter ZB.MOM.WW.OtOpcUa.Redundancy) and SignalR RoleChanged push from
  FleetStatusPoller to RedundancyTab.razor.

- docs/security.md — preserved the transport-security section (still accurate) and
  added Phase 6.2 authorization. Four concerns now documented in one place:
  (1) transport security profiles, (2) OPC UA auth via LdapUserAuthenticator
  (note: task spec called this LdapAuthenticationProvider — actual class name is
  LdapUserAuthenticator in Server/Security/), (3) data-plane authorization via
  NodeAcl + PermissionTrie + AuthorizationGate — additive-only model per decision
  #129, ClusterId → Namespace → UnsArea → UnsLine → Equipment → Tag hierarchy,
  NodePermissions bundle, PermissionProbeService in Admin for "probe this permission",
  (4) control-plane authorization via LdapGroupRoleMapping + AdminRole
  (ConfigViewer / ConfigEditor / FleetAdmin, CanEdit / CanPublish policies) —
  deliberately independent of data-plane ACLs per decision #150. Documented the
  OTOPCUA0001 Roslyn analyzer (UnwrappedCapabilityCallAnalyzer) as the compile-time
  guard ensuring every driver-capability async call is wrapped by CapabilityInvoker.

- docs/ServiceHosting.md — three-process rewrite: OtOpcUa Server (net10 x64,
  BackgroundService + AddWindowsService, hosts OPC UA endpoint + all non-Galaxy
  drivers), OtOpcUa Admin (net10 x64, Blazor Server + SignalR + /metrics via
  OpenTelemetry Prometheus exporter), OtOpcUa Galaxy.Host (.NET Framework 4.8 x86,
  NSSM-wrapped, env-variable driven, STA thread + MXAccess COM). Pipe ACL
  denies-Admins detail + non-elevated shell requirement captured from feedback memory.
  Divergence from CLAUDE.md: task spec said "TopShelf is still the service-installer
  wrapper per CLAUDE.md note" but no csproj in the repo references TopShelf — decision
  #30 replaced it with the generic host's AddWindowsService wrapper (per the doc
  comment on OpcUaServerService). Reflected the actual state + flagged this divergence
  here so someone can update CLAUDE.md separately.

- docs/StatusDashboard.md — replaced the full v1 reference (dashboard endpoints,
  health check rules, StatusData DTO, etc.) with a short "superseded by Admin UI"
  pointer that preserves git-blame continuity + avoids broken links from other docs
  that reference it.

Class references verified by reading:
  src/ZB.MOM.WW.OtOpcUa.Server/Redundancy/{RedundancyCoordinator, ServiceLevelCalculator,
      ApplyLeaseRegistry, RedundancyStatePublisher}.cs
  src/ZB.MOM.WW.OtOpcUa.Core/Authorization/{PermissionTrie, PermissionTrieBuilder,
      PermissionTrieCache, TriePermissionEvaluator, AuthorizationGate}.cs
  src/ZB.MOM.WW.OtOpcUa.Server/Security/{AuthorizationGate, LdapUserAuthenticator}.cs
  src/ZB.MOM.WW.OtOpcUa.Admin/{Program.cs, Services/AdminRoles.cs,
      Services/RedundancyMetrics.cs, Hubs/FleetStatusPoller.cs}
  src/ZB.MOM.WW.OtOpcUa.Server/Program.cs + appsettings.json
  src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/{Program.cs, Ipc/PipeServer.cs}
  src/ZB.MOM.WW.OtOpcUa.Configuration/Entities/{ClusterNode, NodeAcl,
      LdapGroupRoleMapping}.cs
  src/ZB.MOM.WW.OtOpcUa.Analyzers/UnwrappedCapabilityCallAnalyzer.cs

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 01:34:25 -04:00

104 lines
8.2 KiB
Markdown

# Redundancy
## Overview
OtOpcUa supports OPC UA **non-transparent** warm/hot redundancy. Two (or more) OtOpcUa Server processes run side-by-side, share the same Config DB, the same driver backends (Galaxy ZB, MXAccess runtime, remote PLCs), and advertise the same OPC UA node tree. Each process owns a distinct `ApplicationUri`; OPC UA clients see both endpoints via the standard `ServerUriArray` and pick one based on the `ServiceLevel` that each server publishes.
The redundancy surface lives in `src/ZB.MOM.WW.OtOpcUa.Server/Redundancy/`:
| Class | Role |
|---|---|
| `RedundancyCoordinator` | Process-singleton; owns the current `RedundancyTopology` loaded from the `ClusterNode` table. `RefreshAsync` re-reads after `sp_PublishGeneration` so operator role swaps take effect without a process restart. CAS-style swap (`Interlocked.Exchange`) means readers always see a coherent snapshot. |
| `RedundancyTopology` | Immutable `(ClusterId, Self, Peers, ServerUriArray, ValidityFlags)` snapshot. |
| `ApplyLeaseRegistry` | Tracks in-progress `sp_PublishGeneration` apply leases keyed on `(ConfigGenerationId, PublishRequestId)`. `await using` the disposable scope guarantees every exit path (success / exception / cancellation) decrements the lease; a stale-lease watchdog force-closes any lease older than `ApplyMaxDuration` (default 10 minutes) so a crashed publisher can't pin the node at `PrimaryMidApply`. |
| `PeerReachabilityTracker` | Maintains last-known reachability for each peer node over two independent probes — OPC UA ping and HTTP `/healthz`. Both must succeed for `peerReachable = true`. |
| `RecoveryStateManager` | Gates transitions out of the `Recovering*` bands; requires dwell + publish-witness satisfaction before allowing a return to nominal. |
| `ServiceLevelCalculator` | Pure function `(role, selfHealthy, peerUa, peerHttp, applyInProgress, recoveryDwellMet, topologyValid, operatorMaintenance) → byte`. |
| `RedundancyStatePublisher` | Orchestrates inputs into the calculator, pushes the resulting byte to the OPC UA `ServiceLevel` variable via an edge-triggered `OnStateChanged` event, and fires `OnServerUriArrayChanged` when the topology's `ServerUriArray` shifts. |
## Data model
Per-node redundancy state lives in the Config DB `ClusterNode` table (`src/ZB.MOM.WW.OtOpcUa.Configuration/Entities/ClusterNode.cs`):
| Column | Role |
|---|---|
| `NodeId` | Unique node identity; matches `Node:NodeId` in the server's bootstrap `appsettings.json`. |
| `ClusterId` | Foreign key into `ServerCluster`. |
| `RedundancyRole` | `Primary`, `Secondary`, or `Standalone` (`RedundancyRole` enum in `Configuration/Enums`). |
| `ServiceLevelBase` | Per-node base value used to bias nominal ServiceLevel output. |
| `ApplicationUri` | Unique-per-node OPC UA ApplicationUri advertised in endpoint descriptions. |
`ServerUriArray` is derived from the set of peer `ApplicationUri` values at topology-load time and republished when the topology changes.
## ServiceLevel matrix
`ServiceLevelCalculator` produces one of the following bands (see `ServiceLevelBand` enum in the same file):
| Band | Byte | Meaning |
|---|---|---|
| `Maintenance` | 0 | Operator-declared maintenance. |
| `NoData` | 1 | Self-reported unhealthy (`/healthz` fails). |
| `InvalidTopology` | 2 | More than one Primary detected; both nodes self-demote. |
| `RecoveringBackup` | 30 | Backup post-fault, dwell not met. |
| `BackupMidApply` | 50 | Backup inside a publish-apply window. |
| `IsolatedBackup` | 80 | Primary unreachable; Backup says "take over if asked" — does **not** auto-promote (non-transparent model). |
| `AuthoritativeBackup` | 100 | Backup nominal. |
| `RecoveringPrimary` | 180 | Primary post-fault, dwell not met. |
| `PrimaryMidApply` | 200 | Primary inside a publish-apply window. |
| `IsolatedPrimary` | 230 | Primary with unreachable peer, retains authority. |
| `AuthoritativePrimary` | 255 | Primary nominal. |
The reserved bands (0 Maintenance, 1 NoData, 2 InvalidTopology) take precedence over operational states per OPC UA Part 5 §6.3.34. Operational values occupy 2..255 so spec-compliant clients that treat "<3 = unhealthy" keep working.
Standalone nodes (single-instance deployments) report `AuthoritativePrimary` when healthy and `PrimaryMidApply` during publish.
## Publish fencing and split-brain prevention
Any Admin-triggered `sp_PublishGeneration` acquires an apply lease through `ApplyLeaseRegistry.BeginApplyLease`. While the lease is held:
- The calculator reports `PrimaryMidApply` / `BackupMidApply` — clients see the band shift and cut over to the unaffected peer rather than racing against a half-applied generation.
- `RedundancyCoordinator.RefreshAsync` is called at the end of the apply window so the post-publish topology becomes visible exactly once, atomically.
- The watchdog force-closes any lease older than `ApplyMaxDuration`; a stuck publisher therefore cannot strand a node at `PrimaryMidApply`.
Because role transitions are **operator-driven** (write `RedundancyRole` in the Config DB + publish), the Backup never auto-promotes. An `IsolatedBackup` at 80 is the signal that the operator should intervene; auto-failover is intentionally out of scope for the non-transparent model (decision #154).
## Metrics
`RedundancyMetrics` in `src/ZB.MOM.WW.OtOpcUa.Admin/Services/RedundancyMetrics.cs` registers the `ZB.MOM.WW.OtOpcUa.Redundancy` meter on the Admin process. Instruments:
| Name | Kind | Tags | Description |
|---|---|---|---|
| `otopcua.redundancy.role_transition` | Counter<long> | `cluster.id`, `node.id`, `from_role`, `to_role` | Incremented every time `FleetStatusPoller` observes a `RedundancyRole` change on a `ClusterNode` row. |
| `otopcua.redundancy.primary_count` | ObservableGauge<long> | `cluster.id` | Primary-role nodes per cluster — should be exactly 1 in nominal state. |
| `otopcua.redundancy.secondary_count` | ObservableGauge<long> | `cluster.id` | Secondary-role nodes per cluster. |
| `otopcua.redundancy.stale_count` | ObservableGauge<long> | `cluster.id` | Nodes whose `LastSeenAt` exceeded the stale threshold. |
Admin `Program.cs` wires OpenTelemetry to the Prometheus exporter when `Metrics:Prometheus:Enabled=true` (default), exposing the meter under `/metrics`. The endpoint is intentionally unauthenticated — fleet conventions put it behind a reverse-proxy basic-auth gate if needed.
## Real-time notifications (Admin UI)
`FleetStatusPoller` in `src/ZB.MOM.WW.OtOpcUa.Admin/Hubs/` polls the `ClusterNode` table, records role transitions, updates `RedundancyMetrics.SetClusterCounts`, and pushes a `RoleChanged` SignalR event onto `FleetStatusHub` when a transition is observed. `RedundancyTab.razor` subscribes with `_hub.On<RoleChangedMessage>("RoleChanged", …)` so connected Admin sessions see role swaps the moment they happen.
## Configuring a redundant pair
Redundancy is configured **in the Config DB, not appsettings.json**. The fields that must differ between the two instances:
| Field | Location | Instance 1 | Instance 2 |
|---|---|---|---|
| `NodeId` | `appsettings.json` `Node:NodeId` (bootstrap) | `node-a` | `node-b` |
| `ClusterNode.ApplicationUri` | Config DB | `urn:node-a:OtOpcUa` | `urn:node-b:OtOpcUa` |
| `ClusterNode.RedundancyRole` | Config DB | `Primary` | `Secondary` |
| `ClusterNode.ServiceLevelBase` | Config DB | typically 255 | typically 100 |
Shared between instances: `ClusterId`, Config DB connection string, published generation, cluster-level ACLs, UNS hierarchy, driver instances.
Role swaps, stand-alone promotions, and base-level adjustments all happen through the Admin UI `RedundancyTab` — the operator edits the `ClusterNode` row in a draft generation and publishes. `RedundancyCoordinator.RefreshAsync` picks up the new topology without a process restart.
## Client-side failover
The OtOpcUa Client CLI at `src/ZB.MOM.WW.OtOpcUa.Client.CLI` supports `-F` / `--failover-urls` for automatic client-side failover; for long-running subscriptions the CLI monitors session KeepAlive and reconnects to the next available server, recreating the subscription on the new endpoint. See [`Client.CLI.md`](Client.CLI.md) for the command reference.
## Depth reference
For the full decision trail and implementation plan — topology invariants, peer-probe cadence, recovery-dwell policy, compliance-script guard against enum-value drift — see `docs/v2/plan.md` §Phase 6.3.