docs: v2 updates to Redundancy, ServiceHosting, security, README (Task 64)

- Redundancy.md: full rewrite — Akka-leader-driven ServiceLevel replaces operator-managed RedundancyRole. Documents the 5-tier ServiceLevelCalculator, RedundancyStateActor cluster singleton, and the DPS data flow. - ServiceHosting.md: full rewrite — single fused OtOpcUa.Host binary with OTOPCUA_ROLES env gating. Documents the conditional DI graph and the new health endpoints (/health/ready, /health/active, /healthz). - security.md: v2 banner at top covering path/project renames + new JWT bearer + DataProtection persisted to ConfigDb. Body unchanged because the 4-concern security model is unchanged in v2; full per-section rewrite waits for F15 (Admin pages migration) since security.md references many pages that move. - README.md: platform overview updated to v2 (fused Host + role gating).
2026-05-26 06:38:55 -04:00
parent a8becc9c46
commit 3c3fef911c
4 changed files with 139 additions and 118 deletions
@@ -1,103 +1,93 @@
-# Redundancy
+# Redundancy (v2)

 ## Overview

-OtOpcUa supports OPC UA **non-transparent** warm/hot redundancy. Two (or more) OtOpcUa Server processes run side-by-side, share the same Config DB, the same driver backends (Galaxy ZB, MXAccess runtime, remote PLCs), and advertise the same OPC UA node tree. Each process owns a distinct `ApplicationUri`; OPC UA clients see both endpoints via the standard `ServerUriArray` and pick one based on the `ServiceLevel` that each server publishes.
+OtOpcUa supports OPC UA **non-transparent** warm/hot redundancy. Two or more `OtOpcUa.Host` processes run side-by-side, share the same Config DB, and join the same Akka.NET cluster. Each process owns a distinct `ApplicationUri`; OPC UA clients see both endpoints via the standard `ServerUriArray` and pick one based on the `ServiceLevel` byte that each server publishes.

-The redundancy surface lives in `src/Server/ZB.MOM.WW.OtOpcUa.Server/Redundancy/`:
+> **v2 change.** v1's operator-managed `ClusterNode.RedundancyRole` column + `RedundancyCoordinator` / `ApplyLeaseRegistry` / `PeerHttpProbeLoop` are gone. Primary/secondary is now derived from **Akka cluster role-leader** for the `driver` role. The operator no longer writes a role into the DB; cluster topology + health drive ServiceLevel automatically.

-| Class | Role |
-|---|---|
-| `RedundancyCoordinator` | Process-singleton; owns the current `RedundancyTopology` loaded from the `ClusterNode` table. `RefreshAsync` re-reads after `sp_PublishGeneration` so operator role swaps take effect without a process restart. CAS-style swap (`Interlocked.Exchange`) means readers always see a coherent snapshot. |
-| `RedundancyTopology` | Immutable `(ClusterId, Self, Peers, ServerUriArray, ValidityFlags)` snapshot. |
-| `ApplyLeaseRegistry` | Tracks in-progress `sp_PublishGeneration` apply leases keyed on `(ConfigGenerationId, PublishRequestId)`. `await using` the disposable scope guarantees every exit path (success / exception / cancellation) decrements the lease; a stale-lease watchdog force-closes any lease older than `ApplyMaxDuration` (default 10 minutes) so a crashed publisher can't pin the node at `PrimaryMidApply`. |
-| `PeerReachabilityTracker` | Maintains last-known reachability for each peer node over two independent probes — OPC UA ping and HTTP `/healthz`. Both must succeed for `peerReachable = true`. |
-| `RecoveryStateManager` | Gates transitions out of the `Recovering*` bands; requires dwell + publish-witness satisfaction before allowing a return to nominal. |
-| `ServiceLevelCalculator` | Pure function `(role, selfHealthy, peerUa, peerHttp, applyInProgress, recoveryDwellMet, topologyValid, operatorMaintenance) → byte`. |
-| `RedundancyStatePublisher` | Orchestrates inputs into the calculator, pushes the resulting byte to the OPC UA `ServiceLevel` variable via an edge-triggered `OnStateChanged` event, and fires `OnServerUriArrayChanged` when the topology's `ServerUriArray` shifts. |
+The runtime pieces live in:

-## Data model
-
-Per-node redundancy state lives in the Config DB `ClusterNode` table (`src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Entities/ClusterNode.cs`):
-
-| Column | Role |
-|---|---|
-| `NodeId` | Unique node identity; matches `Node:NodeId` in the server's bootstrap `appsettings.json`. |
-| `ClusterId` | Foreign key into `ServerCluster`. |
-| `RedundancyRole` | `Primary`, `Secondary`, or `Standalone` (`RedundancyRole` enum in `Configuration/Enums`). |
-| `ServiceLevelBase` | Per-node base value used to bias nominal ServiceLevel output. |
-| `ApplicationUri` | Unique-per-node OPC UA ApplicationUri advertised in endpoint descriptions. |
-
-`ServerUriArray` is derived from the set of peer `ApplicationUri` values at topology-load time and republished when the topology changes.
-
-## ServiceLevel matrix
-
-`ServiceLevelCalculator` produces one of the following bands (see `ServiceLevelBand` enum in the same file):
-
-| Band | Byte | Meaning |
+| Component | Project | Role |
 |---|---|---|
-| `Maintenance` | 0 | Operator-declared maintenance. |
-| `NoData` | 1 | Self-reported unhealthy (`/healthz` fails). |
-| `InvalidTopology` | 2 | More than one Primary detected; both nodes self-demote. |
-| `RecoveringBackup` | 30 | Backup post-fault, dwell not met. |
-| `BackupMidApply` | 50 | Backup inside a publish-apply window. |
-| `IsolatedBackup` | 80 | Primary unreachable; Backup says "take over if asked" — does **not** auto-promote (non-transparent model). |
-| `AuthoritativeBackup` | 100 | Backup nominal. |
-| `RecoveringPrimary` | 180 | Primary post-fault, dwell not met. |
-| `PrimaryMidApply` | 200 | Primary inside a publish-apply window. |
-| `IsolatedPrimary` | 230 | Primary with unreachable peer, retains authority. |
-| `AuthoritativePrimary` | 255 | Primary nominal. |
+| `ServiceLevelCalculator` | `OtOpcUa.ControlPlane.Redundancy` | Pure function `(NodeHealthInputs) → byte`. No side effects. |
+| `RedundancyStateActor` | `OtOpcUa.ControlPlane.Redundancy` | Admin-role cluster singleton; subscribes to cluster topology events, debounces 250ms, broadcasts `RedundancyStateChanged` on the `redundancy-state` DPS topic. |
+| `DbHealthProbeActor` | `OtOpcUa.Runtime.Health` | Per-node; runs `SELECT 1` against ConfigDb every 5s. Read by health endpoint + redundancy calc. |
+| `PeerOpcUaProbeActor` | `OtOpcUa.Runtime.Health` | Per-node; pings peer `opc.tcp://peer:4840` (real probe call is staged for follow-up F12). |
+| `ClusterRoleInfo` | `OtOpcUa.Cluster` | Live view of cluster membership + role-leader; exposes `IClusterRoleInfo` to the rest of the host. |

-The reserved bands (0 Maintenance, 1 NoData, 2 InvalidTopology) take precedence over operational states per OPC UA Part 5 §6.3.34. Operational values occupy 2..255 so spec-compliant clients that treat "<3 = unhealthy" keep working.
+## ServiceLevel tiers (Part 5 §6.5)

-Standalone nodes (single-instance deployments) report `AuthoritativePrimary` when healthy and `PrimaryMidApply` during publish.
+`ServiceLevelCalculator.Compute(NodeHealthInputs)` returns a byte in 0..255 by tier:

-## Publish fencing and split-brain prevention
+| Tier | Byte | Condition |
+|---|---|---|
+| Down | 0 | Member status is not `Up` or `Joining` (leaving, removed, exiting). |
+| Critically degraded | 100 | ConfigDb unreachable AND data is stale. |
+| Stale | 200 | Data stale but ConfigDb reachable. |
+| Healthy follower | 240 | DB ok + OPC UA probe ok + not stale. |
+| Healthy leader | 250 | Healthy + this node is the `driver` role-leader. |

-Any Admin-triggered `sp_PublishGeneration` acquires an apply lease through `ApplyLeaseRegistry.BeginApplyLease`. While the lease is held:
+Drivers write their computed byte into the OPC UA `ServiceLevel` Variable on each refresh. Clients with the standard redundancy heuristic ("pick the highest ServiceLevel") therefore prefer the role-leader and fall back to followers on its degradation.

- The calculator reports `PrimaryMidApply` / `BackupMidApply` — clients see the band shift and cut over to the unaffected peer rather than racing against a half-applied generation.
- `RedundancyCoordinator.RefreshAsync` is called at the end of the apply window so the post-publish topology becomes visible exactly once, atomically.
- The watchdog force-closes any lease older than `ApplyMaxDuration`; a stuck publisher therefore cannot strand a node at `PrimaryMidApply`.
+## Data flow

-Because role transitions are **operator-driven** (write `RedundancyRole` in the Config DB + publish), the Backup never auto-promotes. An `IsolatedBackup` at 80 is the signal that the operator should intervene; auto-failover is intentionally out of scope for the non-transparent model (decision #154).
+```
+Cluster topology event ──┐
+DB health probe ─────────┤
+OPC UA peer probe ───────┤
+                          ▼
+              RedundancyStateActor (admin singleton)
+                          │  debounce 250ms
+                          ▼
+              DPS topic "redundancy-state"
+                          │
+                          ▼
+                Driver nodes' OpcUaPublishActor
+                          │
+                          ▼
+              ServiceLevelCalculator → byte
+                          │
+                          ▼
+              OPC UA ServiceLevel Variable
+```

-## Metrics
+The admin singleton is the cluster's only `RedundancyStateActor`. If the admin leader fails over, the new admin node spins up its replacement, re-subscribes to cluster events, and publishes a fresh snapshot from the current `Cluster.State`. There is no DB-persisted state to recover.

-`RedundancyMetrics` in `src/Server/ZB.MOM.WW.OtOpcUa.Admin/Services/RedundancyMetrics.cs` registers the `ZB.MOM.WW.OtOpcUa.Redundancy` meter on the Admin process. Instruments:
+## Configuration

-| Name | Kind | Tags | Description |
-|---|---|---|---|
-| `otopcua.redundancy.role_transition` | Counter<long> | `cluster.id`, `node.id`, `from_role`, `to_role` | Incremented every time `FleetStatusPoller` observes a `RedundancyRole` change on a `ClusterNode` row. |
-| `otopcua.redundancy.primary_count` | ObservableGauge<long> | `cluster.id` | Primary-role nodes per cluster — should be exactly 1 in nominal state. |
-| `otopcua.redundancy.secondary_count` | ObservableGauge<long> | `cluster.id` | Secondary-role nodes per cluster. |
-| `otopcua.redundancy.stale_count` | ObservableGauge<long> | `cluster.id` | Nodes whose `LastSeenAt` exceeded the stale threshold. |
+Per-node identity comes from `appsettings.json` + the `OTOPCUA_ROLES` env var:

-Admin `Program.cs` wires OpenTelemetry to the Prometheus exporter when `Metrics:Prometheus:Enabled=true` (default), exposing the meter under `/metrics`. The endpoint is intentionally unauthenticated — fleet conventions put it behind a reverse-proxy basic-auth gate if needed.
+```json
+{
+  "Cluster": {
+    "Hostname": "0.0.0.0",
+    "Port": 4053,
+    "PublicHostname": "node-a.lan",
+    "SeedNodes": ["akka.tcp://otopcua@node-a.lan:4053"],
+    "Roles": ["admin", "driver"]
+  }
+}
+```

-## Real-time notifications (Admin UI)
+```
+OTOPCUA_ROLES=admin,driver
+```

-`FleetStatusPoller` in `src/Server/ZB.MOM.WW.OtOpcUa.Admin/Hubs/` polls the `ClusterNode` table, records role transitions, updates `RedundancyMetrics.SetClusterCounts`, and pushes a `RoleChanged` SignalR event onto `FleetStatusHub` when a transition is observed. `RedundancyTab.razor` subscribes with `_hub.On<RoleChangedMessage>("RoleChanged", …)` so connected Admin sessions see role swaps the moment they happen.
+Both nodes share the same `ConfigDb` connection string; `Cluster.PublicHostname` + `Roles` are what makes them distinct in cluster gossip. The first node bootstraps the cluster (its address goes in `SeedNodes`); the second node joins via the same `SeedNodes` list.

-## Configuring a redundant pair
+There is no longer a `Node:NodeId` setting, no `ClusterNode.RedundancyRole`, no `ServiceLevelBase`. NodeId is derived as `host:port` of the cluster `PublicHostname` (see `ClusterRoleInfo.LocalNode` for the formula).

-Redundancy is configured **in the Config DB, not appsettings.json**. The fields that must differ between the two instances:
+## Split-brain

-| Field | Location | Instance 1 | Instance 2 |
-|---|---|---|---|
-| `NodeId` | `appsettings.json` `Node:NodeId` (bootstrap) | `node-a` | `node-b` |
-| `ClusterNode.ApplicationUri` | Config DB | `urn:node-a:OtOpcUa` | `urn:node-b:OtOpcUa` |
-| `ClusterNode.RedundancyRole` | Config DB | `Primary` | `Secondary` |
-| `ClusterNode.ServiceLevelBase` | Config DB | typically 255 | typically 100 |
+`akka.conf` configures Akka's split-brain resolver with `active-strategy = keep-oldest`, `stable-after = 15s`, and `failure-detector.threshold = 10.0`. Under a clean partition: the oldest member stays up + the smaller (or younger) side downs itself within ~15 seconds. The `RedundancyStateActor` on the surviving partition re-computes from the post-partition `Cluster.State`.

-Shared between instances: `ClusterId`, Config DB connection string, published generation, cluster-level ACLs, UNS hierarchy, driver instances.
-
-Role swaps, stand-alone promotions, and base-level adjustments all happen through the Admin UI `RedundancyTab` — the operator edits the `ClusterNode` row in a draft generation and publishes. `RedundancyCoordinator.RefreshAsync` picks up the new topology without a process restart.
+There is no operator-driven role swap during a partition. Failover is what the cluster does automatically.

 ## Client-side failover

-The OtOpcUa Client CLI at `src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI` supports `-F` / `--failover-urls` for automatic client-side failover; for long-running subscriptions the CLI monitors session KeepAlive and reconnects to the next available server, recreating the subscription on the new endpoint. See [`Client.CLI.md`](Client.CLI.md) for the command reference.
+The OtOpcUa Client CLI at `src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI` supports `-F` / `--failover-urls` for automatic client-side failover; for long-running subscriptions the CLI monitors session KeepAlive and reconnects to the next available server, recreating the subscription on the new endpoint. See [`Client.CLI.md`](Client.CLI.md).

 ## Depth reference

-For the full decision trail and implementation plan — topology invariants, peer-probe cadence, recovery-dwell policy, compliance-script guard against enum-value drift — see `docs/v2/plan.md` §Phase 6.3.
+For the full design — message contracts, tiered calculator truth table, recovery semantics — see `docs/plans/2026-05-26-akka-hosting-alignment-design.md` §6.