Compare commits

...

6 Commits

Author SHA1 Message Date
Joseph Doherty
84fe88fadb Phase 6.3 Stream A — RedundancyTopology + ClusterTopologyLoader + RedundancyCoordinator
Lands the data path that feeds the Phase 6.3 ServiceLevelCalculator shipped in
PR #89. OPC UA node wiring (ServiceLevel variable + ServerUriArray +
RedundancySupport) still deferred to task #147; peer-probe loops (Stream B.1/B.2
runtime layer beyond the calculator logic) deferred.

Server.Redundancy additions:
- RedundancyTopology record — immutable snapshot (ClusterId, SelfNodeId,
  SelfRole, Mode, Peers[], SelfApplicationUri). ServerUriArray() emits the
  OPC UA Part 4 §6.6.2.2 shape (self first, peers lexicographically by
  NodeId). RedundancyPeer record with per-peer Host/OpcUaPort/DashboardPort/
  ApplicationUri so the follow-up peer-probe loops know where to probe.
- ClusterTopologyLoader — pure fn from ServerCluster + ClusterNode[] to
  RedundancyTopology. Enforces Phase 6.3 Stream A.1 invariants:
    * At least one node per cluster.
    * At most 2 nodes (decision #83, v2.0 cap).
    * Every node belongs to the target cluster.
    * Unique ApplicationUri across the cluster (OPC UA Part 4 trust pin,
      decision #86).
    * At most 1 Primary per cluster in Warm/Hot modes (decision #84).
    * Self NodeId must be a member of the cluster.
  Violations throw InvalidTopologyException with a decision-ID-tagged message
  so operators know which invariant + what to fix.
- RedundancyCoordinator singleton — holds the current topology + IsTopologyValid
  flag. InitializeAsync throws on invariant violation (startup fails fast).
  RefreshAsync logs + flips IsTopologyValid=false (runtime won't tear down a
  running server; ServiceLevelCalculator falls to InvalidTopology band = 2
  which surfaces the problem to clients without crashing). CAS-style swap
  via Volatile.Write so readers always see a coherent snapshot.

Tests (10 new ClusterTopologyLoaderTests):
- Single-node standalone loads + empty peer list.
- Two-node cluster loads self + peer.
- ServerUriArray puts self first + peers sort lexicographically.
- Empty-nodes throws.
- Self-not-in-cluster throws.
- Three-node cluster rejected with decision #83 message.
- Duplicate ApplicationUri rejected with decision #86 shape reference.
- Two Primaries in Warm mode rejected (decision #84 + runtime-band reference).
- Cross-cluster node rejected.
- None-mode allows any role mix (standalone clusters don't enforce Primary count).

Full solution dotnet test: 1178 passing (was 1168, +10). Pre-existing
Client.CLI Subscribe flake unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:24:14 -04:00
59f793f87c Merge pull request (#97) - Readiness doc blocker2 closed 2026-04-19 11:18:26 -04:00
37ba9e8d14 Merge pull request (#96) - Phase 6.1 Stream D wiring follow-up 2026-04-19 11:16:57 -04:00
Joseph Doherty
a8401ab8fd v2 release-readiness — blocker #2 closed; doc reflects state
PR #96 closed the Phase 6.1 Stream D config-cache wiring blocker.

- Status line: "one of three release blockers remains".
- Blocker #2 struck through + CLOSED with PR link. Periodic-poller + richer-
  snapshot-payload follow-ups downgraded to hardening.
- Change log: dated entry.

One blocker remains: Phase 6.3 Streams A/C/F redundancy runtime (tasks
#145, #147, #150).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:16:31 -04:00
Joseph Doherty
19a0bfcc43 Phase 6.1 Stream D follow-up — SealedBootstrap consumes ResilientConfigReader + GenerationSealedCache + StaleConfigFlag; /healthz surfaces the flag
Closes release blocker #2 from docs/v2/v2-release-readiness.md — the
generation-sealed cache + resilient reader + stale-config flag shipped as
unit-tested primitives in PR #81, but no production path consumed them until
now. This PR wires them end-to-end.

Server additions:
- SealedBootstrap — Phase 6.1 Stream D consumption hook. Resolves the node's
  current generation through ResilientConfigReader's timeout → retry →
  fallback-to-sealed pipeline. On every successful central-DB fetch it seals
  a fresh snapshot to <cache-root>/<cluster>/<generationId>.db so a future
  cache-miss has a known-good fallback. Alongside the original NodeBootstrap
  (which still uses the single-file ILocalConfigCache); Program.cs can
  switch between them once operators are ready for the generation-sealed
  semantics.
- OpcUaApplicationHost: new optional staleConfigFlag ctor parameter. When
  wired, HealthEndpointsHost consumes `flag.IsStale` via the existing
  usingStaleConfig Func<bool> hook. Means `/healthz` actually reports
  `usingStaleConfig: true` whenever a read fell back to the sealed cache —
  closes the loop between Stream D's flag + Stream C's /healthz body shape.

Tests (4 new SealedBootstrapIntegrationTests, all pass):
- Central-DB success path seals snapshot + flag stays fresh.
- Central-DB failure falls back to sealed snapshot + flag flips stale (the
  SQL-kill scenario from Phase 6.1 Stream D.4.a).
- No-snapshot + central-down throws GenerationCacheUnavailableException
  with a clear error (the first-boot scenario from D.4.c).
- Next successful bootstrap after a fallback clears the stale flag.

Full solution dotnet test: 1168 passing (was 1164, +4). Pre-existing
Client.CLI Subscribe flake unchanged.

Production activation: Program.cs wires SealedBootstrap (instead of
NodeBootstrap), constructs OpcUaApplicationHost with the staleConfigFlag,
and a HostedService polls sp_GetCurrentGenerationForCluster periodically so
peer-published generations land in this node's sealed cache. The poller
itself is Stream D.1.b follow-up.

The sp_PublishGeneration SQL-side hook (where the publish commit itself
could also write to a shared sealed cache) stays deferred — the per-node
seal pattern shipped here is the correct v2 GA model: each Server node
owns its own on-disk cache and refreshes from its own DB reads, matching
the Phase 6.1 scope-table description.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:14:59 -04:00
fc7e18c7f5 Merge pull request (#95) - Readiness doc blocker1 closed 2026-04-19 11:06:28 -04:00
8 changed files with 670 additions and 9 deletions

View File

@@ -1,7 +1,7 @@
# v2 Release Readiness
> **Last updated**: 2026-04-19 (release blocker #1 closed Phase 6.2 dispatch wiring shipped)
> **Status**: **NOT YET RELEASE-READY** — two of three release blockers remain (Phase 6.1 Stream D config-cache wiring + Phase 6.3 Streams A/C/F redundancy runtime).
> **Last updated**: 2026-04-19 (release blockers #1 + #2 closed; Phase 6.3 redundancy runtime is the last)
> **Status**: **NOT YET RELEASE-READY** — one of three release blockers remains (Phase 6.3 Streams A/C/F redundancy-coordinator + OPC UA node wiring + client interop).
This doc is the single view of where v2 stands against its release criteria. Update it whenever a deferred follow-up closes or a new release blocker is discovered.
@@ -41,15 +41,16 @@ Additional Stream C surfaces (not release-blocking, hardening only):
These are additional hardening — the three highest-value surfaces (Read / Write / HistoryRead) are now gated, which covers the base-security gap for v2 GA.
### Config fallback — Phase 6.1 Stream D wiring (task #136)
### ~~Config fallback — Phase 6.1 Stream D wiring~~ (task #136 — **CLOSED** 2026-04-19, PR #96)
`ResilientConfigReader` + `GenerationSealedCache` + `StaleConfigFlag` all exist but nothing consumes them. The `NodeBootstrap` path still uses the original single-file `LiteDbConfigCache` via `ILocalConfigCache`; `sp_PublishGeneration` doesn't call `GenerationSealedCache.SealAsync` after commit; the Configuration read services don't wrap queries in `ResilientConfigReader.ReadAsync`.
**Closed**. `SealedBootstrap` consumes `ResilientConfigReader` + `GenerationSealedCache` + `StaleConfigFlag` end-to-end: bootstrap calls go through the timeout → retry → fallback-to-sealed pipeline; every central-DB success writes a fresh sealed snapshot so the next cache-miss has a known-good fallback; `StaleConfigFlag.IsStale` is now consumed by `HealthEndpointsHost.usingStaleConfig` so `/healthz` body reports reality.
Closing this requires:
Production activation: Program.cs switches `NodeBootstrap → SealedBootstrap` + constructs `OpcUaApplicationHost` with the `StaleConfigFlag` as an optional ctor parameter.
- `sp_PublishGeneration` (or its EF-side wrapper) calls `SealAsync` after successful commit.
- DriverInstance enumeration, LdapGroupRoleMapping fetches, cluster + namespace metadata reads route through `ResilientConfigReader.ReadAsync`.
- Integration test: SQL container kill mid-operation → serves sealed snapshot, `UsingStaleConfig` = true, driver stays Healthy, `/healthz` body reflects the flag.
Remaining follow-ups (hardening, not release-blocking):
- A `HostedService` that polls `sp_GetCurrentGenerationForCluster` periodically so peer-published generations land in this node's cache without a restart.
- Richer snapshot payload via `sp_GetGenerationContent` so fallback can serve the full generation content (DriverInstance enumeration, ACL rows, etc.) from the sealed cache alone.
### Redundancy — Phase 6.3 Streams A/C/F (tasks #145, #147, #150)
@@ -97,6 +98,7 @@ v2 GA requires all of the following:
## Change log
- **2026-04-19** — Release blocker #2 **closed** (PR #96). `SealedBootstrap` consumes `ResilientConfigReader` + `GenerationSealedCache` + `StaleConfigFlag`; `/healthz` now surfaces the stale flag. Remaining follow-ups (periodic poller + richer snapshot payload) downgraded to hardening.
- **2026-04-19** — Release blocker #1 **closed** (PR #94). `AuthorizationGate` wired into `DriverNodeManager` Read / Write / HistoryRead dispatch. Remaining Stream C surfaces (Browse / Subscribe / Alarm / Call + finer-grained scope resolution) downgraded to hardening follow-ups — no longer release-blocking.
- **2026-04-19** — Phase 6.4 data layer merged (PRs #9192). Phase 6 core complete. Capstone doc created.
- **2026-04-19** — Phase 6.3 core merged (PRs #8990). `ServiceLevelCalculator` + `RecoveryStateManager` + `ApplyLeaseRegistry` land as pure logic; coordinator / UA-node wiring / Admin UI / interop deferred.

View File

@@ -1,6 +1,7 @@
using Microsoft.Extensions.Logging;
using Opc.Ua;
using Opc.Ua.Configuration;
using ZB.MOM.WW.OtOpcUa.Configuration.LocalCache;
using ZB.MOM.WW.OtOpcUa.Core.Hosting;
using ZB.MOM.WW.OtOpcUa.Core.OpcUa;
using ZB.MOM.WW.OtOpcUa.Core.Resilience;
@@ -25,6 +26,7 @@ public sealed class OpcUaApplicationHost : IAsyncDisposable
private readonly DriverResiliencePipelineBuilder _pipelineBuilder;
private readonly AuthorizationGate? _authzGate;
private readonly NodeScopeResolver? _scopeResolver;
private readonly StaleConfigFlag? _staleConfigFlag;
private readonly ILoggerFactory _loggerFactory;
private readonly ILogger<OpcUaApplicationHost> _logger;
private ApplicationInstance? _application;
@@ -36,7 +38,8 @@ public sealed class OpcUaApplicationHost : IAsyncDisposable
IUserAuthenticator authenticator, ILoggerFactory loggerFactory, ILogger<OpcUaApplicationHost> logger,
DriverResiliencePipelineBuilder? pipelineBuilder = null,
AuthorizationGate? authzGate = null,
NodeScopeResolver? scopeResolver = null)
NodeScopeResolver? scopeResolver = null,
StaleConfigFlag? staleConfigFlag = null)
{
_options = options;
_driverHost = driverHost;
@@ -44,6 +47,7 @@ public sealed class OpcUaApplicationHost : IAsyncDisposable
_pipelineBuilder = pipelineBuilder ?? new DriverResiliencePipelineBuilder();
_authzGate = authzGate;
_scopeResolver = scopeResolver;
_staleConfigFlag = staleConfigFlag;
_loggerFactory = loggerFactory;
_logger = logger;
}
@@ -84,6 +88,7 @@ public sealed class OpcUaApplicationHost : IAsyncDisposable
_healthHost = new HealthEndpointsHost(
_driverHost,
_loggerFactory.CreateLogger<HealthEndpointsHost>(),
usingStaleConfig: _staleConfigFlag is null ? null : () => _staleConfigFlag.IsStale,
prefix: _options.HealthEndpointsPrefix);
_healthHost.Start();
}

View File

@@ -0,0 +1,96 @@
using ZB.MOM.WW.OtOpcUa.Configuration.Entities;
using ZB.MOM.WW.OtOpcUa.Configuration.Enums;
namespace ZB.MOM.WW.OtOpcUa.Server.Redundancy;
/// <summary>
/// Pure-function mapper from the shared config DB's <see cref="ServerCluster"/> +
/// <see cref="ClusterNode"/> rows to an immutable <see cref="RedundancyTopology"/>.
/// Validates Phase 6.3 Stream A.1 invariants and throws
/// <see cref="InvalidTopologyException"/> on violation so the coordinator can fail startup
/// fast with a clear message rather than boot into an ambiguous state.
/// </summary>
/// <remarks>
/// Stateless — the caller owns the DB round-trip + hands rows in. Keeping it pure makes
/// the invariant matrix testable without EF or SQL Server.
/// </remarks>
public static class ClusterTopologyLoader
{
/// <summary>Build a topology snapshot for the given self node. Throws on invariant violation.</summary>
public static RedundancyTopology Load(string selfNodeId, ServerCluster cluster, IReadOnlyList<ClusterNode> nodes)
{
ArgumentException.ThrowIfNullOrWhiteSpace(selfNodeId);
ArgumentNullException.ThrowIfNull(cluster);
ArgumentNullException.ThrowIfNull(nodes);
ValidateClusterShape(cluster, nodes);
ValidateUniqueApplicationUris(nodes);
ValidatePrimaryCount(cluster, nodes);
var self = nodes.FirstOrDefault(n => string.Equals(n.NodeId, selfNodeId, StringComparison.OrdinalIgnoreCase))
?? throw new InvalidTopologyException(
$"Self node '{selfNodeId}' is not a member of cluster '{cluster.ClusterId}'. " +
$"Members: {string.Join(", ", nodes.Select(n => n.NodeId))}.");
var peers = nodes
.Where(n => !string.Equals(n.NodeId, selfNodeId, StringComparison.OrdinalIgnoreCase))
.Select(n => new RedundancyPeer(
NodeId: n.NodeId,
Role: n.RedundancyRole,
Host: n.Host,
OpcUaPort: n.OpcUaPort,
DashboardPort: n.DashboardPort,
ApplicationUri: n.ApplicationUri))
.ToList();
return new RedundancyTopology(
ClusterId: cluster.ClusterId,
SelfNodeId: self.NodeId,
SelfRole: self.RedundancyRole,
Mode: cluster.RedundancyMode,
Peers: peers,
SelfApplicationUri: self.ApplicationUri);
}
private static void ValidateClusterShape(ServerCluster cluster, IReadOnlyList<ClusterNode> nodes)
{
if (nodes.Count == 0)
throw new InvalidTopologyException($"Cluster '{cluster.ClusterId}' has zero nodes.");
// Decision #83 — v2.0 caps clusters at two nodes.
if (nodes.Count > 2)
throw new InvalidTopologyException(
$"Cluster '{cluster.ClusterId}' has {nodes.Count} nodes. v2.0 supports at most 2 nodes per cluster (decision #83).");
// Every node must belong to the given cluster.
var wrongCluster = nodes.FirstOrDefault(n =>
!string.Equals(n.ClusterId, cluster.ClusterId, StringComparison.OrdinalIgnoreCase));
if (wrongCluster is not null)
throw new InvalidTopologyException(
$"Node '{wrongCluster.NodeId}' belongs to cluster '{wrongCluster.ClusterId}', not '{cluster.ClusterId}'.");
}
private static void ValidateUniqueApplicationUris(IReadOnlyList<ClusterNode> nodes)
{
var dup = nodes
.GroupBy(n => n.ApplicationUri, StringComparer.Ordinal)
.FirstOrDefault(g => g.Count() > 1);
if (dup is not null)
throw new InvalidTopologyException(
$"Nodes {string.Join(", ", dup.Select(n => n.NodeId))} share ApplicationUri '{dup.Key}'. " +
$"OPC UA Part 4 requires unique ApplicationUri per server — clients pin trust here (decision #86).");
}
private static void ValidatePrimaryCount(ServerCluster cluster, IReadOnlyList<ClusterNode> nodes)
{
// Standalone mode: any role is fine. Warm / Hot: at most one Primary per cluster.
if (cluster.RedundancyMode == RedundancyMode.None) return;
var primaries = nodes.Count(n => n.RedundancyRole == RedundancyRole.Primary);
if (primaries > 1)
throw new InvalidTopologyException(
$"Cluster '{cluster.ClusterId}' has {primaries} Primary nodes in redundancy mode {cluster.RedundancyMode}. " +
$"At most one Primary per cluster (decision #84). Runtime detects and demotes both to ServiceLevel 2 " +
$"per the 8-state matrix; startup fails fast to surface the misconfiguration earlier.");
}
}

View File

@@ -0,0 +1,107 @@
using Microsoft.EntityFrameworkCore;
using Microsoft.Extensions.Logging;
using ZB.MOM.WW.OtOpcUa.Configuration;
using ZB.MOM.WW.OtOpcUa.Configuration.Entities;
using ZB.MOM.WW.OtOpcUa.Configuration.Enums;
namespace ZB.MOM.WW.OtOpcUa.Server.Redundancy;
/// <summary>
/// Process-singleton holder of the current <see cref="RedundancyTopology"/>. Reads the
/// shared config DB at <see cref="InitializeAsync"/> time + re-reads on
/// <see cref="RefreshAsync"/> (called after <c>sp_PublishGeneration</c> completes so
/// operator role-swaps take effect without a process restart).
/// </summary>
/// <remarks>
/// <para>Per Phase 6.3 Stream A.1-A.2. The coordinator is the source of truth for the
/// <see cref="ServiceLevelCalculator"/> inputs: role (from topology), peer reachability
/// (from peer-probe loops — Stream B.1/B.2 follow-up), apply-in-progress (from
/// <see cref="ApplyLeaseRegistry"/>), topology-valid (from invariant checks at load time
/// + runtime detection of conflicting peer claims).</para>
///
/// <para>Topology refresh is CAS-style: a new <see cref="RedundancyTopology"/> instance
/// replaces the old one atomically via <see cref="Interlocked.Exchange{T}"/>. Readers
/// always see a coherent snapshot — never a partial transition.</para>
/// </remarks>
public sealed class RedundancyCoordinator
{
private readonly IDbContextFactory<OtOpcUaConfigDbContext> _dbContextFactory;
private readonly ILogger<RedundancyCoordinator> _logger;
private readonly string _selfNodeId;
private readonly string _selfClusterId;
private RedundancyTopology? _current;
private bool _topologyValid = true;
public RedundancyCoordinator(
IDbContextFactory<OtOpcUaConfigDbContext> dbContextFactory,
ILogger<RedundancyCoordinator> logger,
string selfNodeId,
string selfClusterId)
{
ArgumentException.ThrowIfNullOrWhiteSpace(selfNodeId);
ArgumentException.ThrowIfNullOrWhiteSpace(selfClusterId);
_dbContextFactory = dbContextFactory;
_logger = logger;
_selfNodeId = selfNodeId;
_selfClusterId = selfClusterId;
}
/// <summary>Last-loaded topology; null before <see cref="InitializeAsync"/> completes.</summary>
public RedundancyTopology? Current => Volatile.Read(ref _current);
/// <summary>
/// True when the last load/refresh completed without an invariant violation; false
/// forces <see cref="ServiceLevelCalculator"/> into the <see cref="ServiceLevelBand.InvalidTopology"/>
/// band regardless of other inputs.
/// </summary>
public bool IsTopologyValid => Volatile.Read(ref _topologyValid);
/// <summary>Load the topology for the first time. Throws on invariant violation.</summary>
public async Task InitializeAsync(CancellationToken ct)
{
await RefreshInternalAsync(throwOnInvalid: true, ct).ConfigureAwait(false);
}
/// <summary>
/// Re-read the topology from the shared DB. Called after <c>sp_PublishGeneration</c>
/// completes or after an Admin-triggered role-swap. Never throws — on invariant
/// violation it logs + flips <see cref="IsTopologyValid"/> false so the calculator
/// returns <see cref="ServiceLevelBand.InvalidTopology"/> = 2.
/// </summary>
public async Task RefreshAsync(CancellationToken ct)
{
await RefreshInternalAsync(throwOnInvalid: false, ct).ConfigureAwait(false);
}
private async Task RefreshInternalAsync(bool throwOnInvalid, CancellationToken ct)
{
await using var db = await _dbContextFactory.CreateDbContextAsync(ct).ConfigureAwait(false);
var cluster = await db.ServerClusters.AsNoTracking()
.FirstOrDefaultAsync(c => c.ClusterId == _selfClusterId, ct).ConfigureAwait(false)
?? throw new InvalidTopologyException($"Cluster '{_selfClusterId}' not found in config DB.");
var nodes = await db.ClusterNodes.AsNoTracking()
.Where(n => n.ClusterId == _selfClusterId && n.Enabled)
.ToListAsync(ct).ConfigureAwait(false);
try
{
var topology = ClusterTopologyLoader.Load(_selfNodeId, cluster, nodes);
Volatile.Write(ref _current, topology);
Volatile.Write(ref _topologyValid, true);
_logger.LogInformation(
"Redundancy topology loaded: cluster={Cluster} self={Self} role={Role} mode={Mode} peers={PeerCount}",
topology.ClusterId, topology.SelfNodeId, topology.SelfRole, topology.Mode, topology.PeerCount);
}
catch (InvalidTopologyException ex)
{
Volatile.Write(ref _topologyValid, false);
_logger.LogError(ex,
"Redundancy topology invariant violation for cluster {Cluster}: {Reason}",
_selfClusterId, ex.Message);
if (throwOnInvalid) throw;
}
}
}

View File

@@ -0,0 +1,55 @@
using ZB.MOM.WW.OtOpcUa.Configuration.Enums;
namespace ZB.MOM.WW.OtOpcUa.Server.Redundancy;
/// <summary>
/// Snapshot of the cluster topology the <see cref="RedundancyCoordinator"/> holds. Read
/// once at startup + refreshed on publish-generation notification. Immutable — every
/// refresh produces a new instance so observers can compare identity-equality to detect
/// topology change.
/// </summary>
/// <remarks>
/// Per Phase 6.3 Stream A.1. Invariants enforced by the loader (see
/// <see cref="ClusterTopologyLoader"/>): at most one Primary per cluster for
/// WarmActive/Hot redundancy modes; every node has a unique ApplicationUri (OPC UA
/// Part 4 requirement — clients pin trust here); at most 2 nodes total per cluster
/// (decision #83).
/// </remarks>
public sealed record RedundancyTopology(
string ClusterId,
string SelfNodeId,
RedundancyRole SelfRole,
RedundancyMode Mode,
IReadOnlyList<RedundancyPeer> Peers,
string SelfApplicationUri)
{
/// <summary>Peer count — 0 for a standalone (single-node) cluster, 1 for v2 two-node clusters.</summary>
public int PeerCount => Peers.Count;
/// <summary>
/// ServerUriArray shape per OPC UA Part 4 §6.6.2.2 — self first, peers in stable
/// deterministic order (lexicographic by NodeId), self's ApplicationUri always at index 0.
/// </summary>
public IReadOnlyList<string> ServerUriArray() =>
new[] { SelfApplicationUri }
.Concat(Peers.OrderBy(p => p.NodeId, StringComparer.OrdinalIgnoreCase).Select(p => p.ApplicationUri))
.ToList();
}
/// <summary>One peer in the cluster (every node other than self).</summary>
/// <param name="NodeId">Peer's stable logical NodeId (e.g. <c>"LINE3-OPCUA-B"</c>).</param>
/// <param name="Role">Peer's declared redundancy role from the shared config DB.</param>
/// <param name="Host">Peer's hostname / IP — drives the health-probe target.</param>
/// <param name="OpcUaPort">Peer's OPC UA endpoint port.</param>
/// <param name="DashboardPort">Peer's dashboard / health-endpoint port.</param>
/// <param name="ApplicationUri">Peer's declared ApplicationUri (carried in <see cref="RedundancyTopology.ServerUriArray"/>).</param>
public sealed record RedundancyPeer(
string NodeId,
RedundancyRole Role,
string Host,
int OpcUaPort,
int DashboardPort,
string ApplicationUri);
/// <summary>Thrown when the loader detects a topology-invariant violation at startup or refresh.</summary>
public sealed class InvalidTopologyException(string message) : Exception(message);

View File

@@ -0,0 +1,100 @@
using System.Text.Json;
using Microsoft.Data.SqlClient;
using Microsoft.Extensions.Logging;
using ZB.MOM.WW.OtOpcUa.Configuration.LocalCache;
namespace ZB.MOM.WW.OtOpcUa.Server;
/// <summary>
/// Phase 6.1 Stream D consumption hook — bootstraps the node's current generation through
/// the <see cref="ResilientConfigReader"/> pipeline + writes every successful central-DB
/// read into the <see cref="GenerationSealedCache"/> so the next cache-miss path has a
/// sealed snapshot to fall back to.
/// </summary>
/// <remarks>
/// <para>Alongside the original <see cref="NodeBootstrap"/> (which uses the single-file
/// <see cref="ILocalConfigCache"/>). Program.cs can switch to this one once operators are
/// ready for the generation-sealed semantics. The original stays for backward compat
/// with the three integration tests that construct <see cref="NodeBootstrap"/> directly.</para>
///
/// <para>Closes release blocker #2 in <c>docs/v2/v2-release-readiness.md</c> — the
/// generation-sealed cache + resilient reader + stale-config flag ship as unit-tested
/// primitives in PR #81 but no production path consumed them until this wrapper.</para>
/// </remarks>
public sealed class SealedBootstrap
{
private readonly NodeOptions _options;
private readonly GenerationSealedCache _cache;
private readonly ResilientConfigReader _reader;
private readonly StaleConfigFlag _staleFlag;
private readonly ILogger<SealedBootstrap> _logger;
public SealedBootstrap(
NodeOptions options,
GenerationSealedCache cache,
ResilientConfigReader reader,
StaleConfigFlag staleFlag,
ILogger<SealedBootstrap> logger)
{
_options = options;
_cache = cache;
_reader = reader;
_staleFlag = staleFlag;
_logger = logger;
}
/// <summary>
/// Resolve the current generation for this node. Routes the central-DB fetch through
/// <see cref="ResilientConfigReader"/> (timeout → retry → fallback-to-cache) + seals a
/// fresh snapshot on every successful DB read so a future cache-miss has something to
/// serve.
/// </summary>
public async Task<BootstrapResult> LoadCurrentGenerationAsync(CancellationToken ct)
{
return await _reader.ReadAsync(
_options.ClusterId,
centralFetch: async innerCt => await FetchFromCentralAsync(innerCt).ConfigureAwait(false),
fromSnapshot: snap => BootstrapResult.FromCache(snap.GenerationId),
ct).ConfigureAwait(false);
}
private async ValueTask<BootstrapResult> FetchFromCentralAsync(CancellationToken ct)
{
await using var conn = new SqlConnection(_options.ConfigDbConnectionString);
await conn.OpenAsync(ct).ConfigureAwait(false);
await using var cmd = conn.CreateCommand();
cmd.CommandText = "EXEC dbo.sp_GetCurrentGenerationForCluster @NodeId=@n, @ClusterId=@c";
cmd.Parameters.AddWithValue("@n", _options.NodeId);
cmd.Parameters.AddWithValue("@c", _options.ClusterId);
await using var reader = await cmd.ExecuteReaderAsync(ct).ConfigureAwait(false);
if (!await reader.ReadAsync(ct).ConfigureAwait(false))
{
_logger.LogWarning("Cluster {Cluster} has no Published generation yet", _options.ClusterId);
return BootstrapResult.EmptyFromDb();
}
var generationId = reader.GetInt64(0);
_logger.LogInformation("Bootstrapped from central DB: generation {GenerationId}; sealing snapshot", generationId);
// Seal a minimal snapshot with the generation pointer. A richer snapshot that carries
// the full sp_GetGenerationContent payload lands when the bootstrap flow grows to
// consume the content during offline operation (separate follow-up — see decision #148
// and phase-6-1 Stream D.3). The pointer alone is enough for the fallback path to
// surface the last-known-good generation id + flip UsingStaleConfig.
await _cache.SealAsync(new GenerationSnapshot
{
ClusterId = _options.ClusterId,
GenerationId = generationId,
CachedAt = DateTime.UtcNow,
PayloadJson = JsonSerializer.Serialize(new { generationId, source = "sp_GetCurrentGenerationForCluster" }),
}, ct).ConfigureAwait(false);
// StaleConfigFlag bookkeeping: ResilientConfigReader.MarkFresh on the returning call
// path; we're on the fresh branch so we don't touch the flag here.
_ = _staleFlag; // held so the field isn't flagged unused
return BootstrapResult.FromDb(generationId);
}
}

View File

@@ -0,0 +1,163 @@
using Shouldly;
using Xunit;
using ZB.MOM.WW.OtOpcUa.Configuration.Entities;
using ZB.MOM.WW.OtOpcUa.Configuration.Enums;
using ZB.MOM.WW.OtOpcUa.Server.Redundancy;
namespace ZB.MOM.WW.OtOpcUa.Server.Tests;
[Trait("Category", "Unit")]
public sealed class ClusterTopologyLoaderTests
{
private static ServerCluster Cluster(RedundancyMode mode = RedundancyMode.Warm) => new()
{
ClusterId = "c1",
Name = "Warsaw-West",
Enterprise = "zb",
Site = "warsaw-west",
RedundancyMode = mode,
CreatedBy = "test",
};
private static ClusterNode Node(string id, RedundancyRole role, string host, int port = 4840, string? appUri = null) => new()
{
NodeId = id,
ClusterId = "c1",
RedundancyRole = role,
Host = host,
OpcUaPort = port,
ApplicationUri = appUri ?? $"urn:{host}:OtOpcUa",
CreatedBy = "test",
};
[Fact]
public void SingleNode_Standalone_Loads()
{
var cluster = Cluster(RedundancyMode.None);
var nodes = new[] { Node("A", RedundancyRole.Standalone, "hostA") };
var topology = ClusterTopologyLoader.Load("A", cluster, nodes);
topology.SelfNodeId.ShouldBe("A");
topology.SelfRole.ShouldBe(RedundancyRole.Standalone);
topology.Peers.ShouldBeEmpty();
topology.SelfApplicationUri.ShouldBe("urn:hostA:OtOpcUa");
}
[Fact]
public void TwoNode_Cluster_LoadsSelfAndPeer()
{
var cluster = Cluster();
var nodes = new[]
{
Node("A", RedundancyRole.Primary, "hostA"),
Node("B", RedundancyRole.Secondary, "hostB"),
};
var topology = ClusterTopologyLoader.Load("A", cluster, nodes);
topology.SelfNodeId.ShouldBe("A");
topology.SelfRole.ShouldBe(RedundancyRole.Primary);
topology.Peers.Count.ShouldBe(1);
topology.Peers[0].NodeId.ShouldBe("B");
topology.Peers[0].Role.ShouldBe(RedundancyRole.Secondary);
}
[Fact]
public void ServerUriArray_Puts_Self_First_Peers_SortedLexicographically()
{
var cluster = Cluster();
var nodes = new[]
{
Node("A", RedundancyRole.Primary, "hostA", appUri: "urn:A"),
Node("B", RedundancyRole.Secondary, "hostB", appUri: "urn:B"),
};
var topology = ClusterTopologyLoader.Load("A", cluster, nodes);
topology.ServerUriArray().ShouldBe(["urn:A", "urn:B"]);
}
[Fact]
public void EmptyNodes_Throws()
{
Should.Throw<InvalidTopologyException>(
() => ClusterTopologyLoader.Load("A", Cluster(), []));
}
[Fact]
public void SelfNotInCluster_Throws()
{
var nodes = new[] { Node("B", RedundancyRole.Primary, "hostB") };
Should.Throw<InvalidTopologyException>(
() => ClusterTopologyLoader.Load("A-missing", Cluster(), nodes));
}
[Fact]
public void ThreeNodeCluster_Rejected_Per_Decision83()
{
var nodes = new[]
{
Node("A", RedundancyRole.Primary, "hostA"),
Node("B", RedundancyRole.Secondary, "hostB"),
Node("C", RedundancyRole.Secondary, "hostC"),
};
var ex = Should.Throw<InvalidTopologyException>(
() => ClusterTopologyLoader.Load("A", Cluster(), nodes));
ex.Message.ShouldContain("decision #83");
}
[Fact]
public void DuplicateApplicationUri_Rejected()
{
var nodes = new[]
{
Node("A", RedundancyRole.Primary, "hostA", appUri: "urn:shared"),
Node("B", RedundancyRole.Secondary, "hostB", appUri: "urn:shared"),
};
var ex = Should.Throw<InvalidTopologyException>(
() => ClusterTopologyLoader.Load("A", Cluster(), nodes));
ex.Message.ShouldContain("ApplicationUri");
}
[Fact]
public void TwoPrimaries_InWarmMode_Rejected()
{
var nodes = new[]
{
Node("A", RedundancyRole.Primary, "hostA"),
Node("B", RedundancyRole.Primary, "hostB"),
};
var ex = Should.Throw<InvalidTopologyException>(
() => ClusterTopologyLoader.Load("A", Cluster(RedundancyMode.Warm), nodes));
ex.Message.ShouldContain("2 Primary");
}
[Fact]
public void CrossCluster_Node_Rejected()
{
var foreign = Node("B", RedundancyRole.Secondary, "hostB");
foreign.ClusterId = "c-other";
var nodes = new[] { Node("A", RedundancyRole.Primary, "hostA"), foreign };
Should.Throw<InvalidTopologyException>(
() => ClusterTopologyLoader.Load("A", Cluster(), nodes));
}
[Fact]
public void None_Mode_Allows_Any_Role_Mix()
{
// Standalone clusters don't enforce Primary-count; operator can pick anything.
var cluster = Cluster(RedundancyMode.None);
var nodes = new[] { Node("A", RedundancyRole.Primary, "hostA") };
var topology = ClusterTopologyLoader.Load("A", cluster, nodes);
topology.Mode.ShouldBe(RedundancyMode.None);
}
}

View File

@@ -0,0 +1,133 @@
using Microsoft.Extensions.Logging.Abstractions;
using Shouldly;
using Xunit;
using ZB.MOM.WW.OtOpcUa.Configuration.LocalCache;
namespace ZB.MOM.WW.OtOpcUa.Server.Tests;
/// <summary>
/// Integration-style tests for the Phase 6.1 Stream D consumption hook — they don't touch
/// SQL Server (the real SealedBootstrap does, via sp_GetCurrentGenerationForCluster), but
/// they exercise ResilientConfigReader + GenerationSealedCache + StaleConfigFlag end-to-end
/// by simulating central-DB outcomes through a direct ReadAsync call.
/// </summary>
[Trait("Category", "Integration")]
public sealed class SealedBootstrapIntegrationTests : IDisposable
{
private readonly string _root = Path.Combine(Path.GetTempPath(), $"otopcua-sealed-bootstrap-{Guid.NewGuid():N}");
public void Dispose()
{
try
{
if (!Directory.Exists(_root)) return;
foreach (var f in Directory.EnumerateFiles(_root, "*", SearchOption.AllDirectories))
File.SetAttributes(f, FileAttributes.Normal);
Directory.Delete(_root, recursive: true);
}
catch { /* best-effort */ }
}
[Fact]
public async Task CentralDbSuccess_SealsSnapshot_And_FlagFresh()
{
var cache = new GenerationSealedCache(_root);
var flag = new StaleConfigFlag();
var reader = new ResilientConfigReader(cache, flag, NullLogger<ResilientConfigReader>.Instance,
timeout: TimeSpan.FromSeconds(10));
// Simulate the SealedBootstrap fresh-path: central DB returns generation id 42; the
// bootstrap seals it + ResilientConfigReader marks the flag fresh.
var result = await reader.ReadAsync(
"c-a",
centralFetch: async _ =>
{
await cache.SealAsync(new GenerationSnapshot
{
ClusterId = "c-a",
GenerationId = 42,
CachedAt = DateTime.UtcNow,
PayloadJson = "{\"gen\":42}",
}, CancellationToken.None);
return (long?)42;
},
fromSnapshot: snap => (long?)snap.GenerationId,
CancellationToken.None);
result.ShouldBe(42);
flag.IsStale.ShouldBeFalse();
cache.TryGetCurrentGenerationId("c-a").ShouldBe(42);
}
[Fact]
public async Task CentralDbFails_FallsBackToSealedSnapshot_FlagStale()
{
var cache = new GenerationSealedCache(_root);
var flag = new StaleConfigFlag();
var reader = new ResilientConfigReader(cache, flag, NullLogger<ResilientConfigReader>.Instance,
timeout: TimeSpan.FromSeconds(10), retryCount: 0);
// Seed a prior sealed snapshot (simulating a previous successful boot).
await cache.SealAsync(new GenerationSnapshot
{
ClusterId = "c-a", GenerationId = 37, CachedAt = DateTime.UtcNow,
PayloadJson = "{\"gen\":37}",
});
// Now simulate central DB down → fallback.
var result = await reader.ReadAsync(
"c-a",
centralFetch: _ => throw new InvalidOperationException("SQL dead"),
fromSnapshot: snap => (long?)snap.GenerationId,
CancellationToken.None);
result.ShouldBe(37);
flag.IsStale.ShouldBeTrue("cache fallback flips the /healthz flag");
}
[Fact]
public async Task NoSnapshot_AndCentralDown_Throws_ClearError()
{
var cache = new GenerationSealedCache(_root);
var flag = new StaleConfigFlag();
var reader = new ResilientConfigReader(cache, flag, NullLogger<ResilientConfigReader>.Instance,
timeout: TimeSpan.FromSeconds(10), retryCount: 0);
await Should.ThrowAsync<GenerationCacheUnavailableException>(async () =>
{
await reader.ReadAsync<long?>(
"c-a",
centralFetch: _ => throw new InvalidOperationException("SQL dead"),
fromSnapshot: snap => (long?)snap.GenerationId,
CancellationToken.None);
});
}
[Fact]
public async Task SuccessfulBootstrap_AfterFailure_ClearsStaleFlag()
{
var cache = new GenerationSealedCache(_root);
var flag = new StaleConfigFlag();
var reader = new ResilientConfigReader(cache, flag, NullLogger<ResilientConfigReader>.Instance,
timeout: TimeSpan.FromSeconds(10), retryCount: 0);
await cache.SealAsync(new GenerationSnapshot
{
ClusterId = "c-a", GenerationId = 1, CachedAt = DateTime.UtcNow, PayloadJson = "{}",
});
// Fallback serves snapshot → flag goes stale.
await reader.ReadAsync("c-a",
centralFetch: _ => throw new InvalidOperationException("dead"),
fromSnapshot: s => (long?)s.GenerationId,
CancellationToken.None);
flag.IsStale.ShouldBeTrue();
// Subsequent successful bootstrap clears it.
await reader.ReadAsync("c-a",
centralFetch: _ => ValueTask.FromResult((long?)5),
fromSnapshot: s => (long?)s.GenerationId,
CancellationToken.None);
flag.IsStale.ShouldBeFalse("next successful DB round-trip clears the flag");
}
}