Files
lmxopcua/docs/plans/2026-06-07-per-cluster-scoping.md
T
Joseph Doherty 7bec2fd4db docs(plan): per-ClusterId scoping implementation plan + task graph
10 tasks: runtime scoping (ResolveClusterScope + scoped ParseDriverInstances/
ParseComposition, DriverHostActor + OpcUaPublishActor wiring, multi-cluster
E2E) then docker-dev compose/traefik/seed rewrite, live verification, docs.
2026-06-07 02:59:46 -04:00

37 KiB
Raw Blame History

Per-ClusterId Scoping (hub-and-spoke single mesh) Implementation Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers-extended-cc:executing-plans to implement this plan task-by-task.

Goal: Let one central cluster's Admin UI deploy to multiple logically-separate clusters that share one Akka mesh, with each node applying only its own ClusterId's drivers + OPC UA address space.

Architecture: Approach A — node-side, parse-time filtering. Each node resolves its own ClusterId from the deployment artifact's ClusterNode rows (no extra DB query) and filters both the driver specs and the address-space composition to that cluster. The coordinator stays a single broadcast; every node applies its own slice and acks. A single-cluster artifact filters to nothing-different, so existing deployments + tests are unaffected.

Tech Stack: .NET 10, Akka.NET, EF Core, System.Text.Json (artifact parse), xUnit v2 + Shouldly (Runtime.Tests uses Akka.TestKit.Xunit2), Docker Compose + Traefik.

Design doc: docs/plans/2026-06-07-per-cluster-scoping-design.md (approved).

The fallback rule (single source of truth — implemented once in ResolveClusterScope):

  • artifact has ≤1 clusterNone (apply everything; legacy/single-cluster + all existing tests behave identically).
  • artifact has >1 cluster and the node's ClusterNode row is found → ScopeTo(clusterId).
  • artifact has >1 cluster and the node's row is not found → Suppress (apply nothing + the caller logs).

Hard rules (carry through every task): never git add . — stage by explicit path; never stage sql_login.txt or src/Server/.../pki/; never echo the gateway API key into a new tracked file; never force-push or skip hooks.


Task 1: ResolveClusterScope + node-scoped ParseDriverInstances

Classification: standard Estimated implement time: ~5 min Parallelizable with: Task 6, Task 7, Task 8

Files:

  • Modify: src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DeploymentArtifact.cs
  • Test: tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Drivers/DeploymentArtifactTests.cs

Context: DeploymentArtifact is a static JSON decoder over the artifact blob produced by ConfigComposer.SnapshotAndFlattenAsync. The artifact root has Pascal-case arrays: Clusters (ServerCluster, has ClusterId), Nodes (ClusterNode, has NodeId + ClusterId), DriverInstances (has ClusterId), Namespaces/UnsAreas (have ClusterId), Equipment/Tags/UnsLines/ScriptedAlarms (no ClusterId — traced via DriverInstanceId/UnsAreaId/EquipmentId). DriverInstanceSpec already carries ClusterId (DeploymentArtifact.cs:19).

Step 1: Write the failing tests

Add to DeploymentArtifactTests.cs. Reuse the file's existing artifact-blob helper if present; otherwise add this minimal builder:

private static byte[] BlobOf(object snapshot) =>
    System.Text.Json.JsonSerializer.SerializeToUtf8Bytes(snapshot);

private static object MultiClusterSnapshot() => new
{
    Clusters = new[] { new { ClusterId = "MAIN" }, new { ClusterId = "SITE-A" } },
    Nodes = new[]
    {
        new { NodeId = "central-1:4053", ClusterId = "MAIN" },
        new { NodeId = "site-a-1:4053", ClusterId = "SITE-A" },
    },
    DriverInstances = new[]
    {
        new { DriverInstanceRowId = Guid.NewGuid(), DriverInstanceId = "main-galaxy", Name = "g", DriverType = "GalaxyMxGateway", Enabled = true, DriverConfig = "{}", ClusterId = "MAIN", NamespaceId = "main-ns" },
        new { DriverInstanceRowId = Guid.NewGuid(), DriverInstanceId = "sa-modbus",  Name = "m", DriverType = "Modbus",          Enabled = true, DriverConfig = "{}", ClusterId = "SITE-A", NamespaceId = "sa-ns" },
    },
};
[Fact]
public void ResolveClusterScope_single_cluster_artifact_returns_None()
{
    var blob = BlobOf(new { Clusters = new[] { new { ClusterId = "MAIN" } }, Nodes = Array.Empty<object>() });
    var scope = DeploymentArtifact.ResolveClusterScope(blob, "central-1:4053");
    scope.Mode.ShouldBe(ClusterFilterMode.None);
}

[Fact]
public void ResolveClusterScope_multi_cluster_known_node_scopes_to_its_cluster()
{
    var scope = DeploymentArtifact.ResolveClusterScope(BlobOf(MultiClusterSnapshot()), "site-a-1:4053");
    scope.Mode.ShouldBe(ClusterFilterMode.ScopeTo);
    scope.ClusterId.ShouldBe("SITE-A");
}

[Fact]
public void ResolveClusterScope_multi_cluster_unknown_node_suppresses()
{
    var scope = DeploymentArtifact.ResolveClusterScope(BlobOf(MultiClusterSnapshot()), "ghost-9:4053");
    scope.Mode.ShouldBe(ClusterFilterMode.Suppress);
}

[Fact]
public void ParseDriverInstances_scoped_returns_only_my_clusters_drivers()
{
    var specs = DeploymentArtifact.ParseDriverInstances(BlobOf(MultiClusterSnapshot()), "central-1:4053");
    specs.Select(s => s.DriverInstanceId).ShouldBe(new[] { "main-galaxy" });
}

[Fact]
public void ParseDriverInstances_scoped_unknown_node_returns_empty()
{
    var specs = DeploymentArtifact.ParseDriverInstances(BlobOf(MultiClusterSnapshot()), "ghost-9:4053");
    specs.ShouldBeEmpty();
}

[Fact]
public void ParseDriverInstances_scoped_single_cluster_returns_all()
{
    var blob = BlobOf(new
    {
        Clusters = new[] { new { ClusterId = "MAIN" } },
        Nodes = new[] { new { NodeId = "n1:4053", ClusterId = "MAIN" } },
        DriverInstances = new[] { new { DriverInstanceRowId = Guid.NewGuid(), DriverInstanceId = "d1", Name = "d", DriverType = "Modbus", Enabled = true, DriverConfig = "{}", ClusterId = "MAIN" } },
    });
    DeploymentArtifact.ParseDriverInstances(blob, "anything:4053").Select(s => s.DriverInstanceId).ShouldBe(new[] { "d1" });
}

Step 2: Run the tests — verify they fail

Run: dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter "FullyQualifiedName~DeploymentArtifactTests" Expected: FAIL — ClusterFilterMode / ResolveClusterScope / the 2-arg ParseDriverInstances don't exist.

Step 3: Implement

In DeploymentArtifact.cs, add the scope types just above public static class DeploymentArtifact (top-level, same namespace):

/// <summary>How a node should scope a deployment artifact to its own ClusterId.</summary>
public enum ClusterFilterMode { None, ScopeTo, Suppress }

/// <summary>Resolved scoping decision for a node against an artifact.</summary>
/// <param name="Mode">None = apply everything (single-cluster / legacy); ScopeTo = filter to <paramref name="ClusterId"/>; Suppress = apply nothing.</param>
/// <param name="ClusterId">The node's ClusterId when <paramref name="Mode"/> is ScopeTo; otherwise null.</param>
public readonly record struct ClusterScope(ClusterFilterMode Mode, string? ClusterId);

Inside the class, add ResolveClusterScope and the 2-arg ParseDriverInstances overload (place after the existing ParseDriverInstances):

/// <summary>
///     Resolve how a node should scope a multi-cluster deployment artifact to its own logical
///     cluster, from the same consistent snapshot it applies (the artifact's ClusterNode rows map
///     NodeId → ClusterId; the ServerCluster count decides single- vs multi-cluster). Fallback rule:
///     ≤1 cluster ⇒ no filter (legacy single-cluster meshes + existing tests unchanged); &gt;1 cluster
///     with the node's row found ⇒ scope to that ClusterId; &gt;1 cluster with the row missing ⇒
///     suppress (apply nothing) — a node in a multi-cluster mesh with no ClusterNode row is
///     misconfigured and must not serve other clusters' data.
/// </summary>
/// <param name="blob">The deployment artifact blob.</param>
/// <param name="nodeId">This node's identity in "host:port" form (matches ClusterNode.NodeId).</param>
/// <returns>The scoping decision for this node.</returns>
public static ClusterScope ResolveClusterScope(ReadOnlySpan<byte> blob, string nodeId)
{
    if (blob.IsEmpty) return new ClusterScope(ClusterFilterMode.None, null);
    try
    {
        using var doc = JsonDocument.Parse(blob.ToArray());
        var root = doc.RootElement;
        var clusterCount = root.TryGetProperty("Clusters", out var cl) && cl.ValueKind == JsonValueKind.Array
            ? cl.GetArrayLength() : 0;
        if (clusterCount <= 1) return new ClusterScope(ClusterFilterMode.None, null);

        string? myCluster = null;
        if (root.TryGetProperty("Nodes", out var nodes) && nodes.ValueKind == JsonValueKind.Array)
        {
            foreach (var el in nodes.EnumerateArray())
            {
                if (el.ValueKind != JsonValueKind.Object) continue;
                var nid = el.TryGetProperty("NodeId", out var nEl) ? nEl.GetString() : null;
                if (!string.Equals(nid, nodeId, StringComparison.Ordinal)) continue;
                myCluster = el.TryGetProperty("ClusterId", out var cEl) ? cEl.GetString() : null;
                break;
            }
        }
        return string.IsNullOrWhiteSpace(myCluster)
            ? new ClusterScope(ClusterFilterMode.Suppress, null)
            : new ClusterScope(ClusterFilterMode.ScopeTo, myCluster);
    }
    catch (JsonException)
    {
        return new ClusterScope(ClusterFilterMode.None, null);
    }
}

/// <summary>Cluster-scoped overload: the driver specs a node should host given its NodeId.</summary>
/// <param name="blob">The deployment artifact blob.</param>
/// <param name="nodeId">This node's identity in "host:port" form.</param>
/// <returns>The filtered driver specs per the node's <see cref="ResolveClusterScope"/> decision.</returns>
public static IReadOnlyList<DriverInstanceSpec> ParseDriverInstances(ReadOnlySpan<byte> blob, string nodeId)
{
    var scope = ResolveClusterScope(blob, nodeId);
    var all = ParseDriverInstances(blob);
    return scope.Mode switch
    {
        ClusterFilterMode.Suppress => Array.Empty<DriverInstanceSpec>(),
        ClusterFilterMode.ScopeTo => all.Where(
            s => string.Equals(s.ClusterId, scope.ClusterId, StringComparison.Ordinal)).ToArray(),
        _ => all,
    };
}

Step 4: Run the tests — verify they pass

Run: dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter "FullyQualifiedName~DeploymentArtifactTests" Expected: PASS (new + all pre-existing DeploymentArtifact tests).

Step 5: Commit

git add src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DeploymentArtifact.cs \
        tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Drivers/DeploymentArtifactTests.cs
git commit -m "feat(runtime): ClusterId scope resolution + node-scoped driver-spec parse"

Acceptance: ResolveClusterScope implements the 3-branch rule; the scoped ParseDriverInstances filters per the rule; the no-arg overload is untouched.


Task 2: Node-scoped ParseComposition (address-space filter)

Classification: standard Estimated implement time: ~5 min Parallelizable with: Task 6, Task 7, Task 8

Blocked by: Task 1 (uses ClusterScope / ResolveClusterScope, same file).

Files:

  • Modify: src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DeploymentArtifact.cs
  • Test: tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Drivers/DeploymentArtifactTests.cs

Context: ParseComposition(blob) returns a Phase7CompositionResult whose projections carry no ClusterId. Filter by building in-cluster id sets from the raw artifact: DriverInstanceIds and UnsAreaIds whose row's ClusterId matches, plus EquipmentIds whose DriverInstanceId is in-cluster. Then filter each projection (areas by UnsAreaId, lines by UnsAreaId, equipment by EquipmentId, drivers/galaxyTags/equipmentTags by DriverInstanceId, alarms by EquipmentId).

Step 1: Write the failing tests

Extend MultiClusterSnapshot() from Task 1 with namespaces + tags so galaxy-tag filtering is exercised, then add the test:

private static object MultiClusterSnapshotWithTags() => new
{
    Clusters = new[] { new { ClusterId = "MAIN" }, new { ClusterId = "SITE-A" } },
    Nodes = new[]
    {
        new { NodeId = "central-1:4053", ClusterId = "MAIN" },
        new { NodeId = "site-a-1:4053",  ClusterId = "SITE-A" },
    },
    DriverInstances = new[]
    {
        new { DriverInstanceId = "main-galaxy", DriverType = "GalaxyMxGateway", DriverConfig = "{}", ClusterId = "MAIN",   NamespaceId = "main-ns" },
        new { DriverInstanceId = "sa-galaxy",   DriverType = "GalaxyMxGateway", DriverConfig = "{}", ClusterId = "SITE-A", NamespaceId = "sa-ns" },
    },
    Namespaces = new[]
    {
        new { NamespaceId = "main-ns", ClusterId = "MAIN",   Kind = 1 },
        new { NamespaceId = "sa-ns",   ClusterId = "SITE-A", Kind = 1 },
    },
    Tags = new[]
    {
        new { TagId = "t-main", DriverInstanceId = "main-galaxy", EquipmentId = (string?)null, Name = "M1", FolderPath = "F", DataType = "Boolean", TagConfig = "{}" },
        new { TagId = "t-sa",   DriverInstanceId = "sa-galaxy",   EquipmentId = (string?)null, Name = "S1", FolderPath = "F", DataType = "Boolean", TagConfig = "{}" },
    },
};

[Fact]
public void ParseComposition_scoped_keeps_only_my_clusters_drivers_and_tags()
{
    var blob = BlobOf(MultiClusterSnapshotWithTags());

    var main = DeploymentArtifact.ParseComposition(blob, "central-1:4053");
    main.DriverInstancePlans.Select(d => d.DriverInstanceId).ShouldBe(new[] { "main-galaxy" });
    main.GalaxyTags.Select(t => t.TagId).ShouldBe(new[] { "t-main" });

    var siteA = DeploymentArtifact.ParseComposition(blob, "site-a-1:4053");
    siteA.DriverInstancePlans.Select(d => d.DriverInstanceId).ShouldBe(new[] { "sa-galaxy" });
    siteA.GalaxyTags.Select(t => t.TagId).ShouldBe(new[] { "t-sa" });
}

[Fact]
public void ParseComposition_scoped_unknown_node_is_empty()
{
    var comp = DeploymentArtifact.ParseComposition(BlobOf(MultiClusterSnapshotWithTags()), "ghost-9:4053");
    comp.GalaxyTags.ShouldBeEmpty();
    comp.DriverInstancePlans.ShouldBeEmpty();
}

[Fact]
public void ParseComposition_single_cluster_node_id_overload_matches_legacy()
{
    var blob = BlobOf(new
    {
        Clusters = new[] { new { ClusterId = "MAIN" } },
        Nodes = new[] { new { NodeId = "n1:4053", ClusterId = "MAIN" } },
        DriverInstances = new[] { new { DriverInstanceId = "d1", DriverType = "Modbus", DriverConfig = "{}", ClusterId = "MAIN", NamespaceId = "ns" } },
    });
    DeploymentArtifact.ParseComposition(blob, "anything:4053").DriverInstancePlans.Count
        .ShouldBe(DeploymentArtifact.ParseComposition(blob).DriverInstancePlans.Count);
}

Step 2: Run — verify FAIL (2-arg ParseComposition missing): dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter "FullyQualifiedName~DeploymentArtifactTests"

Step 3: Implement

Add to DeploymentArtifact.cs (after the no-arg ParseComposition):

/// <summary>Cluster-scoped overload: the address-space composition a node should materialise given
/// its NodeId. Filters every projection to the node's own ClusterId (see <see cref="ResolveClusterScope"/>).</summary>
/// <param name="blob">The deployment artifact blob.</param>
/// <param name="nodeId">This node's identity in "host:port" form.</param>
/// <returns>The filtered composition per the node's scoping decision.</returns>
public static Phase7CompositionResult ParseComposition(ReadOnlySpan<byte> blob, string nodeId)
{
    var scope = ResolveClusterScope(blob, nodeId);
    if (scope.Mode == ClusterFilterMode.None) return ParseComposition(blob);
    if (scope.Mode == ClusterFilterMode.Suppress) return Empty();

    var full = ParseComposition(blob);
    var sets = BuildClusterSets(blob, scope.ClusterId!);
    return new Phase7CompositionResult(
        full.UnsAreas.Where(a => sets.AreaIds.Contains(a.UnsAreaId)).ToArray(),
        full.UnsLines.Where(l => sets.AreaIds.Contains(l.UnsAreaId)).ToArray(),
        full.EquipmentNodes.Where(e => sets.EquipmentIds.Contains(e.EquipmentId)).ToArray(),
        full.DriverInstancePlans.Where(d => sets.DriverIds.Contains(d.DriverInstanceId)).ToArray(),
        full.ScriptedAlarmPlans.Where(a => sets.EquipmentIds.Contains(a.EquipmentId)).ToArray(),
        full.GalaxyTags.Where(t => sets.DriverIds.Contains(t.DriverInstanceId)).ToArray())
    {
        EquipmentTags = full.EquipmentTags.Where(t => sets.DriverIds.Contains(t.DriverInstanceId)).ToArray(),
    };
}

private sealed record ClusterSets(HashSet<string> DriverIds, HashSet<string> AreaIds, HashSet<string> EquipmentIds);

/// <summary>Build the in-cluster id sets used to filter a composition: DriverInstanceIds + UnsAreaIds
/// that directly carry the ClusterId, plus EquipmentIds whose DriverInstanceId is in-cluster.</summary>
private static ClusterSets BuildClusterSets(ReadOnlySpan<byte> blob, string clusterId)
{
    var driverIds = new HashSet<string>(StringComparer.Ordinal);
    var areaIds = new HashSet<string>(StringComparer.Ordinal);
    var equipmentIds = new HashSet<string>(StringComparer.Ordinal);
    try
    {
        using var doc = JsonDocument.Parse(blob.ToArray());
        var root = doc.RootElement;
        CollectIdsWhereCluster(root, "DriverInstances", "DriverInstanceId", clusterId, driverIds);
        CollectIdsWhereCluster(root, "UnsAreas", "UnsAreaId", clusterId, areaIds);
        // Equipment carries no ClusterId — include it when its DriverInstanceId is in-cluster.
        if (root.TryGetProperty("Equipment", out var eq) && eq.ValueKind == JsonValueKind.Array)
        {
            foreach (var el in eq.EnumerateArray())
            {
                if (el.ValueKind != JsonValueKind.Object) continue;
                var di = el.TryGetProperty("DriverInstanceId", out var diEl) ? diEl.GetString() : null;
                var id = el.TryGetProperty("EquipmentId", out var idEl) ? idEl.GetString() : null;
                if (!string.IsNullOrWhiteSpace(id) && di is not null && driverIds.Contains(di))
                    equipmentIds.Add(id!);
            }
        }
    }
    catch (JsonException) { /* empty sets ⇒ nothing matches ⇒ empty composition */ }
    return new ClusterSets(driverIds, areaIds, equipmentIds);
}

private static void CollectIdsWhereCluster(
    JsonElement root, string arrayName, string idField, string clusterId, HashSet<string> into)
{
    if (!root.TryGetProperty(arrayName, out var arr) || arr.ValueKind != JsonValueKind.Array) return;
    foreach (var el in arr.EnumerateArray())
    {
        if (el.ValueKind != JsonValueKind.Object) continue;
        var cid = el.TryGetProperty("ClusterId", out var cEl) ? cEl.GetString() : null;
        if (!string.Equals(cid, clusterId, StringComparison.Ordinal)) continue;
        var id = el.TryGetProperty(idField, out var idEl) ? idEl.GetString() : null;
        if (!string.IsNullOrWhiteSpace(id)) into.Add(id!);
    }
}

Note: equipment is filtered via its DriverInstanceId (schema-guaranteed present for equipment-namespace rows). If a future schema allows equipment with a null DriverInstanceId, extend BuildClusterSets to also include equipment whose UnsLineId maps to an in-cluster UnsArea — out of scope here (the dev rig's sites are empty).

Step 4: Run — verify PASS (new + pre-existing tests): dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter "FullyQualifiedName~DeploymentArtifactTests"

Step 5: Commit

git add src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DeploymentArtifact.cs \
        tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Drivers/DeploymentArtifactTests.cs
git commit -m "feat(runtime): node-scoped ParseComposition filters address space by ClusterId"

Acceptance: scoped ParseComposition excludes cross-cluster projections; single-cluster + unknown-node behavior matches the rule; no-arg overload untouched.


Task 3: Wire driver-spawn + SubscribeBulk filtering into DriverHostActor

Classification: high-risk Estimated implement time: ~5 min Parallelizable with: Task 4

Blocked by: Task 1, Task 2.

Files:

  • Modify: src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverHostActor.cs:367 (ReconcileDrivers) and :432 (PushDesiredSubscriptions)
  • Test: tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Drivers/ (add DriverHostActorClusterScopeTests.cs, or extend an existing DriverHostActor test if present)

Context: ReconcileDrivers (line 349) loads the artifact blob then calls ParseDriverInstances(blob). PushDesiredSubscriptions (line 412) calls ParseComposition(blob). Both run for normal applies (ApplyAndAck, line 311/321) and restart restore (RestoreApplied, line 393/395) — so changing these two call sites covers both paths. The ack (SendAck, line 314) fires unconditionally before the rebuild, so an empty/suppressed slice still acks — no ack change needed.

Step 1: Write the failing test

A TestKit test that a driver node in a multi-cluster artifact spawns only its cluster's drivers, and a node whose cluster has no drivers still reaches Applied (acks). Model it on the existing DriverHostActor tests in the same folder (reuse their in-memory DbContext + DispatchDeployment plumbing — inspect a sibling test for the exact harness helpers). Core assertions:

// Given a sealed deployment whose artifact has 2 clusters (MAIN: 1 driver, SITE-A: 1 driver)
// and a DriverHostActor whose _localNode = "site-a-1:4053":
//   - after DispatchDeployment, GetDiagnostics shows ONLY the SITE-A driver (not MAIN's)
//   - the node sends an Applied ApplyAck (convergence holds even though it ignored MAIN's driver)
// And a second actor with _localNode = "central-1:4053" shows ONLY the MAIN driver.

Step 2: Run — verify FAIL (node currently spawns both clusters' drivers): dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter "FullyQualifiedName~DriverHostActorClusterScope"

Step 3: Implement — two one-line call-site changes:

DriverHostActor.cs:367:

// before:
var specs = DeploymentArtifact.ParseDriverInstances(blob);
// after:
var specs = DeploymentArtifact.ParseDriverInstances(blob, _localNode.Value);

DriverHostActor.cs:432:

// before:
composition = DeploymentArtifact.ParseComposition(blob);
// after:
composition = DeploymentArtifact.ParseComposition(blob, _localNode.Value);

Step 4: Run — verify PASS, then the whole Runtime suite for no regression:

dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests

Expected: new test green; all pre-existing tests green (single-cluster harnesses hit the None branch → unchanged).

Step 5: Commit

git add src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverHostActor.cs \
        tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Drivers/DriverHostActorClusterScopeTests.cs
git commit -m "feat(runtime): DriverHost spawns + subscribes only its own ClusterId's drivers"

Acceptance: a site node spawns only its cluster's drivers and still acks Applied with an empty slice; existing single-cluster tests stay green.


Task 4: Wire scoped composition into OpcUaPublishActor.HandleRebuild

Classification: high-risk Estimated implement time: ~5 min Parallelizable with: Task 3

Blocked by: Task 2.

Files:

  • Modify: src/Server/ZB.MOM.WW.OtOpcUa.Runtime/OpcUa/OpcUaPublishActor.cs:212 (HandleRebuild)
  • Test: tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/OpcUa/OpcUaPublishActorRebuildTests.cs

Context: HandleRebuild (line ~210) loads the artifact then calls ParseComposition(artifact) and materialises via Phase7Applier. _localNode is NodeId? (line 46) — null on legacy/dev callers, so guard for null. The existing OpcUaPublishActorRebuildTests use a fake/inspectable sink — reuse that pattern.

Step 1: Write the failing test

// Build a 2-cluster artifact (MAIN galaxy tag t-main; SITE-A galaxy tag t-sa),
// seal it as a Deployment row in the test DbContext, construct an OpcUaPublishActor
// with _localNode = NodeId.Parse("site-a-1:4053") and an inspectable sink, send
// RebuildAddressSpace(correlation, depId), then assert the sink received ONLY the
// SITE-A variable/folders (t-sa) and NOT the MAIN ones (t-main).
// Mirror with _localNode = "central-1:4053" → only MAIN.

Step 2: Run — verify FAIL (sink currently gets both clusters' nodes).

Step 3: ImplementOpcUaPublishActor.cs:212:

// before:
var composition = DeploymentArtifact.ParseComposition(artifact);
// after: scope to this node's ClusterId when we know our identity; legacy/dev callers (null
// _localNode) keep the unscoped behaviour.
var composition = _localNode is { } ln
    ? DeploymentArtifact.ParseComposition(artifact, ln.Value)
    : DeploymentArtifact.ParseComposition(artifact);

Step 4: Run — verify PASS + full Runtime suite:

dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests

Step 5: Commit

git add src/Server/ZB.MOM.WW.OtOpcUa.Runtime/OpcUa/OpcUaPublishActor.cs \
        tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/OpcUa/OpcUaPublishActorRebuildTests.cs
git commit -m "feat(runtime): OPC UA rebuild materialises only the node's ClusterId slice"

Acceptance: a node materialises only its own cluster's address space; null _localNode keeps legacy behavior; existing rebuild tests stay green.


Task 5: Multi-cluster scoping E2E on the cluster harness

Classification: high-risk Estimated implement time: ~5 min Parallelizable with: none

Blocked by: Task 3, Task 4.

Files:

  • Test: tests/Server/ZB.MOM.WW.OtOpcUa.Host.IntegrationTests/ (add MultiClusterScopingTests.cs)
  • Possibly Modify: tests/Server/ZB.MOM.WW.OtOpcUa.Host.IntegrationTests/TwoNodeClusterHarness.cs (only if a 2-ClusterId seed helper is needed)

Context: TwoNodeClusterHarness boots an in-process 2-node cluster on an in-memory DB with a null OPC UA sink. It proves the deploy path end-to-end (compose → broadcast → apply → ack) but cannot assert a materialised tree (null sink). So this test asserts driver scoping through the real path: seed two ServerCluster rows + two ClusterNode rows (one per node, different ClusterId)

  • one DriverInstance per cluster, run one deployment, and assert via each node's GetDiagnostics that each node hosts only its own cluster's driver.

If the harness's seed helpers can't express two clusters cleanly, that's a plan defect — surface it; the unit + actor tests (Tasks 14) already cover the scoping logic, and Task 9 covers the live proof.

Step 1: Write the failing test (per the context above). Step 2: Run — verify FAIL. Step 3: No production code — this test passes once Tasks 3+4 are in. If it fails, the defect is in 3/4; fix there, not here. Step 4: Run — verify PASS: dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Host.IntegrationTests --filter "FullyQualifiedName~MultiClusterScoping" Step 5: Commit

git add tests/Server/ZB.MOM.WW.OtOpcUa.Host.IntegrationTests/MultiClusterScopingTests.cs
# add TwoNodeClusterHarness.cs ONLY if you modified it
git commit -m "test(integration): multi-cluster deploy scopes drivers per node"

Acceptance: one deploy over a 2-cluster mesh leaves each node hosting only its own cluster's driver.


Task 6: Rewrite docker-dev/docker-compose.yml for the single mesh

Classification: standard Estimated implement time: ~5 min Parallelizable with: Task 1, Task 2, Task 7, Task 8

Files:

  • Modify: docker-dev/docker-compose.yml

Context: Today MAIN is 4 nodes (admin-a/admin-b admin + driver-a/driver-b driver) as one mesh; SITE-A/SITE-B are 2-node fused meshes with their own seeds. The anchor &otopcua-host (currently on admin-a) holds the shared build + env.

Steps:

  1. Replace the four MAIN services with two fused nodes:
    • central-1 (becomes the &otopcua-host anchor): OTOPCUA_ROLES: "admin,driver", ASPNETCORE_URLS: "http://+:9000", Cluster__PublicHostname: "central-1", Cluster__SeedNodes__0: "akka.tcp://otopcua@central-1:4053", Cluster__Roles__0: "admin", Cluster__Roles__1: "driver", keep all Security__* (Jwt/Ldap/DeployApiKey) + GALAXY_MXGW_API_KEY (keep the existing ${GALAXY_MXGW_API_KEY:-mxgw_otopcua2_...} default — do not introduce a new hardcoded key), ports: ["4840:4840"].
    • central-2: same env, Cluster__PublicHostname: "central-2", seed → central-1, ports: ["4841:4840"], depends_on: { sql: healthy, central-1: started }.
  2. Convert the four site services to driver-only, all seeding central-1:
    • For each of site-a-1, site-a-2, site-b-1, site-b-2: OTOPCUA_ROLES: "driver", Cluster__Roles__0: "driver" (remove the admin role + Cluster__Roles__1), keep Cluster__PublicHostname = own name, set Cluster__SeedNodes__0: "akka.tcp://otopcua@central-1:4053", remove ASPNETCORE_URLS + the Security__Jwt__* / Security__Ldap__* / Security__DeployApiKey block (driver-only nodes serve no UI and authenticate no users), keep the ConnectionStrings__ConfigDb
      • GALAXY_MXGW_API_KEY lines, keep their OPC UA ports (48424845), and add depends_on: { sql: healthy, central-1: started }.
  3. Update traefik.depends_on to [central-1, central-2] (drop the removed services).
  4. Rewrite the header comment block (lines ~140) to describe the single mesh, hub-and-spoke topology: one Akka mesh seeded by central-1; central-1/2 are admin,driver (the only UI + deploy singleton); site-* are driver-only members scoped by ClusterId; central UI at :9200 manages + deploys to all. Keep the existing accurate notes (SQL persistence, mesh isolation note now describes a single mesh, headless deploy via :9200/api/deployments).

Verify:

docker compose -f docker-dev/docker-compose.yml config --quiet && echo "compose OK"

Expected: compose OK.

Commit:

git add docker-dev/docker-compose.yml
git commit -m "feat(docker-dev): single-mesh hub-and-spoke (central-1/2 + driver-only sites)"

Acceptance: docker compose config parses; central nodes fused with UI; site nodes driver-only seeding central; no new hardcoded API key.


Task 7: Rewrite docker-dev/traefik-dynamic.yml (central-only route)

Classification: small Estimated implement time: ~3 min Parallelizable with: Task 1, Task 2, Task 6, Task 8

Files:

  • Modify: docker-dev/traefik-dynamic.yml

Steps:

  1. Keep the otopcua-admin router (PathPrefix(/)) + service, but point its loadBalancer.servers at http://central-1:9000 and http://central-2:9000.
  2. Delete the otopcua-site-a and otopcua-site-b routers and their services (driver-only sites serve no UI). Keep the sticky-cookie + /health/active healthcheck on the surviving otopcua-admin service.
  3. Update the file header comment to describe the single central UI route.

Verify: covered by Task 6's docker compose config (Traefik file is mounted, not parsed by compose) — sanity-check it's valid YAML by eye; the live bring-up in Task 9 is the real check.

Commit:

git add docker-dev/traefik-dynamic.yml
git commit -m "feat(docker-dev): Traefik routes only the central cluster UI"

Acceptance: one router → central-1/central-2; site routers/services removed.


Task 8: Rewrite docker-dev/seed/seed-clusters.sql (MAIN nodes → central-1/2)

Classification: small Estimated implement time: ~4 min Parallelizable with: Task 1, Task 2, Task 6, Task 7

Files:

  • Modify: docker-dev/seed/seed-clusters.sql

Context: The seed inserts 3 ServerCluster rows + 6 ClusterNode rows. MAIN's ClusterNode rows are currently driver-a:4053 / driver-b:4053. Sites already have no drivers/tags (only ServerCluster + ClusterNode), which matches the "empty sites" decision — leave them empty.

Steps:

  1. Change MAIN's two ClusterNode inserts from driver-a / driver-b to central-1 / central-2: NodeId central-1:4053 / central-2:4053, Host central-1 / central-2, ApplicationUri urn:OtOpcUa:central-1 / urn:OtOpcUa:central-2, keep OpcUaPort 4840, ServiceLevelBase 200/150. Update the IF NOT EXISTS ... WHERE NodeId = '...' guards to the new ids.
  2. Update the MAIN ServerCluster Notes to "central-1/central-2 fused admin+driver — UI + deploy singleton + MAIN OPC UA publishers."
  3. Update the SITE-A/SITE-B ServerCluster Notes to "2-node driver-only, managed by the central cluster over the shared mesh (empty until configured)."
  4. Update the file header comment block (the ClusterNode map at lines ~57) to central-1, central-2 → MAIN.
  5. Leave the Galaxy namespace/driver/tags (MAIN) and the LDAP→role mappings unchanged.

Verify: SQL isn't run locally on macOS (no SQL reachable); correctness is confirmed by the live bring-up in Task 9 (the cluster-seed job runs it). Eyeball that every changed NodeId/guard is consistent.

Commit:

git add docker-dev/seed/seed-clusters.sql
git commit -m "feat(docker-dev): seed MAIN ClusterNodes as central-1/central-2"

Acceptance: MAIN ClusterNode rows are central-1/central-2; sites keep their cluster + node rows with no drivers; notes/header updated.


Task 9: Live docker-dev verification

Classification: standard (verification — no subagent review needed) Estimated implement time: ~5 min (plus container build time) Parallelizable with: none

Blocked by: Task 3, Task 4, Task 5, Task 6, Task 7, Task 8.

Files: none (operational).

Steps (run from repo root on this Mac; Docker is local):

  1. Sync deployment + rebuild the image and bring the rig up:
    docker compose -f docker-dev/docker-compose.yml down
    docker compose -f docker-dev/docker-compose.yml up -d --build
    
  2. Confirm the mesh formed and the seed ran:
    docker compose -f docker-dev/docker-compose.yml ps
    docker compose -f docker-dev/docker-compose.yml logs cluster-seed --tail=40
    
    Expect 3 ServerCluster rows + 6 ClusterNode rows (central-1/2, site-a-1/2, site-b-1/2).
  3. Trigger a global deploy headlessly (no UI login needed — the deploy API is on the central admin nodes):
    curl -s -X POST http://localhost:9200/api/deployments \
      -H "X-Api-Key: docker-dev-deploy-key" -H "Content-Type: application/json" \
      -d '{"createdBy":"per-cluster-verify"}'
    
    Expect 202 + a deploymentId.
  4. Confirm scoping in the driver logs — central applies the Galaxy driver, sites apply empty:
    docker compose -f docker-dev/docker-compose.yml logs central-1 | grep -iE "applied deployment|galaxy|materialis"
    docker compose -f docker-dev/docker-compose.yml logs site-a-1 | grep -iE "applied deployment|galaxy|materialis"
    
    Expect central-1 to materialise the MAIN Galaxy tags; site-a-1 to apply with no Galaxy/MAIN nodes.
  5. Browse-check with the Client CLI:
    dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- browse -u opc.tcp://localhost:4840 -r -d 4   # central: Galaxy tree
    dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- browse -u opc.tcp://localhost:4842 -r -d 4   # site-a: empty (no MAIN tree)
    
    Expect the MAIN Galaxy hierarchy on :4840 and an empty address space on :4842 (NOT the merged tree).

Acceptance: the rig boots as one mesh; a single deploy populates the central OPC UA tree and leaves the site nodes empty (proving per-ClusterId scoping live). If anything regresses, stop and debug (do not paper over).


Task 10: Update docker-dev docs + memory

Classification: small Estimated implement time: ~4 min Parallelizable with: Task 9

Blocked by: Task 6 (final topology).

Files:

  • Modify: CLAUDE.md (the Docker Workflow / docker-dev references that mention the cluster layout)
  • Modify: docs/v2/dev-environment.md (if it describes the three-isolated-mesh topology)
  • Modify: /Users/dohertj2/.claude/projects/-Users-dohertj2-Desktop-OtOpcUa/memory/project_dev_environment.md + MEMORY.md pointer

Steps:

  1. In CLAUDE.md and docs/v2/dev-environment.md, update any description of the docker-dev topology from "three isolated meshes / MAIN admin-a+admin-b / site fused" to "single mesh, hub-and-spoke: central-1/central-2 fused admin+driver own the only UI + deploy singleton; site-* are driver-only members scoped by ClusterId; central UI at :9200 deploys to all." Update the OPC UA endpoint list (central-1 :4840, central-2 :4841, sites :4842:4845).
  2. Update the project_dev_environment.md memory's docker-dev section to match (node names, hub-and-spoke, "central UI deploys to all clusters"). Keep the one-line MEMORY.md pointer accurate.

Commit:

git add CLAUDE.md docs/v2/dev-environment.md
git commit -m "docs(docker-dev): document single-mesh hub-and-spoke topology"

(The memory files live outside the repo — write them with the memory workflow, not git.)

Acceptance: docs + memory describe the new topology accurately.


Done criteria

  • dotnet build ZB.MOM.WW.OtOpcUa.slnx clean.
  • dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests green (new scoping tests + no regressions) and ...Host.IntegrationTests multi-cluster test green.
  • The full pre-existing suite stays green (the ClusterCount ≤ 1 lenient branch guarantees single-cluster behavior is unchanged).
  • docker compose config parses; the live rig (Task 9) shows central serving the Galaxy tree and sites empty under one global deploy.

Out of scope (follow-ups)

  • Per-cluster deploy (deploy just SITE-A from the UI) — coordinator + UI work.
  • Seeding demo drivers on the sites — added via the central UI.