docs(docker-dev): document single-mesh hub-and-spoke topology

Rewrite docker-dev/README.md and update docker-dev/Dockerfile comment to reflect the new topology: one Akka mesh seeded by central-1/central-2 (fused admin+driver, MAIN cluster, single UI at http://localhost:9200), with site-a-*/site-b-* as driver-only members scoped by ClusterId. Removes all references to the old three-mesh layout (admin-a, admin-b, driver-a, driver-b, site-a.localhost, site-b.localhost).
2026-06-07 03:34:49 -04:00
parent ec9599e234
commit b88ae5db10
2 changed files with 39 additions and 37 deletions
@@ -1,6 +1,6 @@
 # docker-dev

-Mac-friendly multi-cluster OtOpcUa fleet for manual UI exercise + integration smoke tests. Spins up **three isolated Akka clusters** + SQL Server + Traefik on the same Compose network. All three clusters share the single `OtOpcUa` ConfigDb — multi-tenancy is enforced by per-row `ServerCluster.ClusterId` scoping. Akka.Cluster gossip stays isolated between meshes because their seed-node lists are disjoint, even though they share the same system name `otopcua`.
+Mac-friendly OtOpcUa fleet for manual UI exercise + integration smoke tests. Spins up **one single Akka mesh** (hub-and-spoke topology) + SQL Server + Traefik on the same Compose network. All six host nodes share the single `OtOpcUa` ConfigDb — logical separation between MAIN, SITE-A, and SITE-B is enforced by per-row `ServerCluster.ClusterId` scoping, not by mesh isolation.

 ## Stack

@@ -8,50 +8,48 @@ Mac-friendly multi-cluster OtOpcUa fleet for manual UI exercise + integration sm

 | Service | Role | Ports |
 |---|---|---|
-| `sql` | SQL Server 2022 — single `OtOpcUa` ConfigDb shared by all three clusters | host `14330` → container `1433` |
-| `traefik` | Routes :80 by Host header / PathPrefix | host `80`, dashboard `8089` |
+| `sql` | SQL Server 2022 — single `OtOpcUa` ConfigDb shared by all nodes | host `14330` → container `1433` |
+| `traefik` | Routes `:80` by PathPrefix to central admin nodes | host `80`, dashboard `8089` |

-Authentication uses the **shared GLAuth** on the Linux Docker host at `10.100.0.35:3893` (baseDN `dc=zb,dc=local`). Every host container binds that instance via `cn=serviceaccount,dc=zb,dc=local`. `DevStubMode` is **not** active. Sign in as `multi-role` / `password` to get all three OtOpcUa roles (Administrator, Designer, Viewer), or use any other shared test user with password `password`. Group→role mappings are seeded by `seed/seed-clusters.sql` (`OtOpcUa-Admins`→Administrator, `OtOpcUa-Designers`→Designer, `OtOpcUa-Viewers`→Viewer). The shared GLAuth source of truth and deploy runbook live in `scadaproj/infra/glauth/`.
+Authentication uses the **shared GLAuth** on the Linux Docker host at `10.100.0.35:3893` (baseDN `dc=zb,dc=local`). Only the central admin nodes authenticate users. Sign in as `multi-role` / `password` to get all three OtOpcUa roles (Administrator, Designer, Viewer), or use any other shared test user with password `password`. Group→role mappings are seeded by `seed/seed-clusters.sql` (`OtOpcUa-Admins`→Administrator, `OtOpcUa-Designers`→Designer, `OtOpcUa-Viewers`→Viewer). The shared GLAuth source of truth and deploy runbook live in `scadaproj/infra/glauth/`.

-### Main cluster — split admin/driver roles
+### Central nodes — fused admin+driver (MAIN cluster, UI + deploy singleton)

-| Service | Role | Ports |
+| Service | Roles | Ports |
 |---|---|---|
-| `admin-a` | `OTOPCUA_ROLES=admin`, cluster seed | internal `9000` |
-| `admin-b` | `OTOPCUA_ROLES=admin`, joins admin-a | internal `9000` |
-| `driver-a` | `OTOPCUA_ROLES=driver` | host `4840` → container `4840` |
-| `driver-b` | `OTOPCUA_ROLES=driver` | host `4841` → container `4840` |
+| `central-1` | `OTOPCUA_ROLES=admin,driver`, Akka mesh seed | host `4840` → container `4840`; internal `9000` |
+| `central-2` | `OTOPCUA_ROLES=admin,driver`, joins central-1 | host `4841` → container `4840`; internal `9000` |

-### Site A cluster — 2-node fused admin+driver
+`central-1` and `central-2` are the **only** nodes that host the Admin UI and the deploy singleton. They are also the OPC UA publishers for the MAIN cluster. Traefik routes all `PathPrefix(/)` traffic to whichever central node has the leader role.

-| Service | Role | Ports |
+### Site A nodes — driver-only (SITE-A cluster)
+
+| Service | Roles | Ports |
 |---|---|---|
-| `site-a-1` | `OTOPCUA_ROLES=admin,driver`, cluster seed | host `4842` → container `4840` |
-| `site-a-2` | `OTOPCUA_ROLES=admin,driver`, joins site-a-1 | host `4843` → container `4840` |
+| `site-a-1` | `OTOPCUA_ROLES=driver`, joins the single mesh | host `4842` → container `4840` |
+| `site-a-2` | `OTOPCUA_ROLES=driver`, joins the single mesh | host `4843` → container `4840` |

-### Site B cluster — 2-node fused admin+driver
+### Site B nodes — driver-only (SITE-B cluster)

-| Service | Role | Ports |
+| Service | Roles | Ports |
 |---|---|---|
-| `site-b-1` | `OTOPCUA_ROLES=admin,driver`, cluster seed | host `4844` → container `4840` |
-| `site-b-2` | `OTOPCUA_ROLES=admin,driver`, joins site-b-1 | host `4845` → container `4840` |
+| `site-b-1` | `OTOPCUA_ROLES=driver`, joins the single mesh | host `4844` → container `4840` |
+| `site-b-2` | `OTOPCUA_ROLES=driver`, joins the single mesh | host `4845` → container `4840` |

-All containers bind Akka remoting to port `4053` inside their own network namespace; the `PublicHostname` of each matches its Compose service name. Akka mesh isolation is enforced purely by disjoint seed lists. Configuration-side isolation is enforced by `ServerCluster.ClusterId` — see "Multi-tenancy" below.
+Site nodes serve no UI and authenticate no users. The central cluster manages and deploys to them over the shared Akka mesh. All six nodes bind Akka remoting to port `4053` inside their own network namespace; `PublicHostname` for each matches its Compose service name.

 ## Multi-tenancy

-All eight host nodes write to the same `OtOpcUa` ConfigDb. The `ServerCluster` table differentiates the three Akka meshes: each Akka cluster maps to one row, and each `ClusterNode` row's `ClusterId` ties the runtime node back to its owning cluster scope.
+All six host nodes write to the same `OtOpcUa` ConfigDb. The `ServerCluster` table differentiates the three logical clusters: each maps to one row, and each `ClusterNode` row's `ClusterId` ties the runtime node back to its owning cluster scope.

 A one-shot `cluster-seed` Compose service (image `mcr.microsoft.com/mssql-tools`) waits for SQL + the EF auto-migration to complete and then INSERTs the rows below. The seed is **idempotent** — `IF NOT EXISTS` guards every insert — so re-runs on `docker compose up` are no-ops:

-| Akka mesh | `ServerCluster.ClusterId` | `ClusterNode.NodeId` rows |
+| Logical cluster | `ServerCluster.ClusterId` | `ClusterNode.NodeId` rows |
 |---|---|---|
-| Main | `MAIN` | `driver-a`, `driver-b` (OPC UA publishers) |
+| Main | `MAIN` | `central-1`, `central-2` (OPC UA publishers + admin UI) |
 | Site A | `SITE-A` | `site-a-1`, `site-a-2` |
 | Site B | `SITE-B` | `site-b-1`, `site-b-2` |

-`ClusterNode` is the table for **OPC UA-publishing nodes** (not every Akka cluster member), which is why the main cluster's `admin-a` / `admin-b` don't get rows — they're control-plane-only.
-
 Each `ClusterNode.NodeId` matches the node's `Cluster__PublicHostname` env value (Compose service name) — that's the lookup the runtime uses to resolve its own membership. `ApplicationUri` follows the `urn:OtOpcUa:<NodeId>` convention.

 The SQL lives at `seed/seed-clusters.sql`; the wait-and-apply wrapper lives at `seed/entrypoint.sh`. To re-seed manually:
@@ -72,21 +70,24 @@ The DriverHost actor doesn't spawn drivers from raw DriverInstance rows on its o
 # from the repo root
 docker compose -f docker-dev/docker-compose.yml up -d --build

-# wait ~20 seconds for SQL to come up + all three clusters to form
+# wait ~20 seconds for SQL to come up + the mesh to form

-open http://localhost                      # main cluster admin UI
-open http://site-a.localhost               # site A admin UI
-open http://site-b.localhost               # site B admin UI
+open http://localhost:9200                 # Admin UI (Traefik → central-1 or central-2)
 open http://localhost:8089                 # Traefik dashboard
 ```

-On macOS, `*.localhost` resolves to `127.0.0.1` automatically. On Linux add `127.0.0.1 site-a.localhost site-b.localhost` to `/etc/hosts` if your resolver doesn't.
-
 The first build takes a few minutes (.NET SDK image + restore + publish). Subsequent rebuilds are faster with Docker's layer cache.

 ## Auth (dev only)

-All host containers authenticate against the shared GLAuth at `10.100.0.35:3893` (baseDN `dc=zb,dc=local`). `DevStubMode` is **not** active. Sign in with any test user (password `password`); `multi-role` / `password` returns all three roles (Administrator, Designer, Viewer). Group→role mappings are seeded by `seed/seed-clusters.sql`. The GLAuth source of truth + deploy runbook is in `scadaproj/infra/glauth/`. **Do not** enable `DevStubMode` outside local debugging — production must always bind a real LDAP backend.
+Central nodes authenticate against the shared GLAuth at `10.100.0.35:3893` (baseDN `dc=zb,dc=local`). `DevStubMode` is **not** active. Sign in with any test user (password `password`); `multi-role` / `password` returns all three roles (Administrator, Designer, Viewer). Group→role mappings are seeded by `seed/seed-clusters.sql`. The GLAuth source of truth + deploy runbook is in `scadaproj/infra/glauth/`. **Do not** enable `DevStubMode` outside local debugging — production must always bind a real LDAP backend.
+
+## Headless deploy
+
+```bash
+POST http://localhost:9200/api/deployments
+X-Api-Key: docker-dev-deploy-key
+```

 ## Tear down

@@ -98,15 +99,16 @@ The `-v` drops the SQL volume; remove it to keep ConfigDb state across restarts.

 ## Failover smoke

-1. Watch the Traefik dashboard at `http://localhost:8089`. Both `admin-a` and `admin-b` should be listed as healthy in the `otopcua-admin` service.
-2. `docker compose -f docker-dev/docker-compose.yml stop admin-a` — `admin-b` should pick up the admin role-leader within ~15 s (Akka split-brain stable-after). Traefik will route traffic to `admin-b` once its `/health/active` returns 200.
-3. `docker compose -f docker-dev/docker-compose.yml start admin-a` — `admin-a` rejoins as a follower; `admin-b` keeps the leader role until something disturbs it.
+1. Watch the Traefik dashboard at `http://localhost:8089`. Both `central-1` and `central-2` should be listed as healthy in the `otopcua-admin` service.
+2. `docker compose -f docker-dev/docker-compose.yml stop central-1` — `central-2` should pick up the admin role-leader within ~15 s (Akka split-brain stable-after). Traefik will route traffic to `central-2` once its `/health/active` returns 200.
+3. `docker compose -f docker-dev/docker-compose.yml start central-1` — `central-1` rejoins as a follower; `central-2` keeps the leader role until something disturbs it.

 ## Notes

 - This compose is for the **local Mac/Linux developer rig**. The team's CI + soak runs go to the remote docker host at `10.100.0.35` (see `docs/v2/dev-environment.md`); the file there mirrors this one with adjusted port bindings.
- The OPC UA driver endpoints are reachable directly from the host (Traefik is only in front of the admin HTTP surface):
-  - Main: `opc.tcp://localhost:4840` (driver-a), `opc.tcp://localhost:4841` (driver-b)
+- The OPC UA endpoints are reachable directly from the host (Traefik is only in front of the admin HTTP surface):
+  - Main: `opc.tcp://localhost:4840` (central-1), `opc.tcp://localhost:4841` (central-2)
  - Site A: `opc.tcp://localhost:4842` (site-a-1), `opc.tcp://localhost:4843` (site-a-2)
  - Site B: `opc.tcp://localhost:4844` (site-b-1), `opc.tcp://localhost:4845` (site-b-2)
 - Galaxy + Wonderware drivers can't run in Linux containers (they need the Windows-only mxaccessgw + Historian SDK). On non-Windows, `DriverInstanceActor.ShouldStub(driverType, roles)` returns `true` for those types and the actor goes straight to a `Stubbed` state that returns deterministic success.
+- SQL persistence: ConfigDb state survives container restarts (named Docker volume). Drop the volume with `down -v` for a clean slate.