# docker-dev

Mac-friendly OtOpcUa fleet for manual UI exercise + integration smoke tests. Spins up **one single Akka mesh** (hub-and-spoke topology) + SQL Server + Traefik on the same Compose network. All six host nodes share the single `OtOpcUa` ConfigDb — logical separation between MAIN, SITE-A, and SITE-B is enforced by per-row `ServerCluster.ClusterId` scoping, not by mesh isolation.

## Stack

### Shared infrastructure

| Service | Role | Ports |
|---|---|---|
| `sql` | SQL Server 2022 — single `OtOpcUa` ConfigDb shared by all nodes | host `14330` → container `1433` |
| `traefik` | Routes `:80` by PathPrefix to central admin nodes | host `80`, dashboard `8089` |

Authentication uses the **shared GLAuth** on the Linux Docker host at `10.100.0.35:3893` (baseDN `dc=zb,dc=local`). Only the central admin nodes authenticate users. Sign in as `multi-role` / `password` to get all three OtOpcUa roles (Administrator, Designer, Viewer), or use any other shared test user with password `password`. Group→role mappings are seeded by `seed/seed-clusters.sql` (`OtOpcUa-Admins`→Administrator, `OtOpcUa-Designers`→Designer, `OtOpcUa-Viewers`→Viewer). The shared GLAuth source of truth and deploy runbook live in `scadaproj/infra/glauth/`.

### Central nodes — fused admin+driver (MAIN cluster, UI + deploy singleton)

| Service | Roles | Ports |
|---|---|---|
| `central-1` | `OTOPCUA_ROLES=admin,driver`, Akka mesh seed | host `4840` → container `4840`; internal `9000` |
| `central-2` | `OTOPCUA_ROLES=admin,driver`, joins central-1 | host `4841` → container `4840`; internal `9000` |

`central-1` and `central-2` are the **only** nodes that host the Admin UI and the deploy singleton. They are also the OPC UA publishers for the MAIN cluster. Traefik routes all `PathPrefix(/)` traffic to whichever central node has the leader role.

### Site A nodes — driver-only (SITE-A cluster)

| Service | Roles | Ports |
|---|---|---|
| `site-a-1` | `OTOPCUA_ROLES=driver`, joins the single mesh | host `4842` → container `4840` |
| `site-a-2` | `OTOPCUA_ROLES=driver`, joins the single mesh | host `4843` → container `4840` |

### Site B nodes — driver-only (SITE-B cluster)

| Service | Roles | Ports |
|---|---|---|
| `site-b-1` | `OTOPCUA_ROLES=driver`, joins the single mesh | host `4844` → container `4840` |
| `site-b-2` | `OTOPCUA_ROLES=driver`, joins the single mesh | host `4845` → container `4840` |

Site nodes serve no UI and authenticate no users. The central cluster manages and deploys to them over the shared Akka mesh. All six nodes bind Akka remoting to port `4053` inside their own network namespace; `PublicHostname` for each matches its Compose service name.

## Multi-tenancy

All six host nodes write to the same `OtOpcUa` ConfigDb. The `ServerCluster` table differentiates the three logical clusters: each maps to one row, and each `ClusterNode` row's `ClusterId` ties the runtime node back to its owning cluster scope.

Two one-shot Compose services bootstrap the DB on bring-up: `migrator` applies the EF Core migrations (so a fresh SQL volume gets the schema with no operator step — the host nodes deliberately do **not** auto-migrate, since production owns schema changes), then `cluster-seed` (image `mcr.microsoft.com/mssql-tools`) INSERTs the rows below. `cluster-seed` and every host node `depend_on` the `migrator` completing (`service_completed_successfully`), so the seed never races an in-progress migration. The seed is **idempotent** — `IF NOT EXISTS` guards every insert — so re-runs on `docker compose up` are no-ops:

| Logical cluster | `ServerCluster.ClusterId` | `ClusterNode.NodeId` rows |
|---|---|---|
| Main | `MAIN` | `central-1`, `central-2` (OPC UA publishers + admin UI) |
| Site A | `SITE-A` | `site-a-1`, `site-a-2` |
| Site B | `SITE-B` | `site-b-1`, `site-b-2` |

Each `ClusterNode.NodeId` matches the node's `Cluster__PublicHostname` env value (Compose service name) — that's the lookup the runtime uses to resolve its own membership. `ApplicationUri` follows the `urn:OtOpcUa:<NodeId>` convention.

The SQL lives at `seed/seed-clusters.sql`; the wait-and-apply wrapper lives at `seed/entrypoint.sh`. To re-seed manually:

```bash
docker compose -f docker-dev/docker-compose.yml run --rm cluster-seed
```

### Galaxy / MxAccess gateway

The seed also pre-creates a `SystemPlatform` Namespace + a `GalaxyMxGateway` DriverInstance in the MAIN cluster pointing at `http://10.100.0.48:5120`. The API key is resolved from the `GALAXY_MXGW_API_KEY` env var set on every driver-role container in compose; override via `GALAXY_MXGW_API_KEY=... docker compose up -d` to swap keys without editing the compose file.

The DriverHost actor doesn't spawn drivers from raw DriverInstance rows on its own — the v2 deploy lifecycle requires a *sealed Deployment* before drivers materialise. After first bring-up, sign in to the Admin UI and click **Deploy current configuration** on `/deployments` to compose the seeded rows into an artifact and dispatch it. The Galaxy driver instance will start its gRPC connection to the gateway on the next deploy ack.

## Bring up

```bash
# from the repo root
docker compose -f docker-dev/docker-compose.yml up -d --build

# the one-shot migrator + cluster-seed bootstrap the DB; watch the seed finish:
docker compose -f docker-dev/docker-compose.yml logs -f cluster-seed   # ^C once it prints "[cluster-seed] done."

open http://localhost:9200                 # Admin UI (Traefik → central-1 or central-2)
open http://localhost:8089                 # Traefik dashboard
```

The first build takes a few minutes (.NET SDK image + restore + publish). **No manual schema step is needed** — on a fresh SQL volume the one-shot `migrator` service applies the EF migrations (the host nodes deliberately don't auto-migrate, since production owns schema changes), then `cluster-seed` populates the cluster/namespace/driver rows. `cluster-seed` and the host nodes wait for the migrator via `service_completed_successfully`, so nothing races an in-progress migration. A plain `docker compose ... up -d` on an existing volume is a fast no-op for both — the named SQL volume keeps the schema + rows across restarts; only `down -v` wipes them, after which the next `up` re-migrates + re-seeds automatically.

## Auth (dev only)

Central nodes authenticate against the shared GLAuth at `10.100.0.35:3893` (baseDN `dc=zb,dc=local`). `DevStubMode` is **not** active. Sign in with any test user (password `password`); `multi-role` / `password` returns all three roles (Administrator, Designer, Viewer). Group→role mappings are seeded by `seed/seed-clusters.sql`. The GLAuth source of truth + deploy runbook is in `scadaproj/infra/glauth/`. **Do not** enable `DevStubMode` outside local debugging — production must always bind a real LDAP backend.

## Headless deploy

```bash
POST http://localhost:9200/api/deployments
X-Api-Key: docker-dev-deploy-key
```

## Tear down

```bash
docker compose -f docker-dev/docker-compose.yml down -v
```

The `-v` drops the SQL volume; remove it to keep ConfigDb state across restarts. There is no local LDAP volume — LDAP is the shared external GLAuth on `10.100.0.35:3893`.

## Failover smoke

1. Watch the Traefik dashboard at `http://localhost:8089`. Both `central-1` and `central-2` should be listed as healthy in the `otopcua-admin` service.
2. `docker compose -f docker-dev/docker-compose.yml stop central-1` — `central-2` should pick up the admin role-leader within ~15 s (Akka split-brain stable-after). Traefik will route traffic to `central-2` once its `/health/active` returns 200.
3. `docker compose -f docker-dev/docker-compose.yml start central-1` — `central-1` rejoins as a follower; `central-2` keeps the leader role until something disturbs it.

## Resource limits & dev logging

The full single-mesh stack (`central-1`/`central-2` + the four site nodes) can OOM-kill `central-1` on a loaded host. Two settings in the compose file guard against that:

- **EF Core + ASP.NET Core logs are pinned to `Warning`** on every host node (`Serilog__MinimumLevel__Override__Microsoft.EntityFrameworkCore` / `…Microsoft.AspNetCore` = `Warning`). The host logs via Serilog (`AddZbSerilog` → `ReadFrom.Configuration`), and in `Development` the default level is `Debug` — without these overrides every Deployment-poll emits an `Executed DbCommand` / `SELECT … FROM [Deployment]` line, flooding the Serilog pipeline and starving the Akka cluster heartbeat thread. Application + Akka log levels are left untouched, so this only silences the per-poll SQL chatter. To temporarily restore the SQL log flood for debugging, drop those two env vars (or set them back to `Information`) on the node you're inspecting.
- **Each host node has `mem_limit: 1g`** (`mem_reservation: 512m`). A quiet solo `central-1` measures ~357 MiB; the limit leaves headroom for the deploy/UI load and per-cluster driver subscriptions that push a fully-loaded node higher. The limit/reservation live on the `&otopcua-host` anchor, so all six host services inherit them; `sql`, `traefik`, and the one-shot `migrator`/`cluster-seed` are left unbounded.

The full six-node host stack therefore needs roughly **6 GiB** of Docker Desktop VM memory just for the host nodes (plus SQL Server's own footprint on top). On a constrained host, either raise the Docker Desktop VM memory or run fewer host services (e.g. just `central-1` + `central-2`, or a single central node) rather than the full mesh.

## Notes

- This compose is for the **local Mac/Linux developer rig**. The team's CI + soak runs go to the remote docker host at `10.100.0.35` (see `docs/v2/dev-environment.md`); the file there mirrors this one with adjusted port bindings.
- The OPC UA endpoints are reachable directly from the host (Traefik is only in front of the admin HTTP surface):
  - Main: `opc.tcp://localhost:4840` (central-1), `opc.tcp://localhost:4841` (central-2)
  - Site A: `opc.tcp://localhost:4842` (site-a-1), `opc.tcp://localhost:4843` (site-a-2)
  - Site B: `opc.tcp://localhost:4844` (site-b-1), `opc.tcp://localhost:4845` (site-b-2)
- Galaxy + Wonderware drivers can't run in Linux containers (they need the Windows-only mxaccessgw + Historian SDK). On non-Windows, `DriverInstanceActor.ShouldStub(driverType, roles)` returns `true` for those types and the actor goes straight to a `Stubbed` state that returns deterministic success.
- SQL persistence: ConfigDb state survives container restarts (named Docker volume). Drop the volume with `down -v` for a clean slate.