From 3d25ee50902b4ce1615350d3d4dadabbbcd3ef59 Mon Sep 17 00:00:00 2001 From: Joseph Doherty Date: Mon, 1 Jun 2026 06:23:53 -0400 Subject: [PATCH] docs(health): current-state x3 + GAPS + README Code-verified current-state docs for OtOpcUa (three-tier full), ScadaBridge (two-tier, no /healthz), and MxAccessGateway (bare liveness only / no probes). GAPS backlog with P1 for MxGateway and convergence items for Akka status policy, DB probe technique, and response writer. README with per-project status table. --- components/health/GAPS.md | 141 +++++++++++++ components/health/README.md | 89 +++++++++ .../current-state/mxaccessgw/CURRENT-STATE.md | 133 +++++++++++++ .../current-state/otopcua/CURRENT-STATE.md | 150 ++++++++++++++ .../scadabridge/CURRENT-STATE.md | 185 ++++++++++++++++++ 5 files changed, 698 insertions(+) create mode 100644 components/health/GAPS.md create mode 100644 components/health/README.md create mode 100644 components/health/current-state/mxaccessgw/CURRENT-STATE.md create mode 100644 components/health/current-state/otopcua/CURRENT-STATE.md create mode 100644 components/health/current-state/scadabridge/CURRENT-STATE.md diff --git a/components/health/GAPS.md b/components/health/GAPS.md new file mode 100644 index 0000000..e50474b --- /dev/null +++ b/components/health/GAPS.md @@ -0,0 +1,141 @@ +# Health โ€” gaps & adoption backlog + +Divergence of each project from [`spec/SPEC.md`](spec/SPEC.md), and the ordered backlog to +reach the shared `ZB.MOM.WW.Health` library. Status legend: โ›” gap ยท ๐ŸŸก partial ยท โœ… matches. + +## Divergence vs spec + +### ยง1 Endpoint tiers + +| Spec tier | OtOpcUa | MxAccessGateway | ScadaBridge | +|---|---|---|---| +| `/health/ready` (tag `ready`) | โœ… present | โ›” absent | โœ… present (name-predicate) | +| `/health/active` (tag `active`) | โœ… present | โ›” absent | โœ… present (name-predicate) | +| `/healthz` (bare process liveness) | โœ… present | โ›” absent | โ›” absent | +| `/health/live` (non-standard) | โ€” | โ›” present (hardcoded `"Healthy"`, bypasses health-check pipeline) | โ€” | + +โ†’ **Gap T1 (P1):** MxAccessGateway has no standard health tiers. The existing `/health/live` + `MapGet` lambda must be replaced by `app.MapZbHealth()` + real probes. +โ†’ **Gap T2:** ScadaBridge lacks `/healthz`. `MapZbHealth()` adds it automatically. +โ†’ **Gap T3:** MxAccessGateway's `/health/live` uses a raw `MapGet` that bypasses the ASP.NET Core + health-check middleware โ€” it does not participate in `IHealthCheckPublisher`, `HealthReport`, or + UI integration. Must be removed. + +### ยง2 Probe coverage + +| Probe | OtOpcUa | MxAccessGateway | ScadaBridge | +|---|---|---|---| +| Database connectivity | โœ… `DatabaseHealthCheck` (query probe) | โ›” none | โœ… `DatabaseHealthCheck` (`CanConnectAsync`) | +| Akka cluster membership | โœ… `AkkaClusterHealthCheck` (2-way) | n/a (no Akka) | โœ… `AkkaClusterHealthCheck` (3-way) | +| Active / leader node | โœ… `AdminRoleLeaderHealthCheck` (role-filtered) | n/a | โœ… `ActiveNodeHealthCheck` (role-less) | +| Downstream gRPC dependency | โ›” none | โ›” none | โ›” none | + +โ†’ **Gap P1 (P1):** MxAccessGateway has zero probes โ€” `AddHealthChecks()` at + `GatewayApplication.cs:61` is dead code. Minimum viable: a `GrpcDependencyHealthCheck` + targeting the x86 worker IPC channel. +โ†’ **Gap P2:** No project probes its downstream gRPC dependency. OtOpcUa should probe the + MxAccessGateway channel; MxAccessGateway should probe the worker IPC. +โ†’ **Gap P3:** Dead `AddHealthChecks()` in MxAccessGateway (`GatewayApplication.cs:61`) should be + removed or replaced โ€” it currently implies health checks are configured when they are not. + +### ยง3 Akka status-policy divergence + +| Aspect | OtOpcUa | ScadaBridge | +|---|---|---| +| Probe implementation | Scans `State.Members` for self by address | Reads `SelfMember.Status` directly | +| Joining status | Degraded (not in Members as Up) | Healthy | +| Leaving/Exiting status | Degraded | Degraded | +| Other (Removed, Downโ€ฆ) | Degraded | Unhealthy | +| ActorSystem null guard | โ€” (none; `ActorSystem` injected directly) | โœ… Degraded if null | + +The two implementations diverge in how they classify `Joining` (ScadaBridge calls it Healthy; +OtOpcUa would see it as Degraded because `SelfMember` with status `Joining` would not appear as +`Up` in the member scan). They also diverge in the Removed/Down classification (ScadaBridge +Unhealthy, OtOpcUa Degraded). + +The shared `ZB.MOM.WW.Health.Akka.AkkaClusterHealthCheck` ships two presets to preserve both +behaviors rather than forcing one onto the other: +- **Default** โ€” ScadaBridge's three-way policy (`Up`/`Joining`=Healthy, `Leaving`/`Exiting`=Degraded, + else Unhealthy) +- **OtOpcUaCompat** โ€” OtOpcUa's self-Up-among-members scan (found Up=Healthy, not found=Degraded) + +โ†’ **Gap A1:** OtOpcUa adopts the `OtOpcUaCompat` preset; ScadaBridge adopts the `Default` preset. + Both preserve existing behavior without forcing convergence on a single policy. +โ†’ **Gap A2:** OtOpcUa's `AkkaClusterHealthCheck` injects `ActorSystem` directly (no null guard). + The shared implementation injects via `AkkaHostedService` for startup safety. + +### ยง4 Database probe technique + +| Aspect | OtOpcUa | ScadaBridge | +|---|---|---| +| Probe method | `db.Deployments.AsNoTracking().Take(1).ToListAsync()` (query) | `_dbContext.Database.CanConnectAsync()` (connection only) | +| Injection style | `IDbContextFactory` (pooled, safe for concurrent probes) | `DbContext` directly (scoped, requires care in background use) | +| Schema verification | โœ… implies schema is applied | โ›” connection only | + +โ†’ **Gap D1:** `ZB.MOM.WW.Health.EntityFrameworkCore.DatabaseHealthCheck` uses + `CanConnectAsync` as the default (ScadaBridge behavior). An optional `ProbeQuery` delegate covers + OtOpcUa's stricter approach. Both apps retain their existing probe semantics; neither is forced + to change unless desired. +โ†’ **Gap D2:** ScadaBridge injects `DbContext` directly; the shared probe should use + `IDbContextFactory` for safe reuse from a background-service health-check context. + ScadaBridge's DI registration will need updating on adoption. + +### ยง5 Active-node / leader check + +| Aspect | OtOpcUa | ScadaBridge | +|---|---|---| +| Probe type | `AdminRoleLeaderHealthCheck` (role-filtered: `"admin"`) | `ActiveNodeHealthCheck` (role-less; Up + leader) | +| Non-role-bearing node | Healthy immediately | n/a (all central nodes have no role filter) | +| Leader status | Healthy | Healthy | +| Non-leader (standby) | Degraded | Unhealthy | +| `IActiveNodeGate` backing | Not present | `ActiveNodeGate` (separate type, duplicated logic) | + +โ†’ **Gap L1:** `ZB.MOM.WW.Health.Akka.ActiveNodeHealthCheck` with an optional `RoleFilter` + parameter unifies both behaviors. OtOpcUa passes `RoleFilter = "admin"` (role-filtered); + ScadaBridge uses no role filter. +โ†’ **Gap L2:** ScadaBridge's `ActiveNodeGate` duplicates `ActiveNodeHealthCheck` logic. The shared + `IActiveNodeGate` seam + a backing singleton eliminates the duplication. + +### ยง6 Response writer + +| | OtOpcUa | MxAccessGateway | ScadaBridge | +|---|---|---|---| +| Writer | Default (plain-text/JSON) | Bespoke `GatewayHealthReply` JSON | `UIResponseWriter.WriteHealthCheckUIResponse` | + +โ†’ **Gap W1:** the shared `ZB.MOM.WW.Health` package ships a canonical JSON response writer + (lifting `HealthChecks.UI.Client` style to the default). All three projects adopt it on + `MapZbHealth()` call โ€” no per-project writer wiring needed. + +### ยง7 Endpoint authentication + +Both OtOpcUa and ScadaBridge expose health endpoints without authentication (`AllowAnonymous` or +open by default). MxAccessGateway's `/health/live` has no authentication requirement. The spec +canonizes this: health tiers are `AllowAnonymous`; `MapZbHealth()` applies `AllowAnonymous` by +default. + +No gap โ€” consistent across all three. `MapZbHealth()` should document and enforce this default. + +## Adoption backlog (ordered) + +| # | Item | Projects | Priority | Effort | Risk | Notes | +|---|---|---|---|---|---|---| +| 1 | MxAccessGateway: remove dead `/health/live` + `AddHealthChecks()`, add `GrpcDependencyHealthCheck` (worker IPC) + `MapZbHealth()` | MxGateway | P1 | S | Low | Gap T1, T3, P1, P3 โ€” no probes/tiers today; highest delta | +| 2 | OtOpcUa: replace 3 bespoke checks with shared probes (`AkkaClusterHealthCheck` OtOpcUaCompat + `ActiveNodeHealthCheck` role-filtered + `DatabaseHealthCheck` ProbeQuery) | OtOpcUa | P2 | S | Low | Gap A1, D1, L1 | +| 3 | ScadaBridge: replace 3 bespoke checks with shared probes (Default policy + role-less Active + `CanConnectAsync`) + add `/healthz` + unify `ActiveNodeGate` with `IActiveNodeGate` seam | ScadaBridge | P2 | S | Low | Gap T2, A1, D2, L1, L2 | +| 4 | OtOpcUa + MxAccessGateway: add `GrpcDependencyHealthCheck` for downstream gRPC channel | OtOpcUa, MxGateway | P2 | S | Low | Gap P2 โ€” closes the silent-gateway-down scenario | +| 5 | All: adopt canonical response writer (switch from per-project writers to `MapZbHealth` default) | all 3 | P3 | XS | Low | Gap W1 โ€” mechanical; bundled with #1โ€“3 | +| 6 | DB injection style: switch ScadaBridge from injected `DbContext` to `IDbContextFactory` | ScadaBridge | P3 | XS | Low | Gap D2 โ€” background-service safety | + +**Note: adoption items #1โ€“6 are all follow-on tasks.** They are tracked here as the backlog for +after `ZB.MOM.WW.Health` @ 0.1.0 is published. The library build itself (nupkgs, tests) is a +separate task. This is consistent with how `ZB.MOM.WW.Auth` and `ZB.MOM.WW.Theme` are structured: +the library is built first; adoption by the three apps is the next step. + +## Decisions still open + +- Whether `GrpcDependencyHealthCheck` takes a named channel (from DI) or a raw `ChannelBase` โ€” + affects how MxAccessGateway registers the worker-IPC probe without a standard gRPC channel. +- Whether `IActiveNodeGate` lives in `ZB.MOM.WW.Health` (making it a hard dependency) or stays + in ScadaBridge's `InboundAPI` project (keeping the gate as a ScadaBridge concern). +- Whether the `OtOpcUaCompat` preset for `AkkaClusterHealthCheck` is a named constant or just + documented configuration. diff --git a/components/health/README.md b/components/health/README.md new file mode 100644 index 0000000..7e89837 --- /dev/null +++ b/components/health/README.md @@ -0,0 +1,89 @@ +# Health (readiness / liveness / active-node) + +Second normalized component under the operability cluster. **Goal: path to shared code** โ€” converge +the three sister projects onto a common three-tier health endpoint convention and a set of shared +probe implementations, proposed as the `ZB.MOM.WW.Health` library set (3 packages), while each +project keeps its own probe registration and orchestrator wiring. + +- The one target: [`spec/SPEC.md`](spec/SPEC.md) +- The proposed shared library: [`shared-contract/ZB.MOM.WW.Health.md`](shared-contract/ZB.MOM.WW.Health.md) +- Divergences + backlog: [`GAPS.md`](GAPS.md) +- Current state, per project: [`current-state/`](current-state/) + +## Why health is a strong normalization candidate + +Both OtOpcUa and ScadaBridge trace their health-check structure to the same "ScadaLink three-tier +pattern" (`HealthEndpoints.cs:13` says so explicitly) but have already diverged in probe logic, +status semantics, response writer, and endpoint registration style. MxAccessGateway has no shared +ancestry here โ€” it has a single hardcoded `/health/live` endpoint with no real probes at all. +The common core (three tiers, database probe, Akka cluster probe, active-node probe) is +re-implemented twice and absent once. Shared probe implementations with configurable policies +close the gap without forcing identical behavior onto projects with legitimately different cluster +semantics. + +## Status by project + +| Project | Endpoints today | Probes today | Response writer | `/healthz` | `IActiveNodeGate` | Adoption status | +|---|---|---|---|---|---|---| +| **OtOpcUa** | `/health/ready`, `/health/active`, `/healthz` | Database (query), AkkaCluster (2-way), AdminRoleLeader (role-filtered) | Default (plain-text/JSON) | โœ… present | โ€” | Not started | +| **MxAccessGateway** | `/health/live` only (raw `MapGet`; hardcoded `"Healthy"`) | **None** (`AddHealthChecks()` called but unused) | Bespoke `GatewayHealthReply` JSON | โ›” absent | โ€” | Not started | +| **ScadaBridge** | `/health/ready`, `/health/active` | Database (`CanConnectAsync`), AkkaCluster (3-way), ActiveNode (role-less) | `HealthChecks.UI.Client` JSON | โ›” absent | `ActiveNodeGate` (backs Inbound API 503 gate) | Not started | + +See each project's [`current-state//CURRENT-STATE.md`](current-state/) for the +code-verified detail and its adoption plan. + +## Normalized vs. left per-project + +**Normalized (the shared target):** + +- Three-tier endpoint convention: `/health/ready` (tag `ready`), `/health/active` (tag `active`), + `/healthz` (bare liveness). Mapped by `app.MapZbHealth()` from `ZB.MOM.WW.Health`. +- Canonical JSON response writer (lifted from `HealthChecks.UI.Client` style; no per-project + writer wiring needed). +- `IActiveNodeGate` seam โ€” generalized from ScadaBridge's `ActiveNodeGate`; wired into `MapZbHealth` + for automatic active-tier response. +- `GrpcDependencyHealthCheck` โ€” reachability probe for a downstream gRPC dependency (covers + OtOpcUa โ†’ MxAccessGateway channel and MxAccessGateway โ†’ worker IPC). +- `AkkaClusterHealthCheck` (in `ZB.MOM.WW.Health.Akka`) with a configurable status policy. + Default = ScadaBridge's three-way policy; `OtOpcUaCompat` preset preserves OtOpcUa's two-way + self-Up-among-members scan. +- `ActiveNodeHealthCheck` (in `ZB.MOM.WW.Health.Akka`) with an optional role filter. Role-less = + ScadaBridge's behavior (Up + cluster leader); role-filtered = OtOpcUa's `AdminRoleLeader` + behavior. +- `DatabaseHealthCheck` (in `ZB.MOM.WW.Health.EntityFrameworkCore`) with default + `CanConnectAsync` and an optional `ProbeQuery` delegate. +- `AllowAnonymous` on all three tiers by default (consistent across all three projects today). + +**Left per-project (not forced together):** + +- Which probes each app registers, their names, and which tags they carry. +- Orchestrator / Traefik wiring (sidecars, route rules, upstreams). +- ScadaBridge's `HealthMonitoring/` distributed aggregation pipeline (`SiteHealthCollector`, + `CentralHealthAggregator`, `HealthReportSender`, etc.) โ€” domain-specific, no shared-library + equivalent. +- MxAccessGateway's `GatewayHealthReply` metadata (`DefaultBackend`, `WorkerProtocolVersion`) โ€” + keep as a bespoke `/info` endpoint. +- The x86 worker process โ€” out of process and out of scope; the gateway-side + `GrpcDependencyHealthCheck` observes it indirectly. +- Per-project `IActiveNodeGate` contract location (whether the interface lives in the shared + library or in each project's own surface). + +## Package structure + +`ZB.MOM.WW.Health` ships as three dependency-split packages: + +| Package | Contents | Consumers | +|---|---|---| +| `ZB.MOM.WW.Health` | Core tiers, `MapZbHealth`, canonical writer, `IActiveNodeGate`, `GrpcDependencyHealthCheck` | All three | +| `ZB.MOM.WW.Health.Akka` | `AkkaClusterHealthCheck` + status presets, `ActiveNodeHealthCheck` + role filter | OtOpcUa, ScadaBridge | +| `ZB.MOM.WW.Health.EntityFrameworkCore` | `DatabaseHealthCheck` + optional probe delegate | OtOpcUa, ScadaBridge | + +MxAccessGateway consumes the core package only (no Akka, no EF). OtOpcUa and ScadaBridge consume +all three. + +## Component status + +**Status: Draft.** Spec and shared-contract written; current-state docs verified; GAPS backlog +populated. Library (`ZB.MOM.WW.Health` @ 0.1.0) built and tested in this repo at +[`../../ZB.MOM.WW.Health/`](../../ZB.MOM.WW.Health/). Adoption by the three apps is a follow-on +tracked in [`GAPS.md`](GAPS.md). diff --git a/components/health/current-state/mxaccessgw/CURRENT-STATE.md b/components/health/current-state/mxaccessgw/CURRENT-STATE.md new file mode 100644 index 0000000..c99310f --- /dev/null +++ b/components/health/current-state/mxaccessgw/CURRENT-STATE.md @@ -0,0 +1,133 @@ +# Health โ€” current state: MxAccessGateway + +Repo: `~/Desktop/MxAccessGateway`. Stack: .NET 10 gateway (x64) + .NET 4.8 worker (x86), gRPC; +solution `src/MxGateway.sln`. +Health code lives in `src/ZB.MOM.WW.MxGateway.Server/GatewayApplication.cs`. All paths relative +to repo root. +Verified 2026-06-01. + +**Summary: bare liveness only.** MxAccessGateway has a single `/health/live` endpoint that returns +a hardcoded `GatewayHealthReply` JSON object. `AddHealthChecks()` is called at startup but is +entirely unused โ€” no `IHealthCheck` implementations are registered, `MapHealthChecks` is never +called, and there is no readiness or active-node tier. The net48 x86 worker process has no HTTP +server and therefore no health endpoint of any kind. + +## 1. Endpoint wiring + +`src/ZB.MOM.WW.MxGateway.Server/GatewayApplication.cs`: + +- `:61` โ€” `builder.Services.AddHealthChecks()` is called in the DI registration block. **This call + is dead**: no `.AddCheck()` call follows it, no `MapHealthChecks` is ever called. The + framework registers the health-check infrastructure but nothing is wired through it. +- `:139โ€“145` โ€” `MapGatewayEndpoints` maps a raw `endpoints.MapGet("/health/live", ...)` (not + `MapHealthChecks`). The handler is an inline lambda that returns `Results.Ok(new GatewayHealthReply(...))` + with a hardcoded `Status: "Healthy"`: + + ```csharp + endpoints.MapGet( + "/health/live", + () => Results.Ok(new GatewayHealthReply( + Status: "Healthy", + DefaultBackend: GatewayContractInfo.DefaultBackendName, + WorkerProtocolVersion: GatewayContractInfo.WorkerProtocolVersion))) + .WithName("LiveHealth"); + ``` + +This endpoint always returns HTTP 200 `{"Status":"Healthy",...}` as long as the process is alive. +It carries no authentication requirement (no `[Authorize]` or `.RequireAuthorization()`). + +## 2. Response shape + +`GatewayHealthReply` is a record with three fields: +- `Status` โ€” always `"Healthy"` (hardcoded string, not the ASP.NET Core `HealthStatus` enum) +- `DefaultBackend` โ€” value of `GatewayContractInfo.DefaultBackendName` (the configured backend + name, useful for confirming which gateway instance a probe hit) +- `WorkerProtocolVersion` โ€” value of `GatewayContractInfo.WorkerProtocolVersion` (the gRPC + protocol version the gateway expects from the worker, useful for version-skew detection) + +The response is not `HealthChecks.UI.Client` JSON and is not the standard ASP.NET Core health +response shape. It is a bespoke JSON record. + +## 3. Probes + +None. There is no `IHealthCheck` registered. The `/health/live` response does not reflect: + +- Whether the SQLite auth-store is reachable +- Whether any active MXAccess session is functional +- Whether the x86 worker named-pipe IPC is connected or the worker process is alive +- Whether the gRPC service is actually accepting calls + +The endpoint is purely a process liveness indicator. + +## 4. Tier coverage + +| Tier | Endpoint | Status | +|---|---|---| +| Process liveness | `/health/live` (raw `MapGet`) | โœ… present (but non-standard) | +| Readiness | `/health/ready` | โ›” absent | +| Active node | `/health/active` | โ›” absent (not Akka-based; not applicable as-is) | +| `healthz` convention | `/healthz` | โ›” absent | + +MxAccessGateway is not an Akka.NET application โ€” it has no cluster, no leader election, and no +active-node concept. The "active" tier in the shared spec translates here to "is the worker process +connected and the gRPC service ready to accept calls?" rather than cluster leadership. + +## 5. x86 worker + +`ZB.MOM.WW.MxGateway.Worker` is a .NET 4.8 console application communicating with the gateway +over Windows named-pipe IPC. It has no HTTP server, no health endpoint, and no exposure to any +probe mechanism. Its liveness must be inferred indirectly โ€” either via the gateway process +monitoring it (not currently implemented) or via the `GrpcDependencyHealthCheck` the gateway +could use to probe the IPC channel. + +## 6. Notable gaps + +- `AddHealthChecks()` at `:61` is dead code. No `IHealthCheck` is ever registered via this call. +- `/health/live` uses `MapGet` (a raw minimal-API handler) rather than `MapHealthChecks`. It + bypasses the ASP.NET Core health-check middleware entirely, which means it does not participate + in the standard health-check pipeline (no `IHealthCheckPublisher`, no `HealthReport`, no UI + integration). +- The hardcoded `"Healthy"` status means the endpoint cannot reflect real probe results even if + probes were added later โ€” the handler must be replaced, not just supplemented. +- No readiness gating: orchestrators (Kubernetes, Traefik) that rely on `/health/ready` returning + 503 until the process is actually ready will receive 200 (or 404) from MxAccessGateway today. + +--- + +## Adoption plan โ†’ `ZB.MOM.WW.Health` + +**Replace `/health/live` + wire the shared tiers:** + +The `AddHealthChecks()` call at `GatewayApplication.cs:61` is already present โ€” it just needs +probes registered against it. The raw `MapGet("/health/live", ...)` handler at `:139โ€“145` must be +removed and replaced with `app.MapZbHealth()` from `ZB.MOM.WW.Health`. + +Steps: + +1. **Remove** the inline `MapGet("/health/live", ...)` lambda (`:139โ€“145`). The `GatewayHealthReply` + record and `DefaultBackend`/`WorkerProtocolVersion` metadata can be surfaced differently (e.g., a + `/info` endpoint or as custom data on the health response). +2. **Register a `GrpcDependencyHealthCheck`** (from `ZB.MOM.WW.Health`) that probes the + named-pipe IPC channel to the x86 worker. Tag `["ready"]`. This replaces the hardcoded + liveness-only response with a real probe that reflects whether the worker is reachable. +3. **Optionally add a `GrpcDependencyHealthCheck`** for any downstream gRPC dependency (e.g., the + Galaxy Repository connection) if the gateway is expected to be healthy only when its upstreams are + reachable. Tag `["ready"]`. +4. **Call `app.MapZbHealth()`** โ€” this maps `/health/ready` (tag `ready`), `/health/active` (tag + `active`; initially empty โ€” no active-node concept in MxGateway), and `/healthz` (bare liveness). + The `/healthz` endpoint replaces the semantic role that `/health/live` served today. +5. **Do not add `ZB.MOM.WW.Health.Akka`** โ€” MxAccessGateway has no Akka dependency. The consumer + matrix in the design specifies MxGateway uses the core package only. + +**Keep bespoke:** + +- The `WorkerProtocolVersion` / `DefaultBackend` metadata from `GatewayHealthReply` is + MxAccessGateway-specific; keep it as a separate `/info` endpoint or embed it as `Data` on a + custom probe rather than normalizing it into the shared contract. +- The x86 worker itself (net48 console, named-pipe IPC, no HTTP) remains outside the shared health + scheme. The `GrpcDependencyHealthCheck` observes the worker indirectly from the gateway side. +- Per-gateway auth and TLS concerns on who may call health endpoints remain per-project. + +**Adoption is a follow-on task** (tracked in `GAPS.md`), not part of the `ZB.MOM.WW.Health` +library build. MxGateway is the **highest-priority adopter** (P1 gap โ€” no probes/tiers today) +and should be the first app wired up once the nupkg is available. diff --git a/components/health/current-state/otopcua/CURRENT-STATE.md b/components/health/current-state/otopcua/CURRENT-STATE.md new file mode 100644 index 0000000..19ae6ea --- /dev/null +++ b/components/health/current-state/otopcua/CURRENT-STATE.md @@ -0,0 +1,150 @@ +# Health โ€” current state: OtOpcUa + +Repo: `~/Desktop/OtOpcUa`. Stack: .NET 10, Akka.NET, OPC UA; solution `ZB.MOM.WW.OtOpcUa.slnx`. +Health code lives in `src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/`. All paths relative to repo root. +Verified 2026-06-01. + +Full three-tier pattern: `/health/ready`, `/health/active`, and `/healthz`. Three probes covering +the database, the Akka cluster, and the admin-role leader. All endpoints are `AllowAnonymous` to +permit Traefik and load-balancer probing without credentials. + +## 1. Endpoint wiring + +`src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/HealthEndpoints.cs`: + +- `:13` โ€” XML comment explicitly names this as "ScadaLink's three-tier pattern: `ready` = boot ok; + `active` = fully serving traffic; `healthz` = bare process liveness." +- `:17` โ€” `AddOtOpcUaHealth(IServiceCollection)` calls `services.AddHealthChecks()` and registers + all three probes (lines 20โ€“22): + - `DatabaseHealthCheck` name `"configdb"`, tags `["ready","active"]` + - `AkkaClusterHealthCheck` name `"akka"`, tags `["ready","active"]` + - `AdminRoleLeaderHealthCheck` name `"admin-leader"`, tags `["active"]` only +- `:28` โ€” `MapOtOpcUaHealth(IEndpointRouteBuilder)` maps three endpoints (lines 33โ€“44): + - `/health/ready` โ€” predicate `c => c.Tags.Contains("ready")`, `.AllowAnonymous()` (lines 33โ€“36) + - `/health/active` โ€” predicate `c => c.Tags.Contains("active")`, `.AllowAnonymous()` (lines 37โ€“40) + - `/healthz` โ€” predicate `_ => false` (no probes run; bare process liveness only), `.AllowAnonymous()` (lines 41โ€“44) + +`Program.cs`: +- `:137` โ€” `builder.Services.AddOtOpcUaHealth()` +- `:159` โ€” `app.MapOtOpcUaHealth()` + +Response writer: default ASP.NET Core plain-text/JSON (no `HealthChecks.UI.Client`). + +## 2. Probes + +### DatabaseHealthCheck +`src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/DatabaseHealthCheck.cs`: + +- `:9` โ€” injects `IDbContextFactory` +- `:25โ€“37` โ€” opens a pooled context via `CreateDbContextAsync`, runs + `db.Deployments.AsNoTracking().Take(1).ToListAsync()`. If the query succeeds โ†’ + `HealthCheckResult.Healthy("ConfigDb reachable")` (`:31`). If it throws โ†’ + `HealthCheckResult.Unhealthy("ConfigDb unreachable", ex)` (`:35`). No `Degraded` path. + +The probe exercises a real query (not just `CanConnectAsync`) โ€” it confirms the `Deployments` table +is readable, which implies the schema migration has run. This is **stricter** than ScadaBridge's +`CanConnectAsync` but more opaque about the failure reason. + +Tags on registration: `["ready","active"]` โ€” the database must be reachable for both readiness and +active-node determination. + +### AkkaClusterHealthCheck +`src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/AkkaClusterHealthCheck.cs`: + +- `:9` โ€” injects `ActorSystem` directly +- `:27โ€“33` โ€” calls `Cluster.Get(_system)`, scans `cluster.State.Members` for the member whose + `Address == cluster.SelfAddress` and `Status == MemberStatus.Up`: + - Found Up โ†’ `HealthCheckResult.Healthy($"Self Up; {cluster.State.Members.Count} member(s)")` (`:32`) + - Not found โ†’ `HealthCheckResult.Degraded("Self not yet Up in cluster")` (`:33`) + +No `Unhealthy` path โ€” joining/leaving/removed nodes are all reported as `Degraded`. This differs from +ScadaBridge's more granular three-way policy (see GAPS). + +Tags on registration: `["ready","active"]`. + +### AdminRoleLeaderHealthCheck +`src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/AdminRoleLeaderHealthCheck.cs`: + +- `:14` โ€” injects `IClusterRoleInfo` +- `:27โ€“38` โ€” three-path logic: + - Node does not carry the `"admin"` role โ†’ `Healthy("Node does not carry admin role")` (`:30`) โ€” + non-admin nodes are immediately healthy, so this probe never gates a non-admin node. + - Admin role + node is the role leader โ†’ `Healthy($"Admin leader ({...})")` (`:36`) + - Admin role + not the leader โ†’ `Degraded($"Admin member but not leader (leader=...)")` (`:37`) + +Tags on registration: `["active"]` only โ€” does not participate in `/health/ready`. The intent is +Traefik routing: the active node (admin-role leader) gets sticky admin-UI traffic; standby nodes +are reachable for data-plane OPC UA but report `Degraded` on `/health/active` so the load balancer +does not route control-plane traffic to them. + +Note: no `Unhealthy` path for the role-filter case. If the ActorSystem is not running, `IClusterRoleInfo` +presumably returns safe defaults (no role); this is not separately health-checked. + +## 3. Tag / tier summary + +| Probe | `/health/ready` | `/health/active` | `/healthz` | +|---|---|---|---| +| `DatabaseHealthCheck` | โœ… | โœ… | โ€” | +| `AkkaClusterHealthCheck` | โœ… | โœ… | โ€” | +| `AdminRoleLeaderHealthCheck` | โ€” | โœ… | โ€” | +| (no probes) | โ€” | โ€” | โœ… (bare liveness) | + +`/healthz` runs zero probes โ€” it is a pure process liveness sentinel (process reachable = healthy; +a crashed process = no response). Kubernetes liveness probes, Traefik TCP checks, and uptime +monitors use this tier. + +## 4. Downstream dependency coverage + +No probe for the upstream MxAccessGateway gRPC channel. If the gateway is unreachable, OtOpcUa +reports healthy here (the GalaxyDriver will surface errors in OPC UA diagnostics, but `/health/ready` +and `/health/active` will not reflect it). This is a gap that the shared `GrpcDependencyHealthCheck` +probe in `ZB.MOM.WW.Health` would close. + +## 5. Notable design choices + +- **AllowAnonymous on all tiers** โ€” see `HealthEndpoints.cs:30โ€“32` comment: "Without it the + `AddOtOpcUaAuth` fallback policy 401s every probe and Traefik marks every backend unhealthy." +- **Query probe, not `CanConnectAsync`** โ€” the `Deployments` query validates that the schema has + been applied. ScadaBridge uses `CanConnectAsync`; neither is wrong but they diverge. +- **`Degraded` semantics** โ€” the Akka check uses `Degraded` (not `Unhealthy`) for a joining/pre-Up + node. ASP.NET Core maps `Degraded` to HTTP 200 by default; Traefik sees 200 and considers the + node ready. If `Unhealthy` (HTTP 503) is required to gate traffic, the `Degraded` path is + insufficient. +- **`IClusterRoleInfo` abstraction** โ€” the admin-leader check depends on `IClusterRoleInfo`, an OtOpcUa + interface, not the raw `Akka.Cluster.Cluster` API. This is a testability-friendly layer absent in + ScadaBridge's direct Akka usage. + +--- + +## Adoption plan โ†’ `ZB.MOM.WW.Health` + +**Replace with shared probes:** + +- `AkkaClusterHealthCheck` โ†’ `ZB.MOM.WW.Health.Akka.AkkaClusterHealthCheck` using the + **`OtOpcUaCompat` preset** (self-Up-among-members scan โ†’ Healthy/Degraded). The preset keeps + OtOpcUa's existing two-way policy without forcing ScadaBridge's three-way policy onto it. +- `AdminRoleLeaderHealthCheck` โ†’ `ZB.MOM.WW.Health.Akka.ActiveNodeHealthCheck` with + `RoleFilter = "admin"`. The role-filter parameter produces identical behavior: non-admin nodes + immediately healthy, admin leader healthy, admin non-leader degraded. +- `DatabaseHealthCheck` โ†’ `ZB.MOM.WW.Health.EntityFrameworkCore.DatabaseHealthCheck` + with a `ProbeQuery` delegate of `db => db.Deployments.AsNoTracking().Take(1).ToListAsync()`. + The delegate preserves the stricter query probe rather than falling back to `CanConnectAsync`. +- Add `GrpcDependencyHealthCheck` targeting the MxAccessGateway channel (closes the downstream + dependency gap noted in ยง4). Tag `["ready","active"]`. +- Replace `AddOtOpcUaHealth` / `MapOtOpcUaHealth` with `services.AddZbHealthChecks()` + + `app.MapZbHealth()`. The `/healthz` bare-liveness tier is part of `MapZbHealth` by default โ€” + no separate wiring needed. + +**Keep bespoke:** + +- `IClusterRoleInfo` and its Akka implementation โ€” this is an OtOpcUa abstraction used for more + than health checks; it should remain in the OtOpcUa codebase. The shared `ActiveNodeHealthCheck` + will accept `IClusterRoleInfo` (or an equivalent cluster-info abstraction) as an injection point. +- The `AllowAnonymous` policy โ€” this is an OtOpcUa auth concern; `MapZbHealth` must document that + callers are responsible for applying `AllowAnonymous` (or the shared helper applies it by default). +- Which probes are registered and their tag assignments โ€” the shared library supplies the check + implementations; the wiring (which names, which tags, which options) remains per-project. + +**Adoption is a follow-on task** (tracked in `GAPS.md`), not part of the `ZB.MOM.WW.Health` +library build. The library build delivers the shared implementations; adoption lands in the +OtOpcUa repo as a separate commit once the nupkg is available. diff --git a/components/health/current-state/scadabridge/CURRENT-STATE.md b/components/health/current-state/scadabridge/CURRENT-STATE.md new file mode 100644 index 0000000..ff1d816 --- /dev/null +++ b/components/health/current-state/scadabridge/CURRENT-STATE.md @@ -0,0 +1,185 @@ +# Health โ€” current state: ScadaBridge + +Repo: `~/Desktop/ScadaBridge`. Stack: .NET 10, Akka.NET; solution `ZB.MOM.WW.ScadaBridge.slnx`. +Health code centers on `src/ZB.MOM.WW.ScadaBridge.Host/Health/` (ASP.NET probes) and the +separate `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/` project (domain aggregation pipeline). +All paths relative to repo root. +Verified 2026-06-01. + +Two-tier pattern: `/health/ready` and `/health/active` โ€” no `/healthz`. Three probes (database, +Akka cluster, active-node). ScadaBridge also has a bespoke distributed `HealthMonitoring/` +pipeline that is entirely separate from the ASP.NET health checks and is out of scope for the +shared library. + +## 1. Endpoint wiring + +`src/ZB.MOM.WW.ScadaBridge.Host/Program.cs`: + +- `:114โ€“117` โ€” `builder.Services.AddHealthChecks()` followed by three `.AddCheck()` calls + (no tags, checked by name at the endpoint level): + - `.AddCheck("database")` + - `.AddCheck("akka-cluster")` + - `.AddCheck("active-node")` +- `:131` โ€” `builder.Services.AddSingleton()` registers the + production `IActiveNodeGate` implementation (Inbound API gating, not a health-check probe). +- `:222โ€“226` โ€” `/health/ready` mapped with `Predicate = check => check.Name != "active-node"` and + `ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse` (from `HealthChecks.UI.Client`). + Excludes the active-node check so a healthy standby node reports ready. +- `:229โ€“233` โ€” `/health/active` mapped with `Predicate = check => check.Name == "active-node"` and + `ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse`. Active-node check only. + +No `/healthz` endpoint. Both mapped endpoints use `HealthChecks.UI.Client` JSON (not the default +plain-text writer), which is a divergence from OtOpcUa. + +## 2. Probes + +### DatabaseHealthCheck +`src/ZB.MOM.WW.ScadaBridge.Host/Health/DatabaseHealthCheck.cs`: + +- `:11` โ€” injects `ScadaBridgeDbContext` directly (not a factory) +- `:33โ€“43` โ€” calls `_dbContext.Database.CanConnectAsync(cancellationToken)`: + - Returns `true` โ†’ `HealthCheckResult.Healthy("Database connection is available.")` (`:34โ€“35`) + - Returns `false` โ†’ `HealthCheckResult.Unhealthy("Database connection failed.")` (`:36`) + - Throws โ†’ `HealthCheckResult.Unhealthy("Database connection failed.", ex)` (`:40`) + +`CanConnectAsync` tests the connection layer only โ€” it does not run any query or verify schema +state. This is less strict than OtOpcUa's `Deployments` query but more transparent about failure +cause (connection vs. schema). No `Degraded` path. + +### AkkaClusterHealthCheck +`src/ZB.MOM.WW.ScadaBridge.Host/Health/AkkaClusterHealthCheck.cs`: + +- `:13` โ€” injects `AkkaHostedService` (not `ActorSystem` directly) +- `:33โ€“50` โ€” gets `_akkaService.ActorSystem`, guards on null โ†’ `Degraded("ActorSystem not yet + available.")`, then reads `cluster.SelfMember.Status`: + - `Up` or `Joining` โ†’ `Healthy($"Akka cluster member status: {status}")` (`:43`) + - `Leaving` or `Exiting` โ†’ `Degraded($"Akka cluster member status: {status}")` (`:45`) + - anything else (Removed, Down, WeaklyUpโ€ฆ) โ†’ `Unhealthy($"Akka cluster member status: {status}")` (`:47`) + +Three-way status policy: Healthy / Degraded / Unhealthy. This is more granular than OtOpcUa's +two-way policy (self-Up-or-not โ†’ Healthy/Degraded with no Unhealthy path). + +### ActiveNodeHealthCheck +`src/ZB.MOM.WW.ScadaBridge.Host/Health/ActiveNodeHealthCheck.cs`: + +- `:13` โ€” injects `AkkaHostedService` +- `:29โ€“44` โ€” three-path logic: + - `ActorSystem == null` โ†’ `Unhealthy("ActorSystem not yet available.")` (`:31`) + - `SelfMember.Status != Up` โ†’ `Unhealthy($"Node not Up (status: ...)")` (`:37`) + - `Up` AND `cluster.State.Leader == self.Address` โ†’ `Healthy("Active node (cluster leader).")` (`:41`) + - `Up` but not leader โ†’ `Unhealthy("Standby node (not cluster leader).")` (`:43`) + +No `Degraded` path โ€” `ActiveNodeHealthCheck` uses `Unhealthy` for standby and non-Up states, +which causes `/health/active` to return HTTP 503 on a standby. This is the intended behavior for +Traefik active-node routing. + +## 3. Tag / tier summary + +ScadaBridge uses **name-based predicates** at the endpoint level rather than tags on the check +registration. Tags are absent from all three `.AddCheck()` calls. + +| Probe | `/health/ready` | `/health/active` | `/healthz` | +|---|---|---|---| +| `DatabaseHealthCheck` | โœ… | โ€” (excluded by name) | โ›” absent | +| `AkkaClusterHealthCheck` | โœ… | โ€” (excluded by name) | โ›” absent | +| `ActiveNodeHealthCheck` | โ€” (excluded by name) | โœ… | โ›” absent | + +`/healthz` is absent โ€” there is no bare process liveness endpoint. Kubernetes or Traefik liveness +probes must either use `/health/ready` or tolerate its 503-until-ready behavior. + +## 4. IActiveNodeGate and Inbound API gating + +`src/ZB.MOM.WW.ScadaBridge.Host/Health/ActiveNodeGate.cs`: + +- `:24` โ€” `ActiveNodeGate` implements `IActiveNodeGate` (from the `InboundAPI` project) +- `:40` โ€” `IsActiveNode` property: returns `true` only when `_akkaService.ActorSystem != null` + AND `cluster.SelfMember.Status == MemberStatus.Up` AND `cluster.State.Leader == self.Address`. + Defaults to `false` safely during startup (`:45โ€“46`). +- `:131` in `Program.cs` โ€” registered as a singleton. The `InboundApiEndpointFilter` consults this + gate on every `/api/*` request and returns HTTP 503 on a standby node. + +`ActiveNodeGate` mirrors the exact same logic as `ActiveNodeHealthCheck` โ€” both check Up + leader. +They are separate types serving two different concerns (the health endpoint and the API gate) but +are not abstracted into a shared service; each reads cluster state independently. + +`IActiveNodeGate` is the generalized seam the `ZB.MOM.WW.Health` core package lifts to the shared +library. + +## 5. HealthMonitoring domain pipeline (out of scope for shared library) + +`src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/` is a separate project implementing a distributed +health aggregation pipeline. It is **not ASP.NET Core health checks** and is **not in scope** for +`ZB.MOM.WW.Health`. + +Key types: +- `SiteHealthCollector` โ€” thread-safe singleton accumulating per-site error counters, connection + metrics, and tag-read metrics. Populated by actors in the DCL layer. +- `HealthReportSender` โ€” a background service on site clusters that serializes `SiteHealthState` + and ships it to the central cluster via Akka remoting at a configurable interval. +- `CentralHealthReportLoop` โ€” central-only background service that generates a synthetic + `SiteHealthReport` for the central cluster itself (siteId `"$central"`) and feeds it into the + central aggregator. +- `CentralHealthAggregator` โ€” a `BackgroundService` on the central cluster tracking the latest + health report per site and detecting offline sites via heartbeat timeout. Exposes + `GetAggregatedHealth()` to the Central UI's `/monitoring/health` endpoint. + +This pipeline is domain-specific (multi-site ScadaBridge topology) and will remain per-project +regardless of shared-library adoption. + +## 6. Notable design choices + +- **Name-based predicates vs. tags** โ€” ScadaBridge uses `check.Name == "active-node"` predicate + logic at the endpoint level. OtOpcUa uses tag membership (`c.Tags.Contains("ready")`). The tag + approach is more composable (a probe can participate in multiple tiers), the name approach is + more explicit. The shared `MapZbHealth` should use tags by default. +- **`HealthChecks.UI.Client` response writer** โ€” ScadaBridge uses the richer JSON response writer + from the `AspNetCore.HealthChecks.UI.Client` package. OtOpcUa uses the default plain-text writer. + The shared library's canonical response writer standardizes this. +- **`ActiveNodeHealthCheck` returns `Unhealthy` for standby** โ€” a standby is not *unhealthy* in the + system sense; it is a deliberate routing discriminator. Using `Unhealthy` here ensures `/health/active` + returns HTTP 503 (Traefik sees the node as down for active traffic). The naming is semantically + imprecise but operationally correct. +- **`IActiveNodeGate` + `ActiveNodeGate` duplication** โ€” the gate and the health check implement the + same logic independently. The shared library's `IActiveNodeGate` seam + `ActiveNodeHealthCheck` + unify them: one backing service, two consumers. + +--- + +## Adoption plan โ†’ `ZB.MOM.WW.Health` + +**Replace with shared probes:** + +- `AkkaClusterHealthCheck` โ†’ `ZB.MOM.WW.Health.Akka.AkkaClusterHealthCheck` using the **Default + policy** (`Up`/`Joining`=Healthy, `Leaving`/`Exiting`=Degraded, else Unhealthy). ScadaBridge's + existing three-way policy is the Default โ€” no preset selection needed. +- `ActiveNodeHealthCheck` โ†’ `ZB.MOM.WW.Health.Akka.ActiveNodeHealthCheck` with no role filter + (role-less default: Up && leader = Healthy, else Unhealthy). The shared implementation also backs + `IActiveNodeGate`, eliminating the duplicated leader-check logic between `ActiveNodeHealthCheck` + and `ActiveNodeGate`. +- `DatabaseHealthCheck` โ†’ `ZB.MOM.WW.Health.EntityFrameworkCore.DatabaseHealthCheck` + using the default `CanConnectAsync` probe (ScadaBridge's existing behavior). No `ProbeQuery` + delegate needed. +- Replace the name-based predicates with tag-based predicates by adding tags at registration time: + `"database"` and `"akka-cluster"` โ†’ `["ready"]`; `"active-node"` โ†’ `["active"]`. Then call + `app.MapZbHealth()` instead of the two manual `MapHealthChecks` calls. +- **Add `/healthz`** โ€” `MapZbHealth()` maps the bare liveness tier automatically. ScadaBridge + currently lacks this endpoint. +- Switch `ResponseWriter` from `UIResponseWriter.WriteHealthCheckUIResponse` to the shared + canonical writer (a convergence item โ€” `HealthChecks.UI.Client` style lifted to the default in + `ZB.MOM.WW.Health`). + +**Keep bespoke:** + +- `HealthMonitoring/` domain pipeline (`SiteHealthCollector`, `CentralHealthAggregator`, etc.) โ€” + entirely per-project, no shared-library equivalent. +- `IActiveNodeGate` from the `InboundAPI` project is the contract the `InboundApiEndpointFilter` + depends on; it can be implemented by the shared `ActiveNodeHealthCheck` backing service but the + interface definition stays in the InboundAPI project (or moves to a shared abstractions package). +- The Central UI's `/monitoring/health` endpoint โ€” powered by `CentralHealthAggregator`, not by + ASP.NET health checks. +- The comment at `Program.cs:217โ€“221` explains the readiness design decision (standby nodes are + ready; leadership is a separate concern). This intent is preserved by the tag-based approach. + +**Adoption is a follow-on task** (tracked in `GAPS.md`), not part of the `ZB.MOM.WW.Health` +library build. The library build delivers the shared implementations; adoption lands in the +ScadaBridge repo as a separate commit once the nupkg is available.