docs(health): current-state x3 + GAPS + README

Code-verified current-state docs for OtOpcUa (three-tier full), ScadaBridge
(two-tier, no /healthz), and MxAccessGateway (bare liveness only / no probes).
GAPS backlog with P1 for MxGateway and convergence items for Akka status policy,
DB probe technique, and response writer. README with per-project status table.
This commit is contained in:
Joseph Doherty
2026-06-01 06:23:53 -04:00
parent 1dc35a8c43
commit 3d25ee5090
5 changed files with 698 additions and 0 deletions
+141
View File
@@ -0,0 +1,141 @@
# Health — gaps & adoption backlog
Divergence of each project from [`spec/SPEC.md`](spec/SPEC.md), and the ordered backlog to
reach the shared `ZB.MOM.WW.Health` library. Status legend: ⛔ gap · 🟡 partial · ✅ matches.
## Divergence vs spec
### §1 Endpoint tiers
| Spec tier | OtOpcUa | MxAccessGateway | ScadaBridge |
|---|---|---|---|
| `/health/ready` (tag `ready`) | ✅ present | ⛔ absent | ✅ present (name-predicate) |
| `/health/active` (tag `active`) | ✅ present | ⛔ absent | ✅ present (name-predicate) |
| `/healthz` (bare process liveness) | ✅ present | ⛔ absent | ⛔ absent |
| `/health/live` (non-standard) | — | ⛔ present (hardcoded `"Healthy"`, bypasses health-check pipeline) | — |
**Gap T1 (P1):** MxAccessGateway has no standard health tiers. The existing `/health/live`
`MapGet` lambda must be replaced by `app.MapZbHealth()` + real probes.
**Gap T2:** ScadaBridge lacks `/healthz`. `MapZbHealth()` adds it automatically.
**Gap T3:** MxAccessGateway's `/health/live` uses a raw `MapGet` that bypasses the ASP.NET Core
health-check middleware — it does not participate in `IHealthCheckPublisher`, `HealthReport`, or
UI integration. Must be removed.
### §2 Probe coverage
| Probe | OtOpcUa | MxAccessGateway | ScadaBridge |
|---|---|---|---|
| Database connectivity | ✅ `DatabaseHealthCheck` (query probe) | ⛔ none | ✅ `DatabaseHealthCheck` (`CanConnectAsync`) |
| Akka cluster membership | ✅ `AkkaClusterHealthCheck` (2-way) | n/a (no Akka) | ✅ `AkkaClusterHealthCheck` (3-way) |
| Active / leader node | ✅ `AdminRoleLeaderHealthCheck` (role-filtered) | n/a | ✅ `ActiveNodeHealthCheck` (role-less) |
| Downstream gRPC dependency | ⛔ none | ⛔ none | ⛔ none |
**Gap P1 (P1):** MxAccessGateway has zero probes — `AddHealthChecks()` at
`GatewayApplication.cs:61` is dead code. Minimum viable: a `GrpcDependencyHealthCheck`
targeting the x86 worker IPC channel.
**Gap P2:** No project probes its downstream gRPC dependency. OtOpcUa should probe the
MxAccessGateway channel; MxAccessGateway should probe the worker IPC.
**Gap P3:** Dead `AddHealthChecks()` in MxAccessGateway (`GatewayApplication.cs:61`) should be
removed or replaced — it currently implies health checks are configured when they are not.
### §3 Akka status-policy divergence
| Aspect | OtOpcUa | ScadaBridge |
|---|---|---|
| Probe implementation | Scans `State.Members` for self by address | Reads `SelfMember.Status` directly |
| Joining status | Degraded (not in Members as Up) | Healthy |
| Leaving/Exiting status | Degraded | Degraded |
| Other (Removed, Down…) | Degraded | Unhealthy |
| ActorSystem null guard | — (none; `ActorSystem` injected directly) | ✅ Degraded if null |
The two implementations diverge in how they classify `Joining` (ScadaBridge calls it Healthy;
OtOpcUa would see it as Degraded because `SelfMember` with status `Joining` would not appear as
`Up` in the member scan). They also diverge in the Removed/Down classification (ScadaBridge
Unhealthy, OtOpcUa Degraded).
The shared `ZB.MOM.WW.Health.Akka.AkkaClusterHealthCheck` ships two presets to preserve both
behaviors rather than forcing one onto the other:
- **Default** — ScadaBridge's three-way policy (`Up`/`Joining`=Healthy, `Leaving`/`Exiting`=Degraded,
else Unhealthy)
- **OtOpcUaCompat** — OtOpcUa's self-Up-among-members scan (found Up=Healthy, not found=Degraded)
**Gap A1:** OtOpcUa adopts the `OtOpcUaCompat` preset; ScadaBridge adopts the `Default` preset.
Both preserve existing behavior without forcing convergence on a single policy.
**Gap A2:** OtOpcUa's `AkkaClusterHealthCheck` injects `ActorSystem` directly (no null guard).
The shared implementation injects via `AkkaHostedService` for startup safety.
### §4 Database probe technique
| Aspect | OtOpcUa | ScadaBridge |
|---|---|---|
| Probe method | `db.Deployments.AsNoTracking().Take(1).ToListAsync()` (query) | `_dbContext.Database.CanConnectAsync()` (connection only) |
| Injection style | `IDbContextFactory<T>` (pooled, safe for concurrent probes) | `DbContext` directly (scoped, requires care in background use) |
| Schema verification | ✅ implies schema is applied | ⛔ connection only |
**Gap D1:** `ZB.MOM.WW.Health.EntityFrameworkCore.DatabaseHealthCheck<TContext>` uses
`CanConnectAsync` as the default (ScadaBridge behavior). An optional `ProbeQuery` delegate covers
OtOpcUa's stricter approach. Both apps retain their existing probe semantics; neither is forced
to change unless desired.
**Gap D2:** ScadaBridge injects `DbContext` directly; the shared probe should use
`IDbContextFactory<TContext>` for safe reuse from a background-service health-check context.
ScadaBridge's DI registration will need updating on adoption.
### §5 Active-node / leader check
| Aspect | OtOpcUa | ScadaBridge |
|---|---|---|
| Probe type | `AdminRoleLeaderHealthCheck` (role-filtered: `"admin"`) | `ActiveNodeHealthCheck` (role-less; Up + leader) |
| Non-role-bearing node | Healthy immediately | n/a (all central nodes have no role filter) |
| Leader status | Healthy | Healthy |
| Non-leader (standby) | Degraded | Unhealthy |
| `IActiveNodeGate` backing | Not present | `ActiveNodeGate` (separate type, duplicated logic) |
**Gap L1:** `ZB.MOM.WW.Health.Akka.ActiveNodeHealthCheck` with an optional `RoleFilter`
parameter unifies both behaviors. OtOpcUa passes `RoleFilter = "admin"` (role-filtered);
ScadaBridge uses no role filter.
**Gap L2:** ScadaBridge's `ActiveNodeGate` duplicates `ActiveNodeHealthCheck` logic. The shared
`IActiveNodeGate` seam + a backing singleton eliminates the duplication.
### §6 Response writer
| | OtOpcUa | MxAccessGateway | ScadaBridge |
|---|---|---|---|
| Writer | Default (plain-text/JSON) | Bespoke `GatewayHealthReply` JSON | `UIResponseWriter.WriteHealthCheckUIResponse` |
**Gap W1:** the shared `ZB.MOM.WW.Health` package ships a canonical JSON response writer
(lifting `HealthChecks.UI.Client` style to the default). All three projects adopt it on
`MapZbHealth()` call — no per-project writer wiring needed.
### §7 Endpoint authentication
Both OtOpcUa and ScadaBridge expose health endpoints without authentication (`AllowAnonymous` or
open by default). MxAccessGateway's `/health/live` has no authentication requirement. The spec
canonizes this: health tiers are `AllowAnonymous`; `MapZbHealth()` applies `AllowAnonymous` by
default.
No gap — consistent across all three. `MapZbHealth()` should document and enforce this default.
## Adoption backlog (ordered)
| # | Item | Projects | Priority | Effort | Risk | Notes |
|---|---|---|---|---|---|---|
| 1 | MxAccessGateway: remove dead `/health/live` + `AddHealthChecks()`, add `GrpcDependencyHealthCheck` (worker IPC) + `MapZbHealth()` | MxGateway | P1 | S | Low | Gap T1, T3, P1, P3 — no probes/tiers today; highest delta |
| 2 | OtOpcUa: replace 3 bespoke checks with shared probes (`AkkaClusterHealthCheck` OtOpcUaCompat + `ActiveNodeHealthCheck` role-filtered + `DatabaseHealthCheck<T>` ProbeQuery) | OtOpcUa | P2 | S | Low | Gap A1, D1, L1 |
| 3 | ScadaBridge: replace 3 bespoke checks with shared probes (Default policy + role-less Active + `CanConnectAsync`) + add `/healthz` + unify `ActiveNodeGate` with `IActiveNodeGate` seam | ScadaBridge | P2 | S | Low | Gap T2, A1, D2, L1, L2 |
| 4 | OtOpcUa + MxAccessGateway: add `GrpcDependencyHealthCheck` for downstream gRPC channel | OtOpcUa, MxGateway | P2 | S | Low | Gap P2 — closes the silent-gateway-down scenario |
| 5 | All: adopt canonical response writer (switch from per-project writers to `MapZbHealth` default) | all 3 | P3 | XS | Low | Gap W1 — mechanical; bundled with #13 |
| 6 | DB injection style: switch ScadaBridge from injected `DbContext` to `IDbContextFactory<T>` | ScadaBridge | P3 | XS | Low | Gap D2 — background-service safety |
**Note: adoption items #16 are all follow-on tasks.** They are tracked here as the backlog for
after `ZB.MOM.WW.Health` @ 0.1.0 is published. The library build itself (nupkgs, tests) is a
separate task. This is consistent with how `ZB.MOM.WW.Auth` and `ZB.MOM.WW.Theme` are structured:
the library is built first; adoption by the three apps is the next step.
## Decisions still open
- Whether `GrpcDependencyHealthCheck` takes a named channel (from DI) or a raw `ChannelBase`
affects how MxAccessGateway registers the worker-IPC probe without a standard gRPC channel.
- Whether `IActiveNodeGate` lives in `ZB.MOM.WW.Health` (making it a hard dependency) or stays
in ScadaBridge's `InboundAPI` project (keeping the gate as a ScadaBridge concern).
- Whether the `OtOpcUaCompat` preset for `AkkaClusterHealthCheck` is a named constant or just
documented configuration.