Files
scadaproj/components/health/GAPS.md
T
Joseph Doherty 07d5907258 docs(health): resolve spec/contract/gaps consistency (review fixes)
Applies canonical resolutions for eight settled decisions:
- GAPS: remove three stale "Decisions still open" bullets (#1 IActiveNodeGate placement, #2 GrpcChannel type, #3 OtOpcUaCompat named constant)
- Shared contract: AkkaClusterHealthCheck, ActiveNodeHealthCheck constructors take IServiceProvider (lazy ActorSystem, Degraded-when-not-ready)
- Shared contract: AkkaActiveNodeGate takes IServiceProvider; reads SelfMember+leader directly, null-guarded; does not proxy ActiveNodeHealthCheck
- Shared contract: DatabaseHealthCheckOptions.Probe renamed to ProbeQuery; consumer matrix updated
- Shared contract: settled AddZbHealthChecks open question removed (spec §5 is per-project AddHealthChecks)
- SPEC §2.2: OtOpcUaCompat Leaving/Exiting cell updated from — to Degraded + footnote; §2.3 startup-safety note added
- README: status line corrected from "built and tested" to "scaffolded … implementation is follow-on (task #7)"; IActiveNodeGate "left per-project" bullet removed
- OtOpcUa current-state: AddZbHealthChecks → AddHealthChecks().AddCheck<...>(); IClusterRoleInfo note reframed as accepted trade-off
- ScadaBridge current-state: IActiveNodeGate bullet rewritten — interface moves to ZB.MOM.WW.Health on adoption, InboundApiEndpointFilter references shared interface
2026-06-01 06:33:42 -04:00

134 lines
8.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Health — gaps & adoption backlog
Divergence of each project from [`spec/SPEC.md`](spec/SPEC.md), and the ordered backlog to
reach the shared `ZB.MOM.WW.Health` library. Status legend: ⛔ gap · 🟡 partial · ✅ matches.
## Divergence vs spec
### §1 Endpoint tiers
| Spec tier | OtOpcUa | MxAccessGateway | ScadaBridge |
|---|---|---|---|
| `/health/ready` (tag `ready`) | ✅ present | ⛔ absent | ✅ present (name-predicate) |
| `/health/active` (tag `active`) | ✅ present | ⛔ absent | ✅ present (name-predicate) |
| `/healthz` (bare process liveness) | ✅ present | ⛔ absent | ⛔ absent |
| `/health/live` (non-standard) | — | ⛔ present (hardcoded `"Healthy"`, bypasses health-check pipeline) | — |
**Gap T1 (P1):** MxAccessGateway has no standard health tiers. The existing `/health/live`
`MapGet` lambda must be replaced by `app.MapZbHealth()` + real probes.
**Gap T2:** ScadaBridge lacks `/healthz`. `MapZbHealth()` adds it automatically.
**Gap T3:** MxAccessGateway's `/health/live` uses a raw `MapGet` that bypasses the ASP.NET Core
health-check middleware — it does not participate in `IHealthCheckPublisher`, `HealthReport`, or
UI integration. Must be removed.
### §2 Probe coverage
| Probe | OtOpcUa | MxAccessGateway | ScadaBridge |
|---|---|---|---|
| Database connectivity | ✅ `DatabaseHealthCheck` (query probe) | ⛔ none | ✅ `DatabaseHealthCheck` (`CanConnectAsync`) |
| Akka cluster membership | ✅ `AkkaClusterHealthCheck` (2-way) | n/a (no Akka) | ✅ `AkkaClusterHealthCheck` (3-way) |
| Active / leader node | ✅ `AdminRoleLeaderHealthCheck` (role-filtered) | n/a | ✅ `ActiveNodeHealthCheck` (role-less) |
| Downstream gRPC dependency | ⛔ none | ⛔ none | ⛔ none |
**Gap P1 (P1):** MxAccessGateway has zero probes — `AddHealthChecks()` at
`GatewayApplication.cs:61` is dead code. Minimum viable: a `GrpcDependencyHealthCheck`
targeting the x86 worker IPC channel.
**Gap P2:** No project probes its downstream gRPC dependency. OtOpcUa should probe the
MxAccessGateway channel; MxAccessGateway should probe the worker IPC.
**Gap P3:** Dead `AddHealthChecks()` in MxAccessGateway (`GatewayApplication.cs:61`) should be
removed or replaced — it currently implies health checks are configured when they are not.
### §3 Akka status-policy divergence
| Aspect | OtOpcUa | ScadaBridge |
|---|---|---|
| Probe implementation | Scans `State.Members` for self by address | Reads `SelfMember.Status` directly |
| Joining status | Degraded (not in Members as Up) | Healthy |
| Leaving/Exiting status | Degraded | Degraded |
| Other (Removed, Down…) | Degraded | Unhealthy |
| ActorSystem null guard | — (none; `ActorSystem` injected directly) | ✅ Degraded if null |
The two implementations diverge in how they classify `Joining` (ScadaBridge calls it Healthy;
OtOpcUa would see it as Degraded because `SelfMember` with status `Joining` would not appear as
`Up` in the member scan). They also diverge in the Removed/Down classification (ScadaBridge
Unhealthy, OtOpcUa Degraded).
The shared `ZB.MOM.WW.Health.Akka.AkkaClusterHealthCheck` ships two presets to preserve both
behaviors rather than forcing one onto the other:
- **Default** — ScadaBridge's three-way policy (`Up`/`Joining`=Healthy, `Leaving`/`Exiting`=Degraded,
else Unhealthy)
- **OtOpcUaCompat** — OtOpcUa's self-Up-among-members scan (found Up=Healthy, not found=Degraded)
**Gap A1:** OtOpcUa adopts the `OtOpcUaCompat` preset; ScadaBridge adopts the `Default` preset.
Both preserve existing behavior without forcing convergence on a single policy.
**Gap A2:** OtOpcUa's `AkkaClusterHealthCheck` injects `ActorSystem` directly (no null guard).
The shared implementation injects via `AkkaHostedService` for startup safety.
### §4 Database probe technique
| Aspect | OtOpcUa | ScadaBridge |
|---|---|---|
| Probe method | `db.Deployments.AsNoTracking().Take(1).ToListAsync()` (query) | `_dbContext.Database.CanConnectAsync()` (connection only) |
| Injection style | `IDbContextFactory<T>` (pooled, safe for concurrent probes) | `DbContext` directly (scoped, requires care in background use) |
| Schema verification | ✅ implies schema is applied | ⛔ connection only |
**Gap D1:** `ZB.MOM.WW.Health.EntityFrameworkCore.DatabaseHealthCheck<TContext>` uses
`CanConnectAsync` as the default (ScadaBridge behavior). An optional `ProbeQuery` delegate covers
OtOpcUa's stricter approach. Both apps retain their existing probe semantics; neither is forced
to change unless desired.
**Gap D2:** ScadaBridge injects `DbContext` directly; the shared probe should use
`IDbContextFactory<TContext>` for safe reuse from a background-service health-check context.
ScadaBridge's DI registration will need updating on adoption.
### §5 Active-node / leader check
| Aspect | OtOpcUa | ScadaBridge |
|---|---|---|
| Probe type | `AdminRoleLeaderHealthCheck` (role-filtered: `"admin"`) | `ActiveNodeHealthCheck` (role-less; Up + leader) |
| Non-role-bearing node | Healthy immediately | n/a (all central nodes have no role filter) |
| Leader status | Healthy | Healthy |
| Non-leader (standby) | Degraded | Unhealthy |
| `IActiveNodeGate` backing | Not present | `ActiveNodeGate` (separate type, duplicated logic) |
**Gap L1:** `ZB.MOM.WW.Health.Akka.ActiveNodeHealthCheck` with an optional `RoleFilter`
parameter unifies both behaviors. OtOpcUa passes `RoleFilter = "admin"` (role-filtered);
ScadaBridge uses no role filter.
**Gap L2:** ScadaBridge's `ActiveNodeGate` duplicates `ActiveNodeHealthCheck` logic. The shared
`IActiveNodeGate` seam + a backing singleton eliminates the duplication.
### §6 Response writer
| | OtOpcUa | MxAccessGateway | ScadaBridge |
|---|---|---|---|
| Writer | Default (plain-text/JSON) | Bespoke `GatewayHealthReply` JSON | `UIResponseWriter.WriteHealthCheckUIResponse` |
**Gap W1:** the shared `ZB.MOM.WW.Health` package ships a canonical JSON response writer
(lifting `HealthChecks.UI.Client` style to the default). All three projects adopt it on
`MapZbHealth()` call — no per-project writer wiring needed.
### §7 Endpoint authentication
Both OtOpcUa and ScadaBridge expose health endpoints without authentication (`AllowAnonymous` or
open by default). MxAccessGateway's `/health/live` has no authentication requirement. The spec
canonizes this: health tiers are `AllowAnonymous`; `MapZbHealth()` applies `AllowAnonymous` by
default.
No gap — consistent across all three. `MapZbHealth()` should document and enforce this default.
## Adoption backlog (ordered)
| # | Item | Projects | Priority | Effort | Risk | Notes |
|---|---|---|---|---|---|---|
| 1 | MxAccessGateway: remove dead `/health/live` + `AddHealthChecks()`, add `GrpcDependencyHealthCheck` (worker IPC) + `MapZbHealth()` | MxGateway | P1 | S | Low | Gap T1, T3, P1, P3 — no probes/tiers today; highest delta |
| 2 | OtOpcUa: replace 3 bespoke checks with shared probes (`AkkaClusterHealthCheck` OtOpcUaCompat + `ActiveNodeHealthCheck` role-filtered + `DatabaseHealthCheck<T>` ProbeQuery) | OtOpcUa | P2 | S | Low | Gap A1, D1, L1 |
| 3 | ScadaBridge: replace 3 bespoke checks with shared probes (Default policy + role-less Active + `CanConnectAsync`) + add `/healthz` + unify `ActiveNodeGate` with `IActiveNodeGate` seam | ScadaBridge | P2 | S | Low | Gap T2, A1, D2, L1, L2 |
| 4 | OtOpcUa + MxAccessGateway: add `GrpcDependencyHealthCheck` for downstream gRPC channel | OtOpcUa, MxGateway | P2 | S | Low | Gap P2 — closes the silent-gateway-down scenario |
| 5 | All: adopt canonical response writer (switch from per-project writers to `MapZbHealth` default) | all 3 | P3 | XS | Low | Gap W1 — mechanical; bundled with #13 |
| 6 | DB injection style: switch ScadaBridge from injected `DbContext` to `IDbContextFactory<T>` | ScadaBridge | P3 | XS | Low | Gap D2 — background-service safety |
**Note: adoption items #16 are all follow-on tasks.** They are tracked here as the backlog for
after `ZB.MOM.WW.Health` @ 0.1.0 is published. The library build itself (nupkgs, tests) is a
separate task. This is consistent with how `ZB.MOM.WW.Auth` and `ZB.MOM.WW.Theme` are structured:
the library is built first; adoption by the three apps is the next step.