docs(health): current-state x3 + GAPS + README

Code-verified current-state docs for OtOpcUa (three-tier full), ScadaBridge
(two-tier, no /healthz), and MxAccessGateway (bare liveness only / no probes).
GAPS backlog with P1 for MxGateway and convergence items for Akka status policy,
DB probe technique, and response writer. README with per-project status table.
This commit is contained in:
Joseph Doherty
2026-06-01 06:23:53 -04:00
parent 1dc35a8c43
commit 3d25ee5090
5 changed files with 698 additions and 0 deletions
@@ -0,0 +1,133 @@
# Health — current state: MxAccessGateway
Repo: `~/Desktop/MxAccessGateway`. Stack: .NET 10 gateway (x64) + .NET 4.8 worker (x86), gRPC;
solution `src/MxGateway.sln`.
Health code lives in `src/ZB.MOM.WW.MxGateway.Server/GatewayApplication.cs`. All paths relative
to repo root.
Verified 2026-06-01.
**Summary: bare liveness only.** MxAccessGateway has a single `/health/live` endpoint that returns
a hardcoded `GatewayHealthReply` JSON object. `AddHealthChecks()` is called at startup but is
entirely unused — no `IHealthCheck` implementations are registered, `MapHealthChecks` is never
called, and there is no readiness or active-node tier. The net48 x86 worker process has no HTTP
server and therefore no health endpoint of any kind.
## 1. Endpoint wiring
`src/ZB.MOM.WW.MxGateway.Server/GatewayApplication.cs`:
- `:61``builder.Services.AddHealthChecks()` is called in the DI registration block. **This call
is dead**: no `.AddCheck<T>()` call follows it, no `MapHealthChecks` is ever called. The
framework registers the health-check infrastructure but nothing is wired through it.
- `:139145``MapGatewayEndpoints` maps a raw `endpoints.MapGet("/health/live", ...)` (not
`MapHealthChecks`). The handler is an inline lambda that returns `Results.Ok(new GatewayHealthReply(...))`
with a hardcoded `Status: "Healthy"`:
```csharp
endpoints.MapGet(
"/health/live",
() => Results.Ok(new GatewayHealthReply(
Status: "Healthy",
DefaultBackend: GatewayContractInfo.DefaultBackendName,
WorkerProtocolVersion: GatewayContractInfo.WorkerProtocolVersion)))
.WithName("LiveHealth");
```
This endpoint always returns HTTP 200 `{"Status":"Healthy",...}` as long as the process is alive.
It carries no authentication requirement (no `[Authorize]` or `.RequireAuthorization()`).
## 2. Response shape
`GatewayHealthReply` is a record with three fields:
- `Status` — always `"Healthy"` (hardcoded string, not the ASP.NET Core `HealthStatus` enum)
- `DefaultBackend` — value of `GatewayContractInfo.DefaultBackendName` (the configured backend
name, useful for confirming which gateway instance a probe hit)
- `WorkerProtocolVersion` — value of `GatewayContractInfo.WorkerProtocolVersion` (the gRPC
protocol version the gateway expects from the worker, useful for version-skew detection)
The response is not `HealthChecks.UI.Client` JSON and is not the standard ASP.NET Core health
response shape. It is a bespoke JSON record.
## 3. Probes
None. There is no `IHealthCheck` registered. The `/health/live` response does not reflect:
- Whether the SQLite auth-store is reachable
- Whether any active MXAccess session is functional
- Whether the x86 worker named-pipe IPC is connected or the worker process is alive
- Whether the gRPC service is actually accepting calls
The endpoint is purely a process liveness indicator.
## 4. Tier coverage
| Tier | Endpoint | Status |
|---|---|---|
| Process liveness | `/health/live` (raw `MapGet`) | ✅ present (but non-standard) |
| Readiness | `/health/ready` | ⛔ absent |
| Active node | `/health/active` | ⛔ absent (not Akka-based; not applicable as-is) |
| `healthz` convention | `/healthz` | ⛔ absent |
MxAccessGateway is not an Akka.NET application — it has no cluster, no leader election, and no
active-node concept. The "active" tier in the shared spec translates here to "is the worker process
connected and the gRPC service ready to accept calls?" rather than cluster leadership.
## 5. x86 worker
`ZB.MOM.WW.MxGateway.Worker` is a .NET 4.8 console application communicating with the gateway
over Windows named-pipe IPC. It has no HTTP server, no health endpoint, and no exposure to any
probe mechanism. Its liveness must be inferred indirectly — either via the gateway process
monitoring it (not currently implemented) or via the `GrpcDependencyHealthCheck` the gateway
could use to probe the IPC channel.
## 6. Notable gaps
- `AddHealthChecks()` at `:61` is dead code. No `IHealthCheck` is ever registered via this call.
- `/health/live` uses `MapGet` (a raw minimal-API handler) rather than `MapHealthChecks`. It
bypasses the ASP.NET Core health-check middleware entirely, which means it does not participate
in the standard health-check pipeline (no `IHealthCheckPublisher`, no `HealthReport`, no UI
integration).
- The hardcoded `"Healthy"` status means the endpoint cannot reflect real probe results even if
probes were added later — the handler must be replaced, not just supplemented.
- No readiness gating: orchestrators (Kubernetes, Traefik) that rely on `/health/ready` returning
503 until the process is actually ready will receive 200 (or 404) from MxAccessGateway today.
---
## Adoption plan → `ZB.MOM.WW.Health`
**Replace `/health/live` + wire the shared tiers:**
The `AddHealthChecks()` call at `GatewayApplication.cs:61` is already present — it just needs
probes registered against it. The raw `MapGet("/health/live", ...)` handler at `:139145` must be
removed and replaced with `app.MapZbHealth()` from `ZB.MOM.WW.Health`.
Steps:
1. **Remove** the inline `MapGet("/health/live", ...)` lambda (`:139145`). The `GatewayHealthReply`
record and `DefaultBackend`/`WorkerProtocolVersion` metadata can be surfaced differently (e.g., a
`/info` endpoint or as custom data on the health response).
2. **Register a `GrpcDependencyHealthCheck`** (from `ZB.MOM.WW.Health`) that probes the
named-pipe IPC channel to the x86 worker. Tag `["ready"]`. This replaces the hardcoded
liveness-only response with a real probe that reflects whether the worker is reachable.
3. **Optionally add a `GrpcDependencyHealthCheck`** for any downstream gRPC dependency (e.g., the
Galaxy Repository connection) if the gateway is expected to be healthy only when its upstreams are
reachable. Tag `["ready"]`.
4. **Call `app.MapZbHealth()`** — this maps `/health/ready` (tag `ready`), `/health/active` (tag
`active`; initially empty — no active-node concept in MxGateway), and `/healthz` (bare liveness).
The `/healthz` endpoint replaces the semantic role that `/health/live` served today.
5. **Do not add `ZB.MOM.WW.Health.Akka`** — MxAccessGateway has no Akka dependency. The consumer
matrix in the design specifies MxGateway uses the core package only.
**Keep bespoke:**
- The `WorkerProtocolVersion` / `DefaultBackend` metadata from `GatewayHealthReply` is
MxAccessGateway-specific; keep it as a separate `/info` endpoint or embed it as `Data` on a
custom probe rather than normalizing it into the shared contract.
- The x86 worker itself (net48 console, named-pipe IPC, no HTTP) remains outside the shared health
scheme. The `GrpcDependencyHealthCheck` observes the worker indirectly from the gateway side.
- Per-gateway auth and TLS concerns on who may call health endpoints remain per-project.
**Adoption is a follow-on task** (tracked in `GAPS.md`), not part of the `ZB.MOM.WW.Health`
library build. MxGateway is the **highest-priority adopter** (P1 gap — no probes/tiers today)
and should be the first app wired up once the nupkg is available.