docs(health): current-state x3 + GAPS + README
Code-verified current-state docs for OtOpcUa (three-tier full), ScadaBridge (two-tier, no /healthz), and MxAccessGateway (bare liveness only / no probes). GAPS backlog with P1 for MxGateway and convergence items for Akka status policy, DB probe technique, and response writer. README with per-project status table.
This commit is contained in:
@@ -0,0 +1,141 @@
|
||||
# Health — gaps & adoption backlog
|
||||
|
||||
Divergence of each project from [`spec/SPEC.md`](spec/SPEC.md), and the ordered backlog to
|
||||
reach the shared `ZB.MOM.WW.Health` library. Status legend: ⛔ gap · 🟡 partial · ✅ matches.
|
||||
|
||||
## Divergence vs spec
|
||||
|
||||
### §1 Endpoint tiers
|
||||
|
||||
| Spec tier | OtOpcUa | MxAccessGateway | ScadaBridge |
|
||||
|---|---|---|---|
|
||||
| `/health/ready` (tag `ready`) | ✅ present | ⛔ absent | ✅ present (name-predicate) |
|
||||
| `/health/active` (tag `active`) | ✅ present | ⛔ absent | ✅ present (name-predicate) |
|
||||
| `/healthz` (bare process liveness) | ✅ present | ⛔ absent | ⛔ absent |
|
||||
| `/health/live` (non-standard) | — | ⛔ present (hardcoded `"Healthy"`, bypasses health-check pipeline) | — |
|
||||
|
||||
→ **Gap T1 (P1):** MxAccessGateway has no standard health tiers. The existing `/health/live`
|
||||
`MapGet` lambda must be replaced by `app.MapZbHealth()` + real probes.
|
||||
→ **Gap T2:** ScadaBridge lacks `/healthz`. `MapZbHealth()` adds it automatically.
|
||||
→ **Gap T3:** MxAccessGateway's `/health/live` uses a raw `MapGet` that bypasses the ASP.NET Core
|
||||
health-check middleware — it does not participate in `IHealthCheckPublisher`, `HealthReport`, or
|
||||
UI integration. Must be removed.
|
||||
|
||||
### §2 Probe coverage
|
||||
|
||||
| Probe | OtOpcUa | MxAccessGateway | ScadaBridge |
|
||||
|---|---|---|---|
|
||||
| Database connectivity | ✅ `DatabaseHealthCheck` (query probe) | ⛔ none | ✅ `DatabaseHealthCheck` (`CanConnectAsync`) |
|
||||
| Akka cluster membership | ✅ `AkkaClusterHealthCheck` (2-way) | n/a (no Akka) | ✅ `AkkaClusterHealthCheck` (3-way) |
|
||||
| Active / leader node | ✅ `AdminRoleLeaderHealthCheck` (role-filtered) | n/a | ✅ `ActiveNodeHealthCheck` (role-less) |
|
||||
| Downstream gRPC dependency | ⛔ none | ⛔ none | ⛔ none |
|
||||
|
||||
→ **Gap P1 (P1):** MxAccessGateway has zero probes — `AddHealthChecks()` at
|
||||
`GatewayApplication.cs:61` is dead code. Minimum viable: a `GrpcDependencyHealthCheck`
|
||||
targeting the x86 worker IPC channel.
|
||||
→ **Gap P2:** No project probes its downstream gRPC dependency. OtOpcUa should probe the
|
||||
MxAccessGateway channel; MxAccessGateway should probe the worker IPC.
|
||||
→ **Gap P3:** Dead `AddHealthChecks()` in MxAccessGateway (`GatewayApplication.cs:61`) should be
|
||||
removed or replaced — it currently implies health checks are configured when they are not.
|
||||
|
||||
### §3 Akka status-policy divergence
|
||||
|
||||
| Aspect | OtOpcUa | ScadaBridge |
|
||||
|---|---|---|
|
||||
| Probe implementation | Scans `State.Members` for self by address | Reads `SelfMember.Status` directly |
|
||||
| Joining status | Degraded (not in Members as Up) | Healthy |
|
||||
| Leaving/Exiting status | Degraded | Degraded |
|
||||
| Other (Removed, Down…) | Degraded | Unhealthy |
|
||||
| ActorSystem null guard | — (none; `ActorSystem` injected directly) | ✅ Degraded if null |
|
||||
|
||||
The two implementations diverge in how they classify `Joining` (ScadaBridge calls it Healthy;
|
||||
OtOpcUa would see it as Degraded because `SelfMember` with status `Joining` would not appear as
|
||||
`Up` in the member scan). They also diverge in the Removed/Down classification (ScadaBridge
|
||||
Unhealthy, OtOpcUa Degraded).
|
||||
|
||||
The shared `ZB.MOM.WW.Health.Akka.AkkaClusterHealthCheck` ships two presets to preserve both
|
||||
behaviors rather than forcing one onto the other:
|
||||
- **Default** — ScadaBridge's three-way policy (`Up`/`Joining`=Healthy, `Leaving`/`Exiting`=Degraded,
|
||||
else Unhealthy)
|
||||
- **OtOpcUaCompat** — OtOpcUa's self-Up-among-members scan (found Up=Healthy, not found=Degraded)
|
||||
|
||||
→ **Gap A1:** OtOpcUa adopts the `OtOpcUaCompat` preset; ScadaBridge adopts the `Default` preset.
|
||||
Both preserve existing behavior without forcing convergence on a single policy.
|
||||
→ **Gap A2:** OtOpcUa's `AkkaClusterHealthCheck` injects `ActorSystem` directly (no null guard).
|
||||
The shared implementation injects via `AkkaHostedService` for startup safety.
|
||||
|
||||
### §4 Database probe technique
|
||||
|
||||
| Aspect | OtOpcUa | ScadaBridge |
|
||||
|---|---|---|
|
||||
| Probe method | `db.Deployments.AsNoTracking().Take(1).ToListAsync()` (query) | `_dbContext.Database.CanConnectAsync()` (connection only) |
|
||||
| Injection style | `IDbContextFactory<T>` (pooled, safe for concurrent probes) | `DbContext` directly (scoped, requires care in background use) |
|
||||
| Schema verification | ✅ implies schema is applied | ⛔ connection only |
|
||||
|
||||
→ **Gap D1:** `ZB.MOM.WW.Health.EntityFrameworkCore.DatabaseHealthCheck<TContext>` uses
|
||||
`CanConnectAsync` as the default (ScadaBridge behavior). An optional `ProbeQuery` delegate covers
|
||||
OtOpcUa's stricter approach. Both apps retain their existing probe semantics; neither is forced
|
||||
to change unless desired.
|
||||
→ **Gap D2:** ScadaBridge injects `DbContext` directly; the shared probe should use
|
||||
`IDbContextFactory<TContext>` for safe reuse from a background-service health-check context.
|
||||
ScadaBridge's DI registration will need updating on adoption.
|
||||
|
||||
### §5 Active-node / leader check
|
||||
|
||||
| Aspect | OtOpcUa | ScadaBridge |
|
||||
|---|---|---|
|
||||
| Probe type | `AdminRoleLeaderHealthCheck` (role-filtered: `"admin"`) | `ActiveNodeHealthCheck` (role-less; Up + leader) |
|
||||
| Non-role-bearing node | Healthy immediately | n/a (all central nodes have no role filter) |
|
||||
| Leader status | Healthy | Healthy |
|
||||
| Non-leader (standby) | Degraded | Unhealthy |
|
||||
| `IActiveNodeGate` backing | Not present | `ActiveNodeGate` (separate type, duplicated logic) |
|
||||
|
||||
→ **Gap L1:** `ZB.MOM.WW.Health.Akka.ActiveNodeHealthCheck` with an optional `RoleFilter`
|
||||
parameter unifies both behaviors. OtOpcUa passes `RoleFilter = "admin"` (role-filtered);
|
||||
ScadaBridge uses no role filter.
|
||||
→ **Gap L2:** ScadaBridge's `ActiveNodeGate` duplicates `ActiveNodeHealthCheck` logic. The shared
|
||||
`IActiveNodeGate` seam + a backing singleton eliminates the duplication.
|
||||
|
||||
### §6 Response writer
|
||||
|
||||
| | OtOpcUa | MxAccessGateway | ScadaBridge |
|
||||
|---|---|---|---|
|
||||
| Writer | Default (plain-text/JSON) | Bespoke `GatewayHealthReply` JSON | `UIResponseWriter.WriteHealthCheckUIResponse` |
|
||||
|
||||
→ **Gap W1:** the shared `ZB.MOM.WW.Health` package ships a canonical JSON response writer
|
||||
(lifting `HealthChecks.UI.Client` style to the default). All three projects adopt it on
|
||||
`MapZbHealth()` call — no per-project writer wiring needed.
|
||||
|
||||
### §7 Endpoint authentication
|
||||
|
||||
Both OtOpcUa and ScadaBridge expose health endpoints without authentication (`AllowAnonymous` or
|
||||
open by default). MxAccessGateway's `/health/live` has no authentication requirement. The spec
|
||||
canonizes this: health tiers are `AllowAnonymous`; `MapZbHealth()` applies `AllowAnonymous` by
|
||||
default.
|
||||
|
||||
No gap — consistent across all three. `MapZbHealth()` should document and enforce this default.
|
||||
|
||||
## Adoption backlog (ordered)
|
||||
|
||||
| # | Item | Projects | Priority | Effort | Risk | Notes |
|
||||
|---|---|---|---|---|---|---|
|
||||
| 1 | MxAccessGateway: remove dead `/health/live` + `AddHealthChecks()`, add `GrpcDependencyHealthCheck` (worker IPC) + `MapZbHealth()` | MxGateway | P1 | S | Low | Gap T1, T3, P1, P3 — no probes/tiers today; highest delta |
|
||||
| 2 | OtOpcUa: replace 3 bespoke checks with shared probes (`AkkaClusterHealthCheck` OtOpcUaCompat + `ActiveNodeHealthCheck` role-filtered + `DatabaseHealthCheck<T>` ProbeQuery) | OtOpcUa | P2 | S | Low | Gap A1, D1, L1 |
|
||||
| 3 | ScadaBridge: replace 3 bespoke checks with shared probes (Default policy + role-less Active + `CanConnectAsync`) + add `/healthz` + unify `ActiveNodeGate` with `IActiveNodeGate` seam | ScadaBridge | P2 | S | Low | Gap T2, A1, D2, L1, L2 |
|
||||
| 4 | OtOpcUa + MxAccessGateway: add `GrpcDependencyHealthCheck` for downstream gRPC channel | OtOpcUa, MxGateway | P2 | S | Low | Gap P2 — closes the silent-gateway-down scenario |
|
||||
| 5 | All: adopt canonical response writer (switch from per-project writers to `MapZbHealth` default) | all 3 | P3 | XS | Low | Gap W1 — mechanical; bundled with #1–3 |
|
||||
| 6 | DB injection style: switch ScadaBridge from injected `DbContext` to `IDbContextFactory<T>` | ScadaBridge | P3 | XS | Low | Gap D2 — background-service safety |
|
||||
|
||||
**Note: adoption items #1–6 are all follow-on tasks.** They are tracked here as the backlog for
|
||||
after `ZB.MOM.WW.Health` @ 0.1.0 is published. The library build itself (nupkgs, tests) is a
|
||||
separate task. This is consistent with how `ZB.MOM.WW.Auth` and `ZB.MOM.WW.Theme` are structured:
|
||||
the library is built first; adoption by the three apps is the next step.
|
||||
|
||||
## Decisions still open
|
||||
|
||||
- Whether `GrpcDependencyHealthCheck` takes a named channel (from DI) or a raw `ChannelBase` —
|
||||
affects how MxAccessGateway registers the worker-IPC probe without a standard gRPC channel.
|
||||
- Whether `IActiveNodeGate` lives in `ZB.MOM.WW.Health` (making it a hard dependency) or stays
|
||||
in ScadaBridge's `InboundAPI` project (keeping the gate as a ScadaBridge concern).
|
||||
- Whether the `OtOpcUaCompat` preset for `AkkaClusterHealthCheck` is a named constant or just
|
||||
documented configuration.
|
||||
@@ -0,0 +1,89 @@
|
||||
# Health (readiness / liveness / active-node)
|
||||
|
||||
Second normalized component under the operability cluster. **Goal: path to shared code** — converge
|
||||
the three sister projects onto a common three-tier health endpoint convention and a set of shared
|
||||
probe implementations, proposed as the `ZB.MOM.WW.Health` library set (3 packages), while each
|
||||
project keeps its own probe registration and orchestrator wiring.
|
||||
|
||||
- The one target: [`spec/SPEC.md`](spec/SPEC.md)
|
||||
- The proposed shared library: [`shared-contract/ZB.MOM.WW.Health.md`](shared-contract/ZB.MOM.WW.Health.md)
|
||||
- Divergences + backlog: [`GAPS.md`](GAPS.md)
|
||||
- Current state, per project: [`current-state/`](current-state/)
|
||||
|
||||
## Why health is a strong normalization candidate
|
||||
|
||||
Both OtOpcUa and ScadaBridge trace their health-check structure to the same "ScadaLink three-tier
|
||||
pattern" (`HealthEndpoints.cs:13` says so explicitly) but have already diverged in probe logic,
|
||||
status semantics, response writer, and endpoint registration style. MxAccessGateway has no shared
|
||||
ancestry here — it has a single hardcoded `/health/live` endpoint with no real probes at all.
|
||||
The common core (three tiers, database probe, Akka cluster probe, active-node probe) is
|
||||
re-implemented twice and absent once. Shared probe implementations with configurable policies
|
||||
close the gap without forcing identical behavior onto projects with legitimately different cluster
|
||||
semantics.
|
||||
|
||||
## Status by project
|
||||
|
||||
| Project | Endpoints today | Probes today | Response writer | `/healthz` | `IActiveNodeGate` | Adoption status |
|
||||
|---|---|---|---|---|---|---|
|
||||
| **OtOpcUa** | `/health/ready`, `/health/active`, `/healthz` | Database (query), AkkaCluster (2-way), AdminRoleLeader (role-filtered) | Default (plain-text/JSON) | ✅ present | — | Not started |
|
||||
| **MxAccessGateway** | `/health/live` only (raw `MapGet`; hardcoded `"Healthy"`) | **None** (`AddHealthChecks()` called but unused) | Bespoke `GatewayHealthReply` JSON | ⛔ absent | — | Not started |
|
||||
| **ScadaBridge** | `/health/ready`, `/health/active` | Database (`CanConnectAsync`), AkkaCluster (3-way), ActiveNode (role-less) | `HealthChecks.UI.Client` JSON | ⛔ absent | `ActiveNodeGate` (backs Inbound API 503 gate) | Not started |
|
||||
|
||||
See each project's [`current-state/<project>/CURRENT-STATE.md`](current-state/) for the
|
||||
code-verified detail and its adoption plan.
|
||||
|
||||
## Normalized vs. left per-project
|
||||
|
||||
**Normalized (the shared target):**
|
||||
|
||||
- Three-tier endpoint convention: `/health/ready` (tag `ready`), `/health/active` (tag `active`),
|
||||
`/healthz` (bare liveness). Mapped by `app.MapZbHealth()` from `ZB.MOM.WW.Health`.
|
||||
- Canonical JSON response writer (lifted from `HealthChecks.UI.Client` style; no per-project
|
||||
writer wiring needed).
|
||||
- `IActiveNodeGate` seam — generalized from ScadaBridge's `ActiveNodeGate`; wired into `MapZbHealth`
|
||||
for automatic active-tier response.
|
||||
- `GrpcDependencyHealthCheck` — reachability probe for a downstream gRPC dependency (covers
|
||||
OtOpcUa → MxAccessGateway channel and MxAccessGateway → worker IPC).
|
||||
- `AkkaClusterHealthCheck` (in `ZB.MOM.WW.Health.Akka`) with a configurable status policy.
|
||||
Default = ScadaBridge's three-way policy; `OtOpcUaCompat` preset preserves OtOpcUa's two-way
|
||||
self-Up-among-members scan.
|
||||
- `ActiveNodeHealthCheck` (in `ZB.MOM.WW.Health.Akka`) with an optional role filter. Role-less =
|
||||
ScadaBridge's behavior (Up + cluster leader); role-filtered = OtOpcUa's `AdminRoleLeader`
|
||||
behavior.
|
||||
- `DatabaseHealthCheck<TContext>` (in `ZB.MOM.WW.Health.EntityFrameworkCore`) with default
|
||||
`CanConnectAsync` and an optional `ProbeQuery` delegate.
|
||||
- `AllowAnonymous` on all three tiers by default (consistent across all three projects today).
|
||||
|
||||
**Left per-project (not forced together):**
|
||||
|
||||
- Which probes each app registers, their names, and which tags they carry.
|
||||
- Orchestrator / Traefik wiring (sidecars, route rules, upstreams).
|
||||
- ScadaBridge's `HealthMonitoring/` distributed aggregation pipeline (`SiteHealthCollector`,
|
||||
`CentralHealthAggregator`, `HealthReportSender`, etc.) — domain-specific, no shared-library
|
||||
equivalent.
|
||||
- MxAccessGateway's `GatewayHealthReply` metadata (`DefaultBackend`, `WorkerProtocolVersion`) —
|
||||
keep as a bespoke `/info` endpoint.
|
||||
- The x86 worker process — out of process and out of scope; the gateway-side
|
||||
`GrpcDependencyHealthCheck` observes it indirectly.
|
||||
- Per-project `IActiveNodeGate` contract location (whether the interface lives in the shared
|
||||
library or in each project's own surface).
|
||||
|
||||
## Package structure
|
||||
|
||||
`ZB.MOM.WW.Health` ships as three dependency-split packages:
|
||||
|
||||
| Package | Contents | Consumers |
|
||||
|---|---|---|
|
||||
| `ZB.MOM.WW.Health` | Core tiers, `MapZbHealth`, canonical writer, `IActiveNodeGate`, `GrpcDependencyHealthCheck` | All three |
|
||||
| `ZB.MOM.WW.Health.Akka` | `AkkaClusterHealthCheck` + status presets, `ActiveNodeHealthCheck` + role filter | OtOpcUa, ScadaBridge |
|
||||
| `ZB.MOM.WW.Health.EntityFrameworkCore` | `DatabaseHealthCheck<TContext>` + optional probe delegate | OtOpcUa, ScadaBridge |
|
||||
|
||||
MxAccessGateway consumes the core package only (no Akka, no EF). OtOpcUa and ScadaBridge consume
|
||||
all three.
|
||||
|
||||
## Component status
|
||||
|
||||
**Status: Draft.** Spec and shared-contract written; current-state docs verified; GAPS backlog
|
||||
populated. Library (`ZB.MOM.WW.Health` @ 0.1.0) built and tested in this repo at
|
||||
[`../../ZB.MOM.WW.Health/`](../../ZB.MOM.WW.Health/). Adoption by the three apps is a follow-on
|
||||
tracked in [`GAPS.md`](GAPS.md).
|
||||
@@ -0,0 +1,133 @@
|
||||
# Health — current state: MxAccessGateway
|
||||
|
||||
Repo: `~/Desktop/MxAccessGateway`. Stack: .NET 10 gateway (x64) + .NET 4.8 worker (x86), gRPC;
|
||||
solution `src/MxGateway.sln`.
|
||||
Health code lives in `src/ZB.MOM.WW.MxGateway.Server/GatewayApplication.cs`. All paths relative
|
||||
to repo root.
|
||||
Verified 2026-06-01.
|
||||
|
||||
**Summary: bare liveness only.** MxAccessGateway has a single `/health/live` endpoint that returns
|
||||
a hardcoded `GatewayHealthReply` JSON object. `AddHealthChecks()` is called at startup but is
|
||||
entirely unused — no `IHealthCheck` implementations are registered, `MapHealthChecks` is never
|
||||
called, and there is no readiness or active-node tier. The net48 x86 worker process has no HTTP
|
||||
server and therefore no health endpoint of any kind.
|
||||
|
||||
## 1. Endpoint wiring
|
||||
|
||||
`src/ZB.MOM.WW.MxGateway.Server/GatewayApplication.cs`:
|
||||
|
||||
- `:61` — `builder.Services.AddHealthChecks()` is called in the DI registration block. **This call
|
||||
is dead**: no `.AddCheck<T>()` call follows it, no `MapHealthChecks` is ever called. The
|
||||
framework registers the health-check infrastructure but nothing is wired through it.
|
||||
- `:139–145` — `MapGatewayEndpoints` maps a raw `endpoints.MapGet("/health/live", ...)` (not
|
||||
`MapHealthChecks`). The handler is an inline lambda that returns `Results.Ok(new GatewayHealthReply(...))`
|
||||
with a hardcoded `Status: "Healthy"`:
|
||||
|
||||
```csharp
|
||||
endpoints.MapGet(
|
||||
"/health/live",
|
||||
() => Results.Ok(new GatewayHealthReply(
|
||||
Status: "Healthy",
|
||||
DefaultBackend: GatewayContractInfo.DefaultBackendName,
|
||||
WorkerProtocolVersion: GatewayContractInfo.WorkerProtocolVersion)))
|
||||
.WithName("LiveHealth");
|
||||
```
|
||||
|
||||
This endpoint always returns HTTP 200 `{"Status":"Healthy",...}` as long as the process is alive.
|
||||
It carries no authentication requirement (no `[Authorize]` or `.RequireAuthorization()`).
|
||||
|
||||
## 2. Response shape
|
||||
|
||||
`GatewayHealthReply` is a record with three fields:
|
||||
- `Status` — always `"Healthy"` (hardcoded string, not the ASP.NET Core `HealthStatus` enum)
|
||||
- `DefaultBackend` — value of `GatewayContractInfo.DefaultBackendName` (the configured backend
|
||||
name, useful for confirming which gateway instance a probe hit)
|
||||
- `WorkerProtocolVersion` — value of `GatewayContractInfo.WorkerProtocolVersion` (the gRPC
|
||||
protocol version the gateway expects from the worker, useful for version-skew detection)
|
||||
|
||||
The response is not `HealthChecks.UI.Client` JSON and is not the standard ASP.NET Core health
|
||||
response shape. It is a bespoke JSON record.
|
||||
|
||||
## 3. Probes
|
||||
|
||||
None. There is no `IHealthCheck` registered. The `/health/live` response does not reflect:
|
||||
|
||||
- Whether the SQLite auth-store is reachable
|
||||
- Whether any active MXAccess session is functional
|
||||
- Whether the x86 worker named-pipe IPC is connected or the worker process is alive
|
||||
- Whether the gRPC service is actually accepting calls
|
||||
|
||||
The endpoint is purely a process liveness indicator.
|
||||
|
||||
## 4. Tier coverage
|
||||
|
||||
| Tier | Endpoint | Status |
|
||||
|---|---|---|
|
||||
| Process liveness | `/health/live` (raw `MapGet`) | ✅ present (but non-standard) |
|
||||
| Readiness | `/health/ready` | ⛔ absent |
|
||||
| Active node | `/health/active` | ⛔ absent (not Akka-based; not applicable as-is) |
|
||||
| `healthz` convention | `/healthz` | ⛔ absent |
|
||||
|
||||
MxAccessGateway is not an Akka.NET application — it has no cluster, no leader election, and no
|
||||
active-node concept. The "active" tier in the shared spec translates here to "is the worker process
|
||||
connected and the gRPC service ready to accept calls?" rather than cluster leadership.
|
||||
|
||||
## 5. x86 worker
|
||||
|
||||
`ZB.MOM.WW.MxGateway.Worker` is a .NET 4.8 console application communicating with the gateway
|
||||
over Windows named-pipe IPC. It has no HTTP server, no health endpoint, and no exposure to any
|
||||
probe mechanism. Its liveness must be inferred indirectly — either via the gateway process
|
||||
monitoring it (not currently implemented) or via the `GrpcDependencyHealthCheck` the gateway
|
||||
could use to probe the IPC channel.
|
||||
|
||||
## 6. Notable gaps
|
||||
|
||||
- `AddHealthChecks()` at `:61` is dead code. No `IHealthCheck` is ever registered via this call.
|
||||
- `/health/live` uses `MapGet` (a raw minimal-API handler) rather than `MapHealthChecks`. It
|
||||
bypasses the ASP.NET Core health-check middleware entirely, which means it does not participate
|
||||
in the standard health-check pipeline (no `IHealthCheckPublisher`, no `HealthReport`, no UI
|
||||
integration).
|
||||
- The hardcoded `"Healthy"` status means the endpoint cannot reflect real probe results even if
|
||||
probes were added later — the handler must be replaced, not just supplemented.
|
||||
- No readiness gating: orchestrators (Kubernetes, Traefik) that rely on `/health/ready` returning
|
||||
503 until the process is actually ready will receive 200 (or 404) from MxAccessGateway today.
|
||||
|
||||
---
|
||||
|
||||
## Adoption plan → `ZB.MOM.WW.Health`
|
||||
|
||||
**Replace `/health/live` + wire the shared tiers:**
|
||||
|
||||
The `AddHealthChecks()` call at `GatewayApplication.cs:61` is already present — it just needs
|
||||
probes registered against it. The raw `MapGet("/health/live", ...)` handler at `:139–145` must be
|
||||
removed and replaced with `app.MapZbHealth()` from `ZB.MOM.WW.Health`.
|
||||
|
||||
Steps:
|
||||
|
||||
1. **Remove** the inline `MapGet("/health/live", ...)` lambda (`:139–145`). The `GatewayHealthReply`
|
||||
record and `DefaultBackend`/`WorkerProtocolVersion` metadata can be surfaced differently (e.g., a
|
||||
`/info` endpoint or as custom data on the health response).
|
||||
2. **Register a `GrpcDependencyHealthCheck`** (from `ZB.MOM.WW.Health`) that probes the
|
||||
named-pipe IPC channel to the x86 worker. Tag `["ready"]`. This replaces the hardcoded
|
||||
liveness-only response with a real probe that reflects whether the worker is reachable.
|
||||
3. **Optionally add a `GrpcDependencyHealthCheck`** for any downstream gRPC dependency (e.g., the
|
||||
Galaxy Repository connection) if the gateway is expected to be healthy only when its upstreams are
|
||||
reachable. Tag `["ready"]`.
|
||||
4. **Call `app.MapZbHealth()`** — this maps `/health/ready` (tag `ready`), `/health/active` (tag
|
||||
`active`; initially empty — no active-node concept in MxGateway), and `/healthz` (bare liveness).
|
||||
The `/healthz` endpoint replaces the semantic role that `/health/live` served today.
|
||||
5. **Do not add `ZB.MOM.WW.Health.Akka`** — MxAccessGateway has no Akka dependency. The consumer
|
||||
matrix in the design specifies MxGateway uses the core package only.
|
||||
|
||||
**Keep bespoke:**
|
||||
|
||||
- The `WorkerProtocolVersion` / `DefaultBackend` metadata from `GatewayHealthReply` is
|
||||
MxAccessGateway-specific; keep it as a separate `/info` endpoint or embed it as `Data` on a
|
||||
custom probe rather than normalizing it into the shared contract.
|
||||
- The x86 worker itself (net48 console, named-pipe IPC, no HTTP) remains outside the shared health
|
||||
scheme. The `GrpcDependencyHealthCheck` observes the worker indirectly from the gateway side.
|
||||
- Per-gateway auth and TLS concerns on who may call health endpoints remain per-project.
|
||||
|
||||
**Adoption is a follow-on task** (tracked in `GAPS.md`), not part of the `ZB.MOM.WW.Health`
|
||||
library build. MxGateway is the **highest-priority adopter** (P1 gap — no probes/tiers today)
|
||||
and should be the first app wired up once the nupkg is available.
|
||||
@@ -0,0 +1,150 @@
|
||||
# Health — current state: OtOpcUa
|
||||
|
||||
Repo: `~/Desktop/OtOpcUa`. Stack: .NET 10, Akka.NET, OPC UA; solution `ZB.MOM.WW.OtOpcUa.slnx`.
|
||||
Health code lives in `src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/`. All paths relative to repo root.
|
||||
Verified 2026-06-01.
|
||||
|
||||
Full three-tier pattern: `/health/ready`, `/health/active`, and `/healthz`. Three probes covering
|
||||
the database, the Akka cluster, and the admin-role leader. All endpoints are `AllowAnonymous` to
|
||||
permit Traefik and load-balancer probing without credentials.
|
||||
|
||||
## 1. Endpoint wiring
|
||||
|
||||
`src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/HealthEndpoints.cs`:
|
||||
|
||||
- `:13` — XML comment explicitly names this as "ScadaLink's three-tier pattern: `ready` = boot ok;
|
||||
`active` = fully serving traffic; `healthz` = bare process liveness."
|
||||
- `:17` — `AddOtOpcUaHealth(IServiceCollection)` calls `services.AddHealthChecks()` and registers
|
||||
all three probes (lines 20–22):
|
||||
- `DatabaseHealthCheck` name `"configdb"`, tags `["ready","active"]`
|
||||
- `AkkaClusterHealthCheck` name `"akka"`, tags `["ready","active"]`
|
||||
- `AdminRoleLeaderHealthCheck` name `"admin-leader"`, tags `["active"]` only
|
||||
- `:28` — `MapOtOpcUaHealth(IEndpointRouteBuilder)` maps three endpoints (lines 33–44):
|
||||
- `/health/ready` — predicate `c => c.Tags.Contains("ready")`, `.AllowAnonymous()` (lines 33–36)
|
||||
- `/health/active` — predicate `c => c.Tags.Contains("active")`, `.AllowAnonymous()` (lines 37–40)
|
||||
- `/healthz` — predicate `_ => false` (no probes run; bare process liveness only), `.AllowAnonymous()` (lines 41–44)
|
||||
|
||||
`Program.cs`:
|
||||
- `:137` — `builder.Services.AddOtOpcUaHealth()`
|
||||
- `:159` — `app.MapOtOpcUaHealth()`
|
||||
|
||||
Response writer: default ASP.NET Core plain-text/JSON (no `HealthChecks.UI.Client`).
|
||||
|
||||
## 2. Probes
|
||||
|
||||
### DatabaseHealthCheck
|
||||
`src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/DatabaseHealthCheck.cs`:
|
||||
|
||||
- `:9` — injects `IDbContextFactory<OtOpcUaConfigDbContext>`
|
||||
- `:25–37` — opens a pooled context via `CreateDbContextAsync`, runs
|
||||
`db.Deployments.AsNoTracking().Take(1).ToListAsync()`. If the query succeeds →
|
||||
`HealthCheckResult.Healthy("ConfigDb reachable")` (`:31`). If it throws →
|
||||
`HealthCheckResult.Unhealthy("ConfigDb unreachable", ex)` (`:35`). No `Degraded` path.
|
||||
|
||||
The probe exercises a real query (not just `CanConnectAsync`) — it confirms the `Deployments` table
|
||||
is readable, which implies the schema migration has run. This is **stricter** than ScadaBridge's
|
||||
`CanConnectAsync` but more opaque about the failure reason.
|
||||
|
||||
Tags on registration: `["ready","active"]` — the database must be reachable for both readiness and
|
||||
active-node determination.
|
||||
|
||||
### AkkaClusterHealthCheck
|
||||
`src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/AkkaClusterHealthCheck.cs`:
|
||||
|
||||
- `:9` — injects `ActorSystem` directly
|
||||
- `:27–33` — calls `Cluster.Get(_system)`, scans `cluster.State.Members` for the member whose
|
||||
`Address == cluster.SelfAddress` and `Status == MemberStatus.Up`:
|
||||
- Found Up → `HealthCheckResult.Healthy($"Self Up; {cluster.State.Members.Count} member(s)")` (`:32`)
|
||||
- Not found → `HealthCheckResult.Degraded("Self not yet Up in cluster")` (`:33`)
|
||||
|
||||
No `Unhealthy` path — joining/leaving/removed nodes are all reported as `Degraded`. This differs from
|
||||
ScadaBridge's more granular three-way policy (see GAPS).
|
||||
|
||||
Tags on registration: `["ready","active"]`.
|
||||
|
||||
### AdminRoleLeaderHealthCheck
|
||||
`src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/AdminRoleLeaderHealthCheck.cs`:
|
||||
|
||||
- `:14` — injects `IClusterRoleInfo`
|
||||
- `:27–38` — three-path logic:
|
||||
- Node does not carry the `"admin"` role → `Healthy("Node does not carry admin role")` (`:30`) —
|
||||
non-admin nodes are immediately healthy, so this probe never gates a non-admin node.
|
||||
- Admin role + node is the role leader → `Healthy($"Admin leader ({...})")` (`:36`)
|
||||
- Admin role + not the leader → `Degraded($"Admin member but not leader (leader=...)")` (`:37`)
|
||||
|
||||
Tags on registration: `["active"]` only — does not participate in `/health/ready`. The intent is
|
||||
Traefik routing: the active node (admin-role leader) gets sticky admin-UI traffic; standby nodes
|
||||
are reachable for data-plane OPC UA but report `Degraded` on `/health/active` so the load balancer
|
||||
does not route control-plane traffic to them.
|
||||
|
||||
Note: no `Unhealthy` path for the role-filter case. If the ActorSystem is not running, `IClusterRoleInfo`
|
||||
presumably returns safe defaults (no role); this is not separately health-checked.
|
||||
|
||||
## 3. Tag / tier summary
|
||||
|
||||
| Probe | `/health/ready` | `/health/active` | `/healthz` |
|
||||
|---|---|---|---|
|
||||
| `DatabaseHealthCheck` | ✅ | ✅ | — |
|
||||
| `AkkaClusterHealthCheck` | ✅ | ✅ | — |
|
||||
| `AdminRoleLeaderHealthCheck` | — | ✅ | — |
|
||||
| (no probes) | — | — | ✅ (bare liveness) |
|
||||
|
||||
`/healthz` runs zero probes — it is a pure process liveness sentinel (process reachable = healthy;
|
||||
a crashed process = no response). Kubernetes liveness probes, Traefik TCP checks, and uptime
|
||||
monitors use this tier.
|
||||
|
||||
## 4. Downstream dependency coverage
|
||||
|
||||
No probe for the upstream MxAccessGateway gRPC channel. If the gateway is unreachable, OtOpcUa
|
||||
reports healthy here (the GalaxyDriver will surface errors in OPC UA diagnostics, but `/health/ready`
|
||||
and `/health/active` will not reflect it). This is a gap that the shared `GrpcDependencyHealthCheck`
|
||||
probe in `ZB.MOM.WW.Health` would close.
|
||||
|
||||
## 5. Notable design choices
|
||||
|
||||
- **AllowAnonymous on all tiers** — see `HealthEndpoints.cs:30–32` comment: "Without it the
|
||||
`AddOtOpcUaAuth` fallback policy 401s every probe and Traefik marks every backend unhealthy."
|
||||
- **Query probe, not `CanConnectAsync`** — the `Deployments` query validates that the schema has
|
||||
been applied. ScadaBridge uses `CanConnectAsync`; neither is wrong but they diverge.
|
||||
- **`Degraded` semantics** — the Akka check uses `Degraded` (not `Unhealthy`) for a joining/pre-Up
|
||||
node. ASP.NET Core maps `Degraded` to HTTP 200 by default; Traefik sees 200 and considers the
|
||||
node ready. If `Unhealthy` (HTTP 503) is required to gate traffic, the `Degraded` path is
|
||||
insufficient.
|
||||
- **`IClusterRoleInfo` abstraction** — the admin-leader check depends on `IClusterRoleInfo`, an OtOpcUa
|
||||
interface, not the raw `Akka.Cluster.Cluster` API. This is a testability-friendly layer absent in
|
||||
ScadaBridge's direct Akka usage.
|
||||
|
||||
---
|
||||
|
||||
## Adoption plan → `ZB.MOM.WW.Health`
|
||||
|
||||
**Replace with shared probes:**
|
||||
|
||||
- `AkkaClusterHealthCheck` → `ZB.MOM.WW.Health.Akka.AkkaClusterHealthCheck` using the
|
||||
**`OtOpcUaCompat` preset** (self-Up-among-members scan → Healthy/Degraded). The preset keeps
|
||||
OtOpcUa's existing two-way policy without forcing ScadaBridge's three-way policy onto it.
|
||||
- `AdminRoleLeaderHealthCheck` → `ZB.MOM.WW.Health.Akka.ActiveNodeHealthCheck` with
|
||||
`RoleFilter = "admin"`. The role-filter parameter produces identical behavior: non-admin nodes
|
||||
immediately healthy, admin leader healthy, admin non-leader degraded.
|
||||
- `DatabaseHealthCheck` → `ZB.MOM.WW.Health.EntityFrameworkCore.DatabaseHealthCheck<OtOpcUaConfigDbContext>`
|
||||
with a `ProbeQuery` delegate of `db => db.Deployments.AsNoTracking().Take(1).ToListAsync()`.
|
||||
The delegate preserves the stricter query probe rather than falling back to `CanConnectAsync`.
|
||||
- Add `GrpcDependencyHealthCheck` targeting the MxAccessGateway channel (closes the downstream
|
||||
dependency gap noted in §4). Tag `["ready","active"]`.
|
||||
- Replace `AddOtOpcUaHealth` / `MapOtOpcUaHealth` with `services.AddZbHealthChecks()` +
|
||||
`app.MapZbHealth()`. The `/healthz` bare-liveness tier is part of `MapZbHealth` by default —
|
||||
no separate wiring needed.
|
||||
|
||||
**Keep bespoke:**
|
||||
|
||||
- `IClusterRoleInfo` and its Akka implementation — this is an OtOpcUa abstraction used for more
|
||||
than health checks; it should remain in the OtOpcUa codebase. The shared `ActiveNodeHealthCheck`
|
||||
will accept `IClusterRoleInfo` (or an equivalent cluster-info abstraction) as an injection point.
|
||||
- The `AllowAnonymous` policy — this is an OtOpcUa auth concern; `MapZbHealth` must document that
|
||||
callers are responsible for applying `AllowAnonymous` (or the shared helper applies it by default).
|
||||
- Which probes are registered and their tag assignments — the shared library supplies the check
|
||||
implementations; the wiring (which names, which tags, which options) remains per-project.
|
||||
|
||||
**Adoption is a follow-on task** (tracked in `GAPS.md`), not part of the `ZB.MOM.WW.Health`
|
||||
library build. The library build delivers the shared implementations; adoption lands in the
|
||||
OtOpcUa repo as a separate commit once the nupkg is available.
|
||||
@@ -0,0 +1,185 @@
|
||||
# Health — current state: ScadaBridge
|
||||
|
||||
Repo: `~/Desktop/ScadaBridge`. Stack: .NET 10, Akka.NET; solution `ZB.MOM.WW.ScadaBridge.slnx`.
|
||||
Health code centers on `src/ZB.MOM.WW.ScadaBridge.Host/Health/` (ASP.NET probes) and the
|
||||
separate `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/` project (domain aggregation pipeline).
|
||||
All paths relative to repo root.
|
||||
Verified 2026-06-01.
|
||||
|
||||
Two-tier pattern: `/health/ready` and `/health/active` — no `/healthz`. Three probes (database,
|
||||
Akka cluster, active-node). ScadaBridge also has a bespoke distributed `HealthMonitoring/`
|
||||
pipeline that is entirely separate from the ASP.NET health checks and is out of scope for the
|
||||
shared library.
|
||||
|
||||
## 1. Endpoint wiring
|
||||
|
||||
`src/ZB.MOM.WW.ScadaBridge.Host/Program.cs`:
|
||||
|
||||
- `:114–117` — `builder.Services.AddHealthChecks()` followed by three `.AddCheck<T>()` calls
|
||||
(no tags, checked by name at the endpoint level):
|
||||
- `.AddCheck<DatabaseHealthCheck>("database")`
|
||||
- `.AddCheck<AkkaClusterHealthCheck>("akka-cluster")`
|
||||
- `.AddCheck<ActiveNodeHealthCheck>("active-node")`
|
||||
- `:131` — `builder.Services.AddSingleton<IActiveNodeGate, ActiveNodeGate>()` registers the
|
||||
production `IActiveNodeGate` implementation (Inbound API gating, not a health-check probe).
|
||||
- `:222–226` — `/health/ready` mapped with `Predicate = check => check.Name != "active-node"` and
|
||||
`ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse` (from `HealthChecks.UI.Client`).
|
||||
Excludes the active-node check so a healthy standby node reports ready.
|
||||
- `:229–233` — `/health/active` mapped with `Predicate = check => check.Name == "active-node"` and
|
||||
`ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse`. Active-node check only.
|
||||
|
||||
No `/healthz` endpoint. Both mapped endpoints use `HealthChecks.UI.Client` JSON (not the default
|
||||
plain-text writer), which is a divergence from OtOpcUa.
|
||||
|
||||
## 2. Probes
|
||||
|
||||
### DatabaseHealthCheck
|
||||
`src/ZB.MOM.WW.ScadaBridge.Host/Health/DatabaseHealthCheck.cs`:
|
||||
|
||||
- `:11` — injects `ScadaBridgeDbContext` directly (not a factory)
|
||||
- `:33–43` — calls `_dbContext.Database.CanConnectAsync(cancellationToken)`:
|
||||
- Returns `true` → `HealthCheckResult.Healthy("Database connection is available.")` (`:34–35`)
|
||||
- Returns `false` → `HealthCheckResult.Unhealthy("Database connection failed.")` (`:36`)
|
||||
- Throws → `HealthCheckResult.Unhealthy("Database connection failed.", ex)` (`:40`)
|
||||
|
||||
`CanConnectAsync` tests the connection layer only — it does not run any query or verify schema
|
||||
state. This is less strict than OtOpcUa's `Deployments` query but more transparent about failure
|
||||
cause (connection vs. schema). No `Degraded` path.
|
||||
|
||||
### AkkaClusterHealthCheck
|
||||
`src/ZB.MOM.WW.ScadaBridge.Host/Health/AkkaClusterHealthCheck.cs`:
|
||||
|
||||
- `:13` — injects `AkkaHostedService` (not `ActorSystem` directly)
|
||||
- `:33–50` — gets `_akkaService.ActorSystem`, guards on null → `Degraded("ActorSystem not yet
|
||||
available.")`, then reads `cluster.SelfMember.Status`:
|
||||
- `Up` or `Joining` → `Healthy($"Akka cluster member status: {status}")` (`:43`)
|
||||
- `Leaving` or `Exiting` → `Degraded($"Akka cluster member status: {status}")` (`:45`)
|
||||
- anything else (Removed, Down, WeaklyUp…) → `Unhealthy($"Akka cluster member status: {status}")` (`:47`)
|
||||
|
||||
Three-way status policy: Healthy / Degraded / Unhealthy. This is more granular than OtOpcUa's
|
||||
two-way policy (self-Up-or-not → Healthy/Degraded with no Unhealthy path).
|
||||
|
||||
### ActiveNodeHealthCheck
|
||||
`src/ZB.MOM.WW.ScadaBridge.Host/Health/ActiveNodeHealthCheck.cs`:
|
||||
|
||||
- `:13` — injects `AkkaHostedService`
|
||||
- `:29–44` — three-path logic:
|
||||
- `ActorSystem == null` → `Unhealthy("ActorSystem not yet available.")` (`:31`)
|
||||
- `SelfMember.Status != Up` → `Unhealthy($"Node not Up (status: ...)")` (`:37`)
|
||||
- `Up` AND `cluster.State.Leader == self.Address` → `Healthy("Active node (cluster leader).")` (`:41`)
|
||||
- `Up` but not leader → `Unhealthy("Standby node (not cluster leader).")` (`:43`)
|
||||
|
||||
No `Degraded` path — `ActiveNodeHealthCheck` uses `Unhealthy` for standby and non-Up states,
|
||||
which causes `/health/active` to return HTTP 503 on a standby. This is the intended behavior for
|
||||
Traefik active-node routing.
|
||||
|
||||
## 3. Tag / tier summary
|
||||
|
||||
ScadaBridge uses **name-based predicates** at the endpoint level rather than tags on the check
|
||||
registration. Tags are absent from all three `.AddCheck<T>()` calls.
|
||||
|
||||
| Probe | `/health/ready` | `/health/active` | `/healthz` |
|
||||
|---|---|---|---|
|
||||
| `DatabaseHealthCheck` | ✅ | — (excluded by name) | ⛔ absent |
|
||||
| `AkkaClusterHealthCheck` | ✅ | — (excluded by name) | ⛔ absent |
|
||||
| `ActiveNodeHealthCheck` | — (excluded by name) | ✅ | ⛔ absent |
|
||||
|
||||
`/healthz` is absent — there is no bare process liveness endpoint. Kubernetes or Traefik liveness
|
||||
probes must either use `/health/ready` or tolerate its 503-until-ready behavior.
|
||||
|
||||
## 4. IActiveNodeGate and Inbound API gating
|
||||
|
||||
`src/ZB.MOM.WW.ScadaBridge.Host/Health/ActiveNodeGate.cs`:
|
||||
|
||||
- `:24` — `ActiveNodeGate` implements `IActiveNodeGate` (from the `InboundAPI` project)
|
||||
- `:40` — `IsActiveNode` property: returns `true` only when `_akkaService.ActorSystem != null`
|
||||
AND `cluster.SelfMember.Status == MemberStatus.Up` AND `cluster.State.Leader == self.Address`.
|
||||
Defaults to `false` safely during startup (`:45–46`).
|
||||
- `:131` in `Program.cs` — registered as a singleton. The `InboundApiEndpointFilter` consults this
|
||||
gate on every `/api/*` request and returns HTTP 503 on a standby node.
|
||||
|
||||
`ActiveNodeGate` mirrors the exact same logic as `ActiveNodeHealthCheck` — both check Up + leader.
|
||||
They are separate types serving two different concerns (the health endpoint and the API gate) but
|
||||
are not abstracted into a shared service; each reads cluster state independently.
|
||||
|
||||
`IActiveNodeGate` is the generalized seam the `ZB.MOM.WW.Health` core package lifts to the shared
|
||||
library.
|
||||
|
||||
## 5. HealthMonitoring domain pipeline (out of scope for shared library)
|
||||
|
||||
`src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/` is a separate project implementing a distributed
|
||||
health aggregation pipeline. It is **not ASP.NET Core health checks** and is **not in scope** for
|
||||
`ZB.MOM.WW.Health`.
|
||||
|
||||
Key types:
|
||||
- `SiteHealthCollector` — thread-safe singleton accumulating per-site error counters, connection
|
||||
metrics, and tag-read metrics. Populated by actors in the DCL layer.
|
||||
- `HealthReportSender` — a background service on site clusters that serializes `SiteHealthState`
|
||||
and ships it to the central cluster via Akka remoting at a configurable interval.
|
||||
- `CentralHealthReportLoop` — central-only background service that generates a synthetic
|
||||
`SiteHealthReport` for the central cluster itself (siteId `"$central"`) and feeds it into the
|
||||
central aggregator.
|
||||
- `CentralHealthAggregator` — a `BackgroundService` on the central cluster tracking the latest
|
||||
health report per site and detecting offline sites via heartbeat timeout. Exposes
|
||||
`GetAggregatedHealth()` to the Central UI's `/monitoring/health` endpoint.
|
||||
|
||||
This pipeline is domain-specific (multi-site ScadaBridge topology) and will remain per-project
|
||||
regardless of shared-library adoption.
|
||||
|
||||
## 6. Notable design choices
|
||||
|
||||
- **Name-based predicates vs. tags** — ScadaBridge uses `check.Name == "active-node"` predicate
|
||||
logic at the endpoint level. OtOpcUa uses tag membership (`c.Tags.Contains("ready")`). The tag
|
||||
approach is more composable (a probe can participate in multiple tiers), the name approach is
|
||||
more explicit. The shared `MapZbHealth` should use tags by default.
|
||||
- **`HealthChecks.UI.Client` response writer** — ScadaBridge uses the richer JSON response writer
|
||||
from the `AspNetCore.HealthChecks.UI.Client` package. OtOpcUa uses the default plain-text writer.
|
||||
The shared library's canonical response writer standardizes this.
|
||||
- **`ActiveNodeHealthCheck` returns `Unhealthy` for standby** — a standby is not *unhealthy* in the
|
||||
system sense; it is a deliberate routing discriminator. Using `Unhealthy` here ensures `/health/active`
|
||||
returns HTTP 503 (Traefik sees the node as down for active traffic). The naming is semantically
|
||||
imprecise but operationally correct.
|
||||
- **`IActiveNodeGate` + `ActiveNodeGate` duplication** — the gate and the health check implement the
|
||||
same logic independently. The shared library's `IActiveNodeGate` seam + `ActiveNodeHealthCheck`
|
||||
unify them: one backing service, two consumers.
|
||||
|
||||
---
|
||||
|
||||
## Adoption plan → `ZB.MOM.WW.Health`
|
||||
|
||||
**Replace with shared probes:**
|
||||
|
||||
- `AkkaClusterHealthCheck` → `ZB.MOM.WW.Health.Akka.AkkaClusterHealthCheck` using the **Default
|
||||
policy** (`Up`/`Joining`=Healthy, `Leaving`/`Exiting`=Degraded, else Unhealthy). ScadaBridge's
|
||||
existing three-way policy is the Default — no preset selection needed.
|
||||
- `ActiveNodeHealthCheck` → `ZB.MOM.WW.Health.Akka.ActiveNodeHealthCheck` with no role filter
|
||||
(role-less default: Up && leader = Healthy, else Unhealthy). The shared implementation also backs
|
||||
`IActiveNodeGate`, eliminating the duplicated leader-check logic between `ActiveNodeHealthCheck`
|
||||
and `ActiveNodeGate`.
|
||||
- `DatabaseHealthCheck` → `ZB.MOM.WW.Health.EntityFrameworkCore.DatabaseHealthCheck<ScadaBridgeDbContext>`
|
||||
using the default `CanConnectAsync` probe (ScadaBridge's existing behavior). No `ProbeQuery`
|
||||
delegate needed.
|
||||
- Replace the name-based predicates with tag-based predicates by adding tags at registration time:
|
||||
`"database"` and `"akka-cluster"` → `["ready"]`; `"active-node"` → `["active"]`. Then call
|
||||
`app.MapZbHealth()` instead of the two manual `MapHealthChecks` calls.
|
||||
- **Add `/healthz`** — `MapZbHealth()` maps the bare liveness tier automatically. ScadaBridge
|
||||
currently lacks this endpoint.
|
||||
- Switch `ResponseWriter` from `UIResponseWriter.WriteHealthCheckUIResponse` to the shared
|
||||
canonical writer (a convergence item — `HealthChecks.UI.Client` style lifted to the default in
|
||||
`ZB.MOM.WW.Health`).
|
||||
|
||||
**Keep bespoke:**
|
||||
|
||||
- `HealthMonitoring/` domain pipeline (`SiteHealthCollector`, `CentralHealthAggregator`, etc.) —
|
||||
entirely per-project, no shared-library equivalent.
|
||||
- `IActiveNodeGate` from the `InboundAPI` project is the contract the `InboundApiEndpointFilter`
|
||||
depends on; it can be implemented by the shared `ActiveNodeHealthCheck` backing service but the
|
||||
interface definition stays in the InboundAPI project (or moves to a shared abstractions package).
|
||||
- The Central UI's `/monitoring/health` endpoint — powered by `CentralHealthAggregator`, not by
|
||||
ASP.NET health checks.
|
||||
- The comment at `Program.cs:217–221` explains the readiness design decision (standby nodes are
|
||||
ready; leadership is a separate concern). This intent is preserved by the tag-based approach.
|
||||
|
||||
**Adoption is a follow-on task** (tracked in `GAPS.md`), not part of the `ZB.MOM.WW.Health`
|
||||
library build. The library build delivers the shared implementations; adoption lands in the
|
||||
ScadaBridge repo as a separate commit once the nupkg is available.
|
||||
Reference in New Issue
Block a user