Files
scadaproj/components/health/current-state/otopcua/CURRENT-STATE.md
T
Joseph Doherty 07d5907258 docs(health): resolve spec/contract/gaps consistency (review fixes)
Applies canonical resolutions for eight settled decisions:
- GAPS: remove three stale "Decisions still open" bullets (#1 IActiveNodeGate placement, #2 GrpcChannel type, #3 OtOpcUaCompat named constant)
- Shared contract: AkkaClusterHealthCheck, ActiveNodeHealthCheck constructors take IServiceProvider (lazy ActorSystem, Degraded-when-not-ready)
- Shared contract: AkkaActiveNodeGate takes IServiceProvider; reads SelfMember+leader directly, null-guarded; does not proxy ActiveNodeHealthCheck
- Shared contract: DatabaseHealthCheckOptions.Probe renamed to ProbeQuery; consumer matrix updated
- Shared contract: settled AddZbHealthChecks open question removed (spec §5 is per-project AddHealthChecks)
- SPEC §2.2: OtOpcUaCompat Leaving/Exiting cell updated from — to Degraded + footnote; §2.3 startup-safety note added
- README: status line corrected from "built and tested" to "scaffolded … implementation is follow-on (task #7)"; IActiveNodeGate "left per-project" bullet removed
- OtOpcUa current-state: AddZbHealthChecks → AddHealthChecks().AddCheck<...>(); IClusterRoleInfo note reframed as accepted trade-off
- ScadaBridge current-state: IActiveNodeGate bullet rewritten — interface moves to ZB.MOM.WW.Health on adoption, InboundApiEndpointFilter references shared interface
2026-06-01 06:33:42 -04:00

8.6 KiB
Raw Blame History

Health — current state: OtOpcUa

Repo: ~/Desktop/OtOpcUa. Stack: .NET 10, Akka.NET, OPC UA; solution ZB.MOM.WW.OtOpcUa.slnx. Health code lives in src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/. All paths relative to repo root. Verified 2026-06-01.

Full three-tier pattern: /health/ready, /health/active, and /healthz. Three probes covering the database, the Akka cluster, and the admin-role leader. All endpoints are AllowAnonymous to permit Traefik and load-balancer probing without credentials.

1. Endpoint wiring

src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/HealthEndpoints.cs:

  • :13 — XML comment explicitly names this as "ScadaLink's three-tier pattern: ready = boot ok; active = fully serving traffic; healthz = bare process liveness."
  • :17AddOtOpcUaHealth(IServiceCollection) calls services.AddHealthChecks() and registers all three probes (lines 2022):
    • DatabaseHealthCheck name "configdb", tags ["ready","active"]
    • AkkaClusterHealthCheck name "akka", tags ["ready","active"]
    • AdminRoleLeaderHealthCheck name "admin-leader", tags ["active"] only
  • :28MapOtOpcUaHealth(IEndpointRouteBuilder) maps three endpoints (lines 3344):
    • /health/ready — predicate c => c.Tags.Contains("ready"), .AllowAnonymous() (lines 3336)
    • /health/active — predicate c => c.Tags.Contains("active"), .AllowAnonymous() (lines 3740)
    • /healthz — predicate _ => false (no probes run; bare process liveness only), .AllowAnonymous() (lines 4144)

Program.cs:

  • :137builder.Services.AddOtOpcUaHealth()
  • :159app.MapOtOpcUaHealth()

Response writer: default ASP.NET Core plain-text/JSON (no HealthChecks.UI.Client).

2. Probes

DatabaseHealthCheck

src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/DatabaseHealthCheck.cs:

  • :9 — injects IDbContextFactory<OtOpcUaConfigDbContext>
  • :2537 — opens a pooled context via CreateDbContextAsync, runs db.Deployments.AsNoTracking().Take(1).ToListAsync(). If the query succeeds → HealthCheckResult.Healthy("ConfigDb reachable") (:31). If it throws → HealthCheckResult.Unhealthy("ConfigDb unreachable", ex) (:35). No Degraded path.

The probe exercises a real query (not just CanConnectAsync) — it confirms the Deployments table is readable, which implies the schema migration has run. This is stricter than ScadaBridge's CanConnectAsync but more opaque about the failure reason.

Tags on registration: ["ready","active"] — the database must be reachable for both readiness and active-node determination.

AkkaClusterHealthCheck

src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/AkkaClusterHealthCheck.cs:

  • :9 — injects ActorSystem directly
  • :2733 — calls Cluster.Get(_system), scans cluster.State.Members for the member whose Address == cluster.SelfAddress and Status == MemberStatus.Up:
    • Found Up → HealthCheckResult.Healthy($"Self Up; {cluster.State.Members.Count} member(s)") (:32)
    • Not found → HealthCheckResult.Degraded("Self not yet Up in cluster") (:33)

No Unhealthy path — joining/leaving/removed nodes are all reported as Degraded. This differs from ScadaBridge's more granular three-way policy (see GAPS).

Tags on registration: ["ready","active"].

AdminRoleLeaderHealthCheck

src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/AdminRoleLeaderHealthCheck.cs:

  • :14 — injects IClusterRoleInfo
  • :2738 — three-path logic:
    • Node does not carry the "admin" role → Healthy("Node does not carry admin role") (:30) — non-admin nodes are immediately healthy, so this probe never gates a non-admin node.
    • Admin role + node is the role leader → Healthy($"Admin leader ({...})") (:36)
    • Admin role + not the leader → Degraded($"Admin member but not leader (leader=...)") (:37)

Tags on registration: ["active"] only — does not participate in /health/ready. The intent is Traefik routing: the active node (admin-role leader) gets sticky admin-UI traffic; standby nodes are reachable for data-plane OPC UA but report Degraded on /health/active so the load balancer does not route control-plane traffic to them.

Note: no Unhealthy path for the role-filter case. If the ActorSystem is not running, IClusterRoleInfo presumably returns safe defaults (no role); this is not separately health-checked.

3. Tag / tier summary

Probe /health/ready /health/active /healthz
DatabaseHealthCheck
AkkaClusterHealthCheck
AdminRoleLeaderHealthCheck
(no probes) (bare liveness)

/healthz runs zero probes — it is a pure process liveness sentinel (process reachable = healthy; a crashed process = no response). Kubernetes liveness probes, Traefik TCP checks, and uptime monitors use this tier.

4. Downstream dependency coverage

No probe for the upstream MxAccessGateway gRPC channel. If the gateway is unreachable, OtOpcUa reports healthy here (the GalaxyDriver will surface errors in OPC UA diagnostics, but /health/ready and /health/active will not reflect it). This is a gap that the shared GrpcDependencyHealthCheck probe in ZB.MOM.WW.Health would close.

5. Notable design choices

  • AllowAnonymous on all tiers — see HealthEndpoints.cs:3032 comment: "Without it the AddOtOpcUaAuth fallback policy 401s every probe and Traefik marks every backend unhealthy."
  • Query probe, not CanConnectAsync — the Deployments query validates that the schema has been applied. ScadaBridge uses CanConnectAsync; neither is wrong but they diverge.
  • Degraded semantics — the Akka check uses Degraded (not Unhealthy) for a joining/pre-Up node. ASP.NET Core maps Degraded to HTTP 200 by default; Traefik sees 200 and considers the node ready. If Unhealthy (HTTP 503) is required to gate traffic, the Degraded path is insufficient.
  • IClusterRoleInfo abstraction — the admin-leader check depends on IClusterRoleInfo, an OtOpcUa interface, not the raw Akka.Cluster.Cluster API. This is a testability-friendly layer absent in ScadaBridge's direct Akka usage.

Adoption plan → ZB.MOM.WW.Health

Replace with shared probes:

  • AkkaClusterHealthCheckZB.MOM.WW.Health.Akka.AkkaClusterHealthCheck using the OtOpcUaCompat preset (self-Up-among-members scan → Healthy/Degraded). The preset keeps OtOpcUa's existing two-way policy without forcing ScadaBridge's three-way policy onto it.
  • AdminRoleLeaderHealthCheckZB.MOM.WW.Health.Akka.ActiveNodeHealthCheck with RoleFilter = "admin". The role-filter parameter produces identical behavior: non-admin nodes immediately healthy, admin leader healthy, admin non-leader degraded.
  • DatabaseHealthCheckZB.MOM.WW.Health.EntityFrameworkCore.DatabaseHealthCheck<OtOpcUaConfigDbContext> with a ProbeQuery delegate of db => db.Deployments.AsNoTracking().Take(1).ToListAsync(). The delegate preserves the stricter query probe rather than falling back to CanConnectAsync.
  • Add GrpcDependencyHealthCheck targeting the MxAccessGateway channel (closes the downstream dependency gap noted in §4). Tag ["ready","active"].
  • Replace AddOtOpcUaHealth / MapOtOpcUaHealth with services.AddHealthChecks().AddCheck<...>() (one call per probe, per spec §5) + app.MapZbHealth(). The /healthz bare-liveness tier is part of MapZbHealth by default — no separate wiring needed.

Keep bespoke:

  • IClusterRoleInfo and its Akka implementation — on adoption this testability seam is given up for the health-check path. The shared ActiveNodeHealthCheck reads cluster role state from the ActorSystem directly (resolving it lazily via IServiceProvider); it does not accept IClusterRoleInfo as an injection point. This is an accepted trade-off: the shared implementation is simpler and consistent across projects, while IClusterRoleInfo remains available elsewhere in the OtOpcUa codebase where it is used outside health checks.
  • The AllowAnonymous policy — this is an OtOpcUa auth concern; MapZbHealth must document that callers are responsible for applying AllowAnonymous (or the shared helper applies it by default).
  • Which probes are registered and their tag assignments — the shared library supplies the check implementations; the wiring (which names, which tags, which options) remains per-project.

Adoption is a follow-on task (tracked in GAPS.md), not part of the ZB.MOM.WW.Health library build. The library build delivers the shared implementations; adoption lands in the OtOpcUa repo as a separate commit once the nupkg is available.