5 Commits

Author SHA1 Message Date
Joseph Doherty 1d729fb0f8 feat: adopt shared ZB.MOM.WW.Health probes (preserve tiers + OtOpcUaCompat policy) 2026-06-01 13:36:28 -04:00
Joseph Doherty 0b99aceacb build: reference ZB.MOM.WW.Health packages from the Gitea feed 2026-06-01 13:30:13 -04:00
Joseph Doherty d57b42bcd6 chore: gitignore local credentials file and runtime PKI store
v2-ci / build (push) Failing after 45s
v2-ci / unit-tests (tests/Core/ZB.MOM.WW.OtOpcUa.Cluster.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Security.Tests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.Host.IntegrationTests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.IntegrationTests) (push) Has been skipped
sql_login.txt holds DB creds and the Host pki/ dir is the runtime OPC UA
certificate store (private keys + issued/trusted certs); neither belongs
in source control, and ignoring them prevents an accidental git add .
2026-05-31 10:27:59 -04:00
Joseph Doherty 5e87f7e16f docs(alarms): record 2026-05-31 live re-confirmation of native alarm feed
v2-ci / build (push) Failing after 41s
v2-ci / unit-tests (tests/Core/ZB.MOM.WW.OtOpcUa.Cluster.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Security.Tests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.Host.IntegrationTests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.IntegrationTests) (push) Has been skipped
Independently re-ran the D.1 alarm-source smoke against the live gateway
(10.100.0.48:5120) to back the native MxAccess alarm-event claim with a
fresh empirical run, not just the original 2026-05-29 capture.
2026-05-31 10:12:47 -04:00
Joseph Doherty 695fa6408b docs(alarms): record native alarms verified working; add D.1 smoke
v2-ci / build (push) Failing after 47s
v2-ci / unit-tests (tests/Core/ZB.MOM.WW.OtOpcUa.Cluster.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Security.Tests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.Host.IntegrationTests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.IntegrationTests) (push) Has been skipped
The 2026-04-30 alarm plan banners claimed worker-side native alarm
subscription was blocked on a COM-bitness finding. That's stale: the
mxaccessgw .NET client now has true MxAccess alarm-event support, and a
live StreamAlarms check (+ new Skip-gated GatewayGalaxyAlarmFeedLiveTests
through the lmxopcua consumer) confirms native alarms — operator comment,
category, severity, timestamps — flow end-to-end. Reconcile both plan docs
to reality and add docs/plans/alarms-d1-smoke-artifact.md as the D.1
alarm-source deliverable. Historian-write live smoke + full server->A&C
round-trip remain (Windows parity rig only).
2026-05-31 09:59:01 -04:00
13 changed files with 302 additions and 163 deletions
+6
View File
@@ -42,3 +42,9 @@ config_cache*.db
# Client CLI/UI runtime scratch (last-connected endpoint cache)
session.dat
# Secrets / local credentials — never commit
sql_login.txt
# OPC UA certificate store (runtime PKI: own/trusted/issued/rejected certs + keys)
src/Server/ZB.MOM.WW.OtOpcUa.Host/pki/
+3
View File
@@ -96,6 +96,9 @@
<PackageVersion Include="xunit" Version="2.9.2" />
<PackageVersion Include="xunit.runner.visualstudio" Version="3.0.2" />
<PackageVersion Include="xunit.v3" Version="1.1.0" />
<PackageVersion Include="ZB.MOM.WW.Health" Version="0.1.0" />
<PackageVersion Include="ZB.MOM.WW.Health.Akka" Version="0.1.0" />
<PackageVersion Include="ZB.MOM.WW.Health.EntityFrameworkCore" Version="0.1.0" />
<PackageVersion Include="ZB.MOM.WW.MxGateway.Client" Version="0.1.0" />
<PackageVersion Include="ZB.MOM.WW.MxGateway.Contracts" Version="0.1.0" />
</ItemGroup>
+13
View File
@@ -3,5 +3,18 @@
<packageSources>
<add key="nuget.org" value="https://api.nuget.org/v3/index.json" protocolVersion="3" />
<add key="local-mxgw" value="./nuget-packages" />
<add key="dohertj2-gitea" value="https://gitea.dohertylan.com/api/packages/dohertj2/nuget/index.json" />
</packageSources>
<packageSourceMapping>
<packageSource key="nuget.org">
<package pattern="*" />
</packageSource>
<packageSource key="local-mxgw">
<package pattern="ZB.MOM.WW.MxGateway.*" />
</packageSource>
<packageSource key="dohertj2-gitea">
<package pattern="ZB.MOM.WW.Health" />
<package pattern="ZB.MOM.WW.Health.*" />
</packageSource>
</packageSourceMapping>
</configuration>
+106
View File
@@ -0,0 +1,106 @@
# Alarms D.1 — smoke artifact
> **Status (2026-05-29): alarm-source leg VERIFIED. Historian-write leg still
> pending the Windows sidecar + live AVEVA Historian.**
>
> **Re-confirmed 2026-05-31** against the same gateway (`http://10.100.0.48:5120`):
> the Skip-gated live test passed again, pulling a native `Raise` transition
> (`Galaxy!TestArea.TestMachine_001.TestAlarm001`, raw sev 500 → OPC UA 750/High,
> category `TestArea`, operator comment `Test alarm #1`) through the production
> consumer. Independent re-run, not the original capture.
>
> This is the D.1 deliverable called for by `docs/plans/alarms-worker-wiring-plan.md`
> — captured evidence that a live Galaxy alarm reaches lmxopcua through the native
> gateway path (not the sub-attribute fallback). It supersedes the "A.2 blocked"
> banners in `alarms-over-gateway.md` / `alarms-worker-wiring-plan.md`, which were
> written 2026-04-30 before the gateway's alarm feed was working.
## What was verified
The mxaccessgw gateway **does** serve native MxAccess alarms today, and the lmxopcua
consumer ingests them with full fidelity — **including operator-comment**, the field
the 2026-04-30 plan flagged as "the only v1 regression."
Verified from the macOS dev box against the live gateway at `http://10.100.0.48:5120`
(reachable; `nc -z` succeeds). No acknowledge / no writes were issued — read-only
`StreamAlarms`.
### 1. Gateway boundary — raw `StreamAlarms` (`ZB.MOM.WW.MxGateway.Client`)
A standalone client streamed the active-alarm snapshot: **20 active alarms**, each
carrying native metadata. Sample (one of 20):
```json
{ "alarmFullReference": "Galaxy!TestArea.TestMachine_001.TestAlarm001",
"sourceObjectReference": "TestMachine_001.TestAlarm001",
"alarmTypeName": "DSC", "severity": 500,
"currentState": "ALARM_CONDITION_STATE_ACTIVE", "category": "TestArea",
"lastTransitionTimestamp": "2026-05-24T16:04:10.856Z",
"operatorComment": "Test alarm #1" }
```
Followed by the `SnapshotComplete` marker. `operatorComment`, `category`, `severity`,
`currentState`, and `lastTransitionTimestamp` are all populated.
### 2. lmxopcua consumer — `GatewayGalaxyAlarmFeed` → `GalaxyAlarmTransition`
The Skip-gated live test
`Runtime/GatewayGalaxyAlarmFeedLiveTests.Live_gateway_delivers_native_alarm_transitions_through_the_consumer`
wires the real `MxGatewayClient.StreamAlarmsAsync` into the production consumer seam
and **passes**. Captured output (`D1_SMOKE_OUT`):
```
# consumer transitions observed: 2+
Raise Galaxy!TestArea.TestMachine_001.TestAlarm001 | sev=750(High) raw=500 | cat=TestArea | comment='Test alarm #1' | xitionUtc=2026-05-24T16:04:10.856Z
Raise Galaxy!TestArea.TestMachine_003.TestAlarm001 | sev=750(High) raw=500 | cat=TestArea | comment='Test alarm #1' | xitionUtc=2026-05-07T18:14:00.594Z
```
The consumer preserves `operatorComment` + `category` + transition timestamp and
applies the OPC UA severity-bucket mapping (`MxAccessSeverityMapper`: raw 500 →
OPC UA 750, bucket `High`).
### 3. Full chain to the OPC UA Part 9 surface (code-path verified)
`GalaxyDriver.OnAlarmFeedTransition` maps `GalaxyAlarmTransition`
`AlarmEventArgs`, carrying `OperatorComment`, `OriginalRaiseTimestampUtc`,
`AlarmCategory`, and the severity bucket onto `IAlarmSource.OnAlarmEvent`.
`AlarmEventArgs` already declares those fields — so the **E.7 contract extension is
done**, not pending. The server's Part-9 condition layer consumes `IAlarmSource`
via `AlarmSurfaceInvoker``GenericDriverNodeManager`. Unit coverage:
`GalaxyDriverAlarmSourceTests`, `GatewayGalaxyAlarmFeedTests`.
## How to re-run
```bash
export MXGW_ENDPOINT="http://10.100.0.48:5120"
export GALAXY_MXGW_API_KEY="<dev key from docker-dev/docker-compose.yml>"
export D1_SMOKE_OUT="/tmp/d1-consumer-transitions.txt" # optional capture
dotnet test tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests \
--filter "FullyQualifiedName~GatewayGalaxyAlarmFeedLiveTests"
```
Without the env vars the test `Skip`s, so normal `dotnet test` runs are unaffected.
## Not covered here (still open)
1. **Scripted-alarm historian write-back → AVEVA Historian** (C.1's live leg). The
`SdkAlarmHistorianWriteBackend` (real `HistorianAccess.AddStreamedValue` path) is
implemented and unit-tested, but its `Live_*` write smoke needs the Windows
historian sidecar + a live AVEVA Historian — neither reachable from the macOS dev
box. Capture this leg on the Windows parity rig.
2. **Running-server → OPC UA A&C client round-trip.** This artifact proves the driver
consumer end; it does not exercise a full OtOpcUa server surfacing the condition to
an OPC UA client, because the docker-dev stack stubs the Galaxy driver on Linux
(`DriverInstanceActor.ShouldStub`). Capture on the Windows parity rig (or a Linux
host with `ShouldStub` overridden to point the real driver at the gateway).
## Mechanism — true MxAccess alarm-event support
The gateway delivers these alarms via **true MxAccess alarm-event support** in the
mxaccessgw .NET client — a real alarm-event subscription, **not** the value-driven
sub-attribute fallback. (Confirmed by the gateway maintainer; the client-side stream
check above can only observe the resulting feed, which is why this artifact records the
mechanism here rather than inferring it.) So A.2 is implemented as originally specified:
`MX_EVENT_FAMILY_ON_ALARM_TRANSITION` carries genuine native alarm-event metadata, and
the operator-comment / original-raise-time / category fields are first-class — not
reconstructed from attribute reads.
+33 -16
View File
@@ -9,24 +9,41 @@
> the new RPCs; the sub-attribute fallback path keeps Galaxy alarms
> functional today.
>
> ⚠️ **Worker-side native alarm subscription blocked on a dev-rig
> finding (2026-04-30):** the MXAccess COM Toolkit at
> **UPDATE 2026-05-29 — native alarm feed VERIFIED working; the
> 2026-04-30 "blocked" finding below is superseded.** A live
> `StreamAlarms` check against the gateway at `10.100.0.48:5120`
> returned the active-alarm snapshot (20 alarms) with full native
> metadata — `severity`, `category`, `currentState`,
> `lastTransitionTimestamp`, **and `operatorComment`** (the field the
> note below called "the only v1 regression"). The lmxopcua consumer
> (`GatewayGalaxyAlarmFeed` → `GalaxyAlarmTransition` →
> `AlarmEventArgs` → `IAlarmSource`) ingests it with full fidelity and
> the OPC UA severity-bucket mapping applied — proven by the passing
> Skip-gated live test `GatewayGalaxyAlarmFeedLiveTests`. `AlarmEventArgs`
> already carries operator-comment / original-raise-time / category, so
> **E.7 is done too**. See `docs/plans/alarms-d1-smoke-artifact.md` for
> the captured evidence. The gateway delivers this via **true MxAccess
> alarm-event support** in the mxaccessgw .NET client (a real
> alarm-event subscription — **not** the sub-attribute fallback), so A.2
> is implemented as originally specified. Still open: the scripted-alarm
> → AVEVA Historian write-back live smoke (C.1's `Live_*` leg) and a full
> running-server → OPC UA A&C round-trip — both need the Windows parity rig.
>
> ⚠️ **[SUPERSEDED — kept for history] Worker-side native alarm
> subscription blocked on a dev-rig finding (2026-04-30):** the MXAccess
> COM Toolkit at
> `C:\Program Files (x86)\ArchestrA\Framework\Bin\ArchestrA.MXAccess.dll`
> exposes no alarm-event family — only `OnDataChange`,
> `OnWriteComplete`, `OperationComplete`, `OnBufferedDataChange`.
> exposed no alarm-event family — only `OnDataChange`,
> `OnWriteComplete`, `OperationComplete`, `OnBufferedDataChange` — and
> AVEVA's `aaAlarmManagedClient` / `ArchestrAAlarmsAndEvents.SDK`
> assemblies are x64-only and incompatible with the worker's x86
> bitness. **Operator decision needed before
> `MX_EVENT_FAMILY_ON_ALARM_TRANSITION` carries any events:** either
> accept the value-driven sub-attribute path as the production
> architecture (operator-comment fidelity is the only v1 regression)
> or add an x64 alarm-helper sub-process alongside the worker. See
> `src/MxGateway.Worker/MxAccess/MxAccessAlarmEventSink.cs` in the
> mxaccessgw repo for the architectural notes. Live
> `aahClientManaged` alarm-event write call site
> (`SdkAlarmHistorianWriteBackend` placeholder from PR C.1) and the
> D.1 smoke artifact ship once those decisions resolve. The
> remainder of this document is preserved as the design record.
> assemblies are x64-only vs. the worker's x86 bitness. The operator
> decision (accept the value-driven sub-attribute path, or add an x64
> alarm-helper sub-process) has since been resolved on the gateway side
> — `MX_EVENT_FAMILY_ON_ALARM_TRANSITION` now carries events (verified
> above). The C.1 `SdkAlarmHistorianWriteBackend` is **no longer a
> placeholder** — it writes through the real
> `HistorianAccess.AddStreamedValue` path (only its live-rig write
> smoke remains).
Coordinated epic across two repos:
+27 -10
View File
@@ -1,5 +1,18 @@
# Alarms Worker Wiring Plan
> ✅ **UPDATE 2026-05-29 — the blocker below is RESOLVED on the gateway side; this
> plan is largely complete.** A live `StreamAlarms` check against `10.100.0.48:5120`
> returns the active-alarm snapshot with full native metadata **including
> `operatorComment`**, and the lmxopcua consumer ingests it end-to-end (passing live
> test `GatewayGalaxyAlarmFeedLiveTests`). So **A.2 / A.3 / A.4** are functionally done
> at the gateway boundary (the worker now emits native alarm transitions and the client
> exposes `AcknowledgeAlarm` / `QueryActiveAlarms` RPCs). **C.1** ships real code
> (`SdkAlarmHistorianWriteBackend``HistorianAccess.AddStreamedValue`). **D.1**'s
> alarm-source leg is captured in `docs/plans/alarms-d1-smoke-artifact.md`. Only two
> things remain, both needing the Windows parity rig: C.1's live historian-write smoke
> and a full running-server → OPC UA A&C round-trip. The per-item detail below is kept
> as the historical record of the original blocked state.
>
> **Context**: The alarms-over-gateway epic shipped 19 PRs across the
> `lmxopcua` and `mxaccessgw` repos (merged 2026-04-30). Contracts are live;
> the sub-attribute fallback path keeps Galaxy alarms functional today. Four
@@ -16,7 +29,7 @@
---
## Dev-rig finding that blocks everything (2026-04-30)
## Dev-rig finding that blocks everything (2026-04-30) — [SUPERSEDED 2026-05-29]
During PR A.2 work the following was discovered on the dev box:
@@ -318,16 +331,20 @@ fallback as production).
## Summary of blocks
| Item | Blocked by | Estimated effort once unblocked |
|------|-----------|--------------------------------|
| A.2 | Architectural decision (x64 alarm-helper vs. sub-attribute fallback as production) | 23 days implementation; 1 day tests |
| A.3 | A.2 delivering WorkerEvent bodies | 12 days |
| A.4 | A.2 (active-alarm query needs AlarmClient session) | 1 day |
| C.1 | aahClientManaged SDK access (available on dev box); NOT blocked by A.2 | 12 days |
| D.1 | A.2 + A.3 + C.1 all passing on parity rig | 0.5 day (smoke + artifact capture) |
> **Resolved as of 2026-05-29** — see the update banner at the top and
> `docs/plans/alarms-d1-smoke-artifact.md`. Original status table kept for history.
C.1 can proceed in parallel with A.2 / A.3 since the sidecar's `aahClientManaged`
is x64 and does not share the worker bitness constraint.
| Item | Status (2026-05-29) | Original block |
|------|--------------------|----------------|
| A.2 | ✅ **True MxAccess alarm-event support** in the gateway client (real alarm-event subscription, not the sub-attribute fallback); verified via live `StreamAlarms` with operator-comment fidelity | Architectural decision (x64 alarm-helper vs. sub-attribute fallback) |
| A.3 | ✅ Dispatch + `AcknowledgeAlarm` RPC present on the client surface | A.2 delivering WorkerEvent bodies |
| A.4 | ✅ `QueryActiveAlarms` RPC present on the client surface | A.2 (active-alarm query needs AlarmClient session) |
| C.1 | ✅ Code shipped (`AddStreamedValue` path); ⏳ live historian-write smoke needs the Windows rig | aahClientManaged SDK access |
| D.1 | ◑ Alarm-source leg captured (`alarms-d1-smoke-artifact.md`); ⏳ historian-write leg + full server→A&C round-trip need the Windows rig | A.2 + A.3 + C.1 all passing on parity rig |
The gateway delivers operator-comment fidelity through **true MxAccess alarm-event
support** in the mxaccessgw .NET client — a real alarm-event subscription, not the
value-driven sub-attribute path. The sub-attribute fallback is now legacy.
---
@@ -1,39 +0,0 @@
using Microsoft.Extensions.Diagnostics.HealthChecks;
using ZB.MOM.WW.OtOpcUa.Commons.Interfaces;
namespace ZB.MOM.WW.OtOpcUa.Host.Health;
/// <summary>
/// Reports Healthy on the admin-role leader, Degraded on a non-leader admin member. Used by
/// the <c>/health/active</c> endpoint so external load balancers can route admin-singleton
/// traffic to the current leader (cookie sessions still work on either node — DataProtection
/// keys are shared).
/// </summary>
public sealed class AdminRoleLeaderHealthCheck : IHealthCheck
{
private readonly IClusterRoleInfo _roleInfo;
/// <summary>Initializes a new instance of the AdminRoleLeaderHealthCheck class.</summary>
/// <param name="roleInfo">The cluster role information provider.</param>
public AdminRoleLeaderHealthCheck(IClusterRoleInfo roleInfo)
{
_roleInfo = roleInfo;
}
/// <summary>Checks the health status of the admin role leader.</summary>
/// <param name="context">The health check context.</param>
/// <param name="cancellationToken">The cancellation token.</param>
/// <returns>A task representing the health check operation.</returns>
public Task<HealthCheckResult> CheckHealthAsync(HealthCheckContext context, CancellationToken cancellationToken = default)
{
if (!_roleInfo.HasRole("admin"))
return Task.FromResult(HealthCheckResult.Healthy("Node does not carry admin role"));
var leader = _roleInfo.RoleLeader("admin");
var isLeader = leader is not null && leader.Value.Equals(_roleInfo.LocalNode);
return Task.FromResult(isLeader
? HealthCheckResult.Healthy($"Admin leader ({_roleInfo.LocalNode})")
: HealthCheckResult.Degraded($"Admin member but not leader (leader={leader?.Value ?? "<unknown>"})"));
}
}
@@ -1,35 +0,0 @@
using Akka.Actor;
using Akka.Cluster;
using Microsoft.Extensions.Diagnostics.HealthChecks;
namespace ZB.MOM.WW.OtOpcUa.Host.Health;
public sealed class AkkaClusterHealthCheck : IHealthCheck
{
private readonly ActorSystem _system;
/// <summary>
/// Initializes a new instance of the AkkaClusterHealthCheck class.
/// </summary>
/// <param name="system">The Akka actor system to check cluster health for.</param>
public AkkaClusterHealthCheck(ActorSystem system)
{
_system = system;
}
/// <summary>
/// Checks the health of the Akka cluster asynchronously.
/// </summary>
/// <param name="context">The health check context.</param>
/// <param name="cancellationToken">Cancellation token.</param>
public Task<HealthCheckResult> CheckHealthAsync(HealthCheckContext context, CancellationToken cancellationToken = default)
{
var cluster = Akka.Cluster.Cluster.Get(_system);
var selfUp = cluster.State.Members.Any(m =>
m.Address == cluster.SelfAddress && m.Status == MemberStatus.Up);
return Task.FromResult(selfUp
? HealthCheckResult.Healthy($"Self Up; {cluster.State.Members.Count} member(s)")
: HealthCheckResult.Degraded("Self not yet Up in cluster"));
}
}
@@ -1,38 +0,0 @@
using Microsoft.EntityFrameworkCore;
using Microsoft.Extensions.Diagnostics.HealthChecks;
using ZB.MOM.WW.OtOpcUa.Configuration;
namespace ZB.MOM.WW.OtOpcUa.Host.Health;
public sealed class DatabaseHealthCheck : IHealthCheck
{
private readonly IDbContextFactory<OtOpcUaConfigDbContext> _dbFactory;
/// <summary>
/// Initializes a new instance of the <see cref="DatabaseHealthCheck"/> class.
/// </summary>
/// <param name="dbFactory">The database context factory for the config database.</param>
public DatabaseHealthCheck(IDbContextFactory<OtOpcUaConfigDbContext> dbFactory)
{
_dbFactory = dbFactory;
}
/// <summary>
/// Checks the health of the configuration database.
/// </summary>
/// <param name="context">The health check context.</param>
/// <param name="cancellationToken">Cancellation token.</param>
public async Task<HealthCheckResult> CheckHealthAsync(HealthCheckContext context, CancellationToken cancellationToken = default)
{
try
{
await using var db = await _dbFactory.CreateDbContextAsync(cancellationToken);
await db.Deployments.AsNoTracking().Take(1).ToListAsync(cancellationToken);
return HealthCheckResult.Healthy("ConfigDb reachable");
}
catch (Exception ex)
{
return HealthCheckResult.Unhealthy("ConfigDb unreachable", ex);
}
}
}
@@ -1,25 +1,40 @@
using Microsoft.AspNetCore.Builder;
using Microsoft.AspNetCore.Diagnostics.HealthChecks;
using Microsoft.AspNetCore.Routing;
using Microsoft.EntityFrameworkCore;
using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Diagnostics.HealthChecks;
using ZB.MOM.WW.Health;
using ZB.MOM.WW.Health.Akka;
using ZB.MOM.WW.Health.EntityFrameworkCore;
using ZB.MOM.WW.OtOpcUa.Configuration;
namespace ZB.MOM.WW.OtOpcUa.Host.Health;
public static class HealthEndpoints
{
/// <summary>
/// Registers the standard ASP.NET Core health-check infrastructure plus the OtOpcUa-specific
/// probes. Mirrors ScadaLink's three-tier pattern: <c>ready</c> = boot ok; <c>active</c> =
/// fully serving traffic; <c>healthz</c> = bare process liveness.
/// Registers the shared ZB.MOM.WW health probes. Tier semantics preserved: configdb + akka on
/// ready+active; admin-leader on active only.
/// </summary>
/// <param name="services">The service collection to register health checks with.</param>
public static IServiceCollection AddOtOpcUaHealth(this IServiceCollection services)
{
services.AddHealthChecks()
.AddCheck<DatabaseHealthCheck>("configdb", tags: new[] { "ready", "active" })
.AddCheck<AkkaClusterHealthCheck>("akka", tags: new[] { "ready", "active" })
.AddCheck<AdminRoleLeaderHealthCheck>("admin-leader", tags: new[] { "active" });
.AddTypeActivatedCheck<DatabaseHealthCheck<OtOpcUaConfigDbContext>>(
"configdb",
failureStatus: null,
tags: new[] { ZbHealthTags.Ready, ZbHealthTags.Active },
args: new DatabaseHealthCheckOptions<OtOpcUaConfigDbContext>
{
ProbeQuery = static (db, ct) => db.Deployments.AsNoTracking().Take(1).ToListAsync(ct),
})
.AddTypeActivatedCheck<AkkaClusterHealthCheck>(
"akka",
failureStatus: null,
tags: new[] { ZbHealthTags.Ready, ZbHealthTags.Active },
args: AkkaClusterStatusPolicy.OtOpcUaCompat)
.AddTypeActivatedCheck<ActiveNodeHealthCheck>(
"admin-leader",
failureStatus: null,
tags: new[] { ZbHealthTags.Active },
args: "admin");
return services;
}
@@ -27,21 +42,7 @@ public static class HealthEndpoints
/// <param name="app">The endpoint route builder.</param>
public static IEndpointRouteBuilder MapOtOpcUaHealth(this IEndpointRouteBuilder app)
{
// AllowAnonymous on all three — Traefik / k8s liveness probes / load-balancers
// hit these without credentials. Without it the AddOtOpcUaAuth fallback policy
// 401s every probe and Traefik marks every backend unhealthy.
app.MapHealthChecks("/health/ready", new HealthCheckOptions
{
Predicate = c => c.Tags.Contains("ready"),
}).AllowAnonymous();
app.MapHealthChecks("/health/active", new HealthCheckOptions
{
Predicate = c => c.Tags.Contains("active"),
}).AllowAnonymous();
app.MapHealthChecks("/healthz", new HealthCheckOptions
{
Predicate = _ => false, // process-liveness only — no probes run.
}).AllowAnonymous();
app.MapZbHealth();
return app;
}
}
@@ -27,6 +27,9 @@
</PackageReference>
<PackageReference Include="OpenTelemetry.Extensions.Hosting"/>
<PackageReference Include="OpenTelemetry.Exporter.Prometheus.AspNetCore"/>
<PackageReference Include="ZB.MOM.WW.Health" />
<PackageReference Include="ZB.MOM.WW.Health.Akka" />
<PackageReference Include="ZB.MOM.WW.Health.EntityFrameworkCore" />
</ItemGroup>
<ItemGroup>
@@ -0,0 +1,82 @@
using ZB.MOM.WW.MxGateway.Client;
using Shouldly;
using Xunit;
using ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Runtime;
namespace ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests.Runtime;
/// <summary>
/// D.1 smoke (alarm-source leg): drives the REAL gateway <c>StreamAlarms</c> feed through the
/// production lmxopcua consumer (<see cref="GatewayGalaxyAlarmFeed"/>) and asserts native alarm
/// transitions — with operator comment, category, original raise time, and the mapped OPC UA
/// severity bucket preserved — reach the driver-side boundary that feeds
/// <c>IAlarmSource.OnAlarmEvent</c>.
/// <para>
/// Skip-gated: runs only when <c>MXGW_ENDPOINT</c> + <c>GALAXY_MXGW_API_KEY</c> are set to a
/// reachable gateway. Captured 2026-05-29 against <c>10.100.0.48:5120</c> — see
/// <c>docs/plans/alarms-d1-smoke-artifact.md</c>. Set <c>D1_SMOKE_OUT</c> to dump the observed
/// transitions to a file for artifact capture.
/// </para>
/// </summary>
[Trait("Category", "Integration")]
public sealed class GatewayGalaxyAlarmFeedLiveTests
{
[Fact]
public async Task Live_gateway_delivers_native_alarm_transitions_through_the_consumer()
{
var endpoint = Environment.GetEnvironmentVariable("MXGW_ENDPOINT");
var apiKey = Environment.GetEnvironmentVariable("GALAXY_MXGW_API_KEY");
if (string.IsNullOrWhiteSpace(endpoint) || string.IsNullOrWhiteSpace(apiKey))
Assert.Skip("Set MXGW_ENDPOINT + GALAXY_MXGW_API_KEY to run the live gateway alarm-feed smoke.");
var client = MxGatewayClient.Create(new MxGatewayClientOptions
{
Endpoint = new Uri(endpoint!, UriKind.Absolute),
ApiKey = apiKey!,
UseTls = false,
ConnectTimeout = TimeSpan.FromSeconds(10),
DefaultCallTimeout = TimeSpan.FromSeconds(30),
StreamTimeout = TimeSpan.FromSeconds(30),
});
var observed = new List<GalaxyAlarmTransition>();
var gotOne = new TaskCompletionSource<bool>(TaskCreationOptions.RunContinuationsAsynchronously);
// Wire the live client's StreamAlarms method group into the production consumer seam.
await using var feed = new GatewayGalaxyAlarmFeed(client.StreamAlarmsAsync, clientName: "D1Smoke");
feed.OnAlarmTransition += (_, t) =>
{
lock (observed) { observed.Add(t); }
gotOne.TrySetResult(true);
};
feed.Start();
// The stream opens with the active-alarm snapshot, so we expect ≥1 transition promptly.
await Task.WhenAny(gotOne.Task, Task.Delay(TimeSpan.FromSeconds(20), TestContext.Current.CancellationToken));
List<GalaxyAlarmTransition> snapshot;
lock (observed) snapshot = observed.ToList();
snapshot.ShouldNotBeEmpty(
"Live gateway should deliver at least the active-alarm snapshot through the lmxopcua consumer.");
var first = snapshot[0];
first.AlarmFullReference.ShouldNotBeNullOrWhiteSpace();
first.OpcUaSeverity.ShouldBeGreaterThan(0); // severity bucket mapping applied by the consumer
foreach (var t in snapshot.Take(8))
TestContext.Current.SendDiagnosticMessage(
$"{t.TransitionKind,-11} {t.AlarmFullReference} sev={t.OpcUaSeverity}({t.SeverityBucket}) cat={t.Category} comment='{t.OperatorComment}'");
TestContext.Current.SendDiagnosticMessage($"TOTAL consumer transitions observed: {snapshot.Count}");
// Deterministic artifact capture (only when D1_SMOKE_OUT is set).
var outPath = Environment.GetEnvironmentVariable("D1_SMOKE_OUT");
if (!string.IsNullOrWhiteSpace(outPath))
{
var lines = snapshot.Take(50).Select(t =>
$"{t.TransitionKind,-11} {t.AlarmFullReference} | sev={t.OpcUaSeverity}({t.SeverityBucket}) raw={t.RawMxAccessSeverity} | cat={t.Category} | comment='{t.OperatorComment}' | xitionUtc={t.TransitionTimestampUtc:o}");
await File.WriteAllLinesAsync(outPath!,
new[] { $"# consumer transitions observed: {snapshot.Count}" }.Concat(lines),
TestContext.Current.CancellationToken);
}
}
}
@@ -25,6 +25,9 @@
<ItemGroup>
<PackageReference Include="ZB.MOM.WW.MxGateway.Contracts" />
<!-- Client package: only the Skip-gated live alarm-feed smoke (GatewayGalaxyAlarmFeedLiveTests)
constructs a real MxGatewayClient. Unit tests use the fake stream-factory seam. -->
<PackageReference Include="ZB.MOM.WW.MxGateway.Client" />
</ItemGroup>
</Project>