Files
mxaccessgw/src/MxGateway.Worker.Tests/Probes/AlarmsLiveSmokeTests.cs
T
Joseph Doherty 1aafd6bde4 Code-review 2026-05-20 sweep #2: re-review at a020350, resolve 48 findings
Second re-review pass at commit a020350 caught 48 new findings — including
one High-severity regression I introduced in the prior sweep — and fixed
them all in one parallel wave.

High (1)
- Client.Python-018: prior sweep set `license = "Proprietary"` in
  pyproject.toml. setuptools >= 77 enforces PEP 639 and rejects the
  string (it must be a valid SPDX expression), so `pip wheel .` and
  `pip install -e .` both fail before any source compiles. Tests
  still pass because pytest bypasses the build backend via
  `pythonpath`. Dropped the invalid license string, kept the
  `License :: Other/Proprietary License` classifier, and added
  `tests/test_packaging.py` so a future regression of the same shape
  is caught in CI.

Mediums (6)
- Worker-023: `HeartbeatStuckCeiling` (default 75s = 5x HeartbeatGrace)
  on WorkerPipeSessionOptions bounds the in-flight-command watchdog
  suppression so a truly stuck COM call still triggers StaHung
  instead of permanently defeating the watchdog.
- Client.Rust-018: reverted Rust's `latencyMs` split so the
  cross-language bench comparison is apples-to-apples again;
  `failureLatencyMs` kept as Rust-only enrichment.
- Client.Java-021: applied Client.Java-002's terminal-state
  serialisation pattern to DeployEventStream so close() arriving
  after queue-overflow can't erase the overflow exception.
- IntegrationTests-017: teardown-parity test now uses a two-window
  stability check after UnAdvise instead of strict equality against
  the pre-UnAdvise count (which raced against in-flight events).
- IntegrationTests-019: new RecordingTestOutputHelper wraps every
  log sink the WriteSecured live test owns (worker stdout/stderr,
  gateway logs, direct WriteLine) so the credential is proven
  absent from the full output buffer, not just the diagnostic
  message.
- Tests-020: added MxAccessGatewayServiceConstraintTests coverage
  for the previously-uncovered Write2Bulk and WriteSecured2Bulk
  arms of WriteBulkConstraintPlan.SetPayload.

Lows (41 — highlights)
- Server: Galaxy glob cache eviction is race-free (Server-024);
  GalaxyRepositoryGrpcService takes IGalaxyRepository (Server-025);
  AlarmsOptions validated at startup (Server-026); Authorization.md
  Constraint Enforcement snippet/prose enumerate the bulk write/read
  family (Server-027); bulk-read-commands and bulk-write-commands
  capability tokens added to OpenSession (Server-029);
  NotWiredAlarmRpcDispatcher XML doc and missing scope-resolver and
  state-machine tests cleaned up (023, 028).
- Worker: AlarmCommandHandler now invokes the same STA-affinity
  guard the poll path uses, at every command entry (Worker-024);
  RunAsync null-checks the runtime-session factory result
  (Worker-025).
- Worker.Tests: shared LiveMxAccessOptInVariableName lives on
  GatewayContractInfo (Worker.Tests-025); MxAccessSession.CreateForTesting
  rejects production sinks (Worker.Tests-026); FakeRuntimeSession's
  CancelCommandReturnValue serialised under lock (Worker.Tests-027);
  Probes namespace lifted to MxGateway.Worker.Tests.Probes
  (Worker.Tests-029); cancel-envelope sequence numbers monotonised
  (Worker.Tests-030); docs/GatewayTesting.md gains a "Dev-rig Probes"
  section (Worker.Tests-028).
- Tests: ManualTimeProvider consolidated into one TestSupport/ copy
  (Tests-021); SessionManagerBulkTests adds a mid-flight cancellation
  test backed by a TaskCompletionSource fake (Tests-022); companion
  FakeWorkerProcess.WaitForExitAsync no longer fakes its exit signal
  (Tests-023); constraint plan reply-count divergence pinned
  (Tests-024).
- IntegrationTests: TryGetSession chain carries [MaybeNullWhen(false)]
  end-to-end (IntegrationTests-018); abnormal-exit keyword set
  tightened to pipe-disconnected/end-of-stream and the test now
  asserts streamTask.IsFaulted (020, 021).
- Client.Dotnet: bench commands added to isLongRunning so the
  default 30s wall-clock budget doesn't kill them (015);
  BenchStreamEventsAsync observes the inner stream task on every
  exit path (016).
- Client.Go: parseValue wraps strconv errors with flag context and
  %w (017); bench loops honour ctx.Done() (018); galaxy-watch parses
  RFC3339Nano with fractional seconds (019); runStreamEvents installs
  signal.NotifyContext like runGalaxyWatch (020); five new CLI-level
  table-driven tests cover the bulk/bench subcommands (021).
- Client.Java: toCompletable Javadoc rewritten to match the actual
  cancellation contract Client.Java-015 established (022); stream-events
  text path uses Long.toUnsignedString for worker_sequence (023);
  bench-read-bulk no longer pollutes success-latency histogram with
  failure durations (024); --shutdown-timeout CLI option propagates
  through to ClientOptions (025); seven new MxGatewayCliTests cover
  the bulk and bench commands (026).
- Client.Python: mxgateway_cli ships its own py.typed marker (019);
  wheel-build smoke test added under tests/test_packaging.py (020);
  README documents the Galaxy CLI parity gap explicitly (021).
- Client.Rust: RustClientDesign.md signatures match session.rs and
  document the AsRef<str> read_bulk genericism (019);
  next_correlation_id re-exported at the crate root, with a
  property-style doc contract and an explicit disclaimer that the
  literal textual format is not part of the contract (020).
- Contracts: BulkWriteResult comment names the actual
  IConstraintEnforcer mechanism instead of "tag-allowlist filter"
  (014); BulkReadResult gains explicit per-arm payload-population
  documentation for the success vs failure cases (015).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 10:28:54 -04:00

271 lines
11 KiB
C#

using System;
using System.Collections.Concurrent;
using System.Diagnostics;
using System.Threading;
using MxGateway.Contracts.Proto;
using MxGateway.Worker.MxAccess;
using Xunit.Abstractions;
namespace MxGateway.Worker.Tests.Probes;
/// <summary>
/// Live dev-rig smoke test for the alarms-over-gateway pipeline.
/// Exercises <see cref="WnWrapAlarmConsumer"/> + <see cref="AlarmDispatcher"/> +
/// <see cref="MxAccessAlarmEventSink"/> end-to-end against the actual
/// AVEVA System Platform install: subscribes to
/// <c>\\&lt;machine&gt;\Galaxy!DEV</c>, waits for at least one alarm
/// transition (the dev rig's flip script writes
/// <c>TestMachine_001.TestAlarm001</c> every 10s), drains the proto
/// <c>OnAlarmTransitionEvent</c> from the queue, then ack-by-name's
/// it and verifies the ack registers as a subsequent
/// <see cref="AlarmTransitionKind.Acknowledge"/> transition.
///
/// Skip-gated; flip <c>Skip=null</c> on the dev rig with the flip
/// script running.
/// </summary>
public sealed class AlarmsLiveSmokeTests
{
private static readonly string SubscriptionExpression =
$@"\\{Environment.MachineName}\Galaxy!DEV";
private static readonly TimeSpan PumpDuration = TimeSpan.FromSeconds(45);
private static readonly TimeSpan TransitionWaitTimeout = TimeSpan.FromSeconds(20);
private const string SessionId = "alarms-live-smoke";
private readonly ITestOutputHelper output;
private readonly Stopwatch elapsed = Stopwatch.StartNew();
private readonly ConcurrentQueue<string> log = new ConcurrentQueue<string>();
public AlarmsLiveSmokeTests(ITestOutputHelper output)
{
this.output = output;
}
[Fact(Skip = "Live dev-rig smoke test — flip Skip=null with AVEVA + the alarm flip script running. Verified working 2026-05-01.")]
public void Alarms_FullPipelineRoundTrip_RaisesAndAcknowledges()
{
Exception? threadException = null;
var done = new ManualResetEventSlim(false);
var thread = new Thread(() =>
{
try { RunSmoke(); }
catch (Exception ex) { threadException = ex; }
finally { done.Set(); }
});
thread.IsBackground = false;
thread.SetApartmentState(ApartmentState.STA);
thread.Start();
done.Wait();
thread.Join();
output.WriteLine($"Captured {log.Count} log line(s):");
while (log.TryDequeue(out string? line))
{
output.WriteLine(line);
}
if (threadException != null)
{
throw threadException;
}
}
private void RunSmoke()
{
Log($"Subscription expression: {SubscriptionExpression}");
Log($"Pump duration: {PumpDuration.TotalSeconds:F0}s; transition wait timeout: {TransitionWaitTimeout.TotalSeconds:F0}s");
MxAccessEventQueue queue = new MxAccessEventQueue();
// The consumer owns no internal timer; we drive PollOnce manually
// from the STA below (the wnwrap COM is ThreadingModel=Apartment,
// and this test doesn't run a Win32 message pump on its STA).
WnWrapAlarmConsumer consumer = new WnWrapAlarmConsumer(
new WNWRAPCONSUMERLib.wwAlarmConsumerClass(),
maxAlarmsPerFetch: 1024);
MxAccessAlarmEventSink sink = new MxAccessAlarmEventSink(queue, new MxAccessEventMapper());
using AlarmDispatcher dispatcher = new AlarmDispatcher(consumer, sink, SessionId);
Log("Constructed consumer + sink + dispatcher.");
dispatcher.Subscribe(SubscriptionExpression);
Log("Subscribe -> ok. Driving PollOnce manually from this STA...");
// The wnwrap COM object is ThreadingModel=Apartment. The consumer
// owns no internal timer, so we drive PollOnce manually here on the
// STA. Production hosting routes polls through the worker's
// StaRuntime.
// 1. Wait for the first transition (any kind), then keep waiting
// for one with kind=Raise so the alarm is currently Active when
// we try to ack. AVEVA rejects acks of cleared alarms with -55,
// so we have to time the ack against the flip script's 10s
// cadence.
OnAlarmTransitionEvent? raiseBody = null;
DateTime raiseDeadline = DateTime.UtcNow + TimeSpan.FromSeconds(30);
while (DateTime.UtcNow < raiseDeadline && raiseBody is null)
{
WorkerEvent? evt = WaitForTransition(queue, TransitionWaitTimeout, "raise", consumer);
if (evt is null) break;
OnAlarmTransitionEvent body = evt.Event.OnAlarmTransition;
Log("Transition: " + DescribeTransition(body));
Assert.Equal(SessionId, evt.Event.SessionId);
if (body.TransitionKind == AlarmTransitionKind.Raise)
{
raiseBody = body;
}
}
Assert.NotNull(raiseBody);
Assert.False(string.IsNullOrEmpty(raiseBody!.AlarmFullReference));
Assert.Contains("Galaxy", raiseBody.AlarmFullReference);
// 2. Snapshot the active set + verify the captured alarm is there.
var snapshot = dispatcher.SnapshotActiveAlarms();
Log($"SnapshotActiveAlarms count={snapshot.Count}");
foreach (var s in snapshot)
{
Log(" active: " + DescribeSnapshot(s));
}
Assert.NotEmpty(snapshot);
Assert.Contains(snapshot, s => s.AlarmFullReference == raiseBody.AlarmFullReference);
// 3. Ack-by-name using the captured reference. Parse the reference
// via the same convention the gateway dispatcher uses
// (Provider!Group.Tag where the tag may contain dots).
Assert.True(TryParseReference(
raiseBody.AlarmFullReference,
out string provider, out string group, out string alarmName),
$"Captured reference '{raiseBody.AlarmFullReference}' did not parse as Provider!Group.Tag.");
Log($"Ack target: provider='{provider}' group='{group}' name='{alarmName}'");
// Try the ack with real Windows identity. AVEVA's AlarmAckByName
// may reject synthetic operator strings; using the current process
// identity gives the alarm-history a recognizable principal.
string realUser = Environment.UserName;
string realNode = Environment.MachineName;
string realDomain = Environment.UserDomainName ?? string.Empty;
Log($"Ack identity: user='{realUser}' node='{realNode}' domain='{realDomain}'");
int rc = dispatcher.AcknowledgeByName(
alarmName: alarmName,
providerName: provider,
groupName: group,
ackComment: "alarms-live-smoke ack",
ackOperatorName: realUser,
ackOperatorNode: realNode,
ackOperatorDomain: realDomain,
ackOperatorFullName: realUser);
Log($"AcknowledgeByName(real identity) -> rc={rc}");
Assert.Equal(0, rc);
// 4. Wait for the post-ack transition. With the alarm flipping every
// 10s and the consumer polling every 500ms, the next state
// change should be either kind=Acknowledge (the ack we just
// sent registered as a state delta UnackAlm → AckAlm) or the
// flip script's next Clear (UnackAlm → UnackRtn).
WorkerEvent? second = WaitForTransition(queue, TransitionWaitTimeout, "post-ack", consumer);
Assert.NotNull(second);
OnAlarmTransitionEvent secondBody = second!.Event.OnAlarmTransition;
Log("Post-ack transition: " + DescribeTransition(secondBody));
Assert.NotEqual(AlarmTransitionKind.Unspecified, secondBody.TransitionKind);
// 5. Pump a little longer to confirm the consumer keeps reporting
// transitions on the 10s flip cadence.
DateTime deadline = DateTime.UtcNow + PumpDuration;
int additional = 0;
while (DateTime.UtcNow < deadline)
{
consumer.PollOnce();
if (queue.TryDequeue(out WorkerEvent? evt) && evt is not null)
{
additional++;
OnAlarmTransitionEvent body = evt.Event.OnAlarmTransition;
Log($" +{additional}: " + DescribeTransition(body));
}
Thread.Sleep(500);
}
Log($"Pump completed; additional transitions captured: {additional}.");
}
private WorkerEvent? WaitForTransition(
MxAccessEventQueue queue,
TimeSpan timeout,
string label,
WnWrapAlarmConsumer consumer)
{
DateTime deadline = DateTime.UtcNow + timeout;
int pollCount = 0;
while (DateTime.UtcNow < deadline)
{
try
{
consumer.PollOnce();
pollCount++;
if (pollCount == 1) Log("First PollOnce returned without throw.");
}
catch (Exception ex)
{
Log($"PollOnce threw on poll #{pollCount + 1}: {ex.GetType().Name}: {ex.Message}");
if (ex is System.Runtime.InteropServices.COMException ce)
{
Log($" HResult=0x{(uint)ce.HResult:X8}");
}
throw;
}
if (queue.TryDequeue(out WorkerEvent? evt) && evt is not null)
{
if (evt.Event.Family == MxEventFamily.OnAlarmTransition)
{
return evt;
}
Log($"Skipped non-alarm event (family={evt.Event.Family}) while waiting for {label}.");
}
Thread.Sleep(500);
}
Log($"Timed out waiting for {label} transition after {timeout.TotalSeconds:F0}s (poll count={pollCount}).");
return null;
}
private static bool TryParseReference(
string reference,
out string provider,
out string group,
out string alarmName)
{
provider = group = alarmName = string.Empty;
if (string.IsNullOrWhiteSpace(reference)) return false;
int bang = reference.IndexOf('!');
if (bang <= 0 || bang == reference.Length - 1) return false;
string left = reference.Substring(0, bang);
string right = reference.Substring(bang + 1);
int dot = right.IndexOf('.');
if (dot <= 0 || dot == right.Length - 1) return false;
provider = left;
group = right.Substring(0, dot);
alarmName = right.Substring(dot + 1);
return true;
}
private static string DescribeTransition(OnAlarmTransitionEvent body)
{
return string.Format(
"kind={0} ref='{1}' source='{2}' type='{3}' severity={4} operator='{5}' comment='{6}' ts={7:o}",
body.TransitionKind, body.AlarmFullReference, body.SourceObjectReference,
body.AlarmTypeName, body.Severity, body.OperatorUser, body.OperatorComment,
body.TransitionTimestamp?.ToDateTime() ?? DateTime.MinValue);
}
private static string DescribeSnapshot(ActiveAlarmSnapshot s)
{
return string.Format(
"ref='{0}' state={1} severity={2} operator='{3}' comment='{4}' ts={5:o}",
s.AlarmFullReference, s.CurrentState, s.Severity, s.OperatorUser,
s.OperatorComment,
s.LastTransitionTimestamp?.ToDateTime() ?? DateTime.MinValue);
}
private void Log(string line)
{
log.Enqueue($"[t={elapsed.Elapsed.TotalSeconds:F3}s] {line}");
}
}