1aafd6bde4
Second re-review pass at commit a020350 caught 48 new findings — including
one High-severity regression I introduced in the prior sweep — and fixed
them all in one parallel wave.
High (1)
- Client.Python-018: prior sweep set `license = "Proprietary"` in
pyproject.toml. setuptools >= 77 enforces PEP 639 and rejects the
string (it must be a valid SPDX expression), so `pip wheel .` and
`pip install -e .` both fail before any source compiles. Tests
still pass because pytest bypasses the build backend via
`pythonpath`. Dropped the invalid license string, kept the
`License :: Other/Proprietary License` classifier, and added
`tests/test_packaging.py` so a future regression of the same shape
is caught in CI.
Mediums (6)
- Worker-023: `HeartbeatStuckCeiling` (default 75s = 5x HeartbeatGrace)
on WorkerPipeSessionOptions bounds the in-flight-command watchdog
suppression so a truly stuck COM call still triggers StaHung
instead of permanently defeating the watchdog.
- Client.Rust-018: reverted Rust's `latencyMs` split so the
cross-language bench comparison is apples-to-apples again;
`failureLatencyMs` kept as Rust-only enrichment.
- Client.Java-021: applied Client.Java-002's terminal-state
serialisation pattern to DeployEventStream so close() arriving
after queue-overflow can't erase the overflow exception.
- IntegrationTests-017: teardown-parity test now uses a two-window
stability check after UnAdvise instead of strict equality against
the pre-UnAdvise count (which raced against in-flight events).
- IntegrationTests-019: new RecordingTestOutputHelper wraps every
log sink the WriteSecured live test owns (worker stdout/stderr,
gateway logs, direct WriteLine) so the credential is proven
absent from the full output buffer, not just the diagnostic
message.
- Tests-020: added MxAccessGatewayServiceConstraintTests coverage
for the previously-uncovered Write2Bulk and WriteSecured2Bulk
arms of WriteBulkConstraintPlan.SetPayload.
Lows (41 — highlights)
- Server: Galaxy glob cache eviction is race-free (Server-024);
GalaxyRepositoryGrpcService takes IGalaxyRepository (Server-025);
AlarmsOptions validated at startup (Server-026); Authorization.md
Constraint Enforcement snippet/prose enumerate the bulk write/read
family (Server-027); bulk-read-commands and bulk-write-commands
capability tokens added to OpenSession (Server-029);
NotWiredAlarmRpcDispatcher XML doc and missing scope-resolver and
state-machine tests cleaned up (023, 028).
- Worker: AlarmCommandHandler now invokes the same STA-affinity
guard the poll path uses, at every command entry (Worker-024);
RunAsync null-checks the runtime-session factory result
(Worker-025).
- Worker.Tests: shared LiveMxAccessOptInVariableName lives on
GatewayContractInfo (Worker.Tests-025); MxAccessSession.CreateForTesting
rejects production sinks (Worker.Tests-026); FakeRuntimeSession's
CancelCommandReturnValue serialised under lock (Worker.Tests-027);
Probes namespace lifted to MxGateway.Worker.Tests.Probes
(Worker.Tests-029); cancel-envelope sequence numbers monotonised
(Worker.Tests-030); docs/GatewayTesting.md gains a "Dev-rig Probes"
section (Worker.Tests-028).
- Tests: ManualTimeProvider consolidated into one TestSupport/ copy
(Tests-021); SessionManagerBulkTests adds a mid-flight cancellation
test backed by a TaskCompletionSource fake (Tests-022); companion
FakeWorkerProcess.WaitForExitAsync no longer fakes its exit signal
(Tests-023); constraint plan reply-count divergence pinned
(Tests-024).
- IntegrationTests: TryGetSession chain carries [MaybeNullWhen(false)]
end-to-end (IntegrationTests-018); abnormal-exit keyword set
tightened to pipe-disconnected/end-of-stream and the test now
asserts streamTask.IsFaulted (020, 021).
- Client.Dotnet: bench commands added to isLongRunning so the
default 30s wall-clock budget doesn't kill them (015);
BenchStreamEventsAsync observes the inner stream task on every
exit path (016).
- Client.Go: parseValue wraps strconv errors with flag context and
%w (017); bench loops honour ctx.Done() (018); galaxy-watch parses
RFC3339Nano with fractional seconds (019); runStreamEvents installs
signal.NotifyContext like runGalaxyWatch (020); five new CLI-level
table-driven tests cover the bulk/bench subcommands (021).
- Client.Java: toCompletable Javadoc rewritten to match the actual
cancellation contract Client.Java-015 established (022); stream-events
text path uses Long.toUnsignedString for worker_sequence (023);
bench-read-bulk no longer pollutes success-latency histogram with
failure durations (024); --shutdown-timeout CLI option propagates
through to ClientOptions (025); seven new MxGatewayCliTests cover
the bulk and bench commands (026).
- Client.Python: mxgateway_cli ships its own py.typed marker (019);
wheel-build smoke test added under tests/test_packaging.py (020);
README documents the Galaxy CLI parity gap explicitly (021).
- Client.Rust: RustClientDesign.md signatures match session.rs and
document the AsRef<str> read_bulk genericism (019);
next_correlation_id re-exported at the crate root, with a
property-style doc contract and an explicit disclaimer that the
literal textual format is not part of the contract (020).
- Contracts: BulkWriteResult comment names the actual
IConstraintEnforcer mechanism instead of "tag-allowlist filter"
(014); BulkReadResult gains explicit per-arm payload-population
documentation for the success vs failure cases (015).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
271 lines
11 KiB
C#
271 lines
11 KiB
C#
using System;
|
|
using System.Collections.Concurrent;
|
|
using System.Diagnostics;
|
|
using System.Threading;
|
|
using MxGateway.Contracts.Proto;
|
|
using MxGateway.Worker.MxAccess;
|
|
using Xunit.Abstractions;
|
|
|
|
namespace MxGateway.Worker.Tests.Probes;
|
|
|
|
/// <summary>
|
|
/// Live dev-rig smoke test for the alarms-over-gateway pipeline.
|
|
/// Exercises <see cref="WnWrapAlarmConsumer"/> + <see cref="AlarmDispatcher"/> +
|
|
/// <see cref="MxAccessAlarmEventSink"/> end-to-end against the actual
|
|
/// AVEVA System Platform install: subscribes to
|
|
/// <c>\\<machine>\Galaxy!DEV</c>, waits for at least one alarm
|
|
/// transition (the dev rig's flip script writes
|
|
/// <c>TestMachine_001.TestAlarm001</c> every 10s), drains the proto
|
|
/// <c>OnAlarmTransitionEvent</c> from the queue, then ack-by-name's
|
|
/// it and verifies the ack registers as a subsequent
|
|
/// <see cref="AlarmTransitionKind.Acknowledge"/> transition.
|
|
///
|
|
/// Skip-gated; flip <c>Skip=null</c> on the dev rig with the flip
|
|
/// script running.
|
|
/// </summary>
|
|
public sealed class AlarmsLiveSmokeTests
|
|
{
|
|
private static readonly string SubscriptionExpression =
|
|
$@"\\{Environment.MachineName}\Galaxy!DEV";
|
|
private static readonly TimeSpan PumpDuration = TimeSpan.FromSeconds(45);
|
|
private static readonly TimeSpan TransitionWaitTimeout = TimeSpan.FromSeconds(20);
|
|
|
|
private const string SessionId = "alarms-live-smoke";
|
|
|
|
private readonly ITestOutputHelper output;
|
|
private readonly Stopwatch elapsed = Stopwatch.StartNew();
|
|
private readonly ConcurrentQueue<string> log = new ConcurrentQueue<string>();
|
|
|
|
public AlarmsLiveSmokeTests(ITestOutputHelper output)
|
|
{
|
|
this.output = output;
|
|
}
|
|
|
|
[Fact(Skip = "Live dev-rig smoke test — flip Skip=null with AVEVA + the alarm flip script running. Verified working 2026-05-01.")]
|
|
public void Alarms_FullPipelineRoundTrip_RaisesAndAcknowledges()
|
|
{
|
|
Exception? threadException = null;
|
|
var done = new ManualResetEventSlim(false);
|
|
var thread = new Thread(() =>
|
|
{
|
|
try { RunSmoke(); }
|
|
catch (Exception ex) { threadException = ex; }
|
|
finally { done.Set(); }
|
|
});
|
|
thread.IsBackground = false;
|
|
thread.SetApartmentState(ApartmentState.STA);
|
|
thread.Start();
|
|
done.Wait();
|
|
thread.Join();
|
|
|
|
output.WriteLine($"Captured {log.Count} log line(s):");
|
|
while (log.TryDequeue(out string? line))
|
|
{
|
|
output.WriteLine(line);
|
|
}
|
|
|
|
if (threadException != null)
|
|
{
|
|
throw threadException;
|
|
}
|
|
}
|
|
|
|
private void RunSmoke()
|
|
{
|
|
Log($"Subscription expression: {SubscriptionExpression}");
|
|
Log($"Pump duration: {PumpDuration.TotalSeconds:F0}s; transition wait timeout: {TransitionWaitTimeout.TotalSeconds:F0}s");
|
|
|
|
MxAccessEventQueue queue = new MxAccessEventQueue();
|
|
// The consumer owns no internal timer; we drive PollOnce manually
|
|
// from the STA below (the wnwrap COM is ThreadingModel=Apartment,
|
|
// and this test doesn't run a Win32 message pump on its STA).
|
|
WnWrapAlarmConsumer consumer = new WnWrapAlarmConsumer(
|
|
new WNWRAPCONSUMERLib.wwAlarmConsumerClass(),
|
|
maxAlarmsPerFetch: 1024);
|
|
MxAccessAlarmEventSink sink = new MxAccessAlarmEventSink(queue, new MxAccessEventMapper());
|
|
using AlarmDispatcher dispatcher = new AlarmDispatcher(consumer, sink, SessionId);
|
|
|
|
Log("Constructed consumer + sink + dispatcher.");
|
|
dispatcher.Subscribe(SubscriptionExpression);
|
|
Log("Subscribe -> ok. Driving PollOnce manually from this STA...");
|
|
|
|
// The wnwrap COM object is ThreadingModel=Apartment. The consumer
|
|
// owns no internal timer, so we drive PollOnce manually here on the
|
|
// STA. Production hosting routes polls through the worker's
|
|
// StaRuntime.
|
|
|
|
// 1. Wait for the first transition (any kind), then keep waiting
|
|
// for one with kind=Raise so the alarm is currently Active when
|
|
// we try to ack. AVEVA rejects acks of cleared alarms with -55,
|
|
// so we have to time the ack against the flip script's 10s
|
|
// cadence.
|
|
OnAlarmTransitionEvent? raiseBody = null;
|
|
DateTime raiseDeadline = DateTime.UtcNow + TimeSpan.FromSeconds(30);
|
|
while (DateTime.UtcNow < raiseDeadline && raiseBody is null)
|
|
{
|
|
WorkerEvent? evt = WaitForTransition(queue, TransitionWaitTimeout, "raise", consumer);
|
|
if (evt is null) break;
|
|
OnAlarmTransitionEvent body = evt.Event.OnAlarmTransition;
|
|
Log("Transition: " + DescribeTransition(body));
|
|
Assert.Equal(SessionId, evt.Event.SessionId);
|
|
if (body.TransitionKind == AlarmTransitionKind.Raise)
|
|
{
|
|
raiseBody = body;
|
|
}
|
|
}
|
|
Assert.NotNull(raiseBody);
|
|
Assert.False(string.IsNullOrEmpty(raiseBody!.AlarmFullReference));
|
|
Assert.Contains("Galaxy", raiseBody.AlarmFullReference);
|
|
|
|
// 2. Snapshot the active set + verify the captured alarm is there.
|
|
var snapshot = dispatcher.SnapshotActiveAlarms();
|
|
Log($"SnapshotActiveAlarms count={snapshot.Count}");
|
|
foreach (var s in snapshot)
|
|
{
|
|
Log(" active: " + DescribeSnapshot(s));
|
|
}
|
|
Assert.NotEmpty(snapshot);
|
|
Assert.Contains(snapshot, s => s.AlarmFullReference == raiseBody.AlarmFullReference);
|
|
|
|
// 3. Ack-by-name using the captured reference. Parse the reference
|
|
// via the same convention the gateway dispatcher uses
|
|
// (Provider!Group.Tag where the tag may contain dots).
|
|
Assert.True(TryParseReference(
|
|
raiseBody.AlarmFullReference,
|
|
out string provider, out string group, out string alarmName),
|
|
$"Captured reference '{raiseBody.AlarmFullReference}' did not parse as Provider!Group.Tag.");
|
|
Log($"Ack target: provider='{provider}' group='{group}' name='{alarmName}'");
|
|
|
|
// Try the ack with real Windows identity. AVEVA's AlarmAckByName
|
|
// may reject synthetic operator strings; using the current process
|
|
// identity gives the alarm-history a recognizable principal.
|
|
string realUser = Environment.UserName;
|
|
string realNode = Environment.MachineName;
|
|
string realDomain = Environment.UserDomainName ?? string.Empty;
|
|
Log($"Ack identity: user='{realUser}' node='{realNode}' domain='{realDomain}'");
|
|
|
|
int rc = dispatcher.AcknowledgeByName(
|
|
alarmName: alarmName,
|
|
providerName: provider,
|
|
groupName: group,
|
|
ackComment: "alarms-live-smoke ack",
|
|
ackOperatorName: realUser,
|
|
ackOperatorNode: realNode,
|
|
ackOperatorDomain: realDomain,
|
|
ackOperatorFullName: realUser);
|
|
Log($"AcknowledgeByName(real identity) -> rc={rc}");
|
|
|
|
Assert.Equal(0, rc);
|
|
|
|
// 4. Wait for the post-ack transition. With the alarm flipping every
|
|
// 10s and the consumer polling every 500ms, the next state
|
|
// change should be either kind=Acknowledge (the ack we just
|
|
// sent registered as a state delta UnackAlm → AckAlm) or the
|
|
// flip script's next Clear (UnackAlm → UnackRtn).
|
|
WorkerEvent? second = WaitForTransition(queue, TransitionWaitTimeout, "post-ack", consumer);
|
|
Assert.NotNull(second);
|
|
OnAlarmTransitionEvent secondBody = second!.Event.OnAlarmTransition;
|
|
Log("Post-ack transition: " + DescribeTransition(secondBody));
|
|
Assert.NotEqual(AlarmTransitionKind.Unspecified, secondBody.TransitionKind);
|
|
|
|
// 5. Pump a little longer to confirm the consumer keeps reporting
|
|
// transitions on the 10s flip cadence.
|
|
DateTime deadline = DateTime.UtcNow + PumpDuration;
|
|
int additional = 0;
|
|
while (DateTime.UtcNow < deadline)
|
|
{
|
|
consumer.PollOnce();
|
|
if (queue.TryDequeue(out WorkerEvent? evt) && evt is not null)
|
|
{
|
|
additional++;
|
|
OnAlarmTransitionEvent body = evt.Event.OnAlarmTransition;
|
|
Log($" +{additional}: " + DescribeTransition(body));
|
|
}
|
|
Thread.Sleep(500);
|
|
}
|
|
Log($"Pump completed; additional transitions captured: {additional}.");
|
|
}
|
|
|
|
private WorkerEvent? WaitForTransition(
|
|
MxAccessEventQueue queue,
|
|
TimeSpan timeout,
|
|
string label,
|
|
WnWrapAlarmConsumer consumer)
|
|
{
|
|
DateTime deadline = DateTime.UtcNow + timeout;
|
|
int pollCount = 0;
|
|
while (DateTime.UtcNow < deadline)
|
|
{
|
|
try
|
|
{
|
|
consumer.PollOnce();
|
|
pollCount++;
|
|
if (pollCount == 1) Log("First PollOnce returned without throw.");
|
|
}
|
|
catch (Exception ex)
|
|
{
|
|
Log($"PollOnce threw on poll #{pollCount + 1}: {ex.GetType().Name}: {ex.Message}");
|
|
if (ex is System.Runtime.InteropServices.COMException ce)
|
|
{
|
|
Log($" HResult=0x{(uint)ce.HResult:X8}");
|
|
}
|
|
throw;
|
|
}
|
|
if (queue.TryDequeue(out WorkerEvent? evt) && evt is not null)
|
|
{
|
|
if (evt.Event.Family == MxEventFamily.OnAlarmTransition)
|
|
{
|
|
return evt;
|
|
}
|
|
Log($"Skipped non-alarm event (family={evt.Event.Family}) while waiting for {label}.");
|
|
}
|
|
Thread.Sleep(500);
|
|
}
|
|
Log($"Timed out waiting for {label} transition after {timeout.TotalSeconds:F0}s (poll count={pollCount}).");
|
|
return null;
|
|
}
|
|
|
|
private static bool TryParseReference(
|
|
string reference,
|
|
out string provider,
|
|
out string group,
|
|
out string alarmName)
|
|
{
|
|
provider = group = alarmName = string.Empty;
|
|
if (string.IsNullOrWhiteSpace(reference)) return false;
|
|
int bang = reference.IndexOf('!');
|
|
if (bang <= 0 || bang == reference.Length - 1) return false;
|
|
string left = reference.Substring(0, bang);
|
|
string right = reference.Substring(bang + 1);
|
|
int dot = right.IndexOf('.');
|
|
if (dot <= 0 || dot == right.Length - 1) return false;
|
|
provider = left;
|
|
group = right.Substring(0, dot);
|
|
alarmName = right.Substring(dot + 1);
|
|
return true;
|
|
}
|
|
|
|
private static string DescribeTransition(OnAlarmTransitionEvent body)
|
|
{
|
|
return string.Format(
|
|
"kind={0} ref='{1}' source='{2}' type='{3}' severity={4} operator='{5}' comment='{6}' ts={7:o}",
|
|
body.TransitionKind, body.AlarmFullReference, body.SourceObjectReference,
|
|
body.AlarmTypeName, body.Severity, body.OperatorUser, body.OperatorComment,
|
|
body.TransitionTimestamp?.ToDateTime() ?? DateTime.MinValue);
|
|
}
|
|
|
|
private static string DescribeSnapshot(ActiveAlarmSnapshot s)
|
|
{
|
|
return string.Format(
|
|
"ref='{0}' state={1} severity={2} operator='{3}' comment='{4}' ts={5:o}",
|
|
s.AlarmFullReference, s.CurrentState, s.Severity, s.OperatorUser,
|
|
s.OperatorComment,
|
|
s.LastTransitionTimestamp?.ToDateTime() ?? DateTime.MinValue);
|
|
}
|
|
|
|
private void Log(string line)
|
|
{
|
|
log.Enqueue($"[t={elapsed.Elapsed.TotalSeconds:F3}s] {line}");
|
|
}
|
|
}
|