Commit Graph

125 Commits

Author SHA1 Message Date
Joseph Doherty 40ca4b6908 Add gateway central alarm monitor and StreamAlarms feed
The gateway now monitors alarms continuously, independent of any client
session, and fans the feed out to every client.

GatewayAlarmMonitor is an always-on hosted service that owns one
gateway-managed worker session dedicated to alarms: it subscribes the
configured provider, caches the active-alarm set from the worker's
transition events (reconciled periodically against the worker's
authoritative snapshot), re-opens the session if the worker faults,
and broadcasts to all subscribers.

The new session-less StreamAlarms RPC opens with the current
active-alarm snapshot, then streams live transitions; any number of
clients fan out from the single monitor without opening a worker
session. AcknowledgeAlarm is now session-less and routes through the
monitor. The session-scoped QueryActiveAlarms RPC and the per-session
alarm auto-subscribe hook are removed, along with the now-dead
IAlarmRpcDispatcher trio; the dashboard Alarms tab reads the monitor's
in-process cache directly.

This intentionally reverses the v1 "no multi-subscriber fan-out"
decision for the alarm subsystem.

Contracts regenerated; gateway, dashboard and tests build clean,
94 alarm-affected tests pass, and the monitor is verified live.
Language-client stubs are regenerated in a follow-up change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 16:23:56 -04:00
Joseph Doherty bf73985481 Fix hanging and timing-fragile WorkerClient event-channel tests
The Server-032 change made event-channel overflow wait
EventChannelFullModeTimeout before faulting, instead of faulting
instantly. Two pre-existing overflow tests were not updated and left
EventChannelFullModeTimeout at its 5s default, which races the 5s
TestTimeout: ReadLoop_WhenEventQueueOverflows_FaultsClient and
ReadLoop_WhenClientFaults_KillsOwnedWorkerProcess. Pin it to 50ms in
both so overflow faults promptly.

EnqueueWorkerEvent_WhenChannelFullPastTimeout_FaultsWithRichDiagnostic
wrote 6 events into a 4-slot channel, but the worker client faults
while reading the 5th and its read loop then stops — the 6th event is
never drained and the test's pipe write for it blocks forever on a
full OS pipe buffer, hanging the test host. Write exactly 5 (4 to fill
plus 1 to overflow) as the test comment already intends, and bound the
post-fault event drain with TestTimeout so a future regression fails
instead of hanging.

No production change: the Server-031/032 WorkerClient logic is correct
— these were test-only defects.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 15:19:59 -04:00
Joseph Doherty 0a54fa5e35 Render array values and element type in the Browse panel
The subscription panel showed "(array)" for any array-valued tag and
"Unspecified" for its type. A scalar MxValue carries its type in
MxValue.DataType, but an array leaves that Unspecified and carries the
element type and dimensions on the MxArray itself.

DashboardMxValueFormatter.FormatValue now joins the typed MxArray
elements (e.g. "[1.5, 2.25, 3]"), capped at 24 elements with a
"... N total" suffix, and FormatDataType reads the element type and
dimensions off the array (e.g. "Double[10]").

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 14:34:21 -04:00
Joseph Doherty cec84bf572 Harden worker-client heartbeat watchdog and event backpressure
Server-031: HeartbeatLoopAsync now skips the HeartbeatExpired fault
while a command is in flight on the gateway-worker pipe, up to
WorkerClientOptions.HeartbeatStuckCeiling (75s default) — a heartbeat
gap caused by a slow STA command or an event-drain write burst no
longer faults a healthy worker. Mirrors the worker-side Worker-023
guard. A command older than the ceiling still faults so a genuinely
stuck COM call cannot hide the worker indefinitely.

Server-032: EnqueueWorkerEventAsync now honors the configured
EventChannelFullModeTimeout by awaiting WriteAsync against the
wait-mode channel, instead of faulting on the first missed slot with
the non-blocking TryWrite. A transient consumer hiccup is absorbed up
to the timeout; the overflow diagnostic names the channel depth,
capacity, and the actionable fix.

Adds the Server-031 and Server-032 findings entries and WorkerClient
regression tests covering both.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 14:20:09 -04:00
Joseph Doherty 099d4783b0 Fix worker dropping the OnDataChange source timestamp
MXAccess fires OnDataChange with pftItemTimeStamp marshaled as a
VT_BSTR string (e.g. "3/26/2026 1:38:22.907 PM"), not a FILETIME or
VT_DATE. VariantConverter classified it as a plain string, so
ApplySourceTimestamp never set MxEvent.SourceTimestamp — every
OnDataChange event and every cached ReadBulk result carried no
source timestamp for all gRPC clients.

Parse the string and set SourceTimestamp. MXAccess formats it in the
worker host's local time (verified empirically: a fast-changing tag's
timestamp landed exactly the host UTC offset behind wall-clock UTC),
so it is parsed as local and converted to UTC.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 13:53:38 -04:00
Joseph Doherty c1fe7fbc4a Add Browse and Alarms dashboard tabs
Browse renders the Galaxy hierarchy tree from IGalaxyHierarchyCache:
expandable areas/objects with attribute name, data type and the
alarm/historized flags, plus a name/reference filter. Right-click or
double-click an attribute to add it to a subscription panel that polls
live value, quality and source timestamp every two seconds.

Alarms lists the worker's currently-active alarm set via
IAlarmRpcDispatcher, defaulting to unacknowledged Active alarms with
filters for acknowledged alarms, area, severity range and text. It is
read-only and warns when alarm auto-subscribe is disabled.

Both tabs read live MXAccess data through a new singleton
DashboardLiveDataService that owns one shared, lazily-opened gateway
session (one worker) for the whole dashboard, re-opened transparently
if it faults or its lease expires.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 13:53:28 -04:00
Joseph Doherty b39848b5f5 Enable alarm auto-subscribe on session open
Set MxGateway:Alarms with Enabled=true and the dev-rig subscription
expression \DESKTOP-6JL3KKO\Galaxy!DEV. Worker sessions now subscribe
their wnwrap alarm consumer on open; without this the QueryActiveAlarms
RPC returns an empty stream because no session is ever alarm-subscribed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 13:53:15 -04:00
Joseph Doherty b794c46bc7 File and fix Server-030 and Client.Dotnet-017 from e2e surfacing
Both findings surfaced when running the cross-language e2e matrix
(scripts/run-client-e2e-tests.ps1) against the redeployed gateway at
commit 84d36b7. Filed in code-reviews/Server/findings.md and
code-reviews/Client.Dotnet/findings.md and fixed in the same change.

Server-030 (Medium / Error handling): GatewaySession.GetReadyWorkerClient
gated on `_state == Ready && _workerClient.State == Ready` but only
formatted `_state` into the SessionManagerException message. Under load
the gateway-driven `_state` and the worker-driven `WorkerClient.State`
can diverge, producing a self-contradictory diagnostic ("Session ... is
not ready. Current state is Ready."). The Java e2e client hit this on
the 56th item after 55 successful add-items. Rewrote the message to
include both states ("Session state is X; worker state is Y"), added
an XML doc explaining the two-state contract and that this branch is
the fail-fast for a divergence race, and added regression test
SessionManagerTests.InvokeAsync_WhenWorkerNotReadyButSessionReady_DiagnosticIncludesBothStates
that pins both states appear in the message. The deeper race (should
the gateway briefly wait for worker-Ready before failing?) remains
open as a follow-up.

Client.Dotnet-017 (Low / Error handling): stream-events CLI threw
OperationCanceledException as an unhandled exception when the user's
--timeout expired before --max-events was reached. Exit code
-532462766, no aggregate JSON. The other client CLIs (Go, Rust, Python,
Java) exit 0 in this case. Wrapped the `await foreach` in
`catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested)`
so the supplied token's cancellation (--timeout, Ctrl+C, or parent
CTS) becomes graceful completion; the aggregate `{ "events": [...] }`
JSON still runs after the catch. Added regression test
RunAsync_StreamEvents_WhenTimeoutFiresAfterEvents_EmitsCollectedEventsAndExitsZero
backed by a new FakeCliClient.StreamHangAfterEvents hook that yields
the configured events then parks on the cancellation token.

Side cleanup: the GatewayApplicationTests test added under Server-020
was asserting an invariant (`/dashboard/dashboard/X` doesn't exist)
that I broke by reverting Server-020 in 84d36b7. The doubled endpoint
shapes do exist now (MapGroup("/dashboard") prefixing an already
"/dashboard/X" @page directive) but they're harmless — no client
requests `/dashboard/dashboard/X`. Replaced the test with a positive
assertion (`/dashboard/X` routes ARE registered) and rewrote the XML
doc to record the actual contract.

Verified: dotnet test src/MxGateway.Tests passes 480/480, dotnet test
clients/dotnet/MxGateway.Client.Tests passes 77/77, gateway redeployed
at this commit and GET http://localhost:5130/dashboard returns 200.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 13:07:39 -04:00
Joseph Doherty 84d36b7638 Restore dashboard @page "/dashboard/X" directives — Server-020 reversal
Server-020 incorrectly removed the duplicate `@page "/dashboard/X"` directive
from each dashboard Razor page on the assumption that `MapGroup("/dashboard")`
would prepend the prefix to Blazor SSR route matching. It does not — Blazor's
`@page` template matcher operates on the full URL path, not relative to a
MapGroup. The removal left the dashboard returning HTTP 500 with
"Unable to find the provided template '/dashboard/'" from
RouteTableFactory.CreateEntry on every page.

Restored the eight `@page "/dashboard/X"` directives. The accompanying
regression test still passes (it asserts the genuinely-double-prefixed shape
`/dashboard/dashboard/X` never appears — it never did, since the original
duplicates were `"/"` + `"/dashboard/"`, not `"/dashboard/"` repeated). XML
doc on the test rewritten to record what was actually wrong with Server-020.

Verified: gateway redeploy + `GET http://localhost:5130/dashboard` returns
HTTP 200 7.3 KB; `/dashboard/sessions` also 200.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 12:07:18 -04:00
Joseph Doherty 1aafd6bde4 Code-review 2026-05-20 sweep #2: re-review at a020350, resolve 48 findings
Second re-review pass at commit a020350 caught 48 new findings — including
one High-severity regression I introduced in the prior sweep — and fixed
them all in one parallel wave.

High (1)
- Client.Python-018: prior sweep set `license = "Proprietary"` in
  pyproject.toml. setuptools >= 77 enforces PEP 639 and rejects the
  string (it must be a valid SPDX expression), so `pip wheel .` and
  `pip install -e .` both fail before any source compiles. Tests
  still pass because pytest bypasses the build backend via
  `pythonpath`. Dropped the invalid license string, kept the
  `License :: Other/Proprietary License` classifier, and added
  `tests/test_packaging.py` so a future regression of the same shape
  is caught in CI.

Mediums (6)
- Worker-023: `HeartbeatStuckCeiling` (default 75s = 5x HeartbeatGrace)
  on WorkerPipeSessionOptions bounds the in-flight-command watchdog
  suppression so a truly stuck COM call still triggers StaHung
  instead of permanently defeating the watchdog.
- Client.Rust-018: reverted Rust's `latencyMs` split so the
  cross-language bench comparison is apples-to-apples again;
  `failureLatencyMs` kept as Rust-only enrichment.
- Client.Java-021: applied Client.Java-002's terminal-state
  serialisation pattern to DeployEventStream so close() arriving
  after queue-overflow can't erase the overflow exception.
- IntegrationTests-017: teardown-parity test now uses a two-window
  stability check after UnAdvise instead of strict equality against
  the pre-UnAdvise count (which raced against in-flight events).
- IntegrationTests-019: new RecordingTestOutputHelper wraps every
  log sink the WriteSecured live test owns (worker stdout/stderr,
  gateway logs, direct WriteLine) so the credential is proven
  absent from the full output buffer, not just the diagnostic
  message.
- Tests-020: added MxAccessGatewayServiceConstraintTests coverage
  for the previously-uncovered Write2Bulk and WriteSecured2Bulk
  arms of WriteBulkConstraintPlan.SetPayload.

Lows (41 — highlights)
- Server: Galaxy glob cache eviction is race-free (Server-024);
  GalaxyRepositoryGrpcService takes IGalaxyRepository (Server-025);
  AlarmsOptions validated at startup (Server-026); Authorization.md
  Constraint Enforcement snippet/prose enumerate the bulk write/read
  family (Server-027); bulk-read-commands and bulk-write-commands
  capability tokens added to OpenSession (Server-029);
  NotWiredAlarmRpcDispatcher XML doc and missing scope-resolver and
  state-machine tests cleaned up (023, 028).
- Worker: AlarmCommandHandler now invokes the same STA-affinity
  guard the poll path uses, at every command entry (Worker-024);
  RunAsync null-checks the runtime-session factory result
  (Worker-025).
- Worker.Tests: shared LiveMxAccessOptInVariableName lives on
  GatewayContractInfo (Worker.Tests-025); MxAccessSession.CreateForTesting
  rejects production sinks (Worker.Tests-026); FakeRuntimeSession's
  CancelCommandReturnValue serialised under lock (Worker.Tests-027);
  Probes namespace lifted to MxGateway.Worker.Tests.Probes
  (Worker.Tests-029); cancel-envelope sequence numbers monotonised
  (Worker.Tests-030); docs/GatewayTesting.md gains a "Dev-rig Probes"
  section (Worker.Tests-028).
- Tests: ManualTimeProvider consolidated into one TestSupport/ copy
  (Tests-021); SessionManagerBulkTests adds a mid-flight cancellation
  test backed by a TaskCompletionSource fake (Tests-022); companion
  FakeWorkerProcess.WaitForExitAsync no longer fakes its exit signal
  (Tests-023); constraint plan reply-count divergence pinned
  (Tests-024).
- IntegrationTests: TryGetSession chain carries [MaybeNullWhen(false)]
  end-to-end (IntegrationTests-018); abnormal-exit keyword set
  tightened to pipe-disconnected/end-of-stream and the test now
  asserts streamTask.IsFaulted (020, 021).
- Client.Dotnet: bench commands added to isLongRunning so the
  default 30s wall-clock budget doesn't kill them (015);
  BenchStreamEventsAsync observes the inner stream task on every
  exit path (016).
- Client.Go: parseValue wraps strconv errors with flag context and
  %w (017); bench loops honour ctx.Done() (018); galaxy-watch parses
  RFC3339Nano with fractional seconds (019); runStreamEvents installs
  signal.NotifyContext like runGalaxyWatch (020); five new CLI-level
  table-driven tests cover the bulk/bench subcommands (021).
- Client.Java: toCompletable Javadoc rewritten to match the actual
  cancellation contract Client.Java-015 established (022); stream-events
  text path uses Long.toUnsignedString for worker_sequence (023);
  bench-read-bulk no longer pollutes success-latency histogram with
  failure durations (024); --shutdown-timeout CLI option propagates
  through to ClientOptions (025); seven new MxGatewayCliTests cover
  the bulk and bench commands (026).
- Client.Python: mxgateway_cli ships its own py.typed marker (019);
  wheel-build smoke test added under tests/test_packaging.py (020);
  README documents the Galaxy CLI parity gap explicitly (021).
- Client.Rust: RustClientDesign.md signatures match session.rs and
  document the AsRef<str> read_bulk genericism (019);
  next_correlation_id re-exported at the crate root, with a
  property-style doc contract and an explicit disclaimer that the
  literal textual format is not part of the contract (020).
- Contracts: BulkWriteResult comment names the actual
  IConstraintEnforcer mechanism instead of "tag-allowlist filter"
  (014); BulkReadResult gains explicit per-arm payload-population
  documentation for the success vs failure cases (015).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 10:28:54 -04:00
Joseph Doherty a0203503a7 Code-review 2026-05-20 sweep: re-review at 1cd51bb, resolve 72 findings across all 11 modules
Re-reviewed every module/client against the 10-category checklist
(REVIEW-PROCESS.md) at commit 1cd51bb, filed 72 new findings, and
fixed them in three priority waves (3 High, 17 Medium, 52 Low).

Highs
- Server-017: enumerate AcknowledgeAlarm / QueryActiveAlarms in
  GatewayGrpcScopeResolver so non-admin keys can use them; document
  the mapping in docs/Authorization.md; add interceptor tests.
- Client.Java-013: add the five missing bulk-method stubs to the
  CLI FakeSession so the test module compiles on a clean tree.
- Client.Rust-013: fix the clippy::doc_lazy_continuation regression
  in generated tonic code by reformatting the ReadBulkCommand proto
  comment and scoping a #![allow(...)] to the generated submodules.

Mediums (highlights)
- Server: unify GatewaySession state-lock discipline (-015) and
  make DisposeAsync race-safe against in-flight CloseAsync (-016);
  add constraint-enforcement test coverage for the bulk-plan path
  (-021).
- Worker: introduce StaRuntimeShutdownException so RunAlarmPollLoop
  can distinguish graceful shutdown from a real STA-affinity
  violation (-016); have the watchdog skip StaHung while
  CurrentCommandCorrelationId is non-empty so a legitimate slow
  ReadBulk no longer self-faults (-017).
- Tests: add per-method round-trip + cancellation coverage for the
  11 GatewaySession bulk methods (-013); replace the real TCP probe
  in GalaxyHierarchyCacheTests with an IGalaxyRepository fake
  (-016).
- IntegrationTests: drive the StreamEvents writer in the live Write
  test and assert OnWriteComplete (-012); add live tests for
  Unadvise/RemoveItem/Unregister ordering, WriteSecured, and
  abnormal worker exit (-014).
- Worker.Tests: replace MxAccessSession reflection with an internal
  CreateForTesting factory (-016); cover WorkerCancel and
  unexpected-body envelope branches (-017).
- Client.Java: cancel MxEventStream when close() races
  beforeStart() (-014); return a CancellingCompletableFuture that
  actually forwards cancellation through .thenApply chains (-015).
- Client.Python: drop the silent localhost-plaintext downgrade in
  the CLI; require explicit --plaintext (-013).
- Client.Rust: stop bench-read-bulk from polluting success-latency
  histograms with failed-call durations (-015); add coverage for
  the five MalformedReply paths, the bulk-write helpers, the
  Error::Unavailable mapping, and the unary-fault path (-016).
- Contracts: extend docs/Contracts.md with the bulk read/write
  command family (-009).

Lows (highlights)
- Server: cap GalaxyGlobMatcher.RegexCache; align
  WorkerAlarmRpcDispatcher missing-session handling; drop the
  duplicate dashboard @page routes; refresh IAlarmRpcDispatcher
  XML doc.
- Worker: surface SetXmlAlarmQuery COM failures; remove dead
  subscriptionExpression / ExecutingCommand arms; preserve
  factory-supplied runtime sessions; split MxAlarmSnapshot.cs into
  three files.
- Tests: dispose the WebApplication in seven test classes; rebuild
  FakeWorkerProcess.WaitForExitAsync against a real TaskCompletion
  source; switch the heartbeat-expires test to ManualTimeProvider;
  add InvariantCulture to the remaining DateTimeOffset.Parse sites;
  document GalaxyFilterInputSafetyTests in GatewayTesting.md.
- IntegrationTests: comment fixes, RecordingServerStreamWriter
  IDisposable, class-level [Trait], single-source ZB default
  connection string.
- Worker.Tests: replace silent-return gating with LiveMxAccessFact
  so absent env vars SKIP not pass; PascalCase rename of probe
  [Fact]s; deterministic deadline test; new frame-protocol error
  tests; ComputeTransitions diff-coverage; relocate dev-rig probes
  to Probes/.
- Contracts: add round-trip coverage and per-field redaction /
  Galaxy-identifier comments to the protos.
- Client.Dotnet: introduce clients/dotnet/Directory.Build.props so
  TreatWarningsAsErrors / analysers apply; document
  DiscoverHierarchyOptions and IMxGatewayCliClient; require typed
  bulk-read handles in CLI; surface AcknowledgeAlarm transport
  faults through Translate().
- Client.Go: kill dead code in alarms_test / fakeGalaxyServer /
  runWriteBulkVariant; document the six new subcommands in
  writeUsage; drain galaxy-watch events on limit; switch io.EOF
  comparisons to errors.Is.
- Client.Java: shared shutdown helpers + new shutdownTimeout
  option; regex-based credential redaction; Long.toUnsignedString
  for uint64 sequence; doc fixes.
- Client.Python: combine duplicate imports; add coverage for
  _percentile / bench-read-bulk / MAX_AGGREGATE_EVENTS /
  _api_key_from_env; populate pyproject metadata and ship py.typed.
- Client.Rust: expose next_correlation_id() so CLI ping/close
  stop hard-coding correlation IDs; resync RustClientDesign.md
  with the current Session / Error surface and CLI subcommand set.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 09:46:47 -04:00
Joseph Doherty 5e375f6d3d Add bulk read/write command family across worker, gateway, and clients
Adds five new MXAccess command kinds (WriteBulk, Write2Bulk,
WriteSecuredBulk, WriteSecured2Bulk, ReadBulk) that ride the existing
"one round-trip, per-entry results" bulk shape used by AddItemBulk and
SubscribeBulk today. MXAccess COM has no native bulk API; the worker
runs each bulk operation as a sequential loop on its STA, returning
one BulkWriteResult / BulkReadResult per requested entry so per-item
MXAccess failures surface as was_successful=false rather than throwing.

ReadBulk has no MXAccess analogue. The worker satisfies it by:

  - Returning the last cached OnDataChange payload (was_cached=true)
    when the requested tag is already in the session''s item registry
    AND advised — the existing subscription is NOT touched, since the
    caller did not create it.
  - Otherwise taking the AddItem + Advise + wait-for-OnDataChange +
    UnAdvise + RemoveItem snapshot lifecycle itself (was_cached=false)
    and leaving the session exactly as it was. The wait pumps Windows
    messages on the STA so the inbound MXAccess event can dispatch
    while the executor still holds the thread.

The new MxAccessValueCache lives on each MxAccessSession, shared with
MxAccessBaseEventSink which populates it on every OnDataChange after
the event clears the outbound queue. Eviction on RemoveItem keeps
reused MXAccess handles from serving stale values from a previous
lifetime.

Gateway-side authorization wires WriteBulk/Write2Bulk to invoke:write,
WriteSecuredBulk/WriteSecured2Bulk to invoke:secure, ReadBulk to
invoke:read. The constraint-filter pipeline is refactored from a single
BulkConstraintPlan record into an abstract base plus three concretes
(SubscribeBulk, WriteBulk, ReadBulk), each owning its own denied-entry
merge so the dispatch site never branches on reply shape. A new
FilterWriteBulkAsync<TEntry> generic over the four write-entry shapes
runs CheckWriteHandleAsync per entry; denied entries surface as the
BulkWriteResult shape, preserving original-index order.

All five language clients (.NET, Go, Rust, Python, Java) gained the
five new methods following their existing bulk pattern, with regenerated
protobufs.

Tests added:
  - MxAccessValueCacheTests (6 cases) — Set/TryGet, Remove resets the
    version, TryWaitForUpdate signals on Set, pump step fires each poll.
  - MxAccessBaseEventSinkTests — OnDataChange populates the cache,
    ValueCache property exposes the bound instance.
  - MxAccessCommandExecutorTests — four bulk-write variants (per-entry
    success/failure, value+timestamp forwarding, secured user ids),
    ReadBulk snapshot lifecycle on uncached tag (timeout surfaces as
    was_successful=false), invalid-payload reply.
  - GatewayGrpcScopeResolverTests — five new MxCommandKind cases.
  - SessionManagerTests — WriteBulk and ReadBulk forwarding through
    FakeWorkerHarness; ReadBulk forwards timeout_ms.
  - Per-client (.NET, Go, Rust, Python, Java) — WriteBulk builds the
    right command and returns per-entry results, ReadBulk forwards the
    timeout and unpacks the was_cached flag.

Cross-language e2e CLI subcommands for the new bulks are deliberately
scoped out of this change (each of the five client CLIs would need
five new subcommands plus matching phases in
scripts/run-client-e2e-tests.ps1); coverage equivalent to the existing
bulk-subscribe coverage is provided by worker + gateway + per-client
unit tests.

Docs updated in the same commit: gateway.md (Public MXAccess Command
Surface), docs/DesignDecisions.md (new "Bulk Command Family" section
with the ReadBulk cache-then-snapshot rationale), and every client
README.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 03:42:38 -04:00
Joseph Doherty 06030dd1ef Implement MXAccess write commands in the worker
The .proto contract and MxCommandKind already defined Write, Write2,
WriteSecured, and WriteSecured2, but the worker's MxAccessCommandExecutor
had no case for any of them — every write kind fell through to
CreateInvalidRequestReply ("Unsupported MXAccess command kind Write").

Implement all four:

- VariantConverter.ConvertToComValue projects an MxValue into a
  COM-marshalable object (scalars, arrays, null) — the inverse of the
  existing COM-to-MxValue projection.
- IMxAccessServer / MxAccessComServer gain Write/Write2/WriteSecured/
  WriteSecured2, routed to ILMXProxyServer / ILMXProxyServer4.
- MxAccessSession and MxAccessCommandExecutor add the four write paths,
  following the existing ExecuteAdvise pattern; the reply is a plain OK
  reply and the outcome surfaces later as an OnWriteComplete event.

Verified live: a Write now returns PROTOCOL_STATUS_CODE_OK and produces
an OnWriteComplete event where it previously returned InvalidRequest.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 14:45:35 -04:00
Joseph Doherty 964b40dcbc Fix stale WorkerProjectReferenceTests MXAccess-interop assertion
MxAccessInteropReference_ExistsOnlyInWorkerProject asserted the MXAccess COM
interop was referenced only by MxGateway.Worker. The worker test project now
legitimately references ArchestrA.MxAccess and Interop.WNWRAPCONSUMERLib so it
can exercise the COM-facing worker code (WnWrapAlarmConsumer, the alarm
tests). Renamed to ..._ExistsOnlyInWorkerAndWorkerTestProjects, updated the
assertion to expect both projects, and made it order-independent. The
architecture invariant the test protects — the gateway/contracts never
reference MXAccess COM — still holds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 23:19:32 -04:00
Joseph Doherty bb5603b7ec Fix flaky GalaxyHierarchyRefreshServiceTests timing race
ExecuteAsync_WhenFirstRefreshThrowsNonCancellationException_DoesNotFault
BackgroundService cancelled the service immediately after StartAsync, so
under parallel load the first RefreshAsync could be skipped (RefreshCallCount
0) and `await executeTask` rethrew TaskCanceledException before the IsFaulted
assertion. The test now waits for a TaskCompletionSource signal that the
first refresh was attempted before cancelling, and uses Task.WhenAny so a
Canceled ExecuteTask does not rethrow. Confirmed stable across full-suite
runs (408/408).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 23:14:47 -04:00
Joseph Doherty ee959e46e6 Resolve Contracts-001/004/005/006/007/008 code-review findings
Contracts-001: docs/Grpc.md still described "four MxAccessGateway RPCs" —
updated to the actual six (adding AcknowledgeAlarm and QueryActiveAlarms to
the handler and validation-rule sections).

Contracts-003 (Won't Fix): the finding is factually wrong — the <Protobuf>
item for mxaccess_worker.proto already sets ProtoRoot="Protos"; all three
items are consistent (confirmed back to the reviewed commit).

Contracts-004: corrected the stale GatewayContractInfo XML summary
("before generated protobuf contracts are introduced").

Contracts-005: no proto field/enum value was ever removed, so no reserved
ranges were invented. Added a wire-compatibility policy comment to all three
.proto files instructing future editors to reserve removed numbers.

Contracts-006: documented MxStatusProxy.success — it mirrors the COM
MXSTATUS_PROXY numeric success member, is not a boolean, and clients should
branch on category.

Contracts-007: added 13 round-trip tests covering galaxy_repository.proto
messages, bulk-subscribe payloads, and raw-value/IPC worker bodies.

Contracts-008: WorkerAlarmRpcDispatcher never assigns AcknowledgeAlarmReply.
status, so the old "native status" proto comment was misleading. Corrected
the hresult/status proto comments and documented the worker native_status →
public reply mapping in AlarmClientDiscovery.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 23:12:00 -04:00
Joseph Doherty b4f5e8eb48 Resolve IntegrationTests-007..010 code-review findings
IntegrationTests-007: the three live test classes contend for shared
singletons (one MXAccess COM, one ZB SQL DB, one GLAuth). Added
LiveResourcesCollection with DisableParallelization and applied it to all
three so they no longer run concurrently.

IntegrationTests-008: the three live fact attributes each re-implemented the
env-var check. Added IntegrationTestEnvironment.IsEnabled and all three now
delegate to it.

IntegrationTests-009: reworded the misleading "Mock server call context" XML
doc — it is a hand-written stub with no verification behavior.

IntegrationTests-010: WaitForMessageAsync ignored cancellation. It now takes
an optional CancellationToken linked with the timeout; the smoke test shares
one cancellation source with the StreamEvents call context.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 22:59:18 -04:00
Joseph Doherty 371bcb3f91 Resolve Worker.Tests-008..015 code-review findings
Worker.Tests-008: moved the misplaced WorkerLogRedactor test out of
VariantConverterTests into Bootstrap/WorkerLogRedactorTests.

Worker.Tests-009: renamed 46 snake_case alarm-test methods to PascalCase
Method_Scenario_Expectation.

Worker.Tests-010: replaced a weak Assert.Contains with an exact assertion
against the real diagnostic message and corrected the XML doc.

Worker.Tests-011: renamed and re-documented a cancellation test that
overstated what it proved.

Worker.Tests-012: added an oversized-frame (MessageTooLarge) test; renamed
the mislabeled zero-length-payload test.

Worker.Tests-013: removed the fixed-100ms ThrowIfCompletedAsync helper; the
caller now races runTask deterministically.

Worker.Tests-014: consolidated duplicated test fakes/helpers
(FakeRuntimeSession, NoopComApartmentInitializer, NoopEventSink, frame
helpers) into a shared TestSupport namespace.

Worker.Tests-015: added MxAccessEventQueue coverage for drain-all (maxEvents
0), empty-queue drain, and enqueue-after-fault.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 22:59:07 -04:00
Joseph Doherty 9582de077b Resolve Tests-007..012 code-review findings
Tests-007: TestServerCallContext and stream-writer/constraint helpers were
copy-pasted across five test files. Consolidated into a shared
MxGateway.Tests.TestSupport namespace; duplicates deleted.

Tests-008: renamed snake_case alarm-test methods to PascalCase
Method_Condition_Result and dropped redundant usings. Re-triaged two
inaccurate sub-claims (the "wnwrap" name and a required CompilerServices
using).

Tests-009: corrected three copy-paste-mismatched XML <summary> comments in
SessionManagerTests.

Tests-010: added the missing anonymous-localhost security negatives —
bypass disallowed, and loopback-allowed from a remote address.

Tests-011: SessionWorkerClientFactoryFakeWorkerTests discarded worker tasks.
The test class now tracks each launcher and observes its task in DisposeAsync.

Tests-012: added xunit.runner.json pinning collection parallelism and
documented the ephemeral-port convention.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 22:59:01 -04:00
Joseph Doherty 1764eff1cf Resolve Worker-009..015 code-review findings
Worker-009: WorkerFrameWriter serialized twice and WorkerFrameReader
allocated a payload byte[] per frame. The writer now serializes once into a
single prefix+payload buffer; the reader rents the payload buffer from
ArrayPool and honors the logical frame length.

Worker-010: VariantConverter projected a uint+Time value as a full FILETIME,
producing a near-1601 timestamp. The FILETIME projection is now gated on
`value is long`; uint falls through to the integer projection.

Worker-011: replaced the opaque retryAttempts formula in WorkerPipeClient
with MaxRetryAttempts = int.MaxValue, leaving the connect deadline as the
sole bound.

Worker-012: rewrote stale "future PR / polls on a Timer" comments in
AlarmDispatcher, AlarmCommandHandler, MxAccessAlarmEventSink and
MxAccessEventMapper to match the shipped, post-Worker-001 behavior.

Worker-013 (re-triaged): already resolved — StaMessagePumpTests and
MxAccessStaSessionTests cover the pump and poll loop directly.

Worker-014: moved IAlarmCommandHandler into its own file so
AlarmCommandHandler.cs declares one public type.

Worker-015: clarified the MxAccessBaseEventSink.EnqueueEvent overflow-catch
comment explaining the deliberate double RecordFault no-op.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 22:42:17 -04:00
Joseph Doherty fe9044115b Resolve Server-007..014 code-review findings
Server-007: GalaxyHierarchyProjector re-filtered the whole hierarchy per
page (O(total) paging). It now memoizes the filtered list per cache-entry +
filter signature so subsequent pages are an O(pageSize) slice.

Server-008: WatchDeployEvents re-resolved browse subtrees and rebuilt globs
per streamed event. ResolveBrowseSubtrees is hoisted out of the loop and
GalaxyGlobMatcher caches compiled Regex instances per pattern.

Server-009: auth-store connections used no busy timeout or WAL. A new
OpenConnectionAsync applies journal_mode=WAL and a busy_timeout; all auth
call sites use it. docs/Authentication.md updated.

Server-010: the dashboard rendered Rotate/Revoke for revoked keys, where
Rotate silently reactivates them. ApiKeysPage now shows actions only for
Active keys. docs/Authentication.md updated.

Server-011: WorkerAlarmRpcDispatcher converted to a primary constructor and
brought in line with module conventions.

Server-012: CLAUDE.md corrected to the canonical *:* scope strings.

Server-013 (partly re-triaged): three named coverage gaps were already
closed; the genuine gap (WorkerExecutableValidator) is now covered.

Server-014: rewrote stale "alarm path not yet wired" comments in
MxAccessGatewayService to describe the production WorkerAlarmRpcDispatcher.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 22:42:06 -04:00
Joseph Doherty 1f546c46ee Resolve Contracts-002 code-review finding
MxCommandReply.payload has no by-name ack case: MX_COMMAND_KIND_ACKNOWLEDGE_
ALARM_BY_NAME reuses the acknowledge_alarm reply payload. Verified the worker
(MxAccessCommandExecutor.ExecuteAcknowledgeAlarmByName) and gateway
(WorkerAlarmRpcDispatcher) already implement this correctly — the gap was
purely undocumented contract asymmetry. Documented the reuse on the proto
oneof case and the AcknowledgeAlarmReplyPayload message comment (regenerating
the .NET contract), and in docs/AlarmClientDiscovery.md. Added
ProtobufContractRoundTripTests.MxCommandReply_AcknowledgeAlarmByName_Reuses
AcknowledgeAlarmPayloadCase to pin the contract.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 21:50:57 -04:00
Joseph Doherty f13f35bc79 Resolve IntegrationTests-003..006 code-review findings
IntegrationTests-003: the live MXAccess smoke test asserted on the first
streamed event, which a registration/quality bootstrap event could occupy.
The recording writer now waits for the first event matching a predicate
(Family == OnDataChange).

IntegrationTests-004: the cleanup `finally` could throw and mask an original
assertion failure. Shutdown now routes through a helper that logs cleanup
exceptions instead of propagating them.

IntegrationTests-005: added live MXAccess parity tests — a Write round-trip
to an advised item, and an invalid-handle command surfacing the MXAccess
failure without a transport fault.

IntegrationTests-006: added live LDAP failure-path tests — wrong password
(no password leak), unknown username, and server-unreachable.

docs/GatewayTesting.md updated to describe the new cases.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 21:45:11 -04:00
Joseph Doherty 18ce2922e2 Resolve Worker.Tests-003..007 code-review findings
Worker.Tests-003: removed the wall-clock `Elapsed < 2s` assertion from
InvokeAsync_WakesIdlePumpForQueuedCommand; the awaited completion against a
30s idle period already proves the wake event drove dispatch.

Worker.Tests-004: MxAccessStaSession.Dispose now joins the alarm poll task
after cancelling the CTS (consistent with ShutdownGracefullyAsync), and
Dispose_StopsAlarmPollLoop asserts deterministically instead of via Task.Delay.

Worker.Tests-005: undisposed MemoryStream instances across the frame-protocol
and pipe-session tests are now `using` declarations.

Worker.Tests-006: Dispose_StopsAlarmPollLoop now constructs MxAccessStaSession
with `using` so a failed assertion cannot leak the STA poll loop.

Worker.Tests-007: docs/WorkerFrameProtocol.md verification section corrected
to target MxGateway.Worker.Tests / MxGateway.Worker with -p:Platform=x86.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 21:45:01 -04:00
Joseph Doherty 5ade3f4f48 Resolve Tests-003, -004, -005, -006 code-review findings
Tests-003: temp auth-DB directories leaked under %TEMP%. Added the
TempDatabaseDirectory IDisposable helper (clears the Sqlite connection pool,
then recursively deletes); SqliteAuthStoreTests and ApiKeyAdminCliRunnerTests
now dispose every directory they create.

Tests-004: added end-to-end coverage composing the real authorization
interceptor in front of the real MxAccessGatewayService, plus scope-resolver
tests confirming an unmapped request type fails closed to the admin scope.

Tests-005: added coverage for a worker faulting mid-command — a pipe
disconnect and a worker fault while an InvokeAsync is in flight both fail the
pending invoke. No product change needed.

Tests-006 (re-triaged): the flaky ReadLoop_WhenClientFaults_KillsOwnedWorkerProcess
is a test race, not a product bug — the kill runs synchronously inside
SetFaulted. Rewrote it to await FakeWorkerProcess exit deterministically, and
replaced fixed Task.Delay timing in the late-reply and heartbeat tests with
FIFO ordering and an injected ManualTimeProvider.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 21:44:55 -04:00
Joseph Doherty 54325343bd Resolve Worker-004, -005, -006, -007, -008 code-review findings
Worker-004: post-watchdog-fault heartbeats reported a non-faulted state.
ReportWatchdogFaultIfNeededAsync now sets _state = Faulted before writing
the StaHung fault.

Worker-005 (re-triaged): the cited OnPoll site was removed by Worker-001;
the real silent-failure bug was in MxAccessStaSession.RunAlarmPollLoopAsync,
which caught only graceful-stop exceptions. A failing PollOnce now records a
WorkerFault on the event queue instead of vanishing on a non-awaited task.

Worker-006: RunAsync's finally skipped runtime disposal when shutdown timed
out, leaking the STA thread and COM object. It now always disposes
(MxAccessStaSession.Dispose is idempotent and bounded).

Worker-007 (re-triaged): replaced MxAccessComServer's Type.InvokeMember
reflection fallback with an IMxAccessServer fast path plus typed
ILMXProxyServer* casts; a non-conforming object now fails fast.

Worker-008: alarm consumer STA affinity was unenforced. MxAccessStaSession
records the alarm consumer's STA thread id and asserts every PollOnce runs
on it via a unit-testable guard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 21:31:23 -04:00
Joseph Doherty 1d9e3afadd Resolve Server-002, -004, -005, -006 code-review findings
Server-002: the gateway never terminated leftover MxGateway.Worker.exe
processes at startup, contradicting gateway.md and CLAUDE.md. Added
IRunningProcessInspector/SystemRunningProcessInspector, OrphanWorkerTerminator,
and OrphanWorkerCleanupHostedService (best-effort, runs before sessions are
accepted); updated gateway.md to describe the implemented behavior.

Server-004: API-key scopes were persisted verbatim with no validation. Added
GatewayScopes.All/IsKnown; the CLI parser and dashboard create path now
reject unknown scope strings.

Server-005: a non-SqlException/InvalidOperationException fault on the initial
Galaxy hierarchy load faulted the BackgroundService. ExecuteAsync now catches
all non-cancellation exceptions on first load and RefreshCoreAsync broadens
its catch so the cache records Stale/Unavailable instead.

Server-006: OpenSessionAsync incremented the open-sessions gauge before
alarm auto-subscribe; an auto-subscribe failure leaked the gauge. The catch
path now calls SessionRemoved() when the gauge was incremented.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 21:31:10 -04:00
Joseph Doherty 1b4dcf32d5 Resolve Worker.Tests-001 and Worker.Tests-002 code-review findings
Worker.Tests-001: StaMessagePump had no direct unit test. Added
Sta/StaMessagePumpTests.cs — 8 STA-thread facts covering WaitForWorkOrMessages
(wake-event signalled before/during the wait, timeout expiry, zero-timeout
fast path, the QS_ALLINPUT posted-message wake path) and PumpPendingMessages
drain counting.

Worker.Tests-002: no test drove a COM event through the integrated
sink -> mapper -> queue path. Added MxAccess/MxAccessBaseEventSinkTests.cs —
5 facts driving OnDataChange, OnWriteComplete, OperationComplete and
OnBufferedDataChange through a real MxAccessBaseEventSink + mapper + queue and
asserting the converted WorkerEvent lands in MxAccessEventQueue. The four COM
event handlers were widened private -> internal and InternalsVisibleTo for
MxGateway.Worker.Tests was added, mirroring MxAccessAlarmEventSink's existing
test seam; no worker behavior changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 21:07:48 -04:00
Joseph Doherty 53e3973209 Resolve Worker-001, Worker-002, Worker-003 code-review findings
Worker-001: WnWrapAlarmConsumer armed a System.Threading.Timer whose OnPoll
callback ran GetXmlCurrentAlarms2 on a thread-pool thread against the
Apartment-threaded wnwrap COM object, which can deadlock on cross-apartment
marshaling. Removed the pollTimer/pollIntervalMs fields, OnPoll, the
poll-interval constructor parameter, and the timer arm/disposal. Polls are
driven externally by the STA via StaRuntime.InvokeAsync(PollOnce).

Worker-002: RunHeartbeatLoopAsync delayed a full HeartbeatInterval before
the first heartbeat. Restructured so the first beat is sent immediately on
entering the loop and the delay applies only between subsequent beats.

Worker-003: ProcessCommandAsync silently returned without a reply when
_state was not a command-serving state after dispatch. Both drop sites now
log a WorkerCommandResultDropped diagnostic with correlation_id via
IWorkerLogger; _state is now volatile.

Three pre-existing tests that asserted strict frame ordering were updated to
tolerate an interleaved first heartbeat (Worker-002 consequence).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 20:59:46 -04:00
Joseph Doherty b381bfcaf1 Resolve Tests-001 and Tests-002 code-review findings
Tests-001: FakeSessionManager.TryGetSession unconditionally synthesized a
session, so Invoke_WhenSessionMissing_ThrowsNotFound did not actually
verify the missing-session path. Added ResolveOnlySeededSessions/SeedSession
to the fake, rewrote the missing-session test, and added seeded-resolution
and alarm-RPC missing-session coverage.

Tests-002: re-triaged. GalaxyRepository issues only constant SQL; filters
are applied in-memory by GalaxyHierarchyProjector/GalaxyGlobMatcher. Kept
as a valid coverage gap and added GalaxyFilterInputSafetyTests exercising
filter/glob input safety directly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 20:46:02 -04:00
Joseph Doherty a8aafdf974 Enforce dashboard authorization on all component routes
Fixes code-review findings Server-001 (Critical) and Server-003 (High).

Server-001: the dashboard Razor components were mapped with no
authorization policy, so every dashboard page — including the API Keys
page — was reachable unauthenticated. MapRazorComponents<App>() now
requires DashboardAuthenticationDefaults.AuthorizationPolicy;
unauthenticated requests are challenged by the cookie scheme and
redirected to the login page.

Server-003: DashboardAuthenticator.CreatePrincipal never issued the
'scope' claim that DashboardAuthorizationHandler checks when
Dashboard:RequireAdminScope is enabled, so enforcing the policy would
have denied every LDAP login. CreatePrincipal (reached only after the
required-group check passes) now emits the admin scope claim.

Replaces the GatewayApplicationTests case that asserted dashboard
routes allow anonymous access — it encoded the bug as expected
behavior — with tests that verify component routes require the policy
and the login/logout/denied endpoints allow anonymous.

All 309 MxGateway.Tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 16:45:29 -04:00
Joseph Doherty a67a5a4857 fix(worker): wire alarm command handler and STA poll loop (Gap 1 + Gap 2)
Gap 1 — WorkerPipeSession now passes `eq => new AlarmCommandHandler(eq)` as
the alarmCommandHandlerFactory in all three places it constructs
MxAccessStaSession (two convenience constructors and InitializeMxAccessAsync).
Previously the parameterless MxAccessStaSession() set the factory to null,
so every SubscribeAlarms / AcknowledgeAlarm / QueryActiveAlarms command
returned "alarm consumer not configured" in a deployed worker.

  - Added internal `MxAccessStaSession(Func<MxAccessEventQueue, IAlarmCommandHandler>?)`
    constructor that builds all defaults but accepts a factory.
  - Added public `MxAccessStaSession(StaRuntime, factory, eventQueue, alarmFactory?)`
    4-arg overload to complete the constructor chain.

Gap 2 — WnWrapAlarmConsumer now disables its internal threadpool Timer
(pollIntervalMilliseconds=0 in the default constructor). MxAccessStaSession
starts a `RunAlarmPollLoopAsync` background task that sleeps off-STA then
calls `staRuntime.InvokeAsync(() => handler.PollOnce())` at 500ms intervals.
This satisfies the ThreadingModel=Apartment requirement of wwAlarmConsumerClass:
every GetXmlCurrentAlarms2 call now runs on the worker's STA.

  - Added `PollOnce()` to `IMxAccessAlarmConsumer`, `AlarmDispatcher`,
    `IAlarmCommandHandler`, and `AlarmCommandHandler`.
  - Poll loop cancelled and awaited before alarm handler disposal in both
    ShutdownGracefullyAsync and Dispose.

Tests: 4 new tests in MxAccessStaSessionTests verify that
  - SubscribeAlarms reaches the handler when the factory is wired (Gap 1)
  - SubscribeAlarms returns InvalidRequest without a factory (regression guard)
  - PollOnce is called on the STA thread within 3s (Gap 2)
  - The poll loop stops after Dispose (Gap 2 lifecycle)
All fake IMxAccessAlarmConsumer / IAlarmCommandHandler test implementations
updated with no-op PollOnce() to satisfy the new interface member.

Worker tests: 199 passed / 1 pre-existing failure / 4 skipped (was 195/1/4).
Server tests: 308 passed / 0 failures (unchanged).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 06:30:14 -04:00
Joseph Doherty e00ee61cf0 Place Last Refresh next to Last Deploy on the Galaxy page
Group the two double-width timestamp cards at the start of the metric
row so the deploy/refresh pair reads together, ahead of the count cards.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 01:20:30 -04:00
Joseph Doherty 271bf7edff Give Galaxy timestamp cards double-width boxes
The Last Deploy and Last Refresh metric cards hold full timestamps that
wrapped to three or four lines in a single-width card. Add a Wide option
to MetricCard (grid-column: span 2) and set it on both Galaxy timestamp
cards. Also switch .metric-value to overflow-wrap: break-word so a date
token is never split mid-value.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 01:16:56 -04:00
Joseph Doherty f598b3a647 Stop dashboard table cells breaking whole words
overflow-wrap: anywhere let the table layout shrink a column below its
longest word, splitting tokens like "unconstrained" mid-word. Switch to
break-word so a word only breaks when it genuinely cannot fit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 00:56:55 -04:00
Joseph Doherty 509b0118d4 Keep dashboard button labels on a single line
Bootstrap 5 .btn no longer pins white-space, so the Rotate/Revoke action
buttons broke mid-word in the narrow API Keys actions column. Pin .btn
to white-space: nowrap so a label always reads as one word.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 00:54:53 -04:00
Joseph Doherty 298836d2f3 Keep dashboard status chips on a single line
Status chips wrapped to two lines in narrow table columns (e.g. the
session State column). Pin .chip to white-space: nowrap so an enumerated
state always reads as one token.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 23:19:36 -04:00
Joseph Doherty 96bea1d478 Apply technical-light design system to the gateway dashboard
Restyles the Blazor dashboard onto a portable token-based theme so it
reads like an instrument panel: warm-paper background, hairline-ruled
panels, IBM Plex type, monospace tabular numerics, and status carried by
colour chips. Vendors theme.css + IBM Plex fonts, rewrites dashboard.css
as a thin token-driven view layer, and swaps the Bootstrap navbar and
status badges for the design-system app bar and chips.

Also includes pending API-key management, Galaxy hierarchy projection,
and constraint-enforcement work with their tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 22:31:04 -04:00
Joseph Doherty a4ed605f74 A.3 (live smoke): full alarms-over-gateway pipeline verified end-to-end
Skip-gated AlarmsLiveSmokeTests.Alarms_full_pipeline_round_trip ran
against the dev rig with the flip script firing
TestMachine_001.TestAlarm001 every 10s. Verified:
  - Subscribe + 1st PollOnce yield real transition events
  - Field-by-field decode correct (provider, group, tag, severity,
    UTC timestamp, comment, type)
  - SnapshotActiveAlarms reflects current state
  - AcknowledgeByName(real identity) -> rc=0
  - Pipeline keeps streaming transitions on the 10s cadence post-ack

Three production quirks surfaced and were fixed in
WnWrapAlarmConsumer:

1. SetXmlAlarmQuery is mandatory for reads. Skipping it (per the
   earlier discovery-doc recommendation) makes the first
   GetXmlCurrentAlarms2 fail with E_FAIL. The doc's claim that the
   call is unnecessary because the round-trip echo is mangled was
   wrong — mangled echo or not, the call is required.

2. SetXmlAlarmQuery breaks AlarmAckByName on the same consumer
   instance (returns -55). Workaround: provision a parallel
   "ack-only" wnwrap consumer that runs Initialize → Register →
   Subscribe via the v1-prefixed methods, no SetXmlAlarmQuery.
   Production WnWrapAlarmConsumer now holds two COM clients;
   AcknowledgeByName always dispatches through the ack-only one.

3. AlarmAckByName has v2 (8-arg) and v1 (6-arg) overloads. The v2
   8-arg overload returns -55 on this AVEVA build (apparently a
   stub); the v1 6-arg overload works. Production now calls the
   6-arg overload, discarding the proto's operator_domain and
   operator_full_name fields. The proto contract keeps both for
   forward-compat if AVEVA fixes the v2 method.

Bonus finding (not fixed here): AlarmAckByGUID throws
NotImplementedException on wnwrap. Reference→GUID lookup that we
initially planned to plumb is therefore not viable; all acks must
go through AlarmAckByName. WorkerAlarmRpcDispatcher.AcknowledgeAsync
already routes references through the by-name path, so this only
affects the GUID-input branch (which the worker tries first if the
input parses as a GUID — that branch will surface
NotImplementedException as MxaccessFailure if a client supplies one).

Threading caveat: wnwrap is ThreadingModel=Apartment, so the
consumer's internal Timer (firing on threadpool threads) blocks on
cross-apartment marshaling without an STA message pump. The smoke
test sidesteps this with pollIntervalMilliseconds=0 (Timer disabled)
+ manual PollOnce calls from the test STA. Production hosting will
route polls through the worker's StaRuntime in a follow-up; PollOnce
is now public so the wire-up is straightforward.

Test counts after this slice:
  Worker: 195 pass / 4 skipped (live probes incl. new live smoke) /
          1 pre-existing structure-fail (untouched)
  Server: 308 pass / 0 fail
Solution builds clean.

docs/AlarmClientDiscovery.md "Live smoke-test discoveries" section
records all five findings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 12:17:39 -04:00
Joseph Doherty 4e02927f01 A.3 (alarm-ack-by-name): public AcknowledgeAlarm now accepts Provider!Group.Tag references
Closes the gap where the public AcknowledgeAlarm RPC required canonical
GUIDs but OnAlarmTransitionEvent.AlarmFullReference is "Provider!Group.Tag".
Adds an AVEVA AlarmAckByName path that wraps wwAlarmConsumerClass.AlarmAckByName
so callers can ack with the natural reference.

Proto:
- New MxCommandKind.AcknowledgeAlarmByName (=29).
- New AcknowledgeAlarmByNameCommand(alarm_name, provider_name, group_name,
  comment, operator_user/node/domain/full_name) on MxCommand oneof.
- AcknowledgeAlarmReplyPayload (existing) carries the AVEVA native
  status; reused for the by-name path.

Worker:
- IMxAccessAlarmConsumer + WnWrapAlarmConsumer + AlarmDispatcher +
  AlarmCommandHandler all gain an AcknowledgeByName(name, provider,
  group, comment, operator-identity) overload that maps to
  wwAlarmConsumerClass.AlarmAckByName.
- MxAccessCommandExecutor: new switch arm routes
  MxCommandKind.AcknowledgeAlarmByName to the handler. Empty alarm_name
  yields InvalidRequest; handler exceptions surface as MxaccessFailure.

Gateway:
- WorkerAlarmRpcDispatcher.TryParseAlarmReference: parses
  "Provider!Group.Tag" with the convention that the FIRST '!' separates
  provider, the FIRST '.' after '!' separates group; tag may contain
  more dots.
- AcknowledgeAsync now branches: GUID input → AcknowledgeAlarm command
  (existing path); reference input → AcknowledgeAlarmByName command
  (new path); neither parses → InvalidRequest with a clear diagnostic.

Tests: 13 new unit tests cover each layer end-to-end:
- WorkerAlarmRpcDispatcher.TryParseAlarmReference (3 valid + 8 invalid
  forms) including the realistic 4-component "Galaxy!TestArea.
  TestMachine_001.TestAlarm001" reference.
- WorkerAlarmRpcDispatcher.AcknowledgeAsync routes references through
  AcknowledgeAlarmByName + propagates the full operator tuple.
- Executor switch arm carries the by-name tuple and rejects empty
  alarm_name.
- AlarmDispatcher.AcknowledgeByName forwards to consumer.
- Existing fakes extended for the new overload.

Counts: server 308/0, worker 195/3 skip / 1 pre-existing structure-fail
(untouched). Solution builds clean.

End-to-end alarms-over-gateway now serves the full lmxopcua flow:
client.AcknowledgeAlarm(reference="Galaxy!TestArea.TestMachine_001.TestAlarm001",
operator_user="alice") → gateway parses → IPC AcknowledgeAlarmByName →
worker AlarmAckByName → AVEVA history. The remaining piece for full
parity is a live dev-rig smoke test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 11:17:15 -04:00
Joseph Doherty 47b1fd422c A.3 (auto-subscribe): SessionManager issues SubscribeAlarms on session open
Adds the missing trigger that activates the worker's wnwrap consumer.
Without this, every session opened in OK state but the consumer never
started, so AcknowledgeAlarm/QueryActiveAlarms returned "alarm consumer
not configured" forever.

New AlarmsOptions config block (under MxGateway:Alarms):
  - Enabled (default false): gates the auto-subscribe path so existing
    deployments without alarm configuration are unaffected.
  - SubscriptionExpression: explicit AVEVA expression like
    \<machine>\Galaxy!<area>.
  - DefaultArea: fallback used when SubscriptionExpression is empty;
    composes \$(MachineName)\Galaxy!$(DefaultArea).
  - RequireSubscribeOnOpen (default false): when true, an auto-subscribe
    failure faults the session; when false, the failure is logged and
    the session stays Ready (data subscriptions keep working, alarms
    return "not subscribed" until the operator retries).

SessionManager.OpenSessionAsync gains a TryAutoSubscribeAlarmsAsync hook
that runs after MarkReady. Skips when alarms are disabled; otherwise
builds a SubscribeAlarmsCommand, invokes it on the session's worker
client, and either logs the resulting status or escalates per
RequireSubscribeOnOpen. SessionManagerException is the failure mode for
the strict path so callers in MxAccessGatewayService surface it as
session-open-failed.

Tests: 7 new unit tests cover the disabled lane, expression-driven
subscribe, DefaultArea fallback, success path, soft-failure (require
off), strict-failure (require on), and missing-config-strict-throw.
Server suite total: 295 pass / 0 fail. Solution builds clean.

End-to-end alarms-over-gateway path is now live (with config). Open a
session against a gateway with Alarms.Enabled=true + a valid
SubscriptionExpression; the worker's wnwrap consumer auto-subscribes;
QueryActiveAlarms streams snapshots; AcknowledgeAlarm acks by GUID.
Reference→GUID resolution (AlarmAckByName worker command) and the live
dev-rig smoke test remain follow-ups.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 11:10:13 -04:00
Joseph Doherty 9b21ca3554 A.3 (gateway dispatcher): WorkerAlarmRpcDispatcher routes alarm RPCs over the worker pipe
Replaces NotWiredAlarmRpcDispatcher in DI with a production
implementation that issues the new MxCommandKind.{AcknowledgeAlarm,
QueryActiveAlarms} commands across the IPC and unwraps the resulting
MxCommandReply into the public RPC types.

QueryActiveAlarms is fully wired: builds the QueryActiveAlarmsCommand
(forwarding alarm_filter_prefix), invokes it on the resolved
GatewaySession's worker client, and yields each ActiveAlarmSnapshot
from the QueryActiveAlarmsReplyPayload as the RPC stream. Worker
failures + missing sessions yield an empty stream — matches the
ConditionRefresh contract clients already speak to.

AcknowledgeAlarm is partially wired: the public RPC takes
AlarmFullReference (Provider!Group.Tag), but the worker's wnwrap
consumer acks by GUID. Strategy:
- If AlarmFullReference parses as a canonical GUID, forward it
  directly through MxCommandKind.AcknowledgeAlarm. Native status
  flows back via MxCommandReply.Hresult and the dedicated
  AcknowledgeAlarmReplyPayload.NativeStatus.
- Otherwise, return InvalidRequest with a clear diagnostic naming the
  follow-up — reference→GUID lookup needs a worker-side AlarmAckByName
  command wrapping wwAlarmConsumerClass.AlarmAckByName.

DI: SessionServiceCollectionExtensions registers WorkerAlarmRpcDispatcher
as the default IAlarmRpcDispatcher; MxAccessGatewayService picks it up
via constructor injection. NotWiredAlarmRpcDispatcher is retained for
test fixtures that want the no-side-effect fake.

Tests: 7 new unit tests cover session-not-found short-circuit, GUID-vs-
reference branching, native-status propagation, worker MxaccessFailure
diagnostic propagation, and snapshot-stream yielding. Server test
suite total: 288/0 fail. Solution builds clean.

End-to-end alarms-over-gateway pipeline status:
  consumer → sink → queue (A.2 + A.3 in-process slice)
  worker IPC commands (A.3 worker slice)
  gateway dispatcher (this slice)

Remaining for full E2E:
  - Auto-issue SubscribeAlarms on session open (or add a public
    SubscribeAlarms RPC). Without this trigger the consumer never
    starts and Acknowledge/Query return "not subscribed".
  - AlarmAckByName worker command for ack-by-reference.
  - End-to-end live test against the dev rig.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 10:58:40 -04:00
Joseph Doherty 01f5e6ad91 A.3 (worker IPC slice): proto SubscribeAlarms/Acknowledge/QueryActive commands + executor routing
Adds the worker-side IPC surface for the alarm subsystem so the gateway
can drive the AlarmDispatcher across the named-pipe boundary. Adds four
proto MxCommandKind values + matching command messages and two
MxCommandReply payload variants:

- SubscribeAlarmsCommand(subscription_expression)
- UnsubscribeAlarmsCommand
- AcknowledgeAlarmCommand(alarm_guid, comment, operator_user/node/domain/full_name)
- QueryActiveAlarmsCommand(alarm_filter_prefix)
- AcknowledgeAlarmReplyPayload(native_status)
- QueryActiveAlarmsReplyPayload(repeated ActiveAlarmSnapshot snapshots)

Worker plumbing:

- New IAlarmCommandHandler interface + AlarmCommandHandler production
  impl. Lazy-creates an AlarmDispatcher (with a wnwrap-backed consumer
  by default) on the first SubscribeAlarms; routes Acknowledge / QueryActive /
  Unsubscribe through it. Idempotent under repeated Unsubscribe; rejects
  a second Subscribe without an intervening Unsubscribe; cleans up the
  consumer if the underlying Subscribe call throws.
- MxAccessCommandExecutor: 4 new switch arms map MxCommandKind values to
  IAlarmCommandHandler calls. Acknowledge surfaces the AVEVA native
  status into both MxCommandReply.Hresult and the dedicated
  AcknowledgeAlarmReplyPayload.NativeStatus so gateway-side consumers
  can echo it without unpacking the outer envelope. Invalid GUIDs and
  missing payloads return InvalidRequest; handler exceptions return
  MxaccessFailure with the exception message in DiagnosticMessage.
- MxAccessStaSession: new constructor overload accepts an
  alarmCommandHandlerFactory; it's invoked on the STA thread during
  StartAsync and the resulting handler is passed into the executor.
  ShutdownGracefullyAsync + Dispose tear it down on the STA before the
  data-side cleanup runs.

Tests: 20 new unit tests covering AlarmCommandHandler lazy lifecycle
(Subscribe/Unsubscribe/Acknowledge/Query/Dispose, error paths) and the
executor's 4 alarm switch arms (OK/InvalidRequest/MxaccessFailure paths,
hresult propagation, prefix filtering). Worker test suite total: 192
passed / 3 skipped (live probes) / 1 pre-existing structure-test fail
(untouched).

Deferred to next slice: gateway-side WorkerAlarmRpcDispatcher that
replaces NotWiredAlarmRpcDispatcher, builds + sends these commands across
the IPC, and unwraps the resulting MxCommandReply into AcknowledgeAlarmReply
/ ActiveAlarmSnapshot stream.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 10:52:04 -04:00
Joseph Doherty 82eb0ad569 A.3 (in-process slice): AlarmDispatcher wires consumer events onto event queue
Adds the in-process plumbing that connects WnWrapAlarmConsumer's
AlarmTransitionEmitted stream to the worker's MxAccessEventQueue via
MxAccessAlarmEventSink. With this change a transition raised by the
consumer lands as an OnAlarmTransitionEvent proto on the queue,
SessionId attached, ready for IPC dispatch.

Mapping: provider!group.tag → AlarmFullReference, tag → SourceObjectReference,
priority → severity, wnwrap STATE → AlarmConditionState (Active /
ActiveAcked / Inactive — wnwrap's ack-vs-unack-on-cleared distinction
collapses since OPC UA Part 9 doesn't model it). State delta drives
AlarmTransitionKind via the existing AlarmRecordTransitionMapper table.

Holding off on the proto IPC additions (SubscribeAlarms /
AcknowledgeAlarm / QueryActiveAlarms commands + WorkerAlarmRpcDispatcher)
for a follow-up — those touch every layer of the worker IPC and warrant
their own PR. This slice proves the consumer→sink→queue pipeline
end-to-end with unit tests and clears the path for the proto additions
to plug in cleanly.

Tests: 10 new unit tests cover field-by-field mapping, the
"unchanged-state-doesn't-emit" filter, the state→transition kind table,
Subscribe / Acknowledge passthrough, SnapshotActiveAlarms → proto
ActiveAlarmSnapshot mapping, and Dispose detaches the handler. All
passing; total worker test count 172/3 skip / 1 pre-existing structure
fail (untouched).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 09:52:35 -04:00
Joseph Doherty f711a55be4 A.2: replace AlarmClientConsumer with wnwrap-based polling consumer
Switch the worker's alarm-consumer surface from `aaAlarmManagedClient.AlarmClient`
to `WNWRAPCONSUMERLib.wwAlarmConsumerClass` (CLSID 7AB52E5F-…) hosted by
`wnwrapConsumer.dll`. The new path returns alarm records as a BSTR XML
payload via `GetXmlCurrentAlarms2`, bypassing the FILETIME→DateTime
auto-marshaling that crashed `GetHighPriAlarm` with
ArgumentOutOfRangeException on every poll. Live captured 60/60 polls
clean against `\DESKTOP-6JL3KKO\Galaxy!DEV` while a System Platform
script flipped TestMachine_001.TestAlarm001 every 10s; the GUID,
priority, state (UNACK_ALM ↔ UNACK_RTN), and ASCII-formatted timestamps
arrived end-to-end.

Implementation:
- `Interop.WNWRAPCONSUMERLib.dll` generated via tlbimp, checked in under
  `lib/` so dev boxes don't need the SDK to build.
- New `WnWrapAlarmConsumer` (replaces `AlarmClientConsumer`): owns a
  500ms polling timer, parses `GetXmlCurrentAlarms2` output, diffs the
  snapshot keyed by alarm GUID, and raises one
  `MxAlarmTransitionEvent` per state change. Includes the
  Initialize→Register-before-Subscribe ordering fix found during
  Discovery probe runs.
- New library-agnostic types `MxAlarmSnapshotRecord` /
  `MxAlarmStateKind` / `MxAlarmTransitionEvent` so the proto-build
  path is testable without an AVEVA install.
- `AlarmRecordTransitionMapper` retired the COM-coupled
  `MapTransitionKind(eAlmTransitions)`; new pure helpers
  `ParseStateKind`, `MapTransition(prev, curr)`, and
  `ParseTransitionTimestampUtc` cover XML decode + state-delta logic.
- `IMxAccessAlarmConsumer` event surface changed from
  `EventHandler<AlarmRecord>` to `EventHandler<MxAlarmTransitionEvent>`
  and `SnapshotActiveAlarms()` returns `MxAlarmSnapshotRecord` —
  decoupling the interface from any specific COM library.
- Worker csproj drops `aaAlarmManagedClient` / `IAlarmMgrDataProvider`
  refs; adds `Interop.WNWRAPCONSUMERLib`.

Tests:
- 36 new unit tests (state-string mapping, prev/current → proto kind
  decision table, timestamp UTC reassembly, XML payload parser, 32-char
  hex GUID round-trip) covering everything that doesn't touch the live
  COM surface — all passing.
- Skip-gated `WnWrapConsumerProbeTests.ProbeWnWrapConsumer` archives
  the live capture flow for regression / future probes.

Docs:
- `docs/AlarmClientDiscovery.md` "Option A — captured" section records
  sample XML payloads, the mangled `SetXmlAlarmQuery` round-trip
  (prefer `Subscribe` for filtering), the `GetStatistics`
  AccessViolationException quirk, and the worker-integration outline.

Pre-existing failure noted (separate):
`MxAccessInteropReference_ExistsOnlyInWorkerProject` was already
failing on HEAD — the test project still references `ArchestrA.MxAccess`
for the Skip-gated discovery probes. Not regressed by this change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 09:44:15 -04:00
Joseph Doherty f490ae2593 docs: revise interop fix path — wnwrapConsumer.dll is the right surface
Reflection on aaAlarmManagedClient.AlarmClient shows it implements
only IDisposable (no [ComImport] interface, no class GUID) and
has a single field "CwwAlarmConsumer* m_almUnmanaged". So
AlarmClient is a C++/CLI managed wrapper around a native C++
class -- NOT a COM-interop class. The DateTime conversion happens
INSIDE AVEVA's wrapper IL, not at the .NET-COM marshaling
boundary. There's no separate COM interface to QI to.

Revised approach (in docs/AlarmClientDiscovery.md):

A. wnwrapConsumer.dll -- separate standalone COM library AVEVA
   ships at "C:\Program Files (x86)\Common Files\ArchestrA"
   exposing WNWRAPCONSUMERLib.wwAlarmConsumerClass with
   SetXmlAlarmQuery / GetXmlCurrentAlarms. XML-string output
   bypasses FILETIME marshaling entirely. Best fit -- real COM,
   self-contained, conventional production-grade approach.
B. Patch aaAlarmManagedClient.dll IL -- direct but modifies a
   vendor binary, brittle to upgrades.
C. Reflect into m_almUnmanaged and call native vtable directly
   -- requires reverse-engineering the C++ class layout.

Picking A. Probe restored to Skip; next commit starts the
wnwrapConsumer integration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 09:15:37 -04:00
Joseph Doherty 39f9fd8946 probe: BREAKTHROUGH — alarms flow via canonical \Node\Galaxy!Area, blocked by DateTime marshaling
Two findings that turn the alarm capture path on:

1. Subscription expression: \<MachineName>\Galaxy!<Area> is the
   canonical AlarmClient subscription format per ArchestrA docs:
   \Node\Provider!Area!Filter, with Provider literally "Galaxy"
   (not the Galaxy name) and Node being the machine name. For
   this rig: \DESKTOP-6JL3KKO\Galaxy!DEV catches alarms.
2. InitializeConsumer before RegisterConsumer — discovered
   earlier; bug-fix for PR A.5's AlarmClientConsumer.

With these in place, GetHighPriAlarm returned a record on every
poll for 60s straight (117/117 calls). But every call throws
ArgumentOutOfRangeException: Not a valid Win32 FileTime, because
AlarmRecord has five DateTime fields (ar_Time / ar_OrigTime /
ar_AckTime / ar_RtnTime / ar_SubTime) and AVEVA writes sentinel
FILETIME values for unset ones (e.g., ar_AckTime on an
unacknowledged alarm). The aaAlarmManagedClient.dll auto-marshals
FILETIME -> DateTime and rejects out-of-range values.

GetStatistics still reports total=0 active=0 even with
GetHighPriAlarm returning records — those two APIs have
different views. The active read API for current alarms is
GetHighPriAlarm, not GetStatistics's change array.

So the consumer chain works. The blocking issue is now
extracting the payload past the AVEVA-shipped DateTime
auto-marshaling. Three approaches for the next PR:

1. Patch aaAlarmManagedClient.dll via ildasm/ilasm round-trip.
2. Define a custom [ComImport] interface with safe-blittable
   types and Marshal.QueryInterface to it.
3. Use IDispatch late binding to bypass strong-typed marshaling.

Option 2 is cleanest; needs the AlarmClient COM IID.

Probe changes:

- Subscription expression set to \<MachineName>\Galaxy!DEV.
- GetHighPriAlarm tally counters (ok-with-record vs throw).
- 117 throws / 0 ok-with-record over 60s confirms alarms are
  flowing continuously while the user's flip script runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 09:06:45 -04:00
Joseph Doherty bb7be14d1d probe: aaAlarmManagedClient receives no alarm data — full consumer chain verified
Sixth probe iteration with every consumer-side knob exhausted:

- Subscriptions tried (all rc=0): \Galaxy!, \Galaxy!*, \Galaxy!,
  \Galaxy!TestArea, \.\Galaxy!.
- Read channels polled at 500ms: GetStatistics, GetHighPriAlarm,
  SFCreateSnapshot + SFGetStatistics.
- Filters: priority 0..32767, qtSummary + qtHistory both tried,
  asAlarmActiveNow.
- AlarmRecord pre-init to FILETIME epoch to dodge marshaler bug
  on default(DateTime).

Result: every read API returns empty for the entire 60s window
even with TestMachine_001.TestAlarm001 firing every 10s and
aaObjectViewer confirming InAlarm transitions. The
aaAlarmManagedClient.AlarmClient is not the receive surface
AVEVA's alarm pipeline routes to in this Galaxy configuration.

The consumer chain is verified working end-to-end: Initialize +
Register + Subscribe all succeed, GetProviders finds the
provider, the WM 0xC275 heartbeat fires at 1Hz to AVEVA's
internal hwnd. There is simply no alarm data flowing through
this consumer surface.

Next investigation is not consumer-side: either find the SDK
aaObjectViewer's alarm panel uses, or query the historian
event storage directly. If alarms only flow via the historian
path on this customer's Galaxy, the worker's PR A.5 architecture
is a dead-end and A.2 needs a different transport.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 08:26:29 -04:00
Joseph Doherty 8ac6642bf8 probe: subscribe-parameter sweep — alarms still absent, producer-side blocked
Tried every documented subscription knob with InitializeConsumer
present + provider visible at status 100:

- qtSummary AND qtHistory (the only eQueryType values).
- Priority 1..999 AND 0..32767.
- FilterMask/Spec asNone AND asAlarmActiveNow.

eAlarmFilterState is single-state-valued (asNone=0,
asAlarmActiveNow=1, asAlarmAcked=2, asShelved=3), not flag bits,
so the filter surface is exhausted.

GetStatistics continued to report total=0 active=0 codes=[7]
for every poll across all combinations.

User confirmation: the BoolAlarm extension on
TestMachine_001.TestAlarm001 is evaluating (the $Alarm.InAlarm
sub-attribute flips true/false in lockstep with the script
writes, visible in aaObjectViewer). So the consumer chain is
verified working end-to-end on our side. What's missing is
producer-side publication into the aaAlarmManagedClient stream.

Probable causes (config, not code):

- BoolAlarm extension's "publish to alarm manager" / "Active" /
  "Enabled" flag may be off.
- Alarm-vs-event mode setting may have it routing to events,
  not alarms.
- Platform alarm area may not match the consumer's subscription
  scope.

Resolution path: check the BoolAlarm extension's config in System
Platform IDE; check aaObjectViewer's Active Alarms panel (not
attribute panel) to see if the alarm appears there.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 07:53:26 -04:00
Joseph Doherty 4e8928cf71 probe: InitializeConsumer required — provider visible after, alarms still absent
InitializeConsumer was the missing call. Adding it before
RegisterConsumer makes the \Galaxy! provider appear in
GetProviders (status 0 -> 100 within 500ms). Without Initialize,
GetProviders returns an empty list even though everything else
returns rc=0 (success).

Probe trace 2026-05-01:

  InitializeConsumer -> 0
  RegisterConsumer -> 0
  GetProviders [after Register] -> count=0 list=[]
  Subscribe('\Galaxy!') -> 0
  GetProviders [after Subscribe] -> count=1 list=[  0 \Galaxy!]
  GetProviders [poll #1] -> count=1 list=[100 \Galaxy!]

Despite the provider being at "100% query complete" for the
entire 60s window, GetStatistics continued to report
total=0 active=0 codes=[7] -- no alarm transitions reached the
consumer even with a System Platform script flipping
TestMachine_001.TestAlarm001 every 10s during the run.

So the consumer chain works end-to-end. What's missing is alarm
traffic from the producer side. The next discriminator is
whether ObjectViewer (or another live consumer) sees the alarm
fire while the script runs.

API-ordering bug fix to apply to PR A.5's AlarmClientConsumer
regardless of how A.2 lands: AlarmClientConsumer.Subscribe
should call InitializeConsumer before RegisterConsumer (currently
omits Initialize entirely, which means the provider chain is
never visible from the worker either). That fix lifts a
fundamental bug independent of the polling-vs-callback question.

Probe changes:

- Added InitializeConsumer call before RegisterConsumer.
- Added LogProviders helper that logs only on change; called
  after Register, after Subscribe, and on every poll. Easier
  to spot when the provider chain transitions from empty to
  populated.
- Restored Skip-gating after run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 07:43:06 -04:00