Commit Graph

233 Commits

Author SHA1 Message Date
Joseph Doherty cd92048f4e Regenerate stale Java client protobuf code
The checked-in generated Java sources under clients/java/src/main/generated/
were out of sync with both the .proto contracts and the configured
protobuf 4.33.1 toolchain: they were missing the alarm command kinds
(MX_COMMAND_KIND_SUBSCRIBE_ALARMS..ACKNOWLEDGE_ALARM_BY_NAME, 25-29), the
alarm/galaxy message additions, and the protobuf 4.x generated-code layout.
Regenerated via `gradle generateProto`; `gradle test` passes against the
refreshed sources. No hand edits — pure protoc/protoc-gen-grpc-java output.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 23:21:13 -04:00
Joseph Doherty 964b40dcbc Fix stale WorkerProjectReferenceTests MXAccess-interop assertion
MxAccessInteropReference_ExistsOnlyInWorkerProject asserted the MXAccess COM
interop was referenced only by MxGateway.Worker. The worker test project now
legitimately references ArchestrA.MxAccess and Interop.WNWRAPCONSUMERLib so it
can exercise the COM-facing worker code (WnWrapAlarmConsumer, the alarm
tests). Renamed to ..._ExistsOnlyInWorkerAndWorkerTestProjects, updated the
assertion to expect both projects, and made it order-independent. The
architecture invariant the test protects — the gateway/contracts never
reference MXAccess COM — still holds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 23:19:32 -04:00
Joseph Doherty bb5603b7ec Fix flaky GalaxyHierarchyRefreshServiceTests timing race
ExecuteAsync_WhenFirstRefreshThrowsNonCancellationException_DoesNotFault
BackgroundService cancelled the service immediately after StartAsync, so
under parallel load the first RefreshAsync could be skipped (RefreshCallCount
0) and `await executeTask` rethrew TaskCanceledException before the IsFaulted
assertion. The test now waits for a TaskCompletionSource signal that the
first refresh was attempted before cancelling, and uses Task.WhenAny so a
Canceled ExecuteTask does not rethrow. Confirmed stable across full-suite
runs (408/408).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 23:14:47 -04:00
Joseph Doherty 24de7e21d9 Regenerate code-reviews index after Low findings Batch 3
Reflects resolution of Contracts-001/004/005/006/007/008 (and Contracts-003
re-triaged Won't Fix). All code-review findings across every module are now
closed. Also normalizes the Contracts-003 Status to the canonical
`Won't Fix` value the index generator expects.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 23:12:34 -04:00
Joseph Doherty ee959e46e6 Resolve Contracts-001/004/005/006/007/008 code-review findings
Contracts-001: docs/Grpc.md still described "four MxAccessGateway RPCs" —
updated to the actual six (adding AcknowledgeAlarm and QueryActiveAlarms to
the handler and validation-rule sections).

Contracts-003 (Won't Fix): the finding is factually wrong — the <Protobuf>
item for mxaccess_worker.proto already sets ProtoRoot="Protos"; all three
items are consistent (confirmed back to the reviewed commit).

Contracts-004: corrected the stale GatewayContractInfo XML summary
("before generated protobuf contracts are introduced").

Contracts-005: no proto field/enum value was ever removed, so no reserved
ranges were invented. Added a wire-compatibility policy comment to all three
.proto files instructing future editors to reserve removed numbers.

Contracts-006: documented MxStatusProxy.success — it mirrors the COM
MXSTATUS_PROXY numeric success member, is not a boolean, and clients should
branch on category.

Contracts-007: added 13 round-trip tests covering galaxy_repository.proto
messages, bulk-subscribe payloads, and raw-value/IPC worker bodies.

Contracts-008: WorkerAlarmRpcDispatcher never assigns AcknowledgeAlarmReply.
status, so the old "native status" proto comment was misleading. Corrected
the hresult/status proto comments and documented the worker native_status →
public reply mapping in AlarmClientDiscovery.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 23:12:00 -04:00
Joseph Doherty 771229b39f Regenerate code-reviews index after Low findings Batch 2
Reflects resolution of Tests-007..012, Worker.Tests-008..015,
IntegrationTests-007..010, Client.Python-001/002/004/006/007/008/010/011/012.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 22:59:35 -04:00
Joseph Doherty a7bf1ef95d Resolve Client.Python-001/002/004/006/007/008/010/011/012 findings
Client.Python-001: dropped "scaffold" from the stale pyproject description.
Client.Python-002 (re-triaged): stale finding — MxGatewayCommandError is
already exported and in __all__; no change needed.
Client.Python-004: removed the dead `closed` variable in _smoke; the CLI
smoke now uses `async with session`.
Client.Python-006: close() on both clients and Session had an unlocked
check-then-set race; `_closed` is now set before the await.
Client.Python-007: gateway stream iterators now share one helper that
explicitly catches CancelledError and cancels the call.
Client.Python-008: to_mx_value now rejects nan/inf; float/bytes mapping
documented.
Client.Python-010: removed the circular-import-workaround late imports in
favour of TYPE_CHECKING / module-scope imports.
Client.Python-011: ensure_mxaccess_success no longer treats a proto3-default
success==0 with an unset category as a failure.
Client.Python-012 (Won't Fix): invoke_raw deliberately skips MXAccess-failure
detection for parity tests; documented the contract instead.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 22:59:24 -04:00
Joseph Doherty b4f5e8eb48 Resolve IntegrationTests-007..010 code-review findings
IntegrationTests-007: the three live test classes contend for shared
singletons (one MXAccess COM, one ZB SQL DB, one GLAuth). Added
LiveResourcesCollection with DisableParallelization and applied it to all
three so they no longer run concurrently.

IntegrationTests-008: the three live fact attributes each re-implemented the
env-var check. Added IntegrationTestEnvironment.IsEnabled and all three now
delegate to it.

IntegrationTests-009: reworded the misleading "Mock server call context" XML
doc — it is a hand-written stub with no verification behavior.

IntegrationTests-010: WaitForMessageAsync ignored cancellation. It now takes
an optional CancellationToken linked with the timeout; the smoke test shares
one cancellation source with the StreamEvents call context.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 22:59:18 -04:00
Joseph Doherty 371bcb3f91 Resolve Worker.Tests-008..015 code-review findings
Worker.Tests-008: moved the misplaced WorkerLogRedactor test out of
VariantConverterTests into Bootstrap/WorkerLogRedactorTests.

Worker.Tests-009: renamed 46 snake_case alarm-test methods to PascalCase
Method_Scenario_Expectation.

Worker.Tests-010: replaced a weak Assert.Contains with an exact assertion
against the real diagnostic message and corrected the XML doc.

Worker.Tests-011: renamed and re-documented a cancellation test that
overstated what it proved.

Worker.Tests-012: added an oversized-frame (MessageTooLarge) test; renamed
the mislabeled zero-length-payload test.

Worker.Tests-013: removed the fixed-100ms ThrowIfCompletedAsync helper; the
caller now races runTask deterministically.

Worker.Tests-014: consolidated duplicated test fakes/helpers
(FakeRuntimeSession, NoopComApartmentInitializer, NoopEventSink, frame
helpers) into a shared TestSupport namespace.

Worker.Tests-015: added MxAccessEventQueue coverage for drain-all (maxEvents
0), empty-queue drain, and enqueue-after-fault.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 22:59:07 -04:00
Joseph Doherty 9582de077b Resolve Tests-007..012 code-review findings
Tests-007: TestServerCallContext and stream-writer/constraint helpers were
copy-pasted across five test files. Consolidated into a shared
MxGateway.Tests.TestSupport namespace; duplicates deleted.

Tests-008: renamed snake_case alarm-test methods to PascalCase
Method_Condition_Result and dropped redundant usings. Re-triaged two
inaccurate sub-claims (the "wnwrap" name and a required CompilerServices
using).

Tests-009: corrected three copy-paste-mismatched XML <summary> comments in
SessionManagerTests.

Tests-010: added the missing anonymous-localhost security negatives —
bypass disallowed, and loopback-allowed from a remote address.

Tests-011: SessionWorkerClientFactoryFakeWorkerTests discarded worker tasks.
The test class now tracks each launcher and observes its task in DisposeAsync.

Tests-012: added xunit.runner.json pinning collection parallelism and
documented the ephemeral-port convention.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 22:59:01 -04:00
Joseph Doherty bd3096533d Regenerate code-reviews index after Low findings Batch 1
Reflects resolution of Server-007..014, Worker-009..015,
Client.Dotnet-004..008, Client.Go-004..010, Client.Java-006..012.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 22:43:02 -04:00
Joseph Doherty 6eb9ea9105 Resolve Client.Java-006..012 code-review findings
Client.Java-006: close() on both clients only called shutdown(). It now
awaits termination up to the connect timeout and shutdownNow()s on timeout.

Client.Java-007: added MxGatewayLowFindingsTests covering the alarm surface,
async streaming, MxEventStream overflow, and TLS channel construction. A
latent bug surfaced: a missing CA file throws IllegalArgumentException, not
SSLException — the channel-builder catch was broadened accordingly.

Client.Java-008: async thenApply sites now route stray RuntimeExceptions
through MxGatewayErrors.fromGrpc via a normalising validator.

Client.Java-009: extracted ~80 duplicated lines (createChannel, withDeadline,
toCompletable, ...) into a shared MxGatewayChannels; both clients delegate.

Client.Java-010 (re-triaged): the README's metadata:read scope was correct;
the acknowledgeAlarm Javadoc's invoke:alarm-ack was wrong — corrected to the
admin scope.

Client.Java-011: documented the intentional fail-fast event-stream
backpressure in Javadoc and the README.

Client.Java-012: replaced CommonOptions.resolved()'s mutate-and-return-this
with side-effect-free resolvedApiKey()/resolvedTimeout() accessors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 22:42:51 -04:00
Joseph Doherty 555fe4c0ba Resolve Client.Go-004..010 code-review findings
Client.Go-004: ran gofmt on alarms_test.go and galaxy_test.go; the tree is
now gofmt-clean.

Client.Go-005/009/010: migrated Dial/DialGalaxy off the deprecated
grpc.DialContext/WithBlock to grpc.NewClient via a shared dial helper, with
a DialTimeout-bounded readiness probe to keep fail-fast semantics; shared
callContext deadline arithmetic; updated the stale Dial doc comment. Test
harnesses use passthrough:///bufnet for the NewClient default-scheme change.

Client.Go-006: added GatewayError.Code() and an IsTransient(err) helper so
callers can classify transient gRPC failures.

Client.Go-007: newCorrelationID no longer returns an empty id when
crypto/rand fails — it falls back to a non-empty time+counter id.

Client.Go-008: added coverage_test.go for transport-credential resolution,
callContext deadline arithmetic, and native value/array edge kinds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 22:42:33 -04:00
Joseph Doherty 89043cb2b6 Resolve Client.Dotnet-004..008 code-review findings
Client.Dotnet-004: documented DefaultCallTimeout as both the per-attempt
deadline and the shared retry budget, and removed DeadlineExceeded from the
transient-retry set (a client-imposed deadline cannot be helped by retrying).

Client.Dotnet-005: RegisterAsync/AddItemAsync/AddItem2Async silently returned
0 when a successful reply lacked the typed payload. They now throw a
descriptive MxGatewayException.

Client.Dotnet-006: added XML docs to the previously undocumented public
members MaxGrpcMessageBytes, GatewayProtocolVersion, WorkerProtocolVersion.

Client.Dotnet-007: corrected the AcknowledgeAlarmAsync XML comment — the RPC
requires the admin scope, not a non-existent invoke:alarm-ack sub-scope.

Client.Dotnet-008: the CLI redactor missed env-var-sourced keys because the
caller passed only the --api-key option. Redaction now uses the same
resolver, stripping env-var keys too.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 22:42:27 -04:00
Joseph Doherty 1764eff1cf Resolve Worker-009..015 code-review findings
Worker-009: WorkerFrameWriter serialized twice and WorkerFrameReader
allocated a payload byte[] per frame. The writer now serializes once into a
single prefix+payload buffer; the reader rents the payload buffer from
ArrayPool and honors the logical frame length.

Worker-010: VariantConverter projected a uint+Time value as a full FILETIME,
producing a near-1601 timestamp. The FILETIME projection is now gated on
`value is long`; uint falls through to the integer projection.

Worker-011: replaced the opaque retryAttempts formula in WorkerPipeClient
with MaxRetryAttempts = int.MaxValue, leaving the connect deadline as the
sole bound.

Worker-012: rewrote stale "future PR / polls on a Timer" comments in
AlarmDispatcher, AlarmCommandHandler, MxAccessAlarmEventSink and
MxAccessEventMapper to match the shipped, post-Worker-001 behavior.

Worker-013 (re-triaged): already resolved — StaMessagePumpTests and
MxAccessStaSessionTests cover the pump and poll loop directly.

Worker-014: moved IAlarmCommandHandler into its own file so
AlarmCommandHandler.cs declares one public type.

Worker-015: clarified the MxAccessBaseEventSink.EnqueueEvent overflow-catch
comment explaining the deliberate double RecordFault no-op.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 22:42:17 -04:00
Joseph Doherty fe9044115b Resolve Server-007..014 code-review findings
Server-007: GalaxyHierarchyProjector re-filtered the whole hierarchy per
page (O(total) paging). It now memoizes the filtered list per cache-entry +
filter signature so subsequent pages are an O(pageSize) slice.

Server-008: WatchDeployEvents re-resolved browse subtrees and rebuilt globs
per streamed event. ResolveBrowseSubtrees is hoisted out of the loop and
GalaxyGlobMatcher caches compiled Regex instances per pattern.

Server-009: auth-store connections used no busy timeout or WAL. A new
OpenConnectionAsync applies journal_mode=WAL and a busy_timeout; all auth
call sites use it. docs/Authentication.md updated.

Server-010: the dashboard rendered Rotate/Revoke for revoked keys, where
Rotate silently reactivates them. ApiKeysPage now shows actions only for
Active keys. docs/Authentication.md updated.

Server-011: WorkerAlarmRpcDispatcher converted to a primary constructor and
brought in line with module conventions.

Server-012: CLAUDE.md corrected to the canonical *:* scope strings.

Server-013 (partly re-triaged): three named coverage gaps were already
closed; the genuine gap (WorkerExecutableValidator) is now covered.

Server-014: rewrote stale "alarm path not yet wired" comments in
MxAccessGatewayService to describe the production WorkerAlarmRpcDispatcher.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 22:42:06 -04:00
Joseph Doherty a02faa6ade Regenerate code-reviews index after Medium findings Batch C
Reflects resolution of Contracts-002 — all Medium findings now closed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 21:51:03 -04:00
Joseph Doherty 1f546c46ee Resolve Contracts-002 code-review finding
MxCommandReply.payload has no by-name ack case: MX_COMMAND_KIND_ACKNOWLEDGE_
ALARM_BY_NAME reuses the acknowledge_alarm reply payload. Verified the worker
(MxAccessCommandExecutor.ExecuteAcknowledgeAlarmByName) and gateway
(WorkerAlarmRpcDispatcher) already implement this correctly — the gap was
purely undocumented contract asymmetry. Documented the reuse on the proto
oneof case and the AcknowledgeAlarmReplyPayload message comment (regenerating
the .NET contract), and in docs/AlarmClientDiscovery.md. Added
ProtobufContractRoundTripTests.MxCommandReply_AcknowledgeAlarmByName_Reuses
AcknowledgeAlarmPayloadCase to pin the contract.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 21:50:57 -04:00
Joseph Doherty 6a4833bd32 Regenerate code-reviews index after Medium findings Batch B
Reflects resolution of Tests-003..006, Worker.Tests-003..007,
IntegrationTests-003..006, Client.Python-003/005/009.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 21:45:29 -04:00
Joseph Doherty e4fbbb541a Resolve Client.Python-003, -005, -009 code-review findings
Client.Python-003: stream_events_raw and query_active_alarms passed `timeout`
to the stub with no TypeError fallback, unlike _unary. Both now route through
a shared _open_stream helper that strips `timeout` on TypeError.

Client.Python-005: discover_hierarchy buffered the entire Galaxy hierarchy in
memory. Added GalaxyRepositoryClient.iter_hierarchy, a lazy async generator
yielding objects page-by-page; discover_hierarchy is now a thin wrapper that
preserves its list contract. README documents iter_hierarchy.

Client.Python-009: added regression coverage for previously untested paths —
write2/add_item2 request shape, the MAX_BULK_ITEMS boundary, the None-argument
TypeError guards, TLS ca_file reading, and the non-auth map_rpc_error fallthrough.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 21:45:16 -04:00
Joseph Doherty f13f35bc79 Resolve IntegrationTests-003..006 code-review findings
IntegrationTests-003: the live MXAccess smoke test asserted on the first
streamed event, which a registration/quality bootstrap event could occupy.
The recording writer now waits for the first event matching a predicate
(Family == OnDataChange).

IntegrationTests-004: the cleanup `finally` could throw and mask an original
assertion failure. Shutdown now routes through a helper that logs cleanup
exceptions instead of propagating them.

IntegrationTests-005: added live MXAccess parity tests — a Write round-trip
to an advised item, and an invalid-handle command surfacing the MXAccess
failure without a transport fault.

IntegrationTests-006: added live LDAP failure-path tests — wrong password
(no password leak), unknown username, and server-unreachable.

docs/GatewayTesting.md updated to describe the new cases.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 21:45:11 -04:00
Joseph Doherty 18ce2922e2 Resolve Worker.Tests-003..007 code-review findings
Worker.Tests-003: removed the wall-clock `Elapsed < 2s` assertion from
InvokeAsync_WakesIdlePumpForQueuedCommand; the awaited completion against a
30s idle period already proves the wake event drove dispatch.

Worker.Tests-004: MxAccessStaSession.Dispose now joins the alarm poll task
after cancelling the CTS (consistent with ShutdownGracefullyAsync), and
Dispose_StopsAlarmPollLoop asserts deterministically instead of via Task.Delay.

Worker.Tests-005: undisposed MemoryStream instances across the frame-protocol
and pipe-session tests are now `using` declarations.

Worker.Tests-006: Dispose_StopsAlarmPollLoop now constructs MxAccessStaSession
with `using` so a failed assertion cannot leak the STA poll loop.

Worker.Tests-007: docs/WorkerFrameProtocol.md verification section corrected
to target MxGateway.Worker.Tests / MxGateway.Worker with -p:Platform=x86.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 21:45:01 -04:00
Joseph Doherty 5ade3f4f48 Resolve Tests-003, -004, -005, -006 code-review findings
Tests-003: temp auth-DB directories leaked under %TEMP%. Added the
TempDatabaseDirectory IDisposable helper (clears the Sqlite connection pool,
then recursively deletes); SqliteAuthStoreTests and ApiKeyAdminCliRunnerTests
now dispose every directory they create.

Tests-004: added end-to-end coverage composing the real authorization
interceptor in front of the real MxAccessGatewayService, plus scope-resolver
tests confirming an unmapped request type fails closed to the admin scope.

Tests-005: added coverage for a worker faulting mid-command — a pipe
disconnect and a worker fault while an InvokeAsync is in flight both fail the
pending invoke. No product change needed.

Tests-006 (re-triaged): the flaky ReadLoop_WhenClientFaults_KillsOwnedWorkerProcess
is a test race, not a product bug — the kill runs synchronously inside
SetFaulted. Rewrote it to await FakeWorkerProcess exit deterministically, and
replaced fixed Task.Delay timing in the late-reply and heartbeat tests with
FIFO ordering and an injected ManualTimeProvider.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 21:44:55 -04:00
Joseph Doherty 98f9b7792b Regenerate code-reviews index after Medium findings Batch A
Reflects resolution of Server-002/004/005/006, Worker-004..008,
Client.Dotnet-001/002/003, Client.Go-002/003, Client.Java-001..005.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 21:32:02 -04:00
Joseph Doherty ff41556b9a Resolve Client.Java-001..005 code-review findings
Client.Java-001: redactApiKey echoed the last 4 secret characters. It now
keeps only the non-secret mxgw_<key-id>_ prefix plus ***; non-gateway-shaped
tokens return <redacted>.

Client.Java-002: a close() after a queue-overflow could wipe the enqueued
overflow exception. Terminal transitions are now serialized through a single
guarded terminate() — first terminal condition wins.

Client.Java-003: openSession never read gateway_protocol_version. Both
openSession paths now call ensureGatewayProtocolCompatible, rejecting a
non-zero mismatch and accepting unset (0) for older gateways.

Client.Java-004: register/addItem/addItem2 fell back to a return_value that
silently yields 0 when unset. The fallback is now guarded by hasReturnValue()
and throws on a protocol violation.

Client.Java-005: close() in try-with-resources could mask the body exception
when the CloseSession RPC failed. close() now catches and logs the
close-time failure; closeRaw() still surfaces it for callers that want it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 21:31:46 -04:00
Joseph Doherty f88a029ecc Resolve Client.Go-002, -003 code-review findings
Client.Go-002: the Events/EventsAfter compatibility path silently dropped
events when the 16-slot results channel filled — it cancelled the stream and
closed the channel with no error delivered. sendEventResult now evicts an
old buffered event and delivers a terminal EventResult carrying the new
exported ErrEventBufferOverflow before close, so the overflow is observable.

Client.Go-003: parseInt32List panicked on a malformed -item-handles token,
crashing the CLI with a stack trace. It now returns an error that
runUnsubscribeBulk propagates, exiting 2 with a clean message.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 21:31:36 -04:00
Joseph Doherty 8023eccfa6 Resolve Client.Dotnet-001, -002, -003 code-review findings
Client.Dotnet-001: MapRpcException typed only Unauthenticated and
PermissionDenied; every other gRPC status collapsed to an untyped exception
with the status code discarded. Added a nullable StatusCode to
MxGatewayException, extracted the duplicated mappers into a shared
RpcExceptionMapper that records the code for every status, and documented it.

Client.Dotnet-002: the production retry branch (MxGatewayException wrapping
RpcException) was never exercised. FakeGatewayTransport gained a
MapTransportExceptions mode that runs thrown RpcExceptions through
RpcExceptionMapper exactly as the production transport does.

Client.Dotnet-003: MxGatewaySession.DisposeAsync disposed _closeLock while a
concurrent CloseAsync could be parked in WaitAsync. DisposeAsync now drains
in-flight CloseAsync callers before disposing the semaphore; the client's
_disposed flag is accessed via Interlocked.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 21:31:33 -04:00
Joseph Doherty 54325343bd Resolve Worker-004, -005, -006, -007, -008 code-review findings
Worker-004: post-watchdog-fault heartbeats reported a non-faulted state.
ReportWatchdogFaultIfNeededAsync now sets _state = Faulted before writing
the StaHung fault.

Worker-005 (re-triaged): the cited OnPoll site was removed by Worker-001;
the real silent-failure bug was in MxAccessStaSession.RunAlarmPollLoopAsync,
which caught only graceful-stop exceptions. A failing PollOnce now records a
WorkerFault on the event queue instead of vanishing on a non-awaited task.

Worker-006: RunAsync's finally skipped runtime disposal when shutdown timed
out, leaking the STA thread and COM object. It now always disposes
(MxAccessStaSession.Dispose is idempotent and bounded).

Worker-007 (re-triaged): replaced MxAccessComServer's Type.InvokeMember
reflection fallback with an IMxAccessServer fast path plus typed
ILMXProxyServer* casts; a non-conforming object now fails fast.

Worker-008: alarm consumer STA affinity was unenforced. MxAccessStaSession
records the alarm consumer's STA thread id and asserts every PollOnce runs
on it via a unit-testable guard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 21:31:23 -04:00
Joseph Doherty 1d9e3afadd Resolve Server-002, -004, -005, -006 code-review findings
Server-002: the gateway never terminated leftover MxGateway.Worker.exe
processes at startup, contradicting gateway.md and CLAUDE.md. Added
IRunningProcessInspector/SystemRunningProcessInspector, OrphanWorkerTerminator,
and OrphanWorkerCleanupHostedService (best-effort, runs before sessions are
accepted); updated gateway.md to describe the implemented behavior.

Server-004: API-key scopes were persisted verbatim with no validation. Added
GatewayScopes.All/IsKnown; the CLI parser and dashboard create path now
reject unknown scope strings.

Server-005: a non-SqlException/InvalidOperationException fault on the initial
Galaxy hierarchy load faulted the BackgroundService. ExecuteAsync now catches
all non-cancellation exceptions on first load and RefreshCoreAsync broadens
its catch so the cache records Stale/Unavailable instead.

Server-006: OpenSessionAsync incremented the open-sessions gauge before
alarm auto-subscribe; an auto-subscribe failure leaked the gauge. The catch
path now calls SessionRemoved() when the gauge was incremented.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 21:31:10 -04:00
Joseph Doherty 5e795aeeb8 Regenerate code-reviews index after High/Critical resolution batch
Reflects the resolution of Tests-001/002, IntegrationTests-001/002,
Client.Go-001, Worker-001/002/003 and Worker.Tests-001/002.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 21:08:06 -04:00
Joseph Doherty 1b4dcf32d5 Resolve Worker.Tests-001 and Worker.Tests-002 code-review findings
Worker.Tests-001: StaMessagePump had no direct unit test. Added
Sta/StaMessagePumpTests.cs — 8 STA-thread facts covering WaitForWorkOrMessages
(wake-event signalled before/during the wait, timeout expiry, zero-timeout
fast path, the QS_ALLINPUT posted-message wake path) and PumpPendingMessages
drain counting.

Worker.Tests-002: no test drove a COM event through the integrated
sink -> mapper -> queue path. Added MxAccess/MxAccessBaseEventSinkTests.cs —
5 facts driving OnDataChange, OnWriteComplete, OperationComplete and
OnBufferedDataChange through a real MxAccessBaseEventSink + mapper + queue and
asserting the converted WorkerEvent lands in MxAccessEventQueue. The four COM
event handlers were widened private -> internal and InternalsVisibleTo for
MxGateway.Worker.Tests was added, mirroring MxAccessAlarmEventSink's existing
test seam; no worker behavior changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 21:07:48 -04:00
Joseph Doherty 53e3973209 Resolve Worker-001, Worker-002, Worker-003 code-review findings
Worker-001: WnWrapAlarmConsumer armed a System.Threading.Timer whose OnPoll
callback ran GetXmlCurrentAlarms2 on a thread-pool thread against the
Apartment-threaded wnwrap COM object, which can deadlock on cross-apartment
marshaling. Removed the pollTimer/pollIntervalMs fields, OnPoll, the
poll-interval constructor parameter, and the timer arm/disposal. Polls are
driven externally by the STA via StaRuntime.InvokeAsync(PollOnce).

Worker-002: RunHeartbeatLoopAsync delayed a full HeartbeatInterval before
the first heartbeat. Restructured so the first beat is sent immediately on
entering the loop and the delay applies only between subsequent beats.

Worker-003: ProcessCommandAsync silently returned without a reply when
_state was not a command-serving state after dispatch. Both drop sites now
log a WorkerCommandResultDropped diagnostic with correlation_id via
IWorkerLogger; _state is now volatile.

Three pre-existing tests that asserted strict frame ordering were updated to
tolerate an interleaved first heartbeat (Worker-002 consequence).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 20:59:46 -04:00
Joseph Doherty e967e85973 Resolve Client.Go-001 code-review finding
MxAccessError.Unwrap returned e.Command directly; on the HRESULT-only path
Command is a nil *CommandError, so Unwrap returned a non-nil error wrapping
a typed nil and errors.As bound a nil *CommandError. Unwrap now returns an
untyped nil when Command is nil. Added errors_test.go regression coverage
for the HRESULT-only and populated-Command paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 20:46:12 -04:00
Joseph Doherty bc55396334 Resolve IntegrationTests-001 and IntegrationTests-002 code-review findings
IntegrationTests-001: documented the live Galaxy Repository test suite and
its MXGATEWAY_RUN_LIVE_GALAXY_TESTS / MXGATEWAY_LIVE_GALAXY_CONN gating in
docs/GatewayTesting.md.

IntegrationTests-002: documented the live LDAP test suite in
docs/GatewayTesting.md and added a concrete "Provisioning the GwAdmin group"
step to glauth.md so DashboardLdapLiveTests' GwAdmin-membership assumption
is reproducible.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 20:46:09 -04:00
Joseph Doherty b381bfcaf1 Resolve Tests-001 and Tests-002 code-review findings
Tests-001: FakeSessionManager.TryGetSession unconditionally synthesized a
session, so Invoke_WhenSessionMissing_ThrowsNotFound did not actually
verify the missing-session path. Added ResolveOnlySeededSessions/SeedSession
to the fake, rewrote the missing-session test, and added seeded-resolution
and alarm-RPC missing-session coverage.

Tests-002: re-triaged. GalaxyRepository issues only constant SQL; filters
are applied in-memory by GalaxyHierarchyProjector/GalaxyGlobMatcher. Kept
as a valid coverage gap and added GalaxyFilterInputSafetyTests exercising
filter/glob input safety directly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 20:46:02 -04:00
Joseph Doherty 2a635c8522 Add code-reviews/prompt.md orchestration prompt
Reusable prompt for working the code-reviews/ backlog: batches one
subagent per module, TDD per finding, per-module commits, regenerates
the index. Adapted to mxaccessgw toolchains and module layout.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 17:18:39 -04:00
Joseph Doherty 9082e504a9 Mark Client.Rust findings resolved
All eleven Client.Rust findings are fixed in 0d8a28d; their Status is
now Resolved with the fixing commit recorded. Adds Client.Rust-012 —
an additional clippy::clone_on_copy violation in galaxy.rs found while
verifying that `cargo clippy -- -D warnings` passes — already Resolved
in the same commit. Regenerates code-reviews/README.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 17:10:30 -04:00
Joseph Doherty 0d8a28d2fe Fix all MxGateway.Client.Rust code-review findings
Resolves Client.Rust-001 through Client.Rust-011.

Build/test/clippy gate (Client.Rust-001/002/003):
- options.rs: doc comments on with_max_grpc_message_bytes /
  max_grpc_message_bytes (#![warn(missing_docs)])
- session.rs: rename BulkReplyKind variants to drop the shared `Bulk`
  suffix (clippy::enum_variant_names)
- galaxy.rs: deref instead of clone on Option<Timestamp>
  (clippy::clone_on_copy — an extra violation the gate also hit)
- mxgw-cli: assert version_json against GATEWAY/WORKER_PROTOCOL_VERSION
  constants instead of the stale literal 2
`cargo clippy --workspace --all-targets -- -D warnings` now passes.

Correctness / error handling:
- version.rs: CLIENT_VERSION = env!("CARGO_PKG_VERSION") (Client.Rust-004)
- session.rs: register/add_item/add_item2 handle extractors and
  bulk_results now return Err(Error::MalformedReply) instead of a
  silent 0 / empty vec on a shapeless OK reply (Client.Rust-005/006)
- error.rs: new Error::Unavailable classifies Code::Unavailable /
  ResourceExhausted as transient (Client.Rust-010)
- session.rs: per-call unique correlation ids via an atomic counter
  (Client.Rust-011)

Other:
- value.rs: MxValue/MxArrayValue compute the projection on demand
  instead of caching it, so a wire-only value pays no projection cost
  (Client.Rust-008)
- RustClientDesign.md: correct the crate layout, drop the unused
  `tracing` dependency (Client.Rust-007)
- client_behavior.rs: tests for the bulk-size cap, a mid-stream status
  fault, and the unreadable-CA-file path (Client.Rust-009)

cargo fmt / test --workspace (27 tests) / clippy all pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 17:08:55 -04:00
Joseph Doherty f0a4af62b9 Review the clients/ language clients; mark Server-001/003 resolved
Adds per-module code reviews for the five language clients under
clients/ (Client.Dotnet, Client.Go, Client.Java, Client.Python,
Client.Rust) at commit 3cc53a8 — 53 findings (4 High, 15 Medium,
34 Low; all Open). Extends REVIEW-PROCESS.md so a "module" may also be
a language client under clients/, not only a src/ project.

Marks Server-001 (Critical) and Server-003 (High) Resolved — fixed in
a8aafdf — and regenerates code-reviews/README.md (now 11 modules).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 16:51:00 -04:00
Joseph Doherty a8aafdf974 Enforce dashboard authorization on all component routes
Fixes code-review findings Server-001 (Critical) and Server-003 (High).

Server-001: the dashboard Razor components were mapped with no
authorization policy, so every dashboard page — including the API Keys
page — was reachable unauthenticated. MapRazorComponents<App>() now
requires DashboardAuthenticationDefaults.AuthorizationPolicy;
unauthenticated requests are challenged by the cookie scheme and
redirected to the login page.

Server-003: DashboardAuthenticator.CreatePrincipal never issued the
'scope' claim that DashboardAuthorizationHandler checks when
Dashboard:RequireAdminScope is enabled, so enforcing the policy would
have denied every LDAP login. CreatePrincipal (reached only after the
required-group check passes) now emits the admin scope claim.

Replaces the GatewayApplicationTests case that asserted dashboard
routes allow anonymous access — it encoded the bug as expected
behavior — with tests that verify component routes require the policy
and the login/logout/denied endpoints allow anonymous.

All 309 MxGateway.Tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 16:45:29 -04:00
Joseph Doherty 3cc53a8c69 Harden code-review tooling and align REVIEW-PROCESS.md with mxaccessgw
- regen-readme.py: use `python` not the broken `python3` Store alias in
  the generated note and docstring; --check now also fails when a module
  header's "Open findings" count disagrees with finding statuses or a
  finding has an unrecognised Status (find_inconsistencies)
- REVIEW-PROCESS.md: rewritten for mxaccessgw (was describing ScadaLink)
  — MxGateway.* modules, "mxaccessgw conventions" checklist category,
  gateway.md/docs/ design context, `python` command
- scripts/check-code-reviews-readme.ps1: CI/pre-commit wrapper for
  regen-readme.py --check
- code-reviews/test_regen_readme.py: dependency-free parser tests
- code-reviews/README.md: regenerated

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 16:36:25 -04:00
Joseph Doherty ae164ea34f Add per-module code review tree under code-reviews/
Set up the code review process scaffolding adapted to mxaccessgw and
record a full per-module review of every src/MxGateway.* project at
commit 6c64030.

- code-reviews/_template/findings.md: per-module findings template
- code-reviews/regen-readme.py: generates README.md from findings.md
  files; --check fails if stale
- code-reviews/<Module>/findings.md: reviews for Contracts, Server,
  Worker, Tests, Worker.Tests, IntegrationTests (74 findings:
  1 Critical, 10 High, 23 Medium, 40 Low; all Open)
- code-reviews/README.md: generated cross-module index
- REVIEW-PROCESS.md: review process document

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 16:28:50 -04:00
Joseph Doherty 6c640306e5 Merge branch 'fix/alarm-sta-wiring'
Close the two alarm production gaps: wire alarmCommandHandlerFactory in
WorkerPipeSession, and drive WnWrapAlarmConsumer.PollOnce from the STA
instead of a threadpool timer.
2026-05-18 06:31:05 -04:00
Joseph Doherty a67a5a4857 fix(worker): wire alarm command handler and STA poll loop (Gap 1 + Gap 2)
Gap 1 — WorkerPipeSession now passes `eq => new AlarmCommandHandler(eq)` as
the alarmCommandHandlerFactory in all three places it constructs
MxAccessStaSession (two convenience constructors and InitializeMxAccessAsync).
Previously the parameterless MxAccessStaSession() set the factory to null,
so every SubscribeAlarms / AcknowledgeAlarm / QueryActiveAlarms command
returned "alarm consumer not configured" in a deployed worker.

  - Added internal `MxAccessStaSession(Func<MxAccessEventQueue, IAlarmCommandHandler>?)`
    constructor that builds all defaults but accepts a factory.
  - Added public `MxAccessStaSession(StaRuntime, factory, eventQueue, alarmFactory?)`
    4-arg overload to complete the constructor chain.

Gap 2 — WnWrapAlarmConsumer now disables its internal threadpool Timer
(pollIntervalMilliseconds=0 in the default constructor). MxAccessStaSession
starts a `RunAlarmPollLoopAsync` background task that sleeps off-STA then
calls `staRuntime.InvokeAsync(() => handler.PollOnce())` at 500ms intervals.
This satisfies the ThreadingModel=Apartment requirement of wwAlarmConsumerClass:
every GetXmlCurrentAlarms2 call now runs on the worker's STA.

  - Added `PollOnce()` to `IMxAccessAlarmConsumer`, `AlarmDispatcher`,
    `IAlarmCommandHandler`, and `AlarmCommandHandler`.
  - Poll loop cancelled and awaited before alarm handler disposal in both
    ShutdownGracefullyAsync and Dispose.

Tests: 4 new tests in MxAccessStaSessionTests verify that
  - SubscribeAlarms reaches the handler when the factory is wired (Gap 1)
  - SubscribeAlarms returns InvalidRequest without a factory (regression guard)
  - PollOnce is called on the STA thread within 3s (Gap 2)
  - The poll loop stops after Dispose (Gap 2 lifecycle)
All fake IMxAccessAlarmConsumer / IAlarmCommandHandler test implementations
updated with no-op PollOnce() to satisfy the new interface member.

Worker tests: 199 passed / 1 pre-existing failure / 4 skipped (was 195/1/4).
Server tests: 308 passed / 0 failures (unchanged).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 06:30:14 -04:00
Joseph Doherty e00ee61cf0 Place Last Refresh next to Last Deploy on the Galaxy page
Group the two double-width timestamp cards at the start of the metric
row so the deploy/refresh pair reads together, ahead of the count cards.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 01:20:30 -04:00
Joseph Doherty 271bf7edff Give Galaxy timestamp cards double-width boxes
The Last Deploy and Last Refresh metric cards hold full timestamps that
wrapped to three or four lines in a single-width card. Add a Wide option
to MetricCard (grid-column: span 2) and set it on both Galaxy timestamp
cards. Also switch .metric-value to overflow-wrap: break-word so a date
token is never split mid-value.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 01:16:56 -04:00
Joseph Doherty 3397e99783 Document the dashboard API Keys management page
The dashboard's API Keys page (list plus Create/Rotate/Revoke and the
create dialog) had no design-doc coverage even though Authorization.md
already documents the constraint model it exposes. Add an "API keys
page" section to GatewayDashboardDesign.md describing the table columns,
the LDAP-group-gated management actions, the one-time secret reveal, and
audit logging. Cross-link it from the constraint-enforcement section of
Authorization.md and the CLI section of Authentication.md so the two
key-management surfaces reference each other.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 01:07:35 -04:00
Joseph Doherty f598b3a647 Stop dashboard table cells breaking whole words
overflow-wrap: anywhere let the table layout shrink a column below its
longest word, splitting tokens like "unconstrained" mid-word. Switch to
break-word so a word only breaks when it genuinely cannot fit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 00:56:55 -04:00
Joseph Doherty 509b0118d4 Keep dashboard button labels on a single line
Bootstrap 5 .btn no longer pins white-space, so the Rotate/Revoke action
buttons broke mid-word in the narrow API Keys actions column. Pin .btn
to white-space: nowrap so a label always reads as one word.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 00:54:53 -04:00
Joseph Doherty 298836d2f3 Keep dashboard status chips on a single line
Status chips wrapped to two lines in narrow table columns (e.g. the
session State column). Pin .chip to white-space: nowrap so an enumerated
state always reads as one token.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 23:19:36 -04:00