Code-review 2026-05-20 sweep: re-review at 1cd51bb, resolve 72 findings across all 11 modules

Re-reviewed every module/client against the 10-category checklist
(REVIEW-PROCESS.md) at commit 1cd51bb, filed 72 new findings, and
fixed them in three priority waves (3 High, 17 Medium, 52 Low).

Highs
- Server-017: enumerate AcknowledgeAlarm / QueryActiveAlarms in
  GatewayGrpcScopeResolver so non-admin keys can use them; document
  the mapping in docs/Authorization.md; add interceptor tests.
- Client.Java-013: add the five missing bulk-method stubs to the
  CLI FakeSession so the test module compiles on a clean tree.
- Client.Rust-013: fix the clippy::doc_lazy_continuation regression
  in generated tonic code by reformatting the ReadBulkCommand proto
  comment and scoping a #![allow(...)] to the generated submodules.

Mediums (highlights)
- Server: unify GatewaySession state-lock discipline (-015) and
  make DisposeAsync race-safe against in-flight CloseAsync (-016);
  add constraint-enforcement test coverage for the bulk-plan path
  (-021).
- Worker: introduce StaRuntimeShutdownException so RunAlarmPollLoop
  can distinguish graceful shutdown from a real STA-affinity
  violation (-016); have the watchdog skip StaHung while
  CurrentCommandCorrelationId is non-empty so a legitimate slow
  ReadBulk no longer self-faults (-017).
- Tests: add per-method round-trip + cancellation coverage for the
  11 GatewaySession bulk methods (-013); replace the real TCP probe
  in GalaxyHierarchyCacheTests with an IGalaxyRepository fake
  (-016).
- IntegrationTests: drive the StreamEvents writer in the live Write
  test and assert OnWriteComplete (-012); add live tests for
  Unadvise/RemoveItem/Unregister ordering, WriteSecured, and
  abnormal worker exit (-014).
- Worker.Tests: replace MxAccessSession reflection with an internal
  CreateForTesting factory (-016); cover WorkerCancel and
  unexpected-body envelope branches (-017).
- Client.Java: cancel MxEventStream when close() races
  beforeStart() (-014); return a CancellingCompletableFuture that
  actually forwards cancellation through .thenApply chains (-015).
- Client.Python: drop the silent localhost-plaintext downgrade in
  the CLI; require explicit --plaintext (-013).
- Client.Rust: stop bench-read-bulk from polluting success-latency
  histograms with failed-call durations (-015); add coverage for
  the five MalformedReply paths, the bulk-write helpers, the
  Error::Unavailable mapping, and the unary-fault path (-016).
- Contracts: extend docs/Contracts.md with the bulk read/write
  command family (-009).

Lows (highlights)
- Server: cap GalaxyGlobMatcher.RegexCache; align
  WorkerAlarmRpcDispatcher missing-session handling; drop the
  duplicate dashboard @page routes; refresh IAlarmRpcDispatcher
  XML doc.
- Worker: surface SetXmlAlarmQuery COM failures; remove dead
  subscriptionExpression / ExecutingCommand arms; preserve
  factory-supplied runtime sessions; split MxAlarmSnapshot.cs into
  three files.
- Tests: dispose the WebApplication in seven test classes; rebuild
  FakeWorkerProcess.WaitForExitAsync against a real TaskCompletion
  source; switch the heartbeat-expires test to ManualTimeProvider;
  add InvariantCulture to the remaining DateTimeOffset.Parse sites;
  document GalaxyFilterInputSafetyTests in GatewayTesting.md.
- IntegrationTests: comment fixes, RecordingServerStreamWriter
  IDisposable, class-level [Trait], single-source ZB default
  connection string.
- Worker.Tests: replace silent-return gating with LiveMxAccessFact
  so absent env vars SKIP not pass; PascalCase rename of probe
  [Fact]s; deterministic deadline test; new frame-protocol error
  tests; ComputeTransitions diff-coverage; relocate dev-rig probes
  to Probes/.
- Contracts: add round-trip coverage and per-field redaction /
  Galaxy-identifier comments to the protos.
- Client.Dotnet: introduce clients/dotnet/Directory.Build.props so
  TreatWarningsAsErrors / analysers apply; document
  DiscoverHierarchyOptions and IMxGatewayCliClient; require typed
  bulk-read handles in CLI; surface AcknowledgeAlarm transport
  faults through Translate().
- Client.Go: kill dead code in alarms_test / fakeGalaxyServer /
  runWriteBulkVariant; document the six new subcommands in
  writeUsage; drain galaxy-watch events on limit; switch io.EOF
  comparisons to errors.Is.
- Client.Java: shared shutdown helpers + new shutdownTimeout
  option; regex-based credential redaction; Long.toUnsignedString
  for uint64 sequence; doc fixes.
- Client.Python: combine duplicate imports; add coverage for
  _percentile / bench-read-bulk / MAX_AGGREGATE_EVENTS /
  _api_key_from_env; populate pyproject metadata and ship py.typed.
- Client.Rust: expose next_correlation_id() so CLI ping/close
  stop hard-coding correlation IDs; resync RustClientDesign.md
  with the current Session / Error surface and CLI subcommand set.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-05-20 09:46:47 -04:00
parent 1cd51bbda3
commit a0203503a7
122 changed files with 8723 additions and 757 deletions
+271 -12
View File
@@ -4,25 +4,29 @@
|---|---|
| Module | `clients/python` |
| Reviewer | Claude Code |
| Review date | 2026-05-18 |
| Commit reviewed | `3cc53a8` |
| Review date | 2026-05-20 |
| Commit reviewed | `1cd51bb` |
| Status | Reviewed |
| Open findings | 0 |
## Checklist coverage
A re-review at commit `1cd51bb` over the same module. Prior findings
(Client.Python-001 — Client.Python-012) remain closed and are kept as
history. This section reflects categories evaluated in this pass.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Issues found: dead `closed` variable (Client.Python-004); float/bytes value-mapping assumptions (Client.Python-008). |
| 2 | mxaccessgw conventions | Largely adheres; one missing export and a `*_raw` MXAccess-failure documentation gap (Client.Python-002, Client.Python-012). |
| 3 | Concurrency & thread safety | Issue found: `close()` idempotency claim does not hold under concurrent close (Client.Python-006). |
| 4 | Error handling & resilience | Issues found: inconsistent timeout-kwarg fallback (Client.Python-003); `success == 0` default-value hazard (Client.Python-011); inconsistent cancel helpers (Client.Python-007). |
| 5 | Security | No issues found — API keys redacted in repr and CLI output, TLS supported, no secret logging. |
| 6 | Performance & resource management | Issue found: `discover_hierarchy` buffers the whole hierarchy in memory (Client.Python-005). |
| 7 | Design-document adherence | Matches the design docs closely; minor CLI doc drift (Client.Python-001). |
| 8 | Code organization & conventions | Issues found: `MxGatewayCommandError` omitted from `__all__` (Client.Python-002); fragile circular-import workaround (Client.Python-010). |
| 9 | Testing coverage | Issue found: `write2`, `add_item2`, bulk-size limits, TLS `ca_file`, and CLI command bodies untested (Client.Python-009). |
| 10 | Documentation & comments | Issue found: stale "scaffold" package description (Client.Python-001). |
| 1 | Correctness & logic bugs | Issue found: `_use_plaintext` silently downgrades any `localhost:` / `127.0.0.1:` endpoint to plaintext (Client.Python-013). |
| 2 | mxaccessgw conventions | No new issues found — secrets redacted, MXAccess parity preserved, generated code untouched, no Blazor/COM violations apply (Python client). |
| 3 | Concurrency & thread safety | No new issues foundclose-idempotency hazard fixed in Client.Python-006, shared `_canceling_iterator` cancels on `CancelledError`. |
| 4 | Error handling & resilience | No new issues found at this commit (prior 003, 007, 011 remain closed). |
| 5 | Security | Issue found: implicit plaintext-on-localhost (Client.Python-013) means a user explicitly listing a TLS-fronted loopback endpoint with `--api-key` but without `--tls`/`--plaintext` silently transmits the bearer token in cleartext. |
| 6 | Performance & resource management | No new issues found `iter_hierarchy` streams pages lazily (Client.Python-005 resolution). |
| 7 | Design-document adherence | No new issues found — `PythonClientDesign.md` matches the implemented surface. |
| 8 | Code organization & conventions | Issue found: duplicate `from mxgateway.values import` lines in `commands.py:22-23` (Client.Python-014). |
| 9 | Testing coverage | Issues found: `bench_read_bulk` CLI body, `MAX_AGGREGATE_EVENTS` event-cap, and `_use_plaintext` localhost-auto-plaintext path are untested (Client.Python-015, Client.Python-016). |
| 10 | Documentation & comments | Issues found: `pyproject.toml` lacks PyPI metadata (`authors`, `license`, `classifiers`, `urls`) and no PEP 561 `py.typed` marker (Client.Python-017); auto-plaintext behaviour is undocumented (Client.Python-013). |
## Findings
@@ -205,3 +209,258 @@
**Recommendation:** Document explicitly (README + docstring) that `*_raw` methods surface MXAccess HRESULT/status failures only inside the reply and do not raise `MxAccessError`, so parity-test callers know to inspect `protocol_status`/`hresult`/`statuses` themselves.
**Resolution:** 2026-05-18 — Won't Fix (no behaviour change). Confirmed this is intentional, correct parity behaviour: the `*_raw` methods exist precisely so parity-test callers can inspect an unmodified gateway reply, including embedded MXAccess HRESULT/status failures, without an exception masking them. Changing `invoke_raw` to raise `MxAccessError` would defeat its purpose and duplicate `Session.invoke`. The finding's only actionable point is the documentation gap, which has been addressed: `clients/python/README.md` now states explicitly that `*_raw` methods enforce gateway protocol success only and do **not** run MXAccess-failure detection, and the docstrings of `GatewayClient.invoke_raw` and `Session.invoke_raw` say the same and point callers to inspect `protocol_status`/`hresult`/`statuses` (and to `Session.invoke` for the checked variant). No code/test change — the runtime contract is unchanged and correct.
### Client.Python-013
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Security |
| Location | `clients/python/src/mxgateway_cli/commands.py:757-762` |
| Status | Resolved |
**Description:** `_use_plaintext` silently returns `True` whenever the endpoint
string starts with `localhost:` or `127.0.0.1:`, even if neither `--plaintext`
nor `--tls` is supplied on the command line. Any CLI subcommand (e.g.
`mxgw-py open-session --endpoint localhost:5001 --api-key mxgw_<secret>`) then
attaches the API key to a plaintext gRPC channel without warning. This is a
silent security downgrade: a user who deliberately ran the gateway behind TLS
on loopback (e.g. for testing a production-shaped TLS config locally) and who
passes `--api-key` expecting the secret to be transport-protected gets a
plaintext bearer token instead. The auto-downgrade is also undocumented —
`README.md` and the CLI `--help` text both describe `--plaintext` and `--tls`
as the controls, with no mention that endpoint-prefix matching can override
either. The other client CLIs do not auto-downgrade: the .NET CLI uses
`https://`-prefix detection on a URI scheme (an explicit signal), Go and Java
require an explicit `--plaintext`/`--tls` choice, and Rust defaults to
plaintext only when `plaintext = true` is set on the options struct.
**Recommendation:** Drop the localhost-prefix auto-plaintext branch and
require the user to pass `--plaintext` or `--tls` (or default to TLS to match
the rest of the matrix). If the implicit-localhost behaviour is kept for
ergonomics, document it prominently in both `README.md` and `--help`, emit a
stderr warning when `--api-key` is combined with the auto-downgrade path, and
add a CLI test asserting the auto-downgrade is in fact active so it is not
silently lost in a future refactor.
**Resolution:** 2026-05-20 — Removed the silent `localhost:` / `127.0.0.1:`
auto-plaintext branch from `_use_plaintext`. The new contract matches the Go
and Java CLIs: **TLS is the default**, `--plaintext` is the only way to opt
in to an unencrypted channel, and `--tls` is accepted as a redundant, explicit
affirmation of the default (mutually exclusive with `--plaintext`, which now
raises `click.UsageError`). The `--plaintext` / `--tls` `--help` text and
`clients/python/README.md` both call out the new behaviour. Added six
regression tests in `clients/python/tests/test_cli.py` covering: (a) a
`localhost:` endpoint with no flags resolves to TLS, (b) a `127.0.0.1:`
endpoint with no flags resolves to TLS, (c) `--plaintext` opts in to plaintext,
(d) `--tls` is accepted and idempotent with the default, (e) `--plaintext`
combined with `--tls` is rejected, and (f) an end-to-end CliRunner test
asserting `ClientOptions.plaintext == False` flows through to
`GatewayClient.connect` when no flag is supplied against a `localhost:`
endpoint. **Behaviour change for callers:** scripts that previously relied on
`mxgw-py … --endpoint localhost:5000 …` selecting plaintext silently must now
add an explicit `--plaintext` flag (or set up TLS on the gateway). Calling
`mxgw-py` with an `--api-key` against a plaintext-only gateway without
`--plaintext` will now fail to connect rather than silently leaking the bearer
token.
### Client.Python-014
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | `clients/python/src/mxgateway_cli/commands.py:22-23` |
| Status | Resolved |
**Description:** `commands.py` has two consecutive `from mxgateway.values
import` lines:
```python
from mxgateway.values import to_mx_value
from mxgateway.values import MxValueInput
```
These import from the same module and should be combined into a single
`from mxgateway.values import MxValueInput, to_mx_value`. The split form is
inconsistent with the rest of the file (every other module is imported in a
single statement) and would be flagged by `ruff`/`isort` if any linter were
configured. Pure style, no behavioural impact.
**Recommendation:** Collapse the two imports into one statement, ordered to
match the conventional alphabetical-within-module pattern:
`from mxgateway.values import MxValueInput, to_mx_value`.
**Resolution:** 2026-05-20 — Collapsed the two consecutive
`from mxgateway.values import to_mx_value` / `from mxgateway.values import MxValueInput`
lines in `clients/python/src/mxgateway_cli/commands.py` into a single
`from mxgateway.values import MxValueInput, to_mx_value` statement, matching
the alphabetical-within-module pattern used elsewhere in the file. Pure style
fix — no behavioural impact, covered by the existing CLI tests.
### Client.Python-015
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | `clients/python/src/mxgateway_cli/commands.py:273-294,564-647`, `clients/python/tests/` |
| Status | Resolved |
**Description:** `_bench_read_bulk` is a ~80-line CLI body that opens its own
session, registers, subscribe_bulks, runs a warm-up loop, a measurement loop,
collects per-call latencies, computes a percentile summary, and emits the
shared cross-language JSON schema. It is the largest untested CLI command in
the module — `tests/` has no `bench_read_bulk` test, fake-stub-driven or
otherwise. A drift in the schema field names (`callsPerSecond`,
`cachedReadResults`, `latencyMs.p50`, …) would break the cross-language
`scripts/bench-read-bulk.ps1` aggregation silently. `_percentile_summary` and
`_percentile` are also untested — the boundary cases (`n == 0`, `n == 1`,
quantile interpolation) would benefit from a small unit test since the
identical algorithm is duplicated in the .NET / Go / Rust / Java drivers and
a divergence would corrupt cross-language comparisons.
**Recommendation:** Add a fake-stub-driven `bench_read_bulk` test that drives
a short `--duration-seconds 0 --warmup-seconds 0` run through `CliRunner` and
asserts the JSON schema (`language == "python"`, the full key set,
`latencyMs.p50/p95/p99/max/mean` present). Add unit tests for `_percentile`
covering `n == 0`, `n == 1`, and a known-good interpolated value at p95 so
the implementation cannot silently drift from the other clients.
**Resolution:** 2026-05-20 — Added `clients/python/tests/test_cli_bench_and_helpers.py`
with three layers of coverage. (1) `_percentile` unit tests pin the
cross-language algorithm (`rank = q * (n - 1)`, linear interpolation between
adjacent ranks): empty sample returns `0.0`, single element returns that
element, exact-rank queries return the sample value (p50 of `[10,20,30,40,50]`
is `30.0`), and the interpolated p95/p99 values (`48.0` / `49.6` for that same
five-element sample) are locked down so any drift from the .NET / Go / Rust /
Java drivers fails fast. (2) `_percentile_summary` tests assert the full
`{p50, p95, p99, max, mean}` dict shape, the zero-sample placeholder, and the
3-decimal rounding contract. (3) A `bench-read-bulk` smoke test
(`test_bench_read_bulk_emits_cross_language_schema`) drives the CLI through
`CliRunner` with `--duration-seconds 0 --warmup-seconds 0` against a fake stub
that handles `OpenSession`, `Register`, `SubscribeBulk`, `ReadBulk`, and
`UnsubscribeBulk`, then asserts the emitted JSON has exactly the 16
cross-language schema keys (`language`, `command`, `endpoint`, `clientName`,
`bulkSize`, `durationSeconds`, `warmupSeconds`, `durationMs`, `tags`,
`totalCalls`, `successfulCalls`, `failedCalls`, `totalReadResults`,
`cachedReadResults`, `callsPerSecond`, `latencyMs`) and that `latencyMs` is a
`{p50, p95, p99, max, mean}` sub-object — guarding against silent breakage of
`scripts/bench-read-bulk.ps1`'s cross-language aggregation. No source change —
this is a pure coverage finding.
### Client.Python-016
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | `clients/python/src/mxgateway_cli/commands.py:25,757-775,805-830` |
| Status | Resolved |
**Description:** Three CLI helper paths are not covered by `tests/`:
1. `_use_plaintext` localhost auto-downgrade (line 762) — the
`endpoint.startswith("localhost:") or endpoint.startswith("127.0.0.1:")`
branch (see also Client.Python-013) is untested; no test asserts that an
endpoint without `--plaintext` and without `--tls` resolves to plaintext.
2. `_collect_events` `MAX_AGGREGATE_EVENTS` guard (line 811-815) — passing
`--max-events` greater than `MAX_AGGREGATE_EVENTS` raises
`click.BadParameter`, but no test exercises the guard. A silent removal of
the constant or the comparison would not be caught.
3. `_api_key_from_env` (line 765-768) — only the implicit path through
`_secrets` is exercised; there is no test that verifies an env-var name
resolves to a value and that an unset env var produces `None`.
These are all small, fake-stub-driven CLI behaviours rather than end-to-end
paths. The previous coverage finding (Client.Python-009) closed without
adding tests for these specific paths.
**Recommendation:** Add three small `CliRunner` / unit tests: one asserting
the localhost auto-plaintext (or its replacement, if Client.Python-013 is
fixed), one asserting `--max-events 10001` exits non-zero with the
`MAX_AGGREGATE_EVENTS` error message, and one asserting
`_api_key_from_env("MXGATEWAY_API_KEY")` returns the env value and `None` for
an unset variable.
**Resolution:** 2026-05-20 — Scope adjusted: Client.Python-013 has since
removed the `_use_plaintext` localhost auto-plaintext branch, so item (1) is
no longer a real code path — the
`test_use_plaintext_requires_explicit_flag_for_localhost_endpoint` and
`test_cli_localhost_endpoint_defaults_to_tls_via_open_session` regressions
added under Client.Python-013 already pin the new TLS-by-default contract.
The remaining two helpers are now covered in
`clients/python/tests/test_cli_bench_and_helpers.py`. (2)
`MAX_AGGREGATE_EVENTS` cap:
`test_collect_events_rejects_max_events_above_aggregate_cap` drives
`stream-events` with `--max-events 10001` through `CliRunner` against
stubbed `_connect` / `_session` fakes and asserts the CLI exits non-zero with
the documented `less than or equal to 10000` message;
`test_collect_events_accepts_max_events_at_aggregate_cap_boundary` confirms
`--max-events 10000` is accepted at the boundary and returns an empty event
list. (3) `_api_key_from_env`:
`test_api_key_from_env_resolves_value_when_variable_is_set` (env-var
populated → returned),
`test_api_key_from_env_returns_none_when_variable_is_unset` (env-var unset
`None`), `test_api_key_from_env_returns_none_when_name_is_none` (the
`name is None` early-return), and
`test_api_key_from_env_returns_none_when_name_is_empty_string` (the
`if not name` truthiness guard). No source change — pure coverage finding.
### Client.Python-017
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `clients/python/pyproject.toml:5-25`, `clients/python/src/mxgateway/` |
| Status | Resolved |
**Description:** The package metadata in `pyproject.toml` is minimal for a
published wheel:
* No `authors` field. PyPI / `pip show` will display no author.
* No `license` field, no `license-files` field, and no `LICENSE` file is
referenced from the project. The repo as a whole has no top-level
`LICENSE` either, but other client packages (Java has a license entry, the
.NET package has a license expression in the `csproj`) tend to set this.
* No `classifiers` (no `Programming Language :: Python :: 3.12`,
`Operating System :: Microsoft :: Windows`, `Topic :: …`, no
development-status classifier). Without these the PyPI search facets are
empty and tooling like `pip` cannot tell whether the package is
alpha/beta/stable.
* No `keywords`, no `[project.urls]` (no homepage / source / issue link
pointing back to the repo).
* The package ships no PEP 561 `py.typed` marker file in
`src/mxgateway/`. Type hints are written throughout the module
(`from __future__ import annotations`, full annotations on every public
function), but downstream consumers running `mypy` on `mxaccess-gateway-client`
will not see those hints — PEP 561 requires the marker file to opt the
package into type-stub distribution.
**Recommendation:** Add `authors`, `license = "<spdx>"`, `keywords`, and
`[project.urls]` to `pyproject.toml`; add at least the standard `classifiers`
trio (`Development Status`, `Programming Language :: Python :: 3.12`,
`Intended Audience`); create an empty `src/mxgateway/py.typed` file and
include it in the wheel via `[tool.setuptools.package-data]` so consumers
running `mypy` against an installed wheel pick up the type information.
**Resolution:** 2026-05-20 — Filled out `clients/python/pyproject.toml`
with the missing PyPI metadata: `authors = [{ name = "MXAccess Gateway
Authors" }]`, `license = "Proprietary"` (the repo has no top-level
`LICENSE` file and no other client publishes under an OSS licence, so the
SPDX `Proprietary` expression matches the de-facto status), the standard
classifier set (`Development Status :: 4 - Beta`, `Intended Audience ::
Developers` / `Information Technology`, `Operating System :: Microsoft ::
Windows` and `:: POSIX`, `Programming Language :: Python` /
`Python :: 3` / `Python :: 3.12`, `Topic :: Software Development ::
Libraries :: Python Modules`, `Topic :: System :: Distributed Computing`,
and `Typing :: Typed`), a `keywords` list
(`mxaccess`, `archestra`, `gateway`, `grpc`, `industrial`, `scada`), and
`[project.urls]` with `Homepage` / `Source` / `Issues` pointing at the
Gitea repo. Added the PEP 561 marker file
`clients/python/src/mxgateway/py.typed` (empty, as the spec requires) and
declared it in `[tool.setuptools.package-data] mxgateway = ["py.typed"]`
so the wheel ships the marker and downstream `mypy` users see the
inline type hints. Pure metadata / packaging change — `python -m pytest -q`
still passes (91 tests).