fe2a6db786
rust / build / test / clippy / fmt (push) Has been cancelled
Layout:
- src/ .NET 10 x64 reference: MxNativeCodec, MxNativeClient,
MxAsbClient, probes, tests, harnesses. Executable spec.
- design/ Architectural plan for the Rust port (M0–M6), error
model, protocol invariants, risks (R1–R16), adversarial
review log (review.md).
- rust/ Rust workspace. M0 skeleton + M1 codec parity.
mxaccess-codec: 215 unit tests + 2 cross-implementation
parity tests (byte-identical against .NET reference).
Other crates are M0 stubs awaiting M2+.
- captures/ Frida + netsh + pcap evidence per CLAUDE.md
("captures are evidence, not throwaway logs").
- analysis/ Decompiled C# (frida/proxy/decompiled-*),
Ghidra exports for native DLLs (`exports/` only —
working state at `projects/` and AVEVA's input
binaries at `input/` are gitignored).
- docs/ Reverse-engineering reference docs.
- tools/ Setup-LiveProbeEnv.ps1 (Infisical credential fetcher),
Compute-Crc.ps1 (.NET parity helper).
- .github/workflows/ Rust CI: fmt + build + test + clippy on Windows.
- LICENSE MIT (Joseph Doherty, 2026).
Verified:
- cargo test --workspace → 217 passed (215 unit + 2 .NET parity), 0 failed
- cargo clippy --workspace -- -D warnings → clean
- cargo fmt --all -- --check → clean
- cargo publish --dry-run -p mxaccess-codec → packages cleanly
Excluded from history (see .gitignore):
- **/bin, **/obj, **/target — build artifacts
- analysis/ghidra/projects/ — Ghidra working state (regenerable)
- analysis/ghidra/input/ — AVEVA proprietary DLLs (vendor IP)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
322 lines
31 KiB
Markdown
322 lines
31 KiB
Markdown
# Risks and open questions
|
||
|
||
This is the punch list of things that could break or are unproven. Each entry is tagged R(isk) or Q(uestion), with a current best answer and what would settle it.
|
||
|
||
## Protocol-level
|
||
|
||
### R1 — net.tcp / WCF framing **and binary message encoding** complexity
|
||
|
||
**Severity: P0** (project-blocker — entire ASB data plane, ~3000 LoC)
|
||
|
||
The .NET reference uses `System.ServiceModel.NetTcpBinding` for ASB (`src/MxAsbClient/MxAsbDataClient.cs:663`: `new NetTcpBinding(SecurityMode.None)` with no message-encoding override). With no override, WCF defaults to the **binary message encoder** — i.e. .NET Binary XML ([MC-NBFX]) with a static dictionary lookup ([MC-NBFS]) — **not** SOAP/XML. There is no Rust port of WCF, and `quick-xml` (or any other XML toolkit) is **not sufficient** to read or write these payloads: the body bytes are tokenised binary nodes that reference dictionary string IDs.
|
||
|
||
So the hand-rolled scope is two layers, not one:
|
||
|
||
1. **Framing** per [MS-NMF] (record types: preamble, preamble-ack, sized-envelope, end, fault) plus the reliable-session ack handling on the underlying `net.tcp` channel.
|
||
2. **Message encoding** per [MC-NBFX] (binary XML node tokens, length-prefixed strings, prefixed/typed attributes, end-element markers) **plus** [MC-NBFS] (the static dictionary that holds the SOAP/WS-Addressing/`IASBIDataV2`-action strings the encoder references by ID instead of inlining).
|
||
|
||
**Options:**
|
||
1. Hand-roll both framing ([MS-NMF]) and binary message encoding ([MC-NBFX] + [MC-NBFS]). Estimate ~3000 LoC across both layers (the encoder/dictionary work is the majority — framing alone is ~1500 LoC; the binary XML codec, dictionary tables, and operation-action mapping are roughly the same again).
|
||
2. Switch ASB to its HTTP variant if the deployed AVEVA instance supports it (this would let us use a normal text SOAP/XML stack and skip both [MS-NMF] and [MC-NBFX]/[MC-NBFS] entirely).
|
||
3. Wrap the .NET ASB DLL in a process and call it via stdin/stdout JSON-RPC.
|
||
|
||
**Current best answer:** option 1 (hand-roll both layers). The two specs are public, the encoder is deterministic, and the .NET reference's `AsbMessageDumpBehavior` already produces ground-truth byte vectors for the dictionary and operation set we use. `quick-xml` may help with any auxiliary text-XML the wider stack uses, but it cannot decode the binary-encoded message bodies — that requires the [MC-NBFX] + [MC-NBFS] codec.
|
||
|
||
**Settles when:** `mxaccess-asb-nettcp` parses every captured request/reply byte-identical to the .NET reference's `IClientChannel` payload dump for the proven type matrix, including correct dictionary-ID resolution and round-trip of every observed binary XML node tag.
|
||
|
||
### R2 — Buffered subscription is delivery cadence, not multi-sample payloads
|
||
|
||
**Severity: P3** (likely a non-issue — see verification below)
|
||
|
||
`subscribe_buffered` was originally framed as "we don't know if the codec layout for multi-sample `DataChangeBatch` is right." Verification against `wwtools/mxaccesscli/docs/api-notes.md:97-100,138-140,154-157` reverses this framing: `OnBufferedDataChange(hServer, hItem, MxDataType, value, quality, timestamp, statuses)` is **single-sample-per-event**, identical in shape to `OnDataChange`. The "buffer" is a delivery cadence — `SetBufferedUpdateInterval(ms)` collates per-tick updates and flushes them at the configured interval — **not** a multi-sample payload bundle. The native multi-sample bodies the original R2 worried about may not exist on the LMX surface at all.
|
||
|
||
**Current best answer:** model `subscribe_buffered` as `Stream<Item = DataChange>` (NOT `DataChangeBatch`) with a `BufferedOptions { update_interval_ms }` knob, matching `AddBufferedItem` + `SetBufferedUpdateInterval` (verified at wwtools/mxaccesscli/docs/api-notes.md:140). If a future capture surfaces a true multi-sample body, reopen — but the burden of proof has flipped. **Do not synthesise** multi-sample bodies; the LMX surface emits one per event.
|
||
|
||
**Settles when:** either (a) a captured `OnBufferedDataChange` event with multi-sample body bytes is observed (which would contradict the LMX docs and require codec rework), or (b) the V1 codec ships and no consumer reports missing multi-sample semantics. Default-positive: this likely settles silently as "not a real risk."
|
||
|
||
### R3 — `OperationComplete` trigger unproven
|
||
|
||
**Severity: P1** (significant blocker for OperationComplete consumers — ships verbatim, no typed promotion)
|
||
|
||
`work_remain.md:154–163`: ASB has no native OperationComplete; NMX completion-only frames have no proven mapping table. The .NET reference does not synthesise the event; the Rust port must not either.
|
||
|
||
**Current best answer:** expose `Session::operation_status_events()` as `Stream<Item = RawOperationStatus>` carrying frame bytes. Promote to a typed `WriteCompleted` only if the frame matches the proven `00 00 50 80 00` 5-byte pattern.
|
||
|
||
**Settles when:** indefinitely deferred — see Open evidence gaps table. Settle criteria depends on a Ghidra mapping table (the `aaDCT` tables in `Lmx.dll`) that does not exist in `analysis/ghidra/` and has no owner. No current artifact in this repo produces the byte→status mapping. Reopen if a future capture or decompiled output produces evidence.
|
||
|
||
### R4 — Completion-only byte mapping
|
||
|
||
**Severity: P1** (significant blocker for typed completion semantics — ships verbatim)
|
||
|
||
`0x00`, `0x41`, `0xEF` are observed as raw 1-byte completion frames (`work_remain.md:164–174`). They get preserved as `RawOperationStatus { byte: u8 }` without typed promotion.
|
||
|
||
**Current best answer:** `Session::operation_status_events()` carries `RawOperationStatus(u8)` for these. Document as "preserved verbatim until mapping table is found." Same Ghidra dependency as R3.
|
||
|
||
**Settles when:** indefinitely deferred — see Open evidence gaps table. Settle criteria depends on the same Ghidra mapping table as R3, which does not exist in `analysis/ghidra/` and has no owner. Reopen if a future capture or decompiled output produces evidence.
|
||
|
||
### R5 — Activate / Suspend behaviour
|
||
|
||
**Severity: P1** (significant blocker for Activate/Suspend consumers — surfaced as experimental)
|
||
|
||
`MxNativeCompatibilityServer.Suspend` and `Activate` return MxStatus but the trigger conditions beyond "pending/requesting" are unknown. The .NET reference does not call them on a live path.
|
||
|
||
**Current best answer:** expose `Session::suspend(item)` and `Session::activate(item)` returning `Result<MxStatus, Error>`. Document as experimental until a deployed scenario exercises them. Do not build callback-driven state transitions on top.
|
||
|
||
**Settles when:** a live capture shows the operation triggering an observable state change in `NmxSvc` plus a corresponding callback frame.
|
||
|
||
### R6 — `0x80004021` in `MxNativeSession.WriteSecuredAsync` is a .NET-reference defect, not a real LMX constraint
|
||
|
||
**Severity: P3** (formerly P1 — downgraded after `wwtools/mxaccesscli/` verification)
|
||
|
||
Original framing of this risk asserted that "`WriteSecured` (without `2`) returns `0x80004021` before sending the body" and concluded the single-token form was deprecated or rejected at the wire. That framing was wrong. Verification against `wwtools/mxaccesscli/` (a working CLI built on the production `LMXProxyServerClass` 32-bit COM proxy, i.e. the actual MxAccess surface) establishes:
|
||
|
||
- The LMX `WriteSecured` ALWAYS takes **two** user ids: `(currentUserId, verifierUserId, value)` (`wwtools/mxaccesscli/docs/api-notes.md:60-72`, `wwtools/mxaccesscli/src/MxAccess.Cli/Mx/MxItem.cs:69-70`).
|
||
- "Single-user secured write" is the same API called with `currentUserId == verifierUserId` — it is **not** a separate API surface (`wwtools/mxaccesscli/src/MxAccess.Cli/Commands/WriteCommand.cs:151-155,196-199`).
|
||
- `WriteSecured2` adds a timestamp parameter; it does **not** add a second token. The 1-vs-2 distinction in this design's earlier drafts was a confusion between "with timestamp" (Write2 vs Write) and "two-token" (which is always true).
|
||
- The `0x80004021` failure observed in `src/MxNativeClient/MxNativeSession.cs:218-221` is therefore a defect of the .NET native reimplementation, not behaviour the LMX proxy itself produces.
|
||
|
||
**Current best answer:** `mxaccess` exposes `write_secured(reference, value, current_user_id, verifier_user_id)` (no timestamp) and `write_secured_at(reference, value, timestamp, current_user_id, verifier_user_id)` (with timestamp), matching `WriteSecured` and `WriteSecured2` respectively. Both always pass two user ids; callers performing single-user secured writes pass the same id twice. The `Error::Unsupported` mapping for "single-token form" has been removed from `50-error-model.md`.
|
||
|
||
**Settles when:** the `MxNativeSession.WriteSecuredAsync` defect is fixed in the .NET reference, or a captured frame shows the LMX proxy itself producing `0x80004021` on a `WriteSecured` call (which would resurrect the original framing). Default-positive: this likely settles silently as "not a real risk."
|
||
|
||
### R7 — Status mapping for non-success ASB cases
|
||
|
||
**Severity: P2** (nice-to-have / minor — unknown bytes preserved as raw)
|
||
|
||
`work_remain.md:132–143`: live probes have not yet exercised access-denied and no-communication on the current VM. The Rust port mirrors what the .NET reference proves; remaining ASB error/quality/detail bytes are preserved as raw and surfaced through `MxStatus.detail` until a safe live capture lands.
|
||
|
||
**Current best answer:** preserve unknown payloads. Document the gap.
|
||
|
||
**Settles when:** live capture against a configured access-denied tag and a no-communication endpoint produces the expected `MxStatus` shape.
|
||
|
||
## Implementation-level
|
||
|
||
### R8 — NTLMv2 cross-domain auth
|
||
|
||
**Severity: P1** (significant blocker for cross-domain deployments — single-domain ships)
|
||
|
||
Captured traffic is single-domain (local AVEVA install). Cross-domain NTLM requires AV pair handling that has not been tested.
|
||
|
||
**Current best answer:** implement AV pair parsing per [MS-NLMP] §2.2.2.1 and document `mxaccess-rpc` as untested across domains. Provide fixtures from any successful cross-domain probe.
|
||
|
||
**Settles when:** a cross-domain probe runs successfully end-to-end with packet-integrity signatures verified.
|
||
|
||
### R9 — DPAPI dependency for ASB
|
||
|
||
**Severity: P2** (nice-to-have / minor — explicit `shared_secret` constructor is the escape hatch)
|
||
|
||
ASB shared-secret retrieval uses `ProtectedData.Unprotect` (LocalMachine scope). Linux has no DPAPI. There is no portable replacement; the secret is encrypted at rest with a Windows-specific KCV.
|
||
|
||
**Current best answer:** `mxaccess-asb` requires Windows for the credential read path. Provide an explicit `AsbCredentials::shared_secret(secret: &[u8])` constructor that bypasses DPAPI for tooling that has the secret in plaintext (e.g. CI tests, ops automation).
|
||
|
||
**Settles when:** never. DPAPI is not portable; the escape hatch is the explicit constructor.
|
||
|
||
### R10 — Galaxy SQL schema versioning
|
||
|
||
**Severity: P1** (significant blocker per affected feature — break-loud on mismatch)
|
||
|
||
The recursive CTE in `GalaxyRepositoryTagResolver.cs` assumes the current AVEVA schema. Older Galaxy versions may have different table layouts.
|
||
|
||
**Current best answer:** target the schema that ships with the AVEVA version `MxNativeClient` validates against. Document the expected schema version. Break loudly on mismatch (`ConfigError::Galaxy { reason }`).
|
||
|
||
**Settles when:** a multi-version test matrix is set up. Probably not in V1.
|
||
|
||
### R11 — x86 proxy/stub workaround
|
||
|
||
**Severity: P2** (nice-to-have / minor — integration test catches binding-shape drift)
|
||
|
||
`NmxSvcps.dll` is x86-only. The replacement strategy bypasses the in-proc proxy by speaking ORPC directly. This works because we control both Type1/Type3 marshalling and `RemQueryInterface`. But it depends on `NmxSvc` continuing to expose IPv4 NCACN_IP_TCP bindings via the OXID.
|
||
|
||
**Current best answer:** add an `mxaccess-rpc` integration test that asserts `ResolveOxid` returns at least one `ncacn_ip_tcp` binding. Fail fast if the binding shape changes in a future AVEVA release.
|
||
|
||
**Settles when:** that integration test is in CI gating.
|
||
|
||
### R12 — Performance — codec allocations
|
||
|
||
**Severity: P2** (nice-to-have / minor — micro-optimisation in M6)
|
||
|
||
The .NET reference reuses `byte[]` arrays via `MemoryPool`; the Rust port should use `bytes::Bytes` for zero-copy on receive and pre-allocate via `BytesMut` on encode. The codec currently allocates `Vec<u8>` per encode; tolerable for V1, worth optimising in M6.
|
||
|
||
**Current best answer:** use `BytesMut::with_capacity(MAX_FRAME)` per session. Bench in M6. Aim for < 5 allocations per write at steady state.
|
||
|
||
**Settles when:** `cargo bench` shows the target allocation count.
|
||
|
||
### R13 — DataUpdate `recordCount != 1` panic risk
|
||
|
||
**Severity: P1** (significant blocker for production stability — soft-error path documented)
|
||
|
||
`src/MxNativeCodec/NmxSubscriptionMessage.cs:71-74` hard-throws `ArgumentException` on any `0x33` DataUpdate whose `recordCount` is not exactly 1:
|
||
|
||
```csharp
|
||
if (recordCount != 1)
|
||
{
|
||
throw new ArgumentException("Observed NMX DataUpdate callback parser currently supports one record per body.", nameof(inner));
|
||
}
|
||
```
|
||
|
||
R2 covers the missing **fixture** for the multi-record case, but the bigger production-side risk is separate: the first time AVEVA emits a multi-record `0x33` against a deployed Rust client, the codec — if it ports the .NET behaviour faithfully — will panic / return a hard decode error and tear down the subscription. We have no fixture proving multi-record bodies don't happen on real installs; we only have evidence they haven't happened on **our** install.
|
||
|
||
**Options:**
|
||
1. Mirror the .NET reference and hard-error on `recordCount != 1`. Loud, but kills the session.
|
||
2. Surface as a typed soft error (e.g. `ProtocolError::Decode { reason: "multi-record DataUpdate not yet supported" }`), log at warn, and drop the frame. The subscription stays alive; the consumer sees a single missed update, not a teardown.
|
||
3. Speculatively decode multi-record (assume the per-record layout from the single-record case repeats) — explicitly forbidden by CLAUDE.md "Do not fabricate protocol behavior."
|
||
|
||
**Current best answer:** option 2 in Rust. Map the condition to `ProtocolError::Decode { reason: "multi-record DataUpdate not yet supported" }`, emit a `tracing::warn!` with the raw frame bytes attached as a hex field, and continue. Do **not** synthesise per-record decoding. The .NET-style hard throw stays as-is in the .NET reference (it is the executable spec, and a panic there is what produces the fixture we need — see R2). The Rust port deliberately diverges here on production-safety grounds, with the divergence documented in `50-error-model.md`.
|
||
|
||
**Settles when:** R2's multi-record fixture lands and the codec gains a proven typed decode path; then R13 collapses into "supported, no special handling" and the soft-error branch becomes dead code that can be removed.
|
||
|
||
### R14 — Fabricated `0x80004021 → StaleItem` mapping
|
||
|
||
**Severity: P1** (significant blocker — fabrication risk; corrected in `50-error-model.md`)
|
||
|
||
A draft of `50-error-model.md` mapped `HRESULT 0x80004021` to a typed `StaleItem` error category for regular (non-secured) operations. **This mapping is unevidenced.**
|
||
|
||
- R6 already covers `0x80004021` on secured-write specifically: per `wwtools/mxaccesscli/` verification, this is a `MxNativeSession.WriteSecuredAsync` defect (the .NET native reimplementation throws `NotSupportedException` before reaching the wire), **not** a real LMX-proxy constraint. The production LMX surface accepts `WriteSecured` with two user ids unconditionally. R6 explicitly does **not** generalise the .NET defect to a typed "stale" error.
|
||
- For regular operations, the actual stale-handle / invalid-arg HRESULT observed in captures is `0x80070057` (`E_INVALIDARG`). There is no captured frame, decompiled mapping table, or live probe in this repo that produces `0x80004021` on a non-secured path, and certainly none that justifies tagging it `StaleItem`.
|
||
|
||
This is a fabrication risk: the kind of "looks plausible from naming" mapping that CLAUDE.md "Do not fabricate protocol behavior" exists to prevent.
|
||
|
||
**Options:**
|
||
1. Drop the `StaleItem` category entirely. Regular-op `0x80004021`, if ever observed, falls through to the generic `Hresult { code, hint: None }` branch with the raw HRESULT preserved.
|
||
2. Keep `StaleItem` but rename the source HRESULT to `0x80070057` and require a captured fixture before promoting any frame to that category.
|
||
3. Keep the `0x80004021 → StaleItem` mapping. **Forbidden** — no evidence backs it.
|
||
|
||
**Current best answer:** option 1 for V1. Surface unknown HRESULTs as `Error::Hresult { code }` and let consumers match on the raw value. `50-error-model.md` is being corrected in parallel (review cluster 3) to remove the `StaleItem` reference; this risk register entry exists so the mistake is recorded for future contributors and not silently re-introduced when someone reaches for an ergonomic typed name.
|
||
|
||
**Settles when:** indefinitely deferred — no current artifact maps either `0x80004021` or `0x80070057` to a "stale handle" semantic, and inventing one violates the "don't fabricate protocol behaviour" rule. If a future capture or decompiled mapping table produces evidence, reopen as a typed-error proposal.
|
||
|
||
### R15 — Drop-time async cleanup hazards
|
||
|
||
**Severity: P1** (significant blocker — server-side handle leak on runtime shutdown)
|
||
|
||
`design/00-overview.md:38` states the principle "no spawn from inside Drop." `design/20-async-layer.md` and `design/50-error-model.md` describe Subscription drop semantics that fire `UnAdvise`/`UnregisterEngine` against the server. Reconciling these is non-trivial because:
|
||
|
||
- `tokio::spawn` from `Drop` panics if no Tokio runtime is current at drop time. A user dropping a `Session` from a `std::thread` after `Runtime::shutdown_timeout` returns will hit this.
|
||
- During `Runtime::shutdown_timeout`, spawned tasks are aborted before they can flush. Even if a runtime is current, spawning the cleanup from `Drop` does not guarantee the unadvise/unregister actually reaches the server — the drop call returns immediately and the spawned task may be cancelled before the bytes hit the wire.
|
||
- The result is a **server-side handle leak in `NmxSvc`**: subscriptions stay live, registered engines stay registered, until the TCP connection itself is torn down (which only happens once the kernel notices the socket is dead).
|
||
|
||
**Options:**
|
||
1. Best-effort `tokio::spawn` from `Drop`. Documented hazard. Leaks on runtime shutdown and panics on no-runtime.
|
||
2. Drop sends `UnAdvise`/`UnregisterEngine` via a `tokio::sync::oneshot` (or unbounded `mpsc`) to a long-lived connection task that owns the cleanup loop. **Drop itself never spawns** — it pushes a message onto the channel and returns. The connection task drains the channel until the TCP connection is itself dropped, at which point the server cleans up by socket close anyway.
|
||
3. Require the consumer to call `Session::shutdown(timeout).await` and document Drop as "best-effort, may leak under shutdown" — no automatic cleanup at all.
|
||
|
||
**Current best answer:** option 2. A long-lived connection task owns the cleanup channel and drains it; `Drop` pushes a `UnAdvise`/`UnregisterEngine` request onto a `tokio::sync::oneshot` (one per resource) or a per-connection unbounded `mpsc` and returns synchronously. This keeps `Drop` cheap, satisfies "no spawn from Drop," and gives the cleanup a reasonable best-effort guarantee while the connection task is alive. **Runtime-shutdown leak window remains** — if the connection task is itself aborted by `Runtime::shutdown_timeout` before draining the channel, the cleanup messages are dropped on the floor and the server-side handles remain registered until the TCP socket close is observed by `NmxSvc`. This window is documented in `50-error-model.md`'s cancellation semantics; consumers running under explicit shutdown should call `Session::shutdown(timeout).await` for deterministic cleanup. Cite `design/00-overview.md:38` (no-spawn-from-Drop principle), `design/20-async-layer.md` (Subscription drop semantics), `design/50-error-model.md` (cancellation semantics).
|
||
|
||
**Settles when:** the connection-task cleanup channel is implemented in M4, a stress test under churn confirms drop semantics on a live runtime do not leak, and the runtime-shutdown leak window is captured in a runnable test fixture (consumer drops `Session` after `Runtime::shutdown_timeout`; assert that the leak is bounded by socket-close timeout).
|
||
|
||
### R16 — Crypto/auth crate maintenance drift
|
||
|
||
**Severity: P1** (significant blocker — yank/advisory in CI breaks the build)
|
||
|
||
The auth surface area depends on a small cluster of marginal-maintenance crates. `design/30-crate-topology.md:130` pins `rc4`, `sha-1`, `md-5`, `num-bigint`; `design/10-raw-layer.md:252` instructs "Do not pull `ring` — hand-roll MD4." Of these:
|
||
|
||
- `rc4` is at minimum-maintenance, with a small maintainer pool and no recent releases.
|
||
- `sha-1` v0.10 is the last RustCrypto release that ships with a deprecation warning (the algorithm itself, not the crate's quality, is what's deprecated upstream).
|
||
- `md-5` and `num-bigint` are stable but not on the active-development frontier.
|
||
- The hand-rolled MD4 in `mxaccess-rpc` has no upstream at all — it lives in this repo.
|
||
|
||
The risk is that any one of these crates gets **yanked**, picks up an `RUSTSEC` advisory, or stops compiling against a future Rust toolchain, and `cargo-deny` (or `cargo audit`) in CI fails the build for everyone — without any actual bug being found in our usage. This is especially bad if it happens during a live release window.
|
||
|
||
**Options:**
|
||
1. Pin to known-good versions in workspace `Cargo.toml` and let CI break when an advisory lands. Triage manually.
|
||
2. Pin **and** subscribe to `cargo-deny` advisory feeds with a documented response process; pre-stage replacement plans for each crate (e.g. "if `rc4` is yanked, fall back to a hand-rolled cipher in `mxaccess-rpc::crypto::rc4` — RC4 is ~30 LoC and we already hand-roll MD4").
|
||
3. Hand-roll all of them up front (RC4, SHA-1, MD5, MD4 are all small) and depend on `num-bigint` only. Reduces the surface area to one external crate; increases the in-repo cryptographic LoC.
|
||
|
||
**Current best answer:** option 2 for V1. Pin to known-good versions in workspace `Cargo.toml`; subscribe `cargo-deny` advisories in CI; document a fallback plan per crate (hand-rolled RC4 if `rc4` is yanked, hand-rolled SHA-1/MD5 if `sha-1`/`md-5` are pulled, swap `num-bigint` for `crypto-bigint` if it's pulled). Reassess in M6 and consider option 3 (hand-roll-everything) if any of the pins fire during V1 development. Cite `design/30-crate-topology.md:130` and `design/10-raw-layer.md:252`.
|
||
|
||
**Settles when:** `cargo-deny check advisories` runs green in CI on a fresh advisory database, the workspace `Cargo.toml` pins are documented inline with their fallback plans, and a "yank rehearsal" (manually mark a pin as yanked locally and confirm the fallback compiles) has been done at least once per crate.
|
||
|
||
## Open questions
|
||
|
||
### Q1 — Where does the Rust workspace live? **(unresolved)**
|
||
|
||
`CLAUDE.md` proposes a sibling `rust/` directory at `c:\Users\dohertj2\Desktop\mxaccess\rust\`, but this is a *proposal*, not a confirmation: a glob of `rust/` confirms zero files exist there today, and `CLAUDE.md` itself hedges with "when it is started." **M0 cannot start until this is confirmed.**
|
||
|
||
**Owner:** project lead.
|
||
|
||
**Action:** confirm the path `c:\Users\dohertj2\Desktop\mxaccess\rust\` or pick an alternative location; create the empty `rust/` directory (or sibling) before M0 begins.
|
||
|
||
**Current best answer:** still pending. The CLAUDE.md proposal is the default and is what M0 will assume unless overridden, but treat this as an open decision rather than a confirmed answer.
|
||
|
||
**Settles when:** the workspace directory exists on disk and contains a `Cargo.toml` (even an empty one).
|
||
|
||
### Q2 — License? **(resolved: MIT)**
|
||
|
||
The .NET reference has no LICENSE file at the repo root. The Rust crates need one before publish.
|
||
|
||
**Resolved (2026-05-05):** **MIT** (single-license, not the dual `MIT OR Apache-2.0`). All workspace deps verified MIT/Apache-2.0 compatible; MIT alone satisfies every dep's downstream license obligation. `LICENSE` file added at the project root (`c:\Users\dohertj2\Desktop\mxaccess\LICENSE`). All crate `Cargo.toml`s set `license = "MIT"` via `workspace.package`.
|
||
|
||
**Settles when:** N/A — resolved.
|
||
|
||
### Q3 — Cross-platform reach (Linux, macOS)
|
||
|
||
The codec, ASB SOAP framing, and the async session are theoretically portable. Galaxy SQL via `tiberius` works on Linux. NTLM works on Linux. DPAPI does not. Active Directory authentication on Linux requires `gssapi` (Kerberos) which is out of scope.
|
||
|
||
**Current best answer:** Linux is a **stretch goal** for V1, not a supported target — consistent with `30-crate-topology.md`'s `mxaccess-codec` Targets line ("stretch goal") and `60-roadmap.md`'s "What this roadmap deliberately does not include" (Linux behind feature flags). If pursued, the path is `default-features = false` with the consumer providing credentials and shared secret explicitly. macOS unsupported in V1 (no Galaxy SQL TDS testing on macOS).
|
||
|
||
**Settles when:** a Linux integration test runs successfully against a remote AVEVA install. Until then, treat Linux support as aspirational and gate all Linux-specific code paths behind opt-in feature flags.
|
||
|
||
### Q4 — How does `mxaccess-compat` handle COM event sinks?
|
||
|
||
The .NET `MxNativeCompatibilityServer` raises `OnDataChange` etc. as COM events. `mxaccess-compat` is a Rust API; do we expose them as `Stream`s, callbacks, or both?
|
||
|
||
**Current best answer:** Streams, with a separate optional `mxaccess-compat-com` crate (post-V1) that registers `windows-rs`-generated COM classes. The compat crate's primary surface is Rust.
|
||
|
||
**Settles when:** a concrete consumer requests COM exposure.
|
||
|
||
### Q5 — How do we surface `MxStatus` in `Subscription` items vs `Session` operations?
|
||
|
||
For `Session::write()`, a non-Ok status maps to `Error::Status`. For `Subscription::next()`, a non-Ok status comes through as `DataChange { status: MxStatus, ... }` — it is not necessarily an error (a "stale" data change is still a valid frame).
|
||
|
||
**Current best answer:** `Session::write()` returns `Err` on non-Ok category. `Subscription::next()` returns `Ok(DataChange { ... })` and the consumer inspects `change.status`. Documented in `50-error-model.md`.
|
||
|
||
**Settles when:** API stabilises after consumer feedback.
|
||
|
||
### Q6 — Should `Session` be `Clone`?
|
||
|
||
Cheap clones via `Arc<SessionInner>` are convenient (handlers can take `Session` by value). But cloning makes shutdown semantics fuzzy: when does `UnregisterEngine` fire?
|
||
|
||
**Current best answer:** `Clone + Send + Sync`. Drop of the last clone runs `UnregisterEngine` best-effort via `tokio::spawn`. `Session::shutdown(timeout)` is the explicit, awaitable way for production code.
|
||
|
||
**Settles when:** stress test under churn confirms drop semantics are correct.
|
||
|
||
### Q7 — M1 `hasDetailStatus` audit
|
||
|
||
During M1 wave-1 codec ports, the `subscription_message.rs` agent draft conditionally read the `status: i32` field only when `hasDetailStatus = true`, while requiring a minimum record length of 15 (DataUpdate) regardless. The result: 4 leading status bytes were left unconsumed, then misread as `quality` further down. The defect was caught by round-trip tests (`data_update_boolean_round_trip`, `data_update_has_no_correlation_id`) and fixed: `status: i32` is now read unconditionally per `src/MxNativeCodec/NmxSubscriptionMessage.cs:126-127`; only `detail_status: Option<i32>` is gated on `hasDetailStatus` (`NmxSubscriptionMessage.cs:130-134`).
|
||
|
||
**Follow-up:** audit any other codec port (current or future) that takes a `has_detail_status` / `hasDetailStatus` parameter for the same defect pattern — specifically, verify that fields read unconditionally in the .NET source remain unconditional in the Rust port. Likely affected scope: any future helper that ports `ParseRecord` semantics from `NmxSubscriptionMessage.cs`. The inline note at `mxaccess-codec/src/subscription_message.rs` `parse_record` documents the fix.
|
||
|
||
**Settles when:** post-M1 audit confirms no other codec module conditionally skips fields the .NET reference reads unconditionally.
|
||
|
||
## Open evidence gaps
|
||
|
||
These are missing fixtures that the design assumes will land by their respective milestone.
|
||
|
||
| Fixture | Needed by | Captured how |
|
||
|---|---|---|
|
||
| Multi-sample buffered batch | M6 | provider tuning to exceed buffered queue threshold |
|
||
| Cross-domain NTLM Type1/2/3 | M2+ | multi-domain AVEVA test harness |
|
||
| Activate/Suspend transition | M6 | deployed object that goes pending |
|
||
| `OperationComplete` for non-write op | indefinitely | unknown |
|
||
| Ghidra mapping table for completion-only bytes (R3/R4) | indefinitely | Ghidra decompile of `Lmx.dll`'s `aaDCT` tables — table not yet present in `analysis/ghidra/` and has no owner |
|
||
| ASB write timestamp + status fields | M5 | extended ASB Write/PublishWriteComplete probe |
|
||
| ASB no-communication source-level evidence (`work_remain.md:198`) | M5 | live capture against an unconfigured ASB endpoint |
|
||
| Partial-cleanup behavior after channel failure (`work_remain.md:196-197`) | M4/M5 | inject mid-flight failure during subscribe, observe cleanup state |
|
||
| Galaxy schema older version | indefinitely | not in scope for V1 |
|
||
|
||
## Things that look risky but aren't
|
||
|
||
### "Decode the NDR-bridge to find the value bytes"
|
||
|
||
`docs/Transport-Correlation.md:65-70` notes that distinct value probes do not appear in raw TCP — the `CNmxAdapter::PutRequest`/`CNmxAdapter::TransferData` buffers are an "internal adapter representation, not the TCP wire format." This is because the values flow as DCE/RPC stub bytes *inside* the `TransferData` payload, which itself is the 46-byte envelope plus the inner write/advise/subscribe body. The "bridge" is just our codec re-applied at a different boundary; once we encode the envelope correctly, the bytes are there.
|
||
|
||
The .NET reference confirms this — `src/MxNativeClient/ManagedNmxService2Client.cs:159-183` (`TransferData` + `ValidateTransferDataBody`) writes the 46-byte envelope directly into the DCE/RPC Request stub body, then forwards the inner; the validator explicitly rejects bodies that lack "an inner message after the 46-byte envelope" (line 182). There is no extra layer. The probe-vs-pcap mismatch is an artefact of not reassembling the inner body, not a missing protocol layer.
|
||
|
||
**No risk.** Documented for clarity so future contributors don't chase a non-existent encryption layer.
|
||
|
||
### "We need a custom TLB / proxy DLL"
|
||
|
||
The .NET reference avoids registering a custom TLB by hand-rolling the callback `IRemUnknown` server in `src/MxNativeClient/ManagedCallbackExporter.cs:44-54` (`CreateCallbackObjRef` builds an OBJREF in memory) plus `src/MxNativeClient/ManagedCallbackExporter.cs:164,195-196` (the `IRemUnknown::RemQueryInterface` server-side handler returns the negotiated `INmxSvcCallback` IPID without any registry-resident TLB or proxy/stub DLL). The Rust port does the same in `mxaccess-callback`. The only registry touchpoint is OXID resolution (read-only) and reading the ASB shared secret (read-only via DPAPI). No installer, no admin elevation.
|
||
|
||
**No risk.** Documented because it commonly comes up in DCOM contexts.
|