Files
mxaccess/design/70-risks-and-open-questions.md
T
Joseph Doherty 4dfc0cee65 [R3 + R4 + R8] settle protocol-level risks per Ghidra evidence
Ghidra headless decompile of `Lmx.dll`'s `aaDCT` symbol + the
`LmxProxy.dll` Fire_* event handlers (logs at
`analysis/ghidra/exports/Lmx.dll.aadct-decompile.md` and
`analysis/ghidra/exports/LmxProxy.dll.completion-status-decompile.md`)
settles **R3** and **R4** as "no static byte→status lookup table
exists":

- `Lmx.aaDCT` at `0x10178fc0` is a `SysAllocString(L"Lmx.aaDCT")` into
  a global BSTR — a logging category name, not a table.
- `MXSTATUS_PROXY` is a 4-field struct (success/category/detectedBy/
  detail), used as the marshalled COM event payload — not a static
  array of pre-mapped statuses.
- `Fire_OnDataChange` / `Fire_OnWriteComplete` / `Fire_OperationComplete` /
  `Fire_OnBufferedDataChange` (RVAs 0x15f72, 0x1611f, 0x16271, 0x163c0
  in `LmxProxy.dll`) receive ALREADY-POPULATED `MXSTATUS_PROXY[]`
  arrays — the byte-to-struct synthesis happens upstream in the
  proxy's NMX-callback ingestion code, not via a table lookup. The
  synthesis is per-event computation from operation context (engine
  ids, item handles, retry counters), not a static promotion.

R3/R4 status updated from "indefinitely deferred — no Ghidra table"
to "settled — no table exists; verbatim preservation is the canonical
answer." The .NET reference's `NmxOperationStatusMessage.TryParseInner`
+ the Rust port's `mxaccess-codec/src/operation_status.rs` already
match this canonical behaviour; no code change required.

Reopen R3/R4 only if a context-aware capture surfaces a per-byte
synthesis logic that depends on operation context — at which point
the codec would need access to the originating operation's context,
which is upstream of the bytes themselves.

**R8** marked permanently deferred — implementation already parses
NTLM AV pairs per [MS-NLMP] §2.2.2.1 (including the cross-domain
shapes `MsvAvDnsTreeName` / `MsvAvDnsComputerName` carrying the
trusted-domain DNS suffix), what's missing is the live capture, and
the live capture requires a multi-domain Windows lab not available
on this dev host. Same status pattern as F3 in `design/followups.md`.

Open evidence gaps table updated to reflect:
- Cross-domain NTLM: deferred (R8)
- Ghidra mapping table for completion-only bytes: no table exists
  (R3/R4 settled)
- Activate/Suspend transition (wire): partial (F44 + F46), live re-run
  pending (F50)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 06:23:05 -04:00

377 lines
38 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Risks and open questions
This is the punch list of things that could break or are unproven. Each entry is tagged R(isk) or Q(uestion), with a current best answer and what would settle it.
## Protocol-level
### R1 — net.tcp / WCF framing **and binary message encoding** complexity
**Severity: P0** (project-blocker — entire ASB data plane, ~3000 LoC)
The .NET reference uses `System.ServiceModel.NetTcpBinding` for ASB (`src/MxAsbClient/MxAsbDataClient.cs:663`: `new NetTcpBinding(SecurityMode.None)` with no message-encoding override). With no override, WCF defaults to the **binary message encoder** — i.e. .NET Binary XML ([MC-NBFX]) with a static dictionary lookup ([MC-NBFS]) — **not** SOAP/XML. There is no Rust port of WCF, and `quick-xml` (or any other XML toolkit) is **not sufficient** to read or write these payloads: the body bytes are tokenised binary nodes that reference dictionary string IDs.
So the hand-rolled scope is two layers, not one:
1. **Framing** per [MS-NMF] (record types: preamble, preamble-ack, sized-envelope, end, fault) plus the reliable-session ack handling on the underlying `net.tcp` channel.
2. **Message encoding** per [MC-NBFX] (binary XML node tokens, length-prefixed strings, prefixed/typed attributes, end-element markers) **plus** [MC-NBFS] (the static dictionary that holds the SOAP/WS-Addressing/`IASBIDataV2`-action strings the encoder references by ID instead of inlining).
**Options:**
1. Hand-roll both framing ([MS-NMF]) and binary message encoding ([MC-NBFX] + [MC-NBFS]). Estimate ~3000 LoC across both layers (the encoder/dictionary work is the majority — framing alone is ~1500 LoC; the binary XML codec, dictionary tables, and operation-action mapping are roughly the same again).
2. Switch ASB to its HTTP variant if the deployed AVEVA instance supports it (this would let us use a normal text SOAP/XML stack and skip both [MS-NMF] and [MC-NBFX]/[MC-NBFS] entirely).
3. Wrap the .NET ASB DLL in a process and call it via stdin/stdout JSON-RPC.
**Current best answer:** option 1 (hand-roll both layers). The two specs are public, the encoder is deterministic, and the .NET reference's `AsbMessageDumpBehavior` already produces ground-truth byte vectors for the dictionary and operation set we use. `quick-xml` may help with any auxiliary text-XML the wider stack uses, but it cannot decode the binary-encoded message bodies — that requires the [MC-NBFX] + [MC-NBFS] codec.
**Settles when:** `mxaccess-asb-nettcp` parses every captured request/reply byte-identical to the .NET reference's `IClientChannel` payload dump for the proven type matrix, including correct dictionary-ID resolution and round-trip of every observed binary XML node tag.
### R2 — Buffered subscription multi-sample body **(settled per option (a) — codec change landed under F44)**
**Severity: P3** (settled — codec accepts multi-record DataUpdate)
**Status (2026-05-06): SETTLED PER OPTION (a) — multi-sample body observed; codec relaxed.**
`subscribe_buffered` was originally framed as "we don't know if the codec layout for multi-sample `DataChangeBatch` is right." A first verification pass against `wwtools/mxaccesscli/docs/api-notes.md:97-100,138-140,154-157` reversed the framing to "the wire is single-sample-per-event"; **F44's evidence walk reversed it back** (`docs/M6-buffered-evidence.md`).
`captures/094-frida-buffered-separate-writer/frida-events.tsv:145` (`2026-04-25T21:40:34.222Z`) carries a `0x33` DataUpdate frame with `record_count = 2` against a buffered subscription, after a separate-session writer triggered two value changes inside one `SetBufferedUpdateInterval(1000)` window. Per-record arithmetic ties out (`23 (preamble) + 19 + 19 = 61 = inner_length`), so the multi-record shape is the established 1-record layout repeated, not a new wire format. The .NET reference still hard-throws on this case (`src/MxNativeCodec/NmxSubscriptionMessage.cs:71-74`); the Rust codec deliberately diverges and decodes it.
The `OnBufferedDataChange` **public event shape** the wwtools api-notes describe (`hServer, hItem, MxDataType, value, quality, timestamp, statuses` — singular `value`) is correct. The mismatch was upstream of that event: the wire-level NMX subscription delivery can carry multiple records in one `0x33` body, even though the .NET compatibility server fans those out to one event each.
**Current best answer:** `mxaccess-codec` decodes `0x33` DataUpdate bodies of any positive `record_count`; `subscribe_buffered` continues to expose `Stream<Item = DataChange>`, fanning the records out one per Stream item. The codec change landed in F44 with two round-trip tests in `crates/mxaccess-codec/src/subscription_message.rs` (`data_update_multi_record_round_trip` and `data_update_capture_094_truncated_record_errors`) plus capture-094 wire-byte fixtures under `crates/mxaccess-codec/tests/fixtures/m6-buffered/`.
**Settles when:** ✅ settled per option (a). Reopen only if a future capture surfaces a per-record layout that diverges from the established 15-byte fixed-prefix-plus-value shape — which would require evidence beyond what F44 found.
### R3 — `OperationComplete` trigger unproven **(settled 2026-05-06 — no mapping table exists; verbatim-preserve is the canonical answer)**
**Severity: P1** (was a blocker; now settled per option: verbatim preserve is the canonical native behaviour)
**Status (2026-05-06): SETTLED.** Ghidra headless decompile + string-ref walk across `Lmx.dll`, `LmxProxy.dll`, `NmxAdptr.dll`, `NmxSvc.exe`, and `NmxSvcps.dll` (logs at `analysis/ghidra/exports/Lmx.dll.aadct-decompile.md` + `analysis/ghidra/exports/LmxProxy.dll.completion-status-decompile.md`) confirms there is **no static byte→status lookup table** to extract. Specifically:
- The `Lmx.aaDCT` symbol referenced at `0x10178fc0` is a `SysAllocString(L"Lmx.aaDCT")` call into a global BSTR — a logging category name, not a status-mapping table. Decompiled function body is a textbook static initializer, no array / lookup logic.
- `MXSTATUS_PROXY` (`analysis/decompiled-interop/Interop.Lmx/Interop/Lmx/MXSTATUS_PROXY.cs`) is a 4-field struct (`success: i16`, `category: MxStatusCategory`, `detectedBy: MxStatusSource`, `detail: i16`), used as the marshalled COM event payload — not a static array of pre-mapped statuses.
- The `Fire_OnDataChange` / `Fire_OnWriteComplete` / `Fire_OperationComplete` / `Fire_OnBufferedDataChange` event-firing functions in `LmxProxy.dll` (RVAs `0x15f72`, `0x1611f`, `0x16271`, `0x163c0`) receive **already-populated** `MXSTATUS_PROXY[]` arrays — the byte-to-struct synthesis happens upstream in the proxy's NMX-callback ingestion code, not via a table lookup. The synthesis is per-call computation from operation state (engine ids, item handles, retry counters), not a static byte→status promotion.
This means the .NET reference's verbatim-preservation strategy IS the canonical native behaviour: there is no table to mirror because the native code computes the `MXSTATUS_PROXY` from operation context per-event, not from a lookup. The 1-byte completion frames `0x00`, `0x41`, `0xEF` etc. are intermediate NMX-internal signaling that the proxy synthesizes contextual status from; the only frame with a proven typed promotion is the 5-byte status-word `00 00 50 80 00``MxStatus.WriteCompleteOk`.
**Current best answer:** unchanged — `Session::operation_status_events()` exposes `Stream<Item = RawOperationStatus>` carrying frame bytes. Promote to a typed `WriteCompleted` only on the proven `00 00 50 80 00` 5-byte pattern. Other bytes stay raw as `MxStatus { Success: 0, Category: Unknown, DetectedBy: Unknown, Detail: byte }`. The Rust codec mirrors `src/MxNativeCodec/NmxOperationStatusMessage.cs:TryParseInner`.
**Reopen when:** a live capture surfaces a 1-byte completion frame whose surrounding context (e.g. observed `MXSTATUS_PROXY` struct fired through the .NET probe alongside the byte) lets us back-derive a context-aware promotion. Since the native code synthesises the struct from operation state rather than a table, the promotion logic would itself need to be context-aware — i.e. the codec would need access to the originating operation's context, which is upstream of the bytes themselves. Until then, verbatim preservation is correct by construction.
### R4 — Completion-only byte mapping **(settled 2026-05-06 — collapses into R3's resolution)**
**Severity: P1** (was a blocker; now settled per the same R3 finding)
**Status (2026-05-06): SETTLED.** Same Ghidra walk as R3. The 1-byte completion frames `0x00`, `0x41`, `0xEF` (`work_remain.md:164174`) are the same intermediate NMX signals that R3 covers. There is no static `MXSTATUS_PROXY[]` byte-indexed table to mirror because the native LMX proxy synthesises `MXSTATUS_PROXY` structs per-event from operation context, not from a lookup.
**Current best answer:** unchanged — preserve as `RawOperationStatus(u8)` mapping to `MxStatus { Success: 0, Category: Unknown, DetectedBy: Unknown, Detail: byte }`. The .NET reference does the same; the Rust port matches.
**Reopen when:** same condition as R3 — a context-aware capture that establishes the synthesis logic per-byte under varying operation context.
### R5 — Activate / Suspend behaviour **(partially observed — F44 documented client-side trigger; wire-side residual gap filed as F46, hook landed pending live re-run)**
**Severity: P2** (downgraded from P1 — client-side acceptance criteria are
now documented; LMX-proxy wire emission remains unconfirmed)
**Status (2026-05-06): PARTIALLY OBSERVED — Frida hooks ready, live capture pending.**
F44's evidence walk on
`captures/077-frida-suspend-advised-scanstate/` (per `docs/M6-buffered-evidence.md`)
documents:
- `Suspend` returns synchronously with `MxStatus.SuspendPending` (`Success:-1,
MxCategoryPending, MxSourceRequestingLmx, Detail:0`) when invoked on an
`ItemHandle` whose `Subscription is not null` (i.e. immediately after a
successful `Advise` / `AdviseSupervisory`).
- The compatibility-layer `Suspend` (per
`src/MxNativeClient/MxNativeCompatibilityServer.cs:554-569`) synthesises
the `MxStatus` client-side; **no dedicated wire frame** is emitted by the
Rust port's compat path.
What capture 077 could **not** answer: whether the production
`LmxProxy.dll` stack issues a separate ORPC method for `Suspend` / `Activate`
(e.g. an `ILMXProxyServer5` opnum) or also handles them client-side. Capture
077's Frida script did not hook
`LmxProxy.dll!CLMXProxyServer.Suspend`/`.Activate`, so the wire-side
behaviour is invisible.
**Next step — F46.** `analysis/frida/mx-nmx-trace.js` now carries
`Interceptor.attach` blocks for `LmxProxy.dll!CLMXProxyServer.Suspend`
(RVA `0x13d9c`, `FUN_10013d9c`) and `.Activate` (RVA `0x14028`,
`FUN_10014028`), emitting `mx.suspend.begin/end` and
`mx.activate.begin/end` events with the `MxStatus*` out-parameter
decoded as 4 × int16. No `Resume` / `Reactivate` symbols exist in
`LmxProxy.dll` — verified against
`analysis/ghidra/exports/LmxProxy.dll.ghidra.md` and the decompiled
`ILMXProxyServer5` / `ILMXProxyServer4` interfaces. R5 stays open
until a live re-run on the AVEVA host produces
`captures/NNN-frida-suspend-activate-instrumented/` per the procedure
documented at the top of `analysis/frida/mx-nmx-trace.js`.
**Current best answer:** expose `Session::suspend(item)` and
`Session::activate(item)` returning `Result<MxStatus, Error>`. The success
criteria match the .NET reference's client-side gating: the item must have
an active subscription. If F46's wire capture later proves the LMX proxy
issues a separate ORPC method, add the wire emission here in M6 follow-up.
Do not build callback-driven state transitions on top until F46 settles.
**Settles when:** F45 produces a Frida capture instrumenting
`LmxProxy.dll!CLMXProxyServer.Suspend` / `.Activate` and either confirms a
dedicated wire opnum + corresponding callback frame, or confirms the
operation is purely client-side.
### R6 — `0x80004021` in `MxNativeSession.WriteSecuredAsync` is a .NET-reference defect, not a real LMX constraint
**Severity: P3** (formerly P1 — downgraded after `wwtools/mxaccesscli/` verification)
Original framing of this risk asserted that "`WriteSecured` (without `2`) returns `0x80004021` before sending the body" and concluded the single-token form was deprecated or rejected at the wire. That framing was wrong. Verification against `wwtools/mxaccesscli/` (a working CLI built on the production `LMXProxyServerClass` 32-bit COM proxy, i.e. the actual MxAccess surface) establishes:
- The LMX `WriteSecured` ALWAYS takes **two** user ids: `(currentUserId, verifierUserId, value)` (`wwtools/mxaccesscli/docs/api-notes.md:60-72`, `wwtools/mxaccesscli/src/MxAccess.Cli/Mx/MxItem.cs:69-70`).
- "Single-user secured write" is the same API called with `currentUserId == verifierUserId` — it is **not** a separate API surface (`wwtools/mxaccesscli/src/MxAccess.Cli/Commands/WriteCommand.cs:151-155,196-199`).
- `WriteSecured2` adds a timestamp parameter; it does **not** add a second token. The 1-vs-2 distinction in this design's earlier drafts was a confusion between "with timestamp" (Write2 vs Write) and "two-token" (which is always true).
- The `0x80004021` failure observed in `src/MxNativeClient/MxNativeSession.cs:218-221` is therefore a defect of the .NET native reimplementation, not behaviour the LMX proxy itself produces.
**Current best answer:** `mxaccess` exposes `write_secured(reference, value, current_user_id, verifier_user_id)` (no timestamp) and `write_secured_at(reference, value, timestamp, current_user_id, verifier_user_id)` (with timestamp), matching `WriteSecured` and `WriteSecured2` respectively. Both always pass two user ids; callers performing single-user secured writes pass the same id twice. The `Error::Unsupported` mapping for "single-token form" has been removed from `50-error-model.md`.
**Settles when:** the `MxNativeSession.WriteSecuredAsync` defect is fixed in the .NET reference, or a captured frame shows the LMX proxy itself producing `0x80004021` on a `WriteSecured` call (which would resurrect the original framing). Default-positive: this likely settles silently as "not a real risk."
### R7 — Status mapping for non-success ASB cases
**Severity: P2** (nice-to-have / minor — unknown bytes preserved as raw)
`work_remain.md:132143`: live probes have not yet exercised access-denied and no-communication on the current VM. The Rust port mirrors what the .NET reference proves; remaining ASB error/quality/detail bytes are preserved as raw and surfaced through `MxStatus.detail` until a safe live capture lands.
**Current best answer:** preserve unknown payloads. Document the gap.
**Settles when:** live capture against a configured access-denied tag and a no-communication endpoint produces the expected `MxStatus` shape.
## Implementation-level
### R8 — NTLMv2 cross-domain auth **(permanently deferred 2026-05-06 — external infrastructure gap)**
**Severity: P1** (significant blocker for cross-domain deployments — single-domain ships)
**Status (2026-05-06): PERMANENTLY DEFERRED.** The implementation already parses NTLM AV pairs per [MS-NLMP] §2.2.2.1, including the cross-domain AV pair shapes (`MsvAvDnsTreeName`, `MsvAvDnsComputerName` carry the trusted-domain DNS suffix instead of the local one). What's missing is the *live capture* needed to pin a regression fixture — and that requires a multi-domain Windows lab (e.g. `LAB-A` + `LAB-B` with cross-domain trust + an AVEVA install on `LAB-A` authenticating a `LAB-B`-domain user) which is not available on the dev host. Same external-infrastructure constraint as `F3` in `design/followups.md`. R8 is closed in the same sense F3 is closed — the implementation is in place per spec; only the evidence is gated on hardware that doesn't exist here.
Captured traffic is single-domain (local AVEVA install). Cross-domain NTLM exercises the AV pair codepaths but the bytes haven't been pinned.
**Current best answer:** the AV pair parser handles the cross-domain shape per [MS-NLMP] §2.2.2.1; document `mxaccess-rpc` as untested across domains in the README. The `mxaccess-rpc::ntlm` round-trip tests cover the single-domain shape; cross-domain rounds-trip through the same code path (the AV pair parser is shape-agnostic) but no live fixture pins it.
**Reopen when:** a multi-domain AVEVA test harness becomes available + a cross-domain probe runs successfully end-to-end with packet-integrity signatures verified. Until then, this risk is permanently deferred — same status pattern as F3.
### R9 — DPAPI dependency for ASB
**Severity: P2** (nice-to-have / minor — explicit `shared_secret` constructor is the escape hatch)
ASB shared-secret retrieval uses `ProtectedData.Unprotect` (LocalMachine scope). Linux has no DPAPI. There is no portable replacement; the secret is encrypted at rest with a Windows-specific KCV.
**Current best answer:** `mxaccess-asb` requires Windows for the credential read path. Provide an explicit `AsbCredentials::shared_secret(secret: &[u8])` constructor that bypasses DPAPI for tooling that has the secret in plaintext (e.g. CI tests, ops automation).
**Settles when:** never. DPAPI is not portable; the escape hatch is the explicit constructor.
### R10 — Galaxy SQL schema versioning
**Severity: P1** (significant blocker per affected feature — break-loud on mismatch)
The recursive CTE in `GalaxyRepositoryTagResolver.cs` assumes the current AVEVA schema. Older Galaxy versions may have different table layouts.
**Current best answer:** target the schema that ships with the AVEVA version `MxNativeClient` validates against. Document the expected schema version. Break loudly on mismatch (`ConfigError::Galaxy { reason }`).
**Settles when:** a multi-version test matrix is set up. Probably not in V1.
### R11 — x86 proxy/stub workaround
**Severity: P2** (nice-to-have / minor — integration test catches binding-shape drift)
`NmxSvcps.dll` is x86-only. The replacement strategy bypasses the in-proc proxy by speaking ORPC directly. This works because we control both Type1/Type3 marshalling and `RemQueryInterface`. But it depends on `NmxSvc` continuing to expose IPv4 NCACN_IP_TCP bindings via the OXID.
**Current best answer:** add an `mxaccess-rpc` integration test that asserts `ResolveOxid` returns at least one `ncacn_ip_tcp` binding. Fail fast if the binding shape changes in a future AVEVA release.
**Settles when:** that integration test is in CI gating.
### R12 — Performance — codec allocations
**Severity: P2** (nice-to-have / minor — micro-optimisation in M6)
The .NET reference reuses `byte[]` arrays via `MemoryPool`; the Rust port should use `bytes::Bytes` for zero-copy on receive and pre-allocate via `BytesMut` on encode. The codec currently allocates `Vec<u8>` per encode; tolerable for V1, worth optimising in M6.
**Current best answer:** use `BytesMut::with_capacity(MAX_FRAME)` per session. Bench in M6. Aim for < 5 allocations per write at steady state.
**Settles when:** `cargo bench` shows the target allocation count.
### R13 — DataUpdate `recordCount != 1` panic risk
**Severity: P1** (significant blocker for production stability — soft-error path documented)
`src/MxNativeCodec/NmxSubscriptionMessage.cs:71-74` hard-throws `ArgumentException` on any `0x33` DataUpdate whose `recordCount` is not exactly 1:
```csharp
if (recordCount != 1)
{
throw new ArgumentException("Observed NMX DataUpdate callback parser currently supports one record per body.", nameof(inner));
}
```
R2 covers the missing **fixture** for the multi-record case, but the bigger production-side risk is separate: the first time AVEVA emits a multi-record `0x33` against a deployed Rust client, the codec — if it ports the .NET behaviour faithfully — will panic / return a hard decode error and tear down the subscription. We have no fixture proving multi-record bodies don't happen on real installs; we only have evidence they haven't happened on **our** install.
**Options:**
1. Mirror the .NET reference and hard-error on `recordCount != 1`. Loud, but kills the session.
2. Surface as a typed soft error (e.g. `ProtocolError::Decode { reason: "multi-record DataUpdate not yet supported" }`), log at warn, and drop the frame. The subscription stays alive; the consumer sees a single missed update, not a teardown.
3. Speculatively decode multi-record (assume the per-record layout from the single-record case repeats) — explicitly forbidden by CLAUDE.md "Do not fabricate protocol behavior."
**Current best answer:** option 2 in Rust. Map the condition to `ProtocolError::Decode { reason: "multi-record DataUpdate not yet supported" }`, emit a `tracing::warn!` with the raw frame bytes attached as a hex field, and continue. Do **not** synthesise per-record decoding. The .NET-style hard throw stays as-is in the .NET reference (it is the executable spec, and a panic there is what produces the fixture we need — see R2). The Rust port deliberately diverges here on production-safety grounds, with the divergence documented in `50-error-model.md`.
**Settles when:** R2's multi-record fixture lands and the codec gains a proven typed decode path; then R13 collapses into "supported, no special handling" and the soft-error branch becomes dead code that can be removed.
### R14 — Fabricated `0x80004021 → StaleItem` mapping
**Severity: P1** (significant blocker — fabrication risk; corrected in `50-error-model.md`)
A draft of `50-error-model.md` mapped `HRESULT 0x80004021` to a typed `StaleItem` error category for regular (non-secured) operations. **This mapping is unevidenced.**
- R6 already covers `0x80004021` on secured-write specifically: per `wwtools/mxaccesscli/` verification, this is a `MxNativeSession.WriteSecuredAsync` defect (the .NET native reimplementation throws `NotSupportedException` before reaching the wire), **not** a real LMX-proxy constraint. The production LMX surface accepts `WriteSecured` with two user ids unconditionally. R6 explicitly does **not** generalise the .NET defect to a typed "stale" error.
- For regular operations, the actual stale-handle / invalid-arg HRESULT observed in captures is `0x80070057` (`E_INVALIDARG`). There is no captured frame, decompiled mapping table, or live probe in this repo that produces `0x80004021` on a non-secured path, and certainly none that justifies tagging it `StaleItem`.
This is a fabrication risk: the kind of "looks plausible from naming" mapping that CLAUDE.md "Do not fabricate protocol behavior" exists to prevent.
**Options:**
1. Drop the `StaleItem` category entirely. Regular-op `0x80004021`, if ever observed, falls through to the generic `Hresult { code, hint: None }` branch with the raw HRESULT preserved.
2. Keep `StaleItem` but rename the source HRESULT to `0x80070057` and require a captured fixture before promoting any frame to that category.
3. Keep the `0x80004021 → StaleItem` mapping. **Forbidden** — no evidence backs it.
**Current best answer:** option 1 for V1. Surface unknown HRESULTs as `Error::Hresult { code }` and let consumers match on the raw value. `50-error-model.md` is being corrected in parallel (review cluster 3) to remove the `StaleItem` reference; this risk register entry exists so the mistake is recorded for future contributors and not silently re-introduced when someone reaches for an ergonomic typed name.
**Settles when:** indefinitely deferred — no current artifact maps either `0x80004021` or `0x80070057` to a "stale handle" semantic, and inventing one violates the "don't fabricate protocol behaviour" rule. If a future capture or decompiled mapping table produces evidence, reopen as a typed-error proposal.
### R15 — Drop-time async cleanup hazards
**Severity: P1** (significant blocker — server-side handle leak on runtime shutdown)
`design/00-overview.md:38` states the principle "no spawn from inside Drop." `design/20-async-layer.md` and `design/50-error-model.md` describe Subscription drop semantics that fire `UnAdvise`/`UnregisterEngine` against the server. Reconciling these is non-trivial because:
- `tokio::spawn` from `Drop` panics if no Tokio runtime is current at drop time. A user dropping a `Session` from a `std::thread` after `Runtime::shutdown_timeout` returns will hit this.
- During `Runtime::shutdown_timeout`, spawned tasks are aborted before they can flush. Even if a runtime is current, spawning the cleanup from `Drop` does not guarantee the unadvise/unregister actually reaches the server — the drop call returns immediately and the spawned task may be cancelled before the bytes hit the wire.
- The result is a **server-side handle leak in `NmxSvc`**: subscriptions stay live, registered engines stay registered, until the TCP connection itself is torn down (which only happens once the kernel notices the socket is dead).
**Options:**
1. Best-effort `tokio::spawn` from `Drop`. Documented hazard. Leaks on runtime shutdown and panics on no-runtime.
2. Drop sends `UnAdvise`/`UnregisterEngine` via a `tokio::sync::oneshot` (or unbounded `mpsc`) to a long-lived connection task that owns the cleanup loop. **Drop itself never spawns** — it pushes a message onto the channel and returns. The connection task drains the channel until the TCP connection is itself dropped, at which point the server cleans up by socket close anyway.
3. Require the consumer to call `Session::shutdown(timeout).await` and document Drop as "best-effort, may leak under shutdown" — no automatic cleanup at all.
**Current best answer:** option 2. A long-lived connection task owns the cleanup channel and drains it; `Drop` pushes a `UnAdvise`/`UnregisterEngine` request onto a `tokio::sync::oneshot` (one per resource) or a per-connection unbounded `mpsc` and returns synchronously. This keeps `Drop` cheap, satisfies "no spawn from Drop," and gives the cleanup a reasonable best-effort guarantee while the connection task is alive. **Runtime-shutdown leak window remains** — if the connection task is itself aborted by `Runtime::shutdown_timeout` before draining the channel, the cleanup messages are dropped on the floor and the server-side handles remain registered until the TCP socket close is observed by `NmxSvc`. This window is documented in `50-error-model.md`'s cancellation semantics; consumers running under explicit shutdown should call `Session::shutdown(timeout).await` for deterministic cleanup. Cite `design/00-overview.md:38` (no-spawn-from-Drop principle), `design/20-async-layer.md` (Subscription drop semantics), `design/50-error-model.md` (cancellation semantics).
**Settles when:** the connection-task cleanup channel is implemented in M4, a stress test under churn confirms drop semantics on a live runtime do not leak, and the runtime-shutdown leak window is captured in a runnable test fixture (consumer drops `Session` after `Runtime::shutdown_timeout`; assert that the leak is bounded by socket-close timeout).
### R16 — Crypto/auth crate maintenance drift
**Severity: P1** (significant blocker — yank/advisory in CI breaks the build)
The auth surface area depends on a small cluster of marginal-maintenance crates. `design/30-crate-topology.md:130` pins `rc4`, `sha-1`, `md-5`, `num-bigint`; `design/10-raw-layer.md:252` instructs "Do not pull `ring` — hand-roll MD4." Of these:
- `rc4` is at minimum-maintenance, with a small maintainer pool and no recent releases.
- `sha-1` v0.10 is the last RustCrypto release that ships with a deprecation warning (the algorithm itself, not the crate's quality, is what's deprecated upstream).
- `md-5` and `num-bigint` are stable but not on the active-development frontier.
- The hand-rolled MD4 in `mxaccess-rpc` has no upstream at all — it lives in this repo.
The risk is that any one of these crates gets **yanked**, picks up an `RUSTSEC` advisory, or stops compiling against a future Rust toolchain, and `cargo-deny` (or `cargo audit`) in CI fails the build for everyone — without any actual bug being found in our usage. This is especially bad if it happens during a live release window.
**Options:**
1. Pin to known-good versions in workspace `Cargo.toml` and let CI break when an advisory lands. Triage manually.
2. Pin **and** subscribe to `cargo-deny` advisory feeds with a documented response process; pre-stage replacement plans for each crate (e.g. "if `rc4` is yanked, fall back to a hand-rolled cipher in `mxaccess-rpc::crypto::rc4` — RC4 is ~30 LoC and we already hand-roll MD4").
3. Hand-roll all of them up front (RC4, SHA-1, MD5, MD4 are all small) and depend on `num-bigint` only. Reduces the surface area to one external crate; increases the in-repo cryptographic LoC.
**Current best answer:** option 2 for V1. Pin to known-good versions in workspace `Cargo.toml`; subscribe `cargo-deny` advisories in CI; document a fallback plan per crate (hand-rolled RC4 if `rc4` is yanked, hand-rolled SHA-1/MD5 if `sha-1`/`md-5` are pulled, swap `num-bigint` for `crypto-bigint` if it's pulled). Reassess in M6 and consider option 3 (hand-roll-everything) if any of the pins fire during V1 development. Cite `design/30-crate-topology.md:130` and `design/10-raw-layer.md:252`.
**Settles when:** `cargo-deny check advisories` runs green in CI on a fresh advisory database, the workspace `Cargo.toml` pins are documented inline with their fallback plans, and a "yank rehearsal" (manually mark a pin as yanked locally and confirm the fallback compiles) has been done at least once per crate.
## Open questions
### Q1 — Where does the Rust workspace live? **(unresolved)**
`CLAUDE.md` proposes a sibling `rust/` directory at `c:\Users\dohertj2\Desktop\mxaccess\rust\`, but this is a *proposal*, not a confirmation: a glob of `rust/` confirms zero files exist there today, and `CLAUDE.md` itself hedges with "when it is started." **M0 cannot start until this is confirmed.**
**Owner:** project lead.
**Action:** confirm the path `c:\Users\dohertj2\Desktop\mxaccess\rust\` or pick an alternative location; create the empty `rust/` directory (or sibling) before M0 begins.
**Current best answer:** still pending. The CLAUDE.md proposal is the default and is what M0 will assume unless overridden, but treat this as an open decision rather than a confirmed answer.
**Settles when:** the workspace directory exists on disk and contains a `Cargo.toml` (even an empty one).
### Q2 — License? **(resolved: MIT)**
The .NET reference has no LICENSE file at the repo root. The Rust crates need one before publish.
**Resolved (2026-05-05):** **MIT** (single-license, not the dual `MIT OR Apache-2.0`). All workspace deps verified MIT/Apache-2.0 compatible; MIT alone satisfies every dep's downstream license obligation. `LICENSE` file added at the project root (`c:\Users\dohertj2\Desktop\mxaccess\LICENSE`). All crate `Cargo.toml`s set `license = "MIT"` via `workspace.package`.
**Settles when:** N/A — resolved.
### Q3 — Cross-platform reach (Linux, macOS)
The codec, ASB SOAP framing, and the async session are theoretically portable. Galaxy SQL via `tiberius` works on Linux. NTLM works on Linux. DPAPI does not. Active Directory authentication on Linux requires `gssapi` (Kerberos) which is out of scope.
**Current best answer:** Linux is a **stretch goal** for V1, not a supported target — consistent with `30-crate-topology.md`'s `mxaccess-codec` Targets line ("stretch goal") and `60-roadmap.md`'s "What this roadmap deliberately does not include" (Linux behind feature flags). If pursued, the path is `default-features = false` with the consumer providing credentials and shared secret explicitly. macOS unsupported in V1 (no Galaxy SQL TDS testing on macOS).
**Settles when:** a Linux integration test runs successfully against a remote AVEVA install. Until then, treat Linux support as aspirational and gate all Linux-specific code paths behind opt-in feature flags.
### Q4 — How does `mxaccess-compat` handle COM event sinks?
The .NET `MxNativeCompatibilityServer` raises `OnDataChange` etc. as COM events. `mxaccess-compat` is a Rust API; do we expose them as `Stream`s, callbacks, or both?
**Current best answer:** Streams, with a separate optional `mxaccess-compat-com` crate (post-V1) that registers `windows-rs`-generated COM classes. The compat crate's primary surface is Rust.
**Settles when:** a concrete consumer requests COM exposure.
### Q5 — How do we surface `MxStatus` in `Subscription` items vs `Session` operations?
For `Session::write()`, a non-Ok status maps to `Error::Status`. For `Subscription::next()`, a non-Ok status comes through as `DataChange { status: MxStatus, ... }` — it is not necessarily an error (a "stale" data change is still a valid frame).
**Current best answer:** `Session::write()` returns `Err` on non-Ok category. `Subscription::next()` returns `Ok(DataChange { ... })` and the consumer inspects `change.status`. Documented in `50-error-model.md`.
**Settles when:** API stabilises after consumer feedback.
### Q6 — Should `Session` be `Clone`?
Cheap clones via `Arc<SessionInner>` are convenient (handlers can take `Session` by value). But cloning makes shutdown semantics fuzzy: when does `UnregisterEngine` fire?
**Current best answer:** `Clone + Send + Sync`. Drop of the last clone runs `UnregisterEngine` best-effort via `tokio::spawn`. `Session::shutdown(timeout)` is the explicit, awaitable way for production code.
**Settles when:** stress test under churn confirms drop semantics are correct.
### Q7 — M1 `hasDetailStatus` audit
During M1 wave-1 codec ports, the `subscription_message.rs` agent draft conditionally read the `status: i32` field only when `hasDetailStatus = true`, while requiring a minimum record length of 15 (DataUpdate) regardless. The result: 4 leading status bytes were left unconsumed, then misread as `quality` further down. The defect was caught by round-trip tests (`data_update_boolean_round_trip`, `data_update_has_no_correlation_id`) and fixed: `status: i32` is now read unconditionally per `src/MxNativeCodec/NmxSubscriptionMessage.cs:126-127`; only `detail_status: Option<i32>` is gated on `hasDetailStatus` (`NmxSubscriptionMessage.cs:130-134`).
**Follow-up:** audit any other codec port (current or future) that takes a `has_detail_status` / `hasDetailStatus` parameter for the same defect pattern — specifically, verify that fields read unconditionally in the .NET source remain unconditional in the Rust port. Likely affected scope: any future helper that ports `ParseRecord` semantics from `NmxSubscriptionMessage.cs`. The inline note at `mxaccess-codec/src/subscription_message.rs` `parse_record` documents the fix.
**Settles when:** post-M1 audit confirms no other codec module conditionally skips fields the .NET reference reads unconditionally.
## Open evidence gaps
These are missing fixtures that the design assumes will land by their respective milestone.
| Fixture | Needed by | Captured how |
|---|---|---|
| ~~Multi-sample buffered batch~~ | ~~M6~~ | **CAPTURED (F44)**`captures/094-frida-buffered-separate-writer/frida-events.tsv:145`; fixture under `crates/mxaccess-codec/tests/fixtures/m6-buffered/` |
| ~~Cross-domain NTLM Type1/2/3~~ | ~~M2+~~ | **DEFERRED (R8)** — permanently external-blocked; needs multi-domain Windows lab not available on this dev host |
| Activate/Suspend transition (wire) | M6 / F46 | **PARTIAL (F44 + F46)** — client-side conditions documented from capture 077; F46 added Frida hooks (`LmxProxy.dll!CLMXProxyServer.Suspend/.Activate` at RVAs `0x13d9c` / `0x14028`); live re-run pending (F50) |
| `OperationComplete` for non-write op | indefinitely | unknown |
| ~~Ghidra mapping table for completion-only bytes (R3/R4)~~ | ~~indefinitely~~ | **NO TABLE EXISTS (R3/R4 settled 2026-05-06)**`analysis/ghidra/exports/Lmx.dll.aadct-decompile.md` confirms `aaDCT` is a logging BSTR name, not a table; `LmxProxy.dll`'s Fire_* event handlers receive already-populated `MXSTATUS_PROXY[]` from per-event context synthesis upstream, not from a static lookup. Verbatim preservation is the canonical answer. |
| ASB write timestamp + status fields | M5 | extended ASB Write/PublishWriteComplete probe |
| ASB no-communication source-level evidence (`work_remain.md:198`) | M5 | live capture against an unconfigured ASB endpoint |
| Partial-cleanup behavior after channel failure (`work_remain.md:196-197`) | M4/M5 | inject mid-flight failure during subscribe, observe cleanup state |
| Galaxy schema older version | indefinitely | not in scope for V1 |
## Things that look risky but aren't
### "Decode the NDR-bridge to find the value bytes"
`docs/Transport-Correlation.md:65-70` notes that distinct value probes do not appear in raw TCP — the `CNmxAdapter::PutRequest`/`CNmxAdapter::TransferData` buffers are an "internal adapter representation, not the TCP wire format." This is because the values flow as DCE/RPC stub bytes *inside* the `TransferData` payload, which itself is the 46-byte envelope plus the inner write/advise/subscribe body. The "bridge" is just our codec re-applied at a different boundary; once we encode the envelope correctly, the bytes are there.
The .NET reference confirms this — `src/MxNativeClient/ManagedNmxService2Client.cs:159-183` (`TransferData` + `ValidateTransferDataBody`) writes the 46-byte envelope directly into the DCE/RPC Request stub body, then forwards the inner; the validator explicitly rejects bodies that lack "an inner message after the 46-byte envelope" (line 182). There is no extra layer. The probe-vs-pcap mismatch is an artefact of not reassembling the inner body, not a missing protocol layer.
**No risk.** Documented for clarity so future contributors don't chase a non-existent encryption layer.
### "We need a custom TLB / proxy DLL"
The .NET reference avoids registering a custom TLB by hand-rolling the callback `IRemUnknown` server in `src/MxNativeClient/ManagedCallbackExporter.cs:44-54` (`CreateCallbackObjRef` builds an OBJREF in memory) plus `src/MxNativeClient/ManagedCallbackExporter.cs:164,195-196` (the `IRemUnknown::RemQueryInterface` server-side handler returns the negotiated `INmxSvcCallback` IPID without any registry-resident TLB or proxy/stub DLL). The Rust port does the same in `mxaccess-callback`. The only registry touchpoint is OXID resolution (read-only) and reading the ASB shared secret (read-only via DPAPI). No installer, no admin elevation.
**No risk.** Documented because it commonly comes up in DCOM contexts.