Files

T

Joseph Doherty f98ab9846d design/70-risks: record the .NET reference's WriteCompleted half-implementation

R3's verdict gains an aside documenting why the original native
MxAccess `OnWriteComplete` event has historically only fired for the
one exact 5-byte pattern `00 00 50 80 00` (= `MxStatus.WriteCompleteOk`).

Verified at:
- `src/MxNativeClient/MxNativeCompatibilityServer.cs:756` —
  `if (!evt.Message.IsMxAccessWriteComplete) return;` gates the
  consumer-facing `WriteCompleted` event.
- `src/MxNativeCodec/NmxOperationStatusMessage.cs:18` —
  `IsMxAccessWriteComplete` requires
  `Format == StatusWord && StatusCode == 0x8050 && CompletionCode == 0x00`.

Every other completion frame is silently dropped — the 1-byte
`0x00`/`0x41`/`0xEF` ones, plus any non-success status word.

This was the underlying reason R3/R4 looked unsolvable for a year:
the answer "we don't know how to map" was actually "the native
compatibility shim deliberately doesn't map these because firing
typed failure events on ambiguous bytes was never a goal."

Path A's `MxStatus::from_packed_u32` (commit `c73a33e`) closes the
asymmetry on the Rust side: `Session::operation_status_events()`
exposes ALL typed outcomes the upstream synthesizer produces, not
just the WriteCompleteOk slice. The Rust port now has strictly
broader operation-status visibility than the .NET reference offered.

Recorded so future contributors don't re-derive this from scratch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-06 07:13:28 -04:00

45 KiB

Raw Blame History

Risks and open questions

This is the punch list of things that could break or are unproven. Each entry is tagged R(isk) or Q(uestion), with a current best answer and what would settle it.

Protocol-level

R1 — net.tcp / WCF framing and binary message encoding complexity

Severity: P0 (project-blocker — entire ASB data plane, ~3000 LoC)

The .NET reference uses System.ServiceModel.NetTcpBinding for ASB (src/MxAsbClient/MxAsbDataClient.cs:663: new NetTcpBinding(SecurityMode.None) with no message-encoding override). With no override, WCF defaults to the binary message encoder — i.e. .NET Binary XML ([MC-NBFX]) with a static dictionary lookup ([MC-NBFS]) — not SOAP/XML. There is no Rust port of WCF, and quick-xml (or any other XML toolkit) is not sufficient to read or write these payloads: the body bytes are tokenised binary nodes that reference dictionary string IDs.

So the hand-rolled scope is two layers, not one:

Framing per [MS-NMF] (record types: preamble, preamble-ack, sized-envelope, end, fault) plus the reliable-session ack handling on the underlying net.tcp channel.
Message encoding per [MC-NBFX] (binary XML node tokens, length-prefixed strings, prefixed/typed attributes, end-element markers) plus [MC-NBFS] (the static dictionary that holds the SOAP/WS-Addressing/IASBIDataV2-action strings the encoder references by ID instead of inlining).

Options:

Hand-roll both framing ([MS-NMF]) and binary message encoding ([MC-NBFX] + [MC-NBFS]). Estimate ~3000 LoC across both layers (the encoder/dictionary work is the majority — framing alone is ~1500 LoC; the binary XML codec, dictionary tables, and operation-action mapping are roughly the same again).
Switch ASB to its HTTP variant if the deployed AVEVA instance supports it (this would let us use a normal text SOAP/XML stack and skip both [MS-NMF] and [MC-NBFX]/[MC-NBFS] entirely).
Wrap the .NET ASB DLL in a process and call it via stdin/stdout JSON-RPC.

Current best answer: option 1 (hand-roll both layers). The two specs are public, the encoder is deterministic, and the .NET reference's AsbMessageDumpBehavior already produces ground-truth byte vectors for the dictionary and operation set we use. quick-xml may help with any auxiliary text-XML the wider stack uses, but it cannot decode the binary-encoded message bodies — that requires the [MC-NBFX] + [MC-NBFS] codec.

Settles when: mxaccess-asb-nettcp parses every captured request/reply byte-identical to the .NET reference's IClientChannel payload dump for the proven type matrix, including correct dictionary-ID resolution and round-trip of every observed binary XML node tag.

R2 — Buffered subscription multi-sample body (settled per option (a) — codec change landed under F44)

Severity: P3 (settled — codec accepts multi-record DataUpdate)

Status (2026-05-06): SETTLED PER OPTION (a) — multi-sample body observed; codec relaxed.

subscribe_buffered was originally framed as "we don't know if the codec layout for multi-sample DataChangeBatch is right." A first verification pass against wwtools/mxaccesscli/docs/api-notes.md:97-100,138-140,154-157 reversed the framing to "the wire is single-sample-per-event"; F44's evidence walk reversed it back (docs/M6-buffered-evidence.md).

captures/094-frida-buffered-separate-writer/frida-events.tsv:145 (2026-04-25T21:40:34.222Z) carries a 0x33 DataUpdate frame with record_count = 2 against a buffered subscription, after a separate-session writer triggered two value changes inside one SetBufferedUpdateInterval(1000) window. Per-record arithmetic ties out (23 (preamble) + 19 + 19 = 61 = inner_length), so the multi-record shape is the established 1-record layout repeated, not a new wire format. The .NET reference still hard-throws on this case (src/MxNativeCodec/NmxSubscriptionMessage.cs:71-74); the Rust codec deliberately diverges and decodes it.

The OnBufferedDataChange public event shape the wwtools api-notes describe (hServer, hItem, MxDataType, value, quality, timestamp, statuses — singular value) is correct. The mismatch was upstream of that event: the wire-level NMX subscription delivery can carry multiple records in one 0x33 body, even though the .NET compatibility server fans those out to one event each.

Current best answer: mxaccess-codec decodes 0x33 DataUpdate bodies of any positive record_count; subscribe_buffered continues to expose Stream<Item = DataChange>, fanning the records out one per Stream item. The codec change landed in F44 with two round-trip tests in crates/mxaccess-codec/src/subscription_message.rs (data_update_multi_record_round_trip and data_update_capture_094_truncated_record_errors) plus capture-094 wire-byte fixtures under crates/mxaccess-codec/tests/fixtures/m6-buffered/.

Settles when: ✅ settled per option (a). Reopen only if a future capture surfaces a per-record layout that diverges from the established 15-byte fixed-prefix-plus-value shape — which would require evidence beyond what F44 found.

R3 — `OperationComplete` trigger unproven (settled 2026-05-06 — Path A landed: synthesizer kernel + typed `OperationStatus` events ported)

Severity: P1 (was a blocker; settled per Path A — typed promotion landed via MxStatus::from_packed_u32)

Status (2026-05-06): SETTLED PER PATH A. The five-stage Ghidra walk that previously settled the verdict at "verbatim preserve" was extended with a sixth stage that found the actual byte→MXSTATUS_PROXY synthesizer. It is Lmx.dll!FUN_10100ce0 — a single 4-byte u32 LE → MxStatus decoder used by every NMX-frame parser in Lmx.dll. Bit layout:

bit 31:        success    (-1 if set, 0 if clear)
bits 27..24:   category   (4 bits, masked by 0xF)
bits 23..20:   detected_by (4 bits, masked by 0xF)
bits 15..0:    detail     (i16 — low 16 bits, signed)
bits 30..28, 19..16: reserved/padding

The Rust port now ships this kernel as [MxStatus::from_packed_u32] (and the inverse to_packed_u32 for round-trip parity). Session::operation_status_events() emits typed [OperationStatus] events for every 0x32/0x33-or-similar callback the wire delivers; the synthesizer is byte-deterministic and context-free, so the operation-tracking state machine the original verdict deferred is not required for the kernel itself. Per-operation context tracking (correlating completion frames back to outstanding writes/subscribes) is filed as a follow-up: see F54 below.

A second mapping was also ported: MxStatus::from_nmx_response_code covers the constructed-from-response-code path in Lmx.dll!FUN_1010bd10:741-770 (ScanOnDemandCallback::GetResponse), which builds an MxStatus from a 1-byte NMX responseCode field when no payload status word is present. Six proven mappings: 0x01/0x02 → (CommunicationError, RequestingNmx), 0x03 → (ConfigurationError, RequestingNmx), 0x04 → (ConfigurationError, RespondingNmx), 0x05 → (CommunicationError, RespondingNmx), 0x1A → (CommunicationError, RequestingNmx). Unmapped codes return None and the consumer falls back to verbatim preservation per CLAUDE.md "Do not fabricate protocol behavior."

What about the 1-byte completion frames 0x00/0x41/0xEF? Those are NOT decoded by FUN_10100ce0 — they're a different wire field (the NMX operation-status callback payload, not the INmxService.GetResponse2 responseCode parameter). Lmx.dll's decoder for those frames does not invoke any status-synthesis logic; they propagate as raw byte → MxStatus { success: 0, Unknown, Unknown, detail: byte }. The Rust port preserves this exactly. R4 is settled by the same fact (see below).

Aside — the .NET-reference shim was always half-implemented. Verified at src/MxNativeClient/MxNativeCompatibilityServer.cs:756 + src/MxNativeCodec/NmxOperationStatusMessage.cs:18: MxNativeCompatibilityServer fires WriteCompleted only when IsMxAccessWriteComplete is true, which gates strictly on Format == StatusWord && StatusCode == 0x8050 && CompletionCode == 0x00 — i.e. the one exact 5-byte pattern 00 00 50 80 00 (= MxStatus.WriteCompleteOk). Every other completion frame (the 1-byte 0x00/0x41/0xEF ones and any non-success status word) is silently dropped at the gate. The native consumer-facing WriteCompleted event has therefore only ever fired for unambiguous successful writes — failure outcomes have been invisible at the compatibility-shim layer for the entire history of the .NET reference. Path A's kernel (from_packed_u32) closes this asymmetry on the Rust side: Session::operation_status_events() exposes all typed outcomes the upstream synthesizer produces, not just the WriteCompleteOk slice. The Rust port now has strictly broader operation-status visibility than the .NET reference offered.

Logs:

analysis/ghidra/exports/Lmx.dll.aadct-decompile.md — aaDCT symbol (stage 1)
analysis/ghidra/exports/LmxProxy.dll.completion-status-decompile.md — Fire_* event handlers (stage 2)
analysis/ghidra/exports/LmxProxy.dll.fire-event-xrefs.md — xrefs to Fire_* (stage 3)
analysis/ghidra/exports/LmxProxy.dll.status-synthesis-decompile.md — Fire_* callers (stage 4)
analysis/ghidra/exports/LmxProxy.dll.mxstatus-safearray-decompile.md — FUN_10003f60 (stage 5)
analysis/ghidra/exports/Lmx.dll.set-attribute-result-decompile.md — PreboundReference::OnSetAttributeResult (stage 6, entry to next ring)
analysis/ghidra/exports/Lmx.dll.set-attribute-result-xrefs.md — xrefs to OnSetAttributeResult/CancelWithStatus/OperationComplete (next-ring discovery)
analysis/ghidra/exports/Lmx.dll.synthesizer-decompile.md — ScanOnDemandCallback::OperationComplete/MultipleOperationComplete (FUN_1010b990), RemotePlatformResolver::OperationComplete (FUN_1010dc80), and the constructed-from-responseCode synthesizer FUN_1010bd10 (lines 698-770)
analysis/ghidra/exports/Lmx.dll.synthesizer-helpers-decompile.md — FUN_10003fc0 (the <success %d category %d ...> formatter), FUN_1008f150 (the dispatch helper), PreboundReference constructors
analysis/ghidra/exports/Lmx.dll.synthesizer-helpers2-decompile.md — the synthesizer kernel FUN_10100ce0 (4-byte u32 → MxStatus decoder), FUN_10100bc0 (3×u16 reader), FUN_1005e580 (4-byte stream reader), FUN_1010ee00 (sister NMX-frame parser using the same kernel)
analysis/ghidra/exports/Lmx.dll.synthesizer-callers-xrefs.md — caller graph for the synthesizer ring

Findings, layer by layer (the wire bytes flow inward; the synthesis flows outward):

Lmx.aaDCT at 0x10178fc0 is a SysAllocString(L"Lmx.aaDCT") into a global BSTR — a tracing category name, not a status-mapping table. No array / lookup logic.
MXSTATUS_PROXY (16 bytes, Pack=4) is a 4-field marshalled struct: success: i16 at offset 0, category: i16 at offset 4, detectedBy: i16 at offset 8, detail: i16 at offset 12. It is the output of synthesis, not a lookup-table entry.
LmxProxy.dll Fire_ event handlers* (FUN_10015f72, FUN_1001611f, FUN_10016271, FUN_100163c0) take an already-populated MXSTATUS_PROXY[] and forward it through ATL connection-point dispatch. No synthesis here.
LmxProxy.dll Fire_ callers* (FUN_1001657f for OnDataChange / OnBufferedDataChange, FUN_10016b50 for OnWriteComplete, FUN_10016d4b for OperationComplete) call FUN_10003f60(out_safearray, in_status_ptr, count=1) which creates the SafeArray. FUN_10003f60 is a verbatim memcpy of an existing 14-byte buffer into the SAFEARRAY data — no transformation.
Lmx.dll PreboundReference::OnSetAttributeResult (FUN_10114a90) — the CALLER of step 4's path — receives an already-populated short *param_7 status buffer; synthesis is upstream of THIS function too.
The synthesizer kernel itself: Lmx.dll!FUN_10100ce0 (see analysis/ghidra/exports/Lmx.dll.synthesizer-helpers2-decompile.md). A 4-byte u32 LE read from a stream → 4-tuple MxStatus decoder. Pure transformation, no operation-context dependency. Used by every NMX-frame parser in Lmx.dll (FUN_1010bd10 ScanOnDemandCallback::GetResponse, FUN_1010ee00 AccessManager::ProcessNmxRequest, FUN_10110986, etc.) — the upstream decoder reads the wire bytes, the kernel translates them.
The constructed-when-no-bytes path: when an NMX responseCode != 0 arrives without a payload status word, FUN_1010bd10:741-770 constructs an MxStatus from the responseCode itself via a fixed switch. Six proven response codes (1, 2, 3, 4, 5, 0x1A); see the table in the MxStatus::from_nmx_response_code doc.

Path A landed. The synthesizer kernel and the constructed-from-response-code switch were both portable as pure functions — no operation-tracking state machine required for the kernel itself, because FUN_10100ce0 is byte-deterministic. Rust port:

mxaccess-codec::status::MxStatus::from_packed_u32(packed: u32) -> MxStatus — the kernel.
mxaccess-codec::status::MxStatus::to_packed_u32() -> u32 — inverse, for round-trip parity.
mxaccess-codec::status::MxStatus::from_nmx_response_code(byte: u8) -> Option<MxStatus> — the response-code switch.
mxaccess::OperationKind + mxaccess::OperationContext types for future correlation work (per-operation tracking is filed as F54).
mxaccess::Session::operation_status_events() returns broadcast::Receiver<Arc<OperationStatus>>; operation_status_stream() returns the Stream<Item = Result<...>> variant.
mxaccess::OperationStatus { raw, status, context, is_during_recovery } — matches MxNativeOperationStatusEvent (MxNativeSession.cs:73-78) plus typed MxStatus promotion.
The callback router (session::callback_router) now tries operation-status parsing first, mirroring MxNativeSession.OnCallbackReceived:574.

What about the 1-byte completion frames 0x00/0x41/0xEF? They are NOT decoded by FUN_10100ce0 (they're a different wire field at a different layer — the NMX operation-status callback payload, not the INmxService.GetResponse2 responseCode parameter). Per CLAUDE.md "Do not fabricate protocol behavior" they continue to propagate as MxStatus { success: 0, Unknown, Unknown, detail: byte }. R4 is settled by the same fact.

Current best answer: Path A landed. Session::operation_status_events() emits typed OperationStatus events. The synthesizer kernel (MxStatus::from_packed_u32) is exposed for any consumer that holds a 4-byte packed status word (e.g. extracted from a subscription record's status: i32 field). Per-operation context (correlating completion frames back to outstanding writes/subscribes) is the next step — filed as F54.

Reopen when: F54 lands per-operation correlation, or a future capture surfaces a fresh wire field whose synthesis logic doesn't reduce to FUN_10100ce0 + from_nmx_response_code (no such field has been observed to date).

R4 — Completion-only byte mapping (settled 2026-05-06 — verbatim-preserve confirmed; synthesizer doesn't apply at this layer)

Severity: P1 (was a blocker; now settled per the same R3 Path A finding — by exclusion)

Status (2026-05-06): SETTLED. R3's Path A walk traced the byte→MxStatus synthesizer to Lmx.dll!FUN_10100ce0, a 4-byte u32 LE → MxStatus decoder. The 1-byte completion frames 0x00, 0x41, 0xEF (work_remain.md:164–174) are NOT input to that decoder — they're a different wire field, observed at a different layer (the NMX operation-status callback payload, not the INmxService.GetResponse2 responseCode parameter or any 4-byte packed status field). Lmx.dll's decoder for the 1-byte completion-only inner body does not invoke any synthesis logic; the bytes propagate untransformed.

Current best answer: unchanged — preserve as MxStatus { Success: 0, Category: Unknown, DetectedBy: Unknown, Detail: byte }. mxaccess-codec::NmxOperationStatusMessage::promote_to_typed returns the verbatim placeholder for these frames; mxaccess::Session::operation_status_events() surfaces them via the typed OperationStatus.status field with the byte preserved in detail.

Reopen when: a fresh capture proves a synthesis rule for a specific 1-byte completion code under a specific operation context (e.g. via Frida pairs LmxProxy.dll!FUN_10003f60 input vs. observed event payload). At that point file a sub-followup with the captured (byte, context, observed status) triple and decide whether to add a typed mapping.

R5 — Activate / Suspend behaviour (partially observed — F44 documented client-side trigger; wire-side residual gap filed as F46, hook landed pending live re-run)

Severity: P2 (downgraded from P1 — client-side acceptance criteria are now documented; LMX-proxy wire emission remains unconfirmed)

Status (2026-05-06): PARTIALLY OBSERVED — Frida hooks ready, live capture pending. F44's evidence walk on captures/077-frida-suspend-advised-scanstate/ (per docs/M6-buffered-evidence.md) documents:

Suspend returns synchronously with MxStatus.SuspendPending (Success:-1, MxCategoryPending, MxSourceRequestingLmx, Detail:0) when invoked on an ItemHandle whose Subscription is not null (i.e. immediately after a successful Advise / AdviseSupervisory).
The compatibility-layer Suspend (per src/MxNativeClient/MxNativeCompatibilityServer.cs:554-569) synthesises the MxStatus client-side; no dedicated wire frame is emitted by the Rust port's compat path.

What capture 077 could not answer: whether the production LmxProxy.dll stack issues a separate ORPC method for Suspend / Activate (e.g. an ILMXProxyServer5 opnum) or also handles them client-side. Capture 077's Frida script did not hook LmxProxy.dll!CLMXProxyServer.Suspend/.Activate, so the wire-side behaviour is invisible.

Next step — F46. analysis/frida/mx-nmx-trace.js now carries Interceptor.attach blocks for LmxProxy.dll!CLMXProxyServer.Suspend (RVA 0x13d9c, FUN_10013d9c) and .Activate (RVA 0x14028, FUN_10014028), emitting mx.suspend.begin/end and mx.activate.begin/end events with the MxStatus* out-parameter decoded as 4 × int16. No Resume / Reactivate symbols exist in LmxProxy.dll — verified against analysis/ghidra/exports/LmxProxy.dll.ghidra.md and the decompiled ILMXProxyServer5 / ILMXProxyServer4 interfaces. R5 stays open until a live re-run on the AVEVA host produces captures/NNN-frida-suspend-activate-instrumented/ per the procedure documented at the top of analysis/frida/mx-nmx-trace.js.

Current best answer: expose Session::suspend(item) and Session::activate(item) returning Result<MxStatus, Error>. The success criteria match the .NET reference's client-side gating: the item must have an active subscription. If F46's wire capture later proves the LMX proxy issues a separate ORPC method, add the wire emission here in M6 follow-up. Do not build callback-driven state transitions on top until F46 settles.

Settles when: F45 produces a Frida capture instrumenting LmxProxy.dll!CLMXProxyServer.Suspend / .Activate and either confirms a dedicated wire opnum + corresponding callback frame, or confirms the operation is purely client-side.

R6 — `0x80004021` in `MxNativeSession.WriteSecuredAsync` is a .NET-reference defect, not a real LMX constraint

Severity: P3 (formerly P1 — downgraded after wwtools/mxaccesscli/ verification)

Original framing of this risk asserted that "WriteSecured (without 2) returns 0x80004021 before sending the body" and concluded the single-token form was deprecated or rejected at the wire. That framing was wrong. Verification against wwtools/mxaccesscli/ (a working CLI built on the production LMXProxyServerClass 32-bit COM proxy, i.e. the actual MxAccess surface) establishes:

The LMX WriteSecured ALWAYS takes two user ids: (currentUserId, verifierUserId, value) (wwtools/mxaccesscli/docs/api-notes.md:60-72, wwtools/mxaccesscli/src/MxAccess.Cli/Mx/MxItem.cs:69-70).
"Single-user secured write" is the same API called with currentUserId == verifierUserId — it is not a separate API surface (wwtools/mxaccesscli/src/MxAccess.Cli/Commands/WriteCommand.cs:151-155,196-199).
WriteSecured2 adds a timestamp parameter; it does not add a second token. The 1-vs-2 distinction in this design's earlier drafts was a confusion between "with timestamp" (Write2 vs Write) and "two-token" (which is always true).
The 0x80004021 failure observed in src/MxNativeClient/MxNativeSession.cs:218-221 is therefore a defect of the .NET native reimplementation, not behaviour the LMX proxy itself produces.

Current best answer: mxaccess exposes write_secured(reference, value, current_user_id, verifier_user_id) (no timestamp) and write_secured_at(reference, value, timestamp, current_user_id, verifier_user_id) (with timestamp), matching WriteSecured and WriteSecured2 respectively. Both always pass two user ids; callers performing single-user secured writes pass the same id twice. The Error::Unsupported mapping for "single-token form" has been removed from 50-error-model.md.

Settles when: the MxNativeSession.WriteSecuredAsync defect is fixed in the .NET reference, or a captured frame shows the LMX proxy itself producing 0x80004021 on a WriteSecured call (which would resurrect the original framing). Default-positive: this likely settles silently as "not a real risk."

R7 — Status mapping for non-success ASB cases

Severity: P2 (nice-to-have / minor — unknown bytes preserved as raw)

work_remain.md:132–143: live probes have not yet exercised access-denied and no-communication on the current VM. The Rust port mirrors what the .NET reference proves; remaining ASB error/quality/detail bytes are preserved as raw and surfaced through MxStatus.detail until a safe live capture lands.

Current best answer: preserve unknown payloads. Document the gap.

Settles when: live capture against a configured access-denied tag and a no-communication endpoint produces the expected MxStatus shape.

Implementation-level

R8 — NTLMv2 cross-domain auth (permanently deferred 2026-05-06 — external infrastructure gap)

Severity: P1 (significant blocker for cross-domain deployments — single-domain ships)

Status (2026-05-06): PERMANENTLY DEFERRED. The implementation already parses NTLM AV pairs per [MS-NLMP] §2.2.2.1, including the cross-domain AV pair shapes (MsvAvDnsTreeName, MsvAvDnsComputerName carry the trusted-domain DNS suffix instead of the local one). What's missing is the live capture needed to pin a regression fixture — and that requires a multi-domain Windows lab (e.g. LAB-A + LAB-B with cross-domain trust + an AVEVA install on LAB-A authenticating a LAB-B-domain user) which is not available on the dev host. Same external-infrastructure constraint as F3 in design/followups.md. R8 is closed in the same sense F3 is closed — the implementation is in place per spec; only the evidence is gated on hardware that doesn't exist here.

Captured traffic is single-domain (local AVEVA install). Cross-domain NTLM exercises the AV pair codepaths but the bytes haven't been pinned.

Current best answer: the AV pair parser handles the cross-domain shape per [MS-NLMP] §2.2.2.1; document mxaccess-rpc as untested across domains in the README. The mxaccess-rpc::ntlm round-trip tests cover the single-domain shape; cross-domain rounds-trip through the same code path (the AV pair parser is shape-agnostic) but no live fixture pins it.

Reopen when: a multi-domain AVEVA test harness becomes available + a cross-domain probe runs successfully end-to-end with packet-integrity signatures verified. Until then, this risk is permanently deferred — same status pattern as F3.

R9 — DPAPI dependency for ASB

Severity: P2 (nice-to-have / minor — explicit shared_secret constructor is the escape hatch)

ASB shared-secret retrieval uses ProtectedData.Unprotect (LocalMachine scope). Linux has no DPAPI. There is no portable replacement; the secret is encrypted at rest with a Windows-specific KCV.

Current best answer: mxaccess-asb requires Windows for the credential read path. Provide an explicit AsbCredentials::shared_secret(secret: &[u8]) constructor that bypasses DPAPI for tooling that has the secret in plaintext (e.g. CI tests, ops automation).

Settles when: never. DPAPI is not portable; the escape hatch is the explicit constructor.

R10 — Galaxy SQL schema versioning

Severity: P1 (significant blocker per affected feature — break-loud on mismatch)

The recursive CTE in GalaxyRepositoryTagResolver.cs assumes the current AVEVA schema. Older Galaxy versions may have different table layouts.

Current best answer: target the schema that ships with the AVEVA version MxNativeClient validates against. Document the expected schema version. Break loudly on mismatch (ConfigError::Galaxy { reason }).

Settles when: a multi-version test matrix is set up. Probably not in V1.

R11 — x86 proxy/stub workaround

Severity: P2 (nice-to-have / minor — integration test catches binding-shape drift)

NmxSvcps.dll is x86-only. The replacement strategy bypasses the in-proc proxy by speaking ORPC directly. This works because we control both Type1/Type3 marshalling and RemQueryInterface. But it depends on NmxSvc continuing to expose IPv4 NCACN_IP_TCP bindings via the OXID.

Current best answer: add an mxaccess-rpc integration test that asserts ResolveOxid returns at least one ncacn_ip_tcp binding. Fail fast if the binding shape changes in a future AVEVA release.

Settles when: that integration test is in CI gating.

R12 — Performance — codec allocations

Severity: P2 (nice-to-have / minor — micro-optimisation in M6)

The .NET reference reuses byte[] arrays via MemoryPool; the Rust port should use bytes::Bytes for zero-copy on receive and pre-allocate via BytesMut on encode. The codec currently allocates Vec<u8> per encode; tolerable for V1, worth optimising in M6.

Current best answer: use BytesMut::with_capacity(MAX_FRAME) per session. Bench in M6. Aim for < 5 allocations per write at steady state.

Settles when: cargo bench shows the target allocation count.

R13 — DataUpdate `recordCount != 1` panic risk

Severity: P1 (significant blocker for production stability — soft-error path documented)

src/MxNativeCodec/NmxSubscriptionMessage.cs:71-74 hard-throws ArgumentException on any 0x33 DataUpdate whose recordCount is not exactly 1:

if (recordCount != 1)
{
    throw new ArgumentException("Observed NMX DataUpdate callback parser currently supports one record per body.", nameof(inner));
}

R2 covers the missing fixture for the multi-record case, but the bigger production-side risk is separate: the first time AVEVA emits a multi-record 0x33 against a deployed Rust client, the codec — if it ports the .NET behaviour faithfully — will panic / return a hard decode error and tear down the subscription. We have no fixture proving multi-record bodies don't happen on real installs; we only have evidence they haven't happened on our install.

Options:

Mirror the .NET reference and hard-error on recordCount != 1. Loud, but kills the session.
Surface as a typed soft error (e.g. ProtocolError::Decode { reason: "multi-record DataUpdate not yet supported" }), log at warn, and drop the frame. The subscription stays alive; the consumer sees a single missed update, not a teardown.
Speculatively decode multi-record (assume the per-record layout from the single-record case repeats) — explicitly forbidden by CLAUDE.md "Do not fabricate protocol behavior."

Current best answer: option 2 in Rust. Map the condition to ProtocolError::Decode { reason: "multi-record DataUpdate not yet supported" }, emit a tracing::warn! with the raw frame bytes attached as a hex field, and continue. Do not synthesise per-record decoding. The .NET-style hard throw stays as-is in the .NET reference (it is the executable spec, and a panic there is what produces the fixture we need — see R2). The Rust port deliberately diverges here on production-safety grounds, with the divergence documented in 50-error-model.md.

Settles when: R2's multi-record fixture lands and the codec gains a proven typed decode path; then R13 collapses into "supported, no special handling" and the soft-error branch becomes dead code that can be removed.

R14 — Fabricated `0x80004021 → StaleItem` mapping

Severity: P1 (significant blocker — fabrication risk; corrected in 50-error-model.md)

A draft of 50-error-model.md mapped HRESULT 0x80004021 to a typed StaleItem error category for regular (non-secured) operations. This mapping is unevidenced.

R6 already covers 0x80004021 on secured-write specifically: per wwtools/mxaccesscli/ verification, this is a MxNativeSession.WriteSecuredAsync defect (the .NET native reimplementation throws NotSupportedException before reaching the wire), not a real LMX-proxy constraint. The production LMX surface accepts WriteSecured with two user ids unconditionally. R6 explicitly does not generalise the .NET defect to a typed "stale" error.
For regular operations, the actual stale-handle / invalid-arg HRESULT observed in captures is 0x80070057 (E_INVALIDARG). There is no captured frame, decompiled mapping table, or live probe in this repo that produces 0x80004021 on a non-secured path, and certainly none that justifies tagging it StaleItem.

This is a fabrication risk: the kind of "looks plausible from naming" mapping that CLAUDE.md "Do not fabricate protocol behavior" exists to prevent.

Options:

Drop the StaleItem category entirely. Regular-op 0x80004021, if ever observed, falls through to the generic Hresult { code, hint: None } branch with the raw HRESULT preserved.
Keep StaleItem but rename the source HRESULT to 0x80070057 and require a captured fixture before promoting any frame to that category.
Keep the 0x80004021 → StaleItem mapping. Forbidden — no evidence backs it.

Current best answer: option 1 for V1. Surface unknown HRESULTs as Error::Hresult { code } and let consumers match on the raw value. 50-error-model.md is being corrected in parallel (review cluster 3) to remove the StaleItem reference; this risk register entry exists so the mistake is recorded for future contributors and not silently re-introduced when someone reaches for an ergonomic typed name.

Settles when: indefinitely deferred — no current artifact maps either 0x80004021 or 0x80070057 to a "stale handle" semantic, and inventing one violates the "don't fabricate protocol behaviour" rule. If a future capture or decompiled mapping table produces evidence, reopen as a typed-error proposal.

R15 — Drop-time async cleanup hazards

Severity: P1 (significant blocker — server-side handle leak on runtime shutdown)

design/00-overview.md:38 states the principle "no spawn from inside Drop." design/20-async-layer.md and design/50-error-model.md describe Subscription drop semantics that fire UnAdvise/UnregisterEngine against the server. Reconciling these is non-trivial because:

tokio::spawn from Drop panics if no Tokio runtime is current at drop time. A user dropping a Session from a std::thread after Runtime::shutdown_timeout returns will hit this.
During Runtime::shutdown_timeout, spawned tasks are aborted before they can flush. Even if a runtime is current, spawning the cleanup from Drop does not guarantee the unadvise/unregister actually reaches the server — the drop call returns immediately and the spawned task may be cancelled before the bytes hit the wire.
The result is a server-side handle leak in NmxSvc: subscriptions stay live, registered engines stay registered, until the TCP connection itself is torn down (which only happens once the kernel notices the socket is dead).

Options:

Best-effort tokio::spawn from Drop. Documented hazard. Leaks on runtime shutdown and panics on no-runtime.
Drop sends UnAdvise/UnregisterEngine via a tokio::sync::oneshot (or unbounded mpsc) to a long-lived connection task that owns the cleanup loop. Drop itself never spawns — it pushes a message onto the channel and returns. The connection task drains the channel until the TCP connection is itself dropped, at which point the server cleans up by socket close anyway.
Require the consumer to call Session::shutdown(timeout).await and document Drop as "best-effort, may leak under shutdown" — no automatic cleanup at all.

Current best answer: option 2. A long-lived connection task owns the cleanup channel and drains it; Drop pushes a UnAdvise/UnregisterEngine request onto a tokio::sync::oneshot (one per resource) or a per-connection unbounded mpsc and returns synchronously. This keeps Drop cheap, satisfies "no spawn from Drop," and gives the cleanup a reasonable best-effort guarantee while the connection task is alive. Runtime-shutdown leak window remains — if the connection task is itself aborted by Runtime::shutdown_timeout before draining the channel, the cleanup messages are dropped on the floor and the server-side handles remain registered until the TCP socket close is observed by NmxSvc. This window is documented in 50-error-model.md's cancellation semantics; consumers running under explicit shutdown should call Session::shutdown(timeout).await for deterministic cleanup. Cite design/00-overview.md:38 (no-spawn-from-Drop principle), design/20-async-layer.md (Subscription drop semantics), design/50-error-model.md (cancellation semantics).

Settles when: the connection-task cleanup channel is implemented in M4, a stress test under churn confirms drop semantics on a live runtime do not leak, and the runtime-shutdown leak window is captured in a runnable test fixture (consumer drops Session after Runtime::shutdown_timeout; assert that the leak is bounded by socket-close timeout).

R16 — Crypto/auth crate maintenance drift

Severity: P1 (significant blocker — yank/advisory in CI breaks the build)

The auth surface area depends on a small cluster of marginal-maintenance crates. design/30-crate-topology.md:130 pins rc4, sha-1, md-5, num-bigint; design/10-raw-layer.md:252 instructs "Do not pull ring — hand-roll MD4." Of these:

rc4 is at minimum-maintenance, with a small maintainer pool and no recent releases.
sha-1 v0.10 is the last RustCrypto release that ships with a deprecation warning (the algorithm itself, not the crate's quality, is what's deprecated upstream).
md-5 and num-bigint are stable but not on the active-development frontier.
The hand-rolled MD4 in mxaccess-rpc has no upstream at all — it lives in this repo.

The risk is that any one of these crates gets yanked, picks up an RUSTSEC advisory, or stops compiling against a future Rust toolchain, and cargo-deny (or cargo audit) in CI fails the build for everyone — without any actual bug being found in our usage. This is especially bad if it happens during a live release window.

Options:

Pin to known-good versions in workspace Cargo.toml and let CI break when an advisory lands. Triage manually.
Pin and subscribe to cargo-deny advisory feeds with a documented response process; pre-stage replacement plans for each crate (e.g. "if rc4 is yanked, fall back to a hand-rolled cipher in mxaccess-rpc::crypto::rc4 — RC4 is ~30 LoC and we already hand-roll MD4").
Hand-roll all of them up front (RC4, SHA-1, MD5, MD4 are all small) and depend on num-bigint only. Reduces the surface area to one external crate; increases the in-repo cryptographic LoC.

Current best answer: option 2 for V1. Pin to known-good versions in workspace Cargo.toml; subscribe cargo-deny advisories in CI; document a fallback plan per crate (hand-rolled RC4 if rc4 is yanked, hand-rolled SHA-1/MD5 if sha-1/md-5 are pulled, swap num-bigint for crypto-bigint if it's pulled). Reassess in M6 and consider option 3 (hand-roll-everything) if any of the pins fire during V1 development. Cite design/30-crate-topology.md:130 and design/10-raw-layer.md:252.

Settles when: cargo-deny check advisories runs green in CI on a fresh advisory database, the workspace Cargo.toml pins are documented inline with their fallback plans, and a "yank rehearsal" (manually mark a pin as yanked locally and confirm the fallback compiles) has been done at least once per crate.

Open questions

Q1 — Where does the Rust workspace live? (unresolved)

CLAUDE.md proposes a sibling rust/ directory at c:\Users\dohertj2\Desktop\mxaccess\rust\, but this is a proposal, not a confirmation: a glob of rust/ confirms zero files exist there today, and CLAUDE.md itself hedges with "when it is started." M0 cannot start until this is confirmed.

Owner: project lead.

Action: confirm the path c:\Users\dohertj2\Desktop\mxaccess\rust\ or pick an alternative location; create the empty rust/ directory (or sibling) before M0 begins.

Current best answer: still pending. The CLAUDE.md proposal is the default and is what M0 will assume unless overridden, but treat this as an open decision rather than a confirmed answer.

Settles when: the workspace directory exists on disk and contains a Cargo.toml (even an empty one).

Q2 — License? (resolved: MIT)

The .NET reference has no LICENSE file at the repo root. The Rust crates need one before publish.

Resolved (2026-05-05): MIT (single-license, not the dual MIT OR Apache-2.0). All workspace deps verified MIT/Apache-2.0 compatible; MIT alone satisfies every dep's downstream license obligation. LICENSE file added at the project root (c:\Users\dohertj2\Desktop\mxaccess\LICENSE). All crate Cargo.tomls set license = "MIT" via workspace.package.

Settles when: N/A — resolved.

Q3 — Cross-platform reach (Linux, macOS)

The codec, ASB SOAP framing, and the async session are theoretically portable. Galaxy SQL via tiberius works on Linux. NTLM works on Linux. DPAPI does not. Active Directory authentication on Linux requires gssapi (Kerberos) which is out of scope.

Current best answer: Linux is a stretch goal for V1, not a supported target — consistent with 30-crate-topology.md's mxaccess-codec Targets line ("stretch goal") and 60-roadmap.md's "What this roadmap deliberately does not include" (Linux behind feature flags). If pursued, the path is default-features = false with the consumer providing credentials and shared secret explicitly. macOS unsupported in V1 (no Galaxy SQL TDS testing on macOS).

Settles when: a Linux integration test runs successfully against a remote AVEVA install. Until then, treat Linux support as aspirational and gate all Linux-specific code paths behind opt-in feature flags.

Q4 — How does `mxaccess-compat` handle COM event sinks?

The .NET MxNativeCompatibilityServer raises OnDataChange etc. as COM events. mxaccess-compat is a Rust API; do we expose them as Streams, callbacks, or both?

Current best answer: Streams, with a separate optional mxaccess-compat-com crate (post-V1) that registers windows-rs-generated COM classes. The compat crate's primary surface is Rust.

Settles when: a concrete consumer requests COM exposure.

Q5 — How do we surface `MxStatus` in `Subscription` items vs `Session` operations?

For Session::write(), a non-Ok status maps to Error::Status. For Subscription::next(), a non-Ok status comes through as DataChange { status: MxStatus, ... } — it is not necessarily an error (a "stale" data change is still a valid frame).

Current best answer: Session::write() returns Err on non-Ok category. Subscription::next() returns Ok(DataChange { ... }) and the consumer inspects change.status. Documented in 50-error-model.md.

Settles when: API stabilises after consumer feedback.

Q6 — Should `Session` be `Clone`?

Cheap clones via Arc<SessionInner> are convenient (handlers can take Session by value). But cloning makes shutdown semantics fuzzy: when does UnregisterEngine fire?

Current best answer: Clone + Send + Sync. Drop of the last clone runs UnregisterEngine best-effort via tokio::spawn. Session::shutdown(timeout) is the explicit, awaitable way for production code.

Settles when: stress test under churn confirms drop semantics are correct.

Q7 — M1 `hasDetailStatus` audit

During M1 wave-1 codec ports, the subscription_message.rs agent draft conditionally read the status: i32 field only when hasDetailStatus = true, while requiring a minimum record length of 15 (DataUpdate) regardless. The result: 4 leading status bytes were left unconsumed, then misread as quality further down. The defect was caught by round-trip tests (data_update_boolean_round_trip, data_update_has_no_correlation_id) and fixed: status: i32 is now read unconditionally per src/MxNativeCodec/NmxSubscriptionMessage.cs:126-127; only detail_status: Option<i32> is gated on hasDetailStatus (NmxSubscriptionMessage.cs:130-134).

Follow-up: audit any other codec port (current or future) that takes a has_detail_status / hasDetailStatus parameter for the same defect pattern — specifically, verify that fields read unconditionally in the .NET source remain unconditional in the Rust port. Likely affected scope: any future helper that ports ParseRecord semantics from NmxSubscriptionMessage.cs. The inline note at mxaccess-codec/src/subscription_message.rs parse_record documents the fix.

Settles when: post-M1 audit confirms no other codec module conditionally skips fields the .NET reference reads unconditionally.

Open evidence gaps

These are missing fixtures that the design assumes will land by their respective milestone.

Fixture	Needed by	Captured how
~~Multi-sample buffered batch~~	M6	CAPTURED (F44) — `captures/094-frida-buffered-separate-writer/frida-events.tsv:145`; fixture under `crates/mxaccess-codec/tests/fixtures/m6-buffered/`
~~Cross-domain NTLM Type1/2/3~~	~~M2+~~	DEFERRED (R8) — permanently external-blocked; needs multi-domain Windows lab not available on this dev host
Activate/Suspend transition (wire)	M6 / F46	PARTIAL (F44 + F46) — client-side conditions documented from capture 077; F46 added Frida hooks (`LmxProxy.dll!CLMXProxyServer.Suspend/.Activate` at RVAs `0x13d9c` / `0x14028`); live re-run pending (F50)
`OperationComplete` for non-write op	indefinitely	unknown
~~Ghidra mapping table for completion-only bytes (R3/R4)~~	~~indefinitely~~	NO TABLE EXISTS (R3/R4 settled 2026-05-06) — `analysis/ghidra/exports/Lmx.dll.aadct-decompile.md` confirms `aaDCT` is a logging BSTR name, not a table; `LmxProxy.dll`'s Fire_* event handlers receive already-populated `MXSTATUS_PROXY[]` from per-event context synthesis upstream, not from a static lookup. Verbatim preservation is the canonical answer.
ASB write timestamp + status fields	M5	extended ASB Write/PublishWriteComplete probe
ASB no-communication source-level evidence (`work_remain.md:198`)	M5	live capture against an unconfigured ASB endpoint
Partial-cleanup behavior after channel failure (`work_remain.md:196-197`)	M4/M5	inject mid-flight failure during subscribe, observe cleanup state
Galaxy schema older version	indefinitely	not in scope for V1

Things that look risky but aren't

"Decode the NDR-bridge to find the value bytes"

docs/Transport-Correlation.md:65-70 notes that distinct value probes do not appear in raw TCP — the CNmxAdapter::PutRequest/CNmxAdapter::TransferData buffers are an "internal adapter representation, not the TCP wire format." This is because the values flow as DCE/RPC stub bytes inside the TransferData payload, which itself is the 46-byte envelope plus the inner write/advise/subscribe body. The "bridge" is just our codec re-applied at a different boundary; once we encode the envelope correctly, the bytes are there.

The .NET reference confirms this — src/MxNativeClient/ManagedNmxService2Client.cs:159-183 (TransferData + ValidateTransferDataBody) writes the 46-byte envelope directly into the DCE/RPC Request stub body, then forwards the inner; the validator explicitly rejects bodies that lack "an inner message after the 46-byte envelope" (line 182). There is no extra layer. The probe-vs-pcap mismatch is an artefact of not reassembling the inner body, not a missing protocol layer.

No risk. Documented for clarity so future contributors don't chase a non-existent encryption layer.

"We need a custom TLB / proxy DLL"

The .NET reference avoids registering a custom TLB by hand-rolling the callback IRemUnknown server in src/MxNativeClient/ManagedCallbackExporter.cs:44-54 (CreateCallbackObjRef builds an OBJREF in memory) plus src/MxNativeClient/ManagedCallbackExporter.cs:164,195-196 (the IRemUnknown::RemQueryInterface server-side handler returns the negotiated INmxSvcCallback IPID without any registry-resident TLB or proxy/stub DLL). The Rust port does the same in mxaccess-callback. The only registry touchpoint is OXID resolution (read-only) and reading the ASB shared secret (read-only via DPAPI). No installer, no admin elevation.

No risk. Documented because it commonly comes up in DCOM contexts.

45 KiB Raw Blame History Unescape Escape

Risks and open questions

Protocol-level

R1 — net.tcp / WCF framing and binary message encoding complexity

R2 — Buffered subscription multi-sample body (settled per option (a) — codec change landed under F44)

R3 — OperationComplete trigger unproven (settled 2026-05-06 — Path A landed: synthesizer kernel + typed OperationStatus events ported)

R4 — Completion-only byte mapping (settled 2026-05-06 — verbatim-preserve confirmed; synthesizer doesn't apply at this layer)

R5 — Activate / Suspend behaviour (partially observed — F44 documented client-side trigger; wire-side residual gap filed as F46, hook landed pending live re-run)

R6 — 0x80004021 in MxNativeSession.WriteSecuredAsync is a .NET-reference defect, not a real LMX constraint

R7 — Status mapping for non-success ASB cases

Implementation-level

R8 — NTLMv2 cross-domain auth (permanently deferred 2026-05-06 — external infrastructure gap)

R9 — DPAPI dependency for ASB

R10 — Galaxy SQL schema versioning

R11 — x86 proxy/stub workaround

R12 — Performance — codec allocations

R13 — DataUpdate recordCount != 1 panic risk

R14 — Fabricated 0x80004021 → StaleItem mapping

R15 — Drop-time async cleanup hazards

R16 — Crypto/auth crate maintenance drift

Open questions

Q1 — Where does the Rust workspace live? (unresolved)

Q2 — License? (resolved: MIT)

Q3 — Cross-platform reach (Linux, macOS)

Q4 — How does mxaccess-compat handle COM event sinks?

Q5 — How do we surface MxStatus in Subscription items vs Session operations?

Q6 — Should Session be Clone?

Q7 — M1 hasDetailStatus audit

Open evidence gaps

Things that look risky but aren't

"Decode the NDR-bridge to find the value bytes"

"We need a custom TLB / proxy DLL"

45 KiB

Raw Blame History

R3 — `OperationComplete` trigger unproven (settled 2026-05-06 — Path A landed: synthesizer kernel + typed `OperationStatus` events ported)

R6 — `0x80004021` in `MxNativeSession.WriteSecuredAsync` is a .NET-reference defect, not a real LMX constraint

R13 — DataUpdate `recordCount != 1` panic risk

R14 — Fabricated `0x80004021 → StaleItem` mapping

Q4 — How does `mxaccess-compat` handle COM event sinks?

Q5 — How do we surface `MxStatus` in `Subscription` items vs `Session` operations?

Q6 — Should `Session` be `Clone`?

Q7 — M1 `hasDetailStatus` audit