Files
mxaccess/design/50-error-model.md
Joseph Doherty fe2a6db786
rust / build / test / clippy / fmt (push) Has been cancelled
Initial project state: .NET reference, design, Rust port (M0+M1), evidence
Layout:
- src/                    .NET 10 x64 reference: MxNativeCodec, MxNativeClient,
                          MxAsbClient, probes, tests, harnesses. Executable spec.
- design/                 Architectural plan for the Rust port (M0–M6), error
                          model, protocol invariants, risks (R1–R16), adversarial
                          review log (review.md).
- rust/                   Rust workspace. M0 skeleton + M1 codec parity.
                          mxaccess-codec: 215 unit tests + 2 cross-implementation
                          parity tests (byte-identical against .NET reference).
                          Other crates are M0 stubs awaiting M2+.
- captures/               Frida + netsh + pcap evidence per CLAUDE.md
                          ("captures are evidence, not throwaway logs").
- analysis/               Decompiled C# (frida/proxy/decompiled-*),
                          Ghidra exports for native DLLs (`exports/` only —
                          working state at `projects/` and AVEVA's input
                          binaries at `input/` are gitignored).
- docs/                   Reverse-engineering reference docs.
- tools/                  Setup-LiveProbeEnv.ps1 (Infisical credential fetcher),
                          Compute-Crc.ps1 (.NET parity helper).
- .github/workflows/      Rust CI: fmt + build + test + clippy on Windows.
- LICENSE                 MIT (Joseph Doherty, 2026).

Verified:
- cargo test --workspace → 217 passed (215 unit + 2 .NET parity), 0 failed
- cargo clippy --workspace -- -D warnings → clean
- cargo fmt --all -- --check → clean
- cargo publish --dry-run -p mxaccess-codec → packages cleanly

Excluded from history (see .gitignore):
- **/bin, **/obj, **/target — build artifacts
- analysis/ghidra/projects/ — Ghidra working state (regenerable)
- analysis/ghidra/input/ — AVEVA proprietary DLLs (vendor IP)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 06:21:00 -04:00

16 KiB

Error model

Goals

  • No HRESULT bare integers in the public API.
  • Every operation returns Result<T, mxaccess::Error>.
  • Error is #[non_exhaustive] so adding variants is non-breaking.
  • Every error variant carries enough context to debug without re-running.
  • No panics in the public API.

Public mxaccess::Error

#[derive(Debug, thiserror::Error)]
#[non_exhaustive]
pub enum Error {
    #[error("connection: {0}")]
    Connection(#[from] ConnectionError),

    #[error("authentication: {0}")]
    Auth(#[from] AuthError),

    #[error("protocol: {0}")]
    Protocol(#[from] ProtocolError),

    #[error("configuration: {0}")]
    Configuration(#[from] ConfigError),

    #[error("type mismatch on {reference}: expected {expected:?}, got {actual:?}")]
    TypeMismatch { reference: Arc<str>, expected: MxValueKind, actual: MxValueKind },
    // The `{expected:?} {actual:?}` formatting above requires `MxValueKind: Debug`.
    // The codec layer guarantees this: `MxValueKind`, `MxValue`, `MxStatus`,
    // `MxStatusCategory`, and `MxStatusSource` all `#[derive(Debug)]` (this is a
    // public-API requirement of the codec crate, not just a convenience). The
    // .NET reference enum at src/MxNativeCodec/MxValueKind.cs:3-18 is the spec
    // for the variants; the Rust port reproduces them and pins `Debug` as part
    // of the contract so `Error::Display` and `tracing` fields render usefully.

    #[error("security: {0}")]
    Security(#[from] SecurityError),

    #[error("unsupported on {transport:?} transport: {operation}")]
    Unsupported { operation: Cow<'static, str>, transport: TransportKind },

    #[error("operation timed out after {0:?}")]
    Timeout(Duration),

    #[error("operation cancelled")]
    Cancelled,

    #[error("status: success={success} category={category:?} source={source:?} detail={detail}")]
    Status {
        /// `MxStatus.Success` (`short`). `-1` is the documented OK sentinel
        /// (src/MxNativeCodec/MxStatus.cs:29,36-58). Carried verbatim for
        /// byte-parity diagnostics; consumers usually inspect `category` first.
        success: i16,
        category: MxStatusCategory,
        source: MxStatusSource,
        detail: i16,
    },

    #[error("io: {0}")]
    Io(#[from] std::io::Error),
}

Sub-errors (ConnectionError, AuthError, ProtocolError, ConfigError, SecurityError) are similar thiserror-derived #[non_exhaustive] enums. Sources preserved via #[source].

MxStatusCategory and MxStatusSource (re-exported from mxaccess-codec) are likewise #[non_exhaustive] — AVEVA may legally introduce new short-valued category/source variants, and the .NET reference already has Unknown=-1 open-set sentinels (src/MxNativeCodec/MxStatus.cs:3,17). Without #[non_exhaustive], downstream match statements would freeze the API on the first new category. The Rust port mirrors the seven MxStatusSource values one-for-one as named in the .NET enum — Unknown=-1, RequestingLmx=0, RespondingLmx=1, RequestingNmx=2, RespondingNmx=3, RequestingAutomationObject=4, RespondingAutomationObject=5 (src/MxNativeCodec/MxStatus.cs:17-26). DetectedBy is essential for diagnostics: it tells the consumer which layer (requesting LMX vs responding NMX vs the automation object itself) detected the fault. Verified against wwtools/mxaccesscli/docs/api-notes.md:107-136 — every MxStatusInfo[] entry exposes DetectedBy.

#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
#[non_exhaustive]
#[repr(i16)]
pub enum MxStatusCategory {
    Unknown            = -1,
    Ok                 = 0,
    Pending            = 1,
    Warning            = 2,
    CommunicationError = 3,
    ConfigurationError = 4,
    OperationalError   = 5,
    SecurityError      = 6,
    SoftwareError      = 7,
    OtherError         = 8,
}

#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
#[non_exhaustive]
#[repr(i16)]
pub enum MxStatusSource {
    Unknown                    = -1,
    RequestingLmx              = 0,
    RespondingLmx              = 1,
    RequestingNmx              = 2,
    RespondingNmx              = 3,
    RequestingAutomationObject = 4,
    RespondingAutomationObject = 5,
}
#[derive(Debug, thiserror::Error)]
#[non_exhaustive]
pub enum ConnectionError {
    #[error("tcp connect to {addr} failed")]
    TcpConnect { addr: SocketAddr, #[source] source: std::io::Error },
    #[error("OXID resolve failed")]
    OxidResolve { #[source] source: RpcError },
    // `RpcError` is a public, stable, `std::error::Error`-implementing type
    // defined in the `mxaccess-rpc` crate (per the topology in 30-crate-topology.md)
    // and re-exported here. Its declaration is:
    //
    //     #[derive(Debug, thiserror::Error)]
    //     #[non_exhaustive]
    //     pub enum RpcError { /* DCE/RPC framing/PDU/NDR variants */ }
    //
    // Required because `#[source]` on the `OxidResolve` variant above demands
    // `std::error::Error + 'static`. The variant set is documented in the raw
    // layer; this crate guarantees only that `RpcError: std::error::Error +
    // Send + Sync + Debug + Display + 'static` and is `#[non_exhaustive]` so
    // new RPC failure modes can be added without a breaking change.
    #[error("RPC server unavailable")]
    ServerUnavailable,
    #[error("callback proxy/stub not registered (REGDB_E_CLASSNOTREG)")]
    CallbackProxyMissing,
    #[error("engine not registered (UninitializedObject)")]
    EngineNotRegistered,
}

#[derive(Debug, thiserror::Error)]
#[non_exhaustive]
pub enum AuthError {
    #[error("NTLM rejected: {reason}")]
    Ntlm { reason: String },
    #[error("ASB DH handshake failed: {reason}")]
    AsbHandshake { reason: String },
    #[error("DPAPI shared secret unavailable: {reason}")]
    SharedSecret { reason: String },
}

#[derive(Debug, thiserror::Error)]
#[non_exhaustive]
pub enum ProtocolError {
    #[error("decode at offset {offset} ({reason}); buffer len {buffer_len}")]
    Decode { offset: usize, reason: &'static str, buffer_len: usize },
    #[error("inner length {declared} does not match body length {actual}")]
    InnerLengthMismatch { declared: i32, actual: usize },
    #[error("unexpected opcode {0:#x}")]
    UnexpectedOpcode(u8),
    #[error("envelope mismatch: {reason}")]
    EnvelopeMismatch { reason: &'static str },
}

#[derive(Debug, thiserror::Error)]
#[non_exhaustive]
pub enum ConfigError {
    /// Covers both ill-formed arguments and stale/unknown item handles.
    /// The proven x86 stack returns `0x80070057 E_INVALIDARG` for stale
    /// handles (captures/008-write-test-int-same-value/harness.log:7,
    /// analysis/ghidra/exports/LmxProxy.dll.item-helper-decompile.md:60,75,88,164).
    /// Discriminate via `detail` if a finer split is justified by future evidence.
    #[error("invalid argument: {detail}")]
    InvalidArgument { detail: String },
    #[error("galaxy resolver: {reason}")]
    Galaxy { reason: String },
}

#[derive(Debug, thiserror::Error)]
#[non_exhaustive]
pub enum SecurityError {
    #[error("callback OBJREF rejected (HRESULT 0x8001011D)")]
    CallbackObjRefRejected,
    #[error("verifier user token required for secured write")]
    VerifierRequired,
}

Mapping rules

The async layer translates raw protocol outcomes into typed errors at exactly one place: at the boundary where a Result<_, RawError> from mxaccess-nmx or mxaccess-asb becomes a Result<_, Error> exposed to the consumer.

Source Maps to
TCP connect failure Error::Connection(ConnectionError::TcpConnect { addr, source })
OXID resolve failure Error::Connection(ConnectionError::OxidResolve { source })
NTLM Type1/Type3 reject Error::Auth(AuthError::Ntlm { reason })
DH handshake mismatch (ASB) Error::Auth(AuthError::AsbHandshake { reason })
HRESULT 0x80070057 (incl. stale/unknown item handles) Error::Configuration(ConfigError::InvalidArgument { detail })
HRESULT 0x80040154 Error::Connection(ConnectionError::CallbackProxyMissing)
HRESULT 0x8007139F Error::Connection(ConnectionError::EngineNotRegistered)
HRESULT 0x8001011D Error::Security(SecurityError::CallbackObjRefRejected) (citation: docs/NMX-COM-Contracts.md:590-594, the full surrounding paragraph; no capture currently logs this HRESULT, so promotion to a typed error is based on probe analysis rather than a recorded wire fixture — re-examine if a future capture contradicts the narrative)
HRESULT 0x800706BA Error::Connection(ConnectionError::ServerUnavailable)
HRESULT 0x80004021 from a Write operation preserved verbatim as Error::Status { ..raw HRESULT carried } — no synthesized typed error. Note: this HRESULT was previously mapped to a synthesized "single-token WriteSecured not supported" error; that mapping has been removed per wwtools/mxaccesscli/ verification. Real MxAccess WriteSecured always takes two user ids ((currentUserId, verifierUserId, value)) and the LMX proxy CLI exposes single-user secured writes as currentUserId == verifierUserId. The 0x80004021 in MxNativeSession.WriteSecuredAsync (src/MxNativeClient/MxNativeSession.cs:218-221) is a defect of that .NET reimplementation, not a real LMX constraint — see R6 in 70-risks-and-open-questions.md
Decoder failure on inbound frame Error::Protocol(ProtocolError::Decode { ... })
MxStatus with category != Ok from read/write/subscribe Error::Status { success, category, source, detail } (the full 4-tuple from the .NET MxStatus record at src/MxNativeCodec/MxStatus.cs:28-33 — success is the documented OK sentinel -1 and is carried for byte-parity diagnostics per CLAUDE.md "preserve unknown bytes")
MxStatus with category != Ok on a DataChange from a Subscription not an error — set on DataChange.status, surfaced through the stream

The split between operation-status-as-error and subscription-status-as-data is the only sensible mapping. A subscription frame with category=Warning, detail=stale is data the caller wants to receive and inspect; it is not an exception. An operation that returns category=ConfigurationError, detail=21 (invalid reference) is a failure to produce a value and should be surfaced as Err.

Cancellation semantics

  • Drop: dropping a future cancels it. The library does not block_on inside drop.
  • Subscription drop: holds a tokio::sync::oneshot::Sender<UnAdviseRequest> to signal UnAdvise from the task that owns the connection. On drop, the sender is consumed; the connection task fires UnAdvise best-effort. The drop returns immediately; no async cleanup is awaited. If the best-effort UnAdvise fails (e.g. connection already gone), the failure is logged via tracing::warn! with the subscription handle and the underlying error. The drop completes regardless. (Log signature precedent: captures/probe-add-remove.log:7, which records the .NET LMX_UnAdvise per-handle warning emitted when the proxy reports a non-Ok status during teardown.)
  • Session shutdown: Session::shutdown(timeout) returns Ok(()) once UnregisterEngine completes or the timeout elapses. Drop without shutdown runs an unbounded best-effort cleanup spawn — fine for tests, suboptimal for production where shutdown is the recommendation.
  • CancellationToken: long operations (subscribe_buffered, recovery, connect) accept an optional token. Cancellation surfaces as Error::Cancelled.
  • Timeout: tokio::time::timeout works on every operation because async fns are cancel-correct by construction.

Panic policy

Public API panics only for invariants the type system enforces (e.g. an array length that cannot be wrong). In practice this means: no unwrap(), expect(), or panic!() in non-test code paths.

Lints:

  • clippy::unwrap_used = "deny"
  • clippy::expect_used = "deny"
  • clippy::panic = "deny"
  • clippy::indexing_slicing = "deny" — out-of-bounds slice[i] is a panic; codecs use slice.get(i).ok_or(...)? instead.
  • clippy::unreachable = "deny"unreachable!() is still a panic; force Err(ProtocolError::...) on supposedly-impossible decoder branches so the wire is never trusted to match the type system.
  • clippy::todo = "deny"todo!() must not ship in non-test code.
  • clippy::arithmetic_side_effects = "warn" — overflow-on-debug is a real footgun, but the codec does legitimate u8/u16 field arithmetic where the lint is noisy; warn keeps it visible without blocking honest cases. Use wrapping_*/checked_* explicitly when the intent is non-default.

Test modules opt out of the unwrap/expect/indexing rules via the actual pattern:

#![cfg_attr(test, allow(clippy::unwrap_used, clippy::expect_used, clippy::indexing_slicing))]

placed at the top of the test module (or #[cfg(test)] mod tests { #![allow(...)] ... } for inline test modules). CI rejects PRs that introduce any of the denied lints in non-test paths.

Diagnostics surface

Error::Display is short and operator-friendly. Error::source() walks the cause chain. Error::Debug is verbose for logs.

match session.write("X.Y", v).await {
    Ok(()) => {}
    Err(e) => {
        tracing::error!(error = ?e, error.source = ?e.source(), "write failed");
        return Err(e);
    }
}

For protocol-level decode failures, every ProtocolError::Decode carries:

  • byte offset in the frame
  • reason (&'static str — interned message)
  • buffer length

This matches the .NET reference's tendency to throw ArgumentException with byte-offset messages.

Recovery decisions

Error::is_retryable() -> bool informs the default RecoveryPolicy:

Variant Retryable?
Connection(ServerUnavailable) no — 0x800706BA in the proven stack is a structural callback-marshalling failure (no SYN to the advertised callback port after security bindings are added), not a flapping NmxSvc.exe. Retrying loops forever. Evidence: analysis/proxy/managed-registerengine2-callback-probe.txt:8, docs/NMX-COM-Contracts.md:592.
Connection(TcpConnect) depends on the inner io::Error.kind() — same dispatch as Io(_) below: WouldBlock/Interrupted/TimedOut/ConnectionReset → yes; AddrNotAvailable/HostUnreachable/ConnectionRefused → no. Implementation: is_retryable()'s Connection(TcpConnect { source, .. }) arm delegates to source.kind() rather than blanket-yes; otherwise a name-not-found hot-loops.
Connection(OxidResolve) yes
Connection(CallbackProxyMissing) no (config issue)
Connection(EngineNotRegistered) yes (re-register)
Auth(Ntlm) no (credentials are wrong)
Auth(AsbHandshake) no
Auth(SharedSecret) no
Configuration(*) no
TypeMismatch no
Security(*) no
Status with category=CommunicationError yes
Status with category=Pending yes (with backoff)
Status with category=ConfigurationError depends on detail: detail == 21 ("Invalid reference", src/MxNativeCodec/MxStatus.cs:76) → yes, after a Galaxy-resolver-cache refresh — the .NET reference's per-operation ResolveTagAsync (src/MxNativeClient/MxNativeSession.cs:173,196,232,255) makes a stale-cache miss recoverable on the next attempt. All other ConfigurationError details → no. Implementation: is_retryable() matches the inner detail and returns true only for the resolver-refresh-eligible codes.
Status other categories no by default; consumer can override
Timeout yes
Cancelled no
Protocol(*) no — protocol-level decode bugs are not retryable
Io(_) depends on kind(): WouldBlock/Interrupted/TimedOut → yes; everything else → no

The default RecoveryPolicy::exponential calls is_retryable() before re-attempting and emits RecoveryEvent::Failed { error, .. } if not.

Status as data on subscriptions

pub struct DataChange {
    pub reference: Arc<str>,
    pub value: MxValue,
    pub quality: u16,
    pub timestamp: SystemTime,
    pub status: MxStatus,
}

Consumers inspect status.category to distinguish good/uncertain/bad data. The default mxaccess does not filter; if a consumer wants Ok-only data, they filter the stream themselves:

let mut sub = session.subscribe("X.Y").await?
    .filter_map(|change| async move {
        match change {
            Ok(c) if c.status.category == MxStatusCategory::Ok => Some(Ok(c)),
            Ok(_) => None,                  // drop non-Ok
            Err(e) => Some(Err(e)),
        }
    });

This keeps Error reserved for actual failures and keeps stream items composable.