# Error model ## Goals - No `HRESULT` bare integers in the public API. - Every operation returns `Result`. - `Error` is `#[non_exhaustive]` so adding variants is non-breaking. - Every error variant carries enough context to debug without re-running. - No panics in the public API. ## Public `mxaccess::Error` ```rust #[derive(Debug, thiserror::Error)] #[non_exhaustive] pub enum Error { #[error("connection: {0}")] Connection(#[from] ConnectionError), #[error("authentication: {0}")] Auth(#[from] AuthError), #[error("protocol: {0}")] Protocol(#[from] ProtocolError), #[error("configuration: {0}")] Configuration(#[from] ConfigError), #[error("type mismatch on {reference}: expected {expected:?}, got {actual:?}")] TypeMismatch { reference: Arc, expected: MxValueKind, actual: MxValueKind }, // The `{expected:?} {actual:?}` formatting above requires `MxValueKind: Debug`. // The codec layer guarantees this: `MxValueKind`, `MxValue`, `MxStatus`, // `MxStatusCategory`, and `MxStatusSource` all `#[derive(Debug)]` (this is a // public-API requirement of the codec crate, not just a convenience). The // .NET reference enum at src/MxNativeCodec/MxValueKind.cs:3-18 is the spec // for the variants; the Rust port reproduces them and pins `Debug` as part // of the contract so `Error::Display` and `tracing` fields render usefully. #[error("security: {0}")] Security(#[from] SecurityError), #[error("unsupported on {transport:?} transport: {operation}")] Unsupported { operation: Cow<'static, str>, transport: TransportKind }, #[error("operation timed out after {0:?}")] Timeout(Duration), #[error("operation cancelled")] Cancelled, #[error("status: success={success} category={category:?} source={source:?} detail={detail}")] Status { /// `MxStatus.Success` (`short`). `-1` is the documented OK sentinel /// (src/MxNativeCodec/MxStatus.cs:29,36-58). Carried verbatim for /// byte-parity diagnostics; consumers usually inspect `category` first. success: i16, category: MxStatusCategory, source: MxStatusSource, detail: i16, }, #[error("io: {0}")] Io(#[from] std::io::Error), } ``` Sub-errors (`ConnectionError`, `AuthError`, `ProtocolError`, `ConfigError`, `SecurityError`) are similar `thiserror`-derived `#[non_exhaustive]` enums. Sources preserved via `#[source]`. `MxStatusCategory` and `MxStatusSource` (re-exported from `mxaccess-codec`) are likewise `#[non_exhaustive]` — AVEVA may legally introduce new short-valued category/source variants, and the .NET reference already has `Unknown=-1` open-set sentinels (src/MxNativeCodec/MxStatus.cs:3,17). Without `#[non_exhaustive]`, downstream `match` statements would freeze the API on the first new category. The Rust port mirrors the **seven** `MxStatusSource` values one-for-one as named in the .NET enum — `Unknown=-1, RequestingLmx=0, RespondingLmx=1, RequestingNmx=2, RespondingNmx=3, RequestingAutomationObject=4, RespondingAutomationObject=5` (src/MxNativeCodec/MxStatus.cs:17-26). `DetectedBy` is essential for diagnostics: it tells the consumer which layer (requesting LMX vs responding NMX vs the automation object itself) detected the fault. Verified against wwtools/mxaccesscli/docs/api-notes.md:107-136 — every `MxStatusInfo[]` entry exposes `DetectedBy`. ```rust #[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)] #[non_exhaustive] #[repr(i16)] pub enum MxStatusCategory { Unknown = -1, Ok = 0, Pending = 1, Warning = 2, CommunicationError = 3, ConfigurationError = 4, OperationalError = 5, SecurityError = 6, SoftwareError = 7, OtherError = 8, } #[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)] #[non_exhaustive] #[repr(i16)] pub enum MxStatusSource { Unknown = -1, RequestingLmx = 0, RespondingLmx = 1, RequestingNmx = 2, RespondingNmx = 3, RequestingAutomationObject = 4, RespondingAutomationObject = 5, } ``` ```rust #[derive(Debug, thiserror::Error)] #[non_exhaustive] pub enum ConnectionError { #[error("tcp connect to {addr} failed")] TcpConnect { addr: SocketAddr, #[source] source: std::io::Error }, #[error("OXID resolve failed")] OxidResolve { #[source] source: RpcError }, // `RpcError` is a public, stable, `std::error::Error`-implementing type // defined in the `mxaccess-rpc` crate (per the topology in 30-crate-topology.md) // and re-exported here. Its declaration is: // // #[derive(Debug, thiserror::Error)] // #[non_exhaustive] // pub enum RpcError { /* DCE/RPC framing/PDU/NDR variants */ } // // Required because `#[source]` on the `OxidResolve` variant above demands // `std::error::Error + 'static`. The variant set is documented in the raw // layer; this crate guarantees only that `RpcError: std::error::Error + // Send + Sync + Debug + Display + 'static` and is `#[non_exhaustive]` so // new RPC failure modes can be added without a breaking change. #[error("RPC server unavailable")] ServerUnavailable, #[error("callback proxy/stub not registered (REGDB_E_CLASSNOTREG)")] CallbackProxyMissing, #[error("engine not registered (UninitializedObject)")] EngineNotRegistered, } #[derive(Debug, thiserror::Error)] #[non_exhaustive] pub enum AuthError { #[error("NTLM rejected: {reason}")] Ntlm { reason: String }, #[error("ASB DH handshake failed: {reason}")] AsbHandshake { reason: String }, #[error("DPAPI shared secret unavailable: {reason}")] SharedSecret { reason: String }, } #[derive(Debug, thiserror::Error)] #[non_exhaustive] pub enum ProtocolError { #[error("decode at offset {offset} ({reason}); buffer len {buffer_len}")] Decode { offset: usize, reason: &'static str, buffer_len: usize }, #[error("inner length {declared} does not match body length {actual}")] InnerLengthMismatch { declared: i32, actual: usize }, #[error("unexpected opcode {0:#x}")] UnexpectedOpcode(u8), #[error("envelope mismatch: {reason}")] EnvelopeMismatch { reason: &'static str }, } #[derive(Debug, thiserror::Error)] #[non_exhaustive] pub enum ConfigError { /// Covers both ill-formed arguments and stale/unknown item handles. /// The proven x86 stack returns `0x80070057 E_INVALIDARG` for stale /// handles (captures/008-write-test-int-same-value/harness.log:7, /// analysis/ghidra/exports/LmxProxy.dll.item-helper-decompile.md:60,75,88,164). /// Discriminate via `detail` if a finer split is justified by future evidence. #[error("invalid argument: {detail}")] InvalidArgument { detail: String }, #[error("galaxy resolver: {reason}")] Galaxy { reason: String }, } #[derive(Debug, thiserror::Error)] #[non_exhaustive] pub enum SecurityError { #[error("callback OBJREF rejected (HRESULT 0x8001011D)")] CallbackObjRefRejected, #[error("verifier user token required for secured write")] VerifierRequired, } ``` ## Mapping rules The async layer translates raw protocol outcomes into typed errors at exactly one place: at the boundary where a `Result<_, RawError>` from `mxaccess-nmx` or `mxaccess-asb` becomes a `Result<_, Error>` exposed to the consumer. | Source | Maps to | |---|---| | TCP connect failure | `Error::Connection(ConnectionError::TcpConnect { addr, source })` | | OXID resolve failure | `Error::Connection(ConnectionError::OxidResolve { source })` | | NTLM Type1/Type3 reject | `Error::Auth(AuthError::Ntlm { reason })` | | DH handshake mismatch (ASB) | `Error::Auth(AuthError::AsbHandshake { reason })` | | HRESULT `0x80070057` (incl. stale/unknown item handles) | `Error::Configuration(ConfigError::InvalidArgument { detail })` | | HRESULT `0x80040154` | `Error::Connection(ConnectionError::CallbackProxyMissing)` | | HRESULT `0x8007139F` | `Error::Connection(ConnectionError::EngineNotRegistered)` | | HRESULT `0x8001011D` | `Error::Security(SecurityError::CallbackObjRefRejected)` (citation: `docs/NMX-COM-Contracts.md:590-594`, the full surrounding paragraph; no capture currently logs this HRESULT, so promotion to a typed error is based on probe analysis rather than a recorded wire fixture — re-examine if a future capture contradicts the narrative) | | HRESULT `0x800706BA` | `Error::Connection(ConnectionError::ServerUnavailable)` | | HRESULT `0x80004021` from a Write operation | preserved verbatim as `Error::Status { ..raw HRESULT carried }` — no synthesized typed error. Note: this HRESULT was previously mapped to a synthesized "single-token `WriteSecured` not supported" error; that mapping has been **removed** per `wwtools/mxaccesscli/` verification. Real MxAccess `WriteSecured` always takes two user ids (`(currentUserId, verifierUserId, value)`) and the LMX proxy CLI exposes single-user secured writes as `currentUserId == verifierUserId`. The 0x80004021 in `MxNativeSession.WriteSecuredAsync` (src/MxNativeClient/MxNativeSession.cs:218-221) is a defect of that .NET reimplementation, not a real LMX constraint — see R6 in `70-risks-and-open-questions.md` | | Decoder failure on inbound frame | `Error::Protocol(ProtocolError::Decode { ... })` | | `MxStatus` with `category != Ok` from `read`/`write`/`subscribe` | `Error::Status { success, category, source, detail }` (the full 4-tuple from the .NET `MxStatus` record at src/MxNativeCodec/MxStatus.cs:28-33 — `success` is the documented OK sentinel `-1` and is carried for byte-parity diagnostics per CLAUDE.md "preserve unknown bytes") | | `MxStatus` with `category != Ok` on a `DataChange` from a `Subscription` | **not** an error — set on `DataChange.status`, surfaced through the stream | The split between operation-status-as-error and subscription-status-as-data is the only sensible mapping. A subscription frame with `category=Warning, detail=stale` is data the caller wants to receive and inspect; it is not an exception. An operation that returns `category=ConfigurationError, detail=21 (invalid reference)` is a failure to produce a value and should be surfaced as `Err`. ## Cancellation semantics - **Drop**: dropping a future cancels it. The library does not `block_on` inside drop. - **Subscription drop**: holds a `tokio::sync::oneshot::Sender` to signal `UnAdvise` from the task that owns the connection. On drop, the sender is consumed; the connection task fires `UnAdvise` best-effort. The drop returns immediately; no async cleanup is awaited. If the best-effort `UnAdvise` fails (e.g. connection already gone), the failure is logged via `tracing::warn!` with the subscription handle and the underlying error. The drop completes regardless. (Log signature precedent: `captures/probe-add-remove.log:7`, which records the .NET `LMX_UnAdvise` per-handle warning emitted when the proxy reports a non-Ok status during teardown.) - **Session shutdown**: `Session::shutdown(timeout)` returns `Ok(())` once `UnregisterEngine` completes or the timeout elapses. Drop without `shutdown` runs an unbounded best-effort cleanup spawn — fine for tests, suboptimal for production where `shutdown` is the recommendation. - **`CancellationToken`**: long operations (`subscribe_buffered`, recovery, `connect`) accept an optional token. Cancellation surfaces as `Error::Cancelled`. - **Timeout**: `tokio::time::timeout` works on every operation because `async fn`s are cancel-correct by construction. ## Panic policy Public API panics only for invariants the type system enforces (e.g. an array length that cannot be wrong). In practice this means: **no `unwrap()`, `expect()`, or `panic!()` in non-test code paths**. Lints: - `clippy::unwrap_used = "deny"` - `clippy::expect_used = "deny"` - `clippy::panic = "deny"` - `clippy::indexing_slicing = "deny"` — out-of-bounds `slice[i]` is a panic; codecs use `slice.get(i).ok_or(...)?` instead. - `clippy::unreachable = "deny"` — `unreachable!()` is still a panic; force `Err(ProtocolError::...)` on supposedly-impossible decoder branches so the wire is never trusted to match the type system. - `clippy::todo = "deny"` — `todo!()` must not ship in non-test code. - `clippy::arithmetic_side_effects = "warn"` — overflow-on-debug is a real footgun, but the codec does legitimate `u8`/`u16` field arithmetic where the lint is noisy; `warn` keeps it visible without blocking honest cases. Use `wrapping_*`/`checked_*` explicitly when the intent is non-default. Test modules opt out of the unwrap/expect/indexing rules via the actual pattern: ```rust #![cfg_attr(test, allow(clippy::unwrap_used, clippy::expect_used, clippy::indexing_slicing))] ``` placed at the top of the test module (or `#[cfg(test)] mod tests { #![allow(...)] ... }` for inline test modules). CI rejects PRs that introduce any of the denied lints in non-test paths. ## Diagnostics surface `Error::Display` is short and operator-friendly. `Error::source()` walks the cause chain. `Error::Debug` is verbose for logs. ```rust match session.write("X.Y", v).await { Ok(()) => {} Err(e) => { tracing::error!(error = ?e, error.source = ?e.source(), "write failed"); return Err(e); } } ``` For protocol-level decode failures, every `ProtocolError::Decode` carries: - byte offset in the frame - reason (`&'static str` — interned message) - buffer length This matches the .NET reference's tendency to throw `ArgumentException` with byte-offset messages. ## Recovery decisions `Error::is_retryable() -> bool` informs the default `RecoveryPolicy`: | Variant | Retryable? | |---|---| | `Connection(ServerUnavailable)` | no — `0x800706BA` in the proven stack is a structural callback-marshalling failure (no SYN to the advertised callback port after security bindings are added), not a flapping `NmxSvc.exe`. Retrying loops forever. Evidence: `analysis/proxy/managed-registerengine2-callback-probe.txt:8`, `docs/NMX-COM-Contracts.md:592`. | | `Connection(TcpConnect)` | depends on the inner `io::Error.kind()` — same dispatch as `Io(_)` below: `WouldBlock`/`Interrupted`/`TimedOut`/`ConnectionReset` → yes; `AddrNotAvailable`/`HostUnreachable`/`ConnectionRefused` → no. Implementation: `is_retryable()`'s `Connection(TcpConnect { source, .. })` arm delegates to `source.kind()` rather than blanket-yes; otherwise a name-not-found hot-loops. | | `Connection(OxidResolve)` | yes | | `Connection(CallbackProxyMissing)` | no (config issue) | | `Connection(EngineNotRegistered)` | yes (re-register) | | `Auth(Ntlm)` | no (credentials are wrong) | | `Auth(AsbHandshake)` | no | | `Auth(SharedSecret)` | no | | `Configuration(*)` | no | | `TypeMismatch` | no | | `Security(*)` | no | | `Status` with `category=CommunicationError` | yes | | `Status` with `category=Pending` | yes (with backoff) | | `Status` with `category=ConfigurationError` | depends on `detail`: `detail == 21` ("Invalid reference", src/MxNativeCodec/MxStatus.cs:76) → yes, after a Galaxy-resolver-cache refresh — the .NET reference's per-operation `ResolveTagAsync` (src/MxNativeClient/MxNativeSession.cs:173,196,232,255) makes a stale-cache miss recoverable on the next attempt. All other `ConfigurationError` details → no. Implementation: `is_retryable()` matches the inner `detail` and returns `true` only for the resolver-refresh-eligible codes. | | `Status` other categories | no by default; consumer can override | | `Timeout` | yes | | `Cancelled` | no | | `Protocol(*)` | no — protocol-level decode bugs are not retryable | | `Io(_)` | depends on `kind()`: `WouldBlock`/`Interrupted`/`TimedOut` → yes; everything else → no | The default `RecoveryPolicy::exponential` calls `is_retryable()` before re-attempting and emits `RecoveryEvent::Failed { error, .. }` if not. ## Status as data on subscriptions ```rust pub struct DataChange { pub reference: Arc, pub value: MxValue, pub quality: u16, pub timestamp: SystemTime, pub status: MxStatus, } ``` Consumers inspect `status.category` to distinguish good/uncertain/bad data. The default `mxaccess` does not filter; if a consumer wants Ok-only data, they filter the stream themselves: ```rust let mut sub = session.subscribe("X.Y").await? .filter_map(|change| async move { match change { Ok(c) if c.status.category == MxStatusCategory::Ok => Some(Ok(c)), Ok(_) => None, // drop non-Ok Err(e) => Some(Err(e)), } }); ``` This keeps `Error` reserved for actual failures and keeps stream items composable.