Files
mxaccess/design/50-error-model.md
Joseph Doherty fe2a6db786
rust / build / test / clippy / fmt (push) Has been cancelled
Initial project state: .NET reference, design, Rust port (M0+M1), evidence
Layout:
- src/                    .NET 10 x64 reference: MxNativeCodec, MxNativeClient,
                          MxAsbClient, probes, tests, harnesses. Executable spec.
- design/                 Architectural plan for the Rust port (M0–M6), error
                          model, protocol invariants, risks (R1–R16), adversarial
                          review log (review.md).
- rust/                   Rust workspace. M0 skeleton + M1 codec parity.
                          mxaccess-codec: 215 unit tests + 2 cross-implementation
                          parity tests (byte-identical against .NET reference).
                          Other crates are M0 stubs awaiting M2+.
- captures/               Frida + netsh + pcap evidence per CLAUDE.md
                          ("captures are evidence, not throwaway logs").
- analysis/               Decompiled C# (frida/proxy/decompiled-*),
                          Ghidra exports for native DLLs (`exports/` only —
                          working state at `projects/` and AVEVA's input
                          binaries at `input/` are gitignored).
- docs/                   Reverse-engineering reference docs.
- tools/                  Setup-LiveProbeEnv.ps1 (Infisical credential fetcher),
                          Compute-Crc.ps1 (.NET parity helper).
- .github/workflows/      Rust CI: fmt + build + test + clippy on Windows.
- LICENSE                 MIT (Joseph Doherty, 2026).

Verified:
- cargo test --workspace → 217 passed (215 unit + 2 .NET parity), 0 failed
- cargo clippy --workspace -- -D warnings → clean
- cargo fmt --all -- --check → clean
- cargo publish --dry-run -p mxaccess-codec → packages cleanly

Excluded from history (see .gitignore):
- **/bin, **/obj, **/target — build artifacts
- analysis/ghidra/projects/ — Ghidra working state (regenerable)
- analysis/ghidra/input/ — AVEVA proprietary DLLs (vendor IP)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 06:21:00 -04:00

305 lines
16 KiB
Markdown

# Error model
## Goals
- No `HRESULT` bare integers in the public API.
- Every operation returns `Result<T, mxaccess::Error>`.
- `Error` is `#[non_exhaustive]` so adding variants is non-breaking.
- Every error variant carries enough context to debug without re-running.
- No panics in the public API.
## Public `mxaccess::Error`
```rust
#[derive(Debug, thiserror::Error)]
#[non_exhaustive]
pub enum Error {
#[error("connection: {0}")]
Connection(#[from] ConnectionError),
#[error("authentication: {0}")]
Auth(#[from] AuthError),
#[error("protocol: {0}")]
Protocol(#[from] ProtocolError),
#[error("configuration: {0}")]
Configuration(#[from] ConfigError),
#[error("type mismatch on {reference}: expected {expected:?}, got {actual:?}")]
TypeMismatch { reference: Arc<str>, expected: MxValueKind, actual: MxValueKind },
// The `{expected:?} {actual:?}` formatting above requires `MxValueKind: Debug`.
// The codec layer guarantees this: `MxValueKind`, `MxValue`, `MxStatus`,
// `MxStatusCategory`, and `MxStatusSource` all `#[derive(Debug)]` (this is a
// public-API requirement of the codec crate, not just a convenience). The
// .NET reference enum at src/MxNativeCodec/MxValueKind.cs:3-18 is the spec
// for the variants; the Rust port reproduces them and pins `Debug` as part
// of the contract so `Error::Display` and `tracing` fields render usefully.
#[error("security: {0}")]
Security(#[from] SecurityError),
#[error("unsupported on {transport:?} transport: {operation}")]
Unsupported { operation: Cow<'static, str>, transport: TransportKind },
#[error("operation timed out after {0:?}")]
Timeout(Duration),
#[error("operation cancelled")]
Cancelled,
#[error("status: success={success} category={category:?} source={source:?} detail={detail}")]
Status {
/// `MxStatus.Success` (`short`). `-1` is the documented OK sentinel
/// (src/MxNativeCodec/MxStatus.cs:29,36-58). Carried verbatim for
/// byte-parity diagnostics; consumers usually inspect `category` first.
success: i16,
category: MxStatusCategory,
source: MxStatusSource,
detail: i16,
},
#[error("io: {0}")]
Io(#[from] std::io::Error),
}
```
Sub-errors (`ConnectionError`, `AuthError`, `ProtocolError`, `ConfigError`, `SecurityError`) are similar `thiserror`-derived `#[non_exhaustive]` enums. Sources preserved via `#[source]`.
`MxStatusCategory` and `MxStatusSource` (re-exported from `mxaccess-codec`) are likewise `#[non_exhaustive]` — AVEVA may legally introduce new short-valued category/source variants, and the .NET reference already has `Unknown=-1` open-set sentinels (src/MxNativeCodec/MxStatus.cs:3,17). Without `#[non_exhaustive]`, downstream `match` statements would freeze the API on the first new category. The Rust port mirrors the **seven** `MxStatusSource` values one-for-one as named in the .NET enum — `Unknown=-1, RequestingLmx=0, RespondingLmx=1, RequestingNmx=2, RespondingNmx=3, RequestingAutomationObject=4, RespondingAutomationObject=5` (src/MxNativeCodec/MxStatus.cs:17-26). `DetectedBy` is essential for diagnostics: it tells the consumer which layer (requesting LMX vs responding NMX vs the automation object itself) detected the fault. Verified against wwtools/mxaccesscli/docs/api-notes.md:107-136 — every `MxStatusInfo[]` entry exposes `DetectedBy`.
```rust
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
#[non_exhaustive]
#[repr(i16)]
pub enum MxStatusCategory {
Unknown = -1,
Ok = 0,
Pending = 1,
Warning = 2,
CommunicationError = 3,
ConfigurationError = 4,
OperationalError = 5,
SecurityError = 6,
SoftwareError = 7,
OtherError = 8,
}
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
#[non_exhaustive]
#[repr(i16)]
pub enum MxStatusSource {
Unknown = -1,
RequestingLmx = 0,
RespondingLmx = 1,
RequestingNmx = 2,
RespondingNmx = 3,
RequestingAutomationObject = 4,
RespondingAutomationObject = 5,
}
```
```rust
#[derive(Debug, thiserror::Error)]
#[non_exhaustive]
pub enum ConnectionError {
#[error("tcp connect to {addr} failed")]
TcpConnect { addr: SocketAddr, #[source] source: std::io::Error },
#[error("OXID resolve failed")]
OxidResolve { #[source] source: RpcError },
// `RpcError` is a public, stable, `std::error::Error`-implementing type
// defined in the `mxaccess-rpc` crate (per the topology in 30-crate-topology.md)
// and re-exported here. Its declaration is:
//
// #[derive(Debug, thiserror::Error)]
// #[non_exhaustive]
// pub enum RpcError { /* DCE/RPC framing/PDU/NDR variants */ }
//
// Required because `#[source]` on the `OxidResolve` variant above demands
// `std::error::Error + 'static`. The variant set is documented in the raw
// layer; this crate guarantees only that `RpcError: std::error::Error +
// Send + Sync + Debug + Display + 'static` and is `#[non_exhaustive]` so
// new RPC failure modes can be added without a breaking change.
#[error("RPC server unavailable")]
ServerUnavailable,
#[error("callback proxy/stub not registered (REGDB_E_CLASSNOTREG)")]
CallbackProxyMissing,
#[error("engine not registered (UninitializedObject)")]
EngineNotRegistered,
}
#[derive(Debug, thiserror::Error)]
#[non_exhaustive]
pub enum AuthError {
#[error("NTLM rejected: {reason}")]
Ntlm { reason: String },
#[error("ASB DH handshake failed: {reason}")]
AsbHandshake { reason: String },
#[error("DPAPI shared secret unavailable: {reason}")]
SharedSecret { reason: String },
}
#[derive(Debug, thiserror::Error)]
#[non_exhaustive]
pub enum ProtocolError {
#[error("decode at offset {offset} ({reason}); buffer len {buffer_len}")]
Decode { offset: usize, reason: &'static str, buffer_len: usize },
#[error("inner length {declared} does not match body length {actual}")]
InnerLengthMismatch { declared: i32, actual: usize },
#[error("unexpected opcode {0:#x}")]
UnexpectedOpcode(u8),
#[error("envelope mismatch: {reason}")]
EnvelopeMismatch { reason: &'static str },
}
#[derive(Debug, thiserror::Error)]
#[non_exhaustive]
pub enum ConfigError {
/// Covers both ill-formed arguments and stale/unknown item handles.
/// The proven x86 stack returns `0x80070057 E_INVALIDARG` for stale
/// handles (captures/008-write-test-int-same-value/harness.log:7,
/// analysis/ghidra/exports/LmxProxy.dll.item-helper-decompile.md:60,75,88,164).
/// Discriminate via `detail` if a finer split is justified by future evidence.
#[error("invalid argument: {detail}")]
InvalidArgument { detail: String },
#[error("galaxy resolver: {reason}")]
Galaxy { reason: String },
}
#[derive(Debug, thiserror::Error)]
#[non_exhaustive]
pub enum SecurityError {
#[error("callback OBJREF rejected (HRESULT 0x8001011D)")]
CallbackObjRefRejected,
#[error("verifier user token required for secured write")]
VerifierRequired,
}
```
## Mapping rules
The async layer translates raw protocol outcomes into typed errors at exactly one place: at the boundary where a `Result<_, RawError>` from `mxaccess-nmx` or `mxaccess-asb` becomes a `Result<_, Error>` exposed to the consumer.
| Source | Maps to |
|---|---|
| TCP connect failure | `Error::Connection(ConnectionError::TcpConnect { addr, source })` |
| OXID resolve failure | `Error::Connection(ConnectionError::OxidResolve { source })` |
| NTLM Type1/Type3 reject | `Error::Auth(AuthError::Ntlm { reason })` |
| DH handshake mismatch (ASB) | `Error::Auth(AuthError::AsbHandshake { reason })` |
| HRESULT `0x80070057` (incl. stale/unknown item handles) | `Error::Configuration(ConfigError::InvalidArgument { detail })` |
| HRESULT `0x80040154` | `Error::Connection(ConnectionError::CallbackProxyMissing)` |
| HRESULT `0x8007139F` | `Error::Connection(ConnectionError::EngineNotRegistered)` |
| HRESULT `0x8001011D` | `Error::Security(SecurityError::CallbackObjRefRejected)` (citation: `docs/NMX-COM-Contracts.md:590-594`, the full surrounding paragraph; no capture currently logs this HRESULT, so promotion to a typed error is based on probe analysis rather than a recorded wire fixture — re-examine if a future capture contradicts the narrative) |
| HRESULT `0x800706BA` | `Error::Connection(ConnectionError::ServerUnavailable)` |
| HRESULT `0x80004021` from a Write operation | preserved verbatim as `Error::Status { ..raw HRESULT carried }` — no synthesized typed error. Note: this HRESULT was previously mapped to a synthesized "single-token `WriteSecured` not supported" error; that mapping has been **removed** per `wwtools/mxaccesscli/` verification. Real MxAccess `WriteSecured` always takes two user ids (`(currentUserId, verifierUserId, value)`) and the LMX proxy CLI exposes single-user secured writes as `currentUserId == verifierUserId`. The 0x80004021 in `MxNativeSession.WriteSecuredAsync` (src/MxNativeClient/MxNativeSession.cs:218-221) is a defect of that .NET reimplementation, not a real LMX constraint — see R6 in `70-risks-and-open-questions.md` |
| Decoder failure on inbound frame | `Error::Protocol(ProtocolError::Decode { ... })` |
| `MxStatus` with `category != Ok` from `read`/`write`/`subscribe` | `Error::Status { success, category, source, detail }` (the full 4-tuple from the .NET `MxStatus` record at src/MxNativeCodec/MxStatus.cs:28-33 — `success` is the documented OK sentinel `-1` and is carried for byte-parity diagnostics per CLAUDE.md "preserve unknown bytes") |
| `MxStatus` with `category != Ok` on a `DataChange` from a `Subscription` | **not** an error — set on `DataChange.status`, surfaced through the stream |
The split between operation-status-as-error and subscription-status-as-data is the only sensible mapping. A subscription frame with `category=Warning, detail=stale` is data the caller wants to receive and inspect; it is not an exception. An operation that returns `category=ConfigurationError, detail=21 (invalid reference)` is a failure to produce a value and should be surfaced as `Err`.
## Cancellation semantics
- **Drop**: dropping a future cancels it. The library does not `block_on` inside drop.
- **Subscription drop**: holds a `tokio::sync::oneshot::Sender<UnAdviseRequest>` to signal `UnAdvise` from the task that owns the connection. On drop, the sender is consumed; the connection task fires `UnAdvise` best-effort. The drop returns immediately; no async cleanup is awaited. If the best-effort `UnAdvise` fails (e.g. connection already gone), the failure is logged via `tracing::warn!` with the subscription handle and the underlying error. The drop completes regardless. (Log signature precedent: `captures/probe-add-remove.log:7`, which records the .NET `LMX_UnAdvise` per-handle warning emitted when the proxy reports a non-Ok status during teardown.)
- **Session shutdown**: `Session::shutdown(timeout)` returns `Ok(())` once `UnregisterEngine` completes or the timeout elapses. Drop without `shutdown` runs an unbounded best-effort cleanup spawn — fine for tests, suboptimal for production where `shutdown` is the recommendation.
- **`CancellationToken`**: long operations (`subscribe_buffered`, recovery, `connect`) accept an optional token. Cancellation surfaces as `Error::Cancelled`.
- **Timeout**: `tokio::time::timeout` works on every operation because `async fn`s are cancel-correct by construction.
## Panic policy
Public API panics only for invariants the type system enforces (e.g. an array length that cannot be wrong). In practice this means: **no `unwrap()`, `expect()`, or `panic!()` in non-test code paths**.
Lints:
- `clippy::unwrap_used = "deny"`
- `clippy::expect_used = "deny"`
- `clippy::panic = "deny"`
- `clippy::indexing_slicing = "deny"` — out-of-bounds `slice[i]` is a panic; codecs use `slice.get(i).ok_or(...)?` instead.
- `clippy::unreachable = "deny"``unreachable!()` is still a panic; force `Err(ProtocolError::...)` on supposedly-impossible decoder branches so the wire is never trusted to match the type system.
- `clippy::todo = "deny"``todo!()` must not ship in non-test code.
- `clippy::arithmetic_side_effects = "warn"` — overflow-on-debug is a real footgun, but the codec does legitimate `u8`/`u16` field arithmetic where the lint is noisy; `warn` keeps it visible without blocking honest cases. Use `wrapping_*`/`checked_*` explicitly when the intent is non-default.
Test modules opt out of the unwrap/expect/indexing rules via the actual pattern:
```rust
#![cfg_attr(test, allow(clippy::unwrap_used, clippy::expect_used, clippy::indexing_slicing))]
```
placed at the top of the test module (or `#[cfg(test)] mod tests { #![allow(...)] ... }` for inline test modules). CI rejects PRs that introduce any of the denied lints in non-test paths.
## Diagnostics surface
`Error::Display` is short and operator-friendly. `Error::source()` walks the cause chain. `Error::Debug` is verbose for logs.
```rust
match session.write("X.Y", v).await {
Ok(()) => {}
Err(e) => {
tracing::error!(error = ?e, error.source = ?e.source(), "write failed");
return Err(e);
}
}
```
For protocol-level decode failures, every `ProtocolError::Decode` carries:
- byte offset in the frame
- reason (`&'static str` — interned message)
- buffer length
This matches the .NET reference's tendency to throw `ArgumentException` with byte-offset messages.
## Recovery decisions
`Error::is_retryable() -> bool` informs the default `RecoveryPolicy`:
| Variant | Retryable? |
|---|---|
| `Connection(ServerUnavailable)` | no — `0x800706BA` in the proven stack is a structural callback-marshalling failure (no SYN to the advertised callback port after security bindings are added), not a flapping `NmxSvc.exe`. Retrying loops forever. Evidence: `analysis/proxy/managed-registerengine2-callback-probe.txt:8`, `docs/NMX-COM-Contracts.md:592`. |
| `Connection(TcpConnect)` | depends on the inner `io::Error.kind()` — same dispatch as `Io(_)` below: `WouldBlock`/`Interrupted`/`TimedOut`/`ConnectionReset` → yes; `AddrNotAvailable`/`HostUnreachable`/`ConnectionRefused` → no. Implementation: `is_retryable()`'s `Connection(TcpConnect { source, .. })` arm delegates to `source.kind()` rather than blanket-yes; otherwise a name-not-found hot-loops. |
| `Connection(OxidResolve)` | yes |
| `Connection(CallbackProxyMissing)` | no (config issue) |
| `Connection(EngineNotRegistered)` | yes (re-register) |
| `Auth(Ntlm)` | no (credentials are wrong) |
| `Auth(AsbHandshake)` | no |
| `Auth(SharedSecret)` | no |
| `Configuration(*)` | no |
| `TypeMismatch` | no |
| `Security(*)` | no |
| `Status` with `category=CommunicationError` | yes |
| `Status` with `category=Pending` | yes (with backoff) |
| `Status` with `category=ConfigurationError` | depends on `detail`: `detail == 21` ("Invalid reference", src/MxNativeCodec/MxStatus.cs:76) → yes, after a Galaxy-resolver-cache refresh — the .NET reference's per-operation `ResolveTagAsync` (src/MxNativeClient/MxNativeSession.cs:173,196,232,255) makes a stale-cache miss recoverable on the next attempt. All other `ConfigurationError` details → no. Implementation: `is_retryable()` matches the inner `detail` and returns `true` only for the resolver-refresh-eligible codes. |
| `Status` other categories | no by default; consumer can override |
| `Timeout` | yes |
| `Cancelled` | no |
| `Protocol(*)` | no — protocol-level decode bugs are not retryable |
| `Io(_)` | depends on `kind()`: `WouldBlock`/`Interrupted`/`TimedOut` → yes; everything else → no |
The default `RecoveryPolicy::exponential` calls `is_retryable()` before re-attempting and emits `RecoveryEvent::Failed { error, .. }` if not.
## Status as data on subscriptions
```rust
pub struct DataChange {
pub reference: Arc<str>,
pub value: MxValue,
pub quality: u16,
pub timestamp: SystemTime,
pub status: MxStatus,
}
```
Consumers inspect `status.category` to distinguish good/uncertain/bad data. The default `mxaccess` does not filter; if a consumer wants Ok-only data, they filter the stream themselves:
```rust
let mut sub = session.subscribe("X.Y").await?
.filter_map(|change| async move {
match change {
Ok(c) if c.status.category == MxStatusCategory::Ok => Some(Ok(c)),
Ok(_) => None, // drop non-Ok
Err(e) => Some(Err(e)),
}
});
```
This keeps `Error` reserved for actual failures and keeps stream items composable.