diff --git a/design/followups.md b/design/followups.md index 2b0fce0..298b354 100644 --- a/design/followups.md +++ b/design/followups.md @@ -25,39 +25,6 @@ Between each publish: wait for the crate to be indexed before the next one's `ca **Resolves when:** crates.io shows all 9 crates published + the V1 tag is pushed. -### F49 — Live verification sweep for the M6 features -**Status:** **Resolved 2026-05-06.** All five steps pass live against the local AVEVA install (`docs/M6-live-verification.md`): -- Step 1 (F36 buffered subscribe) — `tests/buffered_subscribe_live.rs` -- Step 2 (F45 buffered recovery replay) — `tests/buffered_recovery_replay_live.rs` -- Step 3 (F47 buffered unsubscribe skip) — `tests/buffered_unsubscribe_skip_live.rs` -- Step 4 (F40 metrics smoke) — `tests/metrics_smoke_live.rs` -- Step 5 (F54 OnWriteComplete) — `tests/lmx_write_complete_live.rs` - -F55 (DCOM-managed `INmxSvcCallback`) and F56 (missing `EnsurePublisherConnected` + AdviseSupervisory after RegisterReference for buffered) were the two real Rust-port bugs uncovered along the way; both are resolved. -**Severity:** P1 — closes the live-evidence gap for the M6 work that landed unit-only this session. -**Source:** F36, F40, F45, F47, F54 closeouts — each ships with unit tests but most were not exercised against the live AVEVA install in this session. -**Blocked-by:** F12 hardening (`Session::connect_nmx_auto` returns `RPC_S_SERVER_UNAVAILABLE` (1722) under `cargo test`'s tokio multi-thread runtime — see "Live attempt 2026-05-06" below). The COM-activation path itself works in isolation (`cargo run -p mxaccess-rpc --example com-marshal-probe --features windows-com` succeeds), so the failure is downstream — likely a COM apartment threading issue when CoInitializeEx runs on a tokio worker thread. - -**Scope.** Run the following against the live AVEVA host with `MX_LIVE=1`: -1. **F36 buffered subscribe** — `cargo run -p mxaccess --example subscribe-buffered -- --tag TestChildObject.TestInt`. Confirm `OnBufferedDataChange`-rate updates flow at the configured cadence; capture wire bytes via `analysis/frida/mx-nmx-trace.js` and confirm exactly one `RegisterReference` (`0x10`) frame with `.property(buffer)` suffix, no separate `SetBufferedUpdateInterval` RPC, and no separate `AdviseSupervisory` follow-up. -2. **F45 recovery replay for buffered** — start the `subscribe-buffered` example, force a `Session::recover_connection` mid-flight (e.g. via a `wwtools` helper that bumps the NMX TCP socket), assert the post-recovery NMX traffic carries an `RegisterReference` (NOT `AdviseSupervisory`) with the same correlation id and `.property(buffer)` suffix. -3. **F47 buffered unsubscribe skip** — instrument `Session::unsubscribe` with a `tracing::debug` log line on the buffered branch, run the example to completion + drop, confirm no `UnAdvise` frame in the wire trace. -4. **F40 metrics** — install a `metrics` exporter (`metrics-exporter-prometheus` is the lightest), run `connect-write-read` + `subscribe` examples with `--features metrics`, confirm at least one counter increment and one histogram observation per metric name in the registered set. -5. **F54 OnWriteComplete (LmxClient round-trip)** — scaffold lives at `crates/mxaccess-compat/tests/lmx_write_complete_live.rs`. Run `cargo test -p mxaccess-compat --features live-windows-com --test lmx_write_complete_live -- --ignored --nocapture` to drive `LmxClient::write` → drain `client.on_write_complete()` and assert the `WriteCompleteEvent { server_handle, item_handle, statuses, is_during_recovery }` shape matches `LMX_OnWriteComplete(int hLMXServerHandle, int phItemHandle, ref MXSTATUS_PROXY[] pVars)`. - -**Live attempt 2026-05-06.** Steps 1-4 not run yet. Step 5 attempted; the test compiled and ran past Frida-style `--probe-resolve-oxid-managed-ntlm-integrity` resolution + `--probe-remqi-managed` IPID extraction, but `connect_nmx_auto` (preferred path) and `connect_nmx` (fallback with probe-resolved IPID) both fail with `Status { detail: 1722 }` (RPC_S_SERVER_UNAVAILABLE). The .NET `MxNativeClient.Probe --probe-session-write` runs the same scenario successfully end-to-end against the same AVEVA install, so the wire is functional and the failure is Rust-port specific. Root-caused as F55 (hand-rolled callback exporter rejected by NmxSvc's SCM-side OXID resolution); not a tokio-runtime COM-activation issue. - -**Step 5 unblocked 2026-05-06 by F55 / Path A.** `cargo test -p mxaccess-compat --features live-windows-com --test lmx_write_complete_live -- --ignored --nocapture` passes against the live AVEVA install: RegisterEngine2 OK, write round-trips, OnWriteComplete fires with the expected `WriteCompleteEvent { server_handle, item_handle, statuses, is_during_recovery }` shape. Steps 1-4 still pending. - -**Step 1 attempted 2026-05-06 — blocked by F56.** Added `crates/mxaccess-compat/tests/buffered_subscribe_live.rs` driving `Session::subscribe_buffered` via `Session::connect_nmx_auto`. RegisterReference completes successfully against the live engine, but no `0x33` DataUpdate frames are ever received — only op-status frames per write and the `0x11` registration-result. The Rust port's wire frame for RegisterReference must differ from the .NET reference's in some field. The codec router was also fixed during this attempt (envelope-peeling for `NmxSubscriptionMessage` and the `0x11` `NmxReferenceRegistrationResultMessage` path were both missing); those are real bugs that would have hidden any DataUpdate had one arrived. F56 captures the open work. - -**Definition of done:** -1. Per-feature evidence summary in `docs/M6-live-verification.md` (one paragraph per feature with the wire-trace excerpt or metrics-exporter snapshot). -2. If any feature fails live: file a sub-followup with the captured failure and link it from the evidence doc. -3. F12's tokio-runtime COM activation issue resolved (the `connect_nmx_auto` 1722 error above) so the live tests can actually run. - -**Resolves when:** all five features have a live evidence row + no sub-followups remain unresolved. - ### F50 — Run the F46 Suspend/Activate Frida capture live **Severity:** P3 — residual from F46 (script ready, capture not yet run). **Source:** F46 closeout (`design/followups.md`) + `analysis/frida/mx-nmx-trace.js` header procedure. @@ -98,6 +65,7 @@ F55 (DCOM-managed `INmxSvcCallback`) and F56 (missing `EnsurePublisherConnected` **Resolves when:** all three optimisations land or are deliberately rejected with a note in the baseline doc. ### F53 — Enable `#![warn(missing_docs)]` workspace-wide +**Status:** Consumer crates resolved 2026-05-06: `#![warn(missing_docs)]` enabled on `mxaccess` and `mxaccess-compat` lib roots, every public item now carries at least a one-line doc comment, `RUSTDOCFLAGS="-D warnings" cargo doc --workspace --no-deps` clean. Protocol crates (`mxaccess-codec`, `mxaccess-rpc`, `mxaccess-galaxy`, `mxaccess-nmx`, `mxaccess-callback`, `mxaccess-asb`, `mxaccess-asb-nettcp`) deliberately deferred per the strategy paragraph below — their consumers (`mxaccess` + `mxaccess-compat`) already document the surfaces they re-export, and forcing one-liners on every transport-internal item adds noise without consumer value. **Severity:** P3 — doc-coverage tightening; not a correctness or release blocker. **Source:** F42 closeout — the missing-docs lint was deferred because enabling it surfaces hundreds of low-priority public-item gaps that are out of scope for that F-number. @@ -111,7 +79,9 @@ F55 (DCOM-managed `INmxSvcCallback`) and F56 (missing `EnsurePublisherConnected` **Resolves when:** the lint is on and the workspace doc build is warning-clean with it. ### F56 — `subscribe` / `subscribe_buffered` complete on the wire but never receive `0x33` DataUpdate frames -**Status:** **Resolved 2026-05-06.** Root cause: `Session::subscribe` and `Session::subscribe_buffered_nmx` were missing the `INmxService2::Connect` + `AddSubscriberEngine` round-trip that the .NET reference's `MxNativeSession.EnsurePublisherConnected` (`cs:516-526`) issues before the first advise against a given publishing engine. Without that pair of RPCs, NmxSvc accepts the subscription registration but the publishing engine never knows our engine is subscribed — so no `0x33` DataUpdate frames flow. +**Status:** **Resolved 2026-05-06.** See Resolved section below for the full closeout. + +Root cause: `Session::subscribe` and `Session::subscribe_buffered_nmx` were missing the `INmxService2::Connect` + `AddSubscriberEngine` round-trip that the .NET reference's `MxNativeSession.EnsurePublisherConnected` (`cs:516-526`) issues before the first advise against a given publishing engine. Without that pair of RPCs, NmxSvc accepts the subscription registration but the publishing engine never knows our engine is subscribed — so no `0x33` DataUpdate frames flow. Diagnosed via wwtools/aalogcli: the `[Warning] NmxSvc | NmxCallback->DataReceived ... failed with error 0x{N}` log lines turned out to be NmxSvc's normal log spam where N is the bufferSize, NOT an actual error — the .NET reference's own probe triggers identical entries while still receiving `0x33` DataUpdate frames successfully. The real issue was that those frames never started being sent in the first place. diff --git a/rust/crates/mxaccess-compat/src/lib.rs b/rust/crates/mxaccess-compat/src/lib.rs index c25bd1b..885a197 100644 --- a/rust/crates/mxaccess-compat/src/lib.rs +++ b/rust/crates/mxaccess-compat/src/lib.rs @@ -50,6 +50,7 @@ //! live-trigger work. #![forbid(unsafe_code)] +#![warn(missing_docs)] use std::collections::{HashMap, VecDeque}; use std::pin::Pin; @@ -73,12 +74,20 @@ use tokio_stream::wrappers::BroadcastStream; /// `MxNativeDataChangeEvent` (`MxNativeCompatibilityServer.cs:6-13`). #[derive(Debug, Clone)] pub struct DataChangeEvent { + /// LMX server handle that produced this event. pub server_handle: i32, + /// Item handle within `server_handle` whose value changed. pub item_handle: i32, + /// Decoded value payload. pub value: MxValue, + /// Legacy 16-bit OPC quality. pub quality: u16, + /// Wire-recorded timestamp (Windows FILETIME-derived). pub timestamp: SystemTime, + /// Richer category-model status (complements `quality`). pub status: MxStatus, + /// `true` when the event was emitted while a `recover_connection` + /// attempt was in flight. pub is_during_recovery: bool, } @@ -90,13 +99,21 @@ pub struct DataChangeEvent { /// capture proves multi-sample bodies real. #[derive(Debug, Clone)] pub struct BufferedDataChangeEvent { + /// LMX server handle that produced this event. pub server_handle: i32, + /// Item handle within `server_handle`. pub item_handle: i32, + /// `MxDataType` discriminator for the carried values. pub mx_data_type: i16, + /// Sample values — length 1 per R2's single-sample verdict. pub values: Vec, + /// Per-sample legacy 16-bit OPC qualities. Same length as `values`. pub qualities: Vec, + /// Per-sample timestamps. Same length as `values`. pub timestamps: Vec, + /// Per-sample richer-category status. Same length as `values`. pub statuses: Vec, + /// `true` when the event was emitted during recovery. pub is_during_recovery: bool, } @@ -104,9 +121,13 @@ pub struct BufferedDataChangeEvent { /// `MxNativeWriteCompleteEvent` (`cs:15-19`). #[derive(Debug, Clone)] pub struct WriteCompleteEvent { + /// LMX server handle that issued the original write. pub server_handle: i32, + /// Item handle the write was targeted at. pub item_handle: i32, + /// Per-write completion statuses (one per `MXSTATUS_PROXY` slot). pub statuses: Vec, + /// `true` when the write completed while recovery was in flight. pub is_during_recovery: bool, } @@ -117,9 +138,13 @@ pub struct WriteCompleteEvent { /// once the trigger is captured. #[derive(Debug, Clone)] pub struct OperationCompleteEvent { + /// LMX server handle the operation belongs to. pub server_handle: i32, + /// Item handle the operation was targeted at. pub item_handle: i32, + /// Per-operation statuses. pub statuses: Vec, + /// `true` when the event was emitted during recovery. pub is_during_recovery: bool, } diff --git a/rust/crates/mxaccess/src/lib.rs b/rust/crates/mxaccess/src/lib.rs index 61675da..62f0e87 100644 --- a/rust/crates/mxaccess/src/lib.rs +++ b/rust/crates/mxaccess/src/lib.rs @@ -18,6 +18,7 @@ //! deferred work tracker. #![forbid(unsafe_code)] +#![warn(missing_docs)] use std::borrow::Cow; use std::sync::Arc; @@ -60,11 +61,16 @@ pub struct Session { /// `wwtools/mxaccesscli/docs/api-notes.md:104-105`. #[derive(Debug, Clone)] pub struct DataChange { + /// Tag reference the update applies to. pub reference: Arc, + /// Decoded value payload. pub value: MxValue, /// Legacy 16-bit OPC quality. Distinct from `status: MxStatus`. pub quality: u16, + /// Wire-recorded timestamp (Windows FILETIME-derived) for the update. pub timestamp: SystemTime, + /// Richer category-model status. Complements `quality` for callers + /// that want the modern MxStatus taxonomy. pub status: MxStatus, } @@ -125,9 +131,16 @@ impl BufferedOptions { } } +/// Caller identity for secured-write operations. Mirrors the .NET +/// reference's two-user-id pattern (`current_user_id` is the actor; +/// `verifier_user_id` is the supervisor approving the change). Both +/// fields are required by the wire op even when the same user fills +/// both roles — pass the same id twice for single-user secured writes. #[derive(Debug, Clone)] pub struct SecurityContext { + /// User id of the caller initiating the write. pub current_user_id: i32, + /// User id of the verifier (supervisor) approving the write. pub verifier_user_id: i32, } @@ -135,17 +148,32 @@ pub struct SecurityContext { #[derive(Debug, Clone)] pub struct ConnectionOptions; +/// Discriminator for which transport produced (or should consume) a +/// given session, error, or capability set. Surfaces in +/// [`Error::Unsupported`] to identify the transport that rejected an +/// operation and in [`TransportCapabilities`] indirectly. #[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)] #[non_exhaustive] pub enum TransportKind { + /// NMX (DCE/RPC over `INmxService2` to NmxSvc.exe). Nmx, + /// ASB (`IASBIDataV2` over net.tcp to MxDataProvider). Asb, } +/// Per-transport capability flags — used by feature-detection helpers +/// to query what an open session supports without spelunking through +/// transport-specific docs. #[derive(Debug, Clone, Copy)] pub struct TransportCapabilities { + /// `true` if the transport supports a server-side buffered + /// delivery cadence (NMX yes; ASB no — gated at API level). pub buffered_subscribe: bool, + /// `true` if the transport supports the `Activate` / `Suspend` + /// pair on a subscribed item (NMX yes; ASB no). pub activate_suspend: bool, + /// `true` if the transport surfaces an OperationComplete frame + /// post-write (NMX yes via `OnWriteComplete`; ASB no). pub operation_complete_frame: bool, } @@ -196,23 +224,38 @@ impl Default for RecoveryPolicy { } } +/// Recovery-attempt lifecycle event broadcast on +/// [`Session::recovery_events`]. +/// /// Not `Clone` — `Error` is not `Clone`-able (thiserror chains an /// `io::Error` source which is not `Clone`). Consumers that need to clone an /// event should wrap it in `Arc`. #[derive(Debug)] #[non_exhaustive] pub enum RecoveryEvent { + /// `recover_connection` started a new attempt. `attempt` is + /// 1-indexed and counts only attempts driven by the current + /// `recover_connection` invocation (resets on a fresh call). Started { + /// 1-indexed attempt counter. attempt: u32, }, + /// An attempt failed. `error` is the underlying cause; `will_retry` + /// is `true` when the configured [`RecoveryPolicy`] still has + /// retry budget. Failed { + /// 1-indexed attempt counter. attempt: u32, + /// Underlying error that caused the failure. error: Error, /// Whether the configured policy will retry. Mirrors /// `MxNativeRecoveryFailureEvent.WillRetry` (`MxNativeSession.cs:47-51`). will_retry: bool, }, + /// Recovery completed successfully. `attempt` is the index of the + /// successful attempt. Recovered { + /// 1-indexed attempt counter. attempt: u32, }, } @@ -242,12 +285,20 @@ pub enum RecoveryEvent { /// `RegisterEngine2`. #[derive(Debug, Clone, PartialEq, Eq, Hash)] pub struct SessionOptions { + /// Local engine id advertised to NmxSvc. Default: `0x7000 + (pid & 0x0FFF)`. pub local_engine_id: i32, + /// Engine name string sent to `RegisterEngine2`. Default: `mxaccess.`. pub engine_name: String, + /// Partner-version negotiated with NmxSvc. Default: `6`. pub partner_version: i32, + /// Galaxy id for routing. Default: `1`. pub galaxy_id: u8, + /// Source-platform id for outbound NMX envelopes. Default: `1`. pub source_platform_id: i32, + /// `Some(n)` enables `SetHeartbeatSendInterval(n, ...)` after register; + /// `None` skips the heartbeat config call entirely (default). pub heartbeat_ticks_per_beat: Option, + /// Heartbeat tolerance — only used when `heartbeat_ticks_per_beat` is `Some`. pub heartbeat_max_missed_ticks: i32, } @@ -290,43 +341,63 @@ impl Default for SessionOptions { // ---- Error taxonomy ------------------------------------------------------ +/// Top-level error returned by every fallible `mxaccess` API. The +/// variants partition errors into stable categories so consumers can +/// match on shape without spelunking nested error types. #[derive(Debug, thiserror::Error)] #[non_exhaustive] pub enum Error { + /// Transport bring-up or runtime connection-state failure. #[error("connection: {0}")] Connection(#[from] ConnectionError), + /// NTLM or other authentication failure. #[error("authentication: {0}")] Auth(#[from] AuthError), + /// Wire protocol / codec violation (decode failure, size mismatch). #[error("protocol: {0}")] Protocol(#[from] ProtocolError), + /// Caller-supplied options or arguments rejected pre-flight. #[error("configuration: {0}")] Configuration(#[from] ConfigError), + /// `MxValue` kind doesn't match what the resolver / engine expects + /// for the named tag. #[error("type mismatch on {reference}: expected {expected:?}, got {actual:?}")] TypeMismatch { + /// Tag reference whose write/read triggered the mismatch. reference: Arc, + /// Kind the engine / resolver expected. expected: MxValueKind, + /// Kind the caller supplied (or the wire returned). actual: MxValueKind, }, + /// Galaxy- or session-level security check rejected the operation. #[error("security: {0}")] Security(#[from] SecurityError), + /// Operation isn't supported on the chosen transport + /// (e.g. `subscribe_buffered` on ASB). #[error("unsupported on {transport:?} transport: {operation}")] Unsupported { + /// Human-readable name of the operation that was rejected. operation: Cow<'static, str>, + /// Transport that rejected the operation. transport: TransportKind, }, + /// Operation didn't complete within its timeout budget. #[error("operation timed out after {0:?}")] Timeout(Duration), + /// Operation was cancelled (e.g. via cancellation token / `drop`). #[error("operation cancelled")] Cancelled, + /// Server-reported MxStatus (decoded category + detail). // Field is named `detected_by` (not `source`) to match the codec's // `MxStatus.detected_by` and to avoid thiserror's `#[source]` attribute // semantics (which would require `MxStatusSource: std::error::Error`). @@ -334,23 +405,37 @@ pub enum Error { "status: success={success} category={category:?} detected_by={detected_by:?} detail={detail}" )] Status { + /// `0` = success, non-zero = failure with the rest of the + /// fields populated. success: i16, + /// Status category as decoded from the wire (`MxStatusCategory`). category: MxStatusCategory, + /// Layer that originally detected the failure. detected_by: MxStatusSource, + /// Detail code carrying the specific reason. detail: i16, }, + /// Underlying I/O error from the transport socket. #[error("io: {0}")] Io(#[from] std::io::Error), } +/// Connection / transport-bring-up failure modes. #[derive(Debug, thiserror::Error)] #[non_exhaustive] pub enum ConnectionError { + /// RPC server (NmxSvc / MxDataProvider) unreachable or refused + /// the connection. #[error("RPC server unavailable")] ServerUnavailable, + /// COM proxy/stub for the callback interface isn't registered on + /// the box (`REGDB_E_CLASSNOTREG` from the SCM). #[error("callback proxy/stub not registered (REGDB_E_CLASSNOTREG)")] CallbackProxyMissing, + /// `RegisterEngine2` returned non-zero, OR an operation was + /// attempted on a session whose engine wasn't registered (e.g. + /// post-shutdown). #[error("engine not registered (UninitializedObject / ERROR_INVALID_STATE)")] EngineNotRegistered, /// Transport bring-up failed during preamble exchange or @@ -359,38 +444,68 @@ pub enum ConnectionError { /// keep the public taxonomy small. ASB-specific (F26 step 2); /// `EngineNotRegistered` covers the analogous NMX failure mode. #[error("transport bring-up failed: {detail}")] - TransportFailure { detail: String }, + TransportFailure { + /// Stringified underlying-error detail. + detail: String, + }, } +/// Authentication-side failures. #[derive(Debug, thiserror::Error)] #[non_exhaustive] pub enum AuthError { + /// NTLM handshake rejected by the peer. #[error("NTLM rejected: {reason}")] - Ntlm { reason: String }, + Ntlm { + /// Human-readable reason from the underlying NTLM stack. + reason: String, + }, } +/// Wire-protocol violations surfaced by the codec layer. #[derive(Debug, thiserror::Error)] #[non_exhaustive] pub enum ProtocolError { + /// Decode failure at a specific buffer offset. #[error("decode at offset {offset} ({reason}); buffer len {buffer_len}")] Decode { + /// Byte offset within the buffer where decoding failed. offset: usize, + /// Static description of the violation. reason: &'static str, + /// Total buffer length (for context in error messages). buffer_len: usize, }, + /// Outer envelope's declared inner length doesn't match the actual body size. #[error("inner length {declared} does not match body length {actual}")] - InnerLengthMismatch { declared: i32, actual: usize }, + InnerLengthMismatch { + /// Inner length declared in the envelope header. + declared: i32, + /// Actual remaining body length on the wire. + actual: usize, + }, + /// First-byte command opcode wasn't recognised by the parser. #[error("unexpected opcode {0:#x}")] UnexpectedOpcode(u8), } +/// Caller-side configuration / argument errors. #[derive(Debug, thiserror::Error)] #[non_exhaustive] pub enum ConfigError { + /// Argument failed pre-flight validation. `detail` carries the + /// specific reason (which arg, what was wrong). #[error("invalid argument: {detail}")] - InvalidArgument { detail: String }, + InvalidArgument { + /// Human-readable description of the rejected argument. + detail: String, + }, + /// Galaxy resolver returned an error during tag resolution. #[error("galaxy resolver: {reason}")] - Galaxy { reason: String }, + Galaxy { + /// Underlying resolver error message. + reason: String, + }, /// `Session::recover_connection` was called without a /// [`crate::RebuildFactory`] installed via /// [`crate::Session::set_recovery_factory`]. F16. @@ -400,11 +515,15 @@ pub enum ConfigError { RecoveryNotConfigured, } +/// Security-related operation rejection. #[derive(Debug, thiserror::Error)] #[non_exhaustive] pub enum SecurityError { + /// NmxSvc rejected our callback OBJREF (`HRESULT 0x8001011D`). #[error("callback OBJREF rejected (HRESULT 0x8001011D)")] CallbackObjRefRejected, + /// Caller invoked a secured-write op without supplying the + /// verifier user id. #[error("verifier user token required for secured write")] VerifierRequired, } @@ -414,10 +533,14 @@ pub enum SecurityError { /// Generic-only trait — `dyn Transport` is intentionally unsupported (see /// design/20-async-layer.md L53 fix). Consumers parameterise on ``. pub trait Transport: Send + Sync + 'static { + /// Reports per-transport capability flags so callers can + /// feature-gate without hard-coding transport identity. fn capabilities(&self) -> TransportCapabilities; + /// Reports which [`TransportKind`] this implementation produces. fn kind(&self) -> TransportKind; } + // ---- Session API surface ------------------------------------------------- // // The `*_value` family in `session.rs` takes `WriteValue` (the codec's