Files
natsdotnet/dtp_updates.md
Joseph Doherty 5de4962bd3 Improve docs coverage and refresh profiling parser artifacts
Add domain-specific XML documentation across src server components to satisfy CommentChecker, and update dotTrace parsing outputs used for diagnostics.
2026-03-14 04:06:04 -04:00

6.2 KiB

DTP Snapshot Extractor — Requested Changes

Problem

The current extractor produces too few nodes to be useful for performance analysis. A 30-second dotTrace sampling snapshot of the NATS server handling 1M messages (5s publish + 1.3s consume) yields only 202 nodes in the JSON output. The entire publish/consume hot path is invisible — no ProcessCommandsAsync, ProcessMessage, DeliverPullFetchMessagesAsync, SendMessageNoFlush, SubList.Match, FileStore.AppendAsync, MsgBlock.WriteAt, or any other NATS server method appears in the call tree. Only server startup/shutdown and InternalEventSystem event serialization show up.

By contrast, the dotTrace GUI shows thousands of samples across these functions with clear call trees and accurate timing. The .dtp file has the data — the extractor is not surfacing it.

Evidence

Snapshot: threads=11, nodes=202
Top NATS inclusive hotspots:
  NatsServer.WaitForShutdown    29741.8ms  (idle wait)
  NatsServer..ctor                 36.5ms  (one-time init)
  InternalEventSystem              26.0ms  (periodic stats)

The actual hot path (ProcessCommandsAsyncProcessMessage → fan-out/delivery) which runs for ~6 seconds of wall time is completely absent.


Change 1: Increase Hotspot Limit

Current: Take(50) in BuildHotspots for both inclusive and exclusive lists.

Requested: Increase to at least 200, or make it configurable via a CLI flag (e.g., --top N). With only 50 hotspots, important functions lower in the ranking are silently dropped.


Change 2: Add --filter Flag to Python CLI

Add a --filter option that passes a substring filter to the .NET helper, so the JSON output only includes nodes whose name matches the filter. This reduces noise and lets me focus on the relevant code:

python3 tools/dtp_parse.py snapshots/foo.dtp --filter NATS --out /tmp/result.json

The .NET helper should filter the hotspot lists and prune the call tree to only include paths that contain at least one matching node (keeping ancestors and descendants of matching nodes).


Change 3: Add Flat Call-Path Output Mode

The current nested call tree is hard to consume programmatically for hot-path analysis. Add a --flat or --paths mode that outputs the top N heaviest call paths as flat strings with timing:

{
  "hotPaths": [
    {
      "path": "ThreadPool > ProcessCommandsAsync > ProcessMessage > DeliverPullFetchMessagesAsync > SendMessageNoFlush",
      "inclusiveMs": 342.5,
      "leafExclusiveMs": 89.2
    },
    {
      "path": "ThreadPool > ProcessCommandsAsync > ProcessMessage > SubList.Match",
      "inclusiveMs": 156.3,
      "leafExclusiveMs": 156.3
    }
  ]
}

This is the most useful output format for LLM-driven analysis — I can immediately see which call chains are expensive without walking the tree.


Change 4: Exclude Idle/Wait Functions from Hotspots

Functions like WaitHandle.WaitOneNoCheck, SemaphoreSlim.WaitCore, LowLevelLifoSemaphore.WaitForSignal, Monitor.Wait, SocketAsyncEngine.EventLoop, Thread.PollGC, and Interop+Sys.WaitForSocketEvents dominate the hotspot lists but represent idle waiting, not actual CPU work. Either:

  • Add a --exclude-idle flag (default on) that strips these from hotspot lists, or
  • Always exclude them from the exclusive hotspot list (they have zero useful exclusive time) and keep them in inclusive only if requested.

Change 5: Investigate Missing Nodes (Critical)

This is the most important issue. 202 nodes from a 30-second sampling profile is far too few. The dotTrace GUI shows the same snapshot with a full, deep call tree across all ThreadPool workers. Possible causes:

  1. The DFS reader is not reading all sections. The callTreeSections.AllHeaders() call may not be returning headers for all threads or all sampling intervals. Check whether there are multiple call tree section families and the current code only reads one.

  2. Node merging/deduplication is losing data. If two threads call the same function, they may share a FunctionUID but have different CallTreeSectionOffset values. Verify that the nodeMap dictionary keyed by offset isn't accidentally losing nodes from different threads.

  3. The totalNodeCount calculation may be wrong. The formula (SectionSize - SectionHeaderSize) / RecordSize() may not account for all record types or section layouts in sampling snapshots.

  4. Sampling vs tracing data layout differences. The code may have been tested primarily with tracing snapshots. Sampling snapshots store data differently — verify that the same reader API works for both.

The fix should result in thousands of nodes for a typical 30-second sampling snapshot, not 202. If the current dotTrace API approach fundamentally can't extract sampling data at full fidelity, document that limitation and suggest an alternative approach (e.g., using dotTrace's built-in report export if available on macOS, or switching to a different API surface).


Change 6: Add Time Unit to JSON Output

The current inclusiveTime / exclusiveTime values are in an unspecified unit (nanoseconds based on magnitude). Add a timeUnit field to the snapshot section:

{
  "snapshot": {
    "path": "...",
    "payloadType": "time",
    "timeUnit": "nanoseconds",
    "threadCount": 11,
    "nodeCount": 1923
  }
}

Change 7: Add Summary Statistics

Add a summary section to the output with quick-reference stats:

{
  "summary": {
    "wallTimeMs": 30155,
    "activeTimeMs": 6340,
    "totalSamples": 15234,
    "topExclusiveMethod": "SendMessageNoFlush",
    "topExclusiveMs": 89.2
  }
}

This lets me immediately assess whether the profile captured meaningful work without parsing the full tree.


Priority Order

  1. Change 5 (missing nodes) — without this, everything else is moot
  2. Change 4 (exclude idle) — makes hotspots immediately useful
  3. Change 1 (increase limit) — more hotspots visible
  4. Change 6 (time unit) — eliminates guesswork
  5. Change 3 (flat paths) — most useful output format for analysis
  6. Change 2 (filter) — nice to have for focused analysis
  7. Change 7 (summary) — nice to have for quick assessment