natsdotnet/dtp_updates.md

# DTP Snapshot Extractor — Requested Changes

## Problem

The current extractor produces too few nodes to be useful for performance analysis. A 30-second dotTrace sampling snapshot of the NATS server handling 1M messages (5s publish + 1.3s consume) yields only **202 nodes** in the JSON output. The entire publish/consume hot path is invisible — no `ProcessCommandsAsync`, `ProcessMessage`, `DeliverPullFetchMessagesAsync`, `SendMessageNoFlush`, `SubList.Match`, `FileStore.AppendAsync`, `MsgBlock.WriteAt`, or any other NATS server method appears in the call tree. Only server startup/shutdown and `InternalEventSystem` event serialization show up.

By contrast, the dotTrace GUI shows thousands of samples across these functions with clear call trees and accurate timing. The `.dtp` file has the data — the extractor is not surfacing it.

### Evidence

```
Snapshot: threads=11, nodes=202
Top NATS inclusive hotspots:
  NatsServer.WaitForShutdown    29741.8ms  (idle wait)
  NatsServer..ctor                 36.5ms  (one-time init)
  InternalEventSystem              26.0ms  (periodic stats)
```

The actual hot path (`ProcessCommandsAsync` → `ProcessMessage` → fan-out/delivery) which runs for ~6 seconds of wall time is completely absent.

---

## Change 1: Increase Hotspot Limit

**Current:** `Take(50)` in `BuildHotspots` for both inclusive and exclusive lists.

**Requested:** Increase to at least **200**, or make it configurable via a CLI flag (e.g., `--top N`). With only 50 hotspots, important functions lower in the ranking are silently dropped.

---

## Change 2: Add `--filter` Flag to Python CLI

Add a `--filter` option that passes a substring filter to the .NET helper, so the JSON output only includes nodes whose name matches the filter. This reduces noise and lets me focus on the relevant code:

```bash
python3 tools/dtp_parse.py snapshots/foo.dtp --filter NATS --out /tmp/result.json
```

The .NET helper should filter the hotspot lists and prune the call tree to only include paths that contain at least one matching node (keeping ancestors and descendants of matching nodes).

---

## Change 3: Add Flat Call-Path Output Mode

The current nested call tree is hard to consume programmatically for hot-path analysis. Add a `--flat` or `--paths` mode that outputs the **top N heaviest call paths** as flat strings with timing:

```json
{
  "hotPaths": [
    {
      "path": "ThreadPool > ProcessCommandsAsync > ProcessMessage > DeliverPullFetchMessagesAsync > SendMessageNoFlush",
      "inclusiveMs": 342.5,
      "leafExclusiveMs": 89.2
    },
    {
      "path": "ThreadPool > ProcessCommandsAsync > ProcessMessage > SubList.Match",
      "inclusiveMs": 156.3,
      "leafExclusiveMs": 156.3
    }
  ]
}
```

This is the most useful output format for LLM-driven analysis — I can immediately see which call chains are expensive without walking the tree.

---

## Change 4: Exclude Idle/Wait Functions from Hotspots

Functions like `WaitHandle.WaitOneNoCheck`, `SemaphoreSlim.WaitCore`, `LowLevelLifoSemaphore.WaitForSignal`, `Monitor.Wait`, `SocketAsyncEngine.EventLoop`, `Thread.PollGC`, and `Interop+Sys.WaitForSocketEvents` dominate the hotspot lists but represent idle waiting, not actual CPU work. Either:

- Add a `--exclude-idle` flag (default on) that strips these from hotspot lists, or
- Always exclude them from the `exclusive` hotspot list (they have zero useful exclusive time) and keep them in `inclusive` only if requested.

---

## Change 5: Investigate Missing Nodes (Critical)

This is the most important issue. **202 nodes from a 30-second sampling profile is far too few.** The dotTrace GUI shows the same snapshot with a full, deep call tree across all ThreadPool workers. Possible causes:

1. **The DFS reader is not reading all sections.** The `callTreeSections.AllHeaders()` call may not be returning headers for all threads or all sampling intervals. Check whether there are multiple call tree section families and the current code only reads one.

2. **Node merging/deduplication is losing data.** If two threads call the same function, they may share a `FunctionUID` but have different `CallTreeSectionOffset` values. Verify that the `nodeMap` dictionary keyed by offset isn't accidentally losing nodes from different threads.

3. **The `totalNodeCount` calculation may be wrong.** The formula `(SectionSize - SectionHeaderSize) / RecordSize()` may not account for all record types or section layouts in sampling snapshots.

4. **Sampling vs tracing data layout differences.** The code may have been tested primarily with tracing snapshots. Sampling snapshots store data differently — verify that the same reader API works for both.

The fix should result in **thousands of nodes** for a typical 30-second sampling snapshot, not 202. If the current dotTrace API approach fundamentally can't extract sampling data at full fidelity, document that limitation and suggest an alternative approach (e.g., using dotTrace's built-in report export if available on macOS, or switching to a different API surface).

---

## Change 6: Add Time Unit to JSON Output

The current `inclusiveTime` / `exclusiveTime` values are in an unspecified unit (nanoseconds based on magnitude). Add a `timeUnit` field to the `snapshot` section:

```json
{
  "snapshot": {
    "path": "...",
    "payloadType": "time",
    "timeUnit": "nanoseconds",
    "threadCount": 11,
    "nodeCount": 1923
  }
}
```

---

## Change 7: Add Summary Statistics

Add a `summary` section to the output with quick-reference stats:

```json
{
  "summary": {
    "wallTimeMs": 30155,
    "activeTimeMs": 6340,
    "totalSamples": 15234,
    "topExclusiveMethod": "SendMessageNoFlush",
    "topExclusiveMs": 89.2
  }
}
```

This lets me immediately assess whether the profile captured meaningful work without parsing the full tree.

---

## Priority Order

1. **Change 5** (missing nodes) — without this, everything else is moot
2. **Change 4** (exclude idle) — makes hotspots immediately useful
3. **Change 1** (increase limit) — more hotspots visible
4. **Change 6** (time unit) — eliminates guesswork
5. **Change 3** (flat paths) — most useful output format for analysis
6. **Change 2** (filter) — nice to have for focused analysis
7. **Change 7** (summary) — nice to have for quick assessment