Files
network/ww_gpu.md
T
dohertj2 8069f21240 Remove Infisical credential pointers; inline credentials in component docs
Reverses the recent Infisical-pointer convention. Each <service>.md
holds its credentials inline under the Access section again. The
Infisical service itself still runs as a Docker stack on the docker
host — it just isn't the source of truth for these docs anymore.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 04:04:34 -04:00

6.6 KiB

WW VM — GPU Passthrough

NVIDIA Quadro P1000 PCI passthrough to WW_DEV_VM (10.100.0.48). Executed 2026-04-28. GPU is live in Windows: nvidia-smi reports driver 582.41 / CUDA 13.0, 4 GiB VRAM, status WDDM, no Code 43.

Final state

Item Value
Quadro P1000 driver in guest 582.41 / CUDA 13.0 (already installed; took the device on first boot)
Guest PCI bus address 00000000:23:00.0
Audio function High Definition Audio Controller present, status OK
ESXi graphicsInfo.graphicsType direct (was already set before this task)
ESXi pciPassthruInfo for 87:00.0 / 87:00.1 passthruEnabled=true, passthruActive=true (flipped on without host reboot)
VM nestedHVEnabled false
VM memoryHotAddEnabled false
VM memory reservation 32768 MB / 32768 MB (locked)
Other VMs touched during the change None — host stayed up

What graphicsType: direct actually means (lesson learned)

graphicsInfo.graphicsType: direct and pciPassthruInfo.passthruEnabled are two parallel mechanisms. Both must be set for direct GPU passthrough:

  1. graphicsType: direct — graphics subsystem says "this card is a passthrough device, not vSGA/vGPU". Set in vSphere UI: Host → Configure → Hardware → Graphics.
  2. pciPassthruInfo.passthruEnabled — generic per-PCI-device passthrough flag. Set via host.esxcli hardware pci pcipassthru set -e=true. Without this, the device doesn't appear in device.pci.ls -vm <VM>, so VMs can't claim it.

The "no host reboot needed" benefit only kicks in when graphicsType: direct is already in effect — the runtime activation flag (-a=true on the esxcli call) succeeds because the device isn't actively serving as a host graphics device. If graphicsType is still shared (default), flipping pcipassthru requires a host reboot for the activation to land.

Procedure (the one that worked)

1. Finish inside-VM teardown — already done before this task

WSL2 + VirtualMachinePlatform Windows features were disabled during the Docker→DOCKER migration. The reboot to finalize that disable also serves as the "shut down before passthrough" step.

ssh dohertj2@10.100.0.48 'Get-WindowsOptionalFeature -Online -FeatureName VirtualMachinePlatform,Microsoft-Windows-Subsystem-Linux | Select-Object FeatureName,State'
# Expect: both Disabled

2. Shut down the VM (graceful)

export GOVC_URL=https://10.2.0.12/sdk GOVC_USERNAME=govc GOVC_PASSWORD='Tn9.xKw-m4Vp' GOVC_INSECURE=true
govc vm.power -s=true WW_DEV_VM
until govc vm.info WW_DEV_VM | grep -q "Power state:  poweredOff"; do sleep 5; done

3. Flip the VM hardware flags (VM must be off)

govc vm.change -vm WW_DEV_VM -nested-hv-enabled=false
govc vm.change -vm WW_DEV_VM -memory-hot-add-enabled=false

govc vm.info -json=true WW_DEV_VM | python3 -c "import json,sys;v=json.load(sys.stdin)['virtualMachines'][0]['config'];print('nestedHV:',v.get('nestedHVEnabled'));print('memHotAdd:',v.get('memoryHotAddEnabled'))"
# Expect: nestedHV: False, memHotAdd: False

4. Enable pcipassthru for both Quadro PCI functions

graphicsType: direct was already set, so -a=true activates the flag immediately — no host reboot. (Note: govc gpu.vm.add is for vGPU profiles, not direct PCI passthrough, and fails on this card with "no vGPU profiles available". Use device.pci.add instead.)

govc host.esxcli hardware pci pcipassthru set -d=0000:87:00.0 -e=true -a=true
govc host.esxcli hardware pci pcipassthru set -d=0000:87:00.1 -e=true -a=true

# Confirm both are active
govc host.info -json=true | python3 -c "
import json,sys
d=json.load(sys.stdin)
for p in d['hostSystems'][0]['config'].get('pciPassthruInfo', []):
    if '87:00' in p.get('id',''): print(p)
"
# Expect: passthruEnabled=True, passthruActive=True for both

# Confirm the Quadro now shows up as available for VMs
govc device.pci.ls -vm WW_DEV_VM | grep -i nvidia
# Expect: 0000:87:00.0 and 0000:87:00.1 listed

A harmless quirk: the second pcipassthru set command may emit Device owner is already configured to passthru if the audio function was previously partially configured. Check the post-state with pciPassthruInfo — both should be passthruActive=True.

5. Attach the GPU + audio to the VM

govc device.pci.add -vm WW_DEV_VM 0000:87:00.0
govc device.pci.add -vm WW_DEV_VM 0000:87:00.1

# Verify two VirtualPCIPassthrough devices exist
govc device.info -vm WW_DEV_VM 'pcipassthrough-*'

6. Power on, verify

govc vm.power -on=true WW_DEV_VM
until ssh -o ConnectTimeout=3 -o BatchMode=yes dohertj2@10.100.0.48 'hostname' 2>/dev/null; do sleep 5; done

# Confirm the GPU is detected and the driver bound
ssh dohertj2@10.100.0.48 'Get-PnpDevice -Class Display | Where-Object FriendlyName -match "Quadro" | Select-Object FriendlyName,Status'

# Confirm CUDA / driver runtime
ssh dohertj2@10.100.0.48 'nvidia-smi'

Notes for future operators

  1. gpu.vm.add vs device.pci.add: govc's gpu.vm.add is for vGPU profiles (data-center cards like A40 with NVIDIA vGPU licensing). For consumer Quadro cards in direct passthrough mode, use device.pci.add. gpu.host.profile.ls returns "no vGPU profiles available" on a host whose only NVIDIA card is a non-vGPU Quadro.
  2. Audio function 87:00.1 must be attached to the same VM as 87:00.0 — they share an IOMMU group via parent bridge 0000:80:03.0 and ESXi rejects splitting them.
  3. No host reboot was needed because graphicsType: direct was already in effect from earlier vSphere UI work. If you ever swap GPUs, set graphicsType: direct first (vSphere UI: Host → Configure → Hardware → Graphics → Edit → Direct) and reboot the host once; from then on, per-VM attach/detach is a runtime operation.
  4. Driver was pre-installed: the previous Windows install already had NVIDIA driver 582.41, so the GPU appeared with status OK on first boot. A fresh Windows install would need the driver from https://www.nvidia.com/Download/index.aspx (Quadro P1000).
  5. Rollback: govc device.pci.remove -vm WW_DEV_VM pcipassthrough-13000 pcipassthrough-13001 → re-enable nestedHVEnabled / memoryHotAddEnabled → power VM on. Host PCI flags can stay enabled; they don't hurt.

Inventory

Field Value
Model NVIDIA Quadro P1000 (GP107GL)
GPU PCI ID (host) 0000:87:00.0 (vendor 0x10de, device 0x1cb1)
Audio PCI ID (host) 0000:87:00.1 (vendor 0x10de, device 0x0fb9)
Subsystem Dell (0x1028:0x11bc)
Parent bridge 0000:80:03.0
VRAM 4 GiB
Driver in guest 582.41 (Windows 10 WDDM)