8069f21240
Reverses the recent Infisical-pointer convention. Each <service>.md holds its credentials inline under the Access section again. The Infisical service itself still runs as a Docker stack on the docker host — it just isn't the source of truth for these docs anymore. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
123 lines
6.6 KiB
Markdown
123 lines
6.6 KiB
Markdown
# WW VM — GPU Passthrough
|
|
|
|
NVIDIA Quadro P1000 PCI passthrough to `WW_DEV_VM` (10.100.0.48). **Executed 2026-04-28.** GPU is live in Windows: `nvidia-smi` reports driver 582.41 / CUDA 13.0, 4 GiB VRAM, status WDDM, no Code 43.
|
|
|
|
## Final state
|
|
|
|
| Item | Value |
|
|
|---|---|
|
|
| Quadro P1000 driver in guest | 582.41 / CUDA 13.0 (already installed; took the device on first boot) |
|
|
| Guest PCI bus address | `00000000:23:00.0` |
|
|
| Audio function | High Definition Audio Controller present, status OK |
|
|
| ESXi `graphicsInfo.graphicsType` | `direct` (was already set before this task) |
|
|
| ESXi `pciPassthruInfo` for `87:00.0` / `87:00.1` | `passthruEnabled=true, passthruActive=true` (flipped on without host reboot) |
|
|
| VM `nestedHVEnabled` | `false` |
|
|
| VM `memoryHotAddEnabled` | `false` |
|
|
| VM memory reservation | 32768 MB / 32768 MB (locked) |
|
|
| Other VMs touched during the change | None — host stayed up |
|
|
|
|
## What `graphicsType: direct` actually means (lesson learned)
|
|
|
|
`graphicsInfo.graphicsType: direct` and `pciPassthruInfo.passthruEnabled` are **two parallel mechanisms**. Both must be set for direct GPU passthrough:
|
|
|
|
1. `graphicsType: direct` — graphics subsystem says "this card is a passthrough device, not vSGA/vGPU". Set in vSphere UI: Host → Configure → Hardware → Graphics.
|
|
2. `pciPassthruInfo.passthruEnabled` — generic per-PCI-device passthrough flag. Set via `host.esxcli hardware pci pcipassthru set -e=true`. Without this, the device doesn't appear in `device.pci.ls -vm <VM>`, so VMs can't claim it.
|
|
|
|
The "no host reboot needed" benefit only kicks in when `graphicsType: direct` is **already** in effect — the runtime activation flag (`-a=true` on the esxcli call) succeeds because the device isn't actively serving as a host graphics device. If `graphicsType` is still `shared` (default), flipping `pcipassthru` requires a host reboot for the activation to land.
|
|
|
|
## Procedure (the one that worked)
|
|
|
|
### 1. Finish inside-VM teardown — already done before this task
|
|
|
|
WSL2 + VirtualMachinePlatform Windows features were disabled during the Docker→DOCKER migration. The reboot to finalize that disable also serves as the "shut down before passthrough" step.
|
|
|
|
```powershell
|
|
ssh dohertj2@10.100.0.48 'Get-WindowsOptionalFeature -Online -FeatureName VirtualMachinePlatform,Microsoft-Windows-Subsystem-Linux | Select-Object FeatureName,State'
|
|
# Expect: both Disabled
|
|
```
|
|
|
|
### 2. Shut down the VM (graceful)
|
|
|
|
```bash
|
|
export GOVC_URL=https://10.2.0.12/sdk GOVC_USERNAME=govc GOVC_PASSWORD='Tn9.xKw-m4Vp' GOVC_INSECURE=true
|
|
govc vm.power -s=true WW_DEV_VM
|
|
until govc vm.info WW_DEV_VM | grep -q "Power state: poweredOff"; do sleep 5; done
|
|
```
|
|
|
|
### 3. Flip the VM hardware flags (VM must be off)
|
|
|
|
```bash
|
|
govc vm.change -vm WW_DEV_VM -nested-hv-enabled=false
|
|
govc vm.change -vm WW_DEV_VM -memory-hot-add-enabled=false
|
|
|
|
govc vm.info -json=true WW_DEV_VM | python3 -c "import json,sys;v=json.load(sys.stdin)['virtualMachines'][0]['config'];print('nestedHV:',v.get('nestedHVEnabled'));print('memHotAdd:',v.get('memoryHotAddEnabled'))"
|
|
# Expect: nestedHV: False, memHotAdd: False
|
|
```
|
|
|
|
### 4. Enable `pcipassthru` for both Quadro PCI functions
|
|
|
|
`graphicsType: direct` was already set, so `-a=true` activates the flag immediately — no host reboot. (Note: `govc gpu.vm.add` is for **vGPU profiles**, not direct PCI passthrough, and fails on this card with "no vGPU profiles available". Use `device.pci.add` instead.)
|
|
|
|
```bash
|
|
govc host.esxcli hardware pci pcipassthru set -d=0000:87:00.0 -e=true -a=true
|
|
govc host.esxcli hardware pci pcipassthru set -d=0000:87:00.1 -e=true -a=true
|
|
|
|
# Confirm both are active
|
|
govc host.info -json=true | python3 -c "
|
|
import json,sys
|
|
d=json.load(sys.stdin)
|
|
for p in d['hostSystems'][0]['config'].get('pciPassthruInfo', []):
|
|
if '87:00' in p.get('id',''): print(p)
|
|
"
|
|
# Expect: passthruEnabled=True, passthruActive=True for both
|
|
|
|
# Confirm the Quadro now shows up as available for VMs
|
|
govc device.pci.ls -vm WW_DEV_VM | grep -i nvidia
|
|
# Expect: 0000:87:00.0 and 0000:87:00.1 listed
|
|
```
|
|
|
|
A harmless quirk: the second `pcipassthru set` command may emit `Device owner is already configured to passthru` if the audio function was previously partially configured. Check the post-state with `pciPassthruInfo` — both should be `passthruActive=True`.
|
|
|
|
### 5. Attach the GPU + audio to the VM
|
|
|
|
```bash
|
|
govc device.pci.add -vm WW_DEV_VM 0000:87:00.0
|
|
govc device.pci.add -vm WW_DEV_VM 0000:87:00.1
|
|
|
|
# Verify two VirtualPCIPassthrough devices exist
|
|
govc device.info -vm WW_DEV_VM 'pcipassthrough-*'
|
|
```
|
|
|
|
### 6. Power on, verify
|
|
|
|
```bash
|
|
govc vm.power -on=true WW_DEV_VM
|
|
until ssh -o ConnectTimeout=3 -o BatchMode=yes dohertj2@10.100.0.48 'hostname' 2>/dev/null; do sleep 5; done
|
|
|
|
# Confirm the GPU is detected and the driver bound
|
|
ssh dohertj2@10.100.0.48 'Get-PnpDevice -Class Display | Where-Object FriendlyName -match "Quadro" | Select-Object FriendlyName,Status'
|
|
|
|
# Confirm CUDA / driver runtime
|
|
ssh dohertj2@10.100.0.48 'nvidia-smi'
|
|
```
|
|
|
|
## Notes for future operators
|
|
|
|
1. **`gpu.vm.add` vs `device.pci.add`**: govc's `gpu.vm.add` is for vGPU profiles (data-center cards like A40 with NVIDIA vGPU licensing). For consumer Quadro cards in direct passthrough mode, use `device.pci.add`. `gpu.host.profile.ls` returns "no vGPU profiles available" on a host whose only NVIDIA card is a non-vGPU Quadro.
|
|
2. **Audio function `87:00.1`** must be attached to the same VM as `87:00.0` — they share an IOMMU group via parent bridge `0000:80:03.0` and ESXi rejects splitting them.
|
|
3. **No host reboot was needed** because `graphicsType: direct` was already in effect from earlier vSphere UI work. If you ever swap GPUs, set `graphicsType: direct` first (vSphere UI: Host → Configure → Hardware → Graphics → Edit → Direct) and reboot the host once; from then on, per-VM attach/detach is a runtime operation.
|
|
4. **Driver was pre-installed**: the previous Windows install already had NVIDIA driver 582.41, so the GPU appeared with status OK on first boot. A fresh Windows install would need the driver from https://www.nvidia.com/Download/index.aspx (Quadro P1000).
|
|
5. **Rollback**: `govc device.pci.remove -vm WW_DEV_VM pcipassthrough-13000 pcipassthrough-13001` → re-enable `nestedHVEnabled` / `memoryHotAddEnabled` → power VM on. Host PCI flags can stay enabled; they don't hurt.
|
|
|
|
## Inventory
|
|
|
|
| Field | Value |
|
|
|---|---|
|
|
| Model | NVIDIA Quadro P1000 (GP107GL) |
|
|
| GPU PCI ID (host) | `0000:87:00.0` (vendor `0x10de`, device `0x1cb1`) |
|
|
| Audio PCI ID (host) | `0000:87:00.1` (vendor `0x10de`, device `0x0fb9`) |
|
|
| Subsystem | Dell (`0x1028:0x11bc`) |
|
|
| Parent bridge | `0000:80:03.0` |
|
|
| VRAM | 4 GiB |
|
|
| Driver in guest | 582.41 (Windows 10 WDDM) |
|