Files
dohertj2 8069f21240 Remove Infisical credential pointers; inline credentials in component docs
Reverses the recent Infisical-pointer convention. Each <service>.md
holds its credentials inline under the Access section again. The
Infisical service itself still runs as a Docker stack on the docker
host — it just isn't the source of truth for these docs anymore.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 04:04:34 -04:00

123 lines
6.6 KiB
Markdown

# WW VM — GPU Passthrough
NVIDIA Quadro P1000 PCI passthrough to `WW_DEV_VM` (10.100.0.48). **Executed 2026-04-28.** GPU is live in Windows: `nvidia-smi` reports driver 582.41 / CUDA 13.0, 4 GiB VRAM, status WDDM, no Code 43.
## Final state
| Item | Value |
|---|---|
| Quadro P1000 driver in guest | 582.41 / CUDA 13.0 (already installed; took the device on first boot) |
| Guest PCI bus address | `00000000:23:00.0` |
| Audio function | High Definition Audio Controller present, status OK |
| ESXi `graphicsInfo.graphicsType` | `direct` (was already set before this task) |
| ESXi `pciPassthruInfo` for `87:00.0` / `87:00.1` | `passthruEnabled=true, passthruActive=true` (flipped on without host reboot) |
| VM `nestedHVEnabled` | `false` |
| VM `memoryHotAddEnabled` | `false` |
| VM memory reservation | 32768 MB / 32768 MB (locked) |
| Other VMs touched during the change | None — host stayed up |
## What `graphicsType: direct` actually means (lesson learned)
`graphicsInfo.graphicsType: direct` and `pciPassthruInfo.passthruEnabled` are **two parallel mechanisms**. Both must be set for direct GPU passthrough:
1. `graphicsType: direct` — graphics subsystem says "this card is a passthrough device, not vSGA/vGPU". Set in vSphere UI: Host → Configure → Hardware → Graphics.
2. `pciPassthruInfo.passthruEnabled` — generic per-PCI-device passthrough flag. Set via `host.esxcli hardware pci pcipassthru set -e=true`. Without this, the device doesn't appear in `device.pci.ls -vm <VM>`, so VMs can't claim it.
The "no host reboot needed" benefit only kicks in when `graphicsType: direct` is **already** in effect — the runtime activation flag (`-a=true` on the esxcli call) succeeds because the device isn't actively serving as a host graphics device. If `graphicsType` is still `shared` (default), flipping `pcipassthru` requires a host reboot for the activation to land.
## Procedure (the one that worked)
### 1. Finish inside-VM teardown — already done before this task
WSL2 + VirtualMachinePlatform Windows features were disabled during the Docker→DOCKER migration. The reboot to finalize that disable also serves as the "shut down before passthrough" step.
```powershell
ssh dohertj2@10.100.0.48 'Get-WindowsOptionalFeature -Online -FeatureName VirtualMachinePlatform,Microsoft-Windows-Subsystem-Linux | Select-Object FeatureName,State'
# Expect: both Disabled
```
### 2. Shut down the VM (graceful)
```bash
export GOVC_URL=https://10.2.0.12/sdk GOVC_USERNAME=govc GOVC_PASSWORD='Tn9.xKw-m4Vp' GOVC_INSECURE=true
govc vm.power -s=true WW_DEV_VM
until govc vm.info WW_DEV_VM | grep -q "Power state: poweredOff"; do sleep 5; done
```
### 3. Flip the VM hardware flags (VM must be off)
```bash
govc vm.change -vm WW_DEV_VM -nested-hv-enabled=false
govc vm.change -vm WW_DEV_VM -memory-hot-add-enabled=false
govc vm.info -json=true WW_DEV_VM | python3 -c "import json,sys;v=json.load(sys.stdin)['virtualMachines'][0]['config'];print('nestedHV:',v.get('nestedHVEnabled'));print('memHotAdd:',v.get('memoryHotAddEnabled'))"
# Expect: nestedHV: False, memHotAdd: False
```
### 4. Enable `pcipassthru` for both Quadro PCI functions
`graphicsType: direct` was already set, so `-a=true` activates the flag immediately — no host reboot. (Note: `govc gpu.vm.add` is for **vGPU profiles**, not direct PCI passthrough, and fails on this card with "no vGPU profiles available". Use `device.pci.add` instead.)
```bash
govc host.esxcli hardware pci pcipassthru set -d=0000:87:00.0 -e=true -a=true
govc host.esxcli hardware pci pcipassthru set -d=0000:87:00.1 -e=true -a=true
# Confirm both are active
govc host.info -json=true | python3 -c "
import json,sys
d=json.load(sys.stdin)
for p in d['hostSystems'][0]['config'].get('pciPassthruInfo', []):
if '87:00' in p.get('id',''): print(p)
"
# Expect: passthruEnabled=True, passthruActive=True for both
# Confirm the Quadro now shows up as available for VMs
govc device.pci.ls -vm WW_DEV_VM | grep -i nvidia
# Expect: 0000:87:00.0 and 0000:87:00.1 listed
```
A harmless quirk: the second `pcipassthru set` command may emit `Device owner is already configured to passthru` if the audio function was previously partially configured. Check the post-state with `pciPassthruInfo` — both should be `passthruActive=True`.
### 5. Attach the GPU + audio to the VM
```bash
govc device.pci.add -vm WW_DEV_VM 0000:87:00.0
govc device.pci.add -vm WW_DEV_VM 0000:87:00.1
# Verify two VirtualPCIPassthrough devices exist
govc device.info -vm WW_DEV_VM 'pcipassthrough-*'
```
### 6. Power on, verify
```bash
govc vm.power -on=true WW_DEV_VM
until ssh -o ConnectTimeout=3 -o BatchMode=yes dohertj2@10.100.0.48 'hostname' 2>/dev/null; do sleep 5; done
# Confirm the GPU is detected and the driver bound
ssh dohertj2@10.100.0.48 'Get-PnpDevice -Class Display | Where-Object FriendlyName -match "Quadro" | Select-Object FriendlyName,Status'
# Confirm CUDA / driver runtime
ssh dohertj2@10.100.0.48 'nvidia-smi'
```
## Notes for future operators
1. **`gpu.vm.add` vs `device.pci.add`**: govc's `gpu.vm.add` is for vGPU profiles (data-center cards like A40 with NVIDIA vGPU licensing). For consumer Quadro cards in direct passthrough mode, use `device.pci.add`. `gpu.host.profile.ls` returns "no vGPU profiles available" on a host whose only NVIDIA card is a non-vGPU Quadro.
2. **Audio function `87:00.1`** must be attached to the same VM as `87:00.0` — they share an IOMMU group via parent bridge `0000:80:03.0` and ESXi rejects splitting them.
3. **No host reboot was needed** because `graphicsType: direct` was already in effect from earlier vSphere UI work. If you ever swap GPUs, set `graphicsType: direct` first (vSphere UI: Host → Configure → Hardware → Graphics → Edit → Direct) and reboot the host once; from then on, per-VM attach/detach is a runtime operation.
4. **Driver was pre-installed**: the previous Windows install already had NVIDIA driver 582.41, so the GPU appeared with status OK on first boot. A fresh Windows install would need the driver from https://www.nvidia.com/Download/index.aspx (Quadro P1000).
5. **Rollback**: `govc device.pci.remove -vm WW_DEV_VM pcipassthrough-13000 pcipassthrough-13001` → re-enable `nestedHVEnabled` / `memoryHotAddEnabled` → power VM on. Host PCI flags can stay enabled; they don't hurt.
## Inventory
| Field | Value |
|---|---|
| Model | NVIDIA Quadro P1000 (GP107GL) |
| GPU PCI ID (host) | `0000:87:00.0` (vendor `0x10de`, device `0x1cb1`) |
| Audio PCI ID (host) | `0000:87:00.1` (vendor `0x10de`, device `0x0fb9`) |
| Subsystem | Dell (`0x1028:0x11bc`) |
| Parent bridge | `0000:80:03.0` |
| VRAM | 4 GiB |
| Driver in guest | 582.41 (Windows 10 WDDM) |