Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] 宿主机处理probe-isolated-devices请求期间创建虚拟机会出现报错超时 #21610

Open
yulongz opened this issue Nov 15, 2024 · 2 comments
Assignees
Labels

Comments

@yulongz
Copy link

yulongz commented Nov 15, 2024

问题描述/What happened:
出现两个虚机创建失败,从对应宿主机host服务中看到日志如下:
[info 2024-11-14 07:19:34 isolated_device.getPassthroughGPUS(gpu.go:75)] filter address []
[info 2024-11-14 07:19:35 isolated_device.(*PCIDevice).IsBootVGA(gpu.go:321)] PCI address 03:00.0 is boot_vga: /sys/devices/pci0000:00/0000:00:1c.2/0000:02:00.0/0000:03:00.0/boot_vga
[info 2024-11-14 07:19:35 isolated_device.getPassthroughGPUS(gpu.go:98)] skip boot vga device 03:00.0
[info 2024-11-14 07:19:36 isolated_device.(*PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] 4f:00.0 already use vfio-pci driver
[info 2024-11-14 07:19:36 isolated_device.(*PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] 4f:00.0 already use vfio-pci driver
[info 2024-11-14 07:19:36 isolated_device.(*PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] 52:00.0 already use vfio-pci driver
[info 2024-11-14 07:19:36 isolated_device.(*PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] 52:00.0 already use vfio-pci driver
[info 2024-11-14 07:19:37 workmanager.(*workerTask).Run(manager.go:95)] DelayTask complete: {"telegraf_deployed":false}
[info 2024-11-14 07:19:37 modules.TaskComplete(task.go:34)] Sync task a8fb0d5c-f5a6-415f-84da-585d48be5f7f complete succ
[info 2024-11-14 07:19:37 appsrv.(*Application).ServeHTTP(appsrv.go:289)] GBAjy6fWKymrO7_-U9zE9ymrQmA= 200 882365-bbf8c2 POST /servers/cb6eb842-c430-409f-8e7a-6ccd0914b192/start (10.x.x.x:52693:compute_v2) 6.17ms
[error 2024-11-14 07:19:37 appsrv.execCallback.func1(workers.go:242)] WorkerManager exec callback error: runtime error: invalid memory address or nil pointer dereference
goroutine 57750 [running]:
runtime/debug.Stack()
/usr/lib/go/src/runtime/debug/stack.go:24 +0x65
runtime/debug.PrintStack()
/usr/lib/go/src/runtime/debug/stack.go:16 +0x19
yunion.io/x/onecloud/pkg/appsrv.execCallback.func1()
/root/go/src/yunion.io/x/onecloud/pkg/appsrv/workers.go:246 +0xdd
panic({0x29a2920, 0x54c7ac0})
/usr/lib/go/src/runtime/panic.go:838 +0x207
yunion.io/x/onecloud/pkg/hostman/guestman.(*SKVMGuestInstance).hasGPU(0xc000f54380)
/root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvmhelper.go:870 +0x9f
yunion.io/x/onecloud/pkg/hostman/guestman.(*SKVMGuestInstance).HideKVM(0xc000f54380)
/root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvmhelper.go:153 +0xfd
yunion.io/x/onecloud/pkg/hostman/guestman/arch.(*X86).GenerateCpuDesc(0xc000f54380?, 0x10, 0xf0, {0x3565cf8, 0xc000f54380})
/root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/arch/x86.go:131 +0x52
yunion.io/x/onecloud/pkg/hostman/guestman.(*SKVMGuestInstance).initCpuDesc(0xc000f54380, 0x0)
/root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvmhelper.go:886 +0x7a
yunion.io/x/onecloud/pkg/hostman/guestman.(*SKVMGuestInstance).initGuestDesc(0xc000f54380)
/root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/pci.go:53 +0x25
yunion.io/x/onecloud/pkg/hostman/guestman.(*SKVMGuestInstance).updateGuestDesc(0xc000f54380)
/root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvm.go:159 +0x1f3
yunion.io/x/onecloud/pkg/hostman/guestman.(*SKVMGuestInstance).asyncScriptStart(0xc000f54380, {0x355a300, 0xc00184e660}, {0x2e6a6a0?, 0xc00228aa60})
/root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvm.go:816 +0x1a5
yunion.io/x/onecloud/pkg/hostman/guestman.(*guestStartTask).Run(0xc00228afa0)
/root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvm.go:2032 +0x3b
yunion.io/x/onecloud/pkg/appsrv.execCallback(0xc001286f50?)
/root/go/src/yunion.io/x/onecloud/pkg/appsrv/workers.go:249 +0x58
yunion.io/x/onecloud/pkg/appsrv.(*SWorker).run(0xc0028089f0)
/root/go/src/yunion.io/x/onecloud/pkg/appsrv/workers.go:92 +0x70
created by yunion.io/x/onecloud/pkg/appsrv.(*SWorkerManager).scheduleWithLock
/root/go/src/yunion.io/x/onecloud/pkg/appsrv/workers.go:268 +0x165
[info 2024-11-14 07:19:37 isolated_device.(*PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] 56:00.0 already use vfio-pci driver
[info 2024-11-14 07:19:37 isolated_device.(*PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] 56:00.0 already use vfio-pci driver
[info 2024-11-14 07:19:37 isolated_device.(*PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] 57:00.0 already use vfio-pci driver
[info 2024-11-14 07:19:37 isolated_device.(*PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] 57:00.0 already use vfio-pci driver
[info 2024-11-14 07:19:38 isolated_device.(*PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] ce:00.0 already use vfio-pci driver
[info 2024-11-14 07:19:38 isolated_device.(*PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] ce:00.0 already use vfio-pci driver
[info 2024-11-14 07:19:38 workmanager.(*workerTask).Run(manager.go:95)] DelayTask complete: {"telegraf_deployed":false}
[info 2024-11-14 07:19:38 modules.TaskComplete(task.go:34)] Sync task 6dc8a284-d3e4-4882-894a-54d72d4c8be3 complete succ
[info 2024-11-14 07:19:38 isolated_device.(*PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] d1:00.0 already use vfio-pci driver
[info 2024-11-14 07:19:38 isolated_device.(*PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] d1:00.0 already use vfio-pci driver
[info 2024-11-14 07:19:39 appsrv.(*Application).ServeHTTP(appsrv.go:289)] GBAjy6fWKymrO7_-U9zE9ymrQmA= 200 5b5cde-d3e11d POST /servers/c3d8d60a-c311-47e8-8c00-4e84707893aa/start (10.x.x.x:26790:compute_v2) 4.28ms
[error 2024-11-14 07:19:39 appsrv.execCallback.func1(workers.go:242)] WorkerManager exec callback error: runtime error: invalid memory address or nil pointer dereference
goroutine 57857 [running]:
runtime/debug.Stack()
/usr/lib/go/src/runtime/debug/stack.go:24 +0x65
runtime/debug.PrintStack()
/usr/lib/go/src/runtime/debug/stack.go:16 +0x19
yunion.io/x/onecloud/pkg/appsrv.execCallback.func1()
/root/go/src/yunion.io/x/onecloud/pkg/appsrv/workers.go:246 +0xdd
panic({0x29a2920, 0x54c7ac0})
/usr/lib/go/src/runtime/panic.go:838 +0x207
yunion.io/x/onecloud/pkg/hostman/guestman.(*SKVMGuestInstance).hasGPU(0xc000eba460)
/root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvmhelper.go:870 +0x9f
yunion.io/x/onecloud/pkg/hostman/guestman.(*SKVMGuestInstance).HideKVM(0xc000eba460)
/root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvmhelper.go:153 +0xfd
yunion.io/x/onecloud/pkg/hostman/guestman/arch.(*X86).GenerateCpuDesc(0xc000eba460?, 0x10, 0xf0, {0x3565cf8, 0xc000eba460})
/root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/arch/x86.go:131 +0x52
yunion.io/x/onecloud/pkg/hostman/guestman.(*SKVMGuestInstance).initCpuDesc(0xc000eba460, 0x0)
/root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvmhelper.go:886 +0x7a
yunion.io/x/onecloud/pkg/hostman/guestman.(*SKVMGuestInstance).initGuestDesc(0xc000eba460)
/root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/pci.go:53 +0x25
yunion.io/x/onecloud/pkg/hostman/guestman.(*SKVMGuestInstance).updateGuestDesc(0xc000eba460)
/root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvm.go:159 +0x1f3
yunion.io/x/onecloud/pkg/hostman/guestman.(*SKVMGuestInstance).asyncScriptStart(0xc000eba460, {0x355a300, 0xc00234ad20}, {0x2e6a6a0?, 0xc0003f7e80})
/root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvm.go:816 +0x1a5
yunion.io/x/onecloud/pkg/hostman/guestman.(*guestStartTask).Run(0xc0017603e0)
/root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvm.go:2032 +0x3b
yunion.io/x/onecloud/pkg/appsrv.execCallback(0xc002118780?)
/root/go/src/yunion.io/x/onecloud/pkg/appsrv/workers.go:249 +0x58
yunion.io/x/onecloud/pkg/appsrv.(*SWorker).run(0xc000ba79b0)
/root/go/src/yunion.io/x/onecloud/pkg/appsrv/workers.go:92 +0x70
created by yunion.io/x/onecloud/pkg/appsrv.(*SWorkerManager).scheduleWithLock
/root/go/src/yunion.io/x/onecloud/pkg/appsrv/workers.go:268 +0x165
[info 2024-11-14 07:19:39 isolated_device.(*PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] d5:00.0 already use vfio-pci driver
[info 2024-11-14 07:19:39 isolated_device.(*PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] d5:00.0 already use vfio-pci driver
[info 2024-11-14 07:19:40 isolated_device.(*PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] d6:00.0 already use vfio-pci driver
[info 2024-11-14 07:19:40 isolated_device.(*PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] d6:00.0 already use vfio-pci driver

环境/Environment:

  • OS (e.g. cat /etc/os-release): ubuntu2204
  • Kernel (e.g. uname -a):Linux cloud-node-0133 5.15.0-124-generic fix: recode host convert hypervisor, make logic more clear #134-Ubuntu SMP Fri Sep 27 20:20:17 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
  • Host: (e.g. dmidecode | egrep -i 'manufacturer|product' |sort -u)
    idProduct: 0x03ee
    Manufacturer: Intel(R) Corporation
    Manufacturer: NO DIMM
    Manufacturer: Samsung
    Manufacturer: Supermicro
    Manufacturer: SUPERMICRO
    Memory Subsystem Controller Manufacturer ID: Unknown
    Memory Subsystem Controller Product ID: Unknown
    Module Manufacturer ID: Bank 1, Hex 0xCE
    Module Manufacturer ID: Unknown
    Module Product ID: Unknown
    Product Name: SYS-420GP-TNR
    Product Name: X12DPG-OA6
  • Service Version (e.g. kubectl exec -n onecloud $(kubectl get pods -n onecloud | grep climc | awk '{print $1}') -- climc version-list):
    3.10.15
@yulongz yulongz added the bug Something isn't working label Nov 15, 2024
@yulongz
Copy link
Author

yulongz commented Nov 15, 2024

补充信息:报错期间应该是有人点击宿主机-透传设备,导致region发送了probe-isolated-devices请求给宿主机,这个请求大概持续了78秒,在这78秒内正好需要创建两台虚机,然后就出现了虚机创建超时。

[info 2024-11-14 07:19:42 appsrv.(*Application).ServeHTTP(appsrv.go:289)] GBAjy6fWKymrO7_-U9zE9ymrQmA= 200 6fb8db-54d8d5-3f07c8 POST /hosts/0a70d90d-f1d5-4dc5-8aaa-0306d88936f9/probe-isolated-devices (10.x.x.x:62394:compute_v2) 7446.48ms

@yulongz yulongz changed the title [BUG] hasGPU panic runtime error: invalid memory address or nil pointer dereference [BUG] 宿主机处理probe-isolated-devices请求期间创建虚拟机会出现报错超时 Nov 15, 2024
@wanyaoqi wanyaoqi self-assigned this Nov 19, 2024
@wanyaoqi
Copy link
Member

@yulongz 这两个操作确实是有冲突的,我们处理一下

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants