Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

M.2 TPU device violates PCI specification #48

Open
lamw opened this issue Aug 18, 2023 · 23 comments
Open

M.2 TPU device violates PCI specification #48

lamw opened this issue Aug 18, 2023 · 23 comments
Labels
Hardware:M.2 Accelerator A+E Coral M.2 Accelerator A+E key issues subtype:ubuntu/linux Ubuntu/Linux Build/installation issues type:build/install Build and install issues

Comments

@lamw
Copy link

lamw commented Aug 18, 2023

Description

Customers that attempt to passthrough the M.2 TPU to a Virtual Machine using VMware ESXi Hypervisor have found that the Apex driver fails to initialize.

# dmesg
<snip>
[    3.780139] apex 0000:02:03.0: enabling device (0000 -> 0002)
[    3.785860] apex 0000:02:03.0: Page table init timed out
[    3.786103] apex 0000:02:03.0: MSI-X table init timed out

Upon initial investigation from VMware Engineering, the following was concluded:

Unfortunately the device in question violates PCI specification by mapping PBA, MSI-X vector table, and other registers into same 4KB page (PBA is at 0x46068, VT at 0x46800, but there is a bunch of other registers in 0x46XXX range). PCIe spec 6.0, page 1020, has this to say:

<quote>
If a Base Address Register or entry in the Enhanced Allocation capability that maps address space for the MSI-X Table or
MSI-X PBA also maps other usable address space that is not associated with MSI-X structures, locations (e.g., for CSRs)
used in the other address space must not share any naturally aligned 4-KB address range with one where either MSI-X
structure resides. This allows system software where applicable to use different processor attributes for MSI-X structures
and the other address space. (Some processor architectures do not support having different processor attributes
associated with the same naturally aligned 4-KB physical address range.) The MSI-X Table and MSI-X PBA are permitted
to co-reside within a naturally aligned 4-KB address range, though they must not overlap with each other.
</quote>

So having CSR registers in same page as MSI-X VT page violates the spec, and under ESXi CSR registers become unreachable (writes ignored, reads return zeroes). Due to this device driver cannot correctly initialize device.

If firmware can modify device's behavior so that VT/PBA arrays do not share same 4KB page with other registers, device will work with ESXi's passthrough. Or if firmware can hide MSI-X capability from PCI configuration space, that would fix issue as well.

I'm not sure if this has already been reported but if Google/Coral can either fix the behavior of the device to conform to the PCI specification OR hide MSI-X capability, then successful passthrough of the M.2 TPU should function correctly when using ESXi, which is a popular Hypervisor platform for development purpose

Click to expand!

Issue Type

Build/Install

Operating System

Ubuntu

Coral Device

M.2 Accelerator A+E

Other Devices

No response

Programming Language

No response

Relevant Log Output

No response

@google-coral-bot google-coral-bot bot added Hardware:M.2 Accelerator A+E Coral M.2 Accelerator A+E key issues subtype:ubuntu/linux Ubuntu/Linux Build/installation issues type:build/install Build and install issues labels Aug 18, 2023
@goldserve
Copy link

Yes, please do look into addressing this!

@ManuelPerrot
Copy link

Very interested to have this fixed as well.
Looks like Xen could have the same issue: https://xcp-ng.org/forum/topic/6304/google-coral-tpu-pcie-passthrough-woes/20

@k1n6b0b
Copy link

k1n6b0b commented Oct 20, 2023

Adding another vote to fix this here!! There are a ton of threads/requests for this but they're all over.

google-coral/edgetpu#343

google-coral/edgetpu#729

blakeblackshear/frigate#6331

blakeblackshear/frigate#94

blakeblackshear/frigate#305

@grembling22
Copy link

+1 for a fix

@c-po
Copy link

c-po commented Nov 5, 2023

+1

@tbozik
Copy link

tbozik commented Nov 6, 2023

+1 for a fix not only m.2 but mini pcie as well

@kentkravitz
Copy link

+1 fix please.

@TokugawaHeavyIndustries
Copy link

TokugawaHeavyIndustries commented Nov 13, 2023

+1 for fix, commenting to follow. Note this also affects the Mini-PCIe model (as expected)

@syncnj
Copy link

syncnj commented Nov 13, 2023

+1

@kentkravitz
Copy link

Can anyone think of any other possible workarounds for this problem? Seems like ESXi could also use a quirks mode for pci-e cards that need some tweaking.

@kuantek
Copy link

kuantek commented Nov 19, 2023

+1 for a fix please

2 similar comments
@Brandon314
Copy link

+1 for a fix please

@gknepper
Copy link

+1 for a fix please

@vobelic
Copy link

vobelic commented Jan 6, 2024

+1 for the fix

@fama-lama
Copy link

+1

@zaolin
Copy link

zaolin commented Jan 19, 2024

Just try to disable the msi bus for the bridge if possible, echo 1 > /sys/bus/pci/devices/$bridge/msi_bus as a temporary fix. For me it looks like there is a lot of hacky stuff in the kernel driver:
https://github.com/google/gasket-driver/blob/09385d485812088e04a98a6e1227bf92663e0b59/src/gasket_interrupt.c#L245

@bridge-four
Copy link

+1 vote for fix!

1 similar comment
@alexsahka
Copy link

+1 vote for fix!

@Claudio1L
Copy link

+1 :-(

@thefl0yd
Copy link

This is not likely to ever get fixed now with broadcom deprecating free ESXi. Aware this is a TPU issue but the ESXi userbase is just going to keep shrinking at this point.

@Sanman96
Copy link

@thefl0yd I do not believe this is the case. I have a need to deploy the m.2 in multiple enterprise VMware deployments via passthru.

+1 For a fix

@SunvidWong
Copy link

+1 vote for fix!

@johnlento
Copy link

Just try to disable the msi bus for the bridge if possible, echo 1 > /sys/bus/pci/devices/$bridge/msi_bus as a temporary fix. For me it looks like there is a lot of hacky stuff in the kernel driver: https://github.com/google/gasket-driver/blob/09385d485812088e04a98a6e1227bf92663e0b59/src/gasket_interrupt.c#L245

This already appears to be set on my fresh install and passthrough. No luck.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Hardware:M.2 Accelerator A+E Coral M.2 Accelerator A+E key issues subtype:ubuntu/linux Ubuntu/Linux Build/installation issues type:build/install Build and install issues
Projects
None yet
Development

No branches or pull requests