Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PJRT] PJRT_Client_Compile ignores xla_gpu_cuda_data_dir #21428

Open
steeve opened this issue Jan 14, 2025 · 8 comments
Open

[PJRT] PJRT_Client_Compile ignores xla_gpu_cuda_data_dir #21428

steeve opened this issue Jan 14, 2025 · 8 comments

Comments

@steeve
Copy link

steeve commented Jan 14, 2025

Hi,

Since 5f1fe6a, it seems CompileOptions.executable_build_options.debug_options.xla_gpu_cuda_data_dir in the CompileOptions pb passed as parameter is ignored.

We're trying to bisect the offending commit, but this is a rather (a few weeks/months ago) regression.

Setting it via XLA_FLAGS works fine.

Regards,

@steeve
Copy link
Author

steeve commented Jan 14, 2025

I see that the CompilerProvider is being created at startup time, and I don't see the options being passed to do using compilation, unfortunately. I see it being created using PJRT_Init ?

@steeve
Copy link
Author

steeve commented Jan 14, 2025

This is highly likely the offending commit: 7d8d569

@beckerhe
Copy link
Member

beckerhe commented Jan 15, 2025

Hi Steeve,
that behaviour is intentional. The problem is that the instance of the compiler class has multiple layers of caching that have always been ignoring the state of DebugOptions in case of a cache hit. This led to subtle bugs.

Since this is a debug option it is supposed to be provided through XLA_FLAGS.

Can you explain a little more about your use case? Are you using the flag to change between different CUDA data dirs with each Compile call or is it about setting the data dir programmatically?

steeve added a commit to zml/zml that referenced this issue Jan 15, 2025
This is the way it should be [1]. Thankfully, other plugins don't use it.

1. openxla/xla#21428
@gwenzek
Copy link

gwenzek commented Jan 16, 2025

Hi, we are working on https://github.com/zml/zml.
Part of our value proposition is to make the toolchain very portable so we sandbox ptxas.
We just want to be able to set the Pjrt plugin to use the sandboxed ptxas, and we don't change it once the program started.

Would moving the cuda data dir out of debug options and into GpuClientOptions an acceptable solution ?
I can contribute a PR if you want.

struct GpuClientOptions {
GpuAllocatorConfig allocator_config;
int node_id = 0;
int num_nodes = 1;
std::optional<std::set<int>> allowed_devices = std::nullopt;
std::optional<std::string> platform_name = std::nullopt;

@beckerhe
Copy link
Member

Putting something into xla_gpu_client_options.h sounds good to me.

But I'm afraid I don't think that's gonna be enough. The main problem is that currently there is no way to pass any kind of parameter to the constructor of the Compiler instance (see

[]() { return std::make_unique<xla::gpu::NVPTXCompiler>(); });
). This can be fixed by changing the function signature of `xla::Compiler::GetForPlatform to take an instance of DebugOptions for example as a second parameter.

Then it has to be plumbed through all the way from the PJRT Client to where the Compiler class is accessed. I haven't looked yet how feasible that is. I'm happy to review a PR for a change along those lines.

A hacky but easier solution is probably to set XLA_FLAGS (the environment variable) before doing anything in XLA. Then on first usage it should parse the flags from there and instantiate the compiler correctly.

@beckerhe
Copy link
Member

Change of plans.

Please hold off from investing any time into this right now. I got two other reports about the same issue (also today interestingly). Since the problem is a lot more widespread than I thought I will try to fix it (do some hacky thing) in NVPTXCompiler directly so that no downstream changes are needed until we can fully move to library compilation which is hopefully coming soon.

I hope that I have a fix by tomorrow.

Does that work for you?

@gwenzek
Copy link

gwenzek commented Jan 16, 2025

yes, it works for us. Our current mitigation is to set XLA_FLAGS early in the process like you suggested.

@beckerhe
Copy link
Member

Hey,
a04b262 should fix the issue.

Can you please verify that it works for you and then close the bug as fixed - or complain to me if it doesn't. :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants