[PJRT] PJRT_Client_Compile ignores xla_gpu_cuda_data_dir #21428

steeve · 2025-01-14T21:20:22Z

Hi,

Since 5f1fe6a, it seems CompileOptions.executable_build_options.debug_options.xla_gpu_cuda_data_dir in the CompileOptions pb passed as parameter is ignored.

We're trying to bisect the offending commit, but this is a rather (a few weeks/months ago) regression.

Setting it via XLA_FLAGS works fine.

Regards,

The text was updated successfully, but these errors were encountered:

steeve · 2025-01-14T23:12:11Z

I see that the CompilerProvider is being created at startup time, and I don't see the options being passed to do using compilation, unfortunately. I see it being created using PJRT_Init ?

steeve · 2025-01-14T23:14:12Z

This is highly likely the offending commit: 7d8d569

beckerhe · 2025-01-15T06:50:05Z

Hi Steeve,
that behaviour is intentional. The problem is that the instance of the compiler class has multiple layers of caching that have always been ignoring the state of DebugOptions in case of a cache hit. This led to subtle bugs.

Since this is a debug option it is supposed to be provided through XLA_FLAGS.

Can you explain a little more about your use case? Are you using the flag to change between different CUDA data dirs with each Compile call or is it about setting the data dir programmatically?

This is the way it should be [1]. Thankfully, other plugins don't use it. 1. openxla/xla#21428

gwenzek · 2025-01-16T11:10:11Z

Hi, we are working on https://github.com/zml/zml.
Part of our value proposition is to make the toolchain very portable so we sandbox ptxas.
We just want to be able to set the Pjrt plugin to use the sandboxed ptxas, and we don't change it once the program started.

Would moving the cuda data dir out of debug options and into GpuClientOptions an acceptable solution ?
I can contribute a PR if you want.

xla/xla/pjrt/plugin/xla_gpu/xla_gpu_client_options.h

Lines 30 to 40 in 3e87afa

    
           struct GpuClientOptions { 
        
             GpuAllocatorConfig allocator_config; 
        
             int node_id = 0; 
        
             int num_nodes = 1; 
        
             std::optional<std::set<int>> allowed_devices = std::nullopt; 
        
             std::optional<std::string> platform_name = std::nullopt;

beckerhe · 2025-01-16T14:04:47Z

Putting something into xla_gpu_client_options.h sounds good to me.

But I'm afraid I don't think that's gonna be enough. The main problem is that currently there is no way to pass any kind of parameter to the constructor of the Compiler instance (see

xla/xla/service/gpu/nvptx_compiler_registration.cc

Line 25 in 1149399

[]() { return std::make_unique<xla::gpu::NVPTXCompiler>(); });

). This can be fixed by changing the function signature of `xla::Compiler::GetForPlatform to take an instance of DebugOptions for example as a second parameter.

Then it has to be plumbed through all the way from the PJRT Client to where the Compiler class is accessed. I haven't looked yet how feasible that is. I'm happy to review a PR for a change along those lines.

A hacky but easier solution is probably to set XLA_FLAGS (the environment variable) before doing anything in XLA. Then on first usage it should parse the flags from there and instantiate the compiler correctly.

beckerhe · 2025-01-16T15:32:04Z

Change of plans.

Please hold off from investing any time into this right now. I got two other reports about the same issue (also today interestingly). Since the problem is a lot more widespread than I thought I will try to fix it (do some hacky thing) in NVPTXCompiler directly so that no downstream changes are needed until we can fully move to library compilation which is hopefully coming soon.

I hope that I have a fix by tomorrow.

Does that work for you?

gwenzek · 2025-01-16T16:13:09Z

yes, it works for us. Our current mitigation is to set XLA_FLAGS early in the process like you suggested.

beckerhe · 2025-01-17T11:40:47Z

Hey,
a04b262 should fix the issue.

Can you please verify that it works for you and then close the bug as fixed - or complain to me if it doesn't. :-)

steeve added a commit to zml/zml that referenced this issue Jan 15, 2025

runtimes/cuda,zml: setup xla_gpu_cuda_data_dir via environment variable

c540ea7

This is the way it should be [1]. Thankfully, other plugins don't use it. 1. openxla/xla#21428

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PJRT] PJRT_Client_Compile ignores xla_gpu_cuda_data_dir #21428

[PJRT] PJRT_Client_Compile ignores xla_gpu_cuda_data_dir #21428

steeve commented Jan 14, 2025 •

edited

Loading

steeve commented Jan 14, 2025

steeve commented Jan 14, 2025

beckerhe commented Jan 15, 2025 •

edited

Loading

gwenzek commented Jan 16, 2025

beckerhe commented Jan 16, 2025

beckerhe commented Jan 16, 2025

gwenzek commented Jan 16, 2025

beckerhe commented Jan 17, 2025

[PJRT] PJRT_Client_Compile ignores xla_gpu_cuda_data_dir #21428

[PJRT] PJRT_Client_Compile ignores xla_gpu_cuda_data_dir #21428

Comments

steeve commented Jan 14, 2025 • edited Loading

steeve commented Jan 14, 2025

steeve commented Jan 14, 2025

beckerhe commented Jan 15, 2025 • edited Loading

gwenzek commented Jan 16, 2025

beckerhe commented Jan 16, 2025

beckerhe commented Jan 16, 2025

gwenzek commented Jan 16, 2025

beckerhe commented Jan 17, 2025

steeve commented Jan 14, 2025 •

edited

Loading

beckerhe commented Jan 15, 2025 •

edited

Loading