Temp test #1

i-chaochen · 2023-10-05T22:42:30Z

No description provided.

PiperOrigin-RevId: 571068693

…ension partitioning. PiperOrigin-RevId: 571073338

The current macro substitution wasn't working and is resolving to %{nccl_version} rather than a version string. Instead, add code to create a NCCL config header (nccl_config.h), and use that to detect the .so version we should be opening. This change should only affect NCCL when obtained via a stub, which is currently only used by an unreleased version of JAX. PiperOrigin-RevId: 571081917

PiperOrigin-RevId: 571120368

PiperOrigin-RevId: 571128305

…g them to the Python object. We need to keep callback objects alive as long as any running executables are alive. It is possible to discard the Python data structures for an executable before the runtime has finished running that executable, which can lead to a use after free. Instead, make the runtime keep host callbacks alive. PiperOrigin-RevId: 571141106

Imported from GitHub PR openxla#6599 FP8 cublasLt matmul uses fast accumulation when both operands' precision are DEFAULT. Otherwise fall back to high precision acuumulation. Issue#openxla#6168 This PR is closely related to Flax PR-![3416](google/flax#3416). Copybara import of the project: -- a4140da by shuw <[email protected]>: Add FP8 fast accumulation support for cublasLt. -- 9684568 by shuw <[email protected]>: Improve based on review #1 -- e906d76 by shuw <[email protected]>: Improve based on review #2 Merging this change closes openxla#6599 COPYBARA_INTEGRATE_REVIEW=openxla#6599 from wenscarl:fp8_fast_accumulation e906d76 PiperOrigin-RevId: 578948593

Imported from GitHub PR openxla#7751 Due to fast accumulation being turned on in the forward mode, the cublasLt fp8 gemm with gelu epilogue can efficiently operate with a fused kernel. Compared against the XLA-generated gelu kernel on H100, the performance demonstrates some improvement for size of [8192, 4096] x [4096, 16384] + gelu: Execution time for matmul using cublasLt and gelu (XLA): 1.28ms Execution time for matmul_gelu using cublasLt: 1.25ms Copybara import of the project: -- e8abce3 by Shu Wang <[email protected]>: Support cublasLt Fp8 Approx Gelu epilogue fusion. -- 818127c by shuw <[email protected]>: Remove F32 check -- 5ce3108 by shuw <[email protected]>: Improve based on review #1 Merging this change closes openxla#7751 COPYBARA_INTEGRATE_REVIEW=openxla#7751 from wenscarl:cublaslt_fp8_gelu 5ce3108 PiperOrigin-RevId: 591236441

…execution scope Instead of always constructing while operation conditional in the default scope use the scope of a while operation itself. This generates correct CUDA graph: https://gist.github.com/ezhulenev/a84192fe8b46a4bf1a934a8baa08ea60 Memeset operation launched in a scope #1 is not synchronized with initial condition handle update PiperOrigin-RevId: 609742672

PiperOrigin-RevId: 629853006

…d phase to Initialize() Imported from GitHub PR openxla#12228 The first time that a NormThunk is executed, it will build a cudnn execution plan. This build step can hang if a NCCL collective is running at the same time. To fix this, I've moved the build step to take place during thunk initialization. We only observe this hang when using cudnn 9. Here's a backtrace from the hang that will be fixed: ``` Thread 585 (Thread 0x7fb9391ff640 (LWP 41364) "main.py"): #0 0x00007fd3d17cffd9 in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x00007fd3d17da24f in pthread_rwlock_wrlock () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x00007fd070967dfe in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1 #3 0x00007fd0709c928a in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1 #4 0x00007f1970d76102 in ?? () from /lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.1.0 #5 0x00007f1970f2c999 in ?? () from /lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.1.0 #6 0x00007f1970a7d4ab in ?? () from /lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.1.0 #7 0x00007f1970d0a9cb in ?? () from /lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.1.0 #8 0x00007fce60b2a98c in cudnn::backend::ExecutionPlan::finalize_internal() () from /lib/x86_64-linux-gnu/libcudnn_graph.so.9.1.0 #9 0x00007fce60aefbb1 in cudnn::backend::Descriptor::finalize() () from /lib/x86_64-linux-gnu/libcudnn_graph.so.9.1.0 #10 0x00007fce60b15bec in cudnnBackendFinalize () from /lib/x86_64-linux-gnu/libcudnn_graph.so.9.1.0 #11 0x00007fd2521b8f39 in cudnn_frontend::ExecutionPlanBuilder_v8::build() () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so #12 0x00007fd2521734ba in stream_executor::gpu::(anonymous namespace)::GetExecPlanFromHeuristics(cudnn_frontend::OperationGraph_v8&&, stream_executor::gpu::(anonymous namespace)::CudnnHandle const&, bool) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so #13 0x00007fd25216ff9b in stream_executor::gpu::CudnnSupport::NormRunnerFromDesc(stream_executor::Stream*, stream_executor::dnn::AlgorithmDesc const&, stream_executor::dnn::NormKind, double, stream_executor::dnn::TensorDescriptor const&, stream_executor::dnn::TensorDescriptor const&, stream_executor::dnn::TensorDescriptor const&, std::optional<stream_executor::dnn::TensorDescriptor>, std::optional<stream_executor::dnn::TensorDescriptor>, std::optional<stream_executor::dnn::TensorDescriptor>, std::optional<stream_executor::dnn::TensorDescriptor>, std::optional<stream_executor::dnn::TensorDescriptor>, std::optional<stream_executor::dnn::TensorDescriptor>) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so #14 0x00007fd24e36b88b in stream_executor::dnn::NormOp::RunnerFromAlgorithmDesc(stream_executor::dnn::AlgorithmDesc const&, stream_executor::dnn::NormOp::Config, stream_executor::Stream*) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so #15 0x00007fd24e36ae37 in stream_executor::dnn::LazyOpRunner<stream_executor::dnn::NormOp>::GetOrCreateRunner(stream_executor::dnn::NormOp::Config, stream_executor::Stream*)::{lambda()#1}::operator()() const () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so #16 0x00007fd24e36adbc in void absl::lts_20230802::base_internal::CallOnceImpl<stream_executor::dnn::LazyOpRunner<stream_executor::dnn::NormOp>::GetOrCreateRunner(stream_executor::dnn::NormOp::Config, stream_executor::Stream*)::{lambda()#1}>(std::atomic<unsigned int>*, absl::lts_20230802::base_internal::SchedulingMode, stream_executor::dnn::LazyOpRunner<stream_executor::dnn::NormOp>::GetOrCreateRunner(stream_executor::dnn::NormOp::Config, stream_executor::Stream*)::{lambda()#1}&&) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so #17 0x00007fd24e36a9bd in stream_executor::dnn::LazyOpRunner<stream_executor::dnn::NormOp>::GetOrCreateRunner(stream_executor::dnn::NormOp::Config, stream_executor::Stream*) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so #18 0x00007fd24e369d29 in xla::gpu::RunGpuNorm(xla::gpu::GpuNormConfig const&, stream_executor::DeviceMemoryBase const&, stream_executor::DeviceMemoryBase const&, stream_executor::DeviceMemoryBase const&, std::optional<stream_executor::DeviceMemoryBase>, std::optional<stream_executor::DeviceMemoryBase>, std::optional<stream_executor::DeviceMemoryBase>, std::optional<stream_executor::DeviceMemoryBase>, std::optional<stream_executor::DeviceMemoryBase>, std::optional<stream_executor::DeviceMemoryBase>, stream_executor::DeviceMemoryBase const&, stream_executor::Stream*, xla::gpu::RunNormOptions) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so #19 0x00007fd24e368be6 in xla::gpu::NormThunk::ExecuteOnStream(xla::gpu::Thunk::ExecuteParams const&) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so ``` Copybara import of the project: -- f535330 by Trevor Morris <[email protected]>: Fix hang with cudnn layer norm by moving cudnn init to Initialize() Merging this change closes openxla#12228 COPYBARA_INTEGRATE_REVIEW=openxla#12228 from trevor-m:tmorris-norm-init f535330 PiperOrigin-RevId: 633220207

name old cpu/op new cpu/op delta BM_SelectAndScatterF32/128/process_time 889µs ± 1% 740µs ± 3% -16.70% BM_SelectAndScatterF32/256/process_time 3.64ms ± 2% 3.00ms ± 1% -17.64% BM_SelectAndScatterF32/512/process_time 15.3ms ± 1% 13.1ms ± 3% -14.61% PiperOrigin-RevId: 658063846

ghpvnist and others added 7 commits October 5, 2023 10:57

Integrate StableHLO at openxla/stablehlo@46a25064

0ffc7e6

PiperOrigin-RevId: 571068693

[XLA:SPMD] Fix wrong broadcast shape in the Gather trivial sliced dim…

f124b79

…ension partitioning. PiperOrigin-RevId: 571073338

Make BufferInfo constructor a constexpr.

3329d04

PiperOrigin-RevId: 571120368

[XLA] Implement buffer isolation

2650ed7

PiperOrigin-RevId: 571128305

fixed aot test

036a2ac

github-actions bot added the kokoro:force-run label Oct 5, 2023

i-chaochen closed this Oct 5, 2023

i-chaochen deleted the temp_test branch October 5, 2023 22:50

pemeliya pushed a commit that referenced this pull request May 6, 2024

[xla:cpu] NFC: Remove deprecated XLA:CPU mlir based codegen part #1

c3366f8

PiperOrigin-RevId: 629853006

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Temp test #1

Temp test #1

i-chaochen commented Oct 5, 2023

Temp test #1

Temp test #1

Conversation

i-chaochen commented Oct 5, 2023