Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A100 RuntimeError: CUDA error: no kernel image is available for execution on the device #27

Open
marvin-0042 opened this issue Nov 18, 2024 · 4 comments

Comments

@marvin-0042
Copy link

marvin-0042 commented Nov 18, 2024

When run example.py, hit
RuntimeError: CUDA error: no kernel image is available for execution on the device (at /home/ubuntu/nunchaku/src/kernels/awq/gemv_awq.cu:311)

I'm on Lambda A100 GPU instance, unbuntu env. Since env CUDA is 12.4, I changed pip install to use cu124 instead.

pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu121
-->
pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu124

A100 Unbuntu Env:
PyTorch version: 2.3.1
CUDA available: True
CUDA version: 12.4
Device: NVIDIA A100-SXM4-40GB
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0
python --version
Python 3.10.12

(nunchaku) ubuntu@129-146-61-54:~/nunchaku$ python example.py

RuntimeError: CUDA error: no kernel image is available for execution on the device (at /home/ubuntu/nunchaku/src/kernels/awq/gemv_awq.cu:311)

@marvin-0042
Copy link
Author

marvin-0042 commented Nov 18, 2024

Sorry, just found this issue for A100. When I followed below instruction and changed setup.py for A100, I hit

RuntimeError: Error compiling objects for extension

#1

We currently support sm_86 (Ampere, RTX3090/A6000) and sm_89 (Ada, RTX4090). The kernel may run on sm_80 (A100) but expect a significant performance drop. If you want to try it on A100 you could edit setup.py and change arch=compute_86,code=sm_86 to arch=compute_80,code=sm_80.
Unfortunately, we don't support Turing (RTX20 series) and earlier architectures since we depend on FlashAttention. Hopper (H100) also does not work due to the lack of INT4 TensorCore.

(nunchaku) ubuntu@129-146-61-54:~/nunchaku$ pip install -e .
Obtaining file:///home/ubuntu/nunchaku
Installing build dependencies ... done
Checking if build backend supports build_editable ... done
Getting requirements to build editable ... done
Preparing editable metadata (pyproject.toml) ... done
Requirement already satisfied: torch>=2.4.1 in /home/ubuntu/miniconda3/envs/nunchaku/lib/python3.11/site-packages (from nunchaku==0.0.0b0) (2.4.1+cu124)
Requirement already satisfied: diffusers>=0.30.3 in /home/ubuntu/miniconda3/envs/nunchaku/lib/python3.11/site-packages (from nunchaku==0.0.0b0) (0.31.0)
Requirement already satisfied: transformers in /home/ubuntu/miniconda3/envs/nunchaku/lib/python3.11/site-packages (from nunchaku==0.0.0b0) (4.46.2)
Requirement already satisfied: accelerate in /home/ubuntu/miniconda3/envs/nunchaku/lib/python3.11/site-packages (from nunchaku==0.0.0b0) (1.1.1)
Requirement already satisfied: sentencepiece in /home/ubuntu/miniconda3/envs/nunchaku/lib/python3.11/site-packages (from nunchaku==0.0.0b0) (0.2.0)
Requirement already satisfied: protobuf in /home/ubuntu/miniconda3/envs/nunchaku/lib/python3.11/site-packages (from nunchaku==0.0.0b0) (5.28.3)
Requirement already satisfied: importlib-metadata in /home/ubuntu/miniconda3/envs/nunchaku/lib/python3.11/site-packages (from diffusers>=0.30.3->nunchaku==0.0.0b0) (8.5.0)
Requirement already satisfied: filelock in /home/ubuntu/miniconda3/envs/nunchaku/lib/python3.11/site-packages (from diffusers>=0.30.3->nunchaku==0.0.0b0) (3.13.1)
Requirement already satisfied: huggingface-hub>=0.23.2 in /home/ubuntu/miniconda3/envs/nunchaku/lib/python3.11/site-packages (from diffusers>=0.30.3->nunchaku==0.0.0b0) (0.26.2)
Requirement already satisfied: numpy in /home/ubuntu/miniconda3/envs/nunchaku/lib/python3.11/site-packages (from diffusers>=0.30.3->nunchaku==0.0.0b0) (1.26.3)
Requirement already satisfied: regex!=2019.12.17 in /home/ubuntu/miniconda3/envs/nunchaku/lib/python3.11/site-packages (from diffusers>=0.30.3->nunchaku==0.0.0b0) (2024.11.6)
Requirement already satisfied: requests in /home/ubuntu/miniconda3/envs/nunchaku/lib/python3.11/site-packages (from diffusers>=0.30.3->nunchaku==0.0.0b0) (2.32.3)
Requirement already satisfied: safetensors>=0.3.1 in /home/ubuntu/miniconda3/envs/nunchaku/lib/python3.11/site-packages (from diffusers>=0.30.3->nunchaku==0.0.0b0) (0.4.5)
Requirement already satisfied: Pillow in /home/ubuntu/miniconda3/envs/nunchaku/lib/python3.11/site-packages (from diffusers>=0.30.3->nunchaku==0.0.0b0) (10.2.0)
Requirement already satisfied: typing-extensions>=4.8.0 in /home/ubuntu/miniconda3/envs/nunchaku/lib/python3.11/site-packages (from torch>=2.4.1->nunchaku==0.0.0b0) (4.9.0)
Requirement already satisfied: sympy in /home/ubuntu/miniconda3/envs/nunchaku/lib/python3.11/site-packages (from torch>=2.4.1->nunchaku==0.0.0b0) (1.13.1)
Requirement already satisfied: networkx in /home/ubuntu/miniconda3/envs/nunchaku/lib/python3.11/site-packages (from torch>=2.4.1->nunchaku==0.0.0b0) (3.2.1)
Requirement already satisfied: jinja2 in /home/ubuntu/miniconda3/envs/nunchaku/lib/python3.11/site-packages (from torch>=2.4.1->nunchaku==0.0.0b0) (3.1.3)
Requirement already satisfied: fsspec in /home/ubuntu/miniconda3/envs/nunchaku/lib/python3.11/site-packages (from torch>=2.4.1->nunchaku==0.0.0b0) (2024.2.0)
Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.4.99 in /home/ubuntu/miniconda3/envs/nunchaku/lib/python3.11/site-packages (from torch>=2.4.1->nunchaku==0.0.0b0) (12.4.99)
Requirement already satisfied: nvidia-cuda-runtime-cu12==12.4.99 in /home/ubuntu/miniconda3/envs/nunchaku/lib/python3.11/site-packages (from torch>=2.4.1->nunchaku==0.0.0b0) (12.4.99)
Requirement already satisfied: nvidia-cuda-cupti-cu12==12.4.99 in /home/ubuntu/miniconda3/envs/nunchaku/lib/python3.11/site-packages (from torch>=2.4.1->nunchaku==0.0.0b0) (12.4.99)
Requirement already satisfied: nvidia-cudnn-cu12==9.1.0.70 in /home/ubuntu/miniconda3/envs/nunchaku/lib/python3.11/site-packages (from torch>=2.4.1->nunchaku==0.0.0b0) (9.1.0.70)
Requirement already satisfied: nvidia-cublas-cu12==12.4.2.65 in /home/ubuntu/miniconda3/envs/nunchaku/lib/python3.11/site-packages (from torch>=2.4.1->nunchaku==0.0.0b0) (12.4.2.65)
Requirement already satisfied: nvidia-cufft-cu12==11.2.0.44 in /home/ubuntu/miniconda3/envs/nunchaku/lib/python3.11/site-packages (from torch>=2.4.1->nunchaku==0.0.0b0) (11.2.0.44)
Requirement already satisfied: nvidia-curand-cu12==10.3.5.119 in /home/ubuntu/miniconda3/envs/nunchaku/lib/python3.11/site-packages (from torch>=2.4.1->nunchaku==0.0.0b0) (10.3.5.119)
Requirement already satisfied: nvidia-cusolver-cu12==11.6.0.99 in /home/ubuntu/miniconda3/envs/nunchaku/lib/python3.11/site-packages (from torch>=2.4.1->nunchaku==0.0.0b0) (11.6.0.99)
Requirement already satisfied: nvidia-cusparse-cu12==12.3.0.142 in /home/ubuntu/miniconda3/envs/nunchaku/lib/python3.11/site-packages (from torch>=2.4.1->nunchaku==0.0.0b0) (12.3.0.142)
Requirement already satisfied: nvidia-nccl-cu12==2.20.5 in /home/ubuntu/miniconda3/envs/nunchaku/lib/python3.11/site-packages (from torch>=2.4.1->nunchaku==0.0.0b0) (2.20.5)
Requirement already satisfied: nvidia-nvtx-cu12==12.4.99 in /home/ubuntu/miniconda3/envs/nunchaku/lib/python3.11/site-packages (from torch>=2.4.1->nunchaku==0.0.0b0) (12.4.99)
Requirement already satisfied: nvidia-nvjitlink-cu12==12.4.99 in /home/ubuntu/miniconda3/envs/nunchaku/lib/python3.11/site-packages (from torch>=2.4.1->nunchaku==0.0.0b0) (12.4.99)
Requirement already satisfied: triton==3.0.0 in /home/ubuntu/miniconda3/envs/nunchaku/lib/python3.11/site-packages (from torch>=2.4.1->nunchaku==0.0.0b0) (3.0.0)
Requirement already satisfied: packaging>=20.0 in /home/ubuntu/miniconda3/envs/nunchaku/lib/python3.11/site-packages (from accelerate->nunchaku==0.0.0b0) (24.2)
Requirement already satisfied: psutil in /home/ubuntu/miniconda3/envs/nunchaku/lib/python3.11/site-packages (from accelerate->nunchaku==0.0.0b0) (5.9.8)
Requirement already satisfied: pyyaml in /home/ubuntu/miniconda3/envs/nunchaku/lib/python3.11/site-packages (from accelerate->nunchaku==0.0.0b0) (6.0.2)
Requirement already satisfied: tokenizers<0.21,>=0.20 in /home/ubuntu/miniconda3/envs/nunchaku/lib/python3.11/site-packages (from transformers->nunchaku==0.0.0b0) (0.20.3)
Requirement already satisfied: tqdm>=4.27 in /home/ubuntu/miniconda3/envs/nunchaku/lib/python3.11/site-packages (from transformers->nunchaku==0.0.0b0) (4.67.0)
Requirement already satisfied: zipp>=3.20 in /home/ubuntu/miniconda3/envs/nunchaku/lib/python3.11/site-packages (from importlib-metadata->diffusers>=0.30.3->nunchaku==0.0.0b0) (3.21.0)
Requirement already satisfied: MarkupSafe>=2.0 in /home/ubuntu/miniconda3/envs/nunchaku/lib/python3.11/site-packages (from jinja2->torch>=2.4.1->nunchaku==0.0.0b0) (2.1.5)
Requirement already satisfied: charset-normalizer<4,>=2 in /home/ubuntu/miniconda3/envs/nunchaku/lib/python3.11/site-packages (from requests->diffusers>=0.30.3->nunchaku==0.0.0b0) (3.4.0)
Requirement already satisfied: idna<4,>=2.5 in /home/ubuntu/miniconda3/envs/nunchaku/lib/python3.11/site-packages (from requests->diffusers>=0.30.3->nunchaku==0.0.0b0) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /home/ubuntu/miniconda3/envs/nunchaku/lib/python3.11/site-packages (from requests->diffusers>=0.30.3->nunchaku==0.0.0b0) (2.2.3)
Requirement already satisfied: certifi>=2017.4.17 in /home/ubuntu/miniconda3/envs/nunchaku/lib/python3.11/site-packages (from requests->diffusers>=0.30.3->nunchaku==0.0.0b0) (2024.8.30)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in /home/ubuntu/miniconda3/envs/nunchaku/lib/python3.11/site-packages (from sympy->torch>=2.4.1->nunchaku==0.0.0b0) (1.3.0)
Building wheels for collected packages: nunchaku
Building editable for nunchaku (pyproject.toml) ... -

/
error
error: subprocess-exited-with-error

× Building editable for nunchaku (pyproject.toml) did not run successfully.
│ exit code: 1
╰─> [314 lines of output]
/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/subclasses/functional_tensor.py:295: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:84.)
cpu = conversion_method_template(device=torch.device("cpu"))
/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/setuptools/dist.py:330: InformationOnly: Normalizing '0.0.0beta0' to '0.0.0b0'
self.metadata.version = self.normalize_version(self.metadata.version)
running editable_wheel
creating /tmp/pip-wheel-pis4qmre/.tmp-fz92l5lg/nunchaku.egg-info
writing /tmp/pip-wheel-pis4qmre/.tmp-fz92l5lg/nunchaku.egg-info/PKG-INFO
writing dependency_links to /tmp/pip-wheel-pis4qmre/.tmp-fz92l5lg/nunchaku.egg-info/dependency_links.txt
writing requirements to /tmp/pip-wheel-pis4qmre/.tmp-fz92l5lg/nunchaku.egg-info/requires.txt
writing top-level names to /tmp/pip-wheel-pis4qmre/.tmp-fz92l5lg/nunchaku.egg-info/top_level.txt
writing manifest file '/tmp/pip-wheel-pis4qmre/.tmp-fz92l5lg/nunchaku.egg-info/SOURCES.txt'
reading manifest file '/tmp/pip-wheel-pis4qmre/.tmp-fz92l5lg/nunchaku.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
warning: no files found matching '.hpp' under directory 'src'
warning: no files found matching '
.ipp' under directory 'src'
warning: no files found matching '.hpp' under directory 'nunchaku/csrc'
warning: no files found matching '
.ipp' under directory 'nunchaku/csrc'
warning: no files found matching '.cu' under directory 'nunchaku/csrc'
warning: no files found matching '
.cuh' under directory 'nunchaku/csrc'
warning: no files found matching '.hpp' under directory 'third_party/Block-Sparse-Attention/csrc/block_sparse_attn'
warning: no files found matching '
.ipp' under directory 'third_party/Block-Sparse-Attention/csrc/block_sparse_attn'
warning: no files found matching '.cpp' under directory 'third_party/cutlass/include'
warning: no files found matching '
.ipp' under directory 'third_party/cutlass/include'
warning: no files found matching '.cu' under directory 'third_party/cutlass/include'
warning: no files found matching '
.cuh' under directory 'third_party/cutlass/include'
warning: no files found matching '.cpp' under directory 'third_party/json/include'
warning: no files found matching '
.h' under directory 'third_party/json/include'
warning: no files found matching '.ipp' under directory 'third_party/json/include'
warning: no files found matching '
.cpp' under directory 'third_party/mio/include'
warning: no files found matching '.h' under directory 'third_party/mio/include'
warning: no files found matching '
.cpp' under directory 'third_party/spdlog/include'
warning: no files found matching '.hpp' under directory 'third_party/spdlog/include'
warning: no files found matching '
.ipp' under directory 'third_party/spdlog/include'
adding license file 'LICENCE.txt'
writing manifest file '/tmp/pip-wheel-pis4qmre/.tmp-fz92l5lg/nunchaku.egg-info/SOURCES.txt'
creating '/tmp/pip-wheel-pis4qmre/.tmp-fz92l5lg/nunchaku-0.0.0b0.dist-info'
creating /tmp/pip-wheel-pis4qmre/.tmp-fz92l5lg/nunchaku-0.0.0b0.dist-info/WHEEL
running build_py
running build_ext
/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/utils/cpp_extension.py:416: UserWarning: The detected CUDA version (12.6) has a minor version mismatch with the version that was used to compile PyTorch (12.4). Most likely this shouldn't be a problem.
warnings.warn(CUDA_MISMATCH_WARN.format(cuda_str_version, torch.version.cuda))
/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/utils/cpp_extension.py:426: UserWarning: There are no /home/ubuntu/miniconda3/envs/nunchaku/bin/x86_64-conda-linux-gnu-c++ version bounds defined for CUDA version 12.6
warnings.warn(f'There are no {compiler_name} version bounds defined for CUDA version {cuda_str_version}')
building 'nunchaku.C' extension
creating /tmp/tmp5fuu9lel.build-temp/nunchaku/csrc
creating /tmp/tmp5fuu9lel.build-temp/src
creating /tmp/tmp5fuu9lel.build-temp/src/interop
creating /tmp/tmp5fuu9lel.build-temp/src/kernels
creating /tmp/tmp5fuu9lel.build-temp/src/kernels/awq
creating /tmp/tmp5fuu9lel.build-temp/third_party/Block-Sparse-Attention/csrc/block_sparse_attn
creating /tmp/tmp5fuu9lel.build-temp/third_party/Block-Sparse-Attention/csrc/block_sparse_attn/src
Emitting ninja build file /tmp/tmp5fuu9lel.build-temp/build.ninja...
Compiling objects...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/24] /home/ubuntu/miniconda3/envs/nunchaku/bin/x86_64-conda-linux-gnu-c++ -MMD -MF /tmp/tmp5fuu9lel.build-temp/src/FluxModel.o.d -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /home/ubuntu/miniconda3/envs/nunchaku/include -fPIC -O2 -isystem /home/ubuntu/miniconda3/envs/nunchaku/include -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/ubuntu/miniconda3/envs/nunchaku/include -I/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/include -L/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/lib -L/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/lib/stubs -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /home/ubuntu/miniconda3/envs/nunchaku/include -I/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/include -L/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/lib -L/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/lib/stubs -fPIC -I/home/ubuntu/nunchaku/src -I/home/ubuntu/nunchaku/third_party/cutlass/include -I/home/ubuntu/nunchaku/third_party/json/include -I/home/ubuntu/nunchaku/third_party/mio/include -I/home/ubuntu/nunchaku/third_party/spdlog/include -I/home/ubuntu/nunchaku/third_party/Block-Sparse-Attention/csrc/block_sparse_attn -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/TH -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/THC -I/home/ubuntu/miniconda3/envs/nunchaku/include -I/home/ubuntu/miniconda3/envs/nunchaku/include/python3.11 -c -c /home/ubuntu/nunchaku/src/FluxModel.cpp -o /tmp/tmp5fuu9lel.build-temp/src/FluxModel.o -DENABLE_BF16=1 -DBUILD_NUNCHAKU=1 -fvisibility=hidden -g -std=c++20 -UNDEBUG -Og -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="gcc"' '-DPYBIND11_STDLIB="libstdcpp"' '-DPYBIND11_BUILD_ABI="cxxabi1011"' -DTORCH_EXTENSION_NAME=C -D_GLIBCXX_USE_CXX11_ABI=0
FAILED: /tmp/tmp5fuu9lel.build-temp/src/FluxModel.o
/home/ubuntu/miniconda3/envs/nunchaku/bin/x86_64-conda-linux-gnu-c++ -MMD -MF /tmp/tmp5fuu9lel.build-temp/src/FluxModel.o.d -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /home/ubuntu/miniconda3/envs/nunchaku/include -fPIC -O2 -isystem /home/ubuntu/miniconda3/envs/nunchaku/include -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/ubuntu/miniconda3/envs/nunchaku/include -I/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/include -L/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/lib -L/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/lib/stubs -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /home/ubuntu/miniconda3/envs/nunchaku/include -I/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/include -L/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/lib -L/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/lib/stubs -fPIC -I/home/ubuntu/nunchaku/src -I/home/ubuntu/nunchaku/third_party/cutlass/include -I/home/ubuntu/nunchaku/third_party/json/include -I/home/ubuntu/nunchaku/third_party/mio/include -I/home/ubuntu/nunchaku/third_party/spdlog/include -I/home/ubuntu/nunchaku/third_party/Block-Sparse-Attention/csrc/block_sparse_attn -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/TH -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/THC -I/home/ubuntu/miniconda3/envs/nunchaku/include -I/home/ubuntu/miniconda3/envs/nunchaku/include/python3.11 -c -c /home/ubuntu/nunchaku/src/FluxModel.cpp -o /tmp/tmp5fuu9lel.build-temp/src/FluxModel.o -DENABLE_BF16=1 -DBUILD_NUNCHAKU=1 -fvisibility=hidden -g -std=c++20 -UNDEBUG -Og -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="gcc"' '-DPYBIND11_STDLIB="libstdcpp"' '-DPYBIND11_BUILD_ABI="cxxabi1011"' -DTORCH_EXTENSION_NAME=C -D_GLIBCXX_USE_CXX11_ABI=0
/home/ubuntu/nunchaku/src/FluxModel.cpp:7:10: fatal error: nvtx3/nvToolsExt.h: No such file or directory
7 | #include <nvtx3/nvToolsExt.h>
| ^~~~~~~~~~~~~~~~~~~~
compilation terminated.
[2/24] /home/ubuntu/miniconda3/envs/nunchaku/bin/x86_64-conda-linux-gnu-c++ -MMD -MF /tmp/tmp5fuu9lel.build-temp/third_party/Block-Sparse-Attention/csrc/block_sparse_attn/flash_api_adapter.o.d -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /home/ubuntu/miniconda3/envs/nunchaku/include -fPIC -O2 -isystem /home/ubuntu/miniconda3/envs/nunchaku/include -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/ubuntu/miniconda3/envs/nunchaku/include -I/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/include -L/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/lib -L/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/lib/stubs -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /home/ubuntu/miniconda3/envs/nunchaku/include -I/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/include -L/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/lib -L/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/lib/stubs -fPIC -I/home/ubuntu/nunchaku/src -I/home/ubuntu/nunchaku/third_party/cutlass/include -I/home/ubuntu/nunchaku/third_party/json/include -I/home/ubuntu/nunchaku/third_party/mio/include -I/home/ubuntu/nunchaku/third_party/spdlog/include -I/home/ubuntu/nunchaku/third_party/Block-Sparse-Attention/csrc/block_sparse_attn -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/TH -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/THC -I/home/ubuntu/miniconda3/envs/nunchaku/include -I/home/ubuntu/miniconda3/envs/nunchaku/include/python3.11 -c -c /home/ubuntu/nunchaku/third_party/Block-Sparse-Attention/csrc/block_sparse_attn/flash_api_adapter.cpp -o /tmp/tmp5fuu9lel.build-temp/third_party/Block-Sparse-Attention/csrc/block_sparse_attn/flash_api_adapter.o -DENABLE_BF16=1 -DBUILD_NUNCHAKU=1 -fvisibility=hidden -g -std=c++20 -UNDEBUG -Og -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="gcc"' '-DPYBIND11_STDLIB="libstdcpp"' '-DPYBIND11_BUILD_ABI="cxxabi1011"' -DTORCH_EXTENSION_NAME=C -D_GLIBCXX_USE_CXX11_ABI=0
[3/24] /home/ubuntu/miniconda3/envs/nunchaku/bin/x86_64-conda-linux-gnu-c++ -MMD -MF /tmp/tmp5fuu9lel.build-temp/src/layernorm.o.d -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /home/ubuntu/miniconda3/envs/nunchaku/include -fPIC -O2 -isystem /home/ubuntu/miniconda3/envs/nunchaku/include -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/ubuntu/miniconda3/envs/nunchaku/include -I/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/include -L/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/lib -L/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/lib/stubs -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /home/ubuntu/miniconda3/envs/nunchaku/include -I/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/include -L/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/lib -L/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/lib/stubs -fPIC -I/home/ubuntu/nunchaku/src -I/home/ubuntu/nunchaku/third_party/cutlass/include -I/home/ubuntu/nunchaku/third_party/json/include -I/home/ubuntu/nunchaku/third_party/mio/include -I/home/ubuntu/nunchaku/third_party/spdlog/include -I/home/ubuntu/nunchaku/third_party/Block-Sparse-Attention/csrc/block_sparse_attn -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/TH -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/THC -I/home/ubuntu/miniconda3/envs/nunchaku/include -I/home/ubuntu/miniconda3/envs/nunchaku/include/python3.11 -c -c /home/ubuntu/nunchaku/src/layernorm.cpp -o /tmp/tmp5fuu9lel.build-temp/src/layernorm.o -DENABLE_BF16=1 -DBUILD_NUNCHAKU=1 -fvisibility=hidden -g -std=c++20 -UNDEBUG -Og -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="gcc"' '-DPYBIND11_STDLIB="libstdcpp"' '-DPYBIND11_BUILD_ABI="cxxabi1011"' -DTORCH_EXTENSION_NAME=C -D_GLIBCXX_USE_CXX11_ABI=0
[4/24] /home/ubuntu/miniconda3/envs/nunchaku/bin/x86_64-conda-linux-gnu-c++ -MMD -MF /tmp/tmp5fuu9lel.build-temp/third_party/Block-Sparse-Attention/csrc/block_sparse_attn/flash_api.o.d -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /home/ubuntu/miniconda3/envs/nunchaku/include -fPIC -O2 -isystem /home/ubuntu/miniconda3/envs/nunchaku/include -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/ubuntu/miniconda3/envs/nunchaku/include -I/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/include -L/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/lib -L/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/lib/stubs -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /home/ubuntu/miniconda3/envs/nunchaku/include -I/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/include -L/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/lib -L/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/lib/stubs -fPIC -I/home/ubuntu/nunchaku/src -I/home/ubuntu/nunchaku/third_party/cutlass/include -I/home/ubuntu/nunchaku/third_party/json/include -I/home/ubuntu/nunchaku/third_party/mio/include -I/home/ubuntu/nunchaku/third_party/spdlog/include -I/home/ubuntu/nunchaku/third_party/Block-Sparse-Attention/csrc/block_sparse_attn -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/TH -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/THC -I/home/ubuntu/miniconda3/envs/nunchaku/include -I/home/ubuntu/miniconda3/envs/nunchaku/include/python3.11 -c -c /home/ubuntu/nunchaku/third_party/Block-Sparse-Attention/csrc/block_sparse_attn/flash_api.cpp -o /tmp/tmp5fuu9lel.build-temp/third_party/Block-Sparse-Attention/csrc/block_sparse_attn/flash_api.o -DENABLE_BF16=1 -DBUILD_NUNCHAKU=1 -fvisibility=hidden -g -std=c++20 -UNDEBUG -Og -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="gcc"' '-DPYBIND11_STDLIB="libstdcpp"' '-DPYBIND11_BUILD_ABI="cxxabi1011"' -DTORCH_EXTENSION_NAME=C -D_GLIBCXX_USE_CXX11_ABI=0
/home/ubuntu/nunchaku/third_party/Block-Sparse-Attention/csrc/block_sparse_attn/flash_api.cpp: In function 'std::vector mha_fwd(Tensor&, const Tensor&, const Tensor&, std::optional&, std::optional&, float, float, bool, int, int, bool, std::optional<pytorch_compat::at::Generator>)':
/home/ubuntu/nunchaku/third_party/Block-Sparse-Attention/csrc/block_sparse_attn/flash_api.cpp:387:25: warning: unused variable 'device_guard' [-Wunused-variable]
387 | at::cuda::CUDAGuard device_guard{(char)q.get_device()};
| ^~~~~~~~~~~~
/home/ubuntu/nunchaku/third_party/Block-Sparse-Attention/csrc/block_sparse_attn/flash_api.cpp:438:13: warning: unused variable 'counter_offset' [-Wunused-variable]
438 | int64_t counter_offset = params.b * params.h * 32;
| ^~~~~~~~~~~~~~
/home/ubuntu/nunchaku/third_party/Block-Sparse-Attention/csrc/block_sparse_attn/flash_api.cpp: In function 'std::vector mha_varlen_fwd(const Tensor&, const Tensor&, const Tensor&, std::optional&, Tensor&, Tensor&, std::optional&, std::optional&, int, int, float, float, bool, bool, int, int, bool, std::optional<pytorch_compat::at::Generator>)':
/home/ubuntu/nunchaku/third_party/Block-Sparse-Attention/csrc/block_sparse_attn/flash_api.cpp:601:25: warning: unused variable 'device_guard' [-Wunused-variable]
601 | at::cuda::CUDAGuard device_guard{(char)q.get_device()};
| ^~~~~~~~~~~~
/home/ubuntu/nunchaku/third_party/Block-Sparse-Attention/csrc/block_sparse_attn/flash_api.cpp:640:13: warning: unused variable 'counter_offset' [-Wunused-variable]
640 | int64_t counter_offset = params.b * params.h * 32;
| ^~~~~~~~~~~~~~
/home/ubuntu/nunchaku/third_party/Block-Sparse-Attention/csrc/block_sparse_attn/flash_api.cpp: In function 'std::vector mha_fwd_block(const Tensor&, const Tensor&, const Tensor&, const Tensor&, const Tensor&, int, int, Tensor&, std::optional&, std::optional&, int, int, float, float, bool, bool, bool, int, int, std::optional<pytorch_compat::at::Generator>)':
/home/ubuntu/nunchaku/third_party/Block-Sparse-Attention/csrc/block_sparse_attn/flash_api.cpp:1620:25: warning: unused variable 'device_guard' [-Wunused-variable]
1620 | at::cuda::CUDAGuard device_guard{(char)q.get_device()};
| ^~~~~~~~~~~~
/home/ubuntu/nunchaku/third_party/Block-Sparse-Attention/csrc/block_sparse_attn/flash_api.cpp:1678:13: warning: unused variable 'counter_offset' [-Wunused-variable]
1678 | int64_t counter_offset = params.b * params.h * 32;
| ^~~~~~~~~~~~~~
[5/24] /home/ubuntu/miniconda3/envs/nunchaku/bin/x86_64-conda-linux-gnu-c++ -MMD -MF /tmp/tmp5fuu9lel.build-temp/src/Linear.o.d -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /home/ubuntu/miniconda3/envs/nunchaku/include -fPIC -O2 -isystem /home/ubuntu/miniconda3/envs/nunchaku/include -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/ubuntu/miniconda3/envs/nunchaku/include -I/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/include -L/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/lib -L/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/lib/stubs -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /home/ubuntu/miniconda3/envs/nunchaku/include -I/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/include -L/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/lib -L/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/lib/stubs -fPIC -I/home/ubuntu/nunchaku/src -I/home/ubuntu/nunchaku/third_party/cutlass/include -I/home/ubuntu/nunchaku/third_party/json/include -I/home/ubuntu/nunchaku/third_party/mio/include -I/home/ubuntu/nunchaku/third_party/spdlog/include -I/home/ubuntu/nunchaku/third_party/Block-Sparse-Attention/csrc/block_sparse_attn -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/TH -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/THC -I/home/ubuntu/miniconda3/envs/nunchaku/include -I/home/ubuntu/miniconda3/envs/nunchaku/include/python3.11 -c -c /home/ubuntu/nunchaku/src/Linear.cpp -o /tmp/tmp5fuu9lel.build-temp/src/Linear.o -DENABLE_BF16=1 -DBUILD_NUNCHAKU=1 -fvisibility=hidden -g -std=c++20 -UNDEBUG -Og -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="gcc"' '-DPYBIND11_STDLIB="libstdcpp"' '-DPYBIND11_BUILD_ABI="cxxabi1011"' -DTORCH_EXTENSION_NAME=C -D_GLIBCXX_USE_CXX11_ABI=0
[6/24] /home/ubuntu/miniconda3/envs/nunchaku/bin/x86_64-conda-linux-gnu-c++ -MMD -MF /tmp/tmp5fuu9lel.build-temp/src/activation.o.d -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /home/ubuntu/miniconda3/envs/nunchaku/include -fPIC -O2 -isystem /home/ubuntu/miniconda3/envs/nunchaku/include -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/ubuntu/miniconda3/envs/nunchaku/include -I/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/include -L/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/lib -L/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/lib/stubs -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /home/ubuntu/miniconda3/envs/nunchaku/include -I/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/include -L/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/lib -L/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/lib/stubs -fPIC -I/home/ubuntu/nunchaku/src -I/home/ubuntu/nunchaku/third_party/cutlass/include -I/home/ubuntu/nunchaku/third_party/json/include -I/home/ubuntu/nunchaku/third_party/mio/include -I/home/ubuntu/nunchaku/third_party/spdlog/include -I/home/ubuntu/nunchaku/third_party/Block-Sparse-Attention/csrc/block_sparse_attn -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/TH -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/THC -I/home/ubuntu/miniconda3/envs/nunchaku/include -I/home/ubuntu/miniconda3/envs/nunchaku/include/python3.11 -c -c /home/ubuntu/nunchaku/src/activation.cpp -o /tmp/tmp5fuu9lel.build-temp/src/activation.o -DENABLE_BF16=1 -DBUILD_NUNCHAKU=1 -fvisibility=hidden -g -std=c++20 -UNDEBUG -Og -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="gcc"' '-DPYBIND11_STDLIB="libstdcpp"' '-DPYBIND11_BUILD_ABI="cxxabi1011"' -DTORCH_EXTENSION_NAME=C -D_GLIBCXX_USE_CXX11_ABI=0
[7/24] /home/ubuntu/miniconda3/envs/nunchaku/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/tmp5fuu9lel.build-temp/src/kernels/activation_kernels.o.d -I/home/ubuntu/nunchaku/src -I/home/ubuntu/nunchaku/third_party/cutlass/include -I/home/ubuntu/nunchaku/third_party/json/include -I/home/ubuntu/nunchaku/third_party/mio/include -I/home/ubuntu/nunchaku/third_party/spdlog/include -I/home/ubuntu/nunchaku/third_party/Block-Sparse-Attention/csrc/block_sparse_attn -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/TH -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/THC -I/home/ubuntu/miniconda3/envs/nunchaku/include -I/home/ubuntu/miniconda3/envs/nunchaku/include/python3.11 -c -c /home/ubuntu/nunchaku/src/kernels/activation_kernels.cu -o /tmp/tmp5fuu9lel.build-temp/src/kernels/activation_kernels.o -D__CUDA_NO_HALF_OPERATORS
-D__CUDA_NO_HALF_CONVERSIONS
-D__CUDA_NO_BFLOAT16_CONVERSIONS
-D__CUDA_NO_HALF2_OPERATORS
--expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DENABLE_BF16=1 -DBUILD_NUNCHAKU=1 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_89,code=sm_89 -g -std=c++20 -UNDEBUG -Xcudafe --diag_suppress=20208 -U__CUDA_NO_HALF_OPERATORS -U__CUDA_NO_HALF_CONVERSIONS -U__CUDA_NO_HALF2_OPERATORS -U__CUDA_NO_HALF2_CONVERSIONS -U__CUDA_NO_BFLOAT16_OPERATORS -U__CUDA_NO_BFLOAT16_CONVERSIONS -U__CUDA_NO_BFLOAT162_OPERATORS -U__CUDA_NO_BFLOAT162_CONVERSIONS --threads=2 --expt-relaxed-constexpr --expt-extended-lambda --generate-line-info --ptxas-options=--allow-expensive-optimizations=true -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="gcc"' '-DPYBIND11_STDLIB="libstdcpp"' '-DPYBIND11_BUILD_ABI="cxxabi1011"' -DTORCH_EXTENSION_NAME=C -D_GLIBCXX_USE_CXX11_ABI=0 -ccbin /home/ubuntu/miniconda3/envs/nunchaku/bin/x86_64-conda-linux-gnu-cc
nvcc warning : incompatible redefinition for option 'compiler-bindir', the last value of this option was used
[8/24] /home/ubuntu/miniconda3/envs/nunchaku/bin/x86_64-conda-linux-gnu-c++ -MMD -MF /tmp/tmp5fuu9lel.build-temp/src/Serialization.o.d -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /home/ubuntu/miniconda3/envs/nunchaku/include -fPIC -O2 -isystem /home/ubuntu/miniconda3/envs/nunchaku/include -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/ubuntu/miniconda3/envs/nunchaku/include -I/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/include -L/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/lib -L/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/lib/stubs -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /home/ubuntu/miniconda3/envs/nunchaku/include -I/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/include -L/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/lib -L/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/lib/stubs -fPIC -I/home/ubuntu/nunchaku/src -I/home/ubuntu/nunchaku/third_party/cutlass/include -I/home/ubuntu/nunchaku/third_party/json/include -I/home/ubuntu/nunchaku/third_party/mio/include -I/home/ubuntu/nunchaku/third_party/spdlog/include -I/home/ubuntu/nunchaku/third_party/Block-Sparse-Attention/csrc/block_sparse_attn -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/TH -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/THC -I/home/ubuntu/miniconda3/envs/nunchaku/include -I/home/ubuntu/miniconda3/envs/nunchaku/include/python3.11 -c -c /home/ubuntu/nunchaku/src/Serialization.cpp -o /tmp/tmp5fuu9lel.build-temp/src/Serialization.o -DENABLE_BF16=1 -DBUILD_NUNCHAKU=1 -fvisibility=hidden -g -std=c++20 -UNDEBUG -Og -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="gcc"' '-DPYBIND11_STDLIB="libstdcpp"' '-DPYBIND11_BUILD_ABI="cxxabi1011"' -DTORCH_EXTENSION_NAME=C -D_GLIBCXX_USE_CXX11_ABI=0
[9/24] /home/ubuntu/miniconda3/envs/nunchaku/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/tmp5fuu9lel.build-temp/src/kernels/layernorm_kernels.o.d -I/home/ubuntu/nunchaku/src -I/home/ubuntu/nunchaku/third_party/cutlass/include -I/home/ubuntu/nunchaku/third_party/json/include -I/home/ubuntu/nunchaku/third_party/mio/include -I/home/ubuntu/nunchaku/third_party/spdlog/include -I/home/ubuntu/nunchaku/third_party/Block-Sparse-Attention/csrc/block_sparse_attn -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/TH -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/THC -I/home/ubuntu/miniconda3/envs/nunchaku/include -I/home/ubuntu/miniconda3/envs/nunchaku/include/python3.11 -c -c /home/ubuntu/nunchaku/src/kernels/layernorm_kernels.cu -o /tmp/tmp5fuu9lel.build-temp/src/kernels/layernorm_kernels.o -D__CUDA_NO_HALF_OPERATORS
-D__CUDA_NO_HALF_CONVERSIONS
-D__CUDA_NO_BFLOAT16_CONVERSIONS
-D__CUDA_NO_HALF2_OPERATORS
--expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DENABLE_BF16=1 -DBUILD_NUNCHAKU=1 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_89,code=sm_89 -g -std=c++20 -UNDEBUG -Xcudafe --diag_suppress=20208 -U__CUDA_NO_HALF_OPERATORS -U__CUDA_NO_HALF_CONVERSIONS -U__CUDA_NO_HALF2_OPERATORS -U__CUDA_NO_HALF2_CONVERSIONS -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --threads=2 --expt-relaxed-constexpr --expt-extended-lambda --generate-line-info --ptxas-options=--allow-expensive-optimizations=true -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="gcc"' '-DPYBIND11_STDLIB="libstdcpp"' '-DPYBIND11_BUILD_ABI="cxxabi1011"' -DTORCH_EXTENSION_NAME=C -D_GLIBCXX_USE_CXX11_ABI=0 -ccbin /home/ubuntu/miniconda3/envs/nunchaku/bin/x86_64-conda-linux-gnu-cc
nvcc warning : incompatible redefinition for option 'compiler-bindir', the last value of this option was used
[10/24] /home/ubuntu/miniconda3/envs/nunchaku/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/tmp5fuu9lel.build-temp/src/kernels/awq/gemv_awq.o.d -I/home/ubuntu/nunchaku/src -I/home/ubuntu/nunchaku/third_party/cutlass/include -I/home/ubuntu/nunchaku/third_party/json/include -I/home/ubuntu/nunchaku/third_party/mio/include -I/home/ubuntu/nunchaku/third_party/spdlog/include -I/home/ubuntu/nunchaku/third_party/Block-Sparse-Attention/csrc/block_sparse_attn -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/TH -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/THC -I/home/ubuntu/miniconda3/envs/nunchaku/include -I/home/ubuntu/miniconda3/envs/nunchaku/include/python3.11 -c -c /home/ubuntu/nunchaku/src/kernels/awq/gemv_awq.cu -o /tmp/tmp5fuu9lel.build-temp/src/kernels/awq/gemv_awq.o -D__CUDA_NO_HALF_OPERATORS
-D__CUDA_NO_HALF_CONVERSIONS
-D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DENABLE_BF16=1 -DBUILD_NUNCHAKU=1 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_89,code=sm_89 -g -std=c++20 -UNDEBUG -Xcudafe --diag_suppress=20208 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_HALF2_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --threads=2 --expt-relaxed-constexpr --expt-extended-lambda --generate-line-info --ptxas-options=--allow-expensive-optimizations=true -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="gcc"' '-DPYBIND11_STDLIB="libstdcpp"' '-DPYBIND11_BUILD_ABI="cxxabi1011"' -DTORCH_EXTENSION_NAME=C -D_GLIBCXX_USE_CXX11_ABI=0 -ccbin /home/ubuntu/miniconda3/envs/nunchaku/bin/x86_64-conda-linux-gnu-cc
nvcc warning : incompatible redefinition for option 'compiler-bindir', the last value of this option was used
[11/24] /home/ubuntu/miniconda3/envs/nunchaku/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/tmp5fuu9lel.build-temp/src/kernels/gemm_batched.o.d -I/home/ubuntu/nunchaku/src -I/home/ubuntu/nunchaku/third_party/cutlass/include -I/home/ubuntu/nunchaku/third_party/json/include -I/home/ubuntu/nunchaku/third_party/mio/include -I/home/ubuntu/nunchaku/third_party/spdlog/include -I/home/ubuntu/nunchaku/third_party/Block-Sparse-Attention/csrc/block_sparse_attn -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/TH -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/THC -I/home/ubuntu/miniconda3/envs/nunchaku/include -I/home/ubuntu/miniconda3/envs/nunchaku/include/python3.11 -c -c /home/ubuntu/nunchaku/src/kernels/gemm_batched.cu -o /tmp/tmp5fuu9lel.build-temp/src/kernels/gemm_batched.o -D__CUDA_NO_HALF_OPERATORS
-D__CUDA_NO_HALF_CONVERSIONS
-D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DENABLE_BF16=1 -DBUILD_NUNCHAKU=1 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_89,code=sm_89 -g -std=c++20 -UNDEBUG -Xcudafe --diag_suppress=20208 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_HALF2_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --threads=2 --expt-relaxed-constexpr --expt-extended-lambda --generate-line-info --ptxas-options=--allow-expensive-optimizations=true -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="gcc"' '-DPYBIND11_STDLIB="libstdcpp"' '-DPYBIND11_BUILD_ABI="cxxabi1011"' -DTORCH_EXTENSION_NAME=C -D_GLIBCXX_USE_CXX11_ABI=0 -ccbin /home/ubuntu/miniconda3/envs/nunchaku/bin/x86_64-conda-linux-gnu-cc
nvcc warning : incompatible redefinition for option 'compiler-bindir', the last value of this option was used
[12/24] /home/ubuntu/miniconda3/envs/nunchaku/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/tmp5fuu9lel.build-temp/src/kernels/gemm_f16.o.d -I/home/ubuntu/nunchaku/src -I/home/ubuntu/nunchaku/third_party/cutlass/include -I/home/ubuntu/nunchaku/third_party/json/include -I/home/ubuntu/nunchaku/third_party/mio/include -I/home/ubuntu/nunchaku/third_party/spdlog/include -I/home/ubuntu/nunchaku/third_party/Block-Sparse-Attention/csrc/block_sparse_attn -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/TH -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/THC -I/home/ubuntu/miniconda3/envs/nunchaku/include -I/home/ubuntu/miniconda3/envs/nunchaku/include/python3.11 -c -c /home/ubuntu/nunchaku/src/kernels/gemm_f16.cu -o /tmp/tmp5fuu9lel.build-temp/src/kernels/gemm_f16.o -D__CUDA_NO_HALF_OPERATORS
-D__CUDA_NO_HALF_CONVERSIONS
-D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DENABLE_BF16=1 -DBUILD_NUNCHAKU=1 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_89,code=sm_89 -g -std=c++20 -UNDEBUG -Xcudafe --diag_suppress=20208 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_HALF2_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --threads=2 --expt-relaxed-constexpr --expt-extended-lambda --generate-line-info --ptxas-options=--allow-expensive-optimizations=true -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="gcc"' '-DPYBIND11_STDLIB="libstdcpp"' '-DPYBIND11_BUILD_ABI="cxxabi1011"' -DTORCH_EXTENSION_NAME=C -D_GLIBCXX_USE_CXX11_ABI=0 -ccbin /home/ubuntu/miniconda3/envs/nunchaku/bin/x86_64-conda-linux-gnu-cc
nvcc warning : incompatible redefinition for option 'compiler-bindir', the last value of this option was used
[13/24] /home/ubuntu/miniconda3/envs/nunchaku/bin/x86_64-conda-linux-gnu-c++ -MMD -MF /tmp/tmp5fuu9lel.build-temp/src/interop/torch.o.d -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /home/ubuntu/miniconda3/envs/nunchaku/include -fPIC -O2 -isystem /home/ubuntu/miniconda3/envs/nunchaku/include -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/ubuntu/miniconda3/envs/nunchaku/include -I/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/include -L/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/lib -L/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/lib/stubs -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /home/ubuntu/miniconda3/envs/nunchaku/include -I/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/include -L/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/lib -L/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/lib/stubs -fPIC -I/home/ubuntu/nunchaku/src -I/home/ubuntu/nunchaku/third_party/cutlass/include -I/home/ubuntu/nunchaku/third_party/json/include -I/home/ubuntu/nunchaku/third_party/mio/include -I/home/ubuntu/nunchaku/third_party/spdlog/include -I/home/ubuntu/nunchaku/third_party/Block-Sparse-Attention/csrc/block_sparse_attn -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/TH -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/THC -I/home/ubuntu/miniconda3/envs/nunchaku/include -I/home/ubuntu/miniconda3/envs/nunchaku/include/python3.11 -c -c /home/ubuntu/nunchaku/src/interop/torch.cpp -o /tmp/tmp5fuu9lel.build-temp/src/interop/torch.o -DENABLE_BF16=1 -DBUILD_NUNCHAKU=1 -fvisibility=hidden -g -std=c++20 -UNDEBUG -Og -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="gcc"' '-DPYBIND11_STDLIB="libstdcpp"' '-DPYBIND11_BUILD_ABI="cxxabi1011"' -DTORCH_EXTENSION_NAME=C -D_GLIBCXX_USE_CXX11_ABI=0
[14/24] /home/ubuntu/miniconda3/envs/nunchaku/bin/x86_64-conda-linux-gnu-c++ -MMD -MF /tmp/tmp5fuu9lel.build-temp/nunchaku/csrc/pybind.o.d -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /home/ubuntu/miniconda3/envs/nunchaku/include -fPIC -O2 -isystem /home/ubuntu/miniconda3/envs/nunchaku/include -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/ubuntu/miniconda3/envs/nunchaku/include -I/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/include -L/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/lib -L/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/lib/stubs -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /home/ubuntu/miniconda3/envs/nunchaku/include -I/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/include -L/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/lib -L/home/ubuntu/miniconda3/envs/nunchaku/targets/x86_64-linux/lib/stubs -fPIC -I/home/ubuntu/nunchaku/src -I/home/ubuntu/nunchaku/third_party/cutlass/include -I/home/ubuntu/nunchaku/third_party/json/include -I/home/ubuntu/nunchaku/third_party/mio/include -I/home/ubuntu/nunchaku/third_party/spdlog/include -I/home/ubuntu/nunchaku/third_party/Block-Sparse-Attention/csrc/block_sparse_attn -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/TH -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/THC -I/home/ubuntu/miniconda3/envs/nunchaku/include -I/home/ubuntu/miniconda3/envs/nunchaku/include/python3.11 -c -c /home/ubuntu/nunchaku/nunchaku/csrc/pybind.cpp -o /tmp/tmp5fuu9lel.build-temp/nunchaku/csrc/pybind.o -DENABLE_BF16=1 -DBUILD_NUNCHAKU=1 -fvisibility=hidden -g -std=c++20 -UNDEBUG -Og -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="gcc"' '-DPYBIND11_STDLIB="libstdcpp"' '-DPYBIND11_BUILD_ABI="cxxabi1011"' -DTORCH_EXTENSION_NAME=C -D_GLIBCXX_USE_CXX11_ABI=0
In file included from /home/ubuntu/miniconda3/envs/nunchaku/x86_64-conda-linux-gnu/include/c++/11.2.0/cassert:44,
from /home/ubuntu/nunchaku/third_party/spdlog/include/spdlog/details/circular_q.h:7,
from /home/ubuntu/nunchaku/third_party/spdlog/include/spdlog/details/backtracer.h:6,
from /home/ubuntu/nunchaku/third_party/spdlog/include/spdlog/logger.h:18,
from /home/ubuntu/nunchaku/third_party/spdlog/include/spdlog/details/registry-inl.h:12,
from /home/ubuntu/nunchaku/third_party/spdlog/include/spdlog/details/registry.h:128,
from /home/ubuntu/nunchaku/third_party/spdlog/include/spdlog/spdlog.h:13,
from /home/ubuntu/nunchaku/src/common.h:23,
from /home/ubuntu/nunchaku/src/interop/torch.h:5,
from /home/ubuntu/nunchaku/nunchaku/csrc/gemm.h:3,
from /home/ubuntu/nunchaku/nunchaku/csrc/pybind.cpp:1:
/home/ubuntu/nunchaku/nunchaku/csrc/gemm.h: In member function 'std::string QuantizedGEMM::dumpTensorINT4(Tensor)':
/home/ubuntu/nunchaku/nunchaku/csrc/gemm.h:89:43: warning: comparison of integer expressions of different signedness: 'int' and 'size_t' {aka 'long unsigned int'} [-Wsign-compare]
89 | assert(offset + i < x.numel() / 4);
| ~~~~~~~~~~~^~~~~~~~~~~~~~~
In file included from /home/ubuntu/nunchaku/nunchaku/csrc/pybind.cpp:2:
/home/ubuntu/nunchaku/nunchaku/csrc/flux.h: In lambda function:
/home/ubuntu/nunchaku/nunchaku/csrc/flux.h:181:48: warning: comparison of integer expressions of different signedness: 'int' and 'std::vector<float, std::allocator >::size_type' {aka 'long unsigned int'} [-Wsign-compare]
181 | for (int i = skipRanks / 16; i < m->lora_scales.size(); i++) {
| ~~^~~~~~~~~~~~~~~~~~~~~~~
[15/24] /home/ubuntu/miniconda3/envs/nunchaku/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/tmp5fuu9lel.build-temp/src/kernels/misc_kernels.o.d -I/home/ubuntu/nunchaku/src -I/home/ubuntu/nunchaku/third_party/cutlass/include -I/home/ubuntu/nunchaku/third_party/json/include -I/home/ubuntu/nunchaku/third_party/mio/include -I/home/ubuntu/nunchaku/third_party/spdlog/include -I/home/ubuntu/nunchaku/third_party/Block-Sparse-Attention/csrc/block_sparse_attn -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/TH -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/THC -I/home/ubuntu/miniconda3/envs/nunchaku/include -I/home/ubuntu/miniconda3/envs/nunchaku/include/python3.11 -c -c /home/ubuntu/nunchaku/src/kernels/misc_kernels.cu -o /tmp/tmp5fuu9lel.build-temp/src/kernels/misc_kernels.o -D__CUDA_NO_HALF_OPERATORS
-D__CUDA_NO_HALF_CONVERSIONS
-D__CUDA_NO_BFLOAT16_CONVERSIONS
-D__CUDA_NO_HALF2_OPERATORS
--expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DENABLE_BF16=1 -DBUILD_NUNCHAKU=1 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_89,code=sm_89 -g -std=c++20 -UNDEBUG -Xcudafe --diag_suppress=20208 -U__CUDA_NO_HALF_OPERATORS
-U__CUDA_NO_HALF_CONVERSIONS
-U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_HALF2_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --threads=2 --expt-relaxed-constexpr --expt-extended-lambda --generate-line-info --ptxas-options=--allow-expensive-optimizations=true -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="gcc"' '-DPYBIND11_STDLIB="libstdcpp"' '-DPYBIND11_BUILD_ABI="cxxabi1011"' -DTORCH_EXTENSION_NAME=C -D_GLIBCXX_USE_CXX11_ABI=0 -ccbin /home/ubuntu/miniconda3/envs/nunchaku/bin/x86_64-conda-linux-gnu-cc
nvcc warning : incompatible redefinition for option 'compiler-bindir', the last value of this option was used
[16/24] /home/ubuntu/miniconda3/envs/nunchaku/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/tmp5fuu9lel.build-temp/third_party/Block-Sparse-Attention/csrc/block_sparse_attn/src/flash_fwd_hdim64_bf16_sm80.o.d -I/home/ubuntu/nunchaku/src -I/home/ubuntu/nunchaku/third_party/cutlass/include -I/home/ubuntu/nunchaku/third_party/json/include -I/home/ubuntu/nunchaku/third_party/mio/include -I/home/ubuntu/nunchaku/third_party/spdlog/include -I/home/ubuntu/nunchaku/third_party/Block-Sparse-Attention/csrc/block_sparse_attn -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/TH -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/THC -I/home/ubuntu/miniconda3/envs/nunchaku/include -I/home/ubuntu/miniconda3/envs/nunchaku/include/python3.11 -c -c /home/ubuntu/nunchaku/third_party/Block-Sparse-Attention/csrc/block_sparse_attn/src/flash_fwd_hdim64_bf16_sm80.cu -o /tmp/tmp5fuu9lel.build-temp/third_party/Block-Sparse-Attention/csrc/block_sparse_attn/src/flash_fwd_hdim64_bf16_sm80.o -D__CUDA_NO_HALF_OPERATORS
-D__CUDA_NO_HALF_CONVERSIONS
-D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DENABLE_BF16=1 -DBUILD_NUNCHAKU=1 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_89,code=sm_89 -g -std=c++20 -UNDEBUG -Xcudafe --diag_suppress=20208 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_HALF2_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --threads=2 --expt-relaxed-constexpr --expt-extended-lambda --generate-line-info --ptxas-options=--allow-expensive-optimizations=true -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="gcc"' '-DPYBIND11_STDLIB="libstdcpp"' '-DPYBIND11_BUILD_ABI="cxxabi1011"' -DTORCH_EXTENSION_NAME=C -D_GLIBCXX_USE_CXX11_ABI=0 -ccbin /home/ubuntu/miniconda3/envs/nunchaku/bin/x86_64-conda-linux-gnu-cc
nvcc warning : incompatible redefinition for option 'compiler-bindir', the last value of this option was used
[17/24] /home/ubuntu/miniconda3/envs/nunchaku/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/tmp5fuu9lel.build-temp/third_party/Block-Sparse-Attention/csrc/block_sparse_attn/src/flash_fwd_hdim64_fp16_sm80.o.d -I/home/ubuntu/nunchaku/src -I/home/ubuntu/nunchaku/third_party/cutlass/include -I/home/ubuntu/nunchaku/third_party/json/include -I/home/ubuntu/nunchaku/third_party/mio/include -I/home/ubuntu/nunchaku/third_party/spdlog/include -I/home/ubuntu/nunchaku/third_party/Block-Sparse-Attention/csrc/block_sparse_attn -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/TH -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/THC -I/home/ubuntu/miniconda3/envs/nunchaku/include -I/home/ubuntu/miniconda3/envs/nunchaku/include/python3.11 -c -c /home/ubuntu/nunchaku/third_party/Block-Sparse-Attention/csrc/block_sparse_attn/src/flash_fwd_hdim64_fp16_sm80.cu -o /tmp/tmp5fuu9lel.build-temp/third_party/Block-Sparse-Attention/csrc/block_sparse_attn/src/flash_fwd_hdim64_fp16_sm80.o -D__CUDA_NO_HALF_OPERATORS
-D__CUDA_NO_HALF_CONVERSIONS
-D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DENABLE_BF16=1 -DBUILD_NUNCHAKU=1 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_89,code=sm_89 -g -std=c++20 -UNDEBUG -Xcudafe --diag_suppress=20208 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_HALF2_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --threads=2 --expt-relaxed-constexpr --expt-extended-lambda --generate-line-info --ptxas-options=--allow-expensive-optimizations=true -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="gcc"' '-DPYBIND11_STDLIB="libstdcpp"' '-DPYBIND11_BUILD_ABI="cxxabi1011"' -DTORCH_EXTENSION_NAME=C -D_GLIBCXX_USE_CXX11_ABI=0 -ccbin /home/ubuntu/miniconda3/envs/nunchaku/bin/x86_64-conda-linux-gnu-cc
nvcc warning : incompatible redefinition for option 'compiler-bindir', the last value of this option was used
[18/24] /home/ubuntu/miniconda3/envs/nunchaku/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/tmp5fuu9lel.build-temp/third_party/Block-Sparse-Attention/csrc/block_sparse_attn/src/flash_fwd_hdim128_bf16_sm80.o.d -I/home/ubuntu/nunchaku/src -I/home/ubuntu/nunchaku/third_party/cutlass/include -I/home/ubuntu/nunchaku/third_party/json/include -I/home/ubuntu/nunchaku/third_party/mio/include -I/home/ubuntu/nunchaku/third_party/spdlog/include -I/home/ubuntu/nunchaku/third_party/Block-Sparse-Attention/csrc/block_sparse_attn -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/TH -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/THC -I/home/ubuntu/miniconda3/envs/nunchaku/include -I/home/ubuntu/miniconda3/envs/nunchaku/include/python3.11 -c -c /home/ubuntu/nunchaku/third_party/Block-Sparse-Attention/csrc/block_sparse_attn/src/flash_fwd_hdim128_bf16_sm80.cu -o /tmp/tmp5fuu9lel.build-temp/third_party/Block-Sparse-Attention/csrc/block_sparse_attn/src/flash_fwd_hdim128_bf16_sm80.o -D__CUDA_NO_HALF_OPERATORS
-D__CUDA_NO_HALF_CONVERSIONS
-D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DENABLE_BF16=1 -DBUILD_NUNCHAKU=1 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_89,code=sm_89 -g -std=c++20 -UNDEBUG -Xcudafe --diag_suppress=20208 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_HALF2_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --threads=2 --expt-relaxed-constexpr --expt-extended-lambda --generate-line-info --ptxas-options=--allow-expensive-optimizations=true -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="gcc"' '-DPYBIND11_STDLIB="libstdcpp"' '-DPYBIND11_BUILD_ABI="cxxabi1011"' -DTORCH_EXTENSION_NAME=C -D_GLIBCXX_USE_CXX11_ABI=0 -ccbin /home/ubuntu/miniconda3/envs/nunchaku/bin/x86_64-conda-linux-gnu-cc
nvcc warning : incompatible redefinition for option 'compiler-bindir', the last value of this option was used
[19/24] /home/ubuntu/miniconda3/envs/nunchaku/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/tmp5fuu9lel.build-temp/third_party/Block-Sparse-Attention/csrc/block_sparse_attn/src/flash_fwd_hdim128_fp16_sm80.o.d -I/home/ubuntu/nunchaku/src -I/home/ubuntu/nunchaku/third_party/cutlass/include -I/home/ubuntu/nunchaku/third_party/json/include -I/home/ubuntu/nunchaku/third_party/mio/include -I/home/ubuntu/nunchaku/third_party/spdlog/include -I/home/ubuntu/nunchaku/third_party/Block-Sparse-Attention/csrc/block_sparse_attn -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/TH -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/THC -I/home/ubuntu/miniconda3/envs/nunchaku/include -I/home/ubuntu/miniconda3/envs/nunchaku/include/python3.11 -c -c /home/ubuntu/nunchaku/third_party/Block-Sparse-Attention/csrc/block_sparse_attn/src/flash_fwd_hdim128_fp16_sm80.cu -o /tmp/tmp5fuu9lel.build-temp/third_party/Block-Sparse-Attention/csrc/block_sparse_attn/src/flash_fwd_hdim128_fp16_sm80.o -D__CUDA_NO_HALF_OPERATORS
-D__CUDA_NO_HALF_CONVERSIONS
-D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DENABLE_BF16=1 -DBUILD_NUNCHAKU=1 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_89,code=sm_89 -g -std=c++20 -UNDEBUG -Xcudafe --diag_suppress=20208 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_HALF2_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --threads=2 --expt-relaxed-constexpr --expt-extended-lambda --generate-line-info --ptxas-options=--allow-expensive-optimizations=true -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="gcc"' '-DPYBIND11_STDLIB="libstdcpp"' '-DPYBIND11_BUILD_ABI="cxxabi1011"' -DTORCH_EXTENSION_NAME=C -D_GLIBCXX_USE_CXX11_ABI=0 -ccbin /home/ubuntu/miniconda3/envs/nunchaku/bin/x86_64-conda-linux-gnu-cc
nvcc warning : incompatible redefinition for option 'compiler-bindir', the last value of this option was used
[20/24] /home/ubuntu/miniconda3/envs/nunchaku/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/tmp5fuu9lel.build-temp/src/kernels/gemm_w4a4.o.d -I/home/ubuntu/nunchaku/src -I/home/ubuntu/nunchaku/third_party/cutlass/include -I/home/ubuntu/nunchaku/third_party/json/include -I/home/ubuntu/nunchaku/third_party/mio/include -I/home/ubuntu/nunchaku/third_party/spdlog/include -I/home/ubuntu/nunchaku/third_party/Block-Sparse-Attention/csrc/block_sparse_attn -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/TH -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/THC -I/home/ubuntu/miniconda3/envs/nunchaku/include -I/home/ubuntu/miniconda3/envs/nunchaku/include/python3.11 -c -c /home/ubuntu/nunchaku/src/kernels/gemm_w4a4.cu -o /tmp/tmp5fuu9lel.build-temp/src/kernels/gemm_w4a4.o -D__CUDA_NO_HALF_OPERATORS
-D__CUDA_NO_HALF_CONVERSIONS
-D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DENABLE_BF16=1 -DBUILD_NUNCHAKU=1 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_89,code=sm_89 -g -std=c++20 -UNDEBUG -Xcudafe --diag_suppress=20208 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_HALF2_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --threads=2 --expt-relaxed-constexpr --expt-extended-lambda --generate-line-info --ptxas-options=--allow-expensive-optimizations=true -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -ccbin /home/ubuntu/miniconda3/envs/nunchaku/bin/x86_64-conda-linux-gnu-cc
nvcc warning : incompatible redefinition for option 'compiler-bindir', the last value of this option was used
/home/ubuntu/nunchaku/src/kernels/gemm_w4a4.cu(3063): warning #549-D: variable "epilogueArgs" is used before its value is set
M, N, K, epilogueArgs,
^

  Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
  
  /home/ubuntu/nunchaku/src/kernels/gemm_w4a4.cu(3063): warning #549-D: variable "epilogueArgs" is used before its value is set
            M, N, K, epilogueArgs,
                     ^
  
  Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
  
  /home/ubuntu/nunchaku/src/kernels/gemm_w4a4.cu(3063): warning #549-D: variable "epilogueArgs" is used before its value is set
            M, N, K, epilogueArgs,
                     ^
  
  Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
  
  [21/24] /home/ubuntu/miniconda3/envs/nunchaku/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/tmp5fuu9lel.build-temp/third_party/Block-Sparse-Attention/csrc/block_sparse_attn/src/flash_fwd_block_hdim64_bf16_sm80.o.d -I/home/ubuntu/nunchaku/src -I/home/ubuntu/nunchaku/third_party/cutlass/include -I/home/ubuntu/nunchaku/third_party/json/include -I/home/ubuntu/nunchaku/third_party/mio/include -I/home/ubuntu/nunchaku/third_party/spdlog/include -I/home/ubuntu/nunchaku/third_party/Block-Sparse-Attention/csrc/block_sparse_attn -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/TH -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/THC -I/home/ubuntu/miniconda3/envs/nunchaku/include -I/home/ubuntu/miniconda3/envs/nunchaku/include/python3.11 -c -c /home/ubuntu/nunchaku/third_party/Block-Sparse-Attention/csrc/block_sparse_attn/src/flash_fwd_block_hdim64_bf16_sm80.cu -o /tmp/tmp5fuu9lel.build-temp/third_party/Block-Sparse-Attention/csrc/block_sparse_attn/src/flash_fwd_block_hdim64_bf16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DENABLE_BF16=1 -DBUILD_NUNCHAKU=1 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_89,code=sm_89 -g -std=c++20 -UNDEBUG -Xcudafe --diag_suppress=20208 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_HALF2_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --threads=2 --expt-relaxed-constexpr --expt-extended-lambda --generate-line-info --ptxas-options=--allow-expensive-optimizations=true -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -ccbin /home/ubuntu/miniconda3/envs/nunchaku/bin/x86_64-conda-linux-gnu-cc
  nvcc warning : incompatible redefinition for option 'compiler-bindir', the last value of this option was used
  [22/24] /home/ubuntu/miniconda3/envs/nunchaku/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/tmp5fuu9lel.build-temp/third_party/Block-Sparse-Attention/csrc/block_sparse_attn/src/flash_fwd_block_hdim64_fp16_sm80.o.d -I/home/ubuntu/nunchaku/src -I/home/ubuntu/nunchaku/third_party/cutlass/include -I/home/ubuntu/nunchaku/third_party/json/include -I/home/ubuntu/nunchaku/third_party/mio/include -I/home/ubuntu/nunchaku/third_party/spdlog/include -I/home/ubuntu/nunchaku/third_party/Block-Sparse-Attention/csrc/block_sparse_attn -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/TH -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/THC -I/home/ubuntu/miniconda3/envs/nunchaku/include -I/home/ubuntu/miniconda3/envs/nunchaku/include/python3.11 -c -c /home/ubuntu/nunchaku/third_party/Block-Sparse-Attention/csrc/block_sparse_attn/src/flash_fwd_block_hdim64_fp16_sm80.cu -o /tmp/tmp5fuu9lel.build-temp/third_party/Block-Sparse-Attention/csrc/block_sparse_attn/src/flash_fwd_block_hdim64_fp16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DENABLE_BF16=1 -DBUILD_NUNCHAKU=1 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_89,code=sm_89 -g -std=c++20 -UNDEBUG -Xcudafe --diag_suppress=20208 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_HALF2_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --threads=2 --expt-relaxed-constexpr --expt-extended-lambda --generate-line-info --ptxas-options=--allow-expensive-optimizations=true -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -ccbin /home/ubuntu/miniconda3/envs/nunchaku/bin/x86_64-conda-linux-gnu-cc
  nvcc warning : incompatible redefinition for option 'compiler-bindir', the last value of this option was used
  [23/24] /home/ubuntu/miniconda3/envs/nunchaku/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/tmp5fuu9lel.build-temp/third_party/Block-Sparse-Attention/csrc/block_sparse_attn/src/flash_fwd_block_hdim128_bf16_sm80.o.d -I/home/ubuntu/nunchaku/src -I/home/ubuntu/nunchaku/third_party/cutlass/include -I/home/ubuntu/nunchaku/third_party/json/include -I/home/ubuntu/nunchaku/third_party/mio/include -I/home/ubuntu/nunchaku/third_party/spdlog/include -I/home/ubuntu/nunchaku/third_party/Block-Sparse-Attention/csrc/block_sparse_attn -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/TH -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/THC -I/home/ubuntu/miniconda3/envs/nunchaku/include -I/home/ubuntu/miniconda3/envs/nunchaku/include/python3.11 -c -c /home/ubuntu/nunchaku/third_party/Block-Sparse-Attention/csrc/block_sparse_attn/src/flash_fwd_block_hdim128_bf16_sm80.cu -o /tmp/tmp5fuu9lel.build-temp/third_party/Block-Sparse-Attention/csrc/block_sparse_attn/src/flash_fwd_block_hdim128_bf16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DENABLE_BF16=1 -DBUILD_NUNCHAKU=1 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_89,code=sm_89 -g -std=c++20 -UNDEBUG -Xcudafe --diag_suppress=20208 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_HALF2_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --threads=2 --expt-relaxed-constexpr --expt-extended-lambda --generate-line-info --ptxas-options=--allow-expensive-optimizations=true -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -ccbin /home/ubuntu/miniconda3/envs/nunchaku/bin/x86_64-conda-linux-gnu-cc
  nvcc warning : incompatible redefinition for option 'compiler-bindir', the last value of this option was used
  [24/24] /home/ubuntu/miniconda3/envs/nunchaku/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/tmp5fuu9lel.build-temp/third_party/Block-Sparse-Attention/csrc/block_sparse_attn/src/flash_fwd_block_hdim128_fp16_sm80.o.d -I/home/ubuntu/nunchaku/src -I/home/ubuntu/nunchaku/third_party/cutlass/include -I/home/ubuntu/nunchaku/third_party/json/include -I/home/ubuntu/nunchaku/third_party/mio/include -I/home/ubuntu/nunchaku/third_party/spdlog/include -I/home/ubuntu/nunchaku/third_party/Block-Sparse-Attention/csrc/block_sparse_attn -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/TH -I/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/include/THC -I/home/ubuntu/miniconda3/envs/nunchaku/include -I/home/ubuntu/miniconda3/envs/nunchaku/include/python3.11 -c -c /home/ubuntu/nunchaku/third_party/Block-Sparse-Attention/csrc/block_sparse_attn/src/flash_fwd_block_hdim128_fp16_sm80.cu -o /tmp/tmp5fuu9lel.build-temp/third_party/Block-Sparse-Attention/csrc/block_sparse_attn/src/flash_fwd_block_hdim128_fp16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DENABLE_BF16=1 -DBUILD_NUNCHAKU=1 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_89,code=sm_89 -g -std=c++20 -UNDEBUG -Xcudafe --diag_suppress=20208 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_HALF2_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --threads=2 --expt-relaxed-constexpr --expt-extended-lambda --generate-line-info --ptxas-options=--allow-expensive-optimizations=true -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -ccbin /home/ubuntu/miniconda3/envs/nunchaku/bin/x86_64-conda-linux-gnu-cc
  nvcc warning : incompatible redefinition for option 'compiler-bindir', the last value of this option was used
  ninja: build stopped: subcommand failed.
  Traceback (most recent call last):
    File "/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 2104, in _run_ninja_build
      subprocess.run(
    File "/home/ubuntu/miniconda3/envs/nunchaku/lib/python3.11/subprocess.py", line 571, in run
      raise CalledProcessError(retcode, process.args,
  subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
  
  The above exception was the direct cause of the following exception:
  
  Traceback (most recent call last):
    File "/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/setuptools/command/editable_wheel.py", line 139, in run
      self._create_wheel_file(bdist_wheel)
    File "/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/setuptools/command/editable_wheel.py", line 340, in _create_wheel_file
      files, mapping = self._run_build_commands(dist_name, unpacked, lib, tmp)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/setuptools/command/editable_wheel.py", line 263, in _run_build_commands
      self._run_build_subcommands()
    File "/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/setuptools/command/editable_wheel.py", line 290, in _run_build_subcommands
      self.run_command(name)
    File "/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/setuptools/_distutils/cmd.py", line 316, in run_command
      self.distribution.run_command(command)
    File "/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/setuptools/dist.py", line 994, in run_command
      super().run_command(command)
    File "/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 973, in run_command
      cmd_obj.run()
    File "/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/setuptools/command/build_ext.py", line 99, in run
      _build_ext.run(self)
    File "/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/setuptools/_distutils/command/build_ext.py", line 359, in run
      self.build_extensions()
    File "<string>", line 18, in build_extensions
    File "/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 868, in build_extensions
      build_ext.build_extensions(self)
    File "/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/setuptools/_distutils/command/build_ext.py", line 476, in build_extensions
      self._build_extensions_serial()
    File "/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/setuptools/_distutils/command/build_ext.py", line 502, in _build_extensions_serial
      self.build_extension(ext)
    File "/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/setuptools/command/build_ext.py", line 264, in build_extension
      _build_ext.build_extension(self, ext)
    File "/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/setuptools/_distutils/command/build_ext.py", line 557, in build_extension
      objects = self.compiler.compile(
                ^^^^^^^^^^^^^^^^^^^^^^
    File "/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 681, in unix_wrap_ninja_compile
      _write_ninja_file_and_compile_objects(
    File "/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1784, in _write_ninja_file_and_compile_objects
      _run_ninja_build(
    File "/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 2120, in _run_ninja_build
      raise RuntimeError(message) from e
  RuntimeError: Error compiling objects for extension
  /tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/setuptools/_distutils/dist.py:973: _DebuggingTips: Problem in editable installation.
  !!
  
          ********************************************************************************
          An error happened while installing `nunchaku` in editable mode.
  
          The following steps are recommended to help debug this problem:
  
          - Try to install the project normally, without using the editable mode.
            Does the error still persist?
            (If it does, try fixing the problem before attempting the editable mode).
          - If you are using binary extensions, make sure you have all OS-level
            dependencies installed (e.g. compilers, toolchains, binary libraries, ...).
          - Try the latest version of setuptools (maybe the error was already fixed).
          - If you (or your project dependencies) are using any setuptools extension
            or customization, make sure they support the editable mode.
  
          After following the steps above, if the problem still persists and
          you think this is related to how setuptools handles editable installations,
          please submit a reproducible example
          (see https://stackoverflow.com/help/minimal-reproducible-example) to:
  
              https://github.com/pypa/setuptools/issues
  
          See https://setuptools.pypa.io/en/latest/userguide/development_mode.html for details.
          ********************************************************************************
  
  !!
    cmd_obj.run()
  Traceback (most recent call last):
    File "/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 2104, in _run_ninja_build
      subprocess.run(
    File "/home/ubuntu/miniconda3/envs/nunchaku/lib/python3.11/subprocess.py", line 571, in run
      raise CalledProcessError(retcode, process.args,
  subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
  
  The above exception was the direct cause of the following exception:
  
  Traceback (most recent call last):
    File "/home/ubuntu/miniconda3/envs/nunchaku/lib/python3.11/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
      main()
    File "/home/ubuntu/miniconda3/envs/nunchaku/lib/python3.11/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
      json_out['return_val'] = hook(**hook_input['kwargs'])
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/ubuntu/miniconda3/envs/nunchaku/lib/python3.11/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 273, in build_editable
      return hook(wheel_directory, config_settings, metadata_directory)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/setuptools/build_meta.py", line 476, in build_editable
      return self._build_with_temp_dir(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/setuptools/build_meta.py", line 407, in _build_with_temp_dir
      self.run_setup()
    File "/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/setuptools/build_meta.py", line 320, in run_setup
      exec(code, locals())
    File "<string>", line 115, in <module>
    File "/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/setuptools/__init__.py", line 117, in setup
      return distutils.core.setup(**attrs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/setuptools/_distutils/core.py", line 183, in setup
      return run_commands(dist)
             ^^^^^^^^^^^^^^^^^^
    File "/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/setuptools/_distutils/core.py", line 199, in run_commands
      dist.run_commands()
    File "/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 954, in run_commands
      self.run_command(cmd)
    File "/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/setuptools/dist.py", line 994, in run_command
      super().run_command(command)
    File "/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 973, in run_command
      cmd_obj.run()
    File "/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/setuptools/command/editable_wheel.py", line 139, in run
      self._create_wheel_file(bdist_wheel)
    File "/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/setuptools/command/editable_wheel.py", line 340, in _create_wheel_file
      files, mapping = self._run_build_commands(dist_name, unpacked, lib, tmp)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/setuptools/command/editable_wheel.py", line 263, in _run_build_commands
      self._run_build_subcommands()
    File "/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/setuptools/command/editable_wheel.py", line 290, in _run_build_subcommands
      self.run_command(name)
    File "/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/setuptools/_distutils/cmd.py", line 316, in run_command
      self.distribution.run_command(command)
    File "/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/setuptools/dist.py", line 994, in run_command
      super().run_command(command)
    File "/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 973, in run_command
      cmd_obj.run()
    File "/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/setuptools/command/build_ext.py", line 99, in run
      _build_ext.run(self)
    File "/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/setuptools/_distutils/command/build_ext.py", line 359, in run
      self.build_extensions()
    File "<string>", line 18, in build_extensions
    File "/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 868, in build_extensions
      build_ext.build_extensions(self)
    File "/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/setuptools/_distutils/command/build_ext.py", line 476, in build_extensions
      self._build_extensions_serial()
    File "/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/setuptools/_distutils/command/build_ext.py", line 502, in _build_extensions_serial
      self.build_extension(ext)
    File "/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/setuptools/command/build_ext.py", line 264, in build_extension
      _build_ext.build_extension(self, ext)
    File "/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/setuptools/_distutils/command/build_ext.py", line 557, in build_extension
      objects = self.compiler.compile(
                ^^^^^^^^^^^^^^^^^^^^^^
    File "/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 681, in unix_wrap_ninja_compile
      _write_ninja_file_and_compile_objects(
    File "/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1784, in _write_ninja_file_and_compile_objects
      _run_ninja_build(
    File "/tmp/pip-build-env-8l_kl2e5/overlay/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 2120, in _run_ninja_build
      raise RuntimeError(message) from e
  RuntimeError: Error compiling objects for extension
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building editable for nunchaku
Failed to build nunchaku
ERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (nunchaku)

@marvin-0042 marvin-0042 changed the title Python example.py: RuntimeError: CUDA error: no kernel image is available for execution on the device (at /home/ubuntu/nunchaku/src/kernels/awq/gemv_awq.cu:311) A100 RuntimeError: CUDA error: no kernel image is available for execution on the device (at /home/ubuntu/nunchaku/src/kernels/awq/gemv_awq.cu:311) Nov 18, 2024
@marvin-0042 marvin-0042 changed the title A100 RuntimeError: CUDA error: no kernel image is available for execution on the device (at /home/ubuntu/nunchaku/src/kernels/awq/gemv_awq.cu:311) A100 RuntimeError: CUDA error: no kernel image is available for execution on the device Nov 19, 2024
@sxtyzhangzk
Copy link
Collaborator

Looks like NVTX is missing. It should be included in the CUDA Toolkit. Please confirm CUDA is properly installed in your system.

Also note that in the current version, the achieved performance on A100 is much lower than the peak performance. One of the main reasons is that the conversion from int to float is slow before sm_86.
We might release a fix for that in the future but the performance may still not be satisfying since CUDA cores on A100 are relatively weak compared to TensorCores, and we do need these CUDA cores to perform group scaling in the W4A4 GEMM kernel.

@dianyo
Copy link

dianyo commented Nov 21, 2024

Hi @marvin-0042

I'm also using Lambda Labs NVIDIA A100-SXM4-40GB. However i didn't run the installation in the original env. Instead I run it inside a docker container which comes from this NGC image nvcr.io/nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04. You have to install the conda first by download a installation script from their official website.

Hope this can help you!

@marvin-0042
Copy link
Author

marvin-0042 commented Nov 21, 2024

Looks like NVTX is missing. It should be included in the CUDA Toolkit. Please confirm CUDA is properly installed in your system.

Hi @marvin-0042

I'm also using Lambda Labs NVIDIA A100-SXM4-40GB. However i didn't run the installation in the original env. Instead I run it inside a docker container which comes from this NGC image nvcr.io/nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04. You have to install the conda first by download a installation script from their official website.

Hope this can help you!

thank you so much @dianyo and @sxtyzhangzk !! It works!

Lambda Labs NVIDIA A100-SXM4-40GB instructions

  1. Add your user to the docker group. ($USER is ubuntu on Lambda A100). And apply changes in current session:
    sudo usermod -aG docker $USER
    newgrp docker

  2. Pull the Docker image
    docker pull nvcr.io/nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04

  3. Download Miniconda installer
    wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

  4. Install Miniconda in Docker
    bash Miniconda3-latest-Linux-x86_64.sh
    source ~/.bashrc

  5. Then follow nunchaku instructions in README, use cuda12.4 instead of 12.1, and revise setup.py arch to sm_80(a100)
    pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu124
    setup.py: "arch=compute_80,code=sm_80",
    And remove other arch lines of compute_8x

    conda create -n nunchaku python=3.11
    conda activate nunchaku
    pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu124
    pip install diffusers ninja wheel transformers accelerate sentencepiece protobuf
    pip install huggingface_hub peft opencv-python einops gradio spaces GPUtil

    git clone https://github.com/mit-han-lab/nunchaku.git
    cd nunchaku
    git submodule init
    git submodule update
    pip install -e .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants