Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue] Triton Compilation Error in Unsloth Fine-Tuning Script on Kernel 5.4.0 #1336

Open
gityeop opened this issue Nov 25, 2024 · 8 comments
Labels
fixed - pending confirmation Fixed, waiting for confirmation from poster

Comments

@gityeop
Copy link

gityeop commented Nov 25, 2024

Description

When trying to run Unsloth fine-tuning script, encountering a Triton compilation error related to ReduceOpToLLVM.cpp.

Error Message

python /data/ephemeral/home/unsloth_example.py
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2024.11.9: Fast Llama patching. Transformers = 4.46.3
.                                                                             \\   /|    GPU: Tesla V100-SXM2-32GB. Max memory: 31.739 GB. Platform = 
Linux.                                                                     O^O/ \_/ \    Pytorch: 2.4.1+cu121. CUDA = 7.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are re
d colored!                                                                 Unsloth 2024.11.9 patched 32 layers with 32 QKV layers, 32 O layers and 32 
MLP layers.                                                                Detected kernel version 5.4.0, which is below the recommended minimum of 5.
5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.                                    max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 210,289 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 41,943,040
  0%|                                               | 0/60 [00:00<?, ?it/s]
python: /project/lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp:31: virtual mlir::LogicalResult {anonymous}::ReduceOpConversion::matchAndRewrite(mlir::triton::ReduceOp, mlir::ConvertOpToLLVMPattern<mlir::triton::ReduceOp>::OpAdaptor, mlir::ConversionPatternRewriter&) const: Assertion `helper.isSupportedLayout() && "Unexpected srcLayout in ReduceOpConversion"' failed.   Aborted (core dumped)

System Information

  • OS Kernel: 5.4.0-99-generic
  • GPU: Tesla V100-SXM2-32GB
  • CUDA Version: 12.2
  • Driver Version: 535.161.08
  • GPU Memory: 32GB

Code

from unsloth import FastLanguageModel 
from unsloth import is_bfloat16_supported
import torch
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
max_seq_length = 2048 # Supports RoPE Scaling interally, so choose any!
# Get LAION dataset
url = "https://huggingface.co/datasets/laion/OIG/resolve/main/unified_chip2.jsonl"
dataset = load_dataset("json", data_files = {"train" : url}, split = "train")

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/mistral-7b-v0.3-bnb-4bit",      # New Mistral v3 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/llama-3-8b-bnb-4bit",           # Llama-3 15 trillion tokens model 2x faster!
    "unsloth/llama-3-8b-Instruct-bnb-4bit",
    "unsloth/llama-3-70b-bnb-4bit",
    "unsloth/Phi-3-mini-4k-instruct",        # Phi-3 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/gemma-7b-bnb-4bit",             # Gemma 2.2x faster!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = None,
    load_in_4bit = True,
)

# Do model patching and add fast LoRA weights
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    max_seq_length = max_seq_length,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

trainer = SFTTrainer(
    model = model,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    tokenizer = tokenizer,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 10,
        max_steps = 60,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        output_dir = "outputs",
        optim = "adamw_8bit",
        seed = 3407,
    ),
)
trainer.train()

# Go to https://github.com/unslothai/unsloth/wiki for advanced tips like
# (1) Saving to GGUF / merging to 16bit for vLLM
# (2) Continued training from a saved LoRA adapter
# (3) Adding an evaluation loop / OOMs
# (4) Customized chat templates

Additional Context

  • The error occurs during model initialization
  • Kernel version (5.4.0) is below the recommended minimum of 5.5.0
  • Using Unsloth's pre-quantized 4-bit model
  • Attempting to run on a single GPU setup

Steps to Reproduce

  1. Set up conda environment with PyTorch and CUDA
  2. Install Unsloth
  3. Run the example script for fine-tuning
  4. Error occurs during model initialization phase

Questions

  1. Is this error related to the kernel version being below the recommended minimum (5.4.0 < 5.5.0)?
  2. Are there any specific version requirements or compatibility issues with Triton that need to be addressed?
  3. Are there any workarounds available for systems that cannot upgrade their kernel version?
@chengju-zhou
Copy link

same issue on V100. but it works fine on T4

@ergosumdre
Copy link

I also have a V100 and I'm getting this error too.

@LiaoPan
Copy link

LiaoPan commented Nov 26, 2024

I also encountered the same error on v100.

@LiaoPan
Copy link

LiaoPan commented Nov 26, 2024

I also encountered the same error on v100.

Temporary solution:

Perhaps change the version of triton, but it will raise some warnings.
$ pip install triton==2.3.0

wait for the final solution.

@hykilpikonna
Copy link

I also encountered the same error on v100.

Temporary solution:

Perhaps change the version of triton, but it will raise some warnings.
$ pip install triton==2.3.0

wait for the final solution.

Which torch version did you use? It seems that torch 2.5.1 isn't compatible

unsloth_env/lib/python3.11/site-packages/torch/_inductor/codecache.py", line 195, in get_system                                                                         from triton.compiler.compiler import triton_key                                                                                                                                             ImportError: cannot import name 'triton_key' from 'triton.compiler.compiler' (unsloth_env/lib/python3.11/site-packages/triton/compiler/compiler.py)```

@LiaoPan
Copy link

LiaoPan commented Nov 26, 2024

@hykilpikonna pytorh '2.4.0+cu121'

@danielhanchen
Copy link
Contributor

Apologies everyone! @LiaoPan @hykilpikonna @ergosumdre @gityeop @chengju-zhou I added a flag to disable some other kernels - I'm unsure if it worked though.

Torch 2.5 and torch 2.4 should be now supported - sadly Colab got rid of V100s so I can't test them - so I'm assuming a specific kernel from Apple's Cut Cross Entropy package is the one causing the issues.

Please try updating Unsloth without dependencies if that works!

pip uninstall unsloth unsloth-zoo
pip install --upgrade --no-cache-dir --no-deps unsloth unsloth-zoo

@danielhanchen
Copy link
Contributor

By the way to get Torch 2.4 - simply run wget -qO- https://raw.githubusercontent.com/unslothai/unsloth/main/unsloth/_auto_install.py | python - to get the optimal installation command

@danielhanchen danielhanchen added the fixed - pending confirmation Fixed, waiting for confirmation from poster label Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fixed - pending confirmation Fixed, waiting for confirmation from poster
Projects
None yet
Development

No branches or pull requests

6 participants