Multi-GPU/CPU offloading is still not working as intended #689

haitham-boxmind · 2025-01-07T18:48:52Z

I believe this issue should have been solved in #668, however it is still happening when I run "examples/quantize.py" on a finetuned version of meta-llama/Llama-3.3-70B-Instruct. My system has a 125GB available RAM and 2xA6000 with 48GB VRAM each.

Using the script without modifications results in RAM OOM and the process is automatically killed. (Only displaying "Killed" after the tqdm progressbar).

If I change "device_map" to auto, I get CUDA OOM.

If I add a "max_memory" dict, the script runs but then at some point I get the error:

NotImplementedError: Cannot copy out of meta tensor; no data!

I tried experimenting with different values in max_memory, different values for n_parallel_calib_samples & max_calib_samples but the issue persists.

Here is my code below:

import argparse
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
from datasets import load_dataset

def parse_args():
    parser = argparse.ArgumentParser(description="Quantize a model using AWQ.")
    parser.add_argument('--model_path', type=str, required=True, help='Path to the pretrained model.')
    parser.add_argument('--quant_path', type=str, required=True, help='Path to save the quantized model.')
    parser.add_argument('--max_memory', type=str, default="38GIB", help='Max memory allocation per device.')
    parser.add_argument('--max_calib_samples', type=int, default=128, help='Maximum number of calibration samples.')
    return parser.parse_args()

def main():
    args = parse_args()

    quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
    model_init_kwargs = {"max_memory": {0: args.max_memory, 1: args.max_memory}}

    # Load model
    model = AutoAWQForCausalLM.from_pretrained(args.model_path, device_map="auto", **model_init_kwargs)
    tokenizer = AutoTokenizer.from_pretrained(args.model_path, trust_remote_code=True)

    # Define data loading methods
    def load_custom_dataset():
        data = load_dataset('****', split="train")

        # concatenate data
        def concatenate_data(x):
            return {"text": x['question'] + '\n' + x['answer']}
        
        concatenated = data.map(concatenate_data)
        return [text for text in concatenated["text"]]

    # Quantize
    model.quantize(tokenizer,
                    calib_data=load_custom_dataset(),
                    quant_config=quant_config,
                    max_calib_samples=args.max_calib_samples)

    # Save quantized model
    model.save_quantized(args.quant_path)
    tokenizer.save_pretrained(args.quant_path)

    print(f'Model is quantized and saved at "{args.quant_path}"')

if __name__ == "__main__":
    main()

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU/CPU offloading is still not working as intended #689

Multi-GPU/CPU offloading is still not working as intended #689

haitham-boxmind commented Jan 7, 2025 •

edited

Loading

Multi-GPU/CPU offloading is still not working as intended #689

Multi-GPU/CPU offloading is still not working as intended #689

Comments

haitham-boxmind commented Jan 7, 2025 • edited Loading

haitham-boxmind commented Jan 7, 2025 •

edited

Loading