You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I believe this issue should have been solved in #668, however it is still happening when I run "examples/quantize.py" on a finetuned version of meta-llama/Llama-3.3-70B-Instruct. My system has a 125GB available RAM and 2xA6000 with 48GB VRAM each.
Using the script without modifications results in RAM OOM and the process is automatically killed. (Only displaying "Killed" after the tqdm progressbar).
If I change "device_map" to auto, I get CUDA OOM.
If I add a "max_memory" dict, the script runs but then at some point I get the error:
NotImplementedError: Cannot copy out of meta tensor; no data!
I tried experimenting with different values in max_memory, different values for n_parallel_calib_samples & max_calib_samples but the issue persists.
Here is my code below:
importargparsefromawqimportAutoAWQForCausalLMfromtransformersimportAutoTokenizerfromdatasetsimportload_datasetdefparse_args():
parser=argparse.ArgumentParser(description="Quantize a model using AWQ.")
parser.add_argument('--model_path', type=str, required=True, help='Path to the pretrained model.')
parser.add_argument('--quant_path', type=str, required=True, help='Path to save the quantized model.')
parser.add_argument('--max_memory', type=str, default="38GIB", help='Max memory allocation per device.')
parser.add_argument('--max_calib_samples', type=int, default=128, help='Maximum number of calibration samples.')
returnparser.parse_args()
defmain():
args=parse_args()
quant_config= {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
model_init_kwargs= {"max_memory": {0: args.max_memory, 1: args.max_memory}}
# Load modelmodel=AutoAWQForCausalLM.from_pretrained(args.model_path, device_map="auto", **model_init_kwargs)
tokenizer=AutoTokenizer.from_pretrained(args.model_path, trust_remote_code=True)
# Define data loading methodsdefload_custom_dataset():
data=load_dataset('****', split="train")
# concatenate datadefconcatenate_data(x):
return {"text": x['question'] +'\n'+x['answer']}
concatenated=data.map(concatenate_data)
return [textfortextinconcatenated["text"]]
# Quantizemodel.quantize(tokenizer,
calib_data=load_custom_dataset(),
quant_config=quant_config,
max_calib_samples=args.max_calib_samples)
# Save quantized modelmodel.save_quantized(args.quant_path)
tokenizer.save_pretrained(args.quant_path)
print(f'Model is quantized and saved at "{args.quant_path}"')
if__name__=="__main__":
main()
The text was updated successfully, but these errors were encountered:
I believe this issue should have been solved in #668, however it is still happening when I run "examples/quantize.py" on a finetuned version of
meta-llama/Llama-3.3-70B-Instruct
. My system has a 125GB available RAM and 2xA6000 with 48GB VRAM each.Using the script without modifications results in RAM OOM and the process is automatically killed. (Only displaying "Killed" after the tqdm progressbar).
If I change "device_map" to
auto
, I get CUDA OOM.If I add a "max_memory" dict, the script runs but then at some point I get the error:
NotImplementedError: Cannot copy out of meta tensor; no data!
I tried experimenting with different values in max_memory, different values for
n_parallel_calib_samples
&max_calib_samples
but the issue persists.Here is my code below:
The text was updated successfully, but these errors were encountered: