Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to run the code with smaller ckpt like opt-6.7B #24

Open
czq693497091 opened this issue Apr 6, 2024 · 14 comments
Open

How to run the code with smaller ckpt like opt-6.7B #24

czq693497091 opened this issue Apr 6, 2024 · 14 comments

Comments

@czq693497091
Copy link

With limited GPU resources, how to use opt-6.7b to just run the code?

@czq693497091
Copy link
Author

I successfully run the collect_sp_data.sh with opt-6.7b and opt-66b. But when I apply the sparse_predictor to train the mlp predictor for both opt-6.7b and opt-66b, but both shows all zero y, which means that the mlp_label_0.mmap is all zero and the training process ends. Is the problem widespread, and how can it be solved?

@xbzjsj
Copy link

xbzjsj commented Apr 17, 2024

I successfully run the collect_sp_data.sh with opt-6.7b and opt-66b. But when I apply the sparse_predictor to train the mlp predictor for both opt-6.7b and opt-66b, but both shows all zero y, which means that the mlp_label_0.mmap is all zero and the training process ends. Is the problem widespread, and how can it be solved?

the predictor also needs to be trained, and are you running with a single GPU?

@czq693497091
Copy link
Author

I successfully run the collect_sp_data.sh with opt-6.7b and opt-66b. But when I apply the sparse_predictor to train the mlp predictor for both opt-6.7b and opt-66b, but both shows all zero y, which means that the mlp_label_0.mmap is all zero and the training process ends. Is the problem widespread, and how can it be solved?

the predictor also needs to be trained, and are you running with a single GPU?

I trained the predictor with more than one gpus. But when I tried to train the mlp predictor, it shows all zero. I don't know how to solve it.

@kechengcode
Copy link

I successfully run the collect_sp_data.sh with opt-6.7b and opt-66b. But when I apply the sparse_predictor to train the mlp predictor for both opt-6.7b and opt-66b, but both shows all zero y, which means that the mlp_label_0.mmap is all zero and the training process ends. Is the problem widespread, and how can it be solved?

I encountered the same problem, but I don't know how to solve it.Have you solved the problem now?

@czq693497091
Copy link
Author

I successfully run the collect_sp_data.sh with opt-6.7b and opt-66b. But when I apply the sparse_predictor to train the mlp predictor for both opt-6.7b and opt-66b, but both shows all zero y, which means that the mlp_label_0.mmap is all zero and the training process ends. Is the problem widespread, and how can it be solved?

I encountered the same problem, but I don't know how to solve it.Have you solved the problem now?

still not. I email to the author but she said that is should not be all zero.

@kechengcode
Copy link

I successfully run the collect_sp_data.sh with opt-6.7b and opt-66b. But when I apply the sparse_predictor to train the mlp predictor for both opt-6.7b and opt-66b, but both shows all zero y, which means that the mlp_label_0.mmap is all zero and the training process ends. Is the problem widespread, and how can it be solved?

I encountered the same problem, but I don't know how to solve it.Have you solved the problem now?

still not. I email to the author but she said that is should not be all zero.

sad..(T_T).. thanks for your response

@kechengcode
Copy link

I successfully run the collect_sp_data.sh with opt-6.7b and opt-66b. But when I apply the sparse_predictor to train the mlp predictor for both opt-6.7b and opt-66b, but both shows all zero y, which means that the mlp_label_0.mmap is all zero and the training process ends. Is the problem widespread, and how can it be solved?

I encountered the same problem, but I don't know how to solve it.Have you solved the problem now?

still not. I email to the author but she said that is should not be all zero.

Hey, I have found the reason for the problem. The issue is with fp_label, because the author has allocated a fp_label.mmap file of size [400000 ,(4 * hidden_size)] for storing fp_label, but in reality, fp_label does not contain that much data. When the author was dividing the validation set, they selected the last 0.05 * len(fp_label) of data from the fp_labl file, which resulted in the reading of empty data. As a result, the MLP cannot receive effective training. To solve this problem, you can modify the def get_data(args, l) function in ./DejaVu/sparse_predictor/main_mlp.py.

def get_data(args, l):
    if CONFIG[args.model]['ckt_storage'] == "bylayer":
        #path = f"{DATA[args.model][args.dataset]}/mlp_x_{l}.mmap"
        path = f"{DATA[args.model][args.dataset]}/mlp_sp_x_{l}.mmap"
        print(f"Reading query from {path}")
        query = np.array(np.memmap(path, dtype='float16', mode='r', shape=(400000,CONFIG[args.model]['d']))[: CONFIG[args.model]['N']])
        path = f"{DATA[args.model][args.dataset]}/mlp_label_{l}.mmap"
        print(f"Reading MLP label from {path}")
        label = np.array(np.memmap(path, dtype='float16', mode='r', shape=(400000,CONFIG[args.model]['d'] * 4))[: CONFIG[args.model]['N']])
        
        num_valid = (label.sum(-1) > 0).sum()
        print(num_valid)
        return  query[:num_valid], label[:num_valid]
        #return  query, label

This is my solution, I hope it will be helpful to you😀

@czq693497091
Copy link
Author

I successfully run the collect_sp_data.sh with opt-6.7b and opt-66b. But when I apply the sparse_predictor to train the mlp predictor for both opt-6.7b and opt-66b, but both shows all zero y, which means that the mlp_label_0.mmap is all zero and the training process ends. Is the problem widespread, and how can it be solved?

I encountered the same problem, but I don't know how to solve it.Have you solved the problem now?

still not. I email to the author but she said that is should not be all zero.

Hey, I have found the reason for the problem. The issue is with fp_label, because the author has allocated a fp_label.mmap file of size [400000 ,(4 * hidden_size)] for storing fp_label, but in reality, fp_label does not contain that much data. When the author was dividing the validation set, they selected the last 0.05 * len(fp_label) of data from the fp_labl file, which resulted in the reading of empty data. As a result, the MLP cannot receive effective training. To solve this problem, you can modify the def get_data(args, l) function in ./DejaVu/sparse_predictor/main_mlp.py.

def get_data(args, l):
    if CONFIG[args.model]['ckt_storage'] == "bylayer":
        #path = f"{DATA[args.model][args.dataset]}/mlp_x_{l}.mmap"
        path = f"{DATA[args.model][args.dataset]}/mlp_sp_x_{l}.mmap"
        print(f"Reading query from {path}")
        query = np.array(np.memmap(path, dtype='float16', mode='r', shape=(400000,CONFIG[args.model]['d']))[: CONFIG[args.model]['N']])
        path = f"{DATA[args.model][args.dataset]}/mlp_label_{l}.mmap"
        print(f"Reading MLP label from {path}")
        label = np.array(np.memmap(path, dtype='float16', mode='r', shape=(400000,CONFIG[args.model]['d'] * 4))[: CONFIG[args.model]['N']])
        
        num_valid = (label.sum(-1) > 0).sum()
        print(num_valid)
        return  query[:num_valid], label[:num_valid]
        #return  query, label

This is my solution, I hope it will be helpful to you😀

cool, thanks for your solution and I will try😀

@susavlsh10
Copy link

I successfully run the collect_sp_data.sh with opt-6.7b and opt-66b. But when I apply the sparse_predictor to train the mlp predictor for both opt-6.7b and opt-66b, but both shows all zero y, which means that the mlp_label_0.mmap is all zero and the training process ends. Is the problem widespread, and how can it be solved?

I encountered the same problem, but I don't know how to solve it.Have you solved the problem now?

still not. I email to the author but she said that is should not be all zero.

Hey, I have found the reason for the problem. The issue is with fp_label, because the author has allocated a fp_label.mmap file of size [400000 ,(4 * hidden_size)] for storing fp_label, but in reality, fp_label does not contain that much data. When the author was dividing the validation set, they selected the last 0.05 * len(fp_label) of data from the fp_labl file, which resulted in the reading of empty data. As a result, the MLP cannot receive effective training. To solve this problem, you can modify the def get_data(args, l) function in ./DejaVu/sparse_predictor/main_mlp.py.

def get_data(args, l):
    if CONFIG[args.model]['ckt_storage'] == "bylayer":
        #path = f"{DATA[args.model][args.dataset]}/mlp_x_{l}.mmap"
        path = f"{DATA[args.model][args.dataset]}/mlp_sp_x_{l}.mmap"
        print(f"Reading query from {path}")
        query = np.array(np.memmap(path, dtype='float16', mode='r', shape=(400000,CONFIG[args.model]['d']))[: CONFIG[args.model]['N']])
        path = f"{DATA[args.model][args.dataset]}/mlp_label_{l}.mmap"
        print(f"Reading MLP label from {path}")
        label = np.array(np.memmap(path, dtype='float16', mode='r', shape=(400000,CONFIG[args.model]['d'] * 4))[: CONFIG[args.model]['N']])
        
        num_valid = (label.sum(-1) > 0).sum()
        print(num_valid)
        return  query[:num_valid], label[:num_valid]
        #return  query, label

This is my solution, I hope it will be helpful to you😀

Thanks!! this works :D

@jason-huang03
Copy link

jason-huang03 commented Jun 27, 2024

sorry, but I met a problem when running 6.7b model. I got

RuntimeError: The expanded size of the tensor (2048) must match the existing size (2047) at non-singleton dimension 1.  Target sizes: [1, 2048].  Tensor sizes: [2047]

on Decentralized_FM_alpha/pipeline_parallel/dist_pipeline_inference_mask_greedy_token_pipe_sync.py line 485

        self.ret_tokens[
            index * self.micro_batch_size : (index + 1) * self.micro_batch_size,
            : self.i_current_token,
        ] = original_indices

I use a single gpu to run, and the script is like

#!/bin/bash

file=./c4_train.jsonl
    
echo "start running ${file}"

ARGS="--model-name opt-6.7b-converted \
--model-type opt-save \
--seed 42 \
--fp16 \
--num-layers 32 \
--max-layers 96 \
--budget 22800 \
--num-iters 2000 \
--dist-url tcp://127.0.0.1:9032 \
--token-micro-batch-size 1 \
--world-size 1 --pipeline-group-size 1 --data-group-size 1 \
--pp-mode pipe_sync_sample_mask_token_pipe \
--infer-data ${file}"

python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 0 --rank 0

does anyone know what's the problem?

@rhmaaa
Copy link

rhmaaa commented Jul 22, 2024

--budget 22800 \
--num-iters 2000 \
--dist-url tcp://127.0.0.1:9032 \
--token-micro-batch-size 1 \
--world-size 1 --pipeline-group-size 1 --data-group-size 1 \
--pp-mode pipe_sync_sample_mask_token_pipe \
--infer-data ${file}"

did you slove it ?

@zzzlxhhh
Copy link

I successfully run the collect_sp_data.sh with opt-6.7b and opt-66b. But when I apply the sparse_predictor to train the mlp predictor for both opt-6.7b and opt-66b, but both shows all zero y, which means that the mlp_label_0.mmap is all zero and the training process ends. Is the problem widespread, and how can it be solved?

I encountered the same problem, but I don't know how to solve it.Have you solved the problem now?

still not. I email to the author but she said that is should not be all zero.

Hey, I have found the reason for the problem. The issue is with fp_label, because the author has allocated a fp_label.mmap file of size [400000 ,(4 * hidden_size)] for storing fp_label, but in reality, fp_label does not contain that much data. When the author was dividing the validation set, they selected the last 0.05 * len(fp_label) of data from the fp_labl file, which resulted in the reading of empty data. As a result, the MLP cannot receive effective training. To solve this problem, you can modify the def get_data(args, l) function in ./DejaVu/sparse_predictor/main_mlp.py.

def get_data(args, l):
    if CONFIG[args.model]['ckt_storage'] == "bylayer":
        #path = f"{DATA[args.model][args.dataset]}/mlp_x_{l}.mmap"
        path = f"{DATA[args.model][args.dataset]}/mlp_sp_x_{l}.mmap"
        print(f"Reading query from {path}")
        query = np.array(np.memmap(path, dtype='float16', mode='r', shape=(400000,CONFIG[args.model]['d']))[: CONFIG[args.model]['N']])
        path = f"{DATA[args.model][args.dataset]}/mlp_label_{l}.mmap"
        print(f"Reading MLP label from {path}")
        label = np.array(np.memmap(path, dtype='float16', mode='r', shape=(400000,CONFIG[args.model]['d'] * 4))[: CONFIG[args.model]['N']])
        
        num_valid = (label.sum(-1) > 0).sum()
        print(num_valid)
        return  query[:num_valid], label[:num_valid]
        #return  query, label

This is my solution, I hope it will be helpful to you😀

Hi, Thanks for giving the solution. That works for me. And, do you know why does the size is specifically set to be 400000, which is rather large. I think 400000 far the size set in run_infer_opt_175b_collect_sp_data.sh. I do not understand the training and the process of collecting sparse training data. The code is confusing.

@mli-tian
Copy link

抱歉,我在运行 6.7b 模型时遇到了问题。我得到了

RuntimeError: The expanded size of the tensor (2048) must match the existing size (2047) at non-singleton dimension 1.  Target sizes: [1, 2048].  Tensor sizes: [2047]

在线 485Decentralized_FM_alpha/pipeline_parallel/dist_pipeline_inference_mask_greedy_token_pipe_sync.py

        self.ret_tokens[
            index * self.micro_batch_size : (index + 1) * self.micro_batch_size,
            : self.i_current_token,
        ] = original_indices

我使用单个 gpu 来运行,脚本类似于

#!/bin/bash

file=./c4_train.jsonl
    
echo "start running ${file}"

ARGS="--model-name opt-6.7b-converted \
--model-type opt-save \
--seed 42 \
--fp16 \
--num-layers 32 \
--max-layers 96 \
--budget 22800 \
--num-iters 2000 \
--dist-url tcp://127.0.0.1:9032 \
--token-micro-batch-size 1 \
--world-size 1 --pipeline-group-size 1 --data-group-size 1 \
--pp-mode pipe_sync_sample_mask_token_pipe \
--infer-data ${file}"

python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 0 --rank 0

有人知道问题出在哪里吗?

I have same problem

@kechengcode
Copy link

I successfully run the collect_sp_data.sh with opt-6.7b and opt-66b. But when I apply the sparse_predictor to train the mlp predictor for both opt-6.7b and opt-66b, but both shows all zero y, which means that the mlp_label_0.mmap is all zero and the training process ends. Is the problem widespread, and how can it be solved?

I encountered the same problem, but I don't know how to solve it.Have you solved the problem now?

still not. I email to the author but she said that is should not be all zero.

Hey, I have found the reason for the problem. The issue is with fp_label, because the author has allocated a fp_label.mmap file of size [400000 ,(4 * hidden_size)] for storing fp_label, but in reality, fp_label does not contain that much data. When the author was dividing the validation set, they selected the last 0.05 * len(fp_label) of data from the fp_labl file, which resulted in the reading of empty data. As a result, the MLP cannot receive effective training. To solve this problem, you can modify the def get_data(args, l) function in ./DejaVu/sparse_predictor/main_mlp.py.

def get_data(args, l):
    if CONFIG[args.model]['ckt_storage'] == "bylayer":
        #path = f"{DATA[args.model][args.dataset]}/mlp_x_{l}.mmap"
        path = f"{DATA[args.model][args.dataset]}/mlp_sp_x_{l}.mmap"
        print(f"Reading query from {path}")
        query = np.array(np.memmap(path, dtype='float16', mode='r', shape=(400000,CONFIG[args.model]['d']))[: CONFIG[args.model]['N']])
        path = f"{DATA[args.model][args.dataset]}/mlp_label_{l}.mmap"
        print(f"Reading MLP label from {path}")
        label = np.array(np.memmap(path, dtype='float16', mode='r', shape=(400000,CONFIG[args.model]['d'] * 4))[: CONFIG[args.model]['N']])
        
        num_valid = (label.sum(-1) > 0).sum()
        print(num_valid)
        return  query[:num_valid], label[:num_valid]
        #return  query, label

This is my solution, I hope it will be helpful to you😀

Hi, Thanks for giving the solution. That works for me. And, do you know why does the size is specifically set to be 400000, which is rather large. I think 400000 far the size set in run_infer_opt_175b_collect_sp_data.sh. I do not understand the training and the process of collecting sparse training data. The code is confusing.

sorry, I don't know why the size is specifically set to be 400000. (╯︵╰)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants