Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

使用BGE-M3模型来对MLDR的英文数据集进行embedding很慢 #1295

Open
JackTan25 opened this issue Dec 18, 2024 · 1 comment
Open

Comments

@JackTan25
Copy link

JackTan25 commented Dec 18, 2024

我目前正在使用BGE-M3模型对MLDR数据集进行编码获取稠密向量,下面是我的代码:

import os
import struct
import time
import datasets
import numpy as np
from tqdm import tqdm
from FlagEmbedding import FlagModel
from dataclasses import dataclass, field
from transformers import HfArgumentParser
from mldr_common_tools import EvalArgs, check_languages, load_corpus


@dataclass
class ModelArgs:
    encoder: str = field(default="BAAI/bge-m3", metadata={'help': 'Name or path of encoder'})
    fp16: bool = field(default=True, metadata={'help': 'Use fp16 in inference?'})
    pooling_method: str = field(default='cls', metadata={'help': "Pooling method. Avaliable methods: 'cls', 'mean'"})
    normalize_embeddings: bool = field(default=True, metadata={'help': "Normalize embeddings or not"})


def get_model(model_args: ModelArgs):
    model = FlagModel(model_args.encoder, pooling_method=model_args.pooling_method,
                      normalize_embeddings=model_args.normalize_embeddings, use_fp16=model_args.fp16)
    return model


def generate_dense(model: FlagModel, corpus: datasets.Dataset, max_passage_length: int, batch_size: int, begin_pos: int,
                   end_pos: int):
    dense_embeddings = model.encode_corpus(corpus["text"][begin_pos: end_pos], batch_size=batch_size,
                                           max_length=max_passage_length)
    dense_embeddings = dense_embeddings.astype(np.float32)
    return dense_embeddings


def save_result(dense_embeddings, dense_save_file: str):
    with open(dense_save_file, 'wb') as f:
        for one_dense in tqdm(dense_embeddings, desc="Saving dense embeddings"):
            dim = one_dense.shape[-1]
            f.write(struct.pack('<i', dim))
            one_dense.astype('float32').tofile(f)


def main():
    parser = HfArgumentParser([ModelArgs, EvalArgs])
    model_args, eval_args = parser.parse_args_into_dataclasses()
    model_args: ModelArgs
    eval_args: EvalArgs

    languages = check_languages(eval_args.languages)

    if model_args.encoder[-1] == '/':
        model_args.encoder = model_args.encoder[:-1]

    model = get_model(model_args=model_args)

    encoder = model_args.encoder
    if os.path.basename(encoder).startswith('checkpoint-'):
        encoder = os.path.dirname(encoder) + '_' + os.path.basename(encoder)

    print("==================================================")
    print("Start generating embedding with model:")
    print(model_args.encoder)

    print('Generate embedding of following languages: ', languages)
    for lang in languages:
        print("**************************************************")
        embedding_save_dir = os.path.join(eval_args.embedding_save_dir, os.path.basename(encoder), lang)
        if not os.path.exists(embedding_save_dir):
            os.makedirs(embedding_save_dir)
        dense_save_file = os.path.join(embedding_save_dir, f'dense-{eval_args.begin_pos}-{eval_args.end_pos}.fvecs')
        if os.path.exists(dense_save_file) and not eval_args.overwrite:
            print(f'Embedding of {lang} already exists. Skip...')
            continue

        print(f"Start generating embedding of {lang} ...")
        corpus = load_corpus(lang)
        start_time = time.time()
        dense_embeddings = generate_dense(model=model, corpus=corpus, max_passage_length=eval_args.max_passage_length,
                                          batch_size=eval_args.batch_size, begin_pos=eval_args.begin_pos,
                                          end_pos=eval_args.end_pos)
        end_time = time.time()
        # 计算耗时并转换为毫秒
        elapsed_time_ms = (end_time - start_time) * 1000
        print(f"generate_dense_embedding time cost: {elapsed_time_ms:.2f} ms")
        with open('generate_dense_embedding_time.txt', 'w') as file:
            file.write(f"执行耗时: {elapsed_time_ms:.2f} 毫秒\n")
        save_result(dense_embeddings, dense_save_file)

    print("==================================================")
    print("Finish generating embeddings with model:")
    print(model_args.encoder)


if __name__ == "__main__":
    main()

我执行如下脚本:

python generate_dense_embedding.py \
--begin_pos 0 \
--end_pos 200000 \
--languages en \
--embedding_save_dir ./corpus-embedding \
--max_passage_length 8192 \
--batch_size 1 \
--fp16 True

这些我的机器的gpu配置:

Wed Dec 18 19:46:22 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.120                Driver Version: 550.120        CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX 5000 Ada Gene...    Off |   00000000:18:00.0 Off |                  Off |
| 30%   30C    P8             12W /  250W |    5720MiB /  32760MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA RTX 5000 Ada Gene...    Off |   00000000:3B:00.0 Off |                    1 |
| 30%   35C    P8             17W /  250W |      14MiB /  30712MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA RTX 5000 Ada Gene...    Off |   00000000:86:00.0 Off |                    0 |
| 30%   38C    P8             13W /  250W |    5720MiB /  30712MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA RTX 5000 Ada Gene...    Off |   00000000:AF:00.0 Off |                  Off |
| 54%   78C    P2            206W /  250W |   14887MiB /  32760MiB |     87%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      4070      G   /usr/lib/xorg/Xorg                              4MiB |
|    0   N/A  N/A    341143      C   /mnt/HDD1/wmz19/miniconda3/bin/python3       5702MiB |
|    1   N/A  N/A      4070      G   /usr/lib/xorg/Xorg                              4MiB |
|    2   N/A  N/A      4070      G   /usr/lib/xorg/Xorg                              4MiB |
|    2   N/A  N/A    341237      C   /mnt/HDD1/wmz19/miniconda3/bin/python3       5702MiB |
|    3   N/A  N/A      4070      G   /usr/lib/xorg/Xorg                              4MiB |
|    3   N/A  N/A    341286      C   /mnt/HDD1/wmz19/miniconda3/bin/python3       5702MiB |
|    3   N/A  N/A    392717      C   python                                       9164MiB |
+-----------------------------------------------------------------------------------------+

我从昨天晚上11点一直执行到现在,进展如下:

pre tokenize: 100%|██████████████████████████████████████████████████████████████████| 50000/50000 [06:17<00:00, 132.47it/s]
pre tokenize:  71%|██████████████████████████████████████████████▊                   | 35438/50000 [06:21<01:54, 126.81it/s]You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
pre tokenize: 100%|██████████████████████████████████████████████████████████████████| 50000/50000 [06:20<00:00, 131.41it/s]
You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
pre tokenize: 100%|██████████████████████████████████████████████████████████████████| 50000/50000 [08:10<00:00, 101.90it/s]
You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Inference Embeddings: 100%|█████████████████████████████████████████████████████████| 50000/50000 [2:22:23<00:00,  5.85it/s]
Inference Embeddings: 100%|█████████████████████████████████████████████████████████| 50000/50000 [2:46:34<00:00,  5.00it/s]
Inference Embeddings: 100%|█████████████████████████████████████████████████████████| 50000/50000 [3:31:52<00:00,  3.93it/s]
Chunks:  75%|████████████████████████████████████████████████████████▎                  | 3/4 [3:40:11<1:02:32, 3752.99s/it]  

请问有没有什么建议,感觉挺慢的,还有这里这个chunks这是在干啥呢?

@JackTan25 JackTan25 changed the title 使用BGE-M3模型来对MLDR的英文数据集进行embedding遇到问题 使用BGE-M3模型来对MLDR的英文数据集进行embedding很慢 Dec 18, 2024
@545999961
Copy link
Collaborator

因为MLDR的文本都很长,确实处理会慢一点
chunks是多卡多进程加速处理,每个chunk代表在一张卡上开启了一个GPU进程

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants