Skip to content

Latest commit

 

History

History
105 lines (77 loc) · 3.71 KB

README.md

File metadata and controls

105 lines (77 loc) · 3.71 KB

Vietnamese Massive Text Embedding Benchmark

Build

Installation

V-MTEB is devloped based on MTEB.

Clone this repo and install as editable

git clone https://github.com/Iambestfeed/V-MTEB.git
cd V-MTEB
pip install -e .

Evaluation

Evaluate reranker

python eval_cross_encoder.py --model_name_or_path BAAI/bge-reranker-base

Evaluate embedding model

  • With scripts Scripts will be updated soon.

  • With sentence-transformers

You can use V-MTEB easily in the same way as MTEB.

from mteb import MTEB
from V_MTEB import *
from sentence_transformers import SentenceTransformer

# Define the sentence-transformers model name
model_name = "fill-your-model-name"

model = SentenceTransformer(model_name)
evaluation = MTEB(task_langs=['vie'])
results = evaluation.run(model, output_folder=f"vi_results/{model_name}")
  • Using a custom model
    To evaluate a new model, you can load it via sentence_transformers if it is supported by sentence_transformers. Otherwise, models should be implemented like below (implementing an encode function taking as input a list of sentences, and returning a list of embeddings (embeddings can be np.array, torch.tensor, etc.).):
class MyModel():
    def encode(self, sentences, batch_size=32, **kwargs):
        """ Returns a list of embeddings for the given sentences.
        Args:
            sentences (`List[str]`): List of sentences to encode
            batch_size (`int`): Batch size for the encoding

        Returns:
            `List[np.ndarray]` or `List[tensor]`: List of embeddings for the given sentences
        """
        pass

model = MyModel()
evaluation = MTEB(tasks=["Vietnamese_Student_Topic"])
evaluation.run(model)

Leaderboard

Will be updated soon.

Tasks

An overview of tasks and datasets available in MTEB-chinese is provided in the following table:

Name Hub URL Description Type Category Test #Samples

Acknowledgement

We thank the great tool from Massive Text Embedding Benchmark and the open-source datasets from Vietnam NLP community.

Citation

If you find this repository useful, please consider citation this repo.