update docs

hanhainebula · Dec 3, 2024 · 1374b98 · 1374b98
1 parent 6d9fa4e
commit 1374b98
Show file tree

Hide file tree

Showing 20 changed files with 435 additions and 52 deletions.
diff --git a/.github/workflows/documentation.yml b/.github/workflows/documentation.yml
@@ -13,7 +13,7 @@ jobs:
       - uses: actions/setup-python@v5
       - name: Install dependencies
         run: |
-          pip install . sphinx sphinx_rtd_theme myst_parser myst-nb furo
+          pip install . sphinx myst_parser myst-nb sphinx-design pydata-sphinx-theme
       - name: Sphinx build
         run: |
           sphinx-build docs/source docs/build

diff --git a/README.md b/README.md
@@ -159,7 +159,7 @@ Currently we are updating the [tutorials](./Tutorials/), we aim to create a comp
 The following contents are releasing in the upcoming weeks:
 
 - Evaluation
-- RAG
+- BGE-EN-ICL
 
 <details>
   <summary>The whole tutorial roadmap</summary>

diff --git a/docs/requirements.txt b/docs/requirements.txt
@@ -1,3 +1,5 @@
 sphinx
 myst-nb
-furo
+sphinx-design
+pydata-sphinx-theme
+# furo
diff --git a/docs/source/API/abc.rst b/docs/source/API/abc.rst
@@ -3,4 +3,5 @@ Abstract Class
 
 .. toctree::
     abc/inference
+    abc/evaluation
     abc/finetune
diff --git a/docs/source/API/index.rst b/docs/source/API/index.rst
@@ -0,0 +1,11 @@
+API
+===
+
+.. toctree::
+   :hidden:
+   :maxdepth: 1
+
+   abc
+   inference
+   evaluation
+   finetune
diff --git a/docs/source/FAQ/index.rst b/docs/source/FAQ/index.rst
@@ -0,0 +1,2 @@
+FAQ
+===
diff --git a/docs/source/Introduction/concept.rst b/docs/source/Introduction/concept.rst
@@ -0,0 +1,37 @@
+Concept
+=======
+
+Embedder
+--------
+
+Embedder, or embedding model, is a model designed to convert data, usually text, codes, or images, into sparse or dense numerical vectors (embeddings) in a high dimensional vector space.
+These embeddings capture the semantic meaning or key features of the input, which enable efficient comparison and analysis.
+
+A very famous demonstration is the example from `word2vec <https://arxiv.org/abs/1301.3781>`_. It shows how word embeddings capture semantic relationships through vector arithmetic:
+
+.. image:: ../_static/img/word2vec.png
+   :width: 500
+   :align: center
+
+Nowadays, embedders are capable of mapping sentences and even passages into vector space.
+They are widely used in real world tasks such as retrieval, clustering, etc.
+In the era of LLMs, embedding models play a pivot role in RAG, enables LLMs to access and integrate relevant context from vast external datasets.
+
+Reranker
+--------
+
+Reranker, or Cross-Encoder, is a model that refines the ranking of candidate pairs (e.g., query-document pairs) by jointly encoding and scoring them.
+
+Typically, we use embedder as a Bi-Encoder. It first computes the embeddings of two input sentences, then compute their similarity using metrics such as cosine similarity or Euclidean distance.
+Whereas a reranker takes two sentences at the same time and directly computer a score representing their similarity.
+
+The following figure shows their difference:
+
+.. figure:: https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/Bi_vs_Cross-Encoder.png
+   :width: 500
+   :align: center
+
+   Bi-Encoder & Cross-Encoder (from Sentence Transformers)
+
+Although Cross-Encoder usually has better performances than Bi-Encoder, it is extremly time consuming to use Cross-Encoder if we have a great amount of data. 
+Thus a widely accepted approach is to use a Bi-Encoder for initial retrieval (e.g., selecting the top 100 candidates from 100,000 sentences) and then refine the ranking of the selected candidates using a Cross-Encoder for more accurate results.
diff --git a/docs/source/Introduction/index.rst b/docs/source/Introduction/index.rst
@@ -0,0 +1,19 @@
+Introduction
+============
+
+BGE builds one-stop retrieval toolkit for search and RAG. We provide inference, evaluation, and fine-tuning for embedding models and reranker.
+
+.. figure:: ../_static/img/RAG_pipeline.png
+   :width: 700
+   :align: center
+
+   BGE embedder and reranker in an RAG pipeline.
+
+Quickly get started with:
+
+.. toctree::
+   :maxdepth: 1
+
+   installation
+   concept
+   quick_start
diff --git a/docs/source/Introduction/installation.rst b/docs/source/Introduction/installation.rst
@@ -40,4 +40,9 @@ For development in editable mode:
     # If you do not want to finetune the models, you can install the package without the finetune dependency:
     pip install -e .
     # If you want to finetune the models, you can install the package with the finetune dependency:
-    pip install -e .[finetune]
+    pip install -e .[finetune]
+
+PyTorch-CUDA
+------------
+
+If you want to use CUDA GPUs during inference and finetuning, please install appropriate version of `PyTorch <https://pytorch.org/get-started/locally/>`_ with CUDA support.
diff --git a/docs/source/_static/css/custom.css b/docs/source/_static/css/custom.css
@@ -0,0 +1,9 @@
+.bd-sidebar-primary {
+    width: 22%;
+    line-height: 1.4;
+}
+
+.col-lg-3 {
+    flex: 0 0 auto;
+    width: 22%;
+}
diff --git a/docs/source/_static/img/RAG_pipeline.png b/docs/source/_static/img/RAG_pipeline.png
diff --git a/docs/source/_static/img/word2vec.png b/docs/source/_static/img/word2vec.png
diff --git a/docs/source/bge/bge_m3.rst b/docs/source/bge/bge_m3.rst
@@ -1,2 +1,117 @@
+======
 BGE-M3
-======
+======
+
+BGE-M3 is a compound and powerful embedding model distinguished for its versatility in:
+- **Multi-Functionality**: It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval.
+- **Multi-Linguality**: It can support more than 100 working languages.
+- **Multi-Granularity**: It is able to process inputs of different granularities, spanning from short sentences to long documents of up to 8192 tokens.
+
++-------------------------------------------------------------------+-----------------+------------+--------------+-----------------------------------------------------------------------+
+|                                  Model                            |    Language     | Parameters |  Model Size  |                              Description                              |
++===================================================================+=================+============+==============+=======================================================================+
+| `BAAI/bge-m3 <https://huggingface.co/BAAI/bge-m3>`_               |  Multi-Lingual  |    569M    |    2.27 GB   | Multi-Functionality, Multi-Linguality, and Multi-Granularity          |
++-------------------------------------------------------------------+-----------------+------------+--------------+-----------------------------------------------------------------------+
+
+Multi-Linguality
+================
+
+BGE-M3 was trained on multiple datasets covering up to 170+ different languages. 
+While the amount of training data on languages are highly unbalanced, the actual model performance on different languages will have difference.
+
+For more information of datasets and evaluation results, please check out our `paper <https://arxiv.org/pdf/2402.03216s>`_ for details.
+
+Multi-Granularity
+=================
+
+We extend the max position to 8192, enabling the embedding of larger corpus. 
+Proposing a simple but effective method: MCLS (Multiple CLS) to enhance the model's ability on long text without additional fine-tuning.
+
+Multi-Functionality
+===================
+
+.. code:: python
+
+    from FlagEmbedding import BGEM3FlagModel
+
+    model = BGEM3FlagModel('BAAI/bge-m3')
+    sentences_1 = ["What is BGE M3?", "Defination of BM25"]
+    sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.", 
+                   "BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]
+
+Dense Retrieval
+---------------
+
+Similar to BGE v1 or v1.5 models, BGE-M3 use the normalized hidden state of the special token [CLS] as the dense embedding:
+
+.. math:: e_q = norm(H_q[0])
+
+Next, to compute the relevance score between the query and passage:
+
+.. math:: s_{dense}=f_{sim}(e_p, e_q)
+
+where :math:`e_p, e_q` are the embedding vectors of passage and query, respectively.
+
+:math:`f_{sim}` is the score function (such as inner product and L2 distance) for comupting two embeddings' similarity.
+
+Sparse Retrieval
+----------------
+
+BGE-M3 generates sparce embeddings by adding a linear layer and a ReLU activation function following the hidden states:
+
+.. math:: w_{qt} = \text{Relu}(W_{lex}^T H_q [i])
+
+where :math:`W_{lex}` representes the weights of linear layer and :math:`H_q[i]` is the encoder's output of the :math:`i^{th}` token.
+
+Based on the tokens' weights of query and passage, the relevance score between them is computed by the joint importance of the co-existed terms within the query and passage:
+
+.. math:: s_{lex} = \sum_{t\in q\cap p}(w_{qt} * w_{pt})
+
+where :math:`w_{qt}, w_{pt}` are the importance weights of each co-existed term :math:`t` in query and passage, respectively.
+
+Multi-Vector
+------------
+
+The multi-vector method utilizes the entire output embeddings for the representation of query :math:`E_q` and passage :math:`E_p`.
+
+.. math:: 
+
+    E_q = norm(W_{mul}^T H_q)
+
+    E_p = norm(W_{mul}^T H_p)
+
+where :math:`W_{mul}` is the learnable projection matrix.
+
+Following ColBert, BGE-M3 use late-interaction to compute the fine-grained relevance score:
+
+.. math:: s_{mul}=\frac{1}{N}\sum_{i=1}^N\max_{j=1}^M E_q[i]\cdot E_p^T[j]
+
+where :math:`E_q, E_p` are the entire output embeddings of query and passage, respectively.
+
+This is a summation of average of maximum similarity of each :math:`v\in E_q` with vectors in :math:`E_p`.
+
+Hybrid Ranking
+--------------
+
+BGE-M3's multi-functionality gives the possibility of hybrid ranking to improve retrieval. 
+Firstly, due to the heavy cost of multi-vector method, we can retrieve the candidate results by either of the dense or sparse method. 
+Then, to get the final result, we can rerank the candidates based on the integrated relevance score:
+
+.. math:: s_{rank} = w_1\cdot s_{dense}+w_2\cdot s_{lex} + w_3\cdot s_{mul}
+
+where the values chosen for :math:`w_1`, :math:`w_2` and :math:`w_3` varies depending on the downstream scenario.
+
+
+Usage
+=====
+
+.. code:: python
+
+    from FlagEmbedding import BGEM3FlagModel
+
+    model = BGEM3FlagModel('BAAI/bge-m3')
+
+    sentences_1 = ["What is BGE M3?", "Defination of BM25"]
+
+    output = model.encode(sentences_1, return_dense=True, return_sparse=True, return_colbert_vecs=True)
+    dense, sparse, multiv = output['dense_vecs'], output['lexical_weights'], output['colbert_vecs']
diff --git a/docs/source/bge/bge_v1.rst → docs/source/bge/bge_v1_v1.5.rst b/docs/source/bge/bge_v1.rst → docs/source/bge/bge_v1_v1.5.rst
@@ -1,5 +1,7 @@
-BGE-v1
-======
+BGE v1 & v1.5
+=============
+
+BGE v1 and v1.5 are series of encoder only models base on BERT. They achieved best performance among the models of the same size at the time of release.
 
 BGE
 ---
@@ -26,7 +28,7 @@ C-MTEB benchmarks at the time released.
 BGE-v1.5
 --------
 
-Then to enhance its retrieval ability without instruction and alleviate the issue of the similarity distribution, :code:`bge-*-1.5` models 
+Then to enhance its retrieval ability without instruction and alleviate the issue of the similarity distribution, :code:`bge-*-v1.5` models 
 were released in Sep 2023. They are still the most popular embedding models that balanced well between embedding quality and model sizes.
 
 +-----------------------------------------------------------------------------+-----------+------------+--------------+--------------+
@@ -37,13 +39,39 @@ were released in Sep 2023. They are still the most popular embedding models that
 | `BAAI/bge-base-en-v1.5 <https://huggingface.co/BAAI/bge-base-en-v1.5>`_     |  English  |    109M    |    438 MB    | reasonable   |
 +-----------------------------------------------------------------------------+-----------+------------+--------------+ similarity   +
 | `BAAI/bge-small-en-v1.5 <https://huggingface.co/BAAI/bge-small-en-v1.5>`_   |  English  |    33.4M   |    133 MB    | distribution |
-+-----------------------------------------------------------------------------+-----------+------------+--------------+              +
-| `BAAI/bge-large-zh-v1.5 <https://huggingface.co/BAAI/bge-large-zh-v1.5>`_   |  Chinese  |    326M    |    1.3 GB    |              |
++-----------------------------------------------------------------------------+-----------+------------+--------------+ and better   +
+| `BAAI/bge-large-zh-v1.5 <https://huggingface.co/BAAI/bge-large-zh-v1.5>`_   |  Chinese  |    326M    |    1.3 GB    | performance  |
 +-----------------------------------------------------------------------------+-----------+------------+--------------+              +
 | `BAAI/bge-base-zh-v1.5 <https://huggingface.co/BAAI/bge-base-zh-v1.5>`_     |  Chinese  |    102M    |    409 MB    |              |
 +-----------------------------------------------------------------------------+-----------+------------+--------------+              +
 | `BAAI/bge-small-zh-v1.5 <https://huggingface.co/BAAI/bge-small-zh-v1.5>`_   |  Chinese  |    24M     |    95.8 MB   |              |
 +-----------------------------------------------------------------------------+-----------+------------+--------------+--------------+
 
 
+Usage
+-----
+
+To use BGE v1 or v1.5 model for inference, load model through ``
+
+.. code:: python
+
+    from FlagEmbedding import FlagModel
+
+    model = FlagModel('BAAI/bge-base-en-v1.5')
+
+    sentences = ["Hello world", "I am inevitable"]
+    embeddings = model.encode(sentences)
+
+.. tip::
+
+    For simple tasks that only encode a few sentences like above, it's faster to use single GPU comparing to multi-GPUs:
+
+    .. code:: python
+
+        import os
+        os.environ['CUDA_VISIBLE_DEVICES'] = '0'
+    or 
+
+    .. code:: python
 
+        model = FlagModel('BAAI/bge-base-en-v1.5', devices=0)
diff --git a/docs/source/bge/index.rst b/docs/source/bge/index.rst
@@ -0,0 +1,19 @@
+BGE
+===
+
+**BGE** stands for **BAAI General Embeddings**, which is a series of embedding models released by BAAI.
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Embedder
+
+   bge_v1_v1.5
+   bge_m3
+   bge_icl
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Embedder
+
+   bge_reranker
+
diff --git a/docs/source/bge/introduction.rst b/docs/source/bge/introduction.rst
diff --git a/docs/source/community/index.rst b/docs/source/community/index.rst
@@ -0,0 +1,2 @@
+Community
+=========