Download data and convert to our format:
bash ./prepare_data/download_data.sh
The data will be saved into ./data/NQ/dataset
.
We use the DPR as the text encoder:
python ./prepare_data/get_embeddings.py \
--data_dir ./data/NQ/dataset \
--preprocess_dir ./data/NQ/preprocess \
--tokenizer_name bert-base-uncased \
--max_doc_length 256 \
--max_query_length 32 \
--output_dir ./data/NQ/evaluate/dpr
The code will preprocess the data into preprocess_dir
(for training encoder)
and generate embeddings into output_dir
(for training index). More information about data format
pleaser refer to dataset.README.md
python ./basic_index/faiss_index.py \
--preprocess_dir ./data/NQ/preprocess \
--embeddings_dir ./data/NQ/evaluate/dpr \
--index_method ivf_opq \
--ivf_centers_num 1000 \
--subvector_num 8 \
--subvector_bits 8 \
--nprobe 10
top5:0.2886426592797784, top10:0.3664819944598338, top20:0.45013850415512463, top30:0.4955678670360111, top50:0.5421052631578948, top100:0.6013850415512465
python ./basic_index/scann_index.py \
--preprocess_dir ./data/NQ/preprocess \
--embeddings_dir ./data/NQ/evaluate/dpr \
--ivf_centers_num 1000 \
--subvector_num 8 \
--nprobe 10
Finetune the index with fixed embeddings:
(need the embeddings of queries and docs)
python ./learnable_index/train_index.py \
--preprocess_dir ./data/NQ/preprocess \
--embeddings_dir ./data/NQ/evaluate/dpr \
--index_method ivf_opq \
--ivf_centers_num 1000 \
--subvector_num 8 \
--subvector_bits 8 \
--nprobe 10 \
--training_mode {distill_index, distill_index_nolabel, contrastive_index}
Jointly train index and query encoder (always has a better performance):
(need embeddings and a query encoder)
python ./learnable_index/train_index_and_encoder.py \
--data_dir ./data/NQ/dataset \
--preprocess_dir ./data/NQ/preprocess \
--max_doc_length 256 \
--max_query_length 32 \
--embeddings_dir ./data/NQ/evaluate/dpr \
--index_method ivf_opq \
--ivf_centers_num 1000 \
--subvector_num 8 \
--subvector_bits 8 \
--nprobe 10 \
--training_mode {distill_index-and-query-encoder, distill_index-and-query-encoder_nolabel, contrastive_index-and-query-encoder}
We provide several different training modes:
- contrastive_(): contrastive learning;
- distill_(): knowledge distillation; transfer knowledge (i.e., the order of docs) from the dense vector to the IVF and PQ
- distill_()_nolabel: knowledge distillation for non-label data; in this way, first to find the top-k docs for each train queries by brute-force search (or a index with high performance), then use these results to form a new train data.
More details of implementation please refer to train_index.py and train_index_and_encoder.
Methods | Recall@5 | Recall@10 | Recall@20 | Recall@100 |
---|---|---|---|---|
Faiss-IVFPQ | 0.1504 | 0.2052 | 0.2722 | 0.4523 |
Faiss-IVFOPQ | 0.3332 | 0.4279 | 0.5110 | 0.6817 |
Scann | 0.2526 | 0.3351 | 0.4144 | 0.6016 |
LibVQ(contrastive_index) | 0.3398 | 0.4415 | 0.5232 | 0.6911 |
LibVQ(distill_index) | 0.3952 | 0.4900 | 0.5667 | 0.7232 |
LibVQ(distill_index_nolabel) | 0.4066 | 0.4936 | 0.5759 | 0.7301 |
LibVQ(contrastive_index-and-query-encoder) | 0.3548 | 0.4470 | 0.5390 | 0.7120 |
LibVQ(distill_index-and-query-encoder) | 0.4725 | 0.5681 | 0.6429 | 0.7739 |
LibVQ(distill_index-and-query-encoder_nolabel) | 0.4977 | 0.5822 | 0.6484 | 0.7764 |
For PQ, you can reuse above commands and only change the --index_method
to pq
or opq
.
For example:
python ./learnable_index/train_index.py \
--data_dir ./data/NQ/dataset \
--preprocess_dir ./data/NQ/preprocess \
--max_doc_length 256 \
--max_query_length 32 \
--embeddings_dir ./data/NQ/evaluate/dpr \
--index_method opq \
--subvector_num 8 \
--subvector_bits 8 \
--training_mode distill_index_nolabel
Besides, you can train both doc encoder and query encoder when only train PQ (training_mode = distill_index-and-two-encoders
).
python ./learnable_index/train_index_and_encoder.py \
--data_dir ./data/NQ/dataset \
--preprocess_dir ./data/NQ/preprocess \
--max_doc_length 256 \
--max_query_length 32 \
--embeddings_dir ./data/NQ/evaluate/dpr \
--index_method opq \
--subvector_num 8 \
--subvector_bits 8 \
--training_mode distill_index-and-two-encoders
Methods | Recall@5 | Recall@10 | Recall@20 | Recall@100 |
---|---|---|---|---|
Faiss-PQ | 0.1301 | 0.1861 | 0.2495 | 0.4188 |
Faiss-OPQ | 0.3166 | 0.4105 | 0.4961 | 0.6836 |
Scann | 0.2526 | 0.3351 | 0.4144 | 0.6013 |
LibVQ(distill_index) | 0.3817 | 0.4806 | 0.5681 | 0.7357 |
LibVQ(distill_index_nolabel) | 0.3880 | 0.4858 | 0.5819 | 0.7423 |
LibVQ(distill_index-and-query-encoder) | 0.4709 | 0.5689 | 0.6481 | 0.7930 |
LibVQ(distill_index-and-query-encoder_nolabel) | 0.4883 | 0.5903 | 0.6678 | 0.7914 |
LibVQ(distill_index-and-two-encoders) | 0.5637 | 0.6515 | 0.7171 | 0.8257 |
LibVQ(distill_index-and-two-encoders_nolabel) | 0.5285 | 0.6144 | 0.7296 | 0.8096 |