Skip to content

SALT-NLP/pair

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PAIR: Perspective-Aligned Information Retrieval

License Build Open Source

This repository contains data and code for the paper Measuring and Addressing Indexical Bias in Information Retrieval. For more information, please reach out to the authors:


Caleb Ziems

William Held

Jane Dwivedi-Yu

Diyi Yang

What is PAIR?

🧑‍🤝‍🧑 PAIR is designed to help you identify and mitigate indexical biases in your IR systems. 🧑‍🤝‍🧑 PAIR includes a set of evaluation metrics, data resources, and human subjects study interfaces that help you measure and experimentally understand the Search Engine Manipulation Effect.

Setup

From Source

$ git clone https://github.com/SALT-NLP/pair.git
$ cd pair
$ conda create -n pair python=3.9.16
$ conda activate pair
$ pip install -r requirements.txt

Quick Example

You can run this example in the Demo.ipynb jupyter notebook.

from src.metrics.duo import Duo, get_relevant_corpus, get_relevant_corpus_retrieved, get_relevant_ranking
from src.utils import load_wiki_balance
from beir.retrieval.search.dense import DenseRetrievalExactSearch as DRES
from beir.retrieval.models import SentenceBERT
from beir.retrieval.evaluation import EvaluateRetrieval

# ----- RETRIEVAL -----
## load the WikiBias_Natural retrieval corpus
corpus, queries, qrels = load_wiki_balance(subset='natural')
## load an IR model from BEIR
retriever = EvaluateRetrieval(DRES(SentenceBERT("msmarco-distilbert-base-tas-b"), batch_size=16))
## retrieve documents
retrieved = retriever.retrieve(corpus, queries)

from src.metrics.duo import Duo, get_relevant_corpus, get_relevant_corpus_retrieved, get_relevant_ranking
# ----- INDEXICAL BIAS EVALUATION -----
## initialize the metric 
d = Duo(embedding_model="sentence-t5-xl", step_size=1, random_state=7)

## load the synthetic corpus for fitting the Duo metric
fit_corpus, fit_queries, fit_qrels = load_wiki_balance(subset='synthetic')

## evaluate on the first query
query_idx = list(retrieved.keys())[0]

## embed documents to polarization scores
d.embed(transform_docs=get_relevant_corpus_retrieved(corpus, retrieved, query_idx, qrels), 
        fit_docs=get_relevant_corpus(fit_corpus, query_idx, fit_qrels),
       )

# compute DUO score
duo_score = d.Duo(ranking=get_relevant_ranking(retrieved, query_idx, qrels))
print(duo_score)

Datasets

You can view the WikiBalance datasets on Hugging Face.

Dataset Huggingface Name Gold Labels Type Topics Queries Documents
WikiBalance Synthetic SALT-NLP/wiki-balance-synthetic test 1.4k 4k 31.5k
WikiBalance Natural SALT-NLP/wiki-balance-natural test 288 452 4.6k

System Audits

You can replicate all system audits from Tables 4 and 5 in the paper by running the following script:

bash run_audit.sh

Only BM-25 and ColBERT require special setup to run. To set up ColBERT, follow the (BEIR demo instructions here)[https://github.com/beir-cellar/beir/tree/main/examples/retrieval/evaluation/late-interaction]. To run BM-25, use the following steps:

On Mac

  1. Download elasticsearch.zip and unpack locally: elastic.co/downloads/elasticsearch
  2. Edit config/elasticsearch.yml to remove security features, setting false to xpack.security.enabled, xpack.security.http.ssl.enabled, xpack.security.transport.ssl.enabled
  3. Move to the elasticsearch directory and run elasticsearch bin/elasticsearch
  4. Run using python -m src.modeling.run_bm25 --dataset "idea/wiki" --model "bm25" On Linux Follow these instructions: linuxize.com/post/how-to-install-elasticsearch-on-ubuntu-18-04

Validations and Additional Experiments

  1. To print the summary tables from the paper, run print_tables.py from the main directory.
  2. To replicate our metric validations in Table 2 (as well as Tables 6 and 7 in the Appendix), run python -m src.experiments.metric_validation
  3. To replicate the SEME experiments, you can do the following: a. Re-run the experiments with your own participants using the HIT interface, hit/seme/hit_pair_seme.html OR b. Download the experimental data from (this Drive link)[https://drive.google.com/file/d/1TXKZueZFo_VbzMyui-V5YkQVvixysQuA/view?usp=drive_link] and place it in the hit/seme directory. c. Run python -m src.experiments.seme_experiment.py

About

Perspective-Aligned Information Retrieval

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published