BlockBoost: Scalable and Efficient Blocking through Boosting

This repository contains the code for BlockBoost: Scalable and Efficient Blocking through Boosting, published in AISTATS 2024.

Package setup

All scripts should be run from the git repository root directory;
The root of the project must be in the PYTHONPATH variable, one way to do this is by running or adding to the .profile script the following line:
```
export PYTHONPATH=.
```
Python packages were managed by Poetry, which can be installed via pipx. To activate and install the environment, run from the root directory:
```
  poetry shell
  poetry install
```

Data setup

Downloading

To download a datasets used in our paper, you can simply run

   python src/data/download/download_{dataset_name}.py

Processing

By running the provided code, you can easily process the datasets used in our paper. The code performs various tasks, including performing a textual vectorization on the dataset, generating record and entity ID's for each entry, and creating train, validation, and test splits.

   python src/data/download/process_{dataset_name}.py

Vectorization

In our paper, we describe multiple vectorization approaches that you can choose from. To execute a particular vectorization, simply use the following command:

   python src/data/download/vectorize_{vectorization_name}.py -db {dataset_name} [other specific vectorization arguments]

Within each code file, you will find detailed descriptions of the specific vectorization arguments, providing appropriate guidance for their usage.

.sh scripts

Alternatively, you can simplify the process of processing and vectorizing multiple datasets by using the .sh scripts named vectorize_all_{vectorization_name}.sh and process_all.sh . These scripts allow you to execute vectorization on all datasets with a single command.

Running models

Blockboost

Compilation

Just run make at the root of the project.

Note that openmp and boost libraries are required to compile the binaries. On Debian/Ubuntu, those can be installed with

sudo apt install -y libboost-all-dev openmpi-common

Execution

./src/models/blockboost/generate_experiments-fp32.sh

DeepBlocker

DeepBlocker experiments were conducted using the official repository available at https://github.com/qcri/DeepBlocker.

Other benchmark models

To use other benchmark models, you can simply run the following command:

python src/models/{benchmark_model_name}/predict.py -db {vectorized_database_name} [other specific model arguments]

Within each model file, you will find detailed descriptions of the specific arguments, providing appropriate guidance for their usage.

Evaluating

Evaluating experiments

Run from the root directory,

./src/eval/eval.sh <model_name>

Creating tables containing the results

To generate the .csv tables that contain the results with each metric for each dataset and benchmark model used, you can simply execute the following command:

python src/eval/create_eval_tables.py

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src		src
Makefile		Makefile
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BlockBoost: Scalable and Efficient Blocking through Boosting

Package setup

Data setup

Downloading

Processing

Vectorization

.sh scripts

Running models

Blockboost

Compilation

Execution

DeepBlocker

Other benchmark models

Evaluating

Evaluating experiments

Creating tables containing the results

About

Releases

Packages

Contributors 2

Languages

thiagorr162/blockboost

Folders and files

Latest commit

History

Repository files navigation

BlockBoost: Scalable and Efficient Blocking through Boosting

Package setup

Data setup

Downloading

Processing

Vectorization

.sh scripts

Running models

Blockboost

Compilation

Execution

DeepBlocker

Other benchmark models

Evaluating

Evaluating experiments

Creating tables containing the results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages