Uniblocker: A Python repository from tshu-w

Towards Universal Dense Blocking for Entity Resolution

Description

Source code and data for the paper: Towards Universal Dense Blocking for Entity Resolution.

How to run

First, install the dependencies and download the resources.

# clone project
git lfs install
git clone https://github.com/tshu-w/uniblocker # It will take a while for LFS to download the benchmark data
cd uniblocker
unzip data/blocking.zip -d data

# [SUGGESTED] use conda environment
conda env create -f environment.yaml
conda activate uniblocker

# [ALTERNATIVE] install requirements directly
pip install -r requirements.txt

# [OPTIONAL] download resources
python scripts/download_gittables.py # pre-training corpus
bash scripts/download_fasttext_model.sh # fasttext model for DeepBlocker

Next, to obtain the main results of the paper:

# Pre-training
./run --config configs/uniblocker.yaml --config configs/gittables.yaml

# Evaluation
bash scripts/sweep_uniblocker.sh

bash scripts/sweep_deepblocker.sh
bash scripts/sweep_sudowoodo.sh
python scripts/sweep_sparse_join.py
python scripts/sweep_blocking_workflows.py

# Scalability Evaluation
python scripts/scala_prepare.py
for f in scripts/scala_*.py; do python $f ; done

You can also run experiments independently using the run script.

# fit with the config and cmd line arguments
./run fit --config configs/uniblocker.yaml --config configs/cora.yaml --data.batch_size 32 --trainer.devices 0,

# evaluate with the checkpoint
./run test --config configs/uniblocker.yaml --ckpt_path CKPT_PATH

# get the script help
./run --help
./run fit --help

Benchmark

Details of the constructed benchmark can be found in the README of the data.

TODO

Add Benchmark README
Separate NNBlocker into a standalone package to facilitate further research on nearest neighbor blocking techniques.