Source code and data for the paper: Towards Universal Dense Blocking for Entity Resolution.
First, install the dependencies and download the resources.
# clone project
git lfs install
git clone https://github.com/tshu-w/uniblocker # It will take a while for LFS to download the benchmark data
cd uniblocker
unzip data/blocking.zip -d data
# [SUGGESTED] use conda environment
conda env create -f environment.yaml
conda activate uniblocker
# [ALTERNATIVE] install requirements directly
pip install -r requirements.txt
# [OPTIONAL] download resources
python scripts/download_gittables.py # pre-training corpus
bash scripts/download_fasttext_model.sh # fasttext model for DeepBlocker
Next, to obtain the main results of the paper:
# Pre-training
./run --config configs/uniblocker.yaml --config configs/gittables.yaml
# Evaluation
bash scripts/sweep_uniblocker.sh
bash scripts/sweep_deepblocker.sh
bash scripts/sweep_sudowoodo.sh
python scripts/sweep_sparse_join.py
python scripts/sweep_blocking_workflows.py
# Scalability Evaluation
python scripts/scala_prepare.py
for f in scripts/scala_*.py; do python $f ; done
You can also run experiments independently using the run
script.
# fit with the config and cmd line arguments
./run fit --config configs/uniblocker.yaml --config configs/cora.yaml --data.batch_size 32 --trainer.devices 0,
# evaluate with the checkpoint
./run test --config configs/uniblocker.yaml --ckpt_path CKPT_PATH
# get the script help
./run --help
./run fit --help
Details of the constructed benchmark can be found in the README of the data.
- Add Benchmark README
- Separate NNBlocker into a standalone package to facilitate further research on nearest neighbor blocking techniques.