This repo includes the codebase and data of paper Pylon: Semantic Table Union Search in Data Lakes.
conda env create -f pylon_environment.yml
Model checkpoints are available at our Google Drive.
pylon_benchmark.tar.gz
contains the benchmark we created from a real-world corpus (GitTables) for semantic table union search.
To run the evaluation, follow the steps below:
cd wte_cl
- Change paths as appropriate in bash scripts under
scripts/
- Run, for example, the evaluation on the Pylon benchmark
./scripts/run_wte_cl_pylon.sh
To pre-train an embedding model, follow the steps below:
cd wte_cl/wte_cl_training
- Change paths and hyperparameters as appropriate in
train_model.py
python train_model.py
If you find our work useful or related to yours, please cite our paper with the entry below:
@article{DBLP:journals/corr/abs-2301-04901,
author = {Tianji Cong and
Fatemeh Nargesian and
H. V. Jagadish},
title = {Pylon: Semantic Table Union Search in Data Lakes},
journal = {CoRR},
volume = {abs/2301.04901},
year = {2023}
}
TLDR: One paper with the same idea as ours is published in VLDB 2023. The first author, she was a friend.