This repository contains source code for the StruBERT
model, a new structure-aware BERT model that is propsoed to solve three table-related downstream tasks: keyword- and content-based table retrieval, and table similarity. StruBERT
fuses the textual and structural information of a data table to produce four context-aware representations for both textual and tabular content of a data table. Two fine-grained features represent the context-aware embeddings of rows and columns, where both horizontal and vertical attentions are applied over the columnand row-based sequences, respectively. Two coarse-grained features capture the textual information from both row- and columnbased views of a data table. These features are incorporated into a new end-to-end ranking model, called miniBERT, that is formed of one layer of Transformer blocks, and operates directly on the embedding-level sequences formed from StruBERT features to capture the cross-matching signals of rows and columns.
First, install the conda environment strubert
with supporting libraries.
conda env create --file scripts/env.yml
conda create --name strubert python=3.6
conda activate strubert
pip install torch==1.3.1 torchvision -f https://download.pytorch.org/whl/cu100/torch_stable.html
pip install torch-scatter==1.3.2
pip install fairseq==0.8.0
cd scripts
pip install "--editable=git+https://github.com/huggingface/transformers.git@372a5c1ceec49b52c503707e9657bfaae7c236a0#egg=pytorch_pretrained_bert" --no-cache-dir
pip install -r requirements.txt
download TaBERT_Base_(K=3) from the TaBERT Google Drive shared folder. Please uncompress the tarball files before usage.
This part is related to using StruBERT for table similarity.
2 datasets are used for table similarity:
WikiTables
corpus contains over 1.6đť‘€ tables that are extracted from Wikipedia. Each table has five indexable fields: table caption, attributes (column headings), data rows, page title, and section title. Download and uncrompress theWikiTables
corpus. We use the same queries that were used by Zhang and Balog, where every query-table pair is evaluated using three numbers: 0 means “irrelevant”, 1 means “partially relevant” and 2 means “relevant”. We iterate over all the queries of WikiTables, and if two tables are relevant to a query, the table pair is given a label 1. On the other hand, an irrelevant table to a query is considered not similar to all tables that are relevant to the query, and therefore the table pair is given a label 0. We provide the 5 fold cross-validation splits with table pairs and binary labels in the source code.PMC
corpus is formed from PubMed Central (PMC) Open Access subset, and used for evaluation on the table similarity task. This collection is related to biomedicine and life sciences. Each table contains a caption and data values. We provide the 5 fold cross-validation splits with table pairs and binary labels in the source code.
To evaluate StruBERT
on table similarity task for WikiTables
:
cd wikitables_similarity_experiments/
python main.py \
--table_folder path/to/wikitables_corpus
--tabert_path path/to/pretrained/model/checkpoint.bin
--device 0
--epochs 5
--batch_size 4
--lr 3e-5
To evaluate StruBERT
on table similarity task for PMC
:
cd pmc_similarity_experiments/
python main.py \
--tabert_path path/to/pretrained/model/checkpoint.bin
--device 0
--epochs 5
--batch_size 4
--lr 3e-5
This part is related to using StruBERT for content-based table retrieval.
Query by Example Data
: this dataset is composed of 50 Wikipedia tables used as input queries. The query tables are related to multiple topics, and each table has at least five rows and three columns. For the ground truth relevance scores of table pairs, each pair is evaluated using three numbers: 2 means highly relevant and it indicates that the queried table is about the same topic of the query table with additional content, 1 means relevant and it indicates that the queried table contains a content that largely overlaps with the query table, and 0 means irrelevant.
cd content_based_table_retrieval/
chmod +x trec_eval
python main.py \
--table_folder path/to/wikitables_corpus
--tabert_path path/to/pretrained/model/checkpoint.bin
--device 0
--epochs 5
--batch_size 4
--lr 3e-5
--balance_data
This part is related to using StruBERT for keyword-based table retrieval.
WikiTables
corpus is used for keyword-based table retrieval.
cd keyword_based_table_retrieval/
chmod +x trec_eval
python main.py \
--table_folder path/to/wikitables_corpus
--tabert_path path/to/pretrained/model/checkpoint.bin
--device 0
--epochs 3
--batch_size 4
--lr 3e-5
If you plan to use StruBERT
in your project, please consider citing our paper:
@inproceedings{trabelsi22www,
author = {Trabelsi, Mohamed and Chen, Zhiyu and Zhang, Shuo and Davison, Brian D and Heflin, Jeff},
title = {Stru{BERT}: Structure-aware BERT for Table Search and Matching},
year = {2022},
booktitle = {Proceedings of the ACM Web Conference},
numpages = {10},
location = {Virtual Event, Lyon, France},
series = {WWW '22}
}
if you have any questions, please contact Mohamed Trabelsi at mot218@lehigh.edu
TaBERT folders (preprocess, table_bert) are important parts of StruBERT, and are initially downloaded from TaBERT
.