Unicorn

This repository contains source code for the SIGMOD'2023 paper "Unicorn: A Unified Multi-tasking Model for Supporting Matching Tasks in Data Integration". In this paper we introduce Unicorn, a unified model for generally supporting common data matching tasks. This unified model can enable knowledge sharing by learning from multiple tasks and multiple datasets, and can also support zero-shot prediction for new tasks with zero labeled matching/non-matching pairs. Unicorn employs one generic Encoder that converts any pair of data elements (a,b) into a learned representation, and uses a Matcher, which is a binary classifier, to decide whether (a,b) is matching. Unicorn adopts a mixture-of-experts (MoE) model that enhances the learned representation into a better representation, which can further boost the performance of predictions.

Code Structure

|-- data # datasets for 20 matching tasks
|-- figs # figures
|-- main.py # pre-train Unicorn under unified prediction setting with the given 20 datasets (section 5.2 in paper)
|-- main-zero.py # pre-train Unicorn under zero-shot setting (section 5.3 in paper)
|-- main-zero-ins.py # pre-train Unicorn under zero-shot setting with instruction instruction (section 5.3 in paper)
|-- finetune.py # fine-tune Unicorn with new dataset
|-- test.py # test new dataset with the pre-trained Unicorn
|-- unicorn # code for Unicorn
    |-- dataprocess # data processing folder
        |-- dataformat.py # dataset configuration
        |-- predata.py # data processing function
    |-- model # implementation of model
        |-- encoder.py # encoder module: convert serialized (a,b) into representation
        |-- moe.py # mixture-of-experts module: convert representation into a better representation
        |-- mather.py # matcher module: convert the representation into 0 (non-matching)/1 (matching)
    |-- traner # model learning function
        |-- pretrain.py # training model function
        |-- evaluate.py # evaluation function
    |-- utils # configuration files and tools
        |-- param.py # necessary parameter
        |-- utils.py # some auxiliary functions

DataSets

We publish 20 datasets of 7 matching tasks in Unicorn. Each dataset contains train.json / valid.json / test.json. The details can be found in our paper.

Entity Matching
- em-wa: Walmart-Amazon
- em-ds: DBLP-Scholar
- em-fz: Fodors-Zagats
- em-ia: iTunes-Amazon
- em-beer: Beer
Column Type Annotation
- efthymiou: Efthymiou
- t2d_col_type_anno: T2D
- Limaye_col_type_anno: Limaye
Entity Linking
- t2d: T2D
- Limaye: Limaye
String Matching
- smurf-addr: Address
- smurf-names: Names
- smurf-res: Researchers
- smurf-prod: Product
- smurf-cit: Citation
Schema Matching
- fabricated_dataset: FabricatedDatasets
- DeepMDatasets: DeepMDatasets
Ontology Matching
- Illinois-onm: Cornell-Washington
Entity Alignment
- dbp_yg: SRPRS: DBP-YG
- dbp_wd: SRPRS: DBP-WD

Quick Start

Step 1: Requirements

Before running the code, please make sure your Python version is 3.6.5 and cuda version is 11.1. Then install necessary packages by :
pip install -r requirements.txt
pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html

Step 2: Run

Pre-train Unicorn with the given datasets

Run the script for Unicorn:

python main.py --pretrain --model deberta_base

Run the script for Unicorn++:

python main.py --pretrain --model deberta_base --shuffle 1 --load_balance 1 --modelname UnicornPlus

Run the script for Unicorn Zero-shot:

python main-zero.py --pretrain --model deberta_base

Run the script for Unicorn Zero-shot instruction:

python main-zero-ins.py --pretrain --model deberta_base

After the pre-training, the checkpoint folder is generated and the three modules of the model are saved: encoder.pt, moe.pt and cls.pt. If you do not want to pre-train yourself, you can download our pre-trained model directly from HuggingFace, and save them in checkpoint folder.

Finetune model with your dataset

python finetune.py --load --ckpt UnicornPlus --model deberta_base --train_dataset_path "train_file_path1.json train_file_path2.json ..." --valid_dataset_path "valid_file_path1.json valid_file_path2.json ..." --test_dataset_path "test_file_path1.json test_file_path2.json ..." --train_metrics "f1 f1 ..." --test_metrics "f1 f1 ..." --modelname UnicornPlusNew

This script loads the pre-trained model UnicornPlus, and uses the training data represented by --train_dataset_path to finetune UnicornPlus, then outputs new model UnicornPlusNew.
Note that --train_dataset_path is required, --valid_dataset_path and --test_dataset_path are optional.

Load model and direct test

python test.py --load --ckpt UnicornPlus --model deberta_base --dataset_path "test_file_path1.json test_file_path2.json ..." --test_metrics "f1 f1 ..."

wangsha/Unicorn