On-Demand-Information-Extraction

Official Repo for paper Instruct and Extract: Instruction Tuning for On-Demand Information Extraction by Yizhu Jiao, Ming Zhong, Sha Li, Ruining Zhao, Ouyang Siru, Heng Ji and Jiawei Han.

🌟 Dataset

We release the InstructIE dataset including 14,579 samples for training (dataset/training_data.json and dataset/training_data_cot.json) and 150 for testing (dataset/test_data.json). This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better in the task of on-demand information extraction.

🚀 Data Generation from scratch

To generate the training data using your own seed tasks or other models, we open-source our scripts for the entire pipeline here. Our current code is tested on the GPT-3.5-turbo model accessible via the OpenAI API. To run this pipeline, please find the implementation details and the usage instructions in the Data Generation directory.

🔧 Model Training

We finetune LLaMA-7B with LoRA, a parameter-efficient fine-tuning technique, on the training set of our InstructIE data to obtain the model ODIE. We format the datasets to follow a chatbot-style schema to allow interactions between the user and the language model into one input sequence. Please find more details about the training stage in the Training directory.

📊 Model Evaluation

To evaluate the table header with semantics similarity, run the following script

python evaluation/sim_for_header.py PATH_OF_FILE

To evaluate the table content with RougeL, run the following script

python evaluation/rougel_for_content.py PATH_OF_FILE

Please provide the path of the evaluated file to run these two scripts. Otherwise, they would evaluate the output of ODIE, model_output/ODIE-7b-filter.json by default. You can also try to evaluate the outputs of other models under this directory.

📚 Citation

If you find this repo helpful, please cite our paper: