/RAG4RE

Retrieval-Augmented Generation-based Relation Extraction

Primary LanguageJupyter NotebookMIT LicenseMIT

RAG4RE

Python  3.10.9 PWC PWC PWC PWC

The repository consists of the source codes of "Retrieval-Augmented Generation-based Relation Extraction" journal paper which has been submitted to Semantic Web Journal (SWJ).

Note: This project's paper is still under review at the SWJ!

To cite its preprint:

@misc{efeoglu2024retrievalaugmented,
      title={Retrieval-Augmented Generation-based Relation Extraction}, 
      author={Sefika Efeoglu and Adrian Paschke},
      year={2024},
      eprint={2404.13397},
      archivePrefix={arXiv}
}

Please use the setting in this branch. There is no sampling on prediction of T5 results. Please use original TACRED datasets from the LDC

Hardware: NVIDIA GeForce GTX 1080 Ti (4GPUs X 12GB, cpu=300 GB).

Note that TACRED is licensed by the Linguistic Data Consortium (LDC), so we cannot directly publish the prompts or the raw results from the experiments conducted with Llama and Mistral, since the responses of these models consists of the prompts in their instruction parts. However, we have published the returned results when Llama and Mistral were integrated. Upon an official request, the data can be accessed on LDC, and the experiments can be easily replicated by following the instructions provided.

Project Folder Hierarchy

.
├── LICENSE
├── README.md
├── data                            ---> dataset, such as tacred, tacrev, re-tacred and semeval
├── results                         ---> results will be saved here.
└── src
    ├── config.ini                  ---> configuration for dataset, approach and llm and results.
    ├── data_preparation
    ├── main.py                     ---> the pipeline is started with this
    ├── retrieval                   ---> retrieval module
    │   ├── refinement.py
    │   └── retriever.py
    ├── data_augmentation           ---> regenerated the user query
    │   ├── embeddings
    │   └── prompt_generation
    ├── generation_module           ---> llm prompting.
    │   └── generation.py
    ├── evaluation                  ---> evaluate and visualize results. 
    │   ├── results_analysis.py
    │   └── vizualization.py
    └── utils.py                    

How to run

Change the paths and configs under config.ini for your experiment.

  • 1.) Datasets

    Put the following dataset under data folder.

    • TACRED dataset is lincensed by Linguistic Data Consortium (LDC), so please download it from here

    • TACREV dataset is constructed from TACRED via the tacrev codes

    • Re-TACRED dataset is derived from TACRED via this repository

    • SemEval is available at the hugging face and under data folder.

  • 2.) First install requirements

    pip install -r requirements.txt
  • 3.) Compute embeddings and similarities for benchmark datasets in advance
    cd src/data_augmentation/embeddings
    python sentence_embeddings.py
    python sentence_sim.py
  • 4.) Run Project
$ python src/main.py