Documentation for Dataset Preparation
samveldfwork opened this issue · 4 comments
Hello,
Thank you for your great work, I managed to run the code with the provided dataset.
However, I have been exploring the repository as well as the paper and noticed that the method for preparing the dataset is not clearly documented.
While there are several methods and scripts related to data processing, a comprehensive guide or documentation on how to prepare the dataset using our own data for training and evaluation is missing.
Specifically, it would be much appreciated if you could provide detailed instructions on:
- The format and structure of the input data.
- Steps to preprocess and prepare the dataset.
- Any specific requirements or dependencies needed for dataset preparation.
- Examples of commands or scripts to run for dataset preparation, especially about how to build and embed the subgraphs.
Thank you for your attention to this matter.
is the subgraph retrieved every time the query is asked?
Thanks for your suggestions!
Here are the basic steps:
GNN:
- Entity Linking: The question entities are linked to the KG.
- Subgraph Extraction: The KG subgraphs (e.g., 4-hop neighbors) are extracted based on the linked entities.
For WebQSP and CWQ, we follow the algorithm of NSM. You can find their preprocessing steps here.
After executed, the input data file (json
format, e.g., test.json
) consists of the following fields; question
, seed_entities
(obtained via entity linking), subgraph
tuples
(obtained via subgraph extraction -- these are in the format (head id, relation id, tail id) or (head name, relation, tail name)), answer
.
Doing the above steps for train and test questions will result into the necessary data files to train your GNN.
We train the ReaRev GNN described here, but you can use different ones.
RAG:
Please, run inference with the GNN as described here. This will generate the candidate answers obtained by the GNN in the right format.
For RAG, we follow RoG (their github code looks deactivated at the moment) as described in the GNN-RAG/llm
folder.
The shortest paths obtained by the GNN are verbalized and concatenated at the input to produce the LLM predictions.
Overall:
KG subgraphs are needed for each question for GNN training/evaluation. Then, the shortest paths between question entities and answer candidates are extracted for RAG. In our work, we follow previous works (GraftNet, NSM) and their preprocessing steps to get the KG subgraphs for WebQSP and CWQ from Freebase KG. If you need to test your own data, you should follow similar data preprocessing steps.
I will try to upload code samples in the next weeks, thanks for your patience!
Thank you for your reply.
Looking forward to your uploads next week :)
@cmavro how to use the MetaQA data? Do I need to change the pipeline or the data format?