Welcome to the GitHub repository for the Instruct-to-SPARQL dataset! This repository contains the source code, data processing scripts, and examples for creating, fine-tuning, and evaluating models with the Instruct-to-SPARQL dataset.
- Instruct-to-SPARQL Overview
- Dataset Details
- Repository Structure
- Installation
- Data Collection and Processing
- Fine-tuning and Evaluation
- Contributing
- License
- Contact
Instruct-to-SPARQL is a dataset that consists of pairs of Natural language instructions and SPARQL queries. The dataset is created by crawling Wikipedia pages and tutorials for real examples of WikiData SPARQL queries. The dataset has a total of 2.8k examples, split into train, validation, and test sets.
The dataset has the following features:
- id: A unique identifier for each example.
- instructions: A list of natural language instructions and questions.
- sparql_raw: The SPARQL query that was crawled and cleaned.
- sparql_annotated: The SPARQL query with annotations and Prefixes.
- sparql_query: The final SPARQL with Prefixes query used to retrieve data.
- complexity: A measure of the query's complexity: simple, medium, or complex.
- complexity_description: A description of the query's complexity.
- query_results: The results obtained from executing the SPARQL query.
The repository is organized as follows:
instruct-to-sparql/
├── data/
│ ├── raw/
│ ├── processed/
│ └── nl_generation/
├── scripts/
├── src/
├── sparql-wikidata.yml
└── README.md
- data/: Contains scripts and notebooks for data collection, cleaning, augmentation, and natural language generation.
- src/: Contains scripts for fine-tuning models, calculating metrics, and evaluating model performance.
- scripts/: Contains bash scripts for the experiments and model training.
- env.yml: Conda environment file with all required dependencies.
To use this repository, you need to clone it and set up the required environment.
git clone https://github.com/your_username/instruct-to-sparql.git
cd instruct-to-sparql
We recommend using a Conda environment. You can create the environment with the required dependencies using the env.yml
file:
conda env create -f sparql-wikidata.yml
conda activate instruct-to-sparql
The dataset was created through several steps:
- Data Collection: Crawling Wikipedia pages and tutorials for real examples of WikiData SPARQL queries.
- Data Cleaning: Cleaning the collected data to ensure consistency and correctness.
- Data Augmentation: Augmenting the dataset with additional examples to enhance diversity.
- Natural Language Generation: Generating natural language instructions corresponding to the SPARQL queries.
Scripts and notebooks for data collection can be found in the data/
directory.
Scripts and notebooks for natural language generation can be found in the data/nl_generation
directory.
The repository includes code for fine-tuning models on the Instruct-to-SPARQL dataset and evaluating their performance.
The scripts/
directory contains scripts for fine-tuning and evaluating models. For examplae, to fine-tune the following models, run:
- For the model Llama3-8B-SPARQL-annotated
./scripts/llama3_sparql.sh --annotated --batch_size=2 --accelerate="deepspeed-fp16" --left_padding_side
- For the model Mistral-7B-v0.3-SPARQL
./scripts/mistral_sparql.sh --batch_size=2 --accelerate="deepspeed-bf16" --left_padding_side
The script scripts/evaluate.sh
is used for evaluating model performance. To evaluate a model checkpoint, run:
./scripts/evaluate.sh --model_name MODEL_NAME --batch_size BATCH_SIZE --annotated
The performance of models on the Instruct-to-SPARQL dataset is evaluated using a combination of machine translation metrics and execution result metrics.
-
BLEU (Bilingual Evaluation Understudy)
BLEU measures the similarity between the generated SPARQL query and a reference SPARQL query by calculating n-gram precision. It ranges from 0 to 1, where 1 indicates a perfect match.
Formula:
$$\text{BLEU} = \exp \left( \min\left(0, 1 - \frac{\text{len(ref)}}{\text{len(gen)}}\right) + \sum_{n=1}^{N} w_n \log p_n \right)$$ where
$p_n$ is the precision of n-grams,$w_n$ are weights, and$\text{len(ref)}$ and$\text{len(gen)}$ are the lengths of the reference and generated queries, respectively. -
ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
ROUGE measures the overlap between the generated SPARQL query and the reference SPARQL query, focusing on recall. The most commonly used variants are ROUGE-N (n-gram recall) and ROUGE-L (longest common subsequence).
ROUGE-N Formula:
$$\text{ROUGE-N} = \frac{\sum_{S \in \text{References}} \sum_{gram_n \in S} \text{Count}{match}(gram_n)}{\sum{S \in \text{References}} \sum_{gram_n \in S} \text{Count}(gram_n)}$$
ROUGE-L Formula:
$$\text{ROUGE-L} = \frac{LCS(X,Y)}{\text{len}(Y)}$$ where
$LCS(X, Y)$ is the length of the longest common subsequence between the reference$X$ and the generated query$Y$ .
Before evaluating the generated SPARQL queries, we execute them to obtain the results. We then do a semantic mapping where we match the keys of the results with the keys of the target results. The performance of models is evaluated based on the similarity between the results obtained from the target and generated queries.
-
The Overlap Coefficient measures the similarity between the sets of results returned by the target and generated SPARQL queries. It is defined as the size of the intersection divided by the size of the smaller set.
Formula:
$$\text{Overlap Coefficient} = \frac{|A \cap B|}{\min(|A|, |B|)}$$ where
$A$ and$B$ are the sets of results from the target and generated queries, respectively. -
The Jaccard Similarity measures the similarity between the sets of results returned by the target and generated SPARQL queries. It is defined as the size of the intersection divided by the size of the union.
Formula:
$$\text{Jaccard Similarity} = \frac{|A \cap B|}{|A \cup B|}$$ where
$A$ and$B$ are the sets of results from the target and generated queries, respectively.
If you use this dataset or code in your research, please cite it as follows:
@dataset{instruct_to_sparql,
author = {Mehdi Ben Amor, Alexis Strappazon, Michael Granitzer, Jelena Mitrovic},
title = {Instruct-to-SPARQL},
year = {2024},
howpublished = {https://huggingface.co/datasets/PaDaS-Lab/Instruct-to-SPARQL},
note = {A dataset of natural language instructions and corresponding SPARQL queries}
}
This repository is licensed under the Apache-2.0 license license. See the LICENSE
file for more details.
For questions or comments about the dataset or repository, please contact here