Instruct-to-SPARQL: GitHub Repository

Welcome to the GitHub repository for the Instruct-to-SPARQL dataset! This repository contains the source code, data processing scripts, and examples for creating, fine-tuning, and evaluating models with the Instruct-to-SPARQL dataset.

Instruct-to-SPARQL Overview
Dataset Details
Repository Structure
Installation
Data Collection and Processing
Fine-tuning and Evaluation
Contributing
License
Contact

Instruct-to-SPARQL Overview

Instruct-to-SPARQL is a dataset that consists of pairs of Natural language instructions and SPARQL queries. The dataset is created by crawling Wikipedia pages and tutorials for real examples of WikiData SPARQL queries. The dataset has a total of 2.8k examples, split into train, validation, and test sets.

Dataset Details

The dataset has the following features:

id: A unique identifier for each example.
instructions: A list of natural language instructions and questions.
sparql_raw: The SPARQL query that was crawled and cleaned.
sparql_annotated: The SPARQL query with annotations and Prefixes.
sparql_query: The final SPARQL with Prefixes query used to retrieve data.
complexity: A measure of the query's complexity: simple, medium, or complex.
complexity_description: A description of the query's complexity.
query_results: The results obtained from executing the SPARQL query.

Repository Structure

The repository is organized as follows:

instruct-to-sparql/
├── data/
│   ├── raw/
│   ├── processed/
│   └── nl_generation/
├── scripts/
├── src/
├── sparql-wikidata.yml
└── README.md

data/: Contains scripts and notebooks for data collection, cleaning, augmentation, and natural language generation.
src/: Contains scripts for fine-tuning models, calculating metrics, and evaluating model performance.
scripts/: Contains bash scripts for the experiments and model training.
env.yml: Conda environment file with all required dependencies.

Installation

To use this repository, you need to clone it and set up the required environment.

Clone the Repository

git clone https://github.com/your_username/instruct-to-sparql.git
cd instruct-to-sparql

Set Up the Environment

We recommend using a Conda environment. You can create the environment with the required dependencies using the env.yml file:

conda env create -f sparql-wikidata.yml
conda activate instruct-to-sparql

Data Collection and Processing

The dataset was created through several steps:

Data Collection: Crawling Wikipedia pages and tutorials for real examples of WikiData SPARQL queries.
Data Cleaning: Cleaning the collected data to ensure consistency and correctness.
Data Augmentation: Augmenting the dataset with additional examples to enhance diversity.
Natural Language Generation: Generating natural language instructions corresponding to the SPARQL queries.

Data Collection a Cleaning

Scripts and notebooks for data collection can be found in the data/ directory.

Natural Language Generation & Augmentation

Scripts and notebooks for natural language generation can be found in the data/nl_generation directory.

Fine-tuning and Evaluation

The repository includes code for fine-tuning models on the Instruct-to-SPARQL dataset and evaluating their performance.

Fine-tuning

The scripts/ directory contains scripts for fine-tuning and evaluating models. For examplae, to fine-tune the following models, run:

For the model Llama3-8B-SPARQL-annotated

./scripts/llama3_sparql.sh --annotated --batch_size=2 --accelerate="deepspeed-fp16" --left_padding_side

For the model Mistral-7B-v0.3-SPARQL

./scripts/mistral_sparql.sh --batch_size=2 --accelerate="deepspeed-bf16" --left_padding_side

Evaluation

The script scripts/evaluate.sh is used for evaluating model performance. To evaluate a model checkpoint, run:

./scripts/evaluate.sh --model_name MODEL_NAME --batch_size BATCH_SIZE --annotated

Metrics

The performance of models on the Instruct-to-SPARQL dataset is evaluated using a combination of machine translation metrics and execution result metrics.

Machine Translation Metrics

BLEU (Bilingual Evaluation Understudy)

BLEU measures the similarity between the generated SPARQL query and a reference SPARQL query by calculating n-gram precision. It ranges from 0 to 1, where 1 indicates a perfect match.

Formula:

$$\text{BLEU} = \exp \left( \min\left(0, 1 - \frac{\text{len(ref)}}{\text{len(gen)}}\right) + \sum_{n=1}^{N} w_n \log p_n \right)$$

where $p_n$ is the precision of n-grams, $w_n$ are weights, and $\text{len(ref)}$ and $\text{len(gen)}$ are the lengths of the reference and generated queries, respectively.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

ROUGE measures the overlap between the generated SPARQL query and the reference SPARQL query, focusing on recall. The most commonly used variants are ROUGE-N (n-gram recall) and ROUGE-L (longest common subsequence).

ROUGE-N Formula:

$$\text{ROUGE-N} = \frac{\sum_{S \in \text{References}} \sum_{gram_n \in S} \text{Count}{match}(gram_n)}{\sum{S \in \text{References}} \sum_{gram_n \in S} \text{Count}(gram_n)}$$

ROUGE-L Formula:

$$\text{ROUGE-L} = \frac{LCS(X,Y)}{\text{len}(Y)}$$

where $LCS(X, Y)$ is the length of the longest common subsequence between the reference $X$ and the generated query $Y$.

Execution Results Metrics

Before evaluating the generated SPARQL queries, we execute them to obtain the results. We then do a semantic mapping where we match the keys of the results with the keys of the target results. The performance of models is evaluated based on the similarity between the results obtained from the target and generated queries.

Overlap Coefficient

The Overlap Coefficient measures the similarity between the sets of results returned by the target and generated SPARQL queries. It is defined as the size of the intersection divided by the size of the smaller set.

Formula:

$$\text{Overlap Coefficient} = \frac{|A \cap B|}{\min(|A|, |B|)}$$

where $A$ and $B$ are the sets of results from the target and generated queries, respectively.
Jaccard Similarity

The Jaccard Similarity measures the similarity between the sets of results returned by the target and generated SPARQL queries. It is defined as the size of the intersection divided by the size of the union.

Formula:

$$\text{Jaccard Similarity} = \frac{|A \cap B|}{|A \cup B|}$$

where $A$ and $B$ are the sets of results from the target and generated queries, respectively.

Citation

If you use this dataset or code in your research, please cite it as follows:

@dataset{instruct_to_sparql,
  author = {Mehdi Ben Amor, Alexis Strappazon, Michael Granitzer, Jelena Mitrovic},
  title = {Instruct-to-SPARQL},
  year = {2024},
  howpublished = {https://huggingface.co/datasets/PaDaS-Lab/Instruct-to-SPARQL},
  note = {A dataset of natural language instructions and corresponding SPARQL queries}
}

License

This repository is licensed under the Apache-2.0 license license. See the LICENSE file for more details.

Contact

For questions or comments about the dataset or repository, please contact here

padas-lab-de/instruct-to-sparql