/Pol-Spider

Polish translation of spider dataset.

Primary LanguagePython

Pol-Spider 🕷️

This repository provides translation of Spider, CoSQL, SParC, Spider-DK, Spider-Syn datasets into Polish and code for some experiments.

📄 Associated master thesis: download link.

Ready datasets

Polish translations are ready to download from Hugging Face Datasets 🤗

Datasets synthesis

datasets directory contains scripts for dataset synthesis

Setup environment

# clone repository
https://github.com/klima7/Polish-Spider

# create environment
conda create -n pol-spider python=3.19
conda activate pol-spider
pip install -r requirements.txt

# download spacy model
python -m spacy download xx_sent_ud_sm

Then download oryginal english databases from here and place inside datasets/components/database

Example dataset synthesis

Synthesize dataset named pol-spider-en, which is based on samples from spider. Translate questions to polish. Apply context-curated translation to schema names. Translate strings in SQL queries to polish:

python datasets/scripts/synthesize.py spider pol-spider-en \
  --question-lang pl \
  --schema-translation context-curated \
  --query-lang pl \
  --with-db

Joining datasets

Create pol-spider dataset by joining pol-spider-en and pol-spider-pl:

python datasets/scripts/join.py pol-spider pol-spider-en pol-spider-pl

App

app directory contains streamlit app, which allows to use C3SQL and RESDSQL models easily.

app_image

Starting app

To use RESDSQL model downloading weights from Hugging Face 🤗 and placing inside app/models is required.

cd app
docker compose up --build

Experiments

experiments directory contains dockerized code for experiments with RAT-SQL, BRIDGE, RESDSQL, C3.

Evaluation

evaluation directory contains code for calculating metrics.