splitting-strategies: A Python repository from runnerup96

Splitting strategies for measuring compositionality in text2query tasks

Overview

This Python script is designed to facilitate the splitting of text2query datasets. It provides a flexible command-line interface for specifying dataset splitting strategies, making it easy to generate training and testing data subsets.

The expected input format is the following:

[{
 "query": {
     "en": "SELECT count(*) FROM head WHERE age  >  56;",
     "ru": "SELECT count(*) FROM head WHERE age  >  56;"
 },
 "question": {
     "en": "How many heads of the departments are older than 56 ?",
     "ru": "Сколько руководителей отделов старше 56 лет?"
 },
 "mask_with_value_and_schema": "SELECT count ( ATTRIBUTE_1 ) FROM TABLE_1 WHERE ATTRIBUTE_2 > NUMERIC_VALUE_1"
}]

The mask_with_value_and_schema key preparation is expected to do before the splitting.

Command-Line Arguments

--dataset_name (type: str): The name of the original dataset.
--split_name (type: str): The splitting strategy. Supported values are: "long_train," "long_test," "template," or "paraphrase."
--list_of_splits_path (nargs='+', type: str): List of paths to the original split files.
--proxy_key (type: str, default: "mask_with_value_and_schema"): Key name for proxy.
--question_key (type: str, default: "question"): Key name for question (used in certain split strategies).
--language (type: str, default: None): Language type (used in certain split strategies).
--test_proportion (type: float, default: 0.2): Proportion of data to allocate to the test set.
--random_seed (type: int, default: 58): Seed value for randomization.

Example Usage

Here's an example command for splitting a dataset using this tool:

python split_dataset.py --dataset_name your_dataset --split_name template 
--list_of_splits_path split1.json split2.json
--proxy_key mask_with_value_and_schema
--question_key question
--language en
--test_proportion 0.3
--random_seed 58

Splitting Strategies

Current statistics per split for WikiSQL and PAUQ datasets can be found here.

Long Train (long_train)

Splits data into train and test sets based on the length of the samples.
Use --split_name long_train to apply this strategy and put shorted target into test.

Long Test (long_test)

Splits data into train and test sets based on the length of the samples, with a focus on including long targets in the test set.
Use --split_name long_test to apply this strategy.

Template (template)

Generates splits using a template-based strategy, considering the proxy_key and test_proportion.
Use --split_name template to apply this strategy.

Paraphrase (paraphrase)

Generates splits based on paraphrase data, considering the proxy_key, question_key, test_proportion, and language.
Use --split_name paraphrase to apply this strategy.

Output