/splitting-strategies

Splitting strategies for measuring compositionality in text2query tasks

Primary LanguagePython

Splitting strategies for measuring compositionality in text2query tasks

Overview

This Python script is designed to facilitate the splitting of text2query datasets. It provides a flexible command-line interface for specifying dataset splitting strategies, making it easy to generate training and testing data subsets.

The expected input format is the following:

[{
 "query": {
     "en": "SELECT count(*) FROM head WHERE age  >  56;",
     "ru": "SELECT count(*) FROM head WHERE age  >  56;"
 },
 "question": {
     "en": "How many heads of the departments are older than 56 ?",
     "ru": "Сколько руководителей отделов старше 56 лет?"
 },
 "mask_with_value_and_schema": "SELECT count ( ATTRIBUTE_1 ) FROM TABLE_1 WHERE ATTRIBUTE_2 > NUMERIC_VALUE_1"
}]

The mask_with_value_and_schema key preparation is expected to do before the splitting.

Command-Line Arguments

  • --dataset_name (type: str): The name of the original dataset.
  • --split_name (type: str): The splitting strategy. Supported values are: "long_train," "long_test," "template," or "paraphrase."
  • --list_of_splits_path (nargs='+', type: str): List of paths to the original split files.
  • --proxy_key (type: str, default: "mask_with_value_and_schema"): Key name for proxy.
  • --question_key (type: str, default: "question"): Key name for question (used in certain split strategies).
  • --language (type: str, default: None): Language type (used in certain split strategies).
  • --test_proportion (type: float, default: 0.2): Proportion of data to allocate to the test set.
  • --random_seed (type: int, default: 58): Seed value for randomization.

Example Usage

Here's an example command for splitting a dataset using this tool:

python split_dataset.py --dataset_name your_dataset --split_name template 
--list_of_splits_path split1.json split2.json
--proxy_key mask_with_value_and_schema
--question_key question
--language en
--test_proportion 0.3
--random_seed 58

Splitting Strategies

Current statistics per split for WikiSQL and PAUQ datasets can be found here.

Long Train (long_train)

  • Splits data into train and test sets based on the length of the samples.
  • Use --split_name long_train to apply this strategy and put shorted target into test.

Long Test (long_test)

  • Splits data into train and test sets based on the length of the samples, with a focus on including long targets in the test set.
  • Use --split_name long_test to apply this strategy.

Template (template)

  • Generates splits using a template-based strategy, considering the proxy_key and test_proportion.
  • Use --split_name template to apply this strategy.

Paraphrase (paraphrase)

  • Generates splits based on paraphrase data, considering the proxy_key, question_key, test_proportion, and language.
  • Use --split_name paraphrase to apply this strategy.

Output

  • The tool will save the generated split dictionary in a JSON file.
  • The file will be saved in a directory named task_splits/dataset_name with the format: <dataset_name>_<split_strategy>_split.json.
  • You will see a message indicating the path to the saved split file.