This Python script is designed to facilitate the splitting of text2query datasets. It provides a flexible command-line interface for specifying dataset splitting strategies, making it easy to generate training and testing data subsets.
The expected input format is the following:
[{
"query": {
"en": "SELECT count(*) FROM head WHERE age > 56;",
"ru": "SELECT count(*) FROM head WHERE age > 56;"
},
"question": {
"en": "How many heads of the departments are older than 56 ?",
"ru": "Сколько руководителей отделов старше 56 лет?"
},
"mask_with_value_and_schema": "SELECT count ( ATTRIBUTE_1 ) FROM TABLE_1 WHERE ATTRIBUTE_2 > NUMERIC_VALUE_1"
}]
The mask_with_value_and_schema
key preparation is expected to do before the splitting.
--dataset_name
(type: str): The name of the original dataset.--split_name
(type: str): The splitting strategy. Supported values are: "long_train," "long_test," "template," or "paraphrase."--list_of_splits_path
(nargs='+', type: str): List of paths to the original split files.--proxy_key
(type: str, default: "mask_with_value_and_schema"): Key name for proxy.--question_key
(type: str, default: "question"): Key name for question (used in certain split strategies).--language
(type: str, default: None): Language type (used in certain split strategies).--test_proportion
(type: float, default: 0.2): Proportion of data to allocate to the test set.--random_seed
(type: int, default: 58): Seed value for randomization.
Here's an example command for splitting a dataset using this tool:
python split_dataset.py --dataset_name your_dataset --split_name template
--list_of_splits_path split1.json split2.json
--proxy_key mask_with_value_and_schema
--question_key question
--language en
--test_proportion 0.3
--random_seed 58
Current statistics per split for WikiSQL and PAUQ datasets can be found here.
Long Train (long_train
)
- Splits data into train and test sets based on the length of the samples.
- Use
--split_name long_train
to apply this strategy and put shorted target into test.
Long Test (long_test
)
- Splits data into train and test sets based on the length of the samples, with a focus on including long targets in the test set.
- Use
--split_name long_test
to apply this strategy.
Template (template
)
- Generates splits using a template-based strategy, considering the
proxy_key
andtest_proportion
. - Use
--split_name template
to apply this strategy.
Paraphrase (paraphrase
)
- Generates splits based on paraphrase data, considering the
proxy_key
,question_key
,test_proportion
, and language. - Use
--split_name paraphrase
to apply this strategy.
- The tool will save the generated split dictionary in a JSON file.
- The file will be saved in a directory named
task_splits/dataset_name
with the format:<dataset_name>_<split_strategy>_split.json
. - You will see a message indicating the path to the saved split file.