Code for mixing Enrico sets?
takiholadi opened this issue · 2 comments
takiholadi commented
What is the proper way of mixing datasets provided by Enrico? What size should it be?
Enrico sets: https://github.com/google-research/FLAN/tree/main/flan/v2#download
The mixture percentage: https://github.com/google-research/FLAN/blob/main/flan/v2/run_example.py
For now I use:
import datasets
cot_submix = datasets.load_dataset('conceptofmind/cot_submix_original')
dialog_submix = datasets.load_dataset('conceptofmind/dialog_submix_original')
niv2_submix = datasets.load_dataset('conceptofmind/niv2_submix_original')
flan2021_submix = datasets.load_dataset('conceptofmind/flan2021_submix_original')
t0_submix = datasets.load_dataset('conceptofmind/t0_submix_original')
cot_zsopt = cot_submix['train'].filter(lambda x: x['template_type'] == 'zs_opt')
cot_fsopt = cot_submix['train'].filter(lambda x: x['template_type'] == 'fs_opt')
dialog_zsopt = dialog_submix['train'].filter(lambda x: x['template_type'] == 'zs_opt')
dialog_fsopt = dialog_submix['train'].filter(lambda x: x['template_type'] == 'fs_opt')
niv2_zsopt = niv2_submix['train'].filter(lambda x: x['template_type'] == 'zs_opt')
niv2_fsopt = niv2_submix['train'].filter(lambda x: x['template_type'] == 'fs_opt')
flan_zsopt = flan2021_submix['train'].filter(lambda x: x['template_type'] == 'zs_opt')
flan_fsopt = flan2021_submix['train'].filter(lambda x: x['template_type'] == 'fs_opt')
flan_zsnoopt = flan2021_submix['train'].filter(lambda x: x['template_type'] == 'zs_noopt')
flan_fsnoopt = flan2021_submix['train'].filter(lambda x: x['template_type'] == 'fs_noopt')
t0_zsopt = t0_submix['train'].filter(lambda x: x['template_type'] == 'zs_opt')
t0_fsopt = t0_submix['train'].filter(lambda x: x['template_type'] == 'fs_opt')
t0_zsnoopt = t0_submix['train'].filter(lambda x: x['template_type'] == 'zs_noopt')
t0_fsnoopt = t0_submix['train'].filter(lambda x: x['template_type'] == 'fs_noopt')
all_datasets = [
flan_zsopt,
flan_fsopt,
flan_zsnoopt,
flan_fsnoopt,
#
t0_zsopt,
t0_fsopt,
t0_zsnoopt,
t0_fsnoopt,
#
niv2_zsopt,
niv2_fsopt,
#
cot_zsopt,
cot_fsopt,
#
dialog_zsopt,
dialog_fsopt,
]
probabilities = [
0.4/4, 0.4/4, 0.4/4, 0.4/4,
#
0.32/4, 0.32/4, 0.32/4, 0.32/4,
#
0.2/2, 0.2/2,
#
0.05/2, 0.05/2,
#
0.03/2, 0.03/2,
]
flan2022_submix = datasets.interleave_datasets(
datasets=all_datasets,
probabilities=probabilities,
seed=567,
stopping_strategy='first_exhausted',
)
flan2022_submix.to_csv('flan2022_submix.csv')
Size of final dataset is 3699512.
Is it correct?
shayne-longpre commented
@takiholadi Yes, this looks correct!
vince62s commented
@takiholadi do you use the output as is or do you uniformise the prompts across the dataset ?