Using trained QuRating model to select data
Closed this issue · 3 comments
Thanks for your work on this, very cool!
I'd like to use a trained QuRating model to select and sample data for me. I don't see this use case covered in the repo at present. Is there any way to do this? Concretely, is there a script that I can input (1) an input data file, (2) a QuRating model, and (3) A fraction of the data that I'd like to be included in training, and the script will save a file with just the data I should train on?
Thanks,
Dave
Hi Dave!
Thanks for the interest in our work and reaching out! We have two scripts for this workflow in the repo:
Step 1: Adding quality annotations
qurater_annotate.py takes a dataset and a QuRater model and adds new columns to the dataset for the quality ratings. Here is an example usage if you have some documents as jsonl files with a text column ({"text": "..."}
):
python -m data_tools.qurater_annotate json <output path for annotated dataset> \
-F <path to jsonl files> \
-M princeton-nlp/QuRater-1.3B \
--text_field text
--labels writing_style required_expertise facts_and_trivia educational_value
We provide the label names for the new column names, which will be writing_style_chunks
(segment-level quality ratings), writing_style_average
(document-level average) for each criteria, similar to the extra columns in QuRatedPajama-260B. The order of these labels corresponds to their head index of the QuRater model.
The resulting dataset can be inspected via huggingface datasets via datasets.load_from_disk(...)
.
Step 2: Selecting a fraction of data
select_subset.py loads input datasets, and can perform top-k selection or sampling proportional to the . It selects data until a token budget is reached, therefore it requires that your dataset contains a field with the number of tokens per sequence. Here's an example use-case for selecting 1B tokens according to the educational_value_average
field with temperature 2.0:
python -m data_tools.select_subset <path to annotated dataset> <output path for subset> \
--metric_field educational_value_average \
--seq_len_field <column name for sequence lengths> \
--tokens 1_000_000_000 \
--temperature 2.0 \
--normalize \
--num_workers 8
where --normalize
normalizes the mean/std of the metric over the training set. If your data has a domain field, you can select a proportional number of examples from each domain by adding --domain_field <column name for domain string>
. This scripts writes multiple HF datasets under the output path (useful for large datasets). You can read them all with
datasets.concatenate_datasets([datasetes.load_from_disk(ds) for ds in glob.glob("<output path>/*")])
Hope this helps and don't hesitate to reach out!
I didn't realize we didn't have these instructions in the README, so I will add more details there, too.
Sorry I lost track of this, thanks for the info!