Using trained QuRating model to select data

Question

Using trained QuRating model to select data

Closed this issue 5 months ago · 3 comments

Thanks for your work on this, very cool!

I'd like to use a trained QuRating model to select and sample data for me. I don't see this use case covered in the repo at present. Is there any way to do this? Concretely, is there a script that I can input (1) an input data file, (2) a QuRating model, and (3) A fraction of the data that I'd like to be included in training, and the script will save a file with just the data I should train on?

Thanks,

Dave

Answer 1 · 2024-05-29T18:29:47.000Z

Hi Dave!

Thanks for the interest in our work and reaching out! We have two scripts for this workflow in the repo:

Step 1: Adding quality annotations

qurater_annotate.py takes a dataset and a QuRater model and adds new columns to the dataset for the quality ratings. Here is an example usage if you have some documents as jsonl files with a text column ({"text": "..."}):

python -m data_tools.qurater_annotate json <output path for annotated dataset> \
    -F <path to jsonl files> \
    -M princeton-nlp/QuRater-1.3B \
    --text_field text
    --labels writing_style required_expertise facts_and_trivia educational_value

We provide the label names for the new column names, which will be writing_style_chunks (segment-level quality ratings), writing_style_average (document-level average) for each criteria, similar to the extra columns in QuRatedPajama-260B. The order of these labels corresponds to their head index of the QuRater model.

The resulting dataset can be inspected via huggingface datasets via datasets.load_from_disk(...).

Step 2: Selecting a fraction of data

select_subset.py loads input datasets, and can perform top-k selection or sampling proportional to the . It selects data until a token budget is reached, therefore it requires that your dataset contains a field with the number of tokens per sequence. Here's an example use-case for selecting 1B tokens according to the educational_value_average field with temperature 2.0:

python -m data_tools.select_subset <path to annotated dataset> <output path for subset> \
    --metric_field educational_value_average \
    --seq_len_field <column name for sequence lengths> \
    --tokens 1_000_000_000 \
    --temperature 2.0 \
    --normalize \  
    --num_workers 8

where --normalize normalizes the mean/std of the metric over the training set. If your data has a domain field, you can select a proportional number of examples from each domain by adding --domain_field <column name for domain string>. This scripts writes multiple HF datasets under the output path (useful for large datasets). You can read them all with
datasets.concatenate_datasets([datasetes.load_from_disk(ds) for ds in glob.glob("<output path>/*")])

Hope this helps and don't hesitate to reach out!

Answer 2 · 2024-05-29T18:30:18.000Z

I didn't realize we didn't have these instructions in the README, so I will add more details there, too.

Answer 3 · 2024-06-27T00:44:26.000Z

Sorry I lost track of this, thanks for the info!