musixmatchresearch/umberto

Model fine-tuning Example

loretoparisi opened this issue · 1 comments

Add a new example of umBERTo model fine-tuning on a specific text domain. For this example we will use a poetry textual domain, like the one attached here (Dante Alighieri, La Divina Commedia)
dante.txt

To make your model be more comfortable to understand a specific text domain, like poetry, you can use Hugging Face run_language_modeling.py.

Make sure to have a train dataset and test dataset, so you can split your dataset as you prefer (like 70% train - 30% test). In this case the dataset doesn't contain labels, so you have not to check measures like accuracy or Fscore. You could rather check if perplexity or loss is decreasing.

This script uses Trainer class which takes training args directly from the args given to run_language_modeling.py script. So if you want to add training arguments like per_device_train_batch_size, num_train_epochs and so on, take a look at this and add those args to the script below.

export TRAIN_FILE=/path/to/dante_train.txt
export TEST_FILE=/path/to/dante_test.txt

python run_language_modeling.py \
    --output_dir=output \
    --model_type=camembert \
    --model_name_or_path=Musixmatch/umberto-commoncrawl-cased-v1 \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --do_eval \
    --eval_data_file=$TEST_FILE \
    --mlm

You can change --model_name_or_path using both

  • Musixmatch/umberto-commoncrawl-cased-v1

  • and Musixmatch/umberto-wikipedia-uncased-v1.
    If you use Musixmatch/umberto-wikipedia-uncased-v1, the only caution is to change run_language_modeling.py to add the lower_case argument, being this model an uncased model. To do it, go to this line and add do_lower_case=True, like:

...
if model_args.tokenizer_name:
        tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name, do_lower_case=True, cache_dir=model_args.cache_dir)
    elif model_args.model_name_or_path:
        tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, do_lower_case=True, cache_dir=model_args.cache_dir)
    else:
        raise ValueError(
            "You are instantiating a new tokenizer from scratch. This is not supported, but you can do it from another script, save it,"
            "and load it from here, using --tokenizer_name"
        )
...