Model fine-tuning Example
loretoparisi opened this issue · 1 comments
Add a new example of umBERTo model fine-tuning on a specific text domain. For this example we will use a poetry textual domain, like the one attached here (Dante Alighieri, La Divina Commedia)
dante.txt
To make your model be more comfortable to understand a specific text domain, like poetry, you can use Hugging Face run_language_modeling.py.
Make sure to have a train dataset and test dataset, so you can split your dataset as you prefer (like 70% train - 30% test). In this case the dataset doesn't contain labels, so you have not to check measures like accuracy or Fscore. You could rather check if perplexity or loss is decreasing.
This script uses Trainer class which takes training args directly from the args given to run_language_modeling.py script. So if you want to add training arguments like per_device_train_batch_size
, num_train_epochs
and so on, take a look at this and add those args to the script below.
export TRAIN_FILE=/path/to/dante_train.txt
export TEST_FILE=/path/to/dante_test.txt
python run_language_modeling.py \
--output_dir=output \
--model_type=camembert \
--model_name_or_path=Musixmatch/umberto-commoncrawl-cased-v1 \
--do_train \
--train_data_file=$TRAIN_FILE \
--do_eval \
--eval_data_file=$TEST_FILE \
--mlm
You can change --model_name_or_path
using both
-
Musixmatch/umberto-commoncrawl-cased-v1
-
and
Musixmatch/umberto-wikipedia-uncased-v1
.
If you useMusixmatch/umberto-wikipedia-uncased-v1
, the only caution is to change run_language_modeling.py to add thelower_case
argument, being this model an uncased model. To do it, go to this line and adddo_lower_case=True
, like:
...
if model_args.tokenizer_name:
tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name, do_lower_case=True, cache_dir=model_args.cache_dir)
elif model_args.model_name_or_path:
tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, do_lower_case=True, cache_dir=model_args.cache_dir)
else:
raise ValueError(
"You are instantiating a new tokenizer from scratch. This is not supported, but you can do it from another script, save it,"
"and load it from here, using --tokenizer_name"
)
...