musixmatchresearch/umberto

Sequence Labeling

Closed this issue · 1 comments

Hi,

thank you for the awesome work! Is there a way to reproduce the sequence labeling task?

Hi @AntonioMarsella , thank you.
Umberto is a Language Model from which you can fine-tune on a specific NLP task you want on Italian language.
Here you can find an example of fine-tuning on NER task using transformers from huggingface library.
This example uses GermanEval that is a dataset for NER token classification, but you can switch with the dataset you want for Italian Language like Evalita or Wikiner.
The important thing is to change the language model you start from, because in your case you want to start from Umberto Language Model.
In German example they start from:

export BERT_MODEL=bert-base-multilingual-cased

We have to change it into this (Umberto-Wikipedia):

export BERT_MODEL=Musixmatch/umberto-wikipedia-uncased-v1

or this (Umberto-Commoncrawl) :

export BERT_MODEL=Musixmatch/umberto-commoncrawl-cased-v1

Then you can start your fine-tuning with your Italian dataset ( Wikiner or Evalita) in this way:

python3 run_ner.py --data_dir $YOUR_DATASET \
--labels ./labels.txt \
--model_name_or_path $BERT_MODEL \
--output_dir $OUTPUT_DIR \
--max_seq_length  $MAX_LENGTH \
--num_train_epochs $NUM_EPOCHS \
--per_device_train_batch_size $BATCH_SIZE \
--save_steps $SAVE_STEPS \
--seed $SEED \
--do_train \
--do_eval \
--do_predict

You can use German example to understand better the format of files and all the env variables.
Hope to help you!