/bert-pos

Part-of-speech tagging using BERT

Primary LanguagePythonMIT LicenseMIT

BERT POS

Part-of-speech tagging using BERT

Quickstart

Download BERT models

./scripts/getmodels.sh

Experiment with FinBERT cased and TDT data

MODELDIR="models/bert-base-finnish-cased"
DATADIR="data/tdt"

python3 train.py \
    --vocab_file "$MODELDIR/vocab.txt" \
    --bert_config_file "$MODELDIR/bert_config.json" \
    --init_checkpoint "$MODELDIR/bert-base-finnish-cased" \
    --data_dir "$DATADIR" \
    --learning_rate 5e-5 \
    --num_train_epochs 3 \
    --predict test \
    --output pred.tsv

python scripts/mergepos.py "$DATADIR/test.conllu" pred.tsv > pred.conllu
python scripts/conll18_ud_eval.py -v "$DATADIR/gold-test.conllu" pred.conllu

CoNLL'18 UD data

Manually annotated data

(A small part of this data is found in data/ud-treebanks-v2.2/)

curl --remote-name-all https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-2837/ud-treebanks-v2.2.tgz

tar xvzf ud-treebanks-v2.2.tgz

Predictions from CoNLL'18 participants

(A small part of this data is found in data/official-submissions/)

curl --remote-name-all https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-2885/conll2018-test-runs.tgz

tar xvzf conll2018-test-runs.tgz

Evaluation script

wget https://universaldependencies.org/conll18/conll18_ud_eval.py \
    -O scripts/conll18_ud_eval.py

Reformat

Gold data

for t in tdt ftb pud; do
    mkdir data/$t
    for f in data/ud-treebanks-v2.2/*/fi_${t}-ud-*.conllu; do
        s=$(echo "$f" | perl -pe 's/.*\/.*-ud-(.*)\.conllu/$1/')
	egrep '^([0-9]+'$'\t''|[[:space:]]*$)' $f | cut -f 2,4 \
            > data/$t/$s.tsv
    done
    cut -f 2 data/$t/test.tsv | egrep -v '^[[:space:]]*$' | sort | uniq \
        > data/$t/labels.txt
    mv data/$t/test.tsv data/$t/gold-test.tsv
    cp data/ud-treebanks-v2.2/*/fi_${t}-ud-test.conllu data/$t/gold-test.conllu
done

PUD doesn't have train and dev, use TDT

for s in train dev; do
    cp data/tdt/$s.tsv data/pud
done

Test data with predicted tokens

for t in tdt ftb pud; do
    cp data/official-submissions/Uppsala-18/fi_$t.conllu data/$t/test.conllu
    egrep '^([0-9]+'$'\t''|[[:space:]]*$)' data/$t/test.conllu \
        | cut -f 2 | perl -pe 's/(\S+)$/$1\tX/' > data/$t/test.tsv
done

Reference results

Best UPOS result for each Finnish treebank in CoNLL'18 from https://universaldependencies.org/conll18/results-upos.html

fi_ftb: 1. HIT-SCIR (Harbin): 96.70
fi_pud: 1. LATTICE (Paris)  : 97.65
fi_tdt: 1. HIT-SCIR (Harbin): 97.30

BERT model comparison for Finnish POS tagging

The scripts run here are specific to a particular Slurm system configuration. You will need to edit them to match your setup if you want to rerun this.

./slurm/run-parameter-selection.sh
python3 slurm/select_params.py logs/*.out | cut -f 1-12 > slurm/selected-params.tsv
./slurm/run-selected-params.sh
python3 slurm/summarize_test.py logs/*.out | cut -f 2,4,11-14 > results.tsv

This should give approximately the following results:

Model             Corpus Mean
FinBERT cased     FTB    98.39
FinBERT uncased   FTB    98.28
M-BERT  cased     FTB    95.87
M-BERT  uncased   FTB    96.00
FinBERT cased     PUD    98.08
FinBERT uncased   PUD    97.94
M-BERT  cased     PUD    97.58
M-BERT  uncased   PUD    97.48
FinBERT cased     TDT    98.23
FinBERT uncased   TDT    98.12
M-BERT  cased     TDT    96.97
M-BERT  uncased   TDT    96.59