BlueBrain/Search

Compare Huggingface vs. SpaCy performance on NER

Closed this issue · 4 comments

Actions

  • Run Huggingface Transformer NER model training on DGX1 on CPU and on GPU. What is the speedup that we get? How do these runtimes compare with what we get when training the NER model using SpaCy (also on CPU and GPU)?
  • Compute NER curves for Huggingface Transformer, and compare them with what we got in #601 for SpaCy.
  • Based on the results (runtime + learning curves) decide whether in the future we should use Huggingface or SpaCy to train our NER models.

Dependencies

Let me just describe both of the models

  • SpaCy model: Vanilla NER model not using transformers. The config.cfg generated automatically via prodigy data-to-spacy
  • HF model: Using the following pretrained model https://huggingface.co/dmis-lab/biobert-v1.1 and putting a token classifier on top of it.

Also, the below numbers always refer to the newest (as of writing this post) dataset containing 339 paragraphs.

Run Huggingface Transformer NER model training on DGX1 on CPU and on GPU. What is the speedup that we get? How do these runtimes compare with what we get when training the NER model using SpaCy (also on CPU and GPU)?

Regarding the setup:
The SpaCy model training seems to be only running on a single CPU and that is what we used.
The HF Trainer automatically detects and uses all available GPUs and that is what we used.

The exact runtimes depend on how exactly the early stopping + other parameters (e.g. evaluation frequency) but both HF and SpaCy train under 5 mins. IMO the training time is negligible in both cases assuming we only want to train once.

Compute NER curves for Huggingface Transformer, and compare them with what we got in #601 for SpaCy.

Assuming no mistakes, HF is clearly better than SpaCy
hf_vs_spacy

Based on the results (runtime + learning curves) decide whether in the future we should use Huggingface or SpaCy to train our NER models.

One thing to consider here is that to use HF transformer model and get results on SpaCy tokens there is need for additional token alignment logic which requires custom code. It is necessary both at train and inference time. If we don't mind this extra complexity then I would definitely go for HF.

Here are the results comparing:

  • SpaCy model: Vanilla NER model not using transformers. The config.cfg generated automatically via prodigy data-to-spacy -> referred as spacy
  • SpaCy model using biobert_v1.1 transformers backbone. The config.cfg was taken directly from Search/data_and_models/pipelines/ner/config.cfg -> referred as spacy_hf
  • HF model: Using the following pretrained model https://huggingface.co/dmis-lab/biobert-v1.1 and putting a token classifier on top of it. -> referred as hf

image

2022-07-05 Discussion

Based on present results:

  • spacy-tok2vec seems to be in all cases worse than the other two models
  • spacy-transformer seems to be more or less equivalent to huggingface

We have decided to only use the "raw" HuggingFace transformer from now on.

Some of the main reasons

  • Seems to perform better than a nontransformer spacy model and on par with a transformer spacy model
  • More transparent as to what it is doing under the hood (and therefore easier to customize)
  • MultiGPU training supported

We further decided that the only token alignment that we will perform is from spacy tokens -> word piece tokens. Note that this is necessary before training since our Prodigy annotations use spacy tokens. To do this, we use the flag is_split_into_words=True when calling the word piece tokenizer (see docs).

Screenshot 2022-07-12 at 17 18 31

The inference will not support custom "pretokenization" (spacy) and will only work on raw strings.

custom_pretokenization = ["This", "sentence", "is", "great"] # <- NOT supported
raw_s = "This sentence is great"  # supported