Compare Huggingface vs. SpaCy performance on NER
Closed this issue · 4 comments
Actions
- Run Huggingface Transformer NER model training on DGX1 on CPU and on GPU. What is the speedup that we get? How do these runtimes compare with what we get when training the NER model using SpaCy (also on CPU and GPU)?
- Compute NER curves for Huggingface Transformer, and compare them with what we got in #601 for SpaCy.
- Based on the results (runtime + learning curves) decide whether in the future we should use Huggingface or SpaCy to train our NER models.
Dependencies
Let me just describe both of the models
- SpaCy model: Vanilla NER model not using transformers. The
config.cfg
generated automatically viaprodigy data-to-spacy
- HF model: Using the following pretrained model https://huggingface.co/dmis-lab/biobert-v1.1 and putting a token classifier on top of it.
Also, the below numbers always refer to the newest (as of writing this post) dataset containing 339 paragraphs.
Run Huggingface Transformer NER model training on DGX1 on CPU and on GPU. What is the speedup that we get? How do these runtimes compare with what we get when training the NER model using SpaCy (also on CPU and GPU)?
Regarding the setup:
The SpaCy model training seems to be only running on a single CPU and that is what we used.
The HF Trainer automatically detects and uses all available GPUs and that is what we used.
The exact runtimes depend on how exactly the early stopping + other parameters (e.g. evaluation frequency) but both HF and SpaCy train under 5 mins. IMO the training time is negligible in both cases assuming we only want to train once.
Compute NER curves for Huggingface Transformer, and compare them with what we got in #601 for SpaCy.
Assuming no mistakes, HF is clearly better than SpaCy
Based on the results (runtime + learning curves) decide whether in the future we should use Huggingface or SpaCy to train our NER models.
One thing to consider here is that to use HF transformer model and get results on SpaCy tokens there is need for additional token alignment logic which requires custom code. It is necessary both at train and inference time. If we don't mind this extra complexity then I would definitely go for HF.
Here are the results comparing:
- SpaCy model: Vanilla NER model not using transformers. The config.cfg generated automatically via prodigy data-to-spacy -> referred as
spacy
- SpaCy model using
biobert_v1.1
transformers backbone. The config.cfg was taken directly fromSearch/data_and_models/pipelines/ner/config.cfg
-> referred asspacy_hf
- HF model: Using the following pretrained model https://huggingface.co/dmis-lab/biobert-v1.1 and putting a token classifier on top of it. -> referred as
hf
2022-07-05 Discussion
Based on present results:
spacy-tok2vec
seems to be in all cases worse than the other two modelsspacy-transformer
seems to be more or less equivalent tohuggingface
We have decided to only use the "raw" HuggingFace transformer from now on.
Some of the main reasons
- Seems to perform better than a nontransformer spacy model and on par with a transformer spacy model
- More transparent as to what it is doing under the hood (and therefore easier to customize)
- MultiGPU training supported
We further decided that the only token alignment that we will perform is from spacy tokens -> word piece tokens. Note that this is necessary before training since our Prodigy annotations use spacy tokens. To do this, we use the flag is_split_into_words=True
when calling the word piece tokenizer (see docs).
The inference will not support custom "pretokenization" (spacy) and will only work on raw strings.
custom_pretokenization = ["This", "sentence", "is", "great"] # <- NOT supported
raw_s = "This sentence is great" # supported