How to use NER for large dataset ?

Question

How to use NER for large dataset ?

Mohammadtvk opened this issue 4 years ago · 3 comments

Hi,

I want to use your pretrained model for NER task but there is a problem, in the tutorial notebook use feed documents one-by-one and this takes too long for my dataset. how can I use it more efficiently ? can i use padding to feed larger batches to the model?

Answer 1 · 2021-02-06T07:50:56.000Z

Hi,

As far I got, if you wanted to use ParsBERT in a down-stream NLP task, in this case, your large NER dataset, you must fine-tune the pre-trained model on that. There is no need to train BERT again. Moreover, regarding the Sentiment Analysis notebook you mentioned, I used a batch size of 16 for train, validation, and test parts (Configuration -> # general config)!

Follow this example to understand how it uses BERT or other pre-trained models in the NER case.

Answer 2 · 2021-02-06T11:58:57.000Z

Hi,
One quick note; if you're not thinking about fine-tuning you can use hazm or simply .split('.') to separate sentences and then give them to the BERT to get all your data through the model and then join all the sentences in each record at last in a for loop for NER related tasks.

Answer 3 · 2021-03-02T12:34:06.000Z

Back after a while ..
Thanks for the replies.

I have another qeustion, is it possible to use transformer based models for feature extraction like word2vec or doc2vec?
I have used FeatureExtractionPipeline with several pretrained model. but there are 2 problems: 1. I can't set the embedding size(and models' embedding sizes are small) and 2. It take so much time (around 0.5sec) for each document just to extract word embeddings