Created a Name Entity recognition model (using spacy) which can detect food names from text and also the default entities of the spacy model.
In order to get food based sentences, used a dataset from kaggle named 'Amazon Fine Food Reviews'. The dataset contained more than 500,000 reviews on different food items. For the purpose of this project, so much data is an overkill. To gather relevant sentences from this dataset, at first a list of some common food items was created (18 foods) namely, gravy', 'ramen', 'burger', 'pizza', 'pasta', 'wings', 'coke', 'sprite', 'water','fanta','pepsi','seven up', 'biriyani','rice', 'pulao', 'bread', 'flat bread' and 'rice bowl'. Also a list of some words which are commonly used in food reviews such as 'flavour','flavours','tasty','delicious','juicy' etc. was created.
For the first 500 reviews of any food, only the ones with any of the helper words was selected and saved in a text file. In order to avoid repetition, for any food, a sentence that has name of a previously iterated food was not selected as it is already present in the file.
The file with sentences was annotated for NER using an online annotator tool and saved to a file.
annotation file : annotations.json
online annotator : https://tecoholic.github.io/ner-annotator/
A spacy model was trained with the processed dataset but it showed catastrophic forgetting problem. Two approaches were tried to solve the problem:
-
Revision Data: The model was trained with some data ci=onsisting of the default entities of the model. But this approach was very resource hungry and didn't yield a worthy result.
-
Pipe combination: The 'ner' pipe of the food-trained model was added before the 'ner' of the default spacy language model. This approach showed significant improvement in the result.
spacy v3, numpy, pandas
- clone the repository
git clone https://github.com/saidulK/food_ner_spacy
-
Download reviews dataset
-
Generate dataset
run Dataset Generation.ipynb
- Train model
run Train Model.ipynb