🍔🍟🍗 Meal analysis with Theseus 🍞🍖🍕

Dev logs

[01/05/2024] Fix ngrok bug on Colab #32 (Migrate to pyngrok).
[24/10/2023] Clean and refactor repo. Integrate YOLOv8 to food detection.
[07/03/2022] Big refactor. Integrate object detection, image classification, semantic segmentation into one Ship of Theseus.
[31/01/2022] Update to new YOLOv5 latest versions P5-P6. Can load checkpoints from original repo.
[26/12/2021] Update app on Android.
[12/09/2021] Update all features to the web app.
[16/07/2021] All trained checkpoints on custom data have been lost. Now use pretrained models on COCO for inference.

📔 Notebook

For inference, use this notebook to run the web app
For training, refer to these notebooks for your own training:
- Detection:
- Classification:
- Semantic segmentation:

🥇 Pretrained-weights

Link
Detection:

Models	Image Size	Epochs	mAP@0.5	mAP@0.5:0.95
YOLOv5s	640x640	172	0.907	0.671
YOLOv5m	640x640	112	0.897	0.666
YOLOv5l	640x640	118	0.94	0.73
YOLOv5x	640x640	62	0.779	0.533
YOLOv8s	640x640	70	0.963	0.82

Segmentation:

Models	Image Size	Epochs	Pixel AP	Pixel AR	Dice score
UNet++	640x640	5	0.931	0.935	99.95

Classification:

Models	Image Size	Epochs	Acc	Balanced Acc	F1-score
EfficientNet-B4	640x640	7	84.069	86.033	84.116

🌟 Logs detail

In total, there are 3 implementation versions:

Training using our own object detection's template. The model's source code is inherited from the Ultralytics source code repo, the dataset is used in COCO format and the training and data processing steps are reinstalled by us using Pytorch. Ensemble technique, merge result of 4 models, only for images. Label enhancement technique, if the output label (after detection) is either "Food" or "Food-drinks", we use a pretrained Efficientnet-B4 classifier (on 255 classes) to re-classify it to another reasonable label.
Big refactor, update the training steps, used from Ultralytics source code repo too. The models yield better accuracy. Test-time augmentation technique is added to the web app.
Update Theseus template, currently supports food detection, food classification, multi-class food semantic segmentation only on images. For this version, we introduce Theseus, which is just a part of Theseus template. Moreover, we omitted some weak or unnecessary features to make the project more robust. Theseus adapted from big project templates such as: mmocr, fairseq, timm, paddleocr,...

For those who want to play around with the first version, which remains some features, differ from the new version. You can check out the v1 branch.

🌟 Inference

Install requirements.

pip install -e .

Start the app (Windows). Safe to run in insecure connection http on localhost. You can generate SSL certificate to run the app in https.

run.bat

python3 app.py

🌟 Dataset

Detection: link (merged OID and Vietnamese Lunch dataset)
Classification: link (MAFood121)
Semantic segmentation: link (UECFood)

🌟 Dataset details

To train the food detection model, we survey the following datasets:

Open Images V6-Food: Open Images V6 is a huge dataset from Google for Computer Vision tasks. To solve our problem, we extracted from a large dataset on food related labels. The extracted set includes 18 labels with more than 20,000 images.
School Lunch Dataset: includes 3940 photos of a lunch of Japanese high school students, taken at the same frontal angle with the goal of assessing student nutrition. Labels consist of coordinates and types of dishes are attached and divided into 21 different dishes, in the dataset there is also a label "Other Foods" if the dishes do not belong to the remaining 20 dishes.
Vietnamese Food: a self-collected dataset on Vietnamese dishes, including 10 simple dishes of our country such as: Pho, Com Tam, Hu Tieu, Banh Mi,... Each category has about 20-30 images, divided 80-20 for training and evaluation.

We aggregate all the above datasets to proceed training. Dishes that appear in different sets will be grouped into one to avoid duplication. After aggregating, a large data set of 60,305 images with 44 different foods from all regions of the world.

In addition, we find that if we expand the problem to include classification, the dataset will increase significantly. Therefore, to further enhance the diversity of dishes, we collect additional datasets to additionally train a classification model:

MAFood-121: consisting of 21,175 training image samples. The dishes are selected from the top 11 most popular cuisines in the world according to Google Trends statistics, these cuisines come from many countries around the world, especially Vietnam. For each type of cuisine, 11 typical traditional dishes are selected. The dataset has a total of 121 different types of dishes, each belonging to at least 1 of 10 food categories: Bread, Eggs, Fried, Meat, Noodles, Rice, Seafood, Soup, Dumplings, and Vegetables . 85% of the images are used for training and the remaining 15% for evaluation.
Food-101: includes 101 different types of dishes, with 101,000 sets of photos. For each dish, 250 images were used as test images and the remaining 750 images were used for training. The training images in this set still have a lot of noise, sometimes the colors are too sharp or some of the data samples are mislabeled, these noises are intentional by the author (mentioned in the study).

We also perform the aggregation of the two data sets above into one. The new set includes 93,748 training images and 26,825 evaluation images with a total of 180 different dishes. It can be seen that the number of dishes has increased significantly, if the model detects a dish labeled "Other Foods", the classification model will be applied to this dish and classified again.

🌟 Server

Implementation details

The function get_prediction is an inference function for detection, classification and semantic segmentation tasks, depends on which inputs you choose. Implemented in modules.py, where the image detection process will call the Edamam API to get nutritional information in the food. We also save nutritional information in csv files in the folder /static/csv.

We provide the user with the ability to customize the threshold of confidence and iou so that the user can find a suitable threshold for the input image. In order not to have to rerun the whole model every time these parameters are changed, when the image is sent from the client, the server will perform a perceptual hash encryption algorithm to encrypt the image and using that resulting string to name the image when saving to the server. This helps when the client sends an image whose encoding already exists in the database, the server will only post-process the previously predicted result without having to re-execute the prediction.

🌟 Additional Methods

To increase the variety of dishes, we apply a classification model:

After testing and observing, we use a simple and effective model: EfficientNet. EfficientNet is proposed by Google and is one of the state-of-the-art models in this classification problem, and efficiency is also guaranteed. We apply the EfficientNet model source code from rwightman, we select the EfficientNet-B4 version for retraining on the aggregated dataset. This model is used as an additional improvement to the YOLOv5 model in case the model detects a dish labeled as "Other Foods", only then EfficientNet is applied to predict the label again for this dish.

To increase the accuracy of the algorithm, we use the ensemble models technique:

For each image, models with different versions are used to predict, the results are then aggregated using the "weighted box fusion" method to give the final result.

To increase users' interactivity with the application:

When a dish is predicted, we provide more information about the nutritional level of that dish to the user. This information is queried from the application's database, which will be periodically updated from the Edamam API - an API that allows querying the nutrition of a dish by dish name. When doing prediction, the nutrition information will be saved along with the dish name under CSV format. We then fetch the CSV file on the client site to proceed drawing nutritrion statistics chart using Chart.js library. There are a total of 2 chart types, which appear when the user clicks on that chart type.