This project was carried out in the months of Jan-May 2020.
The estimated number of visually impaired people in the world is 285 million in which 39 million people are blind and 246 million people have low vision. This project aims at assisting visually impaired people through DL by providing a system which can describe the surrounding as well as answer questions about the surrounding of the user. It comprises two models, a Visual Question Answering (VQA) model and an image captioning model. The image captioning model takes in an image as the input and generates the caption describing the image. The VQA model is fed with an image and a question and it predicts the answer to the question asked with respect to the image.
For Image captioning model, the dataset used is the MS-COCO dataset. For VQA model, the datasets used are MS-COCO dataset and VQA dataset.
- Build and train Image captioning model
- Build and train VQA model
- Construct Speech to Text and Text to Speech models
- Integrate all the models to form a single product
In real time, the user will be provided with an option to choose between describing the surroundings or ask a question about the surroundings and he/she will be provided with an answer.
- Python version 3 with necessary libraries as mentioned in requirements.txt files
- Cuda tooklkit version 10.2 if you need to train on GPU, update driver
- PC with a good NVidia GPU and webcam
- Image_Captioning directory - Contains the files for training and testing the image captioning model. This directory is also needed while running the real time code.
- VQA - Contains 2 directories for traning and testing MLP and CNN_LSTM based VQA models. Under each directory you will find main.py which contains both the training and testing code. Note: It is advised to run the VQA codes in Google Colab to avoid unnecessary errors.
- There are 3 python files inside the root directory which is for real time demo.
- conversions.py: Contains the code for Speech to Text and Text to Speech components
- models.py: Contains the Image captioning and VQA models
- product.py: Contains the real time code using webcam
- Download the train and test dataset (images and annotations 2014) from "http://cocodataset.org/#download"
- Move into Source Code/Image_Captioning directory
- Install the requirements using the command "pip install -r requirements.txt"
- Replace all the required file paths to your convenience
- Uncomment line no. 79 & 80 in train.py and run it to generate the image batch features and train the model. To run the code again, comment the lines to avoid generating the batch features again.
- Run test.py to test the model
- Move into VQA/MLP directory
- Download the train and validation MS-COCO images (2014) from "http://cocodataset.org/#download"
- Download the train and validation Questions (2015) and Answers (2015) from "https://visualqa.org/vqa_v1_download.html"
- Download the image features pickle file for train images from "https://drive.google.com/file/d/1icMniCVK8D3pGoDgkBkTl7K2zTsXRf13/view?usp=sharing"
- Download the image features pickle file for validation from "https://drive.google.com/file/d/1sa_ZEej11NFtiAnmhR18X5o6_Ctc6qcI/view?usp=sharing"
- Download the preprocessed dataset from "https://drive.google.com/drive/folders/1LmOr3poPLLBLDF0e3z50XeMHKmnsQzqI?usp=sharing"
- Install the requirements using the command "pip install -r requirements.txt"
- Run the main.py to train and test
- Move into Source Code/VQA/CNN_LSTM directory
- Download the train and validation MS-COCO images (2014) from "http://cocodataset.org/#download"
- Download the train and validation Questions (2015) and Answers (2015) from "https://visualqa.org/vqa_v1_download.html"
- Download the image features pickle file for train images from "https://drive.google.com/file/d/1icMniCVK8D3pGoDgkBkTl7K2zTsXRf13/view?usp=sharing"
- Download the image features pickle file for validation from "https://drive.google.com/file/d/1sa_ZEej11NFtiAnmhR18X5o6_Ctc6qcI/view?usp=sharing"
- Download the preprocessed dataset from "https://drive.google.com/drive/folders/1LmOr3poPLLBLDF0e3z50XeMHKmnsQzqI?usp=sharing"
- Install the requirements using the command "pip install -r requirements.txt"
- Run the main.py to train and test
- Install the requirements using the command "pip install -r requirements.txt" present in the root directory
- Run product.py code
-
@article{DBLP:journals/corr/XuBKCCSZB15, author = {Kelvin Xu and Jimmy Ba and Ryan Kiros and Kyunghyun Cho and Aaron C. Courville and Ruslan Salakhutdinov and Richard S. Zemel and Yoshua Bengio}, title = {Show, Attend and Tell: Neural Image Caption Generation with Visual Attention}, journal = {CoRR}, volume = {abs/1502.03044}, year = {2015}, url = {http://arxiv.org/abs/1502.03044}, archivePrefix = {arXiv}, eprint = {1502.03044}, timestamp = {Mon, 13 Aug 2018 16:47:52 +0200}, biburl = {https://dblp.org/rec/journals/corr/XuBKCCSZB15.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
-
@article{DBLP:journals/corr/AntolALMBZP15, author = {Stanislaw Antol and Aishwarya Agrawal and Jiasen Lu and Margaret Mitchell and Dhruv Batra and C. Lawrence Zitnick and Devi Parikh}, title = {{VQA:} Visual Question Answering}, journal = {CoRR}, volume = {abs/1505.00468}, year = {2015}, url = {http://arxiv.org/abs/1505.00468}, archivePrefix = {arXiv}, eprint = {1505.00468}, timestamp = {Mon, 13 Aug 2018 16:48:30 +0200}, biburl = {https://dblp.org/rec/journals/corr/AntolALMBZP15.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }