Visual Assistance for the Blind

This project was carried out in the months of Jan-May 2020.

Description

The estimated number of visually impaired people in the world is 285 million in which 39 million people are blind and 246 million people have low vision. This project aims at assisting visually impaired people through DL by providing a system which can describe the surrounding as well as answer questions about the surrounding of the user. It comprises two models, a Visual Question Answering (VQA) model and an image captioning model. The image captioning model takes in an image as the input and generates the caption describing the image. The VQA model is fed with an image and a question and it predicts the answer to the question asked with respect to the image.

Dataset

For Image captioning model, the dataset used is the MS-COCO dataset. For VQA model, the datasets used are MS-COCO dataset and VQA dataset.

Methodology

Build and train Image captioning model
Build and train VQA model
Construct Speech to Text and Text to Speech models
Integrate all the models to form a single product

In real time, the user will be provided with an option to choose between describing the surroundings or ask a question about the surroundings and he/she will be provided with an answer.

Running the application

Software Requirements

Python version 3 with necessary libraries as mentioned in requirements.txt files
Cuda tooklkit version 10.2 if you need to train on GPU, update driver

Hardware Requirements

PC with a good NVidia GPU and webcam

Folder Strcuture

Image_Captioning directory - Contains the files for training and testing the image captioning model. This directory is also needed while running the real time code.
VQA - Contains 2 directories for traning and testing MLP and CNN_LSTM based VQA models. Under each directory you will find main.py which contains both the training and testing code. Note: It is advised to run the VQA codes in Google Colab to avoid unnecessary errors.
There are 3 python files inside the root directory which is for real time demo.
- conversions.py: Contains the code for Speech to Text and Text to Speech components
- models.py: Contains the Image captioning and VQA models
- product.py: Contains the real time code using webcam

To train and test the image captioning do the following steps

Download the train and test dataset (images and annotations 2014) from "http://cocodataset.org/#download"
Move into Source Code/Image_Captioning directory
Install the requirements using the command "pip install -r requirements.txt"
Replace all the required file paths to your convenience
Uncomment line no. 79 & 80 in train.py and run it to generate the image batch features and train the model. To run the code again, comment the lines to avoid generating the batch features again.
Run test.py to test the model

To train and test the MLP VQA model do the following steps

Move into VQA/MLP directory
Download the train and validation MS-COCO images (2014) from "http://cocodataset.org/#download"
Download the train and validation Questions (2015) and Answers (2015) from "https://visualqa.org/vqa_v1_download.html"
Download the image features pickle file for train images from "https://drive.google.com/file/d/1icMniCVK8D3pGoDgkBkTl7K2zTsXRf13/view?usp=sharing"
Download the image features pickle file for validation from "https://drive.google.com/file/d/1sa_ZEej11NFtiAnmhR18X5o6_Ctc6qcI/view?usp=sharing"
Download the preprocessed dataset from "https://drive.google.com/drive/folders/1LmOr3poPLLBLDF0e3z50XeMHKmnsQzqI?usp=sharing"
Install the requirements using the command "pip install -r requirements.txt"
Run the main.py to train and test

To train and test the CNN_LSTM VQA model do the following steps

Move into Source Code/VQA/CNN_LSTM directory
Download the train and validation MS-COCO images (2014) from "http://cocodataset.org/#download"
Download the train and validation Questions (2015) and Answers (2015) from "https://visualqa.org/vqa_v1_download.html"
Download the image features pickle file for train images from "https://drive.google.com/file/d/1icMniCVK8D3pGoDgkBkTl7K2zTsXRf13/view?usp=sharing"
Download the image features pickle file for validation from "https://drive.google.com/file/d/1sa_ZEej11NFtiAnmhR18X5o6_Ctc6qcI/view?usp=sharing"
Download the preprocessed dataset from "https://drive.google.com/drive/folders/1LmOr3poPLLBLDF0e3z50XeMHKmnsQzqI?usp=sharing"
Install the requirements using the command "pip install -r requirements.txt"
Run the main.py to train and test

To test the code in real time with webcam

Install the requirements using the command "pip install -r requirements.txt" present in the root directory
Run product.py code

Bibliography

@article{DBLP:journals/corr/XuBKCCSZB15, author = {Kelvin Xu and Jimmy Ba and Ryan Kiros and Kyunghyun Cho and Aaron C. Courville and Ruslan Salakhutdinov and Richard S. Zemel and Yoshua Bengio}, title = {Show, Attend and Tell: Neural Image Caption Generation with Visual Attention}, journal = {CoRR}, volume = {abs/1502.03044}, year = {2015}, url = {http://arxiv.org/abs/1502.03044}, archivePrefix = {arXiv}, eprint = {1502.03044}, timestamp = {Mon, 13 Aug 2018 16:47:52 +0200}, biburl = {https://dblp.org/rec/journals/corr/XuBKCCSZB15.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
@article{DBLP:journals/corr/AntolALMBZP15, author = {Stanislaw Antol and Aishwarya Agrawal and Jiasen Lu and Margaret Mitchell and Dhruv Batra and C. Lawrence Zitnick and Devi Parikh}, title = {{VQA:} Visual Question Answering}, journal = {CoRR}, volume = {abs/1505.00468}, year = {2015}, url = {http://arxiv.org/abs/1505.00468}, archivePrefix = {arXiv}, eprint = {1505.00468}, timestamp = {Mon, 13 Aug 2018 16:48:30 +0200}, biburl = {https://dblp.org/rec/journals/corr/AntolALMBZP15.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }

shivmohith/Visual-Assistance-for-the-Blind