This repository does not include the model. I cannot find a sustainable way to host the model anywhere. :(

aEYE: Visual Question Answering (VQA) for Visually Impaired Individuals

Overview

aEYE is a Visual Question Answering (VQA) system designed to empower visually impaired individuals by providing access to visual content through natural language questions. This system allows users to inquire about their surroundings, identify objects, and understand scenes independently.

Problem Statement

Visually impaired individuals often face challenges accessing visual information, which hinders their understanding of surroundings and objects. Traditional methods such as audio descriptions or tactile representations are limited in spontaneity and comprehensiveness. aEYE aims to bridge this gap by providing real-time visual content understanding through natural language processing.

Objectives

Provide an accessible VQA system tailored for blind users.
Enable users to independently inquire about their surroundings.
Facilitate identification of objects and understanding of scenes through natural language questions.

Implementation Details

Our solution involves fine-tuning a pretrained version of Microsoft’s GenerativeImage2Text (GIT) model on the COCOQA dataset.

Model Architecture

Image Encoder: A contrastive pre-trained model that takes a raw image as input and outputs a 2D feature map.
Text Decoder: A transformer module consisting of multiple transformer blocks, each with a self-attention layer and a feed-forward layer.
Input Text: Tokenized and embedded, concatenated with the image features from the image encoder. The input text includes the question and the ground-truth answer as a special caption.

Training

Dataset: COCOQA
Epochs: 50
Optimizer: Adam
Loss Function: Cross Entropy applied to the answer and EOS tokens.

Results

Accuracy: Achieved a cosine similarity accuracy of 0.72 between the predicted answer and the target answer.
Training and Validation Loss:

Limitations and Future Directions

Current Limitations:
- Most answers are single words due to the dataset's constraints.
- Model complexity results in slow computation during inference, especially on a CPU.
Future Directions:
- Fine-tuning on a larger dataset with more comprehensive answers.
- Optimizing the model for local performance by storing image features data while performing text prompt inference.

Installation

Clone the repository:

git clone https://github.com/umerkay/aEYE.git
cd aEYE

Install the required dependencies:
```
pip install -r requirements.txt
```
Run the web application:
```
flask run
```

Usage

Access the web application through your browser at http://localhost:5000/
Upload an image and ask a question about it using natural language.
Receive the answer generated by the model.

References

GenerativeImage2Text (GIT): Paper