/aEYE

aEYE is a Visual Question Answering (VQA) system designed to empower visually impaired individuals by providing access to visual content through natural language questions. This system allows users to inquire about their surroundings, identify objects, and understand scenes independently.

Primary LanguageJavaScript

This repository does not include the model. I cannot find a sustainable way to host the model anywhere. :(

image

aEYE: Visual Question Answering (VQA) for Visually Impaired Individuals

Overview

aEYE is a Visual Question Answering (VQA) system designed to empower visually impaired individuals by providing access to visual content through natural language questions. This system allows users to inquire about their surroundings, identify objects, and understand scenes independently.

image

Problem Statement

Visually impaired individuals often face challenges accessing visual information, which hinders their understanding of surroundings and objects. Traditional methods such as audio descriptions or tactile representations are limited in spontaneity and comprehensiveness. aEYE aims to bridge this gap by providing real-time visual content understanding through natural language processing.

Objectives

  • Provide an accessible VQA system tailored for blind users.
  • Enable users to independently inquire about their surroundings.
  • Facilitate identification of objects and understanding of scenes through natural language questions.

Implementation Details

Our solution involves fine-tuning a pretrained version of Microsoft’s GenerativeImage2Text (GIT) model on the COCOQA dataset.

image

Model Architecture

  • Image Encoder: A contrastive pre-trained model that takes a raw image as input and outputs a 2D feature map.
  • Text Decoder: A transformer module consisting of multiple transformer blocks, each with a self-attention layer and a feed-forward layer.
  • Input Text: Tokenized and embedded, concatenated with the image features from the image encoder. The input text includes the question and the ground-truth answer as a special caption.

Training

  • Dataset: COCOQA
  • Epochs: 50
  • Optimizer: Adam
  • Loss Function: Cross Entropy applied to the answer and EOS tokens.

Results

  • Accuracy: Achieved a cosine similarity accuracy of 0.72 between the predicted answer and the target answer.
  • Training and Validation Loss: image

Limitations and Future Directions

  • Current Limitations:

    • Most answers are single words due to the dataset's constraints.
    • Model complexity results in slow computation during inference, especially on a CPU.
  • Future Directions:

    • Fine-tuning on a larger dataset with more comprehensive answers.
    • Optimizing the model for local performance by storing image features data while performing text prompt inference.

Installation

  1. Clone the repository:

    git clone https://github.com/umerkay/aEYE.git
    cd aEYE
  2. Install the required dependencies:

    pip install -r requirements.txt
  3. Run the web application:

    flask run

Usage

  1. Access the web application through your browser at http://localhost:5000/
  2. Upload an image and ask a question about it using natural language.
  3. Receive the answer generated by the model.

References

  • GenerativeImage2Text (GIT): Paper