
The problem comes from the current vqa applications for visually impaired people. We want to implement a system in which users can upload a picture of the environment or a specific item and ask a question about it. The system is able to generate an answer and read out. Thus, basically our project can be divided into two parts: Visual Question Answering (VQA) and Vocie Cloning.


We use Pythia as our model to complete the VQA task. Pythia is a modular framework for Visual Question Answering research, which formed the basis for the winning entry to the VQA Challenge 2018 from Facebook AI Research (FAIR)s A-STAR team. It is built on top of PyTorch.

For Voice Cloning task, we use Real-Time Voice Cloning to read out the answer. The model is an implementation of Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (SV2TTS) with a vocoder that works in real-time.

Quickstart for VQA


1. Install pythia environment

  1. Install Anaconda.
  2. Install cudnn v7.0 and cuda.9.0. You can find a tutorial here.
  3. Create environment for pythia. Run the code below in a terminal.
conda create --name vqa python=3.6

source activate vqa
pip install demjson pyyaml

pip install

pip install torchvision
pip install tensorboardX

2. Clone Pythia repository

git clone ~/pythia

3. Install dependencies and setup

cd ~/pythia
python develop

Download Data

Datasets currently supported in Pythia require two parts of data, features and ImDB. Features correspond to pre-extracted object features from an object detector. ImDB is the image database for the datasets which contains information such as questions and answers.

For VQA task, we need to download features from COCO dataset and VQA 2.0 ImDB. We assume that all of the data is kept inside data folder under pythia root folder. If you want to use your own dataset, the dataset should be in data folder. This step may take some time.

cd ~/pythia;
# Create data folder
mkdir -p data && cd data;

# Download and extract the features
tar xf coco.tar.gz

# Get vocabularies
tar xf vocab.tar.gz

# Download detectron weights required by some models
tar xf detectron_weights.tar.gz

# Download and extract ImDB
mkdir -p imdb && cd imdb
tar xf vqa.tar.gz

Here vqa2 stands for the dataset VQA2.0. If you want to use other datasets like TextVQA or VizWiz. You can change it into the corresponding key words. Here is all the datasets that Pythia currently support for VQA task:

Dataset Task Key ImDB link Features Link
VQA2.0 vqa vqa2 VQA 2.0 ImDB COCO
VizWiz vqa vizwiz VizWiz ImDB VizWiz
TextVQA vqa textvqa TextVQA 0.5 ImDB Openimages
VisualGenome vqa visual_genome Automatically downloaded Automatically downloaded
CLEVR vqa clevr Automatically downloaded Automatically downloaded


After downloading and unzipping the data, we can start training the model

cd ~/pythia;
python tools/ --tasks vqa --datasets vqa2 --model pythia --config \

Pretrain models

Performing inference using pretrained models in Pythia is easy. This section expects that you have already installed the required data as explained before.
Here is the links to the pretrain models:

We are using vqa2_train_val pretrained model. You can download it here. Now to run inference for EvalAI, run the following command.

cd ~/pythia/data
mkdir -p models && cd models;
# Download the pretrained model
cd ../..;
python tools/ --tasks vqa --datasets vqa2 --model pythia --config configs/vqa/vqa2/pythia_train_and_val.yml  --run_type inference --evalai_inference 1 --resume_file data/models/pythia_train_val.pth

If you want to train or evaluate on val, change the run_type to train or val accordingly. You can also use multiple run types, for e.g. to do training, inference on val as well as test you can set --run_type to train+val+inference.

if you remove --evalai_inference argument, Pythia will perform inference and provide results directly on the dataset. Do note that this is not possible in case of test sets as we don't have answers/targets for them. So, this can be useful for performing inference on val set locally.

After the evaluation, you could found the prediction report results like this in the folder /pythia/save/vqa_vqa2_pythia/reports.

[{"question_id": 169624000, "answer": "yes"},
 {"question_id": 93006000, "answer": "car"},
 {"question_id": 46565001, "answer": "surfboard"},
 {"question_id": 13457004, "answer": "white"},
 {"question_id": 243145000, "answer": "florida"},
 {"question_id": 402159003, "answer": "blue and white"},
 {"question_id": 155875004, "answer": "graffiti"},
 {"question_id": 24226001, "answer": "blue"},
 {"question_id": 209024002, "answer": "fast"},
 {"question_id": 365644003, "answer": "black"},
 {"question_id": 428038002, "answer": "yes"},
 {"question_id": 133130004, "answer": "batman"},
 {"question_id": 72711000, "answer": "no"},
 {"question_id": 371925002, "answer": "yes"},
 {"question_id": 364999018, "answer": "playing frisbee"},
 {"question_id": 557744002, "answer": "no"}]

Demo for VQA

To quickly tryout a model interactively with nvidia-docker

  1. Download our pythia repository.
  2. Build the docker using Dockerfile in the folder pythia. Or you can pull our docker image from docker hub.
docker pull shuaiyue0929/pythia
  1. Run the docker pythia:latest to open a jupyter notebook with a demo model to which you can ask questions interactively.
nvidia-docker build pythia -t pythia:latest
docker run --gpus 0 -it -p 8888:8888 pythia:latest

The demo on jupyter notebook will look like this:
3.1 Enter the image URL and the question you want to ask about the image.
3.2 Click the button Ask Pythia!


3.3 You can view the image you upload in the window. Down below the predictions of the answer will show in descending order of confidence.


  1. For your local device, you should run the commands to get the access to your Jupyter notebook.
ssh -i thisIsmyKey.pem -L 8888:localhost:8888 ubuntu@ec2–34–227–222–

Here is the Dockerfile.

FROM nvidia/cuda:10.2-base
FROM python:3-stretch
FROM jupyter/datascience-notebook

# This is needed to ensure cuda can view GPU

RUN pip install --upgrade pip

# Download files for model
#WORKDIR "/workspace"
# RUN mkdir Pythia 
#ADD ./ /workspace
COPY pythia_demo.ipynb ./

#RUN mkdir content
#RUN cd content
RUN mkdir model_data
RUN wget -O model_data/answers_vqa.txt
RUN wget -O model_data/vocabulary_100k.txt
RUN wget -O model_data/detectron_model.pth
RUN wget -O model_data/pythia.pth
RUN wget -O model_data/pythia.yaml
RUN wget -O model_data/detectron_model.yaml
RUN wget -O model_data/detectron_weights.tar.gz
RUN tar xf model_data/detectron_weights.tar.gz

# Current pillow 7.0 has a compatability error
RUN pip install Pillow==6.1

# Install dependencies
RUN pip install ninja yacs cython matplotlib demjson
RUN pip install git+

# Install fastText
RUN git clone fastText && cd fastText && pip install -e .

# Installing Pythia
RUN git clone pythia && cd pythia && pip install -e .

# Installing maskrcnn
RUN git clone && cd vqa-maskrcnn-benchmark && python build && python develop

USER root
RUN apt-get update
RUN apt-get install software-properties-common --assume-yes
RUN add-apt-repository ppa:graphics-drivers/ppa
RUN apt install nvidia-384 nvidia-modprobe --assume-yes
RUN wget
RUN chmod +x cuda_9.0.176_384.81_linux-run 
RUN ./cuda_9.0.176_384.81_linux-run --extract=$HOME
RUN ./ -noprompt
RUN wget
RUN tar -xzvf cudnn-9.0-linux-x64-v7.1.tgz  
RUN cp cuda/include/cudnn.h /usr/local/cuda/include
RUN cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
RUN chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*
RUN echo 'export LD_LIBRARY_PATH=\"$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64\"\nexport CUDA_HOME=/usr/local/cuda\nexport PATH=$PATH:/usr/local/cuda/bin'>>~/.bashrc
#Install cuda

# Create jupyter notebook entrypoint
CMD ["jupyter", "notebook", "--port=8888", "--no-browser", "--ip=", "--allow-root"]

Quickstart for Voice Cloning

You can either run the demo on localhost or on AWS.

Install dependencies

You will need the following whether you plan to use the demo only or to retrain the models. Python 3.7. Python 3.6 might work too, but I wouldn't go lower because I make extensive use of pathlib. Run pip install -r requirements.txt to install the necessary packages. Additionally you will need PyTorch (>=1.0.1). A GPU is mandatory, but you don't necessarily need a high tier GPU if you only want to use the toolbox.

Pretrained models

Download the latest here.


Before you download any dataset, you can begin by testing your configuration with: python If all tests pass, you're good to go.


You can download the dataset here LibriSpeech/train-clean-100. Extract the contents as <datasets_root>/LibriSpeech/train-clean-100 where <datasets_root> is a directory of your choosing. The input of the data should be in flac/wav/m4a/mp3 format.

After training, you could get three models named in the folders /encoder/, /synthesizer/, and/vocoder/.

Demo for Voice Cloning

You can either run the demo directly using commands:


or run it using docker.
If you want to use docker, here are the steps:

  1. You should probably have access to a machine with a CUDA-compatible GPU
  2. Install nvidia-docker
    Follow the instructions here: Note that you’ll need have installed the NVIDIA driver and Docker as well.
  3. Create Dockerfile You can create a Dockerfile like this
FROM pytorch/pytorch

WORKDIR "/workspace"
RUN apt-get clean \
        && apt-get update \
        && apt-get install -y ffmpeg libportaudio2 openssh-server python3-pyqt5 xauth \
        && apt-get -y autoremove \
        && mkdir /var/run/sshd \
        && mkdir /root/.ssh \
        && chmod 700 /root/.ssh \
        && ssh-keygen -A \
        && sed -i "s/^.*PasswordAuthentication.*$/PasswordAuthentication no/" /etc/ssh/sshd_config \
        && sed -i "s/^.*X11Forwarding.*$/X11Forwarding yes/" /etc/ssh/sshd_config \
        && sed -i "s/^.*X11UseLocalhost.*$/X11UseLocalhost no/" /etc/ssh/sshd_config \
        && grep "^X11UseLocalhost" /etc/ssh/sshd_config || echo "X11UseLocalhost no" >> /etc/ssh/sshd_config
ADD Real-Time-Voice-Cloning/requirements.txt /workspace/requirements.txt
RUN pip install -r /workspace/requirements.txt
CMD ["python",""]
  1. Build the docker image run command:
nvidia-docker build -t pytorch-voice .

Or you can pull the docker image from Docker hub.

docker pull shuaiyue0929/vc
  1. Build a container to run the demo run command:
nvidia-docker run pytorch-voice

If you get a message says Saved output as demo_output_00.wav, then congratulations you run the demo successfully. Here we set the default target voice file as test.flac, and set the default text to clone as "The answer of the question is that there are three animals in the picture".
