HotpotNet for TextVQA Task

This is a project repo for Team Hotpot at CMU LTI's 2022 Spring course 11-777 Multimodal Machine Learning.

Our contribution is Hotpot Net, a model that takes feature information from multiple modalities to tackle the challenge of visual question answering that requires reading textual information on the question image.

File Structure

  • Reports: contains 1 final report and 3 intermediate reports that summarize progress of research and analysis throughout the semester.
  • Code: please refer to for our implementation and experiments of baselines and our proposed model Hotpot Net.
  • Data: stores data downloaded from the official webpage of TextVQA challenges (
  • Data_Analysis: contains codes for exploratory analysis on Data
  • Modal_Analysis: contains code and results of quantitative analysis on model output.