/Joint-Text-and-Image-Representation

Implementation of joint text and image representations, using VSE++ losses, and implementation of t-SNE

Primary LanguageJupyter NotebookMIT LicenseMIT

Overview

This project contains an implementation of VSE++ losses by [Faghri, Fartash et al., 2017] that is a technique for learning visual-semantic embeddings for cross-modal retrieval, and an implementation of t-SNE by [van der Maaten et al., 2008] (school project, Signal Learning and Multimedia class, 2019).

It is applied on MSCOCO image captioning dataset by [Lin, Tsung-Yi et al., 2014], in particular with the val2014 data which contains a set of 40k images annotated with five captions each. We also used Resnet50 features by [He, Kaiming et al., 2016] and glove embeddings by [Pennington, Jeffrey et al., 2014].

A good introduction of Representation Learning would be [Bengio, Y. et al., 2013].

Features

Installation

It requires python3, python3-pip, the packages listed in requirements.txt and a recent version of git that supports git-lfs.

To install the required packages:

pip3 install -r requirements.txt

Usage

A notebook is available, and each feature is illustrated in an example in test directory.

References

Authors

  • Charly Lamothe
  • Guillaume Ollier
  • Balthazar Casalé