/baby-vision

Self-supervised learning through the eyes of a child

Primary LanguagePython

Self-supervised learning through the eyes of a child

This repository contains code for reproducing the results reported in the following paper:

Orhan AE, Gupta VV, Lake BM (2020) Self-supervised learning through the eyes of a child. arXiv:2007.16289.

Requirements

  • pytorch == 1.5.1
  • torchvision == 0.6.1

Slightly older versions will probably work fine as well.

Datasets

This project uses the SAYCam dataset described in the following paper:

Sullivan J, Mei M, Perfors A, Wojcik EH, Frank MC (2020) SAYCam: A large, longitudinal audiovisual dataset recorded from the infant’s perspective. PsyArXiv.

The dataset is hosted on the Databrary repository for behavioral science. Unfortunately, we are unable to publicly share the SAYCam dataset here due to the terms of use. However, interested researchers can apply for access to the dataset with approval from their institution's IRB.

In addition, this project also uses the Toybox dataset for evaluation purposes. The Toybox dataset is publicly available at this address.

Code description

For specific usage examples, please see the slurm scripts provided in the scripts directory.

Pre-trained models

We share below the pre-trained weights for our best self-supervised models trained on the SAYCam dataset. Four pre-trained models are provided below: temporal classification models trained on data from the individual children in the SAYCam dataset (TC-S, TC-A, TC-Y) and a temporal classification model trained on data from all three children (TC-SAY).

These models come with the classifier heads attached. To load these models, please do something along the lines of:

import torch
import torchvision.models as models

model = models.mobilenet_v2(pretrained=False)
model.classifier = torch.nn.Linear(in_features=1280, out_features=n_out, bias=True)
model = torch.nn.DataParallel(model).cuda()

checkpoint = torch.load('TC-SAY.tar')
model.load_state_dict(checkpoint['model_state_dict'])

where n_out should be 6269 for TC-SAY, 2765 for TC-S, 1786 for TC-A, and 1718 for TC-Y. The differences here are due to the different lengths of the datasets. To use these models for a different task, you can detach the classifier head and attach a new classifier head of the appropriate size, e.g.:

model.module.classifier = torch.nn.Linear(in_features=1280, out_features=new_n_out, bias=True)

where new_n_out is the new output dimensionality. We also intend to release models fine-tuned on ImageNet in the near future for wider applicability.

Acknowledgments

We are very grateful to the volunteers who contributed recordings to the SAYCam dataset. We thank Jessica Sullivan for her generous assistance with the dataset. We also thank the team behind the Toybox dataset, as well as the developers of PyTorch and torchvision for making this work possible. This project was partly funded by the NSF Award 1922658 NRT-HDR: FUTURE Foundations, Translation, and Responsibility for Data Science.