movie_cpm: A Jupyter Notebook repository from esfinn

This repository contains code and data for the following preprint:

Finn, Emily S. and Bandettini, Peter A. Movie-watching outperforms rest for functional connectivity-based prediction of behavior. https://doi.org/10.1101/2020.08.23.263723

In this project, we use resting-state and movie-watching data from the Human Connectome Project 7T scanning dataset acquired at University of Minnesota. Functional connectivity matrices were constructed by calculating the Pearson correlation coefficient between all pairs of nodes in a 268-node parcellation. Nodewise time series that serve as the input to these matrices are included here under /data/all_shen_roi_ts.

Notes: The 176 subjects used here contain many sets of siblings and twins. It is important to control for family structure when performing model cross-validation, such that siblings are never split between training and testing folds (because this might confer an unfair advantage for the model, since related subjects might be expected to be more similar in both their functional connectivity patterns and/or behavioral scores). Splitting subjects by family requires access to data on which subjects are siblings, which is considered sensitive information and is part of the restricted-access data. Therefore, in order to reproduce the analyses in this paper, you will first need to apply for access to these data from HCP:

Go to db.humanconnectome.org and create an account.
On the main page, under the first dataset ("WU-Minn HCP Data - 1200 Subjects"), click "Data Use Terms Required". Read and accept terms.
Read terms and apply for access to the restricted data by following instructions on this page: https://www.humanconnectome.org/study/hcp-young-adult/document/wu-minn-hcp-consortium-restricted-data-use-terms
Once you have access to the restricted data .csv file, download it and save it under the /data directory as res_behav_data.csv
Run the notebook helper_mk_family_list.ipynb to generate the family_list.npy file (that is then used by cpm.py).

In performing these analyses, we ran many iterations of both true models and null models to get accurate distributions of model performance, and made extensive use of the high-performance computing cluster at the NIH, Biowulf, to parallelize these iterations. Specifically, each true model was run 100 times, and the median of this distribution of performances was compared to a distribution of null models generated by shuffling connectivity-behavior assignment across subjects 10,000 times. The function mk_jobs.py is a helper function for generating a jobs file to submit to the cluster. For example, if you wanted to run 100 iterations of a model trained on MOVIE1 data to predict cognitive score, you would run:

python mk_jobs.py --clip MOVIE1 --behav cogn_PC1 --n_iter 100

...and then submit the resulting jobs file to your cluster scheduler.

And if you wanted to generate a null distribution of 10,000 model to compare, you would run:

python mk_jobs.py --clip MOVIE1 --behav cogn_PC1 --rand_behav 1 --n_iter 10000

To run a single iteration of CPM (which can easily be done locally in ~1min on most machines), you can run:

python cpm_wrapper.py --clip MOVIE1 --behav cogn_PC1

esfinn/movie_cpm