This repository includes data and code used to produce the manuscript Leggat, Cohen, Willis, Fulp, deCamp et al. Science (2022). All data has been deidentified, and as of v2.0.0 all code is functional with deidentified data.

Data Access

If you don't want to run the code but would just like the important data files for your own analysis, you can find the following:

The raw FACS files including .fcs, .xml and .csv files can be found here
Warning - this file is large at ~120 GB
The outputs of the processed FACS files are found in the processed flow directory.
The outputs need to be collated from both trial sites into one collated flow directory.
The FASTQ files from Sanger sequencing are found in the fastq directory.
The annotated, filtered and paired antibody sequences are found in the sequences/ directory and may be downloaded with this link
A merged summary file with all frequencies reported in this study can be found in the flow_summary directory and may be downloaded with this link

Pipeline

Installation pre-requisites

While not necessary, we highly recommend using the conda open-source package and environment manager. This allows you to make an environment with both Python and R dependencies. For the purposes of this repository, only a minimal installer for anaconda is necessary (Miniconda).

Miniconda installers

Mac command line installer

Mac GUI installer

Linux command line installer

Due to our dependencies on HMMER, there is no Windows support at the moment.

Installation

This installation assumes that git and conda are in your path.

# clone repository
git clone https://github.com/SchiefLab/G001.git

# change directory
cd G001

# install G001
./install.sh

# activate environment
conda activate G001

Optional - If you'd like to run the figure generation code, you must pull the large input files using git-lfs.

# initialize git-lfs
git-lfs install

# pull large files
git-lfs pull

Once you install you can use to list of available options

g001 --help

FACS analysis

The flow processing needs to be run for the two sites (VRC,FHCRC) independently from the flow_input data.

# get all raw flow files from public S3 bucket
wget https://iavig001public.s3.us-west-2.amazonaws.com/flow_input.tgz

# extract the files
tar -xvzf flow_input.tgz

# Run for FHCRC
g001 process-flow -s fhcrc -i flow_input/fhcrc/ -o flow_processed_out/

# Run for VRC
g001 process-flow -s vrc -i flow_input/vrc/ -o flow_processed_out/

# For more options, use
g001 process-flow --help

Collation of flow data

The following will combine the VRC and FHCRC flow data.

# If you ran the steps above in FACS analysis, you can use the following command to collate
g001 collate -f flow_processed_out/ -o collated_flow

# If you did not run the above steps in FACS analysis, we have run those steps and 
#placed the output in /data/flow/flow_process_out. You can use the following command to collate 
g001 collate -f data/flow/flow_process_out/ -o collated_flow

# For more options, use
g001 collate --help

BCR sequence analysis

Run BCR sequence analysis pipeline (as in Leggat et al fig. S10).

# run sequence analysis and output to the folder sequence_analysis_output
g001 sequence_analysis -o sequence_analysis_output

# For more options, use
g001 sequence_analysis --help

Combined B cell frequency and BCR sequence analysis

This code combines the sequencing and flow processing results and computes B cell frequencies among various sets of cells (e.g VRC01-class B cells among all IgG+ memory B cells).

# If you ran the steps above for collate and sequence analysis, 
# and the respective output folders are sequence_analysis_output and collated_flow, 
# you can use the following command to combine the sequence and flow data
g001 combine -s sequence_analysis_output -c collated_flow -o combined_flow_seq


# If you did not run the above steps for collate and sequence analysis, we have run those steps and placed the output 
# in /data/flow/collated_flow and data/sequence. You can use the following command to combine the sequence and flow data 
g001 combine -s data/sequence -c data/flow/collated_flow -o combined_flow_seq

Figures and tables

Main figures

The following code generates the main text figures from the data in this repository.

g001 figures fig1
g001 figures fig2
g001 figures fig3
g001 figures fig4
g001 figures fig5
g001 figures fig6
g001 figures fig7
g001 figures fig8

Tables

The following code generates all supplementary tables in the Leggat et al. manuscript. Both individual pdfs and a single combined pdf are generated. This command is only supported on Mac.

g001 supptables -c -o supp_tables

Issues

Please submit any issues to the issues page and we are happy to help.