/BRAPeS

BCR reconstruction from short single cell RNA-seq

Primary LanguageMakefileOtherNOASSERTION

BRAPeS

We present BRAPeS (BCR Reconstruction Algorithm for Paired-End Single-cell), a software for reconstruction of B cell receptors (BCR) using short, paired-end single-cell RNA-sequencing.

BRAPeS is an extension of TRAPeS which reconstructs the BCR in 5 steps: For each chain, it first identify the V and J segments by searching for paired reads with one read mapping to the V segment and its mate mapping to the J segment. Then, a set of putative CDR3-originating reads are identified as the set of unmapped reads whose mates map to the V,J and C segments. Next, an iterative dynamic programming algorithm is used to reconstruct the CDR3 region with the putative CDR3 reads. The isotype of the BCR is then determined by running RSEM on the reconstructed sequence with all possible constant segments added to it. Finally, BRAPeS corrects for somatic hypermutations by collecting all reads aligning to the genomic CDR1, CDR2 and Framework sequences and aligning the reads against themselves to obtain a reconstruction of the consensus sequence.

For more information, see our paper in Life Science Alliance or our preprint in bioRxiv

installing BRAPeS

BRAPeS is written in Python and C++ and works on Linux (see comment below if you wish to run BRAPeS on mac). BRAPeS requires the following python libraries:

  • biopython
  • pysam

In addition, BRAPeS requires the following software:

  • bowtie2 (recommended version 2.3.2 and up)
  • RSEM (recommended version 1.2.21 and up)
  • samtools (recommended version 1.7 and up)

before running BRAPeS

BRAPeS takes as input mapped and unmapped files of reads after genomic alignment (e.g. using TopHat).
BRAPeS assumes a certain folder structure: It assumes that each cell has its own folder, and all of those folders are under one path. Also, it assumes that each cell folder has identical subfolder structure.
In case the genomic alignment was done with a software that output only one bam file (e.g. STAR), you can provide BRAPeS with the same file as both the mapped and unmapped file. Please email us if this produces an error.

Running BRAPeS

To run BRAPeS, simply run:

python brapes.py [options]

To display help:

python brapes.py -h

Testing BRAPeS

To test that BRAPeS is installed correctly, run the following command:

python /path-to-BRAPeS/brapes.py -genome mm10_ncbi -path /path-to-BRAPeS/Example/proc_data/ -sumF /path-to-BRAPeS/Example/BRAPeS_out/BCR.out -output BRAPeS_out/BCR.out -score 15 -top 10 -byExp -unmapped unmapped.bam -bam sorted.bam

This should produce the following files:
/path-to-BRAPeS/Example/BRAPeS_out/BCR.out.summary.txt
/path-to-BRAPeS/Example/BRAPeS_out/BCR.out.BCRs.txt

which should be similar to the files:
/path-to-BRAPeS/Example/BRAPeS_out/example.output.summary.txt
/path-to-BRAPeS/Example/BRAPeS_out/example.output.BCRs.txt

Test files were processed from raw fastq files taken from "Wu,Y.L., Stubbington,M.J.T., Daly,M., Teichmann,S.A. and Rada,C. (2016) Intrinsic transcriptional heterogeneity in B cells controls early class switching to IgE. J. Exp. Med."

Running BRAPeS on macOS

To run BRAPeS on macOS, you will need to compile the CDR3 reconstruction algorithm. To do that you will need the seqan c++ pacakge. From the BRAPeS folder compile using the commend:

g++ vdjAlignment.bcr.cpp -I path-to-seqan/include -o vdj.alignment.bcr



Options when running BRAPeS

Input files:

-path : The folder with all the single cell samples. Assumes every subfolder under this path is a folder of a single cell sequencing results.
-bam : The location of the sorted mapped bam file relative to the single cell folder. For example, if under each single cell folder there is a subfolder named “TopHat_Output” and the mapped file is under that folder and named mapped.bam, the command should read “–bam TopHat_Output/mapped.bam”.
-unmapped : the location of the bam file containing the unmapped reads. Similar to the –bam tag, it is relative to the single cell folder.

Output files:

-output : Output prefix for the files generated by BRAPeS in every single cell folder, relative to the single cell folder (e.g. TopHat_Output/BRAPeS/TCR.out). Can include a subdirectory (BRAPeS will create that subdirectory if it does not yet exist).

-sumF : Prefix for the output summary files of the entire path (summary of all single cells together).

Parameters for the CDR3 reconstruction

-downsample : For very deep libraries, there can be many BCR-originating reads which can increase running time significantly. By adding this tag, in case there are more than 5,000 reads mapping to the V and J segment, and more than 5,000 putative CDR3-originating reads, BRAPeS will downsample the reads to include only 5,000 V-J reads and only 5,000 CDR3-originating reads for the reconstruction process, and will only use 40,000 for somatic hypermutations correction. This will not influence the read mapping with RSEM to find the isotype. However, to further decrease running time the number of aligned reads reported at the summary file is set to maximum 100.

-score : The threshold score for the alignment of a read to the V or J segment. Any putative CDR3-originating read that its alignment to the V or J segments passes this score will be used for reconstruction. Default is 15 but should be changed based on sequencing quality and length.

-overlap : Minimum number of bases that the extended V and J segments should overlap in order to stop the reconstruction. Default is 10.

-iterations : Maximum number of times to extend the V and J segments. If after that number of iterations the V and J segments still do not overlap the reconstruction is determined as unsuccessful. Default is 20.

-bases : Number of bases from the V and the J segments that will be used as the initial template for the reconstruction. Default is min(length(V), length(J)).

-NoLowQ : By default, the putative CDR3-originating reads are identified as unmapped reads whose mate is aligned to the V/J/C segments. However, those reads can by chance be mapped to other places in the genome (usually low quality mapping). By including the –NoLowQ tag the set of CDR3-origintating reads will exclude the reads that map to other places in the genome.

-top: Rank all the V-J pairs based on the number of mapped reads, and reconstruct only the top X number of V-J pairs. Default: reconstruct all V-J pairs. Recommended for very deep libraries or libraries when there are many possible V-J pairing.

-byExp: Must be used along with the top parameter. Rank all the V-J pairs based on the number of mapped reads, but instead of reconstructing only the top X number of V-J pairs, randomly choose 2 V-J pairs with the same rank (same number of mapped reads) and reconstruct them. Then BRAPeS will move on to the next rank until the number of V-J pairs specified with top has been reconstructed. Default: off.

-oneSide: Add this parameter to also search for productive reconstructions only from the extended V segment (in case of no overlap between the extended V and extended J segments). Default: off.


Parameters for somatic hypercorrection correction


-skipHVR: Add this flag to avoid reconstructing the hypervariable regions.
-Hr: Maximum mutation rate when aligning reads to the complementary determining regions (CDR1 and CDR2) in order to consider a read for reconstruction. Default is 0.35.
-Fr: Maximum mutation rate when aligning reads to framework regions (FR1, FR2, FR3 and FR4) in order to consider a read for reconstruction. Default is 0.2.

Paths to other software:

-bowtie2 : Path to bowtie2. If not used assumes bowtie2 is in the default PATH

-rsem: Path to RSEM. If not used assumes RSEM is in the default PATH

-samtools: Path to samtools. If not used assumes samtools is in the default PATH

Other parameters:

-genome : The genome used for genomic alignment. By default, BRAPeS only supports hg38 and mm10, and for mm10 with NCBI chromosome naming use mm10_ncbi. However, BRAPeS can be used for any genome or any organism, as long as the genomic annotations for the V/J/C segments are available. In order to run BRAPeS on genomes besides mm10 or hg38, the user must create the relevant annotation files (see here for more information). In addition, this will require to add the '-Hminus'/'-Kminus'/'-Lminus' parameters if necessary.

-Hminus: Only used for user supplied genomes. Add this flag if the (majority) of V and J annotations of the heavy chain are on the minus strand.
-Kminus: Only used for user supplied genomes. Add this flag if the (majority) of V and J annotations of the kappa chain are on the minus strand.
-Lminus: Only used for user supplied genomes. Add this flag if the (majority) of V and J annotations of the lambda chain are on the minus strand.

-strand : Strand orientation of the reads, options are [minus, plus, none]. For transcripts on the positive strand, to which strand does the rightmost (in genomic coordinates) mate of the read map to. Default is minus.


BRAPeS output


- sumF.summary.txt : Summary of the reconstruction status in each cell (successful/unsuccessful).
  • sumF.BCRs.txt : A list of all reconstructions (productive, unproductive and partial) of all cells.

In addition, in each single cell folder you can find the following useful output files:

- output.\[heavy/kappa/lambda\].\[R1/R2\].fa : Set of paired-end reads that are aligned to the reconstructed BCRs in order to quantify the expression of each BCR using RSEM.
- output.\[heavy/kappa/lambda\].rsem.out.genes.results : The isotype ranking (by RSEM) for each V-J pair.
- output.\[heavy/kappa/lambda\].full.BCRs.bestIso.CDR1.CDR2.reconstructions.fasta.BCRs.fasta : Fasta file with the full sequences of the productive BCRs, after selecting the highest expressed isotype and somatic hypermutations correction.
- output.\[heavy/kappa/lambda\].full.BCRs.bestIso.CDR1.CDR2.reconstructions.fasta.CDR1.CDR2.fasta : The reconstructed sequences of the CDR1, CDR2 and Framework regions.
- output.summary.txt : Summary of all the reconstructed chains in this cell.



For any questions or comments, please email Shaked Afik, safik@berkeley.edu.