/SeroCall

Pneumococcal serotype quantification using NGS reads

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

SeroCall

SeroCall can identify and quantitate the capsular serotypes in Illumina whole-genome sequencing samples of S. pneumoniae, calculating abundances of each serotype in mixed cultures. The software is written in Python (compatible with Python 2 or 3), and is freely available under an open source GPLv3 license.

Note (July 30, 2019): The software has just been released, and is still considered in beta testing.

Contents


Installation and Dependencies

SeroCall has the following dependencies:

  • Python 2.7+ or 3+ (tested using 2.7.11 and 3.5.1)
  • BWA version >= 0.7.15
  • Samtools version >= 1.3 (only used if you change/update the reference database files)

Currently, to install SeroCall, download or clone this github repository, then add the SeroCall directory to your PATH environment variable. Installation through GitHub releases, pip, bioconda and Docker are forthcoming.

Running SeroCall

usage: serocall [-h] [-o OUTPUT] [-t THREAD] r1file r2file

positional arguments:
  r1file                R1 FASTQ file (can be gzipped)
  r2file                R2 FASTQ file (can be gzipped)

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        Prefix for output files [default 'sero']
  -t THREAD, --thread THREAD
                        Number of processors to use [default 1]

In its simplest form, the serocall command just takes the the R1 and R2 fastq files from a paired-end sequencing run. The files can be gzipped or uncompressed, and there are no limitations or patterns to the given filenames.

A "-t #" option can be used to have BWA MEM use multiple threads for processing (BWA MEM is very efficient with parallelizing the alignments, so using multiple threads will reduce the running time of the tool).

SeroCall will generate two output files, a *_counts.txt containing the intermediate "bin" counts, counting how many reads aligned across the regions of the serotype references, and a *_calls.txt containing the final calls. By default, the output file names will use the prefix "sero" (outputting files "sero_counts.txt" and "sero_calls.txt"), but that can be changed using the "-o prefix" option.

The following is an example sero_calls.txt file:

##fileformat=SeroCallv1.0
##NumReads=3420303
##NumUnmapped=129037
##NumGenome=3232094
##NumOther=31298
##NumCapsule=27874
#SEROTYPE       PERCENTAGE
19F     61.8%
01      38.2%

The first line of the output is a file format tag. The other lines, before the "#SEROTYPE..." header, contain metrics about the read data, including the following:

  • NumReads - the number of input reads
  • NumUnmapped - number of reads not mapped to serotype or S. pneumoniae genomes
  • NumGenome - number of reads mapped to the S. pneumoniae genomes (not the serotype sequences)
  • NumOther - number of other alignments, including partial, high error and chimeric alignments
  • NumCapsule - number of reads well-aligned to the serotype capsular sequences (i.e., used to call serotypes)

The "#SEROTYPE..." header and following lines are tab-delimited lines reporting the serotypes called and their percentages. If no lines appear after the header, no serotypes were called from the data.

The sero_counts.txt file is technically an intermediate file, but can be useful for post-analysis QC of the calls. The initial lines of an example file are the following:

##fileformat=BinCountsv1.0
##NumReads=3420303
##NumUnmapped=129037
##NumGenome=3232094
##NumOther=31298
##NumCapsule=27874
#SAMPLE SEROTYPE        START   END     TOTAL   UNIQUE
sero    01      1       500     373     369
sero    01      501     1000    546     508
sero    01      1001    1500    612     584
sero    01      1501    2000    640     339
sero    01      2001    2500    512     322
sero    01      2501    3000    556     556
sero    01      3001    3500    568     551
sero    01      3501    4000    669     669
sero    01      4001    4500    637     637
sero    01      4501    5000    852     852
sero    01      5001    5500    775     775
sero    01      5501    6000    698     698
sero    01      6001    6500    655     655
sero    01      6501    7000    729     729
sero    01      7001    7500    786     786
sero    01      7501    8000    732     732
sero    01      8001    8500    763     763
sero    01      8501    9000    780     780
sero    01      9001    9500    729     729
.
.
.

After the header lines, the rest of the file is a tab-delimited file identifying the regions of the serotype sequences, and then reporting the total counts of aligned reads, and the count of uniquely aligning reads in each region. The final lines in the file contain an extra column with a "diff" identifier, marking them as difference locations for distinguishing closely related serogroups.

Building/Updating the Reference Database

WARNING: This functionality has not been tested externally, does require some (to a lot of) care when making changes, and should be needed by only a small subset of users.

The current software has pre-built databases containing the same serotype sequences as the PneumoCaT v1.2 database, and building the database is not required in order to use the serocall command. The package does provide the ability to change and update the references used by SeroCall, but this has not been extensively tested.

The builddata sub-directory contains the data files and scripts necessary to rebuild the SeroCall database files (found in the "data" sub-directory of the repository). In that sub-directory, the serotype capsular sequences are found in the "serotypes" sub-directory, the S. pneumoniae references are found in the "genomes" sub-directory and the difference locations are found in the "pcat_diffs.txt" file.

To add a new serotype capsular sequence to the database, create a FASTA file containing only that sequence, and then place it in the serotypes sub-directory. IMPORTANT: The file must end with ".fa" and the name of the file must match the accession given on the FASTA header. So, for a new serotype like "10X", the name of the file should be "10X.fa" and the first line of the file should be ">10X ..." where after the first space on the line, any text is allowed.

To add a new genome reference to the database, create a masked FASTA file of the reference sequence and place it in the genomes sub-directory. Any name can be given to the file, however the accession on the FASTA header must begin with "Spneumo_". IMPORTANT: The capsular sequence must be masked from the genome sequence (or all of the reads will align to it, confusing SeroCall). To do that, megablast/blat/minimap2 the serotype sequences against the genome reference, identify any region of the genome with very high identity (>98%) longer than 100 bp, and replace those regions' bases with N's in the FASTA file. (Remember, the goal is not to match to this sequence, but to use it as a decoy so that genomic reads don't align to the serotype sequences.)

If you believe you want to make changes/additions to the difference locations, please email j.knight@yale.edu for instructions on how to do that (there are a number of subtle issues that must be accounted for, in order for the software to properly identify serotypes using these difference).

Once the changes have been made to these files, then the command "bash builddata.sh" command will rebuild the database files and install them into the top-level "data" directory, for use by the serocall command. Things to note when running this command:

  • Both bwa and samtools are required for this step, so they should be in your PATH.
  • All of the initial construction of the files occurs in the "build" directory, then a final transfer occurs into the top-level "data" directory.
  • A backup of the existing top-level "data" directory files is made, placed in "data/bak". (But, only a single backup is made, so if builddata.sh is run multiple times, the previous backup is removed.)
  • This command will download a pair of FASTQ files, totalling 1.5 GB, and store them in the "genomeOnly" sub-directory the first time it is run. These files contain a real run of a capsule-free sample, and is used to identify those serotype regions where genomic reads may align into the serotypes.

Citation

To cite the SeroCall package, please use the following citation:

Knight J, Dunne E, Mulholland E, Saha S, Satzke C, Tothpal A, Weinberger D. Determining the Serotype Composition of Mixed Samples of Pneumococcus using Whole Genome Sequencing. To be submitted.