/PRESM

Personalized Reference Editor for Somatic Mutation discovery in cancer genomics

PRESM

PRESM stands for Personalized Reference Editor for Somatic Mutation discovery. In contrast to other reference genome editor software that generate a diploid reference genome which may distribute the reads to two site, impairing the soundness of the downstream statistical framework, PRESM provides two haploid reference genomes. The pipeline of PRESM involves three steps: First, germline mutations are discovered by another tool, e.g., GATK, and are used to make personalized references to call somatic mutations. Second, a reference genome composed of all personal variants (including both heterozygous and homozygous sites) is used as “decoy” to capture the heterozygous variants in reads. Third, PRESM changes the reads by replacing all heterozygous alleles with the corresponding reference alleles and maps the modified reads back to another personalized reference genome that contains only homozygous changes. The output of this step is a BAM file ready for any somatic mutation callers to use. We intend to offer long-term maintenance for PRESM and continue adding our new functions into it.

Installation

PRESM is a batteries-included JAR executable; therefore no installation is needed. Please copy the executable presm.jar and run it using the standard command for java package: java [–Xmx] –jar presm.jar

Functions

  • Processing variants files generated by GATK, Pindel or other variant call software, i.e., combining two variant files that are for SNPs and indels respectively; selecting homozygous variants or heterozygous variants; removing variants with duplicated coordinates.
  • Generating the personalized reference genome according to the germline mutations provided by the users.
  • Generating the modified background database files according to personalized reference genomes, for example, the personalized dbSNP, db.Indel, and cosmic.vcf can be generated. (Several downstream somatic mutation callers require these files).
  • Mapping the coordinates of somatic variants called by using personalized reference genome to the coordinates of universal reference genome.
  • Replacing the alternative alleles with reference bases according to the heterozygous variants provided by the users.

Commands and options

All the functions are used as: java [-Xmx] –jar /path/to/presm.jar

CombineVariants: Combine two variant call files according to the reference genome.

> -F CombineVariants –R ref.fasta –variant1 input1.vcf –variant2 input2.vcf –O output.vcf

Parameters:

  • –R: input the reference genome file.
  • -variant1: input variant file 1 (in vcf foramt)
  • -variant2: input variant file 2 (in vcf foramt)
  • -O: output the combined variant call file in vcf format

SelectGenotype: Select homozygous or heterozygous variants in the variant call file provided by the users.

> -F SelectGenotype –genotype homo[heter] –variants input.vcf –O output.vcf

Parameters:

  • -genotype: Specify the genotype of the variants (homozygous/ heterozygous variants)
  • -variants: input the variants in vcf format
  • -O: output the specified genotype variants in vcf format

RemoveOverlaps : Remove overlapping variants in a variant call file.

> -F RemoveOverlaps –R ref.fasta –variants input.vcf –O output.vcf

Parameters:

  • –R: input the reference genome file
  • -variants: input the variant in vcf format
  • -O: output the duplicated variant in vcf format

SortVariants: Sort variants according to the reference genome coordinates.

> -F SortVariants –R ref.fasta –variants input.vcf –O output.vcf

Parameters:

  • –R: input the reference genome file
  • -variants: input the variant in vcf format
  • -O: output the sorted variant in vcf format

MakePersonalizedReference: Generate personalized reference genome according to the germline mutations provided by the users.

> -F MakePersonalizedReference –I ref.fasta –germlinemutations input.vcf –O output.fa [–intervals input.intervals] [-genotype home/ heter]

Parameters:

  • –I: input the reference genome file
  • -germlinemutations: input the germline mutations in vcf format
  • -O: output the personalized reference genome in fasta format

Options:

  • -intervals: specify the region of variants
  • -genotype: specify the genotype of variants

MakePersonalizedVariantsDB: Generate personalized variants database files according to the germline mutations provided by the users.

> -F MakePersonalizedVariants –I input.vcf –O output.vcf –variants variant.vcf [–intervals input.intervals] [-genotype home/ heter] [-removeduplicates]

Parameters:

  • -I: input the variants database in vcf format
  • -O: output the personalized variants database in vcf format
  • -variants: input the mutations in vcf format

Options:

  • -intervals: specify the region of variants
  • -genotype: specify the genotype of variants
  • -removeduplicates: remove duplicated variants

MapVariants: Map the personalized reference genome-based coordinates of the variants to their corresponding coordinates in the universal reference genome.

> -F MapVariants –I input.vcf –O output.vcf –germlinemutations variant.vcf [–intervals input.intervals] [-genotype home/heter] [-removeduplicates]

Parameters:

  • -I: input the somatic mutations in vcf format
  • -O: output the somatic mutations being mapped to the universal reference genome in vcf format
  • -germlinemutations: input the germline mutations in vcf format

Options:

  • -intervals: specify the region of variants
  • -genotype: specify the genotype of variants
  • -removeduplicates: remove duplicated variants

ReplaceGenotype: Replacing the alternative alleles in the sequencing reads with reference bases according to the heterozygous variants provided by the users.

> -F ReplaceGenotype –I input.sam –germlinemutations germlinemutations.vcf –O output.sam –readlength len [–intervals input.intervals] [-genotype home/ heter]

Parameters:

  • -I: input the sequence alignment map file in sam format
  • -variant: input the germline mutations in vcf format
  • -O: output the replaced sequence alignment map file in sam format
  • –readlength: the sequencing read length

Options:

  • -intervals: specify the region of variants
  • -genotype: specify the genotype of variants

ViewFasta: View specified region of sequence in reference genome.

> Usage: -F ViewFasta –R ref.fasta [–L input.list] [-region specified region]

Parameters:

  • –R: input the reference genome file
  • -L: input the specified region list file, this function was used for viewing multiple regions in the chromosome
  • -region: input one specified region, this function was used for viewing single region in the chromosome

Example of region specifications format:

chr1: Output whole sequence of chromosome 1 in the reference genome.

chr2: 5000 Output the chromosome 2 sequence which begins at base position 5000 and ends at the end of chromosome 2.

chr3: 500-600 Output the chromosome 3 sequence which begins at base position 500 and ends at base position 600 of chromosome 3.

SomaticMutationsOnGermlineInsertion: Output the relative coordinate of somatic mutations located on germline insertions.

> -F SomaticMutationsOnGermlineInsertion –germlinemutations germlinemutation.vcf –I input.vcf –O output.txt [–intervals input.intervals] [-genotype home/ heter]

Parameters:

  • -germlinemutations: input the germline mutations in vcf format
  • -I: input the somatic mutations (using personalized coordinate system) in vcf formait
  • -O: output the locations of somatic mutations on germline insertions

Options:

  • -intervals: specify the region of variants
  • -genotype: specify the genotype of variants

Contacts

Copyright License (MIT Open Source)

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.