/HgtSIM

A simulator for horizontal gene transfer (HGT) in microbial communities

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

logo

pypi  licence  pypi  version  pypi  download DOI

Publication

Workflow

workflow

Dependencies

Change Log

  • 2019-01-06:

    • HgtSIM can be installed with "pip3 install HgtSIM" now.
  • 2018-04-06:

    • combined the '-mixed', '-mini' and '-maxi' options into one: '-mixed min-max'.
  • 2017-09-16:

    • add support for draft genome.
    • add support for dynamic flanking sequences.
    • add support for the 'mixed' mode.
    • add support for the 'keep_cds' option.

To-do

  • run Prodigal if "-keep_cds" was specified
  • check Ns in provided gene sequences
  • check whether provided sequences to transfer are ORFs, exit if not

Installation

  • HgtSIM is implemented in python3, you can install it with:

      pip3 install HgtSIM
    
  • HgtSIM requires BLAST+, you can either add it to your system path or specify full path to "blastn" and "blastp" executables with options "-blastn" and "-blastp".

Help information

    HgtSIM -h

      -t          sequences of genes to be transferred (multi-fasta format)
      -i          mutation level
      -d          distribution of transfers to the recipient genomes
      -f          folder holds recipient genomes
      -r          ratio of mutation types
      -x          file extension of recipient genomes
      -lf         left end flanking sequences
      -rf         right end flanking sequences
      -mixed      randomly assign mutation levels between specified values, parameter format: min-max
      -keep_cds   insert transfers only to non-coding regions, need the annotation files (in gbk format) of recipient genomes
      -a          folder holds the annotation files (in gbk format) of recipient genomes
      -l          minimum length of intergenic region to be considered for insertion
      -blastn     path to blastn executable, default: blastn
      -blastp     path to blastp executable, default: blastp

Input files and arguments

  1. Sequences of genes to be transferred (in multi-fasta format).

  2. A folder holds all recipient genomes, one file per genome.

  3. The mutation level of genes to be transferred. This can be specified either as a fixed value, or within a range (the 'mixed' mode). If the 'mixed' argument was provided, HgtSIM will randomly select a value between user specified minimum and maximum mutation levels to alter each gene transfer.

     # with fixed mutation level (e.g. 10%).
     HgtSIM -t genes.fasta -d distribution.txt -f input_genomes -r 1-0-1-1 -x fna -i 10
    
     # with 'mixed' mode (e.g. 5-25%)
     HgtSIM -t genes.fasta -d distribution.txt -f input_genomes -r 1-0-1-1 -x fna -mixed 5-25
    
  4. The ratio of mutation categories (separated with dash). The default setting is '1-0-1-1'. Please refer to the publication (http://dx.doi.org/10.7717/peerj.4015) or the figure below for its setting.

    ratio_selection

  5. The distribution of transfers to the recipient genomes. The first column refers to the recipient genomes(without file extension), followed by a list of genes to be transferred therein (separated with comma).

     BAD,AAM_03063,AKV_01007,AMAC_01196,AMAU_02632,AMS_01785
     BDS,AAM_00175,AKV_00943,AMAC_00215,AMAU_02085,AMS_01465
     BGC,AAM_00176,AKV_01272,AMAC_01576,AMAU_00617,AMS_02653
     BHS,AAM_00195,AKV_01273,AMAC_01674,AMAU_05963,AMS_03303
     BNM,AAM_00209,AKV_00282,AMAC_02914,AMAU_02414,AMS_03378
     BRT,AAM_00308,AKV_02353,AMAC_03303,AMAU_00830,AMS_01655
    
  6. The flanking sequences to be added to the end of gene transfers. Can be specified with '-lf' and '-rf', the default value is None.

     # introduce gene transfers without adding flanking sequences
     HgtSIM -t genes.fasta -i 10 -d distribution.txt -f input_genomes -r 1-0-1-1 -x fna
    
     # or, add same pair of flanking sequences (e.g. 'TAGATGAGTGATTAGTTAGTTA') to all gene transfers
     HgtSIM -t genes.fasta -i 10 -d distribution.txt -f input_genomes -r 1-0-1-1 -x fna -lf TAGATGAGTGATTAGTTAGTTA -rf TAGATGAGTGATTAGTTAGTTA
    
     # or, add flanking sequences dynamically to the two ends of each gene transfer
     HgtSIM -t genes.fasta -i 10 -d distribution.txt -f input_genomes -r 1-0-1-1 -x fna -lf lf.fasta -rf rf.fasta
    

    if you want to add flanking sequences dynamically to the gene transfers, you can specify the left and right side sequences in two multi-fasta files. The IDs of the flanking sequences need to be exactly the same to their corresponding gene transfers.

    As an illustration, if you have four transfers, which are transfer_A, transfer_B, transfer_C and transfer_D. And you have provided the following two files:

    lf.fasta

     >transfer_A
     AAAAAAAAAA
     >transfer_B
     TTT
    

    rf.fasta

     >transfer_A
     GGGGGGG
     >transfer_C
     CCCCC
    

    HgtSIM will then:

    1. add 'AAAAAAAAAA' to the left and 'GGGGGGG' to the right end of transfer_A;
    2. add 'TTT' to the left and nothing to the right end of transfer_B;
    3. add nothing to the left and 'CCCCC' to the right end of transfer_C;
    4. add nothing to boths end of transfer_D.
  7. Transfers can be inserted only to the intergenic regions by specifying the 'keep_cds' option. The annotation files (in genbank format) of the recipient genomes are needed to enable this option.

Output files

  1. Produced genomes with transferred genes, which were placed in folder 'Genomes_with_transfers'.
  2. The amino acid sequences of input genes to be transferred.
  3. The nucleotide and amino acid sequences of mutated input genes.
  4. The mutation report file, which includes two parts:
    1. on the top is the nc and aa identities between input and mutated sequences for each transfer.
    2. followed by a summary of changed nucleotide bases for each transfer.
  5. The insertion report file.