/MetaCHIP

Horizontal gene transfer (HGT) identification pipeline

Primary LanguagePythonGNU Affero General Public License v3.0AGPL-3.0

logo

pypi licence pypi version DOI

Publication:

  • Song WZ, Wemheuer B, Zhang S, Steensen K, Thomas T* (2019) MetaCHIP: community-level horizontal gene transfer identification through the combination of best-match and phylogenetic approaches. Microbiome. 7:36 https://doi.org/10.1186/s40168-019-0649-y
  • Contact: Weizhi Song (songwz03@gmail.com), Torsten Thomas (t.thomas@unsw.edu.au)
  • Centre for Marine Science and Innovation (CMSI), University of New South Wales, Sydney, Australia

⚠️ Before you start

  • ⚠️ MetaCHIP was designed to predict HGT among prokaryotes, please do NOT include eukaryotic genomes in your genome folder.
  • ⚠️ To get reliable HGT prediction results, input genomes need to be with at least 40% completeness.

Change Log:

  • v1.10.10 (2022-07-24) - ⚠️ The plot of flanking regions of identified HGTs is disabled by default, provide "-pfr" to the BP module to get the plot
  • v1.10.6 (2021-09-19) - You can now designate a place/directory to hold the output files with "-o"
  • v1.10.4 (2021-06-05) - Removed the limitation that contig length need to be shorter than 22bp
  • v1.9.0 (2020-06-01) - Add supplementary module: rename_seqs
  • v1.6.0 (2019-07-23) - Support customized grouping of query genomes
  • v1.5.2 (2019-07-23) - Pfam hmm profiles updated to v32.0, TIGRFAMS db version is v14.0
  • v1.5.0 (2019-07-19) - Add supplementary module: update_hmms
  • v1.4.0 (2019-07-15) - Add supplementary module: filter_HGT
  • v1.3.0 (2019-07-12) - Add supplementary module: CMLP
  • v1.2.0 (2019-04-29) - Support multiple-level detections
  • v1.1.0 (2019-01-19) - Support multiprocessing
  • v1.0.0 (2018-12-29) - Initial release

Dependencies:

How to install:

  1. MetaCHIP can be installed via pip3:

     # First-time installation
     pip3 install MetaCHIP
     
     # for upgrade
     pip3 install --upgrade MetaCHIP
    
  2. You can either add MetaCHIP's 3rd party dependencies to your system path or specify full path to their executables in MetaCHIP_config.py which can be found in Python's folder lib/site-packages/MetaCHIP.

  3. ⚠️ If you clone the repository directly off GitHub you might end up with a version that is still under development.

How to run:

  1. The input files for MetaCHIP include a folder that holds the sequence file (example) of all query genomes, as well as a text file which provides taxonomic classification (example) or customized grouping (example) of your input genomes. File extension of your input genomes (e.g. fa, fasta) should NOT be included in the taxonomy or grouping file.

  2. GTDB-Tk is recommended for taxonomic classification of input genomes. Only the first two columns ('user_genome' and 'classification') in GTDB-Tk's output file are needed.

  3. Options for argument '-r' in the PI and BP modules can be any combinations of d (domain), p (phylum), c (class), o (order), f (family), g (genus) and s(species).

  4. Some examples:

    • Detect HGT among classes

      MetaCHIP PI -p NorthSea -r c -t 6 -i bin_folder -x fasta -taxon GTDB_classifications.tsv
      MetaCHIP BP -p NorthSea -r c -t 6
      
    • Detect HGT among phyla, classes, orders, families and genera

      MetaCHIP PI -p NorthSea -r pcofg -t 12 -o path/to/output_dir -i MAG_folder -x fasta -taxon GTDB_classifications.tsv
      MetaCHIP BP -p NorthSea -r pcofg -t 12 -o path/to/output_dir
      
    • Detect HGT among customized groups

      MetaCHIP PI -p NorthSea -g customized_grouping.txt -t 6 -i NS_37bins -x fasta
      MetaCHIP BP -p NorthSea -g customized_grouping.txt -t 6
      

Output files:

  1. A Tab delimited text file containing all identified HGTs. Filename format: [prefix]_[taxon_ranks]_detected_HGTs.txt

    Column Description
    Gene_1 The 1st gene involved in a HGT event
    Gene_2 The 2nd gene involved in a HGT event
    Identity Identity between Gene_1 and Gene_2
    Occurence(taxon_ranks) Only for multiple-level detections. If you performed HGT detection at phylum, class and order levels, a number of "011" means current HGT was identified at class and order levels, but not phylum level.
    End_match End match or not (see examples below)
    Full_length_match Full length match or not (see examples below)
    Direction The direction of gene flow. Number in parenthesis refers to the percentage of this direction being observed if this HGT was detected at multiple ranks and different directions were provided by Ranger-DTL.
  2. Nucleotide and amino acid sequences of identified donor and recipient genes.

  3. Flanking regions of identified HGTs. Genes encoded on the forward strand are displayed in light blue, and genes coded on the reverse strand are displayed in light green. The name of genes predicted to be HGT are highlighted in blue, large font with pairwise identity given in parentheses. Contig names are provided at the left bottom of the sequence tracks and numbers following the contig name refer to the distances between the gene subject to HGT and either the left or right end of the contig. Red bars show similarities of the matched regions between the contigs based on BLASTN results. flanking_regions

  4. Gene flow between groups. Bands connect donors and recipients, with the width of the band correlating to the number of HGTs and the colour corresponding to the donors. Gene_flow

  5. Examples of contig end matches. end_match

  6. Examples of full-length contig matches. full_length_match