/Antarctica_scripts

Scripts used as part of metagenome/16S data analysis

Primary LanguagePython

Antarctica_scripts

Scripts used as part of metagenome/16S data analysis. These are made available for transparency and for the benefit of others who may find themselves performing similar analyses.

Contained within this repository are scripts used for the analysis of a project currently titled "A Metagenomic Profile of Antarctica's Microbiome Reveals Effects Mediated by Human Sewage Outflow into the Southern Ocean" that is still to be published. Scripts are partitioned into 8 folders to enable easier retrieval of scripts related to each part of the project.

Alpha diversity

This folder contains scripts used for processing k-SLAM data to enable alpha diversity calculations. In order of use:

    1. kslam_xml_to_alphaCounts.py
    • This script will read in a k-SLAM output .xml file and produce a table of raw counts associated with each taxon. Run this script on each individual .xml for the samples.
    1. kslam_alphaCounts_to_simpson.py
    • This script will read in multiple tables produced by the first script and create a combined table that can be read by VEGAN to perform alpha diversity calculations
    1. kslam_alphadiv.R
    • This R script will perform the alpha diversity calculation.

Bubbleplots

This folder contains R scripts used to generate the bubbleplots that form part of this project. A file exists for each of the three metagenome taxonomy prediction programs (kslam, MetaPhlAn2, and QIIME). The input is a NMDS formatted file generated by scripts in the 'NMDS' folder.

edgerR-GOseq

This folder contains scripts used for generating a read count table used by edgeR, and scripts used subsequent to this to perform edgeR and GOseq analysis. In order of use:

    1. cluster_read_counts.py
    • This script will read in a cdhit .clstr file and htseq output file and clusters reads from redundant mappings into a single count corresponding to the representative sequence from each cluster. Perform this for each sample. This serves as a precursor to:
    1. clustered_read_counts_combine.py
    • This script reads in the files generated by the first script and formats a single table that can be used by edgeR to find differentially abundant genes.
    1. edgeR_logfc.R
    • This script will perform each pairwise comparison and generate output files that can be used for later GOseq.
    1. antarctica_goseq.R
    • This script will automatically find enriched/depleted annotation terms based upon input mapping files (i.e., GO, Pfam, FOAM, and Resfams) for each pairwise comparison.
    1. antarctica_goseqfdr_corr.R
    • This script will read in the outputs generated by the previous R script, enforce a FDR cut-off, and generate outputs that can be used by the signature terms script.
    1. goseq_signature_terms.py
    • This script reads in a directory containing tab-delimited text files of all pairwise comparisons of treatments generated by the previous script. A user specified FDR value can be used as a cut-off.

k-SLAM data formatting

This folder contains scripts used for formatting k-SLAM data which acts as a precursor to some other analyses. In order of use:

    1. kslam_xml_taxIdMap.py
    • This script will read in the kslam xml output file and create a tab-delimited text file that can be used for further downstream ktImportText Krona plot generation or NMDS plot generation.
    1. kslam_taxIdMap_to_ktImportText.py
    • This script reads in the output(s) of kslam_xml_taxIdMap.py and formats file(s) that can be called by ktImportText to produce Krona plot(s).

NMDS

This folder contains scripts used for generating NMDS tables and for perform NMDS computation and plotting. For most of these scripts, you will need a full NCBI taxonomy file. I used the script "generate_taxonomy_taxid.py" provided as part of the MetaPalette repository (https://github.com/dkoslicki/MetaPalette). In the case that this repository is closed in the future, I have also added the script to this folder. In order of use:

k-SLAM

    1. kslam_taxIdMap_to_nmds.py
    • This script reads in the output(s) of kslam_xml_taxIdMap.py and formats a file that can be read in to VEGAN in R.
    1. kslam_nmds.R
    • Creates NMDS plots based upon NMDS formatted table file from above script. The outputs are intended to be created as PDFs to be edited in Illustrator manually to organise the labels optimally.

METAPHLAN2

    1. metaphlan2_to_nmds.py
    • This script reads in the output(s) of MetaPhlAn2 and formats a file that can be read in to VEGAN in R.
    • Use the kslam_nmds.R script for the NMDS plotting.

QIIME

    1. qiime_table_to_nmds.py
    • This script will read in a QIIME table file (originating from the summarize_taxa_through_plots.py script or equivalent) and create a table that is amenable to NMDS.
    1. qiime_nmds.R
    • Creates NMDS plots based upon NMDS formatted table file from above script. The outputs are intended to be created as PDFs to be edited in Illustrator manually to organise the labels optimally.

Also provided is qiime_table_to_nmds_keepPrev.py. This was helpful for manual inspection of results to see what the higher taxonomy levels were for each prediction.

QIIME data formatting

This folder contains a single script.

    1. qiime_to_ktImportText.py
    • This script will read in a QIIME table file (originating from the summarize_taxa_through_plots.py script or equivalent) and create text file(s) that are amenable to ktImportText to produce Krona plot(s).

Statistics and helper scripts

This folder includes scripts that were used for performing basic statistical tests as well as for analysing the overall results.

    1. anova_tukeys_stats.R
    • This script will compute ANOVA and Tukey's HSD for input files.
    1. nmds_convenience.py
    • This script was made quickly just to provide assistance with parsing a NMDS file. It will pull apart a NMDS file and put sorted columns of taxa abundance into your clipboard which can be pasted into Excel/an equivalent program. Data in this format can be used by the below script.
    1. top_taxa_kslam_multicol.py
    • This script was made quickly just to provide assistance with parsing columns generated by the above script. It requires that you manually alter variables into the script. Its purpose is to get a summed abundance count for predictions that match a specified taxonomic rank (e.g, Bacteria) and compute the proportion that this taxa contributes to the overall microbial abundance.

Swiss-Prot

This folder contains a single script.

    1. blast_hit_counter.py
    • This script will count how many sequences obtain a hit more significant than a specified E-value. It assumes the file is ordered by query ID, but not necessarily by E-value.

Notes

While the main reason these scripts are provided is for transparency, if someone who reads the paper this repository is associated with wants to use these scripts for their own analyses, I am okay with fielding any questions especially if you have problems using these. Either open an issue, or email me at zkstewart1@gmail.com.