/pannenkoek

A tool for generating LEfSe ready data from an OTU table and mapping file. Sample metadata is added as a new row based on user inputs.

Primary LanguagePythonMIT LicenseMIT

Pannenkoek

v0.1.8

Big thank to Huttenhower/Segata group for developing LEfSe!

Nicola Segata, Jacques Izard, Levi Walron, Dirk Gevers, Larisa Miropolsky, Wendy Garrett, Curtis Huttenhower. "Metagenomic Biomarker Discovery and Explanation" Genome Biology, 2011 Jun 24;12(6):R60

Easily go from BIOM-OTU table to LEfSe results in one step.

A tool to create LEfSe-ready tables or results from your main OTU table in QIIME. LEfSe is a great tool for microbial diversity analysis, but there are many repetitive steps if you want to analyze multiple time points between two groups. Pannenkoek was made to simplify that process. You can input your main OTU biom file, Treatment group column name and Timepoint column name and it will do the hard work summarizing the table, removing the unused metadata fields, and create multiple tables based on time.

Get Started

You must either start macqiime in terminal or have qiime installed locally with 'pip install qiime'


pip install https://github.com/twbattaglia/pannenkoek/zipball/master

Dependencies

R has to be initialized from the command line first to install packages. Run the command below to install the R dependencies necessary to run LefSe. If you only want to summarize the tables and not run LefSe, you do not need to run the command below.



# 1) Initialize R in terminal by running the command 'R'
$ R

# 2) Run the command below to install the necessary packages. You may get an error stating, 'packages ‘splines’, ‘stats4’ are not available (as a binary package for R version 3.1.3)' but this can be ignored.
$ install.packages(c('splines','stats4','survival','mvtnorm','modeltools','coin','MASS'), repos = "http://cran.us.r-project.org")


  

Start a new terminal session to exit out of R

Running the Command

Enter the macqiime environment before running pannenkoek.py!!

# Example Command
pannenkoek.py -i INPUT_BIOM.biom -o OUTPUT_FOLDER/ -m MAPPING_FILE.txt -level 6
-class colTreatment -split colDay -compare CON,Group1 -a 0.01


usage: pannenkoek.py [-h] [-v] [-i INPUT] [-o OUTPUT] [-m MAPPING]
                     [-s SUBJECTID] [-c CLASSID] [-sc SUBCLASSID] [-l LEVEL]
                     [-spl SPLITBY] [-comp COMPARE] [-t] [-nl]
                     [-a ANOVA_CUTOFF] [-e LDA_CUTOFF] [-strict STRICTNESS]

Summarize taxa and subset by time.

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -i INPUT, -input INPUT
                        Location of the Biom/OTU file to subset and use for
                        LEfSe analysis.(Must be .biom format)
  -o OUTPUT, -output OUTPUT
                        Directory to place all the files.
  -m MAPPING, -map MAPPING
                        Location of the mapping file associated with Biom
                        File.
  -s SUBJECTID, -subject SUBJECTID
                        By Default = #SampleID. Only change if your SampleID
                        header differs.
  -c CLASSID, -class CLASSID
                        This will be the column within your mapping file that
                        has the data which differentiates between groups. This
                        is typically the Treatment column of your data. When
                        formatting the data for LEfSe analysis, this will be
                        the column used as a class. When choosing particular
                        comparisons with "-compare" this classID columns is
                        where it looks for the variables. Choose wisely.
  -sc SUBCLASSID, -subclass SUBCLASSID
                        This will be used in similar way to the classID, but
                        is typically reserved for another level of
                        comparisons. See the use of subclasses with LEfSe
                        data. "http://genomebiology.com/2011/12/6/r60"
  -l LEVEL, -level LEVEL
                        level to use for summarize_taxa.py.
  -spl SPLITBY, -split SPLITBY
                        This is the column name of the variable to use for
                        splitting the table. The table is split over a
                        variable to reduce the amount of work to create a new
                        formatted lefse table for each timepoint of an
                        experiment. This is one of the core features of
                        pannenkoek. Choose the column which will create the
                        analysis you want.
  -comp COMPARE, -compare COMPARE
                        This option lets you select a comparison between
                        variables found in your provded classID column. For
                        example if you had 3 different experimental groups and
                        you wanted to only see the comparisons between Control
                        and Group2 over time you can input, "Control,Group2"
                        in this field and it will subset and only include the
                        sampleIDs that are from either group. This option is
                        very useful for experiments with multiple groups.
  -t, --notranspose     If you do not want the data tranposed, add this option
                        to the command. Default = False
  -nl, --nolefse        If you only want subsetted tables and no formatted or
                        lefse-run data, add this option to the command.
                        Default = False
  -a ANOVA_CUTOFF, -anovap ANOVA_CUTOFF
                        Change the cutoff for significance between OTUs.
                        Default is 0.05.
  -e LDA_CUTOFF, -effect_size LDA_CUTOFF
                        Change the cutoff for effect size. Default is 2.
  -strict STRICTNESS, -strictness STRICTNESS
                        Change the strictness of the comparisons. Can be
                        changed to less strict(1). Default is 0 (more-strict).