16S_Amplicon_Pipeline

Scripts for 16S amplicon analysis

######The 16S analysis was completed following the MOTHUR 454 SOP and [Costello stool analysis protocol] (http://www.mothur.org/wiki/Costello_stool_analysis) with a few modifications.

#####1. Parse Amplicons

Split ITS, LSU and 16S reads using Matt's dbcAmplicon python script: https://github.com/msettles/dbcAmplicons.

dbcAmplicons preprocess -b barcodeLookupTable.txt -p 16S_ITS_LSU_PrimerTable.txt -s ZanneAmplicons_MattPipeline.txt -u -U -v -1 Undetermined_S0_L001_R1_001.fastq -2 MariyaAmy002_S1_L001_R1_001.fastq -3 MariyaAmy002_S1_L001_R2_001.fastq -4 Undetermined_S0_L001_R2_001.fastq &

The -s flag gives the name of a sample sheet file. The sample sheet used for the Tyson year 3 harvest is included in this repository (Tyson_samplesheet.txt).

#####2. Split by Sample Split sequences into individual files by sample and generate a file mapping file name to sample name for mothur analysis.

python  parsefastq.py -s ZanneAmplicons_MattPipeline.txt -r1 R1.fastq -r2 R2.fastq -o output/directory -p PrimerName

This script will generate a file called "stability.files" this file contains a mapping of file names to their corresponding samples. The stability file will be the input file to mothur. Navigate to the specified ouput directory and start mothur ...

######The following commands can be run in batch mode from the 16S_mothur_script.sh rather than interactivly in the mothur shell as described below.

#####3. Assemble Paried End Reads

mothur > make.contigs(file=stability.files, processors=10)

Output File Names: 
stability.trim.contigs.fasta
stability.contigs.report
stability.scrap.contigs.fasta
stability.contigs.groups

#####4. Sequence Processing and Cleanup

Look at the quality of LSU sequences.

mothur > summary.seqs(fasta=stability.trim.contigs.fasta)

Using 10 processors.

                Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        1       277     277     0       3       1
2.5%-tile:      1       431     431     0       4       152765
25%-tile:       1       433     433     0       4       1527650
Median:         1       447     447     0       5       3055299
75%-tile:       1       472     472     1       5       4582948
97.5%-tile:     1       498     498     16      7       5957833
Maximum:        1       563     562     73      280     6110597
Mean:   1       453.881 453.881 1.42854 4.85083
# of Seqs:      6110597

Output File Names: 
stability.trim.contigs.summary

Clean up sequences by removing all sequences with more than 1 ambiguity or homopolymers longer than 8 bp.

Note: mothur has many parameters for quality screening sequence using the trim.seqs() command. Quality screening should be based on the results of the summary.seqs() command for each particular set of sequences.

mothur > screen.seqs(fasta=stability.trim.contigs.fasta, group=stability.contigs.groups, maxambig=1, minlength=300, maxhomop=8)

Output File Names: 
stability.trim.contigs.good.fasta
stability.trim.contigs.bad.accnos
stability.contigs.good.groups

mothur > summary.seqs(fasta=stability.trim.contigs.good.fasta)

Using 10 processors.
[WARNING]: This command can take a namefile and you did not provide one. The current namefile is stability.trim.contigs.good.names which seems to match stability.trim.contigs.good.fasta.

                Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        1       300     300     0       3       1
2.5%-tile:      1       431     431     0       4       128916
25%-tile:       1       433     433     0       4       1289160
Median:         1       446     446     0       5       2578319
75%-tile:       1       468     468     0       5       3867478
97.5%-tile:     1       495     495     1       6       5027721
Maximum:        1       561     560     1       8       5156636
Mean:   1       452.306 452.306 0.119068        4.78745
# of Seqs:      5156636

Output File Names: 
stability.trim.contigs.good.summary

Collapse all identical sequences.

mothur > unique.seqs(fasta=stability.trim.contigs.good.fasta)

  Output File Names: 
  stability.trim.contigs.good.names
  stability.trim.contigs.good.unique.fasta
  
mothur > summary.seqs(fasta=stability.trim.contigs.good.unique.fasta)

Using 10 processors.
[WARNING]: This command can take a namefile and you did not provide one. The current namefile is stability.trim.contigs.good.names which seems to match stability.trim.contigs.good.unique.fasta.

                Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        1       300     300     0       3       1
2.5%-tile:      1       431     431     0       4       95710
25%-tile:       1       433     433     0       4       957099
Median:         1       447     447     0       5       1914198
75%-tile:       1       472     472     0       5       2871297
97.5%-tile:     1       498     498     1       6       3732686
Maximum:        1       561     560     1       8       3828395
Mean:   1       453.966 453.966 0.155886        4.83139
# of Seqs:      3828395

Output File Names: 
stability.trim.contigs.good.unique.summary

#####5. Alignment

Align 16S sequences to referance 16S silva alignment.

mothur > align.seqs(fasta=stability.trim.contigs.good.unique.fasta, reference=../silva.bacteria/silva.bacteria.fasta, processors=10)

Some of you sequences generated alignments that eliminated too many bases, a list is provided in stability.trim.contigs.good.unique.flip.accnos. If you set the flip parameter to true mothur will try aligning the
 reverse compliment as well.
It took 10160 secs to align 3828395 sequences.


Output File Names: 
stability.trim.contigs.good.unique.align
stability.trim.contigs.good.unique.align.report
stability.trim.contigs.good.unique.flip.accnos

mothur > summary.seqs(fasta=stability.trim.contigs.good.unique.align)

Using 10 processors.
[WARNING]: This command can take a namefile and you did not provide one. The current namefile is stability.trim.contigs.good.names which seems to match stability.trim.contigs.good.unique.align.

                Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        0       0       0       0       1       1
2.5%-tile:      1044    10303   431     0       4       95710
25%-tile:       1044    13125   433     0       4       957099
Median:         1044    13125   447     0       5       1914198
75%-tile:       1044    13125   470     0       5       2871297
97.5%-tile:     1044    13859   497     1       6       3732686
Maximum:        43116   43116   560     1       8       3828395
Mean:   1072.42 13061.1 452.867 0.15577 4.8272
# of Seqs:      3828395

Output File Names: 
stability.trim.contigs.good.unique.summary

#####6. Check for Chimeras

Use chimera slayer and the referance 16S alignment to check for chimeric 16S sequences.

mothur > chimera.slayer(fasta=Zanne_LSU_aligned.trim.unique.align, template=James_2006_FungiOnly_LSU_LR0R_LR3.fasta, processors=10, blastlocation=/usr/bin/)

##To Update ... Remove chimeric sequences from analysis files.

mothur > remove.seqs(accnos=Zanne_LSU_aligned.trim.unique.slayer.accnos, name=Zanne-LSU_aligned.trim.names)
    
      Output File Names:
      Zanne-LSU_aligned.trim.pick.names
      
mothur > remove.seqs(accnos=Zanne_LSU_aligned.trim.unique.slayer.accnos, group=LSU.groups)
      
      Output File Names: 
      LSU.pick.groups
      
mothur > remove.seqs(accnos=Zanne_LSU_aligned.trim.unique.slayer.accnos, fasta=Zanne_LSU_aligned.trim.unique.align)
      
      Output File Names: 
      Zanne_LSU_aligned.trim.unique.pick.align

#####7. Clean Alignment

Check quality of alignment with chimeras removed.

mothur > summary.seqs(fasta=Zanne_LSU_aligned.trim.unique.pick.align, processors=15)
      
      Using 15 processors.
      
                      Start   End     NBases  Ambigs  Polymer NumSeqs
      Minimum:        0       0       0       0       1       1
      2.5%-tile:      1       264     195     0       4       129794
      25%-tile:       18      644     520     0       4       1297937
      Median:         18      686     543     0       5       2595874
      75%-tile:       18      954     565     1       5       3893810
      97.5%-tile:     53      954     569     1       7       5061953
      Maximum:        972     972     575     1       11      5191746
      Mean:   25.6089 756.843 517.662 0.420943        4.85864
      # of Seqs:      5191746

Remove sequences that align after base 53 or are less than 500 bp long.

Note: mothur has many parameters for quality screening alignments using the screen.seqs() command. Quality screening should be based on the results of the summary.seqs() command for each particular alignment.

mothur > screen.seqs(fasta=Zanne_LSU_aligned.trim.unique.pick.align, start=53, minlength=500)

      Output File Names: 
      Zanne_LSU_aligned.trim.unique.pick.good.align
      Zanne_LSU_aligned.trim.unique.pick.bad.accnos
      
      It took 51 secs to screen 5191746 sequences.
      
mothur > summary.seqs(fasta=Zanne_LSU_aligned.trim.unique.pick.good.align)
      
      Using 15 processors
      
                      Start   End     NBases  Ambigs  Polymer NumSeqs
      Minimum:        1       609     500     0       3       1
      2.5%-tile:      18      632     507     0       4       106087
      25%-tile:       18      654     530     0       4       1060862
      Median:         18      721     555     1       5       2121723
      75%-tile:       18      954     566     1       5       3182584
      97.5%-tile:     44      954     569     1       7       4137359
      Maximum:        53      972     575     1       11      4243445
      Mean:   20.5771 806.123 547.62  0.508047        4.92197
      # of Seqs:      4243445

Remove regions with only gaps from the alignment

mothur > filter.seqs(fasta=Zanne_LSU_aligned.trim.unique.pick.good.align)
      
      Length of filtered alignment: 771
      Number of columns removed: 201
      Length of the original alignment: 972
      Number of sequences used to construct filter: 4243445
      
      Output File Names: 
      Zanne_LSU_aligned.filter
      Zanne_LSU_aligned.trim.unique.pick.good.filter.fasta
      
mothur > summary.seqs(fasta=Zanne_LSU_aligned.trim.unique.pick.good.filter.fasta)
      
      Using 15 processors
      
                      Start   End     NBases  Ambigs  Polymer NumSeqs
      Minimum:        1       605     500     0       3       1
      2.5%-tile:      18      628     507     0       4       106087
      25%-tile:       18      650     530     0       4       1060862
      Median:         18      717     555     1       5       2121723
      75%-tile:       18      754     566     1       5       3182584
      97.5%-tile:     44      754     569     1       7       4137359
      Maximum:        53      771     575     1       11      4243445
      Mean:   20.5771 706.101 547.62  0.508047        4.92197
      # of Seqs:      4243445
     
      Output File Names: 
      Zanne_LSU_aligned.trim.unique.pick.good.filter.summary

Extract sequence names maintained in filtered alignment and subset mothur files to only those names.

mothur > list.seqs(fasta=Zanne_LSU_aligned.trim.unique.pick.good.filter.fasta)
    
      Output File Names: 
      Zanne_LSU_aligned.trim.unique.pick.good.filter.accnos
      
mothur > get.seqs(accnos=Zanne_LSU_aligned.trim.unique.pick.good.filter.accnos, name=Zanne-LSU_aligned.trim.names) 
     
      Output File Names:
      Zanne-LSU_aligned.trim.pick.names
      
mothur > get.seqs(accnos=Zanne_LSU_aligned.trim.unique.pick.good.filter.accnos, name=LSU.groups) \n",
      
      Output File Names: 
      LSU.pick.groups

#####8. Phylogenetic Analysis Reconstruct a neighbor joining tree from the LSU alignment.

Note: this command did not sucessfully run on a 30 gb ram server for a 5 million sequence alignment.

mothur > clearcut(fasta=Zanne_LSU_aligned.trim.unique.pick.good.filter.fasta, DNA=t, verbose=t)