Rarefaction analyzer is a simple program that can be used to perform rarefaction analysis. This analysis is particularly useful when you wish to know the fraction of the target population that is captured in your sequence data versus the fraction of the target population that has not been described due to not sequencing deep enough. This is a standard analysis performed for microbiome research and is typically recommended for any kind of metagenomics where counts are of interest. Input to this tool includes a FASTA formatted reference database, a SAM formatted alignment file, and a CSV formatted annotation database.
The output of this tool is four TSV text files, with the proportion of reads sampled in the first column and the number of unique genes, groups, mechanisms, or classes identified in the rarefied data. These outputs can be easilly graphed with common R packages like ggplot2 or Python libraries such as matplotlib. To graph, place the proportion of reads sampled (first column) on the x-axis and the number of unique genes, groups, mechanisms, or classes (second column) on the y-axis.
$ git clone https://github.com/cdeanj/RarefactionAnalyzer.git
$ cd RarefactionAnalyzer
$ make
$ cp rarefaction /usr/local/bin
$ ./rarefaction
$ ./rarefaction \
-ref_fp ref.fa \
-sam_fp alignments.sam \
-annot_fp annotations.csv \
-gene_fp gene.tsv \
-group_fp group.tsv \
-class_fp class.tsv \
-mech_fp mech.tsv \
-min 5 \
-max 100 \
-samples 1 \
-t 80
Option | Type | Description |
---|---|---|
ref_fp | FILE | Path to FASTA formatted reference database |
annot_fp | FILE | Path to CSV formatted annotation database |
sam_fp | FILE | Path to SAM formatted alignment file |
gene_fp | FILE | File to write gene level results to |
group_fp | FILE | File to write group level results to |
class_fp | FILE | File to write class level results to |
mech_fp | FILE | File to write mechanism level results to |
min | INT | Starting sample level |
max | INT | Ending sample level |
skip | INT | Skip pattern |
samples | INT | Number of samples to run |
t | INT | Threshold to determine gene significance |