An R package for data mining in microbial community ecology
In microbial community ecology, with the development of high-throughput sequencing techniques, the increasing data amount and complexity make the community data analysis and management a challenge. There has been a lot of R packages created for the microbiome profiling analysis. However, it is still difficult to perform data mining fast and efficiently. Therefore, we created R microeco package.
- R6 Class to store and analyze data; fast, flexible and modularized
- Taxonomic abundance analysis
- Venn diagram
- Alpha diversity
- Beta diversity
- Differential abundance analysis
- Indicator species analysis
- Environmental data analysis
- Null model analysis
- Network analysis
- Functional analysis
If you do not already have R/RStudio installed, do as follows.
Put R in the computer env PATH, for example your_directory\R-4.0.0\bin\x64
Open RStudio...Tools...Global Options...Packages, select the appropriate mirror in Primary CRAN repository.
Install microeco package from CRAN directly.
install.packages("microeco")
Or install the latest development version from github.
# If devtools package is not installed, first install it
install.packages("devtools")
# then install microeco
devtools::install_github("ChiLiubio/microeco")
See the detailed package tutorial (https://chiliubio.github.io/microeco_tutorial/) and the help documentations (e.g. ?microtable). If you want to run the codes in the tutorial completely, you need to install some additional packages. Please see the following Notes part. Contructing the basic microtable object from other tools/platforms (e.g. QIIME, QIIME2, HUMAnN and phyloseq) can be easily achieved with the package file2meco (https://github.com/ChiLiubio/file2meco). The mecodev package (https://github.com/ChiLiubio/mecodev/) is designed to develop more classes for data analysis based on the microeco package.
Chi Liu, Yaoming Cui, Xiangzhen Li and Minjie Yao. 2021. microeco: an R package for data mining in microbial community ecology. FEMS Microbiology Ecology, 97(2): fiaa255. https://doi.org/10.1093/femsec/fiaa255
To keep the start and use of microeco package simplified, the installation of microeco only depend on several packages, which are compulsory-installed from CRAN and important in the data analysis. So the question is that you may encounter an error when using a class or function that invoke an additional package like this:
library(microeco)
data(dataset)
t1 <- trans_network$new(dataset = dataset, cal_cor = NA, taxa_level = "OTU", filter_thres = 0.0005)
t1$cal_network(network_method = "SpiecEasi")
Error in t1$cal_network(network_method = "SpiecEasi"): igraph package not installed ...
The reason is that network construction require igraph package. We donot put the igraph and some other packages (e.g. SpiecEasi in github) on the "Imports" part of microeco package.
The solutions:
-
install the package when encounter such an error. Actually, it's very easy to install the packages from CRAN or bioconductor. Just try it.
-
install the packages in advance. We recommend this solution if you are interest in most of the methods in the microeco package and want to repeat the analysis in tutorial.
We show several packages that are published in CRAN and not installed automatically.
Package | where | description |
---|---|---|
reshape2 | microtable class | data transformation |
MASS | trans_diff$new(method = "lefse",…) | linear discriminant analysis |
GUniFrac | cal_betadiv() | UniFrac distance matrix |
ggpubr | plot_alpha() | some plotting functions |
randomForest | trans_diff$new(method = "rf",…) | random forest analysis |
ggdendro | plot_clustering() | plotting clustering dendrogram |
ggrepel | trans_rda class | reduce the text overlap in the plot |
agricolae | cal_diff(method = anova) | multiple comparisons |
gridExtra | trans_diff class | merge plots |
picante | cal_alphadiv() | Faith’s phylogenetic alpha diversity |
pheatmap | plot_corr(pheatmap = TRUE) | correlation heatmap with clustering dendrogram |
tidytree | trans_diff class | plot the taxonomic tree |
igraph | trans_network class | network related operations |
rgexf | save_network | save network with gexf style |
ggalluvial | plot_bar(use_alluvium = TRUE) | alluvial plot |
Then, if you want to install these packages or some of them, you can do like this:
# If a package is not installed, it will be installed from CRAN.
# First select the packages of interest
packages <- c("reshape2", "MASS", "GUniFrac", "ggpubr", "randomForest", "ggdendro", "ggrepel", "agricolae", "gridExtra", "picante", "pheatmap", "igraph", "rgexf", "ggalluvial")
# Now check or install
lapply(packages, function(x) {
if(!require(x, character.only = TRUE)) {
install.packages(x, dependencies = TRUE)
}})
There are also some packages that are useful in some functions. These packages may be R packages published in github or bioconductor, or packages written by other languages.
Plotting the cladogram from the LEfSe result requires the ggtree package in bioconductor (https://bioconductor.org/packages/release/bioc/html/ggtree.html).
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install("ggtree")
The R package SpiecEasi can be used for the network construction using SPIEC-EASI (SParse InversE Covariance Estimation for Ecological Association Inference) approach. The package can be installed from Github https://github.com/zdk123/SpiecEasi
Gephi is an excellent network visualization tool and used to open the saved network file, i.e. network.gexf in the tutorial. You can download Gephi and learn how to use it from https://gephi.org/users/download/
In the correlation-based network, when the species number is very large, the correlation algorithm in WGCNA is very fast compared to the 'cor' option in trans_network.
install.packages("WGCNA", dependencies = TRUE)
Tax4Fun is an R package used for the prediction of functional potential of prokaryotic communities.
- install Tax4Fun package
install.packages("RJSONIO")
install.packages(system.file("extdata", "biom_0.3.12.tar.gz", package="microeco"), repos = NULL, type = "source")
install.packages(system.file("extdata", "qiimer_0.9.4.tar.gz", package="microeco"), repos = NULL, type = "source")
install.packages(system.file("extdata", "Tax4Fun_0.3.1.tar.gz", package="microeco"), repos = NULL, type = "source")
- download SILVA123 reference data from http://tax4fun.gobics.de/ unzip SILVA123.zip , move it to a place that you can remember.
Tax4Fun2 is another R package for the the prediction of functional profiles and functional gene redundancies of prokaryotic communities. It has higher accuracies than PICRUSt and Tax4Fun. The Tax4Fun2 approach implemented in microeco is a little different from the original package. Using Tax4Fun2 approach require the representative fasta file. The user do not need to install Tax4Fun2 R package. The only thing need to do is to download the blast tool and Ref99NR/Ref100NR database. Downlaod blast tools from "ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+" ; e.g. ncbi-blast-****-x64-win64.tar.gz for windows system. Downlaod Ref99NR.zip from "https://cloudstor.aarnet.edu.au/plus/s/DkoZIyZpMNbrzSw/download" or Ref100NR.zip from "https://cloudstor.aarnet.edu.au/plus/s/jIByczak9ZAFUB4/download" . Uncompress all the folders. The final folders should be like these structures:
blast tools:
|-- ncbi-blast-2.11.0+
|---- bin
|------ blastn.exe
|------ makeblastdb.exe
|------ ......
Ref99NR/Ref100NR:
|-- Tax4Fun2_ReferenceData_v2
|---- Ref99NR
|------ otu000001.tbl.gz
|------ ......
|------ Ref99NR.fasta
|------ Ref99NR.tre
The path "ncbi-blast-2.11.0+/bin" and "Tax4Fun2_ReferenceData_v2" will be required in the trans_func$cal_tax4fun2() function.
# seqinr should be installed for reading and writing fasta file
install.packages("seqinr", dependencies = TRUE)
# Now we show how to read the fasta file
# see https://github.com/ChiLiubio/file2meco if you do not have installed file2meco
rep_fasta_path <- system.file("extdata", "rep.fna", package="file2meco")
rep_fasta <- seqinr::read.fasta(rep_fasta_path)
# then see the help document of microtable class about the rep_fasta in microtable$new().
Most of the plotting in the package rely on the ggplot2 package system. We provide some parameters to change the corresponding plot. If you want to modify the output plot, you can also assign the output a name and use the ggplot2-style grammer to modify it as you need. Each data table used for plotting is stored in the object and can be downloaded for the personalized analysis and plotting. Of course, you can also directly modify the class and reload them.
Previous descriptions on how to construct microtable object from QIIME, QIIME2 and phyloseq have been moved to the package file2meco (https://github.com/ChiLiubio/file2meco) The package file2meco is designed to convert files between other tools/platforms and microtable object.
We welcome any contribution, including but not limited to code, idea and tutorial. ! Please report errors and questions on github Issues. Any contribution via Pull requests or Email(liuchi0426@126.com) will be appreciated. By participating in this project you agree to abide by the terms outlined in the Contributor Code of Conduct.
- Louca, S., Parfrey, L. W., & Doebeli, M. (2016). Decoupling function and taxonomy in the global ocean microbiome. Science, 353(6305), 1272. DOI: 10.1126/science.aaf4507
- Nguyen, N. H., Song, Z., Bates, S. T., Branco, S., Tedersoo, L., Menke, J., … Kennedy, P. G. (2016). FUNGuild: An open annotation tool for parsing fungal community datasets by ecological guild. Fungal Ecology, 20(1), 241–248.
- Põlme, S., Abarenkov, K., Henrik Nilsson, R. et al. FungalTraits: a user-friendly traits database of fungi and fungus-like stramenopiles. Fungal Diversity 105, 1–16 (2020). DOI: 10.1007/s13225-020-00466-2
- Aßhauer, K. P., Wemheuer, B., Daniel, R., & Meinicke, P. (2015). Tax4Fun: Predicting functional profiles from metagenomic 16S rRNA data. Bioinformatics, 31(17), 2882–2884.
- Wemheuer, F., Taylor, J.A., Daniel, R. et al. Tax4Fun2: prediction of habitat-specific functional profiles and functional redundancy based on 16S rRNA gene sequences. Environmental Microbiome 15, 11 (2020). DOI: 10.1186/s40793-020-00358-7
- Liu, C., Yao, M., Stegen, J. C., Rui, J., Li, J., & Li, X. (2017). Long-term nitrogen addition affects the phylogenetic turnover of soil microbial community responding to moisture pulse. Scientific Reports, 7(1), 17492.
- Segata, N., Izard, J., Waldron, L., Gevers, D., Miropolsky, L., Garrett, W. S., & Huttenhower, C. (2011). Metagenomic biomarker discovery and explanation. Genome Biology, 12(6), R60.
- Chi Liu, Yaoming Cui, Xiangzhen Li, Minjie Yao, microeco: an R package for data mining in microbial community ecology, FEMS Microbiology Ecology, Volume 97, Issue 2, February 2021, fiaa255.
- An, J., Liu, C., Wang, Q., Yao, M., Rui, J., Zhang, S., & Li, X. (2019). Soil bacterial community structure in Chinese wetlands. Geoderma, 337, 290–299.
- Tackmann, J., Matias Rodrigues, J. F., & Mering, C. von. (2019). Rapid inference of direct interactions in large-scale ecological networks from heterogeneous microbial sequencing data. Cell Systems, 9(3), 286–296 e8.
- White, J., Nagarajan, N., & Pop, M. (2009). Statistical methods for detecting differentially abundant features in clinical metagenomic samples. PLoS Computational Biology, 5(4), e1000352.
- Kurtz ZD, Muller CL, Miraldi ER, Littman DR, Blaser MJ, Bonneau RA. Sparse and compositionally robust inference of microbial ecological networks. PLoS Comput Biol 2015; 11: e1004226.
- McMurdie PJ, Holmes S (2013) phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data. PLOS ONE 8(4): e61217.
- Paulson, J., Stine, O., Bravo, H. et al. Differential abundance analysis for microbial marker-gene surveys. Nat Methods 10, 1200–1202 (2013). DOI: 10.1038/nmeth.2658
- Deng Y, Jiang Y-H, Yang Y, He Z, Luo F, Zhou J. Molecular ecological network analyses. BMC bioinformatics 2012; 13: 113.
- Oksanen J, Blanchet FG, Friendly M, Kindt R, Legendre P, McGlinn D, et al. Vegan: Community ecology package. 2019.
- Picante: R tools for integrating phylogenies and ecology. Bioinformatics 2010; 26: 1463–1464.