A suite of models for doing time-series predictions of the human gut microbiome. At the moment the models are:
- Feed-forward network (FFN)
- Long short term memory (LSTM)
- Encoder-decoder. As of right now the Encdoer-decoder is the most successful and all reporting is done on that model.
The input data is from QIITA (a repository for microbiome data). The data are:
- Study_id=11052. This is a different set of Rob Knight's time-series data.
- Study_id=2202. This is another time-series study with 2 other individuals.
- Study_id=10283. This is Larry Smarr's pre and post-op data for a colonoscopy. There is also data from a few other women in here.
- Study_id=1015. This is a dataset of Rob Knight and a few other researchers data from a trip abroad. Rob's data from here is excluded because it is present in Study 11052.
After studies have been downloaded from QIITA (in the *.biom
format) they need to be cleaned up prior to being fed to the model. The metadata for a study should also be downloaded from QIITA at the same time. Additionally, a taxonomy file is needed (this can be generated from QIIME).
The general workflow is to use the scripts in the data_preprocessing
directory in this order:
metadata_taxonomy_adder
biom_combiner
(if necessary)host_site_separator_time_sorting
At this point only the sampling sites or hosts of interest can be selected as they are now in their own files (as opposed to combined into one *.biom
file).
sum_truncate_sort_taxonomy
filtering_normalization_completion
top_N_strains
How to use each of these scripts is described in the respective file.
The output of the data preprocessing pipeline is available in input_data
. The directory summed_completed_no_chloro
was used to train the models reported on in this study. The _no_chloro
refers to the fact that an organism identified as part of a chloroplast was manually removed from those datasets.
Given a directory of data as produced from above (let's call it input_dir
) training is very simple. Each model type has its own directory dev/models/<model_type>/
; in that directory there is a file called params.py
where pertinent training parameters can be altered.
To train a model do:
python dev/models/<some_model>/trainer.py -d input_data/summed_completed_no_chloro/all_strains_top_N
If you want to also have a testing dataset, remove the dataset CSV(s) from the directory listed above and put them in their own directory (ie, test
). Now the command above becomes:
python dev/models/<some_model>/trainer.py -d input_data/summed_completed_no_chloro/all_strains_top_N -t input_data/summed_completed_no_chloro/test
The ipython notebook under notebooks/Model Evaluator.ipynb
has all of the tools necessary for assessing model quality.
The ipython notebook under notebooks/Input Analysis
has all of the tools necessary for assessing trends in the data.