This directory contains work related to downloading and processing all V4 datasets together.
Target applications are Fangqiong's population estimation study and Eric's batch effects business.
- Write down the S3 datasets that correspond to all the V4 regions (manually, using
datasets_info.txt
from my meta-analysis work). - Download the dataset from S3 mbit bucket
- Download the summary file from my almlab bucket
- or modify it so they're all the same?
- I think modify it, yeah!
- Process the dataset
- Grab relevant files...
- raw_dereplicated.fasta
- dereplication map
- Concatenate the raw_dereplicated fastas
- See how big that file is. Hopefully less that 4 G. If not... dereplicate that again!?!
- ???
- Fix batch effects. Profit.
We'll be making 97% OTUs and assigning taxonomy with RDP.
I think the goal for processing is to find the lowest common denominator that can work for all datasets. This means that our parameters will likely be super conservative.
Okay. There are 11 V4 datasets.
2 of these started out with fasta's.
1 used max expecte errors filtering (maxee = 2) (t1d_alkanani)
The minimum trim length was 150.
1 was split by barcodes and had primers removed (t1d_mejialeon)
So, I think we should: use max expected errors = 2 trim length = 150
And while we're at it let's make min count = 10 again
Just run ./download_and_process_datasets.sh S3_V4_datasets.txt
and make sure you've copied
master_summary_file.txt
to data/for_pipeline
.
The download_and_process_datasets.sh
script:
- downloads each dataset from S3
- updates its summary file with the right parameters
- processes it through the pipeline
- copies the resulting
*.raw_trimmed.fasta
todata/raw_trimmed/
- also downloads the corresponding metadata from the
almlab
bucket, puts it indata/metadata/
- concatenates all of the metadata files, into
data/for_pipeline/
- dereplicates each raw_trimmed file into a raw_dereplicated file, in
data/derep_data/
- this uses the default
min_count = 10
, i.e. throws out sequences which had fewer than 10 reads in each dataset
- this uses the default
- concatenates all of the
raw_trimmed
fastas intodata/derep_data/dereped_datasets_concated.raw_trimmed.fasta
- re-dereplicates this file into
data/derep_data/dereped_datasets_concated.raw_dereplicated.fasta
- this dereplication step uses
min_count = 1
, i.e. it was in at least one dataset
- this dereplication step uses
- reorders and relabels the sequence IDs according to total size across all studies in
data/derep_data/dereped_datasets_concated.raw_dereplicated.fasta.relabeled_and_sorted
- clusters this super de-replicated fasta with usearch
- makes OTU table by mapping OTUs back to sequences back to original sequences in datasets
The download_and_process_datasets.sh
script calls the other two Python scripts:
update_summary_file.py
updates a dataset's summary file with the processing parameters (as above)manipulate_metadata_files.py
reads through all the metadata files, checks if there are duplicate sample IDs, concatenates all the metadata files, and writes that metadata todata/for_pipeline
.dereplicate_individual_datasets.py
reads all the*.raw_trimmed.fasta
files indata/raw_trimmed/
and dereplicates them with3.dereplicate.py
. It also relabels output files for downstream steps. All outputs go indata/derep_data/
.re_provenance_files.py
reads through the re-dereplicated fasta and relabels and sorts sequences according to their total size. I should probably call this something different.
You can also concatenate all of the *.raw_trimmed.fasta
files into one massive raw_trimmed.fasta
.
If you use a larger cutoff for MIN_COUNT
(e.g. MIN_COUNT = 50
), this file can be directly run
through the pipeline (with RAW_FASTA_FILE
input).
An example summary file is in data/for_pipeline
.
Some samples have 16S data but no metadata. Some samples have metadata but no 16S data. These results are not corrected for these discrepancies (i.e. there are ~2115 samples with metadata, 2418 with 16S data, but only ~2086 with both).