This is a basic example of using Dat, Gasket and Bionode to do reproducible bioinformatics.
On your command line, run the following:
git clone git@github.com:bionode/bionode-example-dat-gasket.git
cd bionode-example-dat-gasket
npm install
There are 4 pipelines defined in package.json
that you can run in order.
npm run init
npm run fetch
The previous command will run a pipeline named fetch-eukaryota-genomes-metadata
that is stored inside the package.json file of this git repository.
The following is a description of that pipeline:
// Initialize a Dat repository (stored locally in the .dat folder)
"dat init --no-prompt",
// Query the NCBI API for the metadata and returns it in Newline Delimited JSON format
"bionode ncbi search genome eukaryota",
// Store the data in the dat repository
"dat import --json"
You can look at the data that got stored by doing dat listen
(then go to localhost:6461 in your browser) or dat cat | head
while in the same folder where the .dat
folder is located.
npm run search
Like the previous gasket command, this runs another pipeline inside the package.json file.
Here's the description of that pipeline:
// Output all the data in the .dat repository
"dat cat",
// Collect only the JSON object that matches Guillardia
"grep Guillardia",
// Extract the value of the assemblyid property
"tool-stream extractProperty assemblyid",
// Download that genome assembly, it is the reference genome
// http://en.wikipedia.org/wiki/Genome_project#Genome_assembly
// http://en.wikipedia.org/wiki/Reference_genome
"bionode ncbi download assembly",
// Wait for the download to complete
"tool-stream collectMatch status completed ",
// Get the NCBI unique ID for this assembly again
"tool-stream extractProperty uid",
// Query the NCBI API to get the UID of the project related to this assembly
"bionode ncbi link assembly bioproject ",
// Extract the UID of the project
"tool-stream extractProperty destUID ",
// Query NCBI to get the UIDs of the related Sequence Read Archive (SRA) files
// http://www.ncbi.nlm.nih.gov/books/NBK47539/
"bionode ncbi link bioproject sra ",
// Extract the SRAs UIDs
"tool-stream extractProperty destUID"
// **Cheating** To have this example run in minutes and not hours/day,
// we will only download the smallest dataset by using its unique ID that
// we figured out in advance. In a normal situation, the following command
// would not be here.
"grep 35526"
Align the Guillardia theta genomic sequences
// Download the SRA file
"bionode ncbi download sra",
// Extract a FASTQ file containing the DNA sequences from the SRA file
"bionode sra fastq-dump",
// Get the path of the FASTQ file
"tool-stream extractProperty destFile",
// Map the DNA sequences to the reference genome
// http://en.wikipedia.org/wiki/Sequence_alignment
// http://en.wikipedia.org/wiki/List_of_sequence_alignment_software#Short-Read_Sequence_Alignment
"bionode bwa mem **/*fna.gz",
// Wait for the previous step to finish and produce a Sequence Alignment/Map (SAM) file
"tool-stream collectMatch status finished",
// Get the path of the SAM file
"tool-stream extractProperty sam",
// Convert the SAM file to a binary format (BAM)
"bionode sam"
Once you have a BAM file, you can view which [scaffold](http://en.wikipedia.org/wiki/Contig#Sequence_contigs) of the reference genomes has more [reads](http://www.k.u-tokyo.ac.jp/pros-e/person/shinichi_morishita/genome-assembly.jpg) mapped to it
```bash
samtools idxstats 35526/SRR070675.bam | sort -k3 -n
Then, you can look for variants (SNPs) in that scaffold (JH993052.1)
gunzip 503988/GCA_000315625.1_Guith1_genomic.fna.gz
samtools mpileup -uf 503988/GCA_000315625.1_Guith1_genomic.fna 35526/SRR070675.bam | bcftools view -v snps - | grep JH993052.1
There are several variants, we can start to have a look at the last one by using the Text Alignment Viewer (tview)
samtools tview -p JH993052.1:31185 35526/SRR070675.bam 503988/GCA_000315625.1_Guith1_genomic.fna
# Press ? for help