Grab-N-Go Genomes (GrabNGoGenomes): Automating Sequence Data Retrieval

Purpose: Wrapper used to search NCBI's SRA database through Entrez's E-utilities (10.9) for sequence data and download sequencing data using NCBI's SRA Toolkit (2.9.6)

Introduction

GrabNGoGenomes: Automating Sequence Data Retrieval

GrabNGoGenomes was created with the "intro to biocomputing" student in mind. Often times, graduate students are new to bioinformatic skillsets and programs needed to perform their research. GrabNGoGenomes can help students get started by disentangling the sequence search and download process into a more streamlined process.

To get started, visit the setup repo at GnGG_setup

After using the setup repository above you can use the get_SeqRec and pull_SeqRec scripts contained in this repository. Usage explained through example:

Dependencies

GrabNGoGenomes is a wrapper for two NCBI toolkits, E-utilities and SRA Toolkit, used for searching and sharing data from biomedical and genomic databases of information. E-utilities is set up during installation. Since these scripts are meant to be executed on an HPC, due to the large storage and computational resources required, please use your cluster's syntax for loading the SRA Toolkit module. Add a module load USER_VERSION_SRA-TOOLKIT statement to your job header.

module load sra/2.8.1

Note: Above is an example command, module name and version will vary. Please search user modules for more information.

Getting Sequence Information

Usage and Arguments

get_SeqRec requires three arguments

get_SeqRec [-F|-P] ["QUERY_ORGANISM"|"Q_Genus Q_species"] [WGS|WXS|AMPLICON|RNA-Seq|RAD-Seq|ChIP-Seq|Hi-C]

pull_SeqRec requires two arguments

pull_SeqRec ["QUERY_ORGANISM"|"Q_Genus Q_species"] [Run_accession_file]

Arguments are case-sensitive and require correct syntax to produce desired results Further infomation about the scripts can be found using the -h flag or simply calling the script

pull_SeqRec -h

get_SeqRec

Full Mode (-F) vs. Partial Mode (-P)

Full Mode

GrabNGoGenomes can be run in two modes, full and partial. The full option will obtain info for all nucleotide sequences of a given organism. get_SeqRec will download SRA run info when given an input of an organism name (this can be genus or species scientific name as well as a common name). Query organisms should be formatted with quotes (ex: "Microcebus", "Microcebus rufus", "dog")

[user@hostname](~)[22:55]: get_SeqRec -F "Microcebus rufus" WGS

print them out in two easy-to-read tab-delimited files:

[user@hostname](Microcebusrufus~Apr_20)[22:55]: ls 
Microcebusrufus~full_SRA_info_Apr_20.txt
Microcebusrufus~filtered_SRA_info_Apr_20.txt

Microcebusrufus~full_SRA_info_Apr_20.txt contains archived results that are filtered for public consent and are SRR (vs. ERR or DRR)

Microcebusrufus~filtered_SRA_info_Apr_20.txt filters the full results by a user-specified sequencing method.

Output example:

[user@hostname](Microcebusrufus~Apr_20)[22:55]: cat ~GrabNGoGenomes/Microcebusrufus~Apr_20/Microcebusrufus~full_SRA_info_Apr_20.txt

Run_ID    Lib_Size(MB)    Lib_Type    Sample_ID    Scientific_Name    Sequencing_Platform    Model    Consent        Apr_20
SRR3496213	428	WGS	SRS1412880	Microcebus rufus	ILLUMINA	NextSeq 500	public
SRR3496243	324	WGS	SRS1412880	Microcebus rufus	ILLUMINA	NextSeq 500	public

A list of run accessions is created:

[user@hostname](Microcebusrufus~Apr_20)[22:55]: cat ~GrabNGoGenomes/Microcebusrufus~Apr_20/Microcebusrufus~run_accession_Apr_20.txt
SRR3496213
SRR3496243
SRR3496245
SRR3496215

The script provides the accession list to fastq-dump, a module of SRA Toolkit, which downloads sequences for each accession. Sequences will be gzip compressed, split into two files (from paired-end libraries), and read (.1 or .2) suffixes are appended to the header line.

Output example:

[user@hostname](Microcebusrufus~files_Apr_20)[23:00] ls
SRR3496213_1.fastq.gz  
SRR3496215_1.fastq.gz  
SRR3496243_1.fastq.gz  
SRR3496245_1.fastq.gz

[user@hostname](Microcebusrufus~files_Apr_20)[23:00] zcat SRR3496213_1.fastq.gz | head -4
@SRR3496213.1.1 2_11101_18707_1025_1 length=146
CATGCAGGAAACTACCTTAACCCAAAGCAACAAGGTTCAAATAAAAATTAGTTCATTAAATAAAAAGTTGAATGAAGGAGAAAGACCATAAAAATAATAGGTATGTACTTTTGATATCTTTTGAACTTAAAACATATAAAAACACA
+SRR3496213.1.1 2_11101_18707_1025_1 length=146
/A/EAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEAAEEEEEEEEEEEEEEEEE6EEEEEEEEEEEEE/EEEEEEEEEEE//AAEEEEAEEEEEAE/EAEEEEEEEE<AEEEEEEEE/6E6<EE/EEEAAE</E//<</

Note: sequencing results were not paired end, hence the lack of SRR#_2.fastq.gz files.

Partial Mode

[user@hostname](~)[22:55]: get_SeqRec -P "Microcebus rufus" WGS

The partial option will obtain SRA run info just as described in full mode, but not the sequences themselves.

Partial mode requires the use of the both get_SeqRec and pull_SeqRec scripts to obtain sequences.

This allows users to filter data with their own parameters. pull_SeqRec requires a list of SRR accessions. If user desires sequences, a file with SRR accession must be created (See Full Mode output for example) and provided topull_SeqRec as an argument, which will obtain desired seqeunces.

awk '{print $1}' Microcebusrufus~filtered_SRA_info_Apr_20.txt| tail -n +2 > Microcebusrufus~run_accession_Apr_20.txt

If RNA-Seqis provided as the sequencing method, an additional file with biosample metadata will be created for more filtering parameters

Metadata for Sceloporus undulatus. Biosample IDs are from Sceloporusundulatus~filtered_SRA_info_Apr_22.txt.
SAMN06312743    Sund_embryo     SRS1964566      not applicable  2 days after egg laying early embryonic stage   pooled male and female  embryo
SAMN06312742    Sund_muscle_M2  SRS1964607      not applicable  not collected   adult   female  skeletal muscle
SAMN06312741    Sund_brain_B1   SRS1964608      not applicable  not collected   adult   female  brain
SAMN01823435    Fence Lizard    SRS378966       Adult   liver
SAMN08687241    SU_1755 SRS3035612      SU_1755 Adult   male    Liver   Tonia Schwartz  Tonia Schwartz Lab Members      July-2017       USA: Auburn University  Lee County, Alabama (Auburn University) 32.59 N 85.48 W
SAMN08687240    SU_1752 SRS3035611      SU_1752 Adult   male    Liver   Tonia Schwartz  Tonia Schwartz Lab Members      July-2017       USA: Auburn University  Lee County, Alabama (Auburn, Greenhouse)        32.61 N 85.46 W
SAMN08687239    SU_1751 SRS3035610      SU_1751 Adult   male    Liver   Tonia Schwartz  Tonia Schwartz Lab Members      July-2017       USA: Auburn University  Lee County, Alabama (Auburn, Greenhouse)        32.61 N 85.46 W

Biosample IDs correspond to the filtered run information file and parameters can be used to further filter datasets.

This awk one-liner can be used on user-filtered files to create run accession lists appropriate for input into pull_SeqRec

[user@hostname](Microcebusrufus~files_Apr_20)[23:00] pull_SeqRec "Microcebus rufus" Microcebusrufus~run_accession_Apr_21.txt

Troubleshooting

The xtract command present in get_SeqRec script should be included in the edirect download. However, if you are receiving an error claiming you do not have xtract then try using the following commands to download xtract:

ftp-cp ftp.ncbi.nlm.nih.gov /entrez/entrezdirect xtract.Linux.gz
gunzip -f xtract.Linux.gz
chmod +x xtract.Linux

After this, replace xtract anywhere in get_SeqRec with xtract.Linux and this should solve the issue.

See Note in the dependencies section about SRA Toolkit

adc0032/GrabNGoGenomes