Java command line application which allows to retrieve sequence from GenBank by using Accession ID and store it in Google Docs Spreadsheet
For any questions contact andrew@smartdeveloping.com
It is custom built application used to retrieve fasta formatted nucleotide sequence file from GenBank by its accession ID. It will place fasta file into Google Drive and will record it into Google Spreadsheet. File name will contain unique auto-incremented numeric ID from the spreadsheet.
Yes, it can be found in the same directory where the jar file is located “config.properties”
TEMP_FILE_PATH = absolute path to temp directory
GENBANK_SEARCH_URL=http://www.ncbi.nlm.nih.gov/nuccore/
GENBANK_FASTA_DOWNLOAD_URL=http://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?tool=portal&sendto=on&log$=seqview&db=nuccore&dopt=fasta&extrafeat=0&maxplex=1&val=
GOOGLE_API_USER= google account username used to access Google Drive and Google Spreadsheets
GOOGLE_API_PASSWORD= google account password
GOOGLE_SPREADSHEET_FEED_URL=https://spreadsheets.google.com/feeds/spreadsheets/private/full
GOOGLE_MASTER_SPREADSHEET= existing spreadsheet name where to keep sequences
GOOGLE_DRIVE_MRNA_SEQUENCES_FOLDER_URL= url to google drive folder to keep fasta files
GOOGLE_DRIVE_MRNA_SEQUENCES_FOLDER_LOCAL_PATH= local path to mounted google drive
Before running it make sure that spreadsheet is sorted by “AccessionID” column in ascending order, otherwise java application won’t be able correctly name new fasta files.
java -jar GenBankRetriever.jar action=insert AccessionID=”access_id_value” COL_NAME=”COL_VALUE” …
The only required parameter is AccessionID, all other provided paramters will represent columns and its values to be added to the spreadsheet upon successful fasta file search and upload.
COL_NAME - column name from the Google spreadsheet where data will be added COL_VALUE - data to be added to the Google spreadsheet
Example: java -jar GenBankRetriever.jar AccessionID="D49400" Description="ATP synthase subunit F, vacuolar" TargetSequenceData="AGGGATGGCGGGGAGGG" TargetCoordinates="NA"
To extract protein coding region coordinates by using sequence GenBank Accession ID: java -jar GenBankRetriever.jar action=updatecds
To find Ensemble Transcript Identifiers based on GenBank Accession ID (will use BioDBNet webservice: http://biodbnet.abcc.ncifcrf.gov): java -jar GenBankRetriever.jar action=updateest